Harden JVM TLS init, crash-handler nesting, and jmp_buf publish/observe#627
Closed
jbachorik wants to merge 2 commits into
Closed
Harden JVM TLS init, crash-handler nesting, and jmp_buf publish/observe#627jbachorik wants to merge 2 commits into
jbachorik wants to merge 2 commits into
Conversation
- JVMThread::initialize() bails out on a null current_thread instead of scanning pthread keys with a bogus match target - enterCrashHandler() uses a single atomic increment-then-check instead of load-then-branch-then-increment, closing a nested-signal race that could let more than CRASH_HANDLER_NESTING_LIMIT handlers past the gate - ProfiledThread::_jmp_buf is now std::atomic<jmp_buf*> with explicit acquire/release ordering across the signal-handler boundary - add gtest coverage for checkFault()'s guard clauses and the VTable-stub null-klass / safeFetch64==0 guard conditions Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Contributor
CI Test ResultsRun: #28602713837 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Failed Testsglibc-amd64/debug / 8-ibmJob: View logs No detailed failure information available. Check the job logs. Summary: Total: 32 | Passed: 31 | Failed: 1 Updated: 2026-07-02 15:59:19 UTC |
safeFetch64's fault trampoline needs a registered SIGSEGV/SIGBUS handler (SafeAccess::handle_safefetch) to catch the fault; without it, a real signal killed the CI test process.
Contributor
Benchmark ResultsPipeline: https://gitlab.ddbuild.io/DataDog/apm-reliability/benchmarking-platform/-/pipelines/122411442 Commit:
|
| Benchmark | JDK | Latest | Dev | Δ (dev vs latest) | Issues L/D |
|---|---|---|---|---|---|
| akka-uct | 21 | ✅ 10160 ms (7 iters) | ✅ 10448 ms (7 iters) | ≈ +2.8% (±20%) | — / — |
| akka-uct | 25 | ✅ 9047 ms (8 iters) | ✅ 8903 ms (8 iters) | ≈ -1.6% (±19.3%) | — / — |
| finagle-chirper | 21 | ✅ 5935 ms (11 iters) | ✅ 5931 ms (11 iters) | ≈ -0.1% (±45.1%) | |
| finagle-chirper | 25 | ✅ 5522 ms (12 iters) | ✅ 5364 ms (12 iters) | ≈ -2.9% (±43.8%) | |
| fj-kmeans | 21 | ✅ 2831 ms (22 iters) | ✅ 2862 ms (22 iters) | ≈ +1.1% (±4.4%) | — / — |
| fj-kmeans | 25 | ✅ 2630 ms (24 iters) | ✅ 2841 ms (22 iters) | 🔴 +8% | — / — |
| future-genetic | 21 | ✅ 2163 ms (29 iters) | ✅ 2063 ms (30 iters) | 🟢 -4.6% | — / — |
| future-genetic | 25 | ✅ 1933 ms (32 iters) | ✅ 2122 ms (29 iters) | 🔴 +9.8% | — / — |
| naive-bayes | 21 | ✅ 1282 ms (45 iters) | ✅ 1241 ms (46 iters) | ≈ -3.2% (±56.1%) | — / — |
| naive-bayes | 25 | ✅ 1014 ms (56 iters) | ✅ 1013 ms (56 iters) | ≈ -0.1% (±55.4%) | — / — |
| reactors | 21 | ✅ 16197 ms (5 iters) | ✅ 15328 ms (5 iters) | ≈ -5.4% (±10.8%) | — / — |
| reactors | 25 | ✅ 18204 ms (5 iters) | ✅ 18699 ms (5 iters) | ≈ +2.7% (±7.4%) | — / — |
Internal counter details (ddprof)
ddprof internal counters, latest / dev (✅ = 0, · = unavailable):
| Benchmark | JDK | Dropped rec | Dropped jvmti | Dropped trace | Skipped WC | AGCT fail | Unwind fail |
|---|---|---|---|---|---|---|---|
| akka-uct | 21 | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ |
| akka-uct | 25 | ✅ / ✅ | ✅ / ✅ | 1 / 2 | 2566 / 2140 | ✅ / ✅ | ✅ / ✅ |
| finagle-chirper | 21 | ✅ / ✅ | ✅ / ✅ | ✅ / 4 | ✅ / 8323 | ✅ / ✅ | ✅ / ✅ |
| finagle-chirper | 25 | ✅ / ✅ | ✅ / ✅ | 1 / 2 | 8332 / 8152 | ✅ / ✅ | ✅ / ✅ |
| fj-kmeans | 21 | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ |
| fj-kmeans | 25 | ✅ / ✅ | ✅ / ✅ | 3 / 3 | 1283 / 1276 | ✅ / ✅ | ✅ / ✅ |
| future-genetic | 21 | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ |
| future-genetic | 25 | ✅ / ✅ | ✅ / ✅ | 1 / 2 | 2872 / 2906 | ✅ / ✅ | ✅ / ✅ |
| naive-bayes | 21 | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ |
| naive-bayes | 25 | ✅ / ✅ | ✅ / ✅ | 3 / 3 | 3459 / 3445 | ✅ / ✅ | ✅ / ✅ |
| reactors | 21 | ✅ / ✅ | ✅ / ✅ | ✅ / ✅ | 1642 / ✅ | ✅ / ✅ | ✅ / ✅ |
| reactors | 25 | ✅ / ✅ | ✅ / ✅ | 1 / ✅ | 1828 / 1889 | ✅ / ✅ | ✅ / ✅ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?:
Stacks on #615 and fixes a scoped set of crash-protection findings surfaced during a review pass of that PR. Each finding below is from that review; findings not addressed here (
s-1, and low-severity comment/dead-code itemss-3/s-4/s-6) are intentionally left out — see Additional Notes.b-0 (HIGH) —
JVMThreadTLS init has no null-guard, undefined key on failureddprof-lib/src/main/cpp/threadLocal.h:213.ThreadLocal<JVMThread*>::initialize()had no early-out for a nullcurrent_thread. In release builds (asserts compiled out), a nullcurrent_threadwould make the pthread-key scan match the first never-created key (pthread_getspecific()also returnsNULLfor it), silently producing a bogus_keythatisKeyValid()/isInitialized()would falsely report as valid — later stack-walk paths would then dereference an unrelated TLS value as aJVMThread*. Separately, theboolreturn ofJVMThread::initialize()was discarded atvmEntry.cpp:559.Fix:
ThreadLocal<JVMThread*>::initialize()now returnsfalseand leaves_key == INVALID_KEYwhencurrent_thread == nullptr;get()checksisKeyValid()before callingpthread_getspecific()instead of relying solely on an assert;JVMThread::initialize()restores the explicit null-guard;VM::ready()now logs a warning and degrades gracefully instead of discarding the result.Crash-handler nesting race (found while reviewing
b-0's neighborhood, not in the original finding set)ddprof-lib/src/main/cpp/threadLocalData.h.enterCrashHandler()used load-then-branch-then-increment (u32 prev = __atomic_load_n(...); if (prev < LIMIT) { __atomic_add_fetch(...); return true; }), which lets a nested signal on the same thread observe the same pre-increment value between the load and the add, letting more thanCRASH_HANDLER_NESTING_LIMIThandlers past the gate.Fix: single atomic increment-then-check (
__atomic_add_fetchthen compare, with a compensating decrement on the reject path).g-5 (LOW) —
_jmp_bufpublish/observe is plain load/store across a signal boundaryddprof-lib/src/main/cpp/threadLocalData.h:203._jmp_bufis written inwalkVM()(setJmpCtx) and read incheckFault()'s SEGV-handler path (getJmpCtx) via plain non-atomic/non-volatile load/store. Benign in practice on x86-64/aarch64, but per the C++ memory model, accessing a shared object from a signal handler that is neithervolatile sig_atomic_tnor a lock-free atomic is UB and permits reordering around the publishing store.Fix:
_jmp_bufis nowstd::atomic<jmp_buf*>, withsetJmpCtx/getJmpCtx/isProtectedusing explicitmemory_order_release/memory_order_acquire, matching how_crash_depthis already hardened.s-2 (MEDIUM) — crash-protection gtest suite never exercises the real
walkVM/checkFaultddprof-lib/src/test/cpp/hotspot_crash_protection_ut.cpp:116. The existing suite exercised onlyProfiledThreadprimitives with hand-rolledjmp_bufs; it never calledHotspotSupport::walkVMorHotspotSupport::checkFault, so the invariants this PR relies on (checkFault's guard/ordering logic) had zero direct coverage.Fix: added tests that call the real
HotspotSupport::checkFault()directly — a null-ProfiledThread*no-op, and an uninitialized-JVMThreadno-op that leaves an installedjmp_bufuntouched. (FullwalkVMintegration coverage needs a live JVM, which this gtest binary doesn't have — see note in the test file.)x-7 (LOW) — VTable-stub null-klass check has no test assertion
ddprof-lib/src/main/cpp/hotspot/hotspotSupport.cpp:627. The null check on the VTable-stub receiver klass (klass != nullptr ? klass->name() : nullptr) had no explicit test; a mutation removing/inverting it wouldn't be caught by existing tests.Fix: added a test replicating the exact guard condition, asserting a null klass yields a null symbol and skips the frame-fill.
x-8 (LOW) —
safeFetch64==0TOCTOU guard has no test assertionddprof-lib/src/main/cpp/hotspot/vmStructs.h:679. The TOCTOU guard (mark = SafeAccess::safeFetch64(...); if (mark == 0) return nullptr;, protecting againstMonitorDeflationThreadfreeing anObjectMonitorbetween reading the mark word and dereferencing it) had no explicit test; inverting or removing the check wouldn't be caught.Fix: added a test that mprotects a page to force
safeFetch64to return its error value (0) and asserts the caller treats that as "give up" rather than shifting 0 into a bogus klass pointer.(
x-7/x-8's underlying call sites depend on JVM-populated static offsets not available in this gtest binary, so both tests replicate the exact guard condition verbatim rather than driving the real call site — consistent with this test file's existing pattern forisJavaThread()'s fast path.)Motivation:
These were flagged during a review pass over #615 as defensive-hardening gaps in the crash-protection machinery (TLS initialization, signal-reentrancy accounting, and cross-signal-handler-boundary memory ordering). None are regressions introduced by #615 itself, but #615 touches the surrounding code closely enough that fixing them here keeps the two PRs easy to review independently.
Additional Notes:
Deliberately out of scope:
s-1:walkVMnow silently drops samples whenProfiledThreadTLS is absent, whereas it previously attempted a protected walk. This is a design/telemetry question (is the coverage reduction intended? isSAMPLES_DROPPED_THREAD_LOCALsurfaced in telemetry?) for the Use ThreadLocal to setup protection longjmp buffer and fix jmp_buf chaining #615 author to weigh in on, not a code fix.s-3/s-4/s-6: stale comments and a dead accessor (VMThread::exception()), left for the Use ThreadLocal to setup protection longjmp buffer and fix jmp_buf chaining #615 author since they're purely about that PR's own new code/comments.How to test the change?:
Added new gtest cases to
ddprof-lib/src/test/cpp/hotspot_crash_protection_ut.cppcovering thecheckFaultguard clauses and the null-klass/safeFetch64guard conditions. Ran the full:ddprof-lib:gtestDebugsuite locally (macOS) with no regressions; the new Linux-only test additions will be exercised by CI.For Datadog employees:
credentials of any kind, I've requested a security review (run the
dd:platform-security-reviewskill, or file a request via the PSEC review form).
bewairealso runs automatically on every PR.Unsure? Have a question? Request a review!