Skip to content

Harden JVM TLS init, crash-handler nesting, and jmp_buf publish/observe#627

Closed
jbachorik wants to merge 2 commits into
zgu/objmonitor_deflationfrom
jbachorik/pr615-crash-protection-fixes
Closed

Harden JVM TLS init, crash-handler nesting, and jmp_buf publish/observe#627
jbachorik wants to merge 2 commits into
zgu/objmonitor_deflationfrom
jbachorik/pr615-crash-protection-fixes

Conversation

@jbachorik

@jbachorik jbachorik commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?:
Stacks on #615 and fixes a scoped set of crash-protection findings surfaced during a review pass of that PR. Each finding below is from that review; findings not addressed here (s-1, and low-severity comment/dead-code items s-3/s-4/s-6) are intentionally left out — see Additional Notes.

b-0 (HIGH) — JVMThread TLS init has no null-guard, undefined key on failure
ddprof-lib/src/main/cpp/threadLocal.h:213. ThreadLocal<JVMThread*>::initialize() had no early-out for a null current_thread. In release builds (asserts compiled out), a null current_thread would make the pthread-key scan match the first never-created key (pthread_getspecific() also returns NULL for it), silently producing a bogus _key that isKeyValid()/isInitialized() would falsely report as valid — later stack-walk paths would then dereference an unrelated TLS value as a JVMThread*. Separately, the bool return of JVMThread::initialize() was discarded at vmEntry.cpp:559.
Fix: ThreadLocal<JVMThread*>::initialize() now returns false and leaves _key == INVALID_KEY when current_thread == nullptr; get() checks isKeyValid() before calling pthread_getspecific() instead of relying solely on an assert; JVMThread::initialize() restores the explicit null-guard; VM::ready() now logs a warning and degrades gracefully instead of discarding the result.

Crash-handler nesting race (found while reviewing b-0's neighborhood, not in the original finding set)
ddprof-lib/src/main/cpp/threadLocalData.h. enterCrashHandler() used load-then-branch-then-increment (u32 prev = __atomic_load_n(...); if (prev < LIMIT) { __atomic_add_fetch(...); return true; }), which lets a nested signal on the same thread observe the same pre-increment value between the load and the add, letting more than CRASH_HANDLER_NESTING_LIMIT handlers past the gate.
Fix: single atomic increment-then-check (__atomic_add_fetch then compare, with a compensating decrement on the reject path).

g-5 (LOW) — _jmp_buf publish/observe is plain load/store across a signal boundary
ddprof-lib/src/main/cpp/threadLocalData.h:203. _jmp_buf is written in walkVM() (setJmpCtx) and read in checkFault()'s SEGV-handler path (getJmpCtx) via plain non-atomic/non-volatile load/store. Benign in practice on x86-64/aarch64, but per the C++ memory model, accessing a shared object from a signal handler that is neither volatile sig_atomic_t nor a lock-free atomic is UB and permits reordering around the publishing store.
Fix: _jmp_buf is now std::atomic<jmp_buf*>, with setJmpCtx/getJmpCtx/isProtected using explicit memory_order_release/memory_order_acquire, matching how _crash_depth is already hardened.

s-2 (MEDIUM) — crash-protection gtest suite never exercises the real walkVM/checkFault
ddprof-lib/src/test/cpp/hotspot_crash_protection_ut.cpp:116. The existing suite exercised only ProfiledThread primitives with hand-rolled jmp_bufs; it never called HotspotSupport::walkVM or HotspotSupport::checkFault, so the invariants this PR relies on (checkFault's guard/ordering logic) had zero direct coverage.
Fix: added tests that call the real HotspotSupport::checkFault() directly — a null-ProfiledThread* no-op, and an uninitialized-JVMThread no-op that leaves an installed jmp_buf untouched. (Full walkVM integration coverage needs a live JVM, which this gtest binary doesn't have — see note in the test file.)

x-7 (LOW) — VTable-stub null-klass check has no test assertion
ddprof-lib/src/main/cpp/hotspot/hotspotSupport.cpp:627. The null check on the VTable-stub receiver klass (klass != nullptr ? klass->name() : nullptr) had no explicit test; a mutation removing/inverting it wouldn't be caught by existing tests.
Fix: added a test replicating the exact guard condition, asserting a null klass yields a null symbol and skips the frame-fill.

x-8 (LOW) — safeFetch64==0 TOCTOU guard has no test assertion
ddprof-lib/src/main/cpp/hotspot/vmStructs.h:679. The TOCTOU guard (mark = SafeAccess::safeFetch64(...); if (mark == 0) return nullptr;, protecting against MonitorDeflationThread freeing an ObjectMonitor between reading the mark word and dereferencing it) had no explicit test; inverting or removing the check wouldn't be caught.
Fix: added a test that mprotects a page to force safeFetch64 to return its error value (0) and asserts the caller treats that as "give up" rather than shifting 0 into a bogus klass pointer.

(x-7/x-8's underlying call sites depend on JVM-populated static offsets not available in this gtest binary, so both tests replicate the exact guard condition verbatim rather than driving the real call site — consistent with this test file's existing pattern for isJavaThread()'s fast path.)

Motivation:
These were flagged during a review pass over #615 as defensive-hardening gaps in the crash-protection machinery (TLS initialization, signal-reentrancy accounting, and cross-signal-handler-boundary memory ordering). None are regressions introduced by #615 itself, but #615 touches the surrounding code closely enough that fixing them here keeps the two PRs easy to review independently.

Additional Notes:
Deliberately out of scope:

How to test the change?:
Added new gtest cases to ddprof-lib/src/test/cpp/hotspot_crash_protection_ut.cpp covering the checkFault guard clauses and the null-klass/safeFetch64 guard conditions. Ran the full :ddprof-lib:gtestDebug suite locally (macOS) with no regressions; the new Linux-only test additions will be exercised by CI.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a security review (run the dd:platform-security-review
    skill, or file a request via the PSEC review form).
    bewaire also runs automatically on every PR.
  • This PR doesn't touch any of that.
  • JIRA: N/A

Unsure? Have a question? Request a review!

- JVMThread::initialize() bails out on a null current_thread instead of
  scanning pthread keys with a bogus match target
- enterCrashHandler() uses a single atomic increment-then-check instead of
  load-then-branch-then-increment, closing a nested-signal race that could
  let more than CRASH_HANDLER_NESTING_LIMIT handlers past the gate
- ProfiledThread::_jmp_buf is now std::atomic<jmp_buf*> with explicit
  acquire/release ordering across the signal-handler boundary
- add gtest coverage for checkFault()'s guard clauses and the VTable-stub
  null-klass / safeFetch64==0 guard conditions

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
@jbachorik jbachorik added the AI label Jul 2, 2026
@datadog-prod-us1-4

This comment has been minimized.

@dd-octo-sts

dd-octo-sts Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

CI Test Results

Run: #28602713837 | Commit: 02e00c6 | Duration: 13m 58s (longest job)

1 of 32 test jobs failed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Failed Tests

glibc-amd64/debug / 8-ibm

Job: View logs

No detailed failure information available. Check the job logs.

Summary: Total: 32 | Passed: 31 | Failed: 1


Updated: 2026-07-02 15:59:19 UTC

safeFetch64's fault trampoline needs a registered SIGSEGV/SIGBUS handler
(SafeAccess::handle_safefetch) to catch the fault; without it, a real
signal killed the CI test process.
@dd-octo-sts

dd-octo-sts Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results

Pipeline: https://gitlab.ddbuild.io/DataDog/apm-reliability/benchmarking-platform/-/pipelines/122411442 Commit: 1e2b49b082ff3b9fe8709cd2b6d0c3e2f0ad126b

⚠️ Significant outliers

  • 🔴 fj-kmeans (JDK 25): runtime +8% (2630→2841 ms)
  • 🟢 future-genetic (JDK 21): runtime -4.6% (2163→2063 ms)
  • 🔴 future-genetic (JDK 25): runtime +9.8% (1933→2122 ms)
  • ⚠️ finagle-chirper (JDK 21): wallclock unwinds skipped — latest 0 → dev 8323 (new)
Runtime details (per benchmark × JDK)
Benchmark JDK Latest Dev Δ (dev vs latest) Issues L/D
akka-uct 21 ✅ 10160 ms (7 iters) ✅ 10448 ms (7 iters) ≈ +2.8% (±20%) — / —
akka-uct 25 ✅ 9047 ms (8 iters) ✅ 8903 ms (8 iters) ≈ -1.6% (±19.3%) — / —
finagle-chirper 21 ✅ 5935 ms (11 iters) ✅ 5931 ms (11 iters) ≈ -0.1% (±45.1%) ⚠️ W:1 / ⚠️ W:1
finagle-chirper 25 ✅ 5522 ms (12 iters) ✅ 5364 ms (12 iters) ≈ -2.9% (±43.8%) ⚠️ W:1 / ⚠️ W:1
fj-kmeans 21 ✅ 2831 ms (22 iters) ✅ 2862 ms (22 iters) ≈ +1.1% (±4.4%) — / —
fj-kmeans 25 ✅ 2630 ms (24 iters) ✅ 2841 ms (22 iters) 🔴 +8% — / —
future-genetic 21 ✅ 2163 ms (29 iters) ✅ 2063 ms (30 iters) 🟢 -4.6% — / —
future-genetic 25 ✅ 1933 ms (32 iters) ✅ 2122 ms (29 iters) 🔴 +9.8% — / —
naive-bayes 21 ✅ 1282 ms (45 iters) ✅ 1241 ms (46 iters) ≈ -3.2% (±56.1%) — / —
naive-bayes 25 ✅ 1014 ms (56 iters) ✅ 1013 ms (56 iters) ≈ -0.1% (±55.4%) — / —
reactors 21 ✅ 16197 ms (5 iters) ✅ 15328 ms (5 iters) ≈ -5.4% (±10.8%) — / —
reactors 25 ✅ 18204 ms (5 iters) ✅ 18699 ms (5 iters) ≈ +2.7% (±7.4%) — / —
Internal counter details (ddprof)

ddprof internal counters, latest / dev (✅ = 0, · = unavailable):

Benchmark JDK Dropped rec Dropped jvmti Dropped trace Skipped WC AGCT fail Unwind fail
akka-uct 21 ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅
akka-uct 25 ✅ / ✅ ✅ / ✅ 1 / 2 2566 / 2140 ✅ / ✅ ✅ / ✅
finagle-chirper 21 ✅ / ✅ ✅ / ✅ ✅ / 4 ✅ / 8323 ✅ / ✅ ✅ / ✅
finagle-chirper 25 ✅ / ✅ ✅ / ✅ 1 / 2 8332 / 8152 ✅ / ✅ ✅ / ✅
fj-kmeans 21 ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅
fj-kmeans 25 ✅ / ✅ ✅ / ✅ 3 / 3 1283 / 1276 ✅ / ✅ ✅ / ✅
future-genetic 21 ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅
future-genetic 25 ✅ / ✅ ✅ / ✅ 1 / 2 2872 / 2906 ✅ / ✅ ✅ / ✅
naive-bayes 21 ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅ ✅ / ✅
naive-bayes 25 ✅ / ✅ ✅ / ✅ 3 / 3 3459 / 3445 ✅ / ✅ ✅ / ✅
reactors 21 ✅ / ✅ ✅ / ✅ ✅ / ✅ 1642 / ✅ ✅ / ✅ ✅ / ✅
reactors 25 ✅ / ✅ ✅ / ✅ 1 / ✅ 1828 / 1889 ✅ / ✅ ✅ / ✅

@jbachorik jbachorik closed this Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant