Skip to content

fix(profiler): close two TOCTOU races between SIGPROF handler and JFR lifecycle#614

Draft
r1viollet wants to merge 1 commit into
mainfrom
r1viollet/fix-sigprof-jfr-races
Draft

fix(profiler): close two TOCTOU races between SIGPROF handler and JFR lifecycle#614
r1viollet wants to merge 1 commit into
mainfrom
r1viollet/fix-sigprof-jfr-races

Conversation

@r1viollet

Copy link
Copy Markdown
Contributor

What does this PR do?:

Closes two TOCTOU races between the SIGPROF signal handler and JFR lifecycle transitions that could cause SIGSEGV or hangs in the test JVM during the 60-second recording cycle rotation.

Race 1 — stop() side (ctimer_linux.cpp):

disableEngines() sets _enabled=false, but a handler that already passed the _enabled=true check could still be executing inside recordSample() when _jfr.stop() freed JFR buffers → use-after-free → SIGSEGV (or hang if the crash is caught by crashtracking).

Fix: add an _inflight counter, incremented on every handler entry before the _enabled check, decremented on every exit path. CTimer::stop() calls drainInflight() after deleting per-thread timers, spinning until _inflight==0 before returning. The caller (Profiler::stop) then proceeds to _jfr.stop() only once all handlers have fully exited.

Race 2 — start() side (profiler.cpp):

enableEngines() set _enabled=true before _jfr.start() had completed. A SIGPROF delivered in that window would see _enabled=true and call recordSample() on partially-initialized JFR structures.

Fix: move enableEngines() to after both _jfr.start() and _cpu_engine->start() have returned successfully (immediately before _state.store(RUNNING)).

Motivation:

Discovered while investigating intermittent SIGSEGV (exit 139) and hang failures in DataDog/profiling-backend CI. Bisected to a dd-trace-java commit that changed instrumentation initialization timing, shifting when the 60-second recording cycle boundary fell relative to test thread activity — exposing both races reliably enough to isolate.

How to test the change?:

Controlled reproducer in DataDog/profiling-backend using AnalysisEndpointTest.testResourceExhausted with the bad dd-trace-java agent (0e13e90dac) and a patched libjavaProfiler.so:

  • Without fix: ~60% failure rate per iteration (SIGSEGV / hang)
  • Race 1 fix only (drainInflight): ~20% failure rate — Race 2 still active
  • Race 2 fix only (move enableEngines): ~40% failure rate — Race 1 still active
  • Both fixes together: 12/12 iterations clean against v_1.44.0 baseline

Additional Notes:

  • drainInflight() is an unbounded spin. In practice recordSample() completes in microseconds so this is safe, but a bounded spin with a log warning could be added as a follow-up.
  • The _inflight counter is incremented even when CriticalSection fails (handler returns early without touching JFR). This is intentional: it makes the drain conservative and guarantees the counter reaches zero only after all code paths between the counter increment and any potential JFR access have completed.
  • Related: Revert "Ignore capturing connection continuation for armeria (#11657)" dd-trace-java#11685 (revert of the dd-trace-java commit that exposed these races).

For Datadog employees:

  • This PR doesn't touch any of that.
  • JIRA: [PROF-XXXX]

… lifecycle

The CPU profiler sends SIGPROF to all threads via per-thread kernel timers.
The signal handler checks _enabled and, if true, calls recordSample() which
accesses JFR buffers. Two races existed around the recording cycle transition
(default every 60 s) where JFR structures could be in mid-init or mid-teardown
while the handler was active:

Race 1 — stop() side (TOCTOU on _enabled vs _jfr.stop()):
  A handler that passed the _enabled=true check could still be executing
  inside recordSample() when disableEngines() set _enabled=false and
  _jfr.stop() freed JFR buffers — use-after-free → SIGSEGV.

  Fix: add an _inflight counter (incremented on handler entry, decremented
  on all exits). CTimer::stop() calls drainInflight() after deleting per-
  thread timers, spinning until _inflight==0 before returning to the caller
  that proceeds to _jfr.stop(). Any handler that fires after disableEngines()
  sees _enabled=false and returns early without touching JFR.

Race 2 — start() side (enableEngines() before _jfr.start()):
  enableEngines() set _enabled=true before _jfr.start() had completed.
  A SIGPROF in that window would see _enabled=true and call recordSample()
  on partially-initialized JFR structures.

  Fix: move enableEngines() to after _jfr.start() and _cpu_engine->start()
  have both returned successfully (just before _state.store(RUNNING)).

Validated empirically: a controlled reproducer in DataDog/profiling-backend
(AnalysisEndpointTest.testResourceExhausted with a 60 s recording period)
showed ~60% failure rate without the fix (SIGSEGV / hang), 0% with both
fixes applied (12/12 iterations clean). Each fix alone only partially
addressed the failures, confirming both races were independently active.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@datadog-prod-us1-3

datadog-prod-us1-3 Bot commented Jun 23, 2026

Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 41 Pipeline jobs failed

CI Run | test-matrix / test-linux-glibc-aarch64 (11-j9, debug)   View in Datadog   GitHub Actions

CI Run | test-matrix / test-linux-glibc-aarch64 (17-j9, debug)   View in Datadog   GitHub Actions

CI Run | test-matrix / test-linux-glibc-aarch64 (21, debug)   View in Datadog   GitHub Actions

View all 41 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 0ed3fb5 | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts

dd-octo-sts Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

CI Test Results

Run: #28029632399 | Commit: f9b0970 | Duration: 15m 52s (longest job)

24 of 32 test jobs failed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Failed Tests

musl-amd64/debug / 25-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 25-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 21-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 8-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-amd64/debug / 17-librca

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 17-j9

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 17

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 21

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 11-j9

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 11-librca

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 21-graal

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 25-graal

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 25

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 17-librca

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 17-graal

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 11

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 17-graal

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 11-j9

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 21-graal

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 25

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 25-graal

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 21

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 17

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 8

Job: View logs

No detailed failure information available. Check the job logs.

Summary: Total: 32 | Passed: 8 | Failed: 24


Updated: 2026-06-23 13:45:29 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant