fix(profiler): close two TOCTOU races between SIGPROF handler and JFR lifecycle#614
fix(profiler): close two TOCTOU races between SIGPROF handler and JFR lifecycle#614r1viollet wants to merge 1 commit into
Conversation
… lifecycle The CPU profiler sends SIGPROF to all threads via per-thread kernel timers. The signal handler checks _enabled and, if true, calls recordSample() which accesses JFR buffers. Two races existed around the recording cycle transition (default every 60 s) where JFR structures could be in mid-init or mid-teardown while the handler was active: Race 1 — stop() side (TOCTOU on _enabled vs _jfr.stop()): A handler that passed the _enabled=true check could still be executing inside recordSample() when disableEngines() set _enabled=false and _jfr.stop() freed JFR buffers — use-after-free → SIGSEGV. Fix: add an _inflight counter (incremented on handler entry, decremented on all exits). CTimer::stop() calls drainInflight() after deleting per- thread timers, spinning until _inflight==0 before returning to the caller that proceeds to _jfr.stop(). Any handler that fires after disableEngines() sees _enabled=false and returns early without touching JFR. Race 2 — start() side (enableEngines() before _jfr.start()): enableEngines() set _enabled=true before _jfr.start() had completed. A SIGPROF in that window would see _enabled=true and call recordSample() on partially-initialized JFR structures. Fix: move enableEngines() to after _jfr.start() and _cpu_engine->start() have both returned successfully (just before _state.store(RUNNING)). Validated empirically: a controlled reproducer in DataDog/profiling-backend (AnalysisEndpointTest.testResourceExhausted with a 60 s recording period) showed ~60% failure rate without the fix (SIGSEGV / hang), 0% with both fixes applied (12/12 iterations clean). Each fix alone only partially addressed the failures, confirming both races were independently active. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
CI Test ResultsRun: #28029632399 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Failed Testsmusl-amd64/debug / 25-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 25-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 21-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 8-librcaJob: View logs No detailed failure information available. Check the job logs. musl-amd64/debug / 17-librcaJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 17-j9Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 17Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 21Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 11-j9Job: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 11-librcaJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 21-graalJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 25-graalJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 25Job: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 17-librcaJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 17-graalJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 11Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 17-graalJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 11-j9Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 21-graalJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 25Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 25-graalJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 21Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 17Job: View logs No detailed failure information available. Check the job logs. Summary: Total: 32 | Passed: 8 | Failed: 24 Updated: 2026-06-23 13:45:29 UTC |
What does this PR do?:
Closes two TOCTOU races between the SIGPROF signal handler and JFR lifecycle transitions that could cause SIGSEGV or hangs in the test JVM during the 60-second recording cycle rotation.
Race 1 — stop() side (
ctimer_linux.cpp):disableEngines()sets_enabled=false, but a handler that already passed the_enabled=truecheck could still be executing insiderecordSample()when_jfr.stop()freed JFR buffers → use-after-free → SIGSEGV (or hang if the crash is caught by crashtracking).Fix: add an
_inflightcounter, incremented on every handler entry before the_enabledcheck, decremented on every exit path.CTimer::stop()callsdrainInflight()after deleting per-thread timers, spinning until_inflight==0before returning. The caller (Profiler::stop) then proceeds to_jfr.stop()only once all handlers have fully exited.Race 2 — start() side (
profiler.cpp):enableEngines()set_enabled=truebefore_jfr.start()had completed. A SIGPROF delivered in that window would see_enabled=trueand callrecordSample()on partially-initialized JFR structures.Fix: move
enableEngines()to after both_jfr.start()and_cpu_engine->start()have returned successfully (immediately before_state.store(RUNNING)).Motivation:
Discovered while investigating intermittent SIGSEGV (exit 139) and hang failures in DataDog/profiling-backend CI. Bisected to a dd-trace-java commit that changed instrumentation initialization timing, shifting when the 60-second recording cycle boundary fell relative to test thread activity — exposing both races reliably enough to isolate.
How to test the change?:
Controlled reproducer in DataDog/profiling-backend using
AnalysisEndpointTest.testResourceExhaustedwith the bad dd-trace-java agent (0e13e90dac) and a patchedlibjavaProfiler.so:drainInflight): ~20% failure rate — Race 2 still activeenableEngines): ~40% failure rate — Race 1 still activev_1.44.0baselineAdditional Notes:
drainInflight()is an unbounded spin. In practicerecordSample()completes in microseconds so this is safe, but a bounded spin with a log warning could be added as a follow-up._inflightcounter is incremented even whenCriticalSectionfails (handler returns early without touching JFR). This is intentional: it makes the drain conservative and guarantees the counter reaches zero only after all code paths between the counter increment and any potential JFR access have completed.For Datadog employees: