feat(profiler): async-signal-safety sanitizer#540
Merged
Conversation
This comment has been minimized.
This comment has been minimized.
Contributor
CI Test ResultsRun: #26421841285 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-05-25 22:29:30 UTC |
5e1f30d to
23b1e2d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 23b1e2d147
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
00d3527 to
9365886
Compare
Adds a runtime sanitizer for the profiler's signal handlers via a
thread-local depth counter, an RAII guard at each handler entry, and
DEBUG_ASSERT_NOT_IN_SIGNAL() at known async-signal-unsafe APIs. In
debug/ASAN builds the assertion aborts with a file:line diagnostic and
writes /tmp/signal-safety-violation.txt for CI to upload as an artifact.
Three macros encapsulate the depth counter — production code never
touches the counter directly:
SIGNAL_HANDLER_GUARD() — RAII increment/decrement
SIGNAL_HANDLER_GUARD_RELEASE() — manual early release before
chaining to handlers that may
longjmp through us
SIGNAL_HANDLER_UNWIND_AFTER_LONGJMP() — decrement at setjmp landing
The depth counter and isInSignalContext()/getInSignalDepth() accessors
are always on (~2 thread-local writes per signal handler invocation) and
live in guards.{h,cpp} so production code can query signal-context state
for AS-safe deferred paths (used by Profiler::dlopen_hook in the
companion commit). Only the DEBUG_ASSERT_NOT_IN_SIGNAL diagnostic macro
is conditional on NDEBUG/ASAN_ENABLED; in release builds it expands to
a no-op.
Wired into all 10 installed signal handlers (ITimer, ITimerJvmti,
CTimer, CTimerJvmti, PerfEvents, WallClockASGCT, WallClockJvmti,
segvHandler, busHandler, wakeupHandler).
Assertions placed at:
Dictionary::lookup (inserting overloads), Dictionary::clear,
Recording::writeClasses, FlightRecorder::recordDatadogSetting,
FlightRecorder::recordHeapUsage, Mutex::lock.
Two longjmp-aware fixes:
- J9 SIGSEGV null-pointer-check handler: segv/busHandler use
GUARD_RELEASE() before chaining to orig_segvHandler/orig_busHandler,
since the chained handler may siglongjmp and skip our destructor.
- HotSpot checkFault longjmp: walkVM's setjmp landing uses
UNWIND_AFTER_LONGJMP() to undo the increment from the unwound frame.
CI test_workflow.yml uploads /tmp/signal-safety-violation.txt as an
artifact when test jobs fail on glibc amd64/aarch64.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2ee2f1f to
aeebd96
Compare
dlopen_hook was performing file I/O, malloc, and mutex acquisition
(parseLibraries, patch_sigaction, installHooks) potentially from
async-signal context — e.g. when the JVM's DWARF unwinder lazily loads
libgcc_s during stack walking on J9/aarch64, our PLT-patched dlopen
hook fires from within a signal handler. Other JVM-internal lazy
loads (observed under Graal on aarch64 in long-running tests) can
trigger the same path.
Three-pronged fix:
1. Force libgcc_s.so.1 to load during Profiler::start() via a plain
dlopen on the init thread (Profiler::prewarmUnwinder). Once
libgcc_s is mapped, the JVM's later resolve finds it already
loaded and the known lazy-load path that would invoke our hook
from signal context never runs.
2. Guard dlopen_hook with isInSignalContext() so any *other* lazy
load from signal context is detected and refresh() is deferred.
refresh() acquires Mutex and calls malloc, both AS-unsafe. The
dlopen_hook in signal context now calls Libraries::markDirty()
(a single atomic store, AS-safe) instead.
3. Background refresher thread in Libraries polls the dirty flag
every REFRESH_INTERVAL_NS (5 s) and runs refresh() when it sees
pending work. Bounds the window during which a lazily-loaded
library is unresolvable by the stack walker — without the
refresher the gap would be the entire chunk-flush interval
(≥30 s). The thread runs only between Profiler::start and stop;
wake-up on stop is via SIGIO/pthread_kill (same pattern as the
wallclock thread) so shutdown latency is bounded by syscall
interruption, not the tick interval.
The SONAME "libgcc_s.so.1" is hardcoded by necessity: the release build
links with -static-libgcc, so referencing _Unwind_Backtrace would not
materialize libgcc_s.so as a NEEDED dependency — only dlopen by SONAME
can map the shared object. libgcc_s.so.1 has been the stable SONAME
since 2002; a bump would constitute a C++ ABI break.
Libraries::refresh() encapsulates updateSymbols + patch_sigaction +
installHooks + (optional) updateBuildIds, and clears the dirty flag
on entry so concurrent dlopen_hook markDirty calls during refresh
re-arm us for the next tick. remote_symbolication state moves into
Libraries via setRemoteSymbolication() so the refresh path is
self-contained and dlopen_hook doesn't need to reach into Profiler.
Also adds Mutex::tryLock() (wraps pthread_mutex_trylock, which is on
the POSIX async-signal-safe list) as a primitive available for future
deferred-path work.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
JUnit 5.10+ dropped Java 8 support; 5.12+ dropped Java 11; 6.x requires Java 17. The Java 8 and 11 CI targets (both HotSpot and J9) run the Gradle test worker on the test JDK itself — the profiler attaches to its own process — so JUnit Platform classes must be loadable on Java 8/11. Revert libs.versions.toml to junit 5.9.2 / junit-platform 1.9.2 (the last known-working pair). Add Dependabot ignore rules to prevent automated bumps past 5.9.x / 1.9.x. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
aeebd96 to
9103d9c
Compare
Each gtest test file was being compiled alongside all 59 library sources, so the per-config build cost was O(n_tests * n_sources). On slow CI runners this turned a debug gtest run into a multi-minute rebuild even when only one test file changed. Register a single compileGtestLibrary<Config> task that compiles the library sources once per build configuration. Each test's compile task now compiles only its own .cpp; each link task pulls in both its own objects and the shared library objects via the new GtestTaskBuilder.withSharedLibObjects() hook. Also adds a buildGtest<Config> aggregation task that runs compile + link without execution, for CI flows that invoke binaries directly. Ported from PR #539 (cherry-picked the build-cache portion only; sanitizer-runtime fixes from that PR are out of scope here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local docker test runner only handled Temurin/Bellsoft/Semeru JDKs, so the 17-graal / 21-graal / 25-graal variants used by CI could not be reproduced locally. CI installs Graal via SDKMAN, but adding SDKMAN to the docker image is heavier than needed. Add get_graal_jdk_url() pointing at the Oracle GraalVM archive URLs and route --jdk=*-graal through it, mirroring the J9/Semeru variant. Versions match the CI's JAVA_<v>_GRAAL_VERSION values so docker reproduction is faithful to what CI runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The always-on _in_signal_handler_depth TLS variable, accessed first from inside our SIGPROF/SIGVTALRM handlers via SIGNAL_HANDLER_GUARD(), was declared with the default global-dynamic TLS model. On first access in a given thread glibc lazily allocates the dtv slot via malloc() and takes the heap lock — both async-signal-unsafe. Reproduced deterministically on Graal aarch64 (glibc 17-graal debug) running ClassGCTest: SIGPROF arrived on the VM Thread while Graal's JVMCI compiler held the malloc heap lock through c2v_notifyCompilerPhaseEvent. Stack: #2 __libc_malloc -- waiting on heap lock #4 allocate_dtv_entry #7 _dl_tlsdesc_dynamic #8 TLS wrapper for _in_signal_handler_depth #9 SignalHandlerScope::SignalHandlerScope #10 CTimer::signalHandler The heap holder is itself blocked at a safepoint waiting for VM Thread to check in, and VM Thread is stuck in malloc -> full process deadlock. Switch the variable to the initial-exec TLS model so the loader allocates its slot from the static TLS surplus at libjavaProfiler.so load time. Every existing thread is fixed up at dlopen and every new thread receives the slot at pthread_create. Access is then a register-relative load — async-signal-safe, lock-free, malloc-free. Also narrow the type to uint8_t (realistic max depth ~3) to make the intent explicit; alignment-wise this is the same slot. Refresher tick reduced from 5 s to 500 ms so a library lazily loaded from signal context becomes resolvable by the stack walker within half a second. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous thread_local + initial-exec TLS approach failed two ways:
- default global-dynamic TLS lazily allocates the dtv slot via malloc
on first access, deadlocking when the heap lock holder waits for a
safepoint (observed on Graal aarch64 in ClassGCTest)
- initial-exec fixes the malloc but consumes static TLS surplus. On
Graal that surplus is already exhausted by libjvmcicompiler.so, so
libjavaProfiler.so fails to dlopen with "cannot allocate memory in
static TLS block"
Move the counter into ProfiledThread::_signal_depth, which is reached
via pthread_getspecific — POSIX-guaranteed AS-safe and never allocates.
Once a ProfiledThread exists for the calling thread, depth tracking is
a plain byte read/write on the owning thread (signal handlers run on
the interrupted thread, so no cross-thread synchronization is needed).
Treat ProfiledThread == null as "assume in signal" for production
gates (Profiler::dlopen_hook now defers refresh() conservatively when
running on uninstrumented JVM threads — VM Thread, JIT, GC). Same
threads were the deadlock source under the previous design.
DEBUG_ASSERT_NOT_IN_SIGNAL skips its check when ProfiledThread is null
so AS-unsafe-but-legitimate code on uninstrumented threads doesn't
trip a false-positive abort.
Refresher tick stays at 500 ms (set in previous commit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code review surfaced 7 findings — 4 CONFIRMED, 3 PLAUSIBLE. This commit addresses all 7: #8 (CONFIRMED) — split isInSignalContext into a strict variant. dlopen_hook was treating null ProfiledThread as "in signal" and deferring refresh on every dlopen from uninstrumented JVM threads (VM Thread, JIT, GC), delaying wasmtime sigaction patching by up to 500 ms. Add isInTrackedSignalContext() that returns false on null (only true when one of our SignalHandlerScopes is positively on the stack); dlopen_hook now uses it so JVM-internal threads get synchronous refresh again. isInSignalContext() retains its conservative semantics for any future caller that wants AS-safe-by-default. #7 (CONFIRMED) — switchLibraryTrap was called before startRefresher despite the invariant comment. Reorder so the refresher is running before the trap can fire. #9 (CONFIRMED) — DEBUG_ASSERT_NOT_IN_SIGNAL was on the 1-arg and 2-arg Dictionary::lookup overloads but missed the 4-arg form that actually mallocs. bounded_lookup's runtime-decided for_insert path was uncovered. Move the assertion into the 4-arg lookup, gated on for_insert (read-only lookups are AS-safe). #4 (CONFIRMED) — Comments referenced "REFRESH_INTERVAL_NS (5s)" but the actual constant is 500 ms. Fix both stale mentions. #13 (PLAUSIBLE) — SIGNAL_HANDLER_GUARD_RELEASE before chaining leaves depth == 0 inside a chained handler that returns normally; DEBUG_ASSERT_NOT_IN_SIGNAL inside such a handler would not fire. Document the trade-off in segvHandler — the longjmp safety property is more important than the sanitizer coverage gap, which is bounded to third-party signal handler code we don't own. #2 (PLAUSIBLE) — refresherLoop used OS::sleep without an EINTR loop; any unmasked signal (SIGCHLD, SIGURG, RT signals) would cause premature ticks. Wrap the sleep in an explicit elapsed-time loop using OS::nanotime so the refresher ticks at 500 ms regardless of stray signals. #14 (PLAUSIBLE) — refresherLoop published _refresher_tid before blocking SIGPROF/SIGVTALRM; a stale per-thread timer from a previous lifecycle could fire on the refresher during the window. Block signals first, then publish the TID. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Move the EINTR-resilient deadline sleep out of Libraries::refresherLoop into a new OS::sleepWhile(max_nanos, std::atomic<bool>&) primitive (clock_nanosleep TIMER_ABSTIME on Linux, nanosleep + nanotime on macOS). - Delete isInSignalContext(), Mutex::tryLock(), OS::uninterruptibleSleep — zero callers after the earlier switch to isInTrackedSignalContext. - Cover SignalHandlerScope::release() and signalHandlerUnwindAfterLongjmp() with unit tests; both were stub-replaceable before. - Upload /tmp/signal-safety-violation.txt on musl CI jobs as well. - Refresh the guards.h doc block to describe the current null-PT policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?:
Adds a runtime async-signal-safety sanitizer for the profiler's signal handlers. All 10 installed signal handlers are wrapped with a
SignalHandlerScopeRAII guard that tracks nesting depth in a thread-local counter. 11 known-unsafe APIs (Dictionary inserting lookups, FlightRecorder lifecycle methods,Mutex::lock) getDEBUG_ASSERT_NOT_IN_SIGNAL()calls that abort viawrite(2)+_exit(1)in debug/ASAN builds if reached from signal context. Release builds are unaffected (macro compiles to((void)0)).Motivation:
A number of profiler crashes in recent weeks all share the same root cause: async-signal-unsafe code reachable from signal handlers, caught only in production. Today the AS-safety rules exist only in contributor memory. This sanitizer enforces them in debug/ASAN builds so violations are caught at development time, not in production.
Additional Notes:
Two commits:
e9d70357— PROF-14764:SignalHandlerScopeRAII +DEBUG_ASSERT_NOT_IN_SIGNAL()macro + 3 unit tests16baae4a— PROF-14765: assertions wired into 11 unsafe call sitesRecording::addThreadwas audited and confirmed signal-safe — it usesThreadIdTable::insert(atomic CAS only).The
restoreSignalHandlerinos_linux.cpp/os_macos.cpp(trivial SIGSEGV/SIGBUS fallback handlers) has noSignalHandlerScope— their bodies are single-statement and call none of the asserted APIs, so no false-positive risk today. Tracked as a follow-up for completeness.How to test the change?:
./gradlew :ddprof-lib:gtestDebug— full debug suite must pass with no assertion fires./gradlew :ddprof-lib:gtestDebug_signalSafety_ut— 3 unit tests covering depth symmetry and nesting./gradlew :ddprof-lib:gtestDebug_dictionary_concurrent_ut— concurrent signal-context bounded_lookup + dump-thread clear; assertions in inserting overloads must not fire on the signal sideFor Datadog employees:
@DataDog/security-design-and-guidance.