feat(profiler): async-signal-safety sanitizer by jbachorik · Pull Request #540 · DataDog/java-profiler

jbachorik · 2026-05-24T21:43:16Z

What does this PR do?:
Adds a runtime async-signal-safety sanitizer for the profiler's signal handlers. All 10 installed signal handlers are wrapped with a SignalHandlerScope RAII guard that tracks nesting depth in a thread-local counter. 11 known-unsafe APIs (Dictionary inserting lookups, FlightRecorder lifecycle methods, Mutex::lock) get DEBUG_ASSERT_NOT_IN_SIGNAL() calls that abort via write(2)+_exit(1) in debug/ASAN builds if reached from signal context. Release builds are unaffected (macro compiles to ((void)0)).

Motivation:
A number of profiler crashes in recent weeks all share the same root cause: async-signal-unsafe code reachable from signal handlers, caught only in production. Today the AS-safety rules exist only in contributor memory. This sanitizer enforces them in debug/ASAN builds so violations are caught at development time, not in production.

Additional Notes:
Two commits:

e9d70357 — PROF-14764: SignalHandlerScope RAII + DEBUG_ASSERT_NOT_IN_SIGNAL() macro + 3 unit tests
16baae4a — PROF-14765: assertions wired into 11 unsafe call sites

Recording::addThread was audited and confirmed signal-safe — it uses ThreadIdTable::insert (atomic CAS only).

The restoreSignalHandler in os_linux.cpp / os_macos.cpp (trivial SIGSEGV/SIGBUS fallback handlers) has no SignalHandlerScope — their bodies are single-statement and call none of the asserted APIs, so no false-positive risk today. Tracked as a follow-up for completeness.

How to test the change?:

./gradlew :ddprof-lib:gtestDebug — full debug suite must pass with no assertion fires
./gradlew :ddprof-lib:gtestDebug_signalSafety_ut — 3 unit tests covering depth symmetry and nesting
./gradlew :ddprof-lib:gtestDebug_dictionary_concurrent_ut — concurrent signal-context bounded_lookup + dump-thread clear; assertions in inserting overloads must not fire on the signal side

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.
JIRA: PROF-14763

dd-octo-sts · 2026-05-25T01:31:31Z

CI Test Results

Run: #26421841285 | Commit: d7f68a4 | Duration: 12m 0s (longest job)

✅ All 32 test jobs passed

Status Overview

JDK	glibc-aarch64/debug	glibc-amd64/debug	musl-aarch64/debug	musl-amd64/debug
8	-	✅	-	-
8-ibm	-	✅	-	-
8-j9	✅	✅	-	-
8-librca	-	-	✅	✅
8-orcl	-	✅	-	-
11	-	✅	-	-
11-j9	✅	✅	-	-
11-librca	-	-	✅	✅
17	✅	✅	-	-
17-graal	✅	✅	-	-
17-j9	✅	✅	-	-
17-librca	-	-	✅	✅
21	✅	✅	-	-
21-graal	✅	✅	-	-
21-librca	-	-	✅	✅
25	✅	✅	-	-
25-graal	✅	✅	-	-
25-librca	-	-	✅	✅

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0

Updated: 2026-05-25 22:29:30 UTC

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23b1e2d147

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Adds a runtime sanitizer for the profiler's signal handlers via a thread-local depth counter, an RAII guard at each handler entry, and DEBUG_ASSERT_NOT_IN_SIGNAL() at known async-signal-unsafe APIs. In debug/ASAN builds the assertion aborts with a file:line diagnostic and writes /tmp/signal-safety-violation.txt for CI to upload as an artifact. Three macros encapsulate the depth counter — production code never touches the counter directly: SIGNAL_HANDLER_GUARD() — RAII increment/decrement SIGNAL_HANDLER_GUARD_RELEASE() — manual early release before chaining to handlers that may longjmp through us SIGNAL_HANDLER_UNWIND_AFTER_LONGJMP() — decrement at setjmp landing The depth counter and isInSignalContext()/getInSignalDepth() accessors are always on (~2 thread-local writes per signal handler invocation) and live in guards.{h,cpp} so production code can query signal-context state for AS-safe deferred paths (used by Profiler::dlopen_hook in the companion commit). Only the DEBUG_ASSERT_NOT_IN_SIGNAL diagnostic macro is conditional on NDEBUG/ASAN_ENABLED; in release builds it expands to a no-op. Wired into all 10 installed signal handlers (ITimer, ITimerJvmti, CTimer, CTimerJvmti, PerfEvents, WallClockASGCT, WallClockJvmti, segvHandler, busHandler, wakeupHandler). Assertions placed at: Dictionary::lookup (inserting overloads), Dictionary::clear, Recording::writeClasses, FlightRecorder::recordDatadogSetting, FlightRecorder::recordHeapUsage, Mutex::lock. Two longjmp-aware fixes: - J9 SIGSEGV null-pointer-check handler: segv/busHandler use GUARD_RELEASE() before chaining to orig_segvHandler/orig_busHandler, since the chained handler may siglongjmp and skip our destructor. - HotSpot checkFault longjmp: walkVM's setjmp landing uses UNWIND_AFTER_LONGJMP() to undo the increment from the unwound frame. CI test_workflow.yml uploads /tmp/signal-safety-violation.txt as an artifact when test jobs fail on glibc amd64/aarch64. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

dlopen_hook was performing file I/O, malloc, and mutex acquisition (parseLibraries, patch_sigaction, installHooks) potentially from async-signal context — e.g. when the JVM's DWARF unwinder lazily loads libgcc_s during stack walking on J9/aarch64, our PLT-patched dlopen hook fires from within a signal handler. Other JVM-internal lazy loads (observed under Graal on aarch64 in long-running tests) can trigger the same path. Three-pronged fix: 1. Force libgcc_s.so.1 to load during Profiler::start() via a plain dlopen on the init thread (Profiler::prewarmUnwinder). Once libgcc_s is mapped, the JVM's later resolve finds it already loaded and the known lazy-load path that would invoke our hook from signal context never runs. 2. Guard dlopen_hook with isInSignalContext() so any *other* lazy load from signal context is detected and refresh() is deferred. refresh() acquires Mutex and calls malloc, both AS-unsafe. The dlopen_hook in signal context now calls Libraries::markDirty() (a single atomic store, AS-safe) instead. 3. Background refresher thread in Libraries polls the dirty flag every REFRESH_INTERVAL_NS (5 s) and runs refresh() when it sees pending work. Bounds the window during which a lazily-loaded library is unresolvable by the stack walker — without the refresher the gap would be the entire chunk-flush interval (≥30 s). The thread runs only between Profiler::start and stop; wake-up on stop is via SIGIO/pthread_kill (same pattern as the wallclock thread) so shutdown latency is bounded by syscall interruption, not the tick interval. The SONAME "libgcc_s.so.1" is hardcoded by necessity: the release build links with -static-libgcc, so referencing _Unwind_Backtrace would not materialize libgcc_s.so as a NEEDED dependency — only dlopen by SONAME can map the shared object. libgcc_s.so.1 has been the stable SONAME since 2002; a bump would constitute a C++ ABI break. Libraries::refresh() encapsulates updateSymbols + patch_sigaction + installHooks + (optional) updateBuildIds, and clears the dirty flag on entry so concurrent dlopen_hook markDirty calls during refresh re-arm us for the next tick. remote_symbolication state moves into Libraries via setRemoteSymbolication() so the refresh path is self-contained and dlopen_hook doesn't need to reach into Profiler. Also adds Mutex::tryLock() (wraps pthread_mutex_trylock, which is on the POSIX async-signal-safe list) as a primitive available for future deferred-path work. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

JUnit 5.10+ dropped Java 8 support; 5.12+ dropped Java 11; 6.x requires Java 17. The Java 8 and 11 CI targets (both HotSpot and J9) run the Gradle test worker on the test JDK itself — the profiler attaches to its own process — so JUnit Platform classes must be loadable on Java 8/11. Revert libs.versions.toml to junit 5.9.2 / junit-platform 1.9.2 (the last known-working pair). Add Dependabot ignore rules to prevent automated bumps past 5.9.x / 1.9.x. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Each gtest test file was being compiled alongside all 59 library sources, so the per-config build cost was O(n_tests * n_sources). On slow CI runners this turned a debug gtest run into a multi-minute rebuild even when only one test file changed. Register a single compileGtestLibrary<Config> task that compiles the library sources once per build configuration. Each test's compile task now compiles only its own .cpp; each link task pulls in both its own objects and the shared library objects via the new GtestTaskBuilder.withSharedLibObjects() hook. Also adds a buildGtest<Config> aggregation task that runs compile + link without execution, for CI flows that invoke binaries directly. Ported from PR #539 (cherry-picked the build-cache portion only; sanitizer-runtime fixes from that PR are out of scope here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The local docker test runner only handled Temurin/Bellsoft/Semeru JDKs, so the 17-graal / 21-graal / 25-graal variants used by CI could not be reproduced locally. CI installs Graal via SDKMAN, but adding SDKMAN to the docker image is heavier than needed. Add get_graal_jdk_url() pointing at the Oracle GraalVM archive URLs and route --jdk=*-graal through it, mirroring the J9/Semeru variant. Versions match the CI's JAVA_<v>_GRAAL_VERSION values so docker reproduction is faithful to what CI runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The always-on _in_signal_handler_depth TLS variable, accessed first from inside our SIGPROF/SIGVTALRM handlers via SIGNAL_HANDLER_GUARD(), was declared with the default global-dynamic TLS model. On first access in a given thread glibc lazily allocates the dtv slot via malloc() and takes the heap lock — both async-signal-unsafe. Reproduced deterministically on Graal aarch64 (glibc 17-graal debug) running ClassGCTest: SIGPROF arrived on the VM Thread while Graal's JVMCI compiler held the malloc heap lock through c2v_notifyCompilerPhaseEvent. Stack: #2 __libc_malloc -- waiting on heap lock #4 allocate_dtv_entry #7 _dl_tlsdesc_dynamic #8 TLS wrapper for _in_signal_handler_depth #9 SignalHandlerScope::SignalHandlerScope #10 CTimer::signalHandler The heap holder is itself blocked at a safepoint waiting for VM Thread to check in, and VM Thread is stuck in malloc -> full process deadlock. Switch the variable to the initial-exec TLS model so the loader allocates its slot from the static TLS surplus at libjavaProfiler.so load time. Every existing thread is fixed up at dlopen and every new thread receives the slot at pthread_create. Access is then a register-relative load — async-signal-safe, lock-free, malloc-free. Also narrow the type to uint8_t (realistic max depth ~3) to make the intent explicit; alignment-wise this is the same slot. Refresher tick reduced from 5 s to 500 ms so a library lazily loaded from signal context becomes resolvable by the stack walker within half a second. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous thread_local + initial-exec TLS approach failed two ways: - default global-dynamic TLS lazily allocates the dtv slot via malloc on first access, deadlocking when the heap lock holder waits for a safepoint (observed on Graal aarch64 in ClassGCTest) - initial-exec fixes the malloc but consumes static TLS surplus. On Graal that surplus is already exhausted by libjvmcicompiler.so, so libjavaProfiler.so fails to dlopen with "cannot allocate memory in static TLS block" Move the counter into ProfiledThread::_signal_depth, which is reached via pthread_getspecific — POSIX-guaranteed AS-safe and never allocates. Once a ProfiledThread exists for the calling thread, depth tracking is a plain byte read/write on the owning thread (signal handlers run on the interrupted thread, so no cross-thread synchronization is needed). Treat ProfiledThread == null as "assume in signal" for production gates (Profiler::dlopen_hook now defers refresh() conservatively when running on uninstrumented JVM threads — VM Thread, JIT, GC). Same threads were the deadlock source under the previous design. DEBUG_ASSERT_NOT_IN_SIGNAL skips its check when ProfiledThread is null so AS-unsafe-but-legitimate code on uninstrumented threads doesn't trip a false-positive abort. Refresher tick stays at 500 ms (set in previous commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Code review surfaced 7 findings — 4 CONFIRMED, 3 PLAUSIBLE. This commit addresses all 7: #8 (CONFIRMED) — split isInSignalContext into a strict variant. dlopen_hook was treating null ProfiledThread as "in signal" and deferring refresh on every dlopen from uninstrumented JVM threads (VM Thread, JIT, GC), delaying wasmtime sigaction patching by up to 500 ms. Add isInTrackedSignalContext() that returns false on null (only true when one of our SignalHandlerScopes is positively on the stack); dlopen_hook now uses it so JVM-internal threads get synchronous refresh again. isInSignalContext() retains its conservative semantics for any future caller that wants AS-safe-by-default. #7 (CONFIRMED) — switchLibraryTrap was called before startRefresher despite the invariant comment. Reorder so the refresher is running before the trap can fire. #9 (CONFIRMED) — DEBUG_ASSERT_NOT_IN_SIGNAL was on the 1-arg and 2-arg Dictionary::lookup overloads but missed the 4-arg form that actually mallocs. bounded_lookup's runtime-decided for_insert path was uncovered. Move the assertion into the 4-arg lookup, gated on for_insert (read-only lookups are AS-safe). #4 (CONFIRMED) — Comments referenced "REFRESH_INTERVAL_NS (5s)" but the actual constant is 500 ms. Fix both stale mentions. #13 (PLAUSIBLE) — SIGNAL_HANDLER_GUARD_RELEASE before chaining leaves depth == 0 inside a chained handler that returns normally; DEBUG_ASSERT_NOT_IN_SIGNAL inside such a handler would not fire. Document the trade-off in segvHandler — the longjmp safety property is more important than the sanitizer coverage gap, which is bounded to third-party signal handler code we don't own. #2 (PLAUSIBLE) — refresherLoop used OS::sleep without an EINTR loop; any unmasked signal (SIGCHLD, SIGURG, RT signals) would cause premature ticks. Wrap the sleep in an explicit elapsed-time loop using OS::nanotime so the refresher ticks at 500 ms regardless of stray signals. #14 (PLAUSIBLE) — refresherLoop published _refresher_tid before blocking SIGPROF/SIGVTALRM; a stale per-thread timer from a previous lifecycle could fire on the refresher during the window. Block signals first, then publish the TID. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Move the EINTR-resilient deadline sleep out of Libraries::refresherLoop into a new OS::sleepWhile(max_nanos, std::atomic<bool>&) primitive (clock_nanosleep TIMER_ABSTIME on Linux, nanosleep + nanotime on macOS). - Delete isInSignalContext(), Mutex::tryLock(), OS::uninterruptibleSleep — zero callers after the earlier switch to isInTrackedSignalContext. - Cover SignalHandlerScope::release() and signalHandlerUnwindAfterLongjmp() with unit tests; both were stub-replaceable before. - Upload /tmp/signal-safety-violation.txt on musl CI jobs as well. - Refresh the guards.h doc block to describe the current null-PT policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jbachorik added the AI label May 24, 2026

This comment has been minimized.

Sign in to view

jbachorik force-pushed the worktree-prof-14763-as-safety-sanitizer branch 2 times, most recently from 5e1f30d to 23b1e2d Compare May 25, 2026 11:13

jbachorik marked this pull request as ready for review May 25, 2026 11:38

jbachorik requested a review from a team as a code owner May 25, 2026 11:38

jbachorik added no-review AI: Author Signed-off labels May 25, 2026

dd-octo-sts Bot approved these changes May 25, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated

Comment thread ddprof-lib/src/main/cpp/libraries.h Outdated

jbachorik force-pushed the worktree-prof-14763-as-safety-sanitizer branch 3 times, most recently from 00d3527 to 9365886 Compare May 25, 2026 13:13

jbachorik force-pushed the worktree-prof-14763-as-safety-sanitizer branch 3 times, most recently from 2ee2f1f to aeebd96 Compare May 25, 2026 16:37

jbachorik and others added 2 commits May 25, 2026 18:41

jbachorik force-pushed the worktree-prof-14763-as-safety-sanitizer branch from aeebd96 to 9103d9c Compare May 25, 2026 16:43

jbachorik and others added 6 commits May 25, 2026 19:00

jbachorik merged commit 827c2ac into main May 26, 2026
104 checks passed

jbachorik deleted the worktree-prof-14763-as-safety-sanitizer branch May 26, 2026 07:46

github-actions Bot added this to the 1.43.0 milestone May 26, 2026

jbachorik changed the title ~~feat(profiler): async-signal-safety sanitizer (PROF-14763)~~ feat(profiler): async-signal-safety sanitizer May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(profiler): async-signal-safety sanitizer#540

feat(profiler): async-signal-safety sanitizer#540
jbachorik merged 9 commits into
mainfrom
worktree-prof-14763-as-safety-sanitizer

jbachorik commented May 24, 2026 •

edited by atlassian Bot

Loading

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented May 25, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jbachorik commented May 24, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Test Results

Status Overview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jbachorik commented May 24, 2026 •

edited by atlassian Bot

Loading

dd-octo-sts Bot commented May 25, 2026 •

edited

Loading