Skip to content

feat(profiler): async-signal-safety sanitizer#540

Merged
jbachorik merged 9 commits into
mainfrom
worktree-prof-14763-as-safety-sanitizer
May 26, 2026
Merged

feat(profiler): async-signal-safety sanitizer#540
jbachorik merged 9 commits into
mainfrom
worktree-prof-14763-as-safety-sanitizer

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented May 24, 2026

What does this PR do?:
Adds a runtime async-signal-safety sanitizer for the profiler's signal handlers. All 10 installed signal handlers are wrapped with a SignalHandlerScope RAII guard that tracks nesting depth in a thread-local counter. 11 known-unsafe APIs (Dictionary inserting lookups, FlightRecorder lifecycle methods, Mutex::lock) get DEBUG_ASSERT_NOT_IN_SIGNAL() calls that abort via write(2)+_exit(1) in debug/ASAN builds if reached from signal context. Release builds are unaffected (macro compiles to ((void)0)).

Motivation:
A number of profiler crashes in recent weeks all share the same root cause: async-signal-unsafe code reachable from signal handlers, caught only in production. Today the AS-safety rules exist only in contributor memory. This sanitizer enforces them in debug/ASAN builds so violations are caught at development time, not in production.

Additional Notes:
Two commits:

  • e9d70357PROF-14764: SignalHandlerScope RAII + DEBUG_ASSERT_NOT_IN_SIGNAL() macro + 3 unit tests
  • 16baae4aPROF-14765: assertions wired into 11 unsafe call sites

Recording::addThread was audited and confirmed signal-safe — it uses ThreadIdTable::insert (atomic CAS only).

The restoreSignalHandler in os_linux.cpp / os_macos.cpp (trivial SIGSEGV/SIGBUS fallback handlers) has no SignalHandlerScope — their bodies are single-statement and call none of the asserted APIs, so no false-positive risk today. Tracked as a follow-up for completeness.

How to test the change?:

  • ./gradlew :ddprof-lib:gtestDebug — full debug suite must pass with no assertion fires
  • ./gradlew :ddprof-lib:gtestDebug_signalSafety_ut — 3 unit tests covering depth symmetry and nesting
  • ./gradlew :ddprof-lib:gtestDebug_dictionary_concurrent_ut — concurrent signal-context bounded_lookup + dump-thread clear; assertions in inserting overloads must not fire on the signal side

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: PROF-14763

@jbachorik jbachorik added the AI label May 24, 2026
@datadog-official

This comment has been minimized.

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 25, 2026

CI Test Results

Run: #26421841285 | Commit: d7f68a4 | Duration: 12m 0s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-05-25 22:29:30 UTC

@jbachorik jbachorik force-pushed the worktree-prof-14763-as-safety-sanitizer branch 2 times, most recently from 5e1f30d to 23b1e2d Compare May 25, 2026 11:13
@jbachorik jbachorik marked this pull request as ready for review May 25, 2026 11:38
@jbachorik jbachorik requested a review from a team as a code owner May 25, 2026 11:38
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23b1e2d147

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
Comment thread ddprof-lib/src/main/cpp/libraries.h Outdated
@jbachorik jbachorik force-pushed the worktree-prof-14763-as-safety-sanitizer branch 3 times, most recently from 00d3527 to 9365886 Compare May 25, 2026 13:13
Adds a runtime sanitizer for the profiler's signal handlers via a
thread-local depth counter, an RAII guard at each handler entry, and
DEBUG_ASSERT_NOT_IN_SIGNAL() at known async-signal-unsafe APIs.  In
debug/ASAN builds the assertion aborts with a file:line diagnostic and
writes /tmp/signal-safety-violation.txt for CI to upload as an artifact.

Three macros encapsulate the depth counter — production code never
touches the counter directly:

  SIGNAL_HANDLER_GUARD()                — RAII increment/decrement
  SIGNAL_HANDLER_GUARD_RELEASE()        — manual early release before
                                          chaining to handlers that may
                                          longjmp through us
  SIGNAL_HANDLER_UNWIND_AFTER_LONGJMP() — decrement at setjmp landing

The depth counter and isInSignalContext()/getInSignalDepth() accessors
are always on (~2 thread-local writes per signal handler invocation) and
live in guards.{h,cpp} so production code can query signal-context state
for AS-safe deferred paths (used by Profiler::dlopen_hook in the
companion commit).  Only the DEBUG_ASSERT_NOT_IN_SIGNAL diagnostic macro
is conditional on NDEBUG/ASAN_ENABLED; in release builds it expands to
a no-op.

Wired into all 10 installed signal handlers (ITimer, ITimerJvmti,
CTimer, CTimerJvmti, PerfEvents, WallClockASGCT, WallClockJvmti,
segvHandler, busHandler, wakeupHandler).

Assertions placed at:
  Dictionary::lookup (inserting overloads), Dictionary::clear,
  Recording::writeClasses, FlightRecorder::recordDatadogSetting,
  FlightRecorder::recordHeapUsage, Mutex::lock.

Two longjmp-aware fixes:
  - J9 SIGSEGV null-pointer-check handler: segv/busHandler use
    GUARD_RELEASE() before chaining to orig_segvHandler/orig_busHandler,
    since the chained handler may siglongjmp and skip our destructor.
  - HotSpot checkFault longjmp: walkVM's setjmp landing uses
    UNWIND_AFTER_LONGJMP() to undo the increment from the unwound frame.

CI test_workflow.yml uploads /tmp/signal-safety-violation.txt as an
artifact when test jobs fail on glibc amd64/aarch64.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the worktree-prof-14763-as-safety-sanitizer branch 3 times, most recently from 2ee2f1f to aeebd96 Compare May 25, 2026 16:37
jbachorik and others added 2 commits May 25, 2026 18:41
dlopen_hook was performing file I/O, malloc, and mutex acquisition
(parseLibraries, patch_sigaction, installHooks) potentially from
async-signal context — e.g. when the JVM's DWARF unwinder lazily loads
libgcc_s during stack walking on J9/aarch64, our PLT-patched dlopen
hook fires from within a signal handler.  Other JVM-internal lazy
loads (observed under Graal on aarch64 in long-running tests) can
trigger the same path.

Three-pronged fix:

  1. Force libgcc_s.so.1 to load during Profiler::start() via a plain
     dlopen on the init thread (Profiler::prewarmUnwinder).  Once
     libgcc_s is mapped, the JVM's later resolve finds it already
     loaded and the known lazy-load path that would invoke our hook
     from signal context never runs.

  2. Guard dlopen_hook with isInSignalContext() so any *other* lazy
     load from signal context is detected and refresh() is deferred.
     refresh() acquires Mutex and calls malloc, both AS-unsafe.  The
     dlopen_hook in signal context now calls Libraries::markDirty()
     (a single atomic store, AS-safe) instead.

  3. Background refresher thread in Libraries polls the dirty flag
     every REFRESH_INTERVAL_NS (5 s) and runs refresh() when it sees
     pending work.  Bounds the window during which a lazily-loaded
     library is unresolvable by the stack walker — without the
     refresher the gap would be the entire chunk-flush interval
     (≥30 s).  The thread runs only between Profiler::start and stop;
     wake-up on stop is via SIGIO/pthread_kill (same pattern as the
     wallclock thread) so shutdown latency is bounded by syscall
     interruption, not the tick interval.

The SONAME "libgcc_s.so.1" is hardcoded by necessity: the release build
links with -static-libgcc, so referencing _Unwind_Backtrace would not
materialize libgcc_s.so as a NEEDED dependency — only dlopen by SONAME
can map the shared object.  libgcc_s.so.1 has been the stable SONAME
since 2002; a bump would constitute a C++ ABI break.

Libraries::refresh() encapsulates updateSymbols + patch_sigaction +
installHooks + (optional) updateBuildIds, and clears the dirty flag
on entry so concurrent dlopen_hook markDirty calls during refresh
re-arm us for the next tick.  remote_symbolication state moves into
Libraries via setRemoteSymbolication() so the refresh path is
self-contained and dlopen_hook doesn't need to reach into Profiler.

Also adds Mutex::tryLock() (wraps pthread_mutex_trylock, which is on
the POSIX async-signal-safe list) as a primitive available for future
deferred-path work.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
JUnit 5.10+ dropped Java 8 support; 5.12+ dropped Java 11; 6.x requires
Java 17.  The Java 8 and 11 CI targets (both HotSpot and J9) run the
Gradle test worker on the test JDK itself — the profiler attaches to its
own process — so JUnit Platform classes must be loadable on Java 8/11.

Revert libs.versions.toml to junit 5.9.2 / junit-platform 1.9.2 (the
last known-working pair).  Add Dependabot ignore rules to prevent
automated bumps past 5.9.x / 1.9.x.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the worktree-prof-14763-as-safety-sanitizer branch from aeebd96 to 9103d9c Compare May 25, 2026 16:43
jbachorik and others added 6 commits May 25, 2026 19:00
Each gtest test file was being compiled alongside all 59 library sources,
so the per-config build cost was O(n_tests * n_sources). On slow CI
runners this turned a debug gtest run into a multi-minute rebuild even
when only one test file changed.

Register a single compileGtestLibrary<Config> task that compiles the
library sources once per build configuration. Each test's compile task
now compiles only its own .cpp; each link task pulls in both its own
objects and the shared library objects via the new
GtestTaskBuilder.withSharedLibObjects() hook.

Also adds a buildGtest<Config> aggregation task that runs compile + link
without execution, for CI flows that invoke binaries directly.

Ported from PR #539 (cherry-picked the build-cache portion only;
sanitizer-runtime fixes from that PR are out of scope here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local docker test runner only handled Temurin/Bellsoft/Semeru JDKs,
so the 17-graal / 21-graal / 25-graal variants used by CI could not be
reproduced locally. CI installs Graal via SDKMAN, but adding SDKMAN to
the docker image is heavier than needed.

Add get_graal_jdk_url() pointing at the Oracle GraalVM archive URLs and
route --jdk=*-graal through it, mirroring the J9/Semeru variant.
Versions match the CI's JAVA_<v>_GRAAL_VERSION values so docker
reproduction is faithful to what CI runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The always-on _in_signal_handler_depth TLS variable, accessed first from
inside our SIGPROF/SIGVTALRM handlers via SIGNAL_HANDLER_GUARD(), was
declared with the default global-dynamic TLS model.  On first access in
a given thread glibc lazily allocates the dtv slot via malloc() and
takes the heap lock — both async-signal-unsafe.

Reproduced deterministically on Graal aarch64 (glibc 17-graal debug)
running ClassGCTest: SIGPROF arrived on the VM Thread while Graal's
JVMCI compiler held the malloc heap lock through
c2v_notifyCompilerPhaseEvent.  Stack:

  #2  __libc_malloc                    -- waiting on heap lock
  #4  allocate_dtv_entry
  #7  _dl_tlsdesc_dynamic
  #8  TLS wrapper for _in_signal_handler_depth
  #9  SignalHandlerScope::SignalHandlerScope
  #10 CTimer::signalHandler

The heap holder is itself blocked at a safepoint waiting for VM Thread
to check in, and VM Thread is stuck in malloc -> full process deadlock.

Switch the variable to the initial-exec TLS model so the loader
allocates its slot from the static TLS surplus at libjavaProfiler.so
load time.  Every existing thread is fixed up at dlopen and every new
thread receives the slot at pthread_create.  Access is then a
register-relative load — async-signal-safe, lock-free, malloc-free.

Also narrow the type to uint8_t (realistic max depth ~3) to make the
intent explicit; alignment-wise this is the same slot.

Refresher tick reduced from 5 s to 500 ms so a library lazily loaded
from signal context becomes resolvable by the stack walker within
half a second.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous thread_local + initial-exec TLS approach failed two ways:

  - default global-dynamic TLS lazily allocates the dtv slot via malloc
    on first access, deadlocking when the heap lock holder waits for a
    safepoint (observed on Graal aarch64 in ClassGCTest)
  - initial-exec fixes the malloc but consumes static TLS surplus.  On
    Graal that surplus is already exhausted by libjvmcicompiler.so, so
    libjavaProfiler.so fails to dlopen with "cannot allocate memory in
    static TLS block"

Move the counter into ProfiledThread::_signal_depth, which is reached
via pthread_getspecific — POSIX-guaranteed AS-safe and never allocates.
Once a ProfiledThread exists for the calling thread, depth tracking is
a plain byte read/write on the owning thread (signal handlers run on
the interrupted thread, so no cross-thread synchronization is needed).

Treat ProfiledThread == null as "assume in signal" for production
gates (Profiler::dlopen_hook now defers refresh() conservatively when
running on uninstrumented JVM threads — VM Thread, JIT, GC).  Same
threads were the deadlock source under the previous design.

DEBUG_ASSERT_NOT_IN_SIGNAL skips its check when ProfiledThread is null
so AS-unsafe-but-legitimate code on uninstrumented threads doesn't
trip a false-positive abort.

Refresher tick stays at 500 ms (set in previous commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code review surfaced 7 findings — 4 CONFIRMED, 3 PLAUSIBLE.  This
commit addresses all 7:

  #8 (CONFIRMED) — split isInSignalContext into a strict variant.
     dlopen_hook was treating null ProfiledThread as "in signal" and
     deferring refresh on every dlopen from uninstrumented JVM threads
     (VM Thread, JIT, GC), delaying wasmtime sigaction patching by up
     to 500 ms.  Add isInTrackedSignalContext() that returns false on
     null (only true when one of our SignalHandlerScopes is positively
     on the stack); dlopen_hook now uses it so JVM-internal threads
     get synchronous refresh again.  isInSignalContext() retains its
     conservative semantics for any future caller that wants
     AS-safe-by-default.

  #7 (CONFIRMED) — switchLibraryTrap was called before startRefresher
     despite the invariant comment.  Reorder so the refresher is
     running before the trap can fire.

  #9 (CONFIRMED) — DEBUG_ASSERT_NOT_IN_SIGNAL was on the 1-arg and
     2-arg Dictionary::lookup overloads but missed the 4-arg form
     that actually mallocs.  bounded_lookup's runtime-decided
     for_insert path was uncovered.  Move the assertion into the
     4-arg lookup, gated on for_insert (read-only lookups are
     AS-safe).

  #4 (CONFIRMED) — Comments referenced "REFRESH_INTERVAL_NS (5s)"
     but the actual constant is 500 ms.  Fix both stale mentions.

  #13 (PLAUSIBLE) — SIGNAL_HANDLER_GUARD_RELEASE before chaining
     leaves depth == 0 inside a chained handler that returns
     normally; DEBUG_ASSERT_NOT_IN_SIGNAL inside such a handler
     would not fire.  Document the trade-off in segvHandler — the
     longjmp safety property is more important than the sanitizer
     coverage gap, which is bounded to third-party signal handler
     code we don't own.

  #2 (PLAUSIBLE) — refresherLoop used OS::sleep without an EINTR
     loop; any unmasked signal (SIGCHLD, SIGURG, RT signals) would
     cause premature ticks.  Wrap the sleep in an explicit
     elapsed-time loop using OS::nanotime so the refresher ticks at
     500 ms regardless of stray signals.

  #14 (PLAUSIBLE) — refresherLoop published _refresher_tid before
     blocking SIGPROF/SIGVTALRM; a stale per-thread timer from a
     previous lifecycle could fire on the refresher during the
     window.  Block signals first, then publish the TID.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Move the EINTR-resilient deadline sleep out of Libraries::refresherLoop
  into a new OS::sleepWhile(max_nanos, std::atomic<bool>&) primitive
  (clock_nanosleep TIMER_ABSTIME on Linux, nanosleep + nanotime on macOS).
- Delete isInSignalContext(), Mutex::tryLock(), OS::uninterruptibleSleep —
  zero callers after the earlier switch to isInTrackedSignalContext.
- Cover SignalHandlerScope::release() and signalHandlerUnwindAfterLongjmp()
  with unit tests; both were stub-replaceable before.
- Upload /tmp/signal-safety-violation.txt on musl CI jobs as well.
- Refresh the guards.h doc block to describe the current null-PT policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik merged commit 827c2ac into main May 26, 2026
104 checks passed
@jbachorik jbachorik deleted the worktree-prof-14763-as-safety-sanitizer branch May 26, 2026 07:46
@github-actions github-actions Bot added this to the 1.43.0 milestone May 26, 2026
@jbachorik jbachorik changed the title feat(profiler): async-signal-safety sanitizer (PROF-14763) feat(profiler): async-signal-safety sanitizer May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant