feat(aws): cross-invocation tracecontext propagation#8182
Conversation
BenchmarksBenchmark execution time: 2026-06-12 17:29:22 Comparing candidate commit ff37f7a in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 1449 metrics, 18 unstable metrics. |
c1c351e to
7756bba
Compare
3eec288 to
2df8f55
Compare
Overall package sizeSelf size: 6.22 MB Dependency sizes| name | version | self size | total size | |------|---------|-----------|------------| | import-in-the-middle | 3.0.2 | 85.93 kB | 825.11 kB | | opentracing | 0.14.7 | 194.81 kB | 194.81 kB | | dc-polyfill | 0.1.11 | 25.74 kB | 25.74 kB |🤖 This report was automatically generated by heaviest-objects-in-the-universe |
🎉 All green!🧪 All tests passed 🔗 Commit SHA: ff37f7a | Docs | Datadog PR Page | Give us feedback! |
da775cf to
d2bb910
Compare
c4eca48 to
73fdb80
Compare
411cd56 to
4180be5
Compare
…s-invocation continuity
Persist the current trace context as a synthetic `_datadog_{N}` STEP operation
when the SDK suspends to PENDING, so subsequent invocations (read by the
upstream datadog-lambda-js wrapper) can resume the same trace.
Files:
- src/handler.js: install a hook on the SDK's terminationManager.terminate
inside bindStart. Save fires only for resumable reasons (PENDING_TERMINATION_REASONS
allow-list mirrors the SDK's TerminationReason enum entries that result in
Status: PENDING). Gated by DD_DURABLE_CROSS_INVOCATION_TRACING_ENABLED
(default on; opt out with 'false'/'0').
- src/trace-checkpoint.js: NEW. Datadog-only header inject (private
TextMapPropagator with tracePropagationStyle.inject = ['datadog'], shadows
the live tracer config), dedup against prior _datadog_N op via
JSON.stringify-after-stripping-x-datadog-parent-id, deterministic blake2b
stepId so the save is idempotent within an execution.
- test/handler.checkpoint.spec.js: unit tests for the termination hook
(pending vs non-pending reasons, env-var gate, idempotency, default reason).
- test/trace-checkpoint.spec.js: unit tests for the save module
(queue START+SUCCEED before terminating, dedup on parent-id-only changes).
- test/index.spec.js: integration coverage for SDK safe-paths
(single cycle, child-context, step-suspend-step).
- packages/dd-trace/src/config/supported-configurations.json and
generated-config-types.d.ts: register DD_DURABLE_CROSS_INVOCATION_TRACING_ENABLED.
…merScheduler bug Skip wait_for_callback (happy path) and the entire invoke describe block (happy + error). All three fail deterministically in CI under @aws/durable-execution-sdk-js-testing's current TimerScheduler, whose hasScheduledFunction() undercounts in-flight scheduled functions and trips the test orchestrator's "Cannot return PENDING status with no pending operations." validation. Production (real AWS backend) is not affected — the validation is mock-only. Fix is open upstream as aws/aws-durable-execution-sdk-js#544; re-enable these tests once a release containing it is pinned in packages/dd-trace/test/plugins/versions/package.json.
…led guard The guard was defensive against a "same terminationManager passed to bindStart twice" scenario that cannot happen in the SDK as it stands — each Lambda invocation calls initializeExecutionContext, which constructs a fresh `new TerminationManager()`, so warm starts share the wrapper closure but not the terminationManager instance. Removing the Symbol + the guard + the explicit "twice across invocations" unit test that only covered a contrived re-entry. Drive-by: fix four pre-existing space-before-function-paren lint errors in the same file.
…cute span, not its parent
Drop the `getParentSpanId` helper and inline the read directly during
`state` initialization. While inlining, switch the anchor from the
execute span's *parent* (typically `aws.lambda`'s id) to the execute
span's *own* id (`span.context().toSpanId()`).
Why anchor at the execute span:
- It's a span this integration owns and just created, so always defined
and never depends on what upstream context happened to be active when
`bindStart` fired.
- Topology becomes "resumed invocations are continuations of the first
execute" — matching the user-facing model of a single durable
execution. The old shape made resumes look like sibling Lambda
invocations under whatever upstream span happened to be there.
- In the no-upstream case the old code already fell through to the
propagator default (= execute span's own id) via `if (parentId)` —
so this just makes the behavior consistent across environments.
Rename for clarity:
- `saveTraceContextCheckpointIfUpdated`'s `checkpointAnchorSpanId`
parameter -> `firstExecutionSpanId`. JSDoc spells out it's only
consulted on the very first save; once a prior `_datadog_{N}` exists,
the function reuses that checkpoint's `x-datadog-parent-id` verbatim.
- The local `latestParentId` (the value carried forward across saves)
-> `anchoredSpanId`, reflecting that it IS the anchor we've been
using since the first save.
- handler.js's `state.parentSpanId` -> `state.firstExecutionSpanId`.
Note: dd-trace-py's `_resolve_override_parent_id` currently anchors at
the execute span's parent (matching the old JS behavior). A follow-up
should bring Python in line with this change so both languages produce
the same trace shape.
…ainst TimerScheduler bug" This reverts commit 8baa8ce.
034a8f8 to
748a826
Compare
…gainst TimerScheduler bug" This reverts commit 748a826.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f6e5f5fe4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "default": "true" | ||
| } | ||
| ], | ||
| "DD_DURABLE_CROSS_INVOCATION_TRACING_ENABLED": [ |
There was a problem hiding this comment.
Has this landed in other tracers yet? Don't we usually disable it by default in the first iteration so we can try to get customer feedback?
There was a problem hiding this comment.
Great question — and totally fair to want a feedback runway on anything new. I'd push back on default-off here though, for a few reasons:
-
To be honest, the feature really only delivers its value when it's on by default. The whole point is connecting traces across suspended/resumed invocations so customers see one coherent trace instead of disconnected fragments. If it's opt-in, not only the overwhelming majority of customers will never discover the flag, but the durable-execution traces they get will look broken (like orphaned spans) by comparison.
-
This is trace-context propagation, not a new product surface. The "ship opt-in first, gather feedback" pattern fits net-new features with novel behavior or perf/risk profiles. Cross-invocation propagation is in the same family as our SQS, SNS, Kinesis, and EventBridge context propagation — all of which are on by default and have been for years without flag-related customer issues.
-
The safety concern is already covered. The implementation is strictly best-effort: every checkpoint write is wrapped so a failure logs and is swallowed — it can never break or fail a customer's durable execution. And the flag still exists as a kill switch, so anyone who does hit an issue (or just wants it off) can disable it immediately. So default-on doesn't remove the escape hatch; it just changes who has to take action, and for an opt-out safety feature that should be the rare exception, not everyone.
-
Cross-tracer parity. This already landed default-on in dd-trace-py (#17773). Shipping Node default-off would diverge our defaults across languages, which creates an inconsistent customer experience and avoidable support/debugging confusion ("why does my Python durable trace connect but my Node one doesn't?").
-
There's no clean path to flip opt-in → opt-out later. Turning a default-off flag on in a later release is itself a behavior change we'd have to risk-assess and communicate — so "default-off now, default-on later" isn't actually lower-risk, it just defers the same change and adds a migration step. We don't have an established convention for that flip, whereas default-on guarded by an opt-out flag is a well-trodden path.
We will definitely keep the kill switch prominent in the docs so the off-ramp is obvious. In fact, we will mention this in both public documentations and release notes and README.md files. But given it mirrors our existing propagation defaults, is best-effort by construction, and already matches Python, I think default-on is the right call here. Let me know if I'm missing context on the feedback process you had in mind though.
BridgeAR
left a comment
There was a problem hiding this comment.
It would be nice to look in the comments before landing :)
| await saveCheckpoint(checkpointManager, executionArn, newNumber, currentHeaders) | ||
| } catch (e) { | ||
| log.debug('Failed to save trace context checkpoint', e) |
There was a problem hiding this comment.
I believe if we write the code itself in a way that it is defensive, we do not need to wrap it in an additional try/catch :)
Wrapping methods in a try catch is something I would rather not do (and I believe we mostly do not do that in other spots).
There was a problem hiding this comment.
Good call — I dropped the method-level try/catch in saveTraceContextCheckpointIfUpdated and made the helpers propagate normally.
The one bit of handling I brought back is the .catch() at the call site in maybeSaveCheckpoint. I don't think we can make that one go away with defensive code: the failure mode isn't bad input we can guard against, it's the SDK's checkpointManager.checkpoint() rejecting at runtime. Since #onTerminate fires this as void maybeSaveCheckpoint(...), an unhandled rejection there would surface in the customer's Lambda. So the .catch() is just fire-and-forget hygiene at the boundary rather than a try/catch wrapping the logic — which I think matches what you were after.
| } | ||
|
|
||
| const originalTerminate = terminationManager.terminate | ||
| terminationManager.terminate = function (...terminateArgs) { |
There was a problem hiding this comment.
I am unsure: who defined this terminate method / the terminationManager? Is the manager potentially defined by our users?
There was a problem hiding this comment.
This is great question. It's definitely worth nailing down precisely.
The terminationManager isn't user-defined — it's owned entirely by the SDK.
- The channel we hook wraps the SDK's internal
runHandler(event, context, executionContext, durableExecutionMode, checkpointToken, handler) (index.js:4452)- args[2] is the SDK's executionContext
- args[5] is the only customer-supplied argument (their handler).
So customers only ever call withDurableExecution(handler) — they can't inject or define this object.
And it's also constructed fresh on every invocation: withDurableExecution returns async (event, context) => …, which calls initializeExecutionContext(...), which builds a new executionContext with terminationManager: new TerminationManager() (index.js:4436) each time. terminate is then invoked by the SDK's own CheckpointManager.executeTermination.
Because it's a fresh instance per invocation, we wrap exactly once and nothing accumulates or leaks across invocations. We also capture-and-delegate (originalTerminate.apply(...)), so we compose with any pre-existing wrap rather than replacing it - the only caveat is the generic one where a different non-delegating patcher on the same method could conflict, which is very unlikely on an SDK-internal object only runHandler ever sees.
The wrap itself is also defensive: we bail if terminationManager.terminate isn't a function (line 63), and we always call originalTerminate.apply(this, terminateArgs), so the SDK's termination behavior is preserved unchanged — we just enqueue a best-effort checkpoint first.
And this isn't just reasoned from reading the source: the trace-checkpoint propagation integration tests in index.spec.js drive the real @aws/durable-execution-sdk-js (the version pinned in our test matrix) through actual suspend/resume cycles — real TerminationManager, real CheckpointManager.executeTermination — and assert the datadog checkpoint is written. So this path is covered against the actual SDK, not a mock, which also guards against the wrapping assumptions drifting if the SDK changes.
There was a problem hiding this comment.
Thank you, that seems fine! Just as above: I noticed this is happening in the plugin and wrapping should always happen inside of our instrumentations. Please move this there :)
There was a problem hiding this comment.
That's a really important find. I basically overlooked it coming from the POC.
| } | ||
|
|
||
| const originalTerminate = terminationManager.terminate | ||
| terminationManager.terminate = function (...terminateArgs) { |
There was a problem hiding this comment.
Thank you, that seems fine! Just as above: I noticed this is happening in the plugin and wrapping should always happen inside of our instrumentations. Please move this there :)
…joey/cross-invocation-tracecontext-propagation
… in bindStart Extract #shouldInstallTerminationHook so bindStart decides whether to install the cross-invocation checkpoint hook, instead of the hook self-gating with early returns. The hook now assumes its preconditions and recomputes the handler args, execute span, and termination manager it wraps. Stub operationName() in the handler checkpoint spec: it reaches into the tracer's nomenclature, which the bare tracer stub doesn't provide. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…js so that it's only done once instead of every invocation
Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de>
BridgeAR
left a comment
There was a problem hiding this comment.
Mostly LGTM! Only few small things are left that would be nice to clean up before landing
…er.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de>
…-checkpoint.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de>
…-checkpoint.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de>
BridgeAR
left a comment
There was a problem hiding this comment.
LGTM, thanks for the quick follow-ups
…-checkpoint.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de>
1780848
into
joey/apm-ai-toolkit/aws-durable-execution-sdk-js
* workflow(aws-durable-execution-sdk-js): install_package * workflow(aws-durable-execution-sdk-js): generate_app * workflow(aws-durable-execution-sdk-js): compile * workflow(aws-durable-execution-sdk-js): test:att1:iter1:fixer * workflow(aws-durable-execution-sdk-js): test:att1:iter2:fixer * workflow(aws-durable-execution-sdk-js): feature_implement * workflow(aws-durable-execution-sdk-js): get_lint_failures * workflow(aws-durable-execution-sdk-js): lint_and_fix:att1:iter1:fix_lint_errors * workflow(aws-durable-execution-sdk-js): review_cycle:att1:iter1:batch_fix * remove the unnecessary dd-api-key * clean up # Conflicts: # index.d.ts * yarn.lock changed... * fixing yarn.lock * remove the unintended finish() guard * update span names * use a fixed service name instead * update resource names * naming consistency * small fix * Python PR parity * Undo unnecessary changes * Finish error spans on asyncEnd * Simplify orchestrion file * Class/file name changes * Several simplifications and improvements * Do not explicitly set component * Remove includeReplayedTag * Smaller simplifications * Tests simplification * chore: update supported-integrations * More test simplification * Add aws.durable.operation_id and aws.durable.operation_name * Fix checks * Linter * Test simplifications * More test improvements * Lazy thenables + only close this integration's spans * Code simplifications * Fix rebase * Mirror changes in v5 * Test waitForCondition happy path * Comment improvements based on guidelines * supress child context for WaitForCallback * Increase tested version * Address review comments * Avoid patching on the plugin by creating a "settle" channel * Do not skipTime to avoid interfering with tracer's timers * Fix test flakiness * Test durable-execution-sdk-js only on node >=22 * Linter * Rename context variables * Lint * Add esbuild bundling acceptance test * ESM smoke tests * Update packages/datadog-plugin-aws-durable-execution-sdk-js/src/handler.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> * Update packages/datadog-plugin-aws-durable-execution-sdk-js/src/checkpoint.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> * Address various review comments * Operation name * feat(aws): cross-invocation tracecontext propagation (#8182) Persist the current trace context as a synthetic `_datadog_{N}` STEP operation when the SDK suspends to PENDING, so subsequent invocations (read by the upstream datadog-lambda-js wrapper) can resume the same trace. --------- Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Pablo Martínez Bernardo <pablo.martinezbernardo@datadoghq.com> Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com> Co-authored-by: pablomartinezbernardo <134320516+pablomartinezbernardo@users.noreply.github.com> Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
* workflow(aws-durable-execution-sdk-js): install_package * workflow(aws-durable-execution-sdk-js): generate_app * workflow(aws-durable-execution-sdk-js): compile * workflow(aws-durable-execution-sdk-js): test:att1:iter1:fixer * workflow(aws-durable-execution-sdk-js): test:att1:iter2:fixer * workflow(aws-durable-execution-sdk-js): feature_implement * workflow(aws-durable-execution-sdk-js): get_lint_failures * workflow(aws-durable-execution-sdk-js): lint_and_fix:att1:iter1:fix_lint_errors * workflow(aws-durable-execution-sdk-js): review_cycle:att1:iter1:batch_fix * remove the unnecessary dd-api-key * clean up # Conflicts: # index.d.ts * yarn.lock changed... * fixing yarn.lock * remove the unintended finish() guard * update span names * use a fixed service name instead * update resource names * naming consistency * small fix * Python PR parity * Undo unnecessary changes * Finish error spans on asyncEnd * Simplify orchestrion file * Class/file name changes * Several simplifications and improvements * Do not explicitly set component * Remove includeReplayedTag * Smaller simplifications * Tests simplification * chore: update supported-integrations * More test simplification * Add aws.durable.operation_id and aws.durable.operation_name * Fix checks * Linter * Test simplifications * More test improvements * Lazy thenables + only close this integration's spans * Code simplifications * Fix rebase * Mirror changes in v5 * Test waitForCondition happy path * Comment improvements based on guidelines * supress child context for WaitForCallback * Increase tested version * Address review comments * Avoid patching on the plugin by creating a "settle" channel * Do not skipTime to avoid interfering with tracer's timers * Fix test flakiness * Test durable-execution-sdk-js only on node >=22 * Linter * Rename context variables * Lint * Add esbuild bundling acceptance test * ESM smoke tests * Update packages/datadog-plugin-aws-durable-execution-sdk-js/src/handler.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> * Update packages/datadog-plugin-aws-durable-execution-sdk-js/src/checkpoint.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> * Address various review comments * Operation name * feat(aws): cross-invocation tracecontext propagation (#8182) Persist the current trace context as a synthetic `_datadog_{N}` STEP operation when the SDK suspends to PENDING, so subsequent invocations (read by the upstream datadog-lambda-js wrapper) can resume the same trace. --------- Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Pablo Martínez Bernardo <pablo.martinezbernardo@datadoghq.com> Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com> Co-authored-by: pablomartinezbernardo <134320516+pablomartinezbernardo@users.noreply.github.com> Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
* workflow(aws-durable-execution-sdk-js): install_package * workflow(aws-durable-execution-sdk-js): generate_app * workflow(aws-durable-execution-sdk-js): compile * workflow(aws-durable-execution-sdk-js): test:att1:iter1:fixer * workflow(aws-durable-execution-sdk-js): test:att1:iter2:fixer * workflow(aws-durable-execution-sdk-js): feature_implement * workflow(aws-durable-execution-sdk-js): get_lint_failures * workflow(aws-durable-execution-sdk-js): lint_and_fix:att1:iter1:fix_lint_errors * workflow(aws-durable-execution-sdk-js): review_cycle:att1:iter1:batch_fix * remove the unnecessary dd-api-key * clean up # Conflicts: # index.d.ts * yarn.lock changed... * fixing yarn.lock * remove the unintended finish() guard * update span names * use a fixed service name instead * update resource names * naming consistency * small fix * Python PR parity * Undo unnecessary changes * Finish error spans on asyncEnd * Simplify orchestrion file * Class/file name changes * Several simplifications and improvements * Do not explicitly set component * Remove includeReplayedTag * Smaller simplifications * Tests simplification * chore: update supported-integrations * More test simplification * Add aws.durable.operation_id and aws.durable.operation_name * Fix checks * Linter * Test simplifications * More test improvements * Lazy thenables + only close this integration's spans * Code simplifications * Fix rebase * Mirror changes in v5 * Test waitForCondition happy path * Comment improvements based on guidelines * supress child context for WaitForCallback * Increase tested version * Address review comments * Avoid patching on the plugin by creating a "settle" channel * Do not skipTime to avoid interfering with tracer's timers * Fix test flakiness * Test durable-execution-sdk-js only on node >=22 * Linter * Rename context variables * Lint * Add esbuild bundling acceptance test * ESM smoke tests * Update packages/datadog-plugin-aws-durable-execution-sdk-js/src/handler.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> * Update packages/datadog-plugin-aws-durable-execution-sdk-js/src/checkpoint.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> * Address various review comments * Operation name * feat(aws): cross-invocation tracecontext propagation (#8182) Persist the current trace context as a synthetic `_datadog_{N}` STEP operation when the SDK suspends to PENDING, so subsequent invocations (read by the upstream datadog-lambda-js wrapper) can resume the same trace. --------- Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Pablo Martínez Bernardo <pablo.martinezbernardo@datadoghq.com> Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com> Co-authored-by: pablomartinezbernardo <134320516+pablomartinezbernardo@users.noreply.github.com> Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
* workflow(aws-durable-execution-sdk-js): install_package * workflow(aws-durable-execution-sdk-js): generate_app * workflow(aws-durable-execution-sdk-js): compile * workflow(aws-durable-execution-sdk-js): test:att1:iter1:fixer * workflow(aws-durable-execution-sdk-js): test:att1:iter2:fixer * workflow(aws-durable-execution-sdk-js): feature_implement * workflow(aws-durable-execution-sdk-js): get_lint_failures * workflow(aws-durable-execution-sdk-js): lint_and_fix:att1:iter1:fix_lint_errors * workflow(aws-durable-execution-sdk-js): review_cycle:att1:iter1:batch_fix * remove the unnecessary dd-api-key * clean up # Conflicts: # index.d.ts * yarn.lock changed... * fixing yarn.lock * remove the unintended finish() guard * update span names * use a fixed service name instead * update resource names * naming consistency * small fix * Python PR parity * Undo unnecessary changes * Finish error spans on asyncEnd * Simplify orchestrion file * Class/file name changes * Several simplifications and improvements * Do not explicitly set component * Remove includeReplayedTag * Smaller simplifications * Tests simplification * chore: update supported-integrations * More test simplification * Add aws.durable.operation_id and aws.durable.operation_name * Fix checks * Linter * Test simplifications * More test improvements * Lazy thenables + only close this integration's spans * Code simplifications * Fix rebase * Mirror changes in v5 * Test waitForCondition happy path * Comment improvements based on guidelines * supress child context for WaitForCallback * Increase tested version * Address review comments * Avoid patching on the plugin by creating a "settle" channel * Do not skipTime to avoid interfering with tracer's timers * Fix test flakiness * Test durable-execution-sdk-js only on node >=22 * Linter * Rename context variables * Lint * Add esbuild bundling acceptance test * ESM smoke tests * Update packages/datadog-plugin-aws-durable-execution-sdk-js/src/handler.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> * Update packages/datadog-plugin-aws-durable-execution-sdk-js/src/checkpoint.js Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> * Address various review comments * Operation name * feat(aws): cross-invocation tracecontext propagation (#8182) Persist the current trace context as a synthetic `_datadog_{N}` STEP operation when the SDK suspends to PENDING, so subsequent invocations (read by the upstream datadog-lambda-js wrapper) can resume the same trace. --------- Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Pablo Martínez Bernardo <pablo.martinezbernardo@datadoghq.com> Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com> Co-authored-by: pablomartinezbernardo <134320516+pablomartinezbernardo@users.noreply.github.com> Co-authored-by: Ruben Bridgewater <ruben@bridgewater.de> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Summary
https://datadoghq.atlassian.net/browse/APMSVLS-493
Adds cross-invocation trace-context continuity for the @aws/durable-execution-sdk-js integration. Each invocation of a durable execution now writes datadog{N} checkpoints on suspend when the trace context updates, so subsequent invocations of the same execution can extract the trace context from checkpoints and attach to a common anchor span. NOTE : The extraction part of these checkpoints is in DataDog/datadog-lambda-js#774
Motivation
A durable execution is a logically single workflow that the SDK transparently runs across N Lambda invocations (suspending on ctx.wait, ctx.waitForCallback, ctx.invoke, retries, etc.).
Before this PR, dd-trace produced one isolated trace per physical invocation. Customers couldn't see the workflow end-to-end in APM.
This PR makes those invocations show up under a single anchor, while leaving the per-invocation aws.durable.execute spans intact.
Without this, every resume of a suspended durable execution starts a fresh, unconnected trace.
Changes
New module: packages/datadog-plugin-aws-durable-execution-sdk-js/src/trace-checkpoint.js
operation via the SDK's checkpoint manager.
anchor verbatim.
_installTerminationCheckpointHookwill:checkpointManager.checkpoint(stepId, START) + checkpoint(stepId, SUCCEED) calls.
Tests
Why commenting out some tests
All three tests race against a
TimerSchedulerbug in@aws/durable-execution-sdk-jsthat is fixed upstream in aws/aws-durable-execution-sdk-js#544. Once that fix is published in a release we pin against, the skips will be removed.The race only manifests at the suspend → resume boundary when resume is driven externally (by
sendCallbackSuccess()forwait_for_callback, or by the target function completing for chained invoke()). Timer-driven resumes (ctx.wait,ctx.waitForCondition, etc.) take a single, ordered code path throughTimerSchedulerand are unaffected.This PR adds the cross-invocation trace-context checkpoint hook. On every PENDING termination, the hook calls back into the SDK via
checkpointManager.checkpoint(stepId, START)+checkpoint(stepId, SUCCEED). That extra async work overlaps with the SDK's own pending-state bookkeeping at the same boundary where TimerScheduler is coordinating drain — which is precisely the state machine #544 cleans up.Why production is unaffected
The
TimerSchedulercode path involved is only used by@aws/durable-execution-sdk-js-testing'sLocalDurableTestRunner(simulated clock + in-process callback resolution). Real Lambda invocations don't drive resume through TimerScheduler — the resume of a suspended execution is a fresh invocation initiated by the Durable Execution service, not a continuation inside the same process. The checkpoint writes themselves complete normally; what races is the test harness's observation of the resumed invocation.Other Notes
REPLAY -> NEW transitions
In dd-trace-py PR-17773,
mark_trace_context_checkpoints_visitedmethod was added to address a glitch caused by the datadog{N} steps we added. That issue doesn't exist for NodeJS.Python SDK pattern:
visited set.
Node.js SDK pattern (durable-context.ts:209):
So our synthetic ops are invisible to the SDK's replay-mode bookkeeping — they can't keep it stuck in REPLAY.