Provide ability to capture continuations using the Context API#11663
Provide ability to capture continuations using the Context API#11663mcculls wants to merge 11 commits into
Conversation
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR extends the datadog.context API to support capturing/resuming execution context via a new ContextContinuation abstraction, and wires dd-trace’s scope continuation mechanism to also implement that API (so Context.current().capture() can hold traces similarly to the legacy tracer continuation).
Changes:
- Introduces
ContextContinuation+ContextListenerand addsContext.capture()/ContextManager.capture(...)/ listener registration APIs. - Implements
capture(...)for both the defaultThreadLocalContextManagerand dd-trace’sContinuableScopeManager, and makesAgentScope.Continuationalso implementContextContinuation. - Adds extensive tests for continuation semantics, noop scopes, and listener lifecycle events.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal-api/src/main/java/datadog/trace/bootstrap/instrumentation/api/AgentScope.java | Makes AgentScope.Continuation also implement ContextContinuation and maps methods (resume→activate, release→cancel). |
| dd-trace-core/src/main/java/datadog/trace/core/scopemanager/ContinuableScopeManager.java | Adds ContextManager.capture(Context) implementation and an addListener(...) stub. |
| dd-trace-core/src/test/java/datadog/trace/core/scopemanager/ScopeManagerTest.java | Adds tests verifying Context API capture/continuation behavior holds/unblocks traces and activates spans. |
| components/context/src/main/java/datadog/context/Context.java | Adds Context.capture() default method. |
| components/context/src/main/java/datadog/context/ContextManager.java | Adds capture + listener APIs and a static register(ContextListener) helper. |
| components/context/src/main/java/datadog/context/ContextContinuation.java | New public API for capturing/resuming context across execution units. |
| components/context/src/main/java/datadog/context/ContextListener.java | New public listener API for attach/detach/capture/release events. |
| components/context/src/main/java/datadog/context/ThreadLocalContextManager.java | Implements capture/resume/release semantics and listener notification for the default manager. |
| components/context/src/main/java/datadog/context/NoopContextScope.java | New cached noop ContextScope implementation for same-context attach/resume no-ops. |
| components/context/src/main/java/datadog/context/EmptyContextContinuation.java | New noop continuation for root/empty context. |
| components/context/src/main/java/datadog/context/EmptyContext.java | Makes EmptyContext constructor private (enforces singleton). |
| components/context/src/main/java/datadog/context/TestContextManager.java | Delegates new capture and addListener APIs to the underlying manager. |
| components/context/src/test/java/datadog/context/ContextContinuationTest.java | New comprehensive tests for continuation lifecycle, concurrency, holds, and event ordering. |
| components/context/src/test/java/datadog/context/ContextManagerTest.java | Adds tests for noop-scope caching/collisions and listener behaviors. |
| components/context/src/test/java/datadog/context/ContextProvidersForkedTest.java | Updates custom manager stub to implement new APIs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
There was a problem hiding this comment.
Early review of the API with a lot of refinement comments - trying to nail the API right before merging it. The implementation feels quite close to the current one so far. It's also nice to see the AgentSpan.Continuation nicely retrofitted to the new ContextContinuation.
Glad to see this piece of context coming! 🤝
396e8a0 to
c8de592
Compare
|
By the way, my comments only are early feedback. I don't expect to have them to get answer in timely manner. |
2754977 to
d40fa11
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d40fa11a82
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
dougqh
left a comment
There was a problem hiding this comment.
I appreciate the work to move away from the old OpenTracing model towards OpenTelemetry.
I do think OpenTelemetry is genuinely a better model. ContextManagers never quite solved the hybrid async environments that automatic instrumentation encounters.
As we do the move, I do want to take care that we preserve the safety measures that were added previously.
Those were...
1 - stack depth limit
2 - unfinished continuations are tolerated
3 - scopes were allowed to be closed across threads
4 - out-of-order scope close was tolerated
In the past, I know that our instrumentations contributed to those problems, but I think we still need to be careful about these things because of outside / manual instrumentation. There have been issues in those areas as recently as last year.
Based on an initial review, it looks like we’ve lost some of these guards. I’m not sure that we need all of them, but I do think we need a depth limit and tolerance for unfinished continuations.
I'll need some time to review this more comprehensively, but I wanted to share that concern up front.
amarziali
left a comment
There was a problem hiding this comment.
Thanks. Looks good to me
Previously we needed our own limit because scopes always escaped onto the heap via our separate scope stack. Now scopes naturally reside on the thread's own call stack. Given this I honestly believe we don't need the overhead of enforcing a separate stack limit and can rely on the JVM's own limits.
The new code uses the same basic continuation algorithm, so should tolerate the same scenarios.
This came with caveats - the first thing the old The old code did update the scope's internal book-keeping regardless which thread closed it - but in practice this doesn't solve the issue of stale scopes left in the scope stack until the original thread happens to trigger cleanup. This is not an issue when we store scopes on the actual call stack, we naturally get cleanup of them as the call stack unwinds. The new scopes have a minimum amount of state / book-keeping - basically just a flag to record when a close attempt was successful. This means the scope tolerates accidental extra calls to close it, firing only when the expected state matches. A feature we now get which wasn't previously available is when we get to a known safe point, we can potentially 'swap' in (i.e. reset to) the expected context. This is possible because we don't have a separate scope stack which may contain a mix of stale and valid scopes. We just have the natural call stack and a notion of the currently attached context.
We have tests that specifically cover out-of-order close. We do stop an out-of-order close from changing the currently attached context if it doesn't match what we expected, but that's intentional. We'll likely introduce a special mode to track out-of-order close calls back to the originating method/instrumentation like in OTel (such features are usually off by default for performance reasons.) There are still situations where a misbehaving instrumentation could leave a context attached - this is the equivalent of an instrumentation leaving an unclosed scope on the stack in the old code, which wasn't recoverable. But now we have the ability to safely reset to the right context (at known safe points) which we don't have with the old approach. If there is a leak then the next time that thread handles a request, a new context will attach and overwrite any stale context - from that point on (assuming the instrumentations for that request behave) the context will properly attach and detach. Only once the request unwinds completely will the stale context "re-appear". This also means that unlike the old approach we won't continually accumulate stale contexts as we don't have the separate scope stack. We'd only ever have potentially one stale context lingering per-thread, and that stale context would not impact any request handling on top of it. Reminder that this code will no doubt evolve as we migrate away from the old approach. But fundamentally the old scope stack will be going away, as it limits how we can model more interesting async primitives. This also gives us a cleaner/simpler foundation to build on - it's not the destination, but a stepping stone. |
PerfectSlayer
left a comment
There was a problem hiding this comment.
Looking great! 👍
Not much important feedback, mostly about testing and mostly polish about coverage and being explicit.
I like having the API interop tested (hopefully we wont have to maintain it long).
|
@dougqh I'd like to merge this today unless there are obvious blockers response from Doug (which somehow ended up edited into my comment?)
|
| * }); | ||
| * }</pre> | ||
| * | ||
| * <p>If a continuation is never resumed (e.g. a task is cancelled before it runs), you must release |
There was a problem hiding this comment.
I'm concerned by the requirement to explicitly release.
There are places where that's hard to do.
I also think it complicates handling of dispatch loops. For dispatch loops, I rather guard the loop with ignoring incoming context - rather than stopping it on the originating end.
I say that because dispatch loops are often set-up lazily with a varied set of construction points, so defending against all of them seems messier.
I think if we could come up with a system without this requirement; it would clean-up instrumentation considerably. Right now, I feel like we're taking a step backwards.
There was a problem hiding this comment.
TBH that decision is best left up to the consumer of the capture and release events.
The code here as it stands won't cause any resource leak per se (the continuation is not kept in any internal cache) - what it will do is delay calling the release event until the conditions have been met.
Now the tracer, as a consumer of that event, can decide to forgo waiting on the release event - if it determines that there is either no need to hold back the trace, or that continuing to wait will cause resource issues. It is better placed to make that decision because it knows how many traces are pending, etc.
This avoids overcomplicating the context layer while still allowing downstream consumers to decide how to handle long-running situations. The alternative would be to try and use some kind of weak reference, which would introduce other issues around liveliness.
There was a problem hiding this comment.
I think it would be better to complicate the context layer rather than complicating the numerous instrumentations. But mostly, I just wish we explored this as a group before the decision was made.
There was a problem hiding this comment.
No, this is about the consumer of the events - i.e. the tracer core - not instrumentations.
Instrumentations are what trigger these events, which is why I want to keep the context layer simple - rather than second guessing when to bypass the "rules" and send additional events to the tracer core.
There was a problem hiding this comment.
I'm not proposing that the context layer second guess when to bypass the rules.
But what I'm saying is that a rule that assumes that the user code is well-behaved is problematic, I don't regard that as a bypass. For me, it is just about maintaining our requirement that we "do no harm" even when the user code is misbehaving.
So I'm just asking for the system to be resilient to misuse, that doesn't necessarily mean doing some fix-up -- just not allowing for runaway memory consumption.
|
|
||
| ContextListener[] ls = listeners; | ||
| notifyDetach(previous, ls); | ||
| holder.current = context; |
There was a problem hiding this comment.
It looks like this can grow in and unbounded fashion.
We need to add a guard against that.
There was a problem hiding this comment.
Technically this is guarded by the call stack size - also I would prefer not to add book-keeping until we verify it is needed because it a) can end up consuming resources more than protecting them and b) can lead to objects escaping from the stack which is worse then if we'd kept things simple.
There was a problem hiding this comment.
Doesn't that presume that the end user uses the API correctly, I don't think that's a safe assumption.
We're really not suppose to have any unbounded data structures, so the Context chain is worrying from my perspective. The fact that is now Context and not Scope doesn't change the situation.
There was a problem hiding this comment.
Context scopes are not collected in this layer because the scope stack has gone away - the only place they can accumulate is on the call stack. If a method erroneously calls activate multiple times and throws the scope away then there won't be a leak of scopes because there is no scope stack pinning them in the heap.
To leak context scopes you would need to call the API multiple times with different contexts, and keep the scopes alive in a way where they escape onto the heap. I'm just wary that by re-introducing state back into scopes (or any kind of "stack") that we'll go back to heap-allocating scopes which results in more allocations and more garbage to collect vs. the simple approach used here.
There was a problem hiding this comment.
Yes, what you are saying is true for Scopes; that's correct. However, the Context linked list is itself unbounded. While misuse may be rare or hard, I still think we should guard against it.
And I'm not proposing putting state back into Scope or Continuation. Continuation need only be an immutable reference to Context.
I'm actually trying to move to a model with less state in the user visible objects by moving the linked list out of Context.
There was a problem hiding this comment.
there is no linked list in Context - Context is an immutable object with no links to other contexts
| @Override | ||
| public void close() { | ||
| if (!closed) { | ||
| // check for out-of-order close to avoid corrupting the current state |
There was a problem hiding this comment.
Even if we don't have the same degree of forgiveness in the API, I'd at least like to see the strict mode restored so we can diagnose problems.
There was a problem hiding this comment.
Strict mode will be added in a separate PR
There was a problem hiding this comment.
What's the time frame? Before the next release?
There was a problem hiding this comment.
Strict mode will be added before ThreadLocalContextManager becomes the default implementation.
The default manager for the next release will still be ContinuableScopeManager. This PR is a migration step to get instrumentations away from AgentScope and related structures and towards a lighter API.
This PR does not change the default implementation. It is extending the API to help move instrumentations away from the complex Agent API onto a slimmer more focused API. It does extend I also disagree that the data structure is unbounded, but that's best discussed off-line. Benchmarks so far show that the simpler approach actually reduces memory and overhead due to fewer objects escaping onto the heap. |
Benchmarking shows these scopes aren't alive long enough to pay for the caching
db47043 to
1d73e24
Compare
|
For the most part, I think we should settle the design before discussing performance, but I did want to raise some concerns about the escape analysis claims. Escape analysis in C2 is actually a very brittle optimization, so unfortunately, a microbenchmark doesn't prove much. I fully expect in real world usage the Scope will get allocated because the Scope creation and Scope closing will often sit around I/O work that the JIT will fail to inline fully through. To really verify the optimization is happening, I'd want to see an allocation profile from a real application (for instance Spring PetClinic). If we really want to avoid the allocation, I think a claim check system is likely a better solution. Rather than return an object, we return just a number. Then on close, we provide the same number again, so the stack can be verified. And then we can wrap the claim check mechanism with a Scope object for public and legacy APIs. |
|
I think most the majority of the problems that I’ve raised can be solved with two key changes. Make Context Immutable By making Context immutable, we can solve the continuation leaking problems But doing so requires that Context not contain a pointer to another Context The simplest solution would be a Context[] in ContextHolder and an index into the Context[] The primary downside is that it makes the ContextHolder more expensive to construct which is bad for virtual threads While that is a bit ugly, it is at least nicely contained in one place Adding a NO_CONTEXT mechanism From a survey of the “leak” PRs, the primary problem is context leaking into a dispatch loop I think the cleanest way to that is introduce a NO_CONTEXT mechanism that ignores any surrounding Context already on the stack. |
8760c68 to
df32a37
Compare
I would need to have an example of the dispatch loop code pattern to help me visualize. But adding a mechanism might not fix wrong context API usage from customers because we won't be able to expose it. |
This helps reduce allocations when attaching the same context multiple times.
As stated elsewhere, I agree that the Scopes are bounded, but the Contexts are not. As for benchmarks, I would put API safety - avoiding the unhappy case ahead of maximizing performance in the happy case. I'd like to think we can find approach that can balance safety and performance. |
By dispatch loop, I mean anything of the form... This applies to servers, executors, message listener loops, etc. The trouble with dispatch loops is that they are often set-up lazily and capture the surrounding context. I had Claude do a sweep and this matches most (all) of the PRs where we're fixing context leaks currently. As for API, you are quite right that it isn't something for customers per se. |
If I can add something, the leakage fix PR we're doing so far is to fill the technical debts we have with instrumentations that have not addressed that well in the past. But I think that also is not directly related to this PR in particular but a wider topic |
The way the new context api is supposed to work is by attaching context to those work item using
That's the kind of issue we would have to deal with as part of rolling out the new API. Making sure the new API is correctly leverage across all the instrumentations. Hence the tool for leakage detection from test runs - note that requires having tests with decent coverage. |
What Does This Do
Introduces basic continuations and events to the Context API
Motivation
This is meant as a stepping stone from the old API towards a simpler foundation
Additional Notes
Benchmarking summary from Claude:
ThreadLocalContextManager: nested attach() calls add zero bytes because theContextScopeImplobjects are stack-allocated by escape analysis.ContinuableScopeManager: eachContinuableScopeobject adds ≈32 B because they escape to the heap via theScopeStacklinked structure.Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issue/merge. You can also:/merge --commit-message "..."/merge -c/merge -f --reason "reason"; please use this judiciously, as some checks do not run at the PR-level (note: the PR still needs to be mergeable, this will only skip the pre-merge build)Jira ticket: [PROJ-IDENT]