PROPOSAL: feat(agents): first-class detached (background) agent-tool runs with a durable completion hook#1758
PROPOSAL: feat(agents): first-class detached (background) agent-tool runs with a durable completion hook#1758threepointone wants to merge 8 commits into
Conversation
Adds the design record for first-class detached sub-agent runs with a durable named-method completion hook and progress/milestone signaling, in response to #1752. Co-authored-by: Cursor <cursoragent@cursor.com>
Implements the core of rfc-detached-agent-tools (#1752): - `runAgentTool(cls, { detached })` dispatches a sub-agent without awaiting, returning `{ runId, status: "running" }`. Fire-and-forget (`detached: true`) or a durable per-run callback (`detached: { onFinish: "methodName" }`). - Durable, eviction-surviving completion delivery via a single guarded funnel with two independent ledger slots (finish / give-up) using a claim+lease, so delivery is exactly-once on the happy path and at-least-once under failure — a premature give-up can never dedupe a child's real late completion (the #1752 production incident). - Warm fast path (waitUntil) + durable self-scheduling reconcile backbone (this.schedule) that self-cancels once no detached run remains. - Reconcile fork: detached runs are never sealed `interrupted` on a lost observer (the normal state for a background run); the backbone owns them and re-arms on restart. - Absolute `maxBudgetMs` give-up ceiling (default 24h, finite because a detached run has no observer to notice a leak) surfaced as `interrupted`/`budget-exceeded`. - `cancelAgentTool(runId)` by-id cancellation through the same guarded path. Schema bumped to v10 with detached + ledger columns (idempotent migrations). Co-authored-by: Cursor <cursoragent@cursor.com>
Drives the detached delivery funnel directly: exactly-once on terminal, dedupe under concurrent fast-path/backbone deliveries, and the independent finish/give-up slots so a budget give-up never dedupes a child's real late completion (#1752). Also switches the ledger claim to rowsWritten() since UPDATE ... RETURNING row counts are not a reliable claim signal on Workers SQLite. Co-authored-by: Cursor <cursoragent@cursor.com>
Adds a changeset (minor) and a "Detached (background) runs" section to the agent-tools doc covering the detached handle, durable onFinish, budget give-up, explicit cancellation, and the inspectAgentToolRun null-means-not-yet contract from #1752. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Adds `detached: { notify: true }` sugar: when a detached sub-agent run
finishes, a Think agent injects the result back into the chat via
submitMessages (idempotent per run + status) so the model reacts, without
hand-wiring onFinish. Override formatDetachedCompletion() to customize. Wired
generically in the base Agent by resolving the conventional notify hook by name
so the core stays decoupled from the chat layer.
Co-authored-by: Cursor <cursoragent@cursor.com>
Adds a `research_background` tool that dispatches a Researcher with
`detached: { notify: true }` (returns immediately, result posted back into the
chat on completion) and a `cancelBackground(runId)` callable built on
cancelAgentTool. Updates the system prompt and README to cover the background
flow.
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
🦋 Changeset detectedLatest commit: fc6a2ab The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
| await this.schedule( | ||
| DETACHED_BACKBONE_CADENCE_S[1], | ||
| DETACHED_RECONCILE_CALLBACK as keyof this | ||
| ); |
There was a problem hiding this comment.
🟡 Backbone reconcile cadence never escalates — indices 2 and 3 of DETACHED_BACKBONE_CADENCE_S are dead code
_cfDetachedReconcileTick always reschedules at DETACHED_BACKBONE_CADENCE_S[1] (15 seconds), never advancing through the rest of the cadence array [5, 15, 30, 120]. The entries at indices [2] (30s) and [3] (120s) are unused dead code. The RFC specifies "exponential-ish cadence, e.g. 5s → 30s → 2m, capped" and mentions "the cadence resets to the fast end (5s)" on a new dispatch, but neither behavior is implemented — every reconcile tick polls at a fixed 15-second interval. This wastes DO alarm budget for long-running detached runs that would otherwise back off to 2-minute intervals.
Prompt for agents
The _cfDetachedReconcileTick method at line 8256 always reschedules with DETACHED_BACKBONE_CADENCE_S[1] (15s). The DETACHED_BACKBONE_CADENCE_S array is [5, 15, 30, 120] but only indices 0 and 1 are ever used — indices 2 and 3 are dead code. To implement the escalating cadence described in the RFC, the reconcile tick needs to track which cadence step it is on (e.g. via a payload on the schedule, or by counting consecutive ticks with no deliveries) and advance through the array. When a new detached dispatch arrives, _armDetachedBackbone should reset back to index 0. This would reduce steady-state alarm cost for long-running detached runs from every 15s to every 120s.
Was this helpful? React with 👍 or 👎 to provide feedback.
| async cancelAgentTool(runId: string, reason?: unknown): Promise<void> { | ||
| const row = this._readAgentToolRun(runId); | ||
| if (!row) return; | ||
| if (this._isAgentToolRowHardTerminal(row.status)) return; | ||
| const message = | ||
| reason instanceof Error | ||
| ? reason.message | ||
| : String(reason ?? "cancelled by parent"); | ||
| try { | ||
| const child = await this._cf_resolveSubAgent(row.agent_type, runId); | ||
| const adapter = this._asAgentToolChildAdapter(child); | ||
| await adapter.cancelAgentToolRun(runId, reason); | ||
| } catch { | ||
| // Best-effort child teardown; we still record the aborted terminal so the | ||
| // parent stops watching and any wired callback fires. | ||
| } | ||
| await this._deliverDetachedTerminal(runId, "finish", { | ||
| runId, | ||
| agentType: row.agent_type, | ||
| status: "aborted", | ||
| error: message | ||
| }); | ||
| } |
There was a problem hiding this comment.
🔴 cancelAgentTool on awaited runs causes onAgentToolFinish to fire twice
cancelAgentTool (line 7876) is documented to work on both detached and awaited runs, but for an awaited run it routes through _deliverDetachedTerminal which calls onAgentToolFinish (line 8086). The concurrent awaited path in runAgentTool — still tailing the child stream — also observes the aborted terminal and calls _finishAgentToolRun (packages/agents/src/index.ts:8455), which fires onAgentToolFinish again unconditionally (line 8474). The _updateAgentToolTerminal SQL guard (WHERE status NOT IN (...)) prevents double-writing the row, but _finishAgentToolRun calls the hook regardless. Since onAgentToolFinish is the documented cost-metering hook (the reporter merges child token cost there), a double-fire can cause double-counted costs.
Prompt for agents
cancelAgentTool at line 7876 always calls _deliverDetachedTerminal, which fires onAgentToolFinish. But for an awaited (non-detached) run, the concurrent awaited path in runAgentTool will also call _finishAgentToolRun → onAgentToolFinish when it observes the child's aborted terminal. This causes onAgentToolFinish to fire twice.
Two possible fixes:
1. In cancelAgentTool, check if the run is detached (row.detached === 1) and only use _deliverDetachedTerminal for detached runs. For non-detached runs, just cancel the child and let the existing awaited path handle the terminal delivery naturally.
2. Alternatively, make _finishAgentToolRun check the finish_delivered_at or finish_claimed_at columns before firing onAgentToolFinish, so the claim mechanism protects against double-fire on both paths.
Option 1 is simpler and doesn't change the awaited path. Option 2 is more robust but changes a shared code path.
Was this helpful? React with 👍 or 👎 to provide feedback.
agents
@cloudflare/ai-chat
@cloudflare/codemode
create-think
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
|
Super small convenience comment:
|
Closes #1752 (proposed direction — seeking feedback before we lock the API).
TL;DR
Adds a supported detached mode to
runAgentToolso a parent agent can dispatch a sub-agent, keep working, and be notified exactly once when the child finishes — even across parent Durable Object eviction. This is the framework-owned version of the ~200 lines of poll / fast-path / idempotency / race-handling glue that @rwdaigle describes in #1752.For chat agents there's a one-liner that injects the result back into the conversation so the model reacts to it:
This PR is the first slice of the detached agent-tools RFC: detached dispatch + durable completion + budget + cancel + docs + Think
notify. The progress/milestone signalling tier in the RFC is intentionally deferred to a follow-up (see Scope below).How it maps to the three asks in #1752
1. A supported "detached" mode on
runAgentTool.runAgentTool(Cls, { input, detached: true | { onFinish, maxBudgetMs, notify } })returnsDetachedRunAgentToolResult({ runId, agentType, status }) immediately, without awaiting. The full run lifecycle — run row incf_agent_tool_runs,agent-tool-eventbroadcast, child recovery,onAgentToolStart/onAgentToolFinish, cost — fires regardless, exactly as on the awaited path. Detached runs deliberately inherit noAbortSignal(the child must outlive the spawning turn); cancel explicitly withcancelAgentTool(runId).2. A durable completion hook guaranteed to fire exactly once across eviction.
The framework now owns the eviction-survival + dedupe you built by hand, via a two-tier delivery with a claim+lease ledger:
ctx.waitUntiloff the dispatch cuts latency while the parent isolate is warm.this.schedulecallback (_cfDetachedReconcileTick, backoff cadence[5s, 15s, 30s, 120s]) that inspects the child to terminal, survives parent DO eviction, and re-arms itself on parent startup whenever outstanding detached runs exist.*_claimed_at/*_delivered_atcolumns claimed atomically via SQLiterowsWrittenunder aDETACHED_DELIVERY_LEASE_MSlease. The happy path delivers exactly once; a crash mid-delivery re-delivers (at-least-once), never zero.onFinish(named method, resolved bykeyof thisso it survives rehydration) and the globalonAgentToolFinishboth fire for detached runs.This directly resolves the two sharp edges you flagged:
nullis not proof a run is gone. The reconciler treats anullinspect as "not yet reconciled," never as failure, and detached runs are excluded from the await-styleinterruptedsealing — so a poll racing the child's first write can't manufacture a spurious "outcome unconfirmed."interrupted/budget-exceeded) and a later real completion are delivered as two distinct events. A premature give-up can never consume the key and dedupe away the child's real summary. (Covered by a dedicated test — see below.)3. Documented contract for
inspectAgentToolRunafter eviction.docs/agent-tools.mdnow specifies the lazy post-eviction reconcile behaviour and thenull= "not yet started / not yet reconciled," not "failed" semantics, plus the full detached lifecycle, budget, and cancellation contract.Public API
runAgentTool(Cls, { …, detached })overload →DetachedRunAgentToolResult.DetachedAgentToolConfig:{ onFinish?: keyof this; maxBudgetMs?: number; notify?: boolean }.cancelAgentTool(runId)— idempotent explicit cancellation (deliversonFinishaborted).detachedMaxBudgetMs(default 24h) — fleet-wide backstop ceiling.AgentToolInterruptedReasonmember"budget-exceeded"(soft seal — a late completion still repairs the run and re-fires the hook).@cloudflare/think:detached: { notify: true }injects the completion into chat viasubmitMessages(idempotent perrunId+ terminal status); overrideformatDetachedCompletion(run, result)to customise the text.Storage / migration
Schema bumped to v10. New columns on
cf_agent_tool_runs(added viaaddColumnIfNotExists, so existing DOs migrate forward in place):detached,detached_on_finish,detached_max_budget_at, and the four ledger columnsfinish_claimed_at/finish_delivered_at/give_up_claimed_at/give_up_delivered_at.cf_agent_tool_runsis created lazily (not part of the constructor DDL snapshot), so no snapshot change.Tests
packages/agents/src/tests/agent-tool-detached.test.tsdrives the delivery ledger directly:onFinish+ the global hook exactly once on terminal;Example
examples/agents-as-toolsgains aresearch_backgroundtool (detached: { notify: true }) that returns immediately and posts the result back into the chat on completion, plus acancelBackground(runId)callable — so reviewers can try the end-to-end flow in a preview build.Scope (what's deferred)
The RFC's progress / milestone signalling tier —
reportProgress,awaitAgentToolMilestone,onProgress, the milestone/progress projection, and the client reducer — is intentionally not in this PR. It rides reserved data parts on the child's live AI-SDK stream and touches the hot_forwardAgentToolStreampath, so it deserves its own reviewable change. This PR is complete and useful on its own and is exactly the detached + durable-completion ask from #1752. Follow-up PR to come if the direction here lands.Validation
pnpm run check— 106 projects typecheck + lint + format ✅packages/agentstests ✅ ·@cloudflare/thinktests ✅ ·agents-as-toolsexample tests ✅Review starting points
design/rfc-detached-agent-tools.md— full design + rationale.packages/agents/src/index.ts—runAgentTooldetached branch,_parseDetachedOption,_deliverDetachedTerminal(claim+lease),_detachedFastPath,_cfDetachedReconcileTick(backbone),cancelAgentTool, reconciler exclusion.packages/agents/src/agent-tool-types.ts—DetachedAgentToolConfig,DetachedRunAgentToolResult.packages/think/src/think.ts—_cfDetachedNotifyFinish+formatDetachedCompletion.cc @rwdaigle — this is a direct response to your write-up; would love your eyes on the shape (
onFinishnamed-method callback, the give-up/real-completion two-slot ledger, budget defaults) before we finalise.Made with Cursor