Add retries to Think step.prompt()#1777
Conversation
step.prompt() now accepts an optional retries option with maxAttempts, baseDelayMs, and maxDelayMs. Retryable errors (e.g. Workers AI 3040 capacity errors) trigger fresh prompt attempts with jittered exponential backoff, while timeouts, validation errors, aborted, and skipped prompts are surfaced terminally. Each attempt uses unique workflow step names and idempotency keys so retries are durable across workflow hibernation and replays.
🦋 Changeset detectedLatest commit: a622d14 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
| expect(sleepCalls[0].duration).toBeGreaterThanOrEqual(0); | ||
| expect(sleepCalls[0].duration).toBeLessThanOrEqual(100); | ||
| expect(sleepCalls[1].duration).toBeGreaterThanOrEqual(0); | ||
| expect(sleepCalls[1].duration).toBeLessThanOrEqual(200); |
There was a problem hiding this comment.
🚩 Test backoff bounds may be vacuously satisfied due to the jitter bug
The retry test at packages/think/src/tests/workflows.test.ts:297-300 asserts sleepCalls[0].duration <= 100 and sleepCalls[1].duration <= 200. Because of the near-zero jitter fraction bug (BUG-0001), these assertions pass trivially (durations are always ~0). After the jitter bug is fixed, these bounds should still hold mathematically (jitter is in [0, upperBound]), but the test should also verify that durations are not always zero — e.g., asserting that at least one sleep duration is > 0 — to serve as a meaningful regression test for the backoff mechanism.
Was this helpful? React with 👍 or 👎 to provide feedback.
agents
@cloudflare/ai-chat
@cloudflare/codemode
create-think
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
- Derive retry jitter from two raw SHA-256 digest bytes so the backoff fraction spans the full [0, 1) range instead of collapsing to ~0ms (base64url char codes / 0xffff was near-zero, defeating jitter). - Keep the original :submit/:wait step names on the first attempt so in-flight workflows replay without re-executing completed steps; only retries get suffixed names. - Persist and read back submitMessages maxRetries through the durable submission queue so modelMaxRetries actually reaches the inference loop (also fixes the maxRetries typecheck error). - Strengthen the backoff test to assert jitter is non-zero and add a regression test for stable first-attempt step names.
…ep-prompt-retries # Conflicts: # packages/think/src/think.ts
|
Remove from maxModelRetries from Think.submitMessages() because you can do it in beforeTurn()/getModel() (this sets the retry count for the whole agent) |
Addresses the chatRecovery race in step.prompt() retries without disabling chatRecovery (which usefully preserves in-flight turn state across DO restarts/stalls). - Before each retry, cancel the abandoned attempt's submission via a durable `:cancel-N` step so its turn and any chatRecovery continuation cannot keep running and race the fresh attempt on the same session (the cause of duplicate/interleaved output). No-op once terminal. - Add `retries.retryOnTimeout` (default true) to fail fast on wait timeouts instead of retrying a likely-to-repeat timeout. - Log each retry via console.warn with step/attempt/delay/error. - Tests: assert abandoned attempts are cancelled before retry and the successful final attempt is not; assert retryOnTimeout=false fails fast without backoff.
The test file's local _promptStep options type omitted retryOnTimeout, failing the tests tsconfig typecheck in CI (the main package tsconfig does not compile the test directory, so it passed locally).
When I run step.prompt() as part of an agentic workflow, I want to have an option for the step.prompt() to be retried (in case the agent fails due to capacity constraints, etc).
I have kept this 'dumb', which will retry the whole step.prompt(). I think the Think harness itself should have retries or 'continues' to ensure the task is pushed to completion before stopping.