Skip to content

[codex] Add Claude API retry backoff events#60

Draft
DorianZheng wants to merge 3 commits intomainfrom
feature/rate-limiting-retry
Draft

[codex] Add Claude API retry backoff events#60
DorianZheng wants to merge 3 commits intomainfrom
feature/rate-limiting-retry

Conversation

@DorianZheng
Copy link
Copy Markdown
Member

What changed

  • added Claude API retry handling in the container agent runner for 429, 500, and 503 failures
  • implemented exponential backoff with jitter (base 1s, max 60s) with per-group containerConfig.maxRetries
  • surfaced retry state and terminal failures as new runtime events so Dune can render retry countdowns and exhausted-retry errors
  • added unit and integration coverage for retry classification, backoff timing, and host event emission

Why

AgentLite previously treated Claude API rate limits and transient upstream failures as hard errors. That caused noisy failures with no retry visibility for the UI and no structured runtime signal when retries were exhausted.

Impact

  • agent groups can now tune retry count with containerConfig.maxRetries
  • Dune can subscribe to run.retry to show rate limited, retrying in Xs
  • terminal failures now emit run.error with retry metadata instead of failing silently

Root cause

The runtime streamed SDK output but had no retry wrapper around query() and no dedicated runtime event for retry state or exhausted retry failures.

Validation

  • npm run typecheck
  • npm test
  • npm test -- src/run-stream-events.test.ts src/container-runner.test.ts container/agent-runner/src/retry.test.ts
  • npm test -- src/exports.test.ts src/agent/actions-http.e2e.test.ts src/acp/client.e2e.test.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant