fix(llmist): classify OpenRouter provider errors instead of crashing the agent#1417
fix(llmist): classify OpenRouter provider errors instead of crashing the agent#1417aaight wants to merge 1 commit into
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
Requesting changes: the handled-result path works for non-retryable OpenRouter failures, but the new Sentry tagging contract is not met for retryable 429/503 provider errors.
Code Issues
Should Fix
- src/backends/llmist/index.ts:206 — OpenRouter 429 and 503 errors are retryable under the existing llmist retry policy. On the final failed attempt,
getRetryConfig(...).onRetriesExhaustedcaptures the same error first withsource: 'llm_retries_exhausted'; only after that does this catch add the newopenrouter_provider_errorwarning. That means rate-limit/model-unavailable provider failures still produce an unclassified retry-exhaustion Sentry event, so operators cannot filter all handled OpenRouter provider errors by the new tag as documented. Please tag or suppress the retry-exhaustion capture for classified OpenRouter errors, or add a regression test that simulates exhausted 429/503 retries and proves no unclassified Sentry event is emitted.
🕵️ codex · gpt-5.5 · run details
| model, | ||
| runId, | ||
| }); | ||
| captureException(err, { |
There was a problem hiding this comment.
This catch is too late to be the only Sentry event for the retryable OpenRouter kinds. For OpenRouter: Rate limit exceeded and OpenRouter: Model temporarily unavailable, llmist's isRetryableError returns true, so after the last retry CASCADE's getRetryConfig(...).onRetriesExhausted already calls captureException(error, { tags: { source: 'llm_retries_exhausted' } }) before the error reaches this block. The run avoids the generic agent_execution tag, but these provider failures still create an unclassified retry-exhaustion event alongside the new warning. The retry-exhaustion path needs to be tagged/suppressed for classified OpenRouter errors, or covered by a test that proves 429/503 only produce the intended OpenRouter-tagged Sentry event.
Summary
MNG-1646: handle OpenRouter provider errors (HTTP 402 credit exhaustion, 401 unauthorized, 429 rate-limited, 503 model-unavailable) in the llmist engine adapter instead of letting them bubble up as generic
agent_executioncrashes.Linear issue: https://linear.app/mongrel/issue/MNG-1646/fix-handle-openrouter-credit-exhaustion-in-llmist
What changed
New module
src/backends/llmist/openrouterErrors.ts— pure string-based classifier with three exports:classifyOpenRouterError(err)— recognises both llmist's wrapped form (OpenRouter: Insufficient credits...produced byenhanceError) and bare HTTP signals (402, 401, 429, 503). Returnsnullfor anything unrelated so non-OpenRouter errors keep flowing through the existing pipeline.formatOpenRouterErrorMessage(kind, raw)— produces the operator-facing plain-English summary (e.g.OpenRouter rejected the request because the account has insufficient credits. Top up the OpenRouter balance at https://openrouter.ai/credits or switch the project to a different model/engine before retrying.) that the PM lifecycle surfaces on the work-item card.OPENROUTER_ERROR_SENTRY_TAG+openRouterErrorSentryTagValue(kind)— stable Sentry tag contract (openrouter_provider_errorkey,openrouter_insufficient_credits/openrouter_rate_limit/openrouter_unauthorized/openrouter_model_unavailable/openrouter_provider_errorvalues) so operators can filter billing/config failures separately from real agent crashes.src/backends/llmist/index.ts— wraprunAgentLoop(...)in a try/catch. Classified OpenRouter errors are logged at ERROR, captured to Sentry atwarninglevel under the new tag, and returned as{ success: false, output: '', error: <friendly>, cost: <tree.getTotalCost() or 0> }. Unclassified errors are re-thrown so the sharedexecuteAgentPipelinecontinues to record them as genericagent_executionfailures with full stack capture — unchanged behaviour for every other error path.tests/unit/backends/openrouterErrors.test.ts— 21 unit tests covering every classifier branch (wrapped 402/401/429/503, bare 402, "Insufficient balance" wording, OpenRouter-prefixed unknown, null/undefined input, string and message-bearing object inputs, case insensitivity), everyformatOpenRouterErrorMessagekind including raw-message preservation and truncation (PM-card comment-cap safety), and the Sentry tag-value contract.tests/unit/backends/llmist.test.ts— 5 new tests forLlmistEngine.execute: insufficient-credits returns a handledAgentEngineResult(no throw), Sentry capture uses the dedicated tag instead ofagent_execution, distinct tags per kind, non-OpenRouter errors still propagate (regression net for "every llmist crash silently becomes a handled warning"), and bare HTTP 402 is still classified.CLAUDE.md— new "OpenRouter provider error classification" paragraph under the Engines section so the next operator triaging an OpenRouter outage finds the contract immediately.Why this approach
The Sentry incident (issue 129527989) showed CASCADE handed back a raw
Error: OpenRouter: Insufficient credits. Add funds at https://openrouter.ai/credits\nOriginal error: ...fromrunAgentLoop, which the shared pipeline reported under the genericagent_executionSentry tag and surfaced verbatim on the PM card vialifecycle.handleFailure. That's three problems in one:Bearer-prefix bug once).This PR keeps the existing surface (
AgentResultstill flows throughexecuteAgentPipeline→lifecycle.handleFailure) and only inserts a classifier before the error becomes generic. No changes to retry policy (provider-credit failures are already non-retryable perisRetryableError), no changes towithBudget, no preflight balance check against/api/v1/credits(that's a separate spec — see "Why not preflight balance" below). New PM provider integrations and other engines are untouched.Why not preflight balance /
max_tokensclampingThe Linear issue's hypothesis includes the option of preflighting the OpenRouter balance and clamping
max_tokensto fit. That would require:/api/v1/creditscall (rate-limited, adds latency to every run boot).customModels.ts, but the worst-caseoutput_tokens × output_priceignores prompt cost and provider fees).max_tokenscap through llmist'sAgentBuilder(no obvious knob today).That's a meaningful extension and a separate PR; the current fix solves the user-visible problem (cryptic agent crash → actionable operator message) and adds the regression net needed to land the preflight later without re-introducing the crash path.
Test plan
npx vitest run --project unit-backends tests/unit/backends/openrouterErrors.test.ts— 21 tests pass.npx vitest run --project unit-backends tests/unit/backends/llmist.test.ts— 19 tests pass (14 existing + 5 new).npx vitest run --project unit-backends— full backends suite, 1203 tests pass.npm test— full unit suite, 10042 tests pass / 33 skipped.npm run typecheck— clean.npm run lint— only pre-existing warnings ontests/unit/gadgets/github/core/createPR.test.ts; my files (src/backends/llmist/openrouterErrors.ts,src/backends/llmist/index.ts,tests/unit/backends/openrouterErrors.test.ts,tests/unit/backends/llmist.test.ts) are clean (npx @biomejs/biome checkon those four exits 0).tests/unit/architecture-docs.test.ts, 41 tests).🕵️ claude-code · claude-opus-4-7 · run details