Skip to content

fix(llmist): classify OpenRouter provider errors instead of crashing the agent#1417

Open
aaight wants to merge 1 commit into
devfrom
fix/openrouter-credit-exhaustion-handling
Open

fix(llmist): classify OpenRouter provider errors instead of crashing the agent#1417
aaight wants to merge 1 commit into
devfrom
fix/openrouter-credit-exhaustion-handling

Conversation

@aaight

@aaight aaight commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

MNG-1646: handle OpenRouter provider errors (HTTP 402 credit exhaustion, 401 unauthorized, 429 rate-limited, 503 model-unavailable) in the llmist engine adapter instead of letting them bubble up as generic agent_execution crashes.

Linear issue: https://linear.app/mongrel/issue/MNG-1646/fix-handle-openrouter-credit-exhaustion-in-llmist

What changed

  • New module src/backends/llmist/openrouterErrors.ts — pure string-based classifier with three exports:

    • classifyOpenRouterError(err) — recognises both llmist's wrapped form (OpenRouter: Insufficient credits... produced by enhanceError) and bare HTTP signals (402, 401, 429, 503). Returns null for anything unrelated so non-OpenRouter errors keep flowing through the existing pipeline.
    • formatOpenRouterErrorMessage(kind, raw) — produces the operator-facing plain-English summary (e.g. OpenRouter rejected the request because the account has insufficient credits. Top up the OpenRouter balance at https://openrouter.ai/credits or switch the project to a different model/engine before retrying.) that the PM lifecycle surfaces on the work-item card.
    • OPENROUTER_ERROR_SENTRY_TAG + openRouterErrorSentryTagValue(kind) — stable Sentry tag contract (openrouter_provider_error key, openrouter_insufficient_credits / openrouter_rate_limit / openrouter_unauthorized / openrouter_model_unavailable / openrouter_provider_error values) so operators can filter billing/config failures separately from real agent crashes.
  • src/backends/llmist/index.ts — wrap runAgentLoop(...) in a try/catch. Classified OpenRouter errors are logged at ERROR, captured to Sentry at warning level under the new tag, and returned as { success: false, output: '', error: <friendly>, cost: <tree.getTotalCost() or 0> }. Unclassified errors are re-thrown so the shared executeAgentPipeline continues to record them as generic agent_execution failures with full stack capture — unchanged behaviour for every other error path.

  • tests/unit/backends/openrouterErrors.test.ts — 21 unit tests covering every classifier branch (wrapped 402/401/429/503, bare 402, "Insufficient balance" wording, OpenRouter-prefixed unknown, null/undefined input, string and message-bearing object inputs, case insensitivity), every formatOpenRouterErrorMessage kind including raw-message preservation and truncation (PM-card comment-cap safety), and the Sentry tag-value contract.

  • tests/unit/backends/llmist.test.ts — 5 new tests for LlmistEngine.execute: insufficient-credits returns a handled AgentEngineResult (no throw), Sentry capture uses the dedicated tag instead of agent_execution, distinct tags per kind, non-OpenRouter errors still propagate (regression net for "every llmist crash silently becomes a handled warning"), and bare HTTP 402 is still classified.

  • CLAUDE.md — new "OpenRouter provider error classification" paragraph under the Engines section so the next operator triaging an OpenRouter outage finds the contract immediately.

Why this approach

The Sentry incident (issue 129527989) showed CASCADE handed back a raw Error: OpenRouter: Insufficient credits. Add funds at https://openrouter.ai/credits\nOriginal error: ... from runAgentLoop, which the shared pipeline reported under the generic agent_execution Sentry tag and surfaced verbatim on the PM card via lifecycle.handleFailure. That's three problems in one:

  1. Operator dashboards mix billing/config issues with real CASCADE bugs.
  2. The card message is a stack trace fragment, not an actionable next step.
  3. There's no regression net against future llmist SDK changes (which already shipped a Bearer-prefix bug once).

This PR keeps the existing surface (AgentResult still flows through executeAgentPipelinelifecycle.handleFailure) and only inserts a classifier before the error becomes generic. No changes to retry policy (provider-credit failures are already non-retryable per isRetryableError), no changes to withBudget, no preflight balance check against /api/v1/credits (that's a separate spec — see "Why not preflight balance" below). New PM provider integrations and other engines are untouched.

Why not preflight balance / max_tokens clamping

The Linear issue's hypothesis includes the option of preflighting the OpenRouter balance and clamping max_tokens to fit. That would require:

  • A new /api/v1/credits call (rate-limited, adds latency to every run boot).
  • A model→tokens→cost projection (we already have per-model pricing in customModels.ts, but the worst-case output_tokens × output_price ignores prompt cost and provider fees).
  • A way to pass a per-request max_tokens cap through llmist's AgentBuilder (no obvious knob today).

That's a meaningful extension and a separate PR; the current fix solves the user-visible problem (cryptic agent crash → actionable operator message) and adds the regression net needed to land the preflight later without re-introducing the crash path.

Test plan

  • npx vitest run --project unit-backends tests/unit/backends/openrouterErrors.test.ts — 21 tests pass.
  • npx vitest run --project unit-backends tests/unit/backends/llmist.test.ts — 19 tests pass (14 existing + 5 new).
  • npx vitest run --project unit-backends — full backends suite, 1203 tests pass.
  • npm test — full unit suite, 10042 tests pass / 33 skipped.
  • npm run typecheck — clean.
  • npm run lint — only pre-existing warnings on tests/unit/gadgets/github/core/createPR.test.ts; my files (src/backends/llmist/openrouterErrors.ts, src/backends/llmist/index.ts, tests/unit/backends/openrouterErrors.test.ts, tests/unit/backends/llmist.test.ts) are clean (npx @biomejs/biome check on those four exits 0).
  • Architecture-docs drift test passes (tests/unit/architecture-docs.test.ts, 41 tests).

🕵️ claude-code · claude-opus-4-7 · run details

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.68345% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/backends/llmist/openrouterErrors.ts 95.74% 3 Missing and 1 partial ⚠️
src/backends/llmist/index.ts 95.55% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Requesting changes: the handled-result path works for non-retryable OpenRouter failures, but the new Sentry tagging contract is not met for retryable 429/503 provider errors.

Code Issues

Should Fix

  • src/backends/llmist/index.ts:206 — OpenRouter 429 and 503 errors are retryable under the existing llmist retry policy. On the final failed attempt, getRetryConfig(...).onRetriesExhausted captures the same error first with source: 'llm_retries_exhausted'; only after that does this catch add the new openrouter_provider_error warning. That means rate-limit/model-unavailable provider failures still produce an unclassified retry-exhaustion Sentry event, so operators cannot filter all handled OpenRouter provider errors by the new tag as documented. Please tag or suppress the retry-exhaustion capture for classified OpenRouter errors, or add a regression test that simulates exhausted 429/503 retries and proves no unclassified Sentry event is emitted.

🕵️ codex · gpt-5.5 · run details

model,
runId,
});
captureException(err, {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This catch is too late to be the only Sentry event for the retryable OpenRouter kinds. For OpenRouter: Rate limit exceeded and OpenRouter: Model temporarily unavailable, llmist's isRetryableError returns true, so after the last retry CASCADE's getRetryConfig(...).onRetriesExhausted already calls captureException(error, { tags: { source: 'llm_retries_exhausted' } }) before the error reaches this block. The run avoids the generic agent_execution tag, but these provider failures still create an unclassified retry-exhaustion event alongside the new warning. The retry-exhaustion path needs to be tagged/suppressed for classified OpenRouter errors, or covered by a test that proves 429/503 only produce the intended OpenRouter-tagged Sentry event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants