fix(channels): surface structured rate-limit metadata on chat_error (#2606)#2652
Conversation
…inyhumansai#2606) Issue tinyhumansai#2606 follows up on PR tinyhumansai#2371: that PR landed retry-after wording inside the user-facing message text, but the structured data was discarded at the channel layer. The wire payload still only carried `message: String` and `error_type: String`, so the frontend could not render a real countdown, a retry button, a provider/source label, or a fallback CTA without regexing the message. This change moves the metadata onto the wire as additive optional fields on `WebChannelEvent`, populated by a new `ClassifiedError` struct returned from `classify_inference_error`: - `error_source` — "provider" | "openhuman_budget" | "agent_loop" | "openhuman_billing" | "transport" | "config" - `error_retryable` — false for non-retryable business 429s ("plan does not include", insufficient balance, Z.AI 1311/1113), auth, model, context, OpenHuman billing - `error_retry_after_ms` — verbatim from Retry-After / retry_after - `error_provider` — extracted from "<provider> API error (...)" - `error_fallback_available` — Some(false) when reliable.rs has already exhausted the model_fallbacks chain ("All providers/models failed") All fields use `skip_serializing_if = "Option::is_none"` so older clients that only read `message` / `error_type` keep working unchanged. Coverage: - 11 new unit tests on the classifier (each branch, each new field, non-retryable 429 business-quota cases, provider extraction, fallback-exhausted aggregate, JSON omit-when-None contract). - 2 new wire-shape tests that drive `start_chat` end-to-end through the WebChannelEvent bus and assert the structured fields land on both the struct and the serialized JSON. - Added a serial test lock so the three tests that share `TEST_FORCED_RUN_CHAT_TASK_ERROR` no longer race under `cargo test`'s default parallelism.
📝 WalkthroughWalkthroughAdds five optional error metadata fields to WebChannelEvent, introduces a typed ClassifiedError, refactors classify_inference_error to return it (including provider, retryability, retry-after, fallback info), and updates event emissions and tests to populate or initialize the new fields. ChangesStructured Error Metadata for Rate Limits
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/channels/providers/web_tests.rs (1)
29-35:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid fire-and-forget forced-error reset in
Drop(can clobber the next test)
TestForcedRunChatTaskErrorGuard::dropschedules async cleanup viatokio::spawn(set_test_forced_run_chat_task_error(None).await). Because locals drop in reverse order, the guard can be dropped (and the spawned reset started) beforeFORCED_ERROR_TEST_LOCKis released, letting the next test setSome(...)and then having the earlier spawned task overwrite it back toNonewhilerun_chat_taskhasn’t consumed it yet (TEST_FORCED_RUN_CHAT_TASK_ERRORis read viaslot.take()).
Remove the asyncDrop/spawn cleanup and instead perform an awaited reset inside each serialized test (applies to the three tests that currently createTestForcedRunChatTaskErrorGuard).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/channels/providers/web_tests.rs` around lines 29 - 35, The Drop implementation for TestForcedRunChatTaskErrorGuard must not spawn an async reset because that fire-and-forget reset (tokio::spawn calling set_test_forced_run_chat_task_error(None)) can race with FORCED_ERROR_TEST_LOCK and clobber a subsequent test's forced error; remove the async Drop::drop implementation entirely and instead update each test that constructs TestForcedRunChatTaskErrorGuard to explicitly await set_test_forced_run_chat_task_error(None) at the end of the test while still holding the FORCED_ERROR_TEST_LOCK (so run_chat_task’s TEST_FORCED_RUN_CHAT_TASK_ERROR slot.take() is not interfered with). Ensure references to set_test_forced_run_chat_task_error, TestForcedRunChatTaskErrorGuard, FORCED_ERROR_TEST_LOCK, run_chat_task, and TEST_FORCED_RUN_CHAT_TASK_ERROR are used to find and modify the guard and the three tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 496-508: The code branch that builds a ClassifiedError for
402/payment messages incorrectly sets source to the hardcoded string
"openhuman_billing", which misattributes upstream provider billing errors; in
the ClassifiedError constructor (the block creating ClassifiedError with
error_type "budget_exhausted" and calling with_provider_detail), replace the
hardcoded source with the actual provider identifier (use the existing provider
variable) or another provider-specific source instead of "openhuman_billing" so
upstream 402/payment errors preserve the provider as the error source.
- Around line 449-469: The current 429 branch sets retryable =
!is_non_retryable_rate_limit_text(&lower) but always builds a
transient/retry-focused summary; change it so you first compute a boolean (e.g.,
let non_retryable = is_non_retryable_rate_limit_text(&lower)) and then pick the
message accordingly: if non_retryable produce a non-retry message that directs
the user to billing/settings/plan (no "retry in this thread" hint), otherwise
keep the transient retry summary that uses retry_after_hint(retry_secs); pass
the chosen summary into with_provider_detail(...) and keep retryable set to
!non_retryable and retry_after_ms as-is. Ensure you reference
is_non_retryable_rate_limit_text, retry_after_hint, retry_secs, and
ClassifiedError::message/retryable when editing.
---
Outside diff comments:
In `@src/openhuman/channels/providers/web_tests.rs`:
- Around line 29-35: The Drop implementation for TestForcedRunChatTaskErrorGuard
must not spawn an async reset because that fire-and-forget reset (tokio::spawn
calling set_test_forced_run_chat_task_error(None)) can race with
FORCED_ERROR_TEST_LOCK and clobber a subsequent test's forced error; remove the
async Drop::drop implementation entirely and instead update each test that
constructs TestForcedRunChatTaskErrorGuard to explicitly await
set_test_forced_run_chat_task_error(None) at the end of the test while still
holding the FORCED_ERROR_TEST_LOCK (so run_chat_task’s
TEST_FORCED_RUN_CHAT_TASK_ERROR slot.take() is not interfered with). Ensure
references to set_test_forced_run_chat_task_error,
TestForcedRunChatTaskErrorGuard, FORCED_ERROR_TEST_LOCK, run_chat_task, and
TEST_FORCED_RUN_CHAT_TASK_ERROR are used to find and modify the guard and the
three tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5e36b968-60fa-4bdc-a35e-c54ea354d53b
📒 Files selected for processing (5)
src/core/socketio.rssrc/openhuman/channels/proactive.rssrc/openhuman/channels/providers/presentation.rssrc/openhuman/channels/providers/web.rssrc/openhuman/channels/providers/web_tests.rs
… copy (tinyhumansai#2606) Three follow-ups from CodeRabbit on PR tinyhumansai#2652: 1. The 402 / payment-required arm hardcoded source="openhuman_billing", misattributing upstream provider 402s (openrouter, openai, ...) to OpenHuman's own credit system. Switch the source based on the extracted provider envelope: when the leading envelope is OpenHuman (or absent), keep "openhuman_billing"; for any other identified upstream, emit source="provider" so the FE doesn't point the user at OpenHuman credits when their provider plan / balance is the issue. 2. The 429 arm always produced the transient "you can retry in this thread" copy regardless of retryable. Compute non_retryable first and pick the message accordingly: non-retryable cases (plan limit, insufficient balance, Z.AI 1311/1113) now route the user to billing / settings / plan instead of suggesting same-thread retry. retry_after_ms and retryable wire shape unchanged. 3. TestForcedRunChatTaskErrorGuard::drop spawned a fire-and-forget tokio task to reset TEST_FORCED_RUN_CHAT_TASK_ERROR. The spawned task could race FORCED_ERROR_TEST_LOCK and clobber a subsequent test's forced error. Remove the guard struct entirely and add an inline `set_test_forced_run_chat_task_error(None).await` at the end of each of the three tests, while still holding the lock. Adds regression tests for both source branches and both message branches.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/channels/providers/web.rs (1)
508-525:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve
fallback_availableon 402 classification.Line 525 resets
fallback_availabletoNone, so an aggregate likeAll providers/models failed ... 402 Payment Requireddrops the already-computedSome(false)from Lines 401-405. That loses the wire signalsrc/core/socketio.rsdocuments for exhausted fallback chains.Suggested fix
ClassifiedError { error_type: "budget_exhausted", message: with_provider_detail("Insufficient credits. Please top up to continue.", err), source, retryable: false, retry_after_ms: None, provider, - fallback_available: None, + fallback_available, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/channels/providers/web.rs` around lines 508 - 525, The 402 classification is overwriting the previously-computed fallback_available with None; update the ClassifiedError construction in this 402 branch to preserve the existing fallback_available value (do not set fallback_available: None) so the earlier computed Some(false)/Some(true) is retained; locate the ClassifiedError creation in src/openhuman/channels/providers/web.rs (the block that sets error_type "budget_exhausted" and message via with_provider_detail) and remove or replace the literal None so the existing fallback_available variable is passed through unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 508-525: The 402 classification is overwriting the
previously-computed fallback_available with None; update the ClassifiedError
construction in this 402 branch to preserve the existing fallback_available
value (do not set fallback_available: None) so the earlier computed
Some(false)/Some(true) is retained; locate the ClassifiedError creation in
src/openhuman/channels/providers/web.rs (the block that sets error_type
"budget_exhausted" and message via with_provider_detail) and remove or replace
the literal None so the existing fallback_available variable is passed through
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d8257f8d-8cdf-4cf0-9a3b-0d45e4b8afd7
📒 Files selected for processing (2)
src/openhuman/channels/providers/web.rssrc/openhuman/channels/providers/web_tests.rs
graycyrus
left a comment
There was a problem hiding this comment.
This is a well-executed fix for #2606. The structured ClassifiedError refactor is clean — promoting from a raw (&'static str, String) tuple to a named struct gives the classifier room to grow without callers needing to unpack positionally. The backward-compat story is solid: five new Option-typed wire fields with skip_serializing_if = "Option::is_none" means older FE clients see no change.
CodeRabbit's two major findings (non-retryable 429 message copy contradiction, and openhuman_billing misdirection for provider 402s) are both addressed in ffac313. The 402 branch now correctly selects source based on provider.as_deref(), and the non-retryable 429 message correctly routes users to billing/settings instead of saying "retry in this thread."
The FORCED_ERROR_TEST_LOCK replacement for TestForcedRunChatTaskErrorGuard is the right call — the old Drop guard spawned a background task with no ordering guarantee relative to the next test acquiring the toggle slot. The mutex approach restores proper isolation.
Two minor items to track for the FE consumer follow-up:
Message copy for provider 402s: when provider.as_deref() resolves to source = "provider", the message still reads "Insufficient credits. Please top up to continue." — which points users at OpenHuman billing rather than the upstream provider. Not worse than before this PR, and error_source = "provider" gives the FE consumer enough signal to render the right CTA. Make sure the follow-up FE PR handles the error_source == "provider" && error_type == "budget_exhausted" case specifically.
"not include" in is_non_retryable_rate_limit_text: this substring is broader than the other hints. In a 429 context it's almost certainly safe, but consider tightening to "not included in" or adding a comment naming the specific provider response it is matching.
13 new tests cover every new structured field plus the JSON-omit contract. Coverage gate passes. Approved.
Summary
chat_errorso the frontend can render a real countdown, retry button, and provider/source label without regexing the message.WebChannelEvent:error_source,error_retryable,error_retry_after_ms,error_provider,error_fallback_available. Older FE clients that read onlymessage/error_typekeep working unchanged (skip_serializing_if = "Option::is_none").fallback_available: falseoncereliable.rs::format_failure_aggregatehas exhausted themodel_fallbackschain so the FE doesn't promise a fallback that doesn't exist.Problem
Issue #2606 follows up on PR #2371. That PR landed retry-after wording inside the user-facing message text — but the structured data was thrown away at the channel-classifier boundary.
WebChannelEventstill only carriedmessage: Stringanderror_type: String, so the frontend could not:Users are still reporting confusing rate-limit copy because of this gap (see issue #2606's acceptance criteria).
Solution
New
ClassifiedErrorstruct returned fromclassify_inference_errorcarries the typed metadata, and the chat-error publish path populates the newWebChannelEventfields from it.error_sourceerror_retryableprovidertrueretry_after_msparsed from bodyRetry-After:/retry_after:providerfalseopenhuman_budgettrueagent_looptrueopenhuman_billingfalsetransporttrueconfigfalseProvider name extracted best-effort from the
\"<provider> API error (...)\"prefix thatinference::provider::ops::api_errorformats. Fallback-exhausted signal extracted from the "All providers/models failed" aggregate thatreliable.rs::format_failure_aggregateemits.Submission Checklist
None-omit contract test.## Related— N/A: no matrix row touched.Closes #NNN— see## Related.Impact
channels::providers::web::classify_inference_errorand thechat_errorpublish instart_chat) plus theWebChannelEventwire struct.chat_errorSSE/Socket.IO frame and can be picked up by the FE in a subsequent PR to render a real countdown/retry/fallback UI.messagealready carries. Provider name extraction is allow-listed to ASCII alphanumeric/_/-.error_typetokens are unchanged for existing consumers. New fields are allOption<…>withskip_serializing_if = \"Option::is_none\", so older FE clients see exactly the same JSON shape they do today.Live verification
Beyond the unit + wire-shape tests, ran a full end-to-end live test:
HTTP/1.1 429 Too Many Requests,Retry-After: 30, body{\"error\":{\"message\":\"...\",\"code\":\"rate_limit_exceeded\"},\"retry_after\":30}on/v1/chat/completions.openhuman-core run --port 7892against an isolated workspace with that endpoint configured as acustom_openaiprovider.openhuman.channel_web_chat, listen to/eventsviacurl -N.The chat_error SSE frame received:
This exercises the full real path:
reliable.rs::is_rate_limited→ops::api_errorformatting →classify_inference_error→WebChannelEventserialization → SSE.Known follow-ups (out of scope)
inference::provider::ops::api_errorcurrently preserves only the response BODY, not theRetry-AfterHTTP header. When the upstream sends only the header (no bodyretry_after),retry_after_mswill beNone. Follow-up should thread the header into the error chain.Related
Retry-Afterheader fromops::api_error; FE consumer for the new structured fields.AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/2606-structured-rate-limit-metadata(branched fromorigin/mainafter fresh fetch)Validation Run
pnpm --filter openhuman-app format:check— N/A: no frontend changes.pnpm typecheck— fails on 4 pre-existing errors inapp/src/components/settings/panels/devices/PairPhoneModal.tsx,app/src/lib/tunnel/crypto.ts,app/src/pages/ios/PairScreen.tsx(missing iOS-only depsqrcode.react,@noble/ciphers,@tauri-apps/plugin-barcode-scanner). These exist onorigin/mainand are unrelated to this PR's Rust-only changes. Pushed with--no-verifyper CLAUDE.md guidance.cargo test --lib openhuman::channels::providers::web::tests→ 52 passed, 0 failed (39 pre-existing + 13 new).cargo test --lib openhuman::channels::providers→ 834 passed, 0 failed under default parallelism (added a serial test lock for the three tests sharingTEST_FORCED_RUN_CHAT_TASK_ERROR).cargo fmt --manifest-path Cargo.tomlapplied;cargo check --lib --bin openhuman-coreclean.cargo check --manifest-path app/src-tauri/Cargo.tomlclean.Validation Blocked
command:N/A — fullcargo test --libhad 42 pre-existing failures inopenhuman::memory::tree_global::*andopenhuman::memory_tree::*modules, confirmed reproducible onorigin/mainwith this branch stashed.error:pre-existing memory_tree test failures unrelated to channels work.impact:none on this PR — those modules are not touched.Behavior Changes
chat_errorWebChannelEvent now carries structurederror_source,error_retryable,error_retry_after_ms,error_provider,error_fallback_availablefields.Parity Contract
error_typetokens andmessagetext are unchanged for every existing branch. New fields are additiveOption<…>withskip_serializing_if. Three existing classifier tests still pass unchanged (only the destructuring shape was migrated from tuple to struct)._distinguishes_action_budget_from_provider_429and_max_iterations_gets_dedicated_branch).Duplicate / Superseded PR Handling
Summary by CodeRabbit
New Features
Bug Fixes
Tests