Skip to content

feat(proxy): deterministic substitution/failover engine (v0.9.1)#38

Merged
WayforthOfficial merged 2 commits into
mainfrom
feat/substitution-failover-engine
Jun 22, 2026
Merged

feat(proxy): deterministic substitution/failover engine (v0.9.1)#38
WayforthOfficial merged 2 commits into
mainfrom
feat/substitution-failover-engine

Conversation

@WayforthOfficial

Copy link
Copy Markdown
Owner

Extends the Reliability Proxy /proxy/{slug} from single-hop into a config-driven, multi-hop substitution/failover engine. Deterministic layer only — ML-ranked selection is a later phase. Built in apps/api (co-located with /proxy + billing; zero cross-service hop), per the approved plan.

⚠️ Idempotency (#5) — FLAGGED FOR YOUR REVIEW BEFORE MERGE

  • pre-send (conn refused, connect-timeout, HTTP 5xx, 429) → always fail over.
  • post-send-ambiguous (read-timeout-after-send, invalid-body-after-200) on the managed railfails over (the end user is always refunded, so never double-charged) under a cost cap (failover_post_send_max_cost, default 25 cr), preceded by a same-provider retry-first, with full instrumentation (duplicate_upstream_cost_possible + measured second_upstream_cost_credits per event).
  • x402 / on-chain per-call settlementSTRICT: never fail over (and no retry) on post-send — surfaces the error. Irreversible double-settlement is the real risk.
  • failover_post_send defaults true, flippable to strict via env if the measured leak turns material. The residual risk is Wayforth's duplicate upstream cost — now measured in substitution_events, not assumed.

What it does

  • Config-driven groups (substitution_groups table, seeded web-search + llm-inference; weather/maps config-ready, no peers yet). A provider is never substituted outside its group. Ordered by wri_score when present, else curated manual_ranknever sorts on nulls, never random (verified live: tavily wri 85.5 ranks first, null-wri peers fall to manual_rank).
  • Billing correctness: charges ONLY the served provider (+ its embedded 1.5% routing fee); every failed hop nets to zero (deduct+refund, per-hop idempotency keys); the surviving ledger row is the real served provider.
  • Failure detection: _try_execute_managed_ex distinguishes httpx ConnectError/ConnectTimeout/5xx/429 (pre_send) vs ReadTimeout/PoolTimeout/200-invalid-body (post_send).
  • Self-heal surface: X-Wayforth-Served-By + X-Wayforth-Fallback headers (+ wrap envelope). Group exhaustion → clean 502 listing providers tried + reasons.
  • substitution_events log (migration 064, shared DB) — future learned-layer training signal; nothing consumes it yet.
  • Guardrails: depth cap; per-provider creds via key_var (missing → skip); primary success path untouched (engine only invoked on the failure branch — zero overhead).

Verification

  • 13 engine unit tests green (ordering, depth cap, pre/post classification + gates, served-only billing, retry-first, invalid-body, 502 shape, event emission). Full offline suite green; import chain clean.
  • Migration 064 applied + verified on prod (additive/idempotent): both groups seeded, substitution_events (15 cols) created. Loader ordering validated live against real wri data.
  • VERSION 0.9.0 → 0.9.1.

Not auto-merging — please review the idempotency (#5) policy above first.

🤖 Generated with Claude Code

MytelligentPRV and others added 2 commits June 21, 2026 22:31
Extends the Reliability Proxy /proxy/{slug} from single-hop into a config-driven,
multi-hop substitution engine. Deterministic layer only — ML-ranked selection is
a later phase.

- core/substitution.py: group loader (cached; ordered by wri_score when present,
  else curated manual_rank — never sorts on nulls, never random), FailoverPolicy,
  category validators, run_with_failover (classify -> retry-first -> idempotency
  gate -> chain -> bill -> log), substitution_events writer.
- Substitution GROUPS are config-driven (substitution_groups table, seeded
  web-search + llm-inference; weather/maps config-ready, no peers yet). A provider
  is NEVER substituted outside its group.
- Failure detection distinguishes pre_send (conn refused, connect-timeout, 5xx,
  429) vs post_send_ambiguous (read-timeout-after-send, invalid-body-after-200)
  via httpx exception types — new _try_execute_managed_ex in execute.py.
- Billing: charges ONLY the served provider (+ its embedded 1.5% routing fee);
  every failed hop nets to zero (deduct+refund, per-hop idempotency keys); the
  surviving ledger row is the real served provider.
- Idempotency (#5, FLAGGED FOR REVIEW): pre-send always fails over; post-send on
  the MANAGED rail fails over (user always refunded) under a cost cap, with a
  same-provider retry-first and full instrumentation (duplicate_upstream_cost_
  possible + measured second_upstream_cost on each event); x402/on-chain is STRICT
  (no post-send failover). failover_post_send defaults true, flippable to strict.
- substitution_events log (migration 064, shared DB) — future learned-layer
  training signal; nothing consumes it yet.
- Visible self-heal: X-Wayforth-Served-By + X-Wayforth-Fallback headers (+ wrap
  envelope served_by/fallback). Group exhaustion -> clean 502 listing providers
  tried + reasons.
- Guardrails: depth cap; per-provider creds via key_var (missing -> skip);
  PRIMARY SUCCESS PATH UNTOUCHED (engine only invoked on the failure branch).

Migration 064 applied + verified on prod (additive/idempotent). 13 engine unit
tests green. VERSION 0.9.0 -> 0.9.1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n idempotency key

Merge-review fixes (billing-critical, #5 guarantee):
- Settlement matrix: WriteError/WriteTimeout move pre_send -> post_send_ambiguous
  (a partial write may have reached + been processed upstream). PoolTimeout moves
  post_send -> pre_send (no connection was ever acquired → request never sent).
  Now: ConnectError/ConnectTimeout/PoolTimeout=pre_send; ReadTimeout/WriteError/
  WriteTimeout=post_send; 5xx/429 received=pre_send (fail-over-safe);
  unclassifiable=post_send (fail safe). Direct parametrized test of the matrix.
- Retry-first idempotency gate: a same-provider retry after a POST-send failure
  only runs when the provider honors an idempotency key (_IDEMPOTENCY_KEY_PROVIDERS,
  empty today) — otherwise it would create the duplicate cost it's meant to
  prevent, so it is SKIPPED (straight to the cost-capped, instrumented
  substitution). pre_send retry is always safe.
- Tests: classification matrix (httpx + http-status), post-send-retry-skipped,
  pre-send-retry-succeeds, cap-exceeded refund safety (primary still refunded).

24 engine tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@WayforthOfficial WayforthOfficial merged commit d3d83b6 into main Jun 22, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants