Skip to content

Independent corroboration: 179K API calls confirm cache_read weight change is the dominant factor #3

@seanGSISG

Description

@seanGSISG

quota-composition-breakdown.mjs.md
cache-read-weight-transition.mjs.md
dual-window-simulation.mjs.md
thinking-token-estimation.mjs.md
rates.json.md
tool-use-deep-dive.mjs.md

Independent corroboration of dual-window findings with 179K API calls (Dec 2025 - Apr 2026)

Your rate-limit header analysis in 02_RATELIMIT-HEADERS.md identified the dual sliding window system and measured per-1% utilization costs. You noted a key limitation: "No before-data. Proxy started April 4."

We have before-data. 178,009 API calls across 5,605 session files from a Max 20x ($200/mo) account, spanning Dec 23, 2025 to April 11, 2026. We built four analysis scripts to validate your findings against the full transition period.

Your benchmarks hold up

Our CacheRead per estimated 1% utilization: 1.62M-1.72M tokens. Your range: 1.5M-2.1M. Dead center.

Our cache_read share of visible tokens: 89.8-95.2% across months. Your measurement: 96-99%. Close match. The slight gap likely comes from our higher cache_write share in months with more cold starts.

The cache_read weight change is the whole story

We computed daily quota under two formulas: the old (cache_read = 0x) and the new (cache_read = 1x).

Month Quota (0x) Quota (1x) Multiplier
Feb 219.1M 2,153.6M 9.8x
Mar 394.2M 4,838.0M 12.3x
Apr 150.1M 2,200.9M 14.7x

The multiplier is 10-15x on every single day since December 2025. Cache_read has always been 90-95% of visible token volume. The ratio doesn't jump at a transition date. What changed is whether the quota system applies that weight.

Under the old formula, our maximum quota consumption in any 5-hour window across the entire dataset was 21.2M tokens, or 11.8% of the budget. We would never have been rate-limited. Not in any month.

5-hour window simulation confirms the transition

We bucketed all calls into 5-hour tumbling windows using your midpoint (100% = ~180M visible tokens):

Month Windows Avg Util Max Util >80% >100%
Feb 70 17.0% 57.4% 0 0
Mar 103 27.0% 123.9% 7 1
Apr 47 23.1% 156.1% 1 1

February: zero windows exceeded 80%. March: seven crossed 80%, one crossed 100%. That lines up with when we first hit rate limits.

Under 0x weight, every one of those windows was fine. The March window that crossed 100% had a quota(0x) of just 15.3M, or 8.5%.

The thinking token "blind spot" is not a blind spot

You flagged thinking tokens as a major unknown. We can measure them because our JSONL logs contain the actual content blocks.

Thinking blocks appear in 1.9-36.7% of calls. Average size: 112-664 characters (~28-166 tokens). Quota impact:

Month Gap Tokens Gap * 5x Weight % of Total Quota
Feb 0.01M 0.03M 0.0%
Mar 0.06M 0.29M 0.0%
Apr 0.49M 2.44M 0.1%

Thinking tokens contribute 0.0-0.1% of total quota. The real hidden output comes from server-side iterations (a usage field that grew from 7.7% of calls in Feb to 52.4% in Apr, each producing 60x more output tokens). We documented this separately in anthropics/claude-code#45756.

@fgrosswig's 64x reduction in our data

Our heaviest pre-transition day (March 21): 356.2M visible tokens, quota(0x) = 27.4M, quota(1x) = 359.0M. A 13.1x multiplier. On peak days the multiplier reaches 18.9x to 31.0x.

A 64x multiplier requires cache_read at ~98.4% of visible tokens. You measured 96-99%. The numbers fit. Accounts with higher R:W ratios (less cache_write from cold starts) would see multipliers closer to 64x.

Counterfactual

Month Active Days Days > 180M quota(0x) Days > 180M quota(1x)
Feb 26 0 1
Mar 31 0 11
Apr 11 0 6

Zero days exceed budget under the old formula. Eighteen exceed it under the new formula.

Scripts

All four scripts read the same 5,605 session files. They're in our analysis repo if you want to cross-reference:

  • quota-composition-breakdown.mjs - Token composition under 0x vs 1x weight, ArkNill benchmark comparison
  • cache-read-weight-transition.mjs - Daily/weekly quota under both regimes, transition detection, counterfactual
  • dual-window-simulation.mjs - 5h tumbling window simulation, utilization distributions, 7d rolling peaks
  • thinking-token-estimation.mjs - Thinking block prevalence, output gap analysis, blind spot cross-reference

Dataset: 178,009 non-synthetic API calls, 5,605 session files, Dec 23, 2025 to Apr 11, 2026, Max 20x ($200/mo), two machines (Linux workstation + Windows laptop).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions