quota-composition-breakdown.mjs.md
cache-read-weight-transition.mjs.md
dual-window-simulation.mjs.md
thinking-token-estimation.mjs.md
rates.json.md
tool-use-deep-dive.mjs.md
Independent corroboration of dual-window findings with 179K API calls (Dec 2025 - Apr 2026)
Your rate-limit header analysis in 02_RATELIMIT-HEADERS.md identified the dual sliding window system and measured per-1% utilization costs. You noted a key limitation: "No before-data. Proxy started April 4."
We have before-data. 178,009 API calls across 5,605 session files from a Max 20x ($200/mo) account, spanning Dec 23, 2025 to April 11, 2026. We built four analysis scripts to validate your findings against the full transition period.
Your benchmarks hold up
Our CacheRead per estimated 1% utilization: 1.62M-1.72M tokens. Your range: 1.5M-2.1M. Dead center.
Our cache_read share of visible tokens: 89.8-95.2% across months. Your measurement: 96-99%. Close match. The slight gap likely comes from our higher cache_write share in months with more cold starts.
The cache_read weight change is the whole story
We computed daily quota under two formulas: the old (cache_read = 0x) and the new (cache_read = 1x).
| Month |
Quota (0x) |
Quota (1x) |
Multiplier |
| Feb |
219.1M |
2,153.6M |
9.8x |
| Mar |
394.2M |
4,838.0M |
12.3x |
| Apr |
150.1M |
2,200.9M |
14.7x |
The multiplier is 10-15x on every single day since December 2025. Cache_read has always been 90-95% of visible token volume. The ratio doesn't jump at a transition date. What changed is whether the quota system applies that weight.
Under the old formula, our maximum quota consumption in any 5-hour window across the entire dataset was 21.2M tokens, or 11.8% of the budget. We would never have been rate-limited. Not in any month.
5-hour window simulation confirms the transition
We bucketed all calls into 5-hour tumbling windows using your midpoint (100% = ~180M visible tokens):
| Month |
Windows |
Avg Util |
Max Util |
>80% |
>100% |
| Feb |
70 |
17.0% |
57.4% |
0 |
0 |
| Mar |
103 |
27.0% |
123.9% |
7 |
1 |
| Apr |
47 |
23.1% |
156.1% |
1 |
1 |
February: zero windows exceeded 80%. March: seven crossed 80%, one crossed 100%. That lines up with when we first hit rate limits.
Under 0x weight, every one of those windows was fine. The March window that crossed 100% had a quota(0x) of just 15.3M, or 8.5%.
The thinking token "blind spot" is not a blind spot
You flagged thinking tokens as a major unknown. We can measure them because our JSONL logs contain the actual content blocks.
Thinking blocks appear in 1.9-36.7% of calls. Average size: 112-664 characters (~28-166 tokens). Quota impact:
| Month |
Gap Tokens |
Gap * 5x Weight |
% of Total Quota |
| Feb |
0.01M |
0.03M |
0.0% |
| Mar |
0.06M |
0.29M |
0.0% |
| Apr |
0.49M |
2.44M |
0.1% |
Thinking tokens contribute 0.0-0.1% of total quota. The real hidden output comes from server-side iterations (a usage field that grew from 7.7% of calls in Feb to 52.4% in Apr, each producing 60x more output tokens). We documented this separately in anthropics/claude-code#45756.
@fgrosswig's 64x reduction in our data
Our heaviest pre-transition day (March 21): 356.2M visible tokens, quota(0x) = 27.4M, quota(1x) = 359.0M. A 13.1x multiplier. On peak days the multiplier reaches 18.9x to 31.0x.
A 64x multiplier requires cache_read at ~98.4% of visible tokens. You measured 96-99%. The numbers fit. Accounts with higher R:W ratios (less cache_write from cold starts) would see multipliers closer to 64x.
Counterfactual
| Month |
Active Days |
Days > 180M quota(0x) |
Days > 180M quota(1x) |
| Feb |
26 |
0 |
1 |
| Mar |
31 |
0 |
11 |
| Apr |
11 |
0 |
6 |
Zero days exceed budget under the old formula. Eighteen exceed it under the new formula.
Scripts
All four scripts read the same 5,605 session files. They're in our analysis repo if you want to cross-reference:
quota-composition-breakdown.mjs - Token composition under 0x vs 1x weight, ArkNill benchmark comparison
cache-read-weight-transition.mjs - Daily/weekly quota under both regimes, transition detection, counterfactual
dual-window-simulation.mjs - 5h tumbling window simulation, utilization distributions, 7d rolling peaks
thinking-token-estimation.mjs - Thinking block prevalence, output gap analysis, blind spot cross-reference
Dataset: 178,009 non-synthetic API calls, 5,605 session files, Dec 23, 2025 to Apr 11, 2026, Max 20x ($200/mo), two machines (Linux workstation + Windows laptop).
quota-composition-breakdown.mjs.md
cache-read-weight-transition.mjs.md
dual-window-simulation.mjs.md
thinking-token-estimation.mjs.md
rates.json.md
tool-use-deep-dive.mjs.md
Independent corroboration of dual-window findings with 179K API calls (Dec 2025 - Apr 2026)
Your rate-limit header analysis in
02_RATELIMIT-HEADERS.mdidentified the dual sliding window system and measured per-1% utilization costs. You noted a key limitation: "No before-data. Proxy started April 4."We have before-data. 178,009 API calls across 5,605 session files from a Max 20x ($200/mo) account, spanning Dec 23, 2025 to April 11, 2026. We built four analysis scripts to validate your findings against the full transition period.
Your benchmarks hold up
Our CacheRead per estimated 1% utilization: 1.62M-1.72M tokens. Your range: 1.5M-2.1M. Dead center.
Our cache_read share of visible tokens: 89.8-95.2% across months. Your measurement: 96-99%. Close match. The slight gap likely comes from our higher cache_write share in months with more cold starts.
The cache_read weight change is the whole story
We computed daily quota under two formulas: the old (cache_read = 0x) and the new (cache_read = 1x).
The multiplier is 10-15x on every single day since December 2025. Cache_read has always been 90-95% of visible token volume. The ratio doesn't jump at a transition date. What changed is whether the quota system applies that weight.
Under the old formula, our maximum quota consumption in any 5-hour window across the entire dataset was 21.2M tokens, or 11.8% of the budget. We would never have been rate-limited. Not in any month.
5-hour window simulation confirms the transition
We bucketed all calls into 5-hour tumbling windows using your midpoint (100% = ~180M visible tokens):
February: zero windows exceeded 80%. March: seven crossed 80%, one crossed 100%. That lines up with when we first hit rate limits.
Under 0x weight, every one of those windows was fine. The March window that crossed 100% had a quota(0x) of just 15.3M, or 8.5%.
The thinking token "blind spot" is not a blind spot
You flagged thinking tokens as a major unknown. We can measure them because our JSONL logs contain the actual content blocks.
Thinking blocks appear in 1.9-36.7% of calls. Average size: 112-664 characters (~28-166 tokens). Quota impact:
Thinking tokens contribute 0.0-0.1% of total quota. The real hidden output comes from server-side
iterations(ausagefield that grew from 7.7% of calls in Feb to 52.4% in Apr, each producing 60x more output tokens). We documented this separately in anthropics/claude-code#45756.@fgrosswig's 64x reduction in our data
Our heaviest pre-transition day (March 21): 356.2M visible tokens, quota(0x) = 27.4M, quota(1x) = 359.0M. A 13.1x multiplier. On peak days the multiplier reaches 18.9x to 31.0x.
A 64x multiplier requires cache_read at ~98.4% of visible tokens. You measured 96-99%. The numbers fit. Accounts with higher R:W ratios (less cache_write from cold starts) would see multipliers closer to 64x.
Counterfactual
Zero days exceed budget under the old formula. Eighteen exceed it under the new formula.
Scripts
All four scripts read the same 5,605 session files. They're in our analysis repo if you want to cross-reference:
quota-composition-breakdown.mjs- Token composition under 0x vs 1x weight, ArkNill benchmark comparisoncache-read-weight-transition.mjs- Daily/weekly quota under both regimes, transition detection, counterfactualdual-window-simulation.mjs- 5h tumbling window simulation, utilization distributions, 7d rolling peaksthinking-token-estimation.mjs- Thinking block prevalence, output gap analysis, blind spot cross-referenceDataset: 178,009 non-synthetic API calls, 5,605 session files, Dec 23, 2025 to Apr 11, 2026, Max 20x ($200/mo), two machines (Linux workstation + Windows laptop).