Skip to content

feat(memory): single authoritative token accounting for compaction (fire + boundary)#1720

Draft
mattzcarey wants to merge 5 commits into
mainfrom
fix/1593-autocompaction-calibration
Draft

feat(memory): single authoritative token accounting for compaction (fire + boundary)#1720
mattzcarey wants to merge 5 commits into
mainfrom
fix/1593-autocompaction-calibration

Conversation

@mattzcarey

@mattzcarey mattzcarey commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Fixes #1593

Design

The Session derives one authoritative context size in model tokens, and that single number drives both decisions: whether to compact, and where to cut. Because both decisions read the same number, they cannot disagree (the #1593 failure was exactly such a disagreement).

This is how earendil-works/pi does it (core/compaction/compaction.ts), and hermes-agent (agent/context_compressor.py) follows the same rule: model-reported usage is a whole-prompt signal, per-message decisions use a cheap proportional heuristic, and the two scales never mix.

Counting

Session._knownContextTokens() resolves in priority order:

  1. The compactAfter() counter, if configured. It is whole-prompt by signature, so () => lastUsage.inputTokens is legal and documented.
  2. Usage metadata on assistant messages. SessionMessage gained metadata?: unknown (an AI SDK UIMessage passes straight through), and metadata.usage / metadata.totalUsage is read the way pi reads it: calculateContextTokens returns totalTokens when present, else the sum of the components. The context estimate is the last reported usage plus the heuristic for any newer messages. When messages carry usage, configuration is one line:
    session.compactAfter(100_000);
  3. Neither. The chars/4 heuristic applies, as before.

Compaction

The same total flows to createCompactFunction as CompactContext.contextTokens. The tail walk is always the plain heuristic walk; when an authoritative total is known, the budget is converted into heuristic units (tailTokenBudget * heuristic / contextTokens, equivalent to scaling every per-message estimate up to the model's scale). The walk's distribution is unchanged and the budget means model tokens.

That is the whole #1593 fix. Previously the Session's whole-prompt counter was adapted into per-message calls, so every message appeared to exceed the budget and tailTokenBudget silently degraded to minTailMessages.

There are no user counters in the boundary walk at all. pi has no per-message counter either; it is the footgun by which #1593 happened, so it is gone rather than documented against. Each remaining lever means one thing:

Lever Meaning
compactAfter(threshold, { tokenCounter }) whole-prompt total (fire + calibration)
assistant metadata.usage / totalUsage same, zero config
protectHead / tailTokenBudget / minTailMessages tunables, all in model tokens

Breaking changes in the experimental package: CompactOptions.tokenCounter, the CompactTokenCounter type, and CompactContext.tokenCounter are removed. The changeset is minor.

Docs

docs/sessions.md gets a single "Token counting" section (priority order, usage-metadata example, new utils in the helper list), docs/think/index.md points at it, and experimental/session-memory/README.md documents the priority order under Auto-Compaction.

Tests

The tests assert the call contracts directly:

  • The Auto-compaction trigger fires, but compaction doesn't run due to internal token counting #1593 end-to-end repro checks that the whole-prompt counter only ever sees the full history, and that the boundary lands at the model-scale budget (toMessageId: "tool-5", not the minTailMessages degradation).
  • The same createCompactFunction no-ops on the raw heuristic and cuts correctly once the context supplies contextTokens.
  • compactAfter(1000) with no counter, and usage attached to the appended assistant message, fires and cuts at the calibrated boundary.
  • calculateContextTokens handles the totalTokens, input+output, prompt+completion and cached shapes. The newest assistant usage wins, usage on non-assistant messages is ignored, and no usage returns null.
  • A non-finite or negative contextTokens is ignored and the raw heuristic applies.

Full experimental-memory suite: 193 passed (7 files). oxfmt and oxlint are clean, and the memory module has no typecheck errors.

Notes for maintainers

  • Usage extraction only reads assistant messages and only accepts positive token counts. The component-sum fallback mirrors pi and can over-count caches on providers where cached tokens are a subset of input tokens, which errs toward compacting earlier.
  • Calibration assumes the model total distributes across messages roughly in proportion to the heuristic. pi and hermes make the same assumption for trailing estimates, and it is strictly better than the previous behavior, which used an identical value for every message.
  • Users who need a custom tokenizer still have a path: put it on compactAfter() as a whole-prompt counter. Per-message precision in the walk bought little once the budget is correct at the model scale, and it was the only place the two token scales could be confused.

…nction

A usage-style counter (() => usage.inputTokens, ignoring its arguments)
previously returned the same huge value for every message in the tail
walk, degrading tailTokenBudget to minTailMessages. Probe the counter
with zero messages to detect its scope: message-scoped counters are
used per-message as before; whole-prompt counters are called once over
the full history and the total calibrates the built-in heuristic, so
the budget is honored at the model's scale with O(1) counter calls.

Fixes #1593
@changeset-bot

changeset-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 8e6cf35

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
agents Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

@pkg-pr-new

pkg-pr-new Bot commented Jun 9, 2026

Copy link
Copy Markdown

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1720

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1720

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1720

create-think

npm i https://pkg.pr.new/create-think@1720

hono-agents

npm i https://pkg.pr.new/hono-agents@1720

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1720

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1720

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1720

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1720

commit: 8e6cf35

@mattzcarey

Copy link
Copy Markdown
Contributor Author

not convinced by this

@mattzcarey mattzcarey marked this pull request as draft June 9, 2026 21:18
Replace the absolute empty-probe threshold (8 tokens) with a two-point
probe that classifies by whether the counter responds to its input:

- counter([]) throws or returns <= 0: message-scoped by construction,
  used per-message as before.
- counter([]) > 0: probe counter(history). If it grew by at least half
  the heuristic estimate, it is a message-scoped counter with fixed
  per-call overhead (chat-template priming, baked-in system prompt) —
  use it per-message minus that overhead instead of misclassifying it
  and discarding its accuracy. If it stayed flat, it is whole-prompt /
  usage-style — calibrate the heuristic by total/heuristic as before.

Matches the design principle in hermes-agent (context_compressor) and
earendil-works/pi (compaction.ts): model-reported totals are only ever
a whole-prompt signal distributed proportionally over a per-message
heuristic; per-message accuracy is kept when the counter actually has it.
@mattzcarey mattzcarey marked this pull request as ready for review June 9, 2026 21:32
…ountTokens

Zero-config path mirroring earendil-works/pi's compaction design:

- SessionMessage.metadata (AI SDK UIMessage-compatible). When an
  assistant message carries model-reported usage (metadata.usage /
  metadata.totalUsage), the Session uses it automatically: the
  compactAfter fire decision becomes last-usage + heuristic for
  trailing messages (pi's estimateContextTokens), and compact() flows
  the same total to the boundary walk via CompactContext.contextTokens,
  which calibrates the built-in heuristic to the model's scale. No
  tokenCounter configuration needed.
- calculateContextTokens mirrors pi: totalTokens || sum of components
  (input/output/cached, camelCase and OpenAI-style names).
- New CompactOptions.countTokens: strictly message-scoped counter used
  directly by the tail walk, no shape detection; wins over tokenCounter.
- Counter resolution priority: countTokens > tokenCounter (two-point
  probe) > usage-metadata calibration > raw heuristic.

Changeset bumped to minor.
@mattzcarey mattzcarey changed the title fix(memory): calibrate whole-prompt token counters in createCompactFunction feat(memory): usage-metadata token accounting, explicit countTokens, and foolproof token counters Jun 9, 2026
@mattzcarey mattzcarey marked this pull request as draft June 9, 2026 21:50
Strip the probe machinery. The model is: one authoritative context size
in model tokens (compactAfter counter > usage metadata > none), computed
once in Session._knownContextTokens and used for BOTH the fire decision
and the boundary walk (CompactContext.contextTokens), which calibrates
the chars/4 heuristic. The boundary walk never calls a whole-prompt
counter per-message — the cause of #1593 — and never probes anything.

- delete resolveTailCounter, MESSAGE_SCOPED_MIN_GROWTH, countTokens
- CompactContext.tokenCounter removed; contextTokens is the only flow
- CompactOptions.tokenCounter documented strictly message-scoped
- tests rewritten to pin the simple contracts: whole-prompt counters
  only ever see the full history; explicit counters only ever see one
  message; contextTokens calibrates; usage metadata is zero-config
@mattzcarey mattzcarey changed the title feat(memory): usage-metadata token accounting, explicit countTokens, and foolproof token counters feat(memory): single authoritative token accounting for compaction (fire + boundary) Jun 9, 2026
@mattzcarey

Copy link
Copy Markdown
Contributor Author

hmmm still sloppy.

…tgun

A whole-prompt counter passed there silently degraded tailTokenBudget
to minTailMessages (the #1593 trap). pi has no per-message user counter
at all; the calibrated heuristic covers the walk. With the counter gone
the calibration also simplifies: scaling every per-message estimate by
contextTokens/heuristic is identical to dividing the budget by it, so
the walk is now always plain findTailCutByTokens against
tailBudgetInHeuristicUnits(). CompactTokenCounter and
findTailCutByTokensWithCounter are deleted.

Docs updated: docs/sessions.md (single token-counting model, usage
metadata example, new utils), docs/think/index.md pointer,
experimental/session-memory README.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auto-compaction trigger fires, but compaction doesn't run due to internal token counting

1 participant