Skip to content

Stop the agent log/CPU runaway: terminate the tool-result trim loop + cap append-only logs#462

Merged
rockfordlhotka merged 4 commits into
mainfrom
claude/rockbot-agent-log-growth-Trynw
Jun 5, 2026
Merged

Stop the agent log/CPU runaway: terminate the tool-result trim loop + cap append-only logs#462
rockfordlhotka merged 4 commits into
mainfrom
claude/rockbot-agent-log-growth-Trynw

Conversation

@rockfordlhotka

Copy link
Copy Markdown
Member

Why

Two related failures let rockbot-agent consume unbounded resources with nothing stopping it.

1. Non-terminating tool-result trim loop (the 2026-06-05 incident) — new in this PR

Symptom (from the on-call investigation): the agent pod pegged at ~2 cores with 407,597 identical "Trimmed tool result … (head 120, tail 80)" log lines in a 5-minute window, re-processing the same ~5 call IDs forever. Every MCP session starved (todo was just the one the user noticed); the downstream MCP services were provably healthy. 0 restarts, 42h uptime → in-memory wedge. kubectl delete pod cleared it, but that's a band-aid.

Root cause: TrimLargeToolResultsAsync looped on while(true) until the estimated context fit the char budget, always re-trimming the single largest tool result. When the non-tool content alone (system prompt + injected skill bodies + the stash registry) already exceeded the budget, and every tool result was already at the ~200-char elision floor — where re-adding the elision marker keeps a result the same size or larger — the total never dropped below budget and the loop spun forever. (head 120 / tail 80 is exactly the 200-char floor at the default 0.6 head ratio — the fingerprint of a loop stuck on floor-sized results.)

Fix: three independent termination guarantees —

  1. stop when the budget is met (unchanged);
  2. stop when the largest tool result can no longer shrink — since we always pick the largest, nothing smaller can reach the budget either;
  3. a hard pass cap as defence-in-depth.

Also clamps the legacy slice length so the 200-char floor can't index past a short result and throw.

De-duplication (requested in review): the trim loop existed as two byte-identical copies — the native RockBotFunctionInvokingChatClient path and the text-based AgentLoopRunner path — that had already drifted. Both are now thin forwarders to a single canonical ToolResultTrimmer.TrimAsync, so this class of bug can't be half-fixed again.

2. Append-only JSONL log growth — earlier commits on this branch

A shared dream-driven retention pass caps the append-only JSONL logs, wired to k8s config, plus line-trimming of persistent session files.

Tests

  • Two new [Timeout] regression tests reproduce the wedge (huge non-tool content + a tiny budget) and hang on the old loop, pass on the new one.
  • All AgentLoopRunnerTrimStashTests pass (11/11), confirming the head+tail+stash behaviour is unchanged for the normal trim path.

Follow-ups (not in this PR)

The investigation surfaced adjacent gaps worth separate work: spawn_wisps results embed fresh GUIDs, so RepetitiveToolCallDetector (keyed on tool+args+result) never fires for wisp dispatch; there's no cross-loop/cross-session circuit breaker on wisp volume; and FileWispExecutionLog.ReadAllAsync re-parses the entire JSONL on every successful wisp (O(n²) under load).

🤖 Generated with Claude Code

claude and others added 4 commits June 3, 2026 06:00
The skill-usage, tool-call, feedback, skill-resource-usage, and
wisp-executions JSONL logs were append-only with no rotation, so they
grew without bound (per-session directories accumulated files forever;
the two global files grew line-by-line indefinitely).

Add a shared JsonlLogRetention helper plus an opt-in IPrunableLog
contract. Each file-backed store implements IPrunableLog and delegates
to the helper with its own on-disk layout: per-session directories drop
aged session files then cap file count; single-file logs trim to a
trailing line budget (serialized against their writer). DreamService
resolves all registered IPrunableLog instances and prunes them once per
cycle, before the memory-count early-return so retention runs every
cycle. New DreamOptions knobs (LogRetentionEnabled, MaxFileAge,
MaxFilesPerDirectory, MaxLinesPerFile) control the policy; no new
background service is introduced.

https://claude.ai/code/session_012sTRRT47bKJwQBSksmFuLm
The retention knobs added by this PR were never bound to configuration —
WithDreaming() took no callback, so DreamOptions (including the new
LogRetention* values, and the pre-existing Dream:CronSchedule) were always
defaults. Bind the Dream section in Program.cs and surface the four knobs as
agent.logRetention.* Helm values + Dream__* ConfigMap keys. The ConfigMap uses
`dig` rather than `default` so an explicit false/0 (which disable a dimension)
are honoured instead of being swallowed as "empty".

Values are sized from observed live traffic: maxFileAge 30d (floored at the
widest dream query window so pruning never starves a pass), maxFilesPerDirectory
1000 (backstop; age pruning is the real control), maxLinesPerFile 10000 (~11 MB
for the wisp log at ~1.1 KB/line, vs the never-trimming 50k code default).

Also close the persistent-session gap: age/count pruning never reaps a
continuously-written {sessionId}.jsonl (blazor-session, cli-session) because it
is never aged out and never the oldest file — on the live cluster the UI
session's tool-call log alone was 2 MB and growing. Add
JsonlLogRetention.TrimSessionFilesAsync, which line-trims each surviving session
file to MaxLinesPerFile while holding the writer's own per-session semaphore so a
trim can never race an append; wire it into the three per-session stores after
their age/count prune. A byte-size gate skips the tiny ephemeral files unread.

Tests: per-session line-trim (over-budget trimmed, under-budget skipped, correct
lock key, store integration) and config-binding regression (TimeSpan/bool/0
shapes from the ConfigMap). Docs updated (dream-service, agent-host, values).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tag for the log-retention image deployed to the live cluster for testing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The head+tail+stash context-window trim looped on `while(true)` until the
estimated context fit the char budget, always re-trimming the single largest
tool result. When the *non-tool* content alone (system prompt + injected skill
bodies + stash registry) exceeded the budget and every tool result was already
at the ~200-char elision floor — where re-adding the elision marker keeps a
result the same size or larger — the total never dropped below budget and the
loop spun forever.

Production incident 2026-06-05: rockbot-agent pegged at ~2 cores, emitting
400k+ identical "Trimmed tool result ... (head 120, tail 80)" log lines in a
5-minute window (head 120 / tail 80 is exactly the 200-char floor at the
default 0.6 head ratio), starving every MCP session until the pod was killed.

Adds three independent termination guarantees: stop when the budget is met;
stop when the largest tool result can no longer shrink (nothing smaller can
reach the budget either); and a hard pass cap as defence-in-depth. Also clamps
the legacy slice length so the 200-char floor can't index past a short result
and throw.

The trim loop existed as two byte-identical copies (native FICC path and the
text-based AgentLoopRunner path) that had already drifted. Both are now thin
forwarders to a single canonical `ToolResultTrimmer.TrimAsync`, so the fix
lives in one place. Two `[Timeout]` regression tests would hang on the old
loop and pass on the new one.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rockfordlhotka rockfordlhotka merged commit bab77c1 into main Jun 5, 2026
2 checks passed
@rockfordlhotka rockfordlhotka deleted the claude/rockbot-agent-log-growth-Trynw branch June 5, 2026 07:35
rockfordlhotka added a commit that referenced this pull request Jun 5, 2026
Follow-up to the trim-loop fix (#462). The per-agent-loop
RepetitiveToolCallDetector is the only existing guard against repeated
identical tool calls, but it is rebuilt fresh per AgentLoopRunner.RunAsync and
dies with that loop — so it never spans completion-eval reprompts, scheduled
re-fires, or message/A2A re-triggers. And because spawn_wisps results embed
fresh batch/wisp GUIDs, two byte-identical dispatches never look identical to
it anyway. Nothing counts "I have already run this exact wisp too many times —
stop."

WispDispatchCircuitBreaker is a process-wide singleton checked at the single
choke point every wisp flows through (SpawnWispsExecutor). It keeps a
fixed-window dispatch count keyed by the exact definition hash; once a
definition exceeds WispOptions.DispatchCircuitBreakerMaxPerWindow (default 30)
within DispatchCircuitBreakerWindow (default 5m), further dispatches of that
definition are refused until the window rolls over.

- Keyed on the exact definition hash, not the value-stripped shape hash, so
  wisps that vary by date/id never trip it — only a truly identical loop does.
- A fixed window keeps memory at O(distinct definitions) regardless of dispatch
  rate, which matters precisely in the runaway case.
- A tripped dispatch does NO work (no LLM, no tool calls) and is NOT written to
  the execution log, so a runaway collapses into a cheap, self-limiting error
  instead of an unbounded spin (and avoids re-bloating the JSONL log).
- The refusal message is intentionally stable (no counts/ids), so identical
  refusals also look identical to the agent-loop's repetitive-result detector,
  which then nudges the model off the loop.
- Emits a rockbot.wisp.circuit_breaker.trips metric for alerting.

Conservative by default and fully configurable; disabled-path and zero-limit
both short-circuit to always-allow. 7 new tests cover threshold, sustained
trip, window reset, per-definition independence, and the disabled paths.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants