Stop the agent log/CPU runaway: terminate the tool-result trim loop + cap append-only logs by rockfordlhotka · Pull Request #462 · MarimerLLC/rockbot

rockfordlhotka · 2026-06-05T07:17:49Z

Why

Two related failures let rockbot-agent consume unbounded resources with nothing stopping it.

1. Non-terminating tool-result trim loop (the 2026-06-05 incident) — new in this PR

Symptom (from the on-call investigation): the agent pod pegged at ~2 cores with 407,597 identical "Trimmed tool result … (head 120, tail 80)" log lines in a 5-minute window, re-processing the same ~5 call IDs forever. Every MCP session starved (todo was just the one the user noticed); the downstream MCP services were provably healthy. 0 restarts, 42h uptime → in-memory wedge. kubectl delete pod cleared it, but that's a band-aid.

Root cause: TrimLargeToolResultsAsync looped on while(true) until the estimated context fit the char budget, always re-trimming the single largest tool result. When the non-tool content alone (system prompt + injected skill bodies + the stash registry) already exceeded the budget, and every tool result was already at the ~200-char elision floor — where re-adding the elision marker keeps a result the same size or larger — the total never dropped below budget and the loop spun forever. (head 120 / tail 80 is exactly the 200-char floor at the default 0.6 head ratio — the fingerprint of a loop stuck on floor-sized results.)

Fix: three independent termination guarantees —

stop when the budget is met (unchanged);
stop when the largest tool result can no longer shrink — since we always pick the largest, nothing smaller can reach the budget either;
a hard pass cap as defence-in-depth.

Also clamps the legacy slice length so the 200-char floor can't index past a short result and throw.

De-duplication (requested in review): the trim loop existed as two byte-identical copies — the native RockBotFunctionInvokingChatClient path and the text-based AgentLoopRunner path — that had already drifted. Both are now thin forwarders to a single canonical ToolResultTrimmer.TrimAsync, so this class of bug can't be half-fixed again.

2. Append-only JSONL log growth — earlier commits on this branch

A shared dream-driven retention pass caps the append-only JSONL logs, wired to k8s config, plus line-trimming of persistent session files.

Tests

Two new [Timeout] regression tests reproduce the wedge (huge non-tool content + a tiny budget) and hang on the old loop, pass on the new one.
All AgentLoopRunnerTrimStashTests pass (11/11), confirming the head+tail+stash behaviour is unchanged for the normal trim path.

Follow-ups (not in this PR)

The investigation surfaced adjacent gaps worth separate work: spawn_wisps results embed fresh GUIDs, so RepetitiveToolCallDetector (keyed on tool+args+result) never fires for wisp dispatch; there's no cross-loop/cross-session circuit breaker on wisp volume; and FileWispExecutionLog.ReadAllAsync re-parses the entire JSONL on every successful wisp (O(n²) under load).

🤖 Generated with Claude Code

The skill-usage, tool-call, feedback, skill-resource-usage, and wisp-executions JSONL logs were append-only with no rotation, so they grew without bound (per-session directories accumulated files forever; the two global files grew line-by-line indefinitely). Add a shared JsonlLogRetention helper plus an opt-in IPrunableLog contract. Each file-backed store implements IPrunableLog and delegates to the helper with its own on-disk layout: per-session directories drop aged session files then cap file count; single-file logs trim to a trailing line budget (serialized against their writer). DreamService resolves all registered IPrunableLog instances and prunes them once per cycle, before the memory-count early-return so retention runs every cycle. New DreamOptions knobs (LogRetentionEnabled, MaxFileAge, MaxFilesPerDirectory, MaxLinesPerFile) control the policy; no new background service is introduced. https://claude.ai/code/session_012sTRRT47bKJwQBSksmFuLm

The retention knobs added by this PR were never bound to configuration — WithDreaming() took no callback, so DreamOptions (including the new LogRetention* values, and the pre-existing Dream:CronSchedule) were always defaults. Bind the Dream section in Program.cs and surface the four knobs as agent.logRetention.* Helm values + Dream__* ConfigMap keys. The ConfigMap uses `dig` rather than `default` so an explicit false/0 (which disable a dimension) are honoured instead of being swallowed as "empty". Values are sized from observed live traffic: maxFileAge 30d (floored at the widest dream query window so pruning never starves a pass), maxFilesPerDirectory 1000 (backstop; age pruning is the real control), maxLinesPerFile 10000 (~11 MB for the wisp log at ~1.1 KB/line, vs the never-trimming 50k code default). Also close the persistent-session gap: age/count pruning never reaps a continuously-written {sessionId}.jsonl (blazor-session, cli-session) because it is never aged out and never the oldest file — on the live cluster the UI session's tool-call log alone was 2 MB and growing. Add JsonlLogRetention.TrimSessionFilesAsync, which line-trims each surviving session file to MaxLinesPerFile while holding the writer's own per-session semaphore so a trim can never race an append; wire it into the three per-session stores after their age/count prune. A byte-size gate skips the tiny ephemeral files unread. Tests: per-session line-trim (over-budget trimmed, under-budget skipped, correct lock key, store integration) and config-binding regression (TimeSpan/bool/0 shapes from the ConfigMap). Docs updated (dream-service, agent-host, values). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tag for the log-retention image deployed to the live cluster for testing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The head+tail+stash context-window trim looped on `while(true)` until the estimated context fit the char budget, always re-trimming the single largest tool result. When the *non-tool* content alone (system prompt + injected skill bodies + stash registry) exceeded the budget and every tool result was already at the ~200-char elision floor — where re-adding the elision marker keeps a result the same size or larger — the total never dropped below budget and the loop spun forever. Production incident 2026-06-05: rockbot-agent pegged at ~2 cores, emitting 400k+ identical "Trimmed tool result ... (head 120, tail 80)" log lines in a 5-minute window (head 120 / tail 80 is exactly the 200-char floor at the default 0.6 head ratio), starving every MCP session until the pod was killed. Adds three independent termination guarantees: stop when the budget is met; stop when the largest tool result can no longer shrink (nothing smaller can reach the budget either); and a hard pass cap as defence-in-depth. Also clamps the legacy slice length so the 200-char floor can't index past a short result and throw. The trim loop existed as two byte-identical copies (native FICC path and the text-based AgentLoopRunner path) that had already drifted. Both are now thin forwarders to a single canonical `ToolResultTrimmer.TrimAsync`, so the fix lives in one place. Two `[Timeout]` regression tests would hang on the old loop and pass on the new one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Follow-up to the trim-loop fix (#462). The per-agent-loop RepetitiveToolCallDetector is the only existing guard against repeated identical tool calls, but it is rebuilt fresh per AgentLoopRunner.RunAsync and dies with that loop — so it never spans completion-eval reprompts, scheduled re-fires, or message/A2A re-triggers. And because spawn_wisps results embed fresh batch/wisp GUIDs, two byte-identical dispatches never look identical to it anyway. Nothing counts "I have already run this exact wisp too many times — stop." WispDispatchCircuitBreaker is a process-wide singleton checked at the single choke point every wisp flows through (SpawnWispsExecutor). It keeps a fixed-window dispatch count keyed by the exact definition hash; once a definition exceeds WispOptions.DispatchCircuitBreakerMaxPerWindow (default 30) within DispatchCircuitBreakerWindow (default 5m), further dispatches of that definition are refused until the window rolls over. - Keyed on the exact definition hash, not the value-stripped shape hash, so wisps that vary by date/id never trip it — only a truly identical loop does. - A fixed window keeps memory at O(distinct definitions) regardless of dispatch rate, which matters precisely in the runaway case. - A tripped dispatch does NO work (no LLM, no tool calls) and is NOT written to the execution log, so a runaway collapses into a cheap, self-limiting error instead of an unbounded spin (and avoids re-bloating the JSONL log). - The refusal message is intentionally stable (no counts/ids), so identical refusals also look identical to the agent-loop's repetitive-result detector, which then nudges the model off the loop. - Emits a rockbot.wisp.circuit_breaker.trips metric for alerting. Conservative by default and fully configurable; disabled-path and zero-limit both short-circuit to always-allow. 7 new tests cover threshold, sustained trip, window reset, per-definition independence, and the disabled paths. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude and others added 4 commits June 3, 2026 06:00

Bump version to 0.12.27

f17dbe2

Tag for the log-retention image deployed to the live cluster for testing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rockfordlhotka mentioned this pull request Jun 5, 2026

Add a wisp dispatch circuit breaker for runaway re-dispatch loops #463

Merged

rockfordlhotka merged commit bab77c1 into main Jun 5, 2026
2 checks passed

rockfordlhotka deleted the claude/rockbot-agent-log-growth-Trynw branch June 5, 2026 07:35

rockfordlhotka mentioned this pull request Jun 5, 2026

RepetitiveToolCallDetector is blind to tools whose result embeds GUIDs/timestamps (e.g. spawn_wisps) #464

Open

rockfordlhotka mentioned this pull request Jun 5, 2026

FileWispExecutionLog.ReadAllAsync re-parses the entire JSONL on every successful wisp (O(n²) under load) #465

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop the agent log/CPU runaway: terminate the tool-result trim loop + cap append-only logs#462

Stop the agent log/CPU runaway: terminate the tool-result trim loop + cap append-only logs#462
rockfordlhotka merged 4 commits into
mainfrom
claude/rockbot-agent-log-growth-Trynw

rockfordlhotka commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rockfordlhotka commented Jun 5, 2026

Why

1. Non-terminating tool-result trim loop (the 2026-06-05 incident) — new in this PR

2. Append-only JSONL log growth — earlier commits on this branch

Tests

Follow-ups (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants