fix(codex): stream per-item LLM-call rows with full tool detail#1432
Conversation
Codex collapsed an entire run into one agent_run_llm_calls row written at turn.completed, with tools stored as bare name strings (input dropped) — so the dashboard showed one end-of-run row of empty "bash" badges: not realtime, no command detail. Codex reports token usage only once (cumulative) at turn.completed, so per-row token attribution is not possible. Persist a Claude-Code-style content-block row per item.completed as it streams (text, or tool_use with full input; tool names normalized to the Claude vocab so the shared parser renders the command/args). Keep the single cumulative cost/usage row at turn.completed unchanged, so run-total cost stays accurate. New codex rows are content-block arrays and render via the existing parseClaudeCodeBlocks path; parseCodexPayload stays as a fallback for old rows. Also poll the run-detail LLM-calls list (and run status) while the run is active so the streamed rows appear live. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
APPROVE — the per-item streaming is correctly gated on item.completed, reuses the existing claude-code content-block parser path, and leaves the cost/usage accounting untouched. codex.test.ts (82), llmResponseParser.test.ts (30), and codex-cost.test.ts (9) all pass locally and biome is clean.
What I verified:
- Cost unchanged. The single
turn.completedrow still carries the per-turn delta; per-item rows carry no tokens, so the dashboard summary (reduceover rows) and the run-total cost are identical to before.codex-cost.test.ts(turn.completed-only streams) still produces exactly N delta-priced rows — the per-item rows are purely additive whenitem.completedevents are present. - No double-persist. An
item.completedis either a message (→ text row) or a function_call/command_execution (→ tool row), never both;item.deltaevents never persist.callNumberstays monotonic across item + cost rows, solistLlmCallsMeta'sorderBy(callNumber)keeps display order chronological. - Rendering. Array-shaped rows parse via
parseClaudeCodeBlocks→summarizeInput, identical in the backend parser and itsweb/src/lib/llm-response-parser.tsmirror, so thenormalizeCodexTool→Bash/Read/… mapping renders the command/args with no parser change needed. - Polling.
refetchInterval: (query) => query.state.data?.status === 'running' ? … : falsematches the existing convention (debug-analysis, prs, global/runs);'running'is the only non-terminal run status; the getById→LlmCallList hand-off cleanly stops polling on completion.
Code Issues
Should Fix (stale comments — no behavior change needed)
- src/backends/codex/index.ts:289-296 —
handleStructuralEvent's doc still asserts the persistence boundary is "ONE storeLlmCall row is written exactly when turn.completed fires … intermediate events … do NOT persist a row." That is precisely the invariant this PR removes (item.completednow persists rows). A future maintainer debugging row counts would be misled by this comment. - src/backends/codex/index.ts:45-54 —
CodexTurnAccumulator's doc still says it "Collects text, tool summaries, and usage … exactly one storeLlmCall row is persisted per completed turn."toolNameswas dropped and there are now many rows per turn. Worth refreshing together with thepersistTurnLlmCallheader (174-188), which still reads as if the turn row is the only row for the turn.
Minor
- src/backends/codex/index.ts:218 —
turn: context.llmCallCountnow emits the global call counter rather than the turn ordinal (the test pinsturn: 3for the first/only turn because two item rows preceded it). The field isn't consumed by the parser, so this is cosmetic, but a dedicated turn counter (or renaming the field tocall) would keep stored payloads from being misleading.
🕵️ claude-code · claude-opus-4-8 · run details
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
APPROVE — the per-item streaming approach is sound. Item rows carry no tokens and the single turn.completed row keeps the cumulative per-turn delta, so summed token/cost totals are unchanged. New rows are content-block arrays that render through the existing parseClaudeCodeBlocks path (verified the web/src/lib/llm-response-parser.ts mirror is in sync and summarizeInput handles the normalized Bash/Read/Write/Edit names). Run/LLM-calls polling is correctly gated on status === 'running' and self-terminates. Ran codex.test.ts (82) + llmResponseParser.test.ts (30) locally — all green.
Minor (non-blocking)
- src/backends/codex/index.ts:219 — the
turn.completedcost row still carries the accumulatedtextsummary, whichparseCodexPayloadrenders astextPreview. Since the same text now streams as its own per-itemtextrow, the agent message appears twice in the LLM-calls list (its own row + the cost row). Consider droppingtextfrom the cost payload so it reads as a pure usage row. See inline. - src/backends/codex/index.ts:218 —
turn: context.llmCallCountno longer equals the turn ordinal now thatllmCallCountalso increments per item; it's effectively the global row number. Nothing consumes.turninparseCodexPayload, so this is cosmetic, but the field name is now misleading. - src/utils/llmResponseParser.ts:6 (+ web mirror) — the file header still documents
Codex: JSON object {turn, text?, tools?: string[], usage?}. Codex now also emits content-block arrays and the cost row no longer hastools; a one-line update keeps the "keep both in sync" comment accurate. (Outside this PR's diff — flagging for a follow-up.)
🕵️ claude-code · claude-opus-4-8 · run details
| // the turn.completed row carries the turn's cost/usage + a short text summary. | ||
| const turnPayload = JSON.stringify({ | ||
| turn: context.llmCallCount, | ||
| text: acc.textSummary.join(' ').slice(0, 500) || undefined, |
There was a problem hiding this comment.
The cost row keeps text: acc.textSummary.join(' ')…, and parseCodexPayload turns that into a textPreview, so the agent's message now renders twice in the LLM-calls list — once as its own per-item text row, and again here on the usage row. Dropping text from this payload (leaving turn/usage/delta/reasoning) would make the cost row a pure usage row and better match the claude-code-style per-item layout the PR is aiming for. Non-blocking.
Problem (observed on prod codex runs)
Confirmed on the live run
a2521469:agent_run_llm_callsrow, written ~1s before it ended — codex persists one row perturn.completed, and a codex review is a single turn.["bash" ×50]) — command/args dropped — so the dashboard showed a wall of emptybashbadges.Key constraint (from the real event stream)
The run's
engine_loghas 137item.completedevents (each tool call with its command, each text chunk) streaming throughout — but token usage is reported exactly once, cumulative, on the singleturn.completed. So per-row token attribution is impossible for codex; only the run total is knowable.Fix
Persist a Claude-Code-style content-block row per
item.completedas it streams —[{type:'text',…}]or[{type:'tool_use', name, input}]with the full input, tool names normalized to the Claude vocab (command_execution→Bash,read_file→Read, …) so the sharedparseClaudeCodeBlocks/summarizeInputrenders the command/args. These rows carry no tokens.The single cumulative cost/usage row at
turn.completedis unchanged — run-total cost stays exactly accurate (no token-delta rework). New codex rows are arrays → render via the existing claude-code parser path;parseCodexPayloadstays as a fallback for old object-shaped rows.The run-detail LLM-calls list + run status now poll while the run is active (
refetchInterval), so streamed rows appear live.Result
Codex runs show realtime, per-item rows that read like claude-code's —
Bash · git status,Read · src/foo.ts— instead of one end-of-run row of empty badges. Run cost is unchanged.Tests
codex.test.ts: per-item rows (text + tool with input, normalized); deltas don't persist (onlyitem.completed); cost row + per-turn delta unchanged; updated the 3 tests that encoded the old one-row-per-turn shape.llmResponseParser.test.ts: codex array rows render with the command; old object rows still parse (backcompat).unit-backends(1184) green.Note
Per-item rows carry no per-row tokens — a codex limitation (it doesn't break usage down per item), documented in code. Accurate cost lives on the one
turn.completedrow + the run total.🤖 Generated with Claude Code