Skip to content

fix(codex): stream per-item LLM-call rows with full tool detail#1432

Merged
zbigniewsobiecki merged 1 commit into
devfrom
fix/codex-realtime-llm-logs
Jun 23, 2026
Merged

fix(codex): stream per-item LLM-call rows with full tool detail#1432
zbigniewsobiecki merged 1 commit into
devfrom
fix/codex-realtime-llm-logs

Conversation

@zbigniewsobiecki

Copy link
Copy Markdown
Member

Problem (observed on prod codex runs)

Confirmed on the live run a2521469:

  • Not realtime: a 12-min run produced one agent_run_llm_calls row, written ~1s before it ended — codex persists one row per turn.completed, and a codex review is a single turn.
  • Tool calls look wrong: that row stored tools as bare name strings (["bash" ×50]) — command/args dropped — so the dashboard showed a wall of empty bash badges.

Key constraint (from the real event stream)

The run's engine_log has 137 item.completed events (each tool call with its command, each text chunk) streaming throughout — but token usage is reported exactly once, cumulative, on the single turn.completed. So per-row token attribution is impossible for codex; only the run total is knowable.

Fix

Persist a Claude-Code-style content-block row per item.completed as it streams — [{type:'text',…}] or [{type:'tool_use', name, input}] with the full input, tool names normalized to the Claude vocab (command_executionBash, read_fileRead, …) so the shared parseClaudeCodeBlocks / summarizeInput renders the command/args. These rows carry no tokens.

The single cumulative cost/usage row at turn.completed is unchanged — run-total cost stays exactly accurate (no token-delta rework). New codex rows are arrays → render via the existing claude-code parser path; parseCodexPayload stays as a fallback for old object-shaped rows.

The run-detail LLM-calls list + run status now poll while the run is active (refetchInterval), so streamed rows appear live.

Result

Codex runs show realtime, per-item rows that read like claude-code's — Bash · git status, Read · src/foo.ts — instead of one end-of-run row of empty badges. Run cost is unchanged.

Tests

  • codex.test.ts: per-item rows (text + tool with input, normalized); deltas don't persist (only item.completed); cost row + per-turn delta unchanged; updated the 3 tests that encoded the old one-row-per-turn shape.
  • llmResponseParser.test.ts: codex array rows render with the command; old object rows still parse (backcompat).
  • typecheck (src + web) + biome clean; full unit-backends (1184) green.

Note

Per-item rows carry no per-row tokens — a codex limitation (it doesn't break usage down per item), documented in code. Accurate cost lives on the one turn.completed row + the run total.

🤖 Generated with Claude Code

Codex collapsed an entire run into one agent_run_llm_calls row written at
turn.completed, with tools stored as bare name strings (input dropped) — so the
dashboard showed one end-of-run row of empty "bash" badges: not realtime, no
command detail. Codex reports token usage only once (cumulative) at
turn.completed, so per-row token attribution is not possible.

Persist a Claude-Code-style content-block row per item.completed as it streams
(text, or tool_use with full input; tool names normalized to the Claude vocab so
the shared parser renders the command/args). Keep the single cumulative
cost/usage row at turn.completed unchanged, so run-total cost stays accurate.
New codex rows are content-block arrays and render via the existing
parseClaudeCodeBlocks path; parseCodexPayload stays as a fallback for old rows.

Also poll the run-detail LLM-calls list (and run status) while the run is active
so the streamed rows appear live.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.36364% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/backends/codex/index.ts 86.36% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

APPROVE — the per-item streaming is correctly gated on item.completed, reuses the existing claude-code content-block parser path, and leaves the cost/usage accounting untouched. codex.test.ts (82), llmResponseParser.test.ts (30), and codex-cost.test.ts (9) all pass locally and biome is clean.

What I verified:

  • Cost unchanged. The single turn.completed row still carries the per-turn delta; per-item rows carry no tokens, so the dashboard summary (reduce over rows) and the run-total cost are identical to before. codex-cost.test.ts (turn.completed-only streams) still produces exactly N delta-priced rows — the per-item rows are purely additive when item.completed events are present.
  • No double-persist. An item.completed is either a message (→ text row) or a function_call/command_execution (→ tool row), never both; item.delta events never persist. callNumber stays monotonic across item + cost rows, so listLlmCallsMeta's orderBy(callNumber) keeps display order chronological.
  • Rendering. Array-shaped rows parse via parseClaudeCodeBlockssummarizeInput, identical in the backend parser and its web/src/lib/llm-response-parser.ts mirror, so the normalizeCodexToolBash/Read/… mapping renders the command/args with no parser change needed.
  • Polling. refetchInterval: (query) => query.state.data?.status === 'running' ? … : false matches the existing convention (debug-analysis, prs, global/runs); 'running' is the only non-terminal run status; the getById→LlmCallList hand-off cleanly stops polling on completion.

Code Issues

Should Fix (stale comments — no behavior change needed)

  • src/backends/codex/index.ts:289-296handleStructuralEvent's doc still asserts the persistence boundary is "ONE storeLlmCall row is written exactly when turn.completed fires … intermediate events … do NOT persist a row." That is precisely the invariant this PR removes (item.completed now persists rows). A future maintainer debugging row counts would be misled by this comment.
  • src/backends/codex/index.ts:45-54CodexTurnAccumulator's doc still says it "Collects text, tool summaries, and usage … exactly one storeLlmCall row is persisted per completed turn." toolNames was dropped and there are now many rows per turn. Worth refreshing together with the persistTurnLlmCall header (174-188), which still reads as if the turn row is the only row for the turn.

Minor

  • src/backends/codex/index.ts:218turn: context.llmCallCount now emits the global call counter rather than the turn ordinal (the test pins turn: 3 for the first/only turn because two item rows preceded it). The field isn't consumed by the parser, so this is cosmetic, but a dedicated turn counter (or renaming the field to call) would keep stored payloads from being misleading.

🕵️ claude-code · claude-opus-4-8 · run details

@zbigniewsobiecki zbigniewsobiecki merged commit 82060de into dev Jun 23, 2026
9 checks passed

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

APPROVE — the per-item streaming approach is sound. Item rows carry no tokens and the single turn.completed row keeps the cumulative per-turn delta, so summed token/cost totals are unchanged. New rows are content-block arrays that render through the existing parseClaudeCodeBlocks path (verified the web/src/lib/llm-response-parser.ts mirror is in sync and summarizeInput handles the normalized Bash/Read/Write/Edit names). Run/LLM-calls polling is correctly gated on status === 'running' and self-terminates. Ran codex.test.ts (82) + llmResponseParser.test.ts (30) locally — all green.

Minor (non-blocking)

  • src/backends/codex/index.ts:219 — the turn.completed cost row still carries the accumulated text summary, which parseCodexPayload renders as textPreview. Since the same text now streams as its own per-item text row, the agent message appears twice in the LLM-calls list (its own row + the cost row). Consider dropping text from the cost payload so it reads as a pure usage row. See inline.
  • src/backends/codex/index.ts:218turn: context.llmCallCount no longer equals the turn ordinal now that llmCallCount also increments per item; it's effectively the global row number. Nothing consumes .turn in parseCodexPayload, so this is cosmetic, but the field name is now misleading.
  • src/utils/llmResponseParser.ts:6 (+ web mirror) — the file header still documents Codex: JSON object {turn, text?, tools?: string[], usage?}. Codex now also emits content-block arrays and the cost row no longer has tools; a one-line update keeps the "keep both in sync" comment accurate. (Outside this PR's diff — flagging for a follow-up.)

🕵️ claude-code · claude-opus-4-8 · run details

// the turn.completed row carries the turn's cost/usage + a short text summary.
const turnPayload = JSON.stringify({
turn: context.llmCallCount,
text: acc.textSummary.join(' ').slice(0, 500) || undefined,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost row keeps text: acc.textSummary.join(' ')…, and parseCodexPayload turns that into a textPreview, so the agent's message now renders twice in the LLM-calls list — once as its own per-item text row, and again here on the usage row. Dropping text from this payload (leaving turn/usage/delta/reasoning) would make the cost row a pure usage row and better match the claude-code-style per-item layout the PR is aiming for. Non-blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants