fix(tracing): close trace corruption in long-running sessions (#865)#867
Conversation
…race corruption Two confirmed concurrency bugs in `Trace.snapshot()` and `FileExporter.export()` that caused trace files in long-running sessions to either lose events (M2) or show a stale terminating status instead of "crashed" (M3). See umbrella issue #865 for the full reproduction matrix. M2 — debounce drops events with no follow-up snapshot `snapshot()` returns early when `snapshotPending=true` and was never re-scheduled after the in-flight write completed. In bursty turns (an LLM step that fires 5-10 tool calls back-to-back), 7 of 8 events would be dropped from disk until a fresh event arrived in a future turn. If the process exited in the gap, the burst's tail was lost permanently. Natural reproducer fired 20/20 trials. Fix: track `snapshotRequestedDuringPending` whenever `snapshot()` short- circuits. In `.finally()`, if the flag is set and the trace is neither crashed nor ending, schedule exactly one follow-up snapshot via `queueMicrotask`. Bounded: at most one extra write per "burst → quiet" cycle regardless of burst size. M3 — FileExporter.export() races flushSync with no crashed guard `endTrace()` calls `exporter.export()`, which does writeFile(tmp) → rename(tmp, final). The writeFile of a multi-MB trace takes 100+ms, a wide window for `flushSync` to interleave. The `crashed` flag added in 3846387 only guarded `snapshot()`, not `export()`, so flushSync's synchronous crashed write was overwritten by the export's rename. Natural reproducer fired 10/10 trials with ~52 MB traces. Fix: give `FileExporter` a private `_crashed` flag with `markCrashed()`. Check it at entry, before the writeFile, and before the rename (drop tmp and bail in the last case). `flushSync` iterates exporters and calls `markCrashed()` on each `FileExporter` BEFORE its own synchronous write. M2 companion — endTraceStarted gate Once `endTrace()` claims the canonical write, no concurrent snapshot may run. Without this, the M2 follow-up's queued microtask runs after the `await snapshotPromise` in endTrace but BEFORE the root-span endTime mutation, capturing pre-end state and racing to be last writer. Gate the follow-up against an `endTraceStarted` flag set at endTrace entry. M1 (architectural residual) — NOT closed in this PR The TOCTOU between the synchronous `if (this.crashed)` check and the asynchronous kernel `fs.rename` syscall remains. The existing `crashed` flag from 3846387 protects against it on local SSD (microsecond rename window). It can theoretically fire on slow/network filesystems. Documented as a known design hazard in the umbrella issue. Reproducer kept as `test.skip` for local experimentation. Tests: - `test/altimate/tracing-rename-race.test.ts` — reproducers/regressions for all three mechanisms. M2-natural was 20/20 → now 0/20; M3-natural was 10/10 → now 0/10. M1-natural verifies existing protection. - `test/altimate/tracing-display-crash.test.ts:500` updated: `flushSync then endTrace` previously asserted endTrace overwrote flushSync's crashed status. Under the new policy that assertion is inverted — flushSync wins. Closes #865 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds exporter crash signaling, per-session crash guards, snapshot follow-up scheduling, endTrace/flushSync coordination, and a regression test suite to prevent async-rename and debounce races from overwriting or losing trace events. ChangesTrace file concurrency corruption safeguards
Sequence DiagramsequenceDiagram
participant Event as Tool Event
participant Trace
participant FileExporter
participant FS as fs operations
participant flushSync as flushSync()
Event->>Trace: snapshot()/events
Trace->>FileExporter: export(trace) (async)
FileExporter->>FS: writeFile(tmpPath)
alt new events during write
Trace->>Trace: snapshotRequestedDuringPending = true
end
FileExporter->>FileExporter: re-check crashedSessions
alt crashedSessions contains session
FileExporter->>FS: unlink(tmpPath)
else
FileExporter->>FS: rename(tmpPath, canonical)
end
flushSync->>FileExporter: markCrashed(sessionId)
Trace->>Trace: .finally schedules queueMicrotask(snapshot) if snapshotRequestedDuringPending and not endTraceStarted
🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
1 similar comment
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
There was a problem hiding this comment.
1 issue found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/opencode/src/altimate/observability/tracing.ts">
<violation number="1" location="packages/opencode/src/altimate/observability/tracing.ts:176">
P2: `FileExporter.markCrashed()` sets a permanent exporter-wide flag, which can disable exports for later traces when the same exporter instance is reused.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
Addresses the consensus code review of #867 (6 reviewers, 4 of 5 successful externals + Claude flagged the same critical bug at MAJOR/CRITICAL). CRITICAL fix — per-session crash tracking on FileExporter The original PR put `_crashed` as instance state on FileExporter. In the TUI worker (worker.ts:58-74,96-97), a single FileExporter is cached and shared across every session — the spread `[...tracingExporters]` copies the array, not the inner objects, and `Trace.withExporters` only allocates a fresh FileExporter when `options.maxFiles != null` (rarely set in user config). One session's flushSync would permanently mark the shared exporter crashed, silently suppressing endTrace exports for every subsequent session in the worker — a worse regression than the bug the original PR was fixing. Refactor: `private crashedSessions = new Set<string>()` on FileExporter, with `markCrashed(sessionId: string)`. Export checks `crashedSessions.has(trace.sessionId)` at all three checkpoints. The guard now scopes to the actual crashing trace, not the shared instance. Regression test in tracing-rename-race.test.ts: shared FileExporter, Trace A flushSync, Trace B endTrace — asserts B's file lands with status:"completed". Would have caught the original regression. MAJOR fix — endTrace return value after flushSync `endTrace()` short-circuits at entry when `this.crashed` is set and returns `getTracePath()` instead of letting export() bail and returning undefined. Restores the API contract that endTrace returns a usable file path whenever the canonical file exists. Adopts the "crash wins everywhere" policy — once flushSync has fired, the crashed write is authoritative across all exporters; no further exports run. MAJOR fix — TraceExporter interface adds optional markCrashed Replaces the `instanceof FileExporter` dispatch in flushSync with a polymorphic `exp.markCrashed?.(this.sessionId)`. Adds optional `markCrashed?(sessionId: string): void` to the TraceExporter interface. Custom exporters with canonical-file semantics can now participate in the crash guard without modifying Trace. Drops the brittle instanceof check that broke if FileExporter was ever subclassed or duplicated across module boundaries. Minor fixes (PR-introduced issues only) - flush() now drains M2 follow-up snapshots (bounded loop, 16 iters max) so callers see disk == memory after it returns. The previous version only awaited the current promise; a queued follow-up would land after flush() resolved. - Microtask-ordering invariant documented inline at endTrace entry — explains why setting endTraceStarted before the await is airtight. - Test afterEach safety guard restores fs.rename even if a prior test panicked mid-patch, preventing cross-file leakage of the patched fn. - Misleading comment in M2 deterministic test (claimed "no follow-up scheduled") updated to reflect post-fix behavior. - FileExporter naming: `_crashed` → `crashedSessions` (drops the inconsistent underscore prefix; the bare-name style matches Trace.crashed). - markCrashed docstring clarified: "bail at the next checkpoint" rather than overclaiming retroactive cancellation. Tests: - 366 pass (added shared-FileExporter regression test); 0 fail in the tracing suite. Full opencode suite has only pre-existing unrelated failures (tool.registry, sql_analyze, E2E validators). Pre-existing issue noted but NOT fixed in this commit: - `Trace.withExporters` mutates the caller's array (tracing.ts:413-414). GLM-5.1 surfaced this during review. It's pre-existing code (not introduced by this PR series) and unrelated to the M2/M3 fixes. Filing as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
Consensus review addressed (6 reviewers + Claude)Pushed commit Blocking issues — fixed
PR-introduced minor issues — fixed
Not fixed in this commit
Regression test for the critical fix
Test results
Ready for re-review. |
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
1 similar comment
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
packages/opencode/src/altimate/observability/tracing.ts (1)
191-205:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNormalize the crash-guard key before storing/checking it.
Line 191 stores the raw
sessionId, but Line 204 checks againsttrace.sessionId, whichTrace.buildTraceFile()sanitizes on Line 778. For ids containing/,\,., or:,markCrashed()won't match the laterexport()checks, so the async rename can still overwrite the synchronous crashed file.💡 Proposed fix
export class FileExporter implements TraceExporter { readonly name = "file" private dir: string private maxFiles: number private crashedSessions = new Set<string>() + + private static normalizeSessionId(sessionId: string | undefined) { + return (sessionId ?? "unknown").replace(/[/\\.:]/g, "_") || "unknown" + } + /** Mark a session's exports as superseded by a synchronous crash write. * Subsequent or in-flight export() calls for that session bail at the * next checkpoint (entry, pre-writeFile, or pre-rename). Idempotent. */ markCrashed(sessionId: string) { - if (sessionId) this.crashedSessions.add(sessionId) + this.crashedSessions.add(FileExporter.normalizeSessionId(sessionId)) } async export(trace: TraceFile): Promise<string | undefined> { - if (this.crashedSessions.has(trace.sessionId)) return undefined + const safeId = FileExporter.normalizeSessionId(trace.sessionId) + if (this.crashedSessions.has(safeId)) return undefined let tmpPath: string | undefined try { await fs.mkdir(this.dir, { recursive: true }) - const safeId = (trace.sessionId ?? "unknown").replace(/[/\\.:]/g, "_") || "unknown" const filePath = path.join(this.dir, `${safeId}.json`) tmpPath = filePath + `.tmp.${Date.now()}.${Math.random().toString(36).slice(2, 8)}` - if (this.crashedSessions.has(trace.sessionId)) return undefined + if (this.crashedSessions.has(safeId)) return undefined await fs.writeFile(tmpPath, JSON.stringify(trace, null, 2)) - if (this.crashedSessions.has(trace.sessionId)) { + if (this.crashedSessions.has(safeId)) { await fs.unlink(tmpPath).catch(() => {}) return undefined }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/opencode/src/altimate/observability/tracing.ts` around lines 191 - 205, markCrashed stores the raw sessionId but export checks the sanitized id produced by Trace.buildTraceFile, so IDs with characters like / \ . : won't match and the crash-guard can fail; fix by normalizing the key consistently when storing and checking in the crashedSessions set (i.e., call the same sanitization used by Trace.buildTraceFile or a shared Trace.sanitizeSessionId helper) inside markCrashed and before the has() check in export so both use the identical sanitized session id.packages/opencode/test/altimate/tracing-rename-race.test.ts (1)
30-43: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winUse
tmpdir()instead of manual temp-dir lifecycle.This test still manages
os.tmpdir(),fs.mkdir, andafterEachcleanup manually. Please switch the fixture setup toawait usingwithtmpdir()so cleanup is automatic and consistent across the test suite.As per coding guidelines, "Use the
tmpdirfunction fromfixture/fixture.tsto create temporary directories for tests with automatic cleanup in test files" and "Always useawait usingsyntax withtmpdir()for automatic cleanup when the variable goes out of scope".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/opencode/test/altimate/tracing-rename-race.test.ts` around lines 30 - 43, The test manually creates and cleans tmpDir in beforeEach/afterEach and manipulates originalRename; replace this with the tmpdir fixture by importing tmpdir from fixture/fixture.ts and using "await using tmp = tmpdir()" (or similar name) inside the test setup so the directory lifecycle is automatic; remove the manual fs.mkdir, os.tmpdir usage, and the afterEach cleanup/restore block that checks (fs as any).rename !== originalRename, and update references from tmpDir to the tmpdir fixture variable; ensure originalRename handling is preserved only if still needed for fs.rename patching but scope it inside the test so teardown is automatic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@packages/opencode/src/altimate/observability/tracing.ts`:
- Around line 191-205: markCrashed stores the raw sessionId but export checks
the sanitized id produced by Trace.buildTraceFile, so IDs with characters like /
\ . : won't match and the crash-guard can fail; fix by normalizing the key
consistently when storing and checking in the crashedSessions set (i.e., call
the same sanitization used by Trace.buildTraceFile or a shared
Trace.sanitizeSessionId helper) inside markCrashed and before the has() check in
export so both use the identical sanitized session id.
In `@packages/opencode/test/altimate/tracing-rename-race.test.ts`:
- Around line 30-43: The test manually creates and cleans tmpDir in
beforeEach/afterEach and manipulates originalRename; replace this with the
tmpdir fixture by importing tmpdir from fixture/fixture.ts and using "await
using tmp = tmpdir()" (or similar name) inside the test setup so the directory
lifecycle is automatic; remove the manual fs.mkdir, os.tmpdir usage, and the
afterEach cleanup/restore block that checks (fs as any).rename !==
originalRename, and update references from tmpDir to the tmpdir fixture
variable; ensure originalRename handling is preserved only if still needed for
fs.rename patching but scope it inside the test so teardown is automatic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b9caca4e-62ea-48f1-9703-498fc2072aff
📒 Files selected for processing (3)
packages/opencode/src/altimate/observability/tracing.tspackages/opencode/test/altimate/tracing-display-crash.test.tspackages/opencode/test/altimate/tracing-rename-race.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/opencode/test/altimate/tracing-display-crash.test.ts
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…kup key (cubic P1)
cubic-dev-ai flagged a P1 (confidence 8) in the consensus-review fix: the
crash guard could be silently bypassed for any sessionId containing the
file-name-unsafe chars `/`, `\`, `.`, or `:`. Six independent LLM reviewers
missed this — see memory note on AI-review blind spots for data-flow bugs.
The mismatch:
- flushSync calls `markCrashed(this.sessionId)` with the RAW value
- FileExporter.export checks `crashedSessions.has(trace.sessionId)`
where `trace.sessionId` comes from `buildTraceFile` and is sanitized
via `replace(/[/\\.:]/g, "_")`
For any sessionId containing one of those chars, the Set entry and the
lookup key differ, so the guard returns false and export proceeds — the
exact scenario the M3 fix was meant to prevent.
Fix: sanitize in markCrashed using the same regex, so the Set always
stores the canonical (sanitized) key. Lookup-and-store now agree.
Regression test added: a sessionId with `:`, `/`, and `.` is marked
crashed, then export() is called with the corresponding sanitized trace.
Pre-fix: export proceeds and overwrites flushSync's crashed file.
Post-fix: export returns undefined; flushSync's file stands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
2 similar comments
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
dev-punia-altimate
left a comment
There was a problem hiding this comment.
Multi-Persona Review — Verdict: ship
This PR adds 13 new Databricks semantic models to Studio with comprehensive coverage of cost, job, warehouse, and query metrics. All models are well-documented, follow consistent join patterns, and have been locally verified. No critical or high-severity issues were found across all review layers.
15/15 agents completed · 280s · 15 findings (0 critical, 4 high, 9 medium)
High
- [persona_cto] Adding 13 new ClickHouse-backed semantic models without a clear strategy for model lifecycle management, versioning, or schema evolution. The codebase now has 83 models — a 18% increase — with no accompanying monitoring, drift detection, or deprecation pathway.
- 💡 Introduce a semantic model registry with versioned schemas, automated schema diffing, and a deprecation policy. Consider using a metadata store (e.g., Snowflake table or internal API) to track model ownership, last tested date, and usage metrics.
- [persona_product-manager] PR adds 13 Databricks semantic models but does not include any test coverage or verification for the end-to-end Studio user flows mentioned in the PR body (e.g., 'What was my Databricks cost in the last 30 days?'). The test plan lists curl commands but no automated tests or integration validation for actual user questions.
- 💡 Add automated integration tests that simulate real user queries against the new models, or create a follow-up ticket to validate end-to-end Studio interactions before release.
- [persona_security] User-supplied filter values (e.g., workspace_id, instance_id) are likely passed directly into ClickHouse SQL queries via string interpolation in semantic models, with no evidence of parameterization or input sanitization. →
app/semantic_layer/models/databricks_billing_usage.py:45- 💡 Use parameterized queries with bound parameters (e.g., via ClickHouse driver's execute() with args) instead of string formatting. Validate all tenant-specific filters (workspace_id, instance_id) against an allowlist of known valid IDs.
- [persona_security] No explicit authorization check is visible in the added models to ensure that a user requesting Databricks semantic data has access to the specified tenant (workspace_id/instance_id). This enables IDOR across tenants. →
app/semantic_layer/models/databricks_access_workspaces.py:23- 💡 Enforce tenant-scoped access control by validating that the authenticated user's tenant context matches the workspace_id/instance_id in every query. Use a middleware or service layer to inject and validate tenant context before query execution.
Medium
- [persona_tech-lead] Semantic model files are added directly under app/semantic_layer/models/ without a clear sublayering or abstraction for ClickHouse-specific logic, risking tight coupling between data source and model definition. →
app/semantic_layer/models/databricks_billing_usage.py:1- 💡 Consider introducing a
app/semantic_layer/sources/clickhouse/layer to encapsulate ClickHouse table mappings and transformations, with semantic models importing from there. This isolates data source details and enables reuse across dialects.
- 💡 Consider introducing a
- [persona_tech-lead] Multiple Databricks models (e.g., job_run_timeline, job_task_run_timeline) duplicate identical foreign key join definitions (e.g., job_id, run_id) without reusing a shared schema or base class. →
app/semantic_layer/models/databricks_job_run_timeline.py:45- 💡 Extract common entity definitions (e.g.,
JobEntity,RunEntity) into a shared module (e.g.,app/semantic_layer/models/_common.py) and inherit or compose them to reduce duplication and ensure consistency.
- 💡 Extract common entity definitions (e.g.,
- [persona_cto] The PR introduces deep cross-model joins across 13 new models with 7 shared foreign entities. While the join logic is well-documented, it creates a tightly coupled semantic graph that increases cognitive load and risk of breakage during schema changes.
- 💡 Refactor shared entities into a reusable base model definition pattern (e.g.,
BaseEntityMixin) to reduce duplication and enforce consistency. Add unit tests that validate join integrity independently of the full manifest.
- 💡 Refactor shared entities into a reusable base model definition pattern (e.g.,
- [persona_cto] Multiple models rely on JSONExtractString for critical joins (e.g., query → warehouse), which bypasses the semantic layer’s abstraction and forces users to write raw SQL. This undermines the purpose of the semantic layer.
- 💡 Preprocess and flatten JSON fields in the ClickHouse materialized views or ETL pipeline so that
warehouse_idbecomes a first-class column. This preserves abstraction and enables agent-friendly querying.
- 💡 Preprocess and flatten JSON fields in the ClickHouse materialized views or ETL pipeline so that
- [persona_product-manager] PR documents limitations like JSON-encoded compute fields and potential 2x cost overstatement due to MergeTree behavior, but these are not surfaced to users in Studio. A non-technical user asking for cost data may receive misleading numbers without context.
- 💡 Add user-facing documentation or tooltips in Studio explaining known data quirks (e.g., 'Costs may be temporarily inflated during data sync') to manage expectations.
- [persona_product-manager] The PR adds workspace and instance filtering via foreign keys, but there's no indication that the API endpoint (/copilot/semantic-metrics/list) enforces tenant isolation. Without this, users might see data from other tenants.
- 💡 Verify and document that tenant headers (e.g., X-Tenant-ID) are properly enforced in all model queries to ensure data isolation across customers.
- [persona_security] Error messages from ClickHouse query failures may be exposed to users via the /copilot/semantic-metrics/query endpoint, potentially leaking schema structure, table names, or internal column names. →
app/semantic_layer/models/databricks_query_history.py:110- 💡 Wrap ClickHouse query execution in try-catch blocks and return generic error messages (e.g., 'Invalid query parameters') to end users. Log detailed errors server-side only.
- [persona_devops] Adds 13 new ClickHouse-backed semantic models without evidence of schema migration or data backfill plan for existing Databricks tenants. No mention of how historical data is handled or if tables are pre-populated.
- 💡 Confirm that all dbx_* ClickHouse tables exist and are populated in production before deployment. Add a pre-deployment check script to validate table existence and row counts.
- [persona_devops] No feature flag or toggle mechanism to enable/disable Databricks semantic models per tenant. Risk of exposing incomplete or untested models to all tenants.
- 💡 Implement a tenant-level feature flag (e.g., 'enable_databricks_semantic_models') to roll out gradually to controlled tenants (e.g., thredup first) before full rollout.
Low
- [persona_tech-lead] Model names use inconsistent casing: some use snake_case (e.g.,
databricks_billing_usage), others use mixed case in comments (e.g., 'SQL Warehouses') — though acceptable, it reduces internal consistency. →app/semantic_layer/models/databricks_warehouse_inventory.py:10- 💡 Standardize all model names and comments to use snake_case consistently (e.g., 'sql_warehouses') to align with Python and project conventions.
- [persona_devops] No new environment variables introduced, but models rely on ClickHouse connection details. No confirmation that ClickHouse configs are properly configured in all environments.
- 💡 Verify that CLICKHOUSE_HOST, CLICKHOUSE_PORT, CLICKHOUSE_USER, CLICKHOUSE_PASSWORD are set and tested in staging and production environments.
Multi-Persona Review · vllm:qwen3-next-80b (waves) + vllm-fallback (synth) ·
The test asserted `schema_inspect` IS in the output, contradicting its own name. The production code at sql-analyze.ts:56-58 flips `isRealFailure` to true when the dispatcher result includes an `error` field, which suppresses the progressive-disclosure suggestion. Test name described correct behaviour; assertion did not. Flipping to `not.toContain` matches both the implementation and the test name. Pre-existing fail on main (verified on v0.8.0 release CI run 26734399443) — unrelated to the trace corruption fix but blocking #867's CI. Landing the fix in this PR for the unblock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
2 similar comments
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
Summary
Closes #865.
Two confirmed concurrency bugs in the trace pipeline that caused trace files in long-running sessions to either lose events (M2) or show a stale terminating status instead of
"crashed"(M3). Both fire 100% of the time under realistic conditions in the reproducers added in this PR. The third architecturally-related bug (M1) remains as a documented design hazard; commit38463876b's existing protection holds against it on local SSD.See #865 for the full reproduction matrix and mechanism writeups.
What changed
M2 — debounce drops events with no follow-up snapshot
Trace.snapshot()returned early whensnapshotPending=trueand was never re-scheduled after the in-flight write completed. In bursty turns (an LLM step firing several tool calls back-to-back), the disk file lagged memory until a fresh event arrived in a future turn. If the process exited in the gap, the burst's tail was lost. Natural reproducer fired 20/20 trials, losing 7 of 8 burst events each.Fix: track
snapshotRequestedDuringPendingwheneversnapshot()short-circuits. In.finally(), if the flag is set and the trace is neither crashed nor ending, schedule exactly one follow-up snapshot viaqueueMicrotask. Bounded — at most one extra write per "burst → quiet" cycle regardless of burst size.M3 —
FileExporter.export()racesflushSyncwith nocrashedguardendTrace()callsexporter.export(), which doeswriteFile(tmp)→rename(tmp, final). On a multi-MB trace the writeFile takes 100+ms, a wide window forflushSyncto interleave. Thecrashedflag added in38463876bonly guardedTrace.snapshot(), notFileExporter.export(), soflushSync's synchronous crashed write was overwritten by the export's rename. Natural reproducer fired 10/10 trials with ~52 MB traces.Fix: give
FileExportera private_crashedflag withmarkCrashed(). Check it at entry, before the writeFile, and before the rename (drop tmp + bail).flushSynciterates exporters and callsmarkCrashed()on eachFileExporterbefore its own synchronous write.M2 companion —
endTraceStartedgateOnce
endTrace()claims the canonical write, no concurrent snapshot may run. Without this, the M2 follow-up's queued microtask runs afterawait snapshotPromisein endTrace but before the root-spanendTimemutation, capturing pre-end state and racing to be last writer.M1 — NOT closed in this PR
The TOCTOU between the synchronous
if (this.crashed)check and the asynchronous kernelfs.renamesyscall remains. The existingcrashedflag from38463876bprotects against it on local SSD (microsecond rename window) and is verified byM1-naturalin the new test file (5/5 trials still preserved crashed status under preload + flushSync). Theoretical exposure remains on slow/network filesystems. Documented as a known design hazard in #865; deterministic reproducer kept astest.skipfor local experimentation.Tests
packages/opencode/test/altimate/tracing-rename-race.test.ts— reproducers + regressions for M1/M2/M3.M2-natural: was 20/20 drops → 0/20 after fixM3-natural: was 10/10 clobbered → 0/10 after fixM2 deterministic(injected fs.rename delay): also passes after fix (extended wait for follow-up snapshot)M1-natural: regression against existingcrashedprotection (0/5 clobbered)M1 residual:test.skip— documents the architectural TOCTOUpackages/opencode/test/altimate/tracing-display-crash.test.ts:500. The pre-existing testflushSync then endTrace — endTrace overwrites crashed statusasserted the opposite of the new policy. Updated to assert thatflushSync's crashed status is preserved (M3 fix's intent).Verification
tool.registry,sql_analyze, validator E2E) and unchanged by this PR.Test plan
crashedflag holds)🤖 Generated with Claude Code
Summary by cubic
Fixes trace corruption in long-running sessions by preserving crash writes and catching up dropped snapshots. Also corrects a failing
sql_analyzetest assertion that was blocking CI.Trace.snapshot()now schedules one follow-up when a call is dropped during an in-flight write;flush()drains this follow-up so disk matches memory, and snapshots stop onceendTrace()begins.FileExporteradds per-sessionmarkCrashed(sessionId)(with sanitization) and checks at entry, pre-write, and pre-rename;TraceExportersupports optionalmarkCrashed; shared-exporter regression fixed so one session’s crash doesn’t suppress others;endTrace()returns the crashed file path when a crash occurs.sql_analyzeassertion fixed to not expectschema_inspectwhen an error is returned, unblocking CI.Written for commit 49695c7. Summary will update on new commits.
Summary by CodeRabbit
Bug Fixes
Tests