Skip to content

bug: intermittent startup crash — process dies between agent_started and first prompt render #116

@franklixuefei

Description

@franklixuefei

Problem

Conductor intermittently crashes during startup, after emitting workflow_started and agent_started events but before rendering the first agent prompt. The process exits silently with no error event logged. Retrying the same command succeeds.

Observed across 56 workflow runs of pr-reviewer (v0.1.8 and v0.1.9) on Windows over 8 days. 26 of 56 runs (46%) crashed on startup. Every crash is recoverable by simply re-running the same command.

Evidence

Crash signature

All crash logs have exactly 2 events and nearly identical file sizes (90,748-90,750 bytes):

Event 1: workflow_started  (timestamp T)
Event 2: agent_started     (timestamp T + 0.001s)
(EOF — no further events, no error event)

Time between workflow_started and agent_started is ~1-2ms. The process dies before the next event (agent_prompt_rendered) would be emitted.

Crash rate

Metric Value
Total pr-reviewer runs 56
2-event crashes 26 (46%)
Successful runs 30
Crash log sizes 90,748 (4x), 90,749 (6x), 90,750 (16x)
Successful log sizes 92KB - 722KB

Crash-before-success patterns

Every crash was eventually followed by a successful run of the same workflow with the same arguments:

5 crash(es) then success
1 crash then success
3 crash(es) then success
1 crash then success
1 crash then success
1 crash then success
2 crash(es) then success
1 crash then success
3 crash(es) then success
3 crash(es) then success
1 crash then success
2 crash(es) then success
2 crash(es) then success

Worst case: 5 consecutive crashes before success. No crash was permanent -- retrying always works.

Near-crash logs (survived slightly longer)

Some runs survived past agent_started but died before completing the first turn:

Events Size Last event
2 90,750 bytes agent_started (crash)
3 92,645 bytes agent_prompt_rendered (crash)
4 92,764 bytes agent_turn_start (crash)

This suggests the failure window extends from agent_started through early agent_turn_start, with most crashes happening before agent_prompt_rendered.

Environment

  • Conductor: v0.1.9 (also observed on v0.1.8)
  • Python: 3.13.13
  • OS: Windows NT 10.0.26220.0 (Windows 11)
  • Workflow: pr-reviewer (8 agents, copilot provider)
  • Provider: copilot (Copilot CLI MCP)

Reproduction

Not deterministic -- crashes are intermittent (~46% of runs). Steps:

  1. Run any conductor workflow that uses the copilot provider:
    conductor --silent run pr-review.yaml --input pr_data_dir="tmp/pr-review" --web-bg
    
  2. Observe that approximately half of runs crash immediately (check $TEMP/conductor/ for 2-event logs)
  3. Re-run the same command -- it succeeds

Impact

  • Wasted time: Each crash adds 10-60 seconds (backoff) before retry
  • Pipeline reliability: Automated pipelines that invoke conductor must implement retry loops with crash detection
  • Silent failure: No error message or event is logged -- the process just exits. Callers must infer the crash from the 2-event log pattern or short runtime

Possible causes (speculation)

Since the crash happens between process startup and first LLM API call, possible causes include:

  • Race condition during MCP server initialization (copilot provider starts filesystem server via Node.js)
  • Port binding failure for the MCP transport
  • Python subprocess spawning race (conductor spawns child processes for MCP servers)
  • Async event loop initialization failure

The fact that retrying always works suggests a transient resource contention issue rather than a configuration problem.

Workaround

Callers should retry conductor on short-lived crashes. Example (PowerShell):

$maxRetries = 3
$crashThresholdSec = 10

for ($attempt = 0; $attempt -le $maxRetries; $attempt++) {
    $start = [DateTime]::UtcNow
    $proc = Start-Process conductor -ArgumentList $args -NoNewWindow -PassThru
    $proc.WaitForExit()
    $runtime = ([DateTime]::UtcNow - $start).TotalSeconds

    if ($proc.ExitCode -eq 0) { break }
    if ($runtime -ge $crashThresholdSec) { break }  # real failure, not crash

    $backoff = [math]::Min(10 * [math]::Pow(2, $attempt), 60)
    Start-Sleep -Seconds $backoff
}

This distinguishes crashes (runtime < 10s) from real workflow failures (runtime > 10s) and only retries crashes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions