bug: intermittent startup crash — process dies between agent_started and first prompt render

## Problem

Conductor intermittently crashes during startup, after emitting `workflow_started` and `agent_started` events but before rendering the first agent prompt. The process exits silently with no error event logged. Retrying the same command succeeds.

Observed across 56 workflow runs of `pr-reviewer` (v0.1.8 and v0.1.9) on Windows over 8 days. 26 of 56 runs (46%) crashed on startup. Every crash is recoverable by simply re-running the same command.

## Evidence

### Crash signature

All crash logs have exactly 2 events and nearly identical file sizes (90,748-90,750 bytes):

```
Event 1: workflow_started  (timestamp T)
Event 2: agent_started     (timestamp T + 0.001s)
(EOF — no further events, no error event)
```

Time between `workflow_started` and `agent_started` is ~1-2ms. The process dies before the next event (`agent_prompt_rendered`) would be emitted.

### Crash rate

| Metric | Value |
|--------|-------|
| Total pr-reviewer runs | 56 |
| 2-event crashes | 26 (46%) |
| Successful runs | 30 |
| Crash log sizes | 90,748 (4x), 90,749 (6x), 90,750 (16x) |
| Successful log sizes | 92KB - 722KB |

### Crash-before-success patterns

Every crash was eventually followed by a successful run of the same workflow with the same arguments:

```
5 crash(es) then success
1 crash then success
3 crash(es) then success
1 crash then success
1 crash then success
1 crash then success
2 crash(es) then success
1 crash then success
3 crash(es) then success
3 crash(es) then success
1 crash then success
2 crash(es) then success
2 crash(es) then success
```

Worst case: 5 consecutive crashes before success. No crash was permanent -- retrying always works.

### Near-crash logs (survived slightly longer)

Some runs survived past `agent_started` but died before completing the first turn:

| Events | Size | Last event |
|--------|------|------------|
| 2 | 90,750 bytes | `agent_started` (crash) |
| 3 | 92,645 bytes | `agent_prompt_rendered` (crash) |
| 4 | 92,764 bytes | `agent_turn_start` (crash) |

This suggests the failure window extends from `agent_started` through early `agent_turn_start`, with most crashes happening before `agent_prompt_rendered`.

## Environment

- Conductor: v0.1.9 (also observed on v0.1.8)
- Python: 3.13.13
- OS: Windows NT 10.0.26220.0 (Windows 11)
- Workflow: `pr-reviewer` (8 agents, copilot provider)
- Provider: copilot (Copilot CLI MCP)

## Reproduction

Not deterministic -- crashes are intermittent (~46% of runs). Steps:

1. Run any conductor workflow that uses the copilot provider:
   ```
   conductor --silent run pr-review.yaml --input pr_data_dir="tmp/pr-review" --web-bg
   ```
2. Observe that approximately half of runs crash immediately (check `$TEMP/conductor/` for 2-event logs)
3. Re-run the same command -- it succeeds

## Impact

- **Wasted time:** Each crash adds 10-60 seconds (backoff) before retry
- **Pipeline reliability:** Automated pipelines that invoke conductor must implement retry loops with crash detection
- **Silent failure:** No error message or event is logged -- the process just exits. Callers must infer the crash from the 2-event log pattern or short runtime

## Possible causes (speculation)

Since the crash happens between process startup and first LLM API call, possible causes include:
- Race condition during MCP server initialization (copilot provider starts filesystem server via Node.js)
- Port binding failure for the MCP transport
- Python subprocess spawning race (conductor spawns child processes for MCP servers)
- Async event loop initialization failure

The fact that retrying always works suggests a transient resource contention issue rather than a configuration problem.

## Workaround

Callers should retry conductor on short-lived crashes. Example (PowerShell):

```powershell
$maxRetries = 3
$crashThresholdSec = 10

for ($attempt = 0; $attempt -le $maxRetries; $attempt++) {
    $start = [DateTime]::UtcNow
    $proc = Start-Process conductor -ArgumentList $args -NoNewWindow -PassThru
    $proc.WaitForExit()
    $runtime = ([DateTime]::UtcNow - $start).TotalSeconds

    if ($proc.ExitCode -eq 0) { break }
    if ($runtime -ge $crashThresholdSec) { break }  # real failure, not crash

    $backoff = [math]::Min(10 * [math]::Pow(2, $attempt), 60)
    Start-Sleep -Seconds $backoff
}
```

This distinguishes crashes (runtime < 10s) from real workflow failures (runtime > 10s) and only retries crashes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: intermittent startup crash — process dies between agent_started and first prompt render #116

Problem

Evidence

Crash signature

Crash rate

Crash-before-success patterns

Near-crash logs (survived slightly longer)

Environment

Reproduction

Impact

Possible causes (speculation)

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
Total pr-reviewer runs	56
2-event crashes	26 (46%)
Successful runs	30
Crash log sizes	90,748 (4x), 90,749 (6x), 90,750 (16x)
Successful log sizes	92KB - 722KB

Events	Size	Last event
2	90,750 bytes	`agent_started` (crash)
3	92,645 bytes	`agent_prompt_rendered` (crash)
4	92,764 bytes	`agent_turn_start` (crash)

bug: intermittent startup crash — process dies between agent_started and first prompt render #116

Description

Problem

Evidence

Crash signature

Crash rate

Crash-before-success patterns

Near-crash logs (survived slightly longer)

Environment

Reproduction

Impact

Possible causes (speculation)

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions