Skip to content

fix(fleet-data): bound external subprocess calls with wait-with-deadline#164

Merged
NagyVikt merged 1 commit into
mainfrom
fix/fleet-data-subprocess-timeouts
May 16, 2026
Merged

fix(fleet-data): bound external subprocess calls with wait-with-deadline#164
NagyVikt merged 1 commit into
mainfrom
fix/fleet-data-subprocess-timeouts

Conversation

@NagyVikt
Copy link
Copy Markdown
Contributor

Summary

Wraps every external subprocess call in fleet-data with a sync wait-with-deadline so a hung child (network stall, auth prompt, wedged tmux) can no longer freeze the dashboards' 250ms tick.

Freeze vector being closed

Before this PR, each dashboard tick could shell out to tmux, agent-auth, git, or gh and block indefinitely on Command::output() / Child::wait_with_output(). A single wedged child (tmux server lock contention, agent-auth prompting for credentials, a stalled gh pr list over a slow network) would stall the entire UI until killed manually.

New module

rust/fleet-data/src/subprocess.rs exports:

pub fn output_with_deadline(
    cmd: std::process::Command,
    deadline: std::time::Duration,
) -> std::io::Result<std::process::Output>

Spawns, polls try_wait on a 10ms loop, kills + reaps on expiry, returns io::ErrorKind::TimedOut. Sync on purpose (the crate is sync).

Two named deadline constants so call sites stay consistent:

  • TMUX_READ_DEADLINE = 500ms for tmux read-only calls. Local socket; normal latency is single-digit ms. 500ms keeps a stalled tmux server contained to at most ~2 ticks.
  • HEAVY_CMD_DEADLINE = 2s for agent-auth list, git, gh. These touch disk / network / remote APIs. 2s is short enough to drop wedged calls fast, long enough for a real gh pr list --json files over a slow link.

Call sites converted

  • tmux.rs: has_session, list_panes, capture_pane, display_message, select_window, set_pane_option (TMUX_READ_DEADLINE)
  • accounts.rs: load_live (HEAVY_CMD_DEADLINE)
  • git.rs: branch_contains_pr, open_prs_with_files (HEAVY_CMD_DEADLINE)
  • panes.rs: the concurrent tmux capture-pane fan-out applies its deadline per child (timed from each child's own spawn instant), not to the whole join. Slow stragglers are killed and reaped; their pane falls back to an empty scrollback tail (which classify reads as Idle).

Every call site already had a "non-zero exit -> empty / best-effort fallback" posture, so the timeout collapses onto the same path. No behavior change on the happy path.

Tests

Three new unit tests in subprocess::tests:

  • output_returns_quickly_when_command_finishes_fast (runs true, 2s deadline)
  • output_returns_timeout_error_for_sleep_longer_than_deadline (runs sleep 5, 100ms deadline)
  • child_is_reaped_on_timeout (runs sh -c "sleep 10", asserts kind=TimedOut + sanity-checks the reap primitive)

Verification

  • cargo test -p fleet-data -> 69 passed, 0 failed (3 new)
  • cargo check --workspace -> clean
  • cargo clippy -p fleet-data --all-targets -> no new warnings

Test plan

  • Run the four dashboard binaries against a healthy tmux server and confirm no visible latency change.
  • Simulate a wedge (pkill -STOP tmux) during a dashboard tick and confirm the tick returns within ~500ms with a one-frame empty fleet instead of freezing.

…dline

The dashboards in fleet-data poll on a ~250ms tick and shell out to
external tools (tmux, agent-auth, git, gh) on the hot path. Any one of
those children could hang indefinitely - a wedged tmux server, an
agent-auth prompting for credentials, a gh call stuck on the network -
and freeze the tick along with it.

Add `subprocess::output_with_deadline`, a sync poll-loop wrapper around
`Command::output` that kills + reaps the child on expiry and returns
`io::ErrorKind::TimedOut`. Route every subprocess call site in this
crate through it:

- tmux read/write calls (`has_session`, `list_panes`, `capture_pane`,
  `display_message`, `select_window`, `set_pane_option`) use
  `TMUX_READ_DEADLINE` (500ms): generous enough to ride out a momentary
  tmux-server stall without freezing more than ~2 frames.
- `accounts::load_live` (`agent-auth list`), `git::branch_contains_pr`
  (`git merge-base`), and `git::open_prs_with_files` (`gh pr list`) use
  `HEAVY_CMD_DEADLINE` (2s): short enough that broken auth or stalled
  network drops out fast, long enough for a real `gh pr list --json
  files` over a slow link to complete.
- The concurrent `tmux capture-pane` fan-out in `panes::list_panes`
  applies the deadline per child (tracked from its own spawn time), not
  to the whole join, so one wedged pane can't take the whole batch with
  it; stragglers fall back to an empty scrollback tail.

Every call site already treated a non-zero exit as the empty / best-effort
fallback, so the timeout error collapses onto the same path.

Add three subprocess::tests cases covering the fast-finish, timeout, and
kill-and-reap paths.

Verification:
- cargo test -p fleet-data => 69 passed, 0 failed (3 new)
- cargo check --workspace => clean
- cargo clippy -p fleet-data --all-targets => no new warnings
@NagyVikt NagyVikt merged commit 68b7260 into main May 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant