fix(fleet-data): bound external subprocess calls with wait-with-deadline by NagyVikt · Pull Request #164 · recodeee/codex-fleetui

NagyVikt · 2026-05-16T16:44:31Z

Summary

Wraps every external subprocess call in fleet-data with a sync wait-with-deadline so a hung child (network stall, auth prompt, wedged tmux) can no longer freeze the dashboards' 250ms tick.

Freeze vector being closed

Before this PR, each dashboard tick could shell out to tmux, agent-auth, git, or gh and block indefinitely on Command::output() / Child::wait_with_output(). A single wedged child (tmux server lock contention, agent-auth prompting for credentials, a stalled gh pr list over a slow network) would stall the entire UI until killed manually.

New module

rust/fleet-data/src/subprocess.rs exports:

pub fn output_with_deadline(
    cmd: std::process::Command,
    deadline: std::time::Duration,
) -> std::io::Result<std::process::Output>

Spawns, polls try_wait on a 10ms loop, kills + reaps on expiry, returns io::ErrorKind::TimedOut. Sync on purpose (the crate is sync).

Two named deadline constants so call sites stay consistent:

TMUX_READ_DEADLINE = 500ms for tmux read-only calls. Local socket; normal latency is single-digit ms. 500ms keeps a stalled tmux server contained to at most ~2 ticks.
HEAVY_CMD_DEADLINE = 2s for agent-auth list, git, gh. These touch disk / network / remote APIs. 2s is short enough to drop wedged calls fast, long enough for a real gh pr list --json files over a slow link.

Call sites converted

tmux.rs: has_session, list_panes, capture_pane, display_message, select_window, set_pane_option (TMUX_READ_DEADLINE)
accounts.rs: load_live (HEAVY_CMD_DEADLINE)
git.rs: branch_contains_pr, open_prs_with_files (HEAVY_CMD_DEADLINE)
panes.rs: the concurrent tmux capture-pane fan-out applies its deadline per child (timed from each child's own spawn instant), not to the whole join. Slow stragglers are killed and reaped; their pane falls back to an empty scrollback tail (which classify reads as Idle).

Every call site already had a "non-zero exit -> empty / best-effort fallback" posture, so the timeout collapses onto the same path. No behavior change on the happy path.

Tests

Three new unit tests in subprocess::tests:

output_returns_quickly_when_command_finishes_fast (runs true, 2s deadline)
output_returns_timeout_error_for_sleep_longer_than_deadline (runs sleep 5, 100ms deadline)
child_is_reaped_on_timeout (runs sh -c "sleep 10", asserts kind=TimedOut + sanity-checks the reap primitive)

Verification

cargo test -p fleet-data -> 69 passed, 0 failed (3 new)
cargo check --workspace -> clean
cargo clippy -p fleet-data --all-targets -> no new warnings

Test plan

Run the four dashboard binaries against a healthy tmux server and confirm no visible latency change.
Simulate a wedge (pkill -STOP tmux) during a dashboard tick and confirm the tick returns within ~500ms with a one-frame empty fleet instead of freezing.

…dline The dashboards in fleet-data poll on a ~250ms tick and shell out to external tools (tmux, agent-auth, git, gh) on the hot path. Any one of those children could hang indefinitely - a wedged tmux server, an agent-auth prompting for credentials, a gh call stuck on the network - and freeze the tick along with it. Add `subprocess::output_with_deadline`, a sync poll-loop wrapper around `Command::output` that kills + reaps the child on expiry and returns `io::ErrorKind::TimedOut`. Route every subprocess call site in this crate through it: - tmux read/write calls (`has_session`, `list_panes`, `capture_pane`, `display_message`, `select_window`, `set_pane_option`) use `TMUX_READ_DEADLINE` (500ms): generous enough to ride out a momentary tmux-server stall without freezing more than ~2 frames. - `accounts::load_live` (`agent-auth list`), `git::branch_contains_pr` (`git merge-base`), and `git::open_prs_with_files` (`gh pr list`) use `HEAVY_CMD_DEADLINE` (2s): short enough that broken auth or stalled network drops out fast, long enough for a real `gh pr list --json files` over a slow link to complete. - The concurrent `tmux capture-pane` fan-out in `panes::list_panes` applies the deadline per child (tracked from its own spawn time), not to the whole join, so one wedged pane can't take the whole batch with it; stragglers fall back to an empty scrollback tail. Every call site already treated a non-zero exit as the empty / best-effort fallback, so the timeout error collapses onto the same path. Add three subprocess::tests cases covering the fast-finish, timeout, and kill-and-reap paths. Verification: - cargo test -p fleet-data => 69 passed, 0 failed (3 new) - cargo check --workspace => clean - cargo clippy -p fleet-data --all-targets => no new warnings

NagyVikt merged commit 68b7260 into main May 16, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fleet-data): bound external subprocess calls with wait-with-deadline#164

fix(fleet-data): bound external subprocess calls with wait-with-deadline#164
NagyVikt merged 1 commit into
mainfrom
fix/fleet-data-subprocess-timeouts

NagyVikt commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NagyVikt commented May 16, 2026

Summary

Freeze vector being closed

New module

Call sites converted

Tests

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant