fix(fleet-data): bound external subprocess calls with wait-with-deadline#164
Merged
Conversation
…dline The dashboards in fleet-data poll on a ~250ms tick and shell out to external tools (tmux, agent-auth, git, gh) on the hot path. Any one of those children could hang indefinitely - a wedged tmux server, an agent-auth prompting for credentials, a gh call stuck on the network - and freeze the tick along with it. Add `subprocess::output_with_deadline`, a sync poll-loop wrapper around `Command::output` that kills + reaps the child on expiry and returns `io::ErrorKind::TimedOut`. Route every subprocess call site in this crate through it: - tmux read/write calls (`has_session`, `list_panes`, `capture_pane`, `display_message`, `select_window`, `set_pane_option`) use `TMUX_READ_DEADLINE` (500ms): generous enough to ride out a momentary tmux-server stall without freezing more than ~2 frames. - `accounts::load_live` (`agent-auth list`), `git::branch_contains_pr` (`git merge-base`), and `git::open_prs_with_files` (`gh pr list`) use `HEAVY_CMD_DEADLINE` (2s): short enough that broken auth or stalled network drops out fast, long enough for a real `gh pr list --json files` over a slow link to complete. - The concurrent `tmux capture-pane` fan-out in `panes::list_panes` applies the deadline per child (tracked from its own spawn time), not to the whole join, so one wedged pane can't take the whole batch with it; stragglers fall back to an empty scrollback tail. Every call site already treated a non-zero exit as the empty / best-effort fallback, so the timeout error collapses onto the same path. Add three subprocess::tests cases covering the fast-finish, timeout, and kill-and-reap paths. Verification: - cargo test -p fleet-data => 69 passed, 0 failed (3 new) - cargo check --workspace => clean - cargo clippy -p fleet-data --all-targets => no new warnings
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wraps every external subprocess call in
fleet-datawith a sync wait-with-deadline so a hung child (network stall, auth prompt, wedged tmux) can no longer freeze the dashboards' 250ms tick.Freeze vector being closed
Before this PR, each dashboard tick could shell out to
tmux,agent-auth,git, orghand block indefinitely onCommand::output()/Child::wait_with_output(). A single wedged child (tmux server lock contention,agent-authprompting for credentials, a stalledgh pr listover a slow network) would stall the entire UI until killed manually.New module
rust/fleet-data/src/subprocess.rsexports:Spawns, polls
try_waiton a 10ms loop, kills + reaps on expiry, returnsio::ErrorKind::TimedOut. Sync on purpose (the crate is sync).Two named deadline constants so call sites stay consistent:
TMUX_READ_DEADLINE = 500msfor tmux read-only calls. Local socket; normal latency is single-digit ms. 500ms keeps a stalled tmux server contained to at most ~2 ticks.HEAVY_CMD_DEADLINE = 2sforagent-auth list,git,gh. These touch disk / network / remote APIs. 2s is short enough to drop wedged calls fast, long enough for a realgh pr list --json filesover a slow link.Call sites converted
tmux.rs:has_session,list_panes,capture_pane,display_message,select_window,set_pane_option(TMUX_READ_DEADLINE)accounts.rs:load_live(HEAVY_CMD_DEADLINE)git.rs:branch_contains_pr,open_prs_with_files(HEAVY_CMD_DEADLINE)panes.rs: the concurrenttmux capture-panefan-out applies its deadline per child (timed from each child's own spawn instant), not to the whole join. Slow stragglers are killed and reaped; their pane falls back to an empty scrollback tail (whichclassifyreads as Idle).Every call site already had a "non-zero exit -> empty / best-effort fallback" posture, so the timeout collapses onto the same path. No behavior change on the happy path.
Tests
Three new unit tests in
subprocess::tests:output_returns_quickly_when_command_finishes_fast(runstrue, 2s deadline)output_returns_timeout_error_for_sleep_longer_than_deadline(runssleep 5, 100ms deadline)child_is_reaped_on_timeout(runssh -c "sleep 10", asserts kind=TimedOut + sanity-checks the reap primitive)Verification
cargo test -p fleet-data-> 69 passed, 0 failed (3 new)cargo check --workspace-> cleancargo clippy -p fleet-data --all-targets-> no new warningsTest plan
pkill -STOP tmux) during a dashboard tick and confirm the tick returns within ~500ms with a one-frame empty fleet instead of freezing.