Skip to content

uts: stop the base-station feed from silently dying#74

Merged
haoruizhou merged 2 commits into
mainfrom
uts-telemetry-audit
Jun 7, 2026
Merged

uts: stop the base-station feed from silently dying#74
haoruizhou merged 2 commits into
mainfrom
uts-telemetry-audit

Conversation

@haoruizhou

Copy link
Copy Markdown
Contributor

Two fixes for the "telemetry stops after a few random minutes" failure on
the MacBook base station.

websocket_bridge.redis_listener: wrap the pub/sub loop in a reconnect loop
with health_check_interval. Previously a single Redis connection blip (idle
timeout, transient Docker-bridge hiccup, Redis restart) made the listener
coroutine return for good while the WebSocket server kept running — PECAN
stayed connected but never received another frame, with no error surfaced.
ws_relay already reconnects this way; redis_listener now matches.

main.py: the child-process monitor only logged "Process X died!" once per
second forever and never recovered. Because the parent stayed alive, neither
Docker's restart: unless-stopped nor systemd's Restart=always ever saw the
failure. Now a dead child tears down the surviving children and exits non-zero
so the supervisor restarts the whole stack cleanly.

Two fixes for the "telemetry stops after a few random minutes" failure on
the MacBook base station.

websocket_bridge.redis_listener: wrap the pub/sub loop in a reconnect loop
with health_check_interval. Previously a single Redis connection blip (idle
timeout, transient Docker-bridge hiccup, Redis restart) made the listener
coroutine return for good while the WebSocket server kept running — PECAN
stayed connected but never received another frame, with no error surfaced.
ws_relay already reconnects this way; redis_listener now matches.

main.py: the child-process monitor only logged "Process X died!" once per
second forever and never recovered. Because the parent stayed alive, neither
Docker's `restart: unless-stopped` nor systemd's `Restart=always` ever saw the
failure. Now a dead child tears down the surviving children and exits non-zero
so the supervisor restarts the whole stack cleanly.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 60fd029649

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# silently keeps running half-dead. Fail fast instead: tear down the
# surviving children and exit non-zero so the supervisor restarts the
# whole stack cleanly.
dead = [p for p in processes if not p.is_alive()]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid restarting telemetry for optional video failures

With this check treating any child death as fatal, an optional media process can now take down the core telemetry stack: main.py starts Video whenever ENABLE_VIDEO is true, and the car systemd unit sets ENABLE_VIDEO=true, while run_video() returns when GStreamer reports an error. In that camera/misconfigured-GStreamer scenario the parent exits non-zero and systemd restarts the whole service repeatedly, so CAN telemetry never stays up even though only the optional video feed failed; consider limiting fail-fast to critical children or making optional child failures non-fatal.

Useful? React with 👍 / 👎.

@haoruizhou haoruizhou merged commit b7fbc39 into main Jun 7, 2026
40 checks passed
@haoruizhou haoruizhou deleted the uts-telemetry-audit branch June 7, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant