uts: stop the base-station feed from silently dying#74
Conversation
Two fixes for the "telemetry stops after a few random minutes" failure on the MacBook base station. websocket_bridge.redis_listener: wrap the pub/sub loop in a reconnect loop with health_check_interval. Previously a single Redis connection blip (idle timeout, transient Docker-bridge hiccup, Redis restart) made the listener coroutine return for good while the WebSocket server kept running — PECAN stayed connected but never received another frame, with no error surfaced. ws_relay already reconnects this way; redis_listener now matches. main.py: the child-process monitor only logged "Process X died!" once per second forever and never recovered. Because the parent stayed alive, neither Docker's `restart: unless-stopped` nor systemd's `Restart=always` ever saw the failure. Now a dead child tears down the surviving children and exits non-zero so the supervisor restarts the whole stack cleanly.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 60fd029649
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # silently keeps running half-dead. Fail fast instead: tear down the | ||
| # surviving children and exit non-zero so the supervisor restarts the | ||
| # whole stack cleanly. | ||
| dead = [p for p in processes if not p.is_alive()] |
There was a problem hiding this comment.
Avoid restarting telemetry for optional video failures
With this check treating any child death as fatal, an optional media process can now take down the core telemetry stack: main.py starts Video whenever ENABLE_VIDEO is true, and the car systemd unit sets ENABLE_VIDEO=true, while run_video() returns when GStreamer reports an error. In that camera/misconfigured-GStreamer scenario the parent exits non-zero and systemd restarts the whole service repeatedly, so CAN telemetry never stays up even though only the optional video feed failed; consider limiting fail-fast to critical children or making optional child failures non-fatal.
Useful? React with 👍 / 👎.
Two fixes for the "telemetry stops after a few random minutes" failure on
the MacBook base station.
websocket_bridge.redis_listener: wrap the pub/sub loop in a reconnect loop
with health_check_interval. Previously a single Redis connection blip (idle
timeout, transient Docker-bridge hiccup, Redis restart) made the listener
coroutine return for good while the WebSocket server kept running — PECAN
stayed connected but never received another frame, with no error surfaced.
ws_relay already reconnects this way; redis_listener now matches.
main.py: the child-process monitor only logged "Process X died!" once per
second forever and never recovered. Because the parent stayed alive, neither
Docker's
restart: unless-stoppednor systemd'sRestart=alwaysever saw thefailure. Now a dead child tears down the surviving children and exits non-zero
so the supervisor restarts the whole stack cleanly.