fix(status): self-heal stale shm file and fail fast on init error#464
fix(status): self-heal stale shm file and fail fast on init error#464
Conversation
The shared memory file at /tmp/rtp2httpd_status_<pid> is keyed solely on the supervisor PID, so a previous instance that exited abnormally (SIGKILL, OOM, power loss) leaves a stale file behind. When PID reuse hits — common in containers with restart=always (supervisor is always PID 1), embedded systems with small pid_max, or any environment where /tmp is persistent — status_init() failed with EEXIST and the supervisor "continued anyway", forking workers that immediately segfaulted because most code paths dereference status_shared without NULL checks (zerocopy_init, buffer_pool, status_*, worker disconnect handling, etc.). Two fixes: 1. status_init() retries open(O_EXCL) once after unlinking on EEXIST. The path is bound to our own PID and no other live process in the same PID namespace can hold that PID, so the leftover is safe to remove. 2. main() now treats status_init() failure as fatal and exits, instead of producing a SIGSEGV restart loop that hits the supervisor's restart rate limiter. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net |
There was a problem hiding this comment.
Pull request overview
This PR aims to harden the status-tracking startup path in rtp2httpd’s supervisor/worker shared-memory setup by recovering from stale /tmp shm files and aborting startup instead of entering a worker crash loop when status initialization fails.
Changes:
- Add an
EEXISTrecovery path instatus_init()that unlinks a presumed stale status file and retries creation once. - Change
main()to treatstatus_init()failure as fatal and exit immediately instead of continuing into supervisor startup. - Expand inline comments to document the PID-based shm filename behavior and the rationale for failing fast.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/status.c |
Adds stale shm-file unlink-and-retry logic during status shared-memory initialization. |
src/rtp2httpd.c |
Changes startup flow so status initialization failure aborts process startup immediately. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Drop rot-prone references (specific call sites) and PR-context examples (OpenWrt, container scenarios). Keep only the non-obvious WHY: the PID-namespace invariant that justifies unlink-and-retry, and the reason status_init failure must be fatal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net |
logger() in utils.c gates on status_shared being non-NULL, but mmap() sets it to MAP_FAILED ((void*)-1) on failure, and the pipe-creation failure path calls munmap() without clearing the pointer. Either case turned the subsequent error log into a SIGSEGV, defeating the fail-fast behavior added in the previous commit. - Stage mmap() into a local until success, then assign status_shared. - Reset status_shared to NULL after munmap() in the pipe-failure path. Reported by Copilot review on PR #464. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Two issues raised in PR #464 review: 1. Cross-namespace PID collision: the path /tmp/rtp2httpd_status_<pid> is only unique within a PID namespace. Two containers sharing /tmp can legitimately have the same PID, and the previous unlink-on-EEXIST recovery would silently destroy the sibling instance's directory entry, causing later cross-cleanup races. Replace the heuristic with an exclusive non-blocking flock(2): on EEXIST, open the existing file and try to lock it. Lock acquired ⇒ no live owner ⇒ stale, safe to unlink. Lock busy ⇒ refuse with a clear error rather than corrupting the sibling. The fresh fd retains the flock for the lifetime of the daemon and is inherited by workers (per-OFD lock semantics keep it held until the last process closes its copy), so future recovery attempts can detect us. 2. Pipe-creation failure path used logger() with status_shared non-NULL but log_mutex still uninitialized, so logger() → status_add_log_entry → pthread_mutex_lock on a zero mutex (UB; not portable). Stage all initialization on a local pointer and only assign status_shared after the struct is fully usable (mutexes init'd, pipes created). All failure paths munmap the local and return with status_shared still NULL, so logger() falls through its NULL guard. Reported by Copilot review on PR #464. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net |
…lish" This reverts commit e5ad6f6.
|
Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net |
Summary
Fixes the supervisor crash loop that occasionally appears in user reports:
Root cause
status_init()creates/tmp/rtp2httpd_status_<pid>withO_EXCL, keyed solely on the supervisor's PID. A previous instance that exited abnormally (SIGKILL, OOM-killer, power loss, kernel panic) leaves the file behind becausestatus_cleanup()never ran. When the new supervisor happens to get the same PID,open(O_EXCL)fails withEEXIST.This is rare on a desktop Linux box but very plausible in:
restart=always— the supervisor is always PID 1, so a crashed PID 1 + auto-restart is essentially guaranteed to hit this.pid_max, persistent (non-tmpfs)/tmp./tmpsurvives reboots.The previous behavior was to log "Failed to initialize status tracking" and "continue anyway — status page won't work but streaming will". But that comment was wrong: ~100 sites across
zerocopy.c,buffer_pool.c,status.c,worker.c, etc. dereferencestatus_shared->...without a NULL check. So workers immediately segfault, the supervisor restarts them, they segfault again, and we hit the restart rate limiter.Fix
status_init()— onEEXIST, unlink and retry once. The path is bound to our own PID and no other live process in the same PID namespace can hold that PID, so the leftover is safe to remove.main()— ifstatus_init()still fails (e.g./tmpnot writable), logLOG_FATALand exit with code 1 instead of forking workers that will segfault immediately.Test plan
Verified locally with three scenarios:
/tmp/rtp2httpd_status_<pid>, clean shutdown removes it.Stale shared memory file ... found, removing and retryingand starts normally.unlink()fails withEISDIR; daemon logsFailed to initialize status tracking, exitingand exits with code 1, no worker fork loop.🤖 Generated with Claude Code