fix(status): self-heal stale shm file and fail fast on init error by stackia · Pull Request #464 · stackia/rtp2httpd

stackia · 2026-05-05T08:05:55Z

Summary

Fixes the supervisor crash loop that occasionally appears in user reports:

[Supervisor] Failed to create shared memory file: File exists
[Supervisor] Failed to initialize status tracking
[Supervisor] Worker 0 (pid 7) killed by signal 11
[Supervisor] Worker 1 (pid 8) killed by signal 11
...
[Supervisor] Worker 0 restart rate limited (3 restarts in 5 seconds)

Root cause

status_init() creates /tmp/rtp2httpd_status_<pid> with O_EXCL, keyed solely on the supervisor's PID. A previous instance that exited abnormally (SIGKILL, OOM-killer, power loss, kernel panic) leaves the file behind because status_cleanup() never ran. When the new supervisor happens to get the same PID, open(O_EXCL) fails with EEXIST.

This is rare on a desktop Linux box but very plausible in:

Containers with restart=always — the supervisor is always PID 1, so a crashed PID 1 + auto-restart is essentially guaranteed to hit this.
OpenWrt / embedded — small pid_max, persistent (non-tmpfs) /tmp.
Any environment where /tmp survives reboots.

The previous behavior was to log "Failed to initialize status tracking" and "continue anyway — status page won't work but streaming will". But that comment was wrong: ~100 sites across zerocopy.c, buffer_pool.c, status.c, worker.c, etc. dereference status_shared->... without a NULL check. So workers immediately segfault, the supervisor restarts them, they segfault again, and we hit the restart rate limiter.

Fix

Self-heal in status_init() — on EEXIST, unlink and retry once. The path is bound to our own PID and no other live process in the same PID namespace can hold that PID, so the leftover is safe to remove.
Fail fast in main() — if status_init() still fails (e.g. /tmp not writable), log LOG_FATAL and exit with code 1 instead of forking workers that will segfault immediately.

Test plan

Verified locally with three scenarios:

Happy path — fresh start creates /tmp/rtp2httpd_status_<pid>, clean shutdown removes it.
Self-heal — pre-create a stale file at the path the daemon will use; daemon logs Stale shared memory file ... found, removing and retrying and starts normally.
Fatal exit — replace the path with a directory so unlink() fails with EISDIR; daemon logs Failed to initialize status tracking, exiting and exits with code 1, no worker fork loop.

🤖 Generated with Claude Code

The shared memory file at /tmp/rtp2httpd_status_<pid> is keyed solely on the supervisor PID, so a previous instance that exited abnormally (SIGKILL, OOM, power loss) leaves a stale file behind. When PID reuse hits — common in containers with restart=always (supervisor is always PID 1), embedded systems with small pid_max, or any environment where /tmp is persistent — status_init() failed with EEXIST and the supervisor "continued anyway", forking workers that immediately segfaulted because most code paths dereference status_shared without NULL checks (zerocopy_init, buffer_pool, status_*, worker disconnect handling, etc.). Two fixes: 1. status_init() retries open(O_EXCL) once after unlinking on EEXIST. The path is bound to our own PID and no other live process in the same PID namespace can hold that PID, so the leftover is safe to remove. 2. main() now treats status_init() failure as fatal and exits, instead of producing a SIGSEGV restart loop that hits the supervisor's restart rate limiter. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-05T08:07:03Z

Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net

Copilot

Pull request overview

This PR aims to harden the status-tracking startup path in rtp2httpd’s supervisor/worker shared-memory setup by recovering from stale /tmp shm files and aborting startup instead of entering a worker crash loop when status initialization fails.

Changes:

Add an EEXIST recovery path in status_init() that unlinks a presumed stale status file and retries creation once.
Change main() to treat status_init() failure as fatal and exit immediately instead of continuing into supervisor startup.
Expand inline comments to document the PID-based shm filename behavior and the rationale for failing fast.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`src/status.c`	Adds stale shm-file unlink-and-retry logic during status shared-memory initialization.
`src/rtp2httpd.c`	Changes startup flow so status initialization failure aborts process startup immediately.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Drop rot-prone references (specific call sites) and PR-context examples (OpenWrt, container scenarios). Keep only the non-obvious WHY: the PID-namespace invariant that justifies unlink-and-retry, and the reason status_init failure must be fatal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-05T08:12:28Z

Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net

logger() in utils.c gates on status_shared being non-NULL, but mmap() sets it to MAP_FAILED ((void*)-1) on failure, and the pipe-creation failure path calls munmap() without clearing the pointer. Either case turned the subsequent error log into a SIGSEGV, defeating the fail-fast behavior added in the previous commit. - Stage mmap() into a local until success, then assign status_shared. - Reset status_shared to NULL after munmap() in the pipe-failure path. Reported by Copilot review on PR #464. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-05T08:15:48Z

Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Two issues raised in PR #464 review: 1. Cross-namespace PID collision: the path /tmp/rtp2httpd_status_<pid> is only unique within a PID namespace. Two containers sharing /tmp can legitimately have the same PID, and the previous unlink-on-EEXIST recovery would silently destroy the sibling instance's directory entry, causing later cross-cleanup races. Replace the heuristic with an exclusive non-blocking flock(2): on EEXIST, open the existing file and try to lock it. Lock acquired ⇒ no live owner ⇒ stale, safe to unlink. Lock busy ⇒ refuse with a clear error rather than corrupting the sibling. The fresh fd retains the flock for the lifetime of the daemon and is inherited by workers (per-OFD lock semantics keep it held until the last process closes its copy), so future recovery attempts can detect us. 2. Pipe-creation failure path used logger() with status_shared non-NULL but log_mutex still uninitialized, so logger() → status_add_log_entry → pthread_mutex_lock on a zero mutex (UB; not portable). Stage all initialization on a local pointer and only assign status_shared after the struct is fully usable (mutexes init'd, pipes created). All failure paths munmap the local and return with status_shared still NULL, so logger() falls through its NULL guard. Reported by Copilot review on PR #464. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-05T08:30:48Z

Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net

…lish" This reverts commit e5ad6f6.

github-actions · 2026-05-05T08:34:31Z

Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net

Copilot AI review requested due to automatic review settings May 5, 2026 08:05

Copilot started reviewing on behalf of stackia May 5, 2026 08:06 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

Comment thread src/status.c

Comment thread src/status.c Outdated

Comment thread src/rtp2httpd.c

stackia requested a review from Copilot May 5, 2026 08:17

Copilot started reviewing on behalf of stackia May 5, 2026 08:17 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

Comment thread src/status.c

Comment thread src/status.c

Revert "fix(status): flock-based liveness check; full init before pub…

93a728f

…lish" This reverts commit e5ad6f6.

stackia merged commit eec67dc into main May 5, 2026
5 checks passed

stackia deleted the worktree-fix-shm-eexist branch May 5, 2026 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(status): self-heal stale shm file and fail fast on init error#464

fix(status): self-heal stale shm file and fail fast on init error#464
stackia merged 5 commits intomainfrom
worktree-fix-shm-eexist

stackia commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stackia commented May 5, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants