Skip to content

fix(status): self-heal stale shm file and fail fast on init error#464

Merged
stackia merged 5 commits intomainfrom
worktree-fix-shm-eexist
May 5, 2026
Merged

fix(status): self-heal stale shm file and fail fast on init error#464
stackia merged 5 commits intomainfrom
worktree-fix-shm-eexist

Conversation

@stackia
Copy link
Copy Markdown
Owner

@stackia stackia commented May 5, 2026

Summary

Fixes the supervisor crash loop that occasionally appears in user reports:

[Supervisor] Failed to create shared memory file: File exists
[Supervisor] Failed to initialize status tracking
[Supervisor] Worker 0 (pid 7) killed by signal 11
[Supervisor] Worker 1 (pid 8) killed by signal 11
...
[Supervisor] Worker 0 restart rate limited (3 restarts in 5 seconds)

Root cause

status_init() creates /tmp/rtp2httpd_status_<pid> with O_EXCL, keyed solely on the supervisor's PID. A previous instance that exited abnormally (SIGKILL, OOM-killer, power loss, kernel panic) leaves the file behind because status_cleanup() never ran. When the new supervisor happens to get the same PID, open(O_EXCL) fails with EEXIST.

This is rare on a desktop Linux box but very plausible in:

  • Containers with restart=always — the supervisor is always PID 1, so a crashed PID 1 + auto-restart is essentially guaranteed to hit this.
  • OpenWrt / embedded — small pid_max, persistent (non-tmpfs) /tmp.
  • Any environment where /tmp survives reboots.

The previous behavior was to log "Failed to initialize status tracking" and "continue anyway — status page won't work but streaming will". But that comment was wrong: ~100 sites across zerocopy.c, buffer_pool.c, status.c, worker.c, etc. dereference status_shared->... without a NULL check. So workers immediately segfault, the supervisor restarts them, they segfault again, and we hit the restart rate limiter.

Fix

  1. Self-heal in status_init() — on EEXIST, unlink and retry once. The path is bound to our own PID and no other live process in the same PID namespace can hold that PID, so the leftover is safe to remove.
  2. Fail fast in main() — if status_init() still fails (e.g. /tmp not writable), log LOG_FATAL and exit with code 1 instead of forking workers that will segfault immediately.

Test plan

Verified locally with three scenarios:

  • Happy path — fresh start creates /tmp/rtp2httpd_status_<pid>, clean shutdown removes it.
  • Self-heal — pre-create a stale file at the path the daemon will use; daemon logs Stale shared memory file ... found, removing and retrying and starts normally.
  • Fatal exit — replace the path with a directory so unlink() fails with EISDIR; daemon logs Failed to initialize status tracking, exiting and exits with code 1, no worker fork loop.

🤖 Generated with Claude Code

The shared memory file at /tmp/rtp2httpd_status_<pid> is keyed solely on
the supervisor PID, so a previous instance that exited abnormally (SIGKILL,
OOM, power loss) leaves a stale file behind. When PID reuse hits — common
in containers with restart=always (supervisor is always PID 1), embedded
systems with small pid_max, or any environment where /tmp is persistent —
status_init() failed with EEXIST and the supervisor "continued anyway",
forking workers that immediately segfaulted because most code paths
dereference status_shared without NULL checks (zerocopy_init,
buffer_pool, status_*, worker disconnect handling, etc.).

Two fixes:

1. status_init() retries open(O_EXCL) once after unlinking on EEXIST.
   The path is bound to our own PID and no other live process in the same
   PID namespace can hold that PID, so the leftover is safe to remove.

2. main() now treats status_init() failure as fatal and exits, instead of
   producing a SIGSEGV restart loop that hits the supervisor's restart
   rate limiter.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 08:05
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to harden the status-tracking startup path in rtp2httpd’s supervisor/worker shared-memory setup by recovering from stale /tmp shm files and aborting startup instead of entering a worker crash loop when status initialization fails.

Changes:

  • Add an EEXIST recovery path in status_init() that unlinks a presumed stale status file and retries creation once.
  • Change main() to treat status_init() failure as fatal and exit immediately instead of continuing into supervisor startup.
  • Expand inline comments to document the PID-based shm filename behavior and the rationale for failing fast.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/status.c Adds stale shm-file unlink-and-retry logic during status shared-memory initialization.
src/rtp2httpd.c Changes startup flow so status initialization failure aborts process startup immediately.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/status.c
Comment thread src/status.c Outdated
Comment thread src/rtp2httpd.c
Drop rot-prone references (specific call sites) and PR-context examples
(OpenWrt, container scenarios). Keep only the non-obvious WHY: the
PID-namespace invariant that justifies unlink-and-retry, and the reason
status_init failure must be fatal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net

logger() in utils.c gates on status_shared being non-NULL, but mmap()
sets it to MAP_FAILED ((void*)-1) on failure, and the pipe-creation
failure path calls munmap() without clearing the pointer. Either case
turned the subsequent error log into a SIGSEGV, defeating the fail-fast
behavior added in the previous commit.

- Stage mmap() into a local until success, then assign status_shared.
- Reset status_shared to NULL after munmap() in the pipe-failure path.

Reported by Copilot review on PR #464.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/status.c
Comment thread src/status.c
Two issues raised in PR #464 review:

1. Cross-namespace PID collision: the path /tmp/rtp2httpd_status_<pid> is
   only unique within a PID namespace. Two containers sharing /tmp can
   legitimately have the same PID, and the previous unlink-on-EEXIST
   recovery would silently destroy the sibling instance's directory entry,
   causing later cross-cleanup races.

   Replace the heuristic with an exclusive non-blocking flock(2): on
   EEXIST, open the existing file and try to lock it. Lock acquired ⇒ no
   live owner ⇒ stale, safe to unlink. Lock busy ⇒ refuse with a clear
   error rather than corrupting the sibling.

   The fresh fd retains the flock for the lifetime of the daemon and is
   inherited by workers (per-OFD lock semantics keep it held until the
   last process closes its copy), so future recovery attempts can detect
   us.

2. Pipe-creation failure path used logger() with status_shared non-NULL
   but log_mutex still uninitialized, so logger() → status_add_log_entry
   → pthread_mutex_lock on a zero mutex (UB; not portable).

   Stage all initialization on a local pointer and only assign
   status_shared after the struct is fully usable (mutexes init'd, pipes
   created). All failure paths munmap the local and return with
   status_shared still NULL, so logger() falls through its NULL guard.

Reported by Copilot review on PR #464.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Azure Static Web Apps: Your stage site is ready! Visit it here: https://thankful-water-0a297bf00-464.eastasia.1.azurestaticapps.net

@stackia stackia merged commit eec67dc into main May 5, 2026
5 checks passed
@stackia stackia deleted the worktree-fix-shm-eexist branch May 5, 2026 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants