fix(logging): never delete .log files held by live processes#145
Merged
Conversation
cleanup_old_logs sorted by mtime and trimmed oldest unconditionally. In hub-only mode the long-lived hub and its root agent-server fall idle between writes, so their .log mtime stays older than short-lived sub-agents'. Each sub-agent startup re-runs cleanup and unlinks the hub/server's still-open log, stranding KB of classifier failure logs in unreadable write-only fds. - OpenOptions opens with .read(true) so the inode can be re-opened via /dev/fd/N even after unlink (was O_WRONLY|O_APPEND, now O_RDWR). - cleanup_with_alive_filter parses each filename's PID and skips deletion whenever kill(pid, 0) reports the writer is still alive.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cleanup_old_logswas unlinking still-open log files of long-lived hub / agent-server processes, stranding KB of logs in unreadable write-only fds and making classifier failures undebuggable.O_RDWR(so an unlinked inode can still be re-opened via/dev/fd/N), and teach cleanup to skip any.logwhose filename PID is still alive.--decision=classifier; users reported repeated classifier failures with no extractable log evidence. Root cause was that each ephemeral sub-agent's startup re-ran cleanup and deleted the hub/server's older-mtime log.Changes
src/log_writer.rsRotatingFileWriter::newand rotation path:OpenOptionsadds.read(true).cleanup_old_logsto delegate tocleanup_with_alive_filter(impl Fn(u32) -> bool); production wiring passesbootstrap::is_alive.pid_from_log_filenameparses bothloopal-{ts}-{pid}.logand rotated*.N.logshapes.src/bootstrap/mod.rspub(crate) use discovery::is_alivesolog_writercan reach it.Test plan
bazel build //:loopalpassesbazel build //:loopal --config=clippyzero warningsbazel build //:loopal --config=rustfmtcleanbazel test //:loopal-unit-test— 3 new tests pass alongside 40 existing