Emit named IN_CREATE/IN_DELETE for inotify directory watches#58
Emit named IN_CREATE/IN_DELETE for inotify directory watches#58chenhunghan wants to merge 2 commits into
Conversation
EVFILT_VNODE reports that a watched directory changed but not which child, so directory events were queued with no name. fsnotify-based consumers (notably the k0s manifest applier, which re-applies only when a *.yaml entry appears) filter on the entry name and silently drop nameless events, so manifests written after the watch was established were never picked up. Keep a per-watch snapshot of the directory's entry names; on each NOTE_WRITE re-list the directory and diff against the snapshot to emit a named IN_CREATE per added child and IN_DELETE per removed one, matching real inotify semantics. The blocking-read and non-blocking collect paths share one process_vnode_event() helper. The snapshot is allocated on add_watch and freed on rm_watch and inotify_close. Add a regression test (test-inotify Test 6) that watches a fresh directory, creates a child, and asserts a named IN_CREATE for it is delivered; this fails before the fix (the event arrives without a name). Validated with make check on Apple Silicon. (cherry picked from commit 2a5fa28b5257ceb5ed116c231aecc7c4a2ef54ab)
There was a problem hiding this comment.
2 issues found across 2 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/syscall/inotify.c">
<violation number="1" location="src/syscall/inotify.c:379">
P2: Filesystem I/O (opendir/readdir) under global inotify_lock causes potential lock contention and latency regression</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| int queued = 0; | ||
| bool overflow = false; | ||
|
|
||
| if (w->is_dir && (fflags & NOTE_WRITE) && w->path) { |
There was a problem hiding this comment.
P2: Filesystem I/O (opendir/readdir) under global inotify_lock causes potential lock contention and latency regression
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/inotify.c, line 379:
<comment>Filesystem I/O (opendir/readdir) under global inotify_lock causes potential lock contention and latency regression</comment>
<file context>
@@ -296,8 +303,133 @@ static void pipe_drain(inotify_instance_t *inst)
+ int queued = 0;
+ bool overflow = false;
+
+ if (w->is_dir && (fflags & NOTE_WRITE) && w->path) {
+ char **now = NULL;
+ int now_n = 0;
</file context>
A directory watch snapshots the child-name set on each change and diffs it against the previous baseline to recover the name kqueue omits. dir_snapshot returned an empty list on any failure (opendir error, a mid-read readdir error, or an allocation failure), which the diff could not tell apart from a genuinely empty directory. On a transient failure the IN_DELETE pass then saw every known child as missing and queued a spurious IN_DELETE for each, and the baseline was overwritten with the empty list, so every later change re-reported the whole directory as IN_CREATE. The corruption was permanent. Make dir_snapshot return bool: on failure it frees any partial result and reports false, and readdir read errors are now detected by clearing errno before each call. process_vnode_event only diffs against and advances to a snapshot that succeeded; on failure it keeps the previous baseline so the next successful snapshot reconciles whatever changed in between. The watch-add path documents that a failed initial snapshot is a best-effort empty baseline. Validated with make check on Apple Silicon; test-inotify still passes 6/6.
4aa4da1 to
7fd5187
Compare
| * On failure keep the previous baseline; the next successful snapshot | ||
| * reconciles whatever changed in between. | ||
| */ | ||
| if (dir_snapshot(w->path, &now, &now_n)) { |
There was a problem hiding this comment.
Stale w->path reads the wrong directory after rename
The snapshot/diff design rests on w->path (captured by strdup(path) at inotify_add_watch time, src/syscall/inotify.c:594) being a valid handle to the directory whose entries the diff should compare. The watch's identity, however, is w->host_fd — an O_EVTONLY fd that follows the inode across renames. Once the two diverge, dir_snapshot(w->path, ...) at src/syscall/inotify.c:409 no longer reads the directory the watch is registered for.
I reproduced both failure modes with two small static aarch64 binaries cross-built and run against build/elfuse at this PR's tip (7fd5187).
Scenario A — old path no longer exists
mkdtemp /tmp/A
inotify_add_watch(fd, /tmp/A, IN_CREATE | IN_DELETE)
rename /tmp/A -> /tmp/A-renamed /* watched inode moves with the fd */
touch /tmp/A-renamed/child /* in the watched dir */
mv raises NOTE_RENAME on the watched fd (delivered as IN_MOVE_SELF, correct). The subsequent touch raises NOTE_WRITE on the same fd; process_vnode_event calls dir_snapshot("/tmp/A", ...), opendir returns ENOENT, the function returns false, the keep-baseline branch fires — and no IN_CREATE for child is ever emitted.
Actual output:
watching /tmp/elfuse-renA-KRUJg3 (wd=1)
renamed /tmp/elfuse-renA-KRUJg3 -> /tmp/elfuse-renA-KRUJg3-renamed; old path is now nonexistent
created /tmp/elfuse-renA-KRUJg3-renamed/child in the renamed (watched) dir
IN_CREATE name="child" count: 0 (expected 1 on Linux)
verdict: BUG REPRODUCED (named event lost after rename)
From the caller's perspective the watch silently stops naming children after a rename — exactly the failure mode #55 was opened to fix.
Scenario B — old path is taken by a different inode
mkdtemp /tmp/A
inotify_add_watch(fd, /tmp/A, IN_CREATE | IN_DELETE)
rename /tmp/A -> /tmp/A-renamed /* watched inode moves with the fd */
mkdir /tmp/A /* fresh dir, different inode, unwatched */
touch /tmp/A/decoy /* in the new unwatched dir */
touch /tmp/A-renamed/child /* in the watched dir */
Now opendir("/tmp/A") succeeds, but it reads the new /tmp/A, not the directory the watch tracks. The diff compares the new directory's entries (["decoy"]) to the baseline ([]) and synthesises an IN_CREATE for the decoy. The real child event in the renamed directory never surfaces because the watched dir is never sampled.
Actual output:
step 1: watching /tmp/elfuse-rename-A-nydrPl (wd=1)
step 2: rename /tmp/elfuse-rename-A-nydrPl -> /tmp/elfuse-rename-A-nydrPl-renamed (watched inode moves with the fd)
step 3: mkdir /tmp/elfuse-rename-A-nydrPl again (NEW inode at old path -- unwatched)
step 4a: created /tmp/elfuse-rename-A-nydrPl/decoy in the NEW (unwatched) dir
step 4b: created /tmp/elfuse-rename-A-nydrPl-renamed/child in the WATCHED (renamed) dir
step 5: draining events for ~1s
event: wd=1 mask=0x00000100 (IN_CREATE) len=8 name=decoy
summary:
total IN_CREATE: 1
IN_CREATE name="decoy" (BUG if >0): 1
IN_CREATE name="child" (expected: 1): 0
verdict: BUG REPRODUCED
The emitted event is objectively wrong on both axes: it names a file in a directory the caller never watched, and it suppresses the named event for the actual child of the watched directory. This isn't theoretical — build systems (yarn/pnpm/Cargo) and test runners routinely mv build → build.old; mkdir build; … for atomic swap; the watch on the old build/ will start fabricating events for the unrelated replacement directory after the swap.
Reproducers
Both binaries are small static aarch64 ELFs cross-built with aarch64-linux-gnu-gcc -static -O2 -D_GNU_SOURCE. Sources attached as Scenario A / Scenario B above; they exit non-zero on the buggy path. A correct fix should make both exit zero (Scenario A: emit IN_CREATE name="child"; Scenario B: emit IN_CREATE name="child" and not name="decoy").
Fixes #55.
Synthesizes named
IN_CREATE/IN_DELETEfor directory watches by snapshotting the directory's entry names and diffing on eachNOTE_WRITE(kqueueEVFILT_VNODEreports only that the directory changed, not which child). Full rationale in the commit message.Adds a regression test (
test-inotifyTest 6) that watches a fresh directory, creates a child, and asserts a namedIN_CREATEis delivered.Validation on Apple Silicon:
make elfuseandmake checkpass; the new test fails before this change (event arrives with an empty name) and passes after.Summary by cubic
Emit named IN_CREATE/IN_DELETE for directory watches by diffing a per-watch snapshot on each change and keeping the previous baseline if a snapshot fails to avoid spurious events. This fixes nameless directory events so fsnotify-based tools can match child names (fixes #55).
Written for commit 7fd5187. Summary will update on new commits.