Prune notes deleted from disk on full index and watcher startup#42
Merged
Conversation
full_index was upsert-only: it added and updated notes but never removed DB rows for files that no longer existed on disk. The only deletion path was the live watcher's per-event handler, so files removed while the watcher was down orphaned their rows indefinitely, polluting co-occurrence and community detection with ghost nodes. Add reconcile_deletions() to diff the DB against a full disk scan and prune orphans (FK cascade drops chunks/summaries/triples; sqlite-vec rows cleared explicitly). full_index runs it by default; the watcher runs it on startup to self-heal offline deletions. An empty scan skips pruning to avoid wiping the index on a misconfigured or unmounted vault_root. Add `neurostack index --no-prune` to opt out, and tests covering the cascade, the empty-scan guard, and exclude-dir handling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
full_indexwas upsert-only: it added and updated notes but never removed DB rows for files that no longer existed on disk. The only deletion path was the live watcher's per-event handler, so any file removed while the watcher was down orphaned its rows indefinitely.These ghost nodes inflate note counts and pollute co-occurrence and community detection — dragging GraphRAG modularity toward random. A vault with a deleted
calendar/folder still showed whole communities built entirely from notes that no longer exist on disk.Fix
reconcile_deletions(conn, vault_root, exclude_dirs)(watcher.py) — diffs thenotestable against a full disk scan and prunes orphans. FK cascade (ON DELETE CASCADE,foreign_keys=ON) drops chunks/summaries/triples; sqlite-vec virtual tables aren't cascaded so they're cleared explicitly. Returns the count pruned.full_indexruns it by default (prune=True) — everyneurostack indexself-cleans.neurostack watchsweeps offline deletions on boot, so the always-on process self-heals without a manual re-index.neurostack index --no-pruneto opt out.Safety
An empty scan (unmounted / misconfigured
vault_root) makes every note look orphaned.reconcile_deletionsrefuses to prune when the scan finds zero files — it warns and bails rather than wiping the index. A partial-mount case is not yet guarded (noted as a follow-up).Tests
New
tests/test_reconcile.py: cascade prune, no-op when clean, the empty-scan guard, and exclude-dir handling. Full suite: 556 passed.