Skip to content

Prune notes deleted from disk on full index and watcher startup#42

Merged
raphasouthall merged 1 commit into
mainfrom
fix/index-prune-orphaned-notes
Jun 3, 2026
Merged

Prune notes deleted from disk on full index and watcher startup#42
raphasouthall merged 1 commit into
mainfrom
fix/index-prune-orphaned-notes

Conversation

@raphasouthall
Copy link
Copy Markdown
Owner

Problem

full_index was upsert-only: it added and updated notes but never removed DB rows for files that no longer existed on disk. The only deletion path was the live watcher's per-event handler, so any file removed while the watcher was down orphaned its rows indefinitely.

These ghost nodes inflate note counts and pollute co-occurrence and community detection — dragging GraphRAG modularity toward random. A vault with a deleted calendar/ folder still showed whole communities built entirely from notes that no longer exist on disk.

Fix

  • reconcile_deletions(conn, vault_root, exclude_dirs) (watcher.py) — diffs the notes table against a full disk scan and prunes orphans. FK cascade (ON DELETE CASCADE, foreign_keys=ON) drops chunks/summaries/triples; sqlite-vec virtual tables aren't cascaded so they're cleared explicitly. Returns the count pruned.
  • full_index runs it by default (prune=True) — every neurostack index self-cleans.
  • Watcher startup reconcileneurostack watch sweeps offline deletions on boot, so the always-on process self-heals without a manual re-index.
  • neurostack index --no-prune to opt out.

Safety

An empty scan (unmounted / misconfigured vault_root) makes every note look orphaned. reconcile_deletions refuses to prune when the scan finds zero files — it warns and bails rather than wiping the index. A partial-mount case is not yet guarded (noted as a follow-up).

Tests

New tests/test_reconcile.py: cascade prune, no-op when clean, the empty-scan guard, and exclude-dir handling. Full suite: 556 passed.

full_index was upsert-only: it added and updated notes but never removed
DB rows for files that no longer existed on disk. The only deletion path
was the live watcher's per-event handler, so files removed while the
watcher was down orphaned their rows indefinitely, polluting co-occurrence
and community detection with ghost nodes.

Add reconcile_deletions() to diff the DB against a full disk scan and prune
orphans (FK cascade drops chunks/summaries/triples; sqlite-vec rows cleared
explicitly). full_index runs it by default; the watcher runs it on startup
to self-heal offline deletions. An empty scan skips pruning to avoid wiping
the index on a misconfigured or unmounted vault_root.

Add `neurostack index --no-prune` to opt out, and tests covering the
cascade, the empty-scan guard, and exclude-dir handling.
@raphasouthall raphasouthall merged commit 5e49fef into main Jun 3, 2026
5 checks passed
@raphasouthall raphasouthall deleted the fix/index-prune-orphaned-notes branch June 3, 2026 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant