Identities, Collections, and Deduplication#286
Conversation
Proposal and implementation plans for the identity discovery, collections, and deduplication system.
Content-hash based duplicate detection across accounts with soft-delete merging. Three-signal identity discovery (From header, OAuth, config). CLI commands: deduplicate (dry-run default, --apply, --undo) and list-identities. All query paths exclude dedup-soft-deleted rows.
Named collections grouping multiple sources with a default "All" collection. SourceIDs filtering in both DuckDB and SQLite query paths. CLI collections command with CRUD operations. Includes fixes for buffer corruption in normalizeRawMIME, deep-copy in Clone(), and openStore consolidation.
Table-driven tests for dedup engine, collections CRUD, identity discovery, and source filter helpers. Incorporates review findings.
roborev: Combined Review (
|
e1e5be4 to
68c6d34
Compare
- DuckDB: add deleted_at IS NULL predicate to buildWhereClause, buildFilterConditions, GetTotalStats, and buildSearchConditions so soft-deleted duplicates are excluded from Parquet-backed queries. Handle missing column in older Parquet files via parquetCTEs. - Collections: call EnsureDefaultCollection during InitSchema so the "All" collection is always present. Add new sources to "All" in GetOrCreateSource so newly added accounts join automatically. - TOML output: replace manual string escaping in writeIdentitiesTOML with BurntSushi/toml encoder to prevent injection via crafted From: addresses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
- Error when --account resolves to zero sources instead of silently falling through to per-source mode - Sanitize src.Identifier in batchID to prevent path separators in manifest filenames - Distinguish nil from empty SourceIDs in query filter (empty = match nothing, nil = no filter) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
…lIDsByFilter - Error when list-identities --account resolves to zero sources instead of returning unscoped results from all accounts - Add missing deleted_at IS NULL filter in DuckDB GetGmailIDsByFilter fallback path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Fixed (5 total):
Disagree with the following:
|
roborev: Combined Review (
|
Dry run, in my opinion, shouldn't do anything that writes or mutates... including backup. It should clearly disclose what it would have done, and what it can't do because of a potential mutate.
Okay, I agree with that. |
…Scan - Scan no longer backfills rfc822_message_id during dry-run; instead reports how many messages need backfill and notes they'll be included on --apply - Backfill is now scoped to AccountSourceIDs, not global - Engine.Scan requires non-empty AccountSourceIDs to prevent accidental cross-account grouping; CLI handles unscoped case via per-source iteration - list-identities --account errors on zero-source resolution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
When LF-only headers are followed by a body containing CRLF sequences, the previous code could match \r\n\r\n in the body instead of \n\n at the actual header boundary. Now finds both delimiters and uses the earliest match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…SourceIDs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
Label union and raw MIME backfill are additive enrichment that leaves survivors strictly better off. Reversing them would require tracking per-merge deltas for no user benefit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
When a user passes --log-file, the startup warning previously printed LogsDir even though logs were actually going to FilePath. Report the path that was actually used so the warning is actionable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add TestAppendSourceFilter cases for nil, empty, single, and multi-ID inputs to pin the boundary behavior of the SQL builder. Add TestEngine_Scan_RejectsEmptyAccountSourceIDs to ensure Engine.Scan rejects an unscoped scan (both nil and empty-slice AccountSourceIDs). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Use strings.HasPrefix for the "(?i)" guard in list_identities.go instead of byte slicing (safer on short inputs). Replace fmt.Sscanf with strconv.ParseInt in parseInt64CSV so malformed rows like "123abc" are rejected instead of silently accepted. Flag the SQLite-specific GROUP_CONCAT in ListLikelyIdentities with a comment for the Postgres dialect port. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wrap AddSourcesToCollection and RemoveSourcesFromCollection in s.withTx so per-source inserts/deletes are atomic — previously a mid-loop failure could leave the membership table partially updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Route datetime('now') through s.dialect.Now() and INSERT OR IGNORE
through s.dialect.InsertOrIgnore(...) so dedup queries port cleanly
to PostgreSQL. Wrap the has_sent_label EXISTS column in CAST(... AS
INTEGER) so int scans are dialect-agnostic. Scan archived_at as
sql.NullTime in GetDuplicateGroupMessages and GetAllRawMIMECandidates
so the survivor tiebreaker no longer depends on a hard-coded
timestamp layout. Also JOIN message_raw in CountMessagesWithoutRFC822ID
so the reported count matches what BackfillRFC822IDs actually
processes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Report: copy report.SampleGroups instead of aliasing report.Groups to prevent silent mutation via future appends. Add SkippedDecompressionErrors and log a warning per failure in scanNormalizedHashGroups; surface the count in FormatReport. Count empty normalized Message-IDs as failed in BackfillRFC822IDs so updated+failed matches the number of candidates processed. Manifest IDs: key remote manifest grouping by (account, source_type) so an account spanning multiple source types gets a per-type manifest with the correct SourceType label. Disambiguate manifest IDs only when an account contributes duplicates from more than one source type (preserves existing single-type IDs). On filename truncation, append a 4-byte hash suffix in SanitizeFilenameComponent so distinct accounts with identical 40-char prefixes produce unique manifest IDs. Undo: continue through all pending manifests in Engine.Undo, joining cancellation errors with errors.Join, and document the best-effort semantics in godoc. Methodology doc: note in FormatMethodology that content-hash is byte-sensitive below the header boundary (CRLF vs LF body differences will not match) and that merge only backfills raw MIME — point users to repair-encoding / full cache rebuild for missing parsed bodies. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the dedup-backup file copy (main + -wal + -shm) with VACUUM INTO, giving an atomic point-in-time snapshot. Accept --undo multiple times (StringArray) and, in per-source mode, print a consolidated footer listing all batch IDs for a single undo command. On --undo error, print the restored count and any in-progress manifests before returning the wrapped error so best-effort partial success is visible. Surface cancelErrs from Engine.Undo to stderr instead of hiding them inside the wrapped error. Reword the still-running warning to say in-progress manifests "cannot be cancelled" and factor the print block into printStillRunningWarning. Append a random run-XXXXXXXX suffix to single-run batch IDs so they can never be a prefix of per-source batch IDs generated in the same second. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
InitSchema returned early from the FTS5 branch when the sqlite3 build lacks the fts5 module, which meant EnsureDefaultCollection never ran. Without the collections tables, GetOrCreateSource's "add to All" insert silently failed and the "All" collection was absent from listings — causing TestCollections_CRUD to fail with list=1, want 2. Fall through instead so collection setup still runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
roborev: Combined Review (
|
|
Haven't forgotten about this, will look at this soon! |
|
via Codex Thanks for putting this together. The direction is useful, especially the reversible local dedup flow, survivor selection, batch IDs, and the idea of default/user-defined collections. Before this goes further, I think we should align on the long-term data model and scope semantics, because several parts of this branch seem to encode a different model than I think msgvault should use. My current understanding of the model we want is:
That model is attractive because it gives us a crisp safety boundary: we never dedup across independent archives unless the user has deliberately grouped them into a collection. Concerns and questions:
The implementation has a good safety rule: only stage remote deletion when loser and survivor share the same
The spec discusses import-time dedup ( Recommended architecture direction:
I think this branch is useful, but I would like us to settle these boundaries before we bake in the naming/API shape. The main thing I want to avoid is agents and humans reading "account" differently across commands, docs, and package APIs. |
|
This is good feedback. Let me take another pass at clarifying the spec first. |
|
Let me know if I can assist, this will be good to get designed correctly so it can provide a good foundation for different types of msgvault frontend environments |
…hema Foundations for first-class collection scoping across the CLI. - account_scope.go: replaces the ambiguous ResolveAccount() (which silently treated collections as accounts when called with --account) with two narrow resolvers, ResolveAccountFlag and ResolveCollectionFlag. Each rejects the other kind with a hint to the correct flag. Returns a generic Scope value (Source or Collection, never both) used by callers. - collections.go (cmd): switches the --accounts member resolver to ResolveAccountFlag so collection names cannot be silently expanded into the membership list. - store/schema.sql: promotes the collections and collection_sources tables into the canonical embedded schema instead of being created lazily at first use. - store/collections.go: drops the eight defensive ensureCollectionSchema() guards now that schema init covers them. - store/sync.go: GetOrCreateSource auto-adds new sources to the All collection; the previous code swallowed errors with `_, _ = Exec`, which silently broke the bootstrap on FK errors. Now logs at warn level so failures are visible.
Adds collection-aware scope flags to the four user-facing commands that previously only accepted --account, and widens search.Query so the engine speaks in source-id slices throughout. - deduplicate, list-identities, search, stats: each gains --collection, mutually exclusive with --account at the cobra layer. Resolution goes through the new ResolveAccountFlag/ResolveCollectionFlag resolvers; ambiguous-name and wrong-flag inputs error with hints to the correct flag. - internal/search/Query: AccountID *int64 -> AccountIDs []int64 so a collection scope expands cleanly into a multi-source filter. Engine call sites in internal/query/sqlite.go and duckdb.go switch to the existing appendSourceFilter helper. internal/mcp/handlers.go and internal/search/parser.go follow. - search_vector: --account resolution moved to use the new helpers in the vector search path so vector mode honors the same scope rules. - internal/remote/engine.go: rejects multi-source scope with an explicit error rather than silently dropping AccountIDs[1:]. CLI layers also block --collection in remote mode; this is defensive for any future caller that bypasses those checks. - stats: gains GetStatsForScope plumbing so stats output respects the same scope as search/dedup/list-identities. (Predicate centralization and unscoped-catalog semantics restoration land in the next commit.) Banner output mentions when a run is collection-scoped so the user can see they are crossing source boundaries.
…tion
Replaces the global flat-map identity model with per-account confirmed
identities, so an address that means "me" in one source no longer
silently means "me" in every source.
- internal/store/account_identities: new table (source_id, address,
source_signal, confirmed_at) with AddAccountIdentity,
ListAccountIdentities, and GetIdentitiesForScope helpers. Addresses
are lowercased; insert is idempotent on (source_id, address).
- internal/store/migrations: small applied_migrations sentinel table
with IsMigrationApplied / MarkMigrationApplied.
- internal/store/migrate_legacy_identity: MigrateLegacyIdentityConfig
fans the legacy [identity].addresses out to every existing source on
first startup, marks the sentinel, and is a no-op on subsequent
runs. RunStartupMigrations wraps it and returns a one-shot user
notice when the migration applies.
- cmd/store_resolver: openLocalStoreAndInit chains Open + InitSchema +
RunStartupMigrations. Every command that opens the local store
(~27 sites) now calls runStartupMigrations after InitSchema; legacy
[identity] config in config.toml is read only at migration time and
triggers a one-time deprecation warning.
- internal/dedup: Config.IdentityAddresses (flat map) replaced with
IdentityAddressesBySource map[int64]map[string]struct{}. Sent-copy
match keys per-source via the message's SourceID and is
case-insensitive. Existing OR'd signals (SENT label, is_from_me)
remain.
- cmd/deduplicate: builds the per-source identity map by calling
GetIdentitiesForScope per source, both in the explicitly-scoped path
and the per-source iteration in runDeduplicatePerSource.
- cmd/list-identities: gains --confirmed to print the persisted union
of identities for a scope (account or collection), distinct from
the existing discovery output.
- internal/dedup: cleanup picked up while wiring the new config:
printDedupSummary now surfaces the batch ID even on Execute errors,
and the dead `m.ID == batchID` branch in undo manifest matching is
removed.
After this commit, dedup sent-copy detection treats identity correctly
per-source. Migration preserves prior behavior for users with an
existing [identity] config block by copying every legacy address to
every existing account on upgrade.
A live message is one not hidden by dedup AND not recorded as deleted from the source server. This commit puts that contract behind a single helper and applies it to every read surface that previously missed either column. - internal/store/live_messages.go: LiveMessagesWhere(alias) returns the canonical SQL predicate. Used by every read path that touches the messages table. - internal/store/store.go: GetStats applies LiveMessagesWhere to the message count; thread, attachment, and label counts keep catalog COUNT(*) in the unscoped path so msgvault stats matches its prior user-visible numbers. Adds GetStatsForScope(sourceIDs) which uses message-derived counts under scope (where catalog tables have no scope column to filter on); SourceCount in scoped mode is the count of distinct source IDs in the scope. - internal/vector/sqlitevec/backend.go: seedPending, dropDeletedFromSource, and filteredMessageIDs all use the predicate so embedding seeding and retrieval honor the contract. seedPending carries a note that dedup Execute does not remove vector-store rows by design — query-time live filtering is the contract. - internal/vector/sqlitevec/fused.go batchGetSubjects: drops the redundant live filter in subject hydration. Liveness is enforced upstream in the ranking CTE; re-filtering at hydration would silently strip subjects for any row soft-deleted between ranking and hydration. - internal/query/sqlite_text.go and duckdb_text.go: text/FTS message search applies the predicate. - vector test helpers (backend_testhelpers, fused_test, hybrid/engine_test, embed/testsupport, embed_vector_test) gain the deleted_at column in their schema fixtures. internal/sync/sync_test updates one assertion that asserted the old (buggy) GetStats behavior of including source-deleted rows in the count. For each fixed read surface, an integration test that inserts a row, soft-deletes it, and asserts it does not appear (text_search_live_test, backend_test, store_test live-message tests).
…fety Adds the third rung in the dedup safety progression: scan -> hide -> local hard delete -> remote delete. Each rung is a separate, explicit user action; the system never escalates between them. dedup-purge: - new top-level command. --batch <id> (repeatable) purges rows hidden by specific dedup batches; --all-hidden purges every dedup-hidden row. Mutually exclusive, exactly one required. - VACUUM INTO backup before deleting (--no-backup to opt out, mirroring deduplicate's UX); interactive [y/N] confirmation (-y to skip). - internal/store/dedup.go gains PurgeBatch and PurgeAllHidden, both scoped to rows where deleted_at IS NOT NULL AND delete_batch_id IS NOT NULL so manually soft-deleted rows aren't caught up. Cascade: attachments (metadata), message_recipients, message_labels, message_bodies, and message_raw all have ON DELETE CASCADE on messages(id), so the single DELETE handles them. Content-addressed attachment blobs survive purge by design. - Vector and parquet caches may hold stale entries for purged rows; the command summary recommends 'build-cache --full-rebuild' after a large purge. Pending remote-deletion manifests reference source-server message IDs, not local rows, so they remain valid after a local purge. Tests cover store-layer behavior (batch purge, all-hidden purge, count of distinct batches, FK cascade verification, no-op on unknown batch, post-purge undo returns zero) and CLI-level flag wiring (mutual exclusion, neither-flag rejection). Also picked up while touching dedup correctness: - BackfillRFC822IDs now wraps each batch in a transaction and counts updates only after a successful commit, so a mid-batch failure does not over-report `updated`. Progress reporting is monotonic against the precomputed total. - Raw MIME backfill counts inserts from affected rows rather than pre-computing. - Repeated decompression warnings are capped so a single corrupt blob doesn't flood logs.
roborev: Combined Review (
|
|
Changed my mind. Am closing this branch and will submit a fresh pr in the morning. |
|
Superseded by #304 — same scope, flat 5-commit history, design docs and worked-out alignment notes stripped out for review/merge clarity. Leaving this PR open as the historical record (the design alignment review trail and iterative refinement live here). |
Authoritative reference for how msgvault organizes ingested communications, identifies which messages belong to whom, and removes redundant local copies without destroying the underlying archive. Defines the conceptual model (Account, Identity, Collection — AIC), schema, CLI surface, read-side contract, manifest formats, errors, and migration semantics. Companion to the dedup/identities/collections branch work and roborev review on PR wesm#286. The README and concept assets in this directory introduce the model; this spec is the contract.
Per the global attribution convention, default credit on shipped artifacts is Jesse's; joint credit is opt-in. The spec footer carried a "Drafted with Claude Code ... Primeradiant Superpowers" line that wasn't explicitly opted in for this artifact. Keep the @jesserobbins authorship and the @wesm + wesm#286 design-review acknowledgement.
Implements and resolves #278 (supersedes #286). Specification: [`docs/accounts-identities-collections-dedup/spec.md`](docs/accounts-identities-collections-dedup/spec.md). ## What this branch adds A reusable account/collection scope, per-account identity storage, and a deduplication pipeline staged into a four-rung safety progression (scan → hide → local hard delete → remote delete). Every read surface in the archive routes through one canonical visibility predicate. ## New CLI surface | Command | Purpose | | --- | --- | | `msgvault deduplicate` (aliases `dedup`, `dedupe`) | Find and merge duplicates by RFC822 Message-ID with optional `--content-hash` fallback. Scoped via `--account` or `--collection`. `--dry-run`, `--undo <batch-id>` (repeatable), `--delete-dups-from-source-server`. | | `msgvault delete-deduped` | Local hard-delete rung. `--batch <id>` (repeatable) or `--all-hidden`. VACUUM INTO backup before deletion; interactive confirmation by default. | | `msgvault identity {list,show,add,remove}` | Per-account identity management. `list` accepts `--account`/`--collection`/`--json`. Replaces the removed `list-identities` command. | | `msgvault collection {list,create,add,remove,delete,show}` | User-defined groupings of accounts. The `All` collection is auto-managed and rejects explicit mutation with `ErrCollectionImmutable`. | | `msgvault search --collection <name>` | Collection scope on FTS, vector, and hybrid search modes. `--account` and `--collection` are mutually exclusive. | | `msgvault stats --account/--collection` | Per-scope stats. | | `msgvault delete-staged --permanent` | Opt into batch permanent deletion. Default is Gmail trash (~30-day recovery). Mutually exclusive with `--yes`; `--permanent` requires typing `delete` to confirm. Execution is gated by `MSGVAULT_ENABLE_REMOTE_DELETE=1` for the v1 release. | ## Key semantics - **One account = one source.** `--account` resolves an account/source identifier; collection names supplied to `--account` are rejected with a hint to use `--collection`. `--collection` resolves only collections; account names supplied to `--collection` are rejected symmetrically. - **Cross-source dedup is opt-in via `--collection`.** Outside collection scope, dedup never crosses source boundaries — protects sent-message provenance (alice's Sent and bob's Inbox share an RFC822 Message-ID and must both survive). - **Remote deletion is same-source-only**, even under collection scope. Cross-account dedup hides locally; it never stages a remote delete that crosses sources. - **Survivor selection** has source-type preference (default `gmail,imap,mbox,emlx,hey`) with tiebreakers on raw MIME presence, label count, archive time, and id. Sent-copy eligibility filter runs before survivor selection. - **Content-hash fallback** is supplementary, not transitive. Groups with multiple Message-ID survivors are skipped; groups containing both an MID survivor and a sent-copy orphan are skipped (per spec § Detection / Survivor selection). ## Identity model - Confirmed identities live in `account_identities (source_id, address, source_signal, confirmed_at)` with multi-signal set semantics (signals stored as comma-delimited sorted set). - Case is preserved on insert because the column accommodates phone E.164 and synthetic handles; comparison uses `identifierMatch` / `NormalizeIdentifierForCompare` which case-folds email-shaped values only. - A one-time legacy migration copies any `[identity].addresses` config block into per-account records. If no eligible source exists at startup, the migration is deferred until first source creation. Apple-Mail and other non-email source types are filtered. - Auto-confirm runs at source creation for `add-account`, `add-imap`, `add-o365`, `import-mbox`, `import-emlx`, `import-whatsapp`, and `import-gvoice`. `import-imessage` is exempted (iMessage contacts are not self-identifying). - `confirmDefaultIdentity` is the shared helper; the prompt is routed through `cmd.OutOrStdout()` so test harnesses can capture it. ## Live-message contract `internal/store.LiveMessagesWhere(alias, hideDeletedFromSource)` is the canonical SQL fragment. It always filters `deleted_at IS NULL` (dedup losers are hidden everywhere) and gates `deleted_from_source_at IS NULL` behind the boolean (archive views may show source-deleted rows by design). All read paths route through it: `internal/store` API/list/ search/summary, `internal/query` SQLite + DuckDB engines (including text search and the raw-MIME helper), `internal/vector/sqlitevec` backend and fused-search filter, dedup engine, and stats. ## Schema changes - New tables: `collections`, `collection_sources`, `account_identities`, `applied_migrations`, plus dedup operation tracking. - New column: `messages.deleted_at` (added via `ALTER TABLE` migration on schema upgrade). - `internal/store/migrations.go` introduces a forward-only migration runner with `MarkMigrationApplied`. ## Safety guardrails - `delete-staged` defaults to trash; `--permanent` is opt-in, mutually exclusive with `--yes`, and requires typing the literal word `delete`. - `delete-deduped --all-hidden` always prompts even when `--yes` is set. - `delete-staged` execution is gated by `MSGVAULT_ENABLE_REMOTE_DELETE=1`. Read-only modes (`--list`, `--dry-run`, `show-deletion`) work without the gate. - `deduplicate --dry-run` and `--undo` are mutually exclusive (undo writes; dry-run promises no writes). - Cancelled deletion manifests are persisted in their own `cancelled/` directory rather than removed, so audit history survives. - Database backup before any local hard delete or merge (skippable with `--no-backup`). ## TUI cache invalidation `tui.go` cache-staleness check counts dedup hides (`deleted_at >= LastSyncAt AND deleted_from_source_at IS NULL`) as a third signal alongside new messages and source-deletions, forcing a full Parquet rebuild when present. Counts are kept disjoint from source-deletion counts to avoid double-attribution. ## Documentation - [`docs/accounts-identities-collections-dedup/spec.md`](docs/accounts-identities-collections-dedup/spec.md) — full specification (scope semantics, identity model, dedup detection/survivor selection, safety progression, CLI surface). - [`docs/accounts-identities-collections-dedup/README.md`](docs/accounts-identities-collections-dedup/README.md) — narrative overview with concept diagrams. ## Notable behavior changes - `list-identities` removed; use `identity list`. - `[identity].addresses` in `config.toml` becomes legacy input after the one-time migration — write per-account identities instead. - Remote-source eligibility for staged deletion is currently `gmail` only. The manifest format (`GmailIDs`) and executor (`gmail.API`) are Gmail-specific; IMAP is gated until a manifest format that records source type and an IMAP executor exist. - Collection identity is derived as the union of member-account confirmed identities; not configured separately. - After a large `delete-deduped` run, vector and Parquet caches may hold stale entries — the command recommends `build-cache --full-rebuild`. --------- Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Implements the identity, collection, and deduplication system from #278.
New commands
list-identities— auto-discover sent-from addresses across accounts using three signals (From header, OAuth, config); prints likely identities for configuring the[identity].addresseslist.collections— manage named source groups (create,list,show,add,remove,delete); a defaultAllcollection is seeded automatically and tracks every source.deduplicate— find and merge duplicates across sources for the same account.deduplicatebehaviorMessage-ID; optional--content-hashalso groups by normalized raw MIME.--prefer <source-types>and identity-based sent-copy preference), unions labels, soft-deletes the pruned copies.--dry-run).--applyrequired to write.--undo <batch-id>restores hidden rows.--undois repeatable to undo multiple batches.VACUUM INTO(skip with--no-backup).--delete-dups-from-source-serveradditionally stages pruned copies for remote deletion (destructive, opt-in; remote sources only).Query layer
SourceIDsfilter propagated through DuckDB and SQLite query paths, plus soft-delete exclusion everywhere.Closes #278.