docs: add accounts/identities/collections/dedup specification#308
Closed
jesserobbins wants to merge 60 commits into
Closed
docs: add accounts/identities/collections/dedup specification#308jesserobbins wants to merge 60 commits into
jesserobbins wants to merge 60 commits into
Conversation
…urce filter Schema and store-layer foundations for the identities/collections/ deduplication feature. - internal/store/schema.sql: adds the canonical collections, collection_sources, account_identities, applied_migrations, and related tables (and indexes) the rest of the feature builds on. - internal/store/collections.go: CRUD for collections, default-All bootstrap (EnsureDefaultCollection), atomic membership updates, and lookup by name. - internal/store/live_messages.go: LiveMessagesWhere(alias) returns the canonical SQL predicate that a row is not dedup-hidden and not recorded as deleted from the source server. Used everywhere a read surface needs the live-message contract. - internal/store/store.go: applies LiveMessagesWhere to GetStats message count; adds GetStatsForScope(sourceIDs) for per-account or per-collection stats. Catalog counts (threads, attachments, labels) remain unchanged in the unscoped path. - internal/store/sync.go: GetOrCreateSource auto-adds new sources to the default All collection; logs at warn level on failure rather than swallowing errors. - internal/query/models.go and source_filter.go: introduce MessageFilter.SourceIDs (multi-source) alongside SourceID, with a shared appendSourceFilter helper that emits source_id IN (...). - internal/config/config.go: legacy [identity].addresses block and IdentityAddressSet helper. Used only by the one-time migration to per-account identities. - internal/logging/logging.go: log file open warning includes the resolved path so users can tell which path actually failed. - cmd/msgvault/cmd/store_resolver.go: openLocalStoreAndInit chains Open + InitSchema + RunStartupMigrations so commands using a local store consistently apply schema and run the legacy-identity migration. - cmd/msgvault/cmd/build_cache: minor changes to keep the parquet cache builder honest about the new schema.
Adds the store-layer dedup operations the engine relies on, plus the explicit local hard-delete rung of the safety progression: scan -> hide -> local hard delete -> remote delete. Each rung is a separate, explicit user action; the system never escalates between them. Store-layer dedup operations: - internal/store/dedup.go: FindDuplicatesByRFC822ID and GetDuplicateGroupMessages drive the engine's scan path; MergeDuplicates writes (deleted_at, delete_batch_id) atomically; UndoDedup clears them for a batch and returns rows-restored. - BackfillRFC822IDs wraps each batch in a transaction and counts updates only after a successful commit so a mid-batch failure does not over-report `updated`. Progress reporting is monotonic against the precomputed total. - StreamMessageRaw and the raw-MIME backfill keep the dedup engine honest about candidate rows that have raw MIME available. Hard-delete rung: - internal/store/dedup.go: PurgeBatch and PurgeAllHidden permanently remove dedup-hidden rows. Both scope to rows where deleted_at IS NOT NULL AND delete_batch_id IS NOT NULL so manually soft-deleted rows aren't caught up. Cascade: attachments (metadata), message_recipients, message_labels, message_bodies, and message_raw all have ON DELETE CASCADE on messages(id), so the single DELETE handles them. Content-addressed attachment blobs survive purge by design. - cmd/msgvault/cmd/dedup_purge.go: top-level command. --batch <id> (repeatable) purges rows hidden by specific dedup batches; --all-hidden purges every dedup-hidden row. Mutually exclusive, exactly one required. VACUUM INTO backup before deleting (--no-backup to opt out, mirroring deduplicate's UX); interactive [y/N] confirmation (-y to skip). Vector and parquet caches may hold stale entries for purged rows; the command summary recommends 'build-cache --full-rebuild' after a large purge. Pending remote-deletion manifests reference source-server message IDs, not local rows, so they remain valid after a local purge. Tests cover store-layer behavior (batch purge, all-hidden purge, distinct-batch counting, FK cascade, no-op on unknown batch, post-purge undo returns zero) and CLI flag wiring (mutual exclusion, neither-flag rejection).
…, and migration - Gate remote-delete to Gmail only: drop "imap" from remoteSourceTypes. IMAP runs with --delete-dups-from-source-server staged Gmail-format manifests that delete-staged would misroute. - Exclude dedup-hidden rows from the analytics cache: the primary sqlite_scanner export path was missing the `deleted_at IS NULL` filter the CSV-fallback path already applied. Linux/macOS full rebuilds could surface deleted rows in analytics. The CSV fallback also gains a TIMESTAMP override for `deleted_at` so both export paths expose the same relation and DuckDB's Binder can resolve the column. - Gate delete-staged execution behind MSGVAULT_ENABLE_REMOTE_DELETE=1, with both --list and --dry-run always permitted (they don't invoke the Gmail API). The gate check now sits below the read-only branches so neither inspection mode can be blocked by the env var. - Route --undo before scope resolution: handle --undo before resolving --account / --collection so a stale or renamed account can't block undoing a valid batch. cobra marks --undo mutually exclusive with the scope flags. --undo errors aggregate across batches instead of stopping at the first failure. - Add a deferred return value to MigrateLegacyIdentityConfig so the "addresses configured but no sources yet" case is distinguishable from "already applied" and "nothing to migrate". RunStartupMigrations surfaces a notice on the deferred path so users know their legacy [identity] config is parked until the first source is created. - Route dedup queries through LiveMessagesWhere: replace ad-hoc `deleted_at IS NULL` predicates in FindDuplicatesByRFC822ID, GetDuplicateGroupMessages, GetAllRawMIMECandidates, CountActiveMessages, CountMessagesWithoutRFC822ID, and BackfillRFC822IDs with the canonical helper so the live-message contract is applied uniformly.
FormatMethodology now branches on collection vs account scope. Collection mode prints that cross-account merging is intentional (with --undo as the escape hatch); account/per-source mode keeps the existing 'NEVER merges across accounts' wording. Adds Config.ScopeIsCollection plumbed from the --collection branch in cmd. LiveMessagesWhere docstring no longer claims internal/query MUST apply the predicate. The query engines intentionally apply only deleted_at by default and gate deleted_from_source_at behind the opt-in HideDeletedFromSource filter, because msgvault is an archive — rows deleted from the source server stay visible by default.
…bind, deferred-notice count Closes the four open roborev findings on PR 304 that survived the 5188612/eb7aaaf consolidation, in one bundled commit: - deduplicate.go: declare --undo mutually exclusive with the destructive / scan-side flags (--delete-dups-from-source-server, --prefer, --content-hash, --no-backup, --yes). Cobra's existing exclusivity only covered --account/--collection/--dry-run, so a user invoking `msgvault deduplicate --undo X --delete-dups-from-source-server` had the destructive flag silently ignored. Now the CLI rejects the combination at parse time. - collection.go: gate the numeric-source-id branch in resolveAccountList on a leading-digit check. strconv.ParseInt accepts a leading '+', so an E.164 phone identifier like "+15551234567" parsed as the integer 15551234567 and was treated as a source ID, silently breaking WhatsApp/Google Voice accounts that key on phone numbers. Restricting the numeric branch to tokens whose first byte is a decimal digit lets signed inputs fall through to identifier resolution. - account_identity.go: confirmDefaultIdentity now skips the write when the source already has any identity rows. Every ingest command (add-account / add-imap / add-o365 / import-*) calls this on each invocation, including reruns and OAuth rebinds, so without this guard an identity the user explicitly removed via `identity remove` would be re-added on the next ingest re-run, silently affecting dedup sent-copy detection. The guard preserves the documented "freshly created source" intent and degrades gracefully if the user has removed every identity (in which case the default is restored, which matches the user's likely intent on a clean slate). - migrate_legacy_identity.go: the deferred-path notice now reports the post-normalization address count rather than len(rawAddresses). Configs like `addresses = ["", " ", "alice@x"]` previously announced "3 address(es) parked" while only one was actually queued. Test updated for the new contract.
Roborev flagged that engine.Scan() runs before the user confirmation and backup, but Scan can backfill rfc822_message_id when not in dry-run mode, so a user who declines the prompt has had the database touched without the dedup-merge backup. The behavior is correct — backfill is idempotent metadata derivation (fills a previously-NULL column from MIME data already on the row, never overwrites a non-NULL value, and changes no message content) and the backup contract exists for the dedup merge (which soft-deletes losers and transfers labels), not for this metadata catch-up. Restructuring Scan to be read-only would force either a double-pass backfill or a report that under-reports duplicates that backfill would have surfaced; neither is worth the cost given backfill's risk profile. What's actually missing is consent legibility: the existing prompt says "hide N duplicates (reversible with --undo)" without telling the user that scan already wrote N' rfc822_message_id values (and that those stay regardless of the answer). FormatReport already prints the backfill count, so the user sees it in the scan output, but the prompt doesn't scope itself. - Add a one-line note to both confirm prompts (single-source and per-source paths) when BackfilledCount > 0, stating that scan already performed the backfill and it is kept regardless of the answer. - Expand the godoc on dedup.Engine.Scan to document the side effect, why it lives in scan rather than execute, and why the backup contract is unchanged. No behavior change.
The surrounding loop already does an `if p == "" { continue }` on the
trimmed token, so the leading-digit check can index p[0] directly.
…hint Account = one ingest source/archive; collection = named grouping of accounts. Cross-source comparison requires --collection. Updates the deduplicate Long help, package docstring, and FormatMethodology so they describe the model instead of the legacy 'Gmail sync + mbox of the same mailbox' framing. Adds a one-liner to the methodology output saying Message-ID is primary and content-hash is supplementary. The dry-run report previously suggested rerunning with --apply, which does not exist; rephrase as 're-run without --dry-run'.
Previously, runHybridSearch was called before scope resolution and its body looked at searchAccount directly while ignoring searchCollection. A 'msgvault search --collection X --mode hybrid' silently searched the entire archive. Hoist scope resolution into resolveSearchScope, called from searchCmd.RunE before the mode branch, and pass the resolved Scope into both runLocalSearch and runHybridSearch. The vector path now copies scope.SourceIDs() into filter.SourceIDs (covering --account and --collection) and prints the same collection banner the FTS path prints. Tests cover --collection scoping for FTS and (under the sqlite_vec build tag) vector mode.
ListMessages, GetMessagesSummariesByIDs, SearchMessages, SearchMessagesQuery, searchMessagesLike, CountMessagesForSource, CountMessagesWithRaw, GetRandomMessageIDs, and CopySubset all filtered only on deleted_from_source_at IS NULL, leaving dedup-hidden rows visible through the API, batch-summary hydration paths, count helpers, random-sample helper, and the subset copier. Route them through internal/store.LiveMessagesWhere so the canonical predicate (deleted_at IS NULL AND deleted_from_source_at IS NULL) is applied uniformly. internal/query/sqlite.go GetMessageSummariesByIDs adds m.deleted_at IS NULL alongside the existing deleted_from_source_at filter so vector/hybrid hit hydration matches the rest of the query layer's policy.
cacheNeedsBuild watched only deleted_from_source_at, so a dedup run after the last cache build left soft-hidden rows in Parquet until a manual --full-rebuild. Add a parallel COUNT on deleted_at after LastSyncAt that triggers a full rebuild the same way deletions already do. Drop the now-redundant 'msgvault build-cache --full-rebuild' hint from the dedup post-run summary.
Previously delete-staged defaulted to permanent batch deletion with --trash as the recoverable opt-in, requiring the elevated mail.google.com OAuth scope by default. Flip the safety default: trash (gmail.modify scope, recoverable for 30 days) is the default, and --permanent is the explicit opt-in that escalates to mail.google.com. The MSGVAULT_ENABLE_REMOTE_DELETE=1 release gate is unchanged. Update the dedup post-run hint, command help, examples, prompt text, and quickstart docs to reference --permanent.
Adds info-level start/done log lines to Engine.Scan, Engine.Execute, and Engine.Undo so a dedup run is traceable from the daily log without enabling debug. Each entry line records scope (account, sources, is_collection, dry_run, content_hash_fallback); each exit line records counts (groups, messages_to_prune / groups_merged, messages_removed, labels_transferred, raw_mime_backfilled, manifests_cancelled, manifests_still_running) and duration_ms. Mirrors the structured logging pattern used in cmd/msgvault/cmd/search.go.
Vector and hybrid modes need text to embed; --account or --collection alone don't supply any. Fail fast at the command boundary with a clear message and a hint to use --mode=fts for queryless scoped searches. Found by codex review on jesse/identities-collections-dedup.
…ack; document BackfillRFC822IDs failed-row semantics DeleteAllDeduped's docstring promises 'every dedup-hidden row regardless of batch', but the DELETE filter required delete_batch_id IS NOT NULL. That's currently a no-op (every dedup writer sets the batch ID), but the filter would silently drop any future writer that hides a row without a batch. Drop the redundant filter; keep distinct-batch counting on the narrower predicate so reporting is unchanged. While in this function, switch the deferred rollback to the committed- flag pattern so a Commit() error doesn't leave a confusing rollback attempt on a finalized tx. Add a one-line comment in BackfillRFC822IDs explaining that failed rows are not retried in the same run (lastID advances past them) and will re-enter the candidate set on the next invocation via the 'rfc822_message_id IS NULL' filter.
Email addresses are case-insensitive in practice, but the unique key on account_identities is (source_id, address) with case preserved (chat handles need that). RemoveAccountIdentity now folds case for tokens containing '@' so 'identity remove Foo@x.com' matches a row stored as 'foo@x.com'. Synthetic identifiers without '@' still match case-sensitively.
EnsureDefaultCollection re-runs INSERT OR IGNORE on every InitSchema, which means a 'collection remove All <account>' silently reverts on the next CLI invocation. AddSourcesToCollection, RemoveSourcesFromCollection, and DeleteCollection now reject the default name with ErrCollectionImmutable so users get a clear error instead of a silent revert. Also fix idiomatic ordering of rows.Err()/rows.Close() in hydrateCollection.
RunStartupMigrations now returns a structured StartupMigrationResult
with Applied/Deferred flags so the CLI's runStartupMigrations can log
the deferred case ('migration deferred until a source exists') without
also emitting the misleading 'legacy identity migrated' line that the
applied path uses.
…dicate
cacheNeedsBuild's deleted_at COUNT now adds 'AND deleted_from_source_at
IS NULL' so a row that is both source-deleted and dedup-hidden after
LastSyncAt isn't reported twice in the reason string. Both still
trigger the same FullRebuild=true.
LiveMessagesWhere returns package-level constants for the two common
aliases ("" and "m") so hot read paths don't allocate a fmt.Sprintf
result on every call.
Databases created before this branch landed don't have the deleted_at column on messages, but the new code paths (LiveMessagesWhere, dedup engine, cache staleness) read it directly. Without this migration, upgraded databases fail with 'no such column: deleted_at' on the first read. Add the ALTER to the schema-upgrade list alongside delete_batch_id and deleted_from_source_at, and a regression test that opens a pre-deleted_at messages table and confirms InitSchema adds the column.
DeleteAllDeduped now deletes every dedup-hidden row regardless of
batch (the d246fea broadening), but the CLI precount still required
delete_batch_id IS NOT NULL on the row count. On databases that
contain batchless dedup-hidden rows the user prompt under-reports
('Will permanently delete N') vs the actual count printed afterward.
Split the count into two queries so the row count uses the broad
predicate (matching DeleteAllDeduped) while distinct-batch count
keeps the narrower one (only meaningful for rows with a batch ID).
The case-insensitive remove fix in 409b3b1 left the add path case-sensitive, so 'identity add Foo@x.com' on top of an existing 'foo@x.com' inserted a duplicate row that 'identity remove' would then collapse on first call. AddAccountIdentity (and the legacy-identity migration) now look up existing rows with LOWER(address) when the input contains '@', and update the existing row's signal set instead of inserting a duplicate. The first-write casing is preserved in storage; chat handles and other synthetic identifiers still match case-sensitively. Tests cover the paired add+remove case-fold, the case-preserving storage invariant, and the chat-handle case-sensitive guard.
cfg.DatabaseDSN() can return a file: URI when [data].database_url is configured. The dedup and delete-deduped backup builders compose paths with filepath.Dir/Base + concatenation, which fails for URI inputs (produces 'file:/...db.dedup-backup-...' that VACUUM INTO can't write). Add cfg.DatabasePath(): returns the on-disk path for plain filesystem DSNs, strips the scheme/query for file: URIs, and rejects non-file DSNs (postgres://, mysql://) where the SQLite-only backup helpers can't operate. Both backup call sites resolve the path up-front so non-file DSNs error before the first scan, not at backup time. Also moves the unused dedup.NewEngine construction in runDeduplicate inside the single-account branch — the per-source path builds its own scoped engines.
The previous implementation stripped the 'file:' prefix and any '?' query suffix manually, leaving percent-encoded bytes intact. A path like 'file:/var/lib/my%20vault.db' was returned as '/var/lib/my%20vault.db' (literal '%20'), which doesn't exist on disk and breaks backup/existence checks for paths with spaces or other escaped characters. Use url.Parse so percent-encoding decodes correctly, and fall back to u.Opaque for relative-path 'file:rel/path' shape that SQLite also accepts. Tests cover percent-encoded paths, relative paths, query suffixes, and the existing rejection of non-file DSNs.
The previous heuristic 'strings.Contains(addr, "@")' folded case for any identifier with an '@' anywhere in it. Matrix MXIDs like '@user:server.org' start with '@' and contain a '.' so they would have been incorrectly treated as case-insensitive, collapsing distinct chat handles into one row. Replace with looksLikeEmail(addr): '@' must not be at index 0 and the substring after the last '@' must contain '.'. That accepts standard emails (foo@x.com) and excludes Matrix MXIDs, bare handles, and phone-shaped strings. Apply at the AddAccountIdentity, RemoveAccountIdentity, and legacy-config migration sites. Test coverage: existing email case-fold pair, new Matrix MXID guard (two distinct cases produce two rows), and the existing chat-handle case-sensitive guard.
…rationApplied db.Exec never returns sql.ErrNoRows — that sentinel only comes from Row.Scan on a missing row. The conditional was never hit and made the error path harder to read. Remove the clause and the now-unused database/sql + errors imports.
… ListAccountIdentities errors
…legacy cleanup in CLI
Legacy [identity] migration parks (deferred=true) when no source exists at startup. The previous flow ran runStartupMigrations before GetOrCreateSource in every ingest command, so the deferred migration never fired on the same invocation that created the first source — the user only got their per-account identities on a *later* command. Add runPostSourceCreateMigrations and call it after source creation in all ingest paths (add-account, add-imap, add-o365, import-mbox, import-emlx, import-imessage, import-gvoice, import-whatsapp, serve's scheduled sync). MigrateLegacyIdentityConfig is idempotent once the sentinel is set, so the post-create call is O(1) on subsequent invocations. Add CLI-level regression test exercising the deferred-then-applied path on a single add-account invocation.
Previous fix re-ran the migration after first source creation but
left the pre-source pass intact. On a fresh install with [identity]
addresses configured, the user saw both the deferred notice ("will
run on the next command") and the applied notice in the same
add-account invocation — the deferred notice was a lie.
Add runStartupMigrationsForIngest variant that skips the pre-source
call when ListSources is empty. The post-source-create call does
all the work and emits the accurate "applied" notice once. Steady-
state behavior is unchanged.
Strengthen the regression test to assert the deferred notice does
NOT fire in an invocation that creates the first source.
The post-import call covers both --no-resume and resume paths; the in-loop call only fired on the resume path and was a no-op the second time anyway (migration sentinel already set). Addresses roborev job 492 finding (claude-code, Low).
Two iter11 codex Low findings: 1. MigrateLegacyIdentityConfig deduped the input list by exact trimmed string, so a config block with case-variant emails (Alice@Example.com + alice@example.com) wrote duplicate rows for every source. Switch dedupe key to NormalizeIdentifierForCompare so the rule matches the rest of the identity subsystem; preserve first-seen casing for storage so synthetic identifiers (Matrix MXIDs) keep their case. 2. MergeFilterIntoQuery only applied filter.SourceIDs when len > 0, so an explicit empty (non-nil) slice fell through silently and the original query's AccountIDs leaked through. appendSourceFilter treats an explicit empty multi-source slice as match-nothing (1=0); align MergeFilterIntoQuery to the same contract via filter.SourceIDs != nil. Allocate the merged slice with make+copy instead of append-from-nil, which would collapse explicit empty back to nil. Tests cover both: case-variant collapse with synthetic-MXID preservation, plus distinguishing nil vs non-nil-empty SourceIDs.
…hrough in collection lookup Three iter12 findings: 1. Codex Medium — internal/store/dedup.go pre-lowered From: addresses in SQL (LOWER(p_from.email_address)) before passing to the dedup engine, which then re-normalizes case-aware via NormalizeIdentifierForCompare. For synthetic identifiers like Matrix MXIDs (case-sensitive contract elsewhere in the system), the SQL LOWER() collapsed @alice:matrix.org to @alice:matrix.org before the per-source identity match could find the case-mixed stored identity. Drop the LOWER() in both GetDuplicateGroupMessages and GetAllRawMIMECandidates; the Go-side normalize handles the email vs synthetic split correctly. Regression test added. 2. Codex Low — cmd/msgvault/cmd/collection.go resolveAccountList errored immediately when a plain-digit token failed source-ID lookup, blocking valid numeric identifiers (e.g. unprefixed phone numbers) from being resolved by identifier. Fall through to ResolveAccountFlag on numeric-lookup miss. Regression test added. 3. Claude Low — store_resolver.go logged Warn for both deferred and applied migration outcomes; the user-facing notice on stderr already covers it. Demote success paths to Info; leave Warn for the actual error case.
…allel test Three iter13 findings: 1. Codex Medium — stats.go's --collection guard checked Scope.IsEmpty but a collection with zero member sources has Collection set (so IsEmpty returns false) yet SourceIDs() returns []. The store layer treats empty SourceIDs as unscoped/global, so `stats --collection <empty>` printed archive-wide counts. Add an explicit len(sourceIDs) == 0 check matching deduplicate.go's pattern. Regression test added. 2. Claude Low — DuplicateMessage.FromEmail's struct comment still said "lower-cased From: address" after iter12 dropped LOWER() from the underlying SQL. A future reader following that comment would reintroduce the exact bug iter12 fixed. Updated to describe the case-aware contract (raw value preserved; Go-side NormalizeIdentifierForCompare splits email-vs-synthetic). 3. Claude Low — iter12's regression test only covered GetDuplicateGroupMessages, not the parallel content-hash path GetAllRawMIMECandidates which got the same LOWER() removal. Added TestStore_GetAllRawMIMECandidates_PreservesFromCase to cover the second SQL site.
…on methodology Two iter14 in-scope findings: 1. Codex Low — collection.go's resolveAccountList numeric-token path (introduced in iter12 fix 72be0eb) swallowed every error from GetSourceByID and fell through to identifier resolution. Real DB failures got reported as "not found" / "no valid accounts" instead of being surfaced. Added store.ErrSourceNotFound sentinel; resolveAccountList now uses errors.Is to fall through only on definitive not-found, returning real errors immediately. 2. Claude Low — dedup.FormatMethodology described --collection scope as "Cross-account dedup is enabled" even when the collection had only a single member source, where no cross-account merging can happen. Gate both methodology branches on len(AccountSourceIDs) > 1; single-member collections now report the same-account guarantee. Regression test added.
Iter15 codex Medium: when [identity] config has legacy addresses, runPostSourceCreateMigrations populated account_identities first, then confirmDefaultIdentity's len(existing)>0 guard suppressed the source's own account-identifier write — leaving the source without its own identity and breaking dedup sent-copy detection. Reorder ingest paths so confirmDefaultIdentity runs BEFORE runPostSourceCreateMigrations: addaccount.go (both token-reusable and post-authorize sites), addimap.go, addo365.go, import_gvoice.go, import.go (whatsapp), import_emlx.go (single + multi-account), import_mbox.go. The legacy migration uses set-semantics (mergeSignalSet), so the migration merges into existing rows correctly when called after the default write. import-imessage skipped (no confirmDefaultIdentity call — explicitly noDefaultIdentity). serve.go scheduled-sync skipped (daemon path doesn't write a default identity). Documented the ordering requirement in confirmDefaultIdentity's godoc and at each call site. Regression test added that doesn't pass --no-default-identity (the iter10 test did, masking this bug).
Three iter15 in-scope items: 1. printStillRunningWarning now says "currently executing" rather than "in progress / cannot be cancelled," so the message is unambiguous when both in-progress manifests and pending-cancel failures exist (the latter surface as a returned error from Undo, not via this warning). 2. runDeduplicatePerSource's exit-on-Execute-error path now prints the accumulated executedBatches undo hint via the new printAccumulatedUndoHint helper before returning the error. Without this, a user who hit an error on source N had no visibility into how to undo sources 1..N-1's changes. Test added for the helper. 3. Added a FUTURE REFINEMENT comment on DeleteAllDeduped's godoc noting that the `deleted_at IS NOT NULL` predicate is currently safe (only the dedup soft-hide path sets that column) but would need a positive marker (`delete_batch_id IS NOT NULL OR dedup_origin = 1`-style) or a column rename if a future feature shares `deleted_at` semantics. Out of scope for wesm#304; flagged for the future change-author.
Iter18 surfaced the same finding both agents flagged in iter16 + iter17 (which Jesse reverted as out-of-scope polish): migration error returns before printImportSummary in three call sites (importSingleAccount, importAutoAccounts loop, runImportMbox). Claude predicted the re-flag explicitly: "a future reader (or another iteration of this loop) will hit the same finding." Add brief inline comments at each site documenting the scope rationale — minimal in-scope shape for wesm#304's deferred legacy [identity] migration retry; restructuring pre-existing summary code is out-of-scope; migration is idempotent and retries on the next invocation; full polish tracked in private/drafts/2026-05-02-issue-import-migration-error-ux.md. Comments only; no behavior change.
…MessagesWhere
LiveMessagesWhere now takes (alias, hideDeletedFromSource). The dedup-hide
column (deleted_at) is filtered always; source-deleted handling stays
caller-controlled via the existing HideDeletedFromSource fields on
MessageOptions, MessageFilter, and search.Query.
Every read path now calls the helper:
- internal/store, internal/dedup, internal/vector existing helper
callers pass true (canonical live view).
- internal/vector inline raw-SQL fragments at backend.go:307,818,847
and fused.go:132 rewritten to interpolate the helper.
- internal/query list paths (sqlite.go:193,267,889,1145; duckdb.go:656,
860,1121,1473,2312) collapse the paired (deleted_at, opt-in
deleted_from_source_at) fragments into one helper call, passing the
call site's HideDeletedFromSource field.
- internal/query hydration paths (sqlite.go:744, sqlite_text.go:451,
duckdb_text.go:436) interpolate the helper with true.
- internal/query Gmail-staging paths (sqlite.go:1007, duckdb.go:1670)
collapse to one helper call passing literal true; this surface feeds
remote-deletion staging and must never honor an opt-in.
- SubAggregate's redundant opts.HideDeletedFromSource / filter
.HideDeletedFromSource handling collapses by OR-ing opts into filter
before buildFilterConditions; the inline append at duckdb.go:1064
is deleted.
Helper signature:
func LiveMessagesWhere(alias string, hideDeletedFromSource bool) string
Four pre-computed constants cover (""|"m") × (true|false). Other
aliases fall back to fmt.Sprintf.
New table-driven test in internal/store/live_messages_test.go covers
the constant fast-paths and fmt.Sprintf fallback at both boolean
values across three aliases.
`runHybridSearch` opens its own raw `*sql.DB` via `sql.Open`, bypassing `store.Open` / `store.InitSchema`. For scoped queries this was harmless because `resolveSearchScope` runs `InitSchema` on the same DSN beforehand. For unscoped queries the dispatch went straight to the raw open, so on a database upgraded from a pre-wesm#304 build the vector backend's `LiveMessagesWhere` filter hit "no such column: deleted_at" until something else triggered schema init. Hoist the init into the unscoped branch via a small `initLocalSchema` helper that opens, inits, migrates, and closes — migration state persists in the file on disk, and `runHybridSearch` reopens against the now-current schema. Adds a regression test that drops `deleted_at` and asserts the unscoped vector dispatch re-adds it before the search runs.
User-facing introduction to the AIC data model and the dedup safety ladder, with worked HOWTO examples for list-accounts, identity, collection, search, dedup, delete-deduped, and delete-staged. Diagram sources live alongside in assets/ as self-contained HTML that renders to PNG via headless Chrome; assets/README.md documents regeneration. spec.md is intentionally not included while the spec is still in draft.
Authoritative reference for how msgvault organizes ingested communications, identifies which messages belong to whom, and removes redundant local copies without destroying the underlying archive. Defines the conceptual model (Account, Identity, Collection — AIC), schema, CLI surface, read-side contract, manifest formats, errors, and migration semantics. Companion to the dedup/identities/collections branch work and roborev review on PR wesm#286. The README and concept assets in this directory introduce the model; this spec is the contract.
roborev: Combined Review (
|
Contributor
Author
|
Okay, this is silly. I was trying to avoid another commit to #304 but I'm just going to tack this on as it's the spec. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This commits the authoritative specification for how msgvault organizes ingested communications, identifies which messages belong to whom, and removes redundant local copies without destroying the underlying archive — Accounts, Identities, Collections, Deduplication (AIC + dedup).
The spec exists today only in the working tree of #304's branch. Publishing it as its own PR lets the document be read, cited, and reviewed independently of the implementation. The document was ground-truth for the audit that drove the destructive-path safety follow-ups (the PRs that will follow this one).
The README and concept diagrams already in this directory introduce the model. This spec is the contract: schema, CLI surface, read-side predicate (
LiveMessagesWhere), manifest formats, error catalog, migration semantics, and the safety-ladder rungs.Summary
docs/accounts-identities-collections-dedup/spec.md(~926 lines, no other files touched).Test plan
docs/accounts-identities-collections-dedup/assets/resolve.Authored by @jesserobbins, with Claude Code (Opus 4.7) and Primeradiant Superpowers.