Skip to content

docs: add accounts/identities/collections/dedup specification#308

Closed
jesserobbins wants to merge 60 commits into
wesm:mainfrom
jesserobbins:jesse/spec-aic-dedup
Closed

docs: add accounts/identities/collections/dedup specification#308
jesserobbins wants to merge 60 commits into
wesm:mainfrom
jesserobbins:jesse/spec-aic-dedup

Conversation

@jesserobbins
Copy link
Copy Markdown
Contributor

Motivation

This commits the authoritative specification for how msgvault organizes ingested communications, identifies which messages belong to whom, and removes redundant local copies without destroying the underlying archive — Accounts, Identities, Collections, Deduplication (AIC + dedup).

The spec exists today only in the working tree of #304's branch. Publishing it as its own PR lets the document be read, cited, and reviewed independently of the implementation. The document was ground-truth for the audit that drove the destructive-path safety follow-ups (the PRs that will follow this one).

The README and concept diagrams already in this directory introduce the model. This spec is the contract: schema, CLI surface, read-side predicate (LiveMessagesWhere), manifest formats, error catalog, migration semantics, and the safety-ladder rungs.

Summary

Test plan

  • Spec renders correctly on GitHub (markdown, tables, asset links).
  • All linked assets in docs/accounts-identities-collections-dedup/assets/ resolve.

Authored by @jesserobbins, with Claude Code (Opus 4.7) and Primeradiant Superpowers.

…urce filter

Schema and store-layer foundations for the identities/collections/
deduplication feature.

- internal/store/schema.sql: adds the canonical collections,
  collection_sources, account_identities, applied_migrations, and
  related tables (and indexes) the rest of the feature builds on.
- internal/store/collections.go: CRUD for collections, default-All
  bootstrap (EnsureDefaultCollection), atomic membership updates,
  and lookup by name.
- internal/store/live_messages.go: LiveMessagesWhere(alias) returns
  the canonical SQL predicate that a row is not dedup-hidden and not
  recorded as deleted from the source server. Used everywhere a read
  surface needs the live-message contract.
- internal/store/store.go: applies LiveMessagesWhere to GetStats
  message count; adds GetStatsForScope(sourceIDs) for per-account or
  per-collection stats. Catalog counts (threads, attachments, labels)
  remain unchanged in the unscoped path.
- internal/store/sync.go: GetOrCreateSource auto-adds new sources to
  the default All collection; logs at warn level on failure rather
  than swallowing errors.
- internal/query/models.go and source_filter.go: introduce
  MessageFilter.SourceIDs (multi-source) alongside SourceID, with a
  shared appendSourceFilter helper that emits source_id IN (...).
- internal/config/config.go: legacy [identity].addresses block and
  IdentityAddressSet helper. Used only by the one-time migration to
  per-account identities.
- internal/logging/logging.go: log file open warning includes the
  resolved path so users can tell which path actually failed.
- cmd/msgvault/cmd/store_resolver.go: openLocalStoreAndInit chains
  Open + InitSchema + RunStartupMigrations so commands using a local
  store consistently apply schema and run the legacy-identity
  migration.
- cmd/msgvault/cmd/build_cache: minor changes to keep the parquet
  cache builder honest about the new schema.
Adds the store-layer dedup operations the engine relies on, plus the
explicit local hard-delete rung of the safety progression: scan ->
hide -> local hard delete -> remote delete. Each rung is a separate,
explicit user action; the system never escalates between them.

Store-layer dedup operations:
- internal/store/dedup.go: FindDuplicatesByRFC822ID and
  GetDuplicateGroupMessages drive the engine's scan path;
  MergeDuplicates writes (deleted_at, delete_batch_id) atomically;
  UndoDedup clears them for a batch and returns rows-restored.
- BackfillRFC822IDs wraps each batch in a transaction and counts
  updates only after a successful commit so a mid-batch failure does
  not over-report `updated`. Progress reporting is monotonic against
  the precomputed total.
- StreamMessageRaw and the raw-MIME backfill keep the dedup engine
  honest about candidate rows that have raw MIME available.

Hard-delete rung:
- internal/store/dedup.go: PurgeBatch and PurgeAllHidden permanently
  remove dedup-hidden rows. Both scope to rows where deleted_at IS
  NOT NULL AND delete_batch_id IS NOT NULL so manually soft-deleted
  rows aren't caught up. Cascade: attachments (metadata),
  message_recipients, message_labels, message_bodies, and
  message_raw all have ON DELETE CASCADE on messages(id), so the
  single DELETE handles them. Content-addressed attachment blobs
  survive purge by design.
- cmd/msgvault/cmd/dedup_purge.go: top-level command. --batch <id>
  (repeatable) purges rows hidden by specific dedup batches;
  --all-hidden purges every dedup-hidden row. Mutually exclusive,
  exactly one required. VACUUM INTO backup before deleting
  (--no-backup to opt out, mirroring deduplicate's UX); interactive
  [y/N] confirmation (-y to skip).

Vector and parquet caches may hold stale entries for purged rows; the
command summary recommends 'build-cache --full-rebuild' after a large
purge. Pending remote-deletion manifests reference source-server
message IDs, not local rows, so they remain valid after a local purge.

Tests cover store-layer behavior (batch purge, all-hidden purge,
distinct-batch counting, FK cascade, no-op on unknown batch,
post-purge undo returns zero) and CLI flag wiring (mutual exclusion,
neither-flag rejection).
…, and migration

- Gate remote-delete to Gmail only: drop "imap" from
  remoteSourceTypes. IMAP runs with
  --delete-dups-from-source-server staged Gmail-format manifests
  that delete-staged would misroute.

- Exclude dedup-hidden rows from the analytics cache: the primary
  sqlite_scanner export path was missing the `deleted_at IS NULL`
  filter the CSV-fallback path already applied. Linux/macOS full
  rebuilds could surface deleted rows in analytics. The CSV
  fallback also gains a TIMESTAMP override for `deleted_at` so
  both export paths expose the same relation and DuckDB's Binder
  can resolve the column.

- Gate delete-staged execution behind
  MSGVAULT_ENABLE_REMOTE_DELETE=1, with both --list and --dry-run
  always permitted (they don't invoke the Gmail API). The gate
  check now sits below the read-only branches so neither inspection
  mode can be blocked by the env var.

- Route --undo before scope resolution: handle --undo before
  resolving --account / --collection so a stale or renamed account
  can't block undoing a valid batch. cobra marks --undo mutually
  exclusive with the scope flags. --undo errors aggregate across
  batches instead of stopping at the first failure.

- Add a deferred return value to MigrateLegacyIdentityConfig so
  the "addresses configured but no sources yet" case is
  distinguishable from "already applied" and "nothing to migrate".
  RunStartupMigrations surfaces a notice on the deferred path so
  users know their legacy [identity] config is parked until the
  first source is created.

- Route dedup queries through LiveMessagesWhere: replace ad-hoc
  `deleted_at IS NULL` predicates in FindDuplicatesByRFC822ID,
  GetDuplicateGroupMessages, GetAllRawMIMECandidates,
  CountActiveMessages, CountMessagesWithoutRFC822ID, and
  BackfillRFC822IDs with the canonical helper so the live-message
  contract is applied uniformly.
FormatMethodology now branches on collection vs account scope. Collection
mode prints that cross-account merging is intentional (with --undo as the
escape hatch); account/per-source mode keeps the existing 'NEVER merges
across accounts' wording. Adds Config.ScopeIsCollection plumbed from the
--collection branch in cmd.

LiveMessagesWhere docstring no longer claims internal/query MUST apply
the predicate. The query engines intentionally apply only deleted_at by
default and gate deleted_from_source_at behind the opt-in
HideDeletedFromSource filter, because msgvault is an archive — rows
deleted from the source server stay visible by default.
…bind, deferred-notice count

Closes the four open roborev findings on PR 304 that survived the
5188612/eb7aaaf consolidation, in one bundled commit:

- deduplicate.go: declare --undo mutually exclusive with the destructive
  / scan-side flags (--delete-dups-from-source-server, --prefer,
  --content-hash, --no-backup, --yes). Cobra's existing exclusivity only
  covered --account/--collection/--dry-run, so a user invoking
  `msgvault deduplicate --undo X --delete-dups-from-source-server` had
  the destructive flag silently ignored. Now the CLI rejects the
  combination at parse time.

- collection.go: gate the numeric-source-id branch in resolveAccountList
  on a leading-digit check. strconv.ParseInt accepts a leading '+', so an
  E.164 phone identifier like "+15551234567" parsed as the integer
  15551234567 and was treated as a source ID, silently breaking
  WhatsApp/Google Voice accounts that key on phone numbers. Restricting
  the numeric branch to tokens whose first byte is a decimal digit lets
  signed inputs fall through to identifier resolution.

- account_identity.go: confirmDefaultIdentity now skips the write when
  the source already has any identity rows. Every ingest command
  (add-account / add-imap / add-o365 / import-*) calls this on each
  invocation, including reruns and OAuth rebinds, so without this guard
  an identity the user explicitly removed via `identity remove` would
  be re-added on the next ingest re-run, silently affecting dedup
  sent-copy detection. The guard preserves the documented "freshly
  created source" intent and degrades gracefully if the user has
  removed every identity (in which case the default is restored, which
  matches the user's likely intent on a clean slate).

- migrate_legacy_identity.go: the deferred-path notice now reports the
  post-normalization address count rather than len(rawAddresses).
  Configs like `addresses = ["", "  ", "alice@x"]` previously announced
  "3 address(es) parked" while only one was actually queued. Test
  updated for the new contract.
Roborev flagged that engine.Scan() runs before the user confirmation and
backup, but Scan can backfill rfc822_message_id when not in dry-run mode,
so a user who declines the prompt has had the database touched without
the dedup-merge backup.

The behavior is correct — backfill is idempotent metadata derivation
(fills a previously-NULL column from MIME data already on the row, never
overwrites a non-NULL value, and changes no message content) and the
backup contract exists for the dedup merge (which soft-deletes losers
and transfers labels), not for this metadata catch-up. Restructuring
Scan to be read-only would force either a double-pass backfill or a
report that under-reports duplicates that backfill would have surfaced;
neither is worth the cost given backfill's risk profile.

What's actually missing is consent legibility: the existing prompt says
"hide N duplicates (reversible with --undo)" without telling the user
that scan already wrote N' rfc822_message_id values (and that those
stay regardless of the answer). FormatReport already prints the
backfill count, so the user sees it in the scan output, but the prompt
doesn't scope itself.

- Add a one-line note to both confirm prompts (single-source and
  per-source paths) when BackfilledCount > 0, stating that scan already
  performed the backfill and it is kept regardless of the answer.
- Expand the godoc on dedup.Engine.Scan to document the side effect,
  why it lives in scan rather than execute, and why the backup contract
  is unchanged.

No behavior change.
The surrounding loop already does an `if p == "" { continue }` on the
trimmed token, so the leading-digit check can index p[0] directly.
…hint

Account = one ingest source/archive; collection = named grouping of
accounts. Cross-source comparison requires --collection. Updates the
deduplicate Long help, package docstring, and FormatMethodology so
they describe the model instead of the legacy 'Gmail sync + mbox of
the same mailbox' framing. Adds a one-liner to the methodology
output saying Message-ID is primary and content-hash is supplementary.

The dry-run report previously suggested rerunning with --apply, which
does not exist; rephrase as 're-run without --dry-run'.
Previously, runHybridSearch was called before scope resolution and
its body looked at searchAccount directly while ignoring
searchCollection. A 'msgvault search --collection X --mode hybrid'
silently searched the entire archive.

Hoist scope resolution into resolveSearchScope, called from
searchCmd.RunE before the mode branch, and pass the resolved Scope
into both runLocalSearch and runHybridSearch. The vector path now
copies scope.SourceIDs() into filter.SourceIDs (covering --account
and --collection) and prints the same collection banner the FTS
path prints. Tests cover --collection scoping for FTS and (under
the sqlite_vec build tag) vector mode.
ListMessages, GetMessagesSummariesByIDs, SearchMessages,
SearchMessagesQuery, searchMessagesLike, CountMessagesForSource,
CountMessagesWithRaw, GetRandomMessageIDs, and CopySubset all
filtered only on deleted_from_source_at IS NULL, leaving dedup-hidden
rows visible through the API, batch-summary hydration paths, count
helpers, random-sample helper, and the subset copier.

Route them through internal/store.LiveMessagesWhere so the canonical
predicate (deleted_at IS NULL AND deleted_from_source_at IS NULL) is
applied uniformly. internal/query/sqlite.go GetMessageSummariesByIDs
adds m.deleted_at IS NULL alongside the existing
deleted_from_source_at filter so vector/hybrid hit hydration matches
the rest of the query layer's policy.
cacheNeedsBuild watched only deleted_from_source_at, so a dedup run
after the last cache build left soft-hidden rows in Parquet until a
manual --full-rebuild. Add a parallel COUNT on deleted_at after
LastSyncAt that triggers a full rebuild the same way deletions
already do. Drop the now-redundant 'msgvault build-cache --full-rebuild'
hint from the dedup post-run summary.
Previously delete-staged defaulted to permanent batch deletion with
--trash as the recoverable opt-in, requiring the elevated
mail.google.com OAuth scope by default. Flip the safety default:
trash (gmail.modify scope, recoverable for 30 days) is the default,
and --permanent is the explicit opt-in that escalates to
mail.google.com. The MSGVAULT_ENABLE_REMOTE_DELETE=1 release gate
is unchanged.

Update the dedup post-run hint, command help, examples, prompt
text, and quickstart docs to reference --permanent.
Adds info-level start/done log lines to Engine.Scan, Engine.Execute,
and Engine.Undo so a dedup run is traceable from the daily log
without enabling debug. Each entry line records scope (account,
sources, is_collection, dry_run, content_hash_fallback); each exit
line records counts (groups, messages_to_prune / groups_merged,
messages_removed, labels_transferred, raw_mime_backfilled,
manifests_cancelled, manifests_still_running) and duration_ms.

Mirrors the structured logging pattern used in cmd/msgvault/cmd/search.go.
Vector and hybrid modes need text to embed; --account or --collection
alone don't supply any. Fail fast at the command boundary with a clear
message and a hint to use --mode=fts for queryless scoped searches.

Found by codex review on jesse/identities-collections-dedup.
…ack; document BackfillRFC822IDs failed-row semantics

DeleteAllDeduped's docstring promises 'every dedup-hidden row regardless
of batch', but the DELETE filter required delete_batch_id IS NOT NULL.
That's currently a no-op (every dedup writer sets the batch ID), but the
filter would silently drop any future writer that hides a row without a
batch. Drop the redundant filter; keep distinct-batch counting on the
narrower predicate so reporting is unchanged.

While in this function, switch the deferred rollback to the committed-
flag pattern so a Commit() error doesn't leave a confusing rollback
attempt on a finalized tx.

Add a one-line comment in BackfillRFC822IDs explaining that failed rows
are not retried in the same run (lastID advances past them) and will
re-enter the candidate set on the next invocation via the
'rfc822_message_id IS NULL' filter.
Email addresses are case-insensitive in practice, but the unique key on
account_identities is (source_id, address) with case preserved (chat
handles need that). RemoveAccountIdentity now folds case for tokens
containing '@' so 'identity remove Foo@x.com' matches a row stored as
'foo@x.com'. Synthetic identifiers without '@' still match
case-sensitively.
EnsureDefaultCollection re-runs INSERT OR IGNORE on every InitSchema,
which means a 'collection remove All <account>' silently reverts on
the next CLI invocation. AddSourcesToCollection,
RemoveSourcesFromCollection, and DeleteCollection now reject the
default name with ErrCollectionImmutable so users get a clear error
instead of a silent revert.

Also fix idiomatic ordering of rows.Err()/rows.Close() in
hydrateCollection.
RunStartupMigrations now returns a structured StartupMigrationResult
with Applied/Deferred flags so the CLI's runStartupMigrations can log
the deferred case ('migration deferred until a source exists') without
also emitting the misleading 'legacy identity migrated' line that the
applied path uses.
…dicate

cacheNeedsBuild's deleted_at COUNT now adds 'AND deleted_from_source_at
IS NULL' so a row that is both source-deleted and dedup-hidden after
LastSyncAt isn't reported twice in the reason string. Both still
trigger the same FullRebuild=true.

LiveMessagesWhere returns package-level constants for the two common
aliases ("" and "m") so hot read paths don't allocate a fmt.Sprintf
result on every call.
Databases created before this branch landed don't have the deleted_at
column on messages, but the new code paths (LiveMessagesWhere, dedup
engine, cache staleness) read it directly. Without this migration,
upgraded databases fail with 'no such column: deleted_at' on the
first read.

Add the ALTER to the schema-upgrade list alongside delete_batch_id and
deleted_from_source_at, and a regression test that opens a
pre-deleted_at messages table and confirms InitSchema adds the column.
DeleteAllDeduped now deletes every dedup-hidden row regardless of
batch (the d246fea broadening), but the CLI precount still required
delete_batch_id IS NOT NULL on the row count. On databases that
contain batchless dedup-hidden rows the user prompt under-reports
('Will permanently delete N') vs the actual count printed afterward.

Split the count into two queries so the row count uses the broad
predicate (matching DeleteAllDeduped) while distinct-batch count
keeps the narrower one (only meaningful for rows with a batch ID).
The case-insensitive remove fix in 409b3b1 left the add path
case-sensitive, so 'identity add Foo@x.com' on top of an existing
'foo@x.com' inserted a duplicate row that 'identity remove' would
then collapse on first call.

AddAccountIdentity (and the legacy-identity migration) now look up
existing rows with LOWER(address) when the input contains '@', and
update the existing row's signal set instead of inserting a
duplicate. The first-write casing is preserved in storage; chat
handles and other synthetic identifiers still match case-sensitively.

Tests cover the paired add+remove case-fold, the case-preserving
storage invariant, and the chat-handle case-sensitive guard.
cfg.DatabaseDSN() can return a file: URI when [data].database_url is
configured. The dedup and delete-deduped backup builders compose paths
with filepath.Dir/Base + concatenation, which fails for URI inputs
(produces 'file:/...db.dedup-backup-...' that VACUUM INTO can't write).

Add cfg.DatabasePath(): returns the on-disk path for plain filesystem
DSNs, strips the scheme/query for file: URIs, and rejects non-file
DSNs (postgres://, mysql://) where the SQLite-only backup helpers
can't operate. Both backup call sites resolve the path up-front so
non-file DSNs error before the first scan, not at backup time.

Also moves the unused dedup.NewEngine construction in runDeduplicate
inside the single-account branch — the per-source path builds its
own scoped engines.
The previous implementation stripped the 'file:' prefix and any '?'
query suffix manually, leaving percent-encoded bytes intact. A path
like 'file:/var/lib/my%20vault.db' was returned as
'/var/lib/my%20vault.db' (literal '%20'), which doesn't exist on disk
and breaks backup/existence checks for paths with spaces or other
escaped characters.

Use url.Parse so percent-encoding decodes correctly, and fall back to
u.Opaque for relative-path 'file:rel/path' shape that SQLite also
accepts. Tests cover percent-encoded paths, relative paths, query
suffixes, and the existing rejection of non-file DSNs.
The previous heuristic 'strings.Contains(addr, "@")' folded case for
any identifier with an '@' anywhere in it. Matrix MXIDs like
'@user:server.org' start with '@' and contain a '.' so they would have
been incorrectly treated as case-insensitive, collapsing distinct
chat handles into one row.

Replace with looksLikeEmail(addr): '@' must not be at index 0 and the
substring after the last '@' must contain '.'. That accepts standard
emails (foo@x.com) and excludes Matrix MXIDs, bare handles, and
phone-shaped strings. Apply at the AddAccountIdentity, RemoveAccountIdentity,
and legacy-config migration sites.

Test coverage: existing email case-fold pair, new Matrix MXID guard
(two distinct cases produce two rows), and the existing chat-handle
case-sensitive guard.
…rationApplied

db.Exec never returns sql.ErrNoRows — that sentinel only comes from
Row.Scan on a missing row. The conditional was never hit and made
the error path harder to read. Remove the clause and the now-unused
database/sql + errors imports.
Legacy [identity] migration parks (deferred=true) when no source
exists at startup. The previous flow ran runStartupMigrations before
GetOrCreateSource in every ingest command, so the deferred migration
never fired on the same invocation that created the first source —
the user only got their per-account identities on a *later* command.

Add runPostSourceCreateMigrations and call it after source creation
in all ingest paths (add-account, add-imap, add-o365, import-mbox,
import-emlx, import-imessage, import-gvoice, import-whatsapp,
serve's scheduled sync). MigrateLegacyIdentityConfig is idempotent
once the sentinel is set, so the post-create call is O(1) on
subsequent invocations.

Add CLI-level regression test exercising the deferred-then-applied
path on a single add-account invocation.
Previous fix re-ran the migration after first source creation but
left the pre-source pass intact. On a fresh install with [identity]
addresses configured, the user saw both the deferred notice ("will
run on the next command") and the applied notice in the same
add-account invocation — the deferred notice was a lie.

Add runStartupMigrationsForIngest variant that skips the pre-source
call when ListSources is empty. The post-source-create call does
all the work and emits the accurate "applied" notice once. Steady-
state behavior is unchanged.

Strengthen the regression test to assert the deferred notice does
NOT fire in an invocation that creates the first source.
The post-import call covers both --no-resume and resume paths;
the in-loop call only fired on the resume path and was a no-op
the second time anyway (migration sentinel already set).

Addresses roborev job 492 finding (claude-code, Low).
Two iter11 codex Low findings:

1. MigrateLegacyIdentityConfig deduped the input list by exact
   trimmed string, so a config block with case-variant emails
   (Alice@Example.com + alice@example.com) wrote duplicate rows
   for every source. Switch dedupe key to NormalizeIdentifierForCompare
   so the rule matches the rest of the identity subsystem; preserve
   first-seen casing for storage so synthetic identifiers (Matrix
   MXIDs) keep their case.

2. MergeFilterIntoQuery only applied filter.SourceIDs when len > 0,
   so an explicit empty (non-nil) slice fell through silently and
   the original query's AccountIDs leaked through. appendSourceFilter
   treats an explicit empty multi-source slice as match-nothing
   (1=0); align MergeFilterIntoQuery to the same contract via
   filter.SourceIDs != nil. Allocate the merged slice with make+copy
   instead of append-from-nil, which would collapse explicit empty
   back to nil.

Tests cover both: case-variant collapse with synthetic-MXID
preservation, plus distinguishing nil vs non-nil-empty SourceIDs.
…hrough in collection lookup

Three iter12 findings:

1. Codex Medium — internal/store/dedup.go pre-lowered From: addresses
   in SQL (LOWER(p_from.email_address)) before passing to the dedup
   engine, which then re-normalizes case-aware via
   NormalizeIdentifierForCompare. For synthetic identifiers like
   Matrix MXIDs (case-sensitive contract elsewhere in the system),
   the SQL LOWER() collapsed @alice:matrix.org to @alice:matrix.org
   before the per-source identity match could find the case-mixed
   stored identity. Drop the LOWER() in both GetDuplicateGroupMessages
   and GetAllRawMIMECandidates; the Go-side normalize handles the
   email vs synthetic split correctly. Regression test added.

2. Codex Low — cmd/msgvault/cmd/collection.go resolveAccountList
   errored immediately when a plain-digit token failed source-ID
   lookup, blocking valid numeric identifiers (e.g. unprefixed
   phone numbers) from being resolved by identifier. Fall through
   to ResolveAccountFlag on numeric-lookup miss. Regression test
   added.

3. Claude Low — store_resolver.go logged Warn for both deferred
   and applied migration outcomes; the user-facing notice on
   stderr already covers it. Demote success paths to Info; leave
   Warn for the actual error case.
…allel test

Three iter13 findings:

1. Codex Medium — stats.go's --collection guard checked Scope.IsEmpty
   but a collection with zero member sources has Collection set
   (so IsEmpty returns false) yet SourceIDs() returns []. The store
   layer treats empty SourceIDs as unscoped/global, so
   `stats --collection <empty>` printed archive-wide counts. Add
   an explicit len(sourceIDs) == 0 check matching deduplicate.go's
   pattern. Regression test added.

2. Claude Low — DuplicateMessage.FromEmail's struct comment still
   said "lower-cased From: address" after iter12 dropped LOWER()
   from the underlying SQL. A future reader following that comment
   would reintroduce the exact bug iter12 fixed. Updated to describe
   the case-aware contract (raw value preserved; Go-side
   NormalizeIdentifierForCompare splits email-vs-synthetic).

3. Claude Low — iter12's regression test only covered
   GetDuplicateGroupMessages, not the parallel content-hash path
   GetAllRawMIMECandidates which got the same LOWER() removal.
   Added TestStore_GetAllRawMIMECandidates_PreservesFromCase to
   cover the second SQL site.
…on methodology

Two iter14 in-scope findings:

1. Codex Low — collection.go's resolveAccountList numeric-token
   path (introduced in iter12 fix 72be0eb) swallowed every error
   from GetSourceByID and fell through to identifier resolution.
   Real DB failures got reported as "not found" / "no valid
   accounts" instead of being surfaced. Added store.ErrSourceNotFound
   sentinel; resolveAccountList now uses errors.Is to fall through
   only on definitive not-found, returning real errors immediately.

2. Claude Low — dedup.FormatMethodology described --collection
   scope as "Cross-account dedup is enabled" even when the
   collection had only a single member source, where no
   cross-account merging can happen. Gate both methodology
   branches on len(AccountSourceIDs) > 1; single-member collections
   now report the same-account guarantee. Regression test added.
Iter15 codex Medium: when [identity] config has legacy addresses,
runPostSourceCreateMigrations populated account_identities first,
then confirmDefaultIdentity's len(existing)>0 guard suppressed
the source's own account-identifier write — leaving the source
without its own identity and breaking dedup sent-copy detection.

Reorder ingest paths so confirmDefaultIdentity runs BEFORE
runPostSourceCreateMigrations: addaccount.go (both token-reusable
and post-authorize sites), addimap.go, addo365.go, import_gvoice.go,
import.go (whatsapp), import_emlx.go (single + multi-account),
import_mbox.go. The legacy migration uses set-semantics
(mergeSignalSet), so the migration merges into existing rows
correctly when called after the default write.

import-imessage skipped (no confirmDefaultIdentity call —
explicitly noDefaultIdentity). serve.go scheduled-sync skipped
(daemon path doesn't write a default identity).

Documented the ordering requirement in confirmDefaultIdentity's
godoc and at each call site. Regression test added that doesn't
pass --no-default-identity (the iter10 test did, masking this
bug).
Three iter15 in-scope items:

1. printStillRunningWarning now says "currently executing" rather
   than "in progress / cannot be cancelled," so the message is
   unambiguous when both in-progress manifests and pending-cancel
   failures exist (the latter surface as a returned error from
   Undo, not via this warning).

2. runDeduplicatePerSource's exit-on-Execute-error path now prints
   the accumulated executedBatches undo hint via the new
   printAccumulatedUndoHint helper before returning the error.
   Without this, a user who hit an error on source N had no
   visibility into how to undo sources 1..N-1's changes. Test
   added for the helper.

3. Added a FUTURE REFINEMENT comment on DeleteAllDeduped's godoc
   noting that the `deleted_at IS NOT NULL` predicate is currently
   safe (only the dedup soft-hide path sets that column) but
   would need a positive marker (`delete_batch_id IS NOT NULL OR
   dedup_origin = 1`-style) or a column rename if a future feature
   shares `deleted_at` semantics. Out of scope for wesm#304; flagged
   for the future change-author.
Iter18 surfaced the same finding both agents flagged in iter16 +
iter17 (which Jesse reverted as out-of-scope polish): migration
error returns before printImportSummary in three call sites
(importSingleAccount, importAutoAccounts loop, runImportMbox).
Claude predicted the re-flag explicitly: "a future reader (or
another iteration of this loop) will hit the same finding."

Add brief inline comments at each site documenting the scope
rationale — minimal in-scope shape for wesm#304's deferred legacy
[identity] migration retry; restructuring pre-existing summary
code is out-of-scope; migration is idempotent and retries on the
next invocation; full polish tracked in
private/drafts/2026-05-02-issue-import-migration-error-ux.md.

Comments only; no behavior change.
…MessagesWhere

LiveMessagesWhere now takes (alias, hideDeletedFromSource). The dedup-hide
column (deleted_at) is filtered always; source-deleted handling stays
caller-controlled via the existing HideDeletedFromSource fields on
MessageOptions, MessageFilter, and search.Query.

Every read path now calls the helper:

- internal/store, internal/dedup, internal/vector existing helper
  callers pass true (canonical live view).
- internal/vector inline raw-SQL fragments at backend.go:307,818,847
  and fused.go:132 rewritten to interpolate the helper.
- internal/query list paths (sqlite.go:193,267,889,1145; duckdb.go:656,
  860,1121,1473,2312) collapse the paired (deleted_at, opt-in
  deleted_from_source_at) fragments into one helper call, passing the
  call site's HideDeletedFromSource field.
- internal/query hydration paths (sqlite.go:744, sqlite_text.go:451,
  duckdb_text.go:436) interpolate the helper with true.
- internal/query Gmail-staging paths (sqlite.go:1007, duckdb.go:1670)
  collapse to one helper call passing literal true; this surface feeds
  remote-deletion staging and must never honor an opt-in.
- SubAggregate's redundant opts.HideDeletedFromSource / filter
  .HideDeletedFromSource handling collapses by OR-ing opts into filter
  before buildFilterConditions; the inline append at duckdb.go:1064
  is deleted.

Helper signature:

  func LiveMessagesWhere(alias string, hideDeletedFromSource bool) string

Four pre-computed constants cover (""|"m") × (true|false). Other
aliases fall back to fmt.Sprintf.

New table-driven test in internal/store/live_messages_test.go covers
the constant fast-paths and fmt.Sprintf fallback at both boolean
values across three aliases.
`runHybridSearch` opens its own raw `*sql.DB` via `sql.Open`,
bypassing `store.Open` / `store.InitSchema`. For scoped queries
this was harmless because `resolveSearchScope` runs `InitSchema`
on the same DSN beforehand. For unscoped queries the dispatch
went straight to the raw open, so on a database upgraded from a
pre-wesm#304 build the vector backend's `LiveMessagesWhere` filter
hit "no such column: deleted_at" until something else triggered
schema init.

Hoist the init into the unscoped branch via a small
`initLocalSchema` helper that opens, inits, migrates, and
closes — migration state persists in the file on disk, and
`runHybridSearch` reopens against the now-current schema.

Adds a regression test that drops `deleted_at` and asserts the
unscoped vector dispatch re-adds it before the search runs.
User-facing introduction to the AIC data model and the dedup safety
ladder, with worked HOWTO examples for list-accounts, identity,
collection, search, dedup, delete-deduped, and delete-staged. Diagram
sources live alongside in assets/ as self-contained HTML that renders
to PNG via headless Chrome; assets/README.md documents regeneration.
spec.md is intentionally not included while the spec is still in
draft.
Authoritative reference for how msgvault organizes ingested
communications, identifies which messages belong to whom, and
removes redundant local copies without destroying the underlying
archive. Defines the conceptual model (Account, Identity,
Collection — AIC), schema, CLI surface, read-side contract,
manifest formats, errors, and migration semantics.

Companion to the dedup/identities/collections branch work and
roborev review on PR wesm#286. The README and concept assets in this
directory introduce the model; this spec is the contract.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 3, 2026

roborev: Combined Review (f59aab1)

Code review verdict: no Medium, High, or Critical issues found.

All agents reported the code as clean or provided no actionable findings.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

Okay, this is silly. I was trying to avoid another commit to #304 but I'm just going to tack this on as it's the spec.

@jesserobbins jesserobbins deleted the jesse/spec-aic-dedup branch May 3, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant