Skip to content

feat(admin): reindex-embeddings command (closes #743)#1258

Draft
grimmjoww wants to merge 4 commits intovectorize-io:mainfrom
grimmjoww:feat/reindex-embeddings-clean
Draft

feat(admin): reindex-embeddings command (closes #743)#1258
grimmjoww wants to merge 4 commits intovectorize-io:mainfrom
grimmjoww:feat/reindex-embeddings-clean

Conversation

@grimmjoww
Copy link
Copy Markdown
Contributor

Summary

Adds hindsight-admin reindex-embeddings — re-embeds rows where embedding IS NULL, idempotent, schema-introspected, model-agnostic. Closes #743.

The use case: swapping the embedding model on a populated bank. ensure_embedding_dimension (migrations.py) refuses to auto-migrate the vector(N) column dimension when rows already exist, leaving operators stuck mid-upgrade. The current workaround is hand-rolled Python; this gives the project an official, idempotent, resumable path.

What it does

  1. Discovers every vector(N) column in the schema by introspecting pg_attribute + pg_type — no hardcoded table list, so future tables Just Work.
  2. Counts rows per column where embedding IS NULL AND text IS NOT NULL.
  3. Verifies the loaded model's output dimension matches each column's stored dimension (fails fast with the exact ALTER TABLE command the operator needs to run).
  4. Re-embeds in batches via LocalSTEmbeddingsexecutemany. Only touches embedding IS NULL rows, so safe to interrupt with Ctrl-C and resume.
  5. Rebuilds HNSW / IVFFlat / vchordrq vector indexes (skip with --skip-index-rebuild).
  6. Optionally verifies recall via a self-match check (--verify-recall): samples 5 rows, queries each row's embedding via cosine distance, asserts the row appears in its own top-5 nearest neighbors. Catches broken pipelines before users discover them via degraded recall.

Flags

Flag Default Purpose
--schema, -s public Schema to reindex
--bank, -b (all) Filter to a single bank id
--batch-size, -B 16 Tune up on faster GPUs
--dry-run false Discover + count, no encode/write
--skip-index-rebuild false Skip REINDEX after re-embedding
--verify-recall false Self-match sanity check (works standalone too)
--auto-backup PATH (none) Run hindsight-admin backup before destructive work; abort on failure
--yes, -y false Skip confirmation prompt

Tested with

  • BAAI/bge-small-en-v1.5 (Hindsight default, 384-dim) — full end-to-end on a populated DB: 8 rows re-embedded in 0.3s, idempotent re-run = 0 pending, --verify-recall 5/5 self-matched.
  • 8 unit tests using an isolated schema fixture: column discovery, dimension parsing, pending count (zero + nonzero), batched re-embed, idempotency, bank filter, empty-text skip.

Designed model-agnostic — anything LocalSTEmbeddings can load works (sentence-transformers compatible). Built specifically for the Qwen3-Embedding-4B/8B upgrade scenario via --trust-remote-code-aware env vars; that path uses the same code, just with a different HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL and an ALTER TABLE ... TYPE vector(<NEW_DIM>) step before running the command.

Workflow

# 1. Backup
hindsight-admin backup /backups/pre-reindex.zip

# 2. Stop API. Update HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL.

# 3. Wipe vectors so column type can be migrated:
psql -c "ALTER TABLE memory_units ALTER COLUMN embedding TYPE vector(<NEW_DIM>) USING NULL;"
psql -c "ALTER TABLE mental_models ALTER COLUMN embedding TYPE vector(<NEW_DIM>) USING NULL;"

# 4. Restart API (loads new model — dimension matches now).

# 5. Re-embed:
hindsight-admin reindex-embeddings --auto-backup /backups/pre-reembed.zip --verify-recall --yes

Test plan

  • pytest tests/test_admin_reindex.py — 8/8 pass
  • ruff check + ruff format — clean
  • End-to-end on populated DB with bge-small-en-v1.5
  • --dry-run shows counts without encoding
  • --verify-recall works standalone (no rows pending → just runs the check)
  • --auto-backup aborts re-embed if backup fails
  • CI green

Happy to revise scope/style/naming if there's a different shape you'd prefer for this — figured an official path is useful for anyone hitting #743.

grimmjoww and others added 4 commits April 25, 2026 01:37
…on populated banks

Closes vectorize-io#743.

ensure_embedding_dimension() refuses to migrate populated tables, leaving
operators stuck mid-upgrade after swapping HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL.
This adds a hindsight-admin CLI command that:

  1. Discovers every vector(N) column in the schema via pg_attribute (not
     hardcoded), so future tables with embeddings work automatically.
  2. Verifies the loaded model's dimension matches each column's stored
     dimension; aborts with migration guidance if mismatched.
  3. Re-embeds rows with NULL embedding in resumable batches via the
     existing LocalSTEmbeddings engine.
  4. Idempotent — only touches WHERE embedding IS NULL — safe to interrupt
     and re-run.
  5. Rebuilds HNSW/IVFFlat/vchordrq vector indexes after bulk update so
     recall returns fresh neighbors.

Workflow becomes:
  hindsight-admin backup ...
  # update HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL
  ALTER TABLE memory_units ALTER COLUMN embedding TYPE vector(N) USING NULL;
  ALTER TABLE mental_models ALTER COLUMN embedding TYPE vector(N) USING NULL;
  # restart api (loads new model)
  hindsight-admin reindex-embeddings

Flags:
  --schema / -s             target schema
  --bank / -b               restrict to one bank id
  --batch-size / -B         encode/update batch size
  --dry-run                 inspect schema + count pending rows, no writes
  --skip-index-rebuild      skip REINDEX after re-embedding
  --yes / -y                skip confirmation

Tests cover:
  - schema introspection finds memory_units + mental_models
  - dimension parsed correctly from vector(N) type
  - count_pending counts only NULL embeddings with non-empty text
  - re-embed populates NULL embeddings end-to-end
  - re-running is a no-op (idempotency)
  - --bank filter restricts to one bank
  - rows with empty/NULL text are skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Sort imports per I001
- Drop f-string prefix where no placeholders (F541)
- Reformat to project line-length 120
- Remove stale --verify-recall reference from module docstring

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
--verify-recall: post-embed sanity check that samples 5 rows, queries
each row's embedding via cosine distance, and verifies the row appears
in its own top-5 nearest neighbors. Catches broken embedding pipelines
before users discover them via degraded recall. Works as a standalone
health check too (runs even when no rows are pending re-embed).

--auto-backup PATH: runs hindsight-admin backup before destructive
re-embed work. Aborts the re-embed if backup fails, so users can't
accidentally run reindex without a recoverable snapshot.

Also defers the missing-model env var check until model load is
actually needed, so --dry-run and standalone --verify-recall work
without requiring HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL to be set.

Refs vectorize-io#743
Extends `_discover_vector_columns` to find both `vector(N)` and `halfvec(N)`
columns, not just `vector`. Operators on pgvector ≥ 0.7 may pick `halfvec`
to fit dimensions > 2000 inside HNSW's 4000-dim limit, or just to halve
storage at ~0.5% MTEB cost. Operators on vchord (which Hindsight already
supports via vchordrq indexes) get the same coverage.

Display + ALTER guidance now uses the actual stored type, so a user with a
halfvec column doesn't get told to ALTER it back to vector by mistake.

No test changes — the existing test schema uses `vector(N)` (matches what
Hindsight's migrations.py creates today). Halfvec coverage is a forward-
compatibility move for users running Qwen3-Embedding-4B (2560-dim) or
similar models that need >2000 dims.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@grimmjoww
Copy link
Copy Markdown
Contributor Author

Heads-up on the failing `verify-generated-files` check: this is a pre-existing master issue, not introduced by this PR.

The error is in `hindsight_embed/daemon_embed_manager.py:55` — `subprocess.DETACHED_PROCESS` and `subprocess.CREATE_NEW_PROCESS_GROUP` are Windows-only constants. The `if platform.system() == "Windows"` guard around them isn't visible to `ty`'s static analysis on the Linux runner, so the lint flags both as `unresolved-attribute`.

This PR only touches `hindsight_api/admin/reindex.py`, `hindsight_api/admin/cli.py`, and `tests/test_admin_reindex.py` — nothing in `hindsight_embed/`.

If you'd like, I can include a small fix in this PR (e.g. `getattr(subprocess, "DETACHED_PROCESS", 0)` or `# type: ignore[unresolved-attribute]` on those lines) so the check goes green. Or it can stay scoped to the reindex command and the lint fix lands separately. Let me know your preference.

@grimmjoww
Copy link
Copy Markdown
Contributor Author

Heads-up — converting to draft.

While testing this locally on an RTX 5080 mobile (Blackwell + Optimus), I've hit a kernel-level hang that hard-resets the machine. The pattern is constant GPU work (embedder running) + concurrent outbound HTTPS to the LLM provider — both happening during the post-consolidation graph build. Reproduces on torch cu128 and cu130. Doesn't appear to be a fault of this PR (reindex-embeddings is just a CLI for swapping embedders), but I'd rather not ship the tooling that lets users walk into the same trap until I've isolated whether it's a torch/driver/Hindsight-architecture interaction or hardware-specific.

Lint fix in #1263 is unrelated — that one stands.

Will undraft once I have a clear root-cause picture.

@grimmjoww grimmjoww marked this pull request as draft April 26, 2026 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: re-embed existing memories when changing embedding model

1 participant