feat(admin): reindex-embeddings command (closes #743)#1258
feat(admin): reindex-embeddings command (closes #743)#1258grimmjoww wants to merge 4 commits intovectorize-io:mainfrom
Conversation
…on populated banks Closes vectorize-io#743. ensure_embedding_dimension() refuses to migrate populated tables, leaving operators stuck mid-upgrade after swapping HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL. This adds a hindsight-admin CLI command that: 1. Discovers every vector(N) column in the schema via pg_attribute (not hardcoded), so future tables with embeddings work automatically. 2. Verifies the loaded model's dimension matches each column's stored dimension; aborts with migration guidance if mismatched. 3. Re-embeds rows with NULL embedding in resumable batches via the existing LocalSTEmbeddings engine. 4. Idempotent — only touches WHERE embedding IS NULL — safe to interrupt and re-run. 5. Rebuilds HNSW/IVFFlat/vchordrq vector indexes after bulk update so recall returns fresh neighbors. Workflow becomes: hindsight-admin backup ... # update HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL ALTER TABLE memory_units ALTER COLUMN embedding TYPE vector(N) USING NULL; ALTER TABLE mental_models ALTER COLUMN embedding TYPE vector(N) USING NULL; # restart api (loads new model) hindsight-admin reindex-embeddings Flags: --schema / -s target schema --bank / -b restrict to one bank id --batch-size / -B encode/update batch size --dry-run inspect schema + count pending rows, no writes --skip-index-rebuild skip REINDEX after re-embedding --yes / -y skip confirmation Tests cover: - schema introspection finds memory_units + mental_models - dimension parsed correctly from vector(N) type - count_pending counts only NULL embeddings with non-empty text - re-embed populates NULL embeddings end-to-end - re-running is a no-op (idempotency) - --bank filter restricts to one bank - rows with empty/NULL text are skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Sort imports per I001 - Drop f-string prefix where no placeholders (F541) - Reformat to project line-length 120 - Remove stale --verify-recall reference from module docstring Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
--verify-recall: post-embed sanity check that samples 5 rows, queries each row's embedding via cosine distance, and verifies the row appears in its own top-5 nearest neighbors. Catches broken embedding pipelines before users discover them via degraded recall. Works as a standalone health check too (runs even when no rows are pending re-embed). --auto-backup PATH: runs hindsight-admin backup before destructive re-embed work. Aborts the re-embed if backup fails, so users can't accidentally run reindex without a recoverable snapshot. Also defers the missing-model env var check until model load is actually needed, so --dry-run and standalone --verify-recall work without requiring HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL to be set. Refs vectorize-io#743
Extends `_discover_vector_columns` to find both `vector(N)` and `halfvec(N)` columns, not just `vector`. Operators on pgvector ≥ 0.7 may pick `halfvec` to fit dimensions > 2000 inside HNSW's 4000-dim limit, or just to halve storage at ~0.5% MTEB cost. Operators on vchord (which Hindsight already supports via vchordrq indexes) get the same coverage. Display + ALTER guidance now uses the actual stored type, so a user with a halfvec column doesn't get told to ALTER it back to vector by mistake. No test changes — the existing test schema uses `vector(N)` (matches what Hindsight's migrations.py creates today). Halfvec coverage is a forward- compatibility move for users running Qwen3-Embedding-4B (2560-dim) or similar models that need >2000 dims. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Heads-up on the failing `verify-generated-files` check: this is a pre-existing master issue, not introduced by this PR. The error is in `hindsight_embed/daemon_embed_manager.py:55` — `subprocess.DETACHED_PROCESS` and `subprocess.CREATE_NEW_PROCESS_GROUP` are Windows-only constants. The `if platform.system() == "Windows"` guard around them isn't visible to `ty`'s static analysis on the Linux runner, so the lint flags both as `unresolved-attribute`. This PR only touches `hindsight_api/admin/reindex.py`, `hindsight_api/admin/cli.py`, and `tests/test_admin_reindex.py` — nothing in `hindsight_embed/`. If you'd like, I can include a small fix in this PR (e.g. `getattr(subprocess, "DETACHED_PROCESS", 0)` or `# type: ignore[unresolved-attribute]` on those lines) so the check goes green. Or it can stay scoped to the reindex command and the lint fix lands separately. Let me know your preference. |
|
Heads-up — converting to draft. While testing this locally on an RTX 5080 mobile (Blackwell + Optimus), I've hit a kernel-level hang that hard-resets the machine. The pattern is constant GPU work (embedder running) + concurrent outbound HTTPS to the LLM provider — both happening during the post-consolidation graph build. Reproduces on torch cu128 and cu130. Doesn't appear to be a fault of this PR ( Lint fix in #1263 is unrelated — that one stands. Will undraft once I have a clear root-cause picture. |
Summary
Adds
hindsight-admin reindex-embeddings— re-embeds rows whereembedding IS NULL, idempotent, schema-introspected, model-agnostic. Closes #743.The use case: swapping the embedding model on a populated bank.
ensure_embedding_dimension(migrations.py) refuses to auto-migrate thevector(N)column dimension when rows already exist, leaving operators stuck mid-upgrade. The current workaround is hand-rolled Python; this gives the project an official, idempotent, resumable path.What it does
vector(N)column in the schema by introspectingpg_attribute+pg_type— no hardcoded table list, so future tables Just Work.embedding IS NULL AND text IS NOT NULL.ALTER TABLEcommand the operator needs to run).LocalSTEmbeddings→executemany. Only touchesembedding IS NULLrows, so safe to interrupt with Ctrl-C and resume.--skip-index-rebuild).--verify-recall): samples 5 rows, queries each row's embedding via cosine distance, asserts the row appears in its own top-5 nearest neighbors. Catches broken pipelines before users discover them via degraded recall.Flags
--schema,-spublic--bank,-b--batch-size,-B16--dry-run--skip-index-rebuild--verify-recall--auto-backup PATHhindsight-admin backupbefore destructive work; abort on failure--yes,-yTested with
BAAI/bge-small-en-v1.5(Hindsight default, 384-dim) — full end-to-end on a populated DB: 8 rows re-embedded in 0.3s, idempotent re-run = 0 pending,--verify-recall5/5 self-matched.Designed model-agnostic — anything
LocalSTEmbeddingscan load works (sentence-transformers compatible). Built specifically for the Qwen3-Embedding-4B/8B upgrade scenario via--trust-remote-code-aware env vars; that path uses the same code, just with a differentHINDSIGHT_API_EMBEDDINGS_LOCAL_MODELand anALTER TABLE ... TYPE vector(<NEW_DIM>)step before running the command.Workflow
Test plan
pytest tests/test_admin_reindex.py— 8/8 passruff check+ruff format— cleanbge-small-en-v1.5--dry-runshows counts without encoding--verify-recallworks standalone (no rows pending → just runs the check)--auto-backupaborts re-embed if backup failsHappy to revise scope/style/naming if there's a different shape you'd prefer for this — figured an official path is useful for anyone hitting #743.