feat(admin): reindex-embeddings command (closes #743) by grimmjoww · Pull Request #1258 · vectorize-io/hindsight

grimmjoww · 2026-04-25T05:38:53Z

Summary

Adds hindsight-admin reindex-embeddings — re-embeds rows where embedding IS NULL, idempotent, schema-introspected, model-agnostic. Closes #743.

The use case: swapping the embedding model on a populated bank. ensure_embedding_dimension (migrations.py) refuses to auto-migrate the vector(N) column dimension when rows already exist, leaving operators stuck mid-upgrade. The current workaround is hand-rolled Python; this gives the project an official, idempotent, resumable path.

What it does

Discovers every vector(N) column in the schema by introspecting pg_attribute + pg_type — no hardcoded table list, so future tables Just Work.
Counts rows per column where embedding IS NULL AND text IS NOT NULL.
Verifies the loaded model's output dimension matches each column's stored dimension (fails fast with the exact ALTER TABLE command the operator needs to run).
Re-embeds in batches via LocalSTEmbeddings → executemany. Only touches embedding IS NULL rows, so safe to interrupt with Ctrl-C and resume.
Rebuilds HNSW / IVFFlat / vchordrq vector indexes (skip with --skip-index-rebuild).
Optionally verifies recall via a self-match check (--verify-recall): samples 5 rows, queries each row's embedding via cosine distance, asserts the row appears in its own top-5 nearest neighbors. Catches broken pipelines before users discover them via degraded recall.

Flags

Flag	Default	Purpose
`--schema`, `-s`	`public`	Schema to reindex
`--bank`, `-b`	(all)	Filter to a single bank id
`--batch-size`, `-B`	`16`	Tune up on faster GPUs
`--dry-run`	false	Discover + count, no encode/write
`--skip-index-rebuild`	false	Skip REINDEX after re-embedding
`--verify-recall`	false	Self-match sanity check (works standalone too)
`--auto-backup PATH`	(none)	Run `hindsight-admin backup` before destructive work; abort on failure
`--yes`, `-y`	false	Skip confirmation prompt

Tested with

BAAI/bge-small-en-v1.5 (Hindsight default, 384-dim) — full end-to-end on a populated DB: 8 rows re-embedded in 0.3s, idempotent re-run = 0 pending, --verify-recall 5/5 self-matched.
8 unit tests using an isolated schema fixture: column discovery, dimension parsing, pending count (zero + nonzero), batched re-embed, idempotency, bank filter, empty-text skip.

Designed model-agnostic — anything LocalSTEmbeddings can load works (sentence-transformers compatible). Built specifically for the Qwen3-Embedding-4B/8B upgrade scenario via --trust-remote-code-aware env vars; that path uses the same code, just with a different HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL and an ALTER TABLE ... TYPE vector(<NEW_DIM>) step before running the command.

Workflow

# 1. Backup
hindsight-admin backup /backups/pre-reindex.zip

# 2. Stop API. Update HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL.

# 3. Wipe vectors so column type can be migrated:
psql -c "ALTER TABLE memory_units ALTER COLUMN embedding TYPE vector(<NEW_DIM>) USING NULL;"
psql -c "ALTER TABLE mental_models ALTER COLUMN embedding TYPE vector(<NEW_DIM>) USING NULL;"

# 4. Restart API (loads new model — dimension matches now).

# 5. Re-embed:
hindsight-admin reindex-embeddings --auto-backup /backups/pre-reembed.zip --verify-recall --yes

Test plan

pytest tests/test_admin_reindex.py — 8/8 pass
ruff check + ruff format — clean
End-to-end on populated DB with bge-small-en-v1.5
--dry-run shows counts without encoding
--verify-recall works standalone (no rows pending → just runs the check)
--auto-backup aborts re-embed if backup fails
CI green

Happy to revise scope/style/naming if there's a different shape you'd prefer for this — figured an official path is useful for anyone hitting #743.

…on populated banks Closes vectorize-io#743. ensure_embedding_dimension() refuses to migrate populated tables, leaving operators stuck mid-upgrade after swapping HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL. This adds a hindsight-admin CLI command that: 1. Discovers every vector(N) column in the schema via pg_attribute (not hardcoded), so future tables with embeddings work automatically. 2. Verifies the loaded model's dimension matches each column's stored dimension; aborts with migration guidance if mismatched. 3. Re-embeds rows with NULL embedding in resumable batches via the existing LocalSTEmbeddings engine. 4. Idempotent — only touches WHERE embedding IS NULL — safe to interrupt and re-run. 5. Rebuilds HNSW/IVFFlat/vchordrq vector indexes after bulk update so recall returns fresh neighbors. Workflow becomes: hindsight-admin backup ... # update HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL ALTER TABLE memory_units ALTER COLUMN embedding TYPE vector(N) USING NULL; ALTER TABLE mental_models ALTER COLUMN embedding TYPE vector(N) USING NULL; # restart api (loads new model) hindsight-admin reindex-embeddings Flags: --schema / -s target schema --bank / -b restrict to one bank id --batch-size / -B encode/update batch size --dry-run inspect schema + count pending rows, no writes --skip-index-rebuild skip REINDEX after re-embedding --yes / -y skip confirmation Tests cover: - schema introspection finds memory_units + mental_models - dimension parsed correctly from vector(N) type - count_pending counts only NULL embeddings with non-empty text - re-embed populates NULL embeddings end-to-end - re-running is a no-op (idempotency) - --bank filter restricts to one bank - rows with empty/NULL text are skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Sort imports per I001 - Drop f-string prefix where no placeholders (F541) - Reformat to project line-length 120 - Remove stale --verify-recall reference from module docstring Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

--verify-recall: post-embed sanity check that samples 5 rows, queries each row's embedding via cosine distance, and verifies the row appears in its own top-5 nearest neighbors. Catches broken embedding pipelines before users discover them via degraded recall. Works as a standalone health check too (runs even when no rows are pending re-embed). --auto-backup PATH: runs hindsight-admin backup before destructive re-embed work. Aborts the re-embed if backup fails, so users can't accidentally run reindex without a recoverable snapshot. Also defers the missing-model env var check until model load is actually needed, so --dry-run and standalone --verify-recall work without requiring HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL to be set. Refs vectorize-io#743

Extends `_discover_vector_columns` to find both `vector(N)` and `halfvec(N)` columns, not just `vector`. Operators on pgvector ≥ 0.7 may pick `halfvec` to fit dimensions > 2000 inside HNSW's 4000-dim limit, or just to halve storage at ~0.5% MTEB cost. Operators on vchord (which Hindsight already supports via vchordrq indexes) get the same coverage. Display + ALTER guidance now uses the actual stored type, so a user with a halfvec column doesn't get told to ALTER it back to vector by mistake. No test changes — the existing test schema uses `vector(N)` (matches what Hindsight's migrations.py creates today). Halfvec coverage is a forward- compatibility move for users running Qwen3-Embedding-4B (2560-dim) or similar models that need >2000 dims. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

grimmjoww · 2026-04-25T07:32:49Z

Heads-up on the failing `verify-generated-files` check: this is a pre-existing master issue, not introduced by this PR.

The error is in `hindsight_embed/daemon_embed_manager.py:55` — `subprocess.DETACHED_PROCESS` and `subprocess.CREATE_NEW_PROCESS_GROUP` are Windows-only constants. The `if platform.system() == "Windows"` guard around them isn't visible to `ty`'s static analysis on the Linux runner, so the lint flags both as `unresolved-attribute`.

This PR only touches `hindsight_api/admin/reindex.py`, `hindsight_api/admin/cli.py`, and `tests/test_admin_reindex.py` — nothing in `hindsight_embed/`.

If you'd like, I can include a small fix in this PR (e.g. `getattr(subprocess, "DETACHED_PROCESS", 0)` or `# type: ignore[unresolved-attribute]` on those lines) so the check goes green. Or it can stay scoped to the reindex command and the lint fix lands separately. Let me know your preference.

grimmjoww · 2026-04-26T00:48:17Z

Heads-up — converting to draft.

While testing this locally on an RTX 5080 mobile (Blackwell + Optimus), I've hit a kernel-level hang that hard-resets the machine. The pattern is constant GPU work (embedder running) + concurrent outbound HTTPS to the LLM provider — both happening during the post-consolidation graph build. Reproduces on torch cu128 and cu130. Doesn't appear to be a fault of this PR (reindex-embeddings is just a CLI for swapping embedders), but I'd rather not ship the tooling that lets users walk into the same trap until I've isolated whether it's a torch/driver/Hindsight-architecture interaction or hardware-specific.

Lint fix in #1263 is unrelated — that one stands.

Will undraft once I have a clear root-cause picture.

grimmjoww and others added 4 commits April 25, 2026 01:37

grimmjoww mentioned this pull request Apr 25, 2026

fix(embed): satisfy ty on Windows-only subprocess attrs #1263

Open

3 tasks

grimmjoww marked this pull request as draft April 26, 2026 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(admin): reindex-embeddings command (closes #743)#1258

feat(admin): reindex-embeddings command (closes #743)#1258
grimmjoww wants to merge 4 commits intovectorize-io:mainfrom
grimmjoww:feat/reindex-embeddings-clean

grimmjoww commented Apr 25, 2026

Uh oh!

grimmjoww commented Apr 25, 2026

Uh oh!

grimmjoww commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grimmjoww commented Apr 25, 2026

Summary

What it does

Flags

Tested with

Workflow

Test plan

Uh oh!

grimmjoww commented Apr 25, 2026

Uh oh!

grimmjoww commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant