Skip to content

feat: support vector indexes via ALTER TABLE ... CREATE INDEX#449

Open
wombatu-kun wants to merge 1 commit intolance-format:mainfrom
wombatu-kun:vector-index
Open

feat: support vector indexes via ALTER TABLE ... CREATE INDEX#449
wombatu-kun wants to merge 1 commit intolance-format:mainfrom
wombatu-kun:vector-index

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

This is one of three PRs that supersede #436, splitting it per reviewer feedback.

Summary

Extends the existing ALTER TABLE … CREATE INDEX … USING method (...) statement (grammar rule LanceSqlExtensions.g4#createIndex) to accept vector index methods (ivf_flat, ivf_pq, ivf_hnsw_pq, ivf_hnsw_sq) alongside the existing scalar methods (btree, fts). No new SQL statement is introduced — the grammar rule is unchanged; only the method parameter accepts new values, exactly as the reviewer suggested.

Example:

```sql
ALTER TABLE lance.db.items CREATE INDEX emb_idx USING ivf_pq (embedding)
WITH (num_partitions = 256, num_sub_vectors = 16, metric = 'cosine');
```

Vector index training currently runs single-shot on the driver (`AddIndexExec.runVectorIndex`) because Lance's distributed vector-index path requires pre-computed IVF centroids — per-fragment tasks cannot train a global codebook on their own and the native code rejects the call with "Build Distributed Vector Index: missing precomputed IVF centroids". A follow-up can precompute centroids in a Spark job and re-enable the per-fragment build via `IvfBuildParams.Builder.setCentroids`.

`DistanceTypes` is shared infrastructure for parsing user-facing metric strings (`l2` / `cosine` / `dot` / `hamming`) into the `DistanceType` enum from lance-core.

Index correctness is verified through the existing DataFrame API path (the `"nearest"` read option backed by `LanceSparkReadOptions.CONFIG_NEAREST`), so this PR has no dependency on the SQL TVF PR — the two can be merged in any order.

Test plan

  • `make test SPARK_VERSION=3.5 SCALA_VERSION=2.12 -Dtest=LanceVectorIndexTest` — 6 integration tests pass (1 skipped on 3.x: float16 needs Arrow 18+)
  • `make test SPARK_VERSION=3.5 -Dtest=VectorIndexParamsBuilderTest` — 9 unit tests pass
  • `make lint` — checkstyle + spotless clean
  • `make test-all` — please run on CI to verify all version modules

🤖 Generated with Claude Code

Extends the existing `ALTER TABLE … CREATE INDEX … USING method (...)`
statement to accept vector index methods (`ivf_flat`, `ivf_pq`,
`ivf_hnsw_pq`, `ivf_hnsw_sq`) alongside the existing scalar methods
(`btree`, `fts`). No new SQL statement is introduced — the grammar rule
`LanceSqlExtensions.g4#createIndex` is unchanged; only the `method`
parameter accepts new values.

Vector index training currently runs single-shot on the driver
(`AddIndexExec.runVectorIndex`) because Lance's distributed vector-index
path requires pre-computed IVF centroids — per-fragment tasks cannot
train a global codebook on their own. A follow-up can precompute
centroids in a Spark job and re-enable the per-fragment build via
`IvfBuildParams.Builder.setCentroids`.

`DistanceTypes` is shared infrastructure for parsing user-facing metric
strings (`l2` / `cosine` / `dot` / `hamming`) into the `DistanceType`
enum from lance-core.

Index correctness is verified through the existing DataFrame API path
(`option("nearest", QueryUtils.queryToString(query))`), so this PR has
no dependency on the SQL TVF.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant