feat: surface _distance virtual column from vector search#451
Draft
wombatu-kun wants to merge 6 commits intolance-format:mainfrom
Draft
feat: surface _distance virtual column from vector search#451wombatu-kun wants to merge 6 commits intolance-format:mainfrom
wombatu-kun wants to merge 6 commits intolance-format:mainfrom
Conversation
Adds a `lance_vector_search` table-valued function exposing Lance
ANN/kNN search to Spark SQL.
Relation to the existing KNN path: vector search is already reachable
via the DataFrame API today, by setting `LanceSparkReadOptions.CONFIG_NEAREST`
(`"nearest"` option) to a JSON-serialized `org.lance.ipc.Query`. That
path requires (1) a compile-time dependency on `org.lance.ipc.Query`,
(2) manual JSON serialization of the query, and (3) a JVM caller —
which blocks pure-SQL workflows (Spark Thrift Server / Connect, BI
tools, dbt models, notebooks driven by SQL). The TVF wraps the same
`CONFIG_NEAREST` mechanism behind a SQL-native interface with named
arguments (Spark 3.5+):
SELECT id, category
FROM lance_vector_search(
table => 'lance.db.items',
column => 'embedding',
query => array(0.1f, 0.2f, ...),
k => 10,
metric => 'cosine')
WHERE category = 'books';
Distance metrics (`l2`, `cosine`, `dot`, `hamming`), IVF/PQ/HNSW knobs
(`nprobes`, `refine_factor`, `ef`), and `use_index` (brute-force escape
hatch) are all parameters of the function.
Tests run in brute-force mode (`use_index=false`) so this PR has no
dependency on the vector-index DDL feature — the two PRs can be
reviewed and merged independently.
Known limitation: the virtual `_distance` column produced by Lance is
present in the returned Arrow batches but not in the relation schema,
so `SELECT _distance` / `ORDER BY _distance` will fail to resolve.
This is addressed in a follow-up PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 18, 2026
ee197b9 to
a296714
Compare
Extends the existing `ALTER TABLE … CREATE INDEX … USING method (...)`
statement to accept vector index methods (`ivf_flat`, `ivf_pq`,
`ivf_hnsw_pq`, `ivf_hnsw_sq`) alongside the existing scalar methods
(`btree`, `fts`). No new SQL statement is introduced — the grammar rule
`LanceSqlExtensions.g4#createIndex` is unchanged; only the `method`
parameter accepts new values.
Vector index training currently runs single-shot on the driver
(`AddIndexExec.runVectorIndex`) because Lance's distributed vector-index
path requires pre-computed IVF centroids — per-fragment tasks cannot
train a global codebook on their own. A follow-up can precompute
centroids in a Spark job and re-enable the per-fragment build via
`IvfBuildParams.Builder.setCentroids`.
`DistanceTypes` is shared infrastructure for parsing user-facing metric
strings (`l2` / `cosine` / `dot` / `hamming`) into the `DistanceType`
enum from lance-core.
Index correctness is verified through the existing DataFrame API path
(`option("nearest", QueryUtils.queryToString(query))`), so this PR has
no dependency on the SQL TVF.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`SparkSession.close()` throws checked `IOException` on Spark 4.x but not on 3.x. The base test file is added to every version module's test source set, so the missing `throws` clause broke `make test` on Spark 4.0/4.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaces the virtual `_distance` column produced by Lance vector search in the relation's schema, so references to it in `SELECT`, `ORDER BY`, and `WHERE` resolve during analysis. Implementation: a thin Table-decorator (`LanceVirtualColumnsTable`) wraps the underlying `LanceDataset` and appends extra `StructField`s to `schema()`. The decorator is applied inside `LanceVectorSearchTableFunction` via a `transformUp` over the analyzed plan — done at the TVF site (rather than in `LanceDataSource.getTable`) because `SupportsCatalogOptions` routes `.load()` through `catalog.loadTable(ident)`, which bypasses `getTable` and never sees the per-read `nearest` option. The decorator is generalized so future virtual columns (`_score` for FTS, `_rowid`, `_score_explain`, …) can reuse it by passing additional `StructField`s — no per-column decorator class needed. Filter and top-N pushdown for `_distance` are explicitly blocked in `LanceScanBuilder`, because Lance native currently rejects `_distance` as an unknown column in `WHERE` and column orderings. Two contract tests pin that upstream behaviour at the JNI layer; once Lance starts accepting `_distance` in pushdowns, the tests will fail and signal that the guards can be relaxed. Depends on the `lance_vector_search` TVF PR — please merge that one first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds testTvfWithIndexAgreesWithBruteForce, the one test in the vector-search suite that exercises both halves of the story: ALTER TABLE … CREATE INDEX … USING ivf_pq (vector-index PR) followed by lance_vector_search(... use_index=true) (TVF PR). Verifies the two features compose end-to-end and that brute-force / indexed top-k agree on the planted neighbour. This makes PR 3 depend on both PR 1 (vector-index) and PR 2 (TVF) rather than only on PR 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ff66c12 to
3325078
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is one of three PRs that supersede #436, splitting it per reviewer feedback.
Summary
Surfaces the virtual
_distancecolumn produced by Lance vector search in the relation's schema, so references to it inSELECT,ORDER BY, andWHEREresolve during analysis. Also adds the one integration test that exercises TVF + indexed vector search end-to-end (the only test that requires both #449 and #450 — the rest of the suites in those PRs cover their halves in isolation).Implementation
A thin Table-decorator (
LanceVirtualColumnsTable) wraps the underlyingLanceDatasetand appends extraStructFields toschema(). The decorator is applied insideLanceVectorSearchTableFunctionvia atransformUpover the analyzed plan — done at the TVF site (rather than inLanceDataSource.getTable) becauseSupportsCatalogOptionsroutes.load()throughcatalog.loadTable(ident), which bypassesgetTableand never sees the per-readnearestoption.Generalization
The decorator is generalized so future virtual columns (
_scorefor FTS,_rowid,_score_explain, …) can reuse it by passing additionalStructFields — no per-column decorator class needed. This addresses the reviewer's concern "There may be more column additions in the future, so maybe it needs to be generalized."Pushdown guards
Filter and top-N pushdown for
_distanceare explicitly blocked inLanceScanBuilder, because Lance native currently rejects_distanceas an unknown column inWHEREand column orderings. Two contract tests pin that upstream behaviour at the JNI layer; once Lance starts accepting_distancein pushdowns, the tests will fail and signal that the guards can be relaxed.Integration test
testTvfWithIndexAgreesWithBruteForcecreates a vector index viaALTER TABLE … CREATE INDEX … USING ivf_pq(#449's feature) then runslance_vector_search(... use_index=true)(#450's feature) and verifies the indexed top-k contains the planted neighbour and overlaps with brute-force. This is the only place the two halves are exercised together.Test plan
make test SPARK_VERSION=3.5 -Dtest=LanceVectorSearchTest— 14 integration tests pass (5 from feat: add lance_vector_search SQL table-valued function #450 + 1 integration + 8_distancetests)make lint— checkstyle + spotless cleanmake test-allon CI🤖 Generated with Claude Code