Skip to content

feat: surface _distance virtual column from vector search#451

Draft
wombatu-kun wants to merge 6 commits intolance-format:mainfrom
wombatu-kun:vector-search-distance-column
Draft

feat: surface _distance virtual column from vector search#451
wombatu-kun wants to merge 6 commits intolance-format:mainfrom
wombatu-kun:vector-search-distance-column

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

@wombatu-kun wombatu-kun commented Apr 18, 2026

This is one of three PRs that supersede #436, splitting it per reviewer feedback.

⚠️ Depends on both #449 (vector-index) and #450 (TVF). Marked as draft until both merge. The diff below currently includes #449's and #450's changes; once both are merged the diff will reduce to just this PR's contents (7 _distance-specific files + 1 integration test).

Summary

Surfaces the virtual _distance column produced by Lance vector search in the relation's schema, so references to it in SELECT, ORDER BY, and WHERE resolve during analysis. Also adds the one integration test that exercises TVF + indexed vector search end-to-end (the only test that requires both #449 and #450 — the rest of the suites in those PRs cover their halves in isolation).

Implementation

A thin Table-decorator (LanceVirtualColumnsTable) wraps the underlying LanceDataset and appends extra StructFields to schema(). The decorator is applied inside LanceVectorSearchTableFunction via a transformUp over the analyzed plan — done at the TVF site (rather than in LanceDataSource.getTable) because SupportsCatalogOptions routes .load() through catalog.loadTable(ident), which bypasses getTable and never sees the per-read nearest option.

Generalization

The decorator is generalized so future virtual columns (_score for FTS, _rowid, _score_explain, …) can reuse it by passing additional StructFields — no per-column decorator class needed. This addresses the reviewer's concern "There may be more column additions in the future, so maybe it needs to be generalized."

Pushdown guards

Filter and top-N pushdown for _distance are explicitly blocked in LanceScanBuilder, because Lance native currently rejects _distance as an unknown column in WHERE and column orderings. Two contract tests pin that upstream behaviour at the JNI layer; once Lance starts accepting _distance in pushdowns, the tests will fail and signal that the guards can be relaxed.

Integration test

testTvfWithIndexAgreesWithBruteForce creates a vector index via ALTER TABLE … CREATE INDEX … USING ivf_pq (#449's feature) then runs lance_vector_search(... use_index=true) (#450's feature) and verifies the indexed top-k contains the planted neighbour and overlaps with brute-force. This is the only place the two halves are exercised together.

Test plan

🤖 Generated with Claude Code

Adds a `lance_vector_search` table-valued function exposing Lance
ANN/kNN search to Spark SQL.

Relation to the existing KNN path: vector search is already reachable
via the DataFrame API today, by setting `LanceSparkReadOptions.CONFIG_NEAREST`
(`"nearest"` option) to a JSON-serialized `org.lance.ipc.Query`. That
path requires (1) a compile-time dependency on `org.lance.ipc.Query`,
(2) manual JSON serialization of the query, and (3) a JVM caller —
which blocks pure-SQL workflows (Spark Thrift Server / Connect, BI
tools, dbt models, notebooks driven by SQL). The TVF wraps the same
`CONFIG_NEAREST` mechanism behind a SQL-native interface with named
arguments (Spark 3.5+):

  SELECT id, category
  FROM lance_vector_search(
         table  => 'lance.db.items',
         column => 'embedding',
         query  => array(0.1f, 0.2f, ...),
         k      => 10,
         metric => 'cosine')
  WHERE category = 'books';

Distance metrics (`l2`, `cosine`, `dot`, `hamming`), IVF/PQ/HNSW knobs
(`nprobes`, `refine_factor`, `ef`), and `use_index` (brute-force escape
hatch) are all parameters of the function.

Tests run in brute-force mode (`use_index=false`) so this PR has no
dependency on the vector-index DDL feature — the two PRs can be
reviewed and merged independently.

Known limitation: the virtual `_distance` column produced by Lance is
present in the returned Arrow batches but not in the relation schema,
so `SELECT _distance` / `ORDER BY _distance` will fail to resolve.
This is addressed in a follow-up PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Vova Kolmakov and others added 5 commits April 18, 2026 18:36
Extends the existing `ALTER TABLE … CREATE INDEX … USING method (...)`
statement to accept vector index methods (`ivf_flat`, `ivf_pq`,
`ivf_hnsw_pq`, `ivf_hnsw_sq`) alongside the existing scalar methods
(`btree`, `fts`). No new SQL statement is introduced — the grammar rule
`LanceSqlExtensions.g4#createIndex` is unchanged; only the `method`
parameter accepts new values.

Vector index training currently runs single-shot on the driver
(`AddIndexExec.runVectorIndex`) because Lance's distributed vector-index
path requires pre-computed IVF centroids — per-fragment tasks cannot
train a global codebook on their own. A follow-up can precompute
centroids in a Spark job and re-enable the per-fragment build via
`IvfBuildParams.Builder.setCentroids`.

`DistanceTypes` is shared infrastructure for parsing user-facing metric
strings (`l2` / `cosine` / `dot` / `hamming`) into the `DistanceType`
enum from lance-core.

Index correctness is verified through the existing DataFrame API path
(`option("nearest", QueryUtils.queryToString(query))`), so this PR has
no dependency on the SQL TVF.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`SparkSession.close()` throws checked `IOException` on Spark 4.x but
not on 3.x. The base test file is added to every version module's test
source set, so the missing `throws` clause broke `make test` on Spark
4.0/4.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaces the virtual `_distance` column produced by Lance vector search
in the relation's schema, so references to it in `SELECT`, `ORDER BY`,
and `WHERE` resolve during analysis.

Implementation: a thin Table-decorator (`LanceVirtualColumnsTable`)
wraps the underlying `LanceDataset` and appends extra `StructField`s to
`schema()`. The decorator is applied inside `LanceVectorSearchTableFunction`
via a `transformUp` over the analyzed plan — done at the TVF site (rather
than in `LanceDataSource.getTable`) because `SupportsCatalogOptions`
routes `.load()` through `catalog.loadTable(ident)`, which bypasses
`getTable` and never sees the per-read `nearest` option.

The decorator is generalized so future virtual columns (`_score` for
FTS, `_rowid`, `_score_explain`, …) can reuse it by passing additional
`StructField`s — no per-column decorator class needed.

Filter and top-N pushdown for `_distance` are explicitly blocked in
`LanceScanBuilder`, because Lance native currently rejects `_distance`
as an unknown column in `WHERE` and column orderings. Two contract
tests pin that upstream behaviour at the JNI layer; once Lance starts
accepting `_distance` in pushdowns, the tests will fail and signal that
the guards can be relaxed.

Depends on the `lance_vector_search` TVF PR — please merge that one
first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds testTvfWithIndexAgreesWithBruteForce, the one test in the
vector-search suite that exercises both halves of the story:
ALTER TABLE … CREATE INDEX … USING ivf_pq (vector-index PR) followed
by lance_vector_search(... use_index=true) (TVF PR). Verifies the two
features compose end-to-end and that brute-force / indexed top-k
agree on the planted neighbour.

This makes PR 3 depend on both PR 1 (vector-index) and PR 2 (TVF)
rather than only on PR 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the vector-search-distance-column branch from ff66c12 to 3325078 Compare April 18, 2026 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant