feat: vector search SQL extension with _distance column by wombatu-kun · Pull Request #436 · lance-format/lance-spark

wombatu-kun · 2026-04-15T09:52:33Z

Closes #65

Summary

Adds a lance_vector_search table-valued function exposing Lance ANN/kNN search to Spark SQL, with the virtual _distance column surfaced in the TVF's output schema so it resolves in SELECT, ORDER BY, and WHERE.

Capabilities

SELECT … FROM lance_vector_search(table, column, query, k [, metric, nprobes, refine_factor, ef, use_index]) — positional or named (Spark 3.5+) arguments
Distance metrics: l2, cosine, dot, hamming
Both indexed (IVF / IVF-PQ / IVF-HNSW-PQ / IVF-HNSW-SQ) and brute-force scans via use_index
Vector dtypes: float32, float64, float16 (the latter requires Arrow 18+, i.e. Spark 4.0+)
Pre-filtering on scalar columns (WHERE category = 'odd')
_distance exposed as a non-nullable FLOAT; units depend on metric (squared L2, 1 − cos(θ), negative inner product, integer Hamming as float)

Limitations

Fragment-local top-k: search runs per fragment and unions the results, so raw TVF output can contain up to k × num_fragments rows — add ORDER BY _distance LIMIT k on top for the true global top-k.
_distance is not pushed down: filters (WHERE _distance < x) and sort orderings (ORDER BY _distance) referencing _distance are evaluated by Spark above the scan rather than pushed into Lance native, which currently rejects _distance as an unknown column in WHERE and column orderings.
Single vector column per call: the column argument is one string; combining two vector columns in a single TVF invocation is not supported.
Driver-side query vector: the query expression is evaluated on the driver during planning. Non-foldable expressions (e.g. a column reference) are rejected.
Named arguments require Spark 3.5+: on Spark 3.4 all arguments must be positional.

🤖 Generated with Claude Code

hamersaw

I think we need to break this into multiple PRs and have a discussion on what components are useful and implemented correctly. Right now I see:

add new SQL for CREATE VECTOR INDEX. I don't necessarily think this makes sense as we already have a ALTER TABLE ... CREATE INDEX .. API that can be extended with whatever we want to support.
Adding new SQL function for performing vector searches. I think this could make sense, but it's a little difficult to review given all the changes proposed in this PR. I would want to isolate whether this adds significant functionality over the existing KNN query implementation.
Exposing the _distance column. This makes a ton of sense, again I want to understand which specific portions are added for this functionality. There may be more column additions in the future, so maybe it needs to be generalized.

Wrap the DataSourceV2Relation returned by lance_vector_search with a Table decorator that appends the virtual _distance column to schema(), so references to it in SELECT, ORDER BY, and WHERE resolve during analysis. Block _distance from filter and topN pushdown since Lance native rejects it as an unknown column in WHERE / sort orderings; two contract tests pin that upstream behaviour at the JNI layer so the guards can be relaxed once Lance starts accepting it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wombatu-kun · 2026-04-18T10:36:44Z

Closing this PR — split into three focused PRs per @hamersaw's review feedback:

feat: support vector indexes via ALTER TABLE ... CREATE INDEX #449 — feat: support vector indexes via ALTER TABLE ... CREATE INDEX (independent, ready)
feat: add lance_vector_search SQL table-valued function #450 — feat: add lance_vector_search SQL table-valued function (independent, ready)
feat: surface _distance virtual column from vector search #451 — feat: surface _distance virtual column from vector search (draft, depends on feat: add lance_vector_search SQL table-valued function #450)

Reviewer's three concerns and how they were addressed:

"CREATE VECTOR INDEX doesn't make sense — we already have ALTER TABLE … CREATE INDEX that can be extended." — On re-reading the original PR, this concern was based on a misread: the grammar rule LanceSqlExtensions.g4#createIndex was already a single statement parametrised by method. feat: support vector indexes via ALTER TABLE ... CREATE INDEX #449 simply adds new accepted values (ivf_flat, ivf_pq, ivf_hnsw_pq, ivf_hnsw_sq) for that parameter — no new SQL statement is introduced. This is now spelled out explicitly in feat: support vector indexes via ALTER TABLE ... CREATE INDEX #449's description.
"Want to isolate whether the TVF adds significant functionality over the existing KNN query implementation." — The "existing KNN query implementation" you're referring to is LanceSparkReadOptions.CONFIG_NEAREST (the "nearest" DataFrame read option). feat: add lance_vector_search SQL table-valued function #450's description articulates exactly what the TVF adds on top: SQL-native syntax with named arguments, no JVM/JSON-construction requirement, composability with WHERE/JOIN/ORDER BY for pure-SQL clients (Spark Thrift Server / Connect, BI tools, dbt, SQL notebooks).
"_distance makes sense — but want to understand which portions are added for it, and maybe generalize for future columns." — feat: surface _distance virtual column from vector search #451 isolates exactly the _distance-specific changes and generalises the decorator into LanceVirtualColumnsTable(virtualColumns: List<StructField>) so future per-batch virtual columns (_score for FTS, _rowid, …) reuse it without new classes.

PR 1 and PR 2 are independent and can be merged in either order. PR 3 is draft until PR 2 merges.

github-actions Bot added the enhancement New feature or request label Apr 15, 2026

hamersaw reviewed Apr 18, 2026

View reviewed changes

Vova Kolmakov and others added 2 commits April 18, 2026 16:24

feat: vector search SQL extension

71e9460

wombatu-kun force-pushed the vector-search-extension branch from 5d44bae to 93b1c59 Compare April 18, 2026 09:34

This was referenced Apr 18, 2026

feat: support vector indexes via ALTER TABLE ... CREATE INDEX #449

Open

feat: add lance_vector_search SQL table-valued function #450

Open

feat: surface _distance virtual column from vector search #451

Draft

wombatu-kun closed this Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: vector search SQL extension with _distance column#436

feat: vector search SQL extension with _distance column#436
wombatu-kun wants to merge 2 commits intolance-format:mainfrom
wombatu-kun:vector-search-extension

wombatu-kun commented Apr 15, 2026

Uh oh!

hamersaw left a comment

Uh oh!

wombatu-kun commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wombatu-kun commented Apr 15, 2026

Summary

Capabilities

Limitations

Uh oh!

hamersaw left a comment

Choose a reason for hiding this comment

Uh oh!

wombatu-kun commented Apr 18, 2026

Reviewer's three concerns and how they were addressed:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants