Skip to content

feat: vector search SQL extension with _distance column#436

Closed
wombatu-kun wants to merge 2 commits intolance-format:mainfrom
wombatu-kun:vector-search-extension
Closed

feat: vector search SQL extension with _distance column#436
wombatu-kun wants to merge 2 commits intolance-format:mainfrom
wombatu-kun:vector-search-extension

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Closes #65

Summary

Adds a lance_vector_search table-valued function exposing Lance ANN/kNN search to Spark SQL, with the virtual _distance column surfaced in the TVF's output schema so it resolves in SELECT, ORDER BY, and WHERE.

Capabilities

  • SELECT … FROM lance_vector_search(table, column, query, k [, metric, nprobes, refine_factor, ef, use_index]) — positional or named (Spark 3.5+) arguments
  • Distance metrics: l2, cosine, dot, hamming
  • Both indexed (IVF / IVF-PQ / IVF-HNSW-PQ / IVF-HNSW-SQ) and brute-force scans via use_index
  • Vector dtypes: float32, float64, float16 (the latter requires Arrow 18+, i.e. Spark 4.0+)
  • Pre-filtering on scalar columns (WHERE category = 'odd')
  • _distance exposed as a non-nullable FLOAT; units depend on metric (squared L2, 1 − cos(θ), negative inner product, integer Hamming as float)

Limitations

  • Fragment-local top-k: search runs per fragment and unions the results, so raw TVF output can contain up to k × num_fragments rows — add ORDER BY _distance LIMIT k on top for the true global top-k.
  • _distance is not pushed down: filters (WHERE _distance < x) and sort orderings (ORDER BY _distance) referencing _distance are evaluated by Spark above the scan rather than pushed into Lance native, which currently rejects _distance as an unknown column in WHERE and column orderings.
  • Single vector column per call: the column argument is one string; combining two vector columns in a single TVF invocation is not supported.
  • Driver-side query vector: the query expression is evaluated on the driver during planning. Non-foldable expressions (e.g. a column reference) are rejected.
  • Named arguments require Spark 3.5+: on Spark 3.4 all arguments must be positional.

🤖 Generated with Claude Code

@github-actions github-actions Bot added the enhancement New feature or request label Apr 15, 2026
Copy link
Copy Markdown
Collaborator

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to break this into multiple PRs and have a discussion on what components are useful and implemented correctly. Right now I see:

  • add new SQL for CREATE VECTOR INDEX. I don't necessarily think this makes sense as we already have a ALTER TABLE ... CREATE INDEX .. API that can be extended with whatever we want to support.
  • Adding new SQL function for performing vector searches. I think this could make sense, but it's a little difficult to review given all the changes proposed in this PR. I would want to isolate whether this adds significant functionality over the existing KNN query implementation.
  • Exposing the _distance column. This makes a ton of sense, again I want to understand which specific portions are added for this functionality. There may be more column additions in the future, so maybe it needs to be generalized.

Vova Kolmakov and others added 2 commits April 18, 2026 16:24
Wrap the DataSourceV2Relation returned by lance_vector_search with a
Table decorator that appends the virtual _distance column to schema(),
so references to it in SELECT, ORDER BY, and WHERE resolve during
analysis. Block _distance from filter and topN pushdown since Lance
native rejects it as an unknown column in WHERE / sort orderings; two
contract tests pin that upstream behaviour at the JNI layer so the
guards can be relaxed once Lance starts accepting it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wombatu-kun
Copy link
Copy Markdown
Contributor Author

Closing this PR — split into three focused PRs per @hamersaw's review feedback:

  1. feat: support vector indexes via ALTER TABLE ... CREATE INDEX #449feat: support vector indexes via ALTER TABLE ... CREATE INDEX (independent, ready)
  2. feat: add lance_vector_search SQL table-valued function #450feat: add lance_vector_search SQL table-valued function (independent, ready)
  3. feat: surface _distance virtual column from vector search #451feat: surface _distance virtual column from vector search (draft, depends on feat: add lance_vector_search SQL table-valued function #450)

Reviewer's three concerns and how they were addressed:

  1. "CREATE VECTOR INDEX doesn't make sense — we already have ALTER TABLE … CREATE INDEX that can be extended." — On re-reading the original PR, this concern was based on a misread: the grammar rule LanceSqlExtensions.g4#createIndex was already a single statement parametrised by method. feat: support vector indexes via ALTER TABLE ... CREATE INDEX #449 simply adds new accepted values (ivf_flat, ivf_pq, ivf_hnsw_pq, ivf_hnsw_sq) for that parameter — no new SQL statement is introduced. This is now spelled out explicitly in feat: support vector indexes via ALTER TABLE ... CREATE INDEX #449's description.

  2. "Want to isolate whether the TVF adds significant functionality over the existing KNN query implementation." — The "existing KNN query implementation" you're referring to is LanceSparkReadOptions.CONFIG_NEAREST (the "nearest" DataFrame read option). feat: add lance_vector_search SQL table-valued function #450's description articulates exactly what the TVF adds on top: SQL-native syntax with named arguments, no JVM/JSON-construction requirement, composability with WHERE/JOIN/ORDER BY for pure-SQL clients (Spark Thrift Server / Connect, BI tools, dbt, SQL notebooks).

  3. "_distance makes sense — but want to understand which portions are added for it, and maybe generalize for future columns."feat: surface _distance virtual column from vector search #451 isolates exactly the _distance-specific changes and generalises the decorator into LanceVirtualColumnsTable(virtualColumns: List<StructField>) so future per-batch virtual columns (_score for FTS, _rowid, …) reuse it without new classes.

PR 1 and PR 2 are independent and can be merged in either order. PR 3 is draft until PR 2 merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vector search SQL extension

2 participants