feat: vector search SQL extension with _distance column#436
feat: vector search SQL extension with _distance column#436wombatu-kun wants to merge 2 commits intolance-format:mainfrom
Conversation
hamersaw
left a comment
There was a problem hiding this comment.
I think we need to break this into multiple PRs and have a discussion on what components are useful and implemented correctly. Right now I see:
- add new SQL for
CREATE VECTOR INDEX. I don't necessarily think this makes sense as we already have aALTER TABLE ... CREATE INDEX ..API that can be extended with whatever we want to support. - Adding new SQL function for performing vector searches. I think this could make sense, but it's a little difficult to review given all the changes proposed in this PR. I would want to isolate whether this adds significant functionality over the existing KNN query implementation.
- Exposing the
_distancecolumn. This makes a ton of sense, again I want to understand which specific portions are added for this functionality. There may be more column additions in the future, so maybe it needs to be generalized.
Wrap the DataSourceV2Relation returned by lance_vector_search with a Table decorator that appends the virtual _distance column to schema(), so references to it in SELECT, ORDER BY, and WHERE resolve during analysis. Block _distance from filter and topN pushdown since Lance native rejects it as an unknown column in WHERE / sort orderings; two contract tests pin that upstream behaviour at the JNI layer so the guards can be relaxed once Lance starts accepting it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5d44bae to
93b1c59
Compare
|
Closing this PR — split into three focused PRs per @hamersaw's review feedback:
Reviewer's three concerns and how they were addressed:
PR 1 and PR 2 are independent and can be merged in either order. PR 3 is draft until PR 2 merges. |
Closes #65
Summary
Adds a
lance_vector_searchtable-valued function exposing Lance ANN/kNN search to Spark SQL, with the virtual_distancecolumn surfaced in the TVF's output schema so it resolves inSELECT,ORDER BY, andWHERE.Capabilities
SELECT … FROM lance_vector_search(table, column, query, k [, metric, nprobes, refine_factor, ef, use_index])— positional or named (Spark 3.5+) argumentsl2,cosine,dot,hamminguse_indexfloat32,float64,float16(the latter requires Arrow 18+, i.e. Spark 4.0+)WHERE category = 'odd')_distanceexposed as a non-nullableFLOAT; units depend on metric (squared L2,1 − cos(θ), negative inner product, integer Hamming as float)Limitations
k × num_fragmentsrows — addORDER BY _distance LIMIT kon top for the true global top-k._distanceis not pushed down: filters (WHERE _distance < x) and sort orderings (ORDER BY _distance) referencing_distanceare evaluated by Spark above the scan rather than pushed into Lance native, which currently rejects_distanceas an unknown column in WHERE and column orderings.columnargument is one string; combining two vector columns in a single TVF invocation is not supported.queryexpression is evaluated on the driver during planning. Non-foldable expressions (e.g. a column reference) are rejected.🤖 Generated with Claude Code