feat: surface the _score virtual column on FTS queries#455
Open
wombatu-kun wants to merge 3 commits intolance-format:mainfrom
Open
feat: surface the _score virtual column on FTS queries#455wombatu-kun wants to merge 3 commits intolance-format:mainfrom
wombatu-kun wants to merge 3 commits intolance-format:mainfrom
Conversation
Add query-side FTS on Lance tables, complementing the existing
ALTER TABLE ... CREATE INDEX USING fts syntax. Users can now run
SELECT id, content FROM t WHERE lance_match(content, 'apache spark')
and the predicate executes natively through the Lance inverted index
rather than as a post-scan substring match.
Implementation:
- `LanceMatch` Catalyst expression registered via `injectFunction`,
with `CodegenFallback` + a substring fallback eval for non-Lance scans
or query shapes the rule doesn't recognize.
- `LanceFtsPushdownRule` — logical-plan optimizer rule that matches
`Filter(LanceMatch(col, 'q'), DataSourceV2Relation(LanceTable, ..))`
BEFORE V2ScanRelationPushDown and injects `_lance_fts_column` /
`_lance_fts_query` into the relation's `CaseInsensitiveStringMap`.
MVP handles top-level LanceMatch and `And(LanceMatch, rest)` shapes.
- `LanceDataset.newScanBuilder` reads those options and calls
`LanceScanBuilder.setFtsQuery(FtsQuerySpec)`; the builder threads the
spec through `LanceScan` → `LanceInputPartition` → per-fragment
scanner, where `ScanOptions.Builder.fullTextQuery(FullTextQuery.match(
queryText, column))` is applied natively.
Tests (base + version-specific subclasses on 3.4_2.12 / 3.5_2.12, with
cross-module reuse into 3.5_2.13 / 4.0_2.13 / 4.1_2.13 via existing
build-helper pom configuration): 2 FTS cases (single-term match,
combined with scalar filter). Full regression on 3.5_2.12 passes.
Documentation: new docs/src/operations/dql/fts.md page (syntax,
requirements, mechanics, limitations) cross-linked from the CREATE INDEX
FTS options section.
Known limitations (documented in fts.md):
- Only single LanceMatch extraction; OR/NOT/multi-match fall back to
Catalyst eval.
- Operator/fuzziness/boost not yet exposed.
- FTS selectivity not reported to the cost-based optimizer.
Count-star through FTS and the `_score` column are landing as separate
follow-up PRs stacked on this one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the five Spark `MetadataColumn` instances (`_rowid`, `_rowaddr`, `_row_created_at_version`, `_row_last_updated_at_version`, `_fragid`) from `LanceDataset` into a new `LanceMetadataColumns` registry class, and replace the hand-maintained exclusion filter and if-ladder in `LanceFragmentScanner.getColumnNames` with iteration over the registry. Before this change, adding a virtual column required edits in three coupled sites: a new anonymous `MetadataColumn` constant in `LanceDataset`, an entry in the `METADATA_COLUMNS` array, a clause in the "exclude from regular columns" filter, and a stanza in the "append to scanner projection" if-ladder. All four had to stay in lock-step; forgetting one produced silent bugs (column in schema but not in scanner output, or Lance rejecting an unknown name). After this change, adding a column is a one-line entry in the registry. Call sites that only needed the column name string (`LanceDataset.X_COLUMN .name()`) now reference `LanceConstant.X` directly — the `MetadataColumn` object was never used at those sites, only its name. Pure refactor: no behavior change. All existing tests pass on `lance-spark-3.5_2.12` (the default module) — the invariant that `LanceFragmentScannerTest` exercises around column ordering is preserved by the new `PROJECTABLE` list, which intentionally excludes `_fragid` (computed per-fragment outside the scanner's projection list). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exposes Lance's native BM25 relevance column as a projectable `Float`
`_score` column on Lance tables when an FTS predicate is active:
SELECT id, _score FROM t WHERE lance_match(content, 'apache spark');
Implementation:
- `LanceConstant.FTS_SCORE = "_score"` — matches
`lance_index::scalar::inverted::SCORE_COL` on the Rust side.
- Registered as a nullable `Float` `MetadataColumn` through the
central `LanceMetadataColumns` registry (added to both `ALL` and
`PROJECTABLE`). The registry's generalization from the previous
metadata-columns refactor PR means no edits to `LanceDataset` or
`LanceFragmentScanner.getColumnNames` — the new column flows through
both the exclusion filter and the scanner-projection loop
automatically.
- `LanceFragmentScanner` rejects a `_score` reference without an
active FTS predicate at scan-build time with a clear error.
Using `MetadataColumn` rather than a schema decorator keeps `_score`
hidden by default — `SELECT *` does not include it — which is the
correct virtual-column contract. The `LanceVirtualColumnsTable`
decorator (used for `_distance` in the vector-search path) would
promote the column into the base `schema()` and regress `SELECT *`;
`SupportsMetadataColumns` is the right Spark-native mechanism for
columns like `_score`.
Test: `BaseFtsQueryTest#testFtsScoreColumn` asserts `_score` is
non-null and positive BM25 when FTS is active, inherited by the
version-specific `FtsQueryTest` subclasses.
Docs: adds "The `_score` Column" section to `fts.md`, the fallback
note about no BM25 without an index, and the `ORDER BY _score`
limitation around Spark's `RangePartitioner.sketch` path.
Depends on the central metadata-columns registry PR (carried as the
preceding commit on this branch until that PR lands); stacked on
top of the base FTS PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Exposes Lance's native BM25 relevance column as a projectable
Float_scorecolumn on Lance tables when an FTS predicate is active:Implementation
LanceConstant.FTS_SCORE = "_score"— matcheslance_index::scalar::inverted::SCORE_COLon the Rust side.FloatMetadataColumnthrough the centralLanceMetadataColumnsregistry from refactor: centralize Lance metadata column definitions #452 (added to bothALLandPROJECTABLE). The registry's generalization means no edits toLanceDatasetorLanceFragmentScanner.getColumnNames— the new column flows through both the exclusion filter and the scanner-projection loop automatically. This is the payoff of the metadata-columns refactor PR.LanceFragmentScannerrejects a_scorereference without an active FTS predicate at scan-build time with a clear error message.Why
MetadataColumn, notLanceVirtualColumnsTableUsing
MetadataColumnrather than a schema decorator keeps_scorehidden by default —SELECT *does not include it — which is the correct virtual-column contract. TheLanceVirtualColumnsTabledecorator (used for_distancein the vector-search path) would promote the column into the baseschema()and regressSELECT *.SupportsMetadataColumnsis the right Spark-native mechanism for columns like_score.Stacking
Depends on #453 and #452 — please merge both first. This branch is stacked on
feat/fts-basewith #452's registry commit cherry-picked on top (carried until #452 lands, at which point rebasing this branch ontomaindrops the duplicate commit cleanly).Test plan
./mvnw test -pl lance-spark-3.5_2.12 -Dtest=FtsQueryTest— 3 FTS cases pass (including newtestFtsScoreColumn)make test SPARK_VERSION=3.5 SCALA_VERSION=2.12— full regression passesmake lint— checkstyle + spotless clean🤖 Generated with Claude Code