feat: surface the _score virtual column on FTS queries by wombatu-kun · Pull Request #455 · lance-format/lance-spark

wombatu-kun · 2026-04-18T13:37:15Z

Summary

Exposes Lance's native BM25 relevance column as a projectable Float _score column on Lance tables when an FTS predicate is active:

SELECT id, _score FROM t WHERE lance_match(content, 'apache spark');

Implementation

LanceConstant.FTS_SCORE = "_score" — matches lance_index::scalar::inverted::SCORE_COL on the Rust side.
Registered as a nullable Float MetadataColumn through the central LanceMetadataColumns registry from refactor: centralize Lance metadata column definitions #452 (added to both ALL and PROJECTABLE). The registry's generalization means no edits to LanceDataset or LanceFragmentScanner.getColumnNames — the new column flows through both the exclusion filter and the scanner-projection loop automatically. This is the payoff of the metadata-columns refactor PR.
LanceFragmentScanner rejects a _score reference without an active FTS predicate at scan-build time with a clear error message.

Why `MetadataColumn`, not `LanceVirtualColumnsTable`

Using MetadataColumn rather than a schema decorator keeps _score hidden by default — SELECT * does not include it — which is the correct virtual-column contract. The LanceVirtualColumnsTable decorator (used for _distance in the vector-search path) would promote the column into the base schema() and regress SELECT *. SupportsMetadataColumns is the right Spark-native mechanism for columns like _score.

Stacking

Depends on #453 and #452 — please merge both first. This branch is stacked on feat/fts-base with #452's registry commit cherry-picked on top (carried until #452 lands, at which point rebasing this branch onto main drops the duplicate commit cleanly).

Test plan

./mvnw test -pl lance-spark-3.5_2.12 -Dtest=FtsQueryTest — 3 FTS cases pass (including new testFtsScoreColumn)
make test SPARK_VERSION=3.5 SCALA_VERSION=2.12 — full regression passes
make lint — checkstyle + spotless clean
CI green across all 5 version modules

🤖 Generated with Claude Code

Add query-side FTS on Lance tables, complementing the existing ALTER TABLE ... CREATE INDEX USING fts syntax. Users can now run SELECT id, content FROM t WHERE lance_match(content, 'apache spark') and the predicate executes natively through the Lance inverted index rather than as a post-scan substring match. Implementation: - `LanceMatch` Catalyst expression registered via `injectFunction`, with `CodegenFallback` + a substring fallback eval for non-Lance scans or query shapes the rule doesn't recognize. - `LanceFtsPushdownRule` — logical-plan optimizer rule that matches `Filter(LanceMatch(col, 'q'), DataSourceV2Relation(LanceTable, ..))` BEFORE V2ScanRelationPushDown and injects `_lance_fts_column` / `_lance_fts_query` into the relation's `CaseInsensitiveStringMap`. MVP handles top-level LanceMatch and `And(LanceMatch, rest)` shapes. - `LanceDataset.newScanBuilder` reads those options and calls `LanceScanBuilder.setFtsQuery(FtsQuerySpec)`; the builder threads the spec through `LanceScan` → `LanceInputPartition` → per-fragment scanner, where `ScanOptions.Builder.fullTextQuery(FullTextQuery.match( queryText, column))` is applied natively. Tests (base + version-specific subclasses on 3.4_2.12 / 3.5_2.12, with cross-module reuse into 3.5_2.13 / 4.0_2.13 / 4.1_2.13 via existing build-helper pom configuration): 2 FTS cases (single-term match, combined with scalar filter). Full regression on 3.5_2.12 passes. Documentation: new docs/src/operations/dql/fts.md page (syntax, requirements, mechanics, limitations) cross-linked from the CREATE INDEX FTS options section. Known limitations (documented in fts.md): - Only single LanceMatch extraction; OR/NOT/multi-match fall back to Catalyst eval. - Operator/fuzziness/boost not yet exposed. - FTS selectivity not reported to the cost-based optimizer. Count-star through FTS and the `_score` column are landing as separate follow-up PRs stacked on this one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extract the five Spark `MetadataColumn` instances (`_rowid`, `_rowaddr`, `_row_created_at_version`, `_row_last_updated_at_version`, `_fragid`) from `LanceDataset` into a new `LanceMetadataColumns` registry class, and replace the hand-maintained exclusion filter and if-ladder in `LanceFragmentScanner.getColumnNames` with iteration over the registry. Before this change, adding a virtual column required edits in three coupled sites: a new anonymous `MetadataColumn` constant in `LanceDataset`, an entry in the `METADATA_COLUMNS` array, a clause in the "exclude from regular columns" filter, and a stanza in the "append to scanner projection" if-ladder. All four had to stay in lock-step; forgetting one produced silent bugs (column in schema but not in scanner output, or Lance rejecting an unknown name). After this change, adding a column is a one-line entry in the registry. Call sites that only needed the column name string (`LanceDataset.X_COLUMN .name()`) now reference `LanceConstant.X` directly — the `MetadataColumn` object was never used at those sites, only its name. Pure refactor: no behavior change. All existing tests pass on `lance-spark-3.5_2.12` (the default module) — the invariant that `LanceFragmentScannerTest` exercises around column ordering is preserved by the new `PROJECTABLE` list, which intentionally excludes `_fragid` (computed per-fragment outside the scanner's projection list). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Exposes Lance's native BM25 relevance column as a projectable `Float` `_score` column on Lance tables when an FTS predicate is active: SELECT id, _score FROM t WHERE lance_match(content, 'apache spark'); Implementation: - `LanceConstant.FTS_SCORE = "_score"` — matches `lance_index::scalar::inverted::SCORE_COL` on the Rust side. - Registered as a nullable `Float` `MetadataColumn` through the central `LanceMetadataColumns` registry (added to both `ALL` and `PROJECTABLE`). The registry's generalization from the previous metadata-columns refactor PR means no edits to `LanceDataset` or `LanceFragmentScanner.getColumnNames` — the new column flows through both the exclusion filter and the scanner-projection loop automatically. - `LanceFragmentScanner` rejects a `_score` reference without an active FTS predicate at scan-build time with a clear error. Using `MetadataColumn` rather than a schema decorator keeps `_score` hidden by default — `SELECT *` does not include it — which is the correct virtual-column contract. The `LanceVirtualColumnsTable` decorator (used for `_distance` in the vector-search path) would promote the column into the base `schema()` and regress `SELECT *`; `SupportsMetadataColumns` is the right Spark-native mechanism for columns like `_score`. Test: `BaseFtsQueryTest#testFtsScoreColumn` asserts `_score` is non-null and positive BM25 when FTS is active, inherited by the version-specific `FtsQueryTest` subclasses. Docs: adds "The `_score` Column" section to `fts.md`, the fallback note about no BM25 without an index, and the `ORDER BY _score` limitation around Spark's `RangePartitioner.sketch` path. Depends on the central metadata-columns registry PR (carried as the preceding commit on this branch until that PR lands); stacked on top of the base FTS PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Vova Kolmakov and others added 3 commits April 18, 2026 20:12

github-actions Bot added the enhancement New feature or request label Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: surface the _score virtual column on FTS queries#455

feat: surface the _score virtual column on FTS queries#455
wombatu-kun wants to merge 3 commits intolance-format:mainfrom
wombatu-kun:feat/fts-score

wombatu-kun commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wombatu-kun commented Apr 18, 2026

Summary

Implementation

Why MetadataColumn, not LanceVirtualColumnsTable

Stacking

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why `MetadataColumn`, not `LanceVirtualColumnsTable`