Skip to content

feat: surface the _score virtual column on FTS queries#455

Open
wombatu-kun wants to merge 3 commits intolance-format:mainfrom
wombatu-kun:feat/fts-score
Open

feat: surface the _score virtual column on FTS queries#455
wombatu-kun wants to merge 3 commits intolance-format:mainfrom
wombatu-kun:feat/fts-score

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Summary

Exposes Lance's native BM25 relevance column as a projectable Float _score column on Lance tables when an FTS predicate is active:

SELECT id, _score FROM t WHERE lance_match(content, 'apache spark');

Implementation

  • LanceConstant.FTS_SCORE = "_score" — matches lance_index::scalar::inverted::SCORE_COL on the Rust side.
  • Registered as a nullable Float MetadataColumn through the central LanceMetadataColumns registry from refactor: centralize Lance metadata column definitions #452 (added to both ALL and PROJECTABLE). The registry's generalization means no edits to LanceDataset or LanceFragmentScanner.getColumnNames — the new column flows through both the exclusion filter and the scanner-projection loop automatically. This is the payoff of the metadata-columns refactor PR.
  • LanceFragmentScanner rejects a _score reference without an active FTS predicate at scan-build time with a clear error message.

Why MetadataColumn, not LanceVirtualColumnsTable

Using MetadataColumn rather than a schema decorator keeps _score hidden by default — SELECT * does not include it — which is the correct virtual-column contract. The LanceVirtualColumnsTable decorator (used for _distance in the vector-search path) would promote the column into the base schema() and regress SELECT *. SupportsMetadataColumns is the right Spark-native mechanism for columns like _score.

Stacking

Depends on #453 and #452 — please merge both first. This branch is stacked on feat/fts-base with #452's registry commit cherry-picked on top (carried until #452 lands, at which point rebasing this branch onto main drops the duplicate commit cleanly).

Test plan

  • ./mvnw test -pl lance-spark-3.5_2.12 -Dtest=FtsQueryTest — 3 FTS cases pass (including new testFtsScoreColumn)
  • make test SPARK_VERSION=3.5 SCALA_VERSION=2.12 — full regression passes
  • make lint — checkstyle + spotless clean
  • CI green across all 5 version modules

🤖 Generated with Claude Code

Vova Kolmakov and others added 3 commits April 18, 2026 20:12
Add query-side FTS on Lance tables, complementing the existing
ALTER TABLE ... CREATE INDEX USING fts syntax. Users can now run

    SELECT id, content FROM t WHERE lance_match(content, 'apache spark')

and the predicate executes natively through the Lance inverted index
rather than as a post-scan substring match.

Implementation:
- `LanceMatch` Catalyst expression registered via `injectFunction`,
  with `CodegenFallback` + a substring fallback eval for non-Lance scans
  or query shapes the rule doesn't recognize.
- `LanceFtsPushdownRule` — logical-plan optimizer rule that matches
  `Filter(LanceMatch(col, 'q'), DataSourceV2Relation(LanceTable, ..))`
  BEFORE V2ScanRelationPushDown and injects `_lance_fts_column` /
  `_lance_fts_query` into the relation's `CaseInsensitiveStringMap`.
  MVP handles top-level LanceMatch and `And(LanceMatch, rest)` shapes.
- `LanceDataset.newScanBuilder` reads those options and calls
  `LanceScanBuilder.setFtsQuery(FtsQuerySpec)`; the builder threads the
  spec through `LanceScan` → `LanceInputPartition` → per-fragment
  scanner, where `ScanOptions.Builder.fullTextQuery(FullTextQuery.match(
  queryText, column))` is applied natively.

Tests (base + version-specific subclasses on 3.4_2.12 / 3.5_2.12, with
cross-module reuse into 3.5_2.13 / 4.0_2.13 / 4.1_2.13 via existing
build-helper pom configuration): 2 FTS cases (single-term match,
combined with scalar filter). Full regression on 3.5_2.12 passes.

Documentation: new docs/src/operations/dql/fts.md page (syntax,
requirements, mechanics, limitations) cross-linked from the CREATE INDEX
FTS options section.

Known limitations (documented in fts.md):
- Only single LanceMatch extraction; OR/NOT/multi-match fall back to
  Catalyst eval.
- Operator/fuzziness/boost not yet exposed.
- FTS selectivity not reported to the cost-based optimizer.

Count-star through FTS and the `_score` column are landing as separate
follow-up PRs stacked on this one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the five Spark `MetadataColumn` instances (`_rowid`, `_rowaddr`,
`_row_created_at_version`, `_row_last_updated_at_version`, `_fragid`) from
`LanceDataset` into a new `LanceMetadataColumns` registry class, and replace
the hand-maintained exclusion filter and if-ladder in
`LanceFragmentScanner.getColumnNames` with iteration over the registry.

Before this change, adding a virtual column required edits in three coupled
sites: a new anonymous `MetadataColumn` constant in `LanceDataset`, an entry
in the `METADATA_COLUMNS` array, a clause in the "exclude from regular
columns" filter, and a stanza in the "append to scanner projection" if-ladder.
All four had to stay in lock-step; forgetting one produced silent bugs
(column in schema but not in scanner output, or Lance rejecting an unknown
name). After this change, adding a column is a one-line entry in the registry.

Call sites that only needed the column name string (`LanceDataset.X_COLUMN
.name()`) now reference `LanceConstant.X` directly — the `MetadataColumn`
object was never used at those sites, only its name.

Pure refactor: no behavior change. All existing tests pass on
`lance-spark-3.5_2.12` (the default module) — the invariant that
`LanceFragmentScannerTest` exercises around column ordering is preserved
by the new `PROJECTABLE` list, which intentionally excludes `_fragid`
(computed per-fragment outside the scanner's projection list).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exposes Lance's native BM25 relevance column as a projectable `Float`
`_score` column on Lance tables when an FTS predicate is active:

    SELECT id, _score FROM t WHERE lance_match(content, 'apache spark');

Implementation:
- `LanceConstant.FTS_SCORE = "_score"` — matches
  `lance_index::scalar::inverted::SCORE_COL` on the Rust side.
- Registered as a nullable `Float` `MetadataColumn` through the
  central `LanceMetadataColumns` registry (added to both `ALL` and
  `PROJECTABLE`). The registry's generalization from the previous
  metadata-columns refactor PR means no edits to `LanceDataset` or
  `LanceFragmentScanner.getColumnNames` — the new column flows through
  both the exclusion filter and the scanner-projection loop
  automatically.
- `LanceFragmentScanner` rejects a `_score` reference without an
  active FTS predicate at scan-build time with a clear error.

Using `MetadataColumn` rather than a schema decorator keeps `_score`
hidden by default — `SELECT *` does not include it — which is the
correct virtual-column contract. The `LanceVirtualColumnsTable`
decorator (used for `_distance` in the vector-search path) would
promote the column into the base `schema()` and regress `SELECT *`;
`SupportsMetadataColumns` is the right Spark-native mechanism for
columns like `_score`.

Test: `BaseFtsQueryTest#testFtsScoreColumn` asserts `_score` is
non-null and positive BM25 when FTS is active, inherited by the
version-specific `FtsQueryTest` subclasses.

Docs: adds "The `_score` Column" section to `fts.md`, the fallback
note about no BM25 without an index, and the `ORDER BY _score`
limitation around Spark's `RangePartitioner.sketch` path.

Depends on the central metadata-columns registry PR (carried as the
preceding commit on this branch until that PR lands); stacked on
top of the base FTS PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the enhancement New feature or request label Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant