feat: support full-text search queries via lance_match() SQL function#453
Open
wombatu-kun wants to merge 1 commit intolance-format:mainfrom
Open
feat: support full-text search queries via lance_match() SQL function#453wombatu-kun wants to merge 1 commit intolance-format:mainfrom
wombatu-kun wants to merge 1 commit intolance-format:mainfrom
Conversation
Add query-side FTS on Lance tables, complementing the existing
ALTER TABLE ... CREATE INDEX USING fts syntax. Users can now run
SELECT id, content FROM t WHERE lance_match(content, 'apache spark')
and the predicate executes natively through the Lance inverted index
rather than as a post-scan substring match.
Implementation:
- `LanceMatch` Catalyst expression registered via `injectFunction`,
with `CodegenFallback` + a substring fallback eval for non-Lance scans
or query shapes the rule doesn't recognize.
- `LanceFtsPushdownRule` — logical-plan optimizer rule that matches
`Filter(LanceMatch(col, 'q'), DataSourceV2Relation(LanceTable, ..))`
BEFORE V2ScanRelationPushDown and injects `_lance_fts_column` /
`_lance_fts_query` into the relation's `CaseInsensitiveStringMap`.
MVP handles top-level LanceMatch and `And(LanceMatch, rest)` shapes.
- `LanceDataset.newScanBuilder` reads those options and calls
`LanceScanBuilder.setFtsQuery(FtsQuerySpec)`; the builder threads the
spec through `LanceScan` → `LanceInputPartition` → per-fragment
scanner, where `ScanOptions.Builder.fullTextQuery(FullTextQuery.match(
queryText, column))` is applied natively.
Tests (base + version-specific subclasses on 3.4_2.12 / 3.5_2.12, with
cross-module reuse into 3.5_2.13 / 4.0_2.13 / 4.1_2.13 via existing
build-helper pom configuration): 2 FTS cases (single-term match,
combined with scalar filter). Full regression on 3.5_2.12 passes.
Documentation: new docs/src/operations/dql/fts.md page (syntax,
requirements, mechanics, limitations) cross-linked from the CREATE INDEX
FTS options section.
Known limitations (documented in fts.md):
- Only single LanceMatch extraction; OR/NOT/multi-match fall back to
Catalyst eval.
- Operator/fuzziness/boost not yet exposed.
- FTS selectivity not reported to the cost-based optimizer.
Count-star through FTS and the `_score` column are landing as separate
follow-up PRs stacked on this one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add query-side FTS on Lance tables, complementing the existing
ALTER TABLE ... CREATE INDEX USING ftssyntax:The predicate now executes natively through the Lance inverted index rather than as a post-scan substring match.
Implementation
LanceMatch— Catalyst expression registered viainjectFunction, withCodegenFallback+ substring fallback eval for non-Lance scans or query shapes the rule doesn't recognize.LanceFtsPushdownRule— logical-plan optimizer rule that matchesFilter(LanceMatch(col, 'q'), DataSourceV2Relation(LanceTable, ..))BEFORE V2ScanRelationPushDown and injects_lance_fts_column/_lance_fts_queryinto the relation'sCaseInsensitiveStringMap. MVP handles top-levelLanceMatchandAnd(LanceMatch, rest)shapes.LanceDataset.newScanBuilderreads those options and callsLanceScanBuilder.setFtsQuery(FtsQuerySpec); the builder threads the spec throughLanceScan→LanceInputPartition→ per-fragment scanner, whereScanOptions.Builder.fullTextQuery(FullTextQuery.match(queryText, column))is applied natively.Count-star through FTS and the
_scorecolumn are landing as separate follow-up PRs stacked on this one.Documentation
New
docs/src/operations/dql/fts.mdpage (syntax, requirements, mechanics, limitations), cross-linked from the CREATE INDEX FTS options section.Known limitations (documented in fts.md)
LanceMatchextraction only;OR/NOT/ multi-match fall back to Catalyst eval.Test plan
./mvnw test -pl lance-spark-3.5_2.12 -Dtest=FtsQueryTest— 2 FTS cases pass (single-term match, combined with scalar filter)make test SPARK_VERSION=3.5 SCALA_VERSION=2.12— full regression passesmake lint— checkstyle + spotless cleanbuild-helperpom entries)🤖 Generated with Claude Code