Skip to content

feat: support full-text search queries via lance_match() SQL function#453

Open
wombatu-kun wants to merge 1 commit intolance-format:mainfrom
wombatu-kun:feat/fts-base
Open

feat: support full-text search queries via lance_match() SQL function#453
wombatu-kun wants to merge 1 commit intolance-format:mainfrom
wombatu-kun:feat/fts-base

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Summary

Add query-side FTS on Lance tables, complementing the existing ALTER TABLE ... CREATE INDEX USING fts syntax:

SELECT id, content FROM t WHERE lance_match(content, 'apache spark');

The predicate now executes natively through the Lance inverted index rather than as a post-scan substring match.

Implementation

  • LanceMatch — Catalyst expression registered via injectFunction, with CodegenFallback + substring fallback eval for non-Lance scans or query shapes the rule doesn't recognize.
  • LanceFtsPushdownRule — logical-plan optimizer rule that matches Filter(LanceMatch(col, 'q'), DataSourceV2Relation(LanceTable, ..)) BEFORE V2ScanRelationPushDown and injects _lance_fts_column / _lance_fts_query into the relation's CaseInsensitiveStringMap. MVP handles top-level LanceMatch and And(LanceMatch, rest) shapes.
  • Scan threadingLanceDataset.newScanBuilder reads those options and calls LanceScanBuilder.setFtsQuery(FtsQuerySpec); the builder threads the spec through LanceScanLanceInputPartition → per-fragment scanner, where ScanOptions.Builder.fullTextQuery(FullTextQuery.match(queryText, column)) is applied natively.

Count-star through FTS and the _score column are landing as separate follow-up PRs stacked on this one.

Documentation

New docs/src/operations/dql/fts.md page (syntax, requirements, mechanics, limitations), cross-linked from the CREATE INDEX FTS options section.

Known limitations (documented in fts.md)

  • Single LanceMatch extraction only; OR / NOT / multi-match fall back to Catalyst eval.
  • Operator / fuzziness / boost not yet exposed.
  • FTS selectivity not reported to the cost-based optimizer.

Test plan

  • ./mvnw test -pl lance-spark-3.5_2.12 -Dtest=FtsQueryTest — 2 FTS cases pass (single-term match, combined with scalar filter)
  • make test SPARK_VERSION=3.5 SCALA_VERSION=2.12 — full regression passes
  • make lint — checkstyle + spotless clean
  • CI green across 3.4_2.12 / 3.5_2.12 / 3.5_2.13 / 4.0_2.13 / 4.1_2.13 (FTS test inherited cross-module via existing build-helper pom entries)

🤖 Generated with Claude Code

Add query-side FTS on Lance tables, complementing the existing
ALTER TABLE ... CREATE INDEX USING fts syntax. Users can now run

    SELECT id, content FROM t WHERE lance_match(content, 'apache spark')

and the predicate executes natively through the Lance inverted index
rather than as a post-scan substring match.

Implementation:
- `LanceMatch` Catalyst expression registered via `injectFunction`,
  with `CodegenFallback` + a substring fallback eval for non-Lance scans
  or query shapes the rule doesn't recognize.
- `LanceFtsPushdownRule` — logical-plan optimizer rule that matches
  `Filter(LanceMatch(col, 'q'), DataSourceV2Relation(LanceTable, ..))`
  BEFORE V2ScanRelationPushDown and injects `_lance_fts_column` /
  `_lance_fts_query` into the relation's `CaseInsensitiveStringMap`.
  MVP handles top-level LanceMatch and `And(LanceMatch, rest)` shapes.
- `LanceDataset.newScanBuilder` reads those options and calls
  `LanceScanBuilder.setFtsQuery(FtsQuerySpec)`; the builder threads the
  spec through `LanceScan` → `LanceInputPartition` → per-fragment
  scanner, where `ScanOptions.Builder.fullTextQuery(FullTextQuery.match(
  queryText, column))` is applied natively.

Tests (base + version-specific subclasses on 3.4_2.12 / 3.5_2.12, with
cross-module reuse into 3.5_2.13 / 4.0_2.13 / 4.1_2.13 via existing
build-helper pom configuration): 2 FTS cases (single-term match,
combined with scalar filter). Full regression on 3.5_2.12 passes.

Documentation: new docs/src/operations/dql/fts.md page (syntax,
requirements, mechanics, limitations) cross-linked from the CREATE INDEX
FTS options section.

Known limitations (documented in fts.md):
- Only single LanceMatch extraction; OR/NOT/multi-match fall back to
  Catalyst eval.
- Operator/fuzziness/boost not yet exposed.
- FTS selectivity not reported to the cost-based optimizer.

Count-star through FTS and the `_score` column are landing as separate
follow-up PRs stacked on this one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant