Skip to content

refactor: centralize Lance metadata column definitions#452

Open
wombatu-kun wants to merge 1 commit intolance-format:mainfrom
wombatu-kun:refactor/metadata-columns-registry
Open

refactor: centralize Lance metadata column definitions#452
wombatu-kun wants to merge 1 commit intolance-format:mainfrom
wombatu-kun:refactor/metadata-columns-registry

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Summary

  • Extract the five MetadataColumn instances (_rowid, _rowaddr, _row_created_at_version, _row_last_updated_at_version, _fragid) from LanceDataset into a new LanceMetadataColumns registry class
  • Replace the hand-maintained exclusion filter and if-ladder in LanceFragmentScanner.getColumnNames with iteration over ALL / PROJECTABLE
  • Migrate callers that only needed the column name string (LanceDataset.X_COLUMN.name()LanceConstant.X)

Motivation

Before this change, adding a virtual column took edits in three coupled sites: a new anonymous MetadataColumn constant in LanceDataset, an entry in the METADATA_COLUMNS array, a clause in the "exclude from regular columns" filter, and a stanza in the "append to scanner projection" if-ladder. All had to stay in lock-step; forgetting one produced silent bugs (column in schema but not in scanner output, or Lance rejecting an unknown name). After, adding a column is a one-line registry entry.

This refactor unblocks the _score column PR (stacked in the FTS PR series); future virtual columns such as _score_explain benefit from the same one-line shape.

Pure refactor — no user-visible behavior change. The asymmetry between ALL (exclusion + Spark registration) and PROJECTABLE (scanner projection) pins the invariant that _fragid is computed per-fragment outside the scanner, preserving the original behavior.

Test plan

  • make test SPARK_VERSION=3.5 SCALA_VERSION=2.12 — full regression passes
  • make lint — checkstyle + spotless clean
  • CI green across all 5 version modules (3.4_2.12, 3.5_2.12, 3.5_2.13, 4.0_2.13, 4.1_2.13)

🤖 Generated with Claude Code

Extract the five Spark `MetadataColumn` instances (`_rowid`, `_rowaddr`,
`_row_created_at_version`, `_row_last_updated_at_version`, `_fragid`) from
`LanceDataset` into a new `LanceMetadataColumns` registry class, and replace
the hand-maintained exclusion filter and if-ladder in
`LanceFragmentScanner.getColumnNames` with iteration over the registry.

Before this change, adding a virtual column required edits in three coupled
sites: a new anonymous `MetadataColumn` constant in `LanceDataset`, an entry
in the `METADATA_COLUMNS` array, a clause in the "exclude from regular
columns" filter, and a stanza in the "append to scanner projection" if-ladder.
All four had to stay in lock-step; forgetting one produced silent bugs
(column in schema but not in scanner output, or Lance rejecting an unknown
name). After this change, adding a column is a one-line entry in the registry.

Call sites that only needed the column name string (`LanceDataset.X_COLUMN
.name()`) now reference `LanceConstant.X` directly — the `MetadataColumn`
object was never used at those sites, only its name.

Pure refactor: no behavior change. All existing tests pass on
`lance-spark-3.5_2.12` (the default module) — the invariant that
`LanceFragmentScannerTest` exercises around column ordering is preserved
by the new `PROJECTABLE` list, which intentionally excludes `_fragid`
(computed per-fragment outside the scanner's projection list).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant