Skip to content

feat: implement native update path with stable row-id preservation#407

Draft
ivscheianu wants to merge 2 commits intolance-format:mainfrom
ivscheianu:stable-row-id
Draft

feat: implement native update path with stable row-id preservation#407
ivscheianu wants to merge 2 commits intolance-format:mainfrom
ivscheianu:stable-row-id

Conversation

@ivscheianu
Copy link
Copy Markdown
Contributor

@ivscheianu ivscheianu commented Apr 9, 2026

Fix for #406

Summary

Reworks SparkPositionDeltaWrite (Spark 3.5) to use Spark's native DeltaWriter.update() method instead of representUpdateAsDeleteAndInsert(), enabling stable row-ID preservation across updates.

Problem

The previous implementation threw UnsupportedOperationException from update(), forcing Spark to decompose updates into separate delete + insert operations. This prevented the connector from knowing which _rowid maps to which newly written row, making it impossible to attach RowIdMeta to new
fragments. Without RowIdMeta, lance-core cannot trace updated rows back to their originals, breaking _row_created_at_version tracking and stable row-ID continuity.

Additionally, each worker opened its own Dataset to apply deletions, causing unnecessary I/O and potential race conditions.

Changes

Spark 3.5 (SparkPositionDeltaWrite):

  • Implement update(): capture _rowid from the id row, record deletion via _rowaddr, and write the data row — all in one call.
  • On commit(): if stable row IDs are enabled, attach RowIdMeta to new fragments using RowIdMeta.fromRowIds() (from companion lance PR).
  • Move deletion application from workers to driver: tasks send Map<fragmentId, RoaringBitmap> to the driver, which consolidates and applies deletions in a single Dataset session.
  • Enable useStableRowIds(true) on CommitBuilder when the dataset has them.

Spark 3.4 (SparkPositionDeltaWrite):

  • Same driver-side deletion consolidation refactor for consistency.
  • No row-ID capture (3.4's DSv2 API doesn't support update()).

CDF test expectation corrections:

  • BaseCdfVersionTrackingTest, BaseCdfConfigTest, BaseCdfQueryPatternsTest: updated _row_created_at_version expectations from 1 to the actual inser version (typically 2). Previously tests passed only because the JNI gap in lance-core was silently dropping version metadata.

Companion PR

This PR depends on lance-format/lance#6465 which fixes the JNI version metadata serialization gap and the row-ID lookup bug in Operation::Update. Both PRs are needed for correct end-to-end _row_created_at_version tracking.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@github-actions github-actions Bot added the bug Something isn't working label Apr 9, 2026
@ivscheianu ivscheianu changed the title fix: spark connector cannot preserve stable row ids across updates feat: implement native update path with stable row-id preservation Apr 9, 2026
@github-actions github-actions Bot added the enhancement New feature or request label Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant