Skip to content

[SPARK-56970][SS] Split CommitMetadata into CommitMetadataBase + V1/V2 case classes#56307

Open
ericm-db wants to merge 1 commit into
apache:branch-4.xfrom
ericm-db:SPARK-56970-branch-4.x
Open

[SPARK-56970][SS] Split CommitMetadata into CommitMetadataBase + V1/V2 case classes#56307
ericm-db wants to merge 1 commit into
apache:branch-4.xfrom
ericm-db:SPARK-56970-branch-4.x

Conversation

@ericm-db
Copy link
Copy Markdown
Contributor

@ericm-db ericm-db commented Jun 3, 2026

What changes were proposed in this pull request?

Backport of [SPARK-56970] (#56018) to branch-4.x.

Refactor CommitLog so that the commit log metadata is dispatched through a CommitMetadataBase trait with concrete CommitMetadata (V1, watermark only) and CommitMetadataV2 (watermark + stateUniqueIds) case classes. The deserializer now reads the wire-format version from the file header and constructs the matching subclass.

This is preparation for CommitMetadataV3 (which adds sink metadata for streaming sink evolution) in a follow-up.

Notable changes:

  • Add CommitMetadataBase trait and CommitMetadataV2 case class.
  • CommitMetadata becomes V1 (no stateUniqueIds field).
  • Add CommitLog.createMetadata factory that dispatches by version and defaults to the configured STATE_STORE_CHECKPOINT_FORMAT_VERSION.
  • CommitLog.readCommitMetadata reads the version line and constructs the matching subclass.
  • MicroBatchExecution, OfflineStateRepartitionRunner, and the existing tests are updated to use the new types / factory.

Why are the changes needed?

The pre-refactor CommitMetadata carried both the V1 and V2 wire shape in a single case class, with stateUniqueIds optional. That made it awkward to add a V3 wire format with additional fields, and forced serialize to take the wire version from SQLConf rather than from the metadata itself.

Does this PR introduce any user-facing change?

No new public API. The wire format for V1 changes slightly: V1 commit log files no longer serialize stateUniqueIds: null. Old V1 files continue to be read because the V1 deserializer ignores the (now-unknown) field.

This PR also relaxes the version-exact-match check on read so that a commit log opened with the V2 conf can deserialize a V1 file. This incidentally resolves SPARK-50653.

How was this patch tested?

  • Existing CommitLogSuite (V1, V2, and cross-version); the cross-version test now asserts successful V1 deserialization.
  • sql/core main and test sources compile cleanly on branch-4.x (build/sbt sql/Test/compile).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

@anishshri-db
Copy link
Copy Markdown
Contributor

@ericm-db - can you make sure that CI is passing here ?

@ericm-db
Copy link
Copy Markdown
Contributor Author

ericm-db commented Jun 4, 2026

@cloud-fan Can you help resolve the CI here? It seems to be that there are CI related to Spark 5.0 build that aren't related.

@ericm-db ericm-db force-pushed the SPARK-56970-branch-4.x branch from a55e188 to 5fddbac Compare June 4, 2026 22:28
@cloud-fan
Copy link
Copy Markdown
Contributor

CI fixed by #56325 , please rebase

…2 case classes

Refactor `CommitLog` so that the commit log metadata is dispatched through a
`CommitMetadataBase` trait with concrete `CommitMetadata` (V1, watermark only)
and `CommitMetadataV2` (watermark + `stateUniqueIds`) case classes. The
deserializer now reads the wire-format version from the file header and
constructs the matching subclass.

This is preparation for `CommitMetadataV3` (which adds sink metadata for
streaming sink evolution) in a follow-up PR.

Notable changes:
- Add `CommitMetadataBase` trait and `CommitMetadataV2` case class.
- `CommitMetadata` becomes V1 (no `stateUniqueIds` field).
- Add `CommitLog.createMetadata` factory that dispatches by version and
  defaults to the configured `STATE_STORE_CHECKPOINT_FORMAT_VERSION`.
- `CommitLog.readCommitMetadata` reads the version line and constructs the
  matching subclass.
- `MicroBatchExecution`, `OfflineStateRepartitionRunner`, and the existing
  tests are updated to use the new types / factory.

The pre-refactor `CommitMetadata` carried both the V1 and V2 wire shape in a
single case class, with `stateUniqueIds` optional. That made it awkward to
add a V3 wire format with additional fields, and forced `serialize` to
take the wire version from `SQLConf` rather than from the metadata itself.

No new public API. The wire format for V1 changes slightly: V1 commit log
files no longer serialize `stateUniqueIds: null`. Old V1 files continue to
be read because the V1 deserializer ignores the (now-unknown) field.

This PR also relaxes the version-exact-match check on read so that a
commit log opened with the V2 conf can deserialize a V1 file. This
incidentally resolves SPARK-50653.

- Existing `CommitLogSuite` (V1, V2, and cross-version) passes; the
  cross-version test now asserts successful V1 deserialization.
- `StreamingSinkEvolutionSuite` (from SPARK-56719) still passes.

Generated-by: Claude Code (claude-opus-4-7)

Co-authored-by: Isaac
@ericm-db ericm-db force-pushed the SPARK-56970-branch-4.x branch from 5fddbac to ce8d843 Compare June 5, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants