[SPARK-56970][SS] Split CommitMetadata into CommitMetadataBase + V1/V2 case classes#56307
Open
ericm-db wants to merge 1 commit into
Open
[SPARK-56970][SS] Split CommitMetadata into CommitMetadataBase + V1/V2 case classes#56307ericm-db wants to merge 1 commit into
ericm-db wants to merge 1 commit into
Conversation
Contributor
|
@ericm-db - can you make sure that CI is passing here ? |
Contributor
Author
|
@cloud-fan Can you help resolve the CI here? It seems to be that there are CI related to Spark 5.0 build that aren't related. |
a55e188 to
5fddbac
Compare
Contributor
|
CI fixed by #56325 , please rebase |
…2 case classes Refactor `CommitLog` so that the commit log metadata is dispatched through a `CommitMetadataBase` trait with concrete `CommitMetadata` (V1, watermark only) and `CommitMetadataV2` (watermark + `stateUniqueIds`) case classes. The deserializer now reads the wire-format version from the file header and constructs the matching subclass. This is preparation for `CommitMetadataV3` (which adds sink metadata for streaming sink evolution) in a follow-up PR. Notable changes: - Add `CommitMetadataBase` trait and `CommitMetadataV2` case class. - `CommitMetadata` becomes V1 (no `stateUniqueIds` field). - Add `CommitLog.createMetadata` factory that dispatches by version and defaults to the configured `STATE_STORE_CHECKPOINT_FORMAT_VERSION`. - `CommitLog.readCommitMetadata` reads the version line and constructs the matching subclass. - `MicroBatchExecution`, `OfflineStateRepartitionRunner`, and the existing tests are updated to use the new types / factory. The pre-refactor `CommitMetadata` carried both the V1 and V2 wire shape in a single case class, with `stateUniqueIds` optional. That made it awkward to add a V3 wire format with additional fields, and forced `serialize` to take the wire version from `SQLConf` rather than from the metadata itself. No new public API. The wire format for V1 changes slightly: V1 commit log files no longer serialize `stateUniqueIds: null`. Old V1 files continue to be read because the V1 deserializer ignores the (now-unknown) field. This PR also relaxes the version-exact-match check on read so that a commit log opened with the V2 conf can deserialize a V1 file. This incidentally resolves SPARK-50653. - Existing `CommitLogSuite` (V1, V2, and cross-version) passes; the cross-version test now asserts successful V1 deserialization. - `StreamingSinkEvolutionSuite` (from SPARK-56719) still passes. Generated-by: Claude Code (claude-opus-4-7) Co-authored-by: Isaac
5fddbac to
ce8d843
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Backport of [SPARK-56970] (#56018) to
branch-4.x.Refactor
CommitLogso that the commit log metadata is dispatched through aCommitMetadataBasetrait with concreteCommitMetadata(V1, watermark only) andCommitMetadataV2(watermark +stateUniqueIds) case classes. The deserializer now reads the wire-format version from the file header and constructs the matching subclass.This is preparation for
CommitMetadataV3(which adds sink metadata for streaming sink evolution) in a follow-up.Notable changes:
CommitMetadataBasetrait andCommitMetadataV2case class.CommitMetadatabecomes V1 (nostateUniqueIdsfield).CommitLog.createMetadatafactory that dispatches by version and defaults to the configuredSTATE_STORE_CHECKPOINT_FORMAT_VERSION.CommitLog.readCommitMetadatareads the version line and constructs the matching subclass.MicroBatchExecution,OfflineStateRepartitionRunner, and the existing tests are updated to use the new types / factory.Why are the changes needed?
The pre-refactor
CommitMetadatacarried both the V1 and V2 wire shape in a single case class, withstateUniqueIdsoptional. That made it awkward to add a V3 wire format with additional fields, and forcedserializeto take the wire version fromSQLConfrather than from the metadata itself.Does this PR introduce any user-facing change?
No new public API. The wire format for V1 changes slightly: V1 commit log files no longer serialize
stateUniqueIds: null. Old V1 files continue to be read because the V1 deserializer ignores the (now-unknown) field.This PR also relaxes the version-exact-match check on read so that a commit log opened with the V2 conf can deserialize a V1 file. This incidentally resolves SPARK-50653.
How was this patch tested?
CommitLogSuite(V1, V2, and cross-version); the cross-version test now asserts successful V1 deserialization.sql/coremain and test sources compile cleanly onbranch-4.x(build/sbt sql/Test/compile).Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-7)