feat(parquet): add content defined chunking for arrow writer by kszucs · Pull Request #9450 · apache/arrow-rs

kszucs · 2026-02-20T15:26:52Z

Which issue does this PR close?

Closes #NNN.

Rationale for this change

Rust implementation of apache/arrow#45360

Traditional Parquet writing splits data pages at fixed sizes, so a single inserted or deleted row causes all subsequent pages to shift — resulting in nearly every byte being re-uploaded to content-addressable storage (CAS) systems. CDC determines page boundaries via a rolling gearhash over column values, so unchanged data produces identical pages across different writes enabling storage cost reductions and faster upload times.

See more details in https://huggingface.co/blog/parquet-cdc

The original C++ implementation apache/arrow#45360

Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator where I already integrated this PR to verify that deduplication effectiveness is on par with parquet-cpp (lower is better):

What changes are included in this PR?

Content-defined chunker at parquet/src/column/chunker/
Arrow writer integration integrated in ArrowColumnWriter
Writer properties via CdcOptions struct (min_chunk_size, max_chunk_size, norm_level)
ColumnDescriptor: added repeated_ancestor_def_level field to for nested field values iteration

Are these changes tested?

Yes — unit tests are located in cdc.rs and ported from the C++ implementation.

Are there any user-facing changes?

New experimental API, disabled by default — no behavior change for existing code:

// Simple toggle (256 KiB min, 1 MiB max, norm_level 0)
let props = WriterProperties::builder()
    .set_content_defined_chunking(true)
    .build();

// Excpliti CDC parameters
let props = WriterProperties::builder()
    .set_cdc_options(CdcOptions { min_chunk_size: 128 * 1024, max_chunk_size: 512 * 1024, norm_level: 1 })
    .build();

…` and use it in content defined chunking

…nments

… chunks

kszucs · 2026-02-25T11:24:47Z

parquet/src/schema/types.rs

            let mut path: Vec<String> = vec![];
            path.extend(path_so_far.iter().copied().map(String::from));
-            leaves.push(Arc::new(ColumnDescriptor::new(
+            let mut desc = ColumnDescriptor::new(


I didn't want to break the API of ColumnDescriptor, so setting repeated_ancestor_def_level below.

You could add a new_with_repeated_ancestor, and perhaps change new to call that with the default value for repeated_ancestor_def_level.

kszucs · 2026-02-25T11:27:02Z

parquet/src/column/chunker/cdc_codegen.py

We don't necessarily need to store the codegen script in the repository. Alternatively we could just reference https://github.com/apache/arrow/blob/main/cpp/src/parquet/chunker_internal_generated.h as a source for cdc_generated.rs. Likely it won't be regenerated at all.

kszucs · 2026-03-06T11:53:29Z

@alamb @etseidl could you please take a look? Let me know if you need extra context or if you have limited bandwidth.

etseidl · 2026-03-06T16:12:12Z

Hi @kszucs. 👋 Apologies, I have been unusually bandwidth constrained lately. I will try to give this a good look in the next few days. Thank you for your patience 🙏 (and for adding this to arrow-rs).

kszucs · 2026-03-06T16:16:11Z

Hi @etseidl! No worries, I really appreciate you taking the time to review!

etseidl

Flushing a few early observations/questions. Still need to do the deep dive.

etseidl · 2026-03-06T18:33:38Z

parquet/src/schema/types.rs

            let mut path: Vec<String> = vec![];
            path.extend(path_so_far.iter().copied().map(String::from));
-            leaves.push(Arc::new(ColumnDescriptor::new(
+            let mut desc = ColumnDescriptor::new(


You could add a new_with_repeated_ancestor, and perhaps change new to call that with the default value for repeated_ancestor_def_level.

etseidl · 2026-03-06T18:49:35Z

parquet/src/arrow/arrow_writer/mod.rs

+        let cdc_chunkers = match props_ptr.content_defined_chunking() {
+            Some(opts) => {
+                let chunkers = file_writer
+                    .schema_descr()
+                    .columns()
+                    .iter()
+                    .map(|desc| ContentDefinedChunker::new(desc, opts))
+                    .collect::<Result<Vec<_>>>()?;
+                Some(chunkers)
+            }
+            None => None,
+        };


Suggested change

let cdc_chunkers = match props_ptr.content_defined_chunking() {

Some(opts) => {

let chunkers = file_writer

.schema_descr()

.columns()

.iter()

.map(|desc| ContentDefinedChunker::new(desc, opts))

.collect::<Result<Vec<_>>>()?;

Some(chunkers)

}

None => None,

};

let cdc_chunkers = props_ptr.content_defined_chunking().map(|opts| {

file_writer

.schema_descr()

.columns()

.iter()

.map(|desc| ContentDefinedChunker::new(desc, opts))

.collect::<Result<Vec<_>>>()

}).transpose()?;

Can simplify this a bit.

etseidl · 2026-03-06T19:21:08Z

parquet/src/column/chunker/mod.rs

+
+/// A chunk of data with level and value offsets for record-shredded nested data.
+#[derive(Debug, Clone, Copy)]
+pub(crate) struct Chunk {


"chunk" is an overloaded term (I keep thinking column chunk). What do you think of changing this to CdcChunk?

etseidl · 2026-03-06T19:35:24Z

parquet/src/arrow/arrow_writer/levels.rs

+    /// Create a sliced view of this `ArrayLevels` for a CDC chunk.
+    pub(crate) fn slice_for_chunk(&self, chunk: &Chunk) -> Self {


I have trouble with calling this a "view" when it's actually allocating new vectors for the levels and non-null indices. ~~I'm thinking out loud, but I wonder if we could create an actual ArrayLevelsView that uses proper slices of the underlying Vecs and pass that to write_internal.~~

Nevermind...I tried implementing this but it would require a ton of changes to the level handling. I guess just update the comment to say that copies of data will be made.

etseidl · 2026-03-06T19:43:30Z

parquet/src/arrow/arrow_writer/levels.rs

+
+    #[test]
+    fn test_slice_for_chunk_flat() {
+        // Required field (no levels): array [1..=6], slice values 2..5


Suggested change

// Required field (no levels): array [1..=6], slice values 2..5

// Required field (no levels): array [1..=6], slice values 3..=5

? Trying to understand value_offset == 2

etseidl · 2026-03-06T19:46:37Z

parquet/src/arrow/arrow_writer/mod.rs

-                self.row_group_writer_factory
-                    .create_row_group_writer(self.writer.flushed_row_groups().len())?,
-            ),
+            x => {


Is this change necessary? Or is it left over from earlier debugging?

etseidl

Just a few more random comments...I'll get into the meat of the chunking tomorrow. Looking good so far!

etseidl · 2026-03-10T22:31:02Z

parquet/src/arrow/arrow_writer/mod.rs

            None => return Ok(()),
        };

+        let chunks = in_progress.close()?;


Also curious about this change and below. Does the close have to happen before calling next_row_group?

etseidl · 2026-03-10T22:51:25Z

parquet/src/file/properties.rs

+    /// Note that the parquet writer has a related `data_page_size_limit` property that
+    /// controls the maximum size of a parquet data page after encoding. While setting
+    /// `data_page_size_limit` to a smaller value than `max_chunk_size` doesn't affect
+    /// the chunking effectiveness, it results in more small parquet data pages.


Suggested change

/// Note that the parquet writer has a related `data_page_size_limit` property that

/// controls the maximum size of a parquet data page after encoding. While setting

/// `data_page_size_limit` to a smaller value than `max_chunk_size` doesn't affect

/// the chunking effectiveness, it results in more small parquet data pages.

/// Note that the parquet writer has a related [`data_page_size_limit`] property that

/// controls the maximum size of a parquet data page after encoding. While setting

/// `data_page_size_limit` to a smaller value than `max_chunk_size` doesn't affect

/// the chunking effectiveness, it results in more small parquet data pages.

///

/// [`data_page_size_limit`]: WriterPropertiesBuilder::set_data_page_size_limit

etseidl · 2026-03-10T22:52:29Z

parquet/src/file/properties.rs

+            min_chunk_size: 256 * 1024,
+            max_chunk_size: 1024 * 1024,


Can you add constants for these above? 🙏

etseidl · 2026-03-10T22:54:53Z

parquet/src/file/properties.rs

+    /// EXPERIMENTAL: Returns content-defined chunking options, or `None` if CDC is disabled.
+    ///
+    /// For more details see [`WriterPropertiesBuilder::set_content_defined_chunking`]
+    pub fn content_defined_chunking(&self) -> Option<&CdcOptions> {


Does this only make sense as a global option, or would making it a per-column property be reasonable?

kszucs added 2 commits February 20, 2026 16:26

feat(parquet): add content defined chunking for arrow writer

a80606a

feat(parquet): add repeated_ancestor_def_level to `ColumnDescriptor…

2fb8a67

…` and use it in content defined chunking

github-actions bot added the parquet Changes to the parquet crate label Feb 20, 2026

kszucs added 7 commits February 20, 2026 16:28

chore: cargo format

7e8b9fd

chore: fix clippy errors

9d57c41

refactor(parquet): maintain field for better encapsulation

644d0ce

refactor(parquet): simplify the CDC implementation

34b1b04

refactor(parquet): hold the cdc chunkers in ArrowWriter

a1d5724

chore(parquet): remove redundant flush_current_page() method

25cb3ed

doc(parquet): remove content defined chunking example from dosctrings

f6a71fa

kszucs marked this pull request as ready for review February 25, 2026 08:12

kszucs added 6 commits February 25, 2026 09:19

chore(parquet): remove unnecessary mut row_group_writer_factory assig…

eecf91e

…nments

fix(parquet): incorporate primitive array offset when calculating cdc…

fc6dc58

… chunks

chore(parquet): add benchmark for cdc chunking

e43d335

chore(parquet): fix clippy errors

c633bb3

refactor(parquet): do not store the chunker in the row group writer

d94e45c

chore(parquet): spell out cdc as content_defined_chunking in properties

eb271c2

kszucs requested review from alamb and etseidl February 25, 2026 11:19

kszucs commented Feb 25, 2026

View reviewed changes

etseidl reviewed Mar 6, 2026

View reviewed changes

etseidl reviewed Mar 10, 2026

View reviewed changes

		/// Create a sliced view of this `ArrayLevels` for a CDC chunk.
		pub(crate) fn slice_for_chunk(&self, chunk: &Chunk) -> Self {

	// Required field (no levels): array [1..=6], slice values 2..5
	// Required field (no levels): array [1..=6], slice values 3..=5

Conversation

kszucs commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kszucs Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kszucs commented Mar 6, 2026

Uh oh!

etseidl commented Mar 6, 2026

Uh oh!

kszucs commented Mar 6, 2026

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kszucs commented Feb 20, 2026 •

edited

Loading

kszucs Feb 25, 2026 •

edited

Loading

etseidl Mar 6, 2026 •

edited

Loading