Fix massive spill files for StringView/BinaryView columns by EeshanBembi · Pull Request #19444 · apache/datafusion

EeshanBembi · 2025-12-21T19:53:33Z

Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files.

Changes:

Add gc_view_arrays() function to apply GC on view arrays
Integrate GC into InProgressSpillFile::append_batch()
Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size)

Fixes #19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers.

Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.

bharath-techie · 2025-12-22T13:03:09Z

datafusion/physical-plan/src/spill/mod.rs

+            "https://produkty%2Fpulove.ru/album/login",
+        ];
+
+        let mut urls = Vec::with_capacity(200_000);


This might be quite heavy - maybe we can just keep the minimal reproducible version to verify that the changes are working as expected [ like the test above this ]

Yes, it would be great to make tests faster and use less memory.

bharath-techie · 2025-12-22T13:06:44Z

datafusion/physical-plan/src/spill/mod.rs

+    if any_gc_performed {
+        Ok(RecordBatch::try_new(batch.schema(), new_columns)?)
+    } else {
+        Ok(batch.clone())


Can we just return the batch without clone ?

bharath-techie · 2025-12-22T13:08:51Z

datafusion/physical-plan/src/spill/mod.rs

+    let mut new_columns: Vec<Arc<dyn Array>> = Vec::with_capacity(batch.num_columns());
+    let mut any_gc_performed = false;
+
+    for array in batch.columns() {


Maybe lets exit early and return the batch as it is if there are no view arrays in the RecordBatch ?

pepijnve · 2025-12-22T13:21:37Z

datafusion/physical-plan/src/spill/mod.rs

+}
+
+fn should_gc_view_array(len: usize, data_buffers: &[arrow::buffer::Buffer]) -> bool {
+    if len < 10 {


Is the number of rows a useful heuristic to not GC? Even if there are few rows, the data buffer may still be large.

pepijnve · 2025-12-22T13:26:23Z

datafusion/physical-plan/src/spill/mod.rs

+        return false;
+    }
+
+    let total_buffer_size: usize = data_buffers.iter().map(|b| b.capacity()).sum();


Can we use some of the existing size calculation methods like get_buffer_memory_size instead of duplicating calculations?

pepijnve · 2025-12-22T13:43:44Z

datafusion/physical-plan/src/spill/mod.rs

+#[cfg(test)]
+const VIEW_SIZE_BYTES: usize = 16;
+#[cfg(test)]
+const INLINE_THRESHOLD: usize = 12;


There's a public constant for this MAX_INLINE_VIEW_LEN

2010YOUY01

Thanks, it's a good idea to include compaction inside SpillManager

One follow-on to do is refactoring the external sort, now it's doing some compaction already outside the SpillManager, so with this PR it would be doing redundant compactions, I believe we should just simply remove that

datafusion/datafusion/physical-plan/src/sorts/sort.rs

Line 483 in bb9a4a7

fn organize_stringview_arrays(

2010YOUY01 · 2025-12-23T03:05:24Z

datafusion/physical-plan/src/spill/mod.rs

+    }
+
+    if any_gc_performed {
+        Ok(RecordBatch::try_new(batch.schema(), new_columns)?)


nit: I think we can get rid of the any_gc_performed condition, and always go this branch to make it a little bit simpler

2010YOUY01 · 2025-12-23T03:07:17Z

datafusion/physical-plan/src/spill/mod.rs

+            "https://produkty%2Fpulove.ru/album/login",
+        ];
+
+        let mut urls = Vec::with_capacity(200_000);


Yes, it would be great to make tests faster and use less memory.

2010YOUY01 · 2025-12-23T03:09:05Z

datafusion/physical-plan/src/spill/in_progress_spill_file.rs

    }

    /// Appends a `RecordBatch` to the spill file, initializing the writer if necessary.
+    /// Performs garbage collection on StringView/BinaryView arrays to reduce spill file size.


I recommend to add more comments to explain the rationale for views gc, perhaps just copy and paste from

datafusion/datafusion/physical-plan/src/sorts/sort.rs

Line 483 in bb9a4a7

fn organize_stringview_arrays(

@EeshanBembi could you enhance the comment here?

bharath-techie · 2026-01-12T07:47:27Z

Hi @EeshanBembi ,
I see you've addressed bunch of comments , can you kindly address remaining comments and take this PR forward ? Let us know if any help is required.

Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files. Changes: - Add gc_view_arrays() function to apply GC on view arrays - Integrate GC into InProgressSpillFile::append_batch() - Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size) Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers. Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.

- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes

The SpillManager now handles GC for StringView/BinaryView arrays internally via gc_view_arrays(), making the organize_stringview_arrays() function in external sort redundant. Changes: - Remove organize_stringview_arrays() call and function from sort.rs - Use batch.clone() for early return (cheaper than creating new batch) - Use arrow_data::MAX_INLINE_VIEW_LEN constant instead of custom constant - Update comment in spill_manager.rs to reference gc_view_arrays()

bharath-techie · 2026-02-05T06:41:52Z

datafusion/physical-plan/Cargo.toml

 tokio = { workspace = true }

 [dev-dependencies]
+arrow-data = { workspace = true }


is this unintentional change / git diff ?

This is intentional, arrow-data is added as a dev-dependency because the test helper calculate_string_view_waste_ratio uses arrow_data::MAX_INLINE_VIEW_LEN to determine whether a string value would be inlined in a StringView array.

xudong963

The fix might also resolve the #19846

adriangb · 2026-02-13T23:13:45Z

Hi @EeshanBembi just pinging here, I'd love to see this across the line, just missing one final push!

…l-gc

EeshanBembi · 2026-02-16T11:37:26Z

Hey @adriangb , could you please have a look? Thanks

adriangb · 2026-02-16T16:19:36Z

datafusion/physical-plan/src/spill/in_progress_spill_file.rs

    }

    /// Appends a `RecordBatch` to the spill file, initializing the writer if necessary.
+    /// Performs garbage collection on StringView/BinaryView arrays to reduce spill file size.


@EeshanBembi could you enhance the comment here?

adriangb · 2026-02-16T16:23:04Z

datafusion/physical-plan/src/spill/mod.rs

+/// - `get_record_batch_memory_size` operates on entire arrays, not just the data buffers
+/// - We need to check buffer capacity specifically to determine GC potential
+/// - The data buffers are what gets compacted during GC, so their size is the key metric
+fn should_gc_view_array(data_buffers: &[arrow::buffer::Buffer]) -> bool {


Can we check if there is anything to gc (i.e. if the array is sliced)? Or is gc() a no-op in that case?

github-actions bot added the physical-plan Changes to the physical-plan crate label Dec 21, 2025

EeshanBembi force-pushed the fix-stringview-spill-gc branch 2 times, most recently from cc6c180 to 7dfb1e2 Compare December 22, 2025 10:00

EeshanBembi mentioned this pull request Dec 22, 2025

[Bug] BinaryView/StringView columns are spilled without GC and results in enormous spill files #19414

Open

bharath-techie reviewed Dec 22, 2025

View reviewed changes

pepijnve reviewed Dec 22, 2025

View reviewed changes

2010YOUY01 reviewed Dec 23, 2025

View reviewed changes

EeshanBembi requested review from bharath-techie and pepijnve December 31, 2025 10:38

EeshanBembi marked this pull request as ready for review December 31, 2025 10:38

EeshanBembi requested a review from 2010YOUY01 January 2, 2026 13:19

EeshanBembi added 4 commits February 3, 2026 16:05

Address PR review feedback for StringView/BinaryView GC

e1a18e5

- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes

Apply cargo fmt

a5e0a12

EeshanBembi force-pushed the fix-stringview-spill-gc branch from 1e72adf to c1a03a8 Compare February 3, 2026 10:36

bharath-techie reviewed Feb 5, 2026

View reviewed changes

xudong963 reviewed Feb 6, 2026

View reviewed changes

EeshanBembi added 3 commits February 14, 2026 14:23

Merge remote-tracking branch 'upstream/main' into fix-stringview-spil…

eba362f

…l-gc

fix: remove unused import Array to fix clippy

7f08877

merge upstream/main and resolve Cargo.toml conflict

1ae3bb1

EeshanBembi requested a review from xudong963 February 16, 2026 11:36

marc-pydantic pushed a commit to pydantic/datafusion that referenced this pull request Feb 16, 2026

Add fix for GC spilling (apache#19444)

28910ed

adriangb reviewed Feb 16, 2026

View reviewed changes

Conversation

EeshanBembi commented Dec 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bharath-techie commented Jan 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xudong963 left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb commented Feb 13, 2026

Uh oh!

EeshanBembi commented Feb 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants