Skip to content

Fix massive spill files for StringView/BinaryView columns#19444

Open
EeshanBembi wants to merge 7 commits intoapache:mainfrom
EeshanBembi:fix-stringview-spill-gc
Open

Fix massive spill files for StringView/BinaryView columns#19444
EeshanBembi wants to merge 7 commits intoapache:mainfrom
EeshanBembi:fix-stringview-spill-gc

Conversation

@EeshanBembi
Copy link
Contributor

Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files.

Changes:

  • Add gc_view_arrays() function to apply GC on view arrays
  • Integrate GC into InProgressSpillFile::append_batch()
  • Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size)

Fixes #19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers.

Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Dec 21, 2025
@EeshanBembi EeshanBembi force-pushed the fix-stringview-spill-gc branch 2 times, most recently from cc6c180 to 7dfb1e2 Compare December 22, 2025 10:00
"https://produkty%2Fpulove.ru/album/login",
];

let mut urls = Vec::with_capacity(200_000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be quite heavy - maybe we can just keep the minimal reproducible version to verify that the changes are working as expected [ like the test above this ]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be great to make tests faster and use less memory.

if any_gc_performed {
Ok(RecordBatch::try_new(batch.schema(), new_columns)?)
} else {
Ok(batch.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just return the batch without clone ?

let mut new_columns: Vec<Arc<dyn Array>> = Vec::with_capacity(batch.num_columns());
let mut any_gc_performed = false;

for array in batch.columns() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe lets exit early and return the batch as it is if there are no view arrays in the RecordBatch ?

}

fn should_gc_view_array(len: usize, data_buffers: &[arrow::buffer::Buffer]) -> bool {
if len < 10 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the number of rows a useful heuristic to not GC? Even if there are few rows, the data buffer may still be large.

return false;
}

let total_buffer_size: usize = data_buffers.iter().map(|b| b.capacity()).sum();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use some of the existing size calculation methods like get_buffer_memory_size instead of duplicating calculations?

#[cfg(test)]
const VIEW_SIZE_BYTES: usize = 16;
#[cfg(test)]
const INLINE_THRESHOLD: usize = 12;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a public constant for this MAX_INLINE_VIEW_LEN

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's a good idea to include compaction inside SpillManager

One follow-on to do is refactoring the external sort, now it's doing some compaction already outside the SpillManager, so with this PR it would be doing redundant compactions, I believe we should just simply remove that

fn organize_stringview_arrays(

}

if any_gc_performed {
Ok(RecordBatch::try_new(batch.schema(), new_columns)?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can get rid of the any_gc_performed condition, and always go this branch to make it a little bit simpler

"https://produkty%2Fpulove.ru/album/login",
];

let mut urls = Vec::with_capacity(200_000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be great to make tests faster and use less memory.

}

/// Appends a `RecordBatch` to the spill file, initializing the writer if necessary.
/// Performs garbage collection on StringView/BinaryView arrays to reduce spill file size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to add more comments to explain the rationale for views gc, perhaps just copy and paste from

fn organize_stringview_arrays(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EeshanBembi could you enhance the comment here?

@EeshanBembi EeshanBembi marked this pull request as ready for review December 31, 2025 10:38
@EeshanBembi EeshanBembi requested a review from 2010YOUY01 January 2, 2026 13:19
@bharath-techie
Copy link
Contributor

Hi @EeshanBembi ,
I see you've addressed bunch of comments , can you kindly address remaining comments and take this PR forward ? Let us know if any help is required.

Add garbage collection for StringView and BinaryView arrays before spilling
to disk. This prevents sliced arrays from carrying their entire original
buffers when written to spill files.

Changes:
- Add gc_view_arrays() function to apply GC on view arrays
- Integrate GC into InProgressSpillFile::append_batch()
- Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size)

Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files
instead of 33MB due to sliced arrays maintaining references to original buffers.

Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.
- Replace row count heuristic with 10KB memory threshold
- Improve documentation and add inline comments
- Remove redundant test_exact_clickbench_issue_19414
- Maintains 96% reduction in spill file sizes
The SpillManager now handles GC for StringView/BinaryView arrays internally
via gc_view_arrays(), making the organize_stringview_arrays() function in
external sort redundant.

Changes:
- Remove organize_stringview_arrays() call and function from sort.rs
- Use batch.clone() for early return (cheaper than creating new batch)
- Use arrow_data::MAX_INLINE_VIEW_LEN constant instead of custom constant
- Update comment in spill_manager.rs to reference gc_view_arrays()
@EeshanBembi EeshanBembi force-pushed the fix-stringview-spill-gc branch from 1e72adf to c1a03a8 Compare February 3, 2026 10:36
tokio = { workspace = true }

[dev-dependencies]
arrow-data = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this unintentional change / git diff ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, arrow-data is added as a dev-dependency because the test helper calculate_string_view_waste_ratio uses arrow_data::MAX_INLINE_VIEW_LEN to determine whether a string value would be inlined in a StringView array.

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix might also resolve the #19846

@adriangb
Copy link
Contributor

Hi @EeshanBembi just pinging here, I'd love to see this across the line, just missing one final push!

@EeshanBembi
Copy link
Contributor Author

Hey @adriangb , could you please have a look? Thanks

marc-pydantic pushed a commit to pydantic/datafusion that referenced this pull request Feb 16, 2026
}

/// Appends a `RecordBatch` to the spill file, initializing the writer if necessary.
/// Performs garbage collection on StringView/BinaryView arrays to reduce spill file size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EeshanBembi could you enhance the comment here?

/// - `get_record_batch_memory_size` operates on entire arrays, not just the data buffers
/// - We need to check buffer capacity specifically to determine GC potential
/// - The data buffers are what gets compacted during GC, so their size is the key metric
fn should_gc_view_array(data_buffers: &[arrow::buffer::Buffer]) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check if there is anything to gc (i.e. if the array is sliced)? Or is gc() a no-op in that case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] BinaryView/StringView columns are spilled without GC and results in enormous spill files

6 participants