Fix massive spill files for StringView/BinaryView columns#19444
Fix massive spill files for StringView/BinaryView columns#19444EeshanBembi wants to merge 7 commits intoapache:mainfrom
Conversation
cc6c180 to
7dfb1e2
Compare
| "https://produkty%2Fpulove.ru/album/login", | ||
| ]; | ||
|
|
||
| let mut urls = Vec::with_capacity(200_000); |
There was a problem hiding this comment.
This might be quite heavy - maybe we can just keep the minimal reproducible version to verify that the changes are working as expected [ like the test above this ]
There was a problem hiding this comment.
Yes, it would be great to make tests faster and use less memory.
| if any_gc_performed { | ||
| Ok(RecordBatch::try_new(batch.schema(), new_columns)?) | ||
| } else { | ||
| Ok(batch.clone()) |
There was a problem hiding this comment.
Can we just return the batch without clone ?
| let mut new_columns: Vec<Arc<dyn Array>> = Vec::with_capacity(batch.num_columns()); | ||
| let mut any_gc_performed = false; | ||
|
|
||
| for array in batch.columns() { |
There was a problem hiding this comment.
Maybe lets exit early and return the batch as it is if there are no view arrays in the RecordBatch ?
| } | ||
|
|
||
| fn should_gc_view_array(len: usize, data_buffers: &[arrow::buffer::Buffer]) -> bool { | ||
| if len < 10 { |
There was a problem hiding this comment.
Is the number of rows a useful heuristic to not GC? Even if there are few rows, the data buffer may still be large.
| return false; | ||
| } | ||
|
|
||
| let total_buffer_size: usize = data_buffers.iter().map(|b| b.capacity()).sum(); |
There was a problem hiding this comment.
Can we use some of the existing size calculation methods like get_buffer_memory_size instead of duplicating calculations?
| #[cfg(test)] | ||
| const VIEW_SIZE_BYTES: usize = 16; | ||
| #[cfg(test)] | ||
| const INLINE_THRESHOLD: usize = 12; |
There was a problem hiding this comment.
There's a public constant for this MAX_INLINE_VIEW_LEN
2010YOUY01
left a comment
There was a problem hiding this comment.
Thanks, it's a good idea to include compaction inside SpillManager
One follow-on to do is refactoring the external sort, now it's doing some compaction already outside the SpillManager, so with this PR it would be doing redundant compactions, I believe we should just simply remove that
| } | ||
|
|
||
| if any_gc_performed { | ||
| Ok(RecordBatch::try_new(batch.schema(), new_columns)?) |
There was a problem hiding this comment.
nit: I think we can get rid of the any_gc_performed condition, and always go this branch to make it a little bit simpler
| "https://produkty%2Fpulove.ru/album/login", | ||
| ]; | ||
|
|
||
| let mut urls = Vec::with_capacity(200_000); |
There was a problem hiding this comment.
Yes, it would be great to make tests faster and use less memory.
| } | ||
|
|
||
| /// Appends a `RecordBatch` to the spill file, initializing the writer if necessary. | ||
| /// Performs garbage collection on StringView/BinaryView arrays to reduce spill file size. |
There was a problem hiding this comment.
I recommend to add more comments to explain the rationale for views gc, perhaps just copy and paste from
There was a problem hiding this comment.
@EeshanBembi could you enhance the comment here?
|
Hi @EeshanBembi , |
Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files. Changes: - Add gc_view_arrays() function to apply GC on view arrays - Integrate GC into InProgressSpillFile::append_batch() - Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size) Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers. Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.
- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes
The SpillManager now handles GC for StringView/BinaryView arrays internally via gc_view_arrays(), making the organize_stringview_arrays() function in external sort redundant. Changes: - Remove organize_stringview_arrays() call and function from sort.rs - Use batch.clone() for early return (cheaper than creating new batch) - Use arrow_data::MAX_INLINE_VIEW_LEN constant instead of custom constant - Update comment in spill_manager.rs to reference gc_view_arrays()
1e72adf to
c1a03a8
Compare
| tokio = { workspace = true } | ||
|
|
||
| [dev-dependencies] | ||
| arrow-data = { workspace = true } |
There was a problem hiding this comment.
is this unintentional change / git diff ?
There was a problem hiding this comment.
This is intentional, arrow-data is added as a dev-dependency because the test helper calculate_string_view_waste_ratio uses arrow_data::MAX_INLINE_VIEW_LEN to determine whether a string value would be inlined in a StringView array.
|
Hi @EeshanBembi just pinging here, I'd love to see this across the line, just missing one final push! |
|
Hey @adriangb , could you please have a look? Thanks |
| } | ||
|
|
||
| /// Appends a `RecordBatch` to the spill file, initializing the writer if necessary. | ||
| /// Performs garbage collection on StringView/BinaryView arrays to reduce spill file size. |
There was a problem hiding this comment.
@EeshanBembi could you enhance the comment here?
| /// - `get_record_batch_memory_size` operates on entire arrays, not just the data buffers | ||
| /// - We need to check buffer capacity specifically to determine GC potential | ||
| /// - The data buffers are what gets compacted during GC, so their size is the key metric | ||
| fn should_gc_view_array(data_buffers: &[arrow::buffer::Buffer]) -> bool { |
There was a problem hiding this comment.
Can we check if there is anything to gc (i.e. if the array is sliced)? Or is gc() a no-op in that case?
Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files.
Changes:
Fixes #19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers.
Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.