Fix #933 — bound compaction memory so wide-row tables don't OOM#942
Merged
erikdarlingdata merged 1 commit intodevfrom May 7, 2026
Merged
Fix #933 — bound compaction memory so wide-row tables don't OOM#942erikdarlingdata merged 1 commit intodevfrom
erikdarlingdata merged 1 commit intodevfrom
Conversation
#935 added temp_directory so DuckDB could spill, but on wider workloads the working set still blew past the 4 GB cap before spill caught up (reporter saw OOM at 3.7 GiB compacting 15 query_snapshots files). Three knobs combined to feed that: - memory_limit = 4 GB was too high — DuckDB held off spilling until late - threads defaulted to N cores, multiplying per-thread row-group buffers - ROW_GROUP_SIZE 122880 buffered up to 122k wide-VARCHAR rows per group Drop memory_limit to 1 GB, cap threads to 2, and shrink ROW_GROUP_SIZE to 8192. On 1.7 M rows of real query_stats data this drops peak working set from 1236 MB → 166 MB (87% reduction) at a 31% wall-time cost. Memory now plateaus instead of growing with row count, which is the load-bearing change for issue #933. Adds tools/CompactionRepro — a standalone reproducer that splits a real monthly parquet file into N per-cycle-shaped chunks and runs the same pair-merge logic with the tuning knobs exposed on the command line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
temp_directoryso DuckDB could spill during compaction, but on wider workloads the working set still blew past the 4 GB cap before spill caught up (reporter saw OOM at 3.7 GiB compacting 15query_snapshotsfiles).memory_limit = 4GBwas too high (DuckDB held off spilling),threadsdefaulted to N cores (per-thread row-group buffers multiplied), andROW_GROUP_SIZE 122880buffered up to 122k wide-VARCHAR rows per group.memory_limitto1GB, capthreads = 2, shrinkROW_GROUP_SIZEto8192. Memory now plateaus instead of growing with row count.Fixes #933
Repro tool
tools/CompactionRepro— standalone .NET console app that splits a real monthly parquet file into N per-cycle-shaped chunks and runs the same pair-merge logic with the tuning knobs exposed on the command line. Useful for validating future changes to compaction.Validation
On a real local archive (
202604_query_stats.parquet, 1.7M rows, ~70 MB):87% peak memory reduction. 31% slower wall time. Output 14% larger (smaller row groups → smaller compression dictionaries — acceptable trade for not crashing).
Test plan
🤖 Generated with Claude Code