feat: replace semaphore and queued write buffers with pooled approach#446
Open
hamersaw wants to merge 2 commits intolance-format:mainfrom
Open
feat: replace semaphore and queued write buffers with pooled approach#446hamersaw wants to merge 2 commits intolance-format:mainfrom
hamersaw wants to merge 2 commits intolance-format:mainfrom
Conversation
Replace SemaphoreArrowBatchWriteBuffer and QueuedArrowBatchWriteBuffer with a single PooledArrowBatchWriteBuffer that pre-allocates a fixed pool of VectorSchemaRoots and cycles through them via reset(). This eliminates per-batch vector reallocation (semaphore) and per-batch child allocator overhead (queued) while preserving buffer capacity across batches. Byte-based flush (maxBatchBytes) uses row-level tracking: fixed-width costs are precomputed from the schema, variable-width growth is tracked via getDataBuffer().readableBytes() deltas. The queue_depth option now controls pool size. The use_queued_write_buffer option is still parsed for backward compatibility but no longer affects buffer selection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BaseVariableWidthVector.setSafe() writes via valueBuffer.setBytes() which does not advance the buffer's writerIndex — that only happens at setValueCount(). So getDataBuffer().readableBytes() stayed at 0 during the write loop and byte-based flush never triggered. Use getBufferSizeFor(rowCount) instead, which reads from the offset buffer (which setSafe does update). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SemaphoreArrowBatchWriteBufferandQueuedArrowBatchWriteBufferwith a singlePooledArrowBatchWriteBufferthat pre-allocates a fixed pool ofVectorSchemaRoots and cycles through them viareset()maxBatchBytes) uses row-level tracking: fixed-width costs precomputed from schema, variable-width growth tracked viagetDataBuffer().readableBytes()deltasMotivation
Benchmarks show the pooled approach cuts allocations from ~100K (semaphore) / ~10K (queued) to ~40 while matching or beating throughput. With
poolSize=1it behaves identically to the old semaphore writer (serial producer/consumer) but with fewer allocations. WithpoolSize>1it provides the same pipelining as the queued writer without per-batch child allocator overhead.The
queue_depthoption now controls pool size. Theuse_queued_write_bufferoption is still parsed for backward compatibility but no longer affects buffer selection.Test plan
PooledArrowBatchWriteBufferTestcovers: basic write/read, partial batch, empty write, large dataset, pool-size-1, multiple columns, exact batch boundary, single-row batch, error propagation, byte-based flush (small rows, large strings, single large row)SparkConnectorWrite*integration tests pass (both semaphore and queued test classes exercise the same pooled buffer now)LanceDataWriterTestpasses🤖 Generated with Claude Code