Skip to content

feat: replace semaphore and queued write buffers with pooled approach#446

Open
hamersaw wants to merge 2 commits intolance-format:mainfrom
hamersaw:feature/pool-arrow-batch-write-buffer
Open

feat: replace semaphore and queued write buffers with pooled approach#446
hamersaw wants to merge 2 commits intolance-format:mainfrom
hamersaw:feature/pool-arrow-batch-write-buffer

Conversation

@hamersaw
Copy link
Copy Markdown
Collaborator

Summary

  • Replace SemaphoreArrowBatchWriteBuffer and QueuedArrowBatchWriteBuffer with a single PooledArrowBatchWriteBuffer that pre-allocates a fixed pool of VectorSchemaRoots and cycles through them via reset()
  • Eliminates per-batch vector reallocation (semaphore) and per-batch child allocator overhead (queued) while preserving buffer capacity across batches
  • Byte-based flush (maxBatchBytes) uses row-level tracking: fixed-width costs precomputed from schema, variable-width growth tracked via getDataBuffer().readableBytes() deltas

Motivation

Benchmarks show the pooled approach cuts allocations from ~100K (semaphore) / ~10K (queued) to ~40 while matching or beating throughput. With poolSize=1 it behaves identically to the old semaphore writer (serial producer/consumer) but with fewer allocations. With poolSize>1 it provides the same pipelining as the queued writer without per-batch child allocator overhead.

The queue_depth option now controls pool size. The use_queued_write_buffer option is still parsed for backward compatibility but no longer affects buffer selection.

Test plan

  • PooledArrowBatchWriteBufferTest covers: basic write/read, partial batch, empty write, large dataset, pool-size-1, multiple columns, exact batch boundary, single-row batch, error propagation, byte-based flush (small rows, large strings, single large row)
  • Existing SparkConnectorWrite* integration tests pass (both semaphore and queued test classes exercise the same pooled buffer now)
  • LanceDataWriterTest passes
  • Compile verified across base, 3.4, and 3.5 modules

🤖 Generated with Claude Code

Replace SemaphoreArrowBatchWriteBuffer and QueuedArrowBatchWriteBuffer
with a single PooledArrowBatchWriteBuffer that pre-allocates a fixed
pool of VectorSchemaRoots and cycles through them via reset(). This
eliminates per-batch vector reallocation (semaphore) and per-batch
child allocator overhead (queued) while preserving buffer capacity
across batches.

Byte-based flush (maxBatchBytes) uses row-level tracking: fixed-width
costs are precomputed from the schema, variable-width growth is tracked
via getDataBuffer().readableBytes() deltas.

The queue_depth option now controls pool size. The use_queued_write_buffer
option is still parsed for backward compatibility but no longer affects
buffer selection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the enhancement New feature or request label Apr 17, 2026
BaseVariableWidthVector.setSafe() writes via valueBuffer.setBytes() which
does not advance the buffer's writerIndex — that only happens at
setValueCount(). So getDataBuffer().readableBytes() stayed at 0 during
the write loop and byte-based flush never triggered. Use
getBufferSizeFor(rowCount) instead, which reads from the offset buffer
(which setSafe does update).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant