feat: replace semaphore and queued write buffers with pooled approach by hamersaw · Pull Request #446 · lance-format/lance-spark

hamersaw · 2026-04-17T12:29:35Z

Summary

Replace SemaphoreArrowBatchWriteBuffer and QueuedArrowBatchWriteBuffer with a single PooledArrowBatchWriteBuffer that pre-allocates a fixed pool of VectorSchemaRoots and cycles through them via reset()
Eliminates per-batch vector reallocation (semaphore) and per-batch child allocator overhead (queued) while preserving buffer capacity across batches
Byte-based flush (maxBatchBytes) uses row-level tracking: fixed-width costs precomputed from schema, variable-width growth tracked via getDataBuffer().readableBytes() deltas

Motivation

Benchmarks show the pooled approach cuts allocations from ~100K (semaphore) / ~10K (queued) to ~40 while matching or beating throughput. With poolSize=1 it behaves identically to the old semaphore writer (serial producer/consumer) but with fewer allocations. With poolSize>1 it provides the same pipelining as the queued writer without per-batch child allocator overhead.

The queue_depth option now controls pool size. The use_queued_write_buffer option is still parsed for backward compatibility but no longer affects buffer selection.

Test plan

PooledArrowBatchWriteBufferTest covers: basic write/read, partial batch, empty write, large dataset, pool-size-1, multiple columns, exact batch boundary, single-row batch, error propagation, byte-based flush (small rows, large strings, single large row)
Existing SparkConnectorWrite* integration tests pass (both semaphore and queued test classes exercise the same pooled buffer now)
LanceDataWriterTest passes
Compile verified across base, 3.4, and 3.5 modules

🤖 Generated with Claude Code

Replace SemaphoreArrowBatchWriteBuffer and QueuedArrowBatchWriteBuffer with a single PooledArrowBatchWriteBuffer that pre-allocates a fixed pool of VectorSchemaRoots and cycles through them via reset(). This eliminates per-batch vector reallocation (semaphore) and per-batch child allocator overhead (queued) while preserving buffer capacity across batches. Byte-based flush (maxBatchBytes) uses row-level tracking: fixed-width costs are precomputed from the schema, variable-width growth is tracked via getDataBuffer().readableBytes() deltas. The queue_depth option now controls pool size. The use_queued_write_buffer option is still parsed for backward compatibility but no longer affects buffer selection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BaseVariableWidthVector.setSafe() writes via valueBuffer.setBytes() which does not advance the buffer's writerIndex — that only happens at setValueCount(). So getDataBuffer().readableBytes() stayed at 0 during the write loop and byte-based flush never triggered. Use getBufferSizeFor(rowCount) instead, which reads from the offset buffer (which setSafe does update). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the enhancement New feature or request label Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: replace semaphore and queued write buffers with pooled approach#446

feat: replace semaphore and queued write buffers with pooled approach#446
hamersaw wants to merge 2 commits intolance-format:mainfrom
hamersaw:feature/pool-arrow-batch-write-buffer

hamersaw commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hamersaw commented Apr 17, 2026

Summary

Motivation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant