-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Background
Commit 23d0e8aa0 introduced Arrow memory-backed tables (ArrowNodeTable) to enable zero-copy data sharing with Arrow ecosystem. However, the current implementation has performance limitations compared to native Ladybug tables.
Current Implementation Issues
The current ArrowNodeTable implementation (see src/storage/table/arrow_node_table.cpp):
- Converts all Arrow data upfront into
std::vector<std::unique_ptr<common::Value>>(seereadArrowTableData()) - Returns one row at a time during scan (row-at-a-time processing)
- Does NOT support semi-masks for filtering during scan
- Does NOT utilize selection vectors for efficient batch filtering
- Missing SIP (Sideways Information Passing) optimization for joins
Optimization Tasks
1. Support SIP (Sideways Information Passing) / Semi-Mask Filtering
Problem: Arrow tables currently ignore semi-masks set on TableScanState::semiMask, which means:
- Hash joins cannot push down semi-join filters to Arrow table scans
- Subquery unnesting cannot benefit from semi-mask-based filtering
- Missing significant join optimization opportunities
Implementation:
- In
ArrowNodeTable::scanInternal(), check ifscanState.semiMaskis enabled - Apply semi-mask filtering similar to
NodeGroup::applySemiMaskFilter()(seesrc/storage/table/node_group.cpp:186-213) - Filter rows based on
semiMask->range(startOffset, endOffset)before returning data - Skip rows that don't match the semi-mask to avoid unnecessary data conversion
Key files:
src/storage/table/arrow_node_table.cppsrc/include/common/mask.h(SemiMask interface)src/storage/table/node_group.cpp(reference implementation)
2. Support Selection Vectors
Problem: The current implementation always returns a selection vector with size 1 (setSelSize(1)), ignoring the vectorized processing capabilities of Ladybug.
Implementation:
- Process data in batches (up to
DEFAULT_VECTOR_CAPACITYrows) instead of one row at a time - Populate selection vector with valid row indices after applying filters
- Support predicate pushdown through selection vectors
- Handle both flat and unflat output vectors properly
Key files:
src/include/common/data_chunk/sel_vector.h(SelectionVector interface)src/processor/operator/scan/scan_node_table.cpp(how scan uses selection vectors)
3. Support for ASP (Asymmetric Semi-join with Predicate) Joins
Problem: Arrow tables cannot participate in asymmetric semi-join optimizations where:
- Build side is much smaller than probe side
- Semi-join predicate can be pushed down to the Arrow table scan
- Early filtering could significantly reduce data transfer
Implementation:
- Ensure Arrow tables properly expose row count statistics for optimizer decisions
- Support predicate pushdown through
TableScanState::columnPredicateSets - Enable early termination when semi-join condition is satisfied
Key files:
src/include/storage/predicate/column_predicate.hsrc/include/planner/operator/sip/side_way_info_passing.h(SIPInfo)src/processor/operator/hash_join/hash_join_probe.h
4. Batch Processing Instead of Row-at-a-Time
Problem: Current implementation converts all data upfront then returns one row per scan call, which is inefficient.
Implementation:
- Process Arrow data in columnar batches (up to
DEFAULT_VECTOR_CAPACITY) - Read data directly from Arrow arrays into Ladybug vectors without intermediate
Valueobjects - Use Arrow's chunked array support for memory-efficient processing
- Implement lazy/partial data reading instead of full table materialization
Key files:
src/common/arrow/arrow_converter.h(Arrow-to-Ladybug conversion utilities)src/include/common/types/value/value.h
5. Zero-Copy Data Access (Bonus/Advanced)
Problem: Converting Arrow arrays to Ladybug Value objects is expensive and loses Arrow's columnar benefits.
Implementation:
- For compatible Arrow types (primitive types), access Arrow buffers directly
- Implement custom
ValueVectorbacked by Arrow array buffers - Use Arrow's C Data Interface for efficient data sharing
- Avoid
Valueobject creation for simple types
Key files:
src/include/common/arrow/arrow.hsrc/common/arrow/arrow_converter.cpp
Acceptance Criteria
- Semi-mask support: Arrow tables should respect semi-masks set by hash joins and skip non-matching rows
- Selection vectors: Arrow scans should populate selection vectors with up to
DEFAULT_VECTOR_CAPACITYrows per call - Performance: Queries with semi-joins on Arrow tables should show measurable performance improvement compared to current row-at-a-time processing
- Correctness: All existing tests in
test/api/arrow_node_table_test.cppshould continue to pass - New tests: Add tests demonstrating:
- Semi-mask filtering on Arrow tables
- Batch processing with selection vectors
- Join queries with SIP optimization enabled
References
- Native semi-mask filtering:
src/storage/table/node_group.cpp:186-213 - Selection vector usage:
src/include/common/data_chunk/sel_vector.h - SIP planning:
src/include/planner/operator/sip/side_way_info_passing.h - Current Arrow implementation:
src/storage/table/arrow_node_table.cpp - Semi-masker operator:
src/include/processor/operator/semi_masker.h - Hash join probe:
src/include/processor/operator/hash_join/hash_join_probe.h
Priority
- Semi-mask support (SIP) - HIGH
- Selection vector batch processing - HIGH
- Predicate pushdown / ASP joins - MEDIUM
- Zero-copy data access - LOW (advanced optimization)