Skip to content

Optimize Arrow Memory Tables #183

@adsharma

Description

@adsharma

Background

Commit 23d0e8aa0 introduced Arrow memory-backed tables (ArrowNodeTable) to enable zero-copy data sharing with Arrow ecosystem. However, the current implementation has performance limitations compared to native Ladybug tables.

Current Implementation Issues

The current ArrowNodeTable implementation (see src/storage/table/arrow_node_table.cpp):

  1. Converts all Arrow data upfront into std::vector<std::unique_ptr<common::Value>> (see readArrowTableData())
  2. Returns one row at a time during scan (row-at-a-time processing)
  3. Does NOT support semi-masks for filtering during scan
  4. Does NOT utilize selection vectors for efficient batch filtering
  5. Missing SIP (Sideways Information Passing) optimization for joins

Optimization Tasks

1. Support SIP (Sideways Information Passing) / Semi-Mask Filtering

Problem: Arrow tables currently ignore semi-masks set on TableScanState::semiMask, which means:

  • Hash joins cannot push down semi-join filters to Arrow table scans
  • Subquery unnesting cannot benefit from semi-mask-based filtering
  • Missing significant join optimization opportunities

Implementation:

  • In ArrowNodeTable::scanInternal(), check if scanState.semiMask is enabled
  • Apply semi-mask filtering similar to NodeGroup::applySemiMaskFilter() (see src/storage/table/node_group.cpp:186-213)
  • Filter rows based on semiMask->range(startOffset, endOffset) before returning data
  • Skip rows that don't match the semi-mask to avoid unnecessary data conversion

Key files:

  • src/storage/table/arrow_node_table.cpp
  • src/include/common/mask.h (SemiMask interface)
  • src/storage/table/node_group.cpp (reference implementation)

2. Support Selection Vectors

Problem: The current implementation always returns a selection vector with size 1 (setSelSize(1)), ignoring the vectorized processing capabilities of Ladybug.

Implementation:

  • Process data in batches (up to DEFAULT_VECTOR_CAPACITY rows) instead of one row at a time
  • Populate selection vector with valid row indices after applying filters
  • Support predicate pushdown through selection vectors
  • Handle both flat and unflat output vectors properly

Key files:

  • src/include/common/data_chunk/sel_vector.h (SelectionVector interface)
  • src/processor/operator/scan/scan_node_table.cpp (how scan uses selection vectors)

3. Support for ASP (Asymmetric Semi-join with Predicate) Joins

Problem: Arrow tables cannot participate in asymmetric semi-join optimizations where:

  • Build side is much smaller than probe side
  • Semi-join predicate can be pushed down to the Arrow table scan
  • Early filtering could significantly reduce data transfer

Implementation:

  • Ensure Arrow tables properly expose row count statistics for optimizer decisions
  • Support predicate pushdown through TableScanState::columnPredicateSets
  • Enable early termination when semi-join condition is satisfied

Key files:

  • src/include/storage/predicate/column_predicate.h
  • src/include/planner/operator/sip/side_way_info_passing.h (SIPInfo)
  • src/processor/operator/hash_join/hash_join_probe.h

4. Batch Processing Instead of Row-at-a-Time

Problem: Current implementation converts all data upfront then returns one row per scan call, which is inefficient.

Implementation:

  • Process Arrow data in columnar batches (up to DEFAULT_VECTOR_CAPACITY)
  • Read data directly from Arrow arrays into Ladybug vectors without intermediate Value objects
  • Use Arrow's chunked array support for memory-efficient processing
  • Implement lazy/partial data reading instead of full table materialization

Key files:

  • src/common/arrow/arrow_converter.h (Arrow-to-Ladybug conversion utilities)
  • src/include/common/types/value/value.h

5. Zero-Copy Data Access (Bonus/Advanced)

Problem: Converting Arrow arrays to Ladybug Value objects is expensive and loses Arrow's columnar benefits.

Implementation:

  • For compatible Arrow types (primitive types), access Arrow buffers directly
  • Implement custom ValueVector backed by Arrow array buffers
  • Use Arrow's C Data Interface for efficient data sharing
  • Avoid Value object creation for simple types

Key files:

  • src/include/common/arrow/arrow.h
  • src/common/arrow/arrow_converter.cpp

Acceptance Criteria

  1. Semi-mask support: Arrow tables should respect semi-masks set by hash joins and skip non-matching rows
  2. Selection vectors: Arrow scans should populate selection vectors with up to DEFAULT_VECTOR_CAPACITY rows per call
  3. Performance: Queries with semi-joins on Arrow tables should show measurable performance improvement compared to current row-at-a-time processing
  4. Correctness: All existing tests in test/api/arrow_node_table_test.cpp should continue to pass
  5. New tests: Add tests demonstrating:
    • Semi-mask filtering on Arrow tables
    • Batch processing with selection vectors
    • Join queries with SIP optimization enabled

References

  • Native semi-mask filtering: src/storage/table/node_group.cpp:186-213
  • Selection vector usage: src/include/common/data_chunk/sel_vector.h
  • SIP planning: src/include/planner/operator/sip/side_way_info_passing.h
  • Current Arrow implementation: src/storage/table/arrow_node_table.cpp
  • Semi-masker operator: src/include/processor/operator/semi_masker.h
  • Hash join probe: src/include/processor/operator/hash_join/hash_join_probe.h

Priority

  1. Semi-mask support (SIP) - HIGH
  2. Selection vector batch processing - HIGH
  3. Predicate pushdown / ASP joins - MEDIUM
  4. Zero-copy data access - LOW (advanced optimization)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions