Optimize Arrow Memory Tables

## Background
[Commit `23d0e8aa0`](https://github.com/LadybugDB/ladybug/commit/23d0e8aa0) introduced Arrow memory-backed tables (`ArrowNodeTable`) to enable zero-copy data sharing with Arrow ecosystem. However, the current implementation has performance limitations compared to native Ladybug tables.

## Current Implementation Issues

The current `ArrowNodeTable` implementation (see `src/storage/table/arrow_node_table.cpp`):
1. **Converts all Arrow data upfront** into `std::vector<std::unique_ptr<common::Value>>` (see `readArrowTableData()`)
2. **Returns one row at a time** during scan (row-at-a-time processing)
3. **Does NOT support semi-masks** for filtering during scan
4. **Does NOT utilize selection vectors** for efficient batch filtering
5. **Missing SIP (Sideways Information Passing)** optimization for joins

## Optimization Tasks

### 1. Support SIP (Sideways Information Passing) / Semi-Mask Filtering

**Problem:** Arrow tables currently ignore semi-masks set on `TableScanState::semiMask`, which means:
- Hash joins cannot push down semi-join filters to Arrow table scans
- Subquery unnesting cannot benefit from semi-mask-based filtering
- Missing significant join optimization opportunities

**Implementation:**
- In `ArrowNodeTable::scanInternal()`, check if `scanState.semiMask` is enabled
- Apply semi-mask filtering similar to `NodeGroup::applySemiMaskFilter()` (see `src/storage/table/node_group.cpp:186-213`)
- Filter rows based on `semiMask->range(startOffset, endOffset)` before returning data
- Skip rows that don't match the semi-mask to avoid unnecessary data conversion

**Key files:**
- `src/storage/table/arrow_node_table.cpp`
- `src/include/common/mask.h` (SemiMask interface)
- `src/storage/table/node_group.cpp` (reference implementation)

### 2. Support Selection Vectors

**Problem:** The current implementation always returns a selection vector with size 1 (`setSelSize(1)`), ignoring the vectorized processing capabilities of Ladybug.

**Implementation:**
- Process data in batches (up to `DEFAULT_VECTOR_CAPACITY` rows) instead of one row at a time
- Populate selection vector with valid row indices after applying filters
- Support predicate pushdown through selection vectors
- Handle both flat and unflat output vectors properly

**Key files:**
- `src/include/common/data_chunk/sel_vector.h` (SelectionVector interface)
- `src/processor/operator/scan/scan_node_table.cpp` (how scan uses selection vectors)

### 3. Support for ASP (Asymmetric Semi-join with Predicate) Joins

**Problem:** Arrow tables cannot participate in asymmetric semi-join optimizations where:
- Build side is much smaller than probe side
- Semi-join predicate can be pushed down to the Arrow table scan
- Early filtering could significantly reduce data transfer

**Implementation:**
- Ensure Arrow tables properly expose row count statistics for optimizer decisions
- Support predicate pushdown through `TableScanState::columnPredicateSets`
- Enable early termination when semi-join condition is satisfied

**Key files:**
- `src/include/storage/predicate/column_predicate.h`
- `src/include/planner/operator/sip/side_way_info_passing.h` (SIPInfo)
- `src/processor/operator/hash_join/hash_join_probe.h`

### 4. Batch Processing Instead of Row-at-a-Time

**Problem:** Current implementation converts all data upfront then returns one row per scan call, which is inefficient.

**Implementation:**
- Process Arrow data in columnar batches (up to `DEFAULT_VECTOR_CAPACITY`)
- Read data directly from Arrow arrays into Ladybug vectors without intermediate `Value` objects
- Use Arrow's chunked array support for memory-efficient processing
- Implement lazy/partial data reading instead of full table materialization

**Key files:**
- `src/common/arrow/arrow_converter.h` (Arrow-to-Ladybug conversion utilities)
- `src/include/common/types/value/value.h`

### 5. Zero-Copy Data Access (Bonus/Advanced)

**Problem:** Converting Arrow arrays to Ladybug `Value` objects is expensive and loses Arrow's columnar benefits.

**Implementation:**
- For compatible Arrow types (primitive types), access Arrow buffers directly
- Implement custom `ValueVector` backed by Arrow array buffers
- Use Arrow's C Data Interface for efficient data sharing
- Avoid `Value` object creation for simple types

**Key files:**
- `src/include/common/arrow/arrow.h`
- `src/common/arrow/arrow_converter.cpp`

## Acceptance Criteria

1. **Semi-mask support:** Arrow tables should respect semi-masks set by hash joins and skip non-matching rows
2. **Selection vectors:** Arrow scans should populate selection vectors with up to `DEFAULT_VECTOR_CAPACITY` rows per call
3. **Performance:** Queries with semi-joins on Arrow tables should show measurable performance improvement compared to current row-at-a-time processing
4. **Correctness:** All existing tests in `test/api/arrow_node_table_test.cpp` should continue to pass
5. **New tests:** Add tests demonstrating:
   - Semi-mask filtering on Arrow tables
   - Batch processing with selection vectors
   - Join queries with SIP optimization enabled

## References

- Native semi-mask filtering: `src/storage/table/node_group.cpp:186-213`
- Selection vector usage: `src/include/common/data_chunk/sel_vector.h`
- SIP planning: `src/include/planner/operator/sip/side_way_info_passing.h`
- Current Arrow implementation: `src/storage/table/arrow_node_table.cpp`
- Semi-masker operator: `src/include/processor/operator/semi_masker.h`
- Hash join probe: `src/include/processor/operator/hash_join/hash_join_probe.h`

## Priority

1. Semi-mask support (SIP) - HIGH
2. Selection vector batch processing - HIGH
3. Predicate pushdown / ASP joins - MEDIUM
4. Zero-copy data access - LOW (advanced optimization)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Arrow Memory Tables #183

Background

Current Implementation Issues

Optimization Tasks

1. Support SIP (Sideways Information Passing) / Semi-Mask Filtering

2. Support Selection Vectors

3. Support for ASP (Asymmetric Semi-join with Predicate) Joins

4. Batch Processing Instead of Row-at-a-Time

5. Zero-Copy Data Access (Bonus/Advanced)

Acceptance Criteria

References

Priority

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize Arrow Memory Tables #183

Description

Background

Current Implementation Issues

Optimization Tasks

1. Support SIP (Sideways Information Passing) / Semi-Mask Filtering

2. Support Selection Vectors

3. Support for ASP (Asymmetric Semi-join with Predicate) Joins

4. Batch Processing Instead of Row-at-a-Time

5. Zero-Copy Data Access (Bonus/Advanced)

Acceptance Criteria

References

Priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions