Skip to content

[DataLoader] Support concurrent multi-file reads per split#539

Merged
robreeves merged 9 commits intolinkedin:mainfrom
robreeves:pyice2
Apr 13, 2026
Merged

[DataLoader] Support concurrent multi-file reads per split#539
robreeves merged 9 commits intolinkedin:mainfrom
robreeves:pyice2

Conversation

@robreeves
Copy link
Copy Markdown
Collaborator

Summary

Builds on #537. Adds a files_per_split parameter to OpenHouseDataLoader that controls how many files each DataLoaderSplit reads concurrently. DataLoaderSplit now accepts a list of FileScanTasks and sets concurrent_streams to match, enabling parallel I/O within a single split. Defaults to 1 (preserving current behavior).

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

data_loader.py — New files_per_split parameter (default 1). Scan tasks are grouped into chunks before being passed to DataLoaderSplit.

data_loader_split.py__init__ now takes file_scan_tasks: Sequence[FileScanTask] instead of a single task. concurrent_streams is set to match the number of files. The id property hashes all file paths for determinism.

Tests — New multi-file split tests in test_data_loader_split.py (iteration, deterministic IDs, transforms, empty-list validation). New files_per_split tests in test_data_loader.py (grouping, remainder, larger-than-total, data preservation, invalid input).

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

make verify passes — 213 tests pass, lint, format, and mypy all green.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

…iceberg

Re-introduce the ArrivalOrder scan order and batch_size parameter that
were removed in linkedin#504. The original removal was necessary because the
fork dependency (sumedhsakdeo/iceberg-python) could not pass ELR.

Now that li-pyiceberg 0.11.3 includes the ArrivalOrder API from
upstream (apache/iceberg-python#3046), we can restore the functionality
using an approved registry dependency.
…per_split

Add files_per_split parameter to OpenHouseDataLoader that controls how
many files each DataLoaderSplit reads concurrently. DataLoaderSplit now
accepts a list of FileScanTasks and sets concurrent_streams to match,
enabling parallel I/O within a single split.
@robreeves robreeves marked this pull request as ready for review April 11, 2026 06:46
Copy link
Copy Markdown
Collaborator

@ShreyeshArangath ShreyeshArangath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@robreeves robreeves merged commit ad69f90 into linkedin:main Apr 13, 2026
2 checks passed
@robreeves robreeves deleted the pyice2 branch April 13, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants