Redesign json array streaming for datafusion#31
Merged
zhuqi-lucas merged 1 commit intobranch-51from Feb 3, 2026
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a buffer boundary bug in JsonArrayToNdjsonReader where data could be lost when processing large JSON arrays that exceed the internal buffer size. The fix changes the implementation to use BufReader's fill_buf()/consume() pattern instead of Read::read(), which allows precise control over byte consumption and prevents data loss.
Changes:
- Wrapped the inner reader in
BufReader<R>for precise byte consumption control - Refactored
fill_internal_buffer()to usefill_buf()/consume()pattern instead ofread() - Added comprehensive tests to verify the fix with data larger than the 64KB buffer size
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| datafusion/datasource-json/src/utils.rs | Core fix: Changed JsonArrayToNdjsonReader to use BufReader wrapper and refactored buffer filling logic to prevent data loss at buffer boundaries; added comprehensive tests |
| datafusion/datasource-json/Cargo.toml | Added serde_json dependency for test validation |
| Cargo.lock | Updated lock file with new dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
365c8fc to
8d4fcd2
Compare
bb8195e to
dbf7781
Compare
8d4fcd2 to
cc9e0ed
Compare
bb29301 to
981ec98
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Upstream:
apache#19924
PR Description
Summary
This PR introduces a high-performance streaming architecture for reading JSON array format files (
[{...}, {...}, ...]) in DataFusion. The new design processes arbitrarily large JSON array files with constant memory usage (~32MB) instead of loading the entire file into memory.Motivation
The previous implementation had critical limitations:
Solution Architecture
The new design uses a streaming character substitution approach that converts JSON array format to NDJSON on-the-fly:
Key Components
JsonArrayToNdjsonReader: A streamingRead+BufReadadapter that performs character-level transformation:[and trailing],to\nChannelReader: Bridges async-to-sync boundary by receivingByteschunks from a channel and implementingstd::io::ReadJsonArrayStream: Custom stream wrapper that holdsSpawnedTaskhandles for proper cancel-safetyMemory Budget (~32MB total)
API Changes
format_arrayoption tonewline_delimited(inverted semantics for clarity)newline_delimited: true(default) → NDJSON formatnewline_delimited: false→ JSON array formatNdJsonReadOptionstoJsonReadOptions(with deprecation alias)compression_levelfield fromJsonOptionsTesting
JsonArrayToNdjsonReaderincluding:JsonOpenerwith JSON array formatLimitations