Skip to content

[SPARK-57275][CONNECT] Validate row count after consuming all arrow batches#56343

Open
biruktesf-db wants to merge 1 commit into
apache:masterfrom
biruktesf-db:fix-multiple-arrow-batch
Open

[SPARK-57275][CONNECT] Validate row count after consuming all arrow batches#56343
biruktesf-db wants to merge 1 commit into
apache:masterfrom
biruktesf-db:fix-multiple-arrow-batch

Conversation

@biruktesf-db
Copy link
Copy Markdown
Contributor

@biruktesf-db biruktesf-db commented Jun 5, 2026

What changes were proposed in this pull request?

When consuming query results on the spark connect client, count each RecordBatch once (num_records += batch.num_rows) and validate row_count only after the IPC stream is fully consumed. The Scala client in SparkResult.scala already validates after the loop and is unaffected.

Why are the changes needed?

The Arrow IPC streaming format wraps a result as [Schema][RecordBatch]*[EOS] a single message can carry multiple RecordBatches, and pa.ipc.open_stream(...) parses all of them. The server's arrow_batch.row_count is the total rows across every RecordBatch in the message and the spark connect client validates the row count inside the per-batch loop:

  for batch in reader:
      num_records_in_batch += batch.num_rows
      if num_records_in_batch != b.arrow_batch.row_count:   # checked too early
          raise SparkConnectException(...)
      num_records += num_records_in_batch                    # also double-counts

When a message contains more than one RecordBatch, the check fires after the first batch, before the stream is fully consumed, and throws:

SparkConnectException: Expected N rows in arrow batch but got M. (M < N)

Impact: Any code path that produces multi-RecordBatch IPC streams fails to fetch results, even though the payload is well-formed and parseable by PyArrow.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests for the client,

Was this patch authored or co-authored using generative AI tooling?

Generated-by:Claude Opus 4.8

@biruktesf-db biruktesf-db force-pushed the fix-multiple-arrow-batch branch from 73dfc7e to 568fe92 Compare June 5, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant