[SPARK-57275][CONNECT] Validate row count after consuming all arrow batches by biruktesf-db · Pull Request #56343 · apache/spark

biruktesf-db · 2026-06-05T12:25:20Z

What changes were proposed in this pull request?

When consuming query results on the spark connect client, count each RecordBatch once (num_records += batch.num_rows) and validate row_count only after the IPC stream is fully consumed. The Scala client in SparkResult.scala already validates after the loop and is unaffected.

Why are the changes needed?

The Arrow IPC streaming format wraps a result as [Schema][RecordBatch]*[EOS] a single message can carry multiple RecordBatches, and pa.ipc.open_stream(...) parses all of them. The server's arrow_batch.row_count is the total rows across every RecordBatch in the message and the spark connect client validates the row count inside the per-batch loop:

  for batch in reader:
      num_records_in_batch += batch.num_rows
      if num_records_in_batch != b.arrow_batch.row_count:   # checked too early
          raise SparkConnectException(...)
      num_records += num_records_in_batch                    # also double-counts

When a message contains more than one RecordBatch, the check fires after the first batch, before the stream is fully consumed, and throws:

SparkConnectException: Expected N rows in arrow batch but got M. (M < N)

Impact: Any code path that produces multi-RecordBatch IPC streams fails to fetch results, even though the payload is well-formed and parseable by PyArrow.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests for the client,

Was this patch authored or co-authored using generative AI tooling?

Generated-by:Claude Opus 4.8

…atches

[SPARK-55448][CONNECT] Validate row count after consuming all arrow b…

568fe92

…atches

biruktesf-db force-pushed the fix-multiple-arrow-batch branch from 73dfc7e to 568fe92 Compare June 5, 2026 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57275][CONNECT] Validate row count after consuming all arrow batches#56343

[SPARK-57275][CONNECT] Validate row count after consuming all arrow batches#56343
biruktesf-db wants to merge 1 commit into
apache:masterfrom
biruktesf-db:fix-multiple-arrow-batch

biruktesf-db commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

biruktesf-db commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

biruktesf-db commented Jun 5, 2026 •

edited

Loading