Skip to content

GH-3411 Expose row group index via Parquet reader#3412

Merged
Fokko merged 2 commits intoapache:masterfrom
uros7251brick:expose-row-group-idx
Mar 12, 2026
Merged

GH-3411 Expose row group index via Parquet reader#3412
Fokko merged 2 commits intoapache:masterfrom
uros7251brick:expose-row-group-idx

Conversation

@uros7251brick
Copy link
Copy Markdown
Contributor

Rationale for this change

Engines like Apache Spark need to know which row group a record belongs to — for example, to expose row group metadata as a hidden column, or to correlate records with row group-level statistics. Without this API, callers have no way to determine the current row group index during sequential reads.

What changes are included in this PR?

Similar to how getCurrentRowIndex() was introduced to expose the current row's file-level index, this adds getCurrentRowGroupIndex() to expose the index of the row group currently being read.

New API:

  • ParquetFileReader.getCurrentRowGroupIndex() — returns the 0-based index of the last row group read via readNextRowGroup() / readNextFilteredRowGroup(). Returns -1 before any row group has been read.
  • ParquetReader.getCurrentRowGroupIndex() — same semantics, for the high-level record reader.
  • ParquetRecordReader.getCurrentRowGroupIndex() — same, for the Hadoop MapReduce record reader.

The returned index is the actual file-level row group index, meaning it correctly reflects gaps when empty row groups are skipped (e.g. if row group 1 is empty, the indices reported will be 0, 2, ... not 0, 1, ...).

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Closes #3411

Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, thanks @uros7251brick for working on this

@Fokko Fokko merged commit 2686e85 into apache:master Mar 12, 2026
5 checks passed
@Fokko
Copy link
Copy Markdown
Contributor

Fokko commented Mar 12, 2026

Thanks @uros7251brick 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose row group index in Parquet readers

2 participants