Skip to content

GH-3598: Expose getRowRanges(int)#3599

Open
peter-toth wants to merge 3 commits into
apache:masterfrom
peter-toth:GH-3598-parquet-reader-row-range-apis
Open

GH-3598: Expose getRowRanges(int)#3599
peter-toth wants to merge 3 commits into
apache:masterfrom
peter-toth:GH-3598-parquet-reader-row-range-apis

Conversation

@peter-toth

@peter-toth peter-toth commented Jun 5, 2026

Copy link
Copy Markdown

Rationale for this change

This PR is based on @mbutrovich's previous work.

Opening up an API needed by a later materialization feature in Spark. External readers (e.g. a Spark-side scanner) need the column-index-derived row ranges that may pass the configured filter for a row group, so they can plan which pages to read without reading column data themselves.

getRowRanges(int) already existed as a private helper; this change makes it public and gives it a well-defined behavior when no filter is configured.

What changes are included in this PR?

  • getRowRanges(int blockIndex): made public; returns the row ranges within the row group that may pass the configured filter. The computation is metadata-only (consults the column index from the footer; no column data is read). With no filter configured, it shortcuts to a RowRanges covering all rows of the row group rather than asserting that a filter is present.

Are these changes tested?

Yes. TestParquetFileReaderRowRanges verifies that, with no filter configured, getRowRanges returns ranges covering all rows of the row group.

Are there any user-facing changes?

No.

Closes #3598

@peter-toth peter-toth changed the title GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges Jun 5, 2026
…RowRanges

### Rationale for this change

Opening up APIs needed by a later materialization feature in Spark. External readers (e.g. a Spark-side scanner) need (a) the column-index-derived row ranges that may pass the configured filter for a row group, and (b) a metadata-only estimate of the on-disk compressed bytes those ranges correspond to for the currently requested columns, so they can plan I/O without reading column data.

### What changes are included in this PR?

- `getRowRanges(int blockIndex)`: made public; returns row ranges that may pass the configured filter. With no filter, shortcuts to all rows of the row group.
- `getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges)`: metadata-only sum of compressed page sizes for the reader's currently requested columns whose pages overlap the given row ranges. Dictionary pages are not represented in OffsetIndex and are therefore excluded.

### Are these changes tested?

Yes. `TestParquetFileReaderRowRanges` covers: no-filter row ranges cover all rows, empty ranges short-circuit to 0, full ranges equal the per-page OffsetIndex sum and are strictly less than the column-chunk total (proving dictionary-page exclusion), and partial ranges fall between 0 and the full total.

### Are there any user-facing changes?

No.

Closes apache#3598

Co-authored-by: Matt Butrovich <mbutrovich@gmail.com>
@peter-toth peter-toth force-pushed the GH-3598-parquet-reader-row-range-apis branch from dc0e426 to 6d3427b Compare June 5, 2026 14:05
* @throws ColumnIndexStore.MissingOffsetIndexException if any requested column lacks an
* offset index
*/
public long getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to maintain this? It seems that all methods used in it are publicly available. Application code can just write this up there.

@peter-toth peter-toth Jun 8, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, I can move this part to Spark.
What's your take on the getRowRanges() change in this PR and the other PR? Shall I combine them into one?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I‘m in favor of keeping them as separate PRs.

@peter-toth peter-toth changed the title GH-3598: Expose getRowRanges(int) and add getCompressedBytesForRowRanges GH-3598: Expose getRowRanges(int) Jun 8, 2026
@wgtmac wgtmac requested a review from Copilot June 9, 2026 06:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR exposes ParquetFileReader#getRowRanges(int) as a public API so external readers (e.g., Spark scanners) can compute column-index-derived row ranges for a row group using footer metadata only. It also defines behavior when no record filter is configured by returning row ranges covering the entire row group, and adds a regression test for that case.

Changes:

  • Promotes getRowRanges(int blockIndex) from private to public and documents its metadata-only behavior.
  • Adds a no-filter fast path that returns RowRanges covering all rows in the row group.
  • Adds TestParquetFileReaderRowRanges to verify the no-filter behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java Exposes getRowRanges(int) publicly and adds the no-filter behavior/documentation.
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetFileReaderRowRanges.java Adds a test ensuring getRowRanges covers all rows when no filter is configured.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1505 to 1510
public RowRanges getRowRanges(int blockIndex) {
if (!FilterCompat.isFilteringRequired(options.getRecordFilter())) {
return RowRanges.createSingle(blocks.get(blockIndex).getRowCount());
}
RowRanges rowRanges = blockRowRanges.get(blockIndex);
if (rowRanges == null) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b04cbb8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose row-range planning APIs on ParquetFileReader (getRowRanges, getCompressedBytesForRowRanges)

3 participants