GH-3601: Cache shouldIgnoreStatistics version parsing result by yadavay-amzn · Pull Request #3607 · apache/parquet-java

yadavay-amzn · 2026-06-07T02:37:49Z

Closes #3601.

Summary

Caches the result of CorruptStatistics.shouldIgnoreStatistics to avoid redundant version string parsing. The createdBy string is constant per file but was parsed R×C times (row groups × columns) during footer reading.

Changes

Added a bounded ConcurrentHashMap<String, Boolean> cache (max 64 entries) keyed on createdBy
Cheap type check (BINARY/FIXED_LEN_BYTE_ARRAY) still runs before cache lookup
When cache is full, computes directly without storing (best-effort cap)

Testing

testCachingBehavior: verifies cache is populated (cacheSize grows from 0→1→2)
testCorrectnessWhenCacheIsFull: fills cache to 64 entries, verifies correct results on the cache-bypass path (65th+ distinct strings)
All existing CorruptStatisticsTest tests pass (correctness preserved)

asifsmohammed

I generally like the approach @yadavay-amzn, here's why I would not use cache alone as I mentioned in the issue. It's better than the current approach of checking it RGxCC times

In high workload scenarios we are going to call shouldIgnoreStatistics RGxCC times i.e. we are going to call .size and .computeIfAbsent(hash + equals) that many times which is still not ideal.

The cleaner fix would be at the caller side in ParquetMetadataConverter.fromParquetMetadata(): compute shouldIgnoreStatistics once before the row group loop using the file-level createdBy, then pass the pre-computed boolean through buildColumnChunkMetaData → fromParquetStatisticsInternal. This eliminates all per-column overhead — no hash, no lookup, just a boolean flowing through the call chain.

This eliminates the need to check shouldIgnoreStatistics for every RGxCC. Both approaches could coexist (cache helps other callers), but the caller-side fix is where we see real improvements

…tMetadataConverter

yadavay-amzn · 2026-06-08T20:29:13Z

@asifsmohammed Thanks for reviewing! I am an agreement with your suggestions.

Removed the global static cache entirely. Now computes shouldIgnoreStatistics once per file in fromParquetMetadata before the row-group loop and passes the pre-computed flag through buildColumnChunkMetaData to fromParquetStatisticsInternal.

cc @wgtmac

Copilot

Pull request overview

Optimizes footer metadata conversion by avoiding repeated CorruptStatistics.shouldIgnoreStatistics(created_by, ...) evaluation during row-group/column iteration, by precomputing a boolean once and threading it through column-chunk metadata/statistics conversion.

Changes:

Added a new internal statistics conversion overload that accepts a precomputed shouldIgnoreCorruptStats boolean.
Added an overloaded buildColumnChunkMetaData method that accepts the precomputed flag and routes footer-reading through it.
Precomputes shouldIgnoreCorruptStats once in fromParquetMetadata(...) and passes it into the column loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    boolean shouldIgnoreCorruptStats =
+        CorruptStatistics.shouldIgnoreStatistics(createdBy, PrimitiveTypeName.BINARY);
+    return buildColumnChunkMetaData(metaData, columnPath, type, createdBy, shouldIgnoreCorruptStats);


+    // Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
+    // (the only types affected by PARQUET-251), and always false for other types.
+    boolean shouldIgnoreCorruptStats =
+        CorruptStatistics.shouldIgnoreStatistics(parquetMetadata.getCreated_by(), PrimitiveTypeName.BINARY);


+    // Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
+    // (the only types affected by PARQUET-251), and always false for other types.


wgtmac reviewed Jun 8, 2026

View reviewed changes

Comment thread parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java Outdated

asifsmohammed reviewed Jun 8, 2026

View reviewed changes

apacheGH-3601: Compute shouldIgnoreStatistics once per file in Parque…

cbd31f7

…tMetadataConverter

yadavay-amzn force-pushed the fix/3601-shouldIgnoreStatistics-cache branch from e1ce8ad to cbd31f7 Compare June 8, 2026 20:29

wgtmac requested a review from Copilot June 9, 2026 05:24

Copilot started reviewing on behalf of wgtmac June 9, 2026 05:24 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3601: Cache shouldIgnoreStatistics version parsing result#3607

GH-3601: Cache shouldIgnoreStatistics version parsing result#3607
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/3601-shouldIgnoreStatistics-cache

yadavay-amzn commented Jun 7, 2026

Uh oh!

Uh oh!

asifsmohammed left a comment

Uh oh!

yadavay-amzn commented Jun 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
		// (the only types affected by PARQUET-251), and always false for other types.

Conversation

yadavay-amzn commented Jun 7, 2026

Summary

Changes

Testing

Uh oh!

Uh oh!

asifsmohammed left a comment

Choose a reason for hiding this comment

Uh oh!

yadavay-amzn commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yadavay-amzn commented Jun 8, 2026 •

edited

Loading