Skip to content

GH-3601: Cache shouldIgnoreStatistics version parsing result#3607

Open
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/3601-shouldIgnoreStatistics-cache
Open

GH-3601: Cache shouldIgnoreStatistics version parsing result#3607
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/3601-shouldIgnoreStatistics-cache

Conversation

@yadavay-amzn

Copy link
Copy Markdown
Contributor

Closes #3601.

Summary

Caches the result of CorruptStatistics.shouldIgnoreStatistics to avoid redundant version string parsing. The createdBy string is constant per file but was parsed R×C times (row groups × columns) during footer reading.

Changes

  • Added a bounded ConcurrentHashMap<String, Boolean> cache (max 64 entries) keyed on createdBy
  • Cheap type check (BINARY/FIXED_LEN_BYTE_ARRAY) still runs before cache lookup
  • When cache is full, computes directly without storing (best-effort cap)

Testing

  • testCachingBehavior: verifies cache is populated (cacheSize grows from 0→1→2)
  • testCorrectnessWhenCacheIsFull: fills cache to 64 entries, verifies correct results on the cache-bypass path (65th+ distinct strings)
  • All existing CorruptStatisticsTest tests pass (correctness preserved)

Comment thread parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java Outdated

@asifsmohammed asifsmohammed left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally like the approach @yadavay-amzn, here's why I would not use cache alone as I mentioned in the issue. It's better than the current approach of checking it RGxCC times

  • In high workload scenarios we are going to call shouldIgnoreStatistics RGxCC times i.e. we are going to call .size and .computeIfAbsent(hash + equals) that many times which is still not ideal.

The cleaner fix would be at the caller side in ParquetMetadataConverter.fromParquetMetadata(): compute shouldIgnoreStatistics once before the row group loop using the file-level createdBy, then pass the pre-computed boolean through buildColumnChunkMetaDatafromParquetStatisticsInternal. This eliminates all per-column overhead — no hash, no lookup, just a boolean flowing through the call chain.

This eliminates the need to check shouldIgnoreStatistics for every RGxCC. Both approaches could coexist (cache helps other callers), but the caller-side fix is where we see real improvements

@yadavay-amzn yadavay-amzn force-pushed the fix/3601-shouldIgnoreStatistics-cache branch from e1ce8ad to cbd31f7 Compare June 8, 2026 20:29
@yadavay-amzn

yadavay-amzn commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@asifsmohammed Thanks for reviewing! I am an agreement with your suggestions.

Removed the global static cache entirely. Now computes shouldIgnoreStatistics once per file in fromParquetMetadata before the row-group loop and passes the pre-computed flag through buildColumnChunkMetaData to fromParquetStatisticsInternal.

cc @wgtmac

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes footer metadata conversion by avoiding repeated CorruptStatistics.shouldIgnoreStatistics(created_by, ...) evaluation during row-group/column iteration, by precomputing a boolean once and threading it through column-chunk metadata/statistics conversion.

Changes:

  • Added a new internal statistics conversion overload that accepts a precomputed shouldIgnoreCorruptStats boolean.
  • Added an overloaded buildColumnChunkMetaData method that accepts the precomputed flag and routes footer-reading through it.
  • Precomputes shouldIgnoreCorruptStats once in fromParquetMetadata(...) and passes it into the column loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1837 to +1839
boolean shouldIgnoreCorruptStats =
CorruptStatistics.shouldIgnoreStatistics(createdBy, PrimitiveTypeName.BINARY);
return buildColumnChunkMetaData(metaData, columnPath, type, createdBy, shouldIgnoreCorruptStats);
Comment on lines +1883 to +1886
// Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
// (the only types affected by PARQUET-251), and always false for other types.
boolean shouldIgnoreCorruptStats =
CorruptStatistics.shouldIgnoreStatistics(parquetMetadata.getCreated_by(), PrimitiveTypeName.BINARY);
Comment on lines +1883 to +1884
// Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY
// (the only types affected by PARQUET-251), and always false for other types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize repeated shouldIgnoreStatistics calls during footer reading in ParquetMetadataConverter

4 participants