GH-3601: Cache shouldIgnoreStatistics version parsing result#3607
GH-3601: Cache shouldIgnoreStatistics version parsing result#3607yadavay-amzn wants to merge 1 commit into
Conversation
asifsmohammed
left a comment
There was a problem hiding this comment.
I generally like the approach @yadavay-amzn, here's why I would not use cache alone as I mentioned in the issue. It's better than the current approach of checking it RGxCC times
- In high workload scenarios we are going to call
shouldIgnoreStatisticsRGxCC times i.e. we are going to call.sizeand.computeIfAbsent(hash + equals) that many times which is still not ideal.
The cleaner fix would be at the caller side in ParquetMetadataConverter.fromParquetMetadata(): compute shouldIgnoreStatistics once before the row group loop using the file-level createdBy, then pass the pre-computed boolean through buildColumnChunkMetaData → fromParquetStatisticsInternal. This eliminates all per-column overhead — no hash, no lookup, just a boolean flowing through the call chain.
This eliminates the need to check shouldIgnoreStatistics for every RGxCC. Both approaches could coexist (cache helps other callers), but the caller-side fix is where we see real improvements
…tMetadataConverter
e1ce8ad to
cbd31f7
Compare
|
@asifsmohammed Thanks for reviewing! I am an agreement with your suggestions. Removed the global static cache entirely. Now computes cc @wgtmac |
There was a problem hiding this comment.
Pull request overview
Optimizes footer metadata conversion by avoiding repeated CorruptStatistics.shouldIgnoreStatistics(created_by, ...) evaluation during row-group/column iteration, by precomputing a boolean once and threading it through column-chunk metadata/statistics conversion.
Changes:
- Added a new internal statistics conversion overload that accepts a precomputed
shouldIgnoreCorruptStatsboolean. - Added an overloaded
buildColumnChunkMetaDatamethod that accepts the precomputed flag and routes footer-reading through it. - Precomputes
shouldIgnoreCorruptStatsonce infromParquetMetadata(...)and passes it into the column loop.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| boolean shouldIgnoreCorruptStats = | ||
| CorruptStatistics.shouldIgnoreStatistics(createdBy, PrimitiveTypeName.BINARY); | ||
| return buildColumnChunkMetaData(metaData, columnPath, type, createdBy, shouldIgnoreCorruptStats); |
| // Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY | ||
| // (the only types affected by PARQUET-251), and always false for other types. | ||
| boolean shouldIgnoreCorruptStats = | ||
| CorruptStatistics.shouldIgnoreStatistics(parquetMetadata.getCreated_by(), PrimitiveTypeName.BINARY); |
| // Compute once per file: the result is the same for BINARY and FIXED_LEN_BYTE_ARRAY | ||
| // (the only types affected by PARQUET-251), and always false for other types. |
Closes #3601.
Summary
Caches the result of
CorruptStatistics.shouldIgnoreStatisticsto avoid redundant version string parsing. ThecreatedBystring is constant per file but was parsed R×C times (row groups × columns) during footer reading.Changes
ConcurrentHashMap<String, Boolean>cache (max 64 entries) keyed oncreatedByTesting
testCachingBehavior: verifies cache is populated (cacheSize grows from 0→1→2)testCorrectnessWhenCacheIsFull: fills cache to 64 entries, verifies correct results on the cache-bypass path (65th+ distinct strings)CorruptStatisticsTesttests pass (correctness preserved)