Support sharded Parquet file querying and conversion#7610
Conversation
635d72e to
524e917
Compare
Signed-off-by: SungJin1212 <tjdwls1201@gmail.com>
Signed-off-by: SungJin1212 <tjdwls1201@gmail.com>
524e917 to
2b79cb9
Compare
| errGroup.SetLimit(p.concurrency) | ||
| for i := range shardBlockIDs { | ||
| errGroup.Go(func() error { | ||
| blk, err := p.newParquetBlock(egCtx, shardBlockIDs[i], shardIDs[i], bucketOpener, bucketOpener, p.chunksDecoder, p.rowRangesCache, noopQuota, noopQuota, noopQuota) |
There was a problem hiding this comment.
I haven't looked at how this parquet sharding works for quite some time now... The sharding is to shard at columns... So do we really need to open all shards here? Or based on sharding do we know if we can only open 1 file is enough?
There was a problem hiding this comment.
We can't tell which shard holds a matching series.. So we need to open all shard files. (the converter mark only stores the shard count, with no per-shard label metadata)
There was a problem hiding this comment.
We should at least try to get some hints? The sharding is based on sorting order. For example if we sort by metric name label, we should be able to get the min and max value for each shard and store them in the convert marker. This way based on metric name in the query we can tell which shard file we need to open.
This doesn't block this PR but I think it is something we should do to optimize the query path
There was a problem hiding this comment.
Agree, we can store MinName and MaxName to the ConverterMark and then utilize it when pruning shards since __name__ is always the primary sort key. I'll track it as a follow-up PR.
| # splits a block into more parquet shards for better read parallelization. | ||
| # Default is unlimited (single shard). | ||
| # CLI flag: -parquet-converter.num-row-groups | ||
| [num_row_groups: <int> | default = 2147483647] |
There was a problem hiding this comment.
We should have an integration test with sharding enabled?
There was a problem hiding this comment.
I added an e2e test.
Signed-off-by: SungJin1212 <tjdwls1201@gmail.com>
This PR supports for querying sharded Parquet files within a bucket store and enables the conversion of sharded Parquet files.
Benchmark Results
Currently, the concurrency is hard-coded as 4.
Which issue(s) this PR fixes:
Fixes #7176 #7174
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]docs/configuration/v1-guarantees.mdupdated if this PR introduces experimental flags