HIVE-29536: Stabilize rebalance compaction tests#6487
HIVE-29536: Stabilize rebalance compaction tests#6487InvisibleProgrammer wants to merge 5 commits into
Conversation
thomasrebele
left a comment
There was a problem hiding this comment.
Thank you for working on the fix! I've added some suggestions and requests for improving it.
| .reduce(0, Integer::sum); | ||
|
|
||
| int optimalRecordsInBucket = allRecordCount / bucketCount; | ||
| int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1; |
There was a problem hiding this comment.
bba153e to
9f5cedf
Compare
zabetak
left a comment
There was a problem hiding this comment.
Many thanks for the PR @InvisibleProgrammer ! Left a bunch of minor comments that we don't necessarily need to address.
However, I would really like to understand which tests should verify data and which tests should verify buckets and how we should choose one or the other.
No need to apply code changes right now just reply to the comments and we can decide how to advance based on the answers.
| // Check if the compaction succeed | ||
| verifyCompaction(1, TxnStore.CLEANING_RESPONSE); | ||
|
|
||
| String[][] expectedBuckets = new String[][] { |
There was a problem hiding this comment.
In some rebalance tests like this one we use explicit buckets (exptectedBuckets) along with verifyRebalance and in others we use just data (expectedData) together with verifyDataAfterCompaction. How do we determine if a test should use one or the other?
There was a problem hiding this comment.
I wanted to minimize the changes and focus on tests that are flaky. I haven't seen a flakiness in that test case. Honestly, I don't know why it is so stable.
Should I rewrite all the rebalance test cases? It would be a future proof solution.
| conf.setBoolVar(HiveConf.ConfVars.HIVE_COMPACTOR_GATHER_STATS, false); | ||
| conf.setBoolVar(HiveConf.ConfVars.HIVE_STATS_AUTOGATHER, false); | ||
|
|
||
| //set grouping size to have 3 buckets, and re-create driver with the new config |
There was a problem hiding this comment.
Is the comment still relevant? Are we going to have 3 buckets at the end of the compaction?
There was a problem hiding this comment.
No, I remove this.
| /* | ||
| Validate the data after the test case | ||
| - the table is balanced (or if not, only numberOfDeletedRows amount of rows are missing | ||
| - there is only one writeId | ||
| - buckets has unique bucketId and the bucketId doesn't change inside a bucket | ||
| - data is sorted by column b (so the order of column a is not predictable) | ||
| - all the required value present | ||
| */ |
There was a problem hiding this comment.
This could be a Javadoc comment since it seems to be more than just an implementation detail.
There was a problem hiding this comment.
I can move it. As it is a private method in a test class, I don't see the difference.
| fs.globStatus(searchPath, AcidUtils.baseFileFilter)) | ||
| .map(FileStatus::getPath) | ||
| .map(Path::getName) | ||
| .sorted() |
There was a problem hiding this comment.
Can we have more than one base? If yes is it a valid scenario?
There was a problem hiding this comment.
I know about two scenarios when we can have multiple base files:
- major compaction, after the compaction finished and we are waiting for the cleaner.
- partitioned table
As far I remember, we have no such test case.
| int optimalRecordsInBucket = allRecordCount / bucketCount; | ||
| int maximumRecordCountInABucket = (allRecordCount + bucketCount - 1) / bucketCount; | ||
|
|
||
| for (int i = 0; i < bucketCount; i++) { | ||
| if (bucketData[i].size() > maximumRecordCountInABucket || bucketData[i].size() < optimalRecordsInBucket) { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
nit: As far as I understand, optimalRecordsInBucket is a lowerBound and maximumRecordCountInAAbucket is an upperBound for the bucket size. Using the lower/upperBound naming could make the code a bit more easier to follow.
There was a problem hiding this comment.
Renamed. My next commit will contain the change.
| record RowData(String colA, Long colB) {} | ||
|
|
||
| record RowInfo(long writeId, long bucketId, long rowId, RowData rowData) { | ||
| private static final ObjectMapper MAPPER = new ObjectMapper(); | ||
|
|
||
| static RowInfo fromRawString(String row) throws JsonProcessingException { | ||
| // Example row data to parse: "{\"writeid\":7,\"bucketid\":537001984,\"rowid\":10}\t5\t4", | ||
|
|
||
| String[] parts = row.split("\t", 3); | ||
|
|
||
| JsonNode json = MAPPER.readTree(parts[0]); | ||
|
|
||
| return new RowInfo( | ||
| json.get("writeid").asLong(), | ||
| json.get("bucketid").asLong(), | ||
| json.get("rowid").asLong(), | ||
| new RowData( | ||
| parts[1], // colA | ||
| Long.parseLong(parts[2]) // colB | ||
| ) | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
This record classes could potentially be used by other compaction tests but putting them here makes them bit harder to find. Possibly a better fit would be TestDataProvider associated with APIs that return RowInfo objects instead of strings. Anyways just an idea, I am fine to leave them here as well.
There was a problem hiding this comment.
Interesting idea. Let me check if we have immediate gain sharing this code (can we use it at some other existing tests).
Forcing code to a common place right after when we introduce them is a tricky question: if we were right, there will be something that people tend to use and re-use. If we were wrong, we usually create 'shared' classes, DTOs, utility classes - with only one usage and make the code harder to understand and maintain.
There was a problem hiding this comment.
Refactored. The next commit will contain the change.
| expectedData.addAll(List.of( | ||
| new RowData("6", 4L), | ||
| new RowData("3", 4L), | ||
| new RowData("4", 4L), | ||
| new RowData("2", 4L), | ||
| new RowData("5", 4L) | ||
| )); |
There was a problem hiding this comment.
It's a bit strange that for some data we use directly Set#add and for other we pass through Set#addAll and List.
There was a problem hiding this comment.
I'm pretty sure they are leftovers from various attempts. I had a couple of iterations when I tried to decide the proper data struct for those assertions. Let me check them.
| AcidOutputFormat.Options options = new AcidOutputFormat.Options(conf); | ||
|
|
||
| /* | ||
| Validate the data after the test case |
There was a problem hiding this comment.
The rowId should be checked as well. It has to be increasing within a file, otherwise the delete operation won't work.
There was a problem hiding this comment.
Done. The next commit will contain the change.
| verifyCompaction(1, TxnStore.CLEANING_RESPONSE); | ||
|
|
||
| // Populate expected data | ||
| Set<RowData> expectedData = new HashSet<>(); |
There was a problem hiding this comment.
Why do you hard-code the expected values? Why not just run a select before and after the compaction and compare the results?
There was a problem hiding this comment.
We had a discussion about this before. That solution is the closest to checking the data files themselves. And actually, this method runs a selects the data and checks the data based on a result of the select statement. For the select see TestData.getBucketData or getStructuredBucketData.
| "{\"writeid\":7,\"bucketid\":537067520,\"rowid\":17}\t17\t17", | ||
| }, | ||
| }; | ||
| verifyRebalance(testDataProvider, tableName, null, expectedBuckets, |
There was a problem hiding this comment.
I thought that the idea of this fix is to have one universal way of validating the result of the rebalance compaction and get rid of the hard-coded data. Why did you keep this? Now we have some tests which using the new way of validation and some tests which using the old way of validation. I don't really like it. We should use one approach to validate the data and use it in all rebalance tests.
There was a problem hiding this comment.
I kept test cases that are not affected by the flakiness problem. For example, in this test case we run a compaction that defines the buckets explicitly (CLUSTERED INTO 4 BUCKETS) so that rebalance compaction always produces the exact same clusters. The flaky test cases are flaky because we cannot guarantee if it ends with 3 or 4 buckets.
|
|
||
| @Test | ||
| public void testRebalanceCompactionOfNotPartitionedImplicitlyBucketedTableWithOrder() throws Exception { | ||
| conf.setBoolVar(HiveConf.ConfVars.COMPACTOR_CRUD_QUERY_BASED, true); |
There was a problem hiding this comment.
Would it make sense to extract these config settings into one place?
There was a problem hiding this comment.
Extracted them into a method.
| "{\"writeid\":1,\"bucketid\":537001984,\"rowid\":3}\t1\t4\ttomorrow", | ||
| }, | ||
| }; | ||
| for(int i = 0; i < 3; i++) { |
There was a problem hiding this comment.
I am just wondering why we need this data validation before the compaction? Do you know anything about the reason? Does it matter how the rows look like before the compaction or the intention here is rather to check if the data is imbalanced?
There was a problem hiding this comment.
The reason is not documented. My personal opinion is checking the data before compaction means we don't trust in a simple insert overwrite statement in Hive.
What do you think? Should we keep the original test logic or remove the checks before compaction?
31b4606 to
8e8085d
Compare
|
8e8085d to
268c47a
Compare
|



Rebalance tests are sensitive and the hard-coded assertions need to be modified regularly.
Some examples:
...
There are two causes identified:
What changes were proposed in this pull request?
The goal of the change is to stabilize those tests by doing two things:
Please note: I also refactored the code little bit and extracted rebalance compaction tests into a new class.
Why are the changes needed?
We experienced regular and serious regression issues due to the effect of the orc version number.
Does this PR introduce any user-facing change?
No
How was this patch tested?
With the existing tests.