HIVE-29536: Stabilize rebalance compaction tests by InvisibleProgrammer · Pull Request #6487 · apache/hive

InvisibleProgrammer · 2026-05-15T07:34:17Z

Rebalance tests are sensitive and the hard-coded assertions need to be modified regularly.
Some examples:

There are two causes identified:

Firstly, the number of buckets and even the order of the elements inside a bucket depends on the version string of Orc: https://issues.apache.org/jira/browse/HIVE-29536?focusedCommentId=18080335&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-18080335 (Thanks @thomasrebele to digging into it)
Secondly, the base directory can change as well (like here: 1a90d27#diff-dedd154465fd42855d9d6710d54553660dae87405ce2e4ea931475de1d5bb816L199)

What changes were proposed in this pull request?

The goal of the change is to stabilize those tests by doing two things:

Rebalance assertions are not hard-coded. Instead of that, we can check if the buckets are balanced or not and if all the data is available.
Base folder can be searched dinamically

Please note: I also refactored the code little bit and extracted rebalance compaction tests into a new class.

Why are the changes needed?

We experienced regular and serious regression issues due to the effect of the orc version number.

Does this PR introduce any user-facing change?

No

How was this patch tested?

With the existing tests.

thomasrebele

Thank you for working on the fix! I've added some suggestions and requests for improving it.

thomasrebele · 2026-05-15T11:58:56Z

+        .reduce(0, Integer::sum);
+
+    int optimalRecordsInBucket = allRecordCount / bucketCount;
+    int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1;


See comment https://github.com/apache/hive/pull/6487/changes#r3248007538

zabetak

Many thanks for the PR @InvisibleProgrammer ! Left a bunch of minor comments that we don't necessarily need to address.

However, I would really like to understand which tests should verify data and which tests should verify buckets and how we should choose one or the other.

No need to apply code changes right now just reply to the comments and we can decide how to advance based on the answers.

zabetak · 2026-06-02T09:32:50Z

+    // Check if the compaction succeed
+    verifyCompaction(1, TxnStore.CLEANING_RESPONSE);
+
+    String[][] expectedBuckets = new String[][] {


In some rebalance tests like this one we use explicit buckets (exptectedBuckets) along with verifyRebalance and in others we use just data (expectedData) together with verifyDataAfterCompaction. How do we determine if a test should use one or the other?

I wanted to minimize the changes and focus on tests that are flaky. I haven't seen a flakiness in that test case. Honestly, I don't know why it is so stable.

Should I rewrite all the rebalance test cases? It would be a future proof solution.

zabetak · 2026-06-02T09:33:50Z

+    conf.setBoolVar(HiveConf.ConfVars.HIVE_COMPACTOR_GATHER_STATS, false);
+    conf.setBoolVar(HiveConf.ConfVars.HIVE_STATS_AUTOGATHER, false);
+
+    //set grouping size to have 3 buckets, and re-create driver with the new config


Is the comment still relevant? Are we going to have 3 buckets at the end of the compaction?

No, I remove this.

zabetak · 2026-06-02T09:35:54Z

+    /*
+      Validate the data after the test case
+        - the table is balanced (or if not, only numberOfDeletedRows amount of rows are missing
+        - there is only one writeId
+        - buckets has unique bucketId and the bucketId doesn't change inside a bucket
+        - data is sorted by column b (so the order of column a is not predictable)
+        - all the required value present
+     */


This could be a Javadoc comment since it seems to be more than just an implementation detail.

I can move it. As it is a private method in a test class, I don't see the difference.

zabetak · 2026-06-02T09:39:05Z

+        fs.globStatus(searchPath, AcidUtils.baseFileFilter))
+        .map(FileStatus::getPath)
+        .map(Path::getName)
+        .sorted()


Can we have more than one base? If yes is it a valid scenario?

I know about two scenarios when we can have multiple base files:

major compaction, after the compaction finished and we are waiting for the cleaner.

partitioned table

As far I remember, we have no such test case.

zabetak · 2026-06-02T09:48:13Z

+    int optimalRecordsInBucket = allRecordCount / bucketCount;
+    int maximumRecordCountInABucket = (allRecordCount + bucketCount - 1) / bucketCount;
+
+    for (int i = 0; i < bucketCount; i++) {
+      if (bucketData[i].size() > maximumRecordCountInABucket || bucketData[i].size() < optimalRecordsInBucket) {
+        return false;
+      }
+    }


nit: As far as I understand, optimalRecordsInBucket is a lowerBound and maximumRecordCountInAAbucket is an upperBound for the bucket size. Using the lower/upperBound naming could make the code a bit more easier to follow.

Renamed. My next commit will contain the change.

zabetak · 2026-06-02T09:53:13Z

+  record RowData(String colA, Long colB) {}
+
+  record RowInfo(long writeId, long bucketId, long rowId, RowData rowData) {
+    private static final ObjectMapper MAPPER = new ObjectMapper();
+
+    static RowInfo fromRawString(String row) throws JsonProcessingException {
+      // Example row data to parse: "{\"writeid\":7,\"bucketid\":537001984,\"rowid\":10}\t5\t4",
+
+      String[] parts = row.split("\t", 3);
+
+      JsonNode json = MAPPER.readTree(parts[0]);
+
+      return new RowInfo(
+          json.get("writeid").asLong(),
+          json.get("bucketid").asLong(),
+          json.get("rowid").asLong(),
+          new RowData(
+              parts[1], // colA
+              Long.parseLong(parts[2])  // colB
+          )
+      );
+    }
+  }


This record classes could potentially be used by other compaction tests but putting them here makes them bit harder to find. Possibly a better fit would be TestDataProvider associated with APIs that return RowInfo objects instead of strings. Anyways just an idea, I am fine to leave them here as well.

Interesting idea. Let me check if we have immediate gain sharing this code (can we use it at some other existing tests).
Forcing code to a common place right after when we introduce them is a tricky question: if we were right, there will be something that people tend to use and re-use. If we were wrong, we usually create 'shared' classes, DTOs, utility classes - with only one usage and make the code harder to understand and maintain.

Refactored. The next commit will contain the change.

zabetak · 2026-06-02T09:54:40Z

+    expectedData.addAll(List.of(
+        new RowData("6", 4L),
+        new RowData("3", 4L),
+        new RowData("4", 4L),
+        new RowData("2", 4L),
+        new RowData("5", 4L)
+    ));


It's a bit strange that for some data we use directly Set#add and for other we pass through Set#addAll and List.

I'm pretty sure they are leftovers from various attempts. I had a couple of iterations when I tried to decide the proper data struct for those assertions. Let me check them.

kuczoram · 2026-06-01T12:35:59Z

+    AcidOutputFormat.Options options = new AcidOutputFormat.Options(conf);
+
+    /*
+      Validate the data after the test case


The rowId should be checked as well. It has to be increasing within a file, otherwise the delete operation won't work.

Done. The next commit will contain the change.

kuczoram · 2026-06-01T12:37:11Z

+    verifyCompaction(1, TxnStore.CLEANING_RESPONSE);
+
+    // Populate expected data
+    Set<RowData> expectedData = new HashSet<>();


Why do you hard-code the expected values? Why not just run a select before and after the compaction and compare the results?

We had a discussion about this before. That solution is the closest to checking the data files themselves. And actually, this method runs a selects the data and checks the data based on a result of the select statement. For the select see TestData.getBucketData or getStructuredBucketData.

kuczoram · 2026-06-01T12:39:50Z

+            "{\"writeid\":7,\"bucketid\":537067520,\"rowid\":17}\t17\t17",
+        },
+    };
+    verifyRebalance(testDataProvider, tableName, null, expectedBuckets,


I thought that the idea of this fix is to have one universal way of validating the result of the rebalance compaction and get rid of the hard-coded data. Why did you keep this? Now we have some tests which using the new way of validation and some tests which using the old way of validation. I don't really like it. We should use one approach to validate the data and use it in all rebalance tests.

I kept test cases that are not affected by the flakiness problem. For example, in this test case we run a compaction that defines the buckets explicitly (CLUSTERED INTO 4 BUCKETS) so that rebalance compaction always produces the exact same clusters. The flaky test cases are flaky because we cannot guarantee if it ends with 3 or 4 buckets.

kuczoram · 2026-06-01T12:41:30Z

+
+  @Test
+  public void testRebalanceCompactionOfNotPartitionedImplicitlyBucketedTableWithOrder() throws Exception {
+    conf.setBoolVar(HiveConf.ConfVars.COMPACTOR_CRUD_QUERY_BASED, true);


Would it make sense to extract these config settings into one place?

Extracted them into a method.

kuczoram · 2026-06-01T12:47:15Z

+            "{\"writeid\":1,\"bucketid\":537001984,\"rowid\":3}\t1\t4\ttomorrow",
+        },
+    };
+    for(int i = 0; i < 3; i++) {


I am just wondering why we need this data validation before the compaction? Do you know anything about the reason? Does it matter how the rows look like before the compaction or the intention here is rather to check if the data is imbalanced?

The reason is not documented. My personal opinion is checking the data before compaction means we don't trust in a simple insert overwrite statement in Hive.
What do you think? Should we keep the original test logic or remove the checks before compaction?

sonarqubecloud · 2026-06-10T02:27:43Z

Quality Gate passed

Issues
7 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

sonarqubecloud · 2026-06-15T10:13:27Z

Quality Gate passed

Issues
7 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

asf-ci-hive added tests pending tests passed and removed tests pending labels May 15, 2026

thomasrebele suggested changes May 15, 2026

View reviewed changes

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels May 20, 2026

thomasrebele suggested changes May 21, 2026

View reviewed changes

Comment thread .../hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestRebalanceCompactor.java

Comment thread .../hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestRebalanceCompactor.java Outdated

asf-ci-hive added tests passed tests pending and removed tests pending tests passed labels May 21, 2026

InvisibleProgrammer force-pushed the fix_rebalance_tests_flakyness branch from bba153e to 9f5cedf Compare May 26, 2026 13:33

asf-ci-hive added tests failed and removed tests pending labels May 26, 2026

thomasrebele approved these changes May 26, 2026

View reviewed changes

asf-ci-hive added tests pending tests unstable and removed tests failed tests pending labels May 26, 2026

zabetak reviewed Jun 2, 2026

View reviewed changes

kuczoram reviewed Jun 3, 2026

View reviewed changes

asf-ci-hive added tests pending and removed tests unstable labels Jun 9, 2026

InvisibleProgrammer force-pushed the fix_rebalance_tests_flakyness branch from 31b4606 to 8e8085d Compare June 9, 2026 20:34

asf-ci-hive added tests failed and removed tests pending labels Jun 9, 2026

asf-ci-hive added tests pending and removed tests failed labels Jun 10, 2026

asf-ci-hive added tests passed and removed tests pending labels Jun 10, 2026

zsmiskolczi added 5 commits June 15, 2026 11:10

HIVE-29536: Stabilize rebalance compaction tests

ebf22d0

Address review comments

48e4f8e

Address SonarQube issues

dd69565

Address review comments

9d1e3cd

Address review comments

268c47a

InvisibleProgrammer force-pushed the fix_rebalance_tests_flakyness branch from 8e8085d to 268c47a Compare June 15, 2026 09:10

asf-ci-hive added tests pending and removed tests passed labels Jun 15, 2026

asf-ci-hive added tests passed and removed tests pending labels Jun 15, 2026

Conversation

InvisibleProgrammer commented May 15, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

thomasrebele left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 10, 2026

Quality Gate passed

Uh oh!

sonarqubecloud Bot commented Jun 15, 2026

Quality Gate passed

Uh oh!