HIVE-29651: Update ZookeeperExternalSessionsRegistryClient to handle … by tanishq-chugh · Pull Request #6528 · apache/hive

tanishq-chugh · 2026-06-08T06:18:12Z

…multiple HiveServer2 instances submitting DAGs concurrently to available Tez External Sessions

What changes were proposed in this pull request?

This PR introduces a distributed locking mechanism to synchronize Tez session assignments across multiple HiveServer2 instances.

Why are the changes needed?

To prevent Execution errors, when multiple HS2 instances tend to submit DAGs concurrently to same tez AMs

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual Testing + added UTs

tanishq-chugh · 2026-06-08T06:21:58Z

Hi @ayushtkn , Could you please help with a review on this patch ?
Thanks!

ayushtkn

Thanx @tanishq-chugh have dropped some comments

ayushtkn · 2026-06-08T07:35:25Z

+      try {
+        client.delete().guaranteed().forPath(claimsPath + "/" + appId);
+      } catch (KeeperException.NoNodeException e) {
+        // If the claim Node has already been deleted, we can ignore it.


can you add a debug log here

Addressed this in commit - 7f6b08f

ayushtkn · 2026-06-08T07:41:00Z

+
+    try {
+      synchronized (lock) {
+        while (System.nanoTime() < endTimeNs) {


this is wrong, known anti-patern, this will screw up if endTimeNs goes -ve due to overflow

Right, changed the code to fix this in commit : fdead45

ayushtkn · 2026-06-08T07:52:14Z

    // We never close external sessions that don't have errors.
    try {
      if (externalAppId != null) {
+        LOG.info("Returning external session with appID: {}", externalAppId);


Too Much Information :-), Please change it to debug

Made the change in commit: 422c01f

ayushtkn · 2026-06-08T07:56:55Z

+      HiveConf conf = new HiveConf();
+      conf.setVar(ConfVars.HIVE_ZOOKEEPER_QUORUM, connectString);
+      conf.setVar(ConfVars.HIVE_SERVER2_TEZ_EXTERNAL_SESSIONS_NAMESPACE, "/tez_ns_fifo");
+      conf.setIntVar(ConfVars.HIVE_SERVER2_TEZ_EXTERNAL_SESSIONS_WAIT_MAX_ATTEMPTS, 15);


how do u reach to this magic number 15 here and 5 above? it is by default 60, why we need to change it?

Leftover, removed in commit: 6bab8dc

ayushtkn · 2026-06-08T07:58:25Z

+      ZookeeperExternalSessionsRegistryClient registry3 = new ZookeeperExternalSessionsRegistryClient(conf);
+      try {
+        Future<String> future1 = executor.submit(registry1::getSession);
+        Thread.sleep(500);


this will lead to flaky behaviour and neither guarantees FIFO. Ideally should use latches

something like

CountDownLatch r1Started = new CountDownLatch(1); CountDownLatch r2Started = new CountDownLatch(1); Future<String> future1 = executor.submit(() -> { r1Started.countDown(); return registry1.getSession(); }); r1Started.await(); Future<String> future2 = executor.submit(() -> { r2Started.countDown(); return registry2.getSession(); }); r2Started.await();

Right, changed the test case in commit: 6bab8dc

ayushtkn · 2026-06-08T08:04:53Z

+          case CHILD_ADDED:
+            // A Tez AM was claimed by another HS2, so remove the AM from the available list of this particular HS2
+            available.remove(applicationId);
+            break;


currious about the connection events, are we sure, if we loose connection and then CONNECTION_RECONNECTED is sent, are we sure the cache will replay all the missed events and our state would be correct?

Even more curios about the Connection Lost case
The network was down longer than the session timeout. Zookeeper deleted all of your ephemeral claim nodes. If you don't handle LOST, your local taken set will think it still owns the Tez AMs, but other HiveServer2 instances will see the nodes disappear and claim them right out from under you

Nice catch @ayushtkn !
In case of CONNECTION_RECONNECTED , the cache does replay all the missed events, but while testing encountered a race condition between two listeners. Have addressed the same in commit : 26ef308

Regarding Connection Lost case, have added logic to kill running DAGs & clear taken state by the particular HS2 at the same time when ZK deletes its ephemeral claim nodes, in the same commit: 26ef308

ayushtkn · 2026-06-08T08:14:57Z

    }
  }
+
+  private final class ClaimsPathListener implements PathChildrenCacheListener {


I am thinking do we need this? Was something like this possible

CuratorCacheListener claimsListener = CuratorCacheListener.builder().forCreates( childData -> { if (childData == null) return; String applicationId = getApplicationId(childData); synchronized (lock) { available.remove(applicationId); } }).forDeletes( childData -> { if (childData == null) return; String applicationId = getApplicationId(childData); synchronized (lock) { if (!taken.contains(applicationId)) { available.add(applicationId); lock.notifyAll(); } } }).build();

Yes, changed this in commit: 484b2e4

…multiple HiveServer2 instances submitting DAGs concurrently to available Tez External Sessions

…s definition

…change

…ing reconnected after being suspended

…er from TestZookeeperExternalSessionsRegistryClient

…istries

sonarqubecloud · 2026-06-12T11:20:37Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

asf-ci-hive added the tests pending label Jun 8, 2026

ayushtkn reviewed Jun 8, 2026

View reviewed changes

asf-ci-hive added tests passed and removed tests pending labels Jun 8, 2026

tanishq-chugh force-pushed the zk-conc-issue branch from 00b5f51 to 26ef308 Compare June 10, 2026 10:18

asf-ci-hive added tests pending tests failed tests unstable tests passed and removed tests passed tests pending tests failed tests unstable labels Jun 10, 2026

tanishq-chugh force-pushed the zk-conc-issue branch from 0b9f372 to 86f90cf Compare June 11, 2026 04:04

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending tests unstable labels Jun 11, 2026

tanishq-chugh added 3 commits June 12, 2026 00:30

HIVE-29651: Update ZookeeperExternalSessionsRegistryClient to handle …

af525a2

…multiple HiveServer2 instances submitting DAGs concurrently to available Tez External Sessions

Fix formatting

cfa2cb8

Logging changes to address review comments

d6bbac3

tanishq-chugh added 9 commits June 12, 2026 00:30

Update timeout calculation in getSession to prevent overflow

c9d89d4

Refactor ClaimsPathListener to use in-place methods instead of a clas…

e829fa5

…s definition

Fix the FIFO test as per the review comment & remove leftover config …

4dfcde1

…change

Address the cases of HS2-ZK connection getting lost / connection gett…

ca84f18

…ing reconnected after being suspended

Change the log line added to debug

9e20093

Fix formatting issues

507f89d

Address SonarQube issues - 1

4a5ec5d

Logic to kill orphan DAGs left behind by crashed HS2

e8f219b

Address Sonarqube - 2

733594b

tanishq-chugh force-pushed the zk-conc-issue branch from 05c2151 to 733594b Compare June 11, 2026 19:01

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Jun 11, 2026

Update Mocks in TestTezTask as per the new code logic & Remove leftov…

5fb5b05

…er from TestZookeeperExternalSessionsRegistryClient

asf-ci-hive added tests pending and removed tests unstable labels Jun 12, 2026

Fix flakiness of newly added UT testFIFOSessionClaimsFromDifferentReg…

644a1ab

…istries

asf-ci-hive added tests passed tests pending and removed tests pending tests passed labels Jun 12, 2026

asf-ci-hive added tests passed and removed tests pending labels Jun 12, 2026

Conversation

tanishq-chugh commented Jun 8, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

tanishq-chugh commented Jun 8, 2026

Uh oh!

ayushtkn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 12, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants