HDDS-15531. DNS refresh on connection failure for Client to OM by kerneltime · Pull Request #10486 · apache/ozone

kerneltime · 2026-06-11T05:06:19Z

What changes were proposed in this pull request?

This is PR 2 of 4 splitting HDDS-15514 (originally proposed as a single ~160KB patch in #10473, split per @szetszwo's review feedback).

This PR fixes the Client → OM Hadoop-RPC path and introduces the shared infrastructure (connection-failure classifier, opt-in flag) that the remaining proxy-provider PRs in this series will reuse.

Stacked on #10485 (PR-1 of 4 — Ratis peer addresses as hostnames). Please review PR-1 first; that branch is the merge base for this PR. The diff against master includes PR-1's hunks until PR-1 lands.

Why this matters

OMProxyInfo — the per-OM peer record inside OMFailoverProxyProvider — constructs an InetSocketAddress from the configured host:port once at process start, then re-uses it for the proxy's lifetime. InetSocketAddress performs a one-shot DNS lookup at construction and freezes the resolved IP. When an OM pod is rescheduled in Kubernetes (or any environment where pod IPs change while DNS names remain stable), every subsequent client RPC dials the now-defunct IP forever. The failover ring walks past the broken peer to the next OM, but the broken peer never recovers without a process restart.

Note: the gRPC OM-client variant (GrpcOMFailoverProxyProvider) already passes a placeholder InetSocketAddress(0) and lets gRPC's NameResolver re-resolve hostnames on its own schedule — so it is unaffected by this bug and unchanged in this PR.

What this PR does

1. Shared infrastructure

Artifact	Purpose
`org.apache.hadoop.hdds.utils.ConnectionFailureUtils`	Classifies a `Throwable` (with cause-chain walk bounded to depth 16) as a connection-class failure or not. Connection-class types: `ConnectException`, `SocketTimeoutException`, `NoRouteToHostException`, `UnknownHostException`, `EOFException`, `SocketException`. Application-level errors (`OMException`, `OMNotLeaderException`, `AccessControlException`, `RetryAction`-coded responses) are NOT classified as connection failures, so DNS load is not amplified by logical failures.
`ozone.client.failover.resolve-needed`	Single opt-in flag for the entire DNS-refresh-on-failure feature, defaulting `false` so existing non-K8s deployments see zero behaviour change. Mirrors `dfs.client.failover.resolve-needed` (HDFS-14118) and `hbase.resolve.hostnames.on.failure`.

The flag and classifier ship with this PR but are also used (in subsequent PRs) by the SCM proxy provider and the DN heartbeat task. Landing them with the OM PR follows @szetszwo's "OM first" guidance.

2. `OMProxyInfo` becomes refreshable

Change	Why
Preserves the original `host:port` string (`rpcAddrStr`) alongside the resolved `InetSocketAddress`	The string is the source of truth for re-resolution; the `InetSocketAddress` is now a derived cache.
`refreshAddressIfChanged()` re-resolves `rpcAddrStr` outside the entry monitor, then atomically swaps the cached address, the derived `dtService`, and nulls the cached `proxy` under the monitor	DNS lookup must not happen under any lock (a slow/dead resolver would freeze concurrent readers). Building `dtService` first means a `SecurityUtil.buildTokenService` throw (rare but possible) aborts cleanly with no half-swap.
Old proxy is stopped via `RPC.stopProxy(staleProxy)` outside the monitor	Avoid blocking concurrent readers behind socket teardown.
`setCachedAddressForTest` (`@VisibleForTesting`)	Test seam to inject a deliberately stale cached address without a real OM peer.

3. `OMFailoverProxyProviderBase.shouldRetry` calls the refresh on connection-class exceptions

After the existing shouldFailover filter and before the failover-index advance, when the flag is enabled and ConnectionFailureUtils.isConnectionFailure(exception) matches, the provider:

Calls OMProxyInfo.refreshAddressIfChanged() for the current peer.
On a successful refresh, returns FAILOVER_AND_RETRY but pins nextProxyIndex to the current node so RetryInvocationHandler.performFailover() does NOT advance past the just-refreshed peer. Without this pin, an N-peer HA cluster would skip the now-fixed peer for up to N-1 attempts.

4. `HadoopRpcOMFailoverProxyProvider` wiring

Pass the preserved hostname string into OMProxyInfo at construction. HadoopRpcOMFollowerReadFailoverProxyProvider adjusts to match.

Atomic-replace pattern

Every refresh path in this series follows the same ordering: build the new resource → atomically install it → close the old resource. Never remove-then-build (a build failure would leave the system without a peer). Never close-then-build (same). For OMProxyInfo.refreshAddressIfChanged:

Re-resolve rpcAddrStr (no lock).
Build the new dtService (no lock).
Capture the stale proxy reference, swap the address/dtService/proxy=null fields under the entry monitor.
Stop the stale proxy outside the monitor.

Secure-cluster prerequisite

When ozone.client.failover.resolve-needed=true is enabled on a Kerberos-secured cluster, operators must also set hadoop.security.token.service.use_ip=false in core-site.xml. The Hadoop delegation-token service identifier defaults to an IP:port string; after a refresh, the per-OM service identifier built from the new IP no longer matches the IP-based service captured on long-lived tokens, and token selection silently fails for the refreshed peer. With use_ip=false the service identifier is the stable hostname:port, which survives any IP change. Same prerequisite HADOOP-17068 carries. Documented inline in ozone-default.xml.

How was this patch tested?

Test class	Count	Coverage
`TestConnectionFailureUtils` (new)	20	Filter classification: bare types, `IOException`-wrapped, deeply nested cause chains (3 levels), application-level negative cases, length-2 cause-chain cycles (terminates), 1024-deep non-matching chains (cost bound).
`TestOMProxyInfoDnsRefresh` (new)	4	Per-instance refresh: no-op preserves cached proxy, swap on IP change, rebuilt proxy is dialed with the freshly-resolved address (not a sentinel), `dtService` updates. Uses the `@VisibleForTesting` setter rather than reflection.
`TestOMFailoverProxyProviderRefreshWired` (new)	5	End-to-end retry path. `SocketTimeoutException` triggers `maybeRefreshCurrentOmAddress` (the AWS EC2 silent-drop case end-to-end). `ConnectException` triggers refresh. `OMException` does NOT. Flag-off does NOT. After a successful refresh, `performFailover` stays on the same `nodeId`.

Existing regression suites confirmed non-regressed: TestOMFailoverProxyProvider (8), TestOMFailovers (1).

Scope of this PR

Includes: Client → OM Hadoop RPC. Includes the shared ConnectionFailureUtils and the ozone.client.failover.resolve-needed flag.
Free transitive benefit: OM ↔ OM control-plane RPC (OMInterServiceProtocol) rides on the same OMFailoverProxyProvider, so OM-to-OM Hadoop-RPC recovery is also fixed by this PR.
Out of scope (later PRs):
- PR-3: OM → SCM (SCMFailoverProxyProviderBase, SCMProxyInfo). Same shape, different provider.
- PR-4: DN → SCM heartbeat (EndpointStateMachine, SCMConnectionManager, StateContext). Brings the ozone.datanode.scm.heartbeat.address.refresh.threshold knob.

What is the link to the Apache JIRA?

https://issues.apache.org/jira/browse/HDDS-15514

Copilot

Pull request overview

This PR introduces an opt-in “DNS refresh on connection failure” mechanism for the Client → OM Hadoop RPC failover path (and shared infrastructure to be reused by later PRs), addressing stale InetSocketAddress resolutions when OM IPs change under stable hostnames (eg. Kubernetes rescheduling).

Changes:

Added ConnectionFailureUtils and a new opt-in config ozone.client.failover.resolve-needed (default false) to gate DNS refresh behavior to connection-class failures only.
Made OMProxyInfo refreshable: it now preserves the original host:port string, can re-resolve DNS on failures, atomically swaps cached address / dtService, and discards & stops the stale proxy.
Wired refresh into OMFailoverProxyProviderBase.shouldRetry (with “retry same proxy” pinning after a successful refresh) and updated delegation-token aggregation to recompute after refresh.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/utils/ConnectionFailureUtils.java	New shared classifier for “connection-class” failures (bounded cause-chain walk).
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/utils/TestConnectionFailureUtils.java	Unit tests covering classifier positives/negatives, wrapping, cycles, depth bound.
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java	Adds the `ozone.client.failover.resolve-needed` config key + default.
hadoop-hdds/common/src/main/resources/ozone-default.xml	Documents the new flag and secure-cluster co-config requirement.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMProxyInfo.java	Stores original address string; adds DNS refresh + proxy discard/stop; synchronizes accessors.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProviderBase.java	Adds opt-in wiring in `shouldRetry` and the refresh hook/pinning behavior.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/HadoopRpcOMFailoverProxyProvider.java	Makes aggregated DT service recomputable after per-node refresh (`volatile` + hook).
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/HadoopRpcOMFollowerReadFailoverProxyProvider.java	Uses synchronized proxy accessor for connection-id lookup under refreshable proxy state.
hadoop-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMProxyInfoDnsRefresh.java	New tests validating `OMProxyInfo.refreshAddressIfChanged()` behavior and proxy rebuild usage.
hadoop-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMFailoverProxyProviderRefreshWired.java	New “wired” tests proving refresh hook gating and “retry same OM after refresh” pinning.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java	Ensures Ratis peer address uses `hostname:port` string (no resolved IP baking).
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/ratis/TestOzoneManagerRatisServer.java	Adds test asserting `RaftPeer.address` preserves hostname form.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMRatisServerImpl.java	Clarifies hostname address intent for SCM Ratis peers (comment-only).
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMHAManagerImpl.java	Clarifies hostname address intent for SCM HA add/join flow (comment-only).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+      LOG.warn("Failed to re-resolve OM address {}: {}", rpcAddrStr,
+          ex.getMessage());


+  @VisibleForTesting
+  synchronized void setCachedAddressForTest(InetSocketAddress address) {
+    this.rpcAddr = address;
+  }


+        LOG.warn("Failed to stop stale OM proxy for nodeId {}: {}",
+            nodeId, stopEx.getMessage());


kerneltime · 2026-06-12T06:12:33Z

Addressed Copilot's three findings in the latest force-push:

OMProxyInfo.setCachedAddressForTest now rejects null via Objects.requireNonNull, so a null injection surfaces immediately as a test bug rather than as a confusing NPE inside refreshAddressIfChanged.
The LOG.warn at the IllegalArgumentException catch and the one at RPC.stopProxy failure now pass the exception object (not .getMessage()), so SLF4J emits the full stack trace.
testRefreshUpdatesDelegationTokenService was indeed vacuous: the original assertion would pass even if the refresh code forgot to rebuild dtService. Added a setCachedDtServiceForTest hook, stale the field with a sentinel value before refresh, and assert post-refresh that dtService equals SecurityUtil.buildTokenService(info.getAddress()). A regression that skipped the dtService swap would now fail this test.

Also retitled the PR to HDDS-15531 per @szetszwo's subtask request.

Ratis builds gRPC channels via NettyChannelBuilder.forTarget(address), where the default DnsNameResolver re-resolves hostnames on connection failure. Two of the three pre-existing createRaftPeer paths in OM, and the AddSCMRequest path in SCMHAManagerImpl, were passing new InetSocketAddress(omNode.getInetAddress(), ratisPort) -- which bakes the resolved IP into RaftPeer.address. Once baked, Ratis (and gRPC under it) keeps dialing that IP for the channel's lifetime, so peer-pod restarts in Kubernetes never recover until the parent process is restarted. Switch every createRaftPeer / AddSCMRequest call to pass the hostname:port string. Collapse the two OzoneManagerRatisServer overloads into one. Replace the misleading "// TODO : Should we use IP instead of hostname??" comment in SCMRatisServerImpl.buildRaftGroup and SCMHAManagerImpl with explanatory comments citing HDDS-15514. Add testCreateRaftPeerUsesHostnameAddress to assert the contract: RaftPeer.address must NEVER be an IPv4 numeric form. This catches any future regression that re-introduces InetSocketAddress at this seam. This is the first of four PRs splitting HDDS-15514 along its natural code-path boundaries. No flag, no exception classifier, and no atomic swap machinery in this PR -- those land with the proxy-provider PRs that follow.

@VisibleForTesting

OMProxyInfo constructs an InetSocketAddress at process start and reuses it for the proxy's lifetime. InetSocketAddress freezes the resolved IP at construction; when an OM pod is rescheduled to a new IP under a stable DNS name (Kubernetes), every subsequent client RPC dials the gone-away IP forever and only a process restart recovers. Fix it at the FailoverProxyProvider seam, gated by a new opt-in flag (ozone.client.failover.resolve-needed, default false). Shared infrastructure (used by subsequent PRs in this series): - ConnectionFailureUtils: classifies a Throwable's cause chain (depth-bounded to 16) as a connection-class failure. Connection types: ConnectException, SocketTimeoutException, NoRouteToHostException, UnknownHostException, EOFException, SocketException. Application errors (OMException, OMNotLeaderEx, AccessControlException, RetryAction-coded responses) are NOT classified as connection failures, so DNS load is not amplified by logical errors. - ozone.client.failover.resolve-needed flag. Client -> OM Hadoop RPC mechanism: - OMProxyInfo preserves the original host:port string and refreshAddressIfChanged() re-resolves it outside the entry monitor; on IP change, atomically swaps the cached InetSocketAddress / dtService / proxy=null under the monitor; stops the stale proxy via RPC.stopProxy outside the monitor. - OMFailoverProxyProviderBase.shouldRetry calls the refresh on connection-class exceptions only when the flag is on. On a successful refresh, returns FAILOVER_AND_RETRY but pins nextProxyIndex to the current node so RetryInvocationHandler does NOT skip past the just-refreshed peer. - HadoopRpcOMFailoverProxyProvider and the follower-read variant pass the preserved hostname string to OMProxyInfo at construction. OM <-> OM Hadoop-RPC control-plane (OMInterServiceProtocol) rides on the same OMFailoverProxyProvider machinery, so OM-to-OM Hadoop-RPC recovery is a free transitive benefit of this PR. The gRPC OM client (GrpcOMFailoverProxyProvider) was already correct (placeholder InetSocketAddress(0); gRPC's NameResolver re-resolves on its own) and is unchanged. Secure-cluster prerequisite documented inline in ozone-default.xml: when this flag is true on a Kerberos cluster, operators must also set hadoop.security.token.service.use_ip=false in core-site.xml. Same prerequisite HADOOP-17068 carries: the Hadoop delegation-token service ID defaults to IP:port and would silently fail token selection after a refresh without that co-config. This is PR 2 of 4 splitting HDDS-15514 along its natural code-path boundaries. PR-1 (Ratis hostname-only fix) is the merge base. Subsequent PRs: - PR-3: OM -> SCM (SCMFailoverProxyProviderBase / SCMProxyInfo). - PR-4: DN -> SCM heartbeat (EndpointStateMachine / SCMConnectionManager / StateContext). Tests: - TestConnectionFailureUtils (new, 20 tests): bare types, IOException-wrapped, deeply nested chains (3 levels), application negative cases, length-2 cause cycles (terminates), 1024-deep non-matching chains (cost bound). - TestOMProxyInfoDnsRefresh (new, 4 tests): no-op preserves cached proxy, swap on IP change, rebuilt proxy uses freshly-resolved address, dtService updates. Uses a @VisibleForTesting setter. - TestOMFailoverProxyProviderRefreshWired (new, 5 tests): SocketTimeoutException triggers refresh (the AWS EC2 silent-drop case end-to-end); ConnectException triggers refresh; OMException does NOT; flag-off does NOT; nextProxyIndex stays pinned after successful refresh. Existing TestOMFailoverProxyProvider (8) and TestOMFailovers (1) verified non-regressed.

kerneltime · 2026-06-15T07:38:11Z

Thank you @adoroszlai

This was referenced Jun 11, 2026

HDDS-15532. DNS refresh on connection failure for OM to SCM #10487

Draft

HDDS-15533. DNS refresh on heartbeat failure for DN to SCM #10488

Draft

HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths #10473

Closed

kerneltime requested review from Copilot, ivandika3 and szetszwo and removed request for szetszwo June 11, 2026 15:27

Copilot started reviewing on behalf of kerneltime June 11, 2026 15:27 View session

kerneltime marked this pull request as ready for review June 11, 2026 15:27

Copilot AI reviewed Jun 11, 2026

View reviewed changes

adoroszlai marked this pull request as draft June 11, 2026 16:31

kerneltime force-pushed the HDDS-15514-client-om-refresh branch from eb66a5e to a6b6a3d Compare June 12, 2026 06:12

kerneltime changed the title ~~HDDS-15514. DNS refresh on connection failure for Client to OM~~ HDDS-15531. DNS refresh on connection failure for Client to OM Jun 12, 2026

kerneltime force-pushed the HDDS-15514-client-om-refresh branch from a6b6a3d to 2095b73 Compare June 12, 2026 06:20

kerneltime added 2 commits June 11, 2026 23:34

kerneltime force-pushed the HDDS-15514-client-om-refresh branch from 2095b73 to 37500bf Compare June 12, 2026 06:34

kerneltime marked this pull request as ready for review June 13, 2026 17:56

kerneltime requested review from adoroszlai, amaliujia and smengcl June 13, 2026 17:57

kerneltime mentioned this pull request Jun 13, 2026

HDDS-15514. Pass Ratis peer addresses as hostnames #10485

Merged

adoroszlai reviewed Jun 13, 2026

View reviewed changes

Comment thread hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/utils/ConnectionFailureUtils.java Outdated

Fix javadoc

78a3321

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-15531. DNS refresh on connection failure for Client to OM#10486

HDDS-15531. DNS refresh on connection failure for Client to OM#10486
kerneltime wants to merge 3 commits into
apache:masterfrom
kerneltime:HDDS-15514-client-om-refresh

kerneltime commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

kerneltime commented Jun 12, 2026

Uh oh!

Uh oh!

kerneltime commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		LOG.warn("Failed to re-resolve OM address {}: {}", rpcAddrStr,
		ex.getMessage());

		LOG.warn("Failed to stop stale OM proxy for nodeId {}: {}",
		nodeId, stopEx.getMessage());

Conversation

kerneltime commented Jun 11, 2026

What changes were proposed in this pull request?

Why this matters

What this PR does

1. Shared infrastructure

2. OMProxyInfo becomes refreshable

3. OMFailoverProxyProviderBase.shouldRetry calls the refresh on connection-class exceptions

4. HadoopRpcOMFailoverProxyProvider wiring

Atomic-replace pattern

Secure-cluster prerequisite

How was this patch tested?

Scope of this PR

What is the link to the Apache JIRA?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

kerneltime commented Jun 12, 2026

Uh oh!

Uh oh!

kerneltime commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. `OMProxyInfo` becomes refreshable

3. `OMFailoverProxyProviderBase.shouldRetry` calls the refresh on connection-class exceptions

4. `HadoopRpcOMFailoverProxyProvider` wiring