HDDS-15531. DNS refresh on connection failure for Client to OM#10486
Open
kerneltime wants to merge 3 commits into
Open
HDDS-15531. DNS refresh on connection failure for Client to OM#10486kerneltime wants to merge 3 commits into
kerneltime wants to merge 3 commits into
Conversation
This was referenced Jun 11, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces an opt-in “DNS refresh on connection failure” mechanism for the Client → OM Hadoop RPC failover path (and shared infrastructure to be reused by later PRs), addressing stale InetSocketAddress resolutions when OM IPs change under stable hostnames (eg. Kubernetes rescheduling).
Changes:
- Added
ConnectionFailureUtilsand a new opt-in configozone.client.failover.resolve-needed(defaultfalse) to gate DNS refresh behavior to connection-class failures only. - Made
OMProxyInforefreshable: it now preserves the originalhost:portstring, can re-resolve DNS on failures, atomically swaps cached address / dtService, and discards & stops the stale proxy. - Wired refresh into
OMFailoverProxyProviderBase.shouldRetry(with “retry same proxy” pinning after a successful refresh) and updated delegation-token aggregation to recompute after refresh.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/utils/ConnectionFailureUtils.java | New shared classifier for “connection-class” failures (bounded cause-chain walk). |
| hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/utils/TestConnectionFailureUtils.java | Unit tests covering classifier positives/negatives, wrapping, cycles, depth bound. |
| hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java | Adds the ozone.client.failover.resolve-needed config key + default. |
| hadoop-hdds/common/src/main/resources/ozone-default.xml | Documents the new flag and secure-cluster co-config requirement. |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMProxyInfo.java | Stores original address string; adds DNS refresh + proxy discard/stop; synchronizes accessors. |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProviderBase.java | Adds opt-in wiring in shouldRetry and the refresh hook/pinning behavior. |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/HadoopRpcOMFailoverProxyProvider.java | Makes aggregated DT service recomputable after per-node refresh (volatile + hook). |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/HadoopRpcOMFollowerReadFailoverProxyProvider.java | Uses synchronized proxy accessor for connection-id lookup under refreshable proxy state. |
| hadoop-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMProxyInfoDnsRefresh.java | New tests validating OMProxyInfo.refreshAddressIfChanged() behavior and proxy rebuild usage. |
| hadoop-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMFailoverProxyProviderRefreshWired.java | New “wired” tests proving refresh hook gating and “retry same OM after refresh” pinning. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java | Ensures Ratis peer address uses hostname:port string (no resolved IP baking). |
| hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/ratis/TestOzoneManagerRatisServer.java | Adds test asserting RaftPeer.address preserves hostname form. |
| hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMRatisServerImpl.java | Clarifies hostname address intent for SCM Ratis peers (comment-only). |
| hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMHAManagerImpl.java | Clarifies hostname address intent for SCM HA add/join flow (comment-only). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+160
to
+161
| LOG.warn("Failed to re-resolve OM address {}: {}", rpcAddrStr, | ||
| ex.getMessage()); |
Comment on lines
+114
to
+117
| @VisibleForTesting | ||
| synchronized void setCachedAddressForTest(InetSocketAddress address) { | ||
| this.rpcAddr = address; | ||
| } |
Comment on lines
+202
to
+203
| LOG.warn("Failed to stop stale OM proxy for nodeId {}: {}", | ||
| nodeId, stopEx.getMessage()); |
eb66a5e to
a6b6a3d
Compare
Contributor
Author
|
Addressed Copilot's three findings in the latest force-push:
Also retitled the PR to HDDS-15531 per @szetszwo's subtask request. |
a6b6a3d to
2095b73
Compare
Ratis builds gRPC channels via NettyChannelBuilder.forTarget(address), where the default DnsNameResolver re-resolves hostnames on connection failure. Two of the three pre-existing createRaftPeer paths in OM, and the AddSCMRequest path in SCMHAManagerImpl, were passing new InetSocketAddress(omNode.getInetAddress(), ratisPort) -- which bakes the resolved IP into RaftPeer.address. Once baked, Ratis (and gRPC under it) keeps dialing that IP for the channel's lifetime, so peer-pod restarts in Kubernetes never recover until the parent process is restarted. Switch every createRaftPeer / AddSCMRequest call to pass the hostname:port string. Collapse the two OzoneManagerRatisServer overloads into one. Replace the misleading "// TODO : Should we use IP instead of hostname??" comment in SCMRatisServerImpl.buildRaftGroup and SCMHAManagerImpl with explanatory comments citing HDDS-15514. Add testCreateRaftPeerUsesHostnameAddress to assert the contract: RaftPeer.address must NEVER be an IPv4 numeric form. This catches any future regression that re-introduces InetSocketAddress at this seam. This is the first of four PRs splitting HDDS-15514 along its natural code-path boundaries. No flag, no exception classifier, and no atomic swap machinery in this PR -- those land with the proxy-provider PRs that follow.
OMProxyInfo constructs an InetSocketAddress at process start and reuses
it for the proxy's lifetime. InetSocketAddress freezes the resolved IP
at construction; when an OM pod is rescheduled to a new IP under a
stable DNS name (Kubernetes), every subsequent client RPC dials the
gone-away IP forever and only a process restart recovers.
Fix it at the FailoverProxyProvider seam, gated by a new opt-in flag
(ozone.client.failover.resolve-needed, default false).
Shared infrastructure (used by subsequent PRs in this series):
- ConnectionFailureUtils: classifies a Throwable's cause chain
(depth-bounded to 16) as a connection-class failure. Connection
types: ConnectException, SocketTimeoutException,
NoRouteToHostException, UnknownHostException, EOFException,
SocketException. Application errors (OMException, OMNotLeaderEx,
AccessControlException, RetryAction-coded responses) are NOT
classified as connection failures, so DNS load is not amplified
by logical errors.
- ozone.client.failover.resolve-needed flag.
Client -> OM Hadoop RPC mechanism:
- OMProxyInfo preserves the original host:port string and
refreshAddressIfChanged() re-resolves it outside the entry
monitor; on IP change, atomically swaps the cached
InetSocketAddress / dtService / proxy=null under the monitor;
stops the stale proxy via RPC.stopProxy outside the monitor.
- OMFailoverProxyProviderBase.shouldRetry calls the refresh on
connection-class exceptions only when the flag is on. On a
successful refresh, returns FAILOVER_AND_RETRY but pins
nextProxyIndex to the current node so RetryInvocationHandler
does NOT skip past the just-refreshed peer.
- HadoopRpcOMFailoverProxyProvider and the follower-read variant
pass the preserved hostname string to OMProxyInfo at
construction.
OM <-> OM Hadoop-RPC control-plane (OMInterServiceProtocol) rides on
the same OMFailoverProxyProvider machinery, so OM-to-OM Hadoop-RPC
recovery is a free transitive benefit of this PR. The gRPC OM client
(GrpcOMFailoverProxyProvider) was already correct (placeholder
InetSocketAddress(0); gRPC's NameResolver re-resolves on its own) and
is unchanged.
Secure-cluster prerequisite documented inline in ozone-default.xml:
when this flag is true on a Kerberos cluster, operators must also set
hadoop.security.token.service.use_ip=false in core-site.xml. Same
prerequisite HADOOP-17068 carries: the Hadoop delegation-token service
ID defaults to IP:port and would silently fail token selection after
a refresh without that co-config.
This is PR 2 of 4 splitting HDDS-15514 along its natural code-path
boundaries. PR-1 (Ratis hostname-only fix) is the merge base.
Subsequent PRs:
- PR-3: OM -> SCM (SCMFailoverProxyProviderBase / SCMProxyInfo).
- PR-4: DN -> SCM heartbeat (EndpointStateMachine /
SCMConnectionManager / StateContext).
Tests:
- TestConnectionFailureUtils (new, 20 tests): bare types,
IOException-wrapped, deeply nested chains (3 levels), application
negative cases, length-2 cause cycles (terminates), 1024-deep
non-matching chains (cost bound).
- TestOMProxyInfoDnsRefresh (new, 4 tests): no-op preserves cached
proxy, swap on IP change, rebuilt proxy uses freshly-resolved
address, dtService updates. Uses a @VisibleForTesting setter.
- TestOMFailoverProxyProviderRefreshWired (new, 5 tests):
SocketTimeoutException triggers refresh (the AWS EC2 silent-drop
case end-to-end); ConnectException triggers refresh; OMException
does NOT; flag-off does NOT; nextProxyIndex stays pinned after
successful refresh.
Existing TestOMFailoverProxyProvider (8) and TestOMFailovers (1)
verified non-regressed.
2095b73 to
37500bf
Compare
adoroszlai
reviewed
Jun 13, 2026
Contributor
Author
|
Thank you @adoroszlai |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This is PR 2 of 4 splitting HDDS-15514 (originally proposed as a single ~160KB patch in #10473, split per @szetszwo's review feedback).
This PR fixes the Client → OM Hadoop-RPC path and introduces the shared infrastructure (connection-failure classifier, opt-in flag) that the remaining proxy-provider PRs in this series will reuse.
Why this matters
OMProxyInfo— the per-OM peer record insideOMFailoverProxyProvider— constructs anInetSocketAddressfrom the configuredhost:portonce at process start, then re-uses it for the proxy's lifetime.InetSocketAddressperforms a one-shot DNS lookup at construction and freezes the resolved IP. When an OM pod is rescheduled in Kubernetes (or any environment where pod IPs change while DNS names remain stable), every subsequent client RPC dials the now-defunct IP forever. The failover ring walks past the broken peer to the next OM, but the broken peer never recovers without a process restart.Note: the gRPC OM-client variant (
GrpcOMFailoverProxyProvider) already passes a placeholderInetSocketAddress(0)and lets gRPC'sNameResolverre-resolve hostnames on its own schedule — so it is unaffected by this bug and unchanged in this PR.What this PR does
1. Shared infrastructure
org.apache.hadoop.hdds.utils.ConnectionFailureUtilsThrowable(with cause-chain walk bounded to depth 16) as a connection-class failure or not. Connection-class types:ConnectException,SocketTimeoutException,NoRouteToHostException,UnknownHostException,EOFException,SocketException. Application-level errors (OMException,OMNotLeaderException,AccessControlException,RetryAction-coded responses) are NOT classified as connection failures, so DNS load is not amplified by logical failures.ozone.client.failover.resolve-neededfalseso existing non-K8s deployments see zero behaviour change. Mirrorsdfs.client.failover.resolve-needed(HDFS-14118) andhbase.resolve.hostnames.on.failure.The flag and classifier ship with this PR but are also used (in subsequent PRs) by the SCM proxy provider and the DN heartbeat task. Landing them with the OM PR follows @szetszwo's "OM first" guidance.
2.
OMProxyInfobecomes refreshablehost:portstring (rpcAddrStr) alongside the resolvedInetSocketAddressInetSocketAddressis now a derived cache.refreshAddressIfChanged()re-resolvesrpcAddrStroutside the entry monitor, then atomically swaps the cached address, the deriveddtService, and nulls the cachedproxyunder the monitordtServicefirst means aSecurityUtil.buildTokenServicethrow (rare but possible) aborts cleanly with no half-swap.RPC.stopProxy(staleProxy)outside the monitorsetCachedAddressForTest(@VisibleForTesting)3.
OMFailoverProxyProviderBase.shouldRetrycalls the refresh on connection-class exceptionsAfter the existing
shouldFailoverfilter and before the failover-index advance, when the flag is enabled andConnectionFailureUtils.isConnectionFailure(exception)matches, the provider:OMProxyInfo.refreshAddressIfChanged()for the current peer.FAILOVER_AND_RETRYbut pinsnextProxyIndexto the current node soRetryInvocationHandler.performFailover()does NOT advance past the just-refreshed peer. Without this pin, an N-peer HA cluster would skip the now-fixed peer for up to N-1 attempts.4.
HadoopRpcOMFailoverProxyProviderwiringPass the preserved hostname string into
OMProxyInfoat construction.HadoopRpcOMFollowerReadFailoverProxyProvideradjusts to match.Atomic-replace pattern
Every refresh path in this series follows the same ordering: build the new resource → atomically install it → close the old resource. Never remove-then-build (a build failure would leave the system without a peer). Never close-then-build (same). For
OMProxyInfo.refreshAddressIfChanged:rpcAddrStr(no lock).dtService(no lock).Secure-cluster prerequisite
When
ozone.client.failover.resolve-needed=trueis enabled on a Kerberos-secured cluster, operators must also sethadoop.security.token.service.use_ip=falseincore-site.xml. The Hadoop delegation-token service identifier defaults to anIP:portstring; after a refresh, the per-OM service identifier built from the new IP no longer matches the IP-based service captured on long-lived tokens, and token selection silently fails for the refreshed peer. Withuse_ip=falsethe service identifier is the stablehostname:port, which survives any IP change. Same prerequisite HADOOP-17068 carries. Documented inline inozone-default.xml.How was this patch tested?
TestConnectionFailureUtils(new)IOException-wrapped, deeply nested cause chains (3 levels), application-level negative cases, length-2 cause-chain cycles (terminates), 1024-deep non-matching chains (cost bound).TestOMProxyInfoDnsRefresh(new)dtServiceupdates. Uses the@VisibleForTestingsetter rather than reflection.TestOMFailoverProxyProviderRefreshWired(new)SocketTimeoutExceptiontriggersmaybeRefreshCurrentOmAddress(the AWS EC2 silent-drop case end-to-end).ConnectExceptiontriggers refresh.OMExceptiondoes NOT. Flag-off does NOT. After a successful refresh,performFailoverstays on the samenodeId.Existing regression suites confirmed non-regressed:
TestOMFailoverProxyProvider(8),TestOMFailovers(1).Scope of this PR
ConnectionFailureUtilsand theozone.client.failover.resolve-neededflag.OMInterServiceProtocol) rides on the sameOMFailoverProxyProvider, so OM-to-OM Hadoop-RPC recovery is also fixed by this PR.SCMFailoverProxyProviderBase,SCMProxyInfo). Same shape, different provider.EndpointStateMachine,SCMConnectionManager,StateContext). Brings theozone.datanode.scm.heartbeat.address.refresh.thresholdknob.What is the link to the Apache JIRA?
https://issues.apache.org/jira/browse/HDDS-15514