Skip to content

HDDS-15531. DNS refresh on connection failure for Client to OM#10486

Open
kerneltime wants to merge 3 commits into
apache:masterfrom
kerneltime:HDDS-15514-client-om-refresh
Open

HDDS-15531. DNS refresh on connection failure for Client to OM#10486
kerneltime wants to merge 3 commits into
apache:masterfrom
kerneltime:HDDS-15514-client-om-refresh

Conversation

@kerneltime

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This is PR 2 of 4 splitting HDDS-15514 (originally proposed as a single ~160KB patch in #10473, split per @szetszwo's review feedback).

This PR fixes the Client → OM Hadoop-RPC path and introduces the shared infrastructure (connection-failure classifier, opt-in flag) that the remaining proxy-provider PRs in this series will reuse.

Stacked on #10485 (PR-1 of 4 — Ratis peer addresses as hostnames). Please review PR-1 first; that branch is the merge base for this PR. The diff against master includes PR-1's hunks until PR-1 lands.

Why this matters

OMProxyInfo — the per-OM peer record inside OMFailoverProxyProvider — constructs an InetSocketAddress from the configured host:port once at process start, then re-uses it for the proxy's lifetime. InetSocketAddress performs a one-shot DNS lookup at construction and freezes the resolved IP. When an OM pod is rescheduled in Kubernetes (or any environment where pod IPs change while DNS names remain stable), every subsequent client RPC dials the now-defunct IP forever. The failover ring walks past the broken peer to the next OM, but the broken peer never recovers without a process restart.

Note: the gRPC OM-client variant (GrpcOMFailoverProxyProvider) already passes a placeholder InetSocketAddress(0) and lets gRPC's NameResolver re-resolve hostnames on its own schedule — so it is unaffected by this bug and unchanged in this PR.

What this PR does

1. Shared infrastructure

Artifact Purpose
org.apache.hadoop.hdds.utils.ConnectionFailureUtils Classifies a Throwable (with cause-chain walk bounded to depth 16) as a connection-class failure or not. Connection-class types: ConnectException, SocketTimeoutException, NoRouteToHostException, UnknownHostException, EOFException, SocketException. Application-level errors (OMException, OMNotLeaderException, AccessControlException, RetryAction-coded responses) are NOT classified as connection failures, so DNS load is not amplified by logical failures.
ozone.client.failover.resolve-needed Single opt-in flag for the entire DNS-refresh-on-failure feature, defaulting false so existing non-K8s deployments see zero behaviour change. Mirrors dfs.client.failover.resolve-needed (HDFS-14118) and hbase.resolve.hostnames.on.failure.

The flag and classifier ship with this PR but are also used (in subsequent PRs) by the SCM proxy provider and the DN heartbeat task. Landing them with the OM PR follows @szetszwo's "OM first" guidance.

2. OMProxyInfo becomes refreshable

Change Why
Preserves the original host:port string (rpcAddrStr) alongside the resolved InetSocketAddress The string is the source of truth for re-resolution; the InetSocketAddress is now a derived cache.
refreshAddressIfChanged() re-resolves rpcAddrStr outside the entry monitor, then atomically swaps the cached address, the derived dtService, and nulls the cached proxy under the monitor DNS lookup must not happen under any lock (a slow/dead resolver would freeze concurrent readers). Building dtService first means a SecurityUtil.buildTokenService throw (rare but possible) aborts cleanly with no half-swap.
Old proxy is stopped via RPC.stopProxy(staleProxy) outside the monitor Avoid blocking concurrent readers behind socket teardown.
setCachedAddressForTest (@VisibleForTesting) Test seam to inject a deliberately stale cached address without a real OM peer.

3. OMFailoverProxyProviderBase.shouldRetry calls the refresh on connection-class exceptions

After the existing shouldFailover filter and before the failover-index advance, when the flag is enabled and ConnectionFailureUtils.isConnectionFailure(exception) matches, the provider:

  1. Calls OMProxyInfo.refreshAddressIfChanged() for the current peer.
  2. On a successful refresh, returns FAILOVER_AND_RETRY but pins nextProxyIndex to the current node so RetryInvocationHandler.performFailover() does NOT advance past the just-refreshed peer. Without this pin, an N-peer HA cluster would skip the now-fixed peer for up to N-1 attempts.

4. HadoopRpcOMFailoverProxyProvider wiring

Pass the preserved hostname string into OMProxyInfo at construction. HadoopRpcOMFollowerReadFailoverProxyProvider adjusts to match.

Atomic-replace pattern

Every refresh path in this series follows the same ordering: build the new resource → atomically install it → close the old resource. Never remove-then-build (a build failure would leave the system without a peer). Never close-then-build (same). For OMProxyInfo.refreshAddressIfChanged:

  1. Re-resolve rpcAddrStr (no lock).
  2. Build the new dtService (no lock).
  3. Capture the stale proxy reference, swap the address/dtService/proxy=null fields under the entry monitor.
  4. Stop the stale proxy outside the monitor.

Secure-cluster prerequisite

When ozone.client.failover.resolve-needed=true is enabled on a Kerberos-secured cluster, operators must also set hadoop.security.token.service.use_ip=false in core-site.xml. The Hadoop delegation-token service identifier defaults to an IP:port string; after a refresh, the per-OM service identifier built from the new IP no longer matches the IP-based service captured on long-lived tokens, and token selection silently fails for the refreshed peer. With use_ip=false the service identifier is the stable hostname:port, which survives any IP change. Same prerequisite HADOOP-17068 carries. Documented inline in ozone-default.xml.

How was this patch tested?

Test class Count Coverage
TestConnectionFailureUtils (new) 20 Filter classification: bare types, IOException-wrapped, deeply nested cause chains (3 levels), application-level negative cases, length-2 cause-chain cycles (terminates), 1024-deep non-matching chains (cost bound).
TestOMProxyInfoDnsRefresh (new) 4 Per-instance refresh: no-op preserves cached proxy, swap on IP change, rebuilt proxy is dialed with the freshly-resolved address (not a sentinel), dtService updates. Uses the @VisibleForTesting setter rather than reflection.
TestOMFailoverProxyProviderRefreshWired (new) 5 End-to-end retry path. SocketTimeoutException triggers maybeRefreshCurrentOmAddress (the AWS EC2 silent-drop case end-to-end). ConnectException triggers refresh. OMException does NOT. Flag-off does NOT. After a successful refresh, performFailover stays on the same nodeId.

Existing regression suites confirmed non-regressed: TestOMFailoverProxyProvider (8), TestOMFailovers (1).

Scope of this PR

  • Includes: Client → OM Hadoop RPC. Includes the shared ConnectionFailureUtils and the ozone.client.failover.resolve-needed flag.
  • Free transitive benefit: OM ↔ OM control-plane RPC (OMInterServiceProtocol) rides on the same OMFailoverProxyProvider, so OM-to-OM Hadoop-RPC recovery is also fixed by this PR.
  • Out of scope (later PRs):
    • PR-3: OM → SCM (SCMFailoverProxyProviderBase, SCMProxyInfo). Same shape, different provider.
    • PR-4: DN → SCM heartbeat (EndpointStateMachine, SCMConnectionManager, StateContext). Brings the ozone.datanode.scm.heartbeat.address.refresh.threshold knob.

What is the link to the Apache JIRA?

https://issues.apache.org/jira/browse/HDDS-15514

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an opt-in “DNS refresh on connection failure” mechanism for the Client → OM Hadoop RPC failover path (and shared infrastructure to be reused by later PRs), addressing stale InetSocketAddress resolutions when OM IPs change under stable hostnames (eg. Kubernetes rescheduling).

Changes:

  • Added ConnectionFailureUtils and a new opt-in config ozone.client.failover.resolve-needed (default false) to gate DNS refresh behavior to connection-class failures only.
  • Made OMProxyInfo refreshable: it now preserves the original host:port string, can re-resolve DNS on failures, atomically swaps cached address / dtService, and discards & stops the stale proxy.
  • Wired refresh into OMFailoverProxyProviderBase.shouldRetry (with “retry same proxy” pinning after a successful refresh) and updated delegation-token aggregation to recompute after refresh.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/utils/ConnectionFailureUtils.java New shared classifier for “connection-class” failures (bounded cause-chain walk).
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/utils/TestConnectionFailureUtils.java Unit tests covering classifier positives/negatives, wrapping, cycles, depth bound.
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java Adds the ozone.client.failover.resolve-needed config key + default.
hadoop-hdds/common/src/main/resources/ozone-default.xml Documents the new flag and secure-cluster co-config requirement.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMProxyInfo.java Stores original address string; adds DNS refresh + proxy discard/stop; synchronizes accessors.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProviderBase.java Adds opt-in wiring in shouldRetry and the refresh hook/pinning behavior.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/HadoopRpcOMFailoverProxyProvider.java Makes aggregated DT service recomputable after per-node refresh (volatile + hook).
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/HadoopRpcOMFollowerReadFailoverProxyProvider.java Uses synchronized proxy accessor for connection-id lookup under refreshable proxy state.
hadoop-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMProxyInfoDnsRefresh.java New tests validating OMProxyInfo.refreshAddressIfChanged() behavior and proxy rebuild usage.
hadoop-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMFailoverProxyProviderRefreshWired.java New “wired” tests proving refresh hook gating and “retry same OM after refresh” pinning.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java Ensures Ratis peer address uses hostname:port string (no resolved IP baking).
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/ratis/TestOzoneManagerRatisServer.java Adds test asserting RaftPeer.address preserves hostname form.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMRatisServerImpl.java Clarifies hostname address intent for SCM Ratis peers (comment-only).
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMHAManagerImpl.java Clarifies hostname address intent for SCM HA add/join flow (comment-only).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +160 to +161
LOG.warn("Failed to re-resolve OM address {}: {}", rpcAddrStr,
ex.getMessage());
Comment on lines +114 to +117
@VisibleForTesting
synchronized void setCachedAddressForTest(InetSocketAddress address) {
this.rpcAddr = address;
}
Comment on lines +202 to +203
LOG.warn("Failed to stop stale OM proxy for nodeId {}: {}",
nodeId, stopEx.getMessage());
@adoroszlai adoroszlai marked this pull request as draft June 11, 2026 16:31
@kerneltime kerneltime force-pushed the HDDS-15514-client-om-refresh branch from eb66a5e to a6b6a3d Compare June 12, 2026 06:12
@kerneltime kerneltime changed the title HDDS-15514. DNS refresh on connection failure for Client to OM HDDS-15531. DNS refresh on connection failure for Client to OM Jun 12, 2026
@kerneltime

Copy link
Copy Markdown
Contributor Author

Addressed Copilot's three findings in the latest force-push:

  1. OMProxyInfo.setCachedAddressForTest now rejects null via Objects.requireNonNull, so a null injection surfaces immediately as a test bug rather than as a confusing NPE inside refreshAddressIfChanged.
  2. The LOG.warn at the IllegalArgumentException catch and the one at RPC.stopProxy failure now pass the exception object (not .getMessage()), so SLF4J emits the full stack trace.
  3. testRefreshUpdatesDelegationTokenService was indeed vacuous: the original assertion would pass even if the refresh code forgot to rebuild dtService. Added a setCachedDtServiceForTest hook, stale the field with a sentinel value before refresh, and assert post-refresh that dtService equals SecurityUtil.buildTokenService(info.getAddress()). A regression that skipped the dtService swap would now fail this test.

Also retitled the PR to HDDS-15531 per @szetszwo's subtask request.

@kerneltime kerneltime force-pushed the HDDS-15514-client-om-refresh branch from a6b6a3d to 2095b73 Compare June 12, 2026 06:20
Ratis builds gRPC channels via NettyChannelBuilder.forTarget(address),
where the default DnsNameResolver re-resolves hostnames on connection
failure. Two of the three pre-existing createRaftPeer paths in OM, and
the AddSCMRequest path in SCMHAManagerImpl, were passing
new InetSocketAddress(omNode.getInetAddress(), ratisPort) -- which bakes
the resolved IP into RaftPeer.address. Once baked, Ratis (and gRPC under
it) keeps dialing that IP for the channel's lifetime, so peer-pod
restarts in Kubernetes never recover until the parent process is
restarted.

Switch every createRaftPeer / AddSCMRequest call to pass the
hostname:port string. Collapse the two OzoneManagerRatisServer
overloads into one.

Replace the misleading "// TODO : Should we use IP instead of hostname??"
comment in SCMRatisServerImpl.buildRaftGroup and SCMHAManagerImpl with
explanatory comments citing HDDS-15514.

Add testCreateRaftPeerUsesHostnameAddress to assert the contract:
RaftPeer.address must NEVER be an IPv4 numeric form. This catches any
future regression that re-introduces InetSocketAddress at this seam.

This is the first of four PRs splitting HDDS-15514 along its natural
code-path boundaries. No flag, no exception classifier, and no atomic
swap machinery in this PR -- those land with the proxy-provider PRs
that follow.
OMProxyInfo constructs an InetSocketAddress at process start and reuses
it for the proxy's lifetime. InetSocketAddress freezes the resolved IP
at construction; when an OM pod is rescheduled to a new IP under a
stable DNS name (Kubernetes), every subsequent client RPC dials the
gone-away IP forever and only a process restart recovers.

Fix it at the FailoverProxyProvider seam, gated by a new opt-in flag
(ozone.client.failover.resolve-needed, default false).

Shared infrastructure (used by subsequent PRs in this series):
  - ConnectionFailureUtils: classifies a Throwable's cause chain
    (depth-bounded to 16) as a connection-class failure. Connection
    types: ConnectException, SocketTimeoutException,
    NoRouteToHostException, UnknownHostException, EOFException,
    SocketException. Application errors (OMException, OMNotLeaderEx,
    AccessControlException, RetryAction-coded responses) are NOT
    classified as connection failures, so DNS load is not amplified
    by logical errors.
  - ozone.client.failover.resolve-needed flag.

Client -> OM Hadoop RPC mechanism:
  - OMProxyInfo preserves the original host:port string and
    refreshAddressIfChanged() re-resolves it outside the entry
    monitor; on IP change, atomically swaps the cached
    InetSocketAddress / dtService / proxy=null under the monitor;
    stops the stale proxy via RPC.stopProxy outside the monitor.
  - OMFailoverProxyProviderBase.shouldRetry calls the refresh on
    connection-class exceptions only when the flag is on. On a
    successful refresh, returns FAILOVER_AND_RETRY but pins
    nextProxyIndex to the current node so RetryInvocationHandler
    does NOT skip past the just-refreshed peer.
  - HadoopRpcOMFailoverProxyProvider and the follower-read variant
    pass the preserved hostname string to OMProxyInfo at
    construction.

OM <-> OM Hadoop-RPC control-plane (OMInterServiceProtocol) rides on
the same OMFailoverProxyProvider machinery, so OM-to-OM Hadoop-RPC
recovery is a free transitive benefit of this PR. The gRPC OM client
(GrpcOMFailoverProxyProvider) was already correct (placeholder
InetSocketAddress(0); gRPC's NameResolver re-resolves on its own) and
is unchanged.

Secure-cluster prerequisite documented inline in ozone-default.xml:
when this flag is true on a Kerberos cluster, operators must also set
hadoop.security.token.service.use_ip=false in core-site.xml. Same
prerequisite HADOOP-17068 carries: the Hadoop delegation-token service
ID defaults to IP:port and would silently fail token selection after
a refresh without that co-config.

This is PR 2 of 4 splitting HDDS-15514 along its natural code-path
boundaries. PR-1 (Ratis hostname-only fix) is the merge base.
Subsequent PRs:
  - PR-3: OM -> SCM (SCMFailoverProxyProviderBase / SCMProxyInfo).
  - PR-4: DN -> SCM heartbeat (EndpointStateMachine /
          SCMConnectionManager / StateContext).

Tests:
  - TestConnectionFailureUtils (new, 20 tests): bare types,
    IOException-wrapped, deeply nested chains (3 levels), application
    negative cases, length-2 cause cycles (terminates), 1024-deep
    non-matching chains (cost bound).
  - TestOMProxyInfoDnsRefresh (new, 4 tests): no-op preserves cached
    proxy, swap on IP change, rebuilt proxy uses freshly-resolved
    address, dtService updates. Uses a @VisibleForTesting setter.
  - TestOMFailoverProxyProviderRefreshWired (new, 5 tests):
    SocketTimeoutException triggers refresh (the AWS EC2 silent-drop
    case end-to-end); ConnectException triggers refresh; OMException
    does NOT; flag-off does NOT; nextProxyIndex stays pinned after
    successful refresh.

Existing TestOMFailoverProxyProvider (8) and TestOMFailovers (1)
verified non-regressed.
@kerneltime kerneltime force-pushed the HDDS-15514-client-om-refresh branch from 2095b73 to 37500bf Compare June 12, 2026 06:34
@kerneltime kerneltime marked this pull request as ready for review June 13, 2026 17:56
@kerneltime

Copy link
Copy Markdown
Contributor Author

Thank you @adoroszlai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants