Skip to content

HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths#10470

Closed
kerneltime wants to merge 1 commit into
apache:masterfrom
kerneltime:HDDS-dns-refresh-all-paths
Closed

HDDS-15514. DNS-refresh-on-failure for OM, SCM, DN RPC paths#10470
kerneltime wants to merge 1 commit into
apache:masterfrom
kerneltime:HDDS-dns-refresh-all-paths

Conversation

@kerneltime

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR addresses HDDS-15514: Datanode and OzoneManager fail to recover from SCM peer IP changes; cache stale InetSocketAddress for process lifetime.

In Kubernetes (and any environment where peer pod IPs may change while DNS names remain stable), Ozone DataNodes and OzoneManagers can become permanently disconnected from SCM after an SCM peer pod is rescheduled to a new IP. The DataNode/OM process remains alive but every heartbeat or RPC call dials the now-defunct IP forever. Recovery today requires either a process restart or an external operator that watches SCM pod IPs and force-restarts dependent components.

This is the same class of bug HADOOP-17068 fixed for HDFS NameNode HA. This PR mirrors that pattern at the FailoverProxyProvider / EndpointStateMachine layer in Ozone (one tier above where Hadoop applied the fix, because Ozone's RPC seams live there) for the four inter-component Hadoop-RPC paths, and removes the IP-baking from the two Ratis paths so gRPC's DnsNameResolver can re-resolve hostnames on its own.

Why is this opt-in?

The new behaviour is gated by a config flag, default false, so that existing non-K8s deployments see zero change. Operators in Kubernetes flip it on. This matches the precedent set by:

  • dfs.client.failover.resolve-needed (HADOOP-17068 / HDFS-14118)
  • hbase.resolve.hostnames.on.failure (HBase ConnectionImplementation)
ozone.client.failover.resolve-needed = false   (default)
ozone.datanode.scm.heartbeat.address.refresh.threshold = 3   (default; DN-specific)

Per-path summary

Path Mechanism
DN → SCM heartbeat EndpointStateMachine preserves hostAndPort string. HeartbeatEndpointTask catch block calls maybeRefreshScmAddress when missedCount ≥ threshold. SCMConnectionManager.refreshSCMServer swaps the endpoint atomically; the new endpoint starts in GETVERSION state — correct because a rescheduled peer is effectively a fresh process.
OM → SCM SCMProxyInfo retains the config-time host:port. SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged(nodeId) runs in shouldRetry when the exception chain contains ConnectException, NoRouteToHostException, or UnknownHostException. Stale proxy is stopped via RPC.stopProxy.
Client → OM (Hadoop RPC) OMProxyInfo.rpcAddr becomes mutable behind the existing monitor. refreshAddressIfChanged() re-resolves rpcAddrStr, swaps rpcAddr and the derived dtService, nulls the cached proxy so the next createProxyIfNeeded dials the new IP. OMFailoverProxyProviderBase.shouldRetry calls this on connection-class exceptions before advancing the failover index.
Client → OM (gRPC) No code change — GrpcOMFailoverProxyProvider passes a placeholder InetSocketAddress(0) and lets gRPC's NameResolver re-resolve hostnames on its own schedule.
OM ↔ OM control plane Uses Hadoop RPC via OMInterServiceProtocol, not Ratis. Recovers transitively via the Client → OM fix.
OM ↔ OM Ratis replication OzoneManagerRatisServer.createRaftPeer simplified to always pass a hostname:port string to RaftPeer.setAddress — never a resolved IP. Two of three previous createRaftPeer branches were calling new InetSocketAddress(omNode.getInetAddress(), ratisPort), which strips the hostname and freezes the IP. With hostname-only addresses, gRPC's default DnsNameResolver (used by Ratis under the hood) re-resolves on connection failure / on its own refresh schedule. No Ratis upstream change required.
SCM ↔ SCM Ratis replication Already used hostname strings; removes a misleading // TODO : Should we use IP instead of hostname?? comment in SCMRatisServerImpl.buildRaftGroup and SCMHAManagerImpl and replaces with explanatory comments.

Connection-class exception filter

Re-resolution is gated on exception types where DNS could plausibly help:

  • java.net.ConnectException — connection refused / unreachable
  • java.net.NoRouteToHostException — host route gone
  • java.net.UnknownHostException — DNS lookup failed downstream

This excludes application-level errors (OMNotLeaderException, RetryAction, OMException, AccessControlException) where SCM/OM is reachable on the cached IP and the failure is logical, not network.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15514

How was this patch tested?

13 new unit tests + 1 real-RPC integration test, all passing under mvn clean test on the latest master:

Test class Tests Coverage
TestSCMConnectionManager +5 new resolveLatestAddress edge cases, refreshSCMServer happy-path swap, no-op when IP unchanged, no-op when host:port not preserved (legacy ctor path)
TestSCMFailoverProxyProviderRefresh 3 new Swap on IP change, no-op when unchanged, no-op when hostAndPort not preserved
TestOMProxyInfoDnsRefresh 3 new Address swap, dtService update, proxy null-out, proxy rebuild after refresh
TestSCMConnectionManagerDnsRefreshE2E 1 new Real Hadoop RPC server (via ScmTestMock) on a real loopback socket. Connection manager primed with deliberately stale 127.0.0.99 and preserved hostname localhost:port. refreshSCMServer fires; a real sendHeartbeat round-trips to the live server; ScmTestMock.rpcCount increments. Proves the full chain: address swap → fresh RPC proxy → real socket dial → server-side handler invocation.
TestOzoneManagerRatisServer +1 new Asserts RaftPeer.getAddress() is a hostname:port string, never an IP:port string. Defensive regex check that the host portion is not a numeric IPv4.

Existing regression tests (no failures): TestSCMConnectionManager (1 prior) + TestEndPoint (17) + TestOMFailoverProxyProvider (8) + TestOMFailovers (1) + TestOzoneManagerRatisServer (5 prior) — all green.

docker-compose validation with the ozone-ha stack confirmed:

  • Build pipeline correctly bundles the new code into hdds-container-service-2.2.0-SNAPSHOT.jar (verified via javap).
  • Config flag reaches the DN's process environment via the OZONE-SITE.XML_* mechanism.
  • The cluster boots, DataNode registers with SCM HA quorum, and writes succeed under the opt-in flag.
  • After SCM IP rotation (via docker network disconnect/connect with a squatter on the old IP), the post-HADOOP-17068 Client.updateAddress() recovery in Hadoop common also fires (visible as WARN ipc_.Client: Address change detected. Old: scm1/192.168.97.3:9861 New: scm1/192.168.97.12:9861). My fix is the load-bearing recovery for the AWS EC2/EKS silent-drop scenario where updateAddress() does not fire because the connect never returns within the IPC retry budget.

Scope and known limitations

  • The DN refresh fires from the HEARTBEAT phase via HeartbeatEndpointTask. If a DataNode starts up with the SCM peer already at a stale IP and never reaches HEARTBEAT, the recovery path does not engage. Initial-bringup DNS staleness is the existing concern of HDDS-5919's ozone.network.jvm.address.cache.enabled=false. InitDatanodeState.java:94-101 already postpones initialization on initial-resolution failure.
  • HDFS-14118-style construction-time DNS fan-out (one hostname → multiple persistent IPs) is a different problem (round-robin DNS for HDFS HA) and out of scope here. Worth a follow-on JIRA if Ozone deployments need it.
  • The Ratis quorum-loss exit-0 issue (SCMStateMachine.close() calling ExitUtils.terminate(0, ...) when leader election fails to converge, leading to Kubernetes CrashLoopBackOff death spirals) is a separate concern worth its own JIRA.

References

  • HADOOP-17068: client fails forever when namenode ipaddr changed (Hadoop 3.4.0). Commit fa14e4bc001e28d9912e8d985d09bab75aedb87c.
  • HDFS-14118: introduces dfs.client.failover.resolve-needed and the AbstractNNFailoverProxyProvider.getResolvedAddressesIfNecessary hook.
  • HBase: hbase.resolve.hostnames.on.failure (ConnectionImplementation.RESOLVE_HOSTNAME_ON_FAIL_KEY).
  • ZOOKEEPER-1506, ZOOKEEPER-2982: ZK StaticHostProvider re-resolves on each next() call.
  • HDDS-5919: introduces ozone.network.jvm.address.cache.enabled (default true). JVM-level DNS cache TTL — necessary but not sufficient for the long-lived InetSocketAddress problem this PR fixes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements an opt-in “DNS refresh on failure” mechanism across key OM/SCM/DN RPC paths to recover automatically when a peer’s IP changes under a stable DNS name (eg Kubernetes pod reschedules), and removes IP “baking” for Ratis peer addresses so gRPC can re-resolve hostnames.

Changes:

  • Add ozone.client.failover.resolve-needed (default false) and a DN-specific heartbeat refresh threshold to gate DNS re-resolution and proxy/endpoint rebuild on connection-class failures.
  • Preserve original host:port strings in DN↔SCM and client/OM/SCM failover paths and refresh cached InetSocketAddress + proxies when DNS resolves to a new IP.
  • Ensure OM/SCM Ratis peer addresses are always hostname-based strings (not resolved IPs) and add targeted unit/integration tests.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/ratis/TestOzoneManagerRatisServer.java Adds a test asserting RaftPeer address preserves hostname (not numeric IP).
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java Forces RaftPeer address to be hostname:port string to allow gRPC DNS re-resolution.
hadoop-ozone/common/src/test/java/org/apache/hadoop/ozone/om/ha/TestOMProxyInfoDnsRefresh.java Adds tests validating OMProxyInfo DNS refresh swaps address and rebuilds proxy.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMProxyInfo.java Makes OM address + token service mutable under monitor and adds refresh-on-DNS-change logic.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProviderBase.java Adds optional DNS refresh on connection failures before advancing failover index.
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/ozone/container/common/TestSCMConnectionManagerDnsRefreshE2E.java Adds end-to-end test for DN→SCM refresh swapping endpoint and succeeding with real RPC.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMRatisServerImpl.java Clarifies/ensures Ratis uses hostname strings (not resolved IPs) for peer addresses.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMHAManagerImpl.java Ensures SCM HA add-SCM path uses hostname string and updates comments accordingly.
hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/scm/proxy/TestSCMFailoverProxyProviderRefresh.java Adds tests for SCM proxy DNS refresh swap/no-op/legacy-no-hoststring behavior.
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMProxyInfo.java Preserves original host:port string (optional) to enable DNS refresh on failure.
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMFailoverProxyProviderBase.java Adds optional DNS re-resolution + proxy rebuild on connection-class failures.
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/statemachine/TestSCMConnectionManager.java Adds tests for resolveLatestAddress and refreshSCMServer swap/no-op behaviors.
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java Preserves host:port when adding SCM endpoints during reconfiguration.
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/endpoint/HeartbeatEndpointTask.java Triggers optional SCM address refresh after N consecutive heartbeat failures.
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/datanode/InitDatanodeState.java Preserves host:port when creating initial SCM endpoints so DNS can be re-resolved later.
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/SCMConnectionManager.java Adds host:port-preserving add method and endpoint refresh swap helper.
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/EndpointStateMachine.java Stores original host:port and adds resolveLatestAddress helper for DNS refresh.
hadoop-hdds/common/src/main/resources/ozone-default.xml Documents new config keys and defaults.
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java Introduces config constants for resolve-on-failure and DN heartbeat refresh threshold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +219 to +232
// Tear down before adding to avoid two endpoints claiming the same
// logical SCM peer.
String hostAndPort = existing.getHostAndPort();
scmMachines.remove(oldAddress);
try {
existing.close();
} catch (RuntimeException closeEx) {
LOG.warn("Failed to close stale endpoint {}: {}", oldAddress,
closeEx.getMessage());
}
addSCMServer(refreshed, hostAndPort, threadNamePrefix);
LOG.info("DNS re-resolution: SCM endpoint {} -> {} (host {}).",
oldAddress, refreshed, hostAndPort);
return refreshed;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 642dd17a.

Factored the proxy/endpoint construction out of addSCMServer into a private buildScmEndpoint(...). refreshSCMServer now calls buildScmEndpoint(refreshed, ...) BEFORE removing the stale entry. If construction throws (transient DNS during RPC.getProtocolProxy, NetUtils refusing the new address, peer not yet accepting), the existing endpoint stays registered in scmMachines and the IOException is propagated to the caller. Heartbeats continue against the stale IP — better than the peer disappearing entirely from the connection manager.

New regression test TestSCMConnectionManager.testRefreshSCMServerLeavesStaleEndpointOnBuildFailure injects a build failure via a @VisibleForTesting-overridable hook and asserts (a) the IOException propagates, (b) getNumOfConnections() is still 1, (c) the SAME EndpointStateMachine instance is still registered (not a half-constructed replacement). 21/21 in TestSCMConnectionManager + 6/6 in TestHeartbeatEndpointTaskDnsRefresh pass.

* The currently-resolved address. Initialized at construction by
* resolving {@link #rpcAddrStr}, and may be replaced atomically by
* {@link #refreshAddressIfChanged()} when the failover provider
* detects that the SCM peer has been rescheduled to a new IP.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 642dd17a — copy-paste artefact from the SCM-side cousin. Now reads "OM node has been rescheduled to a new IP."

@kerneltime kerneltime force-pushed the HDDS-dns-refresh-all-paths branch from 4770df4 to d7aa14e Compare June 9, 2026 14:55
@kerneltime

Copy link
Copy Markdown
Contributor Author

Self-review (Round 2): adversarial pass over my own draft, force-pushed at d7aa14e066

I ran a 5-persona adversarial review (architect / security / correctness / testing / integration) on the original 4770df45 draft, fixed everything they surfaced, then ran the same 5 personas again against the fixes. Round 2 found 3 real bugs in my Round 1 fixes plus 1 documentation gap. All addressed in this push.

Round 1 findings → addressed in earlier passes

  • SCMConnectionManager.refreshSCMServer left StateContext per-endpoint queues keyed by the stale address. Incremental container reports / container actions / pipeline actions queued under the old key were silently undeliverable after a swap. Fix: new StateContext.migrateEndpoint(old, new) re-keys all five maps; called from HeartbeatEndpointTask.maybeRefreshScmAddress after a successful swap.
  • Connection-class filter excluded SocketTimeoutException — the AWS EC2 / EKS silent-drop case the JIRA description names as load-bearing. Fix: extracted to a shared org.apache.hadoop.hdds.utils.ConnectionFailureUtils that also matches EOFException and SocketException, with a depth-bounded (16) cause-chain walk to defend against pathological cycles.
  • OMProxyInfo.refreshAddressIfChanged leaked the stale RPC proxy. Sister SCM path stopped it; OM path did not. Fix: capture old proxy reference, swap fields under monitor, call RPC.stopProxy outside the monitor (mirrors SCMFailoverProxyProviderBase).
  • HadoopRpcOMFailoverProxyProvider.delegationTokenService was final, computed once at construction. Fix: volatile, recomputed in a new onAddressRefreshed(nodeId) hook called from OMFailoverProxyProviderBase.maybeRefreshCurrentOmAddress on swap.
  • DNS lookup is now performed outside the entry monitor (a slow / dead resolver could otherwise freeze every concurrent reader); swap is committed under a re-check that defends against lost updates.

Round 2 findings → fixed in this push

  • StateContext.migrateEndpoint raced on the unsynchronized endpoints HashSet and isFullReportReadyToBeSent HashMap. Producers (addContainerAction, addIncrementalReport, addPipelineActionIfAbsent) iterate endpoints under different monitors than the migration, and addPipelineActionIfAbsent iterates with no monitor at all. Concurrent migration could throw CME or NPE under K8s pod churn — turning a recovery feature into a stability regression. Fix: endpoints is now a CopyOnWriteArraySet, isFullReportReadyToBeSent a ConcurrentHashMap, and producers null-skip queues whose entries the migration removed. Migration ordering (producers' queues first, endpoints set last) ensures any thread that observes the new endpoint membership has already been able to observe its queues populated. Collision case (newEndpoint already keyed) does not merge foreign queues.
  • OM-side "retry-same-proxy" fix didn't actually pin nextProxyIndex. The Round 1 code returned FAILOVER_AND_RETRY after a successful refresh, but RetryInvocationHandler.performFailover() reads nextProxyIndex (which prior retries may have already advanced) and lands on the wrong OM. Fix: explicitly call setNextOmProxy(currentOmNodeId) before returning, mirroring the existing OMLeaderNotReady "retry same OM" pattern. New TestOMFailoverProxyProviderRefreshWired.testRefreshSuccessPinsCurrentNodeId proves the fix end-to-end.
  • The migration test was vacuousSTALE_ADDR.equals(refreshed) made migrateEndpoint short-circuit on the early-return, so the assertion that the new key existed in the metrics map passed regardless of whether migration ran. Fix: build two RESOLVED-but-distinct InetSocketAddress instances via InetAddress.getByAddress(byte[]) (127.0.0.99 vs 127.0.0.1) and assert that the OLD key is GONE and the NEW key is PRESENT in both incrementalReportsQueue and containerActions.
  • Delegation-token service identifier is broken across refresh when hadoop.security.token.service.use_ip=true (Hadoop default). The token's frozen aggregated string contains old IPs; the post-refresh per-OM service query is built from the new IP, and the substring-match selector fails. This is the same prerequisite HADOOP-17068 carries. Documented loudly in ozone-default.xml on the ozone.client.failover.resolve-needed entry: secure clusters MUST also set hadoop.security.token.service.use_ip=false in core-site.xml. Comment in HadoopRpcOMFailoverProxyProvider corrected (the original claim that substring-match handled refresh was inverted).
  • HeartbeatEndpointTask catch block now filters by connection class before deciding to refresh. Previously it would refresh on ANY IOException once the threshold was met, including AccessControlException (peer reachable, application-layer rejection). Now uses ConnectionFailureUtils.isConnectionFailure — symmetric with the OM/SCM provider gates. New negative test asserts AccessControlException at threshold does NOT trigger refresh.
  • refreshSCMServer now refuses to operate on passive (Recon) endpoints. Today's protection is accidental (Recon's EndpointStateMachine ctor leaves hostAndPort=null); explicit if (existing.isPassive()) return null; guard prevents a future maintainer from accidentally downgrading Recon to SCM-protocol on swap.

Test coverage (~85 tests, all green locally)

File Count Purpose
TestConnectionFailureUtils (new) 20 Filter classification: bare + IOException-wrapped, deep nesting, application-level rejects, length-2 cycle, 1024-deep chain
TestOMFailoverProxyProviderRefreshWired (new) 5 Wired retry path: SocketTimeoutException triggers refresh, ConnectException triggers refresh, OMException does NOT, flag-off does NOT, refresh-success pins current nodeId
TestHeartbeatEndpointTaskDnsRefresh (new) 6 DN catch-block: flag-on + threshold + connection-class fires; flag-off does NOT; below-threshold does NOT; null host:port does NOT; AccessControlException does NOT; StateContext queues actually rekey
TestSCMConnectionManagerDnsRefreshE2E (this PR) 1 Real-RPC swap mechanism with @Timeout(30)
TestOMProxyInfoDnsRefresh (this PR, expanded) 4 Per-instance refresh: no-op preserves proxy, swap on IP change, rebuilt proxy uses new address, dtService updates
TestSCMFailoverProxyProviderRefresh (this PR) 3 SCM swap mechanism
TestSCMConnectionManager (this PR) 6 (1 prior + 5 new) resolveLatestAddress edge cases, refreshSCMServer happy-path
TestOzoneManagerRatisServer (this PR) 6 (5 prior + 1 new) Ratis address is hostname:port, never IP:port
Existing regressions verified non-regressed 34 TestEndPoint (17), TestOMFailoverProxyProvider (8), TestOMFailovers (1), TestHeartbeatEndpointTask (8)

What's still out of scope for this PR

  • HDFS-14118-style construction-time DNS fan-out (a different problem; round-robin DNS for HA — worth a follow-on JIRA if Ozone deployments need it).
  • The Ratis quorum-loss exit-0 issue in SCMStateMachine.close() (filed separately).
  • Ratis hostname-only behavioural change for IP-literal ozone.om.address configs — kept conditional on operator config rather than gated by resolve-needed. Today's NodeDetails.getRatisHostPortStr() chains through getHostName() which can trigger reverse DNS for IP-literal-configured peers across some JDK vendors. If anyone with deeper Apache Ozone background thinks this needs a defensive IP-detection branch, happy to add one — flagging it explicitly because Round 2 surfaced it as a remaining concern.

Diff summary

25 files changed, 2260 insertions(+), 56 deletions(-)

PR head: d7aa14e066778a97d37a7171b582326083ab6218 (was 4770df45515).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 9 comments.

Comment on lines +160 to +163
synchronized (this) {
if (refreshed.getAddress() != null
&& refreshed.getAddress().equals(rpcAddr.getAddress())) {
return false;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 46e7f544. Replaced the equality check with a null-safe IP comparison: if (cachedIp != null && refreshedIp != null && refreshedIp.equals(cachedIp)). When the cached rpcAddr was constructed unresolved (the constructor accepts this with a warn at OMProxyInfo.java:81), cachedIp is null and we now ALLOW the swap — which is correct because resolved-now is genuinely a change, and that is the case the refresh path most needs to recover from. NetUtils.createSocketAddr is invoked outside the synchronized block before the comparison, so no I/O happens under lock either.

Comment on lines +146 to +152
LOG.warn("Re-resolution of {} produced an unresolved address; "
+ "leaving cached address {} in place.", hostAndPort, address);
return null;
}
if (refreshed.getAddress().equals(address.getAddress())) {
return null;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 46e7f544. Same null-safe comparison applied at EndpointStateMachine.java:152. If the cached address was unresolved (a possibility because addSCMServer does not reject unresolved at construction, and addReconServer never preserves a hostname), address.getAddress() is null and a successful re-resolution is now treated as a swap rather than NPE-ing the heartbeat refresh path on the case it most needs to fix.

Comment on lines +336 to +338
if (refreshed.getAddress().equals(cachedAddress.getAddress())) {
return false;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 46e7f544. Same pattern at SCMFailoverProxyProviderBase.java:338 — null-safe IP comparison, with an unresolved cached address treated as "swap allowed" rather than NPE.

Comment on lines +266 to +273
// Replacement is built; commit the swap atomically under the
// write lock we already hold, then close the old endpoint
// outside the critical section's correctness path (close()
// failures only affect cleanup, not registration).
scmMachines.put(refreshed, replacement);
if (!refreshed.equals(oldAddress)) {
scmMachines.remove(oldAddress);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the prior push at 774c8eb4 (R3-C3). SCMConnectionManager.refreshSCMServer now refuses the swap when !refreshed.equals(oldAddress) && scmMachines.containsKey(refreshed). The freshly-built replacement is discarded (no put, no close on the existing endpoint, return null). Avoids the "kube-dns transiently maps SCM-A to SCM-B's IP" scenario where the put would silently overwrite SCM-B's EndpointStateMachine and leak its executor.

Comment on lines 183 to 187
commandQueue = new LinkedList<>();
cmdStatusMap = new ConcurrentHashMap<>();
incrementalReportsQueue = new HashMap<>();
containerReports = new AtomicReference<>();
nodeReport = new AtomicReference<>();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 46e7f544. Promoted incrementalReportsQueue to ConcurrentHashMap (StateContext.java:189). The producer-side synchronized(incrementalReportsQueue) blocks remain because they guard COMPOUND operations on the inner LinkedList values (e.g. q.add, q.addAll, q.removeIf) that ConcurrentHashMap does not protect. CHM only fixes the map-structure race that addEndpoint/removeEndpoint/migrateEndpoint introduced. Metric readers iterating entrySet() without a monitor are now safe via CHM's weakly-consistent iteration.

Comment on lines +196 to 199
endpoints = new CopyOnWriteArraySet<>();
containerActions = new HashMap<>();
pipelineActions = new ConcurrentHashMap<>();
lock = new ReentrantLock();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 46e7f544. Same pattern as incrementalReportsQueue: promoted containerActions to ConcurrentHashMap (StateContext.java:202). Producer synchronized(containerActions) blocks remain — they still guard the inner Queue operations.

Comment on lines +1172 to +1184
/**
* Drain entries that are absent in {@code target}'s map into it,
* preserving the original insertion order. Used by the
* DNS-refresh-on-failure endpoint migration to move queued
* pipeline actions from a stale endpoint key to a fresh one
* without losing them.
*/
synchronized void drainInto(PipelineActionMap target) {
for (Map.Entry<PipelineKey, PipelineAction> entry : map.entrySet()) {
target.putIfAbsent(entry.getKey(), entry.getValue());
}
map.clear();
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 46e7f544. drainInto was added in R2 when migrateEndpoint did merge-on-collision, but the post-R2 collision logic drops the old map without merging. Removed the dead method and its Javadoc.

Comment on lines +131 to +135
// Build the cycle reflectively: a.cause -> b, b.cause -> a.
java.lang.reflect.Field causeField = Throwable.class.getDeclaredField("cause");
causeField.setAccessible(true);
causeField.set(a, b);
causeField.set(b, a);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 46e7f544. Rewrote testCycleOfLengthTwoTerminates to use Throwable.initCause (no reflection): the no-arg Throwable() ctor leaves cause uninitialized (cause==this sentinel), so a single initCause call on each side is permitted and lets us close the length-2 cycle. JDK 16+ portable.

Comment on lines +150 to +157
java.lang.reflect.Field causeField = Throwable.class.getDeclaredField("cause");
causeField.setAccessible(true);
// Build a 1024-deep chain. The walker must NOT traverse the whole
// thing -- the bound caps cost.
for (int i = 1; i < 1024; i++) {
Throwable next = new RuntimeException(Integer.toString(i));
causeField.set(cursor, next);
cursor = next;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 46e7f544. Rewrote testUnboundedChainOfNonMatchingTerminates to use initCause instead of reflective field manipulation. JDK 16+ portable.

@kerneltime kerneltime force-pushed the HDDS-dns-refresh-all-paths branch from 642dd17 to 774c8eb Compare June 9, 2026 16:38
@kerneltime

Copy link
Copy Markdown
Contributor Author

Self-review (Round 3): adversarial pass with a failure-injection lens, force-pushed at 774c8eb4

After Copilot caught a textbook resource-management bug that two of my prior adversarial review rounds had missed, I added a failure-injection lens to my review skill (every persona must walk every I/O seam and ask "what is the state if this throws right here?") and ran a Round 3 that included this lens. Round 3 found 3 real issues that all prior rounds plus Copilot missed, plus 3 medium concerns. All addressed in this push.

Round 3 findings → fixed in this push

  • OMProxyInfo.refreshAddressIfChanged had a partial-mutation window inside the synchronized block. The order was rpcAddr = refreshed then dtService = SecurityUtil.buildTokenService(refreshed) then proxy = null. If buildTokenService ever threw (rare but possible with a malformed-but-resolved address or a future strictness change), rpcAddr would already be on the new IP but dtService would still hold the old service identifier and proxy would still point at the old peer. Worse: subsequent refresh attempts would short-circuit at the equality check (refreshed.getAddress().equals(rpcAddr.getAddress())) because rpcAddr already matches — the entry would be permanently wedged in an inconsistent state until JVM restart. Fix: compute newDtService outside the synchronized block; commit rpcAddr/dtService/proxy=null atomically inside. Mirrors the build-first / swap-atomically discipline that the SCM-side path and Copilot's refreshSCMServer fix already enforce.
  • OMFailoverProxyProviderBase.maybeRefreshCurrentOmAddress performed DNS lookup under the provider monitor. R1 moved the DNS call outside the per-instance OMProxyInfo monitor, but the provider monitor was still held across the lookup. A slow / dead resolver would freeze every concurrent synchronized provider operation (performFailover, selectNextOmProxy, setNextOmProxy, getCurrentProxyOMNodeId, etc.). Fix: read nodeId and info under the monitor, drop it for the DNS-doing info.refreshAddressIfChanged() call, re-acquire briefly to invoke onAddressRefreshed. Mirrors the SCM-side pattern in SCMFailoverProxyProviderBase.refreshProxyAddressIfChanged which got this right in R1.
  • testRefreshSuccessPinsCurrentNodeId was vacuous. With currentProxyIndex and nextProxyIndex both initialised to 0 from construction, performFailover was a no-op (currentProxyIndex = nextProxyIndex = 0), so the assertion beforeNode == getCurrentProxyOMNodeId() held regardless of whether the C2 pin code (setNextOmProxy(omNodeId) in OMFailoverProxyProviderBase.shouldRetry) ran. Fix: pre-advance nextProxyIndex by triggering a non-refresh shouldRetry first (with an IOException that is NOT a connection-class failure, so the refresh hook is not invoked); only then trigger the connection-class shouldRetry. The pin must now pull nextProxyIndex back to the original node — if the pin code is broken, this test fails.
  • EndpointStateMachine.close() could leak its executor thread on close failure. The original sequence was endPoint.close(); executorService.shutdown(); — if endPoint.close() (which calls RPC.stopProxy) raised a RuntimeException, the executor was never shut down. The caller (SCMConnectionManager.refreshSCMServer) catches the RuntimeException, so the leak was silent. Fix: wrap in try/finally so executorService.shutdown() always runs.
  • HadoopRpcOMFollowerReadFailoverProxyProvider.getConnectionId read getCurrentProxy().proxy directly, bypassing the per-instance monitor. Pre-PR the inherited proxy field was effectively final (assigned only at construction). Post-PR, OMProxyInfo.refreshAddressIfChanged mutates it under synchronized(this). A direct field read can observe a torn value. Fix: replace with getCurrentProxy().getProxy() (the synchronized accessor).
  • SCMConnectionManager.refreshSCMServer had no guard against the freshly-resolved IP colliding with another already-registered SCM peer key. Realistic scenario: kube-dns transiently maps SCM-A's hostname to SCM-B's current IP. Without a guard, scmMachines.put(refreshed, replacement) would silently overwrite SCM-B's EndpointStateMachine, leaking its executor thread and its task state. Fix: refuse the swap when !refreshed.equals(oldAddress) && scmMachines.containsKey(refreshed); leave the stale endpoint in place and let the next heartbeat retry DNS.

Why a Round 3 was needed

After Copilot caught the remove-then-add resource bug at SCMConnectionManager.refreshSCMServer:230 that two of my five-persona adversarial review rounds had missed, I diagnosed the gap: my personas were asking "is the success path correct?" and missing "what happens if the I/O call inside the critical section throws?" The lens was wrong, not the personas.

I updated my review tooling (/ritesh-view's persona prompts plus a new focused /code-review-failure-injection skill, plus my global CLAUDE.md Resource Management and Failure Atomicity and Failure-injection lens for code review sections) to make the failure-injection lens REQUIRED on every persona's review, with specific deliverables (failure-inventory tables, atomic-replace audit, I/O-under-lock audit, caller-surprise audit). Round 3 then ran the same five personas plus the focused failure-injection sweep against the post-Copilot diff. The findings above are what the new lens caught.

Test coverage

All 86 tests in the touched-module sweep still pass:

  • TestConnectionFailureUtils (20)
  • TestOMFailoverProxyProviderRefreshWired (5, including the now-non-vacuous testRefreshSuccessPinsCurrentNodeId)
  • TestHeartbeatEndpointTaskDnsRefresh (6)
  • TestSCMConnectionManager (7, including the Copilot rollback regression)
  • TestSCMConnectionManagerDnsRefreshE2E (1)
  • TestOMProxyInfoDnsRefresh (4)
  • TestSCMFailoverProxyProviderRefresh (3)
  • TestOzoneManagerRatisServer (6)
  • Plus existing regressions: TestEndPoint (17), TestOMFailoverProxyProvider (8), TestOMFailovers (1), TestHeartbeatEndpointTask (8)

Notes that remain in scope but unchanged

  • Reverse-DNS attack-surface widening on IP-literal-configured Ratis peers. R1 noted this; R3's Security persona confirmed the specific delta (the new getRatisHostPortStr call chains through getHostName() which can trigger PTR lookup). Defense-in-depth concern only — the existing mTLS layer with Ozone CA-signed certs constrains exploitability. Worth a follow-on JIRA to either prefer getHostString() (no PTR) or document IP-literal configs as unsupported with resolve-needed=on.
  • Operator-facing documentation gap. The new ozone.client.failover.resolve-needed and ozone.datanode.scm.heartbeat.address.refresh.threshold flags plus the hadoop.security.token.service.use_ip=false co-config requirement are documented inline in ozone-default.xml only — not in user-facing markdown under hadoop-hdds/docs/ or in any deploy artifact. Worth a separate cleanup PR.
  • Follower-read DNS refresh. HadoopRpcOMFollowerReadFailoverProxyProvider invokes refreshAddressIfChanged only via the inherited shouldRetry → leader path. A follower OM that's been pod-rescheduled gets refreshed only when the failover provider's leader happens to be that same follower. Tactical — not a regression vs. pre-PR — but worth noting.

Diff summary

6 files changed, 104 insertions(+), 19 deletions(-)

PR head: 774c8eb4d3991c5422a0fbfc315758036d3a1438 (was 642dd17a497).

@kerneltime kerneltime force-pushed the HDDS-dns-refresh-all-paths branch from 774c8eb to 46e7f54 Compare June 9, 2026 16:47
@kerneltime

Copy link
Copy Markdown
Contributor Author

Copilot R3 review → 4 fixes pushed at 46e7f544

Copilot's second review (against 642dd17a) surfaced 9 inline comments. Inline replies are posted on each. Two of the nine were already addressed by my earlier R3 push (774c8eb4); the remaining seven cluster into 4 fix categories applied in 46e7f544:

  1. NPE on unresolved cached getAddress() at three refresh sites (OMProxyInfo:163, EndpointStateMachine:152, SCMFailoverProxyProviderBase:338). Both OMProxyInfo and SCMProxyInfo constructors accept an unresolved address with a WARN and store it (OMProxyInfo.java:81, SCMProxyInfo.java:57). My equality check cachedAddress.getAddress().equals(refreshed.getAddress()) would NPE on null.equals(...). Fixed with null-safe comparison that ALSO treats unresolved-cached + resolved-now as a swap (the case the refresh path most needs to recover from).

  2. incrementalReportsQueue and containerActions concurrent-mutation race (StateContext.java:189, 202). My migration code added migrateEndpoint which mutates these maps under the producer's monitor, but pre-existing addEndpoint/removeEndpoint mutate them WITHOUT taking the producer's monitor. Promoted both to ConcurrentHashMap. The producer-side synchronized(map) blocks stay — they guard compound operations on the inner Queue/LinkedList values that CHM doesn't protect.

  3. PipelineActionMap.drainInto was dead code. R2 added it when migrateEndpoint did merge-on-collision; the post-R2 collision path drops rather than merges. Removed.

  4. JDK 16+ portability: TestConnectionFailureUtils.testCycleOfLengthTwoTerminates and testUnboundedChainOfNonMatchingTerminates reflectively wrote Throwable.cause. Rewrote both to use Throwable.initCause() — the no-arg Throwable() ctor leaves cause uninitialized, so a single initCause call on each side closes a length-2 cycle without reflection. Tests now run on JDK 16+ without --add-opens.

What this round means for the review-method post-mortem

My Round 3 with the new failure-injection lens caught 3 NEW Blockers, but Copilot still found 4 new categories I missed. Diagnosis:

  • NPE-on-unresolved-cached-address (Copilot's biggest catch): my failure-injection lens asked "what happens if this throws?" but didn't explicitly include "what does this method do when the cached state was constructed in a degraded mode (warned-but-stored)?" The constructors warn-and-store unresolved addresses; the refresh code assumes resolved-everywhere. That's a different lens — call it the degraded-construction lens — which I should add.

  • HashMap-vs-ConcurrentHashMap audit: my R2 promoted endpoints (HashSet → COWAS) and isFullReportReadyToBeSent (HashMap → CHM) but not the other two. The lens "every map mutated outside its monitor" wasn't applied uniformly. Adding to my CLAUDE.md: when promoting one container in a class to a thread-safe form, audit ALL siblings in the same class.

  • Dead code from intermediate iterations: drainInto was correct when written; R2's later collision-handling change made it dead. The lens "after every refactor, sweep for callers of methods I no longer use" is missing from my workflow.

  • JDK module-system portability: my Testing persona R3 did flag the reflection as brittle but didn't escalate to "fix it"; Copilot did. The lens needs "if a test uses reflection on JDK internals, mandate --add-opens documentation OR a non-reflective alternative."

I'll push these as further updates to my CLAUDE.md and /ritesh-view skill in a separate pass after this PR is otherwise resolved, so the next reviewer (me or anyone using the skill) inherits these lenses.

Diff summary

5 files changed, 61 insertions(+), 35 deletions(-)

PR head: 46e7f54479fa2ea3bd0f8cca24ce07e79457a04f (was 774c8eb4d39).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.

Comment on lines 339 to +347
synchronized (incrementalReportsQueue) {
for (InetSocketAddress endpoint : endpoints) {
incrementalReportsQueue.get(endpoint).add(report);
// Same migration race shape as addPipelineActionIfAbsent: an
// endpoint observed in the COW set may have just had its
// queue removed by migrateEndpoint. Skip rather than NPE.
List<Message> queue = incrementalReportsQueue.get(endpoint);
if (queue != null) {
queue.add(report);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c27a7c43. Re-ordered migrateEndpoint so the invariant "endpoint ∈ endpoints ⇒ queue exists" holds at every observable point: PUBLISH new-key queues → SWITCH endpoints set (add new, then remove old) → RETIRE old-key queues. Now there is no window where a producer iterating endpoints sees a key without a queue. The producer null-skips remain as defense-in-depth but the load-bearing guarantee is the ordering. Logged as LENS-010 in my MISSED_LENSES.md log: "producer null-skip is not a substitute for fixing the underlying invariant" — caught by you, not by my lens, exactly because the lens accepted the band-aid as a fix.

Comment on lines +290 to +294
try {
existing.close();
} catch (RuntimeException closeEx) {
LOG.warn("Failed to close stale endpoint {}: {}", oldAddress,
closeEx.getMessage());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c27a7c43. existing.close() now runs OUTSIDE writeLock. Pattern: capture the stale endpoint reference under the lock alongside the swap commit, release the lock, then close with the same try/catch. Concurrent getValues() callers and other endpoints' heartbeats no longer stall on RPC.stopProxy / socket teardown. Logged as LENS-011 + a META rule: my R3 failure-injection persona DID flag this as a Concern and I dismissed it as "acceptable, pre-existing pattern." That triage was wrong; dismissal is what got re-found. New CLAUDE.md rule: "failure-injection lens findings get fix-or-justify, never silent dismissal."

+ "original nodeId even though a prior shouldRetry advanced "
+ "nextProxyIndex -- otherwise the freshly-fixed peer is "
+ "bypassed for up to N-1 retries");
assertFalse(beforeNode == null);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c27a7c43. Replaced assertFalse(beforeNode == null) with assertNotNull(beforeNode).

Comment on lines +24 to +27
import java.net.InetAddress;
import java.net.InetSocketAddress;
import java.time.Duration;
import org.apache.hadoop.hdds.conf.OzoneConfiguration;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c27a7c43. Removed the unused java.time.Duration import.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.

Comment on lines 35 to 38
import java.util.concurrent.locks.ReentrantReadWriteLock;
import javax.management.ObjectName;
import com.google.common.annotations.VisibleForTesting;
import org.apache.hadoop.conf.Configuration;
Comment on lines 1218 to 1221
}


synchronized List<PipelineAction> getActions(List<PipelineReport> reports,
Comment on lines +141 to +146
/**
* refreshSCMServer() against an endpoint whose cached IP already matches
* DNS is a no-op -- the existing endpoint stays in place untouched. This
* prevents needless tearing-down of healthy connections when the
* heartbeat task asks to refresh after a transient blip.
*/
Comment on lines +230 to +277
writeLock();
try {
EndpointStateMachine existing = scmMachines.get(oldAddress);
if (existing == null) {
return null;
}
// Recon endpoints (added via addReconServer) speak a different
// protocol than active SCM endpoints. The current refresh path
// only knows how to rebuild SCM endpoints, so refusing to
// refresh a passive endpoint avoids silently downgrading a
// Recon endpoint to an SCM-protocol one. Recon's cached IP is
// also a much narrower problem in practice (Recon is rarely
// pod-rescheduled the way SCM-HA peers are).
if (existing.isPassive()) {
return null;
}
InetSocketAddress resolved = existing.resolveLatestAddress();
if (resolved == null) {
return null;
}
// Refuse the swap if the freshly-resolved address collides with
// another already-registered SCM peer key (e.g. transient kube-dns
// returning peer-B's IP for peer-A's hostname). Without this guard,
// the put below would silently overwrite peer-B's
// EndpointStateMachine, leaking its executor and orphaning its
// task thread, while peer-A's task ends up dialing peer-B's IP
// with peer-A's host context. Leave the stale endpoint in place;
// the next heartbeat retries DNS.
if (!resolved.equals(oldAddress)
&& scmMachines.containsKey(resolved)) {
LOG.warn("DNS re-resolution: refused to swap endpoint {} -> {} "
+ "because the new address collides with an already-registered "
+ "SCM peer. Leaving stale endpoint in place.",
oldAddress, resolved);
return null;
}
String preservedHostPort = existing.getHostAndPort();
// Build the replacement BEFORE removing the stale entry so a
// failure to construct the new proxy (transient DNS, peer not
// yet accepting on the new IP, NetUtils refusing the address)
// leaves the existing endpoint registered. Otherwise the peer
// would disappear from scmMachines entirely and the next
// heartbeat cycle would have nothing to dial -- much worse than
// the pre-PR behaviour of dialing the stale IP.
EndpointStateMachine replacement;
try {
replacement = buildScmEndpoint(resolved, preservedHostPort,
threadNamePrefix);
Comment on lines +62 to +64
* Operators that enable {@code ozone.client.failover.resolve-needed}
* on a secure cluster MUST set {@code hadoop.security.token.service.use_ip=false}
* (in core-site.xml) so the per-OM service is hostname:port -- a

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.

Comment on lines 25 to 38
import java.io.Closeable;
import java.io.IOException;
import java.net.InetSocketAddress;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.ReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock;
import javax.management.ObjectName;
import com.google.common.annotations.VisibleForTesting;
import org.apache.hadoop.conf.Configuration;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 745052d9. Moved com.google.common.annotations.VisibleForTesting to immediately after the static imports and before java.*, matching Ozone's CustomImportOrder convention.

Comment on lines +243 to +247
if (existing.isPassive()) {
return null;
}
InetSocketAddress resolved = existing.resolveLatestAddress();
if (resolved == null) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 745052d9. Refactored refreshSCMServer into 4 phases: PHASE A (read lock) snapshots the endpoint reference + hostAndPort; PHASE B (no lock) does the DNS lookup via resolveLatestAddress(); PHASE C (write lock) re-checks current == snapshotEndpoint to defend against concurrent removeSCMServer/refresh races, enforces the collision invariant, builds the replacement, commits the swap; PHASE D (no lock) closes the stale endpoint. Mirrors the SCMFailoverProxyProviderBase pattern. Logged as LENS-012 + META-3 in my MISSED_LENSES.md: this was the same lens (no I/O under lock) that you flagged for existing.close() in R4 — I fixed the named site but did not sweep siblings in the same method. The new rule is "when fixing one site flagged by a lens, sweep every site in the same method before declaring the fix complete."

}
}

/**

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 745052d9. The orphan was a Javadoc for testRefreshSCMServerNoopWhenIpUnchanged that I left stranded above the new R3 rollback test when I inserted the new test in front. Moved the Javadoc to immediately above its matching @Test method.

Comment on lines 1219 to 1221


synchronized List<PipelineAction> getActions(List<PipelineReport> reports,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 745052d9. Removed the second blank line between putIfAbsent and getActions in PipelineActionMap — left over from the drainInto deletion in R3.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.

Comment on lines +213 to +214
String hostAndPort = rpcEndpoint.getHostAndPort();
if (hostAndPort == null) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in . Inlined and removed the unused local.

Mirrors the design intent of HADOOP-17068 (NameNode HA client) for
Apache Ozone's five inter-component RPC paths so that peer pod IP
changes in Kubernetes are recovered without external pod restarts.

Path A (DN -> SCM heartbeat):
  EndpointStateMachine now preserves the original host:port string
  used at construction. On a heartbeat IOException whose cause chain
  is a connection-class type (per the new ConnectionFailureUtils
  helper) AND the miss-count is above a configurable threshold,
  HeartbeatEndpointTask asks SCMConnectionManager to refresh the
  endpoint. If DNS now resolves to a different IP, the connection
  manager swaps the endpoint atomically -- the new endpoint starts
  in GETVERSION state, which is the correct behaviour for a peer
  that has been rescheduled. StateContext.migrateEndpoint re-keys
  per-endpoint queues (incremental reports, container actions,
  pipeline actions, full-report flags, the endpoints set) so
  in-flight reports survive the swap. The endpoints set is now a
  CopyOnWriteArraySet and isFullReportReadyToBeSent is a
  ConcurrentHashMap so producers iterating these structures while
  migration runs cannot CME or NPE.

Path B (OM -> SCM via SCMFailoverProxyProviderBase):
  SCMProxyInfo retains the config-time host:port string. On
  connection-class exceptions in shouldRetry(),
  refreshProxyAddressIfChanged(nodeId) re-resolves the cached
  hostname OUTSIDE the provider monitor (a slow / dead resolver
  must not freeze concurrent getProxy callers), re-checks under
  lock to defend against lost updates, and -- if the IP changed --
  swaps the SCMProxyInfo entry and stops the cached proxy via
  RPC.stopProxy. The next getProxy() rebuilds against the new IP.
  On a successful refresh, setUpdatedLeaderNodeID() pins the
  failover ring to the freshly-fixed nodeId rather than advancing.

Path C (Client -> OM via OMFailoverProxyProviderBase):
  OMProxyInfo's rpcAddr becomes mutable behind the existing
  monitor. refreshAddressIfChanged() re-resolves the original
  rpcAddrStr OUTSIDE the entry monitor, swaps the cached address
  (and derived delegation-token service) under lock, nulls the
  cached proxy, and calls RPC.stopProxy outside the monitor so
  the stale Hadoop Client thread + SASL session are torn down.
  shouldRetry calls maybeRefreshCurrentOmAddress on connection-class
  exceptions; on success it sets nextProxyIndex back to the
  current node so RetryInvocationHandler.performFailover does NOT
  walk the failover ring past the freshly-fixed peer. Path D
  (OM <-> OM control plane) uses the same provider machinery and
  is therefore covered by Path C.

Path E (Ratis OM <-> OM and SCM <-> SCM replication):
  Collapsed two createRaftPeer overloads in OzoneManagerRatisServer
  into one that always passes the host:port string -- never a
  resolved IP. SCMRatisServerImpl.buildRaftGroup and
  SCMHAManagerImpl.startRatisServer already used hostnames; removed
  misleading TODO comments and replaced them with explanatory ones.
  gRPC's DnsNameResolver already re-resolves hostnames on
  connection failure.

Connection-class filter (new shared helper):
  hadoop-hdds/utils/ConnectionFailureUtils.isConnectionFailure
  classifies the cause chain (depth-bounded to 16 to defend against
  pathological cycles). Matches ConnectException,
  SocketTimeoutException (the AWS EC2/EKS silent-drop case the JIRA
  description names as load-bearing), NoRouteToHostException,
  UnknownHostException, EOFException, SocketException. Application
  errors (OMException, AccessControlException, OMNotLeaderException,
  RetryAction-coded responses) are correctly excluded so DNS load
  is not amplified by logical failures.

The Hadoop-RPC paths (A, B, C) are gated on a new opt-in config
flag, ozone.client.failover.resolve-needed (default false), mirroring
HBase and HADOOP-17068's precedent of requiring explicit operator
opt-in. The DN refresh threshold is exposed separately via
ozone.datanode.scm.heartbeat.address.refresh.threshold (default 3).

Secure clusters: ozone-default.xml documents that operators
enabling resolve-needed on a Kerberos cluster MUST also set
hadoop.security.token.service.use_ip=false in core-site.xml.
Reason: the default IP-based delegation-token service identifier
does not survive a peer IP refresh; the hostname-based identifier
does. This is the same prerequisite HADOOP-17068 carries.

Tests (~85 across the touched modules, all passing):
  TestConnectionFailureUtils: 20 parameterized tests covering
    every classified type bare and IOException-wrapped, deeply
    nested chains, application-level negative cases, length-2
    cause-chain cycles, and 1024-deep non-matching chains
    (cost bound).
  TestSCMConnectionManager: +5 new tests for resolveLatestAddress
    edge cases, refreshSCMServer happy-path swap, no-op when IP
    is unchanged, no-op when host:port not preserved.
  TestSCMFailoverProxyProviderRefresh: 3 new tests for SCM swap,
    no-op when unchanged, no-op without preserved host:port.
  TestSCMConnectionManagerDnsRefreshE2E: real-RPC swap mechanism
    test (real ScmTestMock RPC server, deliberately stale
    127.0.0.99 cached), retitled to be honest about what it
    proves; @timeout(30) added.
  TestOMProxyInfoDnsRefresh: 4 tests including a stronger no-op
    test that asserts the proxy is NOT discarded on unchanged-IP,
    a rebuilt-proxy test that asserts the lambda is called with
    the freshly-resolved address (not a discarded sentinel), and
    a delegation-token-service update test.
  TestOMFailoverProxyProviderRefreshWired: 5 wired-retry tests
    proving SocketTimeoutException through shouldRetry actually
    invokes maybeRefreshCurrentOmAddress (the EC2 silent-drop
    case is end-to-end exercised, not just helper-in-isolation),
    flag-off invariance, application-level errors do NOT trigger
    refresh, and on refresh-success performFailover stays on
    the same nodeId.
  TestHeartbeatEndpointTaskDnsRefresh: 6 trigger-chain tests
    driving the production HeartbeatEndpointTask.call() catch
    block end-to-end (real production code path) -- proves
    refreshSCMServer fires only when flag enabled, threshold met,
    cause is connection-class, and host:port preserved; proves
    AccessControlException at threshold does NOT trigger refresh;
    proves StateContext queues for the OLD address are gone and
    queues for the NEW address are present after a swap (using
    two genuinely-distinct InetSocketAddress instances so
    migrateEndpoint actually runs).
  TestOzoneManagerRatisServer: +1 test asserting RaftPeer.address
    is a hostname:port string, never an IP:port string.
  Existing regression suites verified non-regressed:
    TestEndPoint (17), TestOMFailoverProxyProvider (8),
    TestOMFailovers (1), TestHeartbeatEndpointTask (8),
    TestOzoneManagerRatisServer (5 prior).

Out of scope:
  - HDFS-14118-style construction-time DNS fan-out (different
    problem).
  - The Ratis quorum-loss exit-0 issue in SCMStateMachine.close()
    (filed separately).
  - Operator-side workaround removal (downstream change once this
    lands and reaches a releasable Ozone tag).
@kerneltime kerneltime force-pushed the HDDS-dns-refresh-all-paths branch from 745052d to 22ea767 Compare June 9, 2026 18:31
@kerneltime kerneltime requested a review from Copilot June 9, 2026 18:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated no new comments.

@kerneltime

Copy link
Copy Markdown
Contributor Author

Used this PR to resolve reviews, will open up a new clean PR for human review.

@kerneltime kerneltime closed this Jun 9, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.

Comment on lines +94 to +98
// The hostname we'll preserve. localhost reliably resolves to a
// loopback address in any test environment, and the server is
// bound to a loopback address, so a dial of localhost:port
// succeeds.
String hostAndPort = "localhost:" + port;
import java.util.Set;
import java.util.UUID;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.CopyOnWriteArraySet;
Comment on lines +76 to +80
InetAddress loopback = InetAddress.getByName("localhost");
InetSocketAddress address = new InetSocketAddress(loopback, 9861);
EndpointStateMachine endpoint = new EndpointStateMachine(
address, "localhost:9861", null, new OzoneConfiguration(), "");
InetSocketAddress refreshed = endpoint.resolveLatestAddress();
Comment on lines +209 to +211
InetAddress loopback = InetAddress.getByName("localhost");
InetSocketAddress address = new InetSocketAddress(loopback, 9861);
connectionManager.addSCMServer(address, "localhost:9861", "");
@kerneltime

Copy link
Copy Markdown
Contributor Author

Closing in favour of #10473, which is a clean re-submission of the same technical change rebased onto current master with a polished PR description for human reviewers. Code content is identical; tests are identical; only the commit history and PR body differ.

Thanks to GitHub Copilot for the multi-round review on this thread — the findings shaped the final design and remain visible here for context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants