Related to #2176.
I think there may be a residual race in the same request-timeout / connect-success path fixed in #2176, but with a different symptom: a leaked per-host connection permit for HTTP/1.1.
Version
Observed with:
org.asynchttpclient:async-http-client:3.0.10
- HTTP/1.1
- finite
maxConnectionsPerHost
- request timeouts enabled
I also checked current main and the relevant ordering appears unchanged.
Summary
Under a timeout storm against a peer that accepts TCP/TLS connections but does not send a response or close the socket, AHC may leak a per-host semaphore permit permanently.
The suspected race is:
- A new HTTP/1.1 connection is opened, so a per-host permit is acquired and stored on
NettyResponseFuture.partitionKeyLock.
NettyConnectListener.onSuccess(...) starts running on the event loop.
onSuccess(...) calls future.takePartitionKeyLock(), moving the permit token off the future.
- Before
future.attachChannel(channel, false) runs, the request timeout fires.
TimeoutTimerTask.expire(...) calls:
requestSender.abort(nettyResponseFuture.channel(), nettyResponseFuture, TimeoutException)
- At this point
future.channel() can still be null, because attachChannel(...) has not run yet.
NettyRequestSender.abort(...) therefore cannot close the newly connected channel.
- The future is aborted and
releasePartitionKeyLock() releases nothing, because onSuccess(...) already took the token.
onSuccess(...) can then attach the token to channel.closeFuture().
- If the remote peer keeps the socket established and never sends FIN/RST,
closeFuture() never fires, so the per-host permit is never released.
Final state: the request future is done/aborted, but the HTTP/1.1 channel is live and holds a per-host permit until the socket or client/JVM is closed.
Why #2176 may not fully cover this
#2176 correctly handles the case where timeoutsHolder == null or future.isDone() is observed by NettyConnectListener.onSuccess(...).
This suspected case is narrower:
- timeout abort sees
future.channel() == null, so it cannot close the concrete channel;
onSuccess(...) has already taken the partition lock from the future;
onSuccess(...) does not necessarily observe future.isDone() before attaching the permit to channel.closeFuture();
- after that, the only release path is channel close.
So #2176 fixes an observed NPE / early-exit problem in this race family, but it may still leave a permit-accounting window.
Source points
Relevant 3.0.10 source paths:
-
NettyConnectListener.onSuccess(...)
- takes the partition lock via
future.takePartitionKeyLock();
- later calls
attachSemaphoreToChannelClose(...);
- only after that path does
writeRequest(...) call future.attachChannel(channel, false).
-
TimeoutTimerTask.expire(...)
- calls
requestSender.abort(nettyResponseFuture.channel(), nettyResponseFuture, ...).
-
NettyRequestSender.abort(...)
- only closes the channel if the passed
channel is non-null and active.
-
NettyResponseFuture.terminateAndExit(...)
- calls
releasePartitionKeyLock(), but by then the token may already have been taken by onSuccess(...).
-
DefaultChannelPool
- TTL / idle cleanup only scans pooled idle channels, not active/orphaned channels that were never returned to the pool.
Operational symptom
In production, after a burst of request timeouts to one host, we saw sustained TooManyConnectionsPerHostException for that same host.
The important shape was:
- timeout burst first;
- then a flat
TooManyConnectionsPerHostException plateau;
- no gradual decay;
- cleared only when the JVM / AHC client was recycled.
That looks like a fixed number of per-host permits were orphaned and never returned.
Expected behavior
A request timeout racing with connect success should not be able to leave an HTTP/1.1 per-host permit owned only by a live channel that is no longer associated with an active request.
At minimum, either:
- the concrete channel should be attached/published before timeout abort can observe
future.channel() == null;
- or timeout/connect race handling should close the just-connected channel when the future was already aborted;
- or the permit should be released if the future has completed before request ownership is fully transferred to the channel.
Actual behavior
From the source, it looks possible for:
- timeout abort to run with
future.channel() == null;
onSuccess(...) to have already removed the permit token from the future;
- the future abort path to release nothing;
- the permit to be bound to
channel.closeFuture();
- the channel to remain established indefinitely if the peer never closes.
Question
Is this interleaving possible in 3.0.10 / current main?
If yes, should AHC close the channel or release the per-host permit when a request timeout wins before future.attachChannel(...) publishes the channel?
Related to #2176.
I think there may be a residual race in the same request-timeout / connect-success path fixed in #2176, but with a different symptom: a leaked per-host connection permit for HTTP/1.1.
Version
Observed with:
org.asynchttpclient:async-http-client:3.0.10maxConnectionsPerHostI also checked current
mainand the relevant ordering appears unchanged.Summary
Under a timeout storm against a peer that accepts TCP/TLS connections but does not send a response or close the socket, AHC may leak a per-host semaphore permit permanently.
The suspected race is:
NettyResponseFuture.partitionKeyLock.NettyConnectListener.onSuccess(...)starts running on the event loop.onSuccess(...)callsfuture.takePartitionKeyLock(), moving the permit token off the future.future.attachChannel(channel, false)runs, the request timeout fires.TimeoutTimerTask.expire(...)calls:requestSender.abort(nettyResponseFuture.channel(), nettyResponseFuture, TimeoutException)future.channel()can still benull, becauseattachChannel(...)has not run yet.NettyRequestSender.abort(...)therefore cannot close the newly connected channel.releasePartitionKeyLock()releases nothing, becauseonSuccess(...)already took the token.onSuccess(...)can then attach the token tochannel.closeFuture().closeFuture()never fires, so the per-host permit is never released.Final state: the request future is done/aborted, but the HTTP/1.1 channel is live and holds a per-host permit until the socket or client/JVM is closed.
Why #2176 may not fully cover this
#2176 correctly handles the case where
timeoutsHolder == nullorfuture.isDone()is observed byNettyConnectListener.onSuccess(...).This suspected case is narrower:
future.channel() == null, so it cannot close the concrete channel;onSuccess(...)has already taken the partition lock from the future;onSuccess(...)does not necessarily observefuture.isDone()before attaching the permit tochannel.closeFuture();So #2176 fixes an observed NPE / early-exit problem in this race family, but it may still leave a permit-accounting window.
Source points
Relevant 3.0.10 source paths:
NettyConnectListener.onSuccess(...)future.takePartitionKeyLock();attachSemaphoreToChannelClose(...);writeRequest(...)callfuture.attachChannel(channel, false).TimeoutTimerTask.expire(...)requestSender.abort(nettyResponseFuture.channel(), nettyResponseFuture, ...).NettyRequestSender.abort(...)channelis non-null and active.NettyResponseFuture.terminateAndExit(...)releasePartitionKeyLock(), but by then the token may already have been taken byonSuccess(...).DefaultChannelPoolOperational symptom
In production, after a burst of request timeouts to one host, we saw sustained
TooManyConnectionsPerHostExceptionfor that same host.The important shape was:
TooManyConnectionsPerHostExceptionplateau;That looks like a fixed number of per-host permits were orphaned and never returned.
Expected behavior
A request timeout racing with connect success should not be able to leave an HTTP/1.1 per-host permit owned only by a live channel that is no longer associated with an active request.
At minimum, either:
future.channel() == null;Actual behavior
From the source, it looks possible for:
future.channel() == null;onSuccess(...)to have already removed the permit token from the future;channel.closeFuture();Question
Is this interleaving possible in 3.0.10 / current
main?If yes, should AHC close the channel or release the per-host permit when a request timeout wins before
future.attachChannel(...)publishes the channel?