Skip to content

Possible HTTP/1.1 per-host permit leak on request-timeout / connect-success race #2189

@alex-ro

Description

@alex-ro

Related to #2176.

I think there may be a residual race in the same request-timeout / connect-success path fixed in #2176, but with a different symptom: a leaked per-host connection permit for HTTP/1.1.

Version

Observed with:

  • org.asynchttpclient:async-http-client:3.0.10
  • HTTP/1.1
  • finite maxConnectionsPerHost
  • request timeouts enabled

I also checked current main and the relevant ordering appears unchanged.

Summary

Under a timeout storm against a peer that accepts TCP/TLS connections but does not send a response or close the socket, AHC may leak a per-host semaphore permit permanently.

The suspected race is:

  1. A new HTTP/1.1 connection is opened, so a per-host permit is acquired and stored on NettyResponseFuture.partitionKeyLock.
  2. NettyConnectListener.onSuccess(...) starts running on the event loop.
  3. onSuccess(...) calls future.takePartitionKeyLock(), moving the permit token off the future.
  4. Before future.attachChannel(channel, false) runs, the request timeout fires.
  5. TimeoutTimerTask.expire(...) calls:
    requestSender.abort(nettyResponseFuture.channel(), nettyResponseFuture, TimeoutException)
  6. At this point future.channel() can still be null, because attachChannel(...) has not run yet.
  7. NettyRequestSender.abort(...) therefore cannot close the newly connected channel.
  8. The future is aborted and releasePartitionKeyLock() releases nothing, because onSuccess(...) already took the token.
  9. onSuccess(...) can then attach the token to channel.closeFuture().
  10. If the remote peer keeps the socket established and never sends FIN/RST, closeFuture() never fires, so the per-host permit is never released.

Final state: the request future is done/aborted, but the HTTP/1.1 channel is live and holds a per-host permit until the socket or client/JVM is closed.

Why #2176 may not fully cover this

#2176 correctly handles the case where timeoutsHolder == null or future.isDone() is observed by NettyConnectListener.onSuccess(...).

This suspected case is narrower:

  • timeout abort sees future.channel() == null, so it cannot close the concrete channel;
  • onSuccess(...) has already taken the partition lock from the future;
  • onSuccess(...) does not necessarily observe future.isDone() before attaching the permit to channel.closeFuture();
  • after that, the only release path is channel close.

So #2176 fixes an observed NPE / early-exit problem in this race family, but it may still leave a permit-accounting window.

Source points

Relevant 3.0.10 source paths:

  • NettyConnectListener.onSuccess(...)

    • takes the partition lock via future.takePartitionKeyLock();
    • later calls attachSemaphoreToChannelClose(...);
    • only after that path does writeRequest(...) call future.attachChannel(channel, false).
  • TimeoutTimerTask.expire(...)

    • calls requestSender.abort(nettyResponseFuture.channel(), nettyResponseFuture, ...).
  • NettyRequestSender.abort(...)

    • only closes the channel if the passed channel is non-null and active.
  • NettyResponseFuture.terminateAndExit(...)

    • calls releasePartitionKeyLock(), but by then the token may already have been taken by onSuccess(...).
  • DefaultChannelPool

    • TTL / idle cleanup only scans pooled idle channels, not active/orphaned channels that were never returned to the pool.

Operational symptom

In production, after a burst of request timeouts to one host, we saw sustained TooManyConnectionsPerHostException for that same host.

The important shape was:

  • timeout burst first;
  • then a flat TooManyConnectionsPerHostException plateau;
  • no gradual decay;
  • cleared only when the JVM / AHC client was recycled.

That looks like a fixed number of per-host permits were orphaned and never returned.

Expected behavior

A request timeout racing with connect success should not be able to leave an HTTP/1.1 per-host permit owned only by a live channel that is no longer associated with an active request.

At minimum, either:

  • the concrete channel should be attached/published before timeout abort can observe future.channel() == null;
  • or timeout/connect race handling should close the just-connected channel when the future was already aborted;
  • or the permit should be released if the future has completed before request ownership is fully transferred to the channel.

Actual behavior

From the source, it looks possible for:

  • timeout abort to run with future.channel() == null;
  • onSuccess(...) to have already removed the permit token from the future;
  • the future abort path to release nothing;
  • the permit to be bound to channel.closeFuture();
  • the channel to remain established indefinitely if the peer never closes.

Question

Is this interleaving possible in 3.0.10 / current main?

If yes, should AHC close the channel or release the per-host permit when a request timeout wins before future.attachChannel(...) publishes the channel?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions