Skip to content

GH-3398: Fix potential ClassLoader leak caused by ThreadLocal lambda in Binary.java #3447

Merged
wgtmac merged 1 commit into
apache:masterfrom
LuciferYang:GH-3398
Jun 8, 2026
Merged

GH-3398: Fix potential ClassLoader leak caused by ThreadLocal lambda in Binary.java #3447
wgtmac merged 1 commit into
apache:masterfrom
LuciferYang:GH-3398

Conversation

@LuciferYang

@LuciferYang LuciferYang commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

Fixes #3398

Binary.FromCharSequenceBinary cached a UTF-8 CharsetEncoder in a static ThreadLocal created with ThreadLocal.withInitial(StandardCharsets.UTF_8::newEncoder). The method-reference supplier compiles to a synthetic class loaded by the application ClassLoader. In long-lived, pooled-thread environments (Spark/Flink executors, web containers), worker threads outlive a job or a hot redeploy, so the ThreadLocal keeps the application ClassLoader reachable. The ClassLoader can then never be unloaded, and Metaspace grows over time until it throws OutOfMemoryError: Metaspace.

What changes are included in this PR?

encodeUTF8(CharSequence) now creates a fresh CharsetEncoder per call instead of caching one in a ThreadLocal, which removes the static lambda reference that pinned the ClassLoader.

A fresh encoder from newEncoder() keeps the default CodingErrorAction.REPORT, so the encoding behavior is unchanged from the previous implementation: malformed UTF-16 (for example, an unpaired surrogate) still fails fast with a ParquetEncodingException rather than being silently replaced. The exception message is also corrected — the old "UTF-8 not supported." was misleading, since the failure is malformed input, not a missing charset.

fromCharSequence() is only a fallback in AvroWriteSupport.fromAvroString() for CharSequence values that are neither String nor Avro Utf8; both common paths already encode without a ThreadLocal. Allocating an encoder per call on this rare path has no measurable impact.

Are these changes tested?

Yes. TestBinary adds coverage for valid UTF-8 encoding (ASCII, multi-byte BMP, a supplementary code point, and the empty string, cross-checked against String#getBytes(UTF_8)), and for malformed UTF-16, which must throw ParquetEncodingException with a CharacterCodingException cause.

Are there any user-facing changes?

No. The encoded bytes and the fail-fast behavior on malformed input are unchanged; only the internal encoder lifecycle and the exception message differ.

@steveloughran steveloughran left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

There is the penalty of the toString() then the bytebuffer creation from that, but as you note: this isn't a common path.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a potential ClassLoader pinning/leak risk in Binary.FromCharSequenceBinary by removing a ThreadLocal initialized via a lambda/method reference and switching the CharSequence-to-UTF-8 encoding path to a stateless implementation.

Changes:

  • Removed the ThreadLocal<CharsetEncoder> (and related exception handling) previously used for UTF-8 encoding in FromCharSequenceBinary.
  • Implemented stateless UTF-8 encoding for CharSequence via value.toString().getBytes(StandardCharsets.UTF_8).
  • Cleaned up now-unused imports related to CharsetEncoder/CharacterCodingException and ParquetEncodingException.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 267 to 269
private static ByteBuffer encodeUTF8(CharSequence value) {
try {
return ENCODER.get().encode(CharBuffer.wrap(value));
} catch (CharacterCodingException e) {
throw new ParquetEncodingException("UTF-8 not supported.", e);
}
return ByteBuffer.wrap(value.toString().getBytes(StandardCharsets.UTF_8));
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a valid concern, is it? @steveloughran @LuciferYang

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed. getBytes(UTF_8) replaces malformed input instead of throwing, so it wasn't behavior-preserving and the catch wasn't dead code.

Went with your suggestion and now use a fresh CharsetEncoder per call: it still drops the ThreadLocal/lambda behind the leak, while newEncoder() keeps the default CodingErrorAction.REPORT, so unpaired surrogates still fail fast with a ParquetEncodingException.

Also added tests (valid encoding + malformed-UTF-16 rejection), corrected the misleading "UTF-8 not supported." message, and updated the PR description to match.

…ambda in Binary

FromCharSequenceBinary cached a UTF-8 CharsetEncoder in a static ThreadLocal
created with ThreadLocal.withInitial(StandardCharsets.UTF_8::newEncoder). The
method-reference supplier compiles to a synthetic class loaded by the
application ClassLoader, so in long-lived pooled-thread environments
(Spark/Flink executors, web containers) the ThreadLocal keeps that ClassLoader
reachable. The ClassLoader can then never be unloaded, leaking Metaspace.

Encode with a fresh CharsetEncoder per call instead, which removes the static
lambda reference. A new encoder keeps the default CodingErrorAction.REPORT, so
malformed UTF-16 (e.g. an unpaired surrogate) still fails fast with a
ParquetEncodingException rather than being silently replaced, as
String#getBytes(UTF_8) would do.

Add TestBinary coverage for valid UTF-8 encoding (ASCII, multi-byte BMP, a
supplementary code point and the empty string, cross-checked against
String#getBytes(UTF_8)) and for malformed-UTF-16 rejection.
@wgtmac wgtmac merged commit 65f7ade into apache:master Jun 8, 2026
5 checks passed
@wgtmac

wgtmac commented Jun 8, 2026

Copy link
Copy Markdown
Member

Thanks @LuciferYang for fixing this and @steveloughran for the review!

@LuciferYang

Copy link
Copy Markdown
Contributor Author

Thank you @wgtmac and @steveloughran

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Potential ClassLoader Leak: ThreadLocal.withInitial lambda in Binary.java pins ClassLoader causing Metaspace OOM

4 participants