GH-3398: Fix potential ClassLoader leak caused by ThreadLocal lambda in Binary.java by LuciferYang · Pull Request #3447 · apache/parquet-java

LuciferYang · 2026-03-16T12:01:08Z

Rationale for this change

Binary.FromCharSequenceBinary cached a UTF-8 CharsetEncoder in a static ThreadLocal created with ThreadLocal.withInitial(StandardCharsets.UTF_8::newEncoder). The method-reference supplier compiles to a synthetic class loaded by the application ClassLoader. In long-lived, pooled-thread environments (Spark/Flink executors, web containers), worker threads outlive a job or a hot redeploy, so the ThreadLocal keeps the application ClassLoader reachable. The ClassLoader can then never be unloaded, and Metaspace grows over time until it throws OutOfMemoryError: Metaspace.

What changes are included in this PR?

encodeUTF8(CharSequence) now creates a fresh CharsetEncoder per call instead of caching one in a ThreadLocal, which removes the static lambda reference that pinned the ClassLoader.

A fresh encoder from newEncoder() keeps the default CodingErrorAction.REPORT, so the encoding behavior is unchanged from the previous implementation: malformed UTF-16 (for example, an unpaired surrogate) still fails fast with a ParquetEncodingException rather than being silently replaced. The exception message is also corrected — the old "UTF-8 not supported." was misleading, since the failure is malformed input, not a missing charset.

fromCharSequence() is only a fallback in AvroWriteSupport.fromAvroString() for CharSequence values that are neither String nor Avro Utf8; both common paths already encode without a ThreadLocal. Allocating an encoder per call on this rare path has no measurable impact.

Are these changes tested?

Yes. TestBinary adds coverage for valid UTF-8 encoding (ASCII, multi-byte BMP, a supplementary code point, and the empty string, cross-checked against String#getBytes(UTF_8)), and for malformed UTF-16, which must throw ParquetEncodingException with a CharacterCodingException cause.

Are there any user-facing changes?

No. The encoded bytes and the fail-fast behavior on malformed input are unchanged; only the internal encoder lifecycle and the exception message differ.

steveloughran

LGTM.

There is the penalty of the toString() then the bytebuffer creation from that, but as you note: this isn't a common path.

Copilot

Pull request overview

This PR addresses a potential ClassLoader pinning/leak risk in Binary.FromCharSequenceBinary by removing a ThreadLocal initialized via a lambda/method reference and switching the CharSequence-to-UTF-8 encoding path to a stateless implementation.

Changes:

Removed the ThreadLocal<CharsetEncoder> (and related exception handling) previously used for UTF-8 encoding in FromCharSequenceBinary.
Implemented stateless UTF-8 encoding for CharSequence via value.toString().getBytes(StandardCharsets.UTF_8).
Cleaned up now-unused imports related to CharsetEncoder/CharacterCodingException and ParquetEncodingException.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wgtmac · 2026-06-08T02:44:34Z

    private static ByteBuffer encodeUTF8(CharSequence value) {
-      try {
-        return ENCODER.get().encode(CharBuffer.wrap(value));
-      } catch (CharacterCodingException e) {
-        throw new ParquetEncodingException("UTF-8 not supported.", e);
-      }
+      return ByteBuffer.wrap(value.toString().getBytes(StandardCharsets.UTF_8));
    }


Seems a valid concern, is it? @steveloughran @LuciferYang

Confirmed. getBytes(UTF_8) replaces malformed input instead of throwing, so it wasn't behavior-preserving and the catch wasn't dead code.

Went with your suggestion and now use a fresh CharsetEncoder per call: it still drops the ThreadLocal/lambda behind the leak, while newEncoder() keeps the default CodingErrorAction.REPORT, so unpaired surrogates still fail fast with a ParquetEncodingException.

Also added tests (valid encoding + malformed-UTF-16 rejection), corrected the misleading "UTF-8 not supported." message, and updated the PR description to match.

…ambda in Binary FromCharSequenceBinary cached a UTF-8 CharsetEncoder in a static ThreadLocal created with ThreadLocal.withInitial(StandardCharsets.UTF_8::newEncoder). The method-reference supplier compiles to a synthetic class loaded by the application ClassLoader, so in long-lived pooled-thread environments (Spark/Flink executors, web containers) the ThreadLocal keeps that ClassLoader reachable. The ClassLoader can then never be unloaded, leaking Metaspace. Encode with a fresh CharsetEncoder per call instead, which removes the static lambda reference. A new encoder keeps the default CodingErrorAction.REPORT, so malformed UTF-16 (e.g. an unpaired surrogate) still fails fast with a ParquetEncodingException rather than being silently replaced, as String#getBytes(UTF_8) would do. Add TestBinary coverage for valid UTF-8 encoding (ASCII, multi-byte BMP, a supplementary code point and the empty string, cross-checked against String#getBytes(UTF_8)) and for malformed-UTF-16 rejection.

wgtmac · 2026-06-08T05:59:22Z

Thanks @LuciferYang for fixing this and @steveloughran for the review!

LuciferYang · 2026-06-08T06:04:37Z

Thank you @wgtmac and @steveloughran

steveloughran approved these changes Jun 3, 2026

View reviewed changes

wgtmac requested a review from Copilot June 8, 2026 02:27

Copilot started reviewing on behalf of wgtmac June 8, 2026 02:27 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

LuciferYang force-pushed the GH-3398 branch from 0d862b1 to f3182ba Compare June 8, 2026 03:55

wgtmac approved these changes Jun 8, 2026

View reviewed changes

wgtmac merged commit 65f7ade into apache:master Jun 8, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3398: Fix potential ClassLoader leak caused by ThreadLocal lambda in Binary.java #3447

GH-3398: Fix potential ClassLoader leak caused by ThreadLocal lambda in Binary.java #3447
wgtmac merged 1 commit into
apache:masterfrom
LuciferYang:GH-3398

LuciferYang commented Mar 16, 2026 •

edited

Loading

Uh oh!

steveloughran left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

wgtmac Jun 8, 2026

Uh oh!

LuciferYang Jun 8, 2026

Uh oh!

Uh oh!

wgtmac commented Jun 8, 2026

Uh oh!

LuciferYang commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

LuciferYang commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

wgtmac Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wgtmac commented Jun 8, 2026

Uh oh!

LuciferYang commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LuciferYang commented Mar 16, 2026 •

edited

Loading