GH-3398: Fix potential ClassLoader leak caused by ThreadLocal lambda in Binary.java #3447
Conversation
steveloughran
left a comment
There was a problem hiding this comment.
LGTM.
There is the penalty of the toString() then the bytebuffer creation from that, but as you note: this isn't a common path.
There was a problem hiding this comment.
Pull request overview
This PR addresses a potential ClassLoader pinning/leak risk in Binary.FromCharSequenceBinary by removing a ThreadLocal initialized via a lambda/method reference and switching the CharSequence-to-UTF-8 encoding path to a stateless implementation.
Changes:
- Removed the
ThreadLocal<CharsetEncoder>(and related exception handling) previously used for UTF-8 encoding inFromCharSequenceBinary. - Implemented stateless UTF-8 encoding for
CharSequenceviavalue.toString().getBytes(StandardCharsets.UTF_8). - Cleaned up now-unused imports related to
CharsetEncoder/CharacterCodingExceptionandParquetEncodingException.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private static ByteBuffer encodeUTF8(CharSequence value) { | ||
| try { | ||
| return ENCODER.get().encode(CharBuffer.wrap(value)); | ||
| } catch (CharacterCodingException e) { | ||
| throw new ParquetEncodingException("UTF-8 not supported.", e); | ||
| } | ||
| return ByteBuffer.wrap(value.toString().getBytes(StandardCharsets.UTF_8)); | ||
| } |
There was a problem hiding this comment.
Seems a valid concern, is it? @steveloughran @LuciferYang
There was a problem hiding this comment.
Confirmed. getBytes(UTF_8) replaces malformed input instead of throwing, so it wasn't behavior-preserving and the catch wasn't dead code.
Went with your suggestion and now use a fresh CharsetEncoder per call: it still drops the ThreadLocal/lambda behind the leak, while newEncoder() keeps the default CodingErrorAction.REPORT, so unpaired surrogates still fail fast with a ParquetEncodingException.
Also added tests (valid encoding + malformed-UTF-16 rejection), corrected the misleading "UTF-8 not supported." message, and updated the PR description to match.
…ambda in Binary FromCharSequenceBinary cached a UTF-8 CharsetEncoder in a static ThreadLocal created with ThreadLocal.withInitial(StandardCharsets.UTF_8::newEncoder). The method-reference supplier compiles to a synthetic class loaded by the application ClassLoader, so in long-lived pooled-thread environments (Spark/Flink executors, web containers) the ThreadLocal keeps that ClassLoader reachable. The ClassLoader can then never be unloaded, leaking Metaspace. Encode with a fresh CharsetEncoder per call instead, which removes the static lambda reference. A new encoder keeps the default CodingErrorAction.REPORT, so malformed UTF-16 (e.g. an unpaired surrogate) still fails fast with a ParquetEncodingException rather than being silently replaced, as String#getBytes(UTF_8) would do. Add TestBinary coverage for valid UTF-8 encoding (ASCII, multi-byte BMP, a supplementary code point and the empty string, cross-checked against String#getBytes(UTF_8)) and for malformed-UTF-16 rejection.
|
Thanks @LuciferYang for fixing this and @steveloughran for the review! |
|
Thank you @wgtmac and @steveloughran |
Rationale for this change
Fixes #3398
Binary.FromCharSequenceBinarycached a UTF-8CharsetEncoderin a staticThreadLocalcreated withThreadLocal.withInitial(StandardCharsets.UTF_8::newEncoder). The method-reference supplier compiles to a synthetic class loaded by the application ClassLoader. In long-lived, pooled-thread environments (Spark/Flink executors, web containers), worker threads outlive a job or a hot redeploy, so theThreadLocalkeeps the application ClassLoader reachable. The ClassLoader can then never be unloaded, and Metaspace grows over time until it throwsOutOfMemoryError: Metaspace.What changes are included in this PR?
encodeUTF8(CharSequence)now creates a freshCharsetEncoderper call instead of caching one in aThreadLocal, which removes the static lambda reference that pinned the ClassLoader.A fresh encoder from
newEncoder()keeps the defaultCodingErrorAction.REPORT, so the encoding behavior is unchanged from the previous implementation: malformed UTF-16 (for example, an unpaired surrogate) still fails fast with aParquetEncodingExceptionrather than being silently replaced. The exception message is also corrected — the old "UTF-8 not supported." was misleading, since the failure is malformed input, not a missing charset.fromCharSequence()is only a fallback inAvroWriteSupport.fromAvroString()forCharSequencevalues that are neitherStringnor AvroUtf8; both common paths already encode without aThreadLocal. Allocating an encoder per call on this rare path has no measurable impact.Are these changes tested?
Yes.
TestBinaryadds coverage for valid UTF-8 encoding (ASCII, multi-byte BMP, a supplementary code point, and the empty string, cross-checked againstString#getBytes(UTF_8)), and for malformed UTF-16, which must throwParquetEncodingExceptionwith aCharacterCodingExceptioncause.Are there any user-facing changes?
No. The encoded bytes and the fail-fast behavior on malformed input are unchanged; only the internal encoder lifecycle and the exception message differ.