Skip to content

bound recovery scan to input end in xs_textdecoder_decode#1637

Open
rootvector2 wants to merge 1 commit into
Moddable-OpenSource:publicfrom
rootvector2:textdecoder-recovery-bounds
Open

bound recovery scan to input end in xs_textdecoder_decode#1637
rootvector2 wants to merge 1 commit into
Moddable-OpenSource:publicfrom
rootvector2:textdecoder-recovery-bounds

Conversation

@rootvector2

Copy link
Copy Markdown
Contributor

the illegal-sequence recovery loop in xs_textdecoder_decode scans continuation bytes without checking src < srcEnd, so a streaming decode() that finishes a buffered partial lead with a short final chunk runs src past the input and passes a negative length to the trailing c_memcpy; caught with an asan build of xst.

@phoddie

phoddie commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Interesting find.

I believe the proposed change will avoid the out of bounds read. But I think there may be a bug lurking in that code path that is the real source of the problem. I need to investigate a bit further.

@rootvector2

Copy link
Copy Markdown
Contributor Author

agreed, the bounds check only stops the over-read. the root cause is that the recovery loop scans continuation bytes straight from src and ignores bufferLength. in the streaming completion path the lead and its first continuation come from the saved buffer, so the loop walks src past bytes it never owned, and with no src < srcEnd guard it runs off the end of the view when the backing store holds more continuation-range bytes. that negative srcEnd - src is what poisons the trailing c_memcpy. happy to rework the loop to consume from buffer first the way the legal decode path does if you'd rather fix it at the source. your call on direction.

@phoddie

phoddie commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

I think I'd do something like this:

			const uint8_t *s = &utf8[1];
			while (clen-- > 0) {
				uint8_t c = *s++;
				if ((lower <= c) && (c <= upper)) {
					if (bufferLength) {
						bufferLength--;
						buffer++;
					}
					else
						src++;
				}
				else
					break;
			}

It reuses utf8 to avoid having to decide whether to read from buffer or src. Something similar would be in the second loop as well.

…_decode

the illegal-sequence recovery loop read continuation bytes straight from src and advanced src for each one, ignoring the bytes carried over in the streaming buffer. completing a buffered partial lead with a short final chunk walked src past the view end and passed a negative length to the trailing c_memcpy. read the already-assembled utf8 bytes and advance buffer or src the same way the legal decode path does, in both passes.
@rootvector2 rootvector2 force-pushed the textdecoder-recovery-bounds branch from f690635 to 8e8b58c Compare June 16, 2026 09:12
@rootvector2

Copy link
Copy Markdown
Contributor Author

makes sense, that's the real fix. reworked both loops to read the continuation bytes from utf8 and advance buffer before src, so recovery counts the buffered bytes the same way the legal path does and never runs src past the view. side effect is that F0 80 90 90 now decodes to four U+FFFD per spec instead of the two my old guard produced. asan build is clean and decode.js / decodestream.js / surrogates.js pass; bumped the decodestream.js regression to the four-replacement result.

@phoddie

phoddie commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Thanks for reviewing, testing, and integrating. We'll get this merged.

@mkellner

Copy link
Copy Markdown
Collaborator

This has been merged and will be available in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants