Skip to content

Fix UB in number tokenizer: use pointer arithmetic instead of derefer…#5

Merged
rayokota merged 1 commit into
masterfrom
fix-ub-tokenizer
Jun 15, 2026
Merged

Fix UB in number tokenizer: use pointer arithmetic instead of derefer…#5
rayokota merged 1 commit into
masterfrom
fix-ub-tokenizer

Conversation

@rayokota

@rayokota rayokota commented Jun 15, 2026

Copy link
Copy Markdown
Owner

…encing end iterator

Summary

The number-literal scanner in Tokenizer.cpp passed &*byteStart and &*byteEnd to std::regex_search, where byteEnd is path_.cend(). Dereferencing a past-the-end iterator (*path_.cend()) is undefined behavior, even when only its address is taken.

Most standard libraries (libstdc++, libc++) happen to tolerate this, but MSVC's checked iterators (_ITERATOR_DEBUG_LEVEL) correctly reject it, causing an assertion failure / test crash on Windows MSVC 2022 builds.

Fix

Form the regex range from path_.data() pointers instead of dereferencing iterators:

const char* byteStart = path_.data() + byte_offsets_[position_];
const char* byteEnd   = path_.data() + path_.size();
std::cmatch match;
if (std::regex_search(byteStart, byteEnd, match, numregex) && ...)

data() + size() is a valid one-past-the-end pointer (legal to form, never dereferenced), which is exactly what std::regex_search expects for its [first, last) range. std::cmatch already operates on const char* ranges, so the surrounding match handling is unchanged. No behavioral change on conforming platforms — this only removes the UB that MSVC flags.

Testing

  • All existing generated tests pass.

Reported by @u19809 while building with MSVC 2022 on Windows.

…encing end iterator

## Summary

The number-literal scanner in `Tokenizer.cpp` passed `&*byteStart` and
`&*byteEnd` to `std::regex_search`, where `byteEnd` is `path_.cend()`.
Dereferencing a past-the-end iterator (`*path_.cend()`) is **undefined
behavior**, even when only its address is taken.

Most standard libraries (libstdc++, libc++) happen to tolerate this, but
MSVC's checked iterators (`_ITERATOR_DEBUG_LEVEL`) correctly reject it,
causing an assertion failure / test crash on Windows MSVC 2022 builds.

## Fix

Form the regex range from `path_.data()` pointers instead of dereferencing
iterators:

```cpp
const char* byteStart = path_.data() + byte_offsets_[position_];
const char* byteEnd   = path_.data() + path_.size();
std::cmatch match;
if (std::regex_search(byteStart, byteEnd, match, numregex) && ...)
```

`data() + size()` is a valid one-past-the-end **pointer** (legal to form,
never dereferenced), which is exactly what `std::regex_search` expects for
its `[first, last)` range. `std::cmatch` already operates on `const char*`
ranges, so the surrounding match handling is unchanged. No behavioral
change on conforming platforms — this only removes the UB that MSVC flags.

## Testing

- All existing generated tests pass.

Reported by an external contributor building with MSVC 2022 on Windows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rayokota rayokota merged commit 7e0f9eb into master Jun 15, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant