[BUG] multi_tensor_apply: int32 overflow in TensorListMetadata::sizes causes illegal memory access for tensors with numel > INT_MAX

`multi_tensor_apply` silently truncates per-tensor sizes from 64-bit to 32-bit, causing illegal memory access when any input tensor has `numel() > INT_MAX (2,147,483,647)`.

  In transformer_engine/common/multi_tensor/multi_tensor_apply.cuh, `TensorListMetadataBase::sizes` is declared as `int sizes[...] (int32)`, but it is populated from `Tensor::numel() (size_t / int64)`:

```
  // multi_tensor_apply.cuh:24
  int sizes[depth_to_max_tensors[n - 1]];
  ...
  // multi_tensor_apply.cuh:68
  tl.sizes[loc_tensor_info] = tensor_lists[0][t]->numel();   // int64 -> int32, silent truncation
```

For a tensor with numel = 2,476,250,368 (e.g. an embedding of shape [19345706, 128]), this field becomes 2476250368 - 2^32 = -1,818,716,928. The resulting negative / bogus size is then consumed by downstream kernels (e.g. multi_tensor_l2norm_kernel) which compute element offsets from it, producing out-of-bounds global-memory accesses and the following error at the next CUDA sync:

RuntimeError: .../multi_tensor_apply.cuh:92 in function multi_tensor_apply:
CUDA Error: an illegal memory access was encountered

This is hit by any real-world use that feeds a tensor with numel > 2^31 to TE's multi_tensor utilities. In particular, megatron.training.utils.calc_params_l2_norm →
multi_tensor_applier(multi_tensor_l2norm, ...) crashes for any model containing a single parameter with >2.14B elements (common for large-vocab embeddings, tied output layers, over-encoding tables, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] multi_tensor_apply: int32 overflow in TensorListMetadata::sizes causes illegal memory access for tensors with numel > INT_MAX #2918

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] multi_tensor_apply: int32 overflow in TensorListMetadata::sizes causes illegal memory access for tensors with numel > INT_MAX #2918

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions