[fix] Respect per-op activation-offload markers in fused grouped MLP by fanshiqing · Pull Request #3128 · NVIDIA/TransformerEngine

fanshiqing · 2026-06-14T14:50:02Z

Description

The fused groupedMLP offloaded it's entire saved set (fc1_input, activation_input and fc2_input) whenever CPU offload is globally enabled, ignored Megatron-LM's per-module offload selection.

e.g., MLM set --offload-modules moe_act while grouped_mlp will also offload fc1_input, unexpected behavior.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>

greptile-apps · 2026-06-14T14:54:41Z

Greptile Summary

Fixes a bug in the fused grouped MLP where CPU activation offloading ignored Megatron-LM's per-module --offload-modules selection and blindly offloaded all three saved tensors (fc1_input, activation_input, fc2_input) whenever offloading was globally enabled.

Introduces saved_activations, a tuple pairing each saved tensor with a per-op boolean flag derived from fine_grained_activation_offloading on fc1_op / activation_op. When any flag is set, tensors not selected for offload are explicitly protected via mark_not_offload (sets _TE_do_not_offload so the hook skips them), and only the selected tensors are passed to start_offload / mark_activation_offload.
When no per-op markers are present (fine_grained_offload=False), all non-None tensors continue to be offloaded, preserving legacy behavior.

Confidence Score: 4/5

Safe to merge for the common case; the fix correctly prevents unwanted offloading of fc1_input when only moe_act is selected.

The fix is logically sound and addresses the described regression. The one remaining concern is that mark_not_offload is invoked even when no CPU offload context is active, causing an unnecessary prepare_for_saving / restore_from_saved cycle on QuantizedTensor types. This is benign today but is dead work on every forward pass where fine_grained_activation_offloading is set without an active offload context.

transformer_engine/pytorch/ops/fused/grouped_mlp.py — specifically the if fine_grained_offload: block around line 1470

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ops/fused/grouped_mlp.py	Adds per-op fine-grained CPU offload selection for the three fused activation tensors (fc1_x, activation_in, fc2_x); core logic is correct but `mark_not_offload` is called outside the `cpu_offloading` guard, executing unnecessarily when offloading is inactive.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Forward pass requires_grad] --> B[Compute offload flags\noffload_fc1_x = fc1_op.fine_grained_activation_offloading\noffload_act = activation_op.fine_grained_activation_offloading]
    B --> C{fine_grained_offload\n= offload_fc1_x OR offload_act}

    C -- False\nlegacy path --> D[cpu_offloading?]
    D -- Yes --> E[start_offload + mark_activation_offload\nall three tensors]
    D -- No --> F[No offloading]

    C -- True\nper-op path --> G[Build keep list: tensors where sel=False]
    G --> H[mark_not_offload on keep\nsets _TE_do_not_offload=True]
    H --> I[cpu_offloading?]
    I -- Yes --> J[start_offload + mark_activation_offload\nonly selected tensors]
    I -- No --> K[No offloading - hook not active]

    J --> L[save_for_backward\npush_tensor hook fires per tensor]
    E --> L
    L --> M{_check_if_offload\nchecks _TE_do_not_offload}
    M -- False --> N[Offload tensor GPU to CPU]
    M -- True --> O[Keep tensor on GPU]

_{Reviews (1): Last reviewed commit: "[fix] Respect per-op activation-offload ..." | Re-trigger Greptile}

greptile-apps · 2026-06-14T14:54:45Z

+            if fine_grained_offload:
+                keep = [t for t, sel in saved_activations if t is not None and not sel]
+                if keep:
+                    mark_not_offload(*keep)


mark_not_offload called when no offload context is active

mark_not_offload is called whenever fine_grained_offload=True, even when cpu_offloading=False (i.e., no offloading context has been entered). In that case the hook (push_tensor) is never installed, so _TE_do_not_offload is never checked. More importantly, for the QuantizedTensor types used here (grouped_fc1_x, activation_in, saved_grouped_fc2_x), mark_not_offload internally calls prepare_for_saving, which is a destructive-then-restore operation — unnecessary work when no offloader is active. Guard the block with if cpu_offloading and fine_grained_offload: to match the start_offload / mark_activation_offload scope below.

[fix] Respect per-op activation-offload markers in fused grouped MLP

ce5dd25

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>

fanshiqing requested a review from timmoon10 as a code owner June 14, 2026 14:50

github-actions Bot added the org-contribution label Jun 14, 2026

fanshiqing mentioned this pull request Jun 14, 2026

[feat] Support fine-grained activation offloading in fused group mlp NVIDIA/Megatron-LM#5082

Open

greptile-apps Bot reviewed Jun 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Respect per-op activation-offload markers in fused grouped MLP#3128

[fix] Respect per-op activation-offload markers in fused grouped MLP#3128
fanshiqing wants to merge 1 commit into
NVIDIA:mainfrom
fanshiqing:fix_fused_mlp_fingrained_offloading

fanshiqing commented Jun 14, 2026

Uh oh!

greptile-apps Bot commented Jun 14, 2026

Uh oh!

greptile-apps Bot Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fanshiqing commented Jun 14, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Jun 14, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant