Skip to content

[fix] Respect per-op activation-offload markers in fused grouped MLP#3128

Open
fanshiqing wants to merge 1 commit into
NVIDIA:mainfrom
fanshiqing:fix_fused_mlp_fingrained_offloading
Open

[fix] Respect per-op activation-offload markers in fused grouped MLP#3128
fanshiqing wants to merge 1 commit into
NVIDIA:mainfrom
fanshiqing:fix_fused_mlp_fingrained_offloading

Conversation

@fanshiqing

Copy link
Copy Markdown
Member

Description

The fused groupedMLP offloaded it's entire saved set (fc1_input, activation_input and fc2_input) whenever CPU offload is globally enabled, ignored Megatron-LM's per-module offload selection.

  • e.g., MLM set --offload-modules moe_act while grouped_mlp will also offload fc1_input, unexpected behavior.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
@greptile-apps

greptile-apps Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Fixes a bug in the fused grouped MLP where CPU activation offloading ignored Megatron-LM's per-module --offload-modules selection and blindly offloaded all three saved tensors (fc1_input, activation_input, fc2_input) whenever offloading was globally enabled.

  • Introduces saved_activations, a tuple pairing each saved tensor with a per-op boolean flag derived from fine_grained_activation_offloading on fc1_op / activation_op. When any flag is set, tensors not selected for offload are explicitly protected via mark_not_offload (sets _TE_do_not_offload so the hook skips them), and only the selected tensors are passed to start_offload / mark_activation_offload.
  • When no per-op markers are present (fine_grained_offload=False), all non-None tensors continue to be offloaded, preserving legacy behavior.

Confidence Score: 4/5

Safe to merge for the common case; the fix correctly prevents unwanted offloading of fc1_input when only moe_act is selected.

The fix is logically sound and addresses the described regression. The one remaining concern is that mark_not_offload is invoked even when no CPU offload context is active, causing an unnecessary prepare_for_saving / restore_from_saved cycle on QuantizedTensor types. This is benign today but is dead work on every forward pass where fine_grained_activation_offloading is set without an active offload context.

transformer_engine/pytorch/ops/fused/grouped_mlp.py — specifically the if fine_grained_offload: block around line 1470

Important Files Changed

Filename Overview
transformer_engine/pytorch/ops/fused/grouped_mlp.py Adds per-op fine-grained CPU offload selection for the three fused activation tensors (fc1_x, activation_in, fc2_x); core logic is correct but mark_not_offload is called outside the cpu_offloading guard, executing unnecessarily when offloading is inactive.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Forward pass requires_grad] --> B[Compute offload flags\noffload_fc1_x = fc1_op.fine_grained_activation_offloading\noffload_act = activation_op.fine_grained_activation_offloading]
    B --> C{fine_grained_offload\n= offload_fc1_x OR offload_act}

    C -- False\nlegacy path --> D[cpu_offloading?]
    D -- Yes --> E[start_offload + mark_activation_offload\nall three tensors]
    D -- No --> F[No offloading]

    C -- True\nper-op path --> G[Build keep list: tensors where sel=False]
    G --> H[mark_not_offload on keep\nsets _TE_do_not_offload=True]
    H --> I[cpu_offloading?]
    I -- Yes --> J[start_offload + mark_activation_offload\nonly selected tensors]
    I -- No --> K[No offloading - hook not active]

    J --> L[save_for_backward\npush_tensor hook fires per tensor]
    E --> L
    L --> M{_check_if_offload\nchecks _TE_do_not_offload}
    M -- False --> N[Offload tensor GPU to CPU]
    M -- True --> O[Keep tensor on GPU]
Loading

Reviews (1): Last reviewed commit: "[fix] Respect per-op activation-offload ..." | Re-trigger Greptile

Comment on lines +1470 to +1473
if fine_grained_offload:
keep = [t for t, sel in saved_activations if t is not None and not sel]
if keep:
mark_not_offload(*keep)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 mark_not_offload called when no offload context is active

mark_not_offload is called whenever fine_grained_offload=True, even when cpu_offloading=False (i.e., no offloading context has been entered). In that case the hook (push_tensor) is never installed, so _TE_do_not_offload is never checked. More importantly, for the QuantizedTensor types used here (grouped_fc1_x, activation_in, saved_grouped_fc2_x), mark_not_offload internally calls prepare_for_saving, which is a destructive-then-restore operation — unnecessary work when no offloader is active. Guard the block with if cpu_offloading and fine_grained_offload: to match the start_offload / mark_activation_offload scope below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant