[fix] Respect per-op activation-offload markers in fused grouped MLP#3128
[fix] Respect per-op activation-offload markers in fused grouped MLP#3128fanshiqing wants to merge 1 commit into
Conversation
Signed-off-by: Shiqing Fan <shiqingf@nvidia.com>
Greptile SummaryFixes a bug in the fused grouped MLP where CPU activation offloading ignored Megatron-LM's per-module
Confidence Score: 4/5Safe to merge for the common case; the fix correctly prevents unwanted offloading of fc1_input when only moe_act is selected. The fix is logically sound and addresses the described regression. The one remaining concern is that transformer_engine/pytorch/ops/fused/grouped_mlp.py — specifically the Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Forward pass requires_grad] --> B[Compute offload flags\noffload_fc1_x = fc1_op.fine_grained_activation_offloading\noffload_act = activation_op.fine_grained_activation_offloading]
B --> C{fine_grained_offload\n= offload_fc1_x OR offload_act}
C -- False\nlegacy path --> D[cpu_offloading?]
D -- Yes --> E[start_offload + mark_activation_offload\nall three tensors]
D -- No --> F[No offloading]
C -- True\nper-op path --> G[Build keep list: tensors where sel=False]
G --> H[mark_not_offload on keep\nsets _TE_do_not_offload=True]
H --> I[cpu_offloading?]
I -- Yes --> J[start_offload + mark_activation_offload\nonly selected tensors]
I -- No --> K[No offloading - hook not active]
J --> L[save_for_backward\npush_tensor hook fires per tensor]
E --> L
L --> M{_check_if_offload\nchecks _TE_do_not_offload}
M -- False --> N[Offload tensor GPU to CPU]
M -- True --> O[Keep tensor on GPU]
Reviews (1): Last reviewed commit: "[fix] Respect per-op activation-offload ..." | Re-trigger Greptile |
| if fine_grained_offload: | ||
| keep = [t for t, sel in saved_activations if t is not None and not sel] | ||
| if keep: | ||
| mark_not_offload(*keep) |
There was a problem hiding this comment.
mark_not_offload called when no offload context is active
mark_not_offload is called whenever fine_grained_offload=True, even when cpu_offloading=False (i.e., no offloading context has been entered). In that case the hook (push_tensor) is never installed, so _TE_do_not_offload is never checked. More importantly, for the QuantizedTensor types used here (grouped_fc1_x, activation_in, saved_grouped_fc2_x), mark_not_offload internally calls prepare_for_saving, which is a destructive-then-restore operation — unnecessary work when no offloader is active. Guard the block with if cpu_offloading and fine_grained_offload: to match the start_offload / mark_activation_offload scope below.
Description
The fused groupedMLP offloaded it's entire saved set (fc1_input, activation_input and fc2_input) whenever CPU offload is globally enabled, ignored Megatron-LM's per-module offload selection.
--offload-modules moe_actwhile grouped_mlp will also offloadfc1_input, unexpected behavior.Type of change
Changes
Please list the changes introduced in this PR:
Checklist: