Expert Parallelism: common C API + NCCL EP backend#3127
Conversation
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
Greptile SummaryThis PR introduces the Expert Parallelism (EP) C API and its NCCL EP backend for Hopper+ GPUs (SM≥90), adding
Confidence Score: 3/5The core communication path is functional for the common single-top_k case, but the global fallback_layer_cfg_ in the LRU cache will cause hard failures for any model that uses EP layers with differing top_k values — a realistic pattern in heterogeneous MoE designs. The transformer_engine/common/ep/ep_backend.cpp (fallback_layer_cfg_ process-wide constraint and pre-lock group_config_ read), transformer_engine/common/include/transformer_engine/comm_window.h (NCCL internal struct forward declaration) Important Files Changed
Sequence DiagramsequenceDiagram
participant User as Framework (PyTorch/JAX)
participant API as ep_api.cpp (C API)
participant BE as EPBackend (singleton)
participant Cache as LRU Handle Cache
participant NCCL as ncclEp* (libnccl_ep.a)
User->>API: nvte_ep_initialize(ep_comm, group_cfg)
API->>BE: EPBackend::initialize()
BE->>NCCL: ncclGetVersion() ≥ 2.30.4 check
BE->>NCCL: ncclEpCreateGroup(ep_group_, ep_comm, cfg)
BE-->>User: initialized
loop Each training step
User->>API: nvte_ep_prepare(handle_mem, topk_idx, ...)
BE->>Cache: prepare_handle_locked()
Cache->>NCCL: ncclEpInitHandle() [on miss]
BE->>NCCL: ncclEpUpdateHandle() [AllGather routing]
User->>API: nvte_ep_dispatch(...)
BE->>NCCL: ncclEpDispatch()
User->>API: nvte_ep_combine(...)
BE->>NCCL: ncclEpCombine()
User->>API: nvte_ep_combine_bwd + nvte_ep_dispatch_bwd
BE->>NCCL: ncclEpDispatch(bwd) + ncclEpCombine(bwd)
end
User->>API: nvte_ep_shutdown()
BE->>NCCL: ncclEpGroupDestroy(ep_group_)
|
| void EPBackend::initialize(ncclComm_t ep_comm, NVTEEpGroupConfig config) { | ||
| EPBackend& inst = instance(); | ||
| std::lock_guard<std::mutex> lock(inst.mutex_); | ||
| NVTE_CHECK(!inst.initialized_, "EP already initialized. Call initialize only once per process."); |
There was a problem hiding this comment.
The error message "Call initialize only once per process" contradicts the public API contract documented in
ep.h ("Re-init after shutdown is allowed; double-init throws"). A user who calls shutdown() and then tries to re-init (which is explicitly documented as valid) would see a misleading error if they accidentally double-initialized without an intervening shutdown, potentially making the failure mode harder to diagnose.
| NVTE_CHECK(!inst.initialized_, "EP already initialized. Call initialize only once per process."); | |
| NVTE_CHECK(!inst.initialized_, | |
| "EP already initialized. Call nvte_ep_shutdown() before re-initializing."); |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| struct ncclWindow_vidmem; | ||
|
|
||
| /*! \brief NCCL window plus byte offset for a zero-copy payload tensor. */ | ||
| typedef struct { | ||
| struct ncclWindow_vidmem* window; /*!< NCCL window, or NULL to use the raw data pointer. */ |
There was a problem hiding this comment.
Forward-declaring an NCCL internal struct creates a fragile ABI dependency
struct ncclWindow_vidmem is not part of NCCL's stable public ABI — it is an implementation detail that happens to back ncclWindow_t. If NCCL ever renames the underlying struct (e.g., in a major refactor), this forward declaration will conflict with <nccl.h> in any translation unit that includes both headers, producing a type-mismatch error. A more stable approach is to declare an incomplete opaque struct that is explicitly TE-owned (e.g., struct NVTEWindowOpaque;) and cast to/from ncclWindow_t inside ep_backend.cpp where <nccl.h> is already included.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| arch_list: list[str] = [] | ||
| for t in arch_tokens: | ||
| if t.lower() == "native": | ||
| try: | ||
| out = subprocess.check_output( | ||
| ["nvidia-smi", "--query-gpu=compute_cap", "--format=csv,noheader"], | ||
| stderr=subprocess.DEVNULL, | ||
| ).decode() | ||
| except (subprocess.CalledProcessError, FileNotFoundError) as e: | ||
| raise RuntimeError( | ||
| "NVTE_CUDA_ARCHS=native requires nvidia-smi to resolve the host arch." | ||
| ) from e | ||
| for line in out.splitlines(): | ||
| cap = line.strip().replace(".", "") | ||
| if cap.isdigit() and int(cap) >= 90 and cap not in arch_list: | ||
| arch_list.append(cap) | ||
| else: | ||
| bare = t.rstrip("af") | ||
| if bare.isdigit() and int(bare) >= 90 and bare not in arch_list: | ||
| arch_list.append(bare) |
There was a problem hiding this comment.
Stamp-file increment is not atomic: parallel builds can race
prev_gencode is checked, the make build is launched, and the stamp is written in three distinct steps with no locking between them. In a distributed training environment where each rank's setup script (or a pip install in a parallel job) invokes build_nccl_ep_submodule() simultaneously, two processes can both pass the stale-stamp check, both launch make -j, and one can overwrite the stamp while the other is still building, leaving a mismatched stamp if the builds produce different outputs. A file lock (e.g., fcntl.flock around the whole check-build-stamp block) or a build-level lock would make this robust.
|
Pipeline 54707935 |
This is a resubmission of #3034, which was reverted in #3126 due to CI issues.
Implementation is from @phu0ngng and it is already approved by @ptrendx and @timmoon10.