Revert "Expert Parallelism: common C API + NCCL EP backend"#3126
Conversation
This reverts commit c3396ee. Signed-off-by: Tim Moon <tmoon@nvidia.com>
f7f6a07 to
ee354cd
Compare
Greptile SummaryThis PR reverts #3034 ("Expert Parallelism: common C API + NCCL EP backend"), removing the entire NCCL EP feature to unblock CI infrastructure dependencies. It is a clean, mechanical revert that deletes ~2,450 lines of EP implementation, tests, and build scaffolding.
Confidence Score: 5/5The revert is clean and complete — no EP symbols, headers, or build references remain in the tree, confirmed by a full codebase grep. Every file touched by the original EP PR is either deleted or mechanically reverted. The only noteworthy difference from a perfect restore is the omission of os.RTLD_LAZY in _load_core_library(), which matches the pre-EP code and works correctly under glibc. No functional regressions are introduced. transformer_engine/common/init.py — the _load_core_library() binding-mode flag is worth verifying is intentionally left as-is. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["PR #3126: Revert EP PR #3034"] --> B["Remove 3rdparty/nccl submodule"]
A --> C["Delete EP source files\nep_api.cpp / ep_backend.cpp / ep_backend.h\nep.h / comm_window.h"]
A --> D["Revert build system\nsetup.py + CMakeLists.txt"]
A --> E["Revert Python public API\ntransformer_engine/__init__.py"]
A --> F["Revert test infrastructure\ntest.sh / CMakeLists.txt\nDelete test_ep.cu / run_test_ep.sh"]
A --> G["Revert common/__init__.py\nRTLD_GLOBAL|RTLD_LAZY → RTLD_GLOBAL"]
D --> D1["setup.py removes:\nbuild_nccl_ep_submodule()\n_discover_nccl_home()"]
D --> D2["common/CMakeLists.txt removes:\nNVTE_WITH_NCCL_EP option\nlibnccl_ep.a static link\nep_api.cpp / ep_backend.cpp sources"]
D --> D3["tests/CMakeLists.txt removes:\ntest_ep target\nNCCL header discovery"]
E --> E1["Removes:\nis_nccl_ep_available()\nrequire_nccl_ep()\n_nccl_runtime_version()"]
F --> F1["qa/test.sh reverts to:\nset -e\ncmake --build build\nmpirun test_comm_gemm"]
Reviews (1): Last reviewed commit: "Revert "Expert Parallelism: common C API..." | Re-trigger Greptile |
| def _load_core_library(): | ||
| """Load shared library with Transformer Engine C extensions""" | ||
| return ctypes.CDLL(_get_shared_object_file("core"), mode=ctypes.RTLD_GLOBAL | os.RTLD_LAZY) | ||
| return ctypes.CDLL(_get_shared_object_file("core"), mode=ctypes.RTLD_GLOBAL) |
There was a problem hiding this comment.
The EP PR added
os.RTLD_LAZY alongside RTLD_GLOBAL so that lazy binding would prevent unresolved NCCL EP symbols from causing a dlopen failure at load time. The revert drops that flag, restoring the pre-EP state. POSIX requires exactly one of RTLD_LAZY or RTLD_NOW to be specified; without either, the behavior is implementation-defined (glibc defaults to lazy binding, so this works in practice, but it is technically underdefined). Consider restoring os.RTLD_LAZY to make the binding mode explicit, regardless of whether EP is present.
| return ctypes.CDLL(_get_shared_object_file("core"), mode=ctypes.RTLD_GLOBAL) | |
| return ctypes.CDLL(_get_shared_object_file("core"), mode=ctypes.RTLD_GLOBAL | os.RTLD_LAZY) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reverts #3034 since we need to resolve dependencies in our CI infrastructure first.