Expert Parallelism: common C API + NCCL EP backend by phu0ngng · Pull Request #3034 · NVIDIA/TransformerEngine

phu0ngng · 2026-05-22T02:42:51Z

Summary

First PR in the TE Expert Parallelism (EP) series. Lands the common C API and NCCL EP backend that later framework PRs (PyTorch, JAX) build on. No Python bindings yet — common-lib foundation plus build wiring only. Build/load works on any arch; SM and NCCL version gates fire at runtime.

Every network-bound payload tensor takes an optional NVTECommWindow. When the window is provided, the backend uses NCCL EP's symmetric-memory zero-copy path, which skips the D2D Memcpy from the user buffers to the Symmetric Staging Buffers.

Implementation

Public C API (`transformer_engine/common/include/transformer_engine/{ep.h,comm_window.h}`)

Types: NVTEEpGroupConfig, NVTEEpLayerConfig, NVTEEpHandle, NVTECommWindow (side-band {ncclWindow_t window, size_t offset}; NCCL peer handles are not carried on NVTETensor).

Lifecycle (host-only, eager):

void     nvte_ep_initialize(void* ep_comm, NVTEEpGroupConfig group_config);
void     nvte_ep_shutdown(void);

uint64_t nvte_ep_register_layer(NVTEEpLayerConfig layer_config, size_t* handle_mem_size);

nvte_ep_initialize — borrow an external ncclComm_t for the EP sub-group and init the singleton backend.
nvte_ep_shutdown — tear down the backend; idempotent; does not destroy ep_comm.
nvte_ep_register_layer — reserve a handle_id for a layer config and report the handle_mem buffer size the caller must allocate. The pair {id, mem} becomes the per-step NVTEEpHandle.

Per-step (allocation-free, CUDA-graph capturable)

void nvte_ep_prepare(NVTEEpHandle handle, NVTETensor topk_idx, NVTETensor token_counts,
                     size_t dispatch_output_per_expert_alignment, cudaStream_t stream);

void nvte_ep_dispatch(NVTEEpHandle handle, NVTETensor topk_idx,
                      NVTETensor tokens, NVTECommWindow tokens_win,
                      NVTETensor topk_weights, NVTECommWindow topk_weights_win,
                      NVTETensor recv_tokens, NVTECommWindow recv_tokens_win,
                      NVTETensor recv_topk_weights,  NVTECommWindow recv_topk_weights_win,
                      cudaStream_t stream);

void nvte_ep_combine(NVTEEpHandle handle, NVTETensor expert_out, NVTECommWindow expert_out_win,
                     NVTETensor result, cudaStream_t stream);

void nvte_ep_dispatch_bwd(NVTEEpHandle handle, NVTETensor grad, NVTECommWindow grad_win,
                          NVTETensor g_recv_topk_weights, NVTECommWindow g_recv_topk_weights_win,
                          NVTETensor grad_tokens, NVTETensor grad_topk_weights, cudaStream_t stream);

void nvte_ep_combine_bwd(NVTEEpHandle handle, NVTETensor grad, NVTECommWindow grad_win,
                         NVTETensor grad_expert_out, NVTECommWindow grad_expert_out_win,
                         cudaStream_t stream);

nvte_ep_prepare — all-gather the routing map and write routing maps to handle.mem.
nvte_ep_dispatch — scatter tokens and routing weights from source ranks to expert ranks. tokens, topk_weights, recv_tokens, recv_topk_weights each accept an optional symm-mem window for zero-copy.
nvte_ep_combine — scatter-sum expert outputs back to source ranks (unweighted; caller pre-multiplies by recv_topk_weights). expert_out accepts a window.
nvte_ep_dispatch_bwd — backward of dispatch; routes token and weight grads back to source. grad and g_recv_topk_weights accept windows; the gathered outputs (grad_tokens, grad_topk_weights).
nvte_ep_combine_bwd — backward of combine; grad and grad_expert_out accept windows. Padded slots in grad_expert_out are zeroed.

Backend + build

NCCL EP backend (transformer_engine/common/ep/): EPBackend singleton, HT-mode dispatch/combine over NCCL EP (libnccl_ep.so), group/layer registration. Internal helper make_payload_tensor() builds the per-call ncclEpTensor_t: when the caller's NVTECommWindow.window != nullptr it sets win_hdl + win_offset (zero-copy); otherwise it sets data from nvte_tensor_data(t) (HBM fallback).
Runtime gates (in EPBackend::initialize): SM>=90 (via cudaDeviceGetAttribute), NCCL>=2.30.4 (via ncclGetVersion), CUDA multicast/NVLS support.
Stub path: when NVTE_WITH_NCCL_EP=OFF, ep/ep_api_stub.cpp provides throwing nvte_ep_* stubs so framework bindings link unconditionally; failure surfaces at first nvte_ep_initialize.
Build wiring
- setup.py builds libnccl_ep.so from 3rdparty/nccl by default; auto-disables NCCL EP when no requested CUDA arch >= 90. Explicit NVTE_BUILD_WITH_NCCL_EP=1 with all archs < 90 is treated as user error NVTE_BUILD_WITH_NCCL_EP=0 to opt out.
- NCCL_HOME resolved dynamically: explicit env → /opt/nvidia/nccl, /usr/local/nccl, /usr → ldconfig -p fallback.

Testing

C++ distributed tests under tests/cpp_distributed/.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-22T02:48:16Z

Greptile Summary

This PR introduces the Expert Parallelism (EP) foundation for TransformerEngine: a public C API (ep.h, comm_window.h), an NCCL EP backend singleton (EPBackend), and a stub path for non-EP builds. The implementation covers the full forward/backward lifecycle (prepare → dispatch → combine and their backward counterparts) with optional symmetric-memory zero-copy via NVTECommWindow.

Public C API defines NVTEEpGroupConfig (now including max_token_dtype to size NCCL EP staging buffers correctly), NVTEEpLayerConfig, NVTEEpHandle, and per-step allocation-free ops that are CUDA-graph capturable.
NCCL EP backend is a process-level singleton wrapping ncclEpGroup_t; runtime gates for SM≥90, NCCL≥2.30.4, and NVLS multicast are enforced in initialize(); handles are cached per handle_id and lazily opened on first use.
Build wiring auto-detects NCCL prefix via _discover_nccl_home, builds libnccl_ep.so from the 3rdparty/nccl submodule, and falls back to throwing stubs when NVTE_WITH_NCCL_EP=OFF.

Confidence Score: 4/5

Safe to merge with one fix: register_layer reads ep_group_ before acquiring the mutex, racing with a concurrent shutdown() call.

The backend is well-structured and the previously flagged issues (dtype sizing, max_recv_tokens_per_rank validation, NCCL_HOME warning, native arch handling) are all resolved in this revision. The remaining concern is that ncclEpHandleMemSize is called with ep_group_ outside the mutex in register_layer, while shutdown() acquires the same mutex and destroys ep_group_. A concurrent shutdown during layer registration — plausible in framework error-recovery paths — would cause a use-after-free in the NCCL EP group.

transformer_engine/common/ep/ep_backend.cpp — the register_layer method acquires the mutex too late.

Important Files Changed

Filename	Overview
transformer_engine/common/ep/ep_backend.cpp	Core NCCL EP singleton implementation; `register_layer` reads `ep_group_` before acquiring the mutex, creating a race with concurrent `shutdown()` calls.
transformer_engine/common/ep/ep_backend.h	Clean singleton header; handle cache and lifecycle members are well-structured; no issues found.
transformer_engine/common/ep/ep_api.cpp	Thin C-API delegation layer; parameter null checks are present and delegations are correct.
transformer_engine/common/ep/ep_api_stub.cpp	Stub path for non-EP builds; all stubs throw via `ep_not_built()` except `nvte_ep_shutdown` which is intentionally a no-op; correct.
transformer_engine/common/include/transformer_engine/ep.h	Well-documented public C API; added `max_token_dtype` field addresses the prior dtype-width staging buffer issue; no issues found.
transformer_engine/common/include/transformer_engine/comm_window.h	Minimal sideband struct for zero-copy symmem path; NULL-window convention is clear and correctly propagated through make_payload_tensor.
setup.py	NCCL EP build wiring; `_discover_nccl_home` is well-handled; stale `libnccl_ep.so` skips rebuild silently on arch changes.
transformer_engine/common/CMakeLists.txt	NCCL EP CMake wiring; header/lib discovery, rpath embedding, and stub fallback are all correctly gated on NVTE_WITH_NCCL_EP.
transformer_engine/common/util/logging.h	Adds NVTE_CHECK_NCCL macro following the same pattern as existing CUDA/cuBLAS macros; correct.

_{Reviews (5): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…em_reloc gating Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

ptrendx · 2026-05-26T21:59:40Z

+# No MPI dependency — processes are spawned by run_test_ep.sh with
+# --rank / --nranks flags.  ncclUniqueId exchange uses a
+# shared temp file (see test_ep_common.h for details).


I believe that the other distributed tests do rely on MPI, so why don't we also do that here?

ptrendx · 2026-05-27T18:44:56Z

+
+// RAII owner for a cudaMalloc'd device buffer; frees on destruction.
+template <typename T>
+struct DevBuf {


We have a very similar thing already in the test_common.h

ptrendx · 2026-05-27T18:53:47Z

+  // Spot-check 3 hidden-dim positions per token to catch partial-row writes.
+  const int probes[3] = {0, hidden_dim_ / 2, hidden_dim_ - 1};
+  for (int tok = 0; tok < num_tokens_; ++tok) {
+    float exp = __bfloat162float(h_tok[tok * hidden_dim_]) * static_cast<float>(top_k_);


Why do we hardcode BF16 everywhere? I assume that NCCL EP works with the other datatypes, right?

Added tests for FP16 and FP32 in the pipeline tests and found out that NCCL EP has assertions and expects the input to be in BF16 ~~.
I requested NCCL EP to support other types. FP16 and FP32 unit tests will be skipped for now.

ptrendx · 2026-05-27T19:00:37Z

+namespace transformer_engine {
+namespace ep {
+
+/*! \brief EP backend singleton — owns the NCCL EP group; borrows the comm. */


If it borrows the communicator then on the framework side we need to make sure that it stays alive.
Also, if it is a singleton, how does it work with multiple GPUs per process?

If it borrows the communicator then on the framework side we need to make sure that it stays alive.

Added docstring note for this. Will see what to do in PyTorch.

Also, if it is a singleton, how does it work with multiple GPUs per process?

Added note that only a single process, single device is supported for now.

ptrendx · 2026-05-27T19:04:13Z

+typedef struct {
+  int ep_size;             /*!< EP world size. */
+  int num_experts;         /*!< Total experts across all ranks. */
+  int max_tokens_per_rank; /*!< Upper bound on tokens this rank sends per dispatch. */
+  /*! Upper bound on tokens received per dispatch (worst-case top_k fan-out; must be > 0). */
+  int max_recv_tokens_per_rank;
+  int hidden_dim;  /*!< Token hidden dimension. */
+  int max_num_sms; /*!< Max SMs for EP kernels. 0 = auto. */
+  /*! 0 (default): throw on relocated handle_mem for a cached handle_id. 1: silently rebuild. */
+  int allow_handle_mem_reloc;
+} NVTEEpGroupConfig;
+
+/*! \brief Per-layer EP configuration. */
+typedef struct {
+  int num_local_experts; /*!< Reserved for ABI stability (derived from group config). */
+  int top_k;             /*!< Per-token expert fan-out. Required. */
+  size_t dispatch_output_per_expert_alignment;
+  /*!< Per-expert zone alignment in tokens (pow2; 0/1 = no padding). Must match
+   *   between nvte_ep_register_layer and nvte_ep_prepare. */
+} NVTEEpLayerConfig;


If we make this a public API then we should probably version those?

Hi, no other TE public struct is versioned, so I think EP should follow the same convention for now. We can add versioning for all structs in a follow-up PR.

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…rce at dispatch Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

… static layer registration Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…ential per-op and pipeline tests Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…f16 tolerance Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…ower per-dispatch dtypes Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…er TE_LIB to remove --no-as-needed Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…per disambiguation, OpenMP link) and gate FullForwardBackward to bf16 Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…instead of dropping them from the parameter list Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-05-29T00:23:12Z

+  NVTE_CHECK_NCCL(ncclEpHandleMemSize(ep_group_, NCCL_EP_LAYOUT_EXPERT_MAJOR, &hcfg, &hm_size,
+                                      layer_config.top_k));
+  *handle_mem_size = hm_size;
+  std::lock_guard<std::mutex> lock(mutex_);
+  return insert_new_entry(hm_size, layer_config.top_k,
+                          layer_config.dispatch_output_per_expert_alignment);


ep_group_ read outside the mutex in register_layer

ncclEpHandleMemSize accesses ep_group_ at line 293 before the lock_guard is acquired at line 296. shutdown() acquires the same mutex and calls ncclEpGroupDestroy(inst.ep_group_) followed by setting ep_group_ = nullptr. If a framework error handler calls shutdown() concurrently with a register_layer in flight, ncclEpHandleMemSize will race with ncclEpGroupDestroy, causing a use-after-free. Moving the lock_guard to before the ncclEpHandleMemSize call would eliminate the race.

phu0ngng requested a review from ptrendx as a code owner May 22, 2026 02:42

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated

Comment thread transformer_engine/common/ep/ep_backend.cpp

Comment thread setup.py Outdated

Comment thread setup.py Outdated

This was referenced May 22, 2026

[PyTorch] Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy #3035

Draft

[JAX] Expert Parallelism: JAX primitives + VJPs #3036

Open

[Common] Initial NCCL EP integration + Distributed CPP unit tests #3023

Open

phu0ngng requested a review from timmoon10 May 22, 2026 16:17

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated

Expert Parallelism: common C API + NCCL EP v0.1 backend

17e5126

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng force-pushed the phuong/ep-2-commwindow branch from 099857f to 17e5126 Compare May 22, 2026 23:07

phu0ngng and others added 2 commits May 23, 2026 19:36

Expert Parallelism: persistent ncclEpHandle cache with allow_handle_m…

cef4b33

…em_reloc gating Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0086be4

for more information, see https://pre-commit.ci

ptrendx reviewed May 26, 2026

View reviewed changes

Comment thread tests/cpp_distributed/CMakeLists.txt Outdated

ptrendx reviewed May 26, 2026

View reviewed changes