Expert Parallelism: common C API + NCCL EP backend#3034
Conversation
Greptile SummaryThis PR introduces the Expert Parallelism (EP) foundation for TransformerEngine: a public C API (
Confidence Score: 4/5Safe to merge with one fix: The backend is well-structured and the previously flagged issues (dtype sizing, transformer_engine/common/ep/ep_backend.cpp — the Important Files Changed
Reviews (5): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
099857f to
17e5126
Compare
…em_reloc gating Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
for more information, see https://pre-commit.ci
| # No MPI dependency — processes are spawned by run_test_ep.sh with | ||
| # --rank / --nranks flags. ncclUniqueId exchange uses a | ||
| # shared temp file (see test_ep_common.h for details). |
There was a problem hiding this comment.
I believe that the other distributed tests do rely on MPI, so why don't we also do that here?
|
|
||
| // RAII owner for a cudaMalloc'd device buffer; frees on destruction. | ||
| template <typename T> | ||
| struct DevBuf { |
There was a problem hiding this comment.
We have a very similar thing already in the test_common.h
| // Spot-check 3 hidden-dim positions per token to catch partial-row writes. | ||
| const int probes[3] = {0, hidden_dim_ / 2, hidden_dim_ - 1}; | ||
| for (int tok = 0; tok < num_tokens_; ++tok) { | ||
| float exp = __bfloat162float(h_tok[tok * hidden_dim_]) * static_cast<float>(top_k_); |
There was a problem hiding this comment.
Why do we hardcode BF16 everywhere? I assume that NCCL EP works with the other datatypes, right?
There was a problem hiding this comment.
Added tests for FP16 and FP32 in the pipeline tests and found out that NCCL EP has assertions and expects the input to be in BF16 ~~.
I requested NCCL EP to support other types. FP16 and FP32 unit tests will be skipped for now.
| namespace transformer_engine { | ||
| namespace ep { | ||
|
|
||
| /*! \brief EP backend singleton — owns the NCCL EP group; borrows the comm. */ |
There was a problem hiding this comment.
If it borrows the communicator then on the framework side we need to make sure that it stays alive.
Also, if it is a singleton, how does it work with multiple GPUs per process?
There was a problem hiding this comment.
If it borrows the communicator then on the framework side we need to make sure that it stays alive.
Added docstring note for this. Will see what to do in PyTorch.
Also, if it is a singleton, how does it work with multiple GPUs per process?
Added note that only a single process, single device is supported for now.
| typedef struct { | ||
| int ep_size; /*!< EP world size. */ | ||
| int num_experts; /*!< Total experts across all ranks. */ | ||
| int max_tokens_per_rank; /*!< Upper bound on tokens this rank sends per dispatch. */ | ||
| /*! Upper bound on tokens received per dispatch (worst-case top_k fan-out; must be > 0). */ | ||
| int max_recv_tokens_per_rank; | ||
| int hidden_dim; /*!< Token hidden dimension. */ | ||
| int max_num_sms; /*!< Max SMs for EP kernels. 0 = auto. */ | ||
| /*! 0 (default): throw on relocated handle_mem for a cached handle_id. 1: silently rebuild. */ | ||
| int allow_handle_mem_reloc; | ||
| } NVTEEpGroupConfig; | ||
|
|
||
| /*! \brief Per-layer EP configuration. */ | ||
| typedef struct { | ||
| int num_local_experts; /*!< Reserved for ABI stability (derived from group config). */ | ||
| int top_k; /*!< Per-token expert fan-out. Required. */ | ||
| size_t dispatch_output_per_expert_alignment; | ||
| /*!< Per-expert zone alignment in tokens (pow2; 0/1 = no padding). Must match | ||
| * between nvte_ep_register_layer and nvte_ep_prepare. */ | ||
| } NVTEEpLayerConfig; |
There was a problem hiding this comment.
If we make this a public API then we should probably version those?
There was a problem hiding this comment.
Hi, no other TE public struct is versioned, so I think EP should follow the same convention for now. We can add versioning for all structs in a follow-up PR.
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…rce at dispatch Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
… static layer registration Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ential per-op and pipeline tests Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…f16 tolerance Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ower per-dispatch dtypes Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…er TE_LIB to remove --no-as-needed Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…per disambiguation, OpenMP link) and gate FullForwardBackward to bf16 Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…instead of dropping them from the parameter list Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
for more information, see https://pre-commit.ci
| NVTE_CHECK_NCCL(ncclEpHandleMemSize(ep_group_, NCCL_EP_LAYOUT_EXPERT_MAJOR, &hcfg, &hm_size, | ||
| layer_config.top_k)); | ||
| *handle_mem_size = hm_size; | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| return insert_new_entry(hm_size, layer_config.top_k, | ||
| layer_config.dispatch_output_per_expert_alignment); |
There was a problem hiding this comment.
ep_group_ read outside the mutex in register_layer
ncclEpHandleMemSize accesses ep_group_ at line 293 before the lock_guard is acquired at line 296. shutdown() acquires the same mutex and calls ncclEpGroupDestroy(inst.ep_group_) followed by setting ep_group_ = nullptr. If a framework error handler calls shutdown() concurrently with a register_layer in flight, ncclEpHandleMemSize will race with ncclEpGroupDestroy, causing a use-after-free. Moving the lock_guard to before the ncclEpHandleMemSize call would eliminate the race.
Summary
First PR in the TE Expert Parallelism (EP) series. Lands the common C API and NCCL EP backend that later framework PRs (PyTorch, JAX) build on. No Python bindings yet — common-lib foundation plus build wiring only. Build/load works on any arch; SM and NCCL version gates fire at runtime.
Every network-bound payload tensor takes an optional
NVTECommWindow. When the window is provided, the backend uses NCCL EP's symmetric-memory zero-copy path, which skips the D2D Memcpy from the user buffers to the Symmetric Staging Buffers.Implementation
Public C API (
transformer_engine/common/include/transformer_engine/{ep.h,comm_window.h})Types:
NVTEEpGroupConfig,NVTEEpLayerConfig,NVTEEpHandle,NVTECommWindow(side-band{ncclWindow_t window, size_t offset}; NCCL peer handles are not carried onNVTETensor).Lifecycle (host-only, eager):
nvte_ep_initialize— borrow an externalncclComm_tfor the EP sub-group and init the singleton backend.nvte_ep_shutdown— tear down the backend; idempotent; does not destroyep_comm.nvte_ep_register_layer— reserve ahandle_idfor a layer config and report thehandle_membuffer size the caller must allocate. The pair{id, mem}becomes the per-stepNVTEEpHandle.Per-step (allocation-free, CUDA-graph capturable)
nvte_ep_prepare— all-gather the routing map and write routing maps tohandle.mem.nvte_ep_dispatch— scatter tokens and routing weights from source ranks to expert ranks.tokens,topk_weights,recv_tokens,recv_topk_weightseach accept an optional symm-mem window for zero-copy.nvte_ep_combine— scatter-sum expert outputs back to source ranks (unweighted; caller pre-multiplies byrecv_topk_weights).expert_outaccepts a window.nvte_ep_dispatch_bwd— backward of dispatch; routes token and weight grads back to source.gradandg_recv_topk_weightsaccept windows; the gathered outputs (grad_tokens,grad_topk_weights).nvte_ep_combine_bwd— backward of combine;gradandgrad_expert_outaccept windows. Padded slots ingrad_expert_outare zeroed.Backend + build
transformer_engine/common/ep/):EPBackendsingleton, HT-mode dispatch/combine over NCCL EP (libnccl_ep.so), group/layer registration. Internal helpermake_payload_tensor()builds the per-callncclEpTensor_t: when the caller'sNVTECommWindow.window != nullptrit setswin_hdl+win_offset(zero-copy); otherwise it setsdatafromnvte_tensor_data(t)(HBM fallback).EPBackend::initialize): SM>=90 (viacudaDeviceGetAttribute), NCCL>=2.30.4 (viancclGetVersion), CUDA multicast/NVLS support.NVTE_WITH_NCCL_EP=OFF,ep/ep_api_stub.cppprovides throwingnvte_ep_*stubs so framework bindings link unconditionally; failure surfaces at firstnvte_ep_initialize.setup.pybuildslibnccl_ep.sofrom3rdparty/ncclby default; auto-disables NCCL EP when no requested CUDA arch >= 90. ExplicitNVTE_BUILD_WITH_NCCL_EP=1with all archs < 90 is treated as user errorNVTE_BUILD_WITH_NCCL_EP=0to opt out.NCCL_HOMEresolved dynamically: explicit env →/opt/nvidia/nccl,/usr/local/nccl,/usr→ldconfig -pfallback.Testing
tests/cpp_distributed/.Type of change
Checklist: