Skip to content

Expert Parallelism: common C API + NCCL EP backend#3034

Open
phu0ngng wants to merge 21 commits into
NVIDIA:mainfrom
phu0ngng:phuong/ep-2-commwindow
Open

Expert Parallelism: common C API + NCCL EP backend#3034
phu0ngng wants to merge 21 commits into
NVIDIA:mainfrom
phu0ngng:phuong/ep-2-commwindow

Conversation

@phu0ngng
Copy link
Copy Markdown
Collaborator

Summary

First PR in the TE Expert Parallelism (EP) series. Lands the common C API and NCCL EP backend that later framework PRs (PyTorch, JAX) build on. No Python bindings yet — common-lib foundation plus build wiring only. Build/load works on any arch; SM and NCCL version gates fire at runtime.

Every network-bound payload tensor takes an optional NVTECommWindow. When the window is provided, the backend uses NCCL EP's symmetric-memory zero-copy path, which skips the D2D Memcpy from the user buffers to the Symmetric Staging Buffers.

Implementation

Public C API (transformer_engine/common/include/transformer_engine/{ep.h,comm_window.h})

Types: NVTEEpGroupConfig, NVTEEpLayerConfig, NVTEEpHandle, NVTECommWindow (side-band {ncclWindow_t window, size_t offset}; NCCL peer handles are not carried on NVTETensor).

Lifecycle (host-only, eager):

void     nvte_ep_initialize(void* ep_comm, NVTEEpGroupConfig group_config);
void     nvte_ep_shutdown(void);

uint64_t nvte_ep_register_layer(NVTEEpLayerConfig layer_config, size_t* handle_mem_size);
  • nvte_ep_initialize — borrow an external ncclComm_t for the EP sub-group and init the singleton backend.

  • nvte_ep_shutdown — tear down the backend; idempotent; does not destroy ep_comm.

  • nvte_ep_register_layer — reserve a handle_id for a layer config and report the handle_mem buffer size the caller must allocate. The pair {id, mem} becomes the per-step NVTEEpHandle.

Per-step (allocation-free, CUDA-graph capturable)

void nvte_ep_prepare(NVTEEpHandle handle, NVTETensor topk_idx, NVTETensor token_counts,
                     size_t dispatch_output_per_expert_alignment, cudaStream_t stream);

void nvte_ep_dispatch(NVTEEpHandle handle, NVTETensor topk_idx,
                      NVTETensor tokens, NVTECommWindow tokens_win,
                      NVTETensor topk_weights, NVTECommWindow topk_weights_win,
                      NVTETensor recv_tokens, NVTECommWindow recv_tokens_win,
                      NVTETensor recv_topk_weights,  NVTECommWindow recv_topk_weights_win,
                      cudaStream_t stream);

void nvte_ep_combine(NVTEEpHandle handle, NVTETensor expert_out, NVTECommWindow expert_out_win,
                     NVTETensor result, cudaStream_t stream);

void nvte_ep_dispatch_bwd(NVTEEpHandle handle, NVTETensor grad, NVTECommWindow grad_win,
                          NVTETensor g_recv_topk_weights, NVTECommWindow g_recv_topk_weights_win,
                          NVTETensor grad_tokens, NVTETensor grad_topk_weights, cudaStream_t stream);

void nvte_ep_combine_bwd(NVTEEpHandle handle, NVTETensor grad, NVTECommWindow grad_win,
                         NVTETensor grad_expert_out, NVTECommWindow grad_expert_out_win,
                         cudaStream_t stream);
  • nvte_ep_prepare — all-gather the routing map and write routing maps to handle.mem.
  • nvte_ep_dispatch — scatter tokens and routing weights from source ranks to expert ranks. tokens, topk_weights, recv_tokens, recv_topk_weights each accept an optional symm-mem window for zero-copy.
  • nvte_ep_combine — scatter-sum expert outputs back to source ranks (unweighted; caller pre-multiplies by recv_topk_weights). expert_out accepts a window.
  • nvte_ep_dispatch_bwd — backward of dispatch; routes token and weight grads back to source. grad and g_recv_topk_weights accept windows; the gathered outputs (grad_tokens, grad_topk_weights).
  • nvte_ep_combine_bwd — backward of combine; grad and grad_expert_out accept windows. Padded slots in grad_expert_out are zeroed.

Backend + build

  • NCCL EP backend (transformer_engine/common/ep/): EPBackend singleton, HT-mode dispatch/combine over NCCL EP (libnccl_ep.so), group/layer registration. Internal helper make_payload_tensor() builds the per-call ncclEpTensor_t: when the caller's NVTECommWindow.window != nullptr it sets win_hdl + win_offset (zero-copy); otherwise it sets data from nvte_tensor_data(t) (HBM fallback).
  • Runtime gates (in EPBackend::initialize): SM>=90 (via cudaDeviceGetAttribute), NCCL>=2.30.4 (via ncclGetVersion), CUDA multicast/NVLS support.
  • Stub path: when NVTE_WITH_NCCL_EP=OFF, ep/ep_api_stub.cpp provides throwing nvte_ep_* stubs so framework bindings link unconditionally; failure surfaces at first nvte_ep_initialize.
  • Build wiring
    • setup.py builds libnccl_ep.so from 3rdparty/nccl by default; auto-disables NCCL EP when no requested CUDA arch >= 90. Explicit NVTE_BUILD_WITH_NCCL_EP=1 with all archs < 90 is treated as user error NVTE_BUILD_WITH_NCCL_EP=0 to opt out.
    • NCCL_HOME resolved dynamically: explicit env → /opt/nvidia/nccl, /usr/local/nccl, /usrldconfig -p fallback.

Testing

  • C++ distributed tests under tests/cpp_distributed/.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@phu0ngng phu0ngng requested a review from ptrendx as a code owner May 22, 2026 02:42
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR introduces the Expert Parallelism (EP) foundation for TransformerEngine: a public C API (ep.h, comm_window.h), an NCCL EP backend singleton (EPBackend), and a stub path for non-EP builds. The implementation covers the full forward/backward lifecycle (prepare → dispatch → combine and their backward counterparts) with optional symmetric-memory zero-copy via NVTECommWindow.

  • Public C API defines NVTEEpGroupConfig (now including max_token_dtype to size NCCL EP staging buffers correctly), NVTEEpLayerConfig, NVTEEpHandle, and per-step allocation-free ops that are CUDA-graph capturable.
  • NCCL EP backend is a process-level singleton wrapping ncclEpGroup_t; runtime gates for SM≥90, NCCL≥2.30.4, and NVLS multicast are enforced in initialize(); handles are cached per handle_id and lazily opened on first use.
  • Build wiring auto-detects NCCL prefix via _discover_nccl_home, builds libnccl_ep.so from the 3rdparty/nccl submodule, and falls back to throwing stubs when NVTE_WITH_NCCL_EP=OFF.

Confidence Score: 4/5

Safe to merge with one fix: register_layer reads ep_group_ before acquiring the mutex, racing with a concurrent shutdown() call.

The backend is well-structured and the previously flagged issues (dtype sizing, max_recv_tokens_per_rank validation, NCCL_HOME warning, native arch handling) are all resolved in this revision. The remaining concern is that ncclEpHandleMemSize is called with ep_group_ outside the mutex in register_layer, while shutdown() acquires the same mutex and destroys ep_group_. A concurrent shutdown during layer registration — plausible in framework error-recovery paths — would cause a use-after-free in the NCCL EP group.

transformer_engine/common/ep/ep_backend.cpp — the register_layer method acquires the mutex too late.

Important Files Changed

Filename Overview
transformer_engine/common/ep/ep_backend.cpp Core NCCL EP singleton implementation; register_layer reads ep_group_ before acquiring the mutex, creating a race with concurrent shutdown() calls.
transformer_engine/common/ep/ep_backend.h Clean singleton header; handle cache and lifecycle members are well-structured; no issues found.
transformer_engine/common/ep/ep_api.cpp Thin C-API delegation layer; parameter null checks are present and delegations are correct.
transformer_engine/common/ep/ep_api_stub.cpp Stub path for non-EP builds; all stubs throw via ep_not_built() except nvte_ep_shutdown which is intentionally a no-op; correct.
transformer_engine/common/include/transformer_engine/ep.h Well-documented public C API; added max_token_dtype field addresses the prior dtype-width staging buffer issue; no issues found.
transformer_engine/common/include/transformer_engine/comm_window.h Minimal sideband struct for zero-copy symmem path; NULL-window convention is clear and correctly propagated through make_payload_tensor.
setup.py NCCL EP build wiring; _discover_nccl_home is well-handled; stale libnccl_ep.so skips rebuild silently on arch changes.
transformer_engine/common/CMakeLists.txt NCCL EP CMake wiring; header/lib discovery, rpath embedding, and stub fallback are all correctly gated on NVTE_WITH_NCCL_EP.
transformer_engine/common/util/logging.h Adds NVTE_CHECK_NCCL macro following the same pattern as existing CUDA/cuBLAS macros; correct.

Reviews (5): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile

Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated
Comment thread transformer_engine/common/ep/ep_backend.cpp
Comment thread setup.py Outdated
Comment thread setup.py Outdated
Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng phu0ngng force-pushed the phuong/ep-2-commwindow branch from 099857f to 17e5126 Compare May 22, 2026 23:07
phu0ngng and others added 2 commits May 23, 2026 19:36
…em_reloc gating

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Comment thread tests/cpp_distributed/CMakeLists.txt Outdated
Comment thread tests/cpp_distributed/CMakeLists.txt Outdated
Comment on lines +77 to +79
# No MPI dependency — processes are spawned by run_test_ep.sh with
# --rank / --nranks flags. ncclUniqueId exchange uses a
# shared temp file (see test_ep_common.h for details).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the other distributed tests do rely on MPI, so why don't we also do that here?

Comment thread tests/cpp_distributed/CMakeLists.txt Outdated
Comment thread tests/cpp_distributed/test_ep_common.h Outdated
Comment thread tests/cpp_distributed/test_ep_common.h Outdated

// RAII owner for a cudaMalloc'd device buffer; frees on destruction.
template <typename T>
struct DevBuf {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a very similar thing already in the test_common.h

Comment thread tests/cpp_distributed/test_ep_init.cu Outdated
Comment thread tests/cpp_distributed/test_ep_pipeline.cu Outdated
Comment thread tests/cpp_distributed/test_ep_pipeline.cu Outdated
// Spot-check 3 hidden-dim positions per token to catch partial-row writes.
const int probes[3] = {0, hidden_dim_ / 2, hidden_dim_ - 1};
for (int tok = 0; tok < num_tokens_; ++tok) {
float exp = __bfloat162float(h_tok[tok * hidden_dim_]) * static_cast<float>(top_k_);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we hardcode BF16 everywhere? I assume that NCCL EP works with the other datatypes, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for FP16 and FP32 in the pipeline tests and found out that NCCL EP has assertions and expects the input to be in BF16 ~~.
I requested NCCL EP to support other types. FP16 and FP32 unit tests will be skipped for now.

Comment thread tests/cpp_distributed/test_ep_pipeline.cu Outdated
Comment thread tests/cpp_distributed/test_ep_coverage.cu Outdated
namespace transformer_engine {
namespace ep {

/*! \brief EP backend singleton — owns the NCCL EP group; borrows the comm. */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it borrows the communicator then on the framework side we need to make sure that it stays alive.
Also, if it is a singleton, how does it work with multiple GPUs per process?

Copy link
Copy Markdown
Collaborator Author

@phu0ngng phu0ngng May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it borrows the communicator then on the framework side we need to make sure that it stays alive.

Added docstring note for this. Will see what to do in PyTorch.

Also, if it is a singleton, how does it work with multiple GPUs per process?

Added note that only a single process, single device is supported for now.

Comment thread transformer_engine/common/ep/ep_backend.h
Comment on lines +28 to +47
typedef struct {
int ep_size; /*!< EP world size. */
int num_experts; /*!< Total experts across all ranks. */
int max_tokens_per_rank; /*!< Upper bound on tokens this rank sends per dispatch. */
/*! Upper bound on tokens received per dispatch (worst-case top_k fan-out; must be > 0). */
int max_recv_tokens_per_rank;
int hidden_dim; /*!< Token hidden dimension. */
int max_num_sms; /*!< Max SMs for EP kernels. 0 = auto. */
/*! 0 (default): throw on relocated handle_mem for a cached handle_id. 1: silently rebuild. */
int allow_handle_mem_reloc;
} NVTEEpGroupConfig;

/*! \brief Per-layer EP configuration. */
typedef struct {
int num_local_experts; /*!< Reserved for ABI stability (derived from group config). */
int top_k; /*!< Per-token expert fan-out. Required. */
size_t dispatch_output_per_expert_alignment;
/*!< Per-expert zone alignment in tokens (pow2; 0/1 = no padding). Must match
* between nvte_ep_register_layer and nvte_ep_prepare. */
} NVTEEpLayerConfig;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make this a public API then we should probably version those?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, no other TE public struct is versioned, so I think EP should follow the same convention for now. We can add versioning for all structs in a follow-up PR.

phu0ngng added 3 commits May 27, 2026 14:12
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…rce at dispatch

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
phu0ngng and others added 15 commits May 28, 2026 15:31
… static layer registration

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ential per-op and pipeline tests

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…f16 tolerance

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…ower per-dispatch dtypes

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…er TE_LIB to remove --no-as-needed

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…per disambiguation, OpenMP link) and gate FullForwardBackward to bf16

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…instead of dropping them from the parameter list

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Comment on lines +293 to +298
NVTE_CHECK_NCCL(ncclEpHandleMemSize(ep_group_, NCCL_EP_LAYOUT_EXPERT_MAJOR, &hcfg, &hm_size,
layer_config.top_k));
*handle_mem_size = hm_size;
std::lock_guard<std::mutex> lock(mutex_);
return insert_new_entry(hm_size, layer_config.top_k,
layer_config.dispatch_output_per_expert_alignment);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ep_group_ read outside the mutex in register_layer

ncclEpHandleMemSize accesses ep_group_ at line 293 before the lock_guard is acquired at line 296. shutdown() acquires the same mutex and calls ncclEpGroupDestroy(inst.ep_group_) followed by setting ep_group_ = nullptr. If a framework error handler calls shutdown() concurrently with a register_layer in flight, ncclEpHandleMemSize will race with ncclEpGroupDestroy, causing a use-after-free. Moving the lock_guard to before the ncclEpHandleMemSize call would eliminate the race.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants