Add CUB cooperative collectives#9266
Draft
tpn wants to merge 4 commits into
Draft
Conversation
Add experimental adapters and examples for warp and block collective result-placement semantics that are useful to cuda.coop frontends. The examples cover broadcast warp reduction, batched warp reduction, broadcast block reduction, row/segmented reductions, and scan/broadcast composition. Co-Authored-By: GPT-5.5 xhigh, Codex v0.130.0 Signed-off-by: Trent Nelson <trent@trent.me>
Add NVBench coverage for the experimental cooperative collective adapters and the direct CUB patterns they replace. The benchmark compares broadcast warp reduction, batched warp reduction, broadcast block reduction, row reduction, and scan/broadcast composition. Co-Authored-By: GPT-5.5 xhigh, Codex v0.130.0 Signed-off-by: Trent Nelson <trent@trent.me>
Add a shuffle allreduce fast path for warp reduce broadcast sums, covering both full warp and tiny logical-warp reductions without routing through owner-lane storage. Add a batched all-lane broadcast adapter for commutative reductions and benchmark it against serial batched reductions and the owner-lane WarpReduceBatched primitive. Co-Authored-By: GPT-5.5 xhigh, Codex v0.130.0 Signed-off-by: Trent Nelson <trent@trent.me>
Contributor
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Trent Nelson <trent@trent.me>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
closes
Adds CUB cooperative collective adapters with result-placement semantics used by downstream cuda.coop frontends:
cub::WarpReduceBroadcastfor warp reductions whose aggregate is returned to every logical lane.cub::WarpReduceBatchedBroadcastfor batched all-lane warp reductions.cub::BlockReduceBroadcastfor block reductions whose aggregate is returned to every thread.cub::BlockRowReduceandcub::BlockRowReduceWarpBroadcastfor row-shaped block reductions used by norm-style kernels.The implementation now lives in public CUB headers under the
cub::namespace:cub/warp/warp_reduce_broadcast.cuhcub/warp/warp_reduce_batched_broadcast.cuhcub/block/block_reduce_broadcast.cuhcub/block/block_row_reduce.cuhThis also adds focused Catch2 coverage and an nvbench target that compares the new adapters with equivalent handwritten or existing-CUB reduction idioms.
Style/guideline review notes:
AGENTS.md/CONTRIBUTING.mdguidance: pre-commit-managed formatting, targeted CUB build/test, and local consistency with nearby CUB warp/block collectives.Performance and adoption map
The table below combines the focused C++ primitive benchmarks with the cuda.coop CUTE/Numba wrapper experiments. The main point is that these primitives are not all sold the same way: owner-lane batched reduction is a clear speedup; all-lane and block-broadcast forms are mostly about replacing bespoke shuffle/shared-memory boilerplate with a named primitive while preserving performance.
cub::WarpReduceBroadcast<T>::Sum(x)cub::WarpReduceBatched<T, 4>::Sum(items); Numba wrapper:coop.warp.batched_sum(inputs, threads_in_warp=32)SHFLinstructions to 768. Numba reaches the intended codegen shape too: owner batched has 0SHFLand 4REDUXinstances in the wrapper experiment.Allreduce<4>/ inline shuffle helpercub::WarpReduceBroadcast<T, 4>::Sum(x)cub::WarpReduceBatchedBroadcast<T, 4, Width>::Sum(items); Numba wrapper:coop.warp.batched_sum_broadcast(inputs, outputs, ...); CUTE wrapper shape:coop.batched_sum(items)BlockReduce, shared scalar,if (threadIdx.x == 0), two barrierscub::BlockReduceBroadcast<T, BlockDimX>::Sum(x); CUTE-facing shape:coop.sum(value)warp_reduce,block_reduce,row_reduce_sumcoop.sum(...)/ row-reduction provider pathcuda.coop.cutlass.cute.sum/cuda.coop.cutlass.cute.batched_sumTakeaway: the C++ primitive layer already has one clear speedup case and several parity-with-less-boilerplate cases. The downstream Python story is credible as an API simplification path today, but CUTE still needs richer batched return plumbing before we should promise broad end-to-end speedups there.
Validation
pre-commit run --files cub/cub/warp/warp_reduce_broadcast.cuh cub/cub/warp/warp_reduce_batched_broadcast.cuh cub/cub/block/block_reduce_broadcast.cuh cub/cub/block/block_row_reduce.cuh cub/test/catch2_test_coop_collectives.cu cub/benchmarks/bench/collectives/coop_collectives.cu cub/cub/cub.cuhgit diff --checkninja -C build/cub-cpp20 cub.test.coop_collectivesctest --test-dir build/cub-cpp20 -R '^cub.test.coop_collectives$' --output-on-failureninja -C build/cub-benchmark cub.bench.collectives.coop_collectives.baseChecklist
Draft note: the new public headers include API comments and test snippets, but the CUB docs index/API page still needs a public-facing docs pass before this should be marked ready for review.