[cudax] Implement `cudax::coop::reduce` for warp groups within a block by davebayer · Pull Request #9258 · NVIDIA/cccl

davebayer · 2026-06-04T12:32:05Z

Fixes #9160.

coderabbitai · 2026-06-04T12:39:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: db13eebb-6d3f-4232-abe3-e42723c4ac5f

📥 Commits

Reviewing files that changed from the base of the PR and between 05ebea1 and f69a52a.

📒 Files selected for processing (1)

cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh

🚧 Files skipped from review as they are similar to previous changes (1)

cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR implements cudax::coop::reduce for arbitrary warp groups within a CUDA block, addressing feature request #9160. It adds a warp-group reduction path that performs a two-stage reduction across warps in a group, adjusts the public wrapper to take the reduction functor by value, and includes tests validating multi-warp reductions.

Core Implementation Changes

cudax/include/cuda/experimental/__coop/reduce.cuh
- Added a new overload of __reduce_impl constrained to groups whose unit_type == warp_level and level_type == block_level.
- Implementation performs:
  - Per-warp reductions using cub::WarpReduce with per-warp TempStorage.
  - Writes one partial result per warp into shared memory, synchronizes, then reduces per-warp partials via a root-warp cub::WarpReduce.
  - Returns the reduced value (as cuda::std::optional) only from the group root; other threads return nullopt.
- Updated public reduce wrapper to accept the reduction functor by value (was forwarding reference) and forward it into __reduce_impl.

Supporting Changes

cudax/include/cuda/experimental/__group/queries.cuh
- Refactored non-matching _Unit/_GroupUnit branch in __rank_query_group to derive __unit_rank and __unit_count from the group's hierarchy via __rank_query/__count_query, then compute linear rank as __group_unit_rank * __unit_count + __unit_rank.
cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
- make_instance now initializes the barrier only when mapping is valid, mapping rank is 0, and the calling thread’s rank within the _Unit (relative to the parent) is 0.
- Barrier init count scaled by __mapping_result.count() * __nthread_in_unit (where __nthread_in_unit is derived from gpu_thread.count for non-thread units).

Tests

cudax/test/coop/reduce/warps_within_block.cu (new)
- Adds ReduceKernel and Catch2 tests that form groups of nwarps_in_group warps and verify cudax::coop::reduce across multiple warps.
- Parametrized tests:
  - Integral types with integer reduction operators.
  - Floating-point types (float/double) with typical FP ops.
  - Per-thread input sizes 1–4.
  - Verification: exact equality for integers, relative tolerance for floats.
cudax/test/CMakeLists.txt
- Registers new test target coop.reduce.warps_within_block.

API / ABI Surface Changes

Added function: __reduce_impl(_Group, _Tp (&)[_Np], _RedFn) overload (internal path supporting warp groups).
Changed signature: reduce(_Group, _Tp (&)[_Np], _RedFn&&) → reduce(_Group, _Tp (&)[_Np], _RedFn) (functor passed by value).

Review Notes / Areas to Verify

Shared-memory layout and sizing for per-warp partials (compile-time warp-group extents) and correctness of root-warp selection.
Correctness and portability of the two-stage reduction across SMs and warp sizes.
Impact of changing the public reduce functor parameter to pass-by-value—check callers for expectations around forwarding/move semantics.
The commit also includes a fix for an nvcc segfault encountered during development (see commit message).

important:

Walkthrough

Adds a warp-scoped two-stage cooperative reduce overload, adjusts group rank computation and barrier initialization to account for per-unit thread layout, and adds parametrized CUDA tests exercising reductions across multiple warps, types, operators, and item counts.

Changes

Cooperative reduce for warps within block

Layer / File(s)	Summary
Core warp-level reduce implementation `cudax/include/cuda/experimental/__coop/reduce.cuh`	New `__reduce_impl` overload constrained to warp-level unit and block-level hierarchy performs two-stage reduction: per-warp `cub::WarpReduce` storing one partial per warp into shared memory, then final root-warp `cub::WarpReduce` over partials; only root warp/lane returns `optional` result. `reduce` wrapper now takes reduction functor by value.
Group hierarchy ranking correction `cudax/include/cuda/experimental/__group/queries.cuh`	Non-matching `_Unit`/`_GroupUnit` branch now derives `__unit_rank` and `__unit_count` from `__group.hierarchy()` via `__rank_query`/`__count_query` and composes linear rank as `__group_unit_rank * __unit_count + __unit_rank`.
Barrier synchronizer initialization fix `cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh`	`make_instance` initializes barrier only when mapping rank==0 and thread-rank-in-unit==0, computes per-unit thread counts for non-thread_level units, and scales initialization size to `__mapping_result.count() * __nthread_in_unit`.
Test validation for warp-level reduce `cudax/test/CMakeLists.txt`, `cudax/test/coop/reduce/warps_within_block.cu`	Adds `coop.reduce.warps_within_block` test target and a parametrized test kernel `ReduceKernel` that launches groups spanning multiple warps, tests NumItems=1..4, and verifies results for integral and floating-point type/operator combinations (exact for integrals, relative tolerance for floats).

Assessment against linked issues

Objective	Addressed	Explanation
Implement `cudax::coop::reduce` for arbitrary warps within a block groups [`#9160`]	✅

Possibly related PRs

NVIDIA/cccl#9203: Modifies __reduce_impl overloads and interacts with the same reduce wrapper changes.
NVIDIA/cccl#9167: Updates other __reduce_impl overloads and is related to wrapper functor passing changes.

Suggested reviewers

andralex
caugonnet
miscco

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cudax/test/coop/reduce/warps_within_block.cu (1)

152-153: ⚡ Quick win

suggestion: The test names and tags still say this_warp on Lines 152 and 173, but this file/target is warps_within_block. Rename the C2H_TEST names/tags to warps_within_block to keep filtering and failure triage unambiguous.

Also applies to: 173-174

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9ab43e4-f31a-4ae2-94bb-fdad331fffa3

📥 Commits

Reviewing files that changed from the base of the PR and between 2f7cb8b and 3aebfa2.

📒 Files selected for processing (5)

cudax/include/cuda/experimental/__coop/reduce.cuh
cudax/include/cuda/experimental/__group/queries.cuh
cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
cudax/test/CMakeLists.txt
cudax/test/coop/reduce/warps_within_block.cu

github-actions · 2026-06-05T12:40:54Z

😬 CI Workflow Results

🟥 Finished in 32m 40s: Pass: 94%/55 | Total: 8h 19m | Max: 32m 40s | Hits: 67%/47318

See results here.

davebayer requested a review from a team as a code owner June 4, 2026 12:32

davebayer requested a review from andralex June 4, 2026 12:32

github-project-automation Bot added this to CCCL Jun 4, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread cudax/test/coop/reduce/warps_within_block.cu

This comment has been minimized.

Sign in to view

[cudax] Implement cudax::coop::reduce for warp groups within a block

05ebea1

davebayer force-pushed the cudax_coop_reduce_warp_groups branch from 3aebfa2 to 05ebea1 Compare June 5, 2026 11:47

Fix nvcc segfault

f69a52a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cudax] Implement `cudax::coop::reduce` for warp groups within a block#9258

[cudax] Implement `cudax::coop::reduce` for warp groups within a block#9258
davebayer wants to merge 2 commits into
NVIDIA:mainfrom
davebayer:cudax_coop_reduce_warp_groups

davebayer commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davebayer commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Core Implementation Changes

Supporting Changes

Tests

API / ABI Surface Changes

Review Notes / Areas to Verify

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 32m 40s: Pass: 94%/55 | Total: 8h 19m | Max: 32m 40s | Hits: 67%/47318

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading