Skip to content

[cudax] Implement cudax::coop::reduce for warp groups within a block#9258

Open
davebayer wants to merge 2 commits into
NVIDIA:mainfrom
davebayer:cudax_coop_reduce_warp_groups
Open

[cudax] Implement cudax::coop::reduce for warp groups within a block#9258
davebayer wants to merge 2 commits into
NVIDIA:mainfrom
davebayer:cudax_coop_reduce_warp_groups

Conversation

@davebayer
Copy link
Copy Markdown
Contributor

Fixes #9160.

@davebayer davebayer requested a review from a team as a code owner June 4, 2026 12:32
@davebayer davebayer requested a review from andralex June 4, 2026 12:32
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: db13eebb-6d3f-4232-abe3-e42723c4ac5f

📥 Commits

Reviewing files that changed from the base of the PR and between 05ebea1 and f69a52a.

📒 Files selected for processing (1)
  • cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR implements cudax::coop::reduce for arbitrary warp groups within a CUDA block, addressing feature request #9160. It adds a warp-group reduction path that performs a two-stage reduction across warps in a group, adjusts the public wrapper to take the reduction functor by value, and includes tests validating multi-warp reductions.

Core Implementation Changes

  • cudax/include/cuda/experimental/__coop/reduce.cuh
    • Added a new overload of __reduce_impl constrained to groups whose unit_type == warp_level and level_type == block_level.
    • Implementation performs:
      • Per-warp reductions using cub::WarpReduce with per-warp TempStorage.
      • Writes one partial result per warp into shared memory, synchronizes, then reduces per-warp partials via a root-warp cub::WarpReduce.
      • Returns the reduced value (as cuda::std::optional) only from the group root; other threads return nullopt.
    • Updated public reduce wrapper to accept the reduction functor by value (was forwarding reference) and forward it into __reduce_impl.

Supporting Changes

  • cudax/include/cuda/experimental/__group/queries.cuh

    • Refactored non-matching _Unit/_GroupUnit branch in __rank_query_group to derive __unit_rank and __unit_count from the group's hierarchy via __rank_query/__count_query, then compute linear rank as __group_unit_rank * __unit_count + __unit_rank.
  • cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh

    • make_instance now initializes the barrier only when mapping is valid, mapping rank is 0, and the calling thread’s rank within the _Unit (relative to the parent) is 0.
    • Barrier init count scaled by __mapping_result.count() * __nthread_in_unit (where __nthread_in_unit is derived from gpu_thread.count for non-thread units).

Tests

  • cudax/test/coop/reduce/warps_within_block.cu (new)

    • Adds ReduceKernel and Catch2 tests that form groups of nwarps_in_group warps and verify cudax::coop::reduce across multiple warps.
    • Parametrized tests:
      • Integral types with integer reduction operators.
      • Floating-point types (float/double) with typical FP ops.
      • Per-thread input sizes 1–4.
      • Verification: exact equality for integers, relative tolerance for floats.
  • cudax/test/CMakeLists.txt

    • Registers new test target coop.reduce.warps_within_block.

API / ABI Surface Changes

  • Added function: __reduce_impl(_Group, _Tp (&)[_Np], _RedFn) overload (internal path supporting warp groups).
  • Changed signature: reduce(_Group, _Tp (&)[_Np], _RedFn&&)reduce(_Group, _Tp (&)[_Np], _RedFn) (functor passed by value).

Review Notes / Areas to Verify

  • Shared-memory layout and sizing for per-warp partials (compile-time warp-group extents) and correctness of root-warp selection.
  • Correctness and portability of the two-stage reduction across SMs and warp sizes.
  • Impact of changing the public reduce functor parameter to pass-by-value—check callers for expectations around forwarding/move semantics.
  • The commit also includes a fix for an nvcc segfault encountered during development (see commit message).

important:

Walkthrough

Adds a warp-scoped two-stage cooperative reduce overload, adjusts group rank computation and barrier initialization to account for per-unit thread layout, and adds parametrized CUDA tests exercising reductions across multiple warps, types, operators, and item counts.

Changes

Cooperative reduce for warps within block

Layer / File(s) Summary
Core warp-level reduce implementation
cudax/include/cuda/experimental/__coop/reduce.cuh
New __reduce_impl overload constrained to warp-level unit and block-level hierarchy performs two-stage reduction: per-warp cub::WarpReduce storing one partial per warp into shared memory, then final root-warp cub::WarpReduce over partials; only root warp/lane returns optional result. reduce wrapper now takes reduction functor by value.
Group hierarchy ranking correction
cudax/include/cuda/experimental/__group/queries.cuh
Non-matching _Unit/_GroupUnit branch now derives __unit_rank and __unit_count from __group.hierarchy() via __rank_query/__count_query and composes linear rank as __group_unit_rank * __unit_count + __unit_rank.
Barrier synchronizer initialization fix
cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
make_instance initializes barrier only when mapping rank==0 and thread-rank-in-unit==0, computes per-unit thread counts for non-thread_level units, and scales initialization size to __mapping_result.count() * __nthread_in_unit.
Test validation for warp-level reduce
cudax/test/CMakeLists.txt, cudax/test/coop/reduce/warps_within_block.cu
Adds coop.reduce.warps_within_block test target and a parametrized test kernel ReduceKernel that launches groups spanning multiple warps, tests NumItems=1..4, and verifies results for integral and floating-point type/operator combinations (exact for integrals, relative tolerance for floats).

Assessment against linked issues

Objective Addressed Explanation
Implement cudax::coop::reduce for arbitrary warps within a block groups [#9160]

Possibly related PRs

  • NVIDIA/cccl#9203: Modifies __reduce_impl overloads and interacts with the same reduce wrapper changes.
  • NVIDIA/cccl#9167: Updates other __reduce_impl overloads and is related to wrapper functor passing changes.

Suggested reviewers

  • andralex
  • caugonnet
  • miscco

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cudax/test/coop/reduce/warps_within_block.cu (1)

152-153: ⚡ Quick win

suggestion: The test names and tags still say this_warp on Lines 152 and 173, but this file/target is warps_within_block. Rename the C2H_TEST names/tags to warps_within_block to keep filtering and failure triage unambiguous.

Also applies to: 173-174


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9ab43e4-f31a-4ae2-94bb-fdad331fffa3

📥 Commits

Reviewing files that changed from the base of the PR and between 2f7cb8b and 3aebfa2.

📒 Files selected for processing (5)
  • cudax/include/cuda/experimental/__coop/reduce.cuh
  • cudax/include/cuda/experimental/__group/queries.cuh
  • cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
  • cudax/test/CMakeLists.txt
  • cudax/test/coop/reduce/warps_within_block.cu

Comment thread cudax/test/coop/reduce/warps_within_block.cu
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@davebayer davebayer force-pushed the cudax_coop_reduce_warp_groups branch from 3aebfa2 to 05ebea1 Compare June 5, 2026 11:47
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 32m 40s: Pass: 94%/55 | Total: 8h 19m | Max: 32m 40s | Hits: 67%/47318

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Implement cudax::coop::reduce for arbitrary warps within a block groups

1 participant