run to run warpspeed impl sm100+ by srinivasyadav18 · Pull Request #9263 · NVIDIA/cccl

srinivasyadav18 · 2026-06-04T19:09:35Z

Description

closes #7556

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-06-04T19:09:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-04T19:17:36Z

Ready to act? Review this PR in Change Stack to turn feedback into patch suggestions you can inspect and refine.

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ea6c063b-5047-4b8b-afa0-fc5e45b22a54

📥 Commits

Reviewing files that changed from the base of the PR and between cbd13bb and 5550396.

📒 Files selected for processing (1)

cub/cub/device/dispatch/kernels/kernel_scan.cuh

🚧 Files skipped from review as they are similar to previous changes (1)

cub/cub/device/dispatch/kernels/kernel_scan.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR implements run-to-run support for the warpspeed scan optimization on SM100+ targets, enabling deterministic DeviceScan execution. The changes introduce a stable reduction order variant of the warpspeed lookahead logic and thread this stability setting through the scan dispatch pipeline.

Changes

Warpspeed Lookahead Stable Variant

Added warpIncrementalLookaheadStable() function template to cub/cub/detail/warpspeed/look_ahead.cuh that provides deterministic lookahead reduction by:

Anchoring reduction progress to 32-tile boundaries
Fixing reduction order by only reducing when an expected contiguous count of tile aggregates is available
Updating previous-state variables (idxTilePrev, aggrExclusiveCtaPrev) via reference in-place
Returning the computed exclusive aggregate for the last processed range

Warpspeed Scan Pipeline Integration

Extended the warpspeed scan implementation in cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh with a new StableReductionOrder compile-time template parameter (default: false) that:

Selects between stable (warpIncrementalLookaheadStable) and non-stable (warpIncrementalLookahead) reduction paths
Modifies how previous-state updates and exclusive aggregates are computed based on the stability requirement
Propagates the setting from device kernel dispatch through the scan closure implementation

Kernel Dispatch Update

Updated DeviceScanKernel in cub/cub/device/dispatch/kernels/kernel_scan.cuh to pass the StableReductionOrder template parameter to device_scan_warpspeed_body, ensuring the stability requirement flows to the warpspeed execution path.

Policy Selection for Stable Reduction

Modified cub/cub/device/dispatch/tuning/tuning_scan.cuh to allow warpspeed scan selection when stable reduction order is required, but only for compute capability >= 10.0 (SM100+). Previously, warpspeed was skipped entirely for stable reduction requirements.

Related Issue

Closes #7556: Productize run-to-run DeviceScan

Walkthrough

Adds a deterministic warpIncrementalLookaheadStable lookahead to warpspeed scan, threads a compile-time StableReductionOrder flag into the warpspeed closure and dispatch, and updates policy gating to allow warpspeed on sm_100+ when stable reduction order is required.

Changes

Stable Warpspeed Scan Implementation

Layer / File(s)	Summary
Stable lookahead function `cub/cub/detail/warpspeed/look_ahead.cuh`	New `warpIncrementalLookaheadStable` deterministically anchors reduction progress to 32-tile boundaries, enforces fixed reduction order via expected tile count, updates `idxTilePrev` and `aggrExclusiveCtaPrev` by reference, and returns the exclusive aggregate.
Warpspeed kernel stable reduction routing `cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh`, `cub/cub/device/dispatch/kernels/kernel_scan.cuh`	`warpspeed_scan_closure` and `device_scan_warpspeed_body` gain `StableReductionOrder` template parameter; `lookahead` helper conditionally calls `warpIncrementalLookaheadStable` (stable path) or `warpIncrementalLookahead` (non-stable), and updates previous-state variables in the path-specific location.
Dispatch threading and policy gating `cub/cub/device/dispatch/tuning/tuning_scan.cuh`	`DeviceScanKernel` now forwards `StableReductionOrder` to warpspeed dispatch; policy selector allows warpspeed for stable reduction when compute capability >= sm_100 instead of blocking it outright.

Assessment against linked issues

Objective	Addressed	Explanation
Enable DeviceScan stable reduction path with warpspeed for run-to-run determinism [`#7556`]	❓	PR adds warpspeed-stable plumbing and policy gating, but does not show DeviceScan API overloads or env-based entry points required to expose run-to-run option at the public API layer.

Possibly related PRs

NVIDIA/cccl#9169: Refactors warpspeed lookahead infrastructure; this PR adds the stable variant atop that refactor.
NVIDIA/cccl#9098: Propagates StableReductionOrder through DeviceScan dispatch; related to deterministic reduction-order plumbing.

Suggested reviewers

fbusato
bernhardmgruber
miscco

important: Confirm that warpspeed stable lookahead updates idxTilePrev and aggrExclusiveCtaPrev correctly across all lane widths and boundary conditions; the anchor-to-32-multiple alignment must not introduce off-by-one errors when tile indices are not 32-aligned.

important: Verify the if constexpr (StableReductionOrder) routing preserves memory-ordering and concurrent-access guarantees: stable path updates previous-state before shared-memory writeback while non-stable updates after, which changes leader/writeback timing.

suggestion: Consider adding a static_assert or comment that documents StableReductionOrder == true validity only for sm_100+ to catch accidental template misuse at compile-time.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cub/cub/device/dispatch/tuning/tuning_scan.cuh (1)

1038-1047: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

suggestion: Update the inline rationale for the require_stable_reduction_order → cc >= {10, 0} gate: warpIncrementalLookaheadStable is available for __cccl_ptx_isa >= 860 (sm_90+), but the scan policy selector only produces a scan_warpspeed_policy when cc >= {10, 0} (otherwise get_warpspeed_policy returns {}), so stable warpspeed on sm_90+ is blocked by warpspeed policy/tuning availability—not by stable lookahead codegen availability.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dfcdb20c-106f-4ae5-a688-9e19e5475411

📥 Commits

Reviewing files that changed from the base of the PR and between 316f9cc and cbd13bb.

📒 Files selected for processing (4)

cub/cub/detail/warpspeed/look_ahead.cuh
cub/cub/device/dispatch/kernels/kernel_scan.cuh
cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
cub/cub/device/dispatch/tuning/tuning_scan.cuh

srinivasyadav18 · 2026-06-05T14:41:35Z

/ok to test cbd13bb

github-actions · 2026-06-05T17:10:30Z

🥳 CI Workflow Results

🟩 Finished in 2h 26m: Pass: 100%/284 | Total: 11d 15h | Max: 2h 26m | Hits: 18%/1000913

See results here.

srinivasyadav18 · 2026-06-05T19:56:29Z

pre-commit.ci autofix

run to run warpspeed impl sm100+

cbd13bb

srinivasyadav18 requested a review from a team as a code owner June 4, 2026 19:09

srinivasyadav18 requested a review from pauleonix June 4, 2026 19:09

github-project-automation Bot added this to CCCL Jun 4, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

[pre-commit.ci] auto code formatting

5550396

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run to run warpspeed impl sm100+#9263

run to run warpspeed impl sm100+#9263
srinivasyadav18 wants to merge 2 commits into
NVIDIA:mainfrom
srinivasyadav18:run_to_run_opt_ws_sm100

srinivasyadav18 commented Jun 4, 2026

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

srinivasyadav18 commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

srinivasyadav18 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

srinivasyadav18 commented Jun 4, 2026

Description

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

Warpspeed Lookahead Stable Variant

Warpspeed Scan Pipeline Integration

Kernel Dispatch Update

Policy Selection for Stable Reduction

Related Issue

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

srinivasyadav18 commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

🥳 CI Workflow Results

🟩 Finished in 2h 26m: Pass: 100%/284 | Total: 11d 15h | Max: 2h 26m | Hits: 18%/1000913

Uh oh!

srinivasyadav18 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading