Skip to content

run to run warpspeed impl sm100+#9263

Open
srinivasyadav18 wants to merge 2 commits into
NVIDIA:mainfrom
srinivasyadav18:run_to_run_opt_ws_sm100
Open

run to run warpspeed impl sm100+#9263
srinivasyadav18 wants to merge 2 commits into
NVIDIA:mainfrom
srinivasyadav18:run_to_run_opt_ws_sm100

Conversation

@srinivasyadav18
Copy link
Copy Markdown
Contributor

Description

closes #7556

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@srinivasyadav18 srinivasyadav18 requested a review from a team as a code owner June 4, 2026 19:09
@srinivasyadav18 srinivasyadav18 requested a review from pauleonix June 4, 2026 19:09
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Jun 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Ready to act? Review this PR in Change Stack to turn feedback into patch suggestions you can inspect and refine.

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ea6c063b-5047-4b8b-afa0-fc5e45b22a54

📥 Commits

Reviewing files that changed from the base of the PR and between cbd13bb and 5550396.

📒 Files selected for processing (1)
  • cub/cub/device/dispatch/kernels/kernel_scan.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cub/cub/device/dispatch/kernels/kernel_scan.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR implements run-to-run support for the warpspeed scan optimization on SM100+ targets, enabling deterministic DeviceScan execution. The changes introduce a stable reduction order variant of the warpspeed lookahead logic and thread this stability setting through the scan dispatch pipeline.

Changes

Warpspeed Lookahead Stable Variant

Added warpIncrementalLookaheadStable() function template to cub/cub/detail/warpspeed/look_ahead.cuh that provides deterministic lookahead reduction by:

  • Anchoring reduction progress to 32-tile boundaries
  • Fixing reduction order by only reducing when an expected contiguous count of tile aggregates is available
  • Updating previous-state variables (idxTilePrev, aggrExclusiveCtaPrev) via reference in-place
  • Returning the computed exclusive aggregate for the last processed range

Warpspeed Scan Pipeline Integration

Extended the warpspeed scan implementation in cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh with a new StableReductionOrder compile-time template parameter (default: false) that:

  • Selects between stable (warpIncrementalLookaheadStable) and non-stable (warpIncrementalLookahead) reduction paths
  • Modifies how previous-state updates and exclusive aggregates are computed based on the stability requirement
  • Propagates the setting from device kernel dispatch through the scan closure implementation

Kernel Dispatch Update

Updated DeviceScanKernel in cub/cub/device/dispatch/kernels/kernel_scan.cuh to pass the StableReductionOrder template parameter to device_scan_warpspeed_body, ensuring the stability requirement flows to the warpspeed execution path.

Policy Selection for Stable Reduction

Modified cub/cub/device/dispatch/tuning/tuning_scan.cuh to allow warpspeed scan selection when stable reduction order is required, but only for compute capability >= 10.0 (SM100+). Previously, warpspeed was skipped entirely for stable reduction requirements.

Related Issue

Closes #7556: Productize run-to-run DeviceScan

Walkthrough

Adds a deterministic warpIncrementalLookaheadStable lookahead to warpspeed scan, threads a compile-time StableReductionOrder flag into the warpspeed closure and dispatch, and updates policy gating to allow warpspeed on sm_100+ when stable reduction order is required.

Changes

Stable Warpspeed Scan Implementation

Layer / File(s) Summary
Stable lookahead function
cub/cub/detail/warpspeed/look_ahead.cuh
New warpIncrementalLookaheadStable deterministically anchors reduction progress to 32-tile boundaries, enforces fixed reduction order via expected tile count, updates idxTilePrev and aggrExclusiveCtaPrev by reference, and returns the exclusive aggregate.
Warpspeed kernel stable reduction routing
cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh, cub/cub/device/dispatch/kernels/kernel_scan.cuh
warpspeed_scan_closure and device_scan_warpspeed_body gain StableReductionOrder template parameter; lookahead helper conditionally calls warpIncrementalLookaheadStable (stable path) or warpIncrementalLookahead (non-stable), and updates previous-state variables in the path-specific location.
Dispatch threading and policy gating
cub/cub/device/dispatch/tuning/tuning_scan.cuh
DeviceScanKernel now forwards StableReductionOrder to warpspeed dispatch; policy selector allows warpspeed for stable reduction when compute capability >= sm_100 instead of blocking it outright.

Assessment against linked issues

Objective Addressed Explanation
Enable DeviceScan stable reduction path with warpspeed for run-to-run determinism [#7556] PR adds warpspeed-stable plumbing and policy gating, but does not show DeviceScan API overloads or env-based entry points required to expose run-to-run option at the public API layer.

Possibly related PRs

  • NVIDIA/cccl#9169: Refactors warpspeed lookahead infrastructure; this PR adds the stable variant atop that refactor.
  • NVIDIA/cccl#9098: Propagates StableReductionOrder through DeviceScan dispatch; related to deterministic reduction-order plumbing.

Suggested reviewers

  • fbusato
  • bernhardmgruber
  • miscco

important: Confirm that warpspeed stable lookahead updates idxTilePrev and aggrExclusiveCtaPrev correctly across all lane widths and boundary conditions; the anchor-to-32-multiple alignment must not introduce off-by-one errors when tile indices are not 32-aligned.

important: Verify the if constexpr (StableReductionOrder) routing preserves memory-ordering and concurrent-access guarantees: stable path updates previous-state before shared-memory writeback while non-stable updates after, which changes leader/writeback timing.

suggestion: Consider adding a static_assert or comment that documents StableReductionOrder == true validity only for sm_100+ to catch accidental template misuse at compile-time.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cub/cub/device/dispatch/tuning/tuning_scan.cuh (1)

1038-1047: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

suggestion: Update the inline rationale for the require_stable_reduction_ordercc >= {10, 0} gate: warpIncrementalLookaheadStable is available for __cccl_ptx_isa >= 860 (sm_90+), but the scan policy selector only produces a scan_warpspeed_policy when cc >= {10, 0} (otherwise get_warpspeed_policy returns {}), so stable warpspeed on sm_90+ is blocked by warpspeed policy/tuning availability—not by stable lookahead codegen availability.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dfcdb20c-106f-4ae5-a688-9e19e5475411

📥 Commits

Reviewing files that changed from the base of the PR and between 316f9cc and cbd13bb.

📒 Files selected for processing (4)
  • cub/cub/detail/warpspeed/look_ahead.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
  • cub/cub/device/dispatch/tuning/tuning_scan.cuh

@srinivasyadav18
Copy link
Copy Markdown
Contributor Author

/ok to test cbd13bb

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🥳 CI Workflow Results

🟩 Finished in 2h 26m: Pass: 100%/284 | Total: 11d 15h | Max: 2h 26m | Hits: 18%/1000913

See results here.

@srinivasyadav18
Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Productize run-to-run DeviceScan

1 participant