cudax/stf: migrate internal/ parallel_for + cuda_kernel scopes from cuda_safe_call to cuda_try by andralex · Pull Request #9265 · NVIDIA/cccl

andralex · 2026-06-04T22:46:57Z

Summary

Final (and most complex) slice of the cudax/include/cuda/experimental/__stf/internal/ cuda_safe_call -> cuda_try migration: parallel_for_scope.cuh + cuda_kernel_scope.cuh. Covers kernel launches, graph kernel/host nodes, occupancy queries, event timing, and driver-API calls.

Follow-up to #9241 (misc), #9248 (context/resources), #9249 (launch/host_launch). Companion to #9150 (utility), #9165 (stackable, merged).

Templated `cuda_try<F>` conversions

cuLaunchKernel, cuGraphAddKernelNode — driver API; cuda_try handles CUresult and the file already uses cuda_try<cuFuncGetAttribute>/cuda_try<cuCtxGetDevice>.
cudaGraphAddKernelNode, cudaGraphAddHostNode — out-param assigned through the node reference.
cudaGraphAddDependencies, cudaEventRecord (start), cudaGetDevice.
cudaEventCreate -> cuda_try<cudaEventCreateWithFlags>(cudaEventDefault) (semantically identical; cudaEventCreate is an overload set).

Kept in runtime-status `cuda_try(...)` form

Overload sets / multiple outputs, where cuda_try<F> can't name the function:

cudaLaunchKernel, cudaFuncGetAttributes — cuda_runtime.h templated wrappers.
cudaOccupancyMaxPotentialBlockSizeVariableSMem — two output params + templated.

Kept as `cuda_safe_call` (event-timing teardown)

The end cudaEventRecord / cudaEventSynchronize / cudaEventElapsedTime calls. In parallel_for_scope they run inside the noexcept SCOPE(exit); cuda_kernel_scope mirrors that for consistency (measurement cleanup — abort, don't throw).

Other correctness

Initialize the timing events to nullptr — the create calls are now assignments, which would otherwise trip GCC -Werror=maybe-uninitialized (same class as cudax/stf: migrate internal/ launch + host_launch_scope from cuda_safe_call to cuda_try #9249).
parallel_for host-callback args: the stream path deletes them via SCOPE(fail) if cudaLaunchHostFunc throws (the callback takes ownership only on success); the graph path already hands ownership to a ctx resource before the node is created.

Test plan

CI green on the cudax matrix entries that build these headers / STF tests
No success-path behavior change; new behavior is throw-vs-abort plus cleanup on throw

…a_try Final internal/ slice (the most complex): kernel launches, graph kernel/host nodes, occupancy queries, event timing, and driver-API calls. Templated cuda_try<F> conversions: - cuLaunchKernel, cuGraphAddKernelNode (driver API; cuda_try handles CUresult), - cudaGraphAddKernelNode / cudaGraphAddHostNode (out-param via the node ref), - cudaGraphAddDependencies, cudaEventRecord (start), cudaGetDevice, - cudaEventCreate -> cuda_try<cudaEventCreateWithFlags>(cudaEventDefault). Kept in runtime-status cuda_try form (overload sets / multiple outputs): - cudaLaunchKernel and cudaFuncGetAttributes (cuda_runtime.h templated wrappers), - cudaOccupancyMaxPotentialBlockSizeVariableSMem (two output params + templated). Kept as cuda_safe_call (event-timing teardown): the end record/synchronize/ elapsed calls. In parallel_for_scope they run inside the noexcept SCOPE(exit); cuda_kernel_scope mirrors that for consistency. Other correctness: - Initialize the timing events to nullptr (the create calls are now assignments, which would otherwise trip GCC -Werror=maybe-uninitialized). - parallel_for host-callback args: the stream path now deletes them via SCOPE(fail) if cudaLaunchHostFunc throws (the callback only takes ownership on success); the graph path already hands ownership to a ctx resource before the node is created.

copy-pr-bot · 2026-06-04T22:47:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

andralex · 2026-06-04T22:47:06Z

/ok to test 80e0546

coderabbitai · 2026-06-04T22:52:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b26457e7-e954-4478-a3a7-48e83892db1b

📥 Commits

Reviewing files that changed from the base of the PR and between 280cf56 and 1d55338.

📒 Files selected for processing (1)

cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh

🚧 Files skipped from review as they are similar to previous changes (1)

cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR finishes the migration of CUDAX's internal STF scope implementations from cuda_safe_call to cuda_try in cudax/include/cuda/experimental/__stf/internal, targeting parallel_for_scope.cuh and cuda_kernel_scope.cuh. It converts many kernel-launch, graph, occupancy, event, and driver-API error paths from abort-on-error semantics to throw-on-error via cuda_try while preserving abort semantics in a few noexcept teardown scopes. Several correctness and cleanup fixes are included.

Changes

cuda_kernel_scope.cuh

Replaced many cuda_safe_call usages with cuda_try (templated driver-form where applicable) for:
- driver API kernel launches (cuLaunchKernel)
- graph node creation / dependencies (cuGraphAddKernelNode / cudaGraphAddDependencies)
- device selection and event creation/recording in start()
Use cuda_try(cudaEventDefault) for event creation.
Kept cuda_safe_call for event-timing teardown (end event record / synchronize / elapsed-time) to preserve abort-in-noexcept behavior; explanatory comments retained.
Initialized start_event and end_event to nullptr to avoid -Werror=maybe-uninitialized.

parallel_for_scope.cuh

Converted occupancy and related runtime/graph API calls to cuda_try where templated/wrapper forms are available.
Converted graph kernel/host node creation to cuda_try<...> with direct assignment to node handles (ownership to ctx resource).
Stream callback path: added SCOPE(fail) guard to delete heap-allocated callback-argument tuple if cudaLaunchHostFunc throws; switched to cuda_try for enqueue where appropriate.
Initialized device = -1 and event pointers to nullptr before use.
Left runtime overloads/multi-output APIs in runtime-status cuda_try form where a named templated conversion isn't feasible (examples: runtime cudaLaunchKernel wrapper, cudaFuncGetAttributes, cudaOccupancyMaxPotentialBlockSizeVariableSMem).

Error-handling strategy

Apply templated cuda_try for driver APIs returning CUresult (e.g., cuLaunchKernel, cuGraphAddKernelNode) when a specific function template can be named.
Use runtime-status cuda_try(...) for runtime overloads or APIs where identifying a single overload/template is not possible.
Preserve cuda_safe_call in teardown/noexcept scopes so failures abort rather than throw.

Additional fixes and notes

Initialize timing event pointers to nullptr to remove -Werror=maybe-uninitialized.
Ensure heap callback args are cleaned up on failure; graph path transfers ownership to context resource prior to node creation.
Commit message includes a portability/templating fix: replace NULL with nullptr in cuda_try to ensure correct template deduction across compilers.
PR contains five automated "/ok to test" comments (identical purpose, different run IDs) used to trigger CI.

Testing

CI must pass CUDAX matrix entries that build these headers and STF tests.
No success-path behavior change expected; changes affect error-path behavior (throw vs abort) and improve cleanup on failure.

suggestion:

Walkthrough

Replace many cuda_safe_call(...) wrappers with cuda_try(...) across kernel launch, graph node creation, event timing, and occupancy queries; initialize device/event members to safe defaults; add cleanup guard around host-callback enqueue; assign graph node/host-node handles directly from cuda_try returns.

Changes

Error Handling Standardization in Kernel Launch and Parallel-For Scopes

Layer / File(s)	Summary
Member initialization and safer defaults `cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh`, `cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh`	Event handles are initialized to `nullptr` and device ID to `-1` before conditional creation/use.
Stream event timing (start/end) `cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh`, `cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh`	Stream-context device query, event creation with `cudaEventCreateWithFlags(cudaEventDefault)`, event recording, synchronization, and elapsed-time computation use `cuda_try`.
Kernel launch and function attributes `cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh`	Runtime and driver API kernel launches (`cudaLaunchKernel`, `cuLaunchKernel`) and function attribute queries (`cudaFuncGetAttributes`) switch from `cuda_safe_call` to `cuda_try`.
Graph kernel node creation `cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh`, `cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh`	Graph kernel nodes across runtime/driver APIs and reduction/non-reduction paths use `cuda_try<...>` with direct assignment to `node`/`get_node()` instead of output-parameter calls.
Occupancy configuration queries `cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh`	CUDA occupancy queries for main and final reduction kernels wrap `cudaOccupancyMaxPotentialBlockSizeVariableSMem` with `cuda_try`.
Graph dependency edges and kernel chaining `cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh`	Dependency edge addition between consecutive kernel nodes in graph chains uses `cuda_try<cudaGraphAddDependencies>` while preserving `_CCCL_CTK_AT_LEAST(13, 0)` branching.
Host callbacks and graph host nodes `cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh`	Host function launch is wrapped with `SCOPE(fail)` to delete heap-allocated callback arguments on failure; host graph node creation uses `cuda_try<cudaGraphAddHostNode>` with direct node assignment.

suggestion: Possibly related PRs

NVIDIA/cccl#9165: Related STF migration switching graph/launch error-wrapping from cuda_safe_call to cuda_try for graph node creation/dependencies.

suggestion: Suggested labels
stf

suggestion: Suggested reviewers

caugonnet
oleksandr-pavlyk

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh (1)
581-587: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: SCOPE(exit) will abort if event creation throws.

If any cuda_try call at lines 608-614 throws, SCOPE(exit) runs with record_time=true but start_event/end_event still nullptr. Line 583 then calls cudaEventRecord(nullptr, ...), which fails and triggers cuda_safe_call to abort.

Guard the timing cleanup with a null check:
       if (record_time)
       {
+        if (start_event == nullptr || end_event == nullptr)
+        {
+          // Event creation failed; skip timing teardown
+        }
+        else
+        {
           cuda_safe_call(cudaEventRecord(end_event, t.get_stream()));
           cuda_safe_call(cudaEventSynchronize(end_event));

           float milliseconds = 0;
           cuda_safe_call(cudaEventElapsedTime(&milliseconds, start_event, end_event));
           // ... rest of timing code ...
+        }
       }
Or set record_time only after successful event creation.

Also applies to: 608-614

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f5f64e32-9d53-4f4f-924b-b4d14ad08c0e

📥 Commits

Reviewing files that changed from the base of the PR and between 89c81d7 and 80e0546.

📒 Files selected for processing (2)

cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh
cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh

Converting cuda_safe_call(cuLaunchKernel(...)) to the templated cuda_try<cuLaunchKernel>(...) form dropped one nesting level, but the closing `nullptr));` kept both parens, leaving an extra `)` that broke compilation of every TU including cuda_kernel_scope.cuh. Drop the stray paren. Verified locally by building cudax.test.stf.cpp.scoped_graph_task.

andralex · 2026-06-05T01:02:05Z

/ok to test d53149e

The end-of-timing event calls in cuda_kernel_scope::end() are single functions and end() is not a noexcept SCOPE, so switch them to the templated form: cuda_try<cudaEventRecord>, cuda_try<cudaEventSynchronize>, and `const float ms = cuda_try<cudaEventElapsedTime>(start, end)`. cudaLaunchKernel and cudaFuncGetAttributes stay in the runtime-status form: both are overload sets (C API + cuda_runtime.h templated wrapper), so a bare cuda_try<F> cannot name them. Verified locally by building cudax.test.stf.interface.cuda_kernel_empty_args and cudax.test.stf.examples.cuda_kernels_driver.

andralex · 2026-06-05T01:08:14Z

/ok to test 280cf56

…-internal-scopes

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1444d149-7d61-44fd-9752-118a57e52d35

📥 Commits

Reviewing files that changed from the base of the PR and between d53149e and 280cf56.

📒 Files selected for processing (1)

cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh

coderabbitai · 2026-06-05T01:12:49Z

+        cuda_try<cudaEventRecord>(end_event, t.get_stream());
+        cuda_try<cudaEventSynchronize>(end_event);

-        float milliseconds = 0;
-        cuda_safe_call(cudaEventElapsedTime(&milliseconds, start_event, end_event));
+        const float milliseconds = cuda_try<cudaEventElapsedTime>(start_event, end_event);



⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Ensure task cleanup always runs when timing calls fail.
cuda_try can throw in this block, and then t.clear() / support_task.reset() (Lines 418-423) are skipped after t.end_uncleared(). Please guard cleanup/reset with an unconditional scope-exit path so error handling here cannot leave task state uncleared.

andralex · 2026-06-05T01:12:55Z

/ok to test 751c24d

In the parallel_for reduction graph path, the dependencies argument was NULL. As a templated cuda_try<F> argument, NULL is deduced as long/int rather than a null-pointer-constant, so is_invocable finds no valid form (Clang errors; GCC's __null accepts it, which is why local GCC builds missed it). Use nullptr.

andralex · 2026-06-05T01:53:16Z

/ok to test 1d55338

andralex · 2026-06-05T13:07:40Z

/ok to test 6ba1241

github-actions · 2026-06-05T13:44:12Z

😬 CI Workflow Results

🟥 Finished in 34m 26s: Pass: 96%/55 | Total: 8h 12m | Max: 34m 23s | Hits: 77%/45004

See results here.

andralex requested a review from a team as a code owner June 4, 2026 22:46

andralex requested a review from srinivasyadav18 June 4, 2026 22:46

github-project-automation Bot added this to CCCL Jun 4, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Merge remote-tracking branch 'origin/main' into andralex/stf-cuda-try…

751c24d

…-internal-scopes

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

andralex enabled auto-merge (squash) June 5, 2026 01:13

This comment has been minimized.

Sign in to view

caugonnet added the stf Sequential Task Flow programming model label Jun 5, 2026

Merge branch 'main' into andralex/stf-cuda-try-internal-scopes

6ba1241

Conversation

andralex commented Jun 4, 2026

Summary

Templated cuda_try<F> conversions

Kept in runtime-status cuda_try(...) form

Kept as cuda_safe_call (event-timing teardown)

Other correctness

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

andralex commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

cuda_kernel_scope.cuh

parallel_for_scope.cuh

Error-handling strategy

Additional fixes and notes

Testing

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

andralex commented Jun 5, 2026

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

This comment has been minimized.

andralex commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 34m 26s: Pass: 96%/55 | Total: 8h 12m | Max: 34m 23s | Hits: 77%/45004

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Templated `cuda_try<F>` conversions

Kept in runtime-status `cuda_try(...)` form

Kept as `cuda_safe_call` (event-timing teardown)

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading