Skip to content

cudax/stf: migrate internal/ parallel_for + cuda_kernel scopes from cuda_safe_call to cuda_try#9265

Open
andralex wants to merge 6 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-internal-scopes
Open

cudax/stf: migrate internal/ parallel_for + cuda_kernel scopes from cuda_safe_call to cuda_try#9265
andralex wants to merge 6 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-internal-scopes

Conversation

@andralex
Copy link
Copy Markdown
Contributor

@andralex andralex commented Jun 4, 2026

Summary

Final (and most complex) slice of the cudax/include/cuda/experimental/__stf/internal/ cuda_safe_call -> cuda_try migration: parallel_for_scope.cuh + cuda_kernel_scope.cuh. Covers kernel launches, graph kernel/host nodes, occupancy queries, event timing, and driver-API calls.

Follow-up to #9241 (misc), #9248 (context/resources), #9249 (launch/host_launch). Companion to #9150 (utility), #9165 (stackable, merged).

Templated cuda_try<F> conversions

  • cuLaunchKernel, cuGraphAddKernelNodedriver API; cuda_try handles CUresult and the file already uses cuda_try<cuFuncGetAttribute>/cuda_try<cuCtxGetDevice>.
  • cudaGraphAddKernelNode, cudaGraphAddHostNode — out-param assigned through the node reference.
  • cudaGraphAddDependencies, cudaEventRecord (start), cudaGetDevice.
  • cudaEventCreate -> cuda_try<cudaEventCreateWithFlags>(cudaEventDefault) (semantically identical; cudaEventCreate is an overload set).

Kept in runtime-status cuda_try(...) form

Overload sets / multiple outputs, where cuda_try<F> can't name the function:

  • cudaLaunchKernel, cudaFuncGetAttributescuda_runtime.h templated wrappers.
  • cudaOccupancyMaxPotentialBlockSizeVariableSMem — two output params + templated.

Kept as cuda_safe_call (event-timing teardown)

The end cudaEventRecord / cudaEventSynchronize / cudaEventElapsedTime calls. In parallel_for_scope they run inside the noexcept SCOPE(exit); cuda_kernel_scope mirrors that for consistency (measurement cleanup — abort, don't throw).

Other correctness

  • Initialize the timing events to nullptr — the create calls are now assignments, which would otherwise trip GCC -Werror=maybe-uninitialized (same class as cudax/stf: migrate internal/ launch + host_launch_scope from cuda_safe_call to cuda_try #9249).
  • parallel_for host-callback args: the stream path deletes them via SCOPE(fail) if cudaLaunchHostFunc throws (the callback takes ownership only on success); the graph path already hands ownership to a ctx resource before the node is created.

Test plan

  • CI green on the cudax matrix entries that build these headers / STF tests
  • No success-path behavior change; new behavior is throw-vs-abort plus cleanup on throw

…a_try

Final internal/ slice (the most complex): kernel launches, graph kernel/host
nodes, occupancy queries, event timing, and driver-API calls.

Templated cuda_try<F> conversions:
- cuLaunchKernel, cuGraphAddKernelNode (driver API; cuda_try handles CUresult),
- cudaGraphAddKernelNode / cudaGraphAddHostNode (out-param via the node ref),
- cudaGraphAddDependencies, cudaEventRecord (start), cudaGetDevice,
- cudaEventCreate -> cuda_try<cudaEventCreateWithFlags>(cudaEventDefault).

Kept in runtime-status cuda_try form (overload sets / multiple outputs):
- cudaLaunchKernel and cudaFuncGetAttributes (cuda_runtime.h templated wrappers),
- cudaOccupancyMaxPotentialBlockSizeVariableSMem (two output params + templated).

Kept as cuda_safe_call (event-timing teardown): the end record/synchronize/
elapsed calls. In parallel_for_scope they run inside the noexcept SCOPE(exit);
cuda_kernel_scope mirrors that for consistency.

Other correctness:
- Initialize the timing events to nullptr (the create calls are now assignments,
  which would otherwise trip GCC -Werror=maybe-uninitialized).
- parallel_for host-callback args: the stream path now deletes them via
  SCOPE(fail) if cudaLaunchHostFunc throws (the callback only takes ownership on
  success); the graph path already hands ownership to a ctx resource before the
  node is created.
@andralex andralex requested a review from a team as a code owner June 4, 2026 22:46
@andralex andralex requested a review from srinivasyadav18 June 4, 2026 22:46
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Jun 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 80e0546

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b26457e7-e954-4478-a3a7-48e83892db1b

📥 Commits

Reviewing files that changed from the base of the PR and between 280cf56 and 1d55338.

📒 Files selected for processing (1)
  • cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR finishes the migration of CUDAX's internal STF scope implementations from cuda_safe_call to cuda_try in cudax/include/cuda/experimental/__stf/internal, targeting parallel_for_scope.cuh and cuda_kernel_scope.cuh. It converts many kernel-launch, graph, occupancy, event, and driver-API error paths from abort-on-error semantics to throw-on-error via cuda_try while preserving abort semantics in a few noexcept teardown scopes. Several correctness and cleanup fixes are included.

Changes

cuda_kernel_scope.cuh

  • Replaced many cuda_safe_call usages with cuda_try (templated driver-form where applicable) for:
    • driver API kernel launches (cuLaunchKernel)
    • graph node creation / dependencies (cuGraphAddKernelNode / cudaGraphAddDependencies)
    • device selection and event creation/recording in start()
  • Use cuda_try(cudaEventDefault) for event creation.
  • Kept cuda_safe_call for event-timing teardown (end event record / synchronize / elapsed-time) to preserve abort-in-noexcept behavior; explanatory comments retained.
  • Initialized start_event and end_event to nullptr to avoid -Werror=maybe-uninitialized.

parallel_for_scope.cuh

  • Converted occupancy and related runtime/graph API calls to cuda_try where templated/wrapper forms are available.
  • Converted graph kernel/host node creation to cuda_try<...> with direct assignment to node handles (ownership to ctx resource).
  • Stream callback path: added SCOPE(fail) guard to delete heap-allocated callback-argument tuple if cudaLaunchHostFunc throws; switched to cuda_try for enqueue where appropriate.
  • Initialized device = -1 and event pointers to nullptr before use.
  • Left runtime overloads/multi-output APIs in runtime-status cuda_try form where a named templated conversion isn't feasible (examples: runtime cudaLaunchKernel wrapper, cudaFuncGetAttributes, cudaOccupancyMaxPotentialBlockSizeVariableSMem).

Error-handling strategy

  • Apply templated cuda_try for driver APIs returning CUresult (e.g., cuLaunchKernel, cuGraphAddKernelNode) when a specific function template can be named.
  • Use runtime-status cuda_try(...) for runtime overloads or APIs where identifying a single overload/template is not possible.
  • Preserve cuda_safe_call in teardown/noexcept scopes so failures abort rather than throw.

Additional fixes and notes

  • Initialize timing event pointers to nullptr to remove -Werror=maybe-uninitialized.
  • Ensure heap callback args are cleaned up on failure; graph path transfers ownership to context resource prior to node creation.
  • Commit message includes a portability/templating fix: replace NULL with nullptr in cuda_try to ensure correct template deduction across compilers.
  • PR contains five automated "/ok to test" comments (identical purpose, different run IDs) used to trigger CI.

Testing

  • CI must pass CUDAX matrix entries that build these headers and STF tests.
  • No success-path behavior change expected; changes affect error-path behavior (throw vs abort) and improve cleanup on failure.

suggestion:

Walkthrough

Replace many cuda_safe_call(...) wrappers with cuda_try(...) across kernel launch, graph node creation, event timing, and occupancy queries; initialize device/event members to safe defaults; add cleanup guard around host-callback enqueue; assign graph node/host-node handles directly from cuda_try returns.

Changes

Error Handling Standardization in Kernel Launch and Parallel-For Scopes

Layer / File(s) Summary
Member initialization and safer defaults
cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh, cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh
Event handles are initialized to nullptr and device ID to -1 before conditional creation/use.
Stream event timing (start/end)
cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh, cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh
Stream-context device query, event creation with cudaEventCreateWithFlags(cudaEventDefault), event recording, synchronization, and elapsed-time computation use cuda_try.
Kernel launch and function attributes
cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh
Runtime and driver API kernel launches (cudaLaunchKernel, cuLaunchKernel) and function attribute queries (cudaFuncGetAttributes) switch from cuda_safe_call to cuda_try.
Graph kernel node creation
cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh, cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh
Graph kernel nodes across runtime/driver APIs and reduction/non-reduction paths use cuda_try<...> with direct assignment to node/get_node() instead of output-parameter calls.
Occupancy configuration queries
cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh
CUDA occupancy queries for main and final reduction kernels wrap cudaOccupancyMaxPotentialBlockSizeVariableSMem with cuda_try.
Graph dependency edges and kernel chaining
cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh
Dependency edge addition between consecutive kernel nodes in graph chains uses cuda_try<cudaGraphAddDependencies> while preserving _CCCL_CTK_AT_LEAST(13, 0) branching.
Host callbacks and graph host nodes
cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh
Host function launch is wrapped with SCOPE(fail) to delete heap-allocated callback arguments on failure; host graph node creation uses cuda_try<cudaGraphAddHostNode> with direct node assignment.

suggestion: Possibly related PRs

  • NVIDIA/cccl#9165: Related STF migration switching graph/launch error-wrapping from cuda_safe_call to cuda_try for graph node creation/dependencies.

suggestion: Suggested labels
stf

suggestion: Suggested reviewers

  • caugonnet
  • oleksandr-pavlyk

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh (1)

581-587: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: SCOPE(exit) will abort if event creation throws.

If any cuda_try call at lines 608-614 throws, SCOPE(exit) runs with record_time=true but start_event/end_event still nullptr. Line 583 then calls cudaEventRecord(nullptr, ...), which fails and triggers cuda_safe_call to abort.

Guard the timing cleanup with a null check:

       if (record_time)
       {
+        if (start_event == nullptr || end_event == nullptr)
+        {
+          // Event creation failed; skip timing teardown
+        }
+        else
+        {
           cuda_safe_call(cudaEventRecord(end_event, t.get_stream()));
           cuda_safe_call(cudaEventSynchronize(end_event));

           float milliseconds = 0;
           cuda_safe_call(cudaEventElapsedTime(&milliseconds, start_event, end_event));
           // ... rest of timing code ...
+        }
       }

Or set record_time only after successful event creation.

Also applies to: 608-614


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f5f64e32-9d53-4f4f-924b-b4d14ad08c0e

📥 Commits

Reviewing files that changed from the base of the PR and between 89c81d7 and 80e0546.

📒 Files selected for processing (2)
  • cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh
  • cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh

@github-actions

This comment has been minimized.

Converting cuda_safe_call(cuLaunchKernel(...)) to the templated
cuda_try<cuLaunchKernel>(...) form dropped one nesting level, but the
closing `nullptr));` kept both parens, leaving an extra `)` that broke
compilation of every TU including cuda_kernel_scope.cuh. Drop the stray
paren. Verified locally by building cudax.test.stf.cpp.scoped_graph_task.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test d53149e

The end-of-timing event calls in cuda_kernel_scope::end() are single
functions and end() is not a noexcept SCOPE, so switch them to the
templated form: cuda_try<cudaEventRecord>, cuda_try<cudaEventSynchronize>,
and `const float ms = cuda_try<cudaEventElapsedTime>(start, end)`.

cudaLaunchKernel and cudaFuncGetAttributes stay in the runtime-status form:
both are overload sets (C API + cuda_runtime.h templated wrapper), so a bare
cuda_try<F> cannot name them. Verified locally by building
cudax.test.stf.interface.cuda_kernel_empty_args and
cudax.test.stf.examples.cuda_kernels_driver.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test 280cf56

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1444d149-7d61-44fd-9752-118a57e52d35

📥 Commits

Reviewing files that changed from the base of the PR and between d53149e and 280cf56.

📒 Files selected for processing (1)
  • cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh

Comment on lines +399 to 403
cuda_try<cudaEventRecord>(end_event, t.get_stream());
cuda_try<cudaEventSynchronize>(end_event);

float milliseconds = 0;
cuda_safe_call(cudaEventElapsedTime(&milliseconds, start_event, end_event));
const float milliseconds = cuda_try<cudaEventElapsedTime>(start_event, end_event);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Ensure task cleanup always runs when timing calls fail.
cuda_try can throw in this block, and then t.clear() / support_task.reset() (Lines 418-423) are skipped after t.end_uncleared(). Please guard cleanup/reset with an unconditional scope-exit path so error handling here cannot leave task state uncleared.

@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test 751c24d

@andralex andralex enabled auto-merge (squash) June 5, 2026 01:13
In the parallel_for reduction graph path, the dependencies argument was
NULL. As a templated cuda_try<F> argument, NULL is deduced as long/int
rather than a null-pointer-constant, so is_invocable finds no valid form
(Clang errors; GCC's __null accepts it, which is why local GCC builds
missed it). Use nullptr.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test 1d55338

@github-actions

This comment has been minimized.

@caugonnet caugonnet added the stf Sequential Task Flow programming model label Jun 5, 2026
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test 6ba1241

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 34m 26s: Pass: 96%/55 | Total: 8h 12m | Max: 34m 23s | Hits: 77%/45004

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants