cudax/stf: migrate internal/ launch + host_launch_scope from cuda_safe_call to cuda_try by andralex · Pull Request #9249 · NVIDIA/cccl

andralex · 2026-06-04T03:40:53Z

Summary

Third slice of the cudax/include/cuda/experimental/__stf/internal/ cuda_safe_call -> cuda_try migration (complexity-ordered sequence). Covers the kernel/host launch scopes, which share an event-timing pattern.

Follow-up to #9241 (misc) and #9248 (context/resources). Companion to #9150 (utility), #9165 (stackable, merged).

Changes

Templated `cuda_try<F>` conversions

cudaLaunchKernelExC, cudaGraphAddKernelNode (out-param assigned through the cudaGraphNode_t&), cudaGraphKernelNodeSetAttribute, cudaFreeAsync, cudaEventRecord (start), cudaGraphAddHostNode (out-param), cudaLaunchHostFunc.

Kept in runtime-status form (overload sets — `cuda_try<F>` can't name them)

cudaEventCreate — cuda_runtime.h adds a flags overload.
cudaMallocAsync — templated wrapper.

Event-timing: end calls stay `cuda_safe_call`

The end cudaEventRecord / cudaEventSynchronize / cudaEventElapsedTime run inside the enclosing noexcept SCOPE(exit) body in both files. They keep cuda_safe_call on purpose — a CUDA error there must abort, not throw through the guard (which would std::terminate). The start create/record calls (normal flow) are converted.

Host-callback enqueue leak guard

The two stream-path cudaLaunchHostFunc enqueues now have a SCOPE(fail) { delete resolved / wrapper; }. The host callback only takes ownership of the heap args once the enqueue succeeds, so converting cuda_safe_call -> cuda_try would otherwise leak them on a throw. The graph-path cudaGraphAddHostNode calls don't need this — their args are owned by a ctx resource added before the node is created.

Notes

Pre-existing, left as-is: the timing events created in these paths are never cudaEventDestroy'd (a leak in the calibration path, unrelated to this change).
Overload-set determination was made against the CTK 13.2 headers.

Test plan

CI green on the cudax matrix entries that build these headers / STF tests
No success-path behavior change; new behavior is throw-vs-abort plus rollback/cleanup on throw

Third internal/ slice, covering the kernel/host launch scopes and their shared event-timing pattern. - Convert eligible calls to the templated cuda_try<F> form: cudaLaunchKernelExC, cudaGraphAddKernelNode (out-param -> ref), cudaGraphKernelNodeSetAttribute, cudaFreeAsync, cudaEventRecord (start), cudaGraphAddHostNode (out-param -> ref), cudaLaunchHostFunc. - cudaEventCreate and cudaMallocAsync stay in the runtime-status form: both are overload sets (cuda_runtime.h flags overload / templated wrapper), so cuda_try<F> cannot name them. - Event timing's end record/synchronize/elapsed run inside the noexcept SCOPE(exit) body, so they keep cuda_safe_call: a CUDA error there should abort rather than throw through the guard (which would std::terminate). - The two stream-path cudaLaunchHostFunc enqueues now get a SCOPE(fail) that deletes the heap callback args (resolved / wrapper) if the enqueue throws -- the callback only takes ownership once the enqueue succeeds, so this closes the leak the new throw path would otherwise introduce. The graph-path host nodes are already covered because their args are owned by a ctx resource added before the node is created. Pre-existing and left as-is: the timing events created here are never cudaEventDestroy'd (a leak in the calibration path, unrelated to this change).

copy-pr-bot · 2026-06-04T03:40:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

andralex · 2026-06-04T03:41:03Z

/ok to test 1c9ba06

coderabbitai · 2026-06-04T03:47:13Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

suggestion:

Walkthrough

CUDA runtime calls in host and kernel launch scopes are migrated from cuda_safe_call to cuda_try; host-callback enqueue paths switch heap payloads to std::unique_ptr with SCOPE(fail) cleanup and transfer ownership into ctx resources for graph mode; timing end-of-scope still uses cuda_safe_call inside noexcept guards.

Changes

Error handling modernization

Layer / File(s)	Summary
Kernel launch and graph operations `cudax/include/cuda/experimental/__stf/internal/launch.cuh`	`cudaLaunchKernelExC`, `cudaGraphAddKernelNode`, and `cudaGraphKernelNodeSetAttribute` now use `cuda_try` instead of `cuda_safe_call`. (ranges: `range_02a886e45cfd`, `range_461e159f15d6`)
Device async memory & per-run timing `cudax/include/cuda/experimental/__stf/internal/launch.cuh`	Async temporary device allocation/free and per-run CUDA event create/record use `cuda_try`; end-of-scope recording/sync remains `cuda_safe_call` inside a `noexcept` `SCOPE(exit)`. (ranges: `range_030de661f719`, `range_411a619a2aed`, `range_b6564207738b`)
Host callback enqueue with timing `cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh`	Added `#include <memory>`; stream timing uses `cuda_try<cudaEventCreateWithFlags>` and `cuda_try<cudaEventRecord>`; untyped (`resolved`) and typed (`wrapper`) payloads use `std::unique_ptr` and `cuda_try` for `cudaGraphAddHostNode` (graph) or `cudaLaunchHostFunc` (stream); ownership transferred to `host_callback_args_resource` for graph nodes or deleted by the callback on non-graph paths; successful enqueue/node-add calls `release()` to disarm automatic cleanup. (ranges: `range_a8d50701700e`, `range_b2990e8d97ae`, `range_c907250cf74b`, `range_e8d7639d7c19`, `range_c53d61b3c14d`, `range_b85ac2e944ff`, `range_4652f2ff94c1`)

suggested labels

stf

suggested reviewers

caugonnet
oleksandr-pavlyk

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cudax/include/cuda/experimental/__stf/internal/launch.cuh (1)

123-135: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: After cudaMallocAsync succeeds, any exception from the launch path unwinds past Line 135 and never releases th_dev_tmp_ptr. That turns failed launches into a device-memory leak. Add a fail guard immediately after allocation and keep its cleanup non-throwing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 34a9ea1e-214e-42a6-a774-2b2ebd7965aa

📥 Commits

Reviewing files that changed from the base of the PR and between 6383b91 and 1c9ba06.

📒 Files selected for processing (2)

cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh
cudax/include/cuda/experimental/__stf/internal/launch.cuh

…eation cudaEventCreate is an overload set (cuda_runtime.h adds a flags overload), so it cannot be named by the templated cuda_try<F> form. Use the non-overloaded cudaEventCreateWithFlags with cudaEventDefault instead, which is exactly what cudaEventCreate(&e) does internally, so behavior is unchanged while keeping the templated form.

andralex · 2026-06-04T03:47:24Z

/ok to test d2105b3

coderabbitai

♻️ Duplicate comments (1)

cudax/include/cuda/experimental/__stf/internal/launch.cuh (1)

366-375: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Install the task cleanup guard before the first throwing CUDA call. Line 359 calls t.start() before any cleanup guard exists, and Line 389 is the first guard. If any cuda_try at Lines 373-375 throws, t.end_uncleared() and t.clear() are skipped; if Line 375 fails after the event creations, those event handles leak too. Move the existing guard above this block, or add an equivalent pre-guard cleanup path. As per coding guidelines, cudax/**/*: Focus on correctness, lifetime/resource ownership, stream ordering, host/device annotations, experimental API clarity, tests, and compatibility with the supported CUDA toolchains.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dacb43fe-0d9e-46e8-9502-66e98d338834

📥 Commits

Reviewing files that changed from the base of the PR and between 1c9ba06 and d2105b3.

📒 Files selected for processing (2)

cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh
cudax/include/cuda/experimental/__stf/internal/launch.cuh

🚧 Files skipped from review as they are similar to previous changes (1)

cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh

The host_launch callback args (resolved / wrapper) are heap-allocated and guarded by SCOPE(fail) { delete ...; }. Transfer of ownership to the graph-path ctx resource was happening in the wrong order: - Untyped path: `resolved` was set to nullptr right after add_resource, but it is also used as the host node's userData. That made the graph node receive a null userData, so the callback dereferenced null on the success path. - Typed path: add_resource ran before cudaGraphAddHostNode, so a throw from the node creation would delete `wrapper` twice (SCOPE(fail) plus the resource's release_in_callback). Fix both by creating the host node first (while resolved/wrapper is still a valid userData), then handing ownership to the ctx resource, then nulling the pointer once at the end to disarm SCOPE(fail). On a throw before that point the resource has not been added, so SCOPE(fail) is the sole owner and frees the args exactly once.

andralex · 2026-06-04T04:12:36Z

/ok to test 39c1cab

coderabbitai

♻️ Duplicate comments (1)

cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh (1)

244-256: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: The cuda_try timing calls at Lines 253-255 can throw after t.start() (Line 244) but before the SCOPE(exit) at Line 259 is installed. On throw here, t.end_uncleared()/t.clear() never run and the task is left half-open. Install the unwind guard (or a separate non-throwing prelude guard) before the first throwing CUDA call.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4c3ca247-ca00-42ba-9ae2-6259d93466e2

📥 Commits

Reviewing files that changed from the base of the PR and between d2105b3 and 39c1cab.

📒 Files selected for processing (1)

cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh

Replace the raw new + SCOPE(fail){delete} + manual nulling design for the host_launch callback arguments with std::unique_ptr. The args are borrowed via .get() for the host node userData / cudaLaunchHostFunc argument and the ctx resource, and ownership is handed off with .release() once the node has been created (graph) or the launch has been enqueued (stream). On a throw before that point the unique_ptr frees the args; afterwards the ctx resource (graph) or the callback (stream) owns and frees them. Adds <memory>.

andralex · 2026-06-04T04:23:06Z

/ok to test d679c23

launch_impl allocates a temporary device buffer (cudaMallocAsync) and freed it after cuda_launcher returned. Now that cuda_launcher throws on error (via cuda_try<cudaLaunchKernelExC>), the trailing cudaFreeAsync was skipped on a throw, leaking the buffer. Free it from a SCOPE(exit) placed right after the allocation so it runs on both normal and exceptional exit. cuda_safe_call is used inside the noexcept SCOPE(exit) body.

andralex · 2026-06-04T04:27:48Z

/ok to test 70f0713

coderabbitai

🧹 Nitpick comments (1)

cudax/include/cuda/experimental/__stf/internal/launch.cuh (1)

127-135: ⚡ Quick win

important: qualify the new cleanup calls from the global namespace.

cuda_safe_call and cudaFreeAsync are added unqualified here. This repo requires free-function calls to start at ::, including calls in the same namespace, so this should be ::cuda::experimental::stf::cuda_safe_call(::cudaFreeAsync(...)).

As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace, e.g., ::cuda::ceil_div."

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9f47e5e-a718-4233-ad38-ce8ace3e72de

📥 Commits

Reviewing files that changed from the base of the PR and between d679c23 and 70f0713.

📒 Files selected for processing (1)

cudax/include/cuda/experimental/__stf/internal/launch.cuh

The cudaGetDevice call in the timing branch was unchecked. Use the templated cuda_try<cudaGetDevice> form so a failure is reported.

andralex · 2026-06-04T04:35:54Z

/ok to test 92c2260

Match launch.cuh and satisfy GCC -Wmaybe-uninitialized: if cuda_try throws before both events are created, SCOPE(exit) still runs with record_time set.

andralex · 2026-06-04T22:25:12Z

/ok to test 14834d8

andralex · 2026-06-04T22:25:49Z

/ok to test 14834d8

andralex · 2026-06-04T22:27:56Z

/ok to test a19b4b9

…-internal-launch

After the cuda_safe_call -> cuda_try migration, the timing-event setup (cudaGetDevice / cudaEventCreateWithFlags / cudaEventRecord) can throw, but in launch.cuh and host_launch_scope.cuh it ran *before* the SCOPE(exit) that calls t.end_uncleared()/t.clear(). A throw there left the task half-open and leaked any partially-created event. Install the SCOPE(exit) guard first, then do the timing setup, and gate the end-of-scope timing teardown on a new timing_active flag that is only set once both events exist and the start event has been recorded. On a throw from the timing cuda_try calls the guard now runs the task cleanup, and the teardown is skipped because timing_active is still false. No success-path behavior change. Addresses CodeRabbit review comments on PR NVIDIA#9249.

andralex · 2026-06-05T01:46:07Z

/ok to test 7a9d75f

andralex · 2026-06-05T13:07:35Z

/ok to test b29368e

github-actions · 2026-06-05T13:44:21Z

😬 CI Workflow Results

🟥 Finished in 35m 04s: Pass: 96%/55 | Total: 8h 05m | Max: 34m 55s | Hits: 70%/46146

See results here.

andralex requested a review from a team as a code owner June 4, 2026 03:40

andralex requested a review from caugonnet June 4, 2026 03:40

github-project-automation Bot added this to CCCL Jun 4, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh Outdated

Comment thread cudax/include/cuda/experimental/__stf/internal/launch.cuh Outdated

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

cudax/stf: check cudaGetDevice in launch timing path

92c2260

The cudaGetDevice call in the timing branch was unchecked. Use the templated cuda_try<cudaGetDevice> form so a failure is reported.

This comment has been minimized.

Sign in to view

[STF] Initialize host_launch_scope timing events to nullptr

14834d8

Match launch.cuh and satisfy GCC -Wmaybe-uninitialized: if cuda_try throws before both events are created, SCOPE(exit) still runs with record_time set.

Merge branch 'main' into andralex/stf-cuda-try-internal-launch

a19b4b9

andralex enabled auto-merge (squash) June 4, 2026 22:31

caugonnet approved these changes Jun 4, 2026

View reviewed changes

andralex mentioned this pull request Jun 4, 2026

cudax/stf: migrate internal/ parallel_for + cuda_kernel scopes from cuda_safe_call to cuda_try #9265

Open

2 tasks

This comment has been minimized.

Sign in to view

andralex added 2 commits June 4, 2026 21:20

Merge remote-tracking branch 'origin/main' into andralex/stf-cuda-try…

33ffac5

…-internal-launch

This comment has been minimized.

Sign in to view

Merge branch 'main' into andralex/stf-cuda-try-internal-launch

b29368e

Conversation

andralex commented Jun 4, 2026

Summary

Changes

Templated cuda_try<F> conversions

Kept in runtime-status form (overload sets — cuda_try<F> can't name them)

Event-timing: end calls stay cuda_safe_call

Host-callback enqueue leak guard

Notes

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

andralex commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andralex commented Jun 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

andralex commented Jun 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

andralex commented Jun 4, 2026

Uh oh!

andralex commented Jun 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

andralex commented Jun 4, 2026

Uh oh!

This comment has been minimized.

andralex commented Jun 4, 2026

Uh oh!

andralex commented Jun 4, 2026

Uh oh!

andralex commented Jun 4, 2026

Uh oh!

This comment has been minimized.

andralex commented Jun 5, 2026

Uh oh!

This comment has been minimized.

andralex commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 35m 04s: Pass: 96%/55 | Total: 8h 05m | Max: 34m 55s | Hits: 70%/46146

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Templated `cuda_try<F>` conversions

Kept in runtime-status form (overload sets — `cuda_try<F>` can't name them)

Event-timing: end calls stay `cuda_safe_call`

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading