Skip to content

cudax/stf: migrate internal/ launch + host_launch_scope from cuda_safe_call to cuda_try#9249

Open
andralex wants to merge 11 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-internal-launch
Open

cudax/stf: migrate internal/ launch + host_launch_scope from cuda_safe_call to cuda_try#9249
andralex wants to merge 11 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-internal-launch

Conversation

@andralex
Copy link
Copy Markdown
Contributor

@andralex andralex commented Jun 4, 2026

Summary

Third slice of the cudax/include/cuda/experimental/__stf/internal/ cuda_safe_call -> cuda_try migration (complexity-ordered sequence). Covers the kernel/host launch scopes, which share an event-timing pattern.

Follow-up to #9241 (misc) and #9248 (context/resources). Companion to #9150 (utility), #9165 (stackable, merged).

Changes

Templated cuda_try<F> conversions

cudaLaunchKernelExC, cudaGraphAddKernelNode (out-param assigned through the cudaGraphNode_t&), cudaGraphKernelNodeSetAttribute, cudaFreeAsync, cudaEventRecord (start), cudaGraphAddHostNode (out-param), cudaLaunchHostFunc.

Kept in runtime-status form (overload sets — cuda_try<F> can't name them)

  • cudaEventCreatecuda_runtime.h adds a flags overload.
  • cudaMallocAsync — templated wrapper.

Event-timing: end calls stay cuda_safe_call

The end cudaEventRecord / cudaEventSynchronize / cudaEventElapsedTime run inside the enclosing noexcept SCOPE(exit) body in both files. They keep cuda_safe_call on purpose — a CUDA error there must abort, not throw through the guard (which would std::terminate). The start create/record calls (normal flow) are converted.

Host-callback enqueue leak guard

The two stream-path cudaLaunchHostFunc enqueues now have a SCOPE(fail) { delete resolved / wrapper; }. The host callback only takes ownership of the heap args once the enqueue succeeds, so converting cuda_safe_call -> cuda_try would otherwise leak them on a throw. The graph-path cudaGraphAddHostNode calls don't need this — their args are owned by a ctx resource added before the node is created.

Notes

  • Pre-existing, left as-is: the timing events created in these paths are never cudaEventDestroy'd (a leak in the calibration path, unrelated to this change).
  • Overload-set determination was made against the CTK 13.2 headers.

Test plan

  • CI green on the cudax matrix entries that build these headers / STF tests
  • No success-path behavior change; new behavior is throw-vs-abort plus rollback/cleanup on throw

Third internal/ slice, covering the kernel/host launch scopes and their
shared event-timing pattern.

- Convert eligible calls to the templated cuda_try<F> form: cudaLaunchKernelExC,
  cudaGraphAddKernelNode (out-param -> ref), cudaGraphKernelNodeSetAttribute,
  cudaFreeAsync, cudaEventRecord (start), cudaGraphAddHostNode (out-param -> ref),
  cudaLaunchHostFunc.
- cudaEventCreate and cudaMallocAsync stay in the runtime-status form: both are
  overload sets (cuda_runtime.h flags overload / templated wrapper), so
  cuda_try<F> cannot name them.
- Event timing's end record/synchronize/elapsed run inside the noexcept
  SCOPE(exit) body, so they keep cuda_safe_call: a CUDA error there should
  abort rather than throw through the guard (which would std::terminate).
- The two stream-path cudaLaunchHostFunc enqueues now get a SCOPE(fail) that
  deletes the heap callback args (resolved / wrapper) if the enqueue throws --
  the callback only takes ownership once the enqueue succeeds, so this closes
  the leak the new throw path would otherwise introduce. The graph-path host
  nodes are already covered because their args are owned by a ctx resource added
  before the node is created.

Pre-existing and left as-is: the timing events created here are never
cudaEventDestroy'd (a leak in the calibration path, unrelated to this change).
@andralex andralex requested a review from a team as a code owner June 4, 2026 03:40
@andralex andralex requested a review from caugonnet June 4, 2026 03:40
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Jun 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 1c9ba06

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

suggestion:

Walkthrough

CUDA runtime calls in host and kernel launch scopes are migrated from cuda_safe_call to cuda_try; host-callback enqueue paths switch heap payloads to std::unique_ptr with SCOPE(fail) cleanup and transfer ownership into ctx resources for graph mode; timing end-of-scope still uses cuda_safe_call inside noexcept guards.

Changes

Error handling modernization

Layer / File(s) Summary
Kernel launch and graph operations
cudax/include/cuda/experimental/__stf/internal/launch.cuh
cudaLaunchKernelExC, cudaGraphAddKernelNode, and cudaGraphKernelNodeSetAttribute now use cuda_try instead of cuda_safe_call. (ranges: range_02a886e45cfd, range_461e159f15d6)
Device async memory & per-run timing
cudax/include/cuda/experimental/__stf/internal/launch.cuh
Async temporary device allocation/free and per-run CUDA event create/record use cuda_try; end-of-scope recording/sync remains cuda_safe_call inside a noexcept SCOPE(exit). (ranges: range_030de661f719, range_411a619a2aed, range_b6564207738b)
Host callback enqueue with timing
cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh
Added #include <memory>; stream timing uses cuda_try<cudaEventCreateWithFlags> and cuda_try<cudaEventRecord>; untyped (resolved) and typed (wrapper) payloads use std::unique_ptr and cuda_try for cudaGraphAddHostNode (graph) or cudaLaunchHostFunc (stream); ownership transferred to host_callback_args_resource for graph nodes or deleted by the callback on non-graph paths; successful enqueue/node-add calls release() to disarm automatic cleanup. (ranges: range_a8d50701700e, range_b2990e8d97ae, range_c907250cf74b, range_e8d7639d7c19, range_c53d61b3c14d, range_b85ac2e944ff, range_4652f2ff94c1)

suggested labels

stf

suggested reviewers

  • caugonnet
  • oleksandr-pavlyk

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cudax/include/cuda/experimental/__stf/internal/launch.cuh (1)

123-135: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: After cudaMallocAsync succeeds, any exception from the launch path unwinds past Line 135 and never releases th_dev_tmp_ptr. That turns failed launches into a device-memory leak. Add a fail guard immediately after allocation and keep its cleanup non-throwing.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 34a9ea1e-214e-42a6-a774-2b2ebd7965aa

📥 Commits

Reviewing files that changed from the base of the PR and between 6383b91 and 1c9ba06.

📒 Files selected for processing (2)
  • cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh
  • cudax/include/cuda/experimental/__stf/internal/launch.cuh

Comment thread cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh Outdated
Comment thread cudax/include/cuda/experimental/__stf/internal/launch.cuh Outdated
…eation

cudaEventCreate is an overload set (cuda_runtime.h adds a flags overload),
so it cannot be named by the templated cuda_try<F> form. Use the
non-overloaded cudaEventCreateWithFlags with cudaEventDefault instead,
which is exactly what cudaEventCreate(&e) does internally, so behavior is
unchanged while keeping the templated form.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test d2105b3

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
cudax/include/cuda/experimental/__stf/internal/launch.cuh (1)

366-375: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Install the task cleanup guard before the first throwing CUDA call. Line 359 calls t.start() before any cleanup guard exists, and Line 389 is the first guard. If any cuda_try at Lines 373-375 throws, t.end_uncleared() and t.clear() are skipped; if Line 375 fails after the event creations, those event handles leak too. Move the existing guard above this block, or add an equivalent pre-guard cleanup path. As per coding guidelines, cudax/**/*: Focus on correctness, lifetime/resource ownership, stream ordering, host/device annotations, experimental API clarity, tests, and compatibility with the supported CUDA toolchains.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dacb43fe-0d9e-46e8-9502-66e98d338834

📥 Commits

Reviewing files that changed from the base of the PR and between 1c9ba06 and d2105b3.

📒 Files selected for processing (2)
  • cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh
  • cudax/include/cuda/experimental/__stf/internal/launch.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh

The host_launch callback args (resolved / wrapper) are heap-allocated and
guarded by SCOPE(fail) { delete ...; }. Transfer of ownership to the
graph-path ctx resource was happening in the wrong order:

- Untyped path: `resolved` was set to nullptr right after add_resource,
  but it is also used as the host node's userData. That made the graph
  node receive a null userData, so the callback dereferenced null on the
  success path.
- Typed path: add_resource ran before cudaGraphAddHostNode, so a throw
  from the node creation would delete `wrapper` twice (SCOPE(fail) plus
  the resource's release_in_callback).

Fix both by creating the host node first (while resolved/wrapper is still
a valid userData), then handing ownership to the ctx resource, then
nulling the pointer once at the end to disarm SCOPE(fail). On a throw
before that point the resource has not been added, so SCOPE(fail) is the
sole owner and frees the args exactly once.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 39c1cab

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh (1)

244-256: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: The cuda_try timing calls at Lines 253-255 can throw after t.start() (Line 244) but before the SCOPE(exit) at Line 259 is installed. On throw here, t.end_uncleared()/t.clear() never run and the task is left half-open. Install the unwind guard (or a separate non-throwing prelude guard) before the first throwing CUDA call.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4c3ca247-ca00-42ba-9ae2-6259d93466e2

📥 Commits

Reviewing files that changed from the base of the PR and between d2105b3 and 39c1cab.

📒 Files selected for processing (1)
  • cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh

Replace the raw new + SCOPE(fail){delete} + manual nulling design for the
host_launch callback arguments with std::unique_ptr. The args are borrowed
via .get() for the host node userData / cudaLaunchHostFunc argument and the
ctx resource, and ownership is handed off with .release() once the node has
been created (graph) or the launch has been enqueued (stream). On a throw
before that point the unique_ptr frees the args; afterwards the ctx resource
(graph) or the callback (stream) owns and frees them. Adds <memory>.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test d679c23

launch_impl allocates a temporary device buffer (cudaMallocAsync) and freed
it after cuda_launcher returned. Now that cuda_launcher throws on error (via
cuda_try<cudaLaunchKernelExC>), the trailing cudaFreeAsync was skipped on a
throw, leaking the buffer. Free it from a SCOPE(exit) placed right after the
allocation so it runs on both normal and exceptional exit. cuda_safe_call is
used inside the noexcept SCOPE(exit) body.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 70f0713

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/internal/launch.cuh (1)

127-135: ⚡ Quick win

important: qualify the new cleanup calls from the global namespace.

cuda_safe_call and cudaFreeAsync are added unqualified here. This repo requires free-function calls to start at ::, including calls in the same namespace, so this should be ::cuda::experimental::stf::cuda_safe_call(::cudaFreeAsync(...)).

As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace, e.g., ::cuda::ceil_div."


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9f47e5e-a718-4233-ad38-ce8ace3e72de

📥 Commits

Reviewing files that changed from the base of the PR and between d679c23 and 70f0713.

📒 Files selected for processing (1)
  • cudax/include/cuda/experimental/__stf/internal/launch.cuh

The cudaGetDevice call in the timing branch was unchecked. Use the
templated cuda_try<cudaGetDevice> form so a failure is reported.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 92c2260

@github-actions

This comment has been minimized.

Match launch.cuh and satisfy GCC -Wmaybe-uninitialized: if cuda_try throws
before both events are created, SCOPE(exit) still runs with record_time set.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 14834d8

@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 14834d8

@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test a19b4b9

@andralex andralex enabled auto-merge (squash) June 4, 2026 22:31
@github-actions

This comment has been minimized.

andralex added 2 commits June 4, 2026 21:20
After the cuda_safe_call -> cuda_try migration, the timing-event setup
(cudaGetDevice / cudaEventCreateWithFlags / cudaEventRecord) can throw, but
in launch.cuh and host_launch_scope.cuh it ran *before* the SCOPE(exit) that
calls t.end_uncleared()/t.clear(). A throw there left the task half-open and
leaked any partially-created event.

Install the SCOPE(exit) guard first, then do the timing setup, and gate the
end-of-scope timing teardown on a new timing_active flag that is only set once
both events exist and the start event has been recorded. On a throw from the
timing cuda_try calls the guard now runs the task cleanup, and the teardown is
skipped because timing_active is still false. No success-path behavior change.

Addresses CodeRabbit review comments on PR NVIDIA#9249.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test 7a9d75f

@github-actions

This comment has been minimized.

@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test b29368e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 35m 04s: Pass: 96%/55 | Total: 8h 05m | Max: 34m 55s | Hits: 70%/46146

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants