cudax/stf: migrate internal/ launch + host_launch_scope from cuda_safe_call to cuda_try#9249
cudax/stf: migrate internal/ launch + host_launch_scope from cuda_safe_call to cuda_try#9249andralex wants to merge 11 commits into
Conversation
Third internal/ slice, covering the kernel/host launch scopes and their shared event-timing pattern. - Convert eligible calls to the templated cuda_try<F> form: cudaLaunchKernelExC, cudaGraphAddKernelNode (out-param -> ref), cudaGraphKernelNodeSetAttribute, cudaFreeAsync, cudaEventRecord (start), cudaGraphAddHostNode (out-param -> ref), cudaLaunchHostFunc. - cudaEventCreate and cudaMallocAsync stay in the runtime-status form: both are overload sets (cuda_runtime.h flags overload / templated wrapper), so cuda_try<F> cannot name them. - Event timing's end record/synchronize/elapsed run inside the noexcept SCOPE(exit) body, so they keep cuda_safe_call: a CUDA error there should abort rather than throw through the guard (which would std::terminate). - The two stream-path cudaLaunchHostFunc enqueues now get a SCOPE(fail) that deletes the heap callback args (resolved / wrapper) if the enqueue throws -- the callback only takes ownership once the enqueue succeeds, so this closes the leak the new throw path would otherwise introduce. The graph-path host nodes are already covered because their args are owned by a ctx resource added before the node is created. Pre-existing and left as-is: the timing events created here are never cudaEventDestroy'd (a leak in the calibration path, unrelated to this change).
|
/ok to test 1c9ba06 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
suggestion: WalkthroughCUDA runtime calls in host and kernel launch scopes are migrated from ChangesError handling modernization
suggested labels
suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cudax/include/cuda/experimental/__stf/internal/launch.cuh (1)
123-135:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant: After
cudaMallocAsyncsucceeds, any exception from the launch path unwinds past Line 135 and never releasesth_dev_tmp_ptr. That turns failed launches into a device-memory leak. Add a fail guard immediately after allocation and keep its cleanup non-throwing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 34a9ea1e-214e-42a6-a774-2b2ebd7965aa
📒 Files selected for processing (2)
cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuhcudax/include/cuda/experimental/__stf/internal/launch.cuh
…eation cudaEventCreate is an overload set (cuda_runtime.h adds a flags overload), so it cannot be named by the templated cuda_try<F> form. Use the non-overloaded cudaEventCreateWithFlags with cudaEventDefault instead, which is exactly what cudaEventCreate(&e) does internally, so behavior is unchanged while keeping the templated form.
|
/ok to test d2105b3 |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
cudax/include/cuda/experimental/__stf/internal/launch.cuh (1)
366-375:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant: Install the task cleanup guard before the first throwing CUDA call. Line 359 calls
t.start()before any cleanup guard exists, and Line 389 is the first guard. If anycuda_tryat Lines 373-375 throws,t.end_uncleared()andt.clear()are skipped; if Line 375 fails after the event creations, those event handles leak too. Move the existing guard above this block, or add an equivalent pre-guard cleanup path. As per coding guidelines,cudax/**/*: Focus on correctness, lifetime/resource ownership, stream ordering, host/device annotations, experimental API clarity, tests, and compatibility with the supported CUDA toolchains.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: dacb43fe-0d9e-46e8-9502-66e98d338834
📒 Files selected for processing (2)
cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuhcudax/include/cuda/experimental/__stf/internal/launch.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
- cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh
The host_launch callback args (resolved / wrapper) are heap-allocated and
guarded by SCOPE(fail) { delete ...; }. Transfer of ownership to the
graph-path ctx resource was happening in the wrong order:
- Untyped path: `resolved` was set to nullptr right after add_resource,
but it is also used as the host node's userData. That made the graph
node receive a null userData, so the callback dereferenced null on the
success path.
- Typed path: add_resource ran before cudaGraphAddHostNode, so a throw
from the node creation would delete `wrapper` twice (SCOPE(fail) plus
the resource's release_in_callback).
Fix both by creating the host node first (while resolved/wrapper is still
a valid userData), then handing ownership to the ctx resource, then
nulling the pointer once at the end to disarm SCOPE(fail). On a throw
before that point the resource has not been added, so SCOPE(fail) is the
sole owner and frees the args exactly once.
|
/ok to test 39c1cab |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh (1)
244-256:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant: The
cuda_trytiming calls at Lines 253-255 can throw aftert.start()(Line 244) but before theSCOPE(exit)at Line 259 is installed. On throw here,t.end_uncleared()/t.clear()never run and the task is left half-open. Install the unwind guard (or a separate non-throwing prelude guard) before the first throwing CUDA call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4c3ca247-ca00-42ba-9ae2-6259d93466e2
📒 Files selected for processing (1)
cudax/include/cuda/experimental/__stf/internal/host_launch_scope.cuh
Replace the raw new + SCOPE(fail){delete} + manual nulling design for the
host_launch callback arguments with std::unique_ptr. The args are borrowed
via .get() for the host node userData / cudaLaunchHostFunc argument and the
ctx resource, and ownership is handed off with .release() once the node has
been created (graph) or the launch has been enqueued (stream). On a throw
before that point the unique_ptr frees the args; afterwards the ctx resource
(graph) or the callback (stream) owns and frees them. Adds <memory>.
|
/ok to test d679c23 |
launch_impl allocates a temporary device buffer (cudaMallocAsync) and freed it after cuda_launcher returned. Now that cuda_launcher throws on error (via cuda_try<cudaLaunchKernelExC>), the trailing cudaFreeAsync was skipped on a throw, leaking the buffer. Free it from a SCOPE(exit) placed right after the allocation so it runs on both normal and exceptional exit. cuda_safe_call is used inside the noexcept SCOPE(exit) body.
|
/ok to test 70f0713 |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/internal/launch.cuh (1)
127-135: ⚡ Quick winimportant: qualify the new cleanup calls from the global namespace.
cuda_safe_callandcudaFreeAsyncare added unqualified here. This repo requires free-function calls to start at::, including calls in the same namespace, so this should be::cuda::experimental::stf::cuda_safe_call(::cudaFreeAsync(...)).As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace, e.g.,
::cuda::ceil_div."
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b9f47e5e-a718-4233-ad38-ce8ace3e72de
📒 Files selected for processing (1)
cudax/include/cuda/experimental/__stf/internal/launch.cuh
The cudaGetDevice call in the timing branch was unchecked. Use the templated cuda_try<cudaGetDevice> form so a failure is reported.
|
/ok to test 92c2260 |
This comment has been minimized.
This comment has been minimized.
Match launch.cuh and satisfy GCC -Wmaybe-uninitialized: if cuda_try throws before both events are created, SCOPE(exit) still runs with record_time set.
|
/ok to test 14834d8 |
|
/ok to test 14834d8 |
|
/ok to test a19b4b9 |
This comment has been minimized.
This comment has been minimized.
After the cuda_safe_call -> cuda_try migration, the timing-event setup (cudaGetDevice / cudaEventCreateWithFlags / cudaEventRecord) can throw, but in launch.cuh and host_launch_scope.cuh it ran *before* the SCOPE(exit) that calls t.end_uncleared()/t.clear(). A throw there left the task half-open and leaked any partially-created event. Install the SCOPE(exit) guard first, then do the timing setup, and gate the end-of-scope timing teardown on a new timing_active flag that is only set once both events exist and the start event has been recorded. On a throw from the timing cuda_try calls the guard now runs the task cleanup, and the teardown is skipped because timing_active is still false. No success-path behavior change. Addresses CodeRabbit review comments on PR NVIDIA#9249.
|
/ok to test 7a9d75f |
This comment has been minimized.
This comment has been minimized.
|
/ok to test b29368e |
😬 CI Workflow Results🟥 Finished in 35m 04s: Pass: 96%/55 | Total: 8h 05m | Max: 34m 55s | Hits: 70%/46146See results here. |
Summary
Third slice of the
cudax/include/cuda/experimental/__stf/internal/cuda_safe_call->cuda_trymigration (complexity-ordered sequence). Covers the kernel/host launch scopes, which share an event-timing pattern.Follow-up to #9241 (misc) and #9248 (context/resources). Companion to #9150 (utility), #9165 (stackable, merged).
Changes
Templated
cuda_try<F>conversionscudaLaunchKernelExC,cudaGraphAddKernelNode(out-param assigned through thecudaGraphNode_t&),cudaGraphKernelNodeSetAttribute,cudaFreeAsync,cudaEventRecord(start),cudaGraphAddHostNode(out-param),cudaLaunchHostFunc.Kept in runtime-status form (overload sets —
cuda_try<F>can't name them)cudaEventCreate—cuda_runtime.hadds a flags overload.cudaMallocAsync— templated wrapper.Event-timing: end calls stay
cuda_safe_callThe end
cudaEventRecord/cudaEventSynchronize/cudaEventElapsedTimerun inside the enclosing noexceptSCOPE(exit)body in both files. They keepcuda_safe_callon purpose — a CUDA error there must abort, not throw through the guard (which wouldstd::terminate). The start create/record calls (normal flow) are converted.Host-callback enqueue leak guard
The two stream-path
cudaLaunchHostFuncenqueues now have aSCOPE(fail) { delete resolved / wrapper; }. The host callback only takes ownership of the heap args once the enqueue succeeds, so convertingcuda_safe_call->cuda_trywould otherwise leak them on a throw. The graph-pathcudaGraphAddHostNodecalls don't need this — their args are owned by actxresource added before the node is created.Notes
cudaEventDestroy'd (a leak in the calibration path, unrelated to this change).Test plan