cudax/stf: migrate internal/ parallel_for + cuda_kernel scopes from cuda_safe_call to cuda_try#9265
cudax/stf: migrate internal/ parallel_for + cuda_kernel scopes from cuda_safe_call to cuda_try#9265andralex wants to merge 6 commits into
Conversation
…a_try Final internal/ slice (the most complex): kernel launches, graph kernel/host nodes, occupancy queries, event timing, and driver-API calls. Templated cuda_try<F> conversions: - cuLaunchKernel, cuGraphAddKernelNode (driver API; cuda_try handles CUresult), - cudaGraphAddKernelNode / cudaGraphAddHostNode (out-param via the node ref), - cudaGraphAddDependencies, cudaEventRecord (start), cudaGetDevice, - cudaEventCreate -> cuda_try<cudaEventCreateWithFlags>(cudaEventDefault). Kept in runtime-status cuda_try form (overload sets / multiple outputs): - cudaLaunchKernel and cudaFuncGetAttributes (cuda_runtime.h templated wrappers), - cudaOccupancyMaxPotentialBlockSizeVariableSMem (two output params + templated). Kept as cuda_safe_call (event-timing teardown): the end record/synchronize/ elapsed calls. In parallel_for_scope they run inside the noexcept SCOPE(exit); cuda_kernel_scope mirrors that for consistency. Other correctness: - Initialize the timing events to nullptr (the create calls are now assignments, which would otherwise trip GCC -Werror=maybe-uninitialized). - parallel_for host-callback args: the stream path now deletes them via SCOPE(fail) if cudaLaunchHostFunc throws (the callback only takes ownership on success); the graph path already hands ownership to a ctx resource before the node is created.
|
/ok to test 80e0546 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
SummaryThis PR finishes the migration of CUDAX's internal STF scope implementations from cuda_safe_call to cuda_try in cudax/include/cuda/experimental/__stf/internal, targeting parallel_for_scope.cuh and cuda_kernel_scope.cuh. It converts many kernel-launch, graph, occupancy, event, and driver-API error paths from abort-on-error semantics to throw-on-error via cuda_try while preserving abort semantics in a few noexcept teardown scopes. Several correctness and cleanup fixes are included. Changescuda_kernel_scope.cuh
parallel_for_scope.cuh
Error-handling strategy
Additional fixes and notes
Testing
suggestion: WalkthroughReplace many cuda_safe_call(...) wrappers with cuda_try(...) across kernel launch, graph node creation, event timing, and occupancy queries; initialize device/event members to safe defaults; add cleanup guard around host-callback enqueue; assign graph node/host-node handles directly from cuda_try returns. ChangesError Handling Standardization in Kernel Launch and Parallel-For Scopes
suggestion: Possibly related PRs
suggestion: Suggested labels suggestion: Suggested reviewers
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh (1)
581-587:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant: SCOPE(exit) will abort if event creation throws.
If any
cuda_trycall at lines 608-614 throws,SCOPE(exit)runs withrecord_time=truebutstart_event/end_eventstill nullptr. Line 583 then callscudaEventRecord(nullptr, ...), which fails and triggerscuda_safe_callto abort.Guard the timing cleanup with a null check:
if (record_time) { + if (start_event == nullptr || end_event == nullptr) + { + // Event creation failed; skip timing teardown + } + else + { cuda_safe_call(cudaEventRecord(end_event, t.get_stream())); cuda_safe_call(cudaEventSynchronize(end_event)); float milliseconds = 0; cuda_safe_call(cudaEventElapsedTime(&milliseconds, start_event, end_event)); // ... rest of timing code ... + } }Or set
record_timeonly after successful event creation.Also applies to: 608-614
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f5f64e32-9d53-4f4f-924b-b4d14ad08c0e
📒 Files selected for processing (2)
cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuhcudax/include/cuda/experimental/__stf/internal/parallel_for_scope.cuh
This comment has been minimized.
This comment has been minimized.
Converting cuda_safe_call(cuLaunchKernel(...)) to the templated cuda_try<cuLaunchKernel>(...) form dropped one nesting level, but the closing `nullptr));` kept both parens, leaving an extra `)` that broke compilation of every TU including cuda_kernel_scope.cuh. Drop the stray paren. Verified locally by building cudax.test.stf.cpp.scoped_graph_task.
|
/ok to test d53149e |
The end-of-timing event calls in cuda_kernel_scope::end() are single functions and end() is not a noexcept SCOPE, so switch them to the templated form: cuda_try<cudaEventRecord>, cuda_try<cudaEventSynchronize>, and `const float ms = cuda_try<cudaEventElapsedTime>(start, end)`. cudaLaunchKernel and cudaFuncGetAttributes stay in the runtime-status form: both are overload sets (C API + cuda_runtime.h templated wrapper), so a bare cuda_try<F> cannot name them. Verified locally by building cudax.test.stf.interface.cuda_kernel_empty_args and cudax.test.stf.examples.cuda_kernels_driver.
|
/ok to test 280cf56 |
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1444d149-7d61-44fd-9752-118a57e52d35
📒 Files selected for processing (1)
cudax/include/cuda/experimental/__stf/internal/cuda_kernel_scope.cuh
| cuda_try<cudaEventRecord>(end_event, t.get_stream()); | ||
| cuda_try<cudaEventSynchronize>(end_event); | ||
|
|
||
| float milliseconds = 0; | ||
| cuda_safe_call(cudaEventElapsedTime(&milliseconds, start_event, end_event)); | ||
| const float milliseconds = cuda_try<cudaEventElapsedTime>(start_event, end_event); | ||
|
|
There was a problem hiding this comment.
important: Ensure task cleanup always runs when timing calls fail.
cuda_try can throw in this block, and then t.clear() / support_task.reset() (Lines 418-423) are skipped after t.end_uncleared(). Please guard cleanup/reset with an unconditional scope-exit path so error handling here cannot leave task state uncleared.
|
/ok to test 751c24d |
In the parallel_for reduction graph path, the dependencies argument was NULL. As a templated cuda_try<F> argument, NULL is deduced as long/int rather than a null-pointer-constant, so is_invocable finds no valid form (Clang errors; GCC's __null accepts it, which is why local GCC builds missed it). Use nullptr.
|
/ok to test 1d55338 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 6ba1241 |
😬 CI Workflow Results🟥 Finished in 34m 26s: Pass: 96%/55 | Total: 8h 12m | Max: 34m 23s | Hits: 77%/45004See results here. |
Summary
Final (and most complex) slice of the
cudax/include/cuda/experimental/__stf/internal/cuda_safe_call->cuda_trymigration:parallel_for_scope.cuh+cuda_kernel_scope.cuh. Covers kernel launches, graph kernel/host nodes, occupancy queries, event timing, and driver-API calls.Follow-up to #9241 (misc), #9248 (context/resources), #9249 (launch/host_launch). Companion to #9150 (utility), #9165 (stackable, merged).
Templated
cuda_try<F>conversionscuLaunchKernel,cuGraphAddKernelNode— driver API;cuda_tryhandlesCUresultand the file already usescuda_try<cuFuncGetAttribute>/cuda_try<cuCtxGetDevice>.cudaGraphAddKernelNode,cudaGraphAddHostNode— out-param assigned through the node reference.cudaGraphAddDependencies,cudaEventRecord(start),cudaGetDevice.cudaEventCreate->cuda_try<cudaEventCreateWithFlags>(cudaEventDefault)(semantically identical;cudaEventCreateis an overload set).Kept in runtime-status
cuda_try(...)formOverload sets / multiple outputs, where
cuda_try<F>can't name the function:cudaLaunchKernel,cudaFuncGetAttributes—cuda_runtime.htemplated wrappers.cudaOccupancyMaxPotentialBlockSizeVariableSMem— two output params + templated.Kept as
cuda_safe_call(event-timing teardown)The end
cudaEventRecord/cudaEventSynchronize/cudaEventElapsedTimecalls. Inparallel_for_scopethey run inside the noexceptSCOPE(exit);cuda_kernel_scopemirrors that for consistency (measurement cleanup — abort, don't throw).Other correctness
nullptr— the create calls are now assignments, which would otherwise trip GCC-Werror=maybe-uninitialized(same class as cudax/stf: migrate internal/ launch + host_launch_scope from cuda_safe_call to cuda_try #9249).parallel_forhost-callback args: the stream path deletes them viaSCOPE(fail)ifcudaLaunchHostFuncthrows (the callback takes ownership only on success); the graph path already hands ownership to actxresource before the node is created.Test plan