cudax/stf: migrate internal/ misc files from cuda_safe_call to cuda_try#9241
cudax/stf: migrate internal/ misc files from cuda_safe_call to cuda_try#9241andralex wants to merge 8 commits into
Conversation
First (simplest) slice of the internal/ cuda_safe_call -> cuda_try
migration. Covers six low-complexity leaf files whose CUDA calls are
standalone queries, device syncs, memcpies, or single graph-node adds
(no transactional cleanup or noexcept-dispatch concerns):
- dot.cuh cudaGetDevice -> templated cuda_try
- scheduler.cuh cudaGetDeviceCount -> templated cuda_try
- executable_graph_cache.cuh cudaGetDeviceCount -> templated cuda_try
- thread_hierarchy.cuh cudaDeviceSynchronize (x3),
cudaLaunchCooperativeKernel -> cuda_try
- algorithm.cuh cudaGraphAddChildGraphNode,
cudaGraphLaunch (x2) -> cuda_try
- scalar_interface.cuh cudaMemcpyAsync,
cudaGraphAddMemcpyNode -> cuda_try
Where a call has a single synthesizable output parameter and the result
is used, the templated cuda_try<F>(args...) form is used so the result
is a const-initialized local (cudaGetDevice, cudaGetDeviceCount,
cudaGraphAddMemcpyNode). cudaGraphAddChildGraphNode keeps the
runtime-status form because its node handle is intentionally discarded
(a templated form would trip -Wunused-variable).
thread_hierarchy.cuh's calls live in UNITTEST bodies, but they are plain
sequential test code (not runtime-dispatched host-task lambdas), so a
thrown cuda_try simply fails the test; converting keeps the migration
consistent.
slice.cuh is intentionally left for a later slice: its pin_memory calls
run in 2D/3D loops where a mid-loop throw would leak already-pinned
regions, so it needs unpin-on-throw rollback rather than a mechanical
conversion.
|
/ok to test 3e0972d |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
important: Walkthrough This PR replaces many cuda_safe_call usages with cuda_try (returning/template forms where applicable), moves adapter cleanup into SCOPE(exit), and updates a stream_adapter move-constructor to move members. Changes Error Handling Refactor
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/internal/scheduler.cuh (1)
56-56: ⚡ Quick winsuggestion: Make
num_devicesconst by moving initialization to default member initializer.The member variable
num_devicesis assigned fromcuda_try<cudaGetDeviceCount>()but never modified after construction. Per the skill file requirement, "any variable not modified must beconst(including values assigned from function returns likecuda_try<...>())".Refactor to inline initialization at the declaration (line 74) and remove the assignment from the constructor body. This matches the pattern in
reorderer.cuhand enforces immutability.Refactor to const with inline initialization
scheduler() { - num_devices = cuda_try<cudaGetDeviceCount>(); assert(num_devices > 0); } ... protected: - int num_devices = 0; + const int num_devices = cuda_try<cudaGetDeviceCount>();Based on learnings: the skill file explicitly requires const for variables assigned from
cuda_try<...>()that are not subsequently modified.Also applies to: 74-74
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c9c1d6e9-979e-4bc6-9c44-b3b8c057e643
📒 Files selected for processing (6)
cudax/include/cuda/experimental/__stf/internal/algorithm.cuhcudax/include/cuda/experimental/__stf/internal/dot.cuhcudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuhcudax/include/cuda/experimental/__stf/internal/scalar_interface.cuhcudax/include/cuda/experimental/__stf/internal/scheduler.cuhcudax/include/cuda/experimental/__stf/internal/thread_hierarchy.cuh
This comment has been minimized.
This comment has been minimized.
Follow-up to the internal/ misc migration:
- algorithm.cuh / scalar_interface.cuh: switch eligible cuda_try calls to
the templated cuda_try<F>(args...) form. For the discarded child-graph
node, use ::std::ignore = cuda_try<cudaGraphAddChildGraphNode>(...).
cudaGraphLaunch and cudaMemcpyAsync become cuda_try<...> as well.
cudaGraphInstantiateWithFlags stays runtime-status (its output is written
into caller-owned shared_ptr storage).
- algorithm.cuh: in run_dynamic, release the stream adapter via
SCOPE(exit) { adapter.clear(); } right after acquisition, matching run(),
so the adapter is cleared even if graph instantiation/launch throws.
- adapters.cuh: move (not copy) adapter_state and alloc in the
stream_adapter move constructor.
|
/ok to test daab287 |
This comment has been minimized.
This comment has been minimized.
…rnal/
- algorithm.cuh / executable_graph_cache.cuh: use the templated
cuda_try<cudaGraphInstantiateWithFlags>(...) form, assigning the
instantiated handle into *eg / *res. The handle is a trivially-assignable
pointer, so capture-then-assign works even though the storage is owned by
the surrounding shared_ptr.
- Restore cuda_try<cudaGraphLaunch>(*eg, stream) (direct form compiles
fine: no output param, properly typed args, non-nullary).
- Construct the cudaGraphExec_t shared_ptr via the braced form with an
inline deleter, dropping the redundant named deleter lambdas.
- Value-initialize the handle (new cudaGraphExec_t{}) so a throw from
instantiation cannot leave the shared_ptr deleter calling
cudaGraphExecDestroy on an indeterminate handle.
- scheduler.cuh: initialize num_devices in the constructor member-init
list instead of assigning in the body.
|
/ok to test bebaa9b |
This comment has been minimized.
This comment has been minimized.
cudaMemcpyAsync returns cudaError_t with no synthesizable output, so the templated cuda_try<F> form does not apply. Use the runtime-status form instead.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/internal/scalar_interface.cuh (1)
134-134: ⚡ Quick winsuggestion: These replacements introduce direct uses of
cuda_try,cudaMemcpyAsync, andcudaGraphAddMemcpyNode, so they should be qualified from::and this header should include thecuda_trydefinition directly instead of relying on a transitive include. As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace" and "Files must include all headers related to the symbols that they are using. No transitive header inclusions are allowed."Also applies to: 270-270
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2a2134c7-569a-4e92-a3ed-89446692014f
📒 Files selected for processing (1)
cudax/include/cuda/experimental/__stf/internal/scalar_interface.cuh
|
/ok to test 077b9f4 |
This comment has been minimized.
This comment has been minimized.
Harden previously-bare CUDA calls that ran unchecked: - the cudaGraphExecDestroy calls in the shared_ptr deleters (algorithm.cuh, executable_graph_cache.cuh) now use cuda_safe_call. Deleters run in a noexcept context, so cuda_safe_call (abort on error) is correct -- cuda_try must not be used there. - the debug-only cudaGraphDebugDotPrint in dump_algorithm now uses cuda_safe_call. The cudaGraphExecUpdate probe stays as-is: it intentionally captures the status via cudaGetLastError and returns it as a bool.
|
/ok to test 3ef273b |
This comment has been minimized.
This comment has been minimized.
|
/ok to test ff3e69f |
This comment has been minimized.
This comment has been minimized.
|
/ok to test ca417d3 |
😬 CI Workflow Results🟥 Finished in 35m 25s: Pass: 96%/55 | Total: 10h 30m | Max: 35m 22s | Hits: 50%/67038See results here. |
|
/ok to test ca417d3 |
Summary
First (simplest) slice of the
cudax/include/cuda/experimental/__stf/internal/cuda_safe_call->cuda_trymigration. The fullinternal/directory has ~90 sites across 16 files; it is being landed as a sequence of focused PRs in increasing order of complexity. This PR covers six low-complexity leaf files whose CUDA calls are standalone queries, device syncs, memcpies, or single graph-node adds — no transactional cleanup or noexcept-dispatch concerns.Companion to PRs #9146 (allocators), #9150 (utility), #9165 (stackable).
Changes (6 files, 12 sites)
dot.cuhcudaGetDevicecuda_try<F>(result used)scheduler.cuhcudaGetDeviceCountcuda_try<F>executable_graph_cache.cuhcudaGetDeviceCountcuda_try<F>thread_hierarchy.cuhcudaDeviceSynchronize(x3),cudaLaunchCooperativeKernelcuda_tryalgorithm.cuhcudaGraphAddChildGraphNode,cudaGraphLaunch(x2)cuda_tryscalar_interface.cuhcudaMemcpyAsync,cudaGraphAddMemcpyNodeNotes
out = cuda_try<F>(args...)) is used where there is a single synthesizable output parameter whose result is used, giving a const-initialized local. All converted signatures were checked against CTK 13.2 to confirm they are unambiguous forcuda_try's first/last-output selection.cudaGraphAddChildGraphNodekeeps the runtime-status form: its node handle is intentionally discarded, and a templated form would trip-Wunused-variable.thread_hierarchy.cuhcalls are inUNITTESTbodies but are plain sequential test code (not runtime-dispatched host-task lambdas), so a throwncuda_tryjust fails the test. Converted for consistency.Intentionally deferred
slice.cuh(7pin_memorycalls): these run in 2D/3D loops where a mid-loop throw would leak already-pinned regions. It needs unpin-on-throw rollback rather than a mechanical conversion, so it is held for a later (more complex) slice.execution_policy.cuh/hashtable_linearprobing.cuhonly#includethe header; no calls to convert.Test plan