cudax/stf: migrate internal/ context + resources from cuda_safe_call to cuda_try#9248
cudax/stf: migrate internal/ context + resources from cuda_safe_call to cuda_try#9248andralex wants to merge 11 commits into
Conversation
…to cuda_try Second slice of the internal/ migration, covering the stream-lifetime / resource-handle files. Two sites get real transactional handling: - ctx_resource.cuh: hold the batched callback resource list in a unique_ptr and only release() it after cudaStreamAddCallback succeeds, so a throw from the enqueue no longer leaks the list. The enqueue uses the templated cuda_try<cudaStreamAddCallback> form. - slice.cuh: pin() is all-or-nothing, but the 2D/3D paths pin several regions in a loop. Record each pinned base pointer and roll them back via SCOPE(fail) if a later pin_memory throws, leaving the slice unpinned (matching the address_is_pinned early-return that treats the base address as a proxy for the whole slice). pin_memory stays in the runtime-status form (it is an overload-set template); unpin_memory only aborts, never throws, so it is safe inside the noexcept SCOPE(fail). Adds <vector> and scope_guard.cuh includes. The remaining conversions are mechanical: - async_resources_handle.cuh: cudaGraphGetNodes / cudaGraphGetEdges keep the runtime-status form (multiple output pointers, one passed as nullptr). - context.cuh: UNITTEST bodies switch to cuda_try, using the templated cuda_try<F> form where applicable (cudaStreamCreate, cudaSetDevice, cudaStreamSynchronize/Destroy) and ::std::ignore for the discarded cudaGraphAddEmptyNode handle.
|
/ok to test 74be442 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
suggestion: suggestion: WalkthroughSwitches internal STF CUDA calls from ChangesSTF error handling and exception safety refactoring
Possibly related PRs
Suggested labels
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 69632b7c-d1ea-4a44-8bfd-dcf3233fc03e
📒 Files selected for processing (4)
cudax/include/cuda/experimental/__stf/internal/async_resources_handle.cuhcudax/include/cuda/experimental/__stf/internal/context.cuhcudax/include/cuda/experimental/__stf/internal/ctx_resource.cuhcudax/include/cuda/experimental/__stf/internal/slice.cuh
…dge counts cudaGraphGetNodes / cudaGraphGetEdges have multiple output pointers, but passing nullptr for the ones we don't need leaves a single synthesized output (numNodes / numEdges) as the last parameter, so cuda_try's last-output form applies: const size_t nnodes = cuda_try<cudaGraphGetNodes>(g, nullptr); const size_t nedges = cuda_try<cudaGraphGetEdges>(g, nullptr, nullptr[, nullptr]); This also lets the counts be const and drops the separate declarations.
|
/ok to test 50b86a1 |
…ryable Replace the null-then-erase pass with an in-place swap-from-back compaction: each stream-dependent resource is released and then removed by moving the last element into its slot and popping. A resource leaves `resources` only after release(stream) succeeds, so if it throws, the vector still holds the failing resource plus everything not yet processed -- release() can be retried with nothing lost and no double-release. No null slots linger, so no separate erase pass (and no <algorithm>) is needed.
|
/ok to test cfbd028 |
This comment has been minimized.
This comment has been minimized.
Replace the rank-0/1/2/3 x contiguous_dims switch matrix in both pin()
and unpin() with one recursive helper, reserved::for_each_contiguous_hunk,
that invokes f(base, n) for each maximal contiguous hunk: the leading
contiguous_dims() dims form one run, the trailing dims are enumerated
recursively. This covers every element exactly once, handles any rank
(drops the old rank <= 3 limit), and normalizes the odd address math in
the previous 2D paths (which double-counted the stride/extent).
- unpin() is now just a walk calling unpin_memory on each hunk base.
unpin_memory ignores not-registered regions, so this is safe on
fully- or partially-pinned slices.
- pin()'s rollback is simply SCOPE(fail) { unpin(s); } -- no separate
`pinned` vector, and no <vector> include. On a mid-walk failure
unpin() releases the hunks we pinned and no-ops the rest, leaving the
slice fully unpinned, consistent with the address_is_pinned() proxy.
pin_memory stays in the runtime-status cuda_try(pin_memory(...)) form
(it is an overload-set template, so cuda_try<pin_memory> cannot name it).
The helper is C++17 (recursive generic lambda) and lives in `reserved`.
|
/ok to test 1c6cd4c |
…ests In the two UNITTESTs that create a user stream, release it through a SCOPE(exit) guard right after creation instead of a trailing cudaStreamDestroy, so the stream is destroyed even if the test body throws. cuda_safe_call is used inside the noexcept SCOPE(exit) body.
|
/ok to test d37dbfe |
|
/ok to test a7b4930 |
This comment has been minimized.
This comment has been minimized.
The cudaMemcpyAsync in this UNITTEST task body was unchecked; wrap it in cuda_safe_call so a failure is reported. (cuda_safe_call, not cuda_try: this runs inside the task-body callback where throwing is not safe.)
|
/ok to test 7089769 |
This comment has been minimized.
This comment has been minimized.
ctx_resource.cuh uses SCOPE(exit) in the resource-release callback but only included core.cuh and cuda_safe_call.cuh, so SCOPE was undefined. Because ctx_resource.cuh is widely included, this broke the cudax build across the matrix. Add the missing scope_guard.cuh include.
|
/ok to test 2d42937 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test f8a0e77 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test b3c2f46 |
😬 CI Workflow Results🟥 Finished in 35m 37s: Pass: 96%/55 | Total: 7h 58m | Max: 35m 33s | Hits: 79%/44081See results here. |
Summary
Second slice of the
cudax/include/cuda/experimental/__stf/internal/cuda_safe_call->cuda_trymigration (the directory is being landed as a sequence of complexity-ordered PRs). This slice covers the stream-lifetime / resource-handle files and includes two genuinely transactional fixes.Follow-up to #9241 (internal/ misc). Companion to #9150 (utility), #9165 (stackable, merged).
Transactional changes
ctx_resource.cuhThe batched callback-resource list was
new-allocated and ownership handed to the host callback viacudaStreamAddCallback. If the enqueue failed, the list leaked (the callback that woulddeleteit never runs). Now the list is held in aunique_ptrandrelease()d only after a successful enqueue:slice.cuhpin()is all-or-nothing (theaddress_is_pinnedearly-return treats the base address as a proxy for the whole slice), but the 2D/3D paths pin several regions in a loop. A mid-loopcuda_try(pin_memory(...))throw previously couldn't happen (cuda_safe_callaborted); now it can, so pinned regions are recorded and rolled back:pin_memorystays in the runtime-status form (it is an overload-set template, socuda_try<F>cannot apply);unpin_memoryonly aborts, never throws, so it is safe inside thenoexceptSCOPE(fail). Adds<vector>andscope_guard.cuhincludes.Mechanical changes
async_resources_handle.cuh—cudaGraphGetNodes/cudaGraphGetEdgeskeep the runtime-status form (multiple output pointers, one passed asnullptr).context.cuh—UNITTESTbodies switch tocuda_try, using the templatedcuda_try<F>form where applicable (cudaStreamCreate,cudaSetDevice,cudaStreamSynchronize/Destroy) and::std::ignorefor the discardedcudaGraphAddEmptyNodehandle. Captured handles areconst.Test plan