cudax/stf: migrate internal/ context + resources from cuda_safe_call to cuda_try by andralex · Pull Request #9248 · NVIDIA/cccl

andralex · 2026-06-04T00:56:42Z

Summary

Second slice of the cudax/include/cuda/experimental/__stf/internal/ cuda_safe_call -> cuda_try migration (the directory is being landed as a sequence of complexity-ordered PRs). This slice covers the stream-lifetime / resource-handle files and includes two genuinely transactional fixes.

Follow-up to #9241 (internal/ misc). Companion to #9150 (utility), #9165 (stackable, merged).

Transactional changes

`ctx_resource.cuh`

The batched callback-resource list was new-allocated and ownership handed to the host callback via cudaStreamAddCallback. If the enqueue failed, the list leaked (the callback that would delete it never runs). Now the list is held in a unique_ptr and release()d only after a successful enqueue:

auto callback_list = ::std::make_unique<...>(mv(callback_resources));
cuda_try<cudaStreamAddCallback>(stream, release_lambda, callback_list.get(), 0);
callback_list.release();

`slice.cuh`

pin() is all-or-nothing (the address_is_pinned early-return treats the base address as a proxy for the whole slice), but the 2D/3D paths pin several regions in a loop. A mid-loop cuda_try(pin_memory(...)) throw previously couldn't happen (cuda_safe_call aborted); now it can, so pinned regions are recorded and rolled back:

::std::vector<T*> pinned;
SCOPE(fail) { for (T* p : pinned) unpin_memory(p); };
const auto pin_one = [&pinned](T* ptr, size_t n) { cuda_try(pin_memory(ptr, n)); pinned.push_back(ptr); };

pin_memory stays in the runtime-status form (it is an overload-set template, so cuda_try<F> cannot apply); unpin_memory only aborts, never throws, so it is safe inside the noexcept SCOPE(fail). Adds <vector> and scope_guard.cuh includes.

Mechanical changes

async_resources_handle.cuh — cudaGraphGetNodes / cudaGraphGetEdges keep the runtime-status form (multiple output pointers, one passed as nullptr).
context.cuh — UNITTEST bodies switch to cuda_try, using the templated cuda_try<F> form where applicable (cudaStreamCreate, cudaSetDevice, cudaStreamSynchronize/Destroy) and ::std::ignore for the discarded cudaGraphAddEmptyNode handle. Captured handles are const.

Test plan

CI green on the cudax matrix entries that build the affected headers / STF tests
No success-path behavior change; new behavior is throw-vs-abort plus rollback on throw

…to cuda_try Second slice of the internal/ migration, covering the stream-lifetime / resource-handle files. Two sites get real transactional handling: - ctx_resource.cuh: hold the batched callback resource list in a unique_ptr and only release() it after cudaStreamAddCallback succeeds, so a throw from the enqueue no longer leaks the list. The enqueue uses the templated cuda_try<cudaStreamAddCallback> form. - slice.cuh: pin() is all-or-nothing, but the 2D/3D paths pin several regions in a loop. Record each pinned base pointer and roll them back via SCOPE(fail) if a later pin_memory throws, leaving the slice unpinned (matching the address_is_pinned early-return that treats the base address as a proxy for the whole slice). pin_memory stays in the runtime-status form (it is an overload-set template); unpin_memory only aborts, never throws, so it is safe inside the noexcept SCOPE(fail). Adds <vector> and scope_guard.cuh includes. The remaining conversions are mechanical: - async_resources_handle.cuh: cudaGraphGetNodes / cudaGraphGetEdges keep the runtime-status form (multiple output pointers, one passed as nullptr). - context.cuh: UNITTEST bodies switch to cuda_try, using the templated cuda_try<F> form where applicable (cudaStreamCreate, cudaSetDevice, cudaStreamSynchronize/Destroy) and ::std::ignore for the discarded cudaGraphAddEmptyNode handle.

copy-pr-bot · 2026-06-04T00:56:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

andralex · 2026-06-04T00:56:57Z

/ok to test 74be442

coderabbitai · 2026-06-04T01:02:37Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

suggestion:

Walkthrough

Switches internal STF CUDA calls from cuda_safe_call to cuda_try, refactors ctx resource release to be exception-safe with callback ownership transfer, and makes memory pinning transactional by iterating contiguous hunks with rollback on failure.

Changes

STF error handling and exception safety refactoring

Layer / File(s)	Summary
Graph and stream API migration to cuda_try `cudax/include/cuda/experimental/__stf/internal/async_resources_handle.cuh`, `cudax/include/cuda/experimental/__stf/internal/context.cuh`	Graph node/edge queries and unit-test stream creation/destruction/synchronization switch from `cuda_safe_call` to `cuda_try`; test stream locals are `const`, and ignored graph node results use `cuda_try`.
Exception-safe callback resource batching `cudax/include/cuda/experimental/__stf/internal/ctx_resource.cuh`	Release path compacts non-callback resources in-place; callback list is built under `unique_ptr`, enqueue uses `cuda_try`, ownership transferred after successful enqueue; callback deletes vector and calls `release_in_callback()` per resource.
Transactional memory pinning with rollback `cudax/include/cuda/experimental/__stf/internal/slice.cuh`	Adds `reserved::for_each_contiguous_hunk`; `pin()` iterates hunks and calls `cuda_try(pin_memory(...))` per hunk and installs `SCOPE(fail)` rollback to call `unpin(s)` on failure; `unpin()` updated to traverse hunks.

Possibly related PRs

cudax/stf: migrate stackable/ from cuda_safe_call to cuda_try #9165: Parallel migration of CUDA Graph API error handling from cuda_safe_call to cuda_try in stackable_ctx_impl graph paths.

Suggested labels

stf

Suggested reviewers

caugonnet
oleksandr-pavlyk
NaderAlAwar

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 69632b7c-d1ea-4a44-8bfd-dcf3233fc03e

📥 Commits

Reviewing files that changed from the base of the PR and between 6383b91 and 74be442.

📒 Files selected for processing (4)

cudax/include/cuda/experimental/__stf/internal/async_resources_handle.cuh
cudax/include/cuda/experimental/__stf/internal/context.cuh
cudax/include/cuda/experimental/__stf/internal/ctx_resource.cuh
cudax/include/cuda/experimental/__stf/internal/slice.cuh

…dge counts cudaGraphGetNodes / cudaGraphGetEdges have multiple output pointers, but passing nullptr for the ones we don't need leaves a single synthesized output (numNodes / numEdges) as the last parameter, so cuda_try's last-output form applies: const size_t nnodes = cuda_try<cudaGraphGetNodes>(g, nullptr); const size_t nedges = cuda_try<cudaGraphGetEdges>(g, nullptr, nullptr[, nullptr]); This also lets the counts be const and drops the separate declarations.

andralex · 2026-06-04T01:05:04Z

/ok to test 50b86a1

…ryable Replace the null-then-erase pass with an in-place swap-from-back compaction: each stream-dependent resource is released and then removed by moving the last element into its slot and popping. A resource leaves `resources` only after release(stream) succeeds, so if it throws, the vector still holds the failing resource plus everything not yet processed -- release() can be retried with nothing lost and no double-release. No null slots linger, so no separate erase pass (and no <algorithm>) is needed.

andralex · 2026-06-04T01:57:53Z

/ok to test cfbd028

Replace the rank-0/1/2/3 x contiguous_dims switch matrix in both pin() and unpin() with one recursive helper, reserved::for_each_contiguous_hunk, that invokes f(base, n) for each maximal contiguous hunk: the leading contiguous_dims() dims form one run, the trailing dims are enumerated recursively. This covers every element exactly once, handles any rank (drops the old rank <= 3 limit), and normalizes the odd address math in the previous 2D paths (which double-counted the stride/extent). - unpin() is now just a walk calling unpin_memory on each hunk base. unpin_memory ignores not-registered regions, so this is safe on fully- or partially-pinned slices. - pin()'s rollback is simply SCOPE(fail) { unpin(s); } -- no separate `pinned` vector, and no <vector> include. On a mid-walk failure unpin() releases the hunks we pinned and no-ops the rest, leaving the slice fully unpinned, consistent with the address_is_pinned() proxy. pin_memory stays in the runtime-status cuda_try(pin_memory(...)) form (it is an overload-set template, so cuda_try<pin_memory> cannot name it). The helper is C++17 (recursive generic lambda) and lives in `reserved`.

andralex · 2026-06-04T03:27:59Z

/ok to test 1c6cd4c

…ests In the two UNITTESTs that create a user stream, release it through a SCOPE(exit) guard right after creation instead of a trailing cudaStreamDestroy, so the stream is destroyed even if the test body throws. cuda_safe_call is used inside the noexcept SCOPE(exit) body.

andralex · 2026-06-04T03:30:13Z

/ok to test d37dbfe

andralex · 2026-06-04T03:34:26Z

/ok to test a7b4930

The cudaMemcpyAsync in this UNITTEST task body was unchecked; wrap it in cuda_safe_call so a failure is reported. (cuda_safe_call, not cuda_try: this runs inside the task-body callback where throwing is not safe.)

andralex · 2026-06-04T04:59:00Z

/ok to test 7089769

ctx_resource.cuh uses SCOPE(exit) in the resource-release callback but only included core.cuh and cuda_safe_call.cuh, so SCOPE was undefined. Because ctx_resource.cuh is widely included, this broke the cudax build across the matrix. Add the missing scope_guard.cuh include.

andralex · 2026-06-04T22:22:59Z

/ok to test 2d42937

andralex · 2026-06-05T01:10:59Z

/ok to test f8a0e77

andralex · 2026-06-05T13:07:38Z

/ok to test b3c2f46

github-actions · 2026-06-05T13:45:06Z

😬 CI Workflow Results

🟥 Finished in 35m 37s: Pass: 96%/55 | Total: 7h 58m | Max: 35m 33s | Hits: 79%/44081

See results here.

andralex requested a review from a team as a code owner June 4, 2026 00:56

andralex requested a review from caugonnet June 4, 2026 00:56

github-project-automation Bot added this to CCCL Jun 4, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread cudax/include/cuda/experimental/__stf/internal/context.cuh

Comment thread cudax/include/cuda/experimental/__stf/internal/slice.cuh Outdated

Comment thread cudax/include/cuda/experimental/__stf/internal/slice.cuh Outdated