Skip to content

cudax/stf: migrate internal/ context + resources from cuda_safe_call to cuda_try#9248

Open
andralex wants to merge 11 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-internal-ctx
Open

cudax/stf: migrate internal/ context + resources from cuda_safe_call to cuda_try#9248
andralex wants to merge 11 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-internal-ctx

Conversation

@andralex
Copy link
Copy Markdown
Contributor

@andralex andralex commented Jun 4, 2026

Summary

Second slice of the cudax/include/cuda/experimental/__stf/internal/ cuda_safe_call -> cuda_try migration (the directory is being landed as a sequence of complexity-ordered PRs). This slice covers the stream-lifetime / resource-handle files and includes two genuinely transactional fixes.

Follow-up to #9241 (internal/ misc). Companion to #9150 (utility), #9165 (stackable, merged).

Transactional changes

ctx_resource.cuh

The batched callback-resource list was new-allocated and ownership handed to the host callback via cudaStreamAddCallback. If the enqueue failed, the list leaked (the callback that would delete it never runs). Now the list is held in a unique_ptr and release()d only after a successful enqueue:

auto callback_list = ::std::make_unique<...>(mv(callback_resources));
cuda_try<cudaStreamAddCallback>(stream, release_lambda, callback_list.get(), 0);
callback_list.release();

slice.cuh

pin() is all-or-nothing (the address_is_pinned early-return treats the base address as a proxy for the whole slice), but the 2D/3D paths pin several regions in a loop. A mid-loop cuda_try(pin_memory(...)) throw previously couldn't happen (cuda_safe_call aborted); now it can, so pinned regions are recorded and rolled back:

::std::vector<T*> pinned;
SCOPE(fail) { for (T* p : pinned) unpin_memory(p); };
const auto pin_one = [&pinned](T* ptr, size_t n) { cuda_try(pin_memory(ptr, n)); pinned.push_back(ptr); };

pin_memory stays in the runtime-status form (it is an overload-set template, so cuda_try<F> cannot apply); unpin_memory only aborts, never throws, so it is safe inside the noexcept SCOPE(fail). Adds <vector> and scope_guard.cuh includes.

Mechanical changes

  • async_resources_handle.cuhcudaGraphGetNodes / cudaGraphGetEdges keep the runtime-status form (multiple output pointers, one passed as nullptr).
  • context.cuhUNITTEST bodies switch to cuda_try, using the templated cuda_try<F> form where applicable (cudaStreamCreate, cudaSetDevice, cudaStreamSynchronize/Destroy) and ::std::ignore for the discarded cudaGraphAddEmptyNode handle. Captured handles are const.

Test plan

  • CI green on the cudax matrix entries that build the affected headers / STF tests
  • No success-path behavior change; new behavior is throw-vs-abort plus rollback on throw

…to cuda_try

Second slice of the internal/ migration, covering the stream-lifetime /
resource-handle files. Two sites get real transactional handling:

- ctx_resource.cuh: hold the batched callback resource list in a
  unique_ptr and only release() it after cudaStreamAddCallback succeeds,
  so a throw from the enqueue no longer leaks the list. The enqueue uses
  the templated cuda_try<cudaStreamAddCallback> form.

- slice.cuh: pin() is all-or-nothing, but the 2D/3D paths pin several
  regions in a loop. Record each pinned base pointer and roll them back
  via SCOPE(fail) if a later pin_memory throws, leaving the slice
  unpinned (matching the address_is_pinned early-return that treats the
  base address as a proxy for the whole slice). pin_memory stays in the
  runtime-status form (it is an overload-set template); unpin_memory only
  aborts, never throws, so it is safe inside the noexcept SCOPE(fail).
  Adds <vector> and scope_guard.cuh includes.

The remaining conversions are mechanical:

- async_resources_handle.cuh: cudaGraphGetNodes / cudaGraphGetEdges keep
  the runtime-status form (multiple output pointers, one passed as
  nullptr).
- context.cuh: UNITTEST bodies switch to cuda_try, using the templated
  cuda_try<F> form where applicable (cudaStreamCreate, cudaSetDevice,
  cudaStreamSynchronize/Destroy) and ::std::ignore for the discarded
  cudaGraphAddEmptyNode handle.
@andralex andralex requested a review from a team as a code owner June 4, 2026 00:56
@andralex andralex requested a review from caugonnet June 4, 2026 00:56
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Jun 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 74be442

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

suggestion:

suggestion:

Walkthrough

Switches internal STF CUDA calls from cuda_safe_call to cuda_try, refactors ctx resource release to be exception-safe with callback ownership transfer, and makes memory pinning transactional by iterating contiguous hunks with rollback on failure.

Changes

STF error handling and exception safety refactoring

Layer / File(s) Summary
Graph and stream API migration to cuda_try
cudax/include/cuda/experimental/__stf/internal/async_resources_handle.cuh, cudax/include/cuda/experimental/__stf/internal/context.cuh
Graph node/edge queries and unit-test stream creation/destruction/synchronization switch from cuda_safe_call to cuda_try; test stream locals are const, and ignored graph node results use cuda_try.
Exception-safe callback resource batching
cudax/include/cuda/experimental/__stf/internal/ctx_resource.cuh
Release path compacts non-callback resources in-place; callback list is built under unique_ptr, enqueue uses cuda_try, ownership transferred after successful enqueue; callback deletes vector and calls release_in_callback() per resource.
Transactional memory pinning with rollback
cudax/include/cuda/experimental/__stf/internal/slice.cuh
Adds reserved::for_each_contiguous_hunk; pin() iterates hunks and calls cuda_try(pin_memory(...)) per hunk and installs SCOPE(fail) rollback to call unpin(s) on failure; unpin() updated to traverse hunks.

Possibly related PRs

Suggested labels

stf

Suggested reviewers

  • caugonnet
  • oleksandr-pavlyk
  • NaderAlAwar

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 69632b7c-d1ea-4a44-8bfd-dcf3233fc03e

📥 Commits

Reviewing files that changed from the base of the PR and between 6383b91 and 74be442.

📒 Files selected for processing (4)
  • cudax/include/cuda/experimental/__stf/internal/async_resources_handle.cuh
  • cudax/include/cuda/experimental/__stf/internal/context.cuh
  • cudax/include/cuda/experimental/__stf/internal/ctx_resource.cuh
  • cudax/include/cuda/experimental/__stf/internal/slice.cuh

Comment thread cudax/include/cuda/experimental/__stf/internal/context.cuh
Comment thread cudax/include/cuda/experimental/__stf/internal/slice.cuh Outdated
Comment thread cudax/include/cuda/experimental/__stf/internal/slice.cuh Outdated
…dge counts

cudaGraphGetNodes / cudaGraphGetEdges have multiple output pointers, but
passing nullptr for the ones we don't need leaves a single synthesized
output (numNodes / numEdges) as the last parameter, so cuda_try's
last-output form applies:

  const size_t nnodes = cuda_try<cudaGraphGetNodes>(g, nullptr);
  const size_t nedges = cuda_try<cudaGraphGetEdges>(g, nullptr, nullptr[, nullptr]);

This also lets the counts be const and drops the separate declarations.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 50b86a1

…ryable

Replace the null-then-erase pass with an in-place swap-from-back
compaction: each stream-dependent resource is released and then removed
by moving the last element into its slot and popping. A resource leaves
`resources` only after release(stream) succeeds, so if it throws, the
vector still holds the failing resource plus everything not yet
processed -- release() can be retried with nothing lost and no
double-release. No null slots linger, so no separate erase pass (and no
<algorithm>) is needed.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test cfbd028

@github-actions

This comment has been minimized.

Replace the rank-0/1/2/3 x contiguous_dims switch matrix in both pin()
and unpin() with one recursive helper, reserved::for_each_contiguous_hunk,
that invokes f(base, n) for each maximal contiguous hunk: the leading
contiguous_dims() dims form one run, the trailing dims are enumerated
recursively. This covers every element exactly once, handles any rank
(drops the old rank <= 3 limit), and normalizes the odd address math in
the previous 2D paths (which double-counted the stride/extent).

  - unpin() is now just a walk calling unpin_memory on each hunk base.
    unpin_memory ignores not-registered regions, so this is safe on
    fully- or partially-pinned slices.
  - pin()'s rollback is simply SCOPE(fail) { unpin(s); } -- no separate
    `pinned` vector, and no <vector> include. On a mid-walk failure
    unpin() releases the hunks we pinned and no-ops the rest, leaving the
    slice fully unpinned, consistent with the address_is_pinned() proxy.

pin_memory stays in the runtime-status cuda_try(pin_memory(...)) form
(it is an overload-set template, so cuda_try<pin_memory> cannot name it).
The helper is C++17 (recursive generic lambda) and lives in `reserved`.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 1c6cd4c

…ests

In the two UNITTESTs that create a user stream, release it through a
SCOPE(exit) guard right after creation instead of a trailing
cudaStreamDestroy, so the stream is destroyed even if the test body
throws. cuda_safe_call is used inside the noexcept SCOPE(exit) body.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test d37dbfe

@andralex andralex enabled auto-merge (squash) June 4, 2026 03:32
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test a7b4930

@github-actions

This comment has been minimized.

The cudaMemcpyAsync in this UNITTEST task body was unchecked; wrap it in
cuda_safe_call so a failure is reported. (cuda_safe_call, not cuda_try:
this runs inside the task-body callback where throwing is not safe.)
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 7089769

@github-actions

This comment has been minimized.

ctx_resource.cuh uses SCOPE(exit) in the resource-release callback but only
included core.cuh and cuda_safe_call.cuh, so SCOPE was undefined. Because
ctx_resource.cuh is widely included, this broke the cudax build across the
matrix. Add the missing scope_guard.cuh include.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 2d42937

@github-actions

This comment has been minimized.

@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test f8a0e77

@github-actions

This comment has been minimized.

@caugonnet caugonnet added the stf Sequential Task Flow programming model label Jun 5, 2026
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test b3c2f46

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 35m 37s: Pass: 96%/55 | Total: 7h 58m | Max: 35m 33s | Hits: 79%/44081

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants