cudax/stf: migrate internal/ misc files from cuda_safe_call to cuda_try by andralex · Pull Request #9241 · NVIDIA/cccl

andralex · 2026-06-03T18:24:17Z

Summary

First (simplest) slice of the cudax/include/cuda/experimental/__stf/internal/ cuda_safe_call -> cuda_try migration. The full internal/ directory has ~90 sites across 16 files; it is being landed as a sequence of focused PRs in increasing order of complexity. This PR covers six low-complexity leaf files whose CUDA calls are standalone queries, device syncs, memcpies, or single graph-node adds — no transactional cleanup or noexcept-dispatch concerns.

Companion to PRs #9146 (allocators), #9150 (utility), #9165 (stackable).

Changes (6 files, 12 sites)

File	Calls	Form
`dot.cuh`	`cudaGetDevice`	templated `cuda_try<F>` (result used)
`scheduler.cuh`	`cudaGetDeviceCount`	templated `cuda_try<F>`
`executable_graph_cache.cuh`	`cudaGetDeviceCount`	templated `cuda_try<F>`
`thread_hierarchy.cuh`	`cudaDeviceSynchronize` (x3), `cudaLaunchCooperativeKernel`	runtime-status `cuda_try`
`algorithm.cuh`	`cudaGraphAddChildGraphNode`, `cudaGraphLaunch` (x2)	runtime-status `cuda_try`
`scalar_interface.cuh`	`cudaMemcpyAsync`, `cudaGraphAddMemcpyNode`	memcpy: runtime-status; memcpy-node: templated (returned)

Notes

Templated form (out = cuda_try<F>(args...)) is used where there is a single synthesizable output parameter whose result is used, giving a const-initialized local. All converted signatures were checked against CTK 13.2 to confirm they are unambiguous for cuda_try's first/last-output selection.
cudaGraphAddChildGraphNode keeps the runtime-status form: its node handle is intentionally discarded, and a templated form would trip -Wunused-variable.
thread_hierarchy.cuh calls are in UNITTEST bodies but are plain sequential test code (not runtime-dispatched host-task lambdas), so a thrown cuda_try just fails the test. Converted for consistency.

Intentionally deferred

slice.cuh (7 pin_memory calls): these run in 2D/3D loops where a mid-loop throw would leak already-pinned regions. It needs unpin-on-throw rollback rather than a mechanical conversion, so it is held for a later (more complex) slice.
execution_policy.cuh / hashtable_linearprobing.cuh only #include the header; no calls to convert.

Test plan

CI green on the cudax matrix entries that build the affected headers / STF tests
No success-path behavior change — all conversions are throw-vs-abort

First (simplest) slice of the internal/ cuda_safe_call -> cuda_try migration. Covers six low-complexity leaf files whose CUDA calls are standalone queries, device syncs, memcpies, or single graph-node adds (no transactional cleanup or noexcept-dispatch concerns): - dot.cuh cudaGetDevice -> templated cuda_try - scheduler.cuh cudaGetDeviceCount -> templated cuda_try - executable_graph_cache.cuh cudaGetDeviceCount -> templated cuda_try - thread_hierarchy.cuh cudaDeviceSynchronize (x3), cudaLaunchCooperativeKernel -> cuda_try - algorithm.cuh cudaGraphAddChildGraphNode, cudaGraphLaunch (x2) -> cuda_try - scalar_interface.cuh cudaMemcpyAsync, cudaGraphAddMemcpyNode -> cuda_try Where a call has a single synthesizable output parameter and the result is used, the templated cuda_try<F>(args...) form is used so the result is a const-initialized local (cudaGetDevice, cudaGetDeviceCount, cudaGraphAddMemcpyNode). cudaGraphAddChildGraphNode keeps the runtime-status form because its node handle is intentionally discarded (a templated form would trip -Wunused-variable). thread_hierarchy.cuh's calls live in UNITTEST bodies, but they are plain sequential test code (not runtime-dispatched host-task lambdas), so a thrown cuda_try simply fails the test; converting keeps the migration consistent. slice.cuh is intentionally left for a later slice: its pin_memory calls run in 2D/3D loops where a mid-loop throw would leak already-pinned regions, so it needs unpin-on-throw rollback rather than a mechanical conversion.

copy-pr-bot · 2026-06-03T18:24:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

andralex · 2026-06-03T18:24:30Z

/ok to test 3e0972d

coderabbitai · 2026-06-03T18:33:26Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

important: Walkthrough

This PR replaces many cuda_safe_call usages with cuda_try (returning/template forms where applicable), moves adapter cleanup into SCOPE(exit), and updates a stream_adapter move-constructor to move members.

Changes

Error Handling Refactor

Layer / File(s)	Summary
Device state queries `cudax/include/cuda/experimental/__stf/internal/dot.cuh`, `cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh`, `cudax/include/cuda/experimental/__stf/internal/scheduler.cuh`	Device enumeration and current-device reads now capture results via `cuda_try<...>()` instead of calling `cuda_safe_call(... &out)`.
Graph operations and adapter cleanup `cudax/include/cuda/experimental/__stf/internal/algorithm.cuh`, `cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh`	`cudaGraphAddChildGraphNode`, `cudaGraphInstantiateWithFlags`, and `cudaGraphLaunch` now use `cuda_try` (template form where applicable); adapter cleanup moved into `SCOPE(exit)` guards so `adapter.clear()` runs on all exits; `cudaGraphDebugDotPrint` wrapped with `cuda_safe_call`.
Data copy operations `cudax/include/cuda/experimental/__stf/internal/scalar_interface.cuh`	`stream_data_copy` wraps `cudaMemcpyAsync` with `cuda_try(...)`; `graph_data_copy` returns `cuda_try<cudaGraphAddMemcpyNode>(...)` directly.
Unit test synchronization changes `cudax/include/cuda/experimental/__stf/internal/thread_hierarchy.cuh`	Unit tests replace `cuda_safe_call` with `cuda_try` for `cudaDeviceSynchronize()` and `cudaLaunchCooperativeKernel(...)` calls.
Allocator move-constructor fix `cudax/include/cuda/experimental/__stf/allocators/adapters.cuh`	`stream_adapter(stream_adapter&&)` now moves `adapter_state` and `alloc` using `mv(...)` rather than copying them.

Suggested labels: stf
Suggested reviewers:
- caugonnet
- oleksandr-pavlyk

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cudax/include/cuda/experimental/__stf/internal/scheduler.cuh (1)
56-56: ⚡ Quick win

suggestion: Make num_devices const by moving initialization to default member initializer.

The member variable num_devices is assigned from cuda_try<cudaGetDeviceCount>() but never modified after construction. Per the skill file requirement, "any variable not modified must be const (including values assigned from function returns like cuda_try<...>())".

Refactor to inline initialization at the declaration (line 74) and remove the assignment from the constructor body. This matches the pattern in reorderer.cuh and enforces immutability.
Refactor to const with inline initialization
   scheduler()
   {
-    num_devices = cuda_try<cudaGetDeviceCount>();
     assert(num_devices > 0);
   }
   
   ...
   
 protected:
-  int num_devices = 0;
+  const int num_devices = cuda_try<cudaGetDeviceCount>();
Based on learnings: the skill file explicitly requires const for variables assigned from cuda_try<...>() that are not subsequently modified.

Also applies to: 74-74

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c9c1d6e9-979e-4bc6-9c44-b3b8c057e643

📥 Commits

Reviewing files that changed from the base of the PR and between ee9f95b and 3e0972d.

📒 Files selected for processing (6)

cudax/include/cuda/experimental/__stf/internal/algorithm.cuh
cudax/include/cuda/experimental/__stf/internal/dot.cuh
cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh
cudax/include/cuda/experimental/__stf/internal/scalar_interface.cuh
cudax/include/cuda/experimental/__stf/internal/scheduler.cuh
cudax/include/cuda/experimental/__stf/internal/thread_hierarchy.cuh

Follow-up to the internal/ misc migration: - algorithm.cuh / scalar_interface.cuh: switch eligible cuda_try calls to the templated cuda_try<F>(args...) form. For the discarded child-graph node, use ::std::ignore = cuda_try<cudaGraphAddChildGraphNode>(...). cudaGraphLaunch and cudaMemcpyAsync become cuda_try<...> as well. cudaGraphInstantiateWithFlags stays runtime-status (its output is written into caller-owned shared_ptr storage). - algorithm.cuh: in run_dynamic, release the stream adapter via SCOPE(exit) { adapter.clear(); } right after acquisition, matching run(), so the adapter is cleared even if graph instantiation/launch throws. - adapters.cuh: move (not copy) adapter_state and alloc in the stream_adapter move constructor.

andralex · 2026-06-03T20:32:09Z

/ok to test daab287

…rnal/ - algorithm.cuh / executable_graph_cache.cuh: use the templated cuda_try<cudaGraphInstantiateWithFlags>(...) form, assigning the instantiated handle into *eg / *res. The handle is a trivially-assignable pointer, so capture-then-assign works even though the storage is owned by the surrounding shared_ptr. - Restore cuda_try<cudaGraphLaunch>(*eg, stream) (direct form compiles fine: no output param, properly typed args, non-nullary). - Construct the cudaGraphExec_t shared_ptr via the braced form with an inline deleter, dropping the redundant named deleter lambdas. - Value-initialize the handle (new cudaGraphExec_t{}) so a throw from instantiation cannot leave the shared_ptr deleter calling cudaGraphExecDestroy on an indeterminate handle. - scheduler.cuh: initialize num_devices in the constructor member-init list instead of assigning in the body.

andralex · 2026-06-04T00:15:05Z

/ok to test bebaa9b

cudaMemcpyAsync returns cudaError_t with no synthesizable output, so the templated cuda_try<F> form does not apply. Use the runtime-status form instead.

coderabbitai

🧹 Nitpick comments (1)

cudax/include/cuda/experimental/__stf/internal/scalar_interface.cuh (1)

134-134: ⚡ Quick win

suggestion: These replacements introduce direct uses of cuda_try, cudaMemcpyAsync, and cudaGraphAddMemcpyNode, so they should be qualified from :: and this header should include the cuda_try definition directly instead of relying on a transitive include. As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace" and "Files must include all headers related to the symbols that they are using. No transitive header inclusions are allowed."

Also applies to: 270-270

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2a2134c7-569a-4e92-a3ed-89446692014f

📥 Commits

Reviewing files that changed from the base of the PR and between bebaa9b and 45e4467.

📒 Files selected for processing (1)

cudax/include/cuda/experimental/__stf/internal/scalar_interface.cuh

andralex · 2026-06-04T03:33:36Z

/ok to test 077b9f4

Harden previously-bare CUDA calls that ran unchecked: - the cudaGraphExecDestroy calls in the shared_ptr deleters (algorithm.cuh, executable_graph_cache.cuh) now use cuda_safe_call. Deleters run in a noexcept context, so cuda_safe_call (abort on error) is correct -- cuda_try must not be used there. - the debug-only cudaGraphDebugDotPrint in dump_algorithm now uses cuda_safe_call. The cudaGraphExecUpdate probe stays as-is: it intentionally captures the status via cudaGetLastError and returns it as a bool.

andralex · 2026-06-04T04:55:47Z

/ok to test 3ef273b

andralex · 2026-06-04T22:07:53Z

/ok to test ff3e69f

andralex · 2026-06-05T01:34:48Z

/ok to test ca417d3

github-actions · 2026-06-05T02:11:52Z

😬 CI Workflow Results

🟥 Finished in 35m 25s: Pass: 96%/55 | Total: 10h 30m | Max: 35m 22s | Hits: 50%/67038

See results here.

andralex · 2026-06-05T22:29:15Z

/ok to test ca417d3

andralex requested a review from a team as a code owner June 3, 2026 18:24

andralex requested a review from caugonnet June 3, 2026 18:24

github-project-automation Bot added this to CCCL Jun 3, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 3, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 3, 2026

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread cudax/include/cuda/experimental/__stf/internal/algorithm.cuh Outdated

This comment has been minimized.

Sign in to view

andralex mentioned this pull request Jun 4, 2026

cudax/stf: migrate internal/ context + resources from cuda_safe_call to cuda_try #9248

Open

2 tasks

This comment has been minimized.

Sign in to view

[STF] Fix cudaMemcpyAsync cuda_try form in scalar_interface

45e4467

cudaMemcpyAsync returns cudaError_t with no synthesizable output, so the templated cuda_try<F> form does not apply. Use the runtime-status form instead.

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Merge branch 'main' into andralex/stf-cuda-try-internal-misc

077b9f4

andralex enabled auto-merge (squash) June 4, 2026 03:33

andralex mentioned this pull request Jun 4, 2026

cudax/stf: migrate internal/ launch + host_launch_scope from cuda_safe_call to cuda_try #9249

Open

2 tasks

This comment has been minimized.

Sign in to view

Merge branch 'main' into andralex/stf-cuda-try-internal-misc

ff3e69f

caugonnet approved these changes Jun 4, 2026

View reviewed changes

andralex mentioned this pull request Jun 4, 2026

cudax/stf: migrate internal/ parallel_for + cuda_kernel scopes from cuda_safe_call to cuda_try #9265

Open

2 tasks

This comment has been minimized.

Sign in to view

Merge branch 'main' into andralex/stf-cuda-try-internal-misc

ca417d3

caugonnet added the stf Sequential Task Flow programming model label Jun 5, 2026

Conversation

andralex commented Jun 3, 2026

Summary

Changes (6 files, 12 sites)

Notes

Intentionally deferred

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 3, 2026

Uh oh!

andralex commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

andralex commented Jun 3, 2026

Uh oh!

This comment has been minimized.

andralex commented Jun 4, 2026

Uh oh!

This comment has been minimized.

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

andralex commented Jun 4, 2026

Uh oh!

This comment has been minimized.

andralex commented Jun 4, 2026

Uh oh!

This comment has been minimized.

andralex commented Jun 4, 2026

Uh oh!

This comment has been minimized.

andralex commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 35m 25s: Pass: 96%/55 | Total: 10h 30m | Max: 35m 22s | Hits: 50%/67038

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading