Skip to content

cudax/stf: migrate internal/ misc files from cuda_safe_call to cuda_try#9241

Open
andralex wants to merge 8 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-internal-misc
Open

cudax/stf: migrate internal/ misc files from cuda_safe_call to cuda_try#9241
andralex wants to merge 8 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-internal-misc

Conversation

@andralex
Copy link
Copy Markdown
Contributor

@andralex andralex commented Jun 3, 2026

Summary

First (simplest) slice of the cudax/include/cuda/experimental/__stf/internal/ cuda_safe_call -> cuda_try migration. The full internal/ directory has ~90 sites across 16 files; it is being landed as a sequence of focused PRs in increasing order of complexity. This PR covers six low-complexity leaf files whose CUDA calls are standalone queries, device syncs, memcpies, or single graph-node adds — no transactional cleanup or noexcept-dispatch concerns.

Companion to PRs #9146 (allocators), #9150 (utility), #9165 (stackable).

Changes (6 files, 12 sites)

File Calls Form
dot.cuh cudaGetDevice templated cuda_try<F> (result used)
scheduler.cuh cudaGetDeviceCount templated cuda_try<F>
executable_graph_cache.cuh cudaGetDeviceCount templated cuda_try<F>
thread_hierarchy.cuh cudaDeviceSynchronize (x3), cudaLaunchCooperativeKernel runtime-status cuda_try
algorithm.cuh cudaGraphAddChildGraphNode, cudaGraphLaunch (x2) runtime-status cuda_try
scalar_interface.cuh cudaMemcpyAsync, cudaGraphAddMemcpyNode memcpy: runtime-status; memcpy-node: templated (returned)

Notes

  • Templated form (out = cuda_try<F>(args...)) is used where there is a single synthesizable output parameter whose result is used, giving a const-initialized local. All converted signatures were checked against CTK 13.2 to confirm they are unambiguous for cuda_try's first/last-output selection.
  • cudaGraphAddChildGraphNode keeps the runtime-status form: its node handle is intentionally discarded, and a templated form would trip -Wunused-variable.
  • thread_hierarchy.cuh calls are in UNITTEST bodies but are plain sequential test code (not runtime-dispatched host-task lambdas), so a thrown cuda_try just fails the test. Converted for consistency.

Intentionally deferred

  • slice.cuh (7 pin_memory calls): these run in 2D/3D loops where a mid-loop throw would leak already-pinned regions. It needs unpin-on-throw rollback rather than a mechanical conversion, so it is held for a later (more complex) slice.
  • execution_policy.cuh / hashtable_linearprobing.cuh only #include the header; no calls to convert.

Test plan

  • CI green on the cudax matrix entries that build the affected headers / STF tests
  • No success-path behavior change — all conversions are throw-vs-abort

First (simplest) slice of the internal/ cuda_safe_call -> cuda_try
migration. Covers six low-complexity leaf files whose CUDA calls are
standalone queries, device syncs, memcpies, or single graph-node adds
(no transactional cleanup or noexcept-dispatch concerns):

  - dot.cuh                  cudaGetDevice          -> templated cuda_try
  - scheduler.cuh            cudaGetDeviceCount     -> templated cuda_try
  - executable_graph_cache.cuh cudaGetDeviceCount   -> templated cuda_try
  - thread_hierarchy.cuh     cudaDeviceSynchronize (x3),
                             cudaLaunchCooperativeKernel -> cuda_try
  - algorithm.cuh            cudaGraphAddChildGraphNode,
                             cudaGraphLaunch (x2)        -> cuda_try
  - scalar_interface.cuh     cudaMemcpyAsync,
                             cudaGraphAddMemcpyNode      -> cuda_try

Where a call has a single synthesizable output parameter and the result
is used, the templated cuda_try<F>(args...) form is used so the result
is a const-initialized local (cudaGetDevice, cudaGetDeviceCount,
cudaGraphAddMemcpyNode). cudaGraphAddChildGraphNode keeps the
runtime-status form because its node handle is intentionally discarded
(a templated form would trip -Wunused-variable).

thread_hierarchy.cuh's calls live in UNITTEST bodies, but they are plain
sequential test code (not runtime-dispatched host-task lambdas), so a
thrown cuda_try simply fails the test; converting keeps the migration
consistent.

slice.cuh is intentionally left for a later slice: its pin_memory calls
run in 2D/3D loops where a mid-loop throw would leak already-pinned
regions, so it needs unpin-on-throw rollback rather than a mechanical
conversion.
@andralex andralex requested a review from a team as a code owner June 3, 2026 18:24
@andralex andralex requested a review from caugonnet June 3, 2026 18:24
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 3, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Jun 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 3, 2026

/ok to test 3e0972d

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 3, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

important: Walkthrough

This PR replaces many cuda_safe_call usages with cuda_try (returning/template forms where applicable), moves adapter cleanup into SCOPE(exit), and updates a stream_adapter move-constructor to move members.

Changes

Error Handling Refactor

Layer / File(s) Summary
Device state queries
cudax/include/cuda/experimental/__stf/internal/dot.cuh, cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh, cudax/include/cuda/experimental/__stf/internal/scheduler.cuh
Device enumeration and current-device reads now capture results via cuda_try<...>() instead of calling cuda_safe_call(... &out).
Graph operations and adapter cleanup
cudax/include/cuda/experimental/__stf/internal/algorithm.cuh, cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh
cudaGraphAddChildGraphNode, cudaGraphInstantiateWithFlags, and cudaGraphLaunch now use cuda_try (template form where applicable); adapter cleanup moved into SCOPE(exit) guards so adapter.clear() runs on all exits; cudaGraphDebugDotPrint wrapped with cuda_safe_call.
Data copy operations
cudax/include/cuda/experimental/__stf/internal/scalar_interface.cuh
stream_data_copy wraps cudaMemcpyAsync with cuda_try(...); graph_data_copy returns cuda_try<cudaGraphAddMemcpyNode>(...) directly.
Unit test synchronization changes
cudax/include/cuda/experimental/__stf/internal/thread_hierarchy.cuh
Unit tests replace cuda_safe_call with cuda_try for cudaDeviceSynchronize() and cudaLaunchCooperativeKernel(...) calls.
Allocator move-constructor fix
cudax/include/cuda/experimental/__stf/allocators/adapters.cuh
stream_adapter(stream_adapter&&) now moves adapter_state and alloc using mv(...) rather than copying them.
  • Suggested labels: stf
  • Suggested reviewers:
    • caugonnet
    • oleksandr-pavlyk

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/internal/scheduler.cuh (1)

56-56: ⚡ Quick win

suggestion: Make num_devices const by moving initialization to default member initializer.

The member variable num_devices is assigned from cuda_try<cudaGetDeviceCount>() but never modified after construction. Per the skill file requirement, "any variable not modified must be const (including values assigned from function returns like cuda_try<...>())".

Refactor to inline initialization at the declaration (line 74) and remove the assignment from the constructor body. This matches the pattern in reorderer.cuh and enforces immutability.

Refactor to const with inline initialization
   scheduler()
   {
-    num_devices = cuda_try<cudaGetDeviceCount>();
     assert(num_devices > 0);
   }
   
   ...
   
 protected:
-  int num_devices = 0;
+  const int num_devices = cuda_try<cudaGetDeviceCount>();

Based on learnings: the skill file explicitly requires const for variables assigned from cuda_try<...>() that are not subsequently modified.

Also applies to: 74-74


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c9c1d6e9-979e-4bc6-9c44-b3b8c057e643

📥 Commits

Reviewing files that changed from the base of the PR and between ee9f95b and 3e0972d.

📒 Files selected for processing (6)
  • cudax/include/cuda/experimental/__stf/internal/algorithm.cuh
  • cudax/include/cuda/experimental/__stf/internal/dot.cuh
  • cudax/include/cuda/experimental/__stf/internal/executable_graph_cache.cuh
  • cudax/include/cuda/experimental/__stf/internal/scalar_interface.cuh
  • cudax/include/cuda/experimental/__stf/internal/scheduler.cuh
  • cudax/include/cuda/experimental/__stf/internal/thread_hierarchy.cuh

Comment thread cudax/include/cuda/experimental/__stf/internal/algorithm.cuh Outdated
@github-actions

This comment has been minimized.

Follow-up to the internal/ misc migration:

- algorithm.cuh / scalar_interface.cuh: switch eligible cuda_try calls to
  the templated cuda_try<F>(args...) form. For the discarded child-graph
  node, use ::std::ignore = cuda_try<cudaGraphAddChildGraphNode>(...).
  cudaGraphLaunch and cudaMemcpyAsync become cuda_try<...> as well.
  cudaGraphInstantiateWithFlags stays runtime-status (its output is written
  into caller-owned shared_ptr storage).

- algorithm.cuh: in run_dynamic, release the stream adapter via
  SCOPE(exit) { adapter.clear(); } right after acquisition, matching run(),
  so the adapter is cleared even if graph instantiation/launch throws.

- adapters.cuh: move (not copy) adapter_state and alloc in the
  stream_adapter move constructor.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 3, 2026

/ok to test daab287

@github-actions

This comment has been minimized.

…rnal/

- algorithm.cuh / executable_graph_cache.cuh: use the templated
  cuda_try<cudaGraphInstantiateWithFlags>(...) form, assigning the
  instantiated handle into *eg / *res. The handle is a trivially-assignable
  pointer, so capture-then-assign works even though the storage is owned by
  the surrounding shared_ptr.
- Restore cuda_try<cudaGraphLaunch>(*eg, stream) (direct form compiles
  fine: no output param, properly typed args, non-nullary).
- Construct the cudaGraphExec_t shared_ptr via the braced form with an
  inline deleter, dropping the redundant named deleter lambdas.
- Value-initialize the handle (new cudaGraphExec_t{}) so a throw from
  instantiation cannot leave the shared_ptr deleter calling
  cudaGraphExecDestroy on an indeterminate handle.
- scheduler.cuh: initialize num_devices in the constructor member-init
  list instead of assigning in the body.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test bebaa9b

@github-actions

This comment has been minimized.

cudaMemcpyAsync returns cudaError_t with no synthesizable output, so the
templated cuda_try<F> form does not apply. Use the runtime-status form
instead.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/internal/scalar_interface.cuh (1)

134-134: ⚡ Quick win

suggestion: These replacements introduce direct uses of cuda_try, cudaMemcpyAsync, and cudaGraphAddMemcpyNode, so they should be qualified from :: and this header should include the cuda_try definition directly instead of relying on a transitive include. As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace" and "Files must include all headers related to the symbols that they are using. No transitive header inclusions are allowed."

Also applies to: 270-270


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2a2134c7-569a-4e92-a3ed-89446692014f

📥 Commits

Reviewing files that changed from the base of the PR and between bebaa9b and 45e4467.

📒 Files selected for processing (1)
  • cudax/include/cuda/experimental/__stf/internal/scalar_interface.cuh

@andralex andralex enabled auto-merge (squash) June 4, 2026 03:33
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 077b9f4

@github-actions

This comment has been minimized.

Harden previously-bare CUDA calls that ran unchecked:

- the cudaGraphExecDestroy calls in the shared_ptr deleters (algorithm.cuh,
  executable_graph_cache.cuh) now use cuda_safe_call. Deleters run in a
  noexcept context, so cuda_safe_call (abort on error) is correct -- cuda_try
  must not be used there.
- the debug-only cudaGraphDebugDotPrint in dump_algorithm now uses
  cuda_safe_call.

The cudaGraphExecUpdate probe stays as-is: it intentionally captures the
status via cudaGetLastError and returns it as a bool.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test 3ef273b

@github-actions

This comment has been minimized.

@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 4, 2026

/ok to test ff3e69f

@github-actions

This comment has been minimized.

@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test ca417d3

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 35m 25s: Pass: 96%/55 | Total: 10h 30m | Max: 35m 22s | Hits: 50%/67038

See results here.

@caugonnet caugonnet added the stf Sequential Task Flow programming model label Jun 5, 2026
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test ca417d3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants