cudax/stf: migrate stream/interfaces/ from cuda_safe_call to cuda_try by andralex · Pull Request #9268 · NVIDIA/cccl

andralex · 2026-06-05T01:35:14Z

Summary

Migrates the cudax/include/cuda/experimental/__stf/stream/interfaces/ data interfaces (hashtable, slice, slice reduction ops) from cuda_safe_call to cuda_try. Part of the ongoing STF cuda_safe_call -> cuda_try rollout; the large stream files (event_types.cuh, stream_ctx.cuh, stream_task.cuh) are handled in separate PRs.

Changes (3 files, 15 sites)

Templated cuda_try<F> for single-function calls: cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize, cudaFreeHost, cudaFreeAsync, cudaPointerGetAttributes (out-param -> returned cudaPointerAttributes).
Kept runtime-status cuda_try(...) for overload sets (cuda_runtime.h templated wrappers): cudaHostAlloc, cudaMallocAsync.
hashtable_linearprobing.cuh leak guard: in stream_data_allocate, after the device cudaMallocAsync succeeds, the buffer is freed via SCOPE(fail) if the following cudaMemsetAsync throws — closing the leak the new throw path would otherwise introduce. Adds scope_guard.cuh.

Validation

Built locally (cpp20): cudax.test.stf.hashtable.test, cudax.test.stf.reductions.slice2d_reduction (exercises the 2D cudaMemcpy2DAsync path), cudax.test.stf.reductions.reduce_sum — all compile and link.

Test plan

CI green on the cudax matrix
No success-path behavior change; new behavior is throw-vs-abort plus the alloc leak-guard

Covers the stream-backend data interfaces (hashtable, slice, slice reduction ops). Uses the templated cuda_try<F> form for single-function calls (cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize, cudaFreeHost, cudaFreeAsync, cudaPointerGetAttributes). cudaHostAlloc and cudaMallocAsync stay in the runtime-status form: both are overload sets (cuda_runtime.h templated wrappers), so cuda_try<F> cannot name them. hashtable stream_data_allocate: after the device cudaMallocAsync succeeds, the buffer is freed via SCOPE(fail) if the subsequent cudaMemsetAsync throws, so the new throw path does not leak the allocation. Adds scope_guard.cuh. Verified locally by building cudax.test.stf.hashtable.test, cudax.test.stf.reductions.slice2d_reduction (2D cudaMemcpy2DAsync path), and cudax.test.stf.reductions.reduce_sum.

andralex · 2026-06-05T01:35:17Z

placeholder

copy-pr-bot · 2026-06-05T01:35:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

andralex · 2026-06-05T01:38:44Z

/ok to test b6dd048

andralex · 2026-06-05T01:39:24Z

/ok to test 33ffac5

copy-pr-bot · 2026-06-05T01:39:27Z

/ok to test 33ffac5

@andralex, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

coderabbitai · 2026-06-05T01:40:35Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cb3e4545-7fad-4c69-89e0-f7ba2d58ee54

📥 Commits

Reviewing files that changed from the base of the PR and between b6dd048 and 3ae7ba1.

📒 Files selected for processing (2)

cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh
cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh

🚧 Files skipped from review as they are similar to previous changes (2)

cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh
cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR migrates CUDA runtime error handling in the data interfaces under cudax/include/cuda/experimental/__stf/stream/interfaces/ from abort-on-error cuda_safe_call(...) to exception-based cuda_try(...). Three interface files are changed (hashtable, slice, slice_reduction_ops). Large stream files (event_types.cuh, stream_ctx.cuh, stream_task.cuh) are intentionally excluded for separate treatment.

Changes

Files modified: 3 | Sites changed: 15 | Lines changed: +26/-18

hashtable_linearprobing.cuh
- Replaced many CUDA runtime calls with cuda_try variants for copy/alloc/free/sync flows.
- Added scope_guard.cuh and introduced a SCOPE(fail) leak-guard: after successful device cudaMallocAsync, a subsequent failing cudaMemsetAsync will trigger a cleanup that frees the allocated buffer (cleanup call itself remained wrapped with the legacy cuda_safe_call).
- Host fallback/alloc path and deallocation paths switched to cuda_try (cuda_try<cudaStreamSynchronize>, cuda_try<cudaFreeHost>, cuda_try<cudaFreeAsync>) where applicable.
slice.cuh
- Converted async memcpy calls (0D/1D/2D/contiguous higher-dimensional branches) from cuda_safe_call(...) to cuda_try forms. Uses templated cuda_try<F> for single-function calls where possible.
- Rewrote get_memory_type to use cuda_try<cudaPointerGetAttributes> returning cudaPointerAttributes instead of filling an out-parameter.
slice_reduction_ops.cuh
- Switched host-side stream synchronization in op and init_op from cuda_safe_call(cudaStreamSynchronize(...)) to cuda_try<cudaStreamSynchronize>(...).

Notable implementation details / decisions

Introduced a templated cuda_try<F> for single-function CUDA calls (used for functions like cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize, cudaFreeHost, cudaFreeAsync, and cudaPointerGetAttributes).
Kept runtime-status cuda_try(...) form for overload sets provided by cuda_runtime.h (e.g., cudaHostAlloc, cudaMallocAsync) because those cannot be referenced as a single function template parameter.
After discovering that cudaMemcpyAsync is an overload set (CTK 13.2+ runtime wrapper), some attempted cuda_try<cudaMemcpyAsync> uses were reverted to runtime-status form cuda_try(cudaMemcpyAsync(...)) to maintain compiler compatibility (Clang rejects treating overload sets as a single template parameter).
The leak-guard uses SCOPE(fail) to call cudaFreeAsync on partially-initialized device allocations; that cleanup remains invoked via the existing safe wrapper in this diff.

API / compatibility

No exported/public API signature changes.
Success-path behavior unchanged; error-path behavior changes from abort to throwing exceptions (and added leak-guard prevents device allocation leaks on initialization failure).

Validation / tests

Local C++20 build succeeded for: cudax.test.stf.hashtable.test, cudax.test.stf.reductions.slice2d_reduction, and cudax.test.stf.reductions.reduce_sum.
CI expected to run on the cudax matrix.
Comments captured: three automated "/ok to test" author comments and one "placeholder" author comment; no reviewer discussion captured.
Commit notes: include a follow-up fix reverting templated cuda_try<cudaMemcpyAsync> uses to runtime-status form for compiler compatibility.

Walkthrough

Three STF stream-interface files migrate CUDA error handling from cuda_safe_call(...) to cuda_try<...>(...). Hashtable allocation adds a scope-guard cleanup for partial device allocation failures. Slice updates async memcpy and pointer-attribute queries. Reduction ops use cuda_try for stream synchronization.

Changes

STF Stream Interface Error Handling Refactor

Layer / File(s)	Summary
Hash table allocation, initialization, and error-safe deallocation `cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh`	Added `scope_guard.cuh`; async memcpy switched to `cuda_try(cudaMemcpyAsync)`; host allocation path uses `cuda_try<cudaStreamSynchronize>` + `cuda_try(cudaHostAlloc)`; device path uses `cuda_try(cudaMallocAsync)` and `cuda_try<cudaMemsetAsync)` with `SCOPE(fail)` cleanup freeing via `cudaFreeAsync`; deallocation uses `cuda_try`-wrapped sync and frees.
Slice async copies and pointer-attribute query `cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh`	`data_copy` branches (0D/1D/2D/contiguous) now use `cuda_try` wrappers for `cudaMemcpyAsync`/`cudaMemcpy2DAsync`; `get_memory_type` uses `cuda_try<cudaPointerGetAttributes>` to obtain `cudaPointerAttributes` directly.
Reduction op stream synchronization `cudax/include/cuda/experimental/__stf/stream/interfaces/slice_reduction_ops.cuh`	Host-side `cudaStreamSynchronize` calls in `op()` and `init_op()` replaced with `cuda_try<cudaStreamSynchronize>(s)`.

Suggested labels

stf

Suggested reviewers

caugonnet
oleksandr-pavlyk

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…set) cudaMemcpyAsync is an overload set on CTK 13.2+ (cuda_runtime.h adds an alternate-spelling wrapper), so cuda_try<cudaMemcpyAsync> is ill-formed. Clang rejects it; GCC accepts it leniently, which is why local GCC builds missed it. Revert the cudaMemcpyAsync calls in the stream slice and hashtable interfaces to the runtime-status cuda_try(cudaMemcpyAsync(...)) form. The other (non-overloaded) calls keep the templated form.

andralex · 2026-06-05T01:51:16Z

/ok to test 3ae7ba1

github-actions · 2026-06-05T02:51:44Z

😬 CI Workflow Results

🟥 Finished in 58m 10s: Pass: 94%/55 | Total: 21h 49m | Max: 58m 10s | Hits: 16%/136943

See results here.

andralex · 2026-06-05T22:29:27Z

/ok to test 3ae7ba1

andralex requested a review from a team as a code owner June 5, 2026 01:35

andralex requested a review from srinivasyadav18 June 5, 2026 01:35

github-project-automation Bot added this to CCCL Jun 5, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 5, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 5, 2026

andralex enabled auto-merge (squash) June 5, 2026 01:37

caugonnet added the stf Sequential Task Flow programming model label Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudax/stf: migrate stream/interfaces/ from cuda_safe_call to cuda_try#9268

cudax/stf: migrate stream/interfaces/ from cuda_safe_call to cuda_try#9268
andralex wants to merge 2 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-stream-misc

andralex commented Jun 5, 2026

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andralex commented Jun 5, 2026

Summary

Changes (3 files, 15 sites)

Validation

Test plan

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notable implementation details / decisions

API / compatibility

Validation / tests

Walkthrough

Changes

Suggested labels

Suggested reviewers

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 58m 10s: Pass: 94%/55 | Total: 21h 49m | Max: 58m 10s | Hits: 16%/136943

Uh oh!

andralex commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading