Skip to content

cudax/stf: migrate stream/interfaces/ from cuda_safe_call to cuda_try#9268

Open
andralex wants to merge 2 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-stream-misc
Open

cudax/stf: migrate stream/interfaces/ from cuda_safe_call to cuda_try#9268
andralex wants to merge 2 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cuda-try-stream-misc

Conversation

@andralex
Copy link
Copy Markdown
Contributor

@andralex andralex commented Jun 5, 2026

Summary

Migrates the cudax/include/cuda/experimental/__stf/stream/interfaces/ data interfaces (hashtable, slice, slice reduction ops) from cuda_safe_call to cuda_try. Part of the ongoing STF cuda_safe_call -> cuda_try rollout; the large stream files (event_types.cuh, stream_ctx.cuh, stream_task.cuh) are handled in separate PRs.

Changes (3 files, 15 sites)

  • Templated cuda_try<F> for single-function calls: cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize, cudaFreeHost, cudaFreeAsync, cudaPointerGetAttributes (out-param -> returned cudaPointerAttributes).
  • Kept runtime-status cuda_try(...) for overload sets (cuda_runtime.h templated wrappers): cudaHostAlloc, cudaMallocAsync.
  • hashtable_linearprobing.cuh leak guard: in stream_data_allocate, after the device cudaMallocAsync succeeds, the buffer is freed via SCOPE(fail) if the following cudaMemsetAsync throws — closing the leak the new throw path would otherwise introduce. Adds scope_guard.cuh.

Validation

Built locally (cpp20): cudax.test.stf.hashtable.test, cudax.test.stf.reductions.slice2d_reduction (exercises the 2D cudaMemcpy2DAsync path), cudax.test.stf.reductions.reduce_sum — all compile and link.

Test plan

  • CI green on the cudax matrix
  • No success-path behavior change; new behavior is throw-vs-abort plus the alloc leak-guard

Covers the stream-backend data interfaces (hashtable, slice, slice reduction
ops). Uses the templated cuda_try<F> form for single-function calls
(cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize,
cudaFreeHost, cudaFreeAsync, cudaPointerGetAttributes).

cudaHostAlloc and cudaMallocAsync stay in the runtime-status form: both are
overload sets (cuda_runtime.h templated wrappers), so cuda_try<F> cannot name
them.

hashtable stream_data_allocate: after the device cudaMallocAsync succeeds, the
buffer is freed via SCOPE(fail) if the subsequent cudaMemsetAsync throws, so the
new throw path does not leak the allocation. Adds scope_guard.cuh.

Verified locally by building cudax.test.stf.hashtable.test,
cudax.test.stf.reductions.slice2d_reduction (2D cudaMemcpy2DAsync path), and
cudax.test.stf.reductions.reduce_sum.
@andralex andralex requested a review from a team as a code owner June 5, 2026 01:35
@andralex andralex requested a review from srinivasyadav18 June 5, 2026 01:35
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

placeholder

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Jun 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 5, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 5, 2026
@andralex andralex enabled auto-merge (squash) June 5, 2026 01:37
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test b6dd048

@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test 33ffac5

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Jun 5, 2026

/ok to test 33ffac5

@andralex, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cb3e4545-7fad-4c69-89e0-f7ba2d58ee54

📥 Commits

Reviewing files that changed from the base of the PR and between b6dd048 and 3ae7ba1.

📒 Files selected for processing (2)
  • cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh
  • cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh
🚧 Files skipped from review as they are similar to previous changes (2)
  • cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh
  • cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR migrates CUDA runtime error handling in the data interfaces under cudax/include/cuda/experimental/__stf/stream/interfaces/ from abort-on-error cuda_safe_call(...) to exception-based cuda_try(...). Three interface files are changed (hashtable, slice, slice_reduction_ops). Large stream files (event_types.cuh, stream_ctx.cuh, stream_task.cuh) are intentionally excluded for separate treatment.

Changes

Files modified: 3 | Sites changed: 15 | Lines changed: +26/-18

  • hashtable_linearprobing.cuh

    • Replaced many CUDA runtime calls with cuda_try variants for copy/alloc/free/sync flows.
    • Added scope_guard.cuh and introduced a SCOPE(fail) leak-guard: after successful device cudaMallocAsync, a subsequent failing cudaMemsetAsync will trigger a cleanup that frees the allocated buffer (cleanup call itself remained wrapped with the legacy cuda_safe_call).
    • Host fallback/alloc path and deallocation paths switched to cuda_try (cuda_try<cudaStreamSynchronize>, cuda_try<cudaFreeHost>, cuda_try<cudaFreeAsync>) where applicable.
  • slice.cuh

    • Converted async memcpy calls (0D/1D/2D/contiguous higher-dimensional branches) from cuda_safe_call(...) to cuda_try forms. Uses templated cuda_try<F> for single-function calls where possible.
    • Rewrote get_memory_type to use cuda_try<cudaPointerGetAttributes> returning cudaPointerAttributes instead of filling an out-parameter.
  • slice_reduction_ops.cuh

    • Switched host-side stream synchronization in op and init_op from cuda_safe_call(cudaStreamSynchronize(...)) to cuda_try<cudaStreamSynchronize>(...).

Notable implementation details / decisions

  • Introduced a templated cuda_try<F> for single-function CUDA calls (used for functions like cudaMemcpyAsync, cudaMemcpy2DAsync, cudaMemsetAsync, cudaStreamSynchronize, cudaFreeHost, cudaFreeAsync, and cudaPointerGetAttributes).
  • Kept runtime-status cuda_try(...) form for overload sets provided by cuda_runtime.h (e.g., cudaHostAlloc, cudaMallocAsync) because those cannot be referenced as a single function template parameter.
  • After discovering that cudaMemcpyAsync is an overload set (CTK 13.2+ runtime wrapper), some attempted cuda_try<cudaMemcpyAsync> uses were reverted to runtime-status form cuda_try(cudaMemcpyAsync(...)) to maintain compiler compatibility (Clang rejects treating overload sets as a single template parameter).
  • The leak-guard uses SCOPE(fail) to call cudaFreeAsync on partially-initialized device allocations; that cleanup remains invoked via the existing safe wrapper in this diff.

API / compatibility

  • No exported/public API signature changes.
  • Success-path behavior unchanged; error-path behavior changes from abort to throwing exceptions (and added leak-guard prevents device allocation leaks on initialization failure).

Validation / tests

  • Local C++20 build succeeded for: cudax.test.stf.hashtable.test, cudax.test.stf.reductions.slice2d_reduction, and cudax.test.stf.reductions.reduce_sum.
  • CI expected to run on the cudax matrix.
  • Comments captured: three automated "/ok to test" author comments and one "placeholder" author comment; no reviewer discussion captured.
  • Commit notes: include a follow-up fix reverting templated cuda_try<cudaMemcpyAsync> uses to runtime-status form for compiler compatibility.

Walkthrough

Three STF stream-interface files migrate CUDA error handling from cuda_safe_call(...) to cuda_try<...>(...). Hashtable allocation adds a scope-guard cleanup for partial device allocation failures. Slice updates async memcpy and pointer-attribute queries. Reduction ops use cuda_try for stream synchronization.

Changes

STF Stream Interface Error Handling Refactor

Layer / File(s) Summary
Hash table allocation, initialization, and error-safe deallocation
cudax/include/cuda/experimental/__stf/stream/interfaces/hashtable_linearprobing.cuh
Added scope_guard.cuh; async memcpy switched to cuda_try(cudaMemcpyAsync); host allocation path uses cuda_try<cudaStreamSynchronize> + cuda_try(cudaHostAlloc); device path uses cuda_try(cudaMallocAsync) and cuda_try<cudaMemsetAsync) with SCOPE(fail) cleanup freeing via cudaFreeAsync; deallocation uses cuda_try-wrapped sync and frees.
Slice async copies and pointer-attribute query
cudax/include/cuda/experimental/__stf/stream/interfaces/slice.cuh
data_copy branches (0D/1D/2D/contiguous) now use cuda_try wrappers for cudaMemcpyAsync/cudaMemcpy2DAsync; get_memory_type uses cuda_try<cudaPointerGetAttributes> to obtain cudaPointerAttributes directly.
Reduction op stream synchronization
cudax/include/cuda/experimental/__stf/stream/interfaces/slice_reduction_ops.cuh
Host-side cudaStreamSynchronize calls in op() and init_op() replaced with cuda_try<cudaStreamSynchronize>(s).

Suggested labels

stf

Suggested reviewers

  • caugonnet
  • oleksandr-pavlyk

Comment @coderabbitai help to get the list of available commands and usage tips.

…set)

cudaMemcpyAsync is an overload set on CTK 13.2+ (cuda_runtime.h adds an
alternate-spelling wrapper), so cuda_try<cudaMemcpyAsync> is ill-formed.
Clang rejects it; GCC accepts it leniently, which is why local GCC builds
missed it. Revert the cudaMemcpyAsync calls in the stream slice and
hashtable interfaces to the runtime-status cuda_try(cudaMemcpyAsync(...))
form. The other (non-overloaded) calls keep the templated form.
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test 3ae7ba1

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 58m 10s: Pass: 94%/55 | Total: 21h 49m | Max: 58m 10s | Hits: 16%/136943

See results here.

@caugonnet caugonnet added the stf Sequential Task Flow programming model label Jun 5, 2026
@andralex
Copy link
Copy Markdown
Contributor Author

andralex commented Jun 5, 2026

/ok to test 3ae7ba1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants