Skip to content

[libcu++] Make argument namespace and wrappers construction public#9251

Open
pciolkosz wants to merge 4 commits into
NVIDIA:mainfrom
pciolkosz:make_arguments_framework_public
Open

[libcu++] Make argument namespace and wrappers construction public#9251
pciolkosz wants to merge 4 commits into
NVIDIA:mainfrom
pciolkosz:make_arguments_framework_public

Conversation

@pciolkosz
Copy link
Copy Markdown
Contributor

@pciolkosz pciolkosz commented Jun 4, 2026

Closes #9254

This PR removes underscores from the cuda::argument namespace and the user facing argument wrappers and bound objects construction.

Traits and other utility functions remain underscored, those are meant mostly to be used inside CCCL APIs to examine the argument wrappers passed by the user, but they will most likely be public later on, once we are more confident with the design.

@pciolkosz pciolkosz requested review from a team as code owners June 4, 2026 06:04
@pciolkosz pciolkosz requested a review from fbusato June 4, 2026 06:04
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026
Copy link
Copy Markdown
Contributor

@davebayer davebayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I'll block this PR. We said there would be a discussion about the naming. I'm still strongly against having a separate namespace for these types.

@github-project-automation github-project-automation Bot moved this from In Review to In Progress in CCCL Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d1c6b83b-e6b1-426b-aee3-11d4b4323496

📥 Commits

Reviewing files that changed from the base of the PR and between 2e084a9 and 2c740b7.

📒 Files selected for processing (3)
  • libcudacxx/include/cuda/__argument/argument.h
  • libcudacxx/test/libcudacxx/cuda/argument/deferred_sequence_scalar.fail.cpp
  • libcudacxx/test/libcudacxx/cuda/argument/deferred_sequence_scalar_traits.fail.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • libcudacxx/include/cuda/__argument/argument.h

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This pull request makes the user-facing CUDA argument wrapper API public by removing leading underscores from names and introducing the cuda::argument header. Wrapper and bounds types, their deduction guides, factory helpers, and the public header were renamed/moved from the internal cuda::argument namespace to cuda::argument. Internal inspection helpers and trait machinery remain underscored for now (exposed under cuda::argument::* where needed).

Key Changes

  • New public header: libcudacxx/include/cuda/argument (forwards to implementation).
  • Renamed wrapper templates (public):
    • __constant → constant
    • __constant_sequence → constant_sequence
    • __immediate → immediate
    • __immediate_sequence → immediate_sequence
    • __deferred → deferred
    • __deferred_sequence → deferred_sequence
  • Renamed bounds types and factories:
    • __no_bounds → no_bounds
    • __static_bounds → static_bounds
    • __runtime_bounds → runtime_bounds
    • __bounds(...) → bounds(...)
  • Trait/inspection helpers remain underscored and are now accessed under cuda::argument::__* (e.g., __traits, __unwrap, _lowest, _highest).
  • Namespace helper macros updated to open/close cuda::argument instead of cuda::__argument.

Codebase Updates

All usages across CUB, benchmarks, kernels, and tests were updated to use the new public names and header. Notable updated files include:

  • cub/benchmarks/bench/segmented_topk/{fixed,variable}/keys.cu
  • cub/cub/agent/agent_batched_topk.cuh
  • cub/cub/detail/segmented_params.cuh
  • cub/cub/device/dispatch/dispatch_batched_topk.cuh
  • cub/cub/device/dispatch/kernels/kernel_batched_topk.cuh
  • Multiple libcudacxx headers and tests under test/libcudacxx/cuda/argument/*

Search confirms many test and example files now include <cuda/argument> and refer to cuda::argument::{constant, immediate, deferred, static_bounds, runtime_bounds, bounds}, while trait helpers remain used via cuda::argument::__*.

API/ABI Impact

  • Public-facing type names and header usage changed as intended; wrappers and bounds are now available under cuda::argument.
  • Internal trait/helper symbols remain underscored (no intended public change to their semantics); callers that relied on previous underscored names should migrate to the new public wrappers where appropriate.
  • No functional changes intended — most edits are renames, header reorganization, and tests updated accordingly. Some constructor/validation paths tightened bounds checks (see commit notes); reviewers should validate bounds-related behavior where tests exercise it.

Review Notes

  • A blocking comment from @pciolkosz requests a team discussion about naming and expresses opposition to introducing a separate cuda::argument namespace; this should be resolved with the team (they tagged @gevtushenko).
  • Commit-level changes include added assertions (e.g., validating deferred_sequence is a sequence) and tightened bounds validation in some constructors; reviewers should check immediate/immediate_sequence/deferred constructors and related tests for intended behavior.

suggestion:

Walkthrough

This PR replaces internal cuda::__argument wrapper/bounds/traits names with public cuda::argument equivalents, adds a public forwarding header <cuda/argument>, and updates CUB benchmarks, dispatch/kernels/agent, and libcudacxx tests to use the new API. No algorithmic behavior was altered.

Changes

CUDA argument API public-API migration

Layer / File(s) Summary
Core wrapper types and public header
libcudacxx/include/cuda/__argument/argument.h, libcudacxx/include/cuda/__argument/argument_bounds.h, libcudacxx/include/cuda/argument, libcudacxx/include/cuda/std/__internal/namespaces.h
Rename wrapper/bounds types and factories from underscored __* to public * names; update deduction guides, wrapper detection, __is_wrapper_v, __unwrap, __traits_impl, and __lowest_/__highest_; add <cuda/argument> forwarding header and update namespace macros.
CUB dispatch, kernel, agent, and param extraction
cub/cub/device/dispatch/dispatch_batched_topk.cuh, cub/cub/device/dispatch/kernels/kernel_batched_topk.cuh, cub/cub/agent/agent_batched_topk.cuh, cub/cub/detail/segmented_params.cuh
Switch CUB code to accept and query ::cuda::argument wrapper types and traits (e.g., constant, immediate, deferred, __traits, __unwrap) and update kernel parameter element types derived from those traits.
CUB segmented top-k benchmarks using public API
cub/benchmarks/bench/segmented_topk/fixed/keys.cu, cub/benchmarks/bench/segmented_topk/variable/keys.cu
Update includes and argument construction for benchmark dispatches to use ::cuda::argument::immediate/::cuda::argument::immediate_sequence/::cuda::argument::constant and ::cuda::argument::bounds.
CUB device segmented top-k tests
cub/test/catch2_test_device_segmented_topk_keys.cu, cub/test/catch2_test_device_segmented_topk_pairs.cu
Replace internal ::cuda::__argument::__* wrappers in test launch callsites with ::cuda::argument::immediate, immediate_sequence, and bounds.
libcudacxx argument API test suite
libcudacxx/test/libcudacxx/cuda/argument/*
Update all libcudacxx argument tests to include <cuda/argument> and reference cuda::argument wrapper/traits/bounds helpers in assertions and examples; add compile-fail tests validating misuse cases.

Assessment against linked issues

Objective Addressed Explanation
Expose argument annotation framework for public usage in cub::DeviceBatchedTopK (#9254)

Possibly related PRs

  • NVIDIA/cccl#9074: Prior CUB batched-topk work touching segmented-params and argument wrappers.
  • NVIDIA/cccl#8875: Introduces or earlier-migrates the public cuda::argument wrapper/traits API used here.

Suggested labels

libcu++

Suggested reviewers

  • oleksandr-pavlyk
  • fbusato
  • gevtushenko

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
libcudacxx/test/libcudacxx/cuda/argument/deferred_sequence_scalar.fail.cpp

libcudacxx/test/libcudacxx/cuda/argument/deferred_sequence_scalar.fail.cpp:11:10: fatal error: 'cuda/argument' file not found
11 | #include <cuda/argument>
| ^~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/2c740b7dfa2080377f6c51928e2432fe8f5e4413-db8a65d78cde1563/tmp/clang_command_.tmp.9207c4.txt
++Contents of '/tmp/coderabbit-infer/2c740b7dfa2080377f6c51928e2432fe8f5e4413-db8a65d78cde1563/tmp/clang_command_.tmp.9207c4.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-li

... [truncated 1187 characters] ...

m" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/db8a65d78cde1563/file.o" "-x" "c++"
"libcudacxx/test/libcudacxx/cuda/argument/deferred_sequence_scalar.fail.cpp"
"-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

libcudacxx/test/libcudacxx/cuda/argument/deferred_sequence_scalar_traits.fail.cpp

libcudacxx/test/libcudacxx/cuda/argument/deferred_sequence_scalar_traits.fail.cpp:11:10: fatal error: 'cuda/argument' file not found
11 | #include <cuda/argument>
| ^~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/2c740b7dfa2080377f6c51928e2432fe8f5e4413-9226476111f7d90b/tmp/clang_command_.tmp.c88bce.txt
++Contents of '/tmp/coderabbit-infer/2c740b7dfa2080377f6c51928e2432fe8f5e4413-9226476111f7d90b/tmp/clang_command_.tmp.c88bce.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unk

... [truncated 1208 characters] ...

r/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/9226476111f7d90b/file.o" "-x" "c++"
"libcudacxx/test/libcudacxx/cuda/argument/deferred_sequence_scalar_traits.fail.cpp"
"-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4424dd10-8027-4d66-97f4-e3d43a7b6cf4

📥 Commits

Reviewing files that changed from the base of the PR and between 6b7ae38 and 06c277f.

📒 Files selected for processing (20)
  • cub/benchmarks/bench/segmented_topk/fixed/keys.cu
  • cub/benchmarks/bench/segmented_topk/variable/keys.cu
  • cub/cub/agent/agent_batched_topk.cuh
  • cub/cub/detail/segmented_params.cuh
  • cub/cub/device/dispatch/dispatch_batched_topk.cuh
  • cub/cub/device/dispatch/kernels/kernel_batched_topk.cuh
  • cub/test/catch2_test_device_segmented_topk_keys.cu
  • cub/test/catch2_test_device_segmented_topk_pairs.cu
  • libcudacxx/include/cuda/__argument/argument.h
  • libcudacxx/include/cuda/__argument/argument_bounds.h
  • libcudacxx/include/cuda/argument
  • libcudacxx/include/cuda/std/__internal/namespaces.h
  • libcudacxx/test/libcudacxx/cuda/argument/argument_bounds.pass.cpp
  • libcudacxx/test/libcudacxx/cuda/argument/argument_traits.pass.cpp
  • libcudacxx/test/libcudacxx/cuda/argument/deferred_argument.pass.cpp
  • libcudacxx/test/libcudacxx/cuda/argument/dynamic_argument.pass.cpp
  • libcudacxx/test/libcudacxx/cuda/argument/static_argument.pass.cpp
  • libcudacxx/test/libcudacxx/cuda/argument/static_bounds_conversion.fail.cpp
  • libcudacxx/test/libcudacxx/cuda/argument/static_bounds_type_mismatch.fail.cpp
  • libcudacxx/test/libcudacxx/cuda/argument/usage_example.pass.cpp

Comment thread libcudacxx/include/cuda/__argument/argument.h
@pciolkosz
Copy link
Copy Markdown
Contributor Author

I'm sorry, but I'll block this PR. We said there would be a discussion about the naming. I'm still strongly against having a separate namespace for these types.

Tagging @gevtushenko

@pciolkosz pciolkosz force-pushed the make_arguments_framework_public branch from 6002bc6 to 3210e3d Compare June 4, 2026 06:36
@github-actions

This comment has been minimized.

auto segment_sizes = ::cuda::__argument::__constant<MaxSegmentSize>{};
auto k = ::cuda::__argument::__constant<MaxNumSelected>{};
auto select_direction = ::cuda::__argument::__constant<cub::detail::topk::select::max>{};
auto segment_sizes = ::cuda::argument::constant<MaxSegmentSize>{};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++26 has constant_wrapper. It's hard to back-port fully to earlier C++ versions (though there are proposals pending review that might fix this), but for now, it's how C++ spells "compile-time constant." Does this cuda::argument::constant do the same thing? If so, then should we consider just making this an alias to constant_wrapper (or a hypothetical cuda::std::constant_wrapper)?


using segment_size_val_t = typename ::cuda::__argument::__traits<SegmentSizeParameterT>::element_type;
using num_segments_val_t = typename ::cuda::__argument::__traits<NumSegmentsParameterT>::element_type;
using segment_size_val_t = typename ::cuda::argument::__traits<SegmentSizeParameterT>::element_type;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this library usable without exposing __traits to the public API? If not, then should we consider waiting until __traits is ready?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main user facing part is the construction of the wrappers, while our APIs will use the traits to examine them. Long term we want to expose the __traits I think, but we don't have to right now so keeping them private leaves us room to change them easily and make public later

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining! That makes sense.

//! @brief Wraps a compile-time constant argument value.
template <auto _Value>
struct __constant
struct constant
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more or less like constant_wrapper. Should we consider spelling it cuda::std::constant_wrapper?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We considered that design, cuda::std::constant_wrapper is difficult to back port to older c++ standards and the timeline probably doesn't leave enough room to implement it. I think its fine to just interoperate with constant wrapper later on

static_assert(__valid_static_bounds_v<__element_type, _StaticBounds>,
"static argument bounds cannot be represented by the element type");

_Arg __arg_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a public member. That means it's possible to set the value without validating the bounds. Should we instead consider making the value private, so that every constructed immediate instance has been validated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could make it a private member and add an underscored __access structure that is a friend or I can just underscore the variable and have less code this way. I don't mind either way, but we need a way to access it at the end of the day.
But I think this question does apply to __unwrap, which currently has two version, const and not-const reference. I could see the non-const reference version being not necessary, but it's also not public yet, so we don't have to resolve it right now. I might as well remove it and we can add it when needed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pciolkosz Thanks for answering! I don't quite understand the argument why validation is separate from construction.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: Why can't users access the value (by value)? Isn't this just a validated value? Why wouldn't this class just have a value() member?

Comment on lines 318 to +319
_Arg __arg_;
__runtime_bounds<__element_type> __runtime_bounds_{};
runtime_bounds<__element_type> __runtime_bounds_{};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are public members. That means it's possible to set them without validation. Should we instead consider making the members private, so that every constructed immediate_sequence instance has been validated?


template <class _ElementType, auto _Lowest, auto _Highest>
inline constexpr bool __valid_static_bounds_v<_ElementType, __static_bounds<_Lowest, _Highest>> =
inline constexpr bool __valid_static_bounds_v<_ElementType, static_bounds<_Lowest, _Highest>> =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are runtime_bounds and static_bounds. Do we also need a representation of half-run-time, half-static bounds?

submdspan represents bounds generally as any type for which structured binding into two elements, both convertible to index_type, is well-formed. This would let users pass in bounds however they like. It also would permit a unified representation of any combination of run-time and static bounds.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the template parameters' types are unconstrained here. What are the actual requirements on these types?

Copy link
Copy Markdown
Contributor Author

@pciolkosz pciolkosz Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can tighten the static bounds with runtime bounds, so in this case you can do static_bounds<0, 1000> and runtime_bounds(100, 1000) which effectively would be a half-run-time, half-static bounds. It avoids overcomplicating the input types at the expense of this case being a bit more verbose, but I think it should be rare anyway.

Also, the template parameters' types are unconstrained here. What are the actual requirements on these types?

Good point, I added type validation

@@ -410,7 +410,7 @@
"static argument bounds cannot be represented by the element type");

_Arg __arg_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my above comments on public members and bounds checking; thanks!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🥳 CI Workflow Results

🟩 Finished in 4h 05m: Pass: 100%/340 | Total: 5d 20h | Max: 2h 07m | Hits: 94%/499980

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Expose argument annotation framework for public usage in cub::DeviceBatchedTopK

3 participants