Skip to content

[libcu++] Adds exec::guarantee and the max_total_num_items guarantee#9278

Open
elstehle wants to merge 3 commits into
NVIDIA:mainfrom
elstehle:fea/total-num-items-guarantees
Open

[libcu++] Adds exec::guarantee and the max_total_num_items guarantee#9278
elstehle wants to merge 3 commits into
NVIDIA:mainfrom
elstehle:fea/total-num-items-guarantees

Conversation

@elstehle
Copy link
Copy Markdown
Contributor

@elstehle elstehle commented Jun 5, 2026

Closes #9279

Description

Adds cuda::execution::guarantee together with its first guarantee, cuda::execution::max_total_num_items. Where require lets a caller demand properties from an algorithm, guarantee lets a caller promise properties of the problem that an algorithm may exploit. Guarantees are bundled with guarantee(...) and surfaced through a dedicated __get_guarantees query, mirroring require.

max_total_num_items communicates an upper bound on the total number of items processed (e.g. the combined size of all segments in cub::DeviceBatchedTopK), which an algorithm can use to size intermediate offset types. Since this bound-information may not be attachable to a specific parameter (e.g., on a DeviceBatchedTopK and similarly for segmented algorithms), we decided it should go into the guarantees API.

Design decisions

  • max_total_num_items first, min_total_num_items later. Lower bounds are presumably rare in practice, so we optimize for the common case and keep the two as separate, composable guarantees (guarantee(max_total_num_items<N>(), min_total_num_items<M>())) instead of one lower+upper guarantee.
  • Compile-time and runtime upper bounds, both first-class: max_total_num_items<N>() (static), max_total_num_items(n) (runtime), and max_total_num_items<N>(n) (static bound + runtime refinement, asserting n <= N).
  • Inferred integral bound type rather than a hard-coded int64_t: a 32-bit bound stays 32-bit instead of widening to 64-bit, such that a max_total_num_items(1000000) still provides an int32 static upper bound. Narrower types can be requested explicitly (max_total_num_items<cuda::std::int16_t{1000}>()).

Example

#include <cuda/execution.guarantee.h>
#include <cuda/execution.max_total_num_items.h>

namespace ex = cuda::execution;

// Compile-time upper bound (type inferred: fits int -> 32-bit offsets):
auto env = cuda::std::execution::env{ex::guarantee(ex::max_total_num_items<1'000'000'000>())};

// ... or a runtime upper bound:
auto env = cuda::std::execution::env{ex::guarantee(ex::max_total_num_items(num_items))};

// Passed to an algorithm that understands the guarantee, e.g. (once wired up) cub::DeviceBatchedTopK(..., env);

@elstehle elstehle requested a review from a team as a code owner June 5, 2026 08:52
@elstehle elstehle requested a review from Jacobfaib June 5, 2026 08:52
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 5, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 5, 2026
@elstehle elstehle requested review from ericniebler and pciolkosz June 5, 2026 08:52
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 093424e5-164e-408a-944e-49e814f821b3

📥 Commits

Reviewing files that changed from the base of the PR and between 90b7581 and 38ff9c8.

📒 Files selected for processing (2)
  • libcudacxx/test/libcudacxx/cuda/execution/guarantee.fail.cpp
  • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.fail.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.fail.cpp
  • libcudacxx/test/libcudacxx/cuda/execution/guarantee.fail.cpp

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR introduces a guarantees facility to libcudacxx, enabling callers to communicate properties about problem characteristics that algorithms can exploit for optimization. The initial guarantee provides an upper bound on total items processed, useful for sizing intermediate structures in segmented or batched algorithms.

Key Features

Guarantees Mechanism

  • Adds cuda::execution::guarantee() for bundling guarantee objects into an execution environment.
  • Introduces __get_guarantees query key to access guarantee state from environments; follows forwarding-query semantics (mirrors require).
  • Adds base exec::__guarantee type as the common base for guarantee types.

max_total_num_items Guarantee

  • Provides an upper-bound guarantee representing the total number of items processed.
  • Overloads:
    • Compile-time: max_total_num_items<N>()
    • Runtime: max_total_num_items(n)
    • Hybrid: max_total_num_items<N>(n) with assertion that n <= N
  • Preserves narrow integral types via type inference (avoids widening to 64-bit); supports explicit typed literals (e.g., max_total_num_items<cuda::std::int16_t{1000}>()).
  • Queryable via exec::__get_max_total_num_items token; exposes static and runtime bounds through the holder type.

Design Notes

  • Chooses to expose only an upper-bound guarantee initially (max_total_num_items). A separate min_total_num_items may be added later to maintain composability.
  • Usage pattern shown via constructing an environment: cuda::std::execution::env{ex::guarantee(ex::max_total_num_items<1000>())}.

Files Changed

Implementation Headers

  • libcudacxx/include/cuda/__execution/guarantee.h (+82): Core guarantee mechanism, query key, and guarantee() factory.
  • libcudacxx/include/cuda/__execution/max_total_num_items.h (+140): Holder type storing static and optional runtime bounds, query token, and three factory overloads with integral type inference and validation.

Public Headers

  • libcudacxx/include/cuda/execution.guarantee.h (+26): Public wrapper for guarantee facility.
  • libcudacxx/include/cuda/execution.max_total_num_items.h (+26): Public wrapper for max_total_num_items guarantee.
  • libcudacxx/include/cuda/execution (+2): Umbrella header updated to include the new headers.

Tests

  • libcudacxx/test/libcudacxx/cuda/execution/guarantee.pass.cpp (+39): Validates guarantee environment construction, query extraction, and forwarding-query property.
  • libcudacxx/test/libcudacxx/cuda/execution/guarantee.fail.cpp (+31): Negative test covering improper guarantee construction.
  • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp (+83): Comprehensive tests for compile-time, runtime, and hybrid bounds, type inference, and queryability.
  • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.fail.cpp (+26): Validates rejection of non-integral bounds.

Summary

This change adds a composable, queryable guarantees API and the first guarantee, max_total_num_items, enabling algorithms to receive an upper bound on total items for more efficient internal sizing. The implementation supports static and runtime bounds (and a hybrid mode), preserves narrow integral types via inference, integrates with the existing execution query infrastructure, and includes positive and negative tests covering the API surface.

important:

Walkthrough

Adds a guarantee facility (base type, query key, variadic guarantee(...)) and a max_total_num_items guarantee with static/runtime bounds, public wrapper headers, and tests exercising positive and negative cases.

Changes

Execution Guarantees Facility

Layer / File(s) Summary
Guarantee facility and query mechanism
libcudacxx/include/cuda/__execution/guarantee.h
__guarantee base class, __get_guarantees_t query object and global constant, and guarantee(...) variadic template that validates all arguments derive from __guarantee and packages them into a cuda::std::execution::prop.
Max total num items guarantee implementation
libcudacxx/include/cuda/__execution/max_total_num_items.h
__max_total_num_items_holder_t stores compile-time and runtime bounds with highest() and query(...) accessors. Three max_total_num_items(...) factories provide compile-time only, runtime only, and combined construction modes with compile-time validation that runtime bounds do not exceed static bounds.
Public header exports and integration
libcudacxx/include/cuda/execution, libcudacxx/include/cuda/execution.guarantee.h, libcudacxx/include/cuda/execution.max_total_num_items.h
Wrapper headers forward to implementation files with system-header pragmas; main execution umbrella header includes both wrappers to expose the complete guarantee API.
Guarantee facility tests
libcudacxx/test/libcudacxx/cuda/execution/guarantee.pass.cpp, guarantee.fail.cpp
guarantee.pass.cpp validates that guarantee(max_total_num_items<1000>()) correctly exposes guarantees through environment queries with forwarding-query assertions; guarantee.fail.cpp exercises rejection of non-guarantee properties.
Max total num items tests
libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp, max_total_num_items.fail.cpp
max_total_num_items.pass.cpp validates compile-time type deduction, runtime inference, combined bound narrowing, element-type selection, and forwarding-query behavior via static_assert and assert. max_total_num_items.fail.cpp confirms integral-only constraints via floating-point rejection test.

Assessment against linked issues

Objective Addressed Explanation
Add an option to guarantee a static and runtime max_total_num_items to DeviceBatchedTopK (#9279)

Suggested labels

libcu++

Suggested reviewers

  • ericniebler
  • Jacobfaib

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
libcudacxx/test/libcudacxx/cuda/execution/guarantee.fail.cpp

libcudacxx/test/libcudacxx/cuda/execution/guarantee.fail.cpp:11:10: fatal error: 'cuda/execution.guarantee.h' file not found
11 | #include <cuda/execution.guarantee.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/38ff9c881557f7eceb839816abd62806b124b14f-394afc0731c0f36f/tmp/clang_command_.tmp.5fdf79.txt
++Contents of '/tmp/coderabbit-infer/38ff9c881557f7eceb839816abd62806b124b14f-394afc0731c0f36f/tmp/clang_command_.tmp.5fdf79.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-tripl

... [truncated 1183 characters] ...

nternal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/394afc0731c0f36f/file.o" "-x" "c++"
"libcudacxx/test/libcudacxx/cuda/execution/guarantee.fail.cpp" "-O0"
"-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.fail.cpp

libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.fail.cpp:11:10: fatal error: 'cuda/execution.max_total_num_items.h' file not found
11 | #include <cuda/execution.max_total_num_items.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.fail.cpp:16:1-19:1: ERROR translating statement 'CompoundStmt'
Aborting translation of method 'test' in file 'libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.fail.cpp': "Assert_failure src/clang/cAst_utils.ml:249:53"
Uncaught Internal Error: "Assert_failure src/clang/cAst_utils.ml:249:53"
Error backtrace:
Raised at ClangFrontend__CAst_utils.get_decl_from_typ_ptr in file "src/clang/cAst_utils.ml", line 249, characters 53-65
Called from ClangFrontend__CTrans.CTrans_funct.get_destructor_decl_ref in file "src/clang/cTrans.ml", line 658, characters 12-59
Called from ClangFrontend__CTrans.CTrans_funct.destructor_calls.(fun) in file "src/clang/cTr

... [truncated 2200 characters] ...

rc/clang/cFrontend_errors.ml", line 48, characters 6-141
Called from ClangFrontend__CFrontend_decl.CFrontend_decl_funct.add_method in file "src/clang/cFrontend_decl.ml" (inlined), line 54, characters 4-52
Called from ClangFrontend__CFrontend_decl.CFrontend_decl_funct.function_decl in file "src/clang/cFrontend_decl.ml", line 90, characters 12-151
Called from ClangFrontend__CFrontend_decl.CFrontend_decl_funct.translate_one_declaration in file "src/clang/cFrontend_decl.ml", line 453, characters 10-56
Called from Stdlib__List.iter in file "list.ml", line 110, characters 12-15
Called from Stdlib__List.iter in file "list.ml" (inlined), line 110, characters 17-25
Called from Base__List0.iter in file "src/list0.ml" (inlined), line 25, characters 16-35
Called from ClangFrontend__CFrontend.compute_i


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1cc66350-916c-4788-a194-ec55f2aa4233

📥 Commits

Reviewing files that changed from the base of the PR and between 2b21bec and 90b7581.

📒 Files selected for processing (9)
  • libcudacxx/include/cuda/__execution/guarantee.h
  • libcudacxx/include/cuda/__execution/max_total_num_items.h
  • libcudacxx/include/cuda/execution
  • libcudacxx/include/cuda/execution.guarantee.h
  • libcudacxx/include/cuda/execution.max_total_num_items.h
  • libcudacxx/test/libcudacxx/cuda/execution/guarantee.fail.cpp
  • libcudacxx/test/libcudacxx/cuda/execution/guarantee.pass.cpp
  • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.fail.cpp
  • libcudacxx/test/libcudacxx/cuda/execution/max_total_num_items.pass.cpp

Comment thread libcudacxx/test/libcudacxx/cuda/execution/guarantee.fail.cpp
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

😬 CI Workflow Results

🟥 Finished in 1h 20m: Pass: 87%/115 | Total: 20h 06m | Max: 53m 56s | Hits: 99%/266428

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Add an option to guarantee a static and runtime max_total_num_items to DeviceBatchedTopK

1 participant