Skip to content

cuda::device::warp_match_any#9243

Open
fbusato wants to merge 9 commits into
NVIDIA:mainfrom
fbusato:warp_match_any
Open

cuda::device::warp_match_any#9243
fbusato wants to merge 9 commits into
NVIDIA:mainfrom
fbusato:warp_match_any

Conversation

@fbusato
Copy link
Copy Markdown
Contributor

@fbusato fbusato commented Jun 3, 2026

Description

The PR provides cuda::device::warp_match_any in a similar way of cuda::device::warp_match_all for completeness.
The main difference is that warp_match_any returns a mask instead of a bool value.

@fbusato fbusato self-assigned this Jun 3, 2026
@fbusato fbusato requested a review from a team as a code owner June 3, 2026 21:18
@fbusato fbusato added the libcu++ For all items related to libcu++ label Jun 3, 2026
@fbusato fbusato requested a review from a team as a code owner June 3, 2026 21:18
@fbusato fbusato requested a review from gonidelis June 3, 2026 21:18
@fbusato fbusato added this to CCCL Jun 3, 2026
@fbusato fbusato requested a review from griwes June 3, 2026 21:18
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 3, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 3, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Lost in the diff? Review this PR in Change Stack to follow the change map from intent to exact ranges.

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 13b76f61-97d2-40ca-80d0-e5837894e2c8

📥 Commits

Reviewing files that changed from the base of the PR and between a17e799 and de51be0.

📒 Files selected for processing (1)
  • libcudacxx/test/libcudacxx/cuda/warp/warp_match_any.pass.cpp

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR introduces cuda::device::warp_match_any, a new warp-level utility function that returns a mask of lanes whose values match the calling lane. It mirrors the existing cuda::device::warp_match_all but provides a lane-value equality mask instead of a boolean result.

Changes

Core Implementation

  • New header: libcudacxx/include/cuda/__warp/warp_match_any.h
    • Implements the templated warp_match_any function for SM_70+
    • Supports any trivially copyable type _Tp
    • Handles padding bit clearing when __builtin_clear_padding is available; otherwise requires bitwise-comparable types
    • Processes data in 32-bit chunks using __match_any_sync intrinsic and intersects per-chunk results
    • Adds an extern "C" device stub to signal unsupported pre-SM_70 targets

API Exposure

  • Updated libcudacxx/include/cuda/warp header to include and export the new functionality

Documentation

  • New documentation page: docs/libcudacxx/extended_api/warp/warp_match_any.rst
    • Describes function signature, semantics, constraints, preconditions (SM ≥ 70, non-zero lane mask), undefined-behavior rules, performance notes, and references
    • Includes example kernel and a note about padding-related compile errors unless __builtin_clear_padding is supported
  • Updated documentation: docs/libcudacxx/extended_api/warp/warp_match_all.rst
    • Clarified memory ordering behavior and lane mask constraints
    • Refined T type constraints regarding padding bit handling
    • Updated example and references
  • Updated index: docs/libcudacxx/extended_api/warp.rst
    • Added warp_match_any entry to the feature table (CCCL 3.5.0, CUDA 13.5)

Testing

  • New test file: libcudacxx/test/libcudacxx/cuda/warp/warp_match_any.pass.cpp
    • Tests across multiple input types (integral widths, conditional __uint128_t, small structs/arrays)
    • Validates behavior with different lane-mask patterns (prefix and strided masks)
    • Includes tests for all-equal scenarios and alternating value patterns

suggestion:

Walkthrough

Adds cuda::device::warp_match_any: templated device function that serializes values into 32-bit chunks (optional padding clearing), calls __match_any_sync per chunk on SM_70+, intersects chunk masks, and returns the matching lane_mask. Includes tests, public header wiring, a new docs page, and related warp documentation updates.

Changes

warp_match_any Feature

Layer / File(s) Summary
Core warp_match_any implementation
libcudacxx/include/cuda/__warp/warp_match_any.h
Templated device function with input validation (trivially copyable, non-zero lane mask), chunk-based data serialization with optional __builtin_clear_padding support, per-chunk __match_any_sync invocation (SM_70+ with unsupported-before-SM_70 stub), and mask intersection returning final lane mask.
Warp module wiring
libcudacxx/include/cuda/warp
Include directive for cuda/__warp/warp_match_any.h added to expose the new function.
Device validation tests
libcudacxx/test/libcudacxx/cuda/warp/warp_match_any.pass.cpp
Device helpers (make_low_mask, make_stride_mask), templated test routines validating all-equal lanes and alternating even/odd values, kernel instantiation for integral types, conditional __uint128_t, and custom types (char3, cuda::std::array<char,6>), plus host entry point using NV_DISPATCH_TARGET.
warp_match_any documentation
docs/libcudacxx/extended_api/warp/warp_match_any.rst
New user guide documenting signature, semantics, parameters, return value, constraints on T, execution preconditions (SM >= 70, non-zero lane mask), UB requirements, performance notes, references, and an example with Godbolt link.
Related documentation updates
docs/libcudacxx/extended_api/warp.rst, docs/libcudacxx/extended_api/warp/warp_match_all.rst
Warp index toctree and feature table updated to include warp_match_any. warp_match_all docs clarified for memory-ordering, updated T constraint/padding wording, expanded lane_mask UB requirements, example comment change, and Godbolt link replacement.

Possibly Related PRs

  • NVIDIA/cccl#9192: Introduces the _CCCL_BUILTIN_CLEAR_PADDING macro and pattern for handling padding in warp-level operations, directly related to this PR's padding-handling logic.

Suggested Reviewers

  • ericniebler
  • gonidelis

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
libcudacxx/test/libcudacxx/cuda/warp/warp_match_any.pass.cpp

libcudacxx/test/libcudacxx/cuda/warp/warp_match_any.pass.cpp:14:10: fatal error: 'cuda/std/array' file not found
14 | #include <cuda/std/array>
| ^~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/de51be0a903576c5b25d8c1cbfce8c8deeaf8147-bc434e2a457b67d9/tmp/clang_command_.tmp.33d91a.txt
++Contents of '/tmp/coderabbit-infer/de51be0a903576c5b25d8c1cbfce8c8deeaf8147-bc434e2a457b67d9/tmp/clang_command_.tmp.33d91a.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-

... [truncated 1152 characters] ...

nternal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/bc434e2a457b67d9/file.o" "-x" "c++"
"libcudacxx/test/libcudacxx/cuda/warp/warp_match_any.pass.cpp" "-O0"
"-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/libcudacxx/extended_api/warp/warp_match_any.rst (1)

36-37: 💤 Low value

suggestion: Constraints omit the is_bitwise_comparable requirement that the sibling warp_match_all doc lists for the no-__builtin_clear_padding case. Line 37 here describes padding but doesn't reference is_bitwise_comparable, while the implementation static_asserts it. Align with warp_match_all.rst line 37 for consistency.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 20e5a646-5b0a-4a6c-89fb-56027b96cc9e

📥 Commits

Reviewing files that changed from the base of the PR and between b336700 and d5c0bdb.

📒 Files selected for processing (6)
  • docs/libcudacxx/extended_api/warp.rst
  • docs/libcudacxx/extended_api/warp/warp_match_all.rst
  • docs/libcudacxx/extended_api/warp/warp_match_any.rst
  • libcudacxx/include/cuda/__warp/warp_match_any.h
  • libcudacxx/include/cuda/warp
  • libcudacxx/test/libcudacxx/cuda/warp/warp_match_any.pass.cpp

@github-actions

This comment has been minimized.

Comment on lines +57 to +59
auto __data_copy = __data;
_CCCL_BUILTIN_CLEAR_PADDING(&__data_copy);
const auto __data_ptr = ::cuda::std::addressof(__data_copy);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: This introduces a needless copy. I believe this should only copy if is_bitwise_comparable_v is false

Copy link
Copy Markdown
Contributor Author

@fbusato fbusato Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'm aware of this issue. However, is_bitwise_comparable_v also checks padding. I need to introduce another (internal) traits for that. I will open a second PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me, although I would like to have an integer overload that does not do any of that and just forwards to the builtin

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is needed. Direct call and cuda::device::warp_match_any produce identical code as expected https://godbolt.org/z/sKaWz5dv1

Comment thread libcudacxx/test/libcudacxx/cuda/warp/warp_match_any.pass.cpp Outdated
Comment thread libcudacxx/test/libcudacxx/cuda/warp/warp_match_any.pass.cpp Outdated
Comment thread libcudacxx/test/libcudacxx/cuda/warp/warp_match_any.pass.cpp Outdated
@jrhemstad
Copy link
Copy Markdown
Collaborator

question: Should this be cuda::device::warp_match_any to be symmetrical with cuda::device::warp_match_all?

@fbusato
Copy link
Copy Markdown
Contributor Author

fbusato commented Jun 4, 2026

question: Should this be cuda::device::warp_match_any to be symmetrical with cuda::device::warp_match_all?

sorry, only the title is wrong. The implementation is correct. Let me update it

@fbusato fbusato changed the title cuda::device::warp_any_match cuda::device::warp_match_any Jun 4, 2026
fbusato and others added 4 commits June 4, 2026 09:24
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
@fbusato fbusato requested a review from miscco June 4, 2026 16:35
@fbusato fbusato enabled auto-merge (squash) June 4, 2026 16:35
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 12m: Pass: 100%/115 | Total: 1d 12h | Max: 1h 00m | Hits: 99%/336704

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libcu++ For all items related to libcu++

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

3 participants