[libcu++] Adds a cuda::execution::tie_break requirement folded into determinism#9269
[libcu++] Adds a cuda::execution::tie_break requirement folded into determinism#9269elstehle wants to merge 3 commits into
cuda::execution::tie_break requirement folded into determinism#9269Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
OverviewThis PR embeds an optional tie-break preference into cuda::execution::determinism so callers can specify which equal elements are selected when algorithms encounter ties at a selection boundary. Tie-break values are subordinate to a determinism guarantee and cannot be requested on their own. Algorithms that do not support tie-break (e.g., reduce, scan, segmented reduce) now trigger a single, clear compile-time diagnostic if a tie-break is supplied. API Design and UsageTie-break constants (in cuda::execution::determinism::tie_break):
Example: Tie-break holders are not standalone requirements; attaching a tie-break to determinism::not_guaranteed is disallowed via static_assert. Implementation Detailslibcudacxx/include/cuda/__execution/determinism.h:
CUB integration (cub/*):
Tests:
Scope & Impact
MotivationResolves the request to let callers of top-k-style algorithms (e.g., DeviceBatchedTopK) choose deterministic tie-breaking behavior so that the composition of the result set is predictable when equal elements straddle selection boundaries. Walkthroughimportant: PR adds a tie-break preference to determinism requirements, implements a compile-time guard rejecting tie-breaks where unsupported, updates CUB reduce/scan/segmented_reduce to apply the guard, and adds compile-time tests. ChangesDeterminism tie-break support
Assessment against linked issues
Out-of-scope changes
Possibly related PRs
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Infer (1.2.0)libcudacxx/test/libcudacxx/cuda/execution/determinism.pass.cpplibcudacxx/test/libcudacxx/cuda/execution/determinism.pass.cpp:11:10: fatal error: 'cuda/execution.determinism.h' file not found ... [truncated 1195 characters] ... ernal-isystem" "/usr/local/include" "-internal-isystem" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
libcudacxx/include/cuda/__execution/determinism.h (1)
87-90: ⚡ Quick winsuggestion: Complete the Doxygen contract for
operator()documentation.
Line 87 documents a non-void function with Doxygen but omits@param[in]and@return; add both to keep generated docs and header comments consistent.As per coding guidelines: "When a function is documented with Doxygen, it must include:
//!@brief, `//! `@param`[in/out/in,out]` for every parameter, and `//! `@returnfor non-void functions."libcudacxx/test/libcudacxx/cuda/execution/determinism.pass.cpp (1)
35-38: ⚡ Quick winsuggestion: Extract the
gpu_to_gputie-break type and verify itsvalueandtie_breakmembers individually (like ther2r_smallertest) for symmetric coverage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d3c370d9-12ae-4364-aff8-5d7b43dc2a26
📒 Files selected for processing (5)
cub/cub/device/device_reduce.cuhcub/cub/device/device_scan.cuhcub/cub/device/device_segmented_reduce.cuhlibcudacxx/include/cuda/__execution/determinism.hlibcudacxx/test/libcudacxx/cuda/execution/determinism.pass.cpp
| using default_determinism_t = ::cuda::execution::determinism::__validate_no_tie_break_t< | ||
| ::cuda::std::execution::__query_result_or_t<requirements_t, | ||
| ::cuda::execution::determinism::__get_determinism_t, | ||
| ::cuda::execution::determinism::run_to_run_t>; | ||
| ::cuda::execution::determinism::run_to_run_t>>; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== DeviceReduce determinism guard sites =="
rg -n -C2 '__validate_no_tie_break_t|__get_determinism_t|ReduceByKey\s*\(' cub/cub/device/device_reduce.cuh
echo
echo "== env dispatch helper implementation =="
fd -a 'env_dispatch\.cuh$' cub | while read -r f; do
echo "-- $f --"
rg -n -C4 'dispatch_with_env(_and_tuning)?|__get_determinism_t|__validate_no_tie_break_t' "$f"
doneRepository: NVIDIA/cccl
Length of output: 4738
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== device_reduce.cuh: find ReduceByKey env-based overload and its dispatch path =="
rg -n "ReduceByKey\\s*\\(" cub/cub/device/device_reduce.cuh | head -n 50
echo
rg -n "dispatch_with_env(_and_tuning)?" cub/cub/device/device_reduce.cuh
echo
echo "== device_reduce.cuh: show around env-based ReduceByKey (near match at line ~2215) =="
sed -n '2190,2285p' cub/cub/device/device_reduce.cuh
echo
echo "== device_reduce.cuh: show further around implementation starting near 2390 =="
sed -n '2360,2465p' cub/cub/device/device_reduce.cuh
echo
echo "== env_dispatch.cuh: search determinism-related tokens =="
rg -n "determinism|tie_break|__validate_no_tie_break_t|__get_determinism_t" cub/cub/detail/env_dispatch.cuh
echo
echo "== env_dispatch.cuh: show whole file with line numbers (bounded) =="
wc -l cub/cub/detail/env_dispatch.cuh
sed -n '1,160p' cub/cub/detail/env_dispatch.cuhRepository: NVIDIA/cccl
Length of output: 11123
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== device_reduce.cuh: find ReduceByKey env-based overload and its dispatch path =="
rg -n "ReduceByKey\\s*\\(" cub/cub/device/device_reduce.cuh | head -n 80
echo
echo "== device_reduce.cuh: dispatch_with_env(_and_tuning) usage =="
rg -n "dispatch_with_env(_and_tuning)?" cub/cub/device/device_reduce.cuh
echo
echo "== device_reduce.cuh: show around env-based ReduceByKey (near line ~2215 from prior output) =="
sed -n '2185,2285p' cub/cub/device/device_reduce.cuh
echo
echo "== device_reduce.cuh: show env-parameter overload signature area (around ~2390 from prior output) =="
sed -n '2365,2455p' cub/cub/device/device_reduce.cuh
echo
echo "== env_dispatch.cuh: search determinism-related tokens =="
rg -n "determinism|tie_break|__validate_no_tie_break_t|__get_determinism_t" cub/cub/detail/env_dispatch.cuh
echo
echo "== env_dispatch.cuh: show full file (wc + sed) =="
wc -l cub/cub/detail/env_dispatch.cuh
sed -n '1,220p' cub/cub/detail/env_dispatch.cuhRepository: NVIDIA/cccl
Length of output: 11220
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== device_reduce.cuh: ReduceByKey overloads =="
rg -n "ReduceByKey\\s*\\(" cub/cub/device/device_reduce.cuh | head -n 120
echo
echo "== device_reduce.cuh: dispatch_with_env(_and_tuning) usage =="
rg -n "dispatch_with_env(_and_tuning)?" cub/cub/device/device_reduce.cuh
echo
echo "== device_reduce.cuh: show around env-based ReduceByKey declaration/usage (near ~2215) =="
sed -n '2190,2265p' cub/cub/device/device_reduce.cuh
echo
echo "== device_reduce.cuh: show around ReduceByKey definitions starting near ~2390 =="
sed -n '2365,2465p' cub/cub/device/device_reduce.cuh
echo
echo "== env_dispatch.cuh: determinism/tie-break relevance =="
rg -n "determinism|tie_break|__validate_no_tie_break_t|__get_determinism_t" cub/cub/detail/env_dispatch.cuh
echo
echo "== env_dispatch.cuh: file size + contents (bounded) =="
wc -l cub/cub/detail/env_dispatch.cuh
sed -n '1,240p' cub/cub/detail/env_dispatch.cuhRepository: NVIDIA/cccl
Length of output: 9976
important: cub::DeviceReduce::ReduceByKey(..., EnvT) still accepts tie-break determinism via EnvT without enforcing the no-tie-break rule
The EnvT overload in cub/cub/device/device_reduce.cuh directly calls detail::dispatch_with_env_and_tuning(...) and contains no __validate_no_tie_break_t / determinism static_assert, while cub/cub/detail/env_dispatch.cuh only queries stream/memory/tuning (no determinism/tie_break handling). This allows require(determinism::run_to_run(determinism::tie_break::prefer_smaller_index)) in EnvT to be silently ignored instead of hard-failing like the other guarded helpers.
There was a problem hiding this comment.
Iiuc, it's a pre-existing behaviour, where we invoke the ByKey algorithm that may be non-deterministic for pseudo-associative operations like floating point addition without without validating the requested determinism, so a run_to_run/gpu_to_gpu request is silently accepted rather than rejected.
@srinivasyadav18, as our determinism expert, is this a pre-existing issue and we should start to validate requirements in the interfaces?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| using requested_determinism_t = ::cuda::execution::determinism::__validate_no_tie_break_t< | ||
| ::cuda::std::execution::__query_result_or_t<requirements_t, | ||
| ::cuda::execution::determinism::__get_determinism_t, | ||
| ::cuda::execution::determinism::not_guaranteed_t>; | ||
| ::cuda::execution::determinism::not_guaranteed_t>>; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== DeviceScan determinism guard sites =="
rg -n -C2 'scan_impl_env|__validate_no_tie_break_t|__get_determinism_t' cub/cub/device/device_scan.cuh
echo
echo "== DeviceScan by-key env overloads =="
rg -n -C3 'Exclusive(Sum|Scan)ByKey|Inclusive(Sum|Scan)ByKey|dispatch_with_env\(' cub/cub/device/device_scan.cuh
echo
echo "== env dispatch helper implementation =="
fd -a 'env_dispatch\.cuh$' cub | while read -r f; do
echo "-- $f --"
rg -n -C4 'dispatch_with_env(_and_tuning)?|__get_determinism_t|__validate_no_tie_break_t' "$f"
doneRepository: NVIDIA/cccl
Length of output: 12595
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Search determinism validation symbols in env_dispatch.cuh =="
rg -n '__validate_no_tie_break_t|__get_determinism_t' cub/cub/detail/env_dispatch.cuh
echo
echo "== Locate scan_by_key_impl in device_scan.cuh =="
rg -n 'scan_by_key_impl' cub/cub/device/device_scan.cuh
echo
echo "== Context around scan_by_key_impl determinism-related code =="
rg -n -C6 'scan_by_key_impl' cub/cub/device/device_scan.cuh | head -n 200Repository: NVIDIA/cccl
Length of output: 119
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Locate scan_by_key_impl definition =="
rg -n 'scan_by_key_impl' cub/cub/device/device_scan.cuh
echo
echo "== Search determinism validation symbols near scan_by_key_impl =="
rg -n -C8 '__validate_no_tie_break_t|__get_determinism_t' cub/cub/device/device_scan.cuh | head -n 200
echo
echo "== Show scan_by_key_impl body (first match region) =="
# Print from the first occurrence of scan_by_key_impl through a reasonable window.
line=$(rg -n 'scan_by_key_impl' cub/cub/device/device_scan.cuh | head -n 1 | cut -d: -f1)
start=$((line-20))
end=$((line+220))
sed -n "${start},${end}p" cub/cub/device/device_scan.cuh | cat -nRepository: NVIDIA/cccl
Length of output: 13246
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Count occurrences of tie-break validation in device_scan.cuh =="
rg -n '__validate_no_tie_break_t' cub/cub/device/device_scan.cuh
echo
echo "== Find requested_determinism_t definitions/uses in device_scan.cuh =="
rg -n 'requested_determinism_t' cub/cub/device/device_scan.cuh
echo
echo "== Find scan_impl_determinism usage around by-key =="
rg -n 'scan_impl_determinism' cub/cub/device/device_scan.cuh | head -n 50
echo
echo "== Context around requested_determinism_t within scan_by_key_impl wrapper =="
# Narrow to the first requested_determinism_t occurrence
line=$(rg -n 'requested_determinism_t' cub/cub/device/device_scan.cuh | head -n 1 | cut -d: -f1)
start=$((line-60)); end=$((line+60))
sed -n "${start},${end}p" cub/cub/device/device_scan.cuh | cat -nRepository: NVIDIA/cccl
Length of output: 6763
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Find __validate_no_tie_break_t across cub/cub =="
rg -n '__validate_no_tie_break_t' cub/cub | head -n 50
echo
echo "== Find __get_determinism_t across cub/cub =="
rg -n '__get_determinism_t' cub/cub | head -n 50
echo
echo "== Find detail::scan_by_key::dispatch call sites =="
rg -n 'detail::scan_by_key::dispatch' cub/cub | head -n 50
echo
echo "== Find scan_by_key::dispatch definition files =="
rg -n 'namespace scan_by_key|struct dispatch|dispatch\<' cub/cub/detail/scan_by_key* cub/cub/detail 2>/dev/null | head -n 200Repository: NVIDIA/cccl
Length of output: 4415
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Locate scan_by_key-related detail files =="
fd -a '*scan_by_key*' cub/cub/detail | head -n 200
echo
echo "== Search for 'scan_by_key::dispatch' definitions =="
rg -n 'scan_by_key::dispatch' cub/cub/detail | head -n 50
echo
echo "== Search for tie-break validation in scan_by_key details =="
rg -n '__validate_no_tie_break_t' cub/cub/detail | head -n 50
echo
echo "== Search for determinism query symbols in scan_by_key details =="
rg -n '__get_determinism_t|__query_result_or_t<.*determinism|not_guaranteed_t' cub/cub/detail | head -n 80
echo
echo "== If dispatch is found, show surrounding code =="
file=$(rg -l 'scan_by_key::dispatch' cub/cub/detail | head -n 1 || true)
if [ -n "${file:-}" ]; then
echo "-- First dispatch file: $file --"
rg -n -C4 'scan_by_key::dispatch' "$file" | head -n 120
else
echo "-- No dispatch definition found in cub/cub/detail/ with current search --"
fiRepository: NVIDIA/cccl
Length of output: 412
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Locate files with scan_by_key in cub/cub/detail =="
fd -a -g '*scan_by_key*' cub/cub/detail | head -n 200
echo
echo "== Find scan_by_key::dispatch occurrences across cub/cub/detail =="
rg -n 'scan_by_key::dispatch' cub/cub/detail | head -n 50
echo
echo "== Find scan_by_key dispatch / determinism helpers across cub/cub/detail =="
rg -n '__validate_no_tie_break_t|__get_determinism_t|not_guaranteed_t|scan_impl_determinism' cub/cub/detail | head -n 200Repository: NVIDIA/cccl
Length of output: 177
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Find any cub/cub files with scan_by_key in the filename =="
fd -a -g '*scan_by_key*' cub/cub | head -n 200
echo
echo "== Search full cub/cub for 'scan_by_key::dispatch' =="
rg -n 'scan_by_key::dispatch' cub/cub | head -n 50
echo
echo "== Search full cub/cub for 'namespace scan_by_key' =="
rg -n 'namespace[[:space:]]+scan_by_key|struct[[:space:]]+scan_by_key' cub/cub | head -n 50
echo
echo "== Search full cub/cub for tie-break validation helpers in scan_by_key_impl/dispatch =="
rg -n '__validate_no_tie_break_t|scan_impl_determinism|not_guaranteed_t' cub/cub/device/device_scan.cuh cub/cub/detail 2>/dev/null | head -n 200Repository: NVIDIA/cccl
Length of output: 1316
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Search determinism symbols in dispatch_scan_by_key.cuh =="
rg -n '__validate_no_tie_break_t|__get_determinism_t|not_guaranteed_t|scan_impl_determinism|determinism::' cub/cub/device/dispatch/dispatch_scan_by_key.cuh
echo
echo "== Search determinism symbols in tuning_scan_by_key.cuh =="
rg -n '__validate_no_tie_break_t|__get_determinism_t|not_guaranteed_t|scan_impl_determinism|determinism::' cub/cub/device/dispatch/tuning/tuning_scan_by_key.cuh
echo
echo "== Search for scan_by_key dispatch-ish entry points in dispatch_scan_by_key.cuh =="
rg -n 'scan_by_key|dispatch|dispatch\<' cub/cub/device/dispatch/dispatch_scan_by_key.cuh | head -n 200
echo
echo "== Show top portion of dispatch_scan_by_key.cuh (first 220 lines) =="
sed -n '1,220p' cub/cub/device/dispatch/dispatch_scan_by_key.cuh | cat -n
echo
echo "== Show portion around any determinism-related code in dispatch_scan_by_key.cuh =="
for pat in '__validate_no_tie_break_t' '__get_determinism_t' 'not_guaranteed_t' 'determinism'; do
line=$(rg -n "$pat" cub/cub/device/dispatch/dispatch_scan_by_key.cuh | head -n 1 | cut -d: -f1 || true)
if [ -n "$line" ]; then
start=$((line-60)); end=$((line+120));
echo "-- Context around $pat at line $line --"
sed -n "${start},${end}p" cub/cub/device/dispatch/dispatch_scan_by_key.cuh | cat -n
fi
doneRepository: NVIDIA/cccl
Length of output: 116
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Search for 'determinism' substring in dispatch_scan_by_key.cuh =="
rg -n 'determin' cub/cub/device/dispatch/dispatch_scan_by_key.cuh || true
echo
echo "== Search for 'determin' substring in tuning_scan_by_key.cuh =="
rg -n 'determin' cub/cub/device/dispatch/tuning/tuning_scan_by_key.cuh || true
echo
echo "== Show first 260 lines of dispatch_scan_by_key.cuh =="
sed -n '1,260p' cub/cub/device/dispatch/dispatch_scan_by_key.cuh | cat -n
echo
echo "== Show first 260 lines of tuning_scan_by_key.cuh =="
sed -n '1,260p' cub/cub/device/dispatch/tuning/tuning_scan_by_key.cuh | cat -nRepository: NVIDIA/cccl
Length of output: 21756
critical: EnvT tie-break determinism is rejected for scan_impl_env, but not for Exclusive*ByKey / Inclusive*ByKey env overloads.
scan_impl_envuses__validate_no_tie_break_t<...__get_determinism_t...>+static_assert(aroundcub/cub/device/device_scan.cuh:131-143).- The
*ByKeyenv overloads only calldetail::dispatch_with_env(env, ...)and thenscan_by_key_impl<tuning_t>(...);scan_by_key_impl/detail::scan_by_key::dispatchhave no__get_determinism_t/__validate_no_tie_break_tchecks. - Tie-break determinism requirements can therefore likely compile and be ignored on the
*ByKeyenv paths.
Fix: add the same tie-break rejection (requested_determinism_t via __validate_no_tie_break_t + the existing determinism static_assert) to the *ByKey EnvT path (e.g., a scan_by_key_env wrapper analogous to scan_impl_env, or the check in scan_by_key_impl/shared ByKey dispatch).
🥳 CI Workflow Results🟩 Finished in 2h 17m: Pass: 100%/340 | Total: 12d 16h | Max: 2h 13m | Hits: 36%/1291429See results here. |
Closes #9255
Note, this is an alternative proposal to #9238.
Adds an optional tie-break preference to
cuda::execution::determinism.A deterministic guarantee (
determinism::run_to_run/determinism::gpu_to_gpu) can now be called with a tie-break tag to specify which of the elements that compare equal at an algorithm's selection boundary are kept — e.g.require(determinism::run_to_run(determinism::tie_break::prefer_smaller_index)). The tags live incuda::execution::determinism::tie_break::with valuesunspecified(default),prefer_smaller_index, andprefer_larger_index.A word on the motivation in top-k: ties at the K-th element are the source of non-determinism; pairing a determinism requirement with a tie-break lets users specify which amongst the competing items should make it into the result set.
The determinism options I envision for top-k:
require(...)determinism::not_guaranteeddeterminism::run_to_rundeterminism::run_to_run(determinism::tie_break::prefer_smaller_index)determinism::run_to_run(determinism::tie_break::prefer_larger_index)My mental model:
tie_breakis purely a criterion about which items are selected, i.e., it operates on the result set. Ordering, i.e.,{stable,unstable}/sorted/unsorted, is the orthogonal concern that operates on the result sequence (output_ordering). A tie-break says nothing about order.Why part of
determinismrather than a standalone requirement? A tie-break is only meaningful alongside a deterministic guarantee. Instead of a separatetie_breakrequirement that has to be paired withdeterminism(andstatic_assert-ed), the tie-break is now produced by the guarantee itself, so it is structurally impossible to request one without determinism — whiledeterminismstays unchanged for users who don't care (it defaults tounspecified). Attaching a tie-break tonot_guaranteedis astatic_assert. (This started as a standalonecuda::execution::tie_breakrequirement and was folded intodeterminismper review.)A note on levels: an explicit
prefer_smaller_index/prefer_larger_indexfully fixes the result set, so for top-k it produces the same result onrun_to_runandgpu_to_gpu. The tag is accepted on both, since in general the determinism level (reproducibility, incl. numerics) and the tie-break (which set) are orthogonal. See #9238 (comment)Algorithms that do not support a tie-break (reduce / scan / segmented reduce) route their requested determinism through a shared guard and reject a tie-break at compile time with a single, friendly diagnostic instead of a deep template error.
Exposing this within
cuda::execution(rather than burying it in a CUB/top-k namespace) keeps the requirements interface homogeneous for users.