C++ wilcoxon by Intron7 · Pull Request #587 · scverse/rapids-singlecell

Intron7 · 2026-02-25T20:13:03Z

No description provided.

codecov-commenter · 2026-02-25T20:16:12Z

Codecov Report

❌ Patch coverage is 81.25000% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.63%. Comparing base (046bdb3) to head (88bcdd5).

Files with missing lines	Patch %	Lines
...s_singlecell/tools/_rank_genes_groups/_wilcoxon.py	83.41%	33 Missing ⚠️
...pids_singlecell/tools/_rank_genes_groups/_utils.py	0.00%	4 Missing ⚠️
src/rapids_singlecell/_cuda/__init__.py	60.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #587      +/-   ##
==========================================
- Coverage   88.04%   87.63%   -0.41%     
==========================================
  Files          96       95       -1     
  Lines        7032     7071      +39     
==========================================
+ Hits         6191     6197       +6     
- Misses        841      874      +33

Files with missing lines	Coverage Δ
...ds_singlecell/tools/_rank_genes_groups/__init__.py	`82.45% <ø> (ø)`
...apids_singlecell/tools/_rank_genes_groups/_core.py	`72.76% <ø> (-18.73%)`	⬇️
src/rapids_singlecell/_cuda/__init__.py	`66.66% <60.00%> (-3.34%)`	⬇️
...pids_singlecell/tools/_rank_genes_groups/_utils.py	`67.16% <0.00%> (-18.76%)`	⬇️
...s_singlecell/tools/_rank_genes_groups/_wilcoxon.py	`85.26% <83.41%> (-12.36%)`	⬇️

... and 1 file with indirect coverage changes

Intron7 · 2026-04-14T10:15:49Z

@coderabbitai review

coderabbitai · 2026-04-14T10:15:54Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (3)

src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu (1)
160-180: ⚠️ Potential issue | 🟠 Major

Missing cudaGetLastError() after kernel launches.

Per coding guidelines, every kernel launch must be followed by cudaGetLastError() to detect launch failures immediately. The current code only checks errors at stream synchronization (lines 201-207), which delays error detection and makes debugging harder.

This pattern repeats throughout all *_impl functions: kernel launches at lines 80, 83-84, 160-161, 177-180, 289-290, 305-308, 408-409, 413-414, 429-432, 581-583, 587-588, 603-606, 725-726, 741-744.
Example fix pattern for one kernel launch
         rank_sums_from_sorted_kernel<<<sb_cols, tpb_rank, smem_rank, stream>>>(
             buf.keys_out, buf.vals_out, group_codes, buf.sub_rank_sums,
             buf.sub_tie_corr, nullptr, nullptr, nullptr, n_rows, sb_cols,
             n_groups, compute_tie_corr, false, use_gmem);
+        cudaError_t err = cudaGetLastError();
+        if (err != cudaSuccess)
+            throw std::runtime_error(
+                std::string("rank_sums_from_sorted_kernel launch failed: ") +
+                cudaGetErrorString(err));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 160 - 180,
After each device kernel launch and any CUB launch in wilcoxon_ovr.cu (e.g.,
after fill_row_indices_kernel<<<...>>>, after the
cub::DeviceSegmentedRadixSort::SortPairs(...) call, and after
rank_sums_from_sorted_kernel<<<...>>>), insert a call to cudaGetLastError() and
check its return (and log or propagate the error) immediately to detect launch
failures early; repeat this pattern for all similar kernel/CUB launches in the
file (the reviewer listed examples at the start of each *_impl: kernels around
symbols fill_row_indices_kernel, rank_sums_from_sorted_kernel and other kernels
noted in the comment) so each launch is followed by an immediate
cudaGetLastError() check rather than relying solely on stream synchronization.
.github/workflows/publish.yml (2)
136-136: ⚠️ Potential issue | 🟠 Major

Still bypassing abi3audit on Linux.

Line 136 overrides the Linux repair sequence from pyproject.toml with an auditwheel-only command, so the Limited API check never runs for the published wheels. Either append the same abi3audit step here or move the new exclude list into pyproject.toml and let that table stay the source of truth.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/publish.yml at line 136, The CI overrides the Linux repair
step by setting CIBW_REPAIR_WHEEL_COMMAND to an auditwheel-only command, which
bypasses the abi3 audit in pyproject.toml; restore the Limited API check by
either appending the abi3audit step to CIBW_REPAIR_WHEEL_COMMAND (run abi3audit
after auditwheel repair) or move the custom exclude list into the existing
repair/table in pyproject.toml so the original abi3audit invocation is
preserved; update the CIBW_REPAIR_WHEEL_COMMAND or pyproject.toml accordingly
and ensure the repair sequence still invokes abi3audit for Linux builds.
124-133: ⚠️ Potential issue | 🟠 Major

Still assuming lib64 in the RAPIDS wheels.

Line 129 and Line 130 still hard-code lib64, but CMakeLists.txt:18-75 already treats the installed layout as lib*. If either wheel lands under lib instead, these symlinks break before the extensions link. Resolve the actual .so from lib*/ rather than composing a lib64 path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/publish.yml around lines 124 - 133, The symlink creation
assumes a lib64 layout and should instead resolve the actual .so under any lib*
directory; update the ln -sf steps that reference RMM_ROOT and LOG_ROOT so they
search for the shared objects (e.g., under "$RMM_ROOT"/lib*/librmm.so and
"$LOG_ROOT"/lib*/librapids_logger.so or by using a find/glob to locate the first
matching .so) and create symlinks to those resolved paths rather than
hard-coding lib64, keeping the same RMM_ROOT and LOG_ROOT variables and the two
ln -sf operations (for librmm and librapids_logger).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 201-208: The pyproject.toml currently declares librmm under
[tool.uv.extra-build-dependencies], which is ignored by PEP 517 build isolation;
move the librmm requirement into the PEP 517 build requirements by adding
"librmm-cu12>=25.10" to [build-system].requires in pyproject.toml so source
builds (pip install --no-binary) can find headers required by CMakeLists.txt for
the Wilcoxon CUDA extensions, or alternatively update the project installation
docs to explicitly state that librmm (librmm-cu12>=25.10) must be preinstalled
and show using RSC_BUILD_EXTENSIONS=OFF to skip CUDA builds when librmm is not
available.

In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh`:
- Around line 154-156: The warp buffer offset is wrong when use_gmem is false:
in the compute_tie_corr branch the code sets double* warp_buf = use_gmem ? smem
: smem + n_groups but the shared-memory layout places grp_sums, s_sum, s_sq,
s_nnz (4 arrays of size n_groups) before the warp buffer; update the offset
calculation in the compute_tie_corr block to use smem + 4 * n_groups (keeping
the use_gmem ? smem logic) so warp_buf points after those four arrays (refer to
variables compute_tie_corr, warp_buf, use_gmem, smem, n_groups).

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 540-546: The cudaHostRegister calls for h_data, h_indices,
h_rank_sums, and h_tie_corr are missing error checks; wrap each
cudaHostRegister(...) invocation with a check of the returned cudaError_t and
handle failures (e.g., log the error using cudaGetErrorString(err) and
return/throw an error code) so failures aren’t silent; update the block where
cudaHostRegister is called (references: cudaHostRegister, h_data, h_indices,
h_rank_sums, h_tie_corr) to verify the return value, clean up any previously
pinned memory on failure, and propagate an error up from the function.

---

Duplicate comments:
In @.github/workflows/publish.yml:
- Line 136: The CI overrides the Linux repair step by setting
CIBW_REPAIR_WHEEL_COMMAND to an auditwheel-only command, which bypasses the abi3
audit in pyproject.toml; restore the Limited API check by either appending the
abi3audit step to CIBW_REPAIR_WHEEL_COMMAND (run abi3audit after auditwheel
repair) or move the custom exclude list into the existing repair/table in
pyproject.toml so the original abi3audit invocation is preserved; update the
CIBW_REPAIR_WHEEL_COMMAND or pyproject.toml accordingly and ensure the repair
sequence still invokes abi3audit for Linux builds.
- Around line 124-133: The symlink creation assumes a lib64 layout and should
instead resolve the actual .so under any lib* directory; update the ln -sf steps
that reference RMM_ROOT and LOG_ROOT so they search for the shared objects
(e.g., under "$RMM_ROOT"/lib*/librmm.so and "$LOG_ROOT"/lib*/librapids_logger.so
or by using a find/glob to locate the first matching .so) and create symlinks to
those resolved paths rather than hard-coding lib64, keeping the same RMM_ROOT
and LOG_ROOT variables and the two ln -sf operations (for librmm and
librapids_logger).

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 160-180: After each device kernel launch and any CUB launch in
wilcoxon_ovr.cu (e.g., after fill_row_indices_kernel<<<...>>>, after the
cub::DeviceSegmentedRadixSort::SortPairs(...) call, and after
rank_sums_from_sorted_kernel<<<...>>>), insert a call to cudaGetLastError() and
check its return (and log or propagate the error) immediately to detect launch
failures early; repeat this pattern for all similar kernel/CUB launches in the
file (the reviewer listed examples at the start of each *_impl: kernels around
symbols fill_row_indices_kernel, rank_sums_from_sorted_kernel and other kernels
noted in the comment) so each launch is followed by an immediate
cudaGetLastError() check rather than relying solely on stream synchronization.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 548a08c7-d10f-413f-b6b1-1d702b116125

📥 Commits

Reviewing files that changed from the base of the PR and between 1b926c0 and 68ad73c.

📒 Files selected for processing (8)

.github/workflows/publish.yml
CMakeLists.txt
pyproject.toml
src/rapids_singlecell/_cuda/nb_types.h
src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu
src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py

✅ Files skipped from review due to trivial changes (2)

src/rapids_singlecell/_cuda/nb_types.h
src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py

🚧 Files skipped from review as they are similar to previous changes (2)

CMakeLists.txt
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu

coderabbitai · 2026-04-14T10:26:16Z

+[tool.uv]
+# librmm headers are needed at build time for the wilcoxon CUDA kernels.
+# The headers are identical across cu12/cu13 — only the .so differs (loaded
+# at runtime via librmm.load_library()).  cu12 is used here as the build-time
+# provider; cu13 envs get the same headers.
+
+[tool.uv.extra-build-dependencies]
+rapids-singlecell = [ "librmm-cu12>=25.10" ]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the pyproject.toml file at the specified lines wc -l pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 87

🏁 Script executed:

# Read the relevant section of pyproject.toml head -n 220 pyproject.toml | tail -n 30

Repository: scverse/rapids-singlecell

Length of output: 816

🏁 Script executed:

# Check CMakeLists.txt for the rmm requirements at the specified lines if [ -f CMakeLists.txt ]; then head -n 25 CMakeLists.txt | tail -n 15 fi

Repository: scverse/rapids-singlecell

Length of output: 650

🏁 Script executed:

# Check if CMakeLists.txt mentions librmm or rmm if [ -f CMakeLists.txt ]; then rg -n "rmm|librmm" CMakeLists.txt | head -20 fi

Repository: scverse/rapids-singlecell

Length of output: 1181

🏁 Script executed:

# Check the build-system configuration in pyproject.toml rg -A 5 "\[build-system\]" pyproject.toml

Repository: scverse/rapids-singlecell

Length of output: 177

🏁 Script executed:

# Search for any documentation or comments about rmm requirements rg -i "rmm|wilcoxon" --type=py --type=toml --type=md | head -30

Repository: scverse/rapids-singlecell

Length of output: 3193

🏁 Script executed:

# Check for package configuration and publishing info rg -i "packages|distribution|sdist" pyproject.toml | head -20

Repository: scverse/rapids-singlecell

Length of output: 225

🏁 Script executed:

# Look for documentation about build requirements fd -e md -e rst -e txt | xargs rg -l "librmm|rmm|build.*require" 2>/dev/null | head -10

Repository: scverse/rapids-singlecell

Length of output: 129

🏁 Script executed:

# Check README and installation docs if [ -f README.md ]; then head -100 README.md fi

Repository: scverse/rapids-singlecell

Length of output: 2931

🏁 Script executed:

# Search for RSC_BUILD_EXTENSIONS usage and documentation rg -n "RSC_BUILD_EXTENSIONS|BUILD_EXTENSIONS" --type=py --type=toml --type=md

Repository: scverse/rapids-singlecell

Length of output: 160

🏁 Script executed:

# Check pyproject.toml for wheel/sdist configuration cat pyproject.toml | grep -A 10 -B 2 "wheel\|sdist\|package"

Repository: scverse/rapids-singlecell

Length of output: 2382

Move librmm to a build-system-agnostic configuration or document the requirement explicitly.

The [tool.uv.extra-build-dependencies] entry is ignored by pip and generic PEP 517 build isolation. While the project publishes an sdist and CMakeLists.txt (lines 177–180) requires rmm for the Wilcoxon CUDA extensions, source builds via pip install --no-binary will fail to locate librmm headers during compilation unless the package is manually provided. Either add librmm-cu12>=25.10 to [build-system].requires in pyproject.toml, or clearly document this requirement in installation documentation along with guidance on using the RSC_BUILD_EXTENSIONS=OFF flag to skip CUDA extension builds when librmm is unavailable.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 201 - 208, The pyproject.toml currently declares librmm under [tool.uv.extra-build-dependencies], which is ignored by PEP 517 build isolation; move the librmm requirement into the PEP 517 build requirements by adding "librmm-cu12>=25.10" to [build-system].requires in pyproject.toml so source builds (pip install --no-binary) can find headers required by CMakeLists.txt for the Wilcoxon CUDA extensions, or alternatively update the project installation docs to explicitly state that librmm (librmm-cu12>=25.10) must be preinstalled and show using RSC_BUILD_EXTENSIONS=OFF to skip CUDA builds when librmm is not available.

coderabbitai · 2026-04-14T10:26:17Z

+    if (compute_tie_corr) {
+        // Warp buf always in shared memory (32 doubles — always fits)
+        double* warp_buf = use_gmem ? smem : smem + n_groups;


⚠️ Potential issue | 🔴 Critical

Incorrect warp buffer offset in shared memory path.

When use_gmem=false, the shared memory layout allocates 4 arrays of size n_groups (grp_sums, s_sum, s_sq, s_nnz) before the warp buffer. The offset calculation should be smem + 4 * n_groups, not smem + n_groups.

🐛 Proposed fix

if (compute_tie_corr) { // Warp buf always in shared memory (32 doubles — always fits) - double* warp_buf = use_gmem ? smem : smem + n_groups; + double* warp_buf = use_gmem ? smem : smem + 4 * n_groups; `#pragma` unroll

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (compute_tie_corr) {

// Warp buf always in shared memory (32 doubles — always fits)

double* warp_buf = use_gmem ? smem : smem + n_groups;

if (compute_tie_corr) {

// Warp buf always in shared memory (32 doubles — always fits)

double* warp_buf = use_gmem ? smem : smem + 4 * n_groups;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh` around lines 154 - 156, The warp buffer offset is wrong when use_gmem is false: in the compute_tie_corr branch the code sets double* warp_buf = use_gmem ? smem : smem + n_groups but the shared-memory layout places grp_sums, s_sum, s_sq, s_nnz (4 arrays of size n_groups) before the warp buffer; update the offset calculation in the compute_tie_corr block to use smem + 4 * n_groups (keeping the use_gmem ? smem logic) so warp_buf points after those four arrays (refer to variables compute_tie_corr, warp_buf, use_gmem, smem, n_groups).

coderabbitai · 2026-04-14T10:26:17Z

+    cudaHostRegister(const_cast<float*>(h_data),
+                     (size_t)h_indptr[n_cols] * sizeof(float), 0);
+    cudaHostRegister(const_cast<int*>(h_indices),
+                     (size_t)h_indptr[n_cols] * sizeof(int), 0);
+    cudaHostRegister(h_rank_sums, (size_t)n_groups * n_cols * sizeof(double),
+                     0);
+    cudaHostRegister(h_tie_corr, n_cols * sizeof(double), 0);


⚠️ Potential issue | 🟡 Minor

Missing error checks for cudaHostRegister.

cudaHostRegister can fail (e.g., if memory is already pinned, or the system doesn't support pinning). Without error checking, failures are silent and async transfers may not work as expected.

Suggested fix

// Pin host memory for async transfers - cudaHostRegister(const_cast<float*>(h_data), - (size_t)h_indptr[n_cols] * sizeof(float), 0); - cudaHostRegister(const_cast<int*>(h_indices), - (size_t)h_indptr[n_cols] * sizeof(int), 0); - cudaHostRegister(h_rank_sums, (size_t)n_groups * n_cols * sizeof(double), - 0); - cudaHostRegister(h_tie_corr, n_cols * sizeof(double), 0); + cudaError_t pin_err; + pin_err = cudaHostRegister(const_cast<float*>(h_data), + (size_t)h_indptr[n_cols] * sizeof(float), 0); + // Note: cudaErrorHostMemoryAlreadyRegistered is acceptable + if (pin_err != cudaSuccess && pin_err != cudaErrorHostMemoryAlreadyRegistered) + throw std::runtime_error("Failed to pin h_data: " + + std::string(cudaGetErrorString(pin_err))); + // ... similar for other registers

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 540 - 546, The cudaHostRegister calls for h_data, h_indices, h_rank_sums, and h_tie_corr are missing error checks; wrap each cudaHostRegister(...) invocation with a check of the returned cudaError_t and handle failures (e.g., log the error using cudaGetErrorString(err) and return/throw an error code) so failures aren’t silent; update the block where cudaHostRegister is called (references: cudaHostRegister, h_data, h_indices, h_rank_sums, h_tie_corr) to verify the return value, clean up any previously pinned memory on failure, and propagate an error up from the function.

Intron7 · 2026-04-15T15:51:57Z

@coderabbitai review

coderabbitai · 2026-04-15T15:52:03Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (2)

src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu (2)

686-692: CSR histogram kernel lacks cudaGetLastError() check.

The csr_col_histogram_kernel launch at line 690-691 doesn't check for launch errors. If the kernel fails, the subsequent cudaMemcpy to h_col_counts would silently read uninitialized/incorrect data.

Suggested fix

         csr_col_histogram_kernel<<<blocks, tpb>>>(csr_indices, csr_indptr,
                                                   d_col_counts, n_rows, n_cols);
+        cudaError_t err = cudaGetLastError();
+        if (err != cudaSuccess)
+            throw std::runtime_error(
+                std::string("csr_col_histogram_kernel launch failed: ") +
+                cudaGetErrorString(err));
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 686 - 692,
After launching csr_col_histogram_kernel(...) add a cudaGetLastError() (and
optionally cudaDeviceSynchronize()) check immediately after the kernel launch
and before the subsequent cudaMemcpy to h_col_counts; if an error is returned,
log/return the CUDA error (or propagate it) rather than proceeding to
cudaMemcpy. Ensure the check references csr_col_histogram_kernel launch and the
device buffer d_col_counts (and host h_col_counts) so failures are detected and
handled before copying results back to host.

182-203: Consider adding cudaGetLastError() after kernel launches.

While stream synchronization at line 225-230 will surface async errors, adding cudaGetLastError() immediately after kernel launches (lines 183-184, 200-203) would catch launch configuration errors (e.g., invalid shared memory size) earlier, making debugging easier.

This is a minor improvement since errors are eventually caught at sync.

Suggested improvement

         fill_row_indices_kernel<<<sb_cols, 256, 0, stream>>>(buf.vals_in,
                                                              n_rows, sb_cols);
+        if (cudaGetLastError() != cudaSuccess)
+            throw std::runtime_error("fill_row_indices_kernel launch failed");
 
         // Sort: keys = block columns [col, col+sb_cols), already F-order
         const float* keys_in = block + (long long)col * n_rows;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 182 - 203,
After the launches of fill_row_indices_kernel and rank_sums_from_sorted_kernel
(and optionally after the cub::DeviceSegmentedRadixSort::SortPairs call),
immediately call cudaGetLastError() and handle/report any non-success result so
launch-time failures (e.g., invalid <<<...>>> or shared memory size) are caught
earlier; locate the kernel launches by the symbols fill_row_indices_kernel and
rank_sums_from_sorted_kernel in wilcoxon_ovr.cu and add a cudaGetLastError()
check right after each launch (or use your project's CUDA error-check macro) to
log/propagate the launch error before later stream synchronization.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 686-692: After launching csr_col_histogram_kernel(...) add a
cudaGetLastError() (and optionally cudaDeviceSynchronize()) check immediately
after the kernel launch and before the subsequent cudaMemcpy to h_col_counts; if
an error is returned, log/return the CUDA error (or propagate it) rather than
proceeding to cudaMemcpy. Ensure the check references csr_col_histogram_kernel
launch and the device buffer d_col_counts (and host h_col_counts) so failures
are detected and handled before copying results back to host.
- Around line 182-203: After the launches of fill_row_indices_kernel and
rank_sums_from_sorted_kernel (and optionally after the
cub::DeviceSegmentedRadixSort::SortPairs call), immediately call
cudaGetLastError() and handle/report any non-success result so launch-time
failures (e.g., invalid <<<...>>> or shared memory size) are caught earlier;
locate the kernel launches by the symbols fill_row_indices_kernel and
rank_sums_from_sorted_kernel in wilcoxon_ovr.cu and add a cudaGetLastError()
check right after each launch (or use your project's CUDA error-check macro) to
log/propagate the launch error before later stream synchronization.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a64a3c4e-ea59-4657-95d3-8688ccbed35a

📥 Commits

Reviewing files that changed from the base of the PR and between 68ad73c and 7559778.

📒 Files selected for processing (12)

.github/workflows/publish.yml
pyproject.toml
src/rapids_singlecell/_cuda/__init__.py
src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_common.cuh
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu
src/rapids_singlecell/tools/_rank_genes_groups/__init__.py
src/rapids_singlecell/tools/_rank_genes_groups/_core.py
src/rapids_singlecell/tools/_rank_genes_groups/_utils.py
src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py
tests/test_rank_genes_groups_wilcoxon.py
tests/test_rank_genes_groups_wilcoxon_binned.py

💤 Files with no reviewable changes (1)

src/rapids_singlecell/tools/_rank_genes_groups/_core.py

✅ Files skipped from review due to trivial changes (1)

src/rapids_singlecell/tools/_rank_genes_groups/init.py

🚧 Files skipped from review as they are similar to previous changes (4)

src/rapids_singlecell/_cuda/init.py
.github/workflows/publish.yml
pyproject.toml
src/rapids_singlecell/tools/_rank_genes_groups/_utils.py

Intron7 added the run-gpu-ci label Feb 25, 2026

github-actions Bot removed the run-gpu-ci label Feb 25, 2026

Intron7 added the run-gpu-ci label Feb 25, 2026

github-actions Bot removed the run-gpu-ci label Feb 25, 2026

Intron7 marked this pull request as draft February 26, 2026 10:31

Intron7 closed this Apr 9, 2026

Intron7 force-pushed the c++-wilcoxon branch from 9f71931 to f949cce Compare April 9, 2026 13:37

rebase to main and start with illico optimization

8e858b0

Intron7 reopened this Apr 9, 2026

Intron7 added 2 commits April 10, 2026 15:50

v1 illico

e1ef547

try add rmm

9fa9c98

Intron7 added the run-gpu-ci label Apr 13, 2026

github-actions Bot removed the run-gpu-ci label Apr 13, 2026

Intron7 added 3 commits April 13, 2026 19:33

fix publish

0335c0b

try fix build wheel

3fb11d9

fix ld path

abddd30

Intron7 added run-gpu-ci and removed run-gpu-ci labels Apr 13, 2026

github-actions Bot removed the run-gpu-ci label Apr 13, 2026

Intron7 added 2 commits April 13, 2026 20:02

update files

8a59ee1

add debugging

a7f8852

Intron7 added the run-gpu-ci label Apr 13, 2026

github-actions Bot removed the run-gpu-ci label Apr 13, 2026

Merge branch 'main' into c++-wilcoxon

71246ae