Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #587 +/- ##
==========================================
- Coverage 88.04% 87.63% -0.41%
==========================================
Files 96 95 -1
Lines 7032 7071 +39
==========================================
+ Hits 6191 6197 +6
- Misses 841 874 +33
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (3)
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu (1)
160-180:⚠️ Potential issue | 🟠 MajorMissing
cudaGetLastError()after kernel launches.Per coding guidelines, every kernel launch must be followed by
cudaGetLastError()to detect launch failures immediately. The current code only checks errors at stream synchronization (lines 201-207), which delays error detection and makes debugging harder.This pattern repeats throughout all
*_implfunctions: kernel launches at lines 80, 83-84, 160-161, 177-180, 289-290, 305-308, 408-409, 413-414, 429-432, 581-583, 587-588, 603-606, 725-726, 741-744.Example fix pattern for one kernel launch
rank_sums_from_sorted_kernel<<<sb_cols, tpb_rank, smem_rank, stream>>>( buf.keys_out, buf.vals_out, group_codes, buf.sub_rank_sums, buf.sub_tie_corr, nullptr, nullptr, nullptr, n_rows, sb_cols, n_groups, compute_tie_corr, false, use_gmem); + cudaError_t err = cudaGetLastError(); + if (err != cudaSuccess) + throw std::runtime_error( + std::string("rank_sums_from_sorted_kernel launch failed: ") + + cudaGetErrorString(err));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 160 - 180, After each device kernel launch and any CUB launch in wilcoxon_ovr.cu (e.g., after fill_row_indices_kernel<<<...>>>, after the cub::DeviceSegmentedRadixSort::SortPairs(...) call, and after rank_sums_from_sorted_kernel<<<...>>>), insert a call to cudaGetLastError() and check its return (and log or propagate the error) immediately to detect launch failures early; repeat this pattern for all similar kernel/CUB launches in the file (the reviewer listed examples at the start of each *_impl: kernels around symbols fill_row_indices_kernel, rank_sums_from_sorted_kernel and other kernels noted in the comment) so each launch is followed by an immediate cudaGetLastError() check rather than relying solely on stream synchronization..github/workflows/publish.yml (2)
136-136:⚠️ Potential issue | 🟠 MajorStill bypassing
abi3auditon Linux.Line 136 overrides the Linux repair sequence from
pyproject.tomlwith an auditwheel-only command, so the Limited API check never runs for the published wheels. Either append the sameabi3auditstep here or move the new exclude list intopyproject.tomland let that table stay the source of truth.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/publish.yml at line 136, The CI overrides the Linux repair step by setting CIBW_REPAIR_WHEEL_COMMAND to an auditwheel-only command, which bypasses the abi3 audit in pyproject.toml; restore the Limited API check by either appending the abi3audit step to CIBW_REPAIR_WHEEL_COMMAND (run abi3audit after auditwheel repair) or move the custom exclude list into the existing repair/table in pyproject.toml so the original abi3audit invocation is preserved; update the CIBW_REPAIR_WHEEL_COMMAND or pyproject.toml accordingly and ensure the repair sequence still invokes abi3audit for Linux builds.
124-133:⚠️ Potential issue | 🟠 MajorStill assuming
lib64in the RAPIDS wheels.Line 129 and Line 130 still hard-code
lib64, butCMakeLists.txt:18-75already treats the installed layout aslib*. If either wheel lands underlibinstead, these symlinks break before the extensions link. Resolve the actual.sofromlib*/rather than composing alib64path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/publish.yml around lines 124 - 133, The symlink creation assumes a lib64 layout and should instead resolve the actual .so under any lib* directory; update the ln -sf steps that reference RMM_ROOT and LOG_ROOT so they search for the shared objects (e.g., under "$RMM_ROOT"/lib*/librmm.so and "$LOG_ROOT"/lib*/librapids_logger.so or by using a find/glob to locate the first matching .so) and create symlinks to those resolved paths rather than hard-coding lib64, keeping the same RMM_ROOT and LOG_ROOT variables and the two ln -sf operations (for librmm and librapids_logger).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Around line 201-208: The pyproject.toml currently declares librmm under
[tool.uv.extra-build-dependencies], which is ignored by PEP 517 build isolation;
move the librmm requirement into the PEP 517 build requirements by adding
"librmm-cu12>=25.10" to [build-system].requires in pyproject.toml so source
builds (pip install --no-binary) can find headers required by CMakeLists.txt for
the Wilcoxon CUDA extensions, or alternatively update the project installation
docs to explicitly state that librmm (librmm-cu12>=25.10) must be preinstalled
and show using RSC_BUILD_EXTENSIONS=OFF to skip CUDA builds when librmm is not
available.
In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh`:
- Around line 154-156: The warp buffer offset is wrong when use_gmem is false:
in the compute_tie_corr branch the code sets double* warp_buf = use_gmem ? smem
: smem + n_groups but the shared-memory layout places grp_sums, s_sum, s_sq,
s_nnz (4 arrays of size n_groups) before the warp buffer; update the offset
calculation in the compute_tie_corr block to use smem + 4 * n_groups (keeping
the use_gmem ? smem logic) so warp_buf points after those four arrays (refer to
variables compute_tie_corr, warp_buf, use_gmem, smem, n_groups).
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 540-546: The cudaHostRegister calls for h_data, h_indices,
h_rank_sums, and h_tie_corr are missing error checks; wrap each
cudaHostRegister(...) invocation with a check of the returned cudaError_t and
handle failures (e.g., log the error using cudaGetErrorString(err) and
return/throw an error code) so failures aren’t silent; update the block where
cudaHostRegister is called (references: cudaHostRegister, h_data, h_indices,
h_rank_sums, h_tie_corr) to verify the return value, clean up any previously
pinned memory on failure, and propagate an error up from the function.
---
Duplicate comments:
In @.github/workflows/publish.yml:
- Line 136: The CI overrides the Linux repair step by setting
CIBW_REPAIR_WHEEL_COMMAND to an auditwheel-only command, which bypasses the abi3
audit in pyproject.toml; restore the Limited API check by either appending the
abi3audit step to CIBW_REPAIR_WHEEL_COMMAND (run abi3audit after auditwheel
repair) or move the custom exclude list into the existing repair/table in
pyproject.toml so the original abi3audit invocation is preserved; update the
CIBW_REPAIR_WHEEL_COMMAND or pyproject.toml accordingly and ensure the repair
sequence still invokes abi3audit for Linux builds.
- Around line 124-133: The symlink creation assumes a lib64 layout and should
instead resolve the actual .so under any lib* directory; update the ln -sf steps
that reference RMM_ROOT and LOG_ROOT so they search for the shared objects
(e.g., under "$RMM_ROOT"/lib*/librmm.so and "$LOG_ROOT"/lib*/librapids_logger.so
or by using a find/glob to locate the first matching .so) and create symlinks to
those resolved paths rather than hard-coding lib64, keeping the same RMM_ROOT
and LOG_ROOT variables and the two ln -sf operations (for librmm and
librapids_logger).
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 160-180: After each device kernel launch and any CUB launch in
wilcoxon_ovr.cu (e.g., after fill_row_indices_kernel<<<...>>>, after the
cub::DeviceSegmentedRadixSort::SortPairs(...) call, and after
rank_sums_from_sorted_kernel<<<...>>>), insert a call to cudaGetLastError() and
check its return (and log or propagate the error) immediately to detect launch
failures early; repeat this pattern for all similar kernel/CUB launches in the
file (the reviewer listed examples at the start of each *_impl: kernels around
symbols fill_row_indices_kernel, rank_sums_from_sorted_kernel and other kernels
noted in the comment) so each launch is followed by an immediate
cudaGetLastError() check rather than relying solely on stream synchronization.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 548a08c7-d10f-413f-b6b1-1d702b116125
📒 Files selected for processing (8)
.github/workflows/publish.ymlCMakeLists.txtpyproject.tomlsrc/rapids_singlecell/_cuda/nb_types.hsrc/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuhsrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cusrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cusrc/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py
✅ Files skipped from review due to trivial changes (2)
- src/rapids_singlecell/_cuda/nb_types.h
- src/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.py
🚧 Files skipped from review as they are similar to previous changes (2)
- CMakeLists.txt
- src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovo.cu
| [tool.uv] | ||
| # librmm headers are needed at build time for the wilcoxon CUDA kernels. | ||
| # The headers are identical across cu12/cu13 — only the .so differs (loaded | ||
| # at runtime via librmm.load_library()). cu12 is used here as the build-time | ||
| # provider; cu13 envs get the same headers. | ||
|
|
||
| [tool.uv.extra-build-dependencies] | ||
| rapids-singlecell = [ "librmm-cu12>=25.10" ] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the pyproject.toml file at the specified lines
wc -l pyproject.tomlRepository: scverse/rapids-singlecell
Length of output: 87
🏁 Script executed:
# Read the relevant section of pyproject.toml
head -n 220 pyproject.toml | tail -n 30Repository: scverse/rapids-singlecell
Length of output: 816
🏁 Script executed:
# Check CMakeLists.txt for the rmm requirements at the specified lines
if [ -f CMakeLists.txt ]; then
head -n 25 CMakeLists.txt | tail -n 15
fiRepository: scverse/rapids-singlecell
Length of output: 650
🏁 Script executed:
# Check if CMakeLists.txt mentions librmm or rmm
if [ -f CMakeLists.txt ]; then
rg -n "rmm|librmm" CMakeLists.txt | head -20
fiRepository: scverse/rapids-singlecell
Length of output: 1181
🏁 Script executed:
# Check the build-system configuration in pyproject.toml
rg -A 5 "\[build-system\]" pyproject.tomlRepository: scverse/rapids-singlecell
Length of output: 177
🏁 Script executed:
# Search for any documentation or comments about rmm requirements
rg -i "rmm|wilcoxon" --type=py --type=toml --type=md | head -30Repository: scverse/rapids-singlecell
Length of output: 3193
🏁 Script executed:
# Check for package configuration and publishing info
rg -i "packages|distribution|sdist" pyproject.toml | head -20Repository: scverse/rapids-singlecell
Length of output: 225
🏁 Script executed:
# Look for documentation about build requirements
fd -e md -e rst -e txt | xargs rg -l "librmm|rmm|build.*require" 2>/dev/null | head -10Repository: scverse/rapids-singlecell
Length of output: 129
🏁 Script executed:
# Check README and installation docs
if [ -f README.md ]; then
head -100 README.md
fiRepository: scverse/rapids-singlecell
Length of output: 2931
🏁 Script executed:
# Search for RSC_BUILD_EXTENSIONS usage and documentation
rg -n "RSC_BUILD_EXTENSIONS|BUILD_EXTENSIONS" --type=py --type=toml --type=mdRepository: scverse/rapids-singlecell
Length of output: 160
🏁 Script executed:
# Check pyproject.toml for wheel/sdist configuration
cat pyproject.toml | grep -A 10 -B 2 "wheel\|sdist\|package"Repository: scverse/rapids-singlecell
Length of output: 2382
Move librmm to a build-system-agnostic configuration or document the requirement explicitly.
The [tool.uv.extra-build-dependencies] entry is ignored by pip and generic PEP 517 build isolation. While the project publishes an sdist and CMakeLists.txt (lines 177–180) requires rmm for the Wilcoxon CUDA extensions, source builds via pip install --no-binary will fail to locate librmm headers during compilation unless the package is manually provided. Either add librmm-cu12>=25.10 to [build-system].requires in pyproject.toml, or clearly document this requirement in installation documentation along with guidance on using the RSC_BUILD_EXTENSIONS=OFF flag to skip CUDA extension builds when librmm is unavailable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 201 - 208, The pyproject.toml currently declares
librmm under [tool.uv.extra-build-dependencies], which is ignored by PEP 517
build isolation; move the librmm requirement into the PEP 517 build requirements
by adding "librmm-cu12>=25.10" to [build-system].requires in pyproject.toml so
source builds (pip install --no-binary) can find headers required by
CMakeLists.txt for the Wilcoxon CUDA extensions, or alternatively update the
project installation docs to explicitly state that librmm (librmm-cu12>=25.10)
must be preinstalled and show using RSC_BUILD_EXTENSIONS=OFF to skip CUDA builds
when librmm is not available.
| if (compute_tie_corr) { | ||
| // Warp buf always in shared memory (32 doubles — always fits) | ||
| double* warp_buf = use_gmem ? smem : smem + n_groups; |
There was a problem hiding this comment.
Incorrect warp buffer offset in shared memory path.
When use_gmem=false, the shared memory layout allocates 4 arrays of size n_groups (grp_sums, s_sum, s_sq, s_nnz) before the warp buffer. The offset calculation should be smem + 4 * n_groups, not smem + n_groups.
🐛 Proposed fix
if (compute_tie_corr) {
// Warp buf always in shared memory (32 doubles — always fits)
- double* warp_buf = use_gmem ? smem : smem + n_groups;
+ double* warp_buf = use_gmem ? smem : smem + 4 * n_groups;
`#pragma` unroll📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (compute_tie_corr) { | |
| // Warp buf always in shared memory (32 doubles — always fits) | |
| double* warp_buf = use_gmem ? smem : smem + n_groups; | |
| if (compute_tie_corr) { | |
| // Warp buf always in shared memory (32 doubles — always fits) | |
| double* warp_buf = use_gmem ? smem : smem + 4 * n_groups; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuh` around lines 154 -
156, The warp buffer offset is wrong when use_gmem is false: in the
compute_tie_corr branch the code sets double* warp_buf = use_gmem ? smem : smem
+ n_groups but the shared-memory layout places grp_sums, s_sum, s_sq, s_nnz (4
arrays of size n_groups) before the warp buffer; update the offset calculation
in the compute_tie_corr block to use smem + 4 * n_groups (keeping the use_gmem ?
smem logic) so warp_buf points after those four arrays (refer to variables
compute_tie_corr, warp_buf, use_gmem, smem, n_groups).
| cudaHostRegister(const_cast<float*>(h_data), | ||
| (size_t)h_indptr[n_cols] * sizeof(float), 0); | ||
| cudaHostRegister(const_cast<int*>(h_indices), | ||
| (size_t)h_indptr[n_cols] * sizeof(int), 0); | ||
| cudaHostRegister(h_rank_sums, (size_t)n_groups * n_cols * sizeof(double), | ||
| 0); | ||
| cudaHostRegister(h_tie_corr, n_cols * sizeof(double), 0); |
There was a problem hiding this comment.
Missing error checks for cudaHostRegister.
cudaHostRegister can fail (e.g., if memory is already pinned, or the system doesn't support pinning). Without error checking, failures are silent and async transfers may not work as expected.
Suggested fix
// Pin host memory for async transfers
- cudaHostRegister(const_cast<float*>(h_data),
- (size_t)h_indptr[n_cols] * sizeof(float), 0);
- cudaHostRegister(const_cast<int*>(h_indices),
- (size_t)h_indptr[n_cols] * sizeof(int), 0);
- cudaHostRegister(h_rank_sums, (size_t)n_groups * n_cols * sizeof(double),
- 0);
- cudaHostRegister(h_tie_corr, n_cols * sizeof(double), 0);
+ cudaError_t pin_err;
+ pin_err = cudaHostRegister(const_cast<float*>(h_data),
+ (size_t)h_indptr[n_cols] * sizeof(float), 0);
+ // Note: cudaErrorHostMemoryAlreadyRegistered is acceptable
+ if (pin_err != cudaSuccess && pin_err != cudaErrorHostMemoryAlreadyRegistered)
+ throw std::runtime_error("Failed to pin h_data: " +
+ std::string(cudaGetErrorString(pin_err)));
+ // ... similar for other registers🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 540 - 546,
The cudaHostRegister calls for h_data, h_indices, h_rank_sums, and h_tie_corr
are missing error checks; wrap each cudaHostRegister(...) invocation with a
check of the returned cudaError_t and handle failures (e.g., log the error using
cudaGetErrorString(err) and return/throw an error code) so failures aren’t
silent; update the block where cudaHostRegister is called (references:
cudaHostRegister, h_data, h_indices, h_rank_sums, h_tie_corr) to verify the
return value, clean up any previously pinned memory on failure, and propagate an
error up from the function.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu (2)
686-692: CSR histogram kernel lacks cudaGetLastError() check.The
csr_col_histogram_kernellaunch at line 690-691 doesn't check for launch errors. If the kernel fails, the subsequentcudaMemcpytoh_col_countswould silently read uninitialized/incorrect data.Suggested fix
csr_col_histogram_kernel<<<blocks, tpb>>>(csr_indices, csr_indptr, d_col_counts, n_rows, n_cols); + cudaError_t err = cudaGetLastError(); + if (err != cudaSuccess) + throw std::runtime_error( + std::string("csr_col_histogram_kernel launch failed: ") + + cudaGetErrorString(err)); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 686 - 692, After launching csr_col_histogram_kernel(...) add a cudaGetLastError() (and optionally cudaDeviceSynchronize()) check immediately after the kernel launch and before the subsequent cudaMemcpy to h_col_counts; if an error is returned, log/return the CUDA error (or propagate it) rather than proceeding to cudaMemcpy. Ensure the check references csr_col_histogram_kernel launch and the device buffer d_col_counts (and host h_col_counts) so failures are detected and handled before copying results back to host.
182-203: Consider adding cudaGetLastError() after kernel launches.While stream synchronization at line 225-230 will surface async errors, adding
cudaGetLastError()immediately after kernel launches (lines 183-184, 200-203) would catch launch configuration errors (e.g., invalid shared memory size) earlier, making debugging easier.This is a minor improvement since errors are eventually caught at sync.
Suggested improvement
fill_row_indices_kernel<<<sb_cols, 256, 0, stream>>>(buf.vals_in, n_rows, sb_cols); + if (cudaGetLastError() != cudaSuccess) + throw std::runtime_error("fill_row_indices_kernel launch failed"); // Sort: keys = block columns [col, col+sb_cols), already F-order const float* keys_in = block + (long long)col * n_rows;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu` around lines 182 - 203, After the launches of fill_row_indices_kernel and rank_sums_from_sorted_kernel (and optionally after the cub::DeviceSegmentedRadixSort::SortPairs call), immediately call cudaGetLastError() and handle/report any non-success result so launch-time failures (e.g., invalid <<<...>>> or shared memory size) are caught earlier; locate the kernel launches by the symbols fill_row_indices_kernel and rank_sums_from_sorted_kernel in wilcoxon_ovr.cu and add a cudaGetLastError() check right after each launch (or use your project's CUDA error-check macro) to log/propagate the launch error before later stream synchronization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cu`:
- Around line 686-692: After launching csr_col_histogram_kernel(...) add a
cudaGetLastError() (and optionally cudaDeviceSynchronize()) check immediately
after the kernel launch and before the subsequent cudaMemcpy to h_col_counts; if
an error is returned, log/return the CUDA error (or propagate it) rather than
proceeding to cudaMemcpy. Ensure the check references csr_col_histogram_kernel
launch and the device buffer d_col_counts (and host h_col_counts) so failures
are detected and handled before copying results back to host.
- Around line 182-203: After the launches of fill_row_indices_kernel and
rank_sums_from_sorted_kernel (and optionally after the
cub::DeviceSegmentedRadixSort::SortPairs call), immediately call
cudaGetLastError() and handle/report any non-success result so launch-time
failures (e.g., invalid <<<...>>> or shared memory size) are caught earlier;
locate the kernel launches by the symbols fill_row_indices_kernel and
rank_sums_from_sorted_kernel in wilcoxon_ovr.cu and add a cudaGetLastError()
check right after each launch (or use your project's CUDA error-check macro) to
log/propagate the launch error before later stream synchronization.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a64a3c4e-ea59-4657-95d3-8688ccbed35a
📒 Files selected for processing (12)
.github/workflows/publish.ymlpyproject.tomlsrc/rapids_singlecell/_cuda/__init__.pysrc/rapids_singlecell/_cuda/wilcoxon/kernels_wilcoxon.cuhsrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_common.cuhsrc/rapids_singlecell/_cuda/wilcoxon/wilcoxon_ovr.cusrc/rapids_singlecell/tools/_rank_genes_groups/__init__.pysrc/rapids_singlecell/tools/_rank_genes_groups/_core.pysrc/rapids_singlecell/tools/_rank_genes_groups/_utils.pysrc/rapids_singlecell/tools/_rank_genes_groups/_wilcoxon.pytests/test_rank_genes_groups_wilcoxon.pytests/test_rank_genes_groups_wilcoxon_binned.py
💤 Files with no reviewable changes (1)
- src/rapids_singlecell/tools/_rank_genes_groups/_core.py
✅ Files skipped from review due to trivial changes (1)
- src/rapids_singlecell/tools/_rank_genes_groups/init.py
🚧 Files skipped from review as they are similar to previous changes (4)
- src/rapids_singlecell/_cuda/init.py
- .github/workflows/publish.yml
- pyproject.toml
- src/rapids_singlecell/tools/_rank_genes_groups/_utils.py
No description provided.