Vectorize filter_on_target_knockdown by LeonHafner · Pull Request #71 · ArcInstitute/cell-load

LeonHafner · 2026-04-01T01:27:44Z

Summary

Replace the old loop-based filter_on_target_knockdown with the vectorized reimplementation (measured 40x speedup on large datasets)
Remove helpers that were only used by the old implementation: _mean, is_on_target_knockdown, set_var_index_to_col
Add --verbose flag to the CLI to expose per-stage cell/perturbation count output

How it works

The new implementation avoids per-perturbation loops by:

Extracting a single dense submatrix X_sub (n_cells × n_matched_perts) for all target genes at once
Computing control means for all genes in one matrix op: X_sub[control_mask].mean(axis=0)
Using np.bincount + fancy indexing for the perturbation-level and cell-level filter stages
Deferring adata.copy() to the end, covering only the kept subset

Outputs were verified to be identical to the old implementation on a 385k-cell dataset.

gemini-code-assist

Code Review

This pull request refactors the filter_on_target_knockdown function to utilize vectorized operations with NumPy and Pandas, significantly improving performance over the previous loop-based approach. Additionally, a --verbose flag has been added to the CLI to provide detailed counts during the filtering stages. A review comment correctly identifies a potential memory bottleneck where the densification of the expression submatrix for all matched perturbations could lead to out-of-memory errors on large datasets, suggesting a more memory-efficient approach using sparse indexing.

gemini-code-assist · 2026-04-01T01:29:18Z

src/cell_load/utils/data_utils.py

+    if sp.issparse(X):
+        X_sub = X[:, pert_positions].toarray()   # (n_cells, n_matched)
+    else:
+        X_sub = np.asarray(X)[:, pert_positions]


The current implementation densifies the entire submatrix X_sub for all matched perturbations. For large datasets with many perturbations (e.g., whole-genome screens), this can lead to excessive memory consumption or Out-Of-Memory (OOM) errors. Since X is typically sparse in single-cell data, it is more efficient to avoid full densification and instead use sparse indexing or only densify the specific elements needed for each stage.

Consider keeping X_sub sparse if X is sparse, and then using np.asarray(...).flatten() when extracting specific values (like diag_expr or expr_vals) or computing means to maintain the speedup while significantly reducing the memory footprint.

LeonHafner added 3 commits March 30, 2026 16:35

add fast filter_on_target_knockdown

e23d426

add output summary

31deabd

replace old function

904a7ee

gemini-code-assist bot reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize filter_on_target_knockdown#71

Vectorize filter_on_target_knockdown#71
LeonHafner wants to merge 3 commits intoArcInstitute:mainfrom
LeonHafner:vectorize_filter_on_target_knockdown

LeonHafner commented Apr 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeonHafner commented Apr 1, 2026

Summary

How it works

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant