Skip to content

Vectorize filter_on_target_knockdown#71

Open
LeonHafner wants to merge 3 commits intoArcInstitute:mainfrom
LeonHafner:vectorize_filter_on_target_knockdown
Open

Vectorize filter_on_target_knockdown#71
LeonHafner wants to merge 3 commits intoArcInstitute:mainfrom
LeonHafner:vectorize_filter_on_target_knockdown

Conversation

@LeonHafner
Copy link
Copy Markdown

Summary

  • Replace the old loop-based filter_on_target_knockdown with the vectorized reimplementation (measured 40x speedup on large datasets)
  • Remove helpers that were only used by the old implementation: _mean, is_on_target_knockdown, set_var_index_to_col
  • Add --verbose flag to the CLI to expose per-stage cell/perturbation count output

How it works

The new implementation avoids per-perturbation loops by:

  • Extracting a single dense submatrix X_sub (n_cells × n_matched_perts) for all target genes at once
  • Computing control means for all genes in one matrix op: X_sub[control_mask].mean(axis=0)
  • Using np.bincount + fancy indexing for the perturbation-level and cell-level filter stages
  • Deferring adata.copy() to the end, covering only the kept subset

Outputs were verified to be identical to the old implementation on a 385k-cell dataset.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the filter_on_target_knockdown function to utilize vectorized operations with NumPy and Pandas, significantly improving performance over the previous loop-based approach. Additionally, a --verbose flag has been added to the CLI to provide detailed counts during the filtering stages. A review comment correctly identifies a potential memory bottleneck where the densification of the expression submatrix for all matched perturbations could lead to out-of-memory errors on large datasets, suggesting a more memory-efficient approach using sparse indexing.

Comment on lines +303 to +306
if sp.issparse(X):
X_sub = X[:, pert_positions].toarray() # (n_cells, n_matched)
else:
X_sub = np.asarray(X)[:, pert_positions]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation densifies the entire submatrix X_sub for all matched perturbations. For large datasets with many perturbations (e.g., whole-genome screens), this can lead to excessive memory consumption or Out-Of-Memory (OOM) errors. Since X is typically sparse in single-cell data, it is more efficient to avoid full densification and instead use sparse indexing or only densify the specific elements needed for each stage.

Consider keeping X_sub sparse if X is sparse, and then using np.asarray(...).flatten() when extracting specific values (like diag_expr or expr_vals) or computing means to maintain the speedup while significantly reducing the memory footprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant