Skip to content

feat: allow exponentiation post agg for log-fold-change in rank_genes_groups#4037

Open
ilan-gold wants to merge 19 commits intomainfrom
ig/exp_post_agg
Open

feat: allow exponentiation post agg for log-fold-change in rank_genes_groups#4037
ilan-gold wants to merge 19 commits intomainfrom
ig/exp_post_agg

Conversation

@ilan-gold
Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold commented Apr 7, 2026

My three questions here:

  1. What to do about t-tests? Do people expect fold change of the input to the t-test (as is currently done) or fold change of the exponent?
  2. How do we test this? Dummy groups? Wait for illico to compare? I added a very simple test.
  3. Does the name exp_post_agg make sense? exp_post_mean?

@ilan-gold ilan-gold changed the title feat: allow exponentiation post agg for log-fold-change feat: allow exponentiation post agg for log-fold-change in rank_genes_groups Apr 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 78.63%. Comparing base (bb617ba) to head (0437475).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/scanpy/tools/_rank_genes_groups.py 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4037      +/-   ##
==========================================
+ Coverage   78.58%   78.63%   +0.04%     
==========================================
  Files         118      118              
  Lines       12753    12764      +11     
==========================================
+ Hits        10022    10037      +15     
+ Misses       2731     2727       -4     
Flag Coverage Δ
hatch-test.low-vers 77.93% <93.33%> (+0.02%) ⬆️
hatch-test.pre 78.51% <93.33%> (+11.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/scanpy/_settings/presets.py 86.61% <100.00%> (+0.10%) ⬆️
src/scanpy/tools/_rank_genes_groups.py 93.31% <92.85%> (+0.47%) ⬆️

... and 1 file with indirect coverage changes

@ilan-gold ilan-gold added this to the 1.13.0 milestone Apr 14, 2026
@ilan-gold ilan-gold added the Area – Reproducibility Exact replication label Apr 14, 2026
@ilan-gold ilan-gold marked this pull request as ready for review April 14, 2026 14:29
@ilan-gold ilan-gold requested a review from eroell April 14, 2026 14:31
@ilan-gold ilan-gold requested a review from flying-sheep April 14, 2026 14:33
@ilan-gold ilan-gold self-assigned this Apr 16, 2026
Copy link
Copy Markdown
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 1: I didn’t look at the math, but I assume to get both accurate logfoldchanges and correct t-test results, the t-test needs the means of the data, and the logfoldchanges need the means of e^data. So we’d have to calculate means twice I think. If @eroell doesn’t chime in I’ll re-learn the math …

Regarding 3: maybe lfc_estim_fast or so?

@ilan-gold
Copy link
Copy Markdown
Contributor Author

Regarding 1:

I think my question is less about math than expectations. Do people expect #864 to be applied to all tests that report LFC or just wilcoxon?

Regarding 3 lfc_estim_fast

I like this but I wonder if being more explicit about the math is nicer. @eroell thoughts?

@flying-sheep
Copy link
Copy Markdown
Member

Do people expect #864 to be applied to all tests that report LFC or just wilcoxon?

I think it would be confusing if the option only applied to one of them, but of course I don’t know what people expect.

@ilan-gold
Copy link
Copy Markdown
Contributor Author

That's a pretty good reason! Nothing in the issue is focused on wilcoxon, I guess!

@ilan-gold
Copy link
Copy Markdown
Contributor Author

Regarding 3: maybe lfc_estim_fast or so?

Looking back at things, I think this setting was created for a time when the code looked very different. It is only faster for t-test to do exp_post_agg.

@flying-sheep
Copy link
Copy Markdown
Member

I think you’re right, but expm1 is a more expensive operation than the others, so doing it on the full data instead of just the means might do something. Probably not a huge factor, but let’s see.

@scverse-benchmark
Copy link
Copy Markdown

No changes in benchmarks.

Comparison: https://github.com/scverse/scanpy/compare/063262b7c2f6f7e6ca675e5efc2603e21d22df22..39712198c255e2a603c2dcfdcc8a7279f7d410e4
Last changed: Fri, 17 Apr 2026 10:43:14 +0000

More details: https://github.com/scverse/scanpy/pull/4037/checks?check_run_id=71803666404

@flying-sheep
Copy link
Copy Markdown
Member

huh, I didn’t expect it to have zero impact. OK then, let’s see what @eroell says

@eroell
Copy link
Copy Markdown
Contributor

eroell commented Apr 21, 2026

What to do about t-tests? Do people expect fold change of the input to the t-test (as is currently done) or fold change of the exponent?

I think the log(mean(exp(values))) as log fold change would also apply here

Does the name exp_post_agg make sense? exp_post_mean?

How about mean_in_log_space? True would be the current setting, and False would be the new one

@eroell
Copy link
Copy Markdown
Contributor

eroell commented Apr 21, 2026

Also, a side comment on my understanding of the issue: my agreement with the issue there is that I also consider a log fold change to compute the mean of a group in "count space".

Since data seems to lie around generally in log space prior to this function call, the issue mentions log(mean(exp(values))). The innermost exp is only there since the original values experienced a log1p already to become the values in this expression

From this point of view comes also my suggestion for the argument name of calling the current version log space.

For the tests, if a logfoldchange is needed for a test, I'd go with the new one in count space.

Sorry for the slow response to this issue, thank you so much for taking it up!

Comment thread src/scanpy/tools/_rank_genes_groups.py Outdated
corr_method: _CorrMethod = "benjamini-hochberg",
tie_correct: bool = False,
layer: str | None = None,
exp_post_agg: bool | Default = Default(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mean_in_log_space?

@eroell
Copy link
Copy Markdown
Contributor

eroell commented Apr 21, 2026

Regarding the unchanged benchmark, of which I did not get into the weeds;

Does this test numerically computed logfoldchange values, or rather rankings of genes generated by them?

@ilan-gold
Copy link
Copy Markdown
Contributor Author

Does this test numerically computed logfoldchange values, or rather rankings of genes generated by them?

Only speed

mean_in_log_space

I like this!

thanks for the reply!

@ilan-gold ilan-gold requested a review from flying-sheep April 22, 2026 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area – Reproducibility Exact replication

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rank_genes_groups logfoldchange different than seurat

3 participants