Summary
modkit bedmethyl merge performs an outer join: a position is retained if it is present in any input.
When merging biological/technical replicates, it would be valuable to instead keep only positions that are reproducible across samples (present in at least N inputs), optionally with a minimum per-sample valid coverage.
Motivation
Pooling replicates with the current outer join keeps single-replicate positions, which can include non-reproducible/low-confidence calls. After the merge, replicate provenance is lost (the merged bedMethyl has no "number of contributing samples" field), so this filtering cannot be done downstream from the merged file alone. Doing it inside merge (which already reads each input per region via the tabix index) is natural and efficient.
Proposed feature
Two optional flags on bedmethyl merge, fully backward-compatible (defaults reproduce current outer join):
--min-samples <N> (default 1): output a position only if it appears in at least N input files. Setting N = number of inputs performs an inner join (keep only positions present in all samples).
--min-sample-coverage <C> (default 0): an input only counts toward a position (both for the --min-samples tally and for the summed counts) when that input's record has at least C valid coverage.
Example (3 replicates, require presence in all three with >= 5 valid coverage each):
modkit bedmethyl merge rep1.bed.gz rep2.bed.gz rep3.bed.gz \
-o replicates_inner.bed -g genome_sizes.tsv \
--min-samples 3 --min-sample-coverage 5
Summary
modkit bedmethyl mergeperforms an outer join: a position is retained if it is present in any input.When merging biological/technical replicates, it would be valuable to instead keep only positions that are reproducible across samples (present in at least N inputs), optionally with a minimum per-sample valid coverage.
Motivation
Pooling replicates with the current outer join keeps single-replicate positions, which can include non-reproducible/low-confidence calls. After the merge, replicate provenance is lost (the merged bedMethyl has no "number of contributing samples" field), so this filtering cannot be done downstream from the merged file alone. Doing it inside
merge(which already reads each input per region via the tabix index) is natural and efficient.Proposed feature
Two optional flags on
bedmethyl merge, fully backward-compatible (defaults reproduce current outer join):--min-samples <N>(default1): output a position only if it appears in at leastNinput files. SettingN= number of inputs performs an inner join (keep only positions present in all samples).--min-sample-coverage <C>(default0): an input only counts toward a position (both for the--min-samplestally and for the summed counts) when that input's record has at leastCvalid coverage.Example (3 replicates, require presence in all three with >= 5 valid coverage each):