Skip to content

Feature Request: Inner-join for modkit bedmethyl merge #625

@SuhasSrinivasan

Description

@SuhasSrinivasan

Summary

modkit bedmethyl merge performs an outer join: a position is retained if it is present in any input.
When merging biological/technical replicates, it would be valuable to instead keep only positions that are reproducible across samples (present in at least N inputs), optionally with a minimum per-sample valid coverage.

Motivation

Pooling replicates with the current outer join keeps single-replicate positions, which can include non-reproducible/low-confidence calls. After the merge, replicate provenance is lost (the merged bedMethyl has no "number of contributing samples" field), so this filtering cannot be done downstream from the merged file alone. Doing it inside merge (which already reads each input per region via the tabix index) is natural and efficient.

Proposed feature

Two optional flags on bedmethyl merge, fully backward-compatible (defaults reproduce current outer join):

  • --min-samples <N> (default 1): output a position only if it appears in at least N input files. Setting N = number of inputs performs an inner join (keep only positions present in all samples).
  • --min-sample-coverage <C> (default 0): an input only counts toward a position (both for the --min-samples tally and for the summed counts) when that input's record has at least C valid coverage.

Example (3 replicates, require presence in all three with >= 5 valid coverage each):

modkit bedmethyl merge rep1.bed.gz rep2.bed.gz rep3.bed.gz \
  -o replicates_inner.bed -g genome_sizes.tsv \
  --min-samples 3 --min-sample-coverage 5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions