FEAT: Reapply mask to discarded mutations#412
Conversation
- following what Fede (federica.brando@irbbarcelona.org) did in the PR #153
- Also adding a new feature to extract flagged regions to a bed file. - Adding wrapper function to clean-up main. e On branch 118-reapply-mask-discarded-mutation
e Your branch is ahead of 'origin/118-reapply-mask-discarded-mutation' by 1 commit.
- add output as pipeline in 'module.config' - add output in `filtermaf` module - add output in the `mutationpreprocessing`subworkflow
- Added ANNOTATEPANELS with the VEP and other annotation steps - CREATEPANELS now only contains the creation of the modalities of panels
- Bed files emited by ANNOTATEPANEL are used by MUTPREPROCESSING - Added arguments to CREATECONSENSUSPANELS for the ANNOTATEPANEL subworkflow.
- Get the flagged bed into 0-based half-open coordinates. - Useful to use bedtools and filter the positions from the panels.
- publishDir was wrongly done, already corrected - uncommented the filtering logic in CREATEPANELS although it is not doing anything as of now
- Created new process using bedtools to remove positions from panel tsv file using the flagged bed obtained from MUTPREPROCESSING - Added this process to the CREATE_PANELS subworkflow - Added metadata about this process - Added flagged bed as input to CREATEPANELS
-Changed to ANNOTATEPANELS
- Also removed debug printing of channels
There was a problem hiding this comment.
Pull request overview
This pull request introduces a new filtering mechanism to remove variants from annotated panels that overlap with flagged genomic regions, significantly refactors the panel creation workflow into two distinct subworkflows (ANNOTATEPANELS and CREATEPANELS), and reapplies code reviews from PR #153. The changes enable masking of sites where filters (repetitive_variant, cohort_n_rich, other_sample_SNP) were applied to ensure these regions are properly excluded from downstream analysis.
Changes:
- Added a new FILTER_CAPTURED_PANEL module that uses bedtools intersect to remove variants overlapping flagged regions and track removed variants
- Split panel creation into two workflows: ANNOTATEPANELS (handles VEP annotation, custom annotation, and domain annotation) and CREATEPANELS (applies filtering and generates final panels)
- Refactored filter_cohort.py with improved code structure, logging, documentation, and a new extract_flagged_regions_bed function to output flagged positions in BED format
- Updated workflow dependencies so MUTPREPROCESSING outputs flagged_bed that feeds into CREATEPANELS for filtering
Reviewed changes
Copilot reviewed 10 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/deepcsa.nf | Reorganized workflow to use ANNOTATEPANELS before MUTPREPROCESSING, then CREATEPANELS with flagged_bed input; updated references from CREATEPANELS to ANNOTATEPANELS outputs where appropriate |
| subworkflows/local/mutationpreprocessing/main.nf | Added flagged_bed emission from FILTERBATCH output to enable downstream filtering |
| subworkflows/local/createpanels/createpanels/main.nf | Refactored to accept complete_annotated_panel and flagged_bed inputs, applies FILTERCAPTUREDPANEL before generating panels, removed annotation logic moved to ANNOTATEPANELS |
| subworkflows/local/createpanels/createpanels/meta.yml | Added empty meta.yml file (requires documentation) |
| subworkflows/local/createpanels/annotatepanels/main.nf | New subworkflow containing annotation logic (VEP, custom annotation, domain annotation) previously in createpanels, generates initial panels for MUTPREPROCESSING |
| subworkflows/local/createpanels/annotatepanels/meta.yml | Comprehensive documentation for the annotation workflow with detailed input/output descriptions |
| modules/local/filtermaf/main.nf | Added flagged_muts output emission for flagged-pos.bed file |
| modules/local/filter_captured_panel/main.nf | New module that filters annotated panel using bedtools intersect to remove variants overlapping flagged regions, outputs both filtered panel and removed variants |
| modules/local/filter_captured_panel/meta.yml | Documentation for the filtering module |
| conf/modules.config | Split FILTERBATCH configuration to publish .bed files, disabled publishing for MERGEBATCH |
| conf/tools/panels.config | Updated process paths from CREATEPANELS to ANNOTATEPANELS for VEP and related processes |
| bin/utils.py | Improved documentation for add_filter function with proper parameter descriptions |
| bin/filter_cohort.py | Major refactoring: split monolithic main function into separate functions (flag_repetitive_variants, flag_cohort_n_rich, flag_other_samples_snp, flag_gnomad_snp, expand_filter_column, extract_flagged_regions_bed), added logging, improved documentation, added BED output generation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Publish FILTERCAPTUREDPANEL output into createpanels folder - Add bedtools to versions instead of python
FerriolCalvet
left a comment
There was a problem hiding this comment.
Looks good Marta!
I added some minor comments in some files but I also have some more general comments that I would like to discuss before merging, mainly because there are different masking scenarios that probably I did not acknowledge properly :pray-hands:
(I am trying to explain what I understood to make sure that I got it properly)
The way it works for the masking is by removing the positions from the panels let's say that there is a flagged position in the middle of an exon, this would result in the exon in the BED regions of the consensus panel being split in two right? and on that position not being in the consensus panel TSV.
OncodriveFML requires a BED file with the coordinates of the different genomic regions, this is probably not impacting at all, but we could look at the consequence of this change outside omega.
Initially I thought that a possibility could be to "just" putting the depths to 0 of that position and then this could have a similar effect to the final definitions of panels. The reality is that with the current implementation this works perfect for positions that need to be masked for the entire cohort, but does not allow the flexibility of removing a position only in one sample.
We can discuss whether we want to merge this implementation and then work on an additional processing that enables the sample-specific tuning of depths that could then impact the panel building process downstream.
Minor more general comments:
- do we want to use the ymls? I guess it does no harm here, but we should discuss it, and probably apply it as part of the clean branch
|
Thank you for the review Ferriol! I will check the comments tomorrow, but I wanted to answer the general comment so everything is here in case we need it. You're right, removing a position from the panel makes the regions in the bed to be split. About the ymls, I think it could be useful for new people trying to understand how the pipeline works. But I am okay with moving them to the clean branch. |
|
Some time ago I made some fixes to prevent the 0s problem Fede experienced above (4551e17), so hopefully this should not be a problem now. I don't remember if there were other methods for which this was also a problem but I think it is worth testing it again. And yes, agree with the ymls, I would leave the ones you created here and then try to add updated ones for all steps in the clean branch. let's discuss soon! |
|
We have decided to try applying the option of forcing the depth to 0 for those positions to be removed in specific samples. If it is too complex, we could merge this branch and do it later. |
- Now create_mask_matrix.py gets the cohort and sample beds. - If a position is in the cohort bed, it is masked in all the samples.
- If there are no filters, an empty bed is generated to avoid crushing the pipeline - Correct names in documentation
UpdateI have successfully applied the depth = 0 mask. The depths are mask correctly and the pipeline seems to be running without problems (omega, OncodriveFML, etc). I will write here some details of how it is done so we have it all written. Panel filtering approach
Depth = 0 approach
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 20 changed files in this pull request and generated 12 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- avoid creating a json file - improve bed files management in createmaskmatrix - change module.conf to join filter_criteria and filter_criteria_somatic and use them directly
- Updated the container to add polars - Updated code to use polars
- Change from filterbatch to writemaf to obtain the filters
- Removed the module generated to create the bed file - Removed the generation of the all_samples bed from the FILTERBATCH step - Added logic to write_maf.py - Removed filter configuration from FILTERBATCH and added to WRITEMAF. - This avoids adding more modules. Bad part is that bed files are created sequentially instead of in parallel. Good part is that we make sure that those bed files contain all the positions.
FerriolCalvet
left a comment
There was a problem hiding this comment.
looks great!!!
There are very minor things but I believe that once we merge this branch then we can go to tackle improvements in the namings of processes, variables and many things
The only change I think is worth applying and that requires a bit more of work is that the all samples flagged positions only includes the ones filtered in the entire cohort, or we could have one that includes all and another with only the cohort-wide ones. This could help inform us about positions that are to be discarded always
- There are two outputs now: - Cohort-wide -> filters that are applied to all samples equally - all -> cohort-wide filters + sample-specific filters
There was a problem hiding this comment.
perfect! much cleaner now, thanks
Copilot summary
This pull request introduces a new filtering step for annotated panels to remove variants overlapping flagged genomic regions, updates the panel creation workflow to integrate this filtering, and makes several related configuration and workflow changes. Additionally, some code reviews from PR153 are reapplied
Key changes:
New module for filtering captured panels:
filter_captured_panelprocess (modules/local/filter_captured_panel/main.nf) and its metadata (modules/local/filter_captured_panel/meta.yml), which filters annotated panels usingbedtools intersectto remove variants overlapping flagged regions. Outputs both filtered panels and a list of removed variants. [1] [2]Integration of filtering into panel creation workflow:
subworkflows/local/createpanels/createpanels/main.nf, previouslymain.nf) to include the new filtering step (FILTERCAPTUREDPANEL). The workflow now takes aflagged_bedinput fromMUTPREPROCESSINGand generates panels from the filtered annotated panel. The output now includes the list of removed variants. [1] [2] [3]subworkflows/local/createpanels/annotatepanels/main.nf). [1] [2]Workflow and configuration updates:
workflows/deepcsa.nf) to use the new, more modularCREATEPANELSandANNOTATEPANELSsubworkflows.flagged_bedoutput from theFILTERBATCHprocess, enabling downstream filtering.FILTERBATCHprocess inconf/modules.configto publish.bedfiles for flagged mutations, and adjustedMERGEBATCHand related processes to disable publishing..bedfiles to theFILTER_BATCHprocess inmodules/local/filtermaf/main.nf.