Skip to content

Submit tasks as job arrays and fix RNAs in distill summaries#472

Open
tpall wants to merge 25 commits intoWrightonLabCSU:devfrom
tpall:dev
Open

Submit tasks as job arrays and fix RNAs in distill summaries#472
tpall wants to merge 25 commits intoWrightonLabCSU:devfrom
tpall:dev

Conversation

@tpall
Copy link
Copy Markdown

@tpall tpall commented Nov 27, 2025

Changes

  • Added array_size = 10 parameter to nextflow.config and array to conf/base.config for more efficient cluster execution.
  • Fix inclusion of rRNA, tRNA, and quast summaries to genome_stats.tsv and metabolism_summary.xlsx in bin/distill.py script.
  • Refactor channel usage (Channel to channel) for consistency across workflows and improve readability + usage of implicit variable within closures (e.g. it.name to it -> it.name)

Computing environment and command

  • nextflow version 25.10.0.10289
  • openjdk 22.0.1-internal 2024-04-16
  • singularity 3.8.5
  • slurm
  • x86_64 GNU/Linux
nextflow run tpall/DRAM -r dev --input_fasta ./DRAM/input_fasta --outdir ./DRAM/call-annotate-distill --threads 8 --summarize --qc --use_kofam --use_dbcan --use_merops --use_viral --use_methyl --use_sulfur -profile singularity --slurm --partition main -with-report -with-trace -with-timeline --array_size 10 --queue_size 10 -resume --annotate 

@madeline-scyphers madeline-scyphers self-requested a review December 1, 2025 21:14
@madeline-scyphers
Copy link
Copy Markdown
Member

madeline-scyphers commented Dec 1, 2025

Hey @tpall Thanks for this. The job array addition it nice. There is a larger planned update to batch a lot of the inputs into singular jobs to reduce the burden on the queue, since running DRAM with lots of inputs can overwhelm a SLURM scheduler, but adding in job arrays, which weren't supported on the version of Nextflow we initially developed DRAM2 on, but we recently moved to >=24 (we should lock in >=24.04.0 since there are early 24.* prereleases out there if we add in job arrays). I will have to do some testing on utilizing job arrays and their implication. Because from my initial testing it seems like it stops the next stop from proceeding until their are enough inputs to fill an array. Which might be ok. But if we are going to be doing batching anyway, it might not be that important and not worth it.

Also thanks for some of the other QoL updates like updating some of the syntax to DSL2 (Channel -> channel, etc.).

I will have to more fully review the code, which I can get to in a couple weeks. I have deadline for next week, and probably won't be able to review much before then.

But I will leave just a couple of quick thoughts.

Thanks again

Comment thread conf/base.config
}

withName: 'DRAM:ANNOTATE:CALL:.*|DRAM:ANNOTATE:DB_SEARCH:.*' {
array = params.array_size
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to support people running DRAM2 with local executor (such as on their own computer if they want), which doesn't support array. So the array should only be used with executors that support it.

Comment thread conf/base.config Outdated
maxRetries = 2
}

withName: 'DRAM:ANNOTATE:CALL:.*|DRAM:ANNOTATE:DB_SEARCH:.*' {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobs under DRAM:ANNOTATE:QC:COLLECT.* could also have a job array

Comment thread conf/base.config Outdated
Comment on lines +68 to +71
withName: 'DRAM:ANNOTATE:CALL:.*|DRAM:ANNOTATE:DB_SEARCH:.*' {
array = params.array_size
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code here in base.config for the job array should probably be in modules.config

tpall and others added 10 commits December 18, 2025 15:20
Adds a DECOMPRESS_FASTA module (bbtools reformat.sh in the existing
bbmap container) and routes only .gz inputs through it via a branch
on the fasta channel. Basename stripping is unified so sample.fa and
sample.fa.gz produce the same downstream name, keeping outputs
identical regardless of input compression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
It was already defined in nextflow.config (default 10) and consumed by
conf/base.config, but absent from nextflow_schema.json, so runs emitted
a schema-validation warning. Added alongside queue_size under Process
Options.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the never-included trees module and all scripts only it referenced
(parse_annotations.py, update_annots_trees.py, color_labels.R), plus
update_tree.py which had no references at all. Also removes
assets/trees/ refpkgs (only consumed by the dropped module) and the
DRAM-v1 standalone DB setup scripts under assets/internal/ which were
never wired into the DSL2 pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both are unreferenced DRAM-v1 helpers under bin/assets/forms/. They
shell out to a DRAM-setup.py CLI that isn't part of the DSL2 pipeline,
and upstream has already removed them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eight references used DB_CHANNEL_SETUP (all caps) but the workflow is
defined as DB_channel_SETUP. Groovy is case-sensitive so the references
failed at runtime with "No such variable: DB_CHANNEL_SETUP".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tpall and others added 2 commits April 25, 2026 23:02
Variable was declared as formattedOutputchannels (lowercase 'c') at
line 124, but seven references used formattedOutputChannels (uppercase
'C'), tripping MissingPropertyException at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mmseqs2 index files (*.mmsdb, .mmsdb.idx, .mmsdb_h, ...) are
intermediate, memory-mapped artifacts only consumed by MMSEQS_SEARCH
via the Nextflow channel; they are never read from results/. Copying
them per-bin can be hundreds of GB and was filling GPFS quota,
causing publishDir copy failures that aborted the pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To Sort

Development

Successfully merging this pull request may close these issues.

2 participants