The GENEFLOW application serves as the latest update to the Illumina sequencing processing workflow at NYU's Center for Genomics and Systems Biology. This pipeline, developed with Nextflow, encompasses a comprehensive set of procedures:
- Archive the Run Directory
- Basecalling
- Demultiplexing (Optional)
- Demultiplexing Reports
- Data Merging
- FastQC Reports
- MultiQC Report
- Data Delivery
The pipeline is designed to interface seamlessly with TuboWeb, a web-based platform for NGS data analysis and visualization, for the retrieval of metadata and customization of run parameters.
To successfully deploy and run the GENEFLOW pipeline, follow these setup steps:
Modify the launch script (launch.sh) to include your email address:
#SBATCH --mail-user=your_netID@nyu.eduGlobal Configuration in Nextflow
Configure the global variables in nextflow.config as follows:
alpha: The primary work directory for the pipeline.tmp_dir: Temporary working directory for Picard tools.fastqc_path: Destination for rsyncing FastQC files (e.g., web server).archive_path: Destination for archived run directories.admin_email: Email for pipeline administration notifications. Set up the module paths Specify theworkDirConfigure email settings
TuboWeb API Configuration
- API path
- User credentials
- API key
File Delivery and Storage Paths
delivery_folder_root: Destination for FastQ files.raw_run_dir_delivery_root: Destination for raw run directories.raw_run_root: Storage location for raw run directories.alpha: The primary work directory for the pipeline (as innextflow.config).
Gmail Credentials Set the Gmail user and password for email notifications.
For ease of use, a launch.sh script is provided to initiate the pipeline. This script requires two essential parameters and one optional parameter:
- Run Directory Path
- Flowcell ID
- Optional: Entry point for the pipeline (used to resume the pipeline from a specific step)
- Basic launch:
launch.sh /scratch/gencore/sequencers/{machine_name}/{run_dir_name} {fcid} - Specific Example Launch:
launch.sh /scratch/gencore/sequencers/NB502067/240124_NB502067_0578_AHKFT5BGXV HKFT5BGXV
- Launch with Entry Point:
For resuming at a specific step like demultiplexing (e.g., 'demux'):
launch.sh /scratch/gencore/sequencers/{machine_name}/{run_dir_name} {fcid} {entry} - Example with Entry Point:
launch.sh /scratch/gencore/sequencers/NB502067/240124_NB502067_0578_AHKFT5BGXV HKFT5BGXV demux
In a production environment, launch.sh is typically submitted as an SBATCH job in Slurm. Make sure the directories for error and output files are created beforehand (required by SLURM).
mkdir -p /scratch/gencore/GENEFLOW/alpha/logs/HKFT5BGXV/pipeline
sbatch --output=/scratch/gencore/GENEFLOW/alpha/logs/HKFT5BGXV/pipeline/slurm-%j.out \
--error=/scratch/gencore/GENEFLOW/alpha/logs/HKFT5BGXV/pipeline/slurm-%j.err \
--job-name=GENEFLOW_MANAGER_(HKFT5BGXV) \
launch.sh /scratch/gencore/sequencers/NB502067/240124_NB502067_0578_AHKFT5BGXV HKFT5BGXV
The pipeline includes a regression test suite that compares new pipeline output against known-good (ground truth) results.
- Establish ground truth: Run the pipeline normally on a real run directory. The QC reports produced by MultiQC (demux report, run stats
summary, undetermined barcodes) serve as the baseline. Copy these report files to a persistent ground truth directory (e.g.
/home/gencore/GENEFLOW_TESTS_GT/<fcid>/). - Prepare a test run directory: Copy the original sequencer run directory and rename it with a test flowcell ID (e.g.
000000000-TEST1). This prevents test runs from writing logs, deliveries, and other artifacts into the production directories for the real run. - Configure the test: Create a test config file (e.g. test-miseq.config) that sets:
params.run_dir_path— path to the renamed test run directoryparams.truth_dir— path to the ground truth reports saved in step 1
- Run the test: Submit launch_tests.sh via SLURM. It runs the pipeline with the
--testflag, which:
- Skips the deliver process (no emails sent, no data delivered)
- Enables the compare_runs process, which calls
compare_runs.pyto diff the new MultiQC reports against the ground truth files
- Results:
compare_runs.pycompares three report types (demux report, run stats summary, undetermined barcodes) cell-by-cell. It reports PASS if all values match within tolerance, or FAIL with a detailed breakdown of differences. Exit code 0 = pass, 2 = fail.
- After a successful production run of flowcell H323NDRX7, copy its reports
mkdir -p /home/gencore/GENEFLOW_TESTS_GT/H323NDRX7
cp /path/to/multiqc/output/*.txt /home/gencore/GENEFLOW_TESTS_GT/H323NDRX7/
- Copy and rename the run directory
cp -r /scratch/gencore/sequencers/A01097/250822_A01097_0361_AH323NDRX7 \
/scratch/gencore/sequencers/A01097/250822_A01097_0361_ANOVATEST1
-
Create test-novaseq.config pointing to both paths
-
Update launch_tests.sh with the new fcid/config, then sbatch launch_tests.sh