The workflow requires a set of genome reference files and settings to run. All files and settings for a genome are defined in a genome.json file. When launching the workflow, a reference genome is chosen by passing the path to a specific genome.json in the genome parameter (in params.yml or --genome)
The genome.json file includes
| Field | Description | Required? | Example |
|---|---|---|---|
| name | The name of the species / genome-version | Required | human |
| bowtie_index | Path to the bowtie2 index directory and prefix | Required | bowtie2/GRCh38 |
| gtf | Path to the gene annotation | Optional; used for TSS / promoter analysis | filteredGTF/GRCh38_transcripts.gtf |
| macs_genome | Name of the genome for the MACS peak-caller | optional; used to adjust MACS parameters | "hs" |
| speciesName | Name of the species for OrgDb | optional; used for ArchR analysis | Homo sapiens |
| archrAlias | Name of the genome for [ArchR] (https://www.archrproject.com/bookdown/getting-set-up.html) | optional; used for ArchR analysis | hg38 |
| annotationConvention | For non-standard ArchR genomes | optional; used for ArchR analysis | ENSEMBL |
- All files (
bowtie_index,gtf, ...) can be either- An absolute path (
/path/to/genome) - A relative path starting from the location of the
genome.jsonfile (genes/filtered.gtf) - A AWS S3 url (s3://path/to/genome)
- An absolute path (
- The ArchR settings are only needed for downstream analysis (clustering etc.) and optional
Pre-build reference genomes for human and mouse are available for download:
- Human: http://scale.pub.s3.amazonaws.com/genomes/grch38.tgz
- Mouse: http://scale.pub.s3.amazonaws.com/genomes/mm39.tgz
- Human / Mouse barnyard: http://scale.pub.s3.amazonaws.com/genomes/grch38_mm39.tgz
Download these to your analysis server, unpack them and then use e.g.
--genome path/to/genomes/grch38/grch38.json
The workflow includes a tool to pre-filter a GTF to a subset of features (genes / transcripts). This filtered GTF can then be used for analysis:
- Navigate to the directory with your initial
.gtffile. - Create a
gtfFilter.jsonby which to filter your GTF (example) - Install script dependency:
pip install pybedtools - Use script to filter your gtf:
ScaleTagToolkit/bin/filter_gtf.py -gtf <path/to/genes.gtf> -j <pathToFilterJson> -o <path/to/filtered.gtf>
- Replace the
gtffield of thegenome.jsonwith the your newly createdfiltered.gtf