Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed images/AToL-architecture.png
Binary file not shown.
Binary file added images/atol_and_insdc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
116 changes: 87 additions & 29 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,68 +4,124 @@ description: Documentation for the Australian Tree of Life (AToL)
toc: false
---

## About the Australian Tree of Life
The Australian Tree of Life Bioinformatics team (AToL Bioinformatics) is
developing infrastructure for the rapid generation and publication of genome
assemblies and annotations. The current focus of AToL Bioinformatics is
optimising and automating these processes in the Genome Engine.

The Australian Tree of Life (AToL) project is developing infrastructure for the rapid generation and publication of genome assemblies and annotations. The current focus of the AToL project is optimising and automating these processes in the Genome Engine.
Our Genome Engine was inspired by the Wellcome Sanger Institute's [Genome
Engine](https://www.sanger.ac.uk/tool/genome-engine/), and uses some of their
pipelines under the hood.

You can learn more about the initiative here: [Australian Tree of Life](https://www.biocommons.org.au/atol)

The AToL Genome Engine was inspired by the Wellcome Sanger Institute's [Genome Engine](https://www.sanger.ac.uk/tool/genome-engine/), and uses some of their pipelines under the hood.
You can learn more about the Australian Tree of Life activity
[here](https://www.biocommons.org.au/atol).

## What does the Genome Engine do?

The AToL Genome Engine is an automated workflow for assembling and annotating genome sequences from raw sequence data, brokering data to International Nucleotide Sequence Database Collaboration (INSDC) repositories, and drafting short Genome Notes.
The Genome Engine is a semi-automated workflow for assembling and annotating
genome sequences from raw sequence data, brokering data to International
Nucleotide Sequence Database Collaboration (INSDC) repositories, and drafting
short Genome Notes.

This involves:
- Ingesting raw sequence data from the [Bioplatforms Australia Data Portal](https://data.bioplatforms.com/)
- Ingesting raw sequence data from the [Bioplatforms Australia Data
Portal](https://data.bioplatforms.com/)
- Processing sampling and sequencing metadata
- Assembling genome sequences from sequence read data
- Annotating assembled genomes
- Brokering sample metadata, sequence reads, and genome assemblies to the [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/home) (ENA)
- Generating an automatic Genome Note providing details and metrics about sampling, sequencing and assembly
- Brokering sample metadata, sequence reads, and genome assemblies to the
[European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/home) (ENA)
- Generating an automatic Genome Note providing details and metrics about
sampling, sequencing and assembly

At present, the Genome Engine is configured to ingest data generated as part of [Bioplatforms Australia’s](https://bioplatforms.com/) Framework Initiatives which are available from the Bioplatforms Australia Data Portal. In future, we intend to make the Genome Engine available to any Australian researcher for use with their own sequencing data.
At present, the Genome Engine is configured to ingest data generated as part of
[Bioplatforms Australia’s](https://bioplatforms.com/) Framework Initiatives,
which are available from the Bioplatforms Australia Data Portal. In future, we
intend to make the Genome Engine available to any Australian researcher for use
with their own sequencing data.

## How does the Genome Engine work?

### Data retrieval and processing

The Genome Engine accesses sequence data and metadata in bulk from the [Bioplatforms Australia Data Portal](https://data.bioplatforms.com/) API. The metadata are provided by the collecting researcher and sample preparation and sequencing facilities.
The Genome Engine accesses sequence data and metadata in bulk from the
[Bioplatforms Australia Data Portal](https://data.bioplatforms.com/) API. The
metadata are provided by the collecting researcher and sample preparation and
sequencing facilities.

Packages are filtered to select those relevant to genome assembly and annotation, and metadata are validated and mapped to an intermediary, INSDC-compliant schema.
Packages are filtered to select those relevant to genome assembly and
annotation, and metadata are validated and mapped to an intermediary,
INSDC-compliant schema.

Taxon and sample identifiers are extracted to determine which packages can be combined in the assembly process and to retrieve species information from the [NCBI Taxonomy](http://www.ncbi.nlm.nih.gov/taxonomy).
Taxon and sample identifiers are extracted to determine which packages can be
combined in the assembly process and to retrieve species information from the
[NCBI Taxonomy](http://www.ncbi.nlm.nih.gov/taxonomy).

### Genome assembly and annotation

Sequence read data are processed and assembled on High-Performance Computing (HPC) facilities at the [Pawsey Supercomputing Research Centre](https://pawsey.org.au/),
provided by the [Australian BioCommons Leadership Share](https://www.biocommons.org.au/ables) (ABLeS) program.
Sequence read data are processed and assembled on High-Performance Computing
(HPC) facilities at the [Pawsey Supercomputing Research
Centre](https://pawsey.org.au/), provided by the [Australian BioCommons
Leadership Share](https://www.biocommons.org.au/ables) (ABLeS) program.

The assembly pipeline used is an adaptation of the [Sanger Tree of Life (ToL) assembly pipeline](https://pipelines.tol.sanger.ac.uk/genomeassembly), which includes the following steps:
The assembly pipeline used is an adaptation of the [Sanger Tree of Life (ToL)
assembly pipeline](https://pipelines.tol.sanger.ac.uk/genomeassembly), which
includes the following steps:
- assembly using [hifiasm](https://github.com/chhylp123/hifiasm)
- redundant contig removal with [purge_dups](https://github.com/dfguan/purge_dups)
- optional haplotype resolution with hifiasm and scaffolding with [YaHS](https://github.com/c-zhou/yahs) if Hi-C data is available
- redundant contig removal with
[purge_dups](https://github.com/dfguan/purge_dups)
- optional haplotype resolution with hifiasm and scaffolding with
[YaHS](https://github.com/c-zhou/yahs) if Hi-C data is available

Quality assessment and annotation of assembled genomes are currently in development.
Oxford Nanopore-based contig building and scaffolding are currently in
development, along with quality assessment and annotation of assembled genomes.

### Data brokering

The data broker component of the Genome Engine uses sample, sequencing, and assembly metadata to submit files automatically to the ENA. BioSample information is submitted using the [ToL sample checklist](https://www.ebi.ac.uk/ena/browser/view/ERC000053), a minimum standard for sample metadata devised by the [Darwin Tree of Life project](https://www.darwintreeoflife.org/) to facilitate data contextualisation and interoperability. Experiment, read, and assembly data are submitted according to ENA’s standards and schemas. In order to comply with these standards, certain metadata fields in the original Bioplatforms metadata must be filled and vocabulary terms used (see the [FAQ](https://australianbiocommons.github.io/atol/faq) for more information about metadata requirements). AToL’s metadata mapping processes allow for these metadata to be formatted in XML files for programmatic submission to the ENA.

The submitted XML files include the data release date, which is determined according to the embargo release date specified in the Bioplatforms data portal. Once records are made public on their release date, they are exchanged with and made available from other INSDC databases at the US National Center for Biotechnology Information ([NCBI](https://www.ncbi.nlm.nih.gov/)) and the DNA Data Bank of Japan ([DDBJ](https://www.ddbj.nig.ac.jp/index-e.html)).
The data broker component of the Genome Engine uses sample, sequencing, and
assembly metadata to submit files automatically to the ENA. BioSample
information is submitted using the [ToL sample
checklist](https://www.ebi.ac.uk/ena/browser/view/ERC000053), a minimum
standard for sample metadata devised by the [Darwin Tree of Life
project](https://www.darwintreeoflife.org/) to facilitate data
contextualisation and interoperability. Experiment, read, and assembly data are
submitted according to ENA’s standards and schemas. In order to comply with
these standards, certain metadata fields in the original Bioplatforms metadata
must be filled and vocabulary terms used (see the
[FAQ](https://australianbiocommons.github.io/atol/faq) for more information
about metadata requirements). The Genome Engine's metadata mapping processes
allow for these metadata to be formatted in XML files for programmatic
submission to the ENA.

The submitted XML files include the data release date, which is determined
according to the embargo release date specified in the Bioplatforms data
portal. Once records are made public on their release date, they are exchanged
with and made available from other INSDC databases at the US National Center
for Biotechnology Information ([NCBI](https://www.ncbi.nlm.nih.gov/)) and the
DNA Data Bank of Japan ([DDBJ](https://www.ddbj.nig.ac.jp/index-e.html)).

### Genome Note generation

Once a genome has been assembled, a Genome Note document is generated, outlining key metadata and assembly metrics. The Genome Note pipeline populates a template document with metadata values relating to taxonomy, specimen collection, nucleic acid extraction, sequencing, and assembly, and key metrics calculated in the assembly pipeline. The Genome Note also contains the accession numbers generated during brokering to the ENA. The project lead and project collaborators (as they are listed in the Bioplatforms metadata) are named as first and second authors.
Once a genome has been assembled, a Genome Note document is generated,
outlining key metadata and assembly metrics. The Genome Note pipeline populates
a template document with metadata values relating to taxonomy, specimen
collection, nucleic acid extraction, sequencing, and assembly, and key metrics
calculated in the assembly pipeline. The Genome Note also contains the
accession numbers generated during brokering to the ENA. The project lead and
project collaborators (as they are listed in the Bioplatforms metadata) are
named as first and second authors.

Genome Notes will be made available to researchers prior to release to provide an opportunity to manually edit and add content.
Genome Notes will be made available to researchers prior to release to provide
an opportunity to manually edit and add content.

![Diagram of overall AToL architecture](images/AToL-architecture.png)
*Australian Tree of Life architecture overview. Note: the interactive AToL web application is currently in development.*
![Diagram of genome engine data flow](./images/atol_and_insdc.png) *Genome
Engine data flow.*

## Partners

The Australian Tree of Life is a collaborative initiative. It is co-funded by Bioplatforms Australia and the Minderoo Foundation, and supported by project partners at the University of Melbourne and QCIF. The AGRF are hosting a PhD student intern.
AToL Bioinformatics is co-funded by Bioplatforms Australia and the Minderoo
Foundation, and supported by project partners at the University of Melbourne
and QCIF. The AGRF are hosting a PhD student intern.

Bioplatforms Australia is enabled by NCRIS.

Expand All @@ -83,6 +139,8 @@ Bioplatforms Australia is enabled by NCRIS.

## Acknowledgements

This documentation page makes use of the [ELIXIR toolkit theme](https://github.com/ELIXIR-Belgium/elixir-toolkit-theme).
This documentation page makes use of the [ELIXIR toolkit
theme](https://github.com/ELIXIR-Belgium/elixir-toolkit-theme).

{% include image.html file="elixir-toolkit-theme_logo.svg" alt="Elixir Toolkit Theme logo" max-width="15em" %}
{% include image.html file="elixir-toolkit-theme_logo.svg" alt="Elixir Toolkit
Theme logo" max-width="15em" %}
Loading
Loading