Skip to content

SNPs introduced when using high --max-iterations #50

@Markusjsommer

Description

@Markusjsommer

Expected Behavior

penguin assembly with few/no SNPs relative to the reads used to assemble.

Current Behavior

high --max-iterations results in SNPs that are not supported by the reads used during assembly

Steps to Reproduce (for bugs)

observable on most samples we tested with --max-iterations 15

also happens with the benchmark rhinovirus data here: https://github.com/AnnSeidel/penguin-analysis/tree/main/benchmarking/rhinovirus-3-mixture on some of the contigs, screenshots attached.

Context

We noticed for larger viral genomes a high --max-iteration helped generate more contiguous assembly. We were surprised to find a large number of SNPs that were not supported when we aligned the reads back to this assembly.

There does not seem to be a magic number for --max-iterations for some assemblies. Either it will not be in 1 piece, or it will have SNPs. Since these SNPs are not supported by reads, we could use a tool like pilon to correct the assembly, but it may be easy/better to fix within penguin to avoid this caveat to assembly accuracy.

See below for the rhinovirus assembly with default settings (few/no SNPs0) and --max-iterations 15 (many SNPs) also zoomed.

Screenshot from 2024-10-31 13-11-35
Screenshot from 2024-10-31 13-11-58
Screenshot from 2024-10-31 13-12-18

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions