Skip to content

MMSeqs2 splits alignment into two segments #1094

@prototaxites

Description

@prototaxites

Hi,

I have a query database consisting of approximately 10M COX1 marker gene sequences (nucleotides), and I am trying to find near-exact alignments to a single search contig (also nucleotide). I know there is a an identical or near-identical match with an expected alignment length of approximately 658 bp (length of marker gene amplicon) between one of the sequences in the database and the contig, and I can identify this with BLAST:

> blastn -query ../../cois.fa -subject ../../bAloAeg2.k1001_c90.mito.ctg.fasta -outfmt 6

GBMIN134883-17|COI-5P|Alopochen_aegyptiaca|Animalia,Chordata,Aves,Anseriformes,Anatidae,None,Alopochen,Alopochen_aegyptiaca,None	ctg000001c	99.696	658	2	0	1	658	10294	9637	0.0	1205

I am searching for this hit with MMSeqs, but it doesn't appear as it doesn't pass my alignment filter of --min-seq-id 0.96 --min-aln-len 450. This is because it breaks the ~658bp alignment into two smaller alignments:

> mmseqs search mini/db contigs/db search_contigs/db mmseqs_tmp \
    --cov-mode 2 --max-accept 10 \
    --search-type 3 --alignment-mode 3 \
    --threads 16 --strand 2

GBMIN134883-17|COI-5P|Alopochen_aegyptiaca|Animalia,Chordata,Aves,Anseriformes,Anatidae,None,Alopochen,Alopochen_aegyptiaca,None                                   ctg000001c  0.997   364     1         0        658     295   9637    10000  7.335E-190  649
GBMIN134883-17|COI-5P|Alopochen_aegyptiaca|Animalia,Chordata,Aves,Anseriformes,Anatidae,None,Alopochen,Alopochen_aegyptiaca,None                                   ctg000001c  0.996   294     1         0        294     1     10001   10294  7.815E-152  524

Is there anything I can tweak to help this alignment appear as a single segment?

Query sequence:

>GBMIN134883-17|COI-5P|Alopochen_aegyptiaca|Animalia,Chordata,Aves,Anseriformes,Anatidae,None,Alopochen,Alopochen_aegyptiaca,None
CACCCTATATCTTATCTTCGGAGCGTGGGCCGGAATAATTGGCACAGCACTTAGCCTGCT
AATCCGCGCAGAACTGGGCCAACCAGGAACCCTCCTAGGTGACGATCAAATTTACAATGT
AATCGTCACCGCCCACGCTTTTGTAATAATCTTCTTCATGGTGATACCTATCATAATTGG
AGGGTTCGGCAACTGATTAGTCCCCCTAATAATCGGCGCCCCTGATATGGCGTTTCCACG
AATAAACAACATAAGCTTCTGACTCCTCCCCCCGTCATTCCTTCTACTACTCGCCTCATC
TACCGTGGAAGCTGGCGCTGGTACCGGCTGAACCGTGTACCCGCCCCTAGCAGGCAACCT
GGCCCACGCTGGAGCCTCAGTGGACCTGGCTATTTTCTCCCTCCATTTAGCTGGTGTTTC
TTCTATCCTCGGAGCCATTAACTTCATCACTACAGCCATCAACATAAAACCCCCCGCACT
CTCACAATACCAAACCCCCCTCTTCGTCTGATCCGTCCTAATCACCGCTATCCTACTCCT
CCTCTCACTTCCCGTTCTCGCCGCTGGCATCACAATGCTACTGACCGACCGAAACCTAAA
CACCACATTCTTTGACCCCGCCGGAGGAGGAGACCCAATCCTGTACCAACACCTATTC

Target sequence is attached as a file to save space.

bAloAeg2.k1001_c90.mito.ctg.fasta.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions