Small repo to generate and store DIAMOND formatted (.dmnd file extension) reference databases for diamond blastx with customizable specificity.
First activate your environment using
micromamba create -f environment.yaml
micromamba activate diamond-db-creator
Then create a config.yaml file (like the example) with the list of all INSDC references you would like to include in your database. For example:
references:
seg4-H1: U08903.1
seg4-H2: CY005413.1
seg3--H5N1: NC_007359.1Note that the key e.g. seg4-H1 will be the prefix of the dataset name that is returned in the diamond blastx results.tsv. (As DIAMOND matches proteins each CDS in the sequence receives its own identifier, in Loculus we map all sequences that match the protein id|CDS{i} to the sequence id. ). The keys should be the same as your sequenceName they are assigned to e.g. {segment}-{reference} or alternatively, you can add lists of accepted matches to the config.accepted_dataset_matches field.
seqName dataset pident ...
MW874350.1 seg3-H5N1|CDS1 0.4829120176662018 ...
MW874350.1 seg3-H1N1|CDS2 0.34937439846005774 ...
MW874350.1 seg3-H3N2|CDS1 0.22552301255230126 ...snakemake --config config_file=<path-to-config-file>