Skip to content

Comments

Draft: try adding capsid variant calling and resistance#109

Open
schorlton-bugseq wants to merge 1 commit intoPoonLab:masterfrom
schorlton-bugseq:addCAI
Open

Draft: try adding capsid variant calling and resistance#109
schorlton-bugseq wants to merge 1 commit intoPoonLab:masterfrom
schorlton-bugseq:addCAI

Conversation

@schorlton-bugseq
Copy link

Starting to work on #108

Thanks @ArtPoon for initial response. Still not working so @WilliamZekaiWang would definitely value your input here.

Open questions from the PR so far (apologies as I am not an HIV expert):

  • Are capsid coordinates correct? Derived from https://www.hiv.lanl.gov/components/sequence/HIV/search/help.html#region
  • Where were min_overlap thresholds borrowed from? Could not find upon a quick search of the sierra/sierrapy repos
  • Why no lenacapavir calling when using HIVDB v9.8 XML as is? I.e. what are we missing?
  • Are the DRM comments hardcoded in sierra-local, and if so, how to update/get those from the XML for CAI?

@WilliamZekaiWang
Copy link
Collaborator

Hi! I will look more intensively into the questions but from what I could recall and find:

Are the DRM comments hardcoded in sierra-local, and if so, how to update/get those from the XML for CAI?

We pull the comments from https://github.com/hivdb/hivfacts/tree/main/data, which is the repository for the data found on https://hivdb.stanford.edu. The script in sierralocal/updater.py should pull all the resistance mutations. If we want to add CAI in, the relevant files should be there and we can include it into the updater script

Why no lenacapavir calling when using HIVDB v9.8 XML as is? I.e. what are we missing?

I think this is a symptom of not having the necessary files to call lenacapvir from the hivfacts repository mentioned above. It could also be from incorrectly parsing the xml file. Might be a more complicated problem.

Where were min_overlap thresholds borrowed from? Could not find upon a quick search of the sierra/sierrapy repos

I have a memory that it was on the https://hivdb.stanford.edu website, but I couldn't find it after looking around. I will keep looking, I'm pretty sure it was there.

Are capsid coordinates correct? Derived from https://www.hiv.lanl.gov/components/sequence/HIV/search/help.html#region

It matches the coordinates from HIVDB facts:

https://github.com/hivdb/hivfacts/blob/baf19dd1821607eafabd4784b4ba85f00e6e09aa/data/genes_hiv1.yml#L19-L32

I think we got our positions for RT, IN, PR from here

@schorlton-bugseq
Copy link
Author

Thanks so much! I forgot to include that I was using a freshly downloaded database (beyond just the XML) using that script, however when calling sierralocal I was calling via python interface (

def sierralocal(fasta, outfile, xml=None, json=None, cleanup=False, forceupdate=False,
program='post', do_subtype=False): # pragma: no cover
) which only takes the XML and APOBEC JSON as arguments (which I was pointing to the latest files). So if there are other files which should be updated, they were not being used. I was particularly concerned about some of the other files like comments because I saw some 7 year old files in https://github.com/PoonLab/sierra-local/tree/2896b833af1259e9cd5907fa6819c3da53beb88b/sierralocal/data and looks like they are referenced (
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'INSTI-comments.csv')
with open(dest, 'r') as insti_file:
self.insti_comments = dict(csv.reader(insti_file, delimiter='\t'))
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'PI-comments.csv')
with open(dest, 'r') as pi_file:
self.pi_comments = dict(csv.reader(pi_file, delimiter='\t'))
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'RT-comments.csv')
with open(dest, 'r') as rt_file:
self.rt_comments = dict(csv.reader(rt_file, delimiter='\t'))
), however maybe that is only a partially related issue and better to get lenacapavir reported first before worrying about comments.

It could also be from incorrectly parsing the xml file. Might be a more complicated problem.

The lenacapavir looked to me like other drugs in the XML (https://github.com/hivdb/hivfacts/blob/main/data/algorithms/HIVDB_9.8.xml#L2719), so I was more thinking it may be an omission of alignment/examination of capsid in general from sierralocal, although I'm really not certain at this point.

Thank you both again and happy to help in your investigation too.

@WilliamZekaiWang
Copy link
Collaborator

WilliamZekaiWang commented May 27, 2025

Ah I think I see some problems.

I was looking through my old commits and I directly accessed the mutation files in jsonwriter.py

# make dictionary for primary type
dest = str(Path(os.path.dirname(__file__)) / 'data' / 'mutation-type-pairs_hiv1.csv')
with open(dest, 'r', encoding='utf-8-sig') as mut_type_pairs1_files:
mut_type_pairs1_files = csv.DictReader(mut_type_pairs1_files)
self.primary_type_dic = {}
for row in mut_type_pairs1_files:
gene = row['gene']
pos = row['position']
aa = row['aas']
mut = row['mutationType']
if gene not in self.primary_type_dic:
self.primary_type_dic.update({gene: {}})
if pos not in self.primary_type_dic[gene]:
self.primary_type_dic[gene].update({pos: {}})
self.primary_type_dic[gene][pos].update({aa: mut})

and then checked for them later down the same file to be written into the json output

def primary_type(self, gene, position, AA):
"""
see if specific amino acid's primary type through checking hivbd facts
@param gene: str, RT, IN, PR
@param position: int, position of mutation relative to POL
@param AA: new amino acid
@return: bool
"""
position = str(position)
if gene in self.primary_type_dic:
if position in self.primary_type_dic[gene]:
keys = self.primary_type_dic[gene][position].keys()
for aa in AA:
for key in keys:
if aa in key:
return self.primary_type_dic[gene][position][key]
return "Other"

However, when you call the sierralocal function it should still read the reference files located in sierralocal/data, only APOBEC mutations can have a set change in path. I should probably change this so you can change the reference directory all the mutation files. That being said, we should probably update the data files anyways, 7 years is a long time.

I think you're right on the data not being used. We don't have a dedicated dictionary for the capsid protein, so it's just ignored. Your changes should include it, but the issue may arise from this function, which adjusts gene positions in respects to pol:

def create_gene_map(self):
"""
Returns a dictionary with the AMINO ACID position bounds
for each gene in Pol, based on the HXB2 reference annotations.
@return pol_aa_map: dict
"""
# start and end nucleotide coordinates in HXB2 pol
convert = lambda x: int((x - self.pol_start) / 3)
pol_aa_map = {}
for key, val in self.pol_nuc_map.items():
pol_aa_map[key] = (convert(val[0]), convert(val[1]))
return pol_aa_map

The result of this indexes CA in negative indexes, which then gets filtered out here:

def get_genes(self, pol_aligned_sites, pol_first_aa, pol_last_aa):
"""
Determines the first POL gene that is present in
the query sequence, by virtue of gene breakpoints
TODO: sierra uses different minimum numbers of sites per gene
(40, 60 and 30 for PR, RT and IN)
@param pol_aligned_sites: list, sublist holds alignment program
output aligned POL sites
@param pol_first_aa: int, location of first amino acid in pol
@param pol_last_aa: int, location of last amino acid in pol
@return: list, list of length 1
[gene, first aa position in pol, last aa position in pol
first na position in pol, last na position in pol]
"""
# good here
min_overlap = {'PR': 40, 'RT': 60, 'IN': 30}
genes = []
for gene, bounds in self.gene_map.items():
aa_start, aa_end = bounds
geneLength = aa_end - aa_start + 1
try:
overlap = min(aa_end, pol_last_aa) - max(aa_start, pol_first_aa)
if overlap < min_overlap[gene]:
# discard alignment of this gene, too short
continue
aligned_sites = filter(lambda x: x['PosAA'] >= aa_start and x['PosAA'] <= aa_end,
pol_aligned_sites)
aligned_sites = list(aligned_sites)
first_aa = max(pol_first_aa - aa_start, 1)
last_aa = min(pol_last_aa - aa_start + 1, geneLength)
first_na = aligned_sites[0]['PosNA']
last_na = aligned_sites[-1]['PosNA'] - 1 + aligned_sites[-1]['LengthNA']
genes.append((gene, first_aa, last_aa, first_na, last_na))
except:
# if post-align returns nothing, need to raise message
# "There were no PR, RT, and IN genes found, refuse to process."
pass
return genes

However, I checked sierrapy with a few random complete HIV sequences from NCBI nucleotide database and found that they also only picked up PR, IN, and RT. Do you have a sequence that captures CA in sierrapy?

I was also hunting for the comments, and I want to say they're included in the xml files now? Probably needs more checking though

<TEXT><![CDATA[A105T has been selected in 2 patients receiving LEN in combination with M66I. Alone it is associated with 3-fold reduced LEN susceptibility and an RC of ~60%. A105E was selected in vitro by GS-CA1. It is associated with ~4-fold reduced LEN susceptibility. A105S has been reported as a subdominant variant in one patient receiving LEN. Its effect on susceptibility has not been reported.]]></TEXT>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants