Draft: try adding capsid variant calling and resistance by schorlton-bugseq · Pull Request #109 · PoonLab/sierra-local

schorlton-bugseq · 2025-05-25T00:02:43Z

Starting to work on #108

Thanks @ArtPoon for initial response. Still not working so @WilliamZekaiWang would definitely value your input here.

Open questions from the PR so far (apologies as I am not an HIV expert):

Are capsid coordinates correct? Derived from https://www.hiv.lanl.gov/components/sequence/HIV/search/help.html#region
Where were min_overlap thresholds borrowed from? Could not find upon a quick search of the sierra/sierrapy repos
Why no lenacapavir calling when using HIVDB v9.8 XML as is? I.e. what are we missing?
Are the DRM comments hardcoded in sierra-local, and if so, how to update/get those from the XML for CAI?

WilliamZekaiWang · 2025-05-25T15:02:12Z

Hi! I will look more intensively into the questions but from what I could recall and find:

Are the DRM comments hardcoded in sierra-local, and if so, how to update/get those from the XML for CAI?

We pull the comments from https://github.com/hivdb/hivfacts/tree/main/data, which is the repository for the data found on https://hivdb.stanford.edu. The script in sierralocal/updater.py should pull all the resistance mutations. If we want to add CAI in, the relevant files should be there and we can include it into the updater script

Why no lenacapavir calling when using HIVDB v9.8 XML as is? I.e. what are we missing?

I think this is a symptom of not having the necessary files to call lenacapvir from the hivfacts repository mentioned above. It could also be from incorrectly parsing the xml file. Might be a more complicated problem.

Where were min_overlap thresholds borrowed from? Could not find upon a quick search of the sierra/sierrapy repos

I have a memory that it was on the https://hivdb.stanford.edu website, but I couldn't find it after looking around. I will keep looking, I'm pretty sure it was there.

Are capsid coordinates correct? Derived from https://www.hiv.lanl.gov/components/sequence/HIV/search/help.html#region

It matches the coordinates from HIVDB facts:

https://github.com/hivdb/hivfacts/blob/baf19dd1821607eafabd4784b4ba85f00e6e09aa/data/genes_hiv1.yml#L19-L32

I think we got our positions for RT, IN, PR from here

schorlton-bugseq · 2025-05-25T18:09:50Z

Thanks so much! I forgot to include that I was using a freshly downloaded database (beyond just the XML) using that script, however when calling sierralocal I was calling via python interface (

sierra-local/sierralocal/main.py

Lines 125 to 126 in 2896b83

    
           def sierralocal(fasta, outfile, xml=None, json=None, cleanup=False, forceupdate=False, 
        
                           program='post', do_subtype=False): # pragma: no cover

) which only takes the XML and APOBEC JSON as arguments (which I was pointing to the latest files). So if there are other files which should be updated, they were not being used. I was particularly concerned about some of the other files like comments because I saw some 7 year old files in https://github.com/PoonLab/sierra-local/tree/2896b833af1259e9cd5907fa6819c3da53beb88b/sierralocal/data and looks like they are referenced (

sierra-local/sierralocal/jsonwriter.py

Lines 29 to 39 in 2896b83

    
           dest = str(Path(os.path.dirname(__file__)) / 'data' / 'INSTI-comments.csv') 
        
           with open(dest, 'r') as insti_file: 
        
               self.insti_comments = dict(csv.reader(insti_file, delimiter='\t')) 
        
           dest = str(Path(os.path.dirname(__file__)) / 'data' / 'PI-comments.csv') 
        
           with open(dest, 'r') as pi_file: 
        
               self.pi_comments = dict(csv.reader(pi_file, delimiter='\t')) 
        
           dest = str(Path(os.path.dirname(__file__)) / 'data' / 'RT-comments.csv') 
        
           with open(dest, 'r') as rt_file: 
        
               self.rt_comments = dict(csv.reader(rt_file, delimiter='\t'))

), however maybe that is only a partially related issue and better to get lenacapavir reported first before worrying about comments.

It could also be from incorrectly parsing the xml file. Might be a more complicated problem.

The lenacapavir looked to me like other drugs in the XML (https://github.com/hivdb/hivfacts/blob/main/data/algorithms/HIVDB_9.8.xml#L2719), so I was more thinking it may be an omission of alignment/examination of capsid in general from sierralocal, although I'm really not certain at this point.

Thank you both again and happy to help in your investigation too.

WilliamZekaiWang · 2025-05-27T16:37:20Z

Ah I think I see some problems.

I was looking through my old commits and I directly accessed the mutation files in jsonwriter.py

sierra-local/sierralocal/jsonwriter.py

Lines 88 to 102 in 2896b83

    
           # make dictionary for primary type 
        
           dest = str(Path(os.path.dirname(__file__)) / 'data' / 'mutation-type-pairs_hiv1.csv') 
        
           with open(dest, 'r', encoding='utf-8-sig') as mut_type_pairs1_files: 
        
               mut_type_pairs1_files = csv.DictReader(mut_type_pairs1_files) 
        
               self.primary_type_dic = {} 
        
               for row in mut_type_pairs1_files: 
        
                   gene = row['gene'] 
        
                   pos = row['position'] 
        
                   aa = row['aas'] 
        
                   mut = row['mutationType'] 
        
                   if gene not in self.primary_type_dic: 
        
                       self.primary_type_dic.update({gene: {}}) 
        
                   if pos not in self.primary_type_dic[gene]: 
        
                       self.primary_type_dic[gene].update({pos: {}}) 
        
                   self.primary_type_dic[gene][pos].update({aa: mut})

and then checked for them later down the same file to be written into the json output

sierra-local/sierralocal/jsonwriter.py

Lines 532 to 548 in 2896b83

    
               def primary_type(self, gene, position, AA): 
        
                   """ 
        
                   see if specific amino acid's primary type through checking hivbd facts 
        
                   @param gene: str, RT, IN, PR 
        
                   @param position: int, position of mutation relative to POL 
        
                   @param AA: new amino acid 
        
                   @return: bool 
        
                   """ 
        
                   position = str(position) 
        
                   if gene in self.primary_type_dic: 
        
                       if position in self.primary_type_dic[gene]: 
        
                           keys = self.primary_type_dic[gene][position].keys() 
        
                           for aa in AA: 
        
                               for key in keys: 
        
                                   if aa in key: 
        
                                       return self.primary_type_dic[gene][position][key] 
        
                   return "Other"

However, when you call the sierralocal function it should still read the reference files located in sierralocal/data, only APOBEC mutations can have a set change in path. I should probably change this so you can change the reference directory all the mutation files. That being said, we should probably update the data files anyways, 7 years is a long time.

I think you're right on the data not being used. We don't have a dedicated dictionary for the capsid protein, so it's just ignored. Your changes should include it, but the issue may arise from this function, which adjusts gene positions in respects to pol:

sierra-local/sierralocal/nucaminohook.py

Lines 360 to 371 in 2896b83

    
               def create_gene_map(self): 
        
                   """ 
        
                   Returns a dictionary with the AMINO ACID position bounds 
        
                   for each gene in Pol, based on the HXB2 reference annotations. 
        
                   @return pol_aa_map: dict 
        
                   """ 
        
                   # start and end nucleotide coordinates in HXB2 pol 
        
                   convert = lambda x: int((x - self.pol_start) / 3) 
        
                   pol_aa_map = {} 
        
                   for key, val in self.pol_nuc_map.items(): 
        
                       pol_aa_map[key] = (convert(val[0]), convert(val[1])) 
        
                   return pol_aa_map

The result of this indexes CA in negative indexes, which then gets filtered out here:

sierra-local/sierralocal/nucaminohook.py

Lines 373 to 413 in 2896b83

    
               def get_genes(self, pol_aligned_sites, pol_first_aa, pol_last_aa): 
        
                   """ 
        
                   Determines the first POL gene that is present in  
        
                   the query sequence, by virtue of gene breakpoints 
        
                   TODO: sierra uses different minimum numbers of sites per gene 
        
                   (40, 60 and 30 for PR, RT and IN) 
        
                   @param pol_aligned_sites: list, sublist holds alignment program 
        
                   output aligned POL sites 
        
                   @param pol_first_aa: int, location of first amino acid in pol 
        
                   @param pol_last_aa: int, location of last amino acid in pol 
        
                   @return: list, list of length 1 
        
                   [gene, first aa position in pol, last aa position in pol 
        
                    first na position in pol, last na position in pol] 
        
                   """ 
        
                   # good here 
        
                   min_overlap = {'PR': 40, 'RT': 60, 'IN': 30} 
        
                   genes = [] 
        
                   for gene, bounds in self.gene_map.items(): 
        
                       aa_start, aa_end = bounds 
        
                       geneLength = aa_end - aa_start + 1 
        
                       try: 
        
                           overlap = min(aa_end, pol_last_aa) - max(aa_start, pol_first_aa) 
        
                           if overlap < min_overlap[gene]: 
        
                               # discard alignment of this gene, too short 
        
                               continue 
        
                           aligned_sites = filter(lambda x: x['PosAA'] >= aa_start and x['PosAA'] <= aa_end, 
        
                                                  pol_aligned_sites) 
        
                           aligned_sites = list(aligned_sites) 
        
                           first_aa = max(pol_first_aa - aa_start, 1) 
        
                           last_aa = min(pol_last_aa - aa_start + 1, geneLength) 
        
                           first_na = aligned_sites[0]['PosNA'] 
        
                           last_na = aligned_sites[-1]['PosNA'] - 1 + aligned_sites[-1]['LengthNA'] 
        
                           genes.append((gene, first_aa, last_aa, first_na, last_na)) 
        
                       except: 
        
                           # if post-align returns nothing, need to raise message 
        
                           # "There were no PR, RT, and IN genes found, refuse to process." 
        
                           pass 
        
                   return genes

However, I checked sierrapy with a few random complete HIV sequences from NCBI nucleotide database and found that they also only picked up PR, IN, and RT. Do you have a sequence that captures CA in sierrapy?

I was also hunting for the comments, and I want to say they're included in the xml files now? Probably needs more checking though

sierra-local/sierralocal/data/HIVDB_9.4.xml

Line 75 in 2896b83

    
           <TEXT><![CDATA[A105T has been selected in 2 patients receiving LEN in combination with M66I. Alone it is associated with 3-fold reduced LEN susceptibility and an RC of ~60%. A105E was selected in vitro by GS-CA1. It is associated with ~4-fold reduced LEN susceptibility. A105S has been reported as a subdominant variant in one patient receiving LEN. Its effect on susceptibility has not been reported.]]></TEXT>

try adding capsid

415865d

WilliamZekaiWang mentioned this pull request May 27, 2025

Capsid alignment and variants #108

Open

schorlton-bugseq mentioned this pull request Sep 4, 2025

Added multiple arguments related with data files #118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Draft: try adding capsid variant calling and resistance#109

Draft: try adding capsid variant calling and resistance#109
schorlton-bugseq wants to merge 1 commit intoPoonLab:masterfrom
schorlton-bugseq:addCAI

schorlton-bugseq commented May 25, 2025

Uh oh!

WilliamZekaiWang commented May 25, 2025

Uh oh!

schorlton-bugseq commented May 25, 2025

Uh oh!

WilliamZekaiWang commented May 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

schorlton-bugseq commented May 25, 2025

Uh oh!

WilliamZekaiWang commented May 25, 2025

Uh oh!

schorlton-bugseq commented May 25, 2025

Uh oh!

WilliamZekaiWang commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WilliamZekaiWang commented May 27, 2025 •

edited

Loading