Protein structural determinants of HIV evolution


Protein structure, measured by relative solvent accessibility, serves as a general constraint on the evolution of viral proteins. It was recently shown that distance from functional surfaces may explain an additional portion of the evolution of the viral receptor-binding proteins. However, the general applicability of distance constraints to explain viral protein adaptation remains an open question. Furthermore, it is not known whether or not RSA and distance act uniformly for all of the proteins in a single virus, nor is it known if these constraints are important for rapidly mutating viruses. Therefore, we have carried out an analysis of structural predictors of site-wise in HIV-1. Our results indicate that structural constraints do not act uniformly in HIV-1 and that each contraint (RSA and distance) show mixed results for individual proteins. Of the two broad categories of HIV proteins we surveyed, neither showed a common pattern of adaptive evolution or modeling accuracy. Structural constraints were highly predictive in just one of the three enzymes and one of three structural proteins in HIV-1. For the two predictable proteins, the protease enzyme and the gp120 structural protein, structure could explain between 30% and 40% of the variation in . We also found that glycosylations in the gp120 protein of the receptor-binding complex could explain just 2% of the variation in and could not explain gp120 evolution independently of either RSA or distance from the apical surface. Finally, we found with a very conservative cross-validation scheme that adding distance from an optimized reference point significantly improves evolutionary predictions in the majority of HIV proteins.


Despite the development of dozens of antiretrovirals, decades of vaccine development, and billions spent to halt its transmission, HIV-1 remains a major public health issue. A major challenge in combating the HIV epidemic lies in the viruses ability to adapt to even the most complex, combined therapeutic schedules. Such a high level of adaptability stems from the fact that HIV is among the most rapidly mutating human viruses, it maintains a large intra-patient population size, and it hase a high replication rate (Mansky and Temin 1995, Frost et al. 2005, Wei et al. 1995, Perelson et al. 1996 , Perelson et al. 197, Chun et al. 1997). Further complicating the development of new vaccines and therapeutics, the specific important selection pressures in HIV, the strength of those pressures, and the proteins that are adapting most rapidly is still not well understood.

There is evidence in at least one HIV-1 gene (env), that evolution occurs through a number of possible mechanisms that might suggest various patterns of evolution of the receptor-binding protein; one such hypothesis is the glycan shield whereby the virus shift glycosylation patterns to block antibody binding directly (Wei et al. 2002). Such a constraint could produce a pattern of site-wise estimates that are highly predictable with just a single category of likely glycosylation sites. Although, there is some evidence that the glycan shield hypothesis does not play out in sequence evolution (Frost et al. 2005), beyond the env gene, relatively little is known about the pattern of site-wise positive and negative selection in HIV proteins. It may be the case that HIV proteins generally evolve by similar constraints (e.g. protein fold preservation) to eukaryotic proteins. Likewise, it might be that folding constraints like the relative solvent accessibility (RSA) of amino acid sites, the linear packing density, or the distance from a functional surface do not at all constrain the evolution of such a rapidly mutating virus. Further, if these constraints are important for HIV evolution, it might be that they act uniformly in all HIV proteins. Or conversely, it may be that the human immune systems applies such asymmetric evolutionary pressures that none of the traditional constraints work for such a rapidly adapting virus.

Talk a little bit about the distribution of sitewise dN/dS

In this study, we examine the structural predictors of in HIV-1. We find that, similar to influenza, RSA and distance to an optimized reference point are generally significant predictors of virus evolution. Additionally, we find that five out of the six models we produce are significant, but only two can be considered particular strong; those two models account for between 30% and 40% of . In addition, we find that HIV integrase is the first protein out of many hundred tested by several independent studies that does not have a single statistically significant structural predictor of site-wise evolution. For gp120, we also examine the ability of glycosylations to explain . The sites within four angstroms of glycosylations can partially predict the evolution of gp120. However, after controlling for either of the structural constraints we use in combined models, glycosylations lose all statistically power. Thus, we find that glycosylations in the gp120 protein can not predict evolution better than simply RSA or distance from the apical surface. Finally, we find, with a very conservative cross-validation scheme, that adding distance from an optimized reference point significantly improves evolutionary predictions in the majority of HIV proteins.


Obtaining and preparing HIV gene sequences

Gene sequences came from the HIV database at Los Alamos National Laboratory (HIV Database). We used only pre-made gene alignments. The alignments were assumed correct and therefore not changed for all of the genes in this study. To identify the genes in the whole genome sequences, we used the annotation landmarks available in the sequence database (HIV Landmarks). With this script, sequences were filtered to include only those with entirely canonical bases, and to change stop codons in the original alignment to gaps. We used Biopython for all sequence manipulation and cleaning (Cock et al. 2009).

Calculating the site-wise evolutionary rate ratio

We built phylogenetic trees with FastTree 2.1.7 with the following input line (Price et al. 2009).

 fasttree -nt -gtr -nosupport alignment.fasta > alignment.tree 

In addition, we made minor changes to the FastTree source code to prevent rounding errors in the tree branch lengths (the branch length issue was detailed here).

Given a codon alignment and evolutionary tree, we used the phylogenetic software HyPhy to calculate the evolutionary rate ratio (Kosakovsky-Pond et al. 2005). We used the built-in one-rate fixed effects likelihood (FEL) method with the default model (Muse 1994). The FEL method calculates an independent value at each column in the alignment. To calculate a one-rate value, the model computes a single value for the entire alignment and an individual at each site. It then normalizes each value by dividing by the average to obtain (Yang 2006).

Obtaining and mapping protein structures

Protein structures were obtained from the RCSB Protein Databank (PDB) (Berman et al. 2000, Berman 2008). For each of the six proteins, we obtained both the monomeric structure and, if quaternary structure was functionally important, the full biological assembly. This study included all three of the enzymes encoded by the HIV pol gene: reverse transcriptase (PDBID: 1HYS, Sarafianos et al. 2001), integrase (PDBID: 1EX4 and 1K6Y, Chen et al. 2000 and Wang et al. 2001), and protease (PDBID: 1HPV, Kim et al. 2000). In addition we included the three HIV structural proteins that did not bind nucleic acid. Those included the capsid (p24, PDBID: 3H47), the matrix (p17, PDBID: 1HIW, Hill et al.), and the major component of the HIV receptor binding protein (gp120, PDBID: 4TVP, Pancera et al. 2014). We broadly followed the HIV structure suggestions curated by the PDB (HIV Structures).

Integrase constituted a special case as it lacked a full length structure in a single experiment and it was the protein and showed no structural predictors. As a result, we tried a number of different structures to find structural predictors. We first tried the 1EX4 structure alone; this structure include the C-terminal domain, but lacks the N-terminal region. In addition, we tried using the 1K6Y structure alone which includes the N-terminal region, but lacks a portion of the C-terminal domain. Then, in an attempt to have a full length model, we aligned the center structural domain of 1EX4 with that of 1K6Y. Fortunately, there is a relatively large gap from site 46 to site 56 in the N-terminal domain of 1K6Y and the central domain of the two structures align with . Therefore, the core of the two proteins aligned almost perfectly making it possible to simply translate and add the coordinates of the N-terminal domain atoms from 1K6Y to the more complete 1EX4 structure. This procedure produced a nearly full length model of integrase that was used to construct the further models.

To map protein structures onto the existing alignment, we used a developmental version of the sequence alignment software MAFFT (Katoh and Frith 2012, Katoh and Standley 2013, Katoh and Standley 2014). This version included the ability to add a sequence to an existing alignment while removing any sites where there was an insertion in the protein structure sequence. Therefore, we used the following input line.

 mafft --mapout --addfragments aas.fasta alignment.fasta > added_alignment.fasta 

Calculating structural predictors

We used two independent structural predictors from the protein structures. The first was relative solvent accessibility (RSA) which has been detailed extensively previously (Scherrer et al. 2012, Meyer and Wilke 2012, Meyer et al. 2013, Meyer and Wilke 2015). We used the program DSSP to compute the absolute solvent accessibilities of each residue in the protein (Kabsch and Sander 1983). Then, we used maximum absolute accessibility values for each residue to normalize solvent exposure to a value between 0 and 1 (Tien et al. 2013). For all models in this study, we used the RSA of a chain in the functional multimeric state of the biological assembly. The second metric was distance to a reference point (Meyer and Wilke 2015). To compute the set of distances, we used each C-alpha in the protein as a reference point, and calculated the distance to every other C-alpha. Thus, the distances from a single C-alpha to every other C-alpha constituted a single distance set. We repeated this calculation using every amino acid to generate a symmetric matrix of distances where the diagonal is always zero as it is simply the distance from an amino acid to itself. For gp120, we included glycosylation sites as a categorical predictor. To find sites covered by sugar moieties, we used the glycosylations found in the 4TVP structure to calculate all amino acid sites that were within four angstroms of any sugar atom. Then, we included the set of these sites are a predictor in the linear models below.

Constructing linear models and cross validation

It has been shown previously that RSA is a relatively strong structural predictor for site-wise both in viruses and large enzyme datasets. Thus, we started with a linear model that predicted site-wise with site-wise RSA. Then, to find the best reference point among the entire set of reference points, we constructed combined linear models with both RSA and the distance set from a single reference point. To guard against overfitting, we trained the model on a randomly chosen 75% of the data and reserved 25% for validation of the best reference point found in the training set. To be clear, all of the possible reference points were available for training, but for each training set we only used , RSA, and distances for 75% of the sites. We repeated the training and validation 100 times for each protein. We found that the training and validation and p-value we very similar. Therefore, for each protein, we used the site that was found most frequently during training as the best site for the final prediction. All of our models and figures were generated in the statistical language R utilizing the ggplot2 package (Ihaka and Gentleman 1996, Wickham 2009).


HIV proteins and structural predictors of evolution

The HIV genome encodes 15 proteins that can be divided into three general categories. The first, the enzymes, are all encoded by the pol gene which encodes three separate proteins. They are, in order of their reference genome numbering, protease, reverse transcriptase (including the RNase H domain), and integrase. Each of these three enzymes assumes a homodimeric quaternary structure and all three have clinically available inhibitory drugs. The second category of proteins encoded by the HIV genome are the structural proteins; they are encoded by two different genes in HIV. The matrix, capsid, and nucleocapsid are thought to be purely structural and are encoded by the gag gene. The external glycoprotein (gp120) and transmembrane glycoprotein (gp41) are encoded by the env gene, and serve the receptor-binding and uptake functions for the virus. The third category of HIV proteins carry out a disparate mix of accessory and regulatory functions for the virus. These include vpu, vif, vpr, p6, nef, rev, and tat. For this study, we focused on the enzymes and structural proteins. Specifically we analyzed sequence data and structures for protease, reverse transcriptase, integrase, the matrix and capsid proteins as well as the apical portion of the ectodomain of the spike glycoprotein, gp120. For each of these proteins, we wondered to what extent metrics taken entirely from the protein’s structure could predict the evolution of the protein’s sequence measured as . In addition, although structural metrics have been shown to predict evolution in several disparate proteins in unrelated viruses, it is still not known whether or not all of the proteins in a single virus would show a similar connection between structure and evolution.

There are a number of structure-based predictors of evolution (either measure as Rate4site or ) that have been described previously (Bustamante et al. 2000, Franzosa and Xia 2009, Ramsey et al. 2011, Scherrer et al. 2012, Shahmoradi et al. 2014, Echave et al. 2014, Huang et al. 2014, Marcos and Echave; for a complete review see Sikosek and Chan 2014); the most common are relative solvent accessibility (RSA) and weighted contact number (WCN). Both predictors have been shown to provide statistically significant predictions of in other viral systems and large enzyme protein sets alike (Sikosek and Chan 2014, Shahmoradi et al. 2014, Meyer and Wilke 2015); generally, WCN calculated on the amino acid side chain is a stronger predictor than RSA calculated in either the monomeric or functional multimeric state. However, for several viral proteins, RSA has been shown to be a better predictor of (Shahmoradi et al. 2014), and it has been used successfully modeling the evolution of influenza (Meyer and Wilke 2012, Meyer et al. 2013). Therefore, in this study, we used RSA as the baseline evolutionary predictor. In addition, it is possible to calculate RSA in any of several different conformational states. Most simply, one could ignore the higher order quaternary structure of protein complexes and measure RSA on each monomeric unit. Alternatively, one could generate ever more complex multimeric assemblies on which to calculate RSA. We chose to use the minimal multimeric state (called the biological assembly in the protein databank) which has been shown to be the strongest extensible predictor of in influenza (Meyer and Wilke 2015).

In addition to RSA, we used a single distance-based dataset to predict . This has previously proven effective at identifying regions of possible functional important in viral proteins (Meyer and Wilke 2015). In the case of HIV, we generated a set of distances using the pairwise distance from each C-alpha atom to every other C-alpha atom in the protein. Then, we performed a cross validation analysis to identify the best reference point in each protein while controlling for RSA.

Protein-specific structural models of HIV evolution

Evolution of the HIV capsid

The capsid protein is a functional homo-hexamer with extensive contacts in adjacent monomeric units. The hexameric biological assemblies form the basic building blocks for constructing the entire super-structure of the viral capsid. We found that no part of the capsid protein appeared to be under particularly strong evolutionary pressure. With whole-gene (Table 1) the capsid has the second slowest rate of adaptation among the proteins we studied; also, the distribution of shows almost no positively selected sites (Figure 1A). The combined RSA-Distance model could predict less than 10% of the variation in . The sites with the highest predicted rates of adaption were near the N-terminal region of the protein, and the sites with the lowest predicted rate of adaptation were near the C-terminal end (Figure 2A). As expected, the distance to sites nearest the N-terminal region of the monomeric protein showed the highest positive predictive power for (Figure 2B); therefore, sites nearest the N-terminal region of the capsid were, on average, under the strongest pressure from structural constraints.

Evolution of the HIV gp120 protein

The HIV receptor-binding complex is made up of an ectodomain as well as transmembrane and intracellular segments. The ectodomain consists of a homo-trimer of hetero-dimers where each of the dimers consists of a gp120 and gp41 protein. The gp120 protein serves as the apical segment and the gp41 protein serves as the basal segment. Thus, gp120 is farthest from the viral surface and the most exposed HIV protein to the host immune system.

The apical solvent exposed surface of the gp120 protein appears to be under intense adaptive pressure. In addition, with whole-gene (Table 1), gp120 has the highest rate of adaptation in HIV and a large number of sites with (Figure 1B). Therefore, solvent exposure alone accounted for a large fraction of the adaptive evolution (). Nevertheless, adding a single distance constraint to the model nearly doubled the variation explained. Where the RSA-only model produced a relatively uniform rate prediction over the entire surface of gp120, adding the set of distances to the linear model oriented the elevated predictions toward the apical surface of the protein (Figure 3A).

We also investigated the effect of gp120 glycosylations on selection. We found that, even though glycosylation sites alone were able to predict a small amount of adaptive evolution in gp120, adding glycosylation sites to the combined model did not improve or change the overall rate predictions. This finding might be expected since glycosylation sites are on the protein surface. Thus, RSA, and not specific glycosylation sites, provides the most systematic adaptive pressure to gp120 (Figure 3B).

Finally, as we expected from the predicted values, we found that the sites whose distances provided the highest positive predictive power were located on the apical surface of the ectodomain (Figure 3C). Therefore, sites near the surface of the protein and oriented farthest away from the viral surface are on average under the strongest adaptive pressure. Also, no clear connection emerged between the glycosylation pattern and the surface with maximum adaptive pressure (Figure 3D).

Evolution of the HIV matrix

The fundamental biological assembly of the HIV matrix consists of homo-trimers of the matrix protein. Unlike the other structural proteins (the capsid and the gp120 protein), the matrix protein has a relatively small contact surface with adjacent subunits and a relatively high whole-gene (Table 1) and a large number of sites with (Figure 1C). Nevertheless, despite the structural differences in structure, shape, and whole-gene between the matrix and capsid proteins, the predictors performed almost identically in the RSA-Distance combined models. The validated combined model for the matrix protein explained approximately 9.9% of the variation in , compared to 9.2% for the capsid. Also, like the capsid, there appeared to be no region that is under uniformly strong selective pressure (Figure 4A), and the cluster of sites with the highest pressure were on the long eccentric arms of the trimer (Figure 4B).

Evolution of the HIV integrase

The available structures of integrase show it as a symmetric homo-dimeric structure with long arms protruding from the interaction surface. No structural predictors provided a statistically significant model of integrase evolution (Table 1). In addition, it showed the lowest rate of adaptive evolution and a very small number of sites with (Figure 1D.

Evolution of the HIV protease

The protease forms a heart-shaped homo-dimer. The catalytic site of the complex lies in the interface between the two subunits. In addition, the protease had a low average (Table 1) and a very small number of sites with (Figure 1E). For functional reasons, the enzyme active site must be accessible, and the protein-protein binding region must maintain a relatively high solvent exposure (Figure 5) compared to other protein-protein interfaces. As a result, we expected the highly solvent exposed catalytic site to cause an RSA-only model to perform poorly. Indeed, RSA alone could explain only 4.7% of the variation in . However, adding distance produced the second best model among the six proteins tested. The predictions with the combined RSA-Distance model showed that the catalytic site of protease is very conserved relative to the exterior regions of the protein (Figure 5A), a finding that would have been impossible without the distance predictor. Moreover, as with the gp120 protein the pattern of predicted rates is almost perfectly recapitulated by the correlation map (Figure 5B).

Evolution of the HIV reverse transcriptase

As with the other HIV enzymes, reverse transcriptase has a relatively low rate of adaptation at (Table 1) and almost no sites with (Figure 1F). The reverse transcriptase protein is structurally unique among the proteins in HIV. Although it forms a homo-dimeric quaternary structure, the two proteins in the functional assembly do not form the same tertiary structure. In addition, one of the two subunits in the functional assembly retains a separate catalytic domain known as RNase H. The unusual fact that the reverse transcriptase quaternary structure is composed on two identical polymers with between their structures means that any structural model is likely doomed to failure. Despite this fact, we did find that the RSA of the first chain alone was a weak, but statistically significant predictor of . The weakness of the model was highlighted by the fact that the site-wise predictions showed no clear trend toward important regions of the protein (Figure 6A and 6B). For example, the region contacting nucleic acid did not appear to be particular conserved. Also, no set of distances from the first chain could add any predictive power, and strangely the RNase H domain appeared to be strongly conserved relative to the nucleic acid contact site of the polymerase (Figure 6C and 6D).

Summary of combined models of HIV evolution

In summary, we found that using RSA in a single predictor model performs poorly in the majority of HIV proteins (Figure 7 and Table 1). With the exception of gp120, RSA could predict less than 6% of the variation in . This is in contrast to influenza, where RSA explains greater than 10% of the variation in both the hemagglutinin and neuraminidase proteins (Meyer and Wilke 2015). In addition, for the integrase protein, RSA was not even a statistically significant predictor of ; that makes integrase the only protein among the many hundred that have been tested to not display any folding-related evolutionary constraint. For the other five proteins, could be at least partly predicted with RSA.

Adding the distance to a reference point improved predictions of in all of the proteins tested. On average, the models that included distance were able to account for nearly twice as much variation in on the test data set than did the models with RSA alone. Importantly, our cross-validated test sets showed very similar values to those from the training set. Therefore, our optimization scheme is likely not over-fitting the data. Also, for each protein, there appeared to be a connection between the strength of the initial model that included only RSA and the strength of the best model that included both distance and RSA (Figure 8). However, protease was a significant outlier from the trend, and caused the overall correlation to not be significant. For example, the RSA-only model for protease was the third worst, but jumped to become the second best with a distance predictor. By contrast, the RSA-only model for reverse transcriptase was virtually identical in predictive power to that of protease; however, the addition of distance did not provide even a minimally significant improvement in predictions.


We have surveyed structural predictors of in HIV-1. We found that the common structural predictors used in other viruses show mixed results for HIV. We found that neither the structural proteins nor the enzymes show a common pattern of either adaptive evolution or modeling accuracy. Evolution could only be modeled in one of the structural proteins (gp120) and one of the enzymes (protease). However, the models generated for those two proteins could explain between 30% and 40% of . We found that glycosylations in the gp120 protein could not predict evolution better than simply RSA or distance from the apical surface. Finally, we showed with a very conservative cross-validation scheme that adding distance from an optimized reference point significantly improves evolutionary predictions in the majority of HIV proteins.

As a result of the error prone polymerase reverse transcriptase, HIV is among the most rapidly mutating and evolving viruses in nature. This fact alone sheds doubt on whether or not structural constraints could be used to predict any portion of adaptation in HIV. It could have been the case that the rapid accumulation of unfixed mutations lead to difficulty in sampling fixed differences or that folding and functional constraints were washed out by an enormous number of sites under positive selection. Neither of these possibilities was systematically true. The distribution of sitewise was always continuous (Figure 1) and structural constraints predicted some portion of evolution in almost all cases. This finding contradicts the long standing hypothesis in molecular evolution contending that sites in proteins evolve under a limited number of discrete regimes. Typically, one regime is that of where the number of amino acid substitutions is greater than the expectation under neutral DNA changes. A second regime is where the rate of amino acid substitutions exactly equals the null expectation of neutral DNA substitutions. A third regime is where the number of amino acid substitutions is less than the null expectation for a neutral DNA substitution. Historically selection analyses have focused on the sitewise relative to the neutral expectation of 1 and calculated the probability that a point value is actually different from the null. However, as our distribution plots show, even in viruses with enormous mutation rates and a large number of confounding polymorphisms that inflate sitewise estimates, the vast majority of sites have . More likely, due to a long history of selection for protein folding (and by proxy conserved functional folds) the traditional “neutral” expectation is far from a reasonable generic null model of protein adaptation. Therefore, there really is no neutral regime per se, and structured proteins likely do not ever evolve “neutrally” with any regularity. This is supported by the fact that the distribution of sitewise values is always smooth and continuous, and we recently found similar sitewise distributions in influenza protein. In general, there is probably a different neutral expectation for each protein fold and the only meaningful statistical test would be the extent to which a site is an outlier for that fold’s sitewise distribution. Moreover, a null model of seems to be constructed on the misunderstanding that a string of letters is a reasonable approximation of proteins; our data suggest it is not.

HIV-1 proteins generally evolve according to known structural constraints. In particular, combined models of the rate of adaptation in gp120 and protease were quite successful. Interestingly, the extent to which each of the structural constraints predicted evolution in gp120 and protease varied widely. Each of the two predictors, RSA and distance, accounted for approximately the same proportion of in gp120 whereas RSA was largely useless as a predictor for protease. Therefore, it appears the only critical evolutionary constraint for the protease protein is the relative biophysical coupling of a site to the enzymatic active site.

Of the six proteins we tested, gp120 was the most easily predicted by both RSA and the combined model. This finding is not particularly surprising since viral binding proteins are reliably under significant pressure from the host immune system and perform among the most critical functions for viral survival. We found a similarly predictive model of evolution for influenza hemagglutinin which performs largely the same function as gp120. Moreover, the directionality of the adaptive pressure was oriented in exactly the same direction in gp120 as in influenza hemagglutinin; the apical surface of both proteins is under intense adaptive pressure. Surprisingly, RSA and distance predicted a significantly larger portion of the evolution in gp120 than in influenza hemagglutinin. In addition, the small amount of that could be accounted for by the category of glycosylation sites, was completely obliterated by controlling for either RSA or distance, or both. Therefore, contrary to the glycan shield hypothesis (Wei et al. 2002), it appears glycosylation status is not significant systematic driver of sitewise adaptation in HIV. This result is in line with what has been shown previously from sequence analysis (Frost et al. 2005).

Due to the success of structural constraints in predicting protein evolution, we did not anticipate that the integrase protein would have no significant structural predictor of evolution. Similarly surprising was that this fact remained true no matter which domains we included and no matter how complete we made the structural model. This is the first protein, not to mention the first viral protein, that has been found to have no apparent structural constraints for its evolution. There are a number of possible explanations. First, considering the historical difficulty of crystallizing the HIV-1 integrase, it may be that the protein spends most of its time in a relatively disordered state. Such a situation would necessarily reduce the pressure to be folded and reduce the RSA constraint. Second, since there are no full length structures generated in a single experiment, it is possible that the all of the partial structures that are available to us missed a significant portion of correct protein contacts. Third, it is also possible that the crystal structures are simply not capturing any real physiological state. Fourth, some combination of these issues might be at play. For example, if the protein is frequently disordered, it may be relatively easy for the protein to be contorted into non-physiological states during crystallization.