The role of epistasis in determining the fitness landscape of HIV proteins

The rapid evolution of HIV is constrained by interactions between mutations which affect viral fitness. In this work, we explore the role of epistasis in determining the fitness landscape of HIV for multiple drug target proteins, including Protease, Reverse Transcriptase, and Integrase. Epistatic interactions between residues modulate the mutation patterns involved in drug resistance with unambiguous signatures of epistasis best seen in the comparison of a maximum entropy sequence co-variation (Potts) model predicted and experimental HIV sequence “prevalences” when expressed as higher-order marginals (beyond triplets) of the sequence probability distribution. In contrast, the evidence for epistasis based on experimental measures of fitness such as replicative capacity is weak; the correspondence with Potts model “prevalence”-based predictions is obscured by site conservation and limited precision. Double mutant cycles provide in principle one of the best ways to probe epistatic interactions experimentally without reference to a particular background, and we find they reveal that the most strongly interacting mutations in HIV involve correlated sets of drug-resistance-associated residues, however the analysis is complicated by the small dynamic range of measurements. The use of correlated models for the design of experiments to probe viral fitness can help identify the epistatic interactions involved in mutational escape, and lead to better inhibitor therapies. Author summary Protein covariation models provide an alternative to experimental measures for estimating the fitness of mutations in proteins from across a variety of organisms. Yet, for viral proteins, it has been shown that models including epistatic couplings between residues, or other machine learning models perform no better or even worse than a simpler independent model devoid of such epistatic couplings in estimating viral fitness measurements such as replicative capacities, providing weak or ambiguous evidence for epistasis. We show that the evidence for long-range epistasis is strong by the analysis of the high-order marginals of the MSA distribution (up to subsequences of length 14), which are accurately captured by a correlated Potts sequence-covariation model but not by an independent model. While double mutant cycles in principle provide well-established biophysical probes for epistatic interactions, we demonstrate that the analysis and comparison between model and experiment is difficult due to the much smaller dynamic range of the measurements, making them more susceptible to noise.


Introduction 1
A major challenge in biological research, clinical medicine, and biotechnology is how to 2 decipher and exploit the effects of mutations [1]. In efforts ranging from the 3 identification of genetic variations underlying disease-causing mutations, to the 4 understanding of the genotype-phenotype mapping, to development of modified proteins 5 with useful properties, there is a need to rapidly assess the functional effects of 6 mutations. Experimental techniques to assess the effect of multiple mutations on 7 phenotype have been effective [2][3][4][5], but functional assays to test all possible 8 combinations are not possible due to the vast size of the mutational landscape. Recent 9 advances in biotechnology have enabled deep mutational scans [6] and multiplexed 10 assays [7] for a more complete description of the mutational landscapes of a few 11 proteins, but remain resource intensive and limited in scalability. The measured 12 phenotypes depend on the type of experiment and are susceptible to changes in 13 experimental conditions making the comparison between measurements difficult [8]. 14 These methodologies are also utilized under externally applied conditions, but how in 15 vitro selection pressures can be extended to the interpretation of pressures in vivo is not 16 always clear [9]. capacity, is weak and confounded by a number of factors including data limitations, 53 statistical and experimental errors. It is instead in the comparison of higher-order 54 marginals that unambiguous signatures of epistasis can be seen. Although double 55 mutant cycle experiments in principle provide one of the best biophysical ways to 56 examine epistasis, we demonstrate with numerical examples that accurate predictions of 57 double mutant cycles are difficult due to the small dynamic range of the measurements 58 making them much more susceptible to noise. It has been suggested that the success of 59 the Potts sequence covariation model at recapitulating high-throughput mutation 60 experiments depends in part on the extent to which experimental assays can capture 61 phenotypes that are under direct, long-term selection [1]. Measures such as 62 thermostability, activity, or binding energetics of a protein generally do not all 63 contribute to fitness in the same way. We find that while the Potts model performs 64 marginally better than an independent model when predicting experimental replicative 65 capacities, nevertheless it provides a more general representation of the protein fitness 66 landscape capturing contributions from different features of the landscape, replicative 67 capacities and folding energetics, that are not fully captured by either measurement on 68 their own. 69

70
Protein sequence covariation models have been extensively used to study networks of 71 interacting residues for inference of protein structure and function. The Potts model is 72 a maximum-entropy model based on the observed mutational correlations in a multiple 73 sequence alignment (MSA) and constrained to accurately capture the bivariate 74 (pairwise) residue frequencies in the MSA. A central quantity known as the "statistical" 75 energy of a sequence E(S) (Equation 2, Methods) is commonly interpreted to be 76 proportional to fitness; the model predicts that sequences will appear in the dataset 77 with probability P (S) ∝ e −E(S) , such that sequences with favorable statistical energies 78 are more prevalent in the MSA. P (S) describes the "prevalence" landscape of a protein 79 and the marginals of P (S) can be compared with observed frequencies in a multiple 80 sequence alignment. Previous studies have indicated that the Potts model is an accurate 81 predictor of "prevalence" in HIV proteins [20,21,23,[31][32][33][34][35]; "prevalence" is often used 82 as a proxy for "fitness" with covariation models serving as a natural extension for 83 measures of "fitness" based on experiments and model predictions have been compared 84 to different experimental measures of "fitness" with varying degrees of 85 success [1,21,23,28,31,33,35]. Site-independent models, devoid of interactions between 86 sites have also been reported to capture experimentally measured fitness well, in 87 particular for viral proteins [1,36] with studies (on HIV Nef and protease) implying that 88 the dominant contribution to the Potts model predicted sequence statistical energy 89 comes from site-wise "field" parameters h i (see Methods) in the model [28,35]. In this 90 study, we show that interaction between sites is necessary to capture the higher order 91 (beyond pairwise) mutational landscape of HIV proteins for functionally relevant sites, 92 such as those involved in engendering drug resistance, and cannot be predicted by a 93 site-independent model. The correspondence between model predictions of fitness based 94 on "prevalences" in natural sequences with experimentally measured "fitness" is The correspondence between sequence covariation models and sequence statistics in multiple sequence alignments is very strong across different HIV proteins. The correspondence between either covariation models, or "prevalences" in multiple sequence alignments, with other experimental measures of "fitness" is less clear and often inconsistent between different statistics and measures of fitness.
"Prevalence" landscape of HIV proteins and the role of high. This provides support for the importance of epistasis in these datasets. However, 112 in Supplementary File 1 Fig 9 we also show that for some datasets the difference 113 between the Potts and Independent distributions is small, and so may be a less reliable 114 test of the importance of covariation. The importance of correlations is also apparent 115 through the fact that the Potts model also accurately predicts the likelihoods and 116 "entrenchment" of mutations based on the sequence background, as has been verified 117 using aggregate sequence statistics from the MSA [34]. Distribution of the number of mutations (Hamming distances) in drug-experienced HIV-1 sequences as captured by the Potts and independent models. Probabilities of observing sequences with any k number of mutations relative to the HIV-1 subtype B wild-type consensus sequence as observed in original MSA (black) and predicted by the Potts (blue) and independent (gray) models are shown for HIV-1 protease (PR) in (A), and reverse transcriptase (RT) in (B), respectively. The independent model predicted distribution does not accurately capture the distribution of hamming distances in the dataset MSA, especially near the ends of the distribution with either very low or very high number of mutations, where the epistatic effects can be more significant.
But the most direct and strongest evidence of the ability of the Potts model to 119 capture epistatic interactions is seen in its ability to reproduce the higher-order because most sequences in an unbiased MSA are observed only once due to the 125 minuscule sample size in comparison to the vast size of sequence space. Only sequence 126 marginals up to sizes ∼ 14 residues, depending upon protein family, are observed with 127 sufficient frequencies such that their marginal counts are a good proxy for the marginal 128 probabilities predicted by the Potts model. Figure 3 shows the rank-correlation between 129 model predicted marginal probabilities and marginal frequencies in the MSA for 130 subsequences of lengths 2 − 14, with a subsequence being the concatenation of amino 131 acid characters from an often nonconsecutive subset of residue positions. With further 132 increase in the subsequence length, data limitations due to finite sampling become more 133 prominent and the observed and model predicted marginals become dominated by noise. 134 The Potts model's ability to predict higher-order marginals, much beyond the pairwise, 135 for drug-resistance associated sites, while the independent model cannot, provides the 136 most direct evidence of the ability of the Potts model to accurately capture the 137 long-range epistatic interactions that modulate the "prevalence" of amino acid residues 138 at connected sites in the protein. The Potts model is able to accurately predict the 139 higher-order marginal frequencies (which have not been directly fit) at drug-associated 140 sites with a Spearman ρ 2 ≈ 0.95 for the longest subsequence (of length 14) in PR, interactions between residues is more pronounced and the site-independent model is not 147 able to capture the higher-order marginals. In contrast, for residue positions that are 148 not associated with drug resistance, the site-independent model can sufficiently recover 149 the higher order marginals in the MSA. Sites in the protein associated with drug 150 resistance, also however, exhibit considerably more variability contributing to their  drug-associated positions, the role of correlations is more apparent. This is suggestive 159 that strong couplings between sites that are likely to co-mutate, allow for mutations at 160 lesser costs to fitness than the individual mutations alone, resulting in mutational mutations at single sites, in which case drug treatment would likely be ineffective [37]. 164 In contrast to Fig 3A for HIV PR, Fig 3E shows the somewhat improved predictive 165 capacity of a site-independent model in capturing the higher order sequence statistics 166 for drug-resistance associated positions in HIV IN. This is indicative that correlations 167 between drug-resistance-associated sites appear to play a stronger role in protease than 168 in IN. This is also in line with the fact that the IN enzyme is more conserved than PR 169 (Supplementary File 1 Fig 1). Amongst the three drug-target proteins, PR, RT, and IN, 170 the degree of "evolutionary conservation" is considerably higher in IN than in the 171 others. The lack of variability at sites or considerably smaller site-entropies in IN plays 172 a role in obscuring the effect of correlations, as discussed. Furthermore, the MSA depth 173 for IN is also considerably lower than in PR or RT, which adversely affects the quality 174 of the Potts model fit, further making the correlated model less distinguishable from a 175 site-independent one [38]. interactions, upto the 14 th order, that is captured accurately by the Potts model. This 180 illustrates the existence of correlated networks of long-range interactions between sites 181 in HIV, which play an important role in determining its evolutionary fitness landscape. 182 The entrenchment of primary resistance mutations in HIV was shown to be contingent 183 on the presence of specific patterns of background mutations beyond the well studied 184 primary/accessory compensatory pairs, and could not be predicted on the basis of the 185 number of background mutations alone [34], also indicating that long-range correlations 186 involving many sites can potentially shape the evolutionary trajectory of the virus.  [1,9]. Overall, we find that the 215 evidence for epistasis from measures of fitness based on experimental replicative 216 capacities is much weaker compared to that available from the higher-order marginal 217 statistics and can be confounded by different factors. Double mutant cycles provide a biophysical means to interrogate epistatis without 241 reference to a specific sequence background [43]. For a pair of mutations α, β at 242 positions i, j in the protein respectively, the strength of epistatic interactions can be 243 quantified using the difference between the sum of the independent mutational effects, 244 ∆E i α + ∆E j β , and the effect of the corresponding double mutation, ∆E ij αβ .
If ∆∆E = 0, then the two mutations are epistatically coupled, whereas if ∆∆E = 0,    In this section, we explore the contribution from changes in structural stability due to a 300 mutation to its Potts model predicted likelihood(s). To explore the impact of a 301 mutation on structural stability, we employ a well-known protein design algorithm 302 called FoldX [44,45], which uses an empirical force field to determine the energetic 303 effects of a point mutation. FoldX mutates protein side chains using a probability-based 304 rotamer library while exploring alternative conformations of the surrounding side chains, 305 in order to model the energetic effects of a mutation. We observe good correspondence 306 between Potts model predicted likelihoods and FoldX predicted changes in structural  Fig 8A), but the FoldX predicted changes in structural stabilities do not correlate so 312 well with experimentally measured replicative capacities (Fig 7C, Supplementary File 1 Fig 8C). This is indicative that different measures of fitness such as thermostability,  [28] involving combinations (of three or lesser) of mutations in protease associated with resistance to (particularly second-generation) inhibitors in clinic, are compared to changes in Potts statistical energies, ∆Es with a Spearman rank-order correlation, ρ = 0.66 (p 0.001). [28] also report statistically significant correlation (|ρ| = 0.46) with a Potts model inferred using the Adaptive Cluster Expansion (ACE) algorithm. (B) FoldX predicted changes in folding energies, ∆∆Gs (PDB: 3S85) of the mutations also correlate well with Potts predicted changes in statistical energies, ∆Es for the same (Spearman ρ = −0.57). The HIV-1 protease structure (PDB: 3S85) is used as reference, repaired using the RepairPDB function in the FoldX suite, and the free energy of mutants is calculated with the BuildModel function under default parameters. Changes in structural stability due to mutations correlate well with their predicted likelihoods (estimated by the Potts model ∆Es) as seen here with a Spearman rank-order correlation, ρ = −0.57 (p < 0.001) between the two. However, FoldX calculations are susceptible to small changes in structure that can be caused by the presence of small-molecule ligands, etc. For another PDB:4LL3, we still find statistically significant correlation between the two (ρ = −0.64). (C) Experimental relative fitness measurements however, do not correlate as well with FoldX predicted changes in folding energies due to the mutations (ρ = −0.36).

323
Fitness is a complex concept at the foundation of ecology and evolution. The measures 324 of fitness range from those such as replicative capacity, protein stability, catalytic 325 efficiency, that can be determined experimentally in the lab to measures stemming from 326 the "prevalence" in collections of sequences obtained from nature, that can be often fall short of predictions by site-independent models possibly as a consequence of 340 the limited diversity of the sequence alignments [1] or due to a discrepancy between the 341 proxy for viral fitness in the laboratory and the in vivo fitness of the virus [9]. Here, we 342 show that the signatures of epistasis, although weakly detectable in comparisons with 343 experimental fitness, are best manifested for viruses like HIV in the comparison of the 344 Potts model predicted and experimental HIV sequence "prevalences" when expressed as 345 higher-order marginals of the sequence probability distribution. The model, which is 346 parameterized to reproduce the bivariate marginals in the MSA, also accurately 347 captures the higher order marginal probabilities (seen in Fig 3 A,C,   Different measures may also contribute to fitness in different ways. In this study, we 373 employ FoldX to probe the contribution of structural changes and folding energetics due 374 to mutations to their predicted/observed likelihoods, finding that the Potts model 375 predicted likelihoods of mutations in HIV correlate well with FoldX predicted changes 376 in free energies. FoldX predictions, however, do not correlate well with experimental 377 replicative capacity measurements. This is suggestive that the overall fitness landscape 378 predicted by the Potts model includes contributions from many different features, some 379 may even be orthogonal and thus, may not necessarily correlate well with each other.

380
The evolution of viruses like HIV under drug and immune selection pressures induces 381 correlated mutations due to constraints on the structural stability and fitness (ability to 382 assemble, replicate, and propagate infection) of the virus [46]. This is a manifestation of 383 the epistatic interactions in the viral genome. The analysis presented here provides a 384 framework based on sequence prevalence to examine the role of correlated mutations in 385 determining the structural and functional fitness landscape of HIV proteins, especially 386 under drug-selection pressure. Epistatic effects are vital in shaping the higher order 387 (well beyond pairwise) "prevalence" landscape of HIV proteins involved in engendering 388 drug resistance. Identifying/elucidating the epistatic effects for key resistance mutations 389 can help in designing better experiments to probe epistasis and has the potential to 390 impact future HIV drug therapies.

392
The Potts Hamiltonian model of protein sequence covariation is a probabilistic model 393 built from the single and pairwise site amino-acid frequencies in a protein multiple 394 sequence alignment, aimed at describing the probabilities of observing different 395 sequences in the MSA. To approximate the unknown empirical probability distribution 396 P (S) that best describes a sequence S of length L with each residue encoded in a 397 Q-letter alphabet, using a model probability distribution P m (S)[U+2060] (as in [47]), 398 we choose the maximum entropy (or least biased) distribution as the model distribution. 399 Similar distributions that maximize the entropy, with the constraint that the empirical 400 univariate and bivariate marginal distributions are preserved, have been derived 401 in [10,11,22,31,48]. We follow a derivation of the maximum entropy model in [31,47], 402 which takes the form of an exponential distribution: where the quantity E(S) is the Potts statistical energy of a sequence S of length L; the 405 model parameters h i Si , called "fields", represent the statistical energy of residue S i at 406 position i in S; and J ij SiSj called "couplings" represent the energy contribution of a pair 407 of residues at positions i, j. In this form, the Potts Hamiltonian consists of LQ "field" 408 terms h i Si and L 2 Q 2 "coupling" terms J ij SiSj . For the distribution P m ∝ e −E , negative 409 fields and couplings indicate favored amino acids. The change in Potts energy for a 410 mutation α → β at position i in S is given by: In this form, ∆E(S i α→β ) > 0 implies that residue β is more favorable than residue α at 412 the given position and vice versa. The sample size or MSA depth however, plays a 413 critical role in determining the quality and effectiveness of the model [49]. sequences with exposure to both NRTIs and NNRTIs were selected due to much lesser 428 number of sequences exposed to only NRTIs or only NNRTIs. Sequences with insertions 429 ("#") and deletions ("∼") are removed. Multiple sequence alignments for p24 Capsid 430 are obtained from the the Los Alamos HIV sequence database [52]  It has been previously established that phylogenetic corrections are not required for 443 HIV patient protein sequences [23,31] as they exhibit star-like phylogenies [53,54]. For 444 model inference, HIV patient sequences, are given sequence weights such that the 445 effective number of sequences obtained from any single patient is 1. Sequences obtained 446 from different patients are considered to be independent.
. The process is then iteratively carried out until the desired reduction in amino acid 469 characters is achieved. Using the reduced alphabet, the original MSA is then re-encoded 470 and the bivariate marginals are recalculated. Small pseudocounts are added to the 471 bivariate marginals, as described by [17] to make up for sampling biases or limit 472 divergences in the inference procedure. Due to reduction in alphabet, some mutations may not be amenable to our analysis 485 when comparing to experimental fitness measurements such as replicative capacities, etc. 486 We choose mutations corresponding to marginals with higher values in the MSA to be 487 more representative of the model predictions. previously [10,11,22,31,48,[58][59][60][61][62]. The methodology followed here is similar to the one 494 in [31], where, the bivariate marginals are estimated by generating sequences through a 495 Markov Chain Monte Carlo (MCMC) sampling procedure, given a set of fields and 496 couplings. The Metropolis criterion for the generated sequence(s) is proportional to 497 their Potts energies. This is followed by a gradient descent step using a 498 multidimensional Newton search, to determine the optimal set of Potts parameters that 499 minimizes the difference between the empirical bivariate marginal distribution and the 500 bivariate marginal estimates from the MCMC sample. Although the methodology 501 involves approximations during the computation of the Newton steps, the advantage of 502 the methodology is that it avoids making explicit approximations to the model 503 probability distribution at the cost of being heavily computationally extensive. We have 504 employed a GPU implementation of the MCMC methodology, which makes it 505 computationally tractable without resorting to more approximate inverse inference 506 methods. The MCMC algorithm implemented on GPUs has been previously used to 507 infer accurate Potts models as in [23,34,38].

508
The scheme for choosing the Newton update step, however is critical. A 509 quasi-Newton parameter update approach determining the updates to J ij and h i by 510 inverting the system's Jacobian was developed in [31], which we follow here. We further, 511 take advantage of the gauge invariance of the Potts Hamiltonian and use a fieldless 512 gauge in which h i = 0 for all i, and compute the expected change in the model bivariate 513 marginals ∆f ij m (hereafter dropping the m subscript) due to a change in J ij to the first 514 order by: The challenging part of the computation is computing the Jacobian ∂f ij S i S j ∂J kl S k S l and 516 inverting the linear system in equation 6 in order to solve for the changes in ∆J ij and 517 ∆h i given the ∆f ij . We choose the ∆f ij as: with a small enough damping parameter γ such that the linear (and other) 519 approximations are valid.

520
The site-independent model is inferred based on the univariate marginals h i in the 521 MSA alone, giving the "field" parameters as: with a small pseudo-count added to the h i to avoid indeterminate logarithms. The

525
The computational cost of fitting L 2 * (4 − 1) 2 + L * (4 − 1) model parameters for 526 the smallest protein in our analysis, PR, on 2 NVIDIA K80 or 4 NVIDIA TitanX GPUs 527 is ≈ 20h. For a more detailed description of data preprocessing, model inference, and 528 comparison with other methods, we refer the reader to [19] and [17,23,38,49]. 529 Prediction of higher order marginals 530 The Potts model inferred using the methodology described above is generative, allowing 531 for generation of new synthetic sequences which very closely represent the sequences in 532 the MSA of protein sequences obtained from HIV patients. For prediction and 533 comparison of the higher-order marginals, both the Potts and independent models are 534 used to generate new sequences, and subsequence frequencies (marginals) are compared 535 between the dataset MSA and the Potts/independent model generated MSAs. For each 536 subsequence of length 2 − 14, the process is repeated for 500 randomly picked 537 subsequences and the Spearman correlation coefficient is calculated for all subsequences 538 which appear with a frequency greater than the threshold (to avoid noise).

539
Statistical Robustness of HIV Potts models 540 Finite sampling and overfitting can affect all inference problems, and the inverse Ising 541 inference is no exception. In case of the Potts model, the number of model parameters 542 can vastly outsize the number of sequences in the MSA, yet it is possible to fit accurate 543 Potts models [49] to those MSAs, as the model is not directly fit to the sequences but to 544 the bivariate marginals of the MSA. However, finite sampling can affect the estimation 545 of the marginal distributions, which, in turn, affects model inference. In fact, overfitting 546 in the inverse Ising inference arises due to the finite-sampling error in the bivariate 547 marginals estimated from a finite-sized MSA. The degree of overfitting can be 548 quantified using the "signal-to-noise ratio" (SNR), which is a function of the sequence 549 length L, alphabet size q, number of sequences in the MSA N , and the degree of 550 evolutionary conservation in the protein. If the SNR is small, the Potts model is unable 551 to reliably distinguish high scoring sequences in the data set from low-scoring sequences. 552 If SNR is close to or greater than 1, then overfitting is minimal and the Potts model is 553 more reliable. In the analysis presented here, IN has the lowest SNR (0.14 compared to 554 43.7 for RT, and 21.6 for PR) on account of being one of the more conserved proteins 555 with the lowest number of sequences in the MSA, and may be more affected by 556 overfitting. Different predictions of the Potts model, however are differently affected by 557 finite sampling errors with predictions of ∆Es which form the basis of Potts model 558 "fitness" predictions among the more robust [49]. The Potts model is also able to 559 accurately capture the higher-order marginals in the MSA. Thus, we conclude that the 560 MSA sample sizes used in this study are sufficiently large to construct Potts models for 561 these HIV proteins that adequately reflect the effect of the sequence background on 562 mutations.

563
Protein stability analysis 564 The changes in folding free energies due to mutations are analyzed using FoldX [44,45], 565 which uses an empirical force field to determine the energetic effects of point mutations. 566 The HIV-1 protease structure (PDB: 3S85) is used as reference, repaired using the 567 RepairPDB function in the FoldX suite, and the free energy of mutants is calculated 568 with the BuildModel function under default parameters. For each mutation, the mean 569 of 10 FoldX calculations is used as the ∆∆G value.