Abstract
Pathogen evolution of drug resistance often occurs in a stepwise manner via accumulation of multiple mutations, which in combination have a non-additive impact on fitness, a phenomenon known as epistasis. Epistasis complicates the sequence-structure-function relationship and undermines our ability to predict evolution. We present a computational method to predict evolutionary trajectories that accounts for epistasis, using the Rosetta Flex ddG protocol to estimate drug binding free energy changes upon mutation and an evolutionary model based in thermodynamics and statistical mechanics. We apply this method to predict evolutionary trajectories to known multiple mutations associated with resistant phenotypes in malaria. Resistance to the combination drug sulfadoxine-pyrimethamine (SP) in malaria-causing species Plasmodium falciparum (Pf) and Plasmodium vivax (Pv) has arisen via the accumulation of multiple point mutations in the DHFR and DHPS genes. Four known PfDHFR pyrimethamine resistance mutations are highly prevalent in field-isolates and multiple studies have shown epistatic interactions between these mutations determine the accessible evolutionary trajectories to the highly resistant quadruple mutation N51I,C59R,S108N,I164L. We simulated the possible evolutionary trajectories to this quadruple PfDHFR mutation as well as the homologous PvDHFR mutations. In both cases, our most probable pathways agreed well with those determined experimentally. We also applied this method to predict the most likely evolutionary pathways to observed multiple mutations associated with sulfadoxine resistance in PfDHPS and PvDHPS. This novel method can be applied to any drug-target system where the drug acts by binding to the target.
Significance Statement Antimicrobial resistance (AMR) is a major public health threat, resulting from the overuse and misuse of antimicrobial drugs. Our ability to monitor emerging resistance and direct the most appropriate treatment strategy (stewardship) would be strengthened by the development of methods to accurately predict the mutational pathways that lead to resistance. We present a computational method to predict the most likely resistance pathways that result from mutations that alter drug binding affinity. We utilized the Rosetta Flex ddG protocol and a thermodynamic evolutionary model. This novel approach can accurately capture known resistance trajectories and can be applied to any system where resistance arises via changes in drug binding affinity.
Introduction
The development of new antimicrobial drugs and therapeutics, and the design of more successful drug deployment strategies to reduce the prevalence of resistance, requires an understanding of the underlying molecular evolution.
Antimicrobial resistance (AMR) poses a huge global health threat by a wide range of mechanisms including: the protection of drug targets by post-translational modifications such as methylation; the presence and increased expression of proteins that bind to targets preventing drug interactions; limiting the physical entry of the drug compound to the cell or increasing the efficiency of drug efflux through the type or number of efflux pumps, or in the case of bacteria, horizontal gene transfer [1–3]. One of the major routes to resistance, and a focus of the work presented here, is genomic variation within protein coding regions. Of particular significance are single-nucleotide polymorphisms (SNPs) in the antimicrobial target gene that alter the protein structure and prevent efficient binding of the antimicrobial drug. Provided these SNPs do not prevent the target from carrying out its function, the resistant strains will proliferate within the population [4].
The evolution of resistance alleles is affected by the evolutionary interplay between selection for resistance and selection for protein function, as well as drug concentration, epistasis between resistant mutations and mutational bias [5–8]. Prime examples of this are the Plasmodium falciparum and vivax parasites that cause the majority of malaria infections and which have evolved strong resistance to many antimalarial drugs, including pyrimethamine [9], chloroquinine [10] and sulfadoxine [11].
There were an estimated 229 million new cases of malaria world-wide in 2019, resulting in approximately 409,000 deaths predominately among children under 5 [12]. P. falciparum accounts for the majority of cases in Africa, Southeast Asia, the Eastern Mediterranean and Western Pacific, whilst P. vivax is predominant in Central and South America.
P. falciparum malaria has been treated with the combination drug sulfadoxine-pyrimethamine (SP) since 1970s, which targets the folate metabolic pathway. Numerous resistance mutations have arisen within its genome as a result of SNPs in P. falciparum dihydrofolate reductase (PfDHFR) and dihydropteroate synthase (PfDHPS) genes, which are the targets of pyrimethamine and sulfadoxine respectively [11, 13, 14]. Although SP is not usually used to treat P. vivax, co-infections with P. falciparum have meant SP resistance mutations have also arisen in the P. vivax genome [15]. The enzymes of the folate pathway are largely conserved across Plasmodium species, and so polymorphisms in equivalent positions have been observed in P. vivax DHFR (PvDHFR) and DHPS (PvDHPS) and are thought to confer resistance to SP [16–19]
The DHFR gene encodes an enzyme that uses NADPH to synthesize tetrahydrofolate, a co-factor in the synthesis of amino acids [20], and pyrimethamine acts to disrupt this process, thereby blocking DNA synthesis and slowing down growth. Stepwise acquisition of multiple mutations leading to resistance to pyrimethamine has been observed in both PfDHFR [7, 9] and PvDHFR [8]. DHPS exists as a bifunctional enzyme with hydroxymethyldihydropterin pyrophosphokinase (HPPK) and performs a key step in folate synthesis by catalysing the conversion of para-aminobenzoate (pABA) to dihydropteroate. The anti-folate drug sulfadoxine targets DHPS by preventing this conversion and mutations in DHPS have arisen that confer resistance to sulfadoxine [21, 22].
In PfDHFR, a combination of four mutations - Asn-51 to Ile (N51I), Cys-59 to Arg (C59R), Ser-108 to Asn (S108N) and Ile-164 to Leu (I164L) – has been reported to result in resistance to pyrimethamine [23] by altering the binding pocket and reducing the affinity for the drug [24]. Non-additivity in both pyrimethamine binding free energy and the concentration required to inhibit cell growth by 50% (IC50) has been observed experimentally for combinations of these four mutations [7, 9]. This non-additivity – known as epistasis – means that mutations which on their own are not associated with a resistance phenotype, could do so when in combination with other mutations. Epistasis between these mutations has been shown to determine the evolutionary trajectories to the quadruple mutation N51I,C59R,S108N,I164L, which is strongly associated with pyrimethamine resistance.
A similar investigation was conducted into the homologous set of PvDHFR mutations – Asn-50 to Ile (N50I), Ser-59 to Arg (S58R), Ser-117 to Asn (S117N) and Ile-173 to Leu (I173L) - and the accessible evolutionary trajectories to the quadruple mutation [8], some combinations of which have been observed to result in pyrimethamine resistance both in vivo and in vitro [19, 25]. Evolutionary simulations accounting for growth rates, IC50 measurements for increasing concentrations of pyrimethamine and nucleotide bias predicted the most likely pathways to the quadruple mutation for different drug concentrations. They observed the trajectories at each concentration were influenced by epistasis between the mutations and the adaptive conflict between endogenous function and acquisition of drug resistance.
These studies, along with other investigations [5, 26] have highlighted the prevalence of epistasis among resistance mutations and the importance of considering epistatic interactions between mutations when predicting evolutionary trajectories of drug resistance.
Epistasis between mutations within the same protein arises due to energetic interactions between the amino acids, where the impact of a mutation on either function, fitness or a physical property depends upon the protein sequence [27]. When epistasis occurs between two or more mutations, their combined impact on protein fitness or a physical trait such as stability or binding affinity, does not equal the sum of their independent impacts. As well as the aforementioned studies, epistasis been observed in the evolution of many pathogens, such as Escherichia coli [28], Influenza [29], and RNA viruses [30]. Epistasis may also have important consequences for the success of AMR management strategies that aim to reduce resistance via the cessation of use of a particular drug. In theory, stopping the use of a drug should result in reversion of resistance mutations, because most resistance mutations incur a large fitness cost in the absence of drug [31–33]. However, the success of this strategy has been mixed, and in some cases bacterial populations remained resistant [34–36], which may be the result of compensatory mutations (a type of epistasis), which mitigate the deleterious impact of resistance mutations, allowing them to remain in a population and thus retain resistance even in the absence of drug selection pressures [37].
Therefore, understanding how epistasis arises and predicting which mutations will interact epistatically, is important for anticipating future mutations, designing new drugs and developing strategies to minimize resistance and predicting the most likely evolutionary trajectories to resistance.
Whilst experimental methods have been successful in characterising the interactions between resistance mutations and predicting evolutionary trajectories, they are expensive and time consuming. Therefore, identifying computational methods that can accurately characterise how mutations interact could potentially enable fast and reliable predictions of evolutionary trajectories to resistance.
One of the main mechanisms of drug resistance is selection for mutations that reduce drug binding free energy. Several computational tools exist to predict changes in binding free energy upon mutation, however their ability to predict epistasis is limited or unknown. The Rosetta protocol Flex ddG [38] is the current state-of-the art method for predicting changes in protein-protein and protein-ligand binding free energy. Rosetta is a software suite for macromolecular modelling and design that uses all-atom mixed physics- and knowledge-based potentials, and provides a diverse set of protocols to perform specific tasks, such as structure prediction, molecular docking and homology modelling [39]. The Flex ddG protocol has been found to perform better than machine learning methods and comparably to molecular dynamics methods when tested on a large dataset of ligand binding free energy changes upon protein mutation [40, 41]. However, its ability to capture epistasis has not yet been tested. Therefore, we investigated how well Flex ddG can capture epistasis between mutations and observed a good correlation with experimental data.
Next, we developed a method for predicting evolutionary trajectories to multiple resistance mutations, using predicted binding free energy changes and an evolutionary model based in thermodynamics and statistical mechanics. We tested this method by simulating the accessible evolutionary trajectories to known resistant quadruple mutants in both PfDHFR and PvDHFR, as these targets have been studied experimentally. We observed good agreement between the most likely trajectories predicted by our model and those predicted experimentally.
Once we had demonstrated our method for predicting evolutionary trajectories agrees well with experimentally determined trajectories, we extended our analysis to the DHPS gene of P. falciparum and P.vivax. This work represents the first attempt to predict evolutionary trajectories to resistance in these genes, either experimentally or computationally.
We compared our predictions to the frequency of mutations found in isolate data from countries from several geographical regions including South America, West and East Africa, South and Southeast Asia and Melanesia.
Results
Rosetta Flex-ddG accurately predicts binding free energy changes and epistasis for PfDHFR mutations
We investigated if Flex ddG, was able to capture experimentally measured non-additivity in binding free energy. The change in binding free energy was predicted for the combinatorically complete set of the four PfDHFR pyrimethamine resistance mutations N51I, C59R, S108N and I164L. The authors of the Flex ddG protocol suggest conducting a minimum of 35 runs and taking the average of the distribution as the prediction for that mutation. We observed that the distributions were not well characterised by only 35 runs and so we undertook 150 runs for each mutation. Similarly, many of the distributions were not Normal and so we chose to use the peak of the distribution as a prediction for each mutation.
We compared our predictions with the data of Sirawaraporn et al. (1997) [9] who experimentally determined the change in binding free energy for a subset of the combinatorically complete set of the four known PfDHFR resistance mutations (Table 1), where a positive ΔG value indicates a destabilising mutation and a negative ΔG value indicates a stabilising mutation (Note: Rosetta Flex ddG calculates the change in binding free energy as ΔΔG = ΔGmut – ΔGWT, whereas Sirawaraporn et al. (1997) calculated the change as the reverse, ΔΔG = ΔGWT – ΔGmut, where WT indicates the wild-type free energy and mut indicates the mutant free energy. Therefore, in the original manuscript, a mutation that destabilised the binding corresponded to a negative ΔΔG, whilst here we have reversed the signs of the original data to enable comparison with our predictions. Furthermore, Flex ddG predictions are in Rosetta Energy Units (R.E.U), whereas data from [9] is in units kcal/mol).
Our predictions had a Pearson’s correlation of 0.63 with the experimental dataset and 8/9 of the mutations were correctly classified as either stabilising or destabilising. These metrics show that the predictions capture the experimental data satisfactorily. The single mutation S108N was found to be the only destabilising single mutation in both the experimental data and the predictions, with all other single mutations having a stabilising impact on pyrimethamine binding. There are, however, some discrepancies, namely in the experimental data the double mutation N51I/S108N is more destabilizing to binding than single mutation S108N, however the peak of the Flex ddG distribution was stabilizing, with only a few of the 150 runs predicting the mutation was destabilizing. The triple mutation C59R/S108N/I164L was found experimentally to be the most destabilizing of the triple mutations, however Flex ddG predicted it to be only mildly destabilizing and the least destabilizing of the triple mutations. Furthermore, the quadruple mutation was found experimentally to have the most destabilising impact out of all combinations of single and multiple mutations, however, Flex ddG predicted it to be less destabilising than the double mutation C59R/S108N.
Sirawaraporn et al. (1997) determined the ‘interaction energy’, which they defined as the difference between the change in binding free energy due to a multiple mutation and the sum of the changes in binding free energy when the mutations occur on their own. For a double mutation, the interaction energy is the same as the pairwise epistasis between the pair of mutations. For triple and quadruple mutations, the interaction energy quantifies the non-additivity between the mutations and indicates the level of epistasis between the mutations.
We determined the interaction energy by finding the difference between the predicted change in binding free energy of a multiple mutation and the sum of the predictions of their independent binding free energy changes (Table 1). A positive value of the interaction energy indicates the sum of the independent impacts is more destabilizing than the impact of the multiple mutation and a negative value indicates the sum is less destabilizing than the combined impact.
A correlation of 0.765 with the data was observed and 4/5 values were correctly classified as having either a positive or negative interaction energy. The incorrectly classified mutation was again N51I/S108N, where the interaction energy was predicted to be positive, but found experimentally to be negative, because the sum of the individual predictions was destabilising but the double mutation itself was predicted to have a stabilising impact.
Both the experimental data and our predictions found that the quadruple mutation had the largest magnitude interaction energy reflecting the greatest difference between the stabilising impact of the sum of the individual mutations and the destabilising impact of the quadruple mutation itself.
The triple mutation C59R,S108N,I164L was found to have a large negative interaction energy in both the data and our predictions, where the triple mutation was found to be destabilizing whilst the sum of the individual impacts was stabilising to pyrimethamine binding. We also observed large negative interaction energy between S108N and C59R, where C59R is stabilizing in the wildtype background but destabilizing in the background of S108N, an example of sign epistasis and in agreement with the observations of both Sirawaraporn et al. (1997) and Lozovosky et al. (2009).
However, whilst the interaction energy of the triple mutation N51I,C59R,S108N was negative for both data and predictions, in our predictions its magnitude was much smaller compared to the data. Both single mutations N51I and C59R were predicted to be only marginally stabilising – almost neutral - to pyrimethamine binding, whilst in the experimental data both mutations have a large stabilising impact. Furthermore, the triple mutation was predicted to be only marginally more destabilising than single mutation S108N, resulting in the small negative interaction energy.
A thermodynamic evolutionary model can predict the most likely evolutionary trajectories to quadruple mutations in both PfDHFR and PvDHFR
We simulated the evolutionary trajectories to quadruple mutant N51I,C59R,S108N,I164L, using an evolutionary model, adapted from previous studies [42–44], in which selection acts to reduce the binding affinity between PfDHFR and pyrimethamine. Briefly, starting from the wild-type protein, we randomly sample a value from the Flex ddG distributions for each of the four single mutations and calculate the fitness of the mutated protein (Eq. 1), and the fixation probability (Eq. 2). A mutation is then chosen with a probability proportional to the fixation probability and this is repeated until the quadruple mutation is reached. We carried out 10,000 runs and determined the probability of each trajectory.
The top two most likely trajectories were S108N/C59R/N51I/I164L, S108N/C59R/I164L/N51I, respectively (Figure 1a), corresponding to the top two most likely pathways determined in [7]. The fourth most likely pathway predicted by our model, S108N/N51I/C59R/I164L, corresponds to the third most likely pathway predicted by [7].
a) Simulated evolutionary pathways to highly-resistance quadruple PfDHFR mutant N51I,C59R,S108N,I164L. Line thickness indicates the likelihood of a mutation at each step. Dotted lines indicate zero probability of a mutation at that step. The two most likely pathways are S108N/C59R/N51I/I164L and S108N/C59R/I164L/N51I respectively, b) The frequency of the 16 possible combinations of mutations N51I, C59R, S108N and I164L in PfDHFR, including wild-type, observed in our isolate data.
We compared our predictions to the frequency of mutations observed in our isolate data (Figure 1b). In the isolate data, S108N was the most frequent single mutation, C59R,S108N the most frequent double mutation and N51I,C59R,S108N the most frequent triple mutation. This suggests the pathway proceeds in the order S108N/C59R/N51I/I164L, in agreement with the most likely pathway from both our evolutionary simulations and those using experimental data.
Single mutations C59R and N51I are the second and third most prevalent single mutations, respectively, in both our isolate data and our pathway predictions and I164L is absent from the isolate data and has a zero probability of being selected as the first step of our evolutionary trajectories. Whilst there is agreement between the model and the data regarding the most common double mutation (C59R,S108N), the agreement is less convincing for the remaining double mutations. For example, N51I,S108N is the second most frequent double mutation in the isolate data but the fourth most likely double mutation predicted by the model when summing up all pathways. There is better agreement with the triple mutations, where C59R,S108N,I164L is the second most frequent triple in the data and the second most likely triple in the simulations when summing over all possible pathways.
We also compared our pathway predictions to the distributions of mutations from isolates from different geographical areas. We grouped the isolate data in to five geographical regions: South America (Brazil, Colombia and Peru), West Africa (Benin, Burkina Faso, Cameroon, Cape Verde, Cote d’Ivoire, Gabon, Gambia, Ghana, Guinea, Mali, Mauritania, Nigeria and Senegal), Central and East Africa (Congo, Eritrea, Ethiopia, Kenya, Madagascar, Malawi, Tanzania, Uganda), Southeast Asia (Bangladesh, Cambodia, Indonesia, Laos, Myanmar, Thailand and Vietnam) and Melanesia (Papua New Guinea).
Examining the geographical distribution of PfDHFR mutations (Figure 2), we can see that some combinations of the aforementioned set of four resistance mutations are observed in high frequencies in most regions (See Supplementary data ‘PfDHFR_pcnt.csv’). By considering the frequencies of the possible combinations of the four mutations N51I, C59R, S108N and I164L, we can infer the likely evolutionary pathways to pyrimethamine resistance followed in each region and compare with our predicted pathways.
The geographical distribution of mutant PfDHFR alleles found in our isolate data. The size of each pie chart is proportional to the number of isolates from that particular country.
The evolution of pyrimethamine resistance in West Africa and Southeast Asia appears to have followed the two main pathways predicted here and in previous studies, S108N/C59R/N51I/I164L and/or S108N/C59R/I164L/N51I.
In West Africa, the most frequent mutations were single mutation S108N, double mutation C59R,S108N and triple mutation N51I,C59R,S108N, which were found in 1.2%, 8.1%, 67.0% of isolates respectively. The quadruple mutation N51I,C59R,S108N,I164L, single mutations C59R, and N51I and double mutations N51I,C59R and N51I,S108N were each found in >1% of isolates. This suggests West Africa is following the main trajectory S108N/C59R/N51I/164L. In Southeast Asia, single mutation S108N, double mutation C59R,S108N, triple mutation N51I,C59R,S108N and quadruple mutation N51I,C59R,S108N,I164L were observed in >1%, 8.2%, 34.2% and 52.5% of isolates, respectively. Additionally, double mutants N51I,C59R, N51I,S108N and S108N,I164L were each observed in >1% of isolates and triple mutants N51I,C59R,I164L and C59R,S108N,I164L were observed in >1% and 3.7% of isolates, respectively. This suggests the two most likely pathways to pyrimethamine resistance in this region are S108N/C59R/N51I/I164L and S108N/C59R/I164L/N51I.
However, not all regions follow the most likely trajectories predicted by our evolutionary model. The evolution of pyrimethamine resistance in Central and East Africa appears to have followed a slightly different pathway. The three most common mutations were double mutations N51I,S108N and C59R,S108N and triple mutation N51I,C59R,S108N which were found in 12.1%, 2.9% and 82.3% of isolates respectively. Single mutation S108N, double mutation N51I,C59R, triple mutation N51I,S108N,I164L and quadruple mutation N51I,C59R,S108N,I164L were each found in >1% of isolates. All other combinations of the four mutations were absent. This suggests the main trajectory in this region is S108N/N51I/C59R/I164L, in addition to a second less likely trajectory S108N/C59R/N51I/I164L.
The distribution of mutations in South America also suggests alternative routes to pyrimethamine resistance. The most frequent mutations were single mutation S108N, double mutation N51I,S108N and triple mutation N51I,S108N,I164L which accounted for 52.0%, 24.0% and 8.0% of the isolates, respectively. All other combinations of the four mutations were not observed in South America. This suggests this region is following the trajectory S108N/N51I/I164L to the triple mutation.
In Papua New Guinea, however, out of all possible combinations of the four mutations, only single mutation S108N and double mutation C59R,S108N were observed in >1% and 19.3% of isolates, respectively. Triple mutation C59R,S108N,S306F was observed in the remaining 79.8% of isolates.
From the geographical distribution we can see that there are a small number of other mutations that appear at non-negligible frequencies. In South America, the single mutation A16V was observed in 2.0% of isolates and the triple mutation C50R,N51I,S108N is observed in 8.0% of isolates.
Next, we generated Flex ddG predictions of the change in binding free energy for each mutation in the combinatorically complete set of four mutations in PvDHFR, (N50I, S58R, S117N and I173L), which are homologous to the four in P. falciparum. We used these predictions to simulate the evolutionary trajectories to the quadruple mutation (Figure 3a) and compared our results to those presented in Jiang et al. (2013) [8]. In [8] they consider the wild-type allele to have a Serine at codons 58 and 117 and therefore we do the same for ease of comparison. The most likely first step in our model is S58R, which Flex ddG predicts to be the only single mutation to reduce the binding affinity when considering both the average and the peak of the distribution (Figure S1). Single mutation S117N reduces the binding affinity on average, but the peak of the distribution is a mild stabilizing effect (Figure S2).
a) Simulated evolutionary pathways to quadruple PvDHFR mutant N50I,S58R,S117N,I173L. Line thickness indicates the likelihood of a mutation at each step. Dotted lines indicate zero probability of a mutation at that step. The two most likely pathways are S58R/S117N/I173L/N50I and S58R/S117N/N50I/I173L respectively, b) The frequency of the 16 possible combinations of mutations N50I, S58R, S117N and I173L in PvDHFR, including wild-type, observed in our isolate data.
The most likely pathway predicted by our simulations (S58R/S117N/I173L/N50I) corresponds to the second most likely pathway predicted by Jiang et al. (2013) [8] for the highest pyrimethamine concentration. Our second most likely pathway (S58R/S117N/N50I/I173L) corresponds to the first most likely pathway predicted in [8] for the highest pyrimethamine concentration.
We compared the frequency of the 16 possible combinations of mutations in the pathway to the frequency found in our P. vivax isolate data (Figure 3b). Mutations S117N and S58R were the top two most frequent single mutations in our isolate data, respectively. The double mutant S58R,S117N was the most frequent mutation of the set found in our isolate data and was the most likely double mutation in our simulations when considering all possible routes to each of the six possible double mutants. The only triple mutant observed in the data was S58R,S117N,I173L and this was the most likely of the four possible triple mutations in our simulations when considering all possible routes. The quadruple mutation N50I,S58R,S117N,I173L is not observed in our isolate data, and has not been observed in the literature either.
We grouped the data into five broad geographical regions: South America (Brazil, Colombia, Guyana, Panama, Peru), East Africa (Ethiopia, Eritrea, Madagascar, Sudan, Uganda), South Asia (Afghanistan, Bangladesh, India, Pakistan, Sri Lanka), Southeast Asia (Cambodia, China, Laos, Malaysia, Myanmar, Philippines, Thailand and Vietnam) and Melanesia (Papua New Guinea). If we consider the geographical distribution of mutations in PvDHFR (Figure 4), we can see the dominant mutations differs between regions and compared to PfDHFR (Figure 2) there is a more diverse set of mutations that occur at non-negligible frequencies. As mentioned previously, the quadruple mutation is not observed in the isolate data and so we will infer trajectories up to triple mutant combinations of the four mutations N50I, S58R, S117N, I173L.
The geographical distribution of mutant PvDHFR alleles found in our isolate data. The size of each pie chart is proportional to the number of isolates from that particular country.
In South America, single mutations S117N and S58R and double mutation S58R,S117N were observed in 13.2%, 3.5% and 28.8% of isolates respectively. Double mutation S117N,I173L and triple mutation S58R,S117N,I173L were both observed in >1% of isolates. This suggests the evolution to triple mutation S58R,S117N,I173L follows a main trajectory S117N/S58R/I173L and two less likely alternative trajectories S58R/S117N/I173L and S117N/I173L/S58R.
However, in East Africa, South Asia, Southeast Asia and Papua New Guinea, only S117N, S58R and S58R,S117N are observed out of the possible mutational combinations.
In addition to the set of four mutations studied here, we observed other PvDHFR mutations in our isolate data. The double mutation S58R,S117N is now so frequent in the population that both mutations are considered fixed and the allele containing the substitutions at those residues is considered the major allele [45]. Therefore, we also observe mutations N117T and R58K at low frequencies in our dataset as these are mutations that occur following the Asparagine substitution at position 117 and the Arginine substitution at position 58.
Two quadruple mutations F57L,S58R,T61M,N117T and F57I,S58R,T61M,N117T were also found in high frequency in our isolate data. Mutation F57L,S58R,T61M,N117T was found in 38.5% of isolates from Papua New Guinea and 15.1% of isolates from Southeast Asia, whilst F57I,S58R,T61M,N117T was found in 29.7% of isolates from Southeast Asia. Intermediates of these two quadruple mutations were also found in our isolate data. The high frequency of these quadruple mutations suggests it may result in a resistant phenotype and future work could study the trajectories to these mutations.
Predicting the most likely evolutionary pathways to multiple resistant mutations in PfDHPS
The PfDHPS mutations Ser-436 to Ala (S436A), Ala-437 to Gly (A437G), Lys-540 to Glu (K540E) and Ala-581 to Gly (A581G) are highly prevalent in isolate data and have been shown to confer resistance to sulfadoxine, both independently and in combination. In our data, the quadruple mutation S436A,A437G,K540E,A581G is observed, along with high frequency intermediate single, double and triple mutant combinations.
We generated Flex ddG predictions of the change in binding free energy upon mutation for the combinatorically complete set of these four mutations. Triglia et al. (1997) [46] reported a 10-fold decrease in sulfadoxine binding affinity for single mutation A437G, but only a 4-fold decrease in binding affinity for A581G. Double mutations A437G/A581G and S436A/A437G were reported to result in a further 10-fold decrease in binding affinity, with a 70-fold decrease for A437G/K540E, relative to A437G alone. These results suggest these mutations are interacting epistatically.
We compared the peak of our predicted distributions for each mutation to the data presented in Triglia et al. (1997) [46] (Table 2). We observed Pearson’s correlation of 0.469 with 5/5 correctly classified as having a destabilizing impact. In the data, single mutation A437G was found to have a greater destabilizing impact than A581G. If we consider the average of the distributions, A437G was found to be around 50 times more destabilizing than A581G (< ΔΔGA437G >= 0.259, < ΔΔGA581G >= 0.005 (R.E.U)). However, if we consider the peak of the distributions, A437G has a smaller destabilizing impact than A581G ( (R.E.U)). If we analyze the overall distribution of A437G (Fig. S3) we can see that whilst the peak of the distribution occurs around small destabilizing changes in free energy and there is a long tail where many of the runs have predicted a much larger destabilizing impact, with a maximum of 1.58 (R.E.U), and only 2/150 runs predicted a stabilizing impact. The A581G distribution (Fig. S4), however, has a maximum destabilizing prediction of just 0.246, with almost half (65/150) of the runs predicting a stabilizing effect. This again highlights the need to consider the entire distribution of predictions for a specific mutation, instead of a summarizing statistic, when predicting the most likely evolutionary pathways.
We simulated the possible evolutionary trajectories to known PfDHPS sulfadoxine quadruple mutation S436A,A437G,K540E,A581G (Figure 5a). The most likely first step was single mutation A437G, in agreement with previous studies which found A437G was the most frequent single mutation sulfadoxine-resistant PfDHPS isolates [13, 47] and the data presented in [46]. Furthermore, A437G is now considered fixed in the population [48] and A437G is also the most common mutation in our isolate data, followed by S436A (Figure 5b). The most likely double mutation in our simulations was A437G/S436A, followed by A437G/K540E. This corresponds well to our isolate data in which the two most likely double mutations were A437G,K540E and S436A,A437G, respectively. The most likely triple mutation from our simulations was A437G/K540E/A581G followed by A437G/A581G/K540E and the two most frequent triple mutations in our isolate data were A437G,K540E,A581G and S436A,A437G,K540E, respectively. The most likely pathway to the quadruple mutation was A437G/S436A/A581G/K540E, followed by A437G/K540E/A581G/S436A. In our isolate data, the quadruple mutation is not very common and is only found in 1/6038 isolates.
a) Simulated evolutionary pathways to quadruple PfDHPS mutation S436A,A437G,K540E,A581G. Line thickness indicates the likelihood of a mutation at each step. Dotted lines indicate zero probability of a mutation at that step. The two most likely pathways are A437G/S436A/A581G/K540E and A437G/K540E/A581G/S436A, b) The frequency of the 16 possible combinations of mutations S436A, A437G, K540E and A581G in PfDHPS, including wild-type, observed in our isolate data.
We grouped the isolate data in to five geographical regions: South America (Brazil, Colombia and Peru), West Africa (Benin, Burkina Faso, Cameroon, Cape Verde, Cote d’Ivoire, Gabon, Gambia, Ghana, Guinea, Mali, Mauritania, Nigeria and Senegal), East Africa (Congo, Eritrea, Ethiopia, Kenya, Madagascar, Malawi, Tanzania, Uganda), Southeast Asia (Bangladesh, Cambodia, Indonesia, Laos, Myanmar, Thailand and Vietnam) and Melanesia (Papua New Guinea).
In Southeast Asia, the most common mutations were single mutation A437G, double mutations S436A,A437G, A437G,K540E and A437G,A581G and triple mutations S436A,A437G,K540E and A437G,K540E,A581G which were observed in 4.6%, 5.8%, 6%, 2.3%, 21.5% and 28.6% of isolates, respectively. Single mutations S436A and K540E were observed in 1.1% and >1% of isolates, respectively and the quadruple mutation S436A,A437G,K540E,A581G was observed in >1% of isolates. This suggests three possible trajectories to the quadruple mutation with A437G/K540E/A581G/S436A being the most likely pathway, followed by A437G/K540E/S436A/A581G and A437G/S436A/K540E/A581G as the second and third most likely pathways, respectively.
In South America, the most common mutations were single mutation A437G and double mutation A437G,A581G which were both found in 10.0% of isolates and triple mutation A437G,K540E,A581G which was observed in 16.0% of isolates. All other combinations of the four mutations were not observed in the South America isolates. This suggests the pathway followed to the triple mutation is A437G/A581G/K540E.
In West Africa, single mutations A437G and S436A were observed in 41.9% and 17.2% of isolates, respectively, double mutations S436A,A437G and A437G,K540E were observed in 18.0% and >1% of isolates, respectively and triple mutation A437G,K540E,A581G was observed in >1% of isolates. This suggests separate evolutionary trajectories to double mutation S436A,A437G and triple mutation A437G,K540E,A581G. The main trajectory to the S436G,A437G is A437G/S436A with a second slightly less likely trajectory S436A/A437G, whilst the trajectory to the triple mutation likely occurs A437G/K540E/A581G.
In East Africa, the most common mutations were single mutation A437G, double mutation A437G,K540E and triple mutation A437G,K540E,A581G which were observed in 24.8%, 57% and 7.3% of isolates, respectively. Single mutations S436A and K540E were observed in 1.5% and >1% of alleles, respectively and double mutations S436A,A437G and A437G,A581G were observed in 1.1% and >1% of alleles respectively and triple mutations S436A,A437G,K540E was observed in >1% of isolates. This suggests separate trajectories to triple mutants S436A,A437G,K540E and A437G,K540E,A581G. The main trajectory to S436A,A437G,K540E occurs via pathway A437G/K540E/S436A with a second less likely pathway A437G/S436A/K540E. The main trajectory to A437G,K540E,A581G occurs via pathway A437G/K540E/A581G.
In Papua New Guinea, most isolates (56.3%) were the wild-type allele. The most common mutations were single mutation A437G and double mutation A437G,K540E, which were observed in 5.9% and 29.4% of isolates, respectively. All other combinations of the four mutations were absent from the isolates. This suggests the trajectory to the double mutation follows the pathway A437G/K540E.
Predicting the most likely evolutionary pathways to resistant mutations in PvDHPS
Homologous mutations in PvDHPS to those in PfDHPS have been reported that confer resistance to sulfadoxine, such as S382F/A/C, A383G, K512E/M/T and A553G, which correspond to PfDHPS mutations S436F/A, A437G, K540E and A581G respectively [16, 49]. It has been suggested that these homologous mutations, most notably A383G and A553G, confer sulfadoxine resistance in P. vivax [50]. Pornthanakasem et al. (2016) [51] reported 30-fold decrease in sulfadoxine binding affinity upon single mutation A383G relative to the wild-type allele, and a further almost 4-fold decrease for the double mutation A383G,A553G relative to A383G, and an almost 5-fold decrease for the triple mutation S328A,A383G,A553G relative to A383G.
The single mutation A383G, along with double A383G,A553G and triple mutant S328A,A383G,A553G have been observed in 90% of mutations in areas where malaria is endemic in Thailand [52]. It is also thought that, like A437G, A383G is likely to be the initial mutation and may be necessary for the appearance of subsequent mutations, as it is most often found in field isolates [49, 52]. In fact, A383G is now so frequent, it its considered fixed in the population, much like its homologous mutation A437G. In our isolate data, the wild-type allele contains a Glycine at position 383, however to test our method we chose Ala-383 as the wild-type residue, and simulated evolutionary trajectories including the A383G mutation.
In our isolate data, mutation Met-205 to Ile (M205I) occurs frequently in combination with A383G, A553G and S382A/C. M205I has been previously identified by other studies as a common polymorphism occurring P. vivax [49, 51], however it is unknown if this mutation confers resistance to sulfadoxine as it occurs in the PPPK gene of the bifunction enzyme PPPK-DHPS. It may be possible, however, that M205I interacts via long-range epistasis with the three other aforementioned PvDHPS sulfadoxine resistance mutations and so we chose to predict the most likely evolutionary pathways to quadruple mutant M205I,S382A,A383G,A553G which was observed at high frequency in our isolate data
We generated Flex ddG predictions of the change in sulfadoxine binding free energy for the combinatorically complete set of the four aforementioned mutations (M205I, S382A, A383G, A553G) and also the single mutation V585A. We then compared the peak of the predicted distributions for single mutations A383G and V585A, double mutation A383G,A553G and triple mutation S328A,A383G,A553G to the data presented in [51], and observed a Pearson’s correlation of 0.94 with ¾ of the mutations correctly classified as stabilizing or destabilizing (Table 3). Despite the high correlation, the predicted destabilizing effect on sulfadoxine binding free energy for the triple mutation S382A,A383G,A553G was less than both single mutation A383G and double mutation A383G,A553G. This is at odds with the data, where the triple mutation is the most destabilizing mutation of the set studied.
Comparing Flex ddG predictions, A383G was the most destabilizing single mutation compared to M205I, S382A and A553G, and was predicted to be 7-times more destabilizing than the second most destabilizing mutation, A553G. Double mutation A383G,A553G was the most destabilizing double mutation and triple mutation M205I,S382A,A553G was the most destabilizing triple mutation.
We determined the non-additivity between the impact of multiple mutations and the sum of their independent impacts to investigate epistasis between the mutations (see Supplementary data, ‘PvDHPS_epistasis.xlsx’). Only one mutation, M205I,A553G appears to be additive, whilst all other multiple mutations were predicted to be non-additive, suggesting epistasis. All multiple mutations except double mutation S382A,A553G, triple mutation S382A,A383G,A553G and quadruple mutation M205I,S382A,A383G,A553G were more destabilizing than expected. The largest non-additive interactions were for mutations M205I,S382A,A553G and M205I,S382A,A383G,A553G, where M205I,S382A,A553G was predicted to be 9-times more destabilizing than expected, and M205I,S382A,A383A,A553G was predicted to be almost 10-times less destabilizing than expected. Interestingly, quadruple mutation M205I,S382A,A383G,A553G, was predicted to be less destabilizing than single mutation A383G alone.
Single mutation M205I occurs in the PPPK gene and is not in direct contact with sulfadoxine and on its own Flex ddG predicts it to have a mostly neutral impact on sulfadoxine binding affinity (). However, it is predicted to interact epistatically with all mutations, except A553G and A383G,A553G although there are no direct interactions between M205I and the other mutations, suggesting long-range epistasis.
We simulated possible evolutionary trajectories to the quadruple mutation M205I,S382A,A383G,A553G (Figure 7a). The most likely first step in the pathway is single mutation A383G, in agreement with the observation that A383G is now a major allele. The two most likely pathways to the quadruple mutation are A383G/M205I/S382A/A553G and A383G/M205I/A553G/S382A respectively.
The geographical distribution of mutant PfDHPS alleles found in our isolate data. The size of each pie chart is proportional to the number of isolates from that particular country.
a) Simulated evolutionary pathways to quadruple PvDHPS mutation M205I,S382A,A383G,A553G. Line thickness indicates the likelihood of a mutation at each step. Dotted lines indicate zero probability of a mutation at that step. The two most likely pathways are A383G/M205I/S382A/A553G and A383G/M205I/A553G/S382A, b) The frequency of the 16 possible combinations of mutations at sites M205I, S382A, A383G and A553G in PvDHPS, including wild-type, observed in our isolate data.
Analyzing the frequency of each mutation in our isolate data (Figure 7b), M205I is the most frequent single mutation and A383G is the second most frequent, whereas A383G and A553G are the two most likely single mutations in our simulations (Supplementary data ‘PvDHPS_pathway_probabilities.csv’). The most frequent double mutations in the isolate data and the most likely double mutations in our evolutionary simulations were M205I,A383G and A383G,A553G, when summing over all possible pathways. The most frequent triple mutation in the isolate data, M205I,A383G,A553G, was the second most likely triple mutation in our simulations, considering all pathways. The most likely triple mutation in our simulations, M205I,S382A,A383G, corresponds to the second most frequent triple mutation in the isolate data.
To compare our simulations to the geographical distribution of mutations, we grouped the data into five rough regions: South America (Brazil, Colombia, Guyana, Mexico, Panama, Peru), East Africa (Ethiopia, Eritrea, Madagascar, Sudan, Uganda), South Asia (Afghanistan, Bangladesh, India, Pakistan, Sri Lanka), Southeast Asia (Cambodia, China, Laos, Malaysia, Myanmar, Philippines, Thailand and Vietnam) and Melanesia (Papua New Guinea).
If we consider the geographical distribution of PvDHPS mutations (Figure 8), we can see that the dominant mutations differ between regions. In Southeast Asia, many combinations of the set of four mutations M205I, S382A, A383G and A553G are observed. Single mutations M205I, A383G and A553G were and observed in 1.2%, >1% and 5.9% of isolates, respectively. Double mutations M205I,A383G, S382A,A383G and A383G,A553G were observed in 28.2%, >1% and >1% of isolates, respectively. Triple mutations M205I,S382A,A383G, M205I,A383G,A553G and S382A,A383G,A553G were observed in 1.2%, 26.1% and >1%, respectively, and the quadruple mutation was observed in 11.3% of isolates. This suggests the evolutionary trajectory to the quadruple mutation occurs main pathways M205I/A383G/A553G/S382A, with possible alternative pathway A383G/M205I/A553G/S382A.
The geographical distribution of mutant PvDHPS alleles found in our isolate data. The size of each pie chart is proportional to the number of isolates from that particular country.
However, only a small number of the possible combinations of the four mutations were observed in South Asia, South America, East Africa and Papua New Guinea. In South Asia, mutations A383G, A553G and A383G,A553G were observed in 2.6%, 1.8% and 12.3% of isolates, respectively. In South America, mutations A383G, M205I and M205I,A383G were observed in 6.5%, 15.5% and 26.4%, respectively. In East Africa, mutations A383G and M205I were observed in 3.6% and 5.9% of isolates, respectively, and in Papua New Guinea, single mutation M205I was observed in 15.4% of isolates.
Other frequent multiple mutations include M205I,S382C,A383G which was found in 11.2% of South America isolates, E142G,M205I and E142G,M205I,A383G,A647V which were found in 36.9% and 21.4% of East African isolates, respectively and E132G,A383G,A553G which was found in 8.9% of Southeast Asian isolates.
Discussion
We have presented a method for predicting the most likely evolutionary trajectories to multiple mutants, utilizing Rosetta Flex ddG protocol and a thermodynamic evolutionary model. The most likely pathways predicted by our model to the pyrimethamine-resistant quadruple PfDHFR mutant correspond well to those predicted by Lozovsky et al. (2009) [7]. They used experimentally determined IC50 values of PfDHFR pyrimethamine binding for the combinatorically complete set of the four PfDHFR mutations (N51I, C59R, S108N, I164L), combined with information regarding the nucleotide bias of the P. falciparum genome, to simulate the evolutionary trajectories. The three most likely pathways based on experimental IC50 values were found in the top four most likely pathways based on our simulations using predictions of binding free energy.
We also simulated the most likely evolutionary trajectories to the PvDHFR quadruple mutation N50I,S58R,S117N,I173L and compared our results to those of Jiang et al. (2013) [8]. They considered the relative growth rates of the different alleles at different drug concentrations when simulating evolutionary trajectories, which incorporate both change in pyrimethamine binding affinity (Ki) and catalytic activity (kcat). Our top two most likely pathways correspond to their top two most likely pathways for the highest pyrimethamine concentration they consider, albeit in reverse order. At high pyrimethamine concentrations, it is likely alleles which significantly reduce binding affinity will be selectively favoured even if there is a slight reduction in catalytic activity. This may be why our predictions agree well their predictions for high pyrimethamine concentration, but not for low-to-middle pyrimethamine concentrations, because even though ligand concentration is included in our equation for protein fitness (Eq. 1), our model cannot account for adaptive conflict between Ki and kcat.
This highlights a limitation of our method as it only accounts for changes in binding affinity and does not account for changes in protein function. As previously mentioned, DHFR catalyzes the reduction of substrate DHF via oxidation of cofactor NADPH. Therefore, in the case of the DHFR enzyme, a future iteration of the model could include the impact resistance mutations have on binding of these two ligands, as a proxy for changes to enzyme function. However, this would require a much more complex model of protein fitness and would be much more computationally expensive.
Mutations occurring at a drug-binding site may also reduce the protein’s thermodynamic stability [53] and therefore may not be selected for, even if they improve the resistance phenotype. Therefore, our model may also be improved by including selection for mutations that do not reduce thermodynamic stability relative to the wild-type enzyme. There are several computational methods to predict changes in protein stability upon mutation, including mCSM [54], Rosetta Cartesian ddG [55] and FoldX [56]. However, it must also be noted that most proteins are marginally stable [57–59], a property which may have evolved either as an evolutionary spandrel [60, 61] (a characteristic that arises as a result of non-adaptive processes which is then used for adaptive purposes [62]) or due to selection for increased flexibility to improve certain functionalities [63, 64]. Therefore, the model would also have to account for the fact that a resistance mutation that increases protein stability relative to the wild-type stability may also result in a reduction in fitness.
The quadruple PvDHFR mutation (N50I,S58R,S117N,I173L) has not been noted in the literature or in our clinical isolate data. This may be because P. vivax is only exposed to pyrimethamine when present in co-infections with P. falciparum and so is not under continued selection for pyrimethamine resistance. Furthermore, the quadruple mutation only reached fixation in the evolutionary simulations described in [8] for the highest pyrimethamine concentrations, therefore clinical dosages of pyrimethamine may not be high enough to select for the quadruple mutation. Additionally, the quadruple mutation may result in a fitness impairment that requires compensatory mutations.
Despite its limitations, the evolutionary trajectories predicted using our method agree well with experimentally predicted trajectories for both PfDHFR and PvDHFR, and so we applied our method to predict the evolutionary trajectories to quadruple mutants in both PfDHPS and PvDHPS that may confer resistance to sulfadoxine. Evolutionary trajectories to multiple resistance mutations in these enzymes have not yet been investigated fully, either experimentally or computationally, although attempts have been made to infer the first step in sulfadoxine resistance. Therefore, our work presents the first attempt to predict evolutionary trajectories in these enzymes.
As we can see from the geographical distributions of mutations in PfDHPS (Figure 6) and PvDHPS (Figure 8), there is much more variation between different regions than was observed for the DHFR gene from both species (Figure 2 and Figure 4 for PfDHFR and PvDHFR, respectively). For both species, the quadruple mutations to which we simulated evolutionary trajectories are only observed in Southeast Asia, a region of high drug pressure Furthermore, whilst we have only considered the trajectories for one multiple mutant for each gene, it may be beneficial to consider one per region for the DHPS gene from both species.
Oguike et al. (2016) [65] reported the emergence of a further PfDHPS sulfadoxine resistance mutation Ile-431 to Val (I431V) in Nigeria and found it occurred most commonly as a quintuple mutant with S436A, A437G, A581G and A613S. This quintuple mutation is also found frequently in our isolate data from West Africa, along with single, double, triple and quadruple combinations of the mutations involved. Therefore, it may also be useful to study the evolutionary trajectories to this quintuple mutation.
We might also consider quadruple PvDHPS mutation E142G,M205I,A383G,A647V, which is highly prevalent in East Africa. However, the mutations at codons 142 and 205 occur in the PvPPPK gene of the bifunctional enzyme PPPK-DHPS and are not directly involved in sulfadoxine binding. Therefore, this mutation may not result in a resistant phenotype.
We also inferred evolutionary pathways from our clinical isolate data from different geographical regions for each drug-target combination studied here. This analysis suggested that different regions often follow different evolutionary trajectories and that the most likely evolutionary trajectories predicted by our model are not always the most prevalent.
Geographical differences in the distribution of resistant alleles may be the result of drug regimens and gene flow in parasite populations. If we consider the geographical distribution of PfDHFR mutations (Figure 2), we can see that triple mutant N51I,C59R,S108N is common in West and East Africa as well as Southeast Asia, whereas the quadruple mutation N51I,C59R,S108N,I164L is only observed at high frequency in Southeast Asia. Combination drug SP was first used in 1967 to treat P. falciparum in Southeast Asia, and resistance was first noted that same year on the Thai-Cambodia and Thai-Myanmar borders [66]. In Africa, SP was first used in the 1980s, with resistance occurring later that decade. However, analysis of PfDHFR genotypes and microsatellite haplotypes surrounding the DHFR gene in Southeast Asia and Africa suggest a single resistant lineage that appeared in Southeast Asia accumulated multiple mutations, including the triple N51I,C59R,S108N [67–69], migrated to Africa and spread throughout the continent [70–72]. Variation in the frequency of PfDHFR (and PfDHPS) mutants across Africa occurs because of differences in the timing of chloroquine withdrawal and introduction of SP, as well as continued use of SP for intermittent preventive treatment (IPTp) in pregnant women residing in areas of moderate to high malaria transmission intensity [48, 73].
Pyrimethamine resistance increased in West Papua in the early 1960s following the introduction of mass drug administration [74]. In our data, the distribution in Papua New Guinea is made up mainly by the double mutant C59R,S108N and triple mutant C59R,S108N,S306F. Microsatellite haplotype analysis suggests C59R,S108N in Melanesia has two lineages, one of which originated in Southeast Asia whilst the other evolved indigenously [69].
Pyrimethamine resistance in South America looks surprisingly different from the distributions in Africa and Southeast Asia. SP was introduced in South America and low-level resistance was first noted in Colombia in 1981 [75]. Microsatellite haplotype analysis suggests pyrimethamine resistance evolved indigenously in South America, with at least two distinct lineages detected. A triple mutant lineage (C50I,N51I,S108N) was identified in Venezuela that possibly evolved from double mutant N51I,S108N [71]. A second triple mutant lineage (N51I,S108N,I164L) was identified in Peru and Bolivia which also possibly evolved from a distinct double mutant (N51I,S108N) lineage [76].
The distribution of PvDHFR pyrimethamine-resistance mutations (Figure 4) is much more diverse than the PfDHFR distribution (Figure 2), despite the lower number of PvDHFR isolates. However, the opposite is true for the DHPS gene, with a greater diversity of mutations found the P. falciparum isolates.
In general, the PvDHFR gene is much more polymorphic than PfDHFR gene, with over 20 alleles observed in a limited geographical sampling [25], whereas fewer PfDHFR alleles have been observed despite much more extensive surveillance with non-synonymous changes and insertions/deletions occurring rarely [77]. It also appears that the origin of PvDHFR pyrimethamine resistance mutation is much more diverse than PfDHFR. Hawkins et al. (2008) [78] investigated isolates from Colombia, India, Indonesia, Papua New Guinea, Sri Lanka, Thailand and Vanuatu. They found multiple origins of the double PvDHFR mutant 58R,117N and three independent origins of triple mutant 58R,61M,117T and quadruple mutants 57I,61M,117T,173F and 57L,58R,61M,117T in Thailand, Indonesia and Papua New Guinea/Vanuatu. Shaukat et al. (2021) [79] assessed the evolutionary origin of PvDHFR pyrimethamine resistance mutations in Punjab, Pakistan and found multiple origins of single mutation S117N and a common origin of double mutant 58R,S117N and triple mutants 57L,58R,117N, 58R,61M,117N and 58R,118N,I173L. This is in contrast to the evolutionary origin of pyrimethamine resistance in PfDHFR, where mutations in Africa shared a common origin with a resistance lineage from Asia.
The distribution of PfDHPS sulfadoxine resistance mutations is strikingly different across different geographical regions (Figure 6). Microsatellite haplotype analysis of resistant alleles revealed single, double and triple sulfadoxine resistance mutations arose independently in Cambodia, Kenya, Cameroon and Venezuela [71, 80]. Microsatellite analysis of PfDHPS in Africa indicated multiple origins of single mutant (A437G) and double mutant (S436A,A437G and A437G,K540E) and different resistance lineages when comparing east and west Africa, resulting in differing distributions of resistance mutations [81].
Similarly, multiple origins of PvDHPS resistance mutations at codons 382, 383 and 553 were observed in Cambodia, Indonesia, Sri Lanka and Thailand [49], which may account for the differences in the distribution of mutations between geographical regions (Figure 8).
This highlights the need to distinguish between geographical regions and account for existing resistance alleles within that region and trace their lineages when attempting to predict the next step in evolutionary trajectories to highly resistant multiple mutants. Given the current dominant resistance allele from a specific region, our method could be used to predict the most likely next steps from a subset of likely mutations.
We have presented a new computational method for predicting the most likely evolutionary trajectories that has demonstrated good agreement with trajectories predicted experimentally and has the advantage of being much quicker and more cost-effective. This method can be applied to any system in which a drug binds to a target molecule, provided a structure of the complex exists or can be produced via structural modelling. Given the threat antimicrobial resistance poses, methods to accurately and efficiently predict future trajectories are vital and can inform treatment strategies and aid drug development.
Materials and Methods
Homology modelling
Homology modelling was carried out in Modeller [82] to produce complete structures of the target proteins bound to their drug molecules.
Several crystal structures of PfDHFR exist in the Protein Data Bank (PDB). The entry 3QGT provides the crystal structure of wild-type PfDHFR complexed with NADPH, dUMP and pyrimethamine, however residues in the ranges 86-95 and 232-282 are missing from the structure. Homology modelling was used to complete the structure using a second wild-type PfDHFR structure PDB entry 1J3I along with a wild-type PvDHFR structure PDB entry 2BLB.
To produce a complete structure of PvDHFR, PDB entry 2BLB was used as a template, which provides the X-ray crystal structure of wild-type P. vivax DHFR in complex with pyrimethamine. This structure was only missing a loop section between residues 87-105 and so Modeller was used to build this missing loop.
To produce a complete structure of PfDHPS, PDB entry 6JWX was used as a template, which provides the crystal structure of wild-type PfDHPS-HPPK in complex with sulfadoxine. This structure is missing a number of loop sections, so Modeller was used to build these loops using additional wild-type PfDHPS template 6JWR and P. vivax HPPK-DHPS template 5Z79.
To produce a complete structure of PvDHPS, PDB entry 5Z79 was used as a template, which provides the crystal structure of wild-type P. vivax HPPK-DHPS. This structure does not include sulfadoxine, however, and is missing several important loop sections. Modeller was used to build the structure complete with SDX-DHP and completed loops using additional templates 6JWX and 6JWA.
Flex ddG binding free energy predictions
The Rosetta Flex ddG protocol was used to estimate the change in binding free energy upon mutation, ΔΔG = ΔGmut – ΔGWT, for each step in all possible mutational trajectories for a set of stepwise resistance mutations (see supplementary data ‘Flex_ddG’ folder for example Rosetta script, example resfile and example command line. The protein-ligand structure files and ligand parameter files can be found in the folders named for the specific targets). To predict the change in binding free energy for a single or multiple mutation, we used the structure of the target protein with the drug molecule bound as input to Flex ddG and ran the protocol for 150 times per mutation to produce a distribution of predictions of the change in the free energy of binding. We then found the peak of the distribution to produce a single estimate of the change in the binding free energy for the mutation, denoted for mutation X.
To predict the stepwise evolutionary trajectories, we must consider the interactions between the mutations in the pathway. The interaction energy (or epistasis) in the binding free energy between two mutations X and Y, can be written εXY = ΔΔGX,Y – (ΔΔGX + ΔΔGY). This quantifies by how much the change in binding free energy of the double mutant X, Y deviates from additivity of the single mutants, where each are calculated with respect to the wild-type. Therefore, the change in binding free energy when mutation Y occurs in the background of mutation X can be written ΔΔGX/Y = ΔΔGX,Y – ΔΔGX, where ΔΔGX/Y = ΔΔGY + εXY.
For a third mutation, Z, occurring in the background of double mutation X,Y, the interaction energy between Z and X,Y is εXY,Z = ΔΔGX,Y,Z – (ΔΔGX,Y + ΔΔGZ). The quantity εXY,Z is not the same as the third order epistasis between mutations X, Y, and Z, or the interaction energy εXYZ = ΔΔGX,Y,Z – (ΔΔGX + ΔΔGY + ΔΔGZ), as it does not account for the interaction between X and Y, rather it only quantifies the interaction between Z and the two mutations X and Y..
Therefore, the change in binding free energy when mutation Z occurs in the background of double mutant X,Y can be calculated as ΔΔGX,Y/Z = ΔΔGX,Y,Z – ΔΔGX,Y, where ΔΔGX,Y/Z = ΔΔGZ + εXY,Z.
To estimate the change in binding free energy when mutation Y occurs in the background of mutation X, ΔΔGX/Y for stepwise pathway X/Y, we subtracted the predictions for the first mutation X, from the predictions for the double mutation X,Y,
, to create a set of 150 ‘predictions’ for the change in binding free energy when Y occurs in the background of X,
i.e.
for i = {1, …, 150}. To estimate the change in binding free energy when mutation Z occurs in the background of mutations X and Y we calculated.
. We applied a similar method for the quadruple mutations, so that we had a set of ‘predictions’ for each step in the possible evolutionary trajectories.
Simulating Evolutionary Trajectories
To predict the most likely evolutionary trajectories to reach a quadruple mutant we used a model based in thermodynamics and statistical mechanics where the fitness of a protein is determined by the probability it would not be bound to a ligand, Punbound. We consider a two-state system in which the protein can either be bound or unbound and do not explicitly account for if the protein is folded or unfolded in either the bound or unbound state. For ligand concentration [L] it can be shown that the probability a protein is unbound is
where Kd is the protein-ligand dissociation constant and can be calculated as c0eΔG/kT where c0 is a reference ligand concentration (set here arbitrarily to 0.6M), ΔG is the protein-ligand binding free energy, k is the Boltzmann constant and T is the temperature in Kelvin.
Starting from the wild-type protein, with binding free energy ΔGWT and fitness , we extract one sample i from the 150 values of the predicted binding affinity changes for the single mutations to determine the binding free energy after mutation X,
, and calculate the fitness of each single mutant protein
. We can calculate the probability the mutation will fix in the population using the Kimura fixation probability
where Ne is the effective population size and s is the selection coefficient
. We also took in to account the mutational bias of Plasmodium falciparum using the nucleotide mutation matrix calculated in [7]. The probabilities of fixation for each mutation were normalised by the sum of the probabilities of fixation for all possible mutations at that step in the trajectory. A mutation is then chosen with a probability proportional to this normalised probability of fixation.
Once a single mutation is chosen, the binding free energy is set to of the chosen mutation, and a value is sampled from the distribution of each of the possible next steps, X/Y in the trajectory i.e.
. This continues until the end of the trajectory is reached.
SNP data
Plasmodium falciparum and vivax SNP data were sourced from recent studies [45, 48]. In those studies, paired Illumina raw sequence data was mapped to the Pf3D7 (P. falciparum) or PVP01 (P. vivax) reference genome using bwa-mem software (default parameters). SNPs were called using the samtools and GATK software suites. Those SNPs occurring in non-unique, low quality or low coverage regions were discarded, and those in the candidate genes analysed here were extracted.
Acknowledgments
R.C.E and N.F. are funded by the Medical Research Council UK (Grant no. MR/T000171/1). T.G.C is funded by the Medical Research Council UK (Grant no. MR/M01360X/1, MR/N010469/1, MR/R025576/1, and MR/R020973/1) and BBSRC (Grant no. BB/R013063/1). S.C is funded by Medical Research Council UK grants (MR/M01360X/1, MR/R025576/1, and MR/R020973/1) and Bloomsbury SET. EM is funded by a Newton Institutional Links Grant (British Council, no. 261868591. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. R.C.E would like to thank Tanushree Tunstall for her helpful discussions.
Footnotes
Competing Interest Statement: None.
References
- 1.↵
- 2.
- 3.↵
- 4.↵
- 5.↵
- 6.
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.
- 18.
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.
- 33.↵
- 34.↵
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵