Abstract
The SARS-CoV-2 main protease (Mpro) is essential to viral replication and cleaves highly specific substrate sequences, making it an obvious target for inhibitor design. However, as for any virus, SARS-CoV-2 is subject to constant selection pressure, with new Mpro mutations arising over time. Identification and structural characterization of Mpro variants is thus critical for robust inhibitor design. Here we report sequence analysis, structure predictions, and molecular modeling for seventy-nine Mpro variants, constituting all clinically observed mutations in this protein as of April 29, 2020. Residue substitution is widely distributed, with some tendency toward larger and more hydrophobic residues. Modeling and protein structure network analysis suggest differences in cohesion and active site flexibility, revealing patterns in viral evolution that have relevance for drug discovery.
Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged in late 2019 (1) and rapidly spread worldwide, causing an ongoing pandemic. Although the sequence of its RNA genome is highly similar to that of SARS-CoV-1, SARS-CoV-2 is believed to have arisen independently from a bat coronavirus (2), to which it shares 96% similarity (3). The emerging SARS-CoV-2 subsequently gained a modified spike protein due to recombination in an intermediate host, the pangolin (4, 5), followed by purifying selection for binding to the human ACE2 protein (6). No therapeutic agents able to reduce SARS-CoV-2 mortality in clinical settings are yet known, although extensive efforts are underway to discover new drugs or repurpose existing ones to inhibit key viral proteins. Here we focus on the main protease (Mpro), which plays a critical role in viral replication. Like other betacoronaviruses, SARS-CoV-2 is a positive-sense RNA virus that expresses all of its proteins as a single polypeptide chain, which is cleaved by Mpro to yield the mature proteins (7).
Inhibiting this key enzyme would prevent viral replication, reducing viral load and thus symptom intensity. A similar approach was instrumental in making HIV a manageable disease (8–10). However, the proteins in question differ markedly, rendering HIV protease inhibitors ineffective against SARS-CoV-2; indeed, a standard HIV protease inhibitor combination did not prove effective against COVID-19 in a recent clinical trial (11). Specifically, HIV protease is an aspartic protease (and functional only as a dimer, as the active site comprises one residue from each monomer), whereas Mpro is a 3CL cysteine protease that is likewise most active in the dimeric state, although each monomer has its own catalytic dyad (12). The 3CL cysteine proteases are characterized by a chymotrypsin-like fold and a cysteine-histidine catalytic dyad in the active site, implying both different structures and distinct chemical mechanisms. While the general strategy of seeking protease inhibitors is hence viable for both SARS-CoV-2 and HIV, drug development for the former depends on characterizing this novel enzyme.
Molecular modeling is an important tool for guiding inhibitor discovery, making it possible to evaluate large numbers of candidate drugs in silico to select experimental targets; however, standard approaches screen against only one version of the protein, typically the reference or wild-type (WT) sequence. In a host population, mutations accumulate with each viral passage, generating a mutational landscape rather than a single protein. The design of robust inhibitors that can protect against the multiple strains encountered in clinical settings requires characterization of this sequence space and the populations of conformations it engenders. Furthermore, effective and rapid response to future emerging coronavirus diseases requires both in silico screening and experimental testing of antiviral agents and a validated library of relatively general inhibitors that can be used as a basis for the development of specialized therapeutics. Central to the success of that effort will be developing an understanding of structural and functional variation in SARS-CoV-2 proteins, particularly as mutations accumulate and new strains emerge. Here we characterize all 79 known variants of Mpro as of 29 April, 2020, and analyze trends in amino acid substitutions and the resulting structural changes using network analysis and molecular modeling. To our knowledge this is the first detailed analysis of clinically relevant mutations in Mpro. Our analysis shows a trend toward substitution for larger and more hydrophobic residues versus the WT protein. Analysis of active site networks (ASN) from Mpro variants suggests differences in active site flexibility and cohesion that may serve to guide the design of robust, mutation-resistant inhibitors.
Results and Discussion
Mutations in Mpro are geographically distributed and occur throughout the protein
From the GISAID (https://www.gisaid.org/) (13) EpiCoV database (through 29 Apr, 2020), 78 unique non-synonymous mutations to Mpro were found in addition to the WT sequence, including 73 single point variants and 5 double variants. For genome sequences containing these Mpro variants, full genome alignments were performed using MUSCLE (14), and neighbor-joining trees were generated using MEGA X (15). Overall, the variation in SARS-CoV-2 sequences observed so far is relatively low, with mutation hotspots not evenly distributed throughout the genome, but localized to specific sequence regions (16). Because Mpro is critical for viral replication, mutations that have a large deleterious effect on virus replication are unlikely to be observed in clinical isolates; all Mpro variants investigated here are therefore assumed to be enzymatically competent. In general, codon usage and amino acid frequency in viruses of eukaryotes are essentially identical to those of their eukaryotic hosts, reflecting the viruses’ use of the host translation machinery (17). Mpro sequences found in sequences isolated from human hosts will therefore likely reflect bias toward human codon usage, somewhat limiting the scope of the observed mutation space.
The known mutations in Mpro are summarized in Figure 1. The tree was generated based on overall genome similarity; however, only sequences containing at least one mutation in Mpro were included in the analysis, along with the WT human sequence and two non-human reference sequences. The accession numbers and geographical sources are listed in Supplementary Table S2. The solid arcs around the outside of the diagram indicate Mpro mutations; color coding corresponds to the geographical source. Several mutations appear to have arisen more than once in the virus’s evolutionary history so far. Notably, K90R variants appear in multiple distantly related subtrees; five of these unique evolutionary events can be verified in Nextstrain’s SARS-CoV-2 phylogenetic tree (18). Further, L89F, P108S, and N274D arise at least twice in both trees.
These phylogenetic comparisons appear to support a multiple event hypothesis, but are subject to errors resulting from the sparsity of testing. The repeated occurrence of the same mutation in seemingly unrelated subtrees may be due to missing data that would show their evolutionary connectedness. The average branch length of Figure 1, which shows only topology, is 1.432161×10−4 base substitutions per site (including those from the bat (3) and pangolin (19)); 32.2% of the 1028 branches have, to ten significant figures, 0 base substitutions per site. For a genome of roughly 30,000 base pairs, this amounts to an average of only 4 substitutions per branch. All of these unique mutants therefore effectively belong to the same strain, making them difficult to place in an evolutionary context. For more diverged mutants, unfortunately-placed ambiguous nucleotides (20) could push them from one subtree to another. With the exception of five double variants, a majority of the sequences in Figure 1 arise from single point mutations. Whether and how Mpro mutations have affected viral fitness is not yet known, but at least three mutants have remained in the population long enough to accumulate another mutation: L220F to A191V/L220F, G15S to G15S/D48E and G15S/V35L, and K90R to V77A/K90R. It is worth noting that although a single variant A191V exists, the A191V/L220F double variant likely stemmed from an L220F ancestor due to its shared lineage with L220F single variants. A fifth double variant, A193T/R279C, was found but did not stem from any single mutation in our dataset; its origins remain unclear.
While a mutation’s prevalence and evolution in a population may be interpreted as a sign of stable viral function, the opposite does not necessarily indicate reduced virulence. Testing rates, social behavior, and time of first infection in each region are all factors that contribute to the spread of the disease and the availability of sequencing data. For instance, a large number of K90R mutants were collected in Iceland, where the number of tests per 1,000 people is nearly twice as many as the next leading country’s and more than seven times as many as the United States’ (Iceland: 141.75, USA: 18.21, as of 29 April, 2020) (21). Consequently, further investigation is needed to determine whether Mpro mutations affect viral fitness on a global scale. As such, without greater divergence and more sequences, it is difficult to tell if the presence of an Mpro mutation in unrelated subtrees is evidence of multiple evolutionary events, or an artifact of sparse testing.
Because only sequences harboring Mpro mutations were retained for analysis, certain geographical areas appear to be underrepresented. It is likely that the strains that had spread to underrepresented regions prior to our data collection simply did not have Mpro mutations. Different regions tend to be dominated by different mutants, a feature that might be explained by the timing at which these mutations arose or arrived. For instance, 83 of the 100 Mpro mutants from Iceland were K90R, and most stemmed from a single shared ancestor (Supplementary Table S2). Further, it is likely that heterogeneity in sequencing rates have resulted in a less-than-complete dataset. As of April 29th, the only North American, South American, and African Mpro mutants reported in the GISAID database that passed our filtering parameters were from Costa Rica and the USA, Argentina and Brazil, and the DRC respectively. This does not necessarily indicate a lack of Mpro mutations in other subregions, and may instead reflect differences in sequencing rates. In the structural analyses that follow, we focus on the differences in protein properties relative to WT, of the clinically observed Mpro variants.
Mpro mutations to date suggest selection for larger, more massive, and more hydrophobic residues
To reveal the global pattern of substitutions, we visualize mutations in Mpro - independent of sequence position or location in the three-dimensional structure - by a network in which the nodes, or vertices, are amino acid types and the edges (represented by arrows pointing in the direction of substitution) are directional indicators of how often one amino acid was observed to substitute for another (Figure 2). The weights of the edges indicate the frequency of the mutation across known Mpro variants, while node color reflects residue hydrophobicity on the scale of (23) (larger numbers indicate greater hydrophobicity.) The most obvious trend observed in the pattern of mutation so far is the preferential substitution of larger, more hydrophobic amino acids in place of smaller, less hydrophobic ones. Overall, the pattern is consistent with increased incidence of amino acids that are more likely to be present in folded domains, rather than in linker regions (24).
In particular, it is notable that alanine has very few incoming ties and a large number of outgoing ties, mostly to valine, which has a larger and more hydrophobic side chain. Alanine is at the same time one of the most common amino acids and one of those with the most variable prevalence in the human genome (25). Similarly, observed ties to isoleucine are mostly incoming from smaller residues, and leucine, which is also large and hydrophobic, likewise has more incoming than outgoing ties overall, with the bulk of its outgoing ties going to phenylalanine. However, aromatic residues per se do not appear to be selected at a higher rate than can be explained by their hydrophobicity. Also notable is the selection away from the secondary structure-breakers proline and glycine, both of which have only outgoing ties, and the propensity for lysine to be replaced by arginine even though both side chains are positively charged. Arginine is both larger and capable of making more and stronger hydrogen bonds, as well as cation-π interactions not available to lysine, leading to its known overrepresentation in inter-domain and inter-monomer interfaces (26–29).
The mean differences in sidechain properties for observed Mpro mutations are summarized in Table 1. As observed in the network representation (Figure 2), mutated residues are, on average, larger and more hydrophobic than those they replace. Although substituted residues are on average larger and more massive, we do not see strong evidence favoring bulky over compact residues net of mass: residue bulk (measured as volume/mass) for substituted residues did not differ significantly from WT (mean difference=0.02Å 3/Da, t = 1.87, p = 0.0650). The variant sequences are not significantly different from WT in charge or aromatic content.
Molecular modeling suggests regionally specific differences in Mpro variant structure
For WT and each Mpro variant, a molecular model was constructed using MODELLER 9.23 (30), based on the A chain monomer of the PDB structure 6Y2E (31), followed by annealing, correction of protonation states, and all-atom molecular dynamics simulation in explicit solvent (see Methods). Examples of representative models are shown in Supplementary Figure S1, with the positions of all mutated residues shown mapped onto the WT structure in Supplementary Figure S2. We do not observe gross differences in structure or dynamics across variants, as expected given that all variants were found in clinical isolates and are therefore necessarily functional; mutations leading to radically altered or misfolded structures would likely be strongly selected against. However, analysis of MD trajectories does suggest more subtle differences across variants, providing insight into function-preserving changes.
To assess the overall degree to which local structure is conserved across Mpro variants, we compute the cross-variant variance in average ϕ, Ψ backbone torsion angles by residue. In order to control for overall flexibility, we normalize this by the estimated variance in torsion angles within each trajectory. For arbitrary angle αi at residue i, this leads to the local variation index where αij is the vector of angles of type αi over the trajectory of variant j with corresponding angular mean , is the vector of such means across variants, VarB is the “between variant” angular variance in mean angles, and VarW is the “within variant” angular variance in αij. Intuitively, high values of v(αi) indicate relatively large between-variant variation in αi relative to angular variation seen within the trajectories themselves. For v(ϕi) and v(Ψi), such values correspond to systematic changes in local conformation associated with Mpro mutations. By turns, low values of v(ϕi) and v(Ψi) indicate residues whose local structure does not vary meaningfully across variants. It should be noted that such regions can be either flexible or rigid.
Figure 3 shows the mean local variation indices for ϕ, Ψ by residue for the 79 Mpro variants, indicated by color on the structure of Mpro WT. (Separate values for ϕ and Ψ are shown in Supplementary Figure S3.) It is immediately noteworthy that - with the minor exception of two small loop regions around N277 and F223 (respectively) - domain 3 shows little systematic variation across variants. The β-sheet-rich structure around the active site is also relatively well conserved. By contrast, we see relatively high levels of between-variant difference in the inter-domain region involving the termini (residues G2-A7 and S301-F305) and the double loop “active site gateway” region involving (respectively) L50-Y54 and D187-A191. The former is potentially significant in influencing large-scale flexibility (possibly relevant to dimerization), whereas the latter is of obvious relevance to substrate processing and specificity. This motivates a more detailed examination of variation in the active site, to which we return below.
The relatively high levels of conformational variation in the inter-domain regions suggest functionally relevant differences in global cohesion across variants. To assess this, we employ protein structure networks (PSNs), which are well-suited to assessing the looseness or cohesiveness of contacts among chemical groups. Moiety-level PSNs were constructed for each frame within each variant trajectory, using the definitions of (32) (Supplementary Figure S4). The assessment of global cohesion was performed by computing the mean degree k-core number for all moieties in each structure; to allow comparison of global cohesion within domains, we also compute mean core numbers within each of the three domains. The mean core number can be considered an index of structural cohesion, with higher values indicating greater numbers of redundant contacts among chemical groups (33). To account for within-trajectory autocorrelation in comparing mean core numbers, autocorrelation-corrected parametric bootstrap confidence intervals and standard errors were employed.
Figure 4 shows global and domain-specific cohesion levels (i.e., mean core numbers) for all variants, sorted in descending order of mean cohesion. (Means and standard errors for each variant can be found in Supplementary Table S1.) As suggested from the torsion angle analysis, cohesion differs significantly among variants, both globally and within domains. On average, the majority of variants are estimated to be less cohesively structured than WT, with the exception of domain 3 (in which WT does not differ significantly from the mean). It is possible that these differences indicate selection for more globally flexible structures (again, with the exception of domain 3). Whether or not this is the case, however, it appears clear that less cohesive structures are not strongly selected against. Such flexibility may affect dimerization kinetics, which is potentially relevant to the development of robust dimerization inhibitors.
Active site networks suggest potential activity differences across Mpro variants
The observation of structural variation in loop regions associated with the binding pocket motivates closer examination of variation in the Mpro active site. To this end, subgraphs of the full protein structure networks comprising moieties belonging to the active site residues and their neighbors were constructed to produce active site networks (ASNs) (34) for all conformations. A protein’s ASN describes physical interactions among active site moieties and other groups that are immediately adjacent in the 3D structure, irrespective of their positions in the amino acid sequence. Per (34), we compute for each ASN a constraint score, a general measure of active site flexibility that is associated with substrate specificity. The constraint score is the first principal component of a set of several network metrics (see Methods), with higher values indicating a greater tendency for the catalytic residues to be constrained by cohesive contacts with other residues, and lower values indicating fewer such constraints. Examples of ASNs corresponding to the maximum, minimum, and mean observed constraint values over all observed Mpro conformations are shown in Fig. 5.
Examination of the mean constraint scores for each variant trajectory suggests potential activity differences across Mpro variants. Fig. 5A shows mean constraint scores for each variant, with autocorrelation-corrected parametric bootstrap confidence intervals. Of the 79 trajectories examined, 22 (28%) were significantly below the grand mean (dotted vertical line) and 28 (35%) were significantly above it; similarly, when directly compared to WT, 12 variants were observed to be significantly less constrained, while 17 were significantly more constrained (i.e., bootstrap t-scores less than −2 or greater than 2, respectively). 43 out of 78 variants (55%) showed nominally higher levels of mean constraint than WT (discounting significance), suggesting a lack of uniform selection pressure for active sites that are more or less constrained than wild type (the fraction greater does not differ significantly from random deviation, p = 0.16, exact binomial test). Thus, although we do not see evidence here of systematic selection for net changes in active site constraint, we do see evidence that variants differ from each other and from WT in their average active site properties. These differences should be considered in the design of inhibitors that are robust to mutational change in Mpro over time. In particular, it is clear that the population of extant Mpro variants already possesses some phenotypic diversity in active site flexibility, potentially facilitating its ability to evolve around some types of inhibitors.
Methods
Sequence analysis and clustering
SARS-CoV-2 genome sequences were found by searching the GISAID (https://www.gisaid.org/) (13) EpiCoV database on 3 May, 2020, using the host keyword “human” and a cutoff date of 29 April, 2020, yielding a total of 15,432 SARS-CoV-2 genomes. Genomes outside the range of ± 3% reference (RefSeq: NC 045512.2) length (29,006bp – 30,800bp inclusive) or ≥ 1% N content were removed, leaving 10,644 “high-quality” sequences. Open reading frames in these high-quality full genomes were compared with a reference Mpro nucleotide sequence (WT, RefSeq: NC 045512.2, loc: 10,055–10,972), to extract Mpro sequences of at least 80% similarity using a script written in Python v3.7.0 (35). Genomes with gaps or ambiguous nucleotides (e.g. N, S, D, per International Union of Pure and Applied Chemistry (IUPAC) nomenclature (20)), in the Mpro sequence were excluded from this data set, leaving a total of 10,578 sequences from high-coverage genomes.
Nucleotide sequences were converted into amino acid sequences and screened for nonsynonymous mutations against the WT Mpro using code written in Wolfram Mathematica 12.1 (36), yielding 511 non-synonymous mutations in Mpro, 77 of which were unique. A single unique Mpro variant, found in a 24 April, 2020 dataset, but no longer available in the GISAID database, was also used in our analyses. Full genome alignments were performed using MUSCLE (14) on the complete set of non-synonymous Mpro mutants as well as reference WT, bat, and pangolin sequences. Trees were generated in MEGA X (15), using the Neighbor-Joining method (37); a bootstrap test (38) of 1000 replicates was performed, and distances were calculated using the Maximum Composite Likelihood model (38). In all, 515 full genomes were used in phylogenetic analyses; 78 unique Mpro mutants and a reference WT sequence (79 total) were used for molecular modeling.
Molecular modeling of wild-type and variant protein structures
Initial conditions for the WT trajectory used here are based on the A monomer of PDB structure 6Y2E (39), representing a mature (i.e., cleaved pro-sequence) protein. Initial variant protein structures were predicted using MODELLER 9.23 (30), using the 6Y2E structure as a template; three rounds of annealing and MD refinement were performed using the “slow” optimization level for each. Initial structures were then processed to correct protonation states to reflect their predicted cellular environment (with protonation states predicted using PROPKA 3.1 (40)). Each corrected model structure was then minimized and equilibrated in explicit solvent; simulations were performed using NAMD (41) with the CHARMM36 forcefield (42) in TIP3P water (43) at 310 K under periodic boundary conditions (with a 10 Å margin water box). Solvated protein models were energy-minimized for 10,000 iterations before being simulated for 0.5ns to adjust water box size, after which a 10ns trajectory was simulated with conformations being sampled every 20ps; an NpT ensemble was used, with temperature controlled via Langevin dynamics with a damping coefficient of 1/ps and Nosé-Hoover Langevin piston pressure control set to 1 atm (44, 45).
Network analysis
A protein structure network (PSN) was calculated for each modeled conformation of each variant via scripts employing the statnet (46–48), Rpdb (49), and bio3d (50) libraries for R (51). Vertices were defined using the method of (32), where each node represents a chemical moieity, with edges being defined by interatomic contacts. Specifically, two nodes i and j are considered adjacent if i contains atom g and j contains atom h such that the g, h distance is less than 1.1 times the sum of their respective van der Waals radii (using values from (52)). The node definitions are illustrated in Supplementary Figure S4A, and a small-moiety PSN of this type for WT Mpro is shown in Supplementary Figure S4B. Active site networks (ASNs) were constructed from each PSN as described in (34). Briefly, all vertices belonging to the catalytic Cys and His residues were identified, along with all vertices adjacent to these vertices within the PSN. The ASN was then defined as the subgraph of the corresponding PSN induced by this combined vertex set (Supplementary Figure S4C.)
To assess overall cohesion, degree k-core values (53) were calculated for each vertex in each PSN, and the average core number was computed for the entire protein and for the vertices in each domain, respectively. All calculations were performed using the sna library (48) for R. For each vertex associated with a moiety in the active site, three measures identified as associated with active site constraint by (34) were computed: the degree, or number of ties to other vertices; the triangle degree, or number of triangles (3-cliques) to which the vertex belongs; and core number, or number of the highest degree k-core (54) to which the vertex belongs. Physically, these respectively indicate the total number of contacts associated with the chemical group (potentially impeding its motion), the number of truss-like, triangular structures in which the group is embedded (again, restricting mobility), and the extent of local cohesion around the chemical group, which is found to distinguish “tighter” and “looser” packing regimes (33). To summarize the impact of each measure over the active site as a whole, values were averaged across active-site vertices. As an an additional constraint measure, the number of paths between each pair of active-site vertices through neighboring (i.e., non-active site) vertices was computed, and the log of the minimum of this value over the set of active site vertex pairs was employed as a measure of site cohesion. Intuitively, high values of site cohesion indicate that all active site chemical groups are connected by a large number of indirect contacts, while low values suggest that at least one pair of active site moieties has few local pathways holding them together. These four indices (mean active site degree, mean active site triangle degree, mean active site core number, and site cohesion) were used to produce an omnibus index of site constraint via principal component analysis (PCA) of the standardized network measures over all modeled conformations, per the approach of (34). This first principal component (the constraint score) accounted for approximately 71% of the variance in all four measures, and the ratio of its associated eigenvalue to the next largest was approximately 4.7 (confirming the dominance of the principal eigenvector).
Comparing mean cohesion and constraint scores across variants
Because cohesion and constraint scores are heavily autocorrelated within trajectories, we employ a parametric bootstrap strategy to obtain autocorrelation-corrected standard errors and confidence intervals (55). For each time series of scores for each trajectory, an autoregressive (AR) model with AIC-selected order was fit, and the estimated series mean obtained. (Estimation performed by maximum likelihood estimation using the ar function in R (51).) The whitened residuals from the time series model were then used to construct 5,000 parametric bootstrap replicate series, which were then re-fit to obtain bootstrap replicate means. Mean estimates from the bootstrap replicates were used to construct 95% bootstrap confidence intervals and standard errors for the series mean, as shown in Figs. 4 and 5. This procedure was applied to the MD trajectory for each variant. For the WT comparisons shown in Figs. 4 and 5, t values for mean constraint or cohesion score of each variant trajectory versus WT were constructed using the bootstrapestimated standard errors, with variant trajectories indicated in red or blue (respectively) if the differences of their mean scores versus WT led to t statistics below −2 or above 2. For cohesion scores, mean and bootstrap standard errors are provided for the full protein and each domain in Supplementary Table S1.
Author Contributions
R.W.M. and C.T.B. designed the study. T.J.C., G.R.T., M.G.C., and R.W.M analyzed sequence data. E.M.D. and C.T.B developed, implemented, and analyzed the simulation and network studies. M.G.C., V.F., S.Z., C.T.B., and R.W.M. performed structural analysis. T.J.C., G.R.T., C.T.B., and R.W.M. wrote the manuscript.
Supplementary materials
Table of Contents
Figures
Molecular models of representative variants (S1);
Sites of known Mpro mutations, mapped on the wild-type structure (S2);
Additional information on the fibril model reference measure (S3);
Local variation index values for Mpro backbone torsion angle, by residue number (S4);
Additional detail on the PSNs used in this work (S5)
Tables
Mean cohesion scores (k-core number) and autocorrelation-corrected bootstrap standard errors by variant (S1);
Accession numbers, locations, and dates of collection of all Mpro variants referred to in this work (S2)
Additional files available for download
Uncompressed version of the tree depicted in Figure 1 (muscle gisaid 20200429.txt);
Full acknowledgments for all sequences used in this work (514 acknowledgements.pdf)
Acknowledgments
This research was supported by NSF awards DMS-1361425 to C.T.B. and R.W.M. and IIS-1939237 to C.T.B., NASA Award 80NSSC20K0620 to R.W.M. and C.T.B., NIH grant 2R01EY021514 to RWM, the CIFAR Molecular Architecture of Life program, and the UC Irvine School of Physical Sciences Dean’s Excellence Fund. R.W.M. is a CIFAR Fellow.