Abstract
Understanding SARS-CoV-2 evolution is a fundamental effort in coping with the COVID-19 pandemic. The virus genomes have been broadly evolving due to the high number of infected hosts world-wide. Mutagenesis and selection are the two inter-dependent mechanisms of virus diversification. However, which mechanisms contribute to the mutation profiles of SARS-CoV-2 remain under-explored. Here, we delineate the contribution of mutagenesis and selection to the genome diversity of SARS-CoV-2 isolates. We generated a comprehensive phylogenetic tree with representative genomes. Instead of counting mutations relative to the reference genome, we identified each mutation event at the nodes of the phylogenetic tree. With this approach, we obtained the mutation events that are independent of each other and generated the mutation profile of SARS-CoV-2 genomes. The results suggest that the heterogeneous mutation patterns are mainly reflections of host (i) antiviral mechanisms that are achieved through APOBEC, ADAR, and ZAP proteins and (ii) probable adaptation against reactive oxygen species.
Importance SARS-CoV-2 genomes are evolving worldwide. Revealing the evolutionary characteristics of SARS-CoV-2 is essential to understand host-virus interactions. Here, we aim to understand whether mutagenesis or selection is the primary driver of SARS-CoV-2 evolution. This study provides an unbiased computational method for profiling and analyzing independently occurring SARS-CoV-2 mutations. The results point out three host antiviral mechanisms shaping the mutational profile of SARS-CoV-2 through APOBEC, ADAR, and ZAP proteins. Besides, reactive oxygen species might have an impact on the SARS-CoV-2 mutagenesis.
Introduction
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread worldwide since its emergence in December 2019 (Hu, Guo, Zhou, & Shi, 2020), reportedly infected more than 83 million people with a death toll of 1,831,412 as of 4 January 2021, according to World Health Organization (WHO) (https://covid19.who.int/). Studies have been focused on effective treatment of the disease, mostly by the drug re-purposing approach due to the urgency (Shah, Modi, & Sagar, 2020) and by finding a vaccine that will stop the spreading of the virus. Though there are dozens of vaccine candidates in clinical development, the evolutionary potential of the virus might affect the efficacy of the immunizations and treatments. Therefore, understanding the genomic features and mutation dynamics of the virus is crucial to interpret its evolutionary patterns and its response to the available treatments and potential vaccines.
Analyzing virus sequence context and mutations has revealed many critical pieces of information about the characteristics of SARS-CoV-2. For example, the origin of SARS-CoV-2 was linked to bats and pangolins using phylogenetic analyses (Li, Yang, & Ren, 2020; Wu et al., 2020; Zhou et al., 2020; Zhu et al., 2020). Through mutational analyses, some genomic variants of the virus were associated with increased transmissibility (Hou et al., 2020; Volz et al., 2020). In addition, we and others studied the spread of the virus in a variety of countries by tracking the mutation events of sequences over time (Adebali et al., 2020; Popa et al., 2020; Volz et al., 2020).
Some studies also analyzed the mutation profiles of SARS-CoV-2 by counting the mutations on sequences (Graudenzi, Maspero, Angaroni, Piazza, & Ramazzotti, 2020; Popa et al., 2020). Care should be taken when counting mutations to create a mutation profile. Considering that virus genomes are evolutionarily linked to each other, counting all the mutations in the sequences with respect to a reference genome creates a mutation bias towards the most abundant or frequently sequenced isolates. In other words, if a mutation occurs in an ancestral genome, it will also be seen in all of its descendants unless it reverts. When the mutations are called relative to a reference genome, variants of a common origin will be counted multiple times, even though they are linked to a single mutation event. To overcome this issue, we created a phylogenetic tree and assigned only nucleotides that differ from the parent node as a mutation.
In this study, we retrieved more than 65,000 SARS-CoV-2 genome sequences from GISAID (Global Initiative on Sharing All Influenza Data) database (Shu & McCauley, 2017) and clustered them based on their sequence identity. For each cluster, we assigned a sequence as the representative of its cluster. We aligned the representatives, and constructed a phylogenetic tree using a maximum likelihood method. Then, we assigned mutations to the nodes of the phylogenetic tree and counted only the mutations on sequences that were newly acquired with respect to its parent node. Finally, we analyzed mutation profiles and sequence diversity of SARS-CoV-2.
Results
We retrieved 66,938 genomes, which were all the available genomes in the GISAID database as of July 18th, 2020 (Shu & McCauley, 2017). Because multiple sequence alignment and building a tree with more than 65,000 genomes was computationally intense, we initially clustered the genomes and assigned a sequence from each cluster as its representative. By the method, we reduced the genome number to 20,089 without losing major sequence information. To assign mutations, we first aligned the 20,089 representative sequences and built a phylogenetic tree with a maximum likelihood method (see Methods). Then, we reconstructed the phylogenetic tree into a time-resolved phylogenetic tree, which infers the ancestral characters by estimating the molecular clock of the viral genomes (Sagulenko, Puller, & Neher, 2018). After constructing the tree, mutations were assigned based on the differences between the sequences and their parent node (Figure 1). This method enabled us to capture all the mutation events without recounting ancestral mutations. Moreover, we could also identify mutations that occurred repeatedly in different lineages, which would not be possible if the mutations were assigned relative to a reference genome.
Mutation profile of SARS-CoV-2 and potentially related mechanisms
To generate the mutation profile of SARS-CoV-2, we performed mutational signature analysis for all 192 trinucleotide changes using 28,544 mutations from the 35,841 representative sequences and nodes (Figure 2A). We normalized all the trinucleotide changes by the occurrence of the trinucleotide to eliminate any sequence context bias. In general, the most abundant mutational patterns are C>U, G>U, U>C, and A>G substitutions, that are 33.8%, 14.8%, 13.7%, and 9.6% of total substitutions, respectively (Figure 2A).
An enzyme family known for causing C>U substitution is called apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like (APOBEC) family. Enzymes of APOBEC family have an antiviral activity against some RNA viruses including coronaviruses (Harris & Dudley, 2015; Sharma et al., 2015; Woo, Wong, Huang, Lau, & Yuen, 2007). Briefly, they can deaminate cytosine to thymine (uracil in the RNA genome), which can either result in C>U substitution on single-stranded viral RNA (plus strand) or G>A substitution if the mutation occurs on complementary strand (minus strand). In agreement with previous studies, the impact of APOBEC is highly visible at C>U substitutions, while it is relatively low at G>A substitutions (8.8% of total substitutions) (Di Giorgio, Martignano, Torcia, Mattiuz, & Conticello, 2020; Graudenzi et al., 2020; Simmonds, 2020). This result suggests an asymmetric activity for APOBEC enzymes in favor of single-stranded viral RNA. Because virus RNA is frequently present as the plus strand, we see the effect of the APOBEC activity majorly in the form of C>U substitution relative to G>A, which possibly reflects APOBEC activity on the negative strand during RNA replication. Moreover, APOBEC proteins show target inclination towards 5’-[T/U]C-3’ and 5’–CC–3’ motifs while deamination of cytosine (McDaniel et al., 2020). Target sequence preferences of APOBEC proteins are observed in our mutational profile, where the 5 out of 7 highest normalized mutational counts on C>U distributes along 5’-UC-3’ and 5’–CC–3’ motifs (Figure 2A). It is also experimentally found that A1CF RNA editing cofactor, which is APOBEC1 complementation factor, is among the SARS-CoV-2 RNA binders (Schmidt et al., 2020) that strengthens the hypothesis of APOBEC proteins’ activity on the C>U substitutions.
The second most prevalent substitution is G>U, which might be associated with reactive oxygen species (ROS) in APOBEC-related manner. A recent study revealed that DNA damage response mediated by APOBEC3A (a member of APOBEC family) results in ROS production (Niocel, Appourchaux, Nguyen, Delpeuch, & Cimarelli, 2019). ROS can induce oxidative DNA damage, usually transforming guanine into 7,8-dihydro-8-oxo-20-deoxyguanine (oxoguanine), which can pair with adenine and lead to G>U substitution (Molteni, Principi, & Esposito, 2014; Waris & Ahsan, 2006). However, to date, there is no direct evidence of ROS-caused damage in the SARS-CoV-2 genome.
Another mechanism that can mutate the viral genome is adenosine deaminase acting on RNA (ADAR), which is an enzyme that mediates deamination of adenine to inosine (A>I) and later changes to guanine (A>G) (Di Giorgio et al., 2020). A>G (plus strand) and U>C (minus strand) substitutions are observed at similar levels (13.8%, and 9.6% of total substitutions, respectively.) (Figure 2A). ADAR targets dsRNA, and therefore, equivalent levels of ADAR activity is expected to present at both strands (Di Giorgio et al., 2020). The symmetric mutation profile for this pattern strongly suggests that ADAR working on replication RNA is effective in A>G and U>C substitutions.
In the context of trinucleotides, mutations dominantly occurred in U(C>U)G, A(C>U)G, C(C>U)G, U(C>U)U, and A(C>U)A (Figure 2A). Notably, 3 out of the 5 most frequently changed trinucleotides contain CG at their second and third positions. To examine whether these mutations were predominantly located at a single position in the viral genome or are distributed throughout the genome, we identified dynamic positions, where more than 8 recurring mutations were observed. Afterwards, we investigated the contribution of these trinucleotide positions to the mutation profile (Figure 2B,C). With some exceptions, most mutations in dynamic positions do not dominate the overall mutation profile. One of the exceptions is G(G>C)G mutations occurred at position 28,883, which correspond to the 64.3% of all mutations occurring on GGG. Although the percentage is high, the number of mutations occurring on GGG trinucleotides is only 14. Similarly G(A>U)C mutations at position 29,869 correspond to the 29% of all mutations occurring on GAC trinucleotide, but the number of mutations of GAC is as low as 31. However, when trinucleotides with a total mutation number exceeding 200 are considered, the position with the highest mutation becomes position 11,083 with U(G>U)U mutations, composing 15% of total mutations of UGU. In conclusion, the mutation profile is not dominated by the switching positions; a position bias on mutation distribution is only observable when the number of mutations of a trinucleotide is low. Several mutations labeled as impactful on signatures (Figure 2B) have been investigated for their possible effect on the severity and transmissibility of the virus by various studies. The highest mutational position 11083, has been associated with the severity of the virus (Toyoshima, Nemoto, Matsumoto, Nakamura, & Kiyotani, 2020). Together with 11083, mutations at 23403, 21575, 28881, 28883 positions (Berrio, Gartner, & Wray, 2020; Korber et al., 2020; van Dorp et al., 2020; Yin, 2020) have been associated with significant indication towards selection. Particularly the D614G mutation on Spike protein is associated with the fitness of the virus both by computational and clinical studies (Korber et al., 2020; Plante et al., 2020; Yin, 2020).
Codon usage of SARS-CoV-2 differentiates in favor of A and U containing codons
Next, we investigated the impact of a potential contribution of codon bias selection on the mutation profile. First, we counted all the formed and altered codons, which we referred to as “form” and “deform”, respectively. We calculated the relative difference between form and deform values for each codon to test potential convergence to host codon usage through mutations (Figure 3A). While UUU, AUA, AUU, and UAU are the intensively formed codons, CCA, UGG, GCU, and ACA are the most diminished ones. These results indicate a dominant forming of A and U containing trinucleotides, whereas G and C containing trinucleotides tend to disappear and reduce in number. In addition, all the codons that are translated into alanine (A) and proline (P) tend to diminish, resulting in lower translation of these amino acids in viral proteins. Considering that all the codons of A and P contain GC and CC in the first and second position, respectively, the reduction in these amino acids is probably related to selection against G and C presence (Figure 3A).
Human coronaviruses are known to have low GC content (GC%) and SARS-CoV-2 is not an exception with ∼38 GC% (Berkhout & van Hemert, 2015; Xia, 2020). Moreover, it was suggested that the reduction in GC% is an adaptation strategy of SARS-CoV-2, especially to the human lung expressed genes (Y. Li et al., 2020). To determine whether the mutations of SARS-CoV-2 is an adaptation strategy to increase its viability inside the host or just the byproduct of host immune response to the viral RNA, we obtained the human codon usage values (Nakamura, Gojobori, & Ikemura, 2000) and calculated the codon usage of SARS-CoV-2 (see methods). We calculated the relative ratio of these values and grouped codons that are translating the same amino acid (Figure 3B). If the viral genome is to adapt to the host genome, one can hypothesize that the codons that are used dominantly in the host relative to the virus should be formed in the viral genome to increase the similarity, while the codons that are used dominantly in the virus should diminish. GCU, GAA, GGU, CGU codons that are used relatively high in SARS-CoV-2, have the tendency to deform, in agreement with the hypothesis. However, UGU, AUA, UUA, GUU codons that are also used relatively high in SARS-CoV-2, have the tendency to be formed. A similar contradiction is also observed in the codons that are highly used in the human genome. In general, adaptation to the host codon usage cannot explain the formation tendency of the codons. The main driver of the formation tendency is likely to be selection pressure against GC% and thus, A and U increase.
CG nucleotide deforms, while UU nucleotide forms
After observing higher mutations in trinucleotides that contain CG at their second and third positions, and higher deformation in G and C containing codons, we decided to examine the deformation (Figure 4A) and formation (Figure 4B) of dinucleotides. Because deformation of a dinucleotide is dependent on its occurrence in the genome, we normalized the deformed value of each dinucleotide with respect to its occurrence in the reference genome. Also suggested by others (Wang et al., 2020; Xia, 2020), CG dinucleotide is the most deformed among all (Figure 4A). Xia et al. attributed the reduction in CG dinucleotide to a protein called zinc finger antiviral protein (ZAP), which binds and mediates the degradation of the viral genome (Xia, 2020). Moreover, the study indicates that SARS-CoV-2 is the most CG deficient betacoronavirus (Xia, 2020). Thus, high CG deformation might be an adaptation of SARS-CoV-2 to escape ZAP under high purifying selection. In addition, UU dinucleotide is formed more than all other dinucleotides. In general, A and U containing dinucleotides are formed, meanwhile C and G containing dinucleotides are deformed.
Discussion
The COVID-19 pandemic has been spreading aggressively, killing thousands of people and affecting the daily lives of much more. Moreover, the evolutionary behavior of SARS-CoV-2 might potentially weaken the efficiency of the current treatments and vaccines. Here, we performed a phylogenetic tree-based mutational analysis to assess the contribution of mutagenesis and selection mechanism to SARS-CoV-2 mutation profiles.
The mutation profile of SARS-CoV-2 revealed that C>U, G>U, U>C, and A>G are the dominant substitutions. Based on these mutational patterns, we compiled some potential mechanisms that might be influencing the SARS-CoV-2 viral genome (Figure 5), which are namely APOBEC, ADAR, and ZAP. These mechanisms were linked to SARS-CoV-2 mutagenesis in previous studies as well (Graudenzi et al., 2020; Kosuge, Furusawa-Nishii, Ito, Saito, & Ogasawara, 2020; Xia, 2020). In addition, we suspect that ROS might be a driver of G>U substitutions, however, more studies should be conducted to link ROS to SARS-CoV-2.
Another aim of this study was to examine the main driver of the mutational patterns of SARS-CoV-2; whether the viral genome is inclined to converge into the host genome or the mechanisms we have discussed are the only contributors to the mutational patterns. Analyses on formed and deformed codons exhibit an increase of A and U and a decrease of G and C containing codons. Furthermore, the comparison between human and SARS-CoV-2 codon usage does not reveal a strong correlation between codon usage percentages and SARS-CoV-2 formation tendency. These results combined, suggest that SARS-CoV-2 genome diverges through RNA editing mechanisms of the host, independently of any adaptive mechanism to increase its genomic similarity to the host genome, which was suggested in another study as well (Y. Li et al., 2020). Then, we examined the formation tendency of dinucleotides. In general, we observed a decrease of G and C and an increase of A and U containing dinucleotides. Strikingly, the deform rate of CG dinucleotides and formation of UU dinucleotides is extremely high. This phenomenon, which was observed in most human viruses (Caudill et al., 2020), was previously associated with the reduction of the hydrogen bonds between strands to achieve more efficient gene expression (Wang et al., 2020).
In conclusion, the mutational profile we generated supported the potential biological mechanisms contributing to the genome diversity of SARS-CoV-2 genomes. Strand asymmetry of some mutation signatures suggested the mechanism acting on the plus RNA strand only. Strand-wise equivalent mutation signature attributed to ADAR is in agreement with its mechanism of action where RNA is affected in the double-strand form. Antiviral responses and selection cannot be distinguished from each other. Host responses against the virus cause mutations in one hand and the reduced targets in the virus genome make it less susceptible to the same antiviral attacks. Although we don’t suggest a direct antiviral mechanism to reduce CG content, the reduced CG content can be explained by the adaptation to the host antiviral mechanism by ZAP. So far, the virus has been affected by the host antiviral mechanisms. Although there are several Spike protein amino acid substitutions that are likely to provide a selection advantage (Chand et al., 2020; Volz et al., 2020), selection hasn’t been a major driving force that can be applied to the entire genome. In the coming months, with a wide administration of the vaccines, it might be possible to see the effect of the vaccination and selection pressure by observing amino acid changes providing an advantage in escaping from immunized hosts.
Methods
Data retrieval and mutation assignment
66,938 SARS-CoV-2 genomes and their metadata in the GISAID database, which was dated until 18 July 2020 were retrieved (Shu & McCauley, 2017). Initially, all the sequences with incomplete information (proper date or location of sample collection) in the metadata file were filtered out with a custom python script. Then, the cd-hit program was used to cluster sequences and choose representatives (-c 0.999 -M 0 -T 80) (Fu, Niu, Zhu, Wu, & Li, 2012). 20,089 clusters were created, of which 18,334 contained only a single sequence, while 1,755 of them contained multiple sequences (up to 14,867sequences in a cluster). Then, the first sequence of each cluster was chosen as the representative of that cluster. After representative selection, representative sequences were aligned with the MAFFT algorithm using Augur toolkit (Hadfield et al., 2018; Katoh & Standley, 2016). Wuhan-Hu-1 genome (GenBank: NC_045512.2) was chosen as the reference genome for the alignment. Then, a phylogenetic tree was constructed using IQ-TREE (-fast -n AUTO -m GTR).
The tree then reconstructed into a time-resolved tree using the treetime option of Augur (Hadfield et al., 2018). The sample with the earliest collection date among the representative sequences was chosen as the root and marginal maximum likelihood estimation was used for date inference. The clock rate was applied across the genome to estimate the evolution rate and set to 0.0008, with a standard deviation of 0.0004 and using the date confidence flag to take the uncertainty of divergence time estimates into account. A constant coalescent model was chosen and ‘covariance-aware’ mode of Augur was turned off with no covariance flag. With these parameters, the number of representative sequences reduced to 19,045.
To assign the mutations to the nodes of the time-resolved tree, the ancestral option of Augur, which infers the ancestral sequences, was used by giving the time-resolved tree and the multiple sequence alignment of representative sequences as input (--inference joint).
Mutation profile analysis
Mutation list was obtained from the phylogenetic tree and includes the mutations observed in each step of the tree. Then, this mutation list was divided into 192 groups based on their 12 mutation types (ie. A>U, G>C) and 16 different trinucleotide content where the mutation is centered (ie. A>U:UAA, G>C:AGU). Each of the 192 mutation groups were normalized with their corresponding trinucleotide count. Trinucleotides were normalized by the number of occurrences in the reference genome. Finally, these normalized mutation count values were plotted within the ggplot2 package (Wickham, 2016) using R language, colored by their corresponding mutation type and trinucleotide content.
Observed mutations are first grouped by their position, mutation type and trinucleotide and frequently observed mutations (more than 8 times) which occur on the same position on the same trinucleotide, are recorded. Then observed mutations are grouped by their mutation type and trinucleotide which resulted in a 192 group as indicated in the mutation profile. By dividing the occurrence counts of frequently observed mutations to their corresponding mutation and trinucleotide groups total occurrence count, the contribution of each mutation to its profile is calculated and the ones who contributed more than 5% are plotted by ggplot2 in R studio.
The number of mutations observed by position is plotted and previously recorded frequently observed mutations (Figure 2B) are highlighted with respect to their mutation type and contribution range.
Counting codon changes and codon usage
By using ancestral mutations from the time-resolved tree, the mutated codons were counted and labeled as deformed count for each codon, while the count of the resulting codons were referred to as formed codons. For each codon type, the ratio between formed and deformed count was taken and plotted in log2 scale by using the ggplot2 package (Wickham, 2016).
Human codon usage table was retrieved (Nakamura et al., 2000), SARS-CoV-2 codon usage table was calculated by a custom R script. Number of occurrences in the reference genome was retrieved for each codon, then, they were grouped by their corresponding amino acids. The ratio of use per codon was calculated by dividing the occurrence of that codon to the sum of itself and its synonymous codons occurrence. Afterwards, the relative ratio of codon usage between Homo Sapiens and SARS-CoV-2 was calculated by dividing the ratio of a codon in one species to the sum of ratio in both species.
Dinucleotide changes
Observed mutations in the time-resolved tree were used to calculate the number of deformations observed for each dinucleotide. Dinucleotides were formed by these mutations, which were also calculated and recorded as the number of formations for that dinucleotide. Then, observed counts in the reference genome per codon were retrieved. The deformation counts were normalized by their division with their observation counts in the reference genome and plotted with their formation counts by ggplot2 in R studio (Wickham, 2016).
Code availability
All the codes and the processed data are publicly available at https://github.com/CompGenomeLab/covid19.
Acknowledgments
This study was supported by EMBO Installation Grant which is funded by TUBITAK. We thank Dr. Stuart James Lucas for his helpful feedback.