Intra-host variability in global SARS-CoV-2 genomes as signatures of RNA editing: implications in viral and host response outcomes

Since its zoonotic transmission in the human host, the SARS-CoV-2 virus has infected millions and has diversified extensively. A hallmark feature of viral system survival is their continuous evolution and adaptation within the host. RNA editing via APOBEC and ADAR family of enzymes has been recently implicated as the major driver of intra-host variability of the SARS-CoV-2 genomes. Analysis of the intra-host single-nucleotide variations (iSNVs) in SARS-CoV-2 genomes at spatio-temporal scales can provide insights on the consequence of RNA editing on the establishment, spread and functional outcomes of the virus. In this study, using 1,347 transcriptomes of COVID-19 infected patients across various populations, we find variable prevalence of iSNVs with distinctly higher levels in Indian population. Our results also suggest that iSNVs can likely establish variants in a population. These iSNVs may also contribute to key structural and functional changes in the Spike protein that confer antibody resistance.


INTRODUCTION
The SARS-CoV-2 pandemic has inflicted an enormous toll on human lives and livelihoods. As of December 1, 2020, 1.5 million individuals have succumbed to COVID-19 while 64 million have been confirmed diagnosed around the globe ( 1 ). The outcomes of viral infection are highly variable as apparent from regional distribution of spread, susceptibility across age groups, diversity in clinical symptoms, and significant molecular differences including tissue-level expressions in hosts. At genomics scale, thousands of viral genome sequences across the globe also reveal the differential patterns of diversity during the course of the pandemic. Besides, variability in the number of cases, mortality as well as recovery rates across different ethnicities is now being widely acknowledged ( 2 ). The spectrum of these evolving variations in different populations is suggestive of host-dependent factors. Failure to account for this variability can confound diagnostics and therapy that are dependent on the sequence and structure of the SARS-CoV-2 genomes.
One of the differentiating features of RNA viruses is their exceptionally high nucleotide mutation rates that results in a set of closely related but non-identical genotypes, termed as quasi-species , in a host ( 3 -8 ) .
Host-mediated RNA editing could contribute to the high mutation rates in retroviruses ( 9 , 10 ) besides other factors like high-rates of replication, low fidelity of RNA-dependent RNA polymerase and genomic recombination ( 11 , 12 ) . In a recent study, RNA editing by host deaminases has been proposed as a major driver of intra-host heterogeneity in SARS-CoV-2 genomes ( 13 ). The APOBEC (apolipoprotein B mRNA editing catalytic polypeptide-like) and ADAR1 (Adenosine Deaminase RNA Specific, 1) protein families are the most prominent RNA editing enzymes implicated in innate antiretroviral defence for a large family of viruses ( 9 , 10 ,

Intra-host SNVs as a consequence of potential RNA editing of SARS-CoV-2 genomes
We analysed 1,347 transcriptomic samples obtained from patients diagnosed with COVID-19 from different populations (Table 1). 1,160 samples could be aligned to the SARS-CoV-2 reference genome. We identified potential editing events as iSNVs by recording heteroplasmy in the sequence reads. To ensure specificity of iSNV detection, the filtering criteria and cutoffs were established by analysing an additional set of 500 samples sequenced in replicates (see "Material and Methods"). 958 of the 1,160 samples (~83%) harboured one or more iSNVs wi th frequencies ranging between 0.005 and 0.80 . We recorded a total of 86,595 iSNVs with a median of~19 iSNVs per sample ( Supplementary Fig. 1A), revealing extensive heteroplasmy in samples. We did not observe a significant correlation between the mean sample coverage depth and the number of iSNVs per sample suggesting that these events are unlikely to be a consequence of sequencing artefacts (Pearson's r=0.259, Supplementary Fig. 1C).
A bias towards the C-to-T/G-to-A and A-to-G/T-to-C nucleotide changes suggests RNA editing activity in both the strands of SARS-CoV-2 genome by the APOBEC and ADAR family of enzymes respectively (Fig. 1A).
ADAR mediated A-to-G/T-to-C changes were observed to be more frequent accounting for about 36% of all variant positions, though these changes had the lowest median frequencies compared to the others ( Supplementary Fig. 1D). Although A-to-G changes were seen in disproportionate numbers in few samples, APOBEC mediated C-to-T changes were more consistently observed with a median of 5 events per sample ( Fig. 1A). Both C-to-T/G-to-A and A-to-G/T-to-C combinations contributed significantly yet similarly to synonymous and missense changes, whereas only the former contributed to stop gains ( Supplementary Fig.   1E).
iSNVs contribute to variability throughout the SARS-CoV-2 genome, both in the coding as well as in the non-coding regions. A total of 18,216 genomic positions (4,961 polymorphic) ( Supplementary Table 3 ) contributed to iSNVs in one or more samples distributed evenly in all genes and protein-coding domains of the viral genome ( Supplementary Fig. 1F). Though the frequency spectrum of iSNVs was nearly uniform with respect to the nature of inflicted amino acid sequence change ( Supplementary Fig. 1G), a l arge fraction of the sites result ed in non-synonymous changes (Fig. 1B). Gain of stop codons that could alter the amount of functional proteins from the viral genomes were also observed.

Contribution of iSNVs to the spectrum of SARS-CoV-2 diversity across populations
There was an overall difference in the prevalence of iSNVs between the populations that could have arisen due to differential editing of the SARS-CoV-2 genomes within the hosts (Fig. 1C). Indian subpopulations distinctly displayed a significantly higher number of iSNVs compared to populations in Europe, China and USA.
The number of samples analysed in the populations of China, USA and India were comparable, thus ruling out any bias induced due to sampling. This highlights the role of the background population in ongoing variability in the SARS-CoV-2 genome.
The top 1% frequently observed heteroplasmic positions in major cohorts that exhibit a wide range of frequency differences are shown in Figure 1D. Each concentric circle depicts iSNV frequency in bins of 0.

Spatio-temporal dynamics of iSNVs in populations
The diversity in the prevalence of viral strains that could have potentially arisen out of these editing events is variable during different times of the pandemic as reflected from the appearance and disappearance of the genomes harbouring certain SNVs that originate from heteroplasmic sites. For instance, A2a clade that originated in March continues to be the predominant clade in the population whereas the India specific A3i/A4 clade that was the most abundant clade in India in March seems to have nearly disappeared by May ( Fig. 2A).
Though most of the sites seem to convert to a fixed variant in one or all populations (Fig. 1D, Supplementary   Fig. 2), positions like 1707, 15435, 24622, 26554 and 29029 that depict an increasing trend in the frequency of heteroplasmic variants in some cohorts suggest that these iSNVs could get fixed in the population at a later date. In a geographically restricted population, such events may contribute to the origin of new variants/haplotypes within the population over a period of time. We observe this in the East Indian cohort where the appearance of SNVs for a range of positions was seen to coincide with or after the iSNVs first surfaced in the population (Fig. 2B). This further strengthens the postulate that host-mediated editing may introduce SNVs into the travelling viral strains in a population.

Potential functional consequence of hyper-editing on Spike protein
We observed a few samples to be substrates for hyper-editing events in each cohort i.e reflected by extensive heteroplasmy at multiple sites. In order to identify these population outliers, we computed the Z-score values based on the distribution of the number of iSNVs per sample (Fig. 3A, Supplementary Table 4). In these samples, we observed A-to-G substitutions accounting for about one-third of all changed genomic positions implying extensive ADAR mediated editing activity (Fig. 3B). Of the 11,420 annotated variants, 1,481 correspond to protein-coding variants within the Spike protein.
To understand how iSNVs could impact function through amino acids substitutions in hyper-edited samples, we examined the mutational spectra of the Spike protein in more detail, given that the initial host-viral response is triggered by the attachment of the Spike protein of SARS-CoV-2 with the host ACE2 receptor, disrupting the host cell membrane and activating viral entry ( 25 ) . Amongst all the variants potentially induced by editing in the Spike protein, 1,068 were found to be nonsynonymous, 322 synonymous and 91 stop-coding. Figure 3C depicts the frequency of these sites on different functional protein domains, including the critical receptor-binding domain (RBD). The most frequently altered variants were D614, Y91, I105, and D428. The D614G mutation (one of the A2a clade defining variants) is near the S1/S2 cleavage site and has already been reported in the literature to be more geographically spreading than the D614 type ( 26 ) .
We looked at other Spike mutations in more detail with respect to their functional consequences. Most striking, some amino acid sites within Spike seem to evolve into more than one type of variant. In Figure 3D, we depict these hyper-variable sites, with residue A879, N1108, S1239, D1260 evolving into either of >4 amino acid substitutions. The conservation of each residue i.e., how variable that site is amongst closely related coronavirus species was also calculated. For instance, the HR1 region and residues near the transmembrane are observed to be more conserved while NTD and RBD domains have highly variable regions, in that, especially the RBM motif (residue 437-508) seems to be less conserved (Fig. 3E). Interestingly, RBM was also found to be variable in our study where specific amino acids at 442, 465, 468 conferred variations in~11 out of 25 hyper-edited samples. The highly conserved amino acids represent a similar function of the protein across species but the substitution of critical amino acids around functional sites like RBM motif might lead to evolved or newer biological activities.
In a recent study, accelerated evolution of the virus over time leading to repeated resurgence in an immunocompromised infected patient has been reported ( 27 ) . Genomic analysis revealed that the individual had not been infected multiple times. Instead, the virus had lingered and rapidly mutated in the body leading to non-synonymous changes predominantly in the Spike region. The Spike protein and RBD harboured 57% and 38% of the total changes despite occupying only 13% and 2% of the virus genome respectively. The changes also included Q493K and F486I that corroborated with an earlier study that had recognised additional mutations (N74K, F79I, T259K, K417E, K444Q, V445A, N450D, Y453F, L455F, E484K, G485D, F490L, F490S, H655Y, R682Q, R685S, V687G, G769E, Q779K and V1128A) to be implicated in antibody resistance ( 28 ) . Besides, N439K mutation in Spike has also been shown to be involved in immune escape from a panel of neutralizing monoclonal antibodies in another study ( 29 ) . 20 out of these 23 positions overlap with sites that are potential sites of editing in our samples (Supplementary Table 3).

DISCUSSIONS
The outcome of SARS-CoV-2 infection in individuals appears to be panning out differently across diverse p opulations with respect to prevalence as well as mortality in the number of deaths per million ( 30 ). From the initial strain reported in Wuhan, extensive variability in clade distribution has been observed across different parts of the world ( 31 ). This has been attributed to the containment of strains in lockdown conditions followed by the evolution of viral genomes during local transmissions. Many uncertain aspects of variable outcome of the same virus in different hosts, reinfection and long sequelae of COVID-19 is now getting highlighted, indicating an extensive cross-talk between the virus and host genomes.
In this study, we explored the variability in the SARS-CoV-2 genomes across populations as a surrogate for ongoing editing activity of the hosts. The APOBEC and ADAR family of enzymes are extensively reported to edit different families of viral genomes ( 9 , 10 , 14 -16 ) including the coronaviruses ( 17 ). Previous genome analyses of many human-associated RNA viruses have revealed footprints of past editing events with a propensity for fixation of alleles that allow them to evade the host editing machinery ( 32 ). SARS-CoV-2 being zoonotic and of recent origin, seems to harbour a large number of sites that could be potentially targeted by the host editing machinery. The signatures of these enzymatic activities seem to be reflected in the evolving SARS-CoV-2 genomes (Fig. 1A) which corroborates with an earlier report ( 13 ) . Recently, ADAR1 mediated editing in SARS-CoV-2 genomes has also been shown in transcriptome data of infected human cell-lines, Vero cells and clinical samples ( 33 ). Though we observed ADAR hyperactivity in some individuals, APOBEC-mediated C-to-T changes were more consistently observed in our samples (Fig. 1A) which may explain the overrepresentation of C-to-T SNVs in the SARS-CoV-2 populations ( 34 , 35 ). Besides the APOBEC and ADAR1 editing changes, we also observed substantial abundance of G-to-T and C-to-A substitutions in some samples. These might be due to Reactive Oxygen Species (ROS), as hypothesized in a recent study ( 36 ). ROS activity oxidizes guanine to 7,8-dihydro-8-oxo-2'-deoxyguanosine (oxoguanine) that base pairs with adenine, leading to G-to-T transversions. This change on the negative strand would be reflected as a C-to-A transversion in the reference genome. ROS also has its role implicated in mutagenesis of many other virus families ( 37 ).
The difference in the prevalence of iSNVs that could arise out of editing events across populations (Fig. 1C) highlights the active role of the host. The editing enzymes have been shown to display genetic variability in populations, perhaps a consequence of selection based on pathogenic loads ( 38 , 39 ) . Transmission of viruses through different host populations could thus shape the mutational landscape and govern the evolution of the viral genomes. Our group has previously shown a prominent insertion-deletion polymorphism in APOBEC3b which is associated with susceptibility to Plasmodium falciparum ( 40 ). The insertion allele that retains the APOBEC3b is nearly fixed in malaria-endemic regions in India and is associated with protection from severe malaria. The prevalence of COVID-19 related mortality is nearly negligible in Indian populations that are malaria endemic and have a high frequency of APOBEC3b ( 41 ). Though the correlation of APOBEC insertion with protection still needs to be tested, this nevertheless suggests the role of such family of enzymes in evolutionary outcomes of SARS-CoV-2 infection and the burden of disease.
Editing can alter the fitness of the strains by influencing its virulence, infectivity or transmissibility properties which may be both beneficial or detrimental to the virus and lead to fixation or removal of alleles from the populations. Though the extent of variability differs amongst individuals and populations, there seem to be preferred sites that are shared across populations (Fig. 1D, Supplementary Fig. 2). In most parts of the world, the A2a clade seems to have become the most prevalent with near disappearance of other clades which were once the predominant strain in certain local regions. For example, the A3i/A4 clade which was reported to be primarily endemic to India in April 2020 ( 23 , 24 ) had disappeared from the entire population as of October 2020 ( Fig. 2A). One of the A2a clade defining variants, D614G ( A23403G ), has been shown to increase the infectivity of SARS-CoV-2 ( 26 ) . Given that all A2a clade defining positions (241, 3037, 14408 and 23403) are heteroplasmic sites with C-to-T and A-to-G changes (Fig. 1D) Ongoing selection in the host might have contributed to the fixation of some of the sites as variants in SARS-CoV-2 genomes with potential consequences on the functioning of some of the viral proteins. Structural analysis of iSNVs that occur in Spike shows that more than 40% of these variants result in a substantial change in the property of these amino acids (Supplementary Table 5). Changes in specific amino acids, especially mutations close to functional sites, may lead to drastic changes in protein structure as they can either introduce significant change in the sidechain or charge of the residue. Also, it has been shown that APOBEC influenced C-to-T substitutions elevate the frequency of hydrophobic amino acid coding codons in SARS-CoV-2 peptides ( 48 ) . We also mapped 192 SNVs onto the protein structure by utilizing a full-length Spike protein model of closed conformation (Fig. 3F). Some variants are located on the surface of the receptor-binding domain (RBD) which is responsible to bind to the ACE2 receptor in the host cell. These could directly impact the binding of Spike protein to the receptor and might have a putative effect on the binding interface residues shown in surface representation (Fig. 3G). Overall, our findings indicate a comprehensive survey for one of the SARS-CoV-2 proteins and we propose that some of these distinct patterns of sites in hyper-edited samples may directly interfere with specific functional output.
Since SARS-CoV-2 has been recently transmitted to humans, differential editing is likely to confound interpretations of phenotypic outcomes of infection. As mentioned, the response of the host to SARS-CoV-2 infection has been extremely variable with different sequelae during and after infection ( 49 , 50 ). Few cases of reinfection are being reported where the reinfected viruses have many alterations and present with different symptoms ( 51 , 52 ). A relook at the sequence changes that have been reported in some reinfection studies shows considerable overlap with potential editing sites (Supplementary Table 6). Since many of these samples were studied using nanopore sequencing technique, there is a possibility that the heteroplasmy during the first event was not captured. Interestingly, in a recent communication (unpublished) where the subject was resampled after two months, few heteroplasmic sites in the first infection were observed to be fixed variants in the reinfection state as shown in Supplementary Table 6 (SRA Project ID PRJNA674796). Although, there were also some sites in the reinfection sample where this was not observed. Our meta-analysis of limited data suggests that a fraction of the reported reinfection cases could be a consequence of persistent editing of the viral genome within hosts. Since we observe that editing can lead to a change in the amino acid sequence, it can have implications on the structure and function of the protein. Noteworthy, we observed heteroplasmy in ~87% of the sites in the Spike protein that have been recently reported to confer antibody resistance. These mutations can have major implications in vaccine response as they could alter the immunogenicity of the antigenic RBD peptide leading to differential antibody titers in infected individuals ( 53 ). However, there are limitations to this analysis as the reports are few, the platforms were different, and these were not followup studies at regular intervals. These observations substantiate that editing within hosts may possibly lead to an evolved immune escape ability in some strains which may seem to be a case of reinfection in a host after weeks or months of the first incidence.
In conclusion, temporally tracking within-host variability of the virus in individuals and populations might provide important leads to the sites that are favourable or deleterious for virus survival. This information would be of enormous utility for diagnostics, design of vaccines as well as predicting the spread and infectivity of viral strains in the population. Conjoint analysis with the host variability in editing machinery should be the next step.

Datasets
We accessed 1,347 Illumina NGS samples belonging to various regions to recognise differential signatures of RNA editing in SARS-CoV-2 infected populations. Only those samples that were paired-end sequenced were used for an increased degree of accuracy in recognising novel "iSNVs" that could arise out of RNA editing events. All publicly available samples submitted in the National Center for Biotechnology Information Sequence Read Archive (NCBI SRA) till 23rd May 2020 from China, USA, Germany, Malaysia and United Kingdom were downloaded. Samples that had been sequenced in Bhubaneswar, Delhi and Hyderabad were also included that gave us a pan-India representation of SARS-CoV-2 infected East, North and South Indian populations ( Table 1)

Data Processing and Annotations
The sequence files retrieved from NCBI SRA ( 56 ) were downloaded in the SRA compressed file format and were converted to FASTQ format files using NCBI's SRA-Tools ( 57 ). In order to trim the adaptor sequences and filter low-quality reads from downstream analysis, we used Trimmomatic ( 58 ). The trimmed reads were then aligned against the human reference genome (GRCh37/hg19) using HISAT2 ( 59 ). All reads which couldn't be aligned in pairs to the human reference genome were filtered out from the human aligned files using SAMtools ( 60 )  We used REDItools2 ( 68 ) to call iSNVs in successfully aligned SARS-CoV-2 reads with a quality score ≥30.
Additional position-level parameters in REDItools2 were used to discard some positions on the basis of site quality. The sites whose reads did not achieve a mean quality score of ≥33 were rejected and sites that were located in long homopolymeric regions of length ≥4bp ( Supplementary Table 7 ) were excluded due to known sequencing errors in these regions (66). From the iSNVs called in the REDItools2 output file, potential editing events were filtered out using the following thresholds: number of minor allele reads ≥5, base coverage ≥20 and minor allele frequency ≥0.005 and ≤0. 80. In order to test the pipeline's efficiency in identifying potential editing events in SARS-CoV-2 genome, we additionally downloaded and processed 500 single-end NGS samples sequenced in replicates on an Illumina platform from NCBI SRA with the Project ID PRJNA655577.
We observed a high concordance between the replicates (Pearson's r=0.989, Supplementary Fig. 3) in sites qualifying the pipeline thresholds. Thus, we used these cutoffs as a qualifying criterion for defining a heteroplasmic site as a potential editing site in our subsequent analyses.

Identification of Samples with potential hyper-editing events
Z-score values based on the distribution of number of iSNVs per sample were calculated for each population using the Python package SciPy ( 69 ). Samples that showed a Z-score value > 3 were classified as hyper-edited in each cohort (Fig. 3A).
In order to categorize the specific amino acid change and the proteins containing the iSNVs , they were annotated using SnpEff version 4.5 ( 70 ). Conservation analysis of the full-length sequences of proteins was done on the basis of the six other coronaviruses (HCoV-229E, NL63, OC43, HKU1, MERS-CoV, SARS-CoV).
The multiple sequence alignment of seven protein sequences was performed by clustal-omega ( 71 ).
Conservation calculation of Spike amino acids was done using the ConSurf tool ( 72 ). The high-risk hyper-edited sites resulting in a single amino acid change in Spike proteins were screened and prioritized on the basis of sequence conservation and frequency i.e. number of times it is present in the total 25 hyper-edited samples. ConSurf conservation scores above 6 were considered and a total of 192 variants were filtered for mapping onto the structure to understand the effect on protein function and structure. We utilized PyMol ( 73 ) to determine interface residues from the available crystal structures. PDB 6M0J ( 74 ) and 6M17 ( 75 ) for ACE2 RBD binding and PDB 6W41 ( 76 ) and 7BZ5 ( 77 ) for antibody binding residues were used from Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) ( 78 ).

Data Analysis and Visualizations
Custom python scripts were used to automate the process of downloading and retrieving samples from NCBI SRA as well as to process samples through the pipeline. Python libraries such as NumPy ( 79 ), Pandas ( 80 ), Matplotlib ( 81 ) and Seaborn ( 82 ) were used for data handling and visualizations. SciPy ( 69 ) was used to compute pairwise two-sided t-tests and to calculate distribution statistics and Pearson's correlation coefficients.

Code Availability
The commands used in the pipeline have been detailed order-wise in Supplementary File 1. The automated codes, detailed individual sample information and statistics are available at the following GitHub repositoryhttps://github.com/pxthxk/iCoV19 .

ACKNOWLEDGEMENTS
In memory of all who have lost their lives to COVID-19. We gratefully acknowledge the authors from the originating laboratories of the submitted SARS-CoV-2 samples in NCBI SRA and also the CSIR-IGIB Delhi, ILS Bhubaneswar and CSIR-CCMB COVID-19 teams for their efforts in aggregating, processing and sharing sample transcriptomic data of COVID-19 infected individuals. We also acknowledge the authors from the originating laboratories of the submitted SARS-CoV-2 genome data in GISAID, the list of whom has been shared in Supplementary Table 8 . We would also like to thank Khusboo Singhal for her inputs and Pragyan Acharya and Abhay Sharma for their critical reviews of the manuscript.