Insertions in SARS-CoV-2 genome caused by template switch and duplications give rise to new variants that merit monitoring

The appearance of multiple new SARS-CoV-2 variants during the winter of 2020–2021 is a matter of grave concern. Some of these new variants, such as B.1.617.2, B.1.1.7, and B.1.351, manifest higher infectivity and virulence than the earlier SARS-CoV-2 variants, with potential dramatic effects on the course of the COVID-19 pandemic. So far, analysis of new SARS-CoV-2 variants focused primarily on point nucleotide substitutions and short deletions that are readily identifiable by comparison to consensus genome sequences. In contrast, insertions have largely escaped the attention of researchers although the furin site insert in the spike protein is thought to be a determinant of SARS-CoV-2 virulence and other inserts might have contributed to coronavirus pathogenicity as well. Here, we investigate insertions in SARS-CoV-2 genomes and identify 347 unique inserts of different lengths. We present evidence that these inserts reflect actual virus variance rather than sequencing errors. Two principal mechanisms appear to account for the inserts in the SARS-CoV-2 genomes, polymerase slippage and template switch that might be associated with the synthesis of subgenomic RNAs. We show that inserts in the Spike glycoprotein can affect its antigenic properties and thus merit monitoring. At least, three inserts in the N-terminal domain of the Spike (ins245IME, ins246DSWG, and ins248SSLT) that were first detected in 2021 are predicted to lead to escape from neutralizing antibodies, whereas other inserts might result in escape from T-cell immunity.

glycoprotein can affect its antigenic properties and thus merit monitoring. At least, three inserts 23 in the N-terminal domain of the Spike (ins245IME, ins246DSWG, and ins248SSLT) that were 24 first detected in 2021 are predicted to lead to escape from neutralizing antibodies, whereas other 25 inserts might result in escape from T-cell immunity. 26 27 105 and is also made available for use under a CC0 license.
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC The copyright holder for this preprint this version posted  https://doi.org/10.1101/2021.04.23.441209 doi: bioRxiv preprint Main text 28 The first SARS-CoV-2 genome was sequenced in January 2020. Since then, more than a milion 29 virus genomes have been collected and sequenced. Comparative analysis of SARS-CoV- 2 30 variants has provided for the identification of the routes of virus transmission [1][2][3][4] , the selective 31 pressure on different genes 5 , and the discovery of new variants associated with higher infectivity 32 [6][7][8] . In many cases, genome analysis only included search for point mutations, but some deletions 33 also have been identified, such as del69-70, one of the characteristic mutations of B.1.1.7 and 34 Cluster 5 2,3 or del157-158 in B.1.617.2 (delta) 9 . Moreover, recently, recurrent deletions have 35 been shown to drive antibody escape 10 . However, insertions are mostly ignored, both during 36 variant calling step and in the downstream analysis. 37 Although insufficiently studied, insertions appear to be crucial for beta-coronavirus evolution. 38 Three insertions in the spike (S) glycoprotein and in the nucleoprotein (N), that occurred early in 39 sarbecovirus evolution, have been shown to differentiate highly pathogenic beta-coronaviruses 40 (SARS-CoV-1, SARS-CoV-2 and MERS) from mildly pathogenic and non-pathogenic strains, 41 and suggested to be key determinants of human coronaviruses pathogenicity 11 . The best 42 characterized insert in SARS-CoV-2 is the PRRA tetrapeptide that so far is unique to SARS-43 CoV-2 and introduces a polybasic furin cleavage site into the S protein, enhancing its binding to 44 the receptor 12-14 . 45 Inserts in the SARS-CoV-2 genome are categorized in the CoV-GLUE database 15 , and the 46 preliminary results on systematic characterization of the structural variance and inserts in 47 particular have been reported 16 . Forty structural variants including three inserts, three 48 nucleotides long each, were discovered and shown to occur in specific regions of the SARS-49 CoV-2 genome. These variants have been further demonstrated to be enriched near the 5' and 3 ' 50 breakpoints of the non-canonical (nc) subgenomic (sg) RNAs of coronaviruses. Additionally, 51 indels have been shown to occur in arms of the folded SARS-CoV-2 genomic RNA 16 . However, 52 longer inserts that might have been introduced into the virus genome during SARS-CoV-2 53 evolution, to our knowledge, have not been systematically analyzed. 54 55 105 and is also made available for use under a CC0 license.
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC uracil, at about 45%, whereas the composition of the long inserts (9 nt or longer) was similar to 110 that of the SARS-CoV-2 genome average, with about 25% U (Figure 1b). This trend was even 111 more pronounced for inserts verified by raw reads (Supplementary Figure 1). Thus, we split the 112 collection of inserts into two categories, the short and long inserts, which we analyzed 113 separately. 114 We then checked whether inserts that were present in multiple genome sequences were 115 located close to each other in the global phylogenetic tree (see Materials and Methods). We did 116 not require strict monophyly because inserts are not always included in the genome sequences 117 (see Supplementary note 1). Of the 52 short inserts identified in multiple genomes, 18 met this 118 relaxed monophyly criterion (Supplementary Table 2). In 5 cases, identical insertions were 119 observed in genomes submitted from the same laboratory, and mostly, on the same date, which 120 implies that the genomes were sequenced and analyzed together, making it difficult to rule out a 121 sequencing error. Interestingly, 7 of the 8 cases that were validated by raw reads were not 122 monophyletic. By contrast, among the 15 long inserts that were found in multiple genomes, 14 123 met the monophyly criteria including all 5 validated by raw reads (Supplementary Table 2). 124 In summary, the inserts detected in SARS-CoV-2 genomes fell into the following categories: 230 125 (66.5%) short inserts, among which 15 (4.3% of the total) were validated by raw reads; and 116 126 long (at least, 9 nt) (33.5%) inserts. We additionally classified the long inserts into four groups, 127 in the order of increasing confidence: 87 (25%) singletons, 1 non-monophyletic insert observed 128 in two genomes, 9 (2.6%) monophyletic inserts observed in multiple genomes, all with no 129 available raw reads and 20 (5.8%) inserts (15 singletons and 5 monophyletic ones), for which the 130 insertions were validated by the raw reads. The 15 (4.3%) short inserts confirmed by read data 131 and 29 (8.4%) long inserts that were detected in multiple genomes (monophyletic and not) and/or 132 confirmed by raw reads comprised the set of the most reliable insertion events that were 133 observable across the evolution of SARS-CoV-2 during the pandemic (Supplementary Table 4). 134 Among these highly reliable inserts, there was only one frameshifting insert within orf6 gene. 135 136 105 and is also made available for use under a CC0 license.
(which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC Insertions are non-uniformly distributed along the SARS-CoV-2 genome 137 We found that the insertions were not randomly distributed along the genome, with most 138 occurring in the 3'-terminal third of the genome (Figure 1c). Two, not necessarily mutually 139 exclusive main hypotheses have been proposed on the origin of the short inserts (structural 140 variants) in coronavirus, namely, that they are associated with loops in the virus RNA structure 141 or occur near the hotspots of template switch, at the breakpoints of TRS-independent transcripts 142 16 . To differentiate between these two mechanisms, we compared the distribution of 347 inserts 143 along the SARS-CoV-2 genome with the distributions of structured regions 28 and of template 144 switch hotspots 26 . We detected a strong association of the insertions with the template switch 145 hotspots (Pearson correlation r = 0.42, p-value = 1.8x10 -14 ) (Figure 1d). Almost 25% of the 146 inserts occurred within 5 nt of a template switch hotspot, compared to less than 10% expected by 147 chance, and the distribution of lengths observed in the real data is significantly different from 148 random expectation (Wilcoxon test p-value < 2.2x10 -16 ; Figure 1e). Also, we observed that 149 inserts were significantly less frequent in predicted RNA stems than expected by chance 150 (permutation test p-value = 0.004; Figure 1f). Both these observations held, with statistical 151 significance, when we analyzed only the 45 highly confident insert (Supplementary Figure 4). 152 Thus, inserts in SARS-CoV-2 genomes are associated with template switch hotspots, and also 153 tend to occur in RNA loops. 154 155 Short insertions in SARS-CoV-2 are generated by template sliding 156 The notable difference in nucleotide composition and different phyletic patterns of short and 157 long inserts imply that the two types of insertions occur via different mechanisms. As pointed out 158 above, the short insertions are rarely monophyletic, indicating that short U-rich sequences were 159 inserted in the same position in the SARS-CoV-2 genome on multiple, independent occasions 160 during virus evolution in the course of the pandemic. Also, 43 of the 230 short inserts occurred 161 in runs of U or A, whereas 63 more probably result from local duplications (Supplementary table  162 2). These observations suggest that short insertions occur via template sliding (polymerase 163 stuttering) on runs of As or Us in the template (negative strand or positive strand, respectively) 164 phenomenon occurring during SARS-CoV-2 evolution, in case the errors are produced by 166 stuttering of the coronavirus RdRP, or an artifact if the errors come from the reverse transcriptase 167 or DNA polymerase that is used for RNA sequencing, or are a mix of biological and 168 experimental polymerase errors. However, for all 17 short inserts that were confirmed by raw 169 read analysis, we also detected the U enrichment (Supplementary Figure 1). Those inserts were 170 observed at high allele frequencies in the data (Supplementary Table 1 Table 1). Furthermore, elevated rate of thymine insertion has not been reported 175 as a common error of either Illumina or Oxford Nanopore technology [32][33][34][35] . In contrast, 176 production of longer transcripts and slow processing on polyU tracts has been demonstrated for 177 nsp12 (RdRP) of SARS-CoV-1 36 Table 4). Although we could not find a probable source of many 210 inserts, which is a limitation of our analysis, the principal reason of this failure is that most 211 inserts are only 9 nt long, so that there is not enough statistical power for detecting likely origin 212 sequences with substitutions. 213 We hypothesized that the long inserts, at least those, for which potential origin sequences were 214 detected, originate from template switching that occurs during the synthesis of the nc sgRNAs, 215 when the RdRP jumps from one genome location to another. To assess the possibility that RdRP 216 jumping contributes to insertions, we compared the insert locations and the sites of the likely 217 origin of the inserts with the experimental data on the SARS-CoV-2 transcriptome 26 . The jumps 218 produce junction reads that connect two distant locations in the genome. Regions that are 219 connected by multiple reads are hotspots of template switching ( Figure 2). If the insertion 220 sequence and the origin sequence are located close to these junction sites, it is likely that they 221 were introduced by the RdRP during sgRNA synthesis. As pointed out above, inserts show a 222 non-random proximity to template switch hotspots, so for the inserts with a traceable origin, we 223 additionally checked whether their sites of origin occurred close to the sites of RdRp jumping. 224 Although the information on the SARS-CoV-2 transcriptome is limited, among the 7 inserts with 225 predicted origins, we found that three insert sites were located within one end of the junction, 226 whereas their corresponding sites of origin were within 100 nucleotides of the other side of the 227 same junction (Figure 2a). To assess the significance of this finding, we performed two 228 permutation tests (see Material and Methods), in one of which the real insertion positions were 229 matched with start sites chosen randomly, whereas in the second one, both types of sites were 230 selected at random. Both tests showed significant co-localization of the inserts with template 231 switch junctions (Figure 2 b,c). 232 Thus, high-confidence long inserts in the SARS-CoV-2 genome probably originated either by 233 local duplication or by template switching which, at least in some cases, seemed to be associated 234 with nc sgRNA synthesis. Notably, the PRRA insert, the furin cleavage site that is one of were strongly supported by raw reads. Four more inserts (ins98KAE, ins214KLGP, ins245IME, 251 and ins246DSWG) were found in single genomes, but again, were strongly supported by raw 252 reads, where they reached allele frequency close to unity, so these are highly unlikely to be 253 artifacts (Supplementary Table 1 p-value is provided. The data on SARS-CoV-2 structure was obtained from 28 . 289 105 and is also made available for use under a CC0 license. (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC   (which was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC

311
Although structural variation is an important driver of betacoronaviruses evolution, in the 312 genome analysis during the COVID-19 pandemics, part of the structural variations, namely, long 313 insertions, to our knowledge, have not been systematically analyzed. This is a potentially 314 consequential omission, given that insertions in the S and N proteins might contribute to the 315 betacoronavirus pathogenicity. In particular, the furin cleavage site inserted into the S protein 316 seems to be crucial for SARS-CoV-2 pathogenicity 45,46 , and insertions that seem to differentiate 317 high-pathogenicity coronaviruses from low pathogenicity ones have been detected in both S and 318 N 11 . Furthermore, betacoronaviruses are known to produce transcripts of greater than genome 319 length 24 , suggesting that insertions occur frequently during the reproduction of these viruses. 320 Here we attempted a comprehensive identification and analysis of insertions in the SARS-CoV-2 321 protein-coding sequences that originated during the current pandemic. 322 We found that short and longer insertions substantially differed with respect to their nucleotide 323 compositions and mapping to the phylogenetic tree, suggesting that different mechanisms could 324 be at play. The short inserts were strongly enriched in U and in most cases occurred 325 independently in the tree. It appears likely that these inserts resulted from RdRP slippage on 326 short runs of A or U. Indeed, the observed excess of U in these inserts resembles the error profile 327 of SARS-CoV-1 RdRP 36 . In contrast, the composition of the long inserts was close to that of the 328 virus genome, and many of these insertions were found to be monophyletic, that is, these appear 329 to be rare events that did not occur at nucleotide runs. Sequence analysis of the SARS-CoV-2 330 genomes indicates that these insertions occur either through polymerase slippage resulting in 331 tandem duplication or more commonly, seem to have been triggered by illegitimate template 332 switching associated with the formation of nc sgRNAs. In support to our hypothesis template 333 switching in different RNA viruses has been demonstrated previously in a variety of 334 experimental settings 47,48 . Furthermore, template switching has been observed in coronaviruses 335 49 . 336 For approximately one third of the long insertions, we were unable to pinpoint the source of the 337 inserted sequence. This could be explained simply by the mutational deterioration of the 338 similarity between the source and insert sequences, especially, for 9 nt inserts. However, a third 339 mechanism of insertion cannot be ruled out. The PRRA insert that comprises the furin cleavage 340 site in the S protein resembles the long inserts and likely originated by similar mechanisms 341 although its origin by template switching and recombination with another sarbecovirus remains a 342 possibility 50 . 343 Long inserts are markedly overrepresented in the S glycoprotein, particularly, in the NTD. 344 Examination of the locations of these inserts on S protein structure strongly suggests that at least all inserts that were the result of misalignment were discarded. All the sequences that had more 365 than two insertions were discarded, in order to avoid genomes with multiple sequencing errors. 366 Also all inserts with ambiguous symbols were discarded. Information on the laboratory of 367 origin, sequencing platform and consensus assembly methods (where available) was extracted 368 from GISAID metadata. Additionally, because GISAID removes from alignment all inserts > 12 369 nucleotides that appeared only once, we downloaded the metadata for all genomes with inserts, 370 that were submitted before 2021-06-24, and selected those events that were at least 12 371 nucleotides long. We retained only those that had corresponding data in SRA, and were 372 confirmed by raw read data analysis (see below). 373 To identify all genomes that contained the particular insert, we downloaded all GISAID 374 sequences available by 2021-06-23 and used cdhit-est-2d to find identical sequences with 375 parameters: -s 0.99 -c 1.0 -n 11. 376 Insertion validation from raw read data 377 Raw reads were downloaded from SRA database (https://www.ncbi.nlm.nih.gov/sra) with SRA 378 Toolkit (Supplementary Table 1 CoV (MT040335.1) was added to this dataset. Each insertion sequence was compared to all 392 subsequences from a target sequence. All sequences with either the perfect match or with 393 mismatches was retrieved (putative insertion source, PIS). If a PIS was located immediately 394 upstream or downstream of an insertion sequence, it was annotated as duplication. If the PIS was 395 located in any other positions, the template switch model was accepted as the best explanation of 396 the observed insertion sequence. This procedure was implemented using a sliding window, of a 397 length equal to the length of the analyzed insert. If no perfect match was detected, the window 398 with the minimal number of mismatches was retrieved and considered a putative insertion 399 source. 400 To assess the significance of putative duplications and template switch events, we designed a 401 sampling procedure to test a hypothesis that an insertion is not the result of spurious matches 402 between an insertion sequence and corresponding PIS. Each insertion sequence was shuffled and 403 scanned against datasets using the sliding window described above. This procedure was 404 implemented as a set of C++ and Fortran programs (see Data Availability). Manual inspection of 405 results was performed using the FASTA3 program 53 . The number of mismatches between an 406 insertion sequence (observed or shuffled) and PIS was taken as weight W. The distribution of 407 weights Wshuffled was calculated for 1,000 shuffled insertion sequences This distribution was used 408 to calculate probability P(Wobserved ≥ Wshuffled). This probability is equal to the number of shuffled 409 insertion sequences with Wshuffled equal to or smaller than Wobserved. Small probability values 410 (P(Wobserved ≥ Wshuffled) ≤ 0.05) indicate statistical support for the hypothesis that the analyzed 411 insertion sequence results from a duplication or a template switch. 412 Short inserts (< 9 nucleotides) were marked as a duplication if the insert was identical either to 413 the left or the right adjoining sequence. Additionally, we separated single-nucleotide inserts 414 occurring within homopolymer runs from the rest of the inserts. 415 Analysis of transcriptome data and genomic RNA structure 416 To compare insert locations with RNA secondary structure, we utilized the data from Huston et 417 al., 2021 uploaded to github: https://github.com/pylelab/SARS-CoV-2_SHAPE_MaP_structure. 418 For our analysis we used the data from full-length secondary structure map (.ct file). We 419 considered all paired bases to be in stems, whereas those that are not paired were considered to 420 be located in the loops. Thus, an insert was assigned to the stem if it was flanked on both sides 421 by residues known to be paired. 422 The data on the SARS-CoV-2 transcriptome was extracted from Ref. 26. Pearson correlation 423 coefficient between insertion locations and template switch hotspots was calculated for bins of 424 size 100 nucleotides with cor.test() function in R version 3.6.3. 425 To calculate the random distributions for the analyses of distances to the closest junction and 426 appearance of insertions in stems, we performed 1000 permutations, where each time we 427 selected randomly from genome the same number of genome positions as in inserts dataset (347 428 when we analyzed all inserts, and 45 when we analyzed highly confident inserts). To compare 429 distribution of distances for real data and random data, the Wilcoxon sum rank test was 430 performed. In the case of inserts in stems, the p-value is the portion of cases in our simulation 431 that had the same or smaller number of junctions as the real data. 432 To analyze whether long insertions coincide with template switch hotspots, we utilized the data 433 on 5' and 3' ends of junctions from 26 . All junction reads within 100 nucleotides from insertion 434 site and insertion source positions were selected. Although the core elements that are involved in 435 template switching are 6-7 nt long, the neighboring regions appear to contribute as well 23,54 . 436 Furthermore, the peak areas representing the hotspots in the original data are wide, and in the 437 initial publication, 100 nt windows were used for the analysis (Fig. 1e) 26 . To verify the 438 significance of these findings we performed two simulations. In first scenario the positions of 439 inserts were fixed to the real positions from the data, but the locations of source sequences were 440 randomly sampled 1000 times from the genome, in second scenario both source and insertion 441 site positions were randomly sampled 1000 times. The p-value is the portion of cases in our 442 simulation that have the same or larger number of junctions as the real data. 443

444
To find the location of the selected SARS-CoV-2 genomes on the phylogenetic tree we utilized 445 UShER 55 . The phylogenetic tree of 2,501,152 genomes from GISAID, Genbank, COG-UK and 446 CNCB (2021-07-16) available at UCSC was used as the starting tree, and only leaves 447 representing records from GISAID (1347414 leaves) retained. 448