The landscape of SARS-CoV-2 RNA modifications

In 2019 the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) caused the first documented cases of severe lung disease COVID-19. Since then, SARS-CoV-2 has been spreading around the globe resulting in a severe pandemic with over 500.000 fatalities and large economical and social disruptions in human societies. Gaining knowledge on how SARS-Cov-2 interacts with its host cells and causes COVID-19 is crucial for the intervention of novel therapeutic strategies. SARS-CoV-2, like other coronaviruses, is a positive-strand RNA virus. The viral RNA is modified by RNA-modifying enzymes provided by the host cell. Direct RNA sequencing (DRS) using nanopores enables unbiased sensing of canonical and modified RNA bases of the viral transcripts. In this work, we used DRS to precisely annotate the open reading frames and the landscape of SARS-CoV-2 RNA modifications. We provide the first DRS data of SARS-CoV-2 in infected human lung epithelial cells. From sequencing three isolates, we derive a robust identification of SARS-CoV-2 modification sites within a physiologically relevant host cell type. A comparison of our data with the DRS data from a previous SARS-CoV-2 isolate, both raised in monkey renal cells, reveals consistent RNA modifications across the viral genome. Conservation of the RNA modification pattern during progression of the current pandemic suggests that this pattern is likely essential for the life cycle of SARS-CoV-2 and represents a possible target for drug interventions.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is an RNA virus that causes coronavirus disease 2019 . The ongoing COVID-19 pandemic has put an enormous burden on human society in 2020 and is expected to have even longer-lasting impacts. Despite tremendous ongoing research efforts, we still do not have sufficient antiviral treatment solutions or a vaccine. Over the last two decades, the closely related zoonotic betacoronaviruses SARS-CoV and MERS have caused recurring outbreaks in the human population. The ability of coronaviruses (CoV) for cross-species transmission, their known reservoirs in multiple species, and their high replication rates keep CoVs a threat for the human population even beyond the 2020 pandemic. Understanding the molecular mechanisms behind the replication of SARS-CoV-2 is urgently needed.
SARS-CoV-2 carries an enveloped positive-sense single-stranded RNA genome (~30kb) encoding a dense collection of structural and non-structural proteins (nsp), and accessory proteins. Like other members of the order Nidovirales, the genome encodes two polyproteins followed by a series of ORFs that are transcribed into sub-genomic RNAs (sgRNAs). Each transcribed sgRNA is thought to be translated into one protein, and its 3' untranslated region overlaps with the coding sequence of the shorter downstream sgRNAs (1,2). Upon cell entry, ORF1a and ORF1b can be translated directly from the viral genome. A −1 ribosomal frameshifting upstream of the ORF1a stop codon allows the translation of ORF1b (3). The resulting polyproteins, pp1a and pp1b, are further cleaved by viral proteases and yield 11 and 15 nsps, respectively. The RNA-dependent RNA polymerase (RdRP) nsp12 performs the genome replication and the transcription of sgRNAs through negative-sense RNA template intermediates. To transcribe the sgRNAs, the negative RNA intermediates undergo discontinuous transcription, in which the RdRP skips the genome region between transcription-regulatory sequences (TRS) located at the 5' end of the ORFs (TRS-B sites) and a corresponding TRS-Leader site at the 5' end of the viral genome (for a review please see Sola et al. (1)). As a consequence, viral sgRNAs share a common 5' leader sequence derived from the 5' end of the genome up to the TRS-L site. Like host mRNAs, the viral genomic RNA and the sgRNAs have a methylated 5' cap and a polyadenylated 3' tail.
Still, the transcriptomic aspect of CoVs, including SARS-CoV-2, is not fully understood. Transcript-level regulation of gene expression is widely used by the native cellular mechanisms of the host. Viruses have adopted and hijacked these mechanisms throughout their evolution (4

Sequencing read statistics
The direct RNA sequencing experiments yielded a total of 2.3, 1.2 and 1.3 million sequencing reads for the three samples. We mapped the sequences of each dataset to the combination of the human host genome, the yeast enolase gene used as the ONT DRS spike, and the SARS-CoV-2 NCBI reference genome. Notably, between 62-70% of the mapped reads were mapped to the virus genome (Fig. 1a), which is very much comparable to the fraction of viral reads obtained by Kim et al. (12) using the Vero host cell line (Fig. S1a). In contrast to Calu-3 cells which were used for this study, Vero cells are interferon-deficient. Thus, our observation seems to indicate that the interferon deficiency of Vero cells does not benefit the viral life cycle to an experimentally relevant extent. This is in line with a recent study analyzing the host transcription response to SARS-CoV-2 infection (15).

SARS-CoV-2 TRS-B sites and subgenomic RNAs
The long RNA sequencing reads generated for this study cover the entire SARS-CoV-2 genomic RNA as well as the different ORFs (Fig 1b,c, Fig. S1b). This allowed us to do an in-depth analysis of the genomic junctions, including the TRS-B sites described by Kim et al. (12). For comparability, we downloaded and

Detection of RNA modification
We mitigated the variability in the sgRNA expression levels and the ONT higher coverage bias at the 3'end of the transcripts by downsampling the collections of intact sgRNA reads. In this way, we get a quasi-uniform distribution of intact reads across all the samples and the sgRNAs except for ORF7b (Fig.   S1b). For comparison, we also applied the same data processing workflows on datasets from Kim et al.
We used the intact reads that were identified and down-sampled in the previous step for RNA modification detection by DRS using the two available in silico methods. For the identification of the modification sites, we used two different approaches for harnessing the sensed electrical signals from sequencing the native RNA molecules by nanopores. Typically, the electrical signal events aligned to positions, called squiggles, are compared between a condition with unknown putative modifications and a control condition. One strategy to detect the modified genomic positions is to compare the distribution of squiggles of two conditions, both encoding the transcripts of interest. Another strategy uses trained statistical models of the control condition to identify modification of the other condition by evaluating disagreement between the observed features and the model expectations.
Two sets of Galaxy workflows based on Tombo (16) and Nanocompore (17) tools were designed to compute the modification scores from the DRS data (Table S3). Both Tombo and Nanocompore support the distribution-based strategy while Tombo further can perform model-based modification detection.
Since Nanocompore supports biological replicates, we used it as the distribution-based strategy for calling modifications from the three replicates (Fig. 2a). We further used Tombo to train models and calculate modification scores for individual samples (Fig. 2b). To this end, we utilized the in vitro transcribed (IVT) data of SARS-CoV-2 from Kim et al. as the unmodified RNA control dataset for both Tombo and Nanocompore. The distribution of the signals derived from virus RNA and unmodified RNA is representatively depicted for Fr3 in Figure 3.

RNA modification sites of SARS-CoV-2 sgRNAs
We identified the positions modified in SARS-CoV-2 sgRNAs for all the sgRNAs and among all the datasets (Fig. 2a,b). We specifically focused on the modifications regions of the sub-genomic RNAs, i.e., the region downstream of the associated TRS-B sites. We excluded the genomic reads due to the moderately low number of intact reads. The modification results for 5'leader was also not considered due to the anomalies observed in the read coverage of the 5'leader site (Figure 1b). By comparing the model-based prediction for the presented datasets (Fr1-3), we identified a high level of correlation between the modification rates of sgRNA positions in the three replicates (Figure 2b). This prompted us to perform a correlation analysis as depicted in Figure 4 representatively for sgRNA S and N. Notably, this analysis revealed a high correlation not only for the modification sites but also for the fractions of modification between biological replicates. We therefore tested the correlation between our data and the previously published data (Kr) (Fig. 4). Remarkably, the top-ranked modification sites are consistent and the correlation of the fraction of RNA modification fractions is high (Fig. 4), too. This observation was confirmed by visual inspection of raw signals ( examples are shown in Fig. 5). We excluded the data of the Australian isolate from this analysis due to the relatively lower read coverage and different ratio of viral reads (Fig. S1a).
The large overlap of highly modified sites predicted by two independent algorithms supports the validity of our analysis and findings. However, for sites with a low modification ratio the predicted significance levels differ sometimes, indicating that additional biological replicates are needed to consistently reach a valid significance level.

Conclusions
RNA modifications are essential modulators of RNA stability and function. The recent invention of direct RNA sequencing protocols using nanopores enable unbiased detection of RNA modification. In general, the analysis of DRS raw signals is challenging and not well standardized and thus only possible for experienced bioinformaticians. To enable more researchers to use this technology, we present two highly standardized analysis pipelines for DRS sequencing data. These pipelines were integrated into the Galaxy platform (18) and are accessible at https://covid19.galaxyproject.org/direct-rnaseq together with workflows for mapping reads to the viral genome, for calling genomic variants, and for identifying and extracting sgRNA-derived reads. Using these pipelines we analyzed the DRS data sets generated for this study, serving as the first DRS data from Europe, and compared it with the data from previous studies.
Here we generated the SARS-CoV-2 DRS sequencing data sets for the first time for three biological replicates. In contrast, to previously published data, viruses were cultured in a disease-relevant human epithelial lung cell line. Remarkably, the infection resulted in more than 60% of poly-A enriched RNA reads from SARS-CoV-2. We provide experimental evidence for transcription of a total of 11 sgRNAs, two of which are not part of the public SARS-CoV-2 reference annotation.
The comparative analysis of our three replicates with published data demonstrates a high degree of similarity between isolates from different continents and at both early and recent stages of the epidemie.

Quantification and Statistical analysis Availability of analysis workflows and input data
The development of all analysis workflows used for the bioinformatic evaluation of the sequencing data was carried as part of the Covid-19 initiative of the Galaxy project (19). All Galaxy workflows and additional required inputs to them (beyond the sequencing data) are available from the Direct RNAseq subpage of the project at https://covid19.galaxyproject.org/direct-rnaseq .

Assignment of sequenced reads to viral transcripts
Mapping of the sequence reads to the corresponding genomes, extraction of intact reads and assignment to the sgRNAs were performed on the European Galaxy server.
For mapping, the ONT reads of each sample were first mapped to a virtual genome combined of the host (hg38) and the SARS-CoV-2 reference (NC_045512.2) genomes, as well as host rDNA (U13369.1) and ENO spike sequence using Minimap2 (20). The subset of reads that mapped to the viral genome was isolated using samtools (21) and served as input for a second round of mapping to only the viral genome and Minimap2 parameters optimized for the alignment of viral cross-junction sequences similar to Kim et al.. The complete mapping steps can be reproduced using our Read mapping to viral genome Galaxy workflow.
For the extraction of intact reads carrying the viral leader sequence and assignment of these reads to viral sgRNAs, we used a two-step strategy. First, we used bedtools (22) and samtools to extract reads, for which the mapping supported a junction between the viral TRS-L site and putative landing regions upstream of any potential longer ORF beyond ORF1ab. The list of landing region candidates used at this step includes the regions between each of the predicted structural ORFs and the next intervening upstream start codon, but also correspondingly defined regions upstream of potential alternative start codons within the S, 3a, M, 7b, N and 10 ORFs and enables a relatively unbiased detection of junction sites independent of prior assumptions about TRS-B sites.
Next, we inspected the resulting reads classifications with IGV (23) for evidence of junction events and used this information to build a list of TRS-B sites the use of which is supported by the sequencing data.
This list was then used in a second round of assignment of reads to viral sgRNAs, in which only reads supporting a junction event between the TRS-L site and any of the confirmed TRS-B sites (with 10 flanking bases on each side to account for alignment ambiguities around the junction sites) were considered.
Both read classification strategies can be reproduced using Galaxy workflows to classify ONT reads by candidate junction and to classify ONT reads by confirmed junction sites , respectively. We have also

RNA modification detection
The collections of FASTQ-formatted intact reads with the viral leader sequence were used as input to Tombo. First, tombo preprocess and tombo resquiggle commands were invoked on the FASTQ files and the associated FAST5 collection (option --rna). Tombo detect_modification was invoked using the subcommand model_sample_compare (options --fishers-method-context 2 --minimum-test-reads 20 --sample-only-estimates) on the re-squiggled viral reads and the downsampled IVT data from Kim et al. The subcommand level_sample_compare was also applied with the same configuration (data not shown). The methylation scores were extracted from the computed statistics using the subcommand text_output browser_files --file-types dampened_fraction. The plots for ionic signals were also generated using Tombo.
The second workflow for distribution-based comparison of conditions was developed in Galaxy using Nanocompore and Nanopolish (17,24). To align the raw sequencing event data to the reference genome, Nanopolish subcommand eventalign was used (options --samples --scale-events --print-read-names) (17). The alignments produced in the previous step in BAM format and the associated reads in fastq format were provided to the Nanopolish tool. In the next step, the tabular output of event alignment was treated by removing the rows for the portion of the events that were aligned to the first 100 positions of the genome that covers the leader region using awk. This step has been necessary to have a proper utilization of Nanocompore tool that does not natively support spliced alignments. In the next step the event_align data was processed using NanopolishComp (https://github.com/a-slide/NanopolishComp) followed by Nanocompore sampcomp (options --sequence_context 2 --logit) to obtain the methylation scores. The p-value score GMM_logit_pvalue_context_2 was used to predict methylation.  Heatmaps of the predicted fraction of modified bases using Tombo. The red marks show top-1% modified sites per sample that are common in at least two of the three samples. Figure 3: Distribution of nanopore measured ionic signals for exemplary regions with high modification scores according to Tombo and Nanocompore. Shown are signals obtained from unmodified RNA (black) and one representative sample, Fr3 (red).

Figure 4:
Correlation of the fraction of modified bases in the S (a) and N (b) sgRNAs computed using Tombo. Correlation coefficients are given in red circles. Figure 5: Direct RNA sequencing raw electrical signals of downsampled reads obtained from unmodified RNA (IVT, black), from samples generated for this study and from isolate from a published korean data set (Fr1-3 and Kr, red).

Figure S1
Mapping statistics of data sets and read distributions among the genomes. a: Mapping statistics of DRS reads for the human genome, ONT control ENO, and SARS-CoV-2. Depicted are results obtained for published data sets from Korea (Kr) and Australis (Au). b: Top panel, the total number of reads with a 5'leader sequence for the different sgRNAs and the genome. Bottom panel, the to maximal 4000 reads downsampled sgRNA reads for the downstream modification analysis.

Tables
Supplementary Tables   Table S1 : Genomic variants detected in the three studied isolates. To be included in this list the variant site had to show a depth of coverage (DP) > 10 and an alternate allele frequency (AF) > 0.5.  : Counts of reads supporting junctions between TRS-L and each of 16 TRS-B candidate regions. Asterisks mark candidate regions with unconvincingly low number of reads that were not considered for further analysis.

Sample
ng-and-coverage-analysis