ABSTRACT
Sequence analyses of RNA virus genomes remain challenging due to the exceptional genetic plasticity of these viruses. Because of high mutation and recombination rates, genome replication by viral RNA-dependent RNA polymerases leads to populations of closely related viruses that are generally referred to as ‘quasispecies’. Although standard (short-read) sequencing technologies allow to readily determine consensus sequences for these ‘quasispecies’, it is far more difficult to reconstruct large numbers of full-length haplotypes of (i) RNA virus genomes and (ii) subgenome-length (sg) RNAs comprised of noncontiguous genome regions that may be present in these virus populations. Here, we used a full-length, direct RNA sequencing (DRS) approach without any amplification step to characterize viral RNAs produced in cells infected with a human coronavirus representing one of the largest RNA virus genomes known to date.
Using DRS, we were able to map the longest (~26 kb) contiguous read to the viral reference genome. By combining Illumina and nanopore sequencing, a highly accurate consensus sequence of the human coronavirus (HCoV) 229E genome (27.3 kb) was reconstructed. Furthermore, using long reads that did not require an assembly step, we were able to identify, in infected cells, diverse and novel HCoV-229E sg RNAs that remain to be characterized. Also, the DRS approach, which does not require reverse transcription and amplification of RNA, allowed us to detect methylation sites in viral RNAs. Our work paves the way for haplotype-based analyses of viral quasispecies by demonstrating the feasibility of intra-sample haplotype separation. We also show how supplementary short-read sequencing (Illumina) can be used to reduce the error rate of nanopore sequencing.
Even though a number of technical challenges remain to be addressed to fully exploit the potential of the nanopore technology, our work illustrates that direct RNA sequencing may significantly advance genomic studies of complex virus populations, including predictions on long-range interactions in individual full-length viral RNA haplotypes.
Background
Coronaviruses (subfamily Coronavirinae, family Coronaviridae, order Nidovirales) are enveloped positive-sense (+) single-stranded (ss) RNA viruses that infect a variety of mammalian and avian hosts and are of significant medical and economic importance, as illustrated by recent zoonotic transmissions from diverse animal hosts to humans1,2. The genome sizes of coronaviruses (~30 kb) exceed those of most other RNA viruses. Coronaviruses use a special mechanism called discountinuous extension of minus strands3,4 to produce a nested set of 5’- and 3’-coterminal subgenomic (sg) mRNAs that carry a common 5’ leader sequence that is identical to the 5’ end of the viral genome5,6. These sg mRNAs contain a different number of open reading frames (ORFs) that encode the viral structural proteins and several accessory proteins. With very few exceptions, only the 5’-located ORF (which is absent from the next smaller sg mRNA) is translated into protein (Fig. 1).
In HCoV-229E-infected cells, at total of 7 major viral RNAs are produced. The viral genome (RNA 1) is occasionally referred to as mRNA 1 because it (also) has an mRNA function. In its 5’-terminal region, the genome RNA contains two large ORFs, 1a and 1b, that encode the viral replicase polyproteins 1a and 1ab. mRNAs 2, 4, 5, 6, and 7 are used to produce the S protein, accessory protein 4, E protein, M protein and N protein, respectively. The 5’-ORF present in RNA 3 starts contains the central and 3’ regions of the S gene. Although this sg RNA has been consistently identified in HCoV-229E-infected cells, its mRNA function has been disputed and there is currently no evidence that this RNA is translated into protein7–9.
Like many other +RNA viruses, coronaviruses show high rates of recombination10–12. In fact, the mechanism to produce 5’ leader-containing sg mRNAs represents a prime example for copy-choice RNA recombination that, in this particular case, is guided by complex RNA-RNA interactions involving the transcription-regulating sequence (TRS) core sequences and likely requires additional interactions of viral proteins with specific RNA signals. In other virus systems, RNA recombination has been shown to generate ‘transcriptional units’ that control the expression of individual components of the genome13. The mechanisms involved in viral RNA recombination are diverse and may even extend to nonreplicating systems14. In the vast majority of cases, recombination results in defective RNA (dRNA) copies that lack essential cis-active elements and thus cannot be replicated. In other cases, functional recombinant RNA with new properties, such as the ability to replicate in a new host, may emerge15–18. In yet other cases, defective interfering RNAs (DI-RNAs) may be produced. These defective (subgenome-length) RNAs contain all the cis-acting elements required for efficient replication by a helper virus polymerase and, therefore, represent parasitic RNAs that compete for components of the viral replication/transcription complex with non-defective viral RNAs19.
To elucidate the many facets of recombination and to determine full-length haplotypes of, for example, virus mutants/variants in complex viral populations (quasispecies), long-read sequencing has become the method of choice. Short-read Next-Generation Sequencing technologies (NGS) – such as IonTorrent and Illumina – are restricted by read length (200-400 nucleotides20). For example, the use of inevitably highly fragmented viral RNAs considerably complicates the investigation of haplotypes21,22. Since the nested coronavirus mRNAs are almost identical to the original genome sequence, short-read data can usually not be unambiguously assigned to specific sg RNA species.
In this study, we performed direct RNA sequencing (DRS) on an array of nanopores, as developed by Oxford Nanopore Technologies (ONT)23. Nanopore sequencing does not have a limited reading length but is limited only by fragmentation of the input material24–26. Further, by using DRS, we avoid several drawbacks of previous sequencing methods, in particular cDNA synthesis and amplification of the input material. Thus, for example, cDNA synthesis can create artificial RNA-RNA chimerae27 that are difficult to discriminate from naturally occurring chimerae (such as spliced RNAs). Also, amplification prior to sequencing would remove all RNA modifications from the input material, whereas the nanopore sequencing technology preserves these modifica-tions23,28.
Recently, nanopore sequencing has been used for metagenomic forays into the virosphere29 and studies focusing on transmission routes30,31. Furthermore, viral transcriptomes have been investigated using nanopore sequencing of cDNA32–35, being subject to bias from reverse transcription and amplification. Other studies used DRS to study the human poly(A) transcriptome36 and the transcriptome of DNA viruses such as HSV37. Furthermore, the genome of Influenza A virus has been completely sequenced in its original form using direct RNA sequencing38.
In the present study, we sequenced one of the largest known RNA genomes, that of HCoV-229E, a member of the genus Alphacoronavirus, with a genome size of about 27,300 nt, in order to assess the complex architectural details for viral sg RNAs produced in cells infected with recombinant HCoV-229E. Using DRS, we aim to capture complete viral mRNAs, including the full coronavirus genome, in single contiguous reads. Sequence analysis of thousands of full-length sg RNAs will allow us to determine the architectures (including leader-body junction sites) of the major viral mRNAs. In addition, this approach provides insight into the diversity of additional hCoV-229E sg RNAs, probably including DI-RNAs. Further, we aim to assess whether RNA modifications can be called directly from the raw nanopore signal of viral molecules without prior in vitro treatment, as has been shown for DNA39,40.
Results
Full genome sequencing without amplification
We sequenced total RNA samples obtained from Huh-7 cells infected with serially passaged recombinant human coronaviruses: wild-type (WT) HCoV-229E, HCoV-229E_SL2-SARS-CoV, and HCoV-229E_SL2-BCoV, respectively. In the latter two viruses, a conserved stem-loop structure (SL2) residing in the HCoV-229E 5’-UTR was replaced with the equivalent SL2 element from SARS-CoV and BCoV, respectively41. Total RNA samples obtained for the latter two (chimeric) viruses were pooled prior to sequence analysis. Hereafter, we refer to the first sample as WT RNA and to the second (pooled) sample as SL2 RNA (see Methods and Materials).
We performed two direct RNA sequencing runs (one per sample) on a MinION nanopore sequencer. As shown in Table 1, we achieved a throughput of 0.237 and 0.282 gigabases with 225 k and 181k reads for the WT and SL2 sample respectively. See SFig. 1 A for an overview of the read length distribution. For the WT and SL2 samples, 33.2% and 35.9% of the reads mapped to the reference HCoV-229E sequences, respectively. 15.8% and 10.2% respectively mapped to the yeast enolase 2 mRNA sequence, a calibration strand added during the library preparation, while 47.4% and 52.7% could be attributed to human host cell RNA. minimap2 did not align the remaining 3.50% and 1.11 % of reads. Using BLAST42 against the nt database, 18.1% and 20.7% of these reads can be attributed again to HCoV, human or yeast. As reads which were not aligned by minimap2 were mostly very short (median <= 200), of poor basecalling quality and represented only 0.62% and 0.15% of total nucleotides respectively, we decided to only use the higher quality reads that minimap2 could align. (see SFig. 2 for detailed statistics).
The visualized raw voltage signal of a nanopore read is commonly called ‘squiggle’ (see SFig. 3). Different from all previous sequencing technologies, nanopore sequencing preserves the information about base modifications in the raw signal23. However, one of the biggest challenges is the accurate mapping of the raw voltage signal to bases (‘base calling’).
As expected for nanopore DRS23,38, reads had a median uncorrected error rate of about 15% for human and virus reads, while basecalling errors were reduced for yeast ENO2 mRNA reads, as the basecaller was trained on this calibration strand (see Table 1). This included gaps but omitted discontinuous sites longer than six nucleotides since they indicated recombination. Half of all errors were deletions. In addition, we found that more than half of all single nucleotide deletions occur in homopolymers, and most of these streches that coincide with a deletion are three or more nucleotides long (see SFig. 4). A quarter of the errors were substitutions, which we argue are largely due to modified bases that impede the basecaller’s ability to assign bases correctly.
The HCoV-229E genome was 99.86% covered, with a large coverage bias towards both ends (see Fig. 2 and Fig. 1). The high coverage of the 3’-end reflects the higher abundance of mRNAs produced from the 3’-terminal genome regions and is a result of the discontinuous transcription mechanism employed by coronaviruses and several other nidoviruses5,43,44. The 3’-coverage is further increased by the directional sequencing that starts from the mRNA 3’-terminal poly(A) tail. Also, the observed coverage bias for the very 5’-end results from the coronavirus-specific transcription mechanism because all viral mRNAs are equipped with the 65-nt 5’-leader sequence derived from the 5’-end of the genome. The remainder of the high 5’-coverage bias likely reflects the presence of high numbers of DI-RNAs in which 5’- and 3’-proximal genomic sequences were fused, probably resulting from illegitimate recombination events as shown previously for other coronaviruses10,12,45. For the WT and SL2 samples, 38.37 % and 16.32 % were split-mapped, respectively. Of these, only 278 and 181 had multiple splits. The considerably larger fraction of split reads in the WT sample is explained by the high abundance of potential DI-RNA molecules, see Fig. 2 (c).
An alignment of the longest reads from both samples to the HCoV-229E reference indicates that they represent near complete virus genomes (SFig. 1 B). The observed peaks in the aligned reads length distribution (see Fig. 4) corresponded very well with the abundances of the known mRNAs produced in HCoV-229E-infected cells7–9 (see Fig. 1). Alignment of the reads to these canonical mRNA sequences confirmed these observed abundances (SFig. 5).
The median read length for the combined set of reads from both samples was 826 nt, with a maximum of 26,210 nt, covering 99.86% of the 27,276-nt-long virus genome, missing only 21 nt at the 5’-end, 15 nt at the 3’-end and those nucleotides that correspond to the skewed error distribution, with 5.7 percentage points more deletions than insertions (see Tab. 1). The median read length might sound short, however most of the viral RNAs (including many DI-RNAs) identified in HCoV-229E-infected cells were below 2,000 nt in length. Furthermore, this number nearly doubles the longest read length that can be obtained with short-read sequencing methods. We observed an abundance of very short reads, representing the 3’ (poly-A) end of the genome. This could be an artifact of RNA degradation, although we cannot estimate the exact fraction of affected transcripts. Because sequencing starts at the poly-A tail, fragmented RNA will not be sequenced beyond any 3’ break point. It is thus best to minimize handling time during RNA extraction and library preparation. Innovations in these fields will directly translate into larger median read lengths.
We obtained 99.15% and 98.79% identity in both samples (WT, SL2) respectively with the help of the consensus caller ococo46 using the reference genomes and all reads mapping to it. We attempted a standard long read assembly using Canu47, which yielded unusable results (WT: 389 contigs, longest 13kb, all other <4kb; SL2: 517 contigs, all <6kb). We think that current nanopore-only assembly tools are not equipped to handle special read data sets such as those originating from a small RNA virus genome. In addition, we assembled WT and SL2 consensus sequences using Nanopore and Illumina data with HG-CoLoR48 in an approach that uses long nanopore reads to traverse an assembly graph constructed from short Illumina reads. We thereby recovered 99.57% of the reference genome in a single contiguous sequence at 99.90% sequence identity to reference using this approach with the single longest read from the SL2 sample. This hybrid approach illustrates how short- and long-read technologies can be combined to reconstruct long transcripts accurately, which will greatly facilitate studies of haplotypes.
Uncharacterized subgenomic and defective interfering RNAs
In addition to the leader-to-body junctions expected for the canonical sg mRNAs 2, 4, 5, 6, and 7, we observed a surprisingly high number of recombination sites (Fig. 3) which were consistently found in our samples but have not been described previously (Fig. 3). In this study, we defined a recombination site as any site that flanks more than 100 consecutive gaps, as determined in a discontinuous mapping (‘spliced’ mapping). While there is currently no consensus on how to define such sites, we believe this to be a conservative definition, as this type of pattern is unlikely to result from e.g. miscalled homopolymer runs which, in our experience, typically affect less than 10 consecutive bases. We observe all known canonical HCoV-229E mRNAs at their expected lengths, including the (presumably) non-coding mRNA 3 (Fig. 4).
The aligned reads distribution revealed clusters for all known mRNAs which closely fit the expected molecule lengths (Fig. 4). The cluster positions show double peaks with a consistent distance of −65 nt, i.e. the length of the leader sequence. We observed that the 5’-end of reads has larger-than-average error rates and is often missing nucleotides (see SFig. 8 for detailed statistics). This might be due to a bias of the basecaller towards the end of reads. This is plausible, because the underlying classification algorithm is a bidirectional (i.e. forward and backward looking) long-short-term memory neural network (LSTM). The mapping algorithm was often unable to align these erroneous 5’-ends, leading to soft-clipped bases. Thus, for many reads representing canonical mRNAs, the included leader sequence was not aligned, which gives rise to the secondary peak at each cluster position. Many leader sequences can be recovered with a hidden markov model on these soft-clipped 5’-ends (data not shown). We also observed additional clusters which likely correspond to highly abundant dRNAs (Fig. 4).
Interestingly, we also observed several unexpected recombination sites, e.g. at positions 3,000 to 4,000 (within ORF1a, see Fig. 3). These sites were confirmed by both nanopore and Illumina sequencing. They had a high read support and defined margins, suggesting a specific synthesis/amplification of these sg RNAs which, most likely, represent DI-RNAs. Since DI-RNAs are byproducts of viral replication and transcription, they present a larger diversity than the canonical viral mRNAs49–53.
Nanopore sequencing captures recombination events far better than Illumina, which allowed us to identify even complex sg RNAs (composed of sequences derived from more than 2 noncontiguous genome regions) at much higher resolution: For example, we found sg RNAs with up to four recombination sites in the 5’- and 3’-terminal genome regions (Fig. 3).
Consistent 5mC methylation signatures of coronavirus RNA
Nanopore sequencing preserves information on nucleotide modifications. Using a trained model, DNA and RNA modifications such as 5mC methylation could be identified (Fig. 5). To assess the false positive rate (FPR) of the methylation calling, we used an unmethylated RNA calibration standard (RCS) as a negative control which was added in the standard library preparation protocol for DRS. We considered a position to be methylated if at least 90 % of the reads showed a methylation signal for this particular position. Using this threshold, the estimated FPR was calculated to be below 5 %. Our experimental setup did not include a positive methylation control.
When analyzing 5mC methylation across various RNAs, we observed consistent patterns (Figure 5) that were reproducible for the corresponding genomic positions in different RNAs, suggesting that the methylation of coronavirus RNAs is sequence-specific and/or controlled by RNA structural elements. Methylated nucleotides could be identified across the genome, both in the leader sequence and in the body regions of viral mRNAs.
While the overall methylation pattern looks similar between subgenomic RNAs and the negative control (see SFig. 9), we nevertheless find consistent methylation across different subgenomic RNA “types”, i.e. methylated positions of mRNA 2 are mirrored in mRNA 4 etc.
Discussion
We found the diversity of sg RNAs identified in coronavirus-infected cells to be surprisingly high, with many sg RNAs not corresponding to the known canonical mRNAs. These ‘non-canonical’ sg RNAs had abundant read support and full-length sequences could be obtained for most of these RNAs.
As indicated above, only 12% of the sg RNAs were found to conform to our current understanding of discontinuous mRNA transcription in coronaviruses, resulting in mRNAs that (all) carry an identical 5’-leader sequence that is fused to the 3’-coding (‘body’) sequence of the respective mRNA. We however believe that 12% represents an underestimate because a large number of sg RNAs were probably omitted from the analysis: (1) RNA molecules degrade rapidly under laboratory conditions, even when handled carefully. The resulting fragments will only be sequenced if they contain a poly(A) tail. (2) The high sequencing error may introduce mismappings, especially for low-quality reads. These reads would not be assigned to the canonical model under our assumptions because of the high number of mismatches. However, we think the associated bias is low, because minimap2 is very robust against high error rates and because the reads are very long, thus ensuring that the mapper has sufficient aggregate information on a given read to position it very reliably on a reference. (3) The library preparation protocol for DRS includes the ligation of adapters via a T4 ligase. Any ligase could potentially introduce artificial chimera, although we did not investigate this systematically. Again, we think that this does not affect our results substantially: First, this bias is random and it seems unlikely that we would observe the very same RNA ‘isoform’ many times if it was created by random ligation. Second, many ‘isoforms’ that we observed only once (e.g. those colored pink and blue in Figure 3) were structured plausibly: They contained a leader sequence and had recombined at expected (self-similar) sites corresponding to putative or validated TRSs, with downstream sequences being arranged in a linear 5’-3’-order. (4) Finally, it is important to note that the RNA used for DRS was isolated from cells infected with a serially passaged pool of recombinant viruses rescued after transfection of in vitro-transcribed genome-length (27.3 kb) RNAs. Transfection of preparations of in vitro-transcribed RNA of this large size likely included a significant proportion of abortive transcripts that lacked varying parts of the 3’ genome regions, rendering them dysfunctional. It is reasonable to suggest that the presence of replication-incompetent RNAs lacking essential 3’-terminal genome regions may have triggered recombination events resulting in the emergence of DI-RNAs that contained all the 5’- and 3’-terminal cis-active elements required for RNA replication, but lacked non-essential internal genome regions. Upon serial passaging of the cell culture supernants for 21 times, DI-RNAs may have been enriched, especially in the HCoV-229E (wild-type) sample (Figure 3). Comparative DRS analyses of RNA obtained from cells infected with (i) plaque-purified HCoV-229E and (ii) newly rescued recombinant HCoV-229E (without prior plaque purification), respectively, would help to address the possible role of prematurely terminated in vitro transcripts produced from full-length cDNA in triggering the large number of DI-RNAs observed in our study.
Although, for the above reasons, the low percentage of canonical mRNAs (12%) in our samples likely represents an underestimate, our study may stimulate additional studies, for example to revisit the production of mRNAs from non-canonical templates54,55. Also, it is worth mentioning in this context that, for several other nidoviruses, such as murine hepatitis virus (MHV), bovine coronavirus (BCoV), and arteriviruses, evidence has been obtained that sg RNA transcription may also involve non-canonical TRS motifs56–60.
The majority of sg RNAs (other than mRNAs) we found in our samples likely represent DI-RNAs, which are a common occurrence in coronavirus in vitro studies19.
To our knowledge, this study is the first to perform RNA modification calling without prior treatment of the input sample. It only relies on the raw nanopore signal. While DNA modifications such as 5mC methylation have been explored exten-sively61, less is known about RNA modifications62, the importance of which is debated63. We found consistent 5mC methylation patterns across viral RNAs when tested at a FPR below 5 %. We were not able to assess the sensitivity and specificity of the methylation calling due to the absence of a positive control group, which was beyond the scope of this study.
RNA is known to have many different modifications, and we expect the presence of these on Coronavirus sgRNAs64 too. However, to our knowledge no comprehensive data exists on prior expectations for such modifications in Coronaviruses, which might or might not correspond to those observed in e.g. humans.
In addition, we observed that the software used39 will likely present high error rates in regions of low coverage or where the underlying reference assembly is erroneous. This is because the resquiggle algorithm – upon which this method is based – has to align the raw nanopore read signal to the basecalled read sequence (see SFig. 6). This is necessary to test the raw signal against learned modification models, of which at the time of manuscript preparation (May 2019) only 5mC was implemented for RNA. Nevertheless, new options to call these modifications at an acceptable error rate without any RNA pretreatment is a powerful method.
The validity of the methylation signal should be confirmed in future studies using e.g. bisulfite sequencing. Ideally, this validation should start from in vitro, synthetic transcripts where modified bases have been inserted in known positions. Furthermore, RNA modification detection from single-molecule sequencing is a current bioin-formatic frontier, and algorithms and tools are under active development. We showed that consistent 5mC methylation patterns were seen across different subgenomic RNAs. However, the overall pattern of the methylation calls between subgenomic RNAs and the negative control was very similar. At a false positive rate of 5%, the RNA modifications we identified are supported by their consistent occurence. However, we cannot rule out that instead, the observed pattern might be caused by an alignment artifact. In the employed methylation calling algorithm, the raw signal is aligned to the nucleotide sequence after basecalling. If there is a systematic bias in this alignment, and certain sequence motifs cause a consistent mapping mismatch, this mismatch could lead to false positive methylated sites. This is, because in these positions the signal would deviate from the expected one due to the misalignment, and not due to methylation. In future experiments this can be decided using a positive control in the form of an RNA transcript with known 5mC methylated sites. However, even if we are in its early stages, the reading of RNA modifications from the read signal has great potential to elucidate viral biology.
We were able to reconstruct accurate consensus sequences, both for the Illumina and nanopore data. We also demonstrated that individual transcripts can be characterized. More problematic was the resolution of quasispecies in our experimental setup. Although DRS allowed us to confirm the presence of each of the two heterologous SL2 structures present in the SL2 sample, this was only possible for subgenome-length (DI-) RNAs. It appears that the high error rate of more than 10% was a critical limitation when analyzing the
SL2 region located at the extreme 5’-terminal end of the 27.3 kb genome RNA. This high error rate made variant calling difficult, particularly under low-coverage conditions, as was the case in our analyses of the 5’-UTR of genome-length RNA (results not included). The current generation of long-read assemblers is not well suited to reconstruct many viral genome architectures, such as nested ones. The development of specialized assemblers would be of great help in virology projects.
We used a hybrid error correction method (HG-CoLoR48) that uses Illumina data to correct read-level errors. However, it remains questionable whether the corrected read sequence is truly representative of the ground truth read sequence. Signal-based correction methods such as Nanopolish65 may be more promising, however, at the time of manuscript writing (May 2019) correction on direct RNA data has not been implemented. We expect this to become available in the near future. Combined with the ever-increasing accuracy of the nanopore technology, we think this method might be able to study quasispecies soon.
There are recombination events observed in the Illumina data that were not detected in the Nanopore data. These are likely caused by misalignment of the short single-end reads (50 nt). A minimum of only 10 nt was required for mapping on either end of the gapped alignment. This was a trade-off between sensitivity to identify recombination sites and unspecific mapping.
In this work, we demonstrated the potential of long-read data as produced by nanopore sequencing. We were able to directly sequence the RNA molecules of two different samples of one of the largest RNA virus genomes known to date. We showed how very large RNA genomes and a diverse set of sg RNAs with complex structures can be investigated at high resolution without the need for a prior assembly step and without the bias introduced by cDNA synthesis that is typically required for transcriptome studies.
The detail and quality of the available data still require significant bioinformatic expertise as the available tooling is still at an early development stage. However, the technological potential of nanopore sequencing for new insights into different aspects of viral replication and evolution is very promising.
Future studies should investigate both strands of the coronavirus transcriptome. Studies focusing on RNA modifications need to employ well-defined positive and negative controls to assess the error rate of the current software alternatives. Also, the DRS method will be extremely powerful if it comes to analyzing the nature and dynamics of specific haplotypes in coronavirus populations under specific selection pressures, for example mutations and/or drugs affecting replication efficiency and others.
Our work also serves as a proof-of-concept demonstrating that consistent RNA modifications can be detected using nanopore DRS.
To fully exploit the potential of DRS, several improvements are needed: First and foremost, a significant reduction of the currently very high per-read error rate is crucial. This is especially problematic in studies focusing on intra-sample heterogeneity and haplotypes. Secondly, protocols that limit RNA degradation during library preparation would be of great value. This could be achieved by shortening the library protocol. To limit the cost of DRS, barcoded adapters would be desirable. On the bioinformatics side, the basecaller for DRS data is still at an early stage and, for example, cannot accurately call the poly(A)-regions as well as the RNA-DNA-hybrid adapter sequences. Further basecalling errors likely result from RNA modifications, which need to be modelled more accurately. However, once these limitations will be fixed, the use of nanopore-based DRS can be expected to greatly advance our understanding of the genomics of virus populations and their multiple haplotypes.
RNA virus samples
The two total RNA samples used in this study for DRS (ONT MinION) and Illumina sequencing were prepared at 24 h post infection from Huh-7 cells infected at an MOI of 3 with recombinant HCoV-229E WT, HCoV-229E_SL2-SARS-CoV and HCoV-229E_SL2-BCoV, respectively41. Prior to sequence analysis, the two RNA samples obtained from HCoV-229E_SL2-SARS-CoV- and HCoV-229E_SL2-BCoV-infected cells were pooled (SL2 sample, see SFig. 7).
Generation of recombinant viruses and total RNA isolation were carried out as described previously41. Briefly, full-length cDNA copies of the genomes of HCoV-229E (GenBank accession number NC_002645), HCoV-229E_SL2-SARS-CoV and HCoV-229E_SL2-BCoV, respectively, were engineered into recombinant vaccinia viruses using previously described meth-ods66–68. Next, full-length genomic RNAs of HCoV-229E, HCoV-229E_SL2-SARS-CoV and HCoV-229E_SL2-BCoV, respectively, were transcribed in vitro using purified ClaI-digested genomic DNA of the corresponding recombinant vaccinia virus as a template. 1.5 μg of full-length viral genome RNA, along with 0.75 μg of in vitro-transcribed HCoV-229E nucleocapsid protein mRNA, were used to transfect 1 × 106 Huh-7 cells using the TransIT® mRNA transfection kit according to the manufacturer’s instructions (Mirus Bio LLC). At 72 h post transfection (p.t.), cell culture supernatants were collected and serially passaged in Huh-7 cells for 21 (WT) or 12 times (gHCoV-229E_SL2-SARS-CoV and HCoV-229E_SL2-BCoV), respectively.
Nanopore sequencing and long-read assessment
For nanopore sequencing, 1 μg of RNA in 9 μl was carried into the library preparation with the Oxford nanopore direct RNA sequencing protocol (SQK-RNA001). All steps were followed according to the manufacturer’s specifications. The library was then loaded on an R9.4 flow cell and sequenced on a MinION device (Oxford Nanopore Technologies). The sequencing run was terminated after 48 h.
The raw signal data was basecalled using Albacore (v2.2.7, available through the Oxford Nanopore community forum). Due to the size of the raw signal files, only the basecalled data were deposited at the Open Science Framework (OSF; doi.org/10.17605/OSF.IO/UP7B4).
While it is customary to remove adapters after DNA sequencing experiments, we did not perform this preprocessing step. The reason is that the sequenced RNA is attached to the adapter molecule via a DNA linker, effectively creating a DNA-RNA chimera. The current basecaller – being trained on RNA – is not able to reliably translate the DNA part of the sequence into base space, which makes adapter trimming based on sequence distance unreliable. However, we found that the subsequent mapping is very robust against these adapter sequences. All mappings were performed with minimap269 (v2.8-r672) using the ‘spliced’ preset without observing the canonical GU…AG splicing motif (parameter -u n), and k-mer size set to 14 (-k 14).
Raw reads coverage and sequence identity to the HCoV-229E reference genome (WT: Gen-bank, NC_002645.1; SL2: stem loop 2 sequence replaced with SARS-CoV SL2 sequence) were determined from mappings to the references produced by minimap2. Read origin and sequencing error statistics were assessed by mapping the reads simultaneously with minimap2 to a concatenated mock-genome consisting of HCoV-229E (WT and SL2 variants respectively), yeast enolase 2 mRNA (calibration strand, Genbank, NP_012044.1), and the human genome. Identity and error rates are the number of matching nucleotides (or number of nucleotide substitutions, insertions or deletions) divided by the total length of the alignment including gaps from indels.
Consensus calling of nanopore reads was performed with ococo46 (v0.1.2.6). The minimum required base quality was set to 0 in order to avoid gaps in low coverage domains.
We used the hybrid error correction tool HG-CoLoR48 in conjunction with the Illumina HiSeq short-read data sets of both samples to reduce errors in all reads that exceed 20k nt in length. The program builds a de Bruijn graph from the near noise-free short-read data and then substitutes fragments of the noisy long reads with paths found in the graph which correspond to that same fragment of the sequence. HG-CoLoR was run with default parameters except for the maximum order of the de Bruijn graph, which was set to 50 in order to fit the length of the short reads.
Illumina HiSeq sequencing and assembly
Illumina short-read sequencing was performed using the TruSeq RNA v2 kit to obtain RNA from species with polyA-tails and without any strand information. The three samples (WT, SL2_SARS-CoV, SL2_BCoV) selected for this study were prepared on a HiSeq 2500 lane and sequenced with 51 cycles. After demultiplexing, 23.2, 22.0, and 23.8 million single-end reads were obtained for the WT and the two SL2 samples, respectively. The raw sequencing data was deposited at the OSF (doi.org/10.17605/OSF.IO/UP7B4).
Characterization of transcript isoforms and subgenomic RNAs
We first defined TRS as a 8-mer with a maximum Hamming distance of 2 from the motif UCUCAACU. We then searched the HCoV-229E reference genome (Genbank, NC_002645.1) for all matching 8-mers. We then synthesized sg RNAs in silico as follows: For each pair of complementary 8-mers (5’-TRS, 3’-TRS) we accepted at most 1 mismatch to simulate base pairing under a stable energy state. We then joined two reference subsequences for each pair or TRS: First, the 5’-end up to but not including the 5’-TRS. Second, the 3’-end of the reference genome including the 3’-TRS and excluding the poly(A)-tail.
This way we obtained about 5,000 candidate sg RNAs. To validate them, we mapped the nanopore reads to these ‘mock’ sg RNAs in a non-discontinuous manner, i.e. all reads had to map consecutively without large gaps. To count as a putative hit, 95% of the read length had to uniquely map to a given mock transcript, and the mock transcript could not be more than 5% longer than the read.
We only considered putative hits as plausible if they had a read support of at least 5. With this threshold, we aim to balance the sensitivity of finding plausible novel transcripts with a need to control the number of false positives.
Identification of 5mC methylation
We used Tombo (v1.3, default parameters)39 to identify signal level changes corresponding to 5mC methylation (see SFig. 6).
To assess the false positive rate (FPR) of the methylation calling, we used an RNA calibration standard (RCS) as a negative control. It is added in the standard library preparation protocol for direct RNA sequencing. This mRNA standard is derived from the yeast enolase II (YHR174W) gene70 and is produced using an in vitro transcription system. As a consequence, the mRNA standard is not methylated.
For a conservative resquiggle match score of 1.3 (part of the Tombo algorithm, default setting for RNA) and a methylation threshold of 0.9, the FPR was 4.67%, which met our requirement that the FPR be smaller than 5%. Our experimental setup did not include a positive methylation control.
We used Inkscape version 0.92.1 (available from inkscape.org) to finalize our figures for publication.
Data access
Both the short-read cDNA (Illumina) and the raw as well as basecalled long-read RNA (ONT) data were archived and stored at the Open Science Framework, under accession doi.org/10.17605/OSF.IO/UP7B4. Analysis code has been deposited in the same repository.
Competing interests
The authors declare that they have no competing interests.
Funding
This work was supported by BMBF – InfectCon-trol 2020 (03ZZ0820A) (KL, MM) and is part of the Collaborative Research Centre AquaDiva (CRC 1076 AquaDiva) of the Friedrich Schiller University Jena, funded by the Deutsche Forschungsgemeinschaft (DFG); supporting MH. The study is further supported by DFG TRR 124 “FungiNet”, INST 275/365-1, B05 (MM). The work of JZ was supported by the DFG (SFB 1021-A01 and KFO309-P3).
Authors’ contributions
AV developed the experimental design for sequencing with nanopores. AV, SK and KL analyzed and interpreted the data. JZ, RM, MH and MM were major contributors for discussion and in writing the manuscript. All authors wrote, commented, edited and approved the final manuscript.
Acknowledgements
We sincerely thank Celia Diezel for technical assistance in nanopore sequencing. We thank Ivonne Görlich and Marco Groth from the Core Facility DNA sequencing of the Leibniz Institute on Aging – Fritz Lipmann Institute in Jena for their help with Illumina sequencing. We also thank Nadja Karl (Medical Virology, Giessen) for excellent technical assistence. MH appreciates the support of the Joachim Herz Foundation by the add-on fellowship for interdisciplinary life science.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].
- [17].
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].
- [34].
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].
- [51].
- [52].
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].
- [58].
- [59].
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].
- [68].↵
- [69].↵
- [70].↵