LinearTurboFold: Linear-Time Global Prediction of Conserved Structures for RNA Homologs with Applications to SARS-CoV-2

Sizhen Li; He Zhang; Liang Zhang; Kaibo Liu; Boxiang Liu; David H. Mathews; Liang Huang

doi:10.1101/2020.11.23.393488

Abstract

The constant emergence of COVID-19 variants reduces the effectiveness of existing vaccines and test kits. Therefore, it is critical to identify conserved structures in SARS-CoV-2 genomes as potential targets for variant-proof diagnostics and therapeutics. However, the algorithms to predict these conserved structures, which simultaneously fold and align multiple RNA homologs, scale at best cubically with sequence length, and are thus infeasible for coronaviruses, which possess the longest genomes (∼30,000 nt) among RNA viruses. As a result, existing efforts on modeling SARS-CoV-2 structures resort to single sequence folding as well as local folding methods with short window sizes, which inevitably neglect long-range interactions that are crucial in RNA functions. Here we present LinearTurboFold, an efficient algorithm for folding RNA homologs that scales linearly with sequence length, enabling unprecedented global structural analysis on SARS-CoV-2. Surprisingly, on a group of SARS-CoV-2 and SARS-related genomes, LinearTurboFold’s purely in silico prediction not only is close to experimentally-guided models for local structures, but also goes far beyond them by capturing the end-to-end pairs between 5’ and 3’ UTRs (∼29,800 nt apart) that match perfectly with a purely experimental work. Furthermore, LinearTurboFold identifies novel conserved structures and conserved accessible regions as potential targets for designing efficient and mutation-insensitive small-molecule drugs, antisense oligonucleotides, siRNAs, CRISPR-Cas13 guide RNAs and RT-PCR primers. LinearTurboFold is a general technique that can also be applied to other RNA viruses and full-length genome studies, and will be a useful tool in fighting the current and future pandemics.

Availability and implementation Our source code is available at https://github.com/LinearFold/LinearTurboFold.

Significance Statement Conserved RNA structures are critical for designing diagnostic and therapeutic tools for many diseases including COVID-19. However, existing algorithms are much too slow to model the global structures of full-length RNA viral genomes. We present LinearTurboFold, a linear-time algorithm that is orders of magnitude faster, making it the first method to simultaneously fold and align whole genomes of SARS-CoV-2 variants, the longest known RNA virus (∼ 30 kilobases). Our work enables unprecedented global structural analysis and captures long-range interactions that are out of reach for existing algorithms but crucial for RNA functions. LinearTurboFold is a general technique for full-length genome studies and can help fight the current and future pandemics.

Ribonucleic acid (RNA) plays important roles in many cellular processes.^{1, 2} To maintain their functions, secondary structures of RNA homologs are conserved across evolution.^{3, 4, 5} These conserved structures provide critical targets for diagnostics and treatments. Thus, there is a need for developing fast and accurate computational methods to identify structurally conserved regions.

Commonly, conserved structures involve compensatory base pair changes, where two positions in primary sequences mutate across evolution and still conserve a base pair, for instance, an AU or a CG pair replaces a GC pair in homologous sequences. These compensatory changes provide strong evidence for evolutionarily conserved structures.^{6, 7, 8, 9, 10} Meanwhile, they make it harder to align sequences when structures are unknown. To solve this issue, Sankoff proposed a dynamic programming algorithm that simultaneously predicts structures and a structural alignment for two or more sequences.¹¹ The major limitation of this approach is that the algorithm runs in O(n^3k) against k sequences with the average sequence length n. Several software packages provide implementations of the Sankoff algorithm^{12, 13, 14, 15, 16, 17} that use simplifications to reduce runtime.^*

As an alternative, TurboFold II,¹⁸ an extension of TurboFold,¹⁹ provides a more computationally efficient method to align and fold sequences. Taking multiple unaligned sequences as input, TurboFold II iteratively refines alignments and structure predictions so that they conform more closely to each other and converge on conserved structures. TurboFold II is significantly more accurate than other methods^{12, 14, 20, 21, 22} when tested on RNA families with known structures and alignments.

However, the cubic runtime and quadratic memory usage of TurboFold II prevent it from scaling to longer sequences such as full-length SARS-CoV-2 genomes, which contain ∼ 30,000 nucleotides; in fact, no joint-align-and-fold methods can scale to these genomes, which are the longest among RNA viruses. As a (not very principled) workaround, most existing efforts for modeling SARS-CoV-2 structures^{29, 24, 25, 27, 28, 26} resort to local folding methods^{30, 31} with sliding windows plus a limited pairing distance, abandoning all long-range interactions, and only consider one SARS-CoV-2 genome (Fig. 1B– C), ignoring signals available in multiple homologous sequences. To address this challenge, we designed a linearized version of TurboFold II, LinearTurboFold (Fig. 1A), which is a global homologous folding algorithm that scales linearly with sequence length. This linear runtime makes it the first joint-fold-and-align algorithm to scale to full-length coronavirus genomes without any constraints on window size or pairing distance, taking about 13 hours to analyze a group of 25 SARS-CoV homologs. It also leads to significant improvement on secondary structure prediction accuracy as well as an alignment accuracy comparable to or higher than all benchmarks.

Fig. 1.

A: The LinearTurboFold framework. Like TurboFold II, LinearTurboFold takes multiple unaligned homologous sequences as input and outputs a secondary structures for each sequence, and a multiple sequence alignment (MSA). But unlike TurboFold II, LinearTurboFold employs two linearizations to ensure linear runtime: a linearized alignment computation (module 1) to predict posterior co-incidence probabilities (red squares) for all pairs of sequences (see Methods §1–4), and a linearized partition function computation (module 2) to estimate base-pairing probabilities (yellow triangles) for all the sequences (see Methods §5–6). These two modules take advantage of information from each other and iteratively refine predictions (Fig. S7). After several iterations, module 3 generates the final multiple sequence alignments (see Methods §7), and module 4 predicts secondary structures. Module 5 can stochastically sample structures. B–C: Prior studies (except for the purely experimental work by Ziv et al.) used local folding methods with limited window size and maximum pairing distance. B shows the local folding of the SARS-CoV-2 genome by Huston et al., which used a window of 3,000 nt that was advanced 300 nt. It also limited the distance between nucleotides that can base pair at 500. Some work also used homologous sequences to identify conserved structures, but they only predicted structures for one genome and utilized sequence alignments to identify mutations. By contrast, LinearTurboFold is a global folding method without any limitations on sequence length or paring distance, and it jointly folds and aligns homologs to obtain conserved structures. Consequently, LinearTurboFold can capture long-range interactions even across the whole genome (the long arc in B and Fig. 3).

Over a group of 25 SARS-CoV-2 and SARS-related homologous genomes, LinearTurboFold predictions are close to the canonical structures³² and structures modeled with the aid of experimental data^{24, 25, 27} for several well-studied regions. Thanks to global rather than local folding, LinearTurboFold discovers a long-range interaction involving 5’ and 3’ UTRs (∼29,800 nt apart), which is consistent with recent purely experimental work,²⁸ and yet is out of reach for local folding methods used by existing studies (Fig. 1B–C). In short, our in silico method of folding multiple homologs can achieve results similar to, and sometimes more accurate than, experimentally-guided models for one genome. Moreover, LinearTurboFold identifies conserved structures supported by compensatory mutations, which are potential targets for small molecule drugs³³ and antisense oligonucleotides (ASOs).²⁶ We further identify regions that are (a) sequence-level conserved, (b) at least 15 nt long, and (c) accessible (i.e., likely to be completely unpaired) as potential targets for ASOs,³⁴ small interfering RNA (siRNA),³⁵ CRISPR-Cas13 guide RNA (gRNA)³⁶ and reverse transcription polymerase chain reaction (RT-PCR) primers.³⁷ LinearTurboFold is a general technique that can also be applied to other RNA viruses (e.g., influenza, Ebola, HIV, Zika, etc.) and full-length genome studies.

Results

The framework of LinearTurboFold has two major aspects (Fig. 1A): linearized structure-aware pairwise alignment estimation (module 1); and linearized homolog-aware structure prediction (module 2). LinearTurboFold iteratively refines alignments and structure predictions, specifically, updating pairwise alignment probabilities by incorporating predicted base-pairing probabilities (from module 2) to form structural alignments, and modifying base-pairing probabilities for each sequence by integrating the structural information from homologous sequences via the estimated alignment probabilities (from module 1) to detect conserved structures. After several iterations, LinearTurboFold generates the final multiple sequence alignment (MSA) based on the latest pairwise alignment probabilities (module 3) and predicts secondary structures using the latest pairing probabilities (module 4).

LinearTurboFold achieves linear time regarding sequence length with two major linearized modules: our recent work LinearPartition³⁸ (Fig. 1A module 2), which approximates the RNA partition function³⁹ and base pairing probabilities in linear time, and a novel algorithm LinearAlignment (module 1). LinearAlignment aligns two sequences by Hidden Markov Model (HMM) in linear time by applying the same beam search heuristic⁴⁰ used by LinearPartition. Finally, LinearTurboFold assembles the secondary structure from the final base pairing probabilities using an accurate and linear-time method named ThreshKnot⁴¹ (module 4). LinearTurboFold also integrates a linear-time stochastic sampling algorithm named LinearSampling⁴² (module 5), which can independently sample structures according to the homolog-aware partition functions and then calculate the probability of being unpaired for regions, which is an important property in, for example, siRNA sequence design.³⁵ Therefore, the overall end-to-end runtime of LinearTurboFold scales linearly with sequence length (see Methods §1–7 for more details).

Scalability and Accuracy

To evaluate the efficiency of LinearTurboFold against the sequence length, we collected a dataset consisting of seven families of RNAs with sequence length ranging from 210 nt to 30,000 nt, including five families from the RNAstralign dataset plus 23S ribosomal RNA, HIV genomes and SARS-CoV genomes, and the calculation for each family uses five homologous sequences (see Methods §8 for more details). Fig. 2A compares the running times of LinearTurboFold with TurboFold II and two Sankoff-style simultaneous folding and alignment algorithms, LocARNA and MXSCARNA. Clearly, LinearTurboFold scales linearly with sequence length n, and is substantially faster than other algorithms, which scale superlinearly. The linearization in LinearTurboFold brought orders of magnitude speedup over the cubic-time TurboFold II, taking only 12 minutes on the HIV family (average length 9,686 nt) while TurboFold II takes 3.1 days (372 × speedup). More importantly, LinearTurboFold takes only 40 minutes on five SARS-CoV sequences while all other benchmarks fail to scale. Regarding the memory usage (Fig. 2B), LinearTurboFold costs linear memory space with sequence length, while other benchmarks use quadratic or more memory. In Fig. 2C–D, we also demonstrate that the runtime and memory usage against the number of homologs (k = 5 ∼ 20), using sets of 16S rRNAs about 1,500 nt in length. The apparent complexity against the group size of LinearTurboFold is higher than TurboFold II because the cubic-time partition function calculation, which dominates the runtime of TurboFold II, was linearized in LinearTurboFold by LinearPartition (Fig. S10C).

Fig. 2.

End-to-end Scalability and Accuracy Comparisons. A–B: End-to-end runtime and memory usage comparisons between benchmarks and LinearTurboFold against the sequence length. C–D: End-to-end runtime and memory usage comparisons against the group size. LinearTurboFold is the first joint-fold-and-align algorithm to scale to full-length coronavirus genomes (∼30,000 nt) due to its linear runtime. E–F: The F1 accuracy scores of the structure prediction and multiple sequence alignment (see Tab. S1 for more details). LocARNA and MXSCARNA are Sankoff-style simultaneous folding and alignment algorithms for homologous sequences. As negative controls, LinearPartition and Vienna RNAfold-predicted structures for each sequence separately; LinearAlignment and MAFFT generated sequence-level alignments; RNAalifold folded pre-aligned sequences (e.g., from MAFFT) and predicted conserved structures. Statistical significances (two-tailed permutation test) between the benchmarks and LinearTurboFold are marked with one star (⋆) on the top of the corresponding bars if p < 0.05 or two stars if p < 0.01. The benchmarks whose accuracies are significantly lower than LinearTurboFold are annotated with black stars, while benchmarks higher than LinearTurboFold are marked with dark red stars. Overall, on structure prediction, LinearTurboFold achieves significantly higher accuracy than all evaluated benchmarks, and on multiple sequence alignment, it achieves accuracies comparable to TurboFold II and significantly higher than other methods (See Tab. S1 for detailed accuracies).

We next compare the accuracies of predicted secondary structures and MSAs between LinearTurboFold and several benchmark methods (see Methods §9). Besides Sankoff-style LocARNA and MXS-CARNA, we also consider three types of negative controls: (a) single sequence folding (partition function-based): Vienna RNAfold³¹ (-p mode) and LinearPartition; (b) sequence-only alignment: MAFFT²¹ and LinearAlignment (a standalone version of the alignment method developed for this work, but without structural information in LinearTurboFold); and (c) an align-then-fold method that predicts consensus structures from MSAs (Fig. S6): MAFFT + RNAalifold.²⁰

For secondary structure prediction, LinearTurboFold, TurboFold II and LocARNA achieve higher F1 scores than single sequence folding methods (Vienna RNAfold and LinearPartition) (Fig. 2E), which demonstrates folding with homology information performs better than folding sequences separately. Overall, LinearTurboFold performs significantly better than all the other benchmarks on structure prediction. For the accuracy of MSAs (Fig. 2F), the structural alignments from LinearTurboFold obtain higher accuracies than sequence-only alignments (LinearAlignment and MAFFT) on all four families, especially for families with low sequence identity. On average, LinearTurboFold performs comparably with TurboFold II and significantly better than other benchmarks on alignments. We also note that the structure prediction accuracy of the align-then-fold approach (MAFFT + RNAalifold) depends heavily on the alignment accuracy, and is the worst when the sequence identity is low (e.g., SRP RNA) and the best when the sequence identity is high (e.g., 16S rRNA) (Fig. 2E–F).

Highly Conserved Structures in SARS-CoV-2 and SARS-related Betacoronaviruses

RNA sequences with conserved secondary structures play vital biological roles and provide potential targets. The current COVID-19 outbreak raises an emergent requirement of identifying potential targets for diagnostics and therapeutics. Given the strong scalability and high accuracy, we used LinearTurboFold on a group of full-length SARS-CoV-2 and SARS-related (SARSr) genomes to obtain global structures and identify highly conserved structural regions.

We used a greedy algorithm to select the 16 most diverse genomes from all the valid SARS-CoV-2 genomes submitted to the Global Initiative on Sharing Avian Influenza Data (GISAID)⁴³ up to December 2020 (Methods §11). We further extended the group by adding 9 SARS-related homologous genomes (5 human SARS-CoV-1 and 4 bat coronaviruses).⁴⁴ In total, we built a dataset of 25 full-length genomes consisting of 16 SARS-CoV-2 and 9 SARS-related sequences (Tab. S2). The average pairwise sequence identities of the 16 SARS-CoV-2 and the total 25 genomes are 99.9% and 89.6%, respectively. LinearTurboFold takes about 13 hours and 43 GB on the 25 genomes.

To evaluate the reliability of LinearTurboFold predictions, we first compare them with the Huston et al.’s SHAPE-guided models²⁴ for regions with well-characterized structures across betacoronaviruses. For the extended 5’ and 3’ untranslated regions (UTRs), LinearTurboFold’s predictions are close to the SHAPE-guided structures (Fig. 3A– B), i.e., both identify the stem-loops (SLs) 1–2 and 4–7 in the extended 5’ UTR, and the bulged stem-loop (BSL), SL1, and a long bulge stem for the hypervariable region (HVR) including the stem-loop II-like motif (S2M) in the 3’ UTR. Interestingly, in our model, the high unpaired probability of the stem in the SL4b indicates the possibility of being single-stranded as an alternative structure, which is supported by experimental studies.^{26, 25} In addition, the compensatory mutations LinearTurboFold found in UTRs strongly support the evolutionary conservation of structures (Fig. 3A).

Fig. 3.

Secondary structures predictions of SARS-CoV-2 extended 5’ and 3’ UTRs. A: LinearTurboFold prediction.The nucleotides and base pairs are colored by unpaired probabilities and base-pairing probabilities, respectively. The compensatory mutations extracted by LinearTurboFold are annotated with alternative pairs in red boxes (see Tab. S3 for more fully conserved pairs with co-variational changes). B: SHAPE-guided model by Huston et al.²⁴ (window size 3000 nt sliding by 300 nt with maximum pairing distance 500 nt). The nucleotides are colored by SHAPE reactivities. Dashed boxes enclose the different structures between A and B. Our model is close to Huston et al.’s, but the major difference is that LinearTurboFold predicts the end-to-end pairs involving 5’ and 3’ UTRs (solid box in A), which is exactly the same interaction detected by Ziv et al. using the COMRADES experimental technique²³ (C). Such long-range interactions cannot be captured by the local folding methods used by prior experimentally-guided models (Fig. 1B). The similarity between models A and B as well as the exact agreement between A and C show that our in silico method of folding multiple homologs can achieve results similar to, if not more accurate than, experimentally-guided single-genome prediction. As negative controls (Fig. S11), the align-then-fold (RNAalifold) method cannot predict such long-range interactions. Although the single sequence folding algorithm (LinearPartition) predicts a long-range 5’-3’ interaction, the positions are not the same as the LinearTurboFold model and Ziv et al.’s experimental result.

The most important difference between LinearTurboFold’s prediction and Huston et al.’s experimentally-guided model is that LinearTurboFold discovers an end-to-end interaction (29.8 kilobases apart) between the 5’ UTR (SL3, 60-82 nt) and the 3’ UTR (final region, 29845-29868 nt), which fold locally by themselves in Huston et al.’s model. Interestingly, this 5’-3’ interaction matches exactly with the one discovered by the purely experimental work of Ziv et al.²³ using the COMRADES technique to capture long-range base-pairing interactions (Fig. 3C). These end-to-end interactions have been well established by theoretical and experimental studies^{45, 46, 47} to be common in natural RNAs, but are far beyond the reaches of local folding methods used in existing studies on SARS-CoV-2 secondary structures.^{24, 25, 27, 28} By contrast, LinearTurboFold predicts secondary structures globally without any limit on window size or base-pairing distance, enabling it to discover long-distance interactions across the whole genome. The similarity between our predictions and the experimental work shows that our in silico method of folding multiple homologs can achieve results similar to, if not more accurate than, those experimentally-guided single-genome prediction. We also observed that LinearPartition, as a single sequence folding method, can also predict a long-range interaction between 5’ and 3’ UTRs, but it involves SL2 instead of SL3 of the 5’ UTR (Fig. 3A), which indicates that the homologous information assists to adjust the positions of base pairs to be conserved in LinearTurboFold. Additionally, the align-then-fold approach (MAFFT + RNAalifold) fails to predict such long-range interactions (Fig. S11B).

The frameshifiting stimulation element (FSE) is another well-characterized region. For an extended FSE region, the LinearTurboFold prediction consists of two substructures (Fig. 4A): the 5’ part includes an attenuator hairpin and a stem, which are connected by a long internal loop (16 nt) including the slippery site, and the 3’ part includes three stem loops. We observe that our predicted structure of the 5’ part is consistent with experimentally-guided models^{24, 25, 28} (Fig. 4B–D). In the attenuator hairpin, the small internal loop motif (UU) was previously selected as a small molecule binder that stabilizes the folded state of the attenuator hairpin and impairs frameshifting.³³ For the long internal loop including the slippery site, we will show in the next section that it is both highly accessible and conserved (Fig. 5), which makes it a perfect candidate for drug design. For the 3’ region of the FSE, LinearTurboFold successfully predicts stems 1–2 (but misses stem 3) of the canonical three-stem pseudoknot³² (Fig. 4E). Our prediction is closer to the canonical structure compared to the experimentally-guided models^{24, 25, 28} (Fig. 4B–D); one such model (Fig. 4B) identified the pseudoknot (stem 3) but with an open stem 2. Note that all these experimentally-guided models for the FSE region were estimated for specific local regions. As a result, the models are sensitive to the context and region boundaries^{28, 24, 48} (see Fig. S12D–F for alternative structures of Fig. 4B–D with different regions). LinearTurboFold, by contrast, does not suffer from this problem by virtue of global folding without local windows. Besides SARS-CoV-2, we notice that the estimated structure of the SARS-CoV-1 reference sequence (Fig. 4F) from LinearTurboFold is similar to SARS-CoV-2 (Fig. 4A), which is consistent with the observation that the structure of the FSE region is highly conserved among betacoronaviruses.³² Finally, as negative controls, both the single sequence folding algorithm (LinearPartition in Fig. 4G) and the align-then-fold method (RNAalifold in Fig. S12G) predict quite different structures compared with the LinearTurboFold prediction (Fig. 4A) (39%/61% of pairs from the LinearTurboFold model are not found by LinearPartition/RNAalifold).

Fig. 4.

A–D: Secondary structure predictions of SARS-CoV-2 extended frameshifting stimulation element (FSE) region (13425–13545 nt). A: LinearTurboFold prediction. B–D: Experimentally-guided predictions from the literature,^{24, 28,25} which are sensitive to the context and region boundaries due to the use of local folding methods (Fig. S12). E: The canonical pseudoknot structure by the comparative analysis between SARS-CoV-1 and SARS-CoV-2 genomes.³² For the 5’ region of the FSE shown in dotted boxes (attenuator hairpin, internal loop with slippery site, and a stem), the LinearTurboFold prediction (A) is consistent with B–D; for the 3’ region of the FSE shown in dashed boxes, our prediction (predicting stems 1–2 but missing 3) is closer to the canonical structure in E compared to B–D. F: LinearTurboFold prediction on SARS-CoV-1. G: Single sequence folding algorithm (LinearPartition) prediction on SARS-CoV-2, which is quite different from LinearTurboFold’s. As another negative control, the align-then-fold method (RNAalifold) predicts a rather dissimilar structure (Fig. S12G). H: Five examples from 59 fully conserved structures among 25 genomes (see Tab. S4 for details), 26 of which are novel compared with prior work.^29,24

Fig. 5.

An illustration of accessible and conserved regions that LinearTurboFold identifies. A–B: Identified structurally-conserved accessible regions by LinearTurboFold with the help of considering alignment and folding simultaneously. The regions at least 15 nt long with accessibility of at least 0.5 among all the 16 SARS-CoV-2 genomes are shaded on blue background. Structures are encoded in dot-bracket notation. “(“and ”)” indicates nucleotides pairing in the 3’ and 5’ direction, respectively. “.” indicates an unpaired nucleotide. The positions with mutations compared to the SARS-CoV-2 reference sequence among three different subfamilies (SARS-CoV-2, SARS-CoV-1 and BCoV) are underlined. C: Accessible and conserved regions are not only accessible among SARS-CoV-2 genomes (pink circle) but also conserved (at sequence level) among both SARS-CoV-2 and SARS-related genomes (green circle). D: Two examples out of 33 accessible and conserved regions found by LinearTurboFold. Region 16 and Region 29 correspond to the accessible regions in A and B, respectively. Region 16 is also the long internal loop including the slippery site in the FSE region (H). The conservation of these regions on 9 SARS-related genomes is the number of mutated sites. The conservation on the ∼2M SARS-CoV-2 dataset is shown in both average sequence identity with the reference sequence and the percentage of exact matches, respectively. E–F: Single sequence folding algorithms predict greatly different structures even if the sequence identities are high (grey boxes). These two regions, fully conserved among SARS-CoV-2 genomes, still fold into different structures due to mutations outside the regions. G: The positions of these 33 regions (red bars) across the whole genome (see Tab. S6 for more details). All the accessible and conserved regions are potential targets for siRNAs, ASOs, CRISPR-Cas13 gRNAs and RT-PCR primers.

In addition to the well-studied UTRs and FSE regions, LinearTurboFold discovers 50 conserved structures with identical structures among 25 genomes, and 26 regions are novel compared to previous studies^{29, 24} (Fig. 4H and Tab. S4). These novel structures are potential targets for small-molecule drugs³³ and antisense oligonucleotides.^{26, 49} LinearTurboFold also recovers fully conserved base pairs with compensatory mutations (Tab. S3), which imply highly conserved structural regions whose functions might not have been explored. We also provide the whole multiple sequence alignment and predicted structures for 25 genomes from LinearTurboFold (see Fig. S13 for the format and link).

Highly Accessible and Conserved Regions in SARS-CoV-2 and SARS-related Betacoronaviruses

Studies show that the siRNA silencing efficiency, ASO inhibitory efficacy, CRISPR-Cas13 knockdown efficiency, and RT-PCR primer binding efficiency, all correlate with the target region’s accessibility,^{37, 35, 36, 50} which is the probability of a target site being fully unpaired. However, most existing work for designing siRNAs, ASOs, CRISPR-Cas13 gRNAs, and RT-PCR primers does not take this feature into consideration^{51, 52} (Tab. S5). Here LinearTurboFold is able to provide more principled design candidates by identifying accessible regions of the target genome. In addition to accessibility, the emerging variants around the world reduce effectiveness of existing vaccines and test kits (Tab. S5), which indicates sequence conservation is another critical aspect for therapeutic and diagnostic design. LinearTurboFold, being a tool for both structural alignment and homologous folding, can identify regions that are both (sequence-wise) conserved and (structurally) accessible, and it takes advantage of not only SARS-CoV-2 variants but also homologous sequences, e.g., SARS-CoV-1 and bat coronavirus genomes, to identify conserved regions from historical and evolutionary perspectives.

To get unstructured regions, Rangan et al.²⁹ imposed a threshold on unpaired probability of each position, which is a crude approximation because the probabilities are not independent of each other. By contrast, the widely-used stochastic sampling algorithm^{53, 42} builds a representative ensemble of structures by sampling independent secondary structures according to their probabilities in the Boltzmann distribution. Thus the accessibility for a region can be approximated as the fraction of sampled structures in which the region is single-stranded. LinearTurboFold utilized LinearSampling⁴² to generate 10,000 independent structures for each genome according to the modified partition functions after the iterative refinement (Fig. 1A module 5), and calculated accessibilities for regions at least 15 nt long. We then define accessible regions that are with at least 0.5 accessibility among all 16 SARS-CoV-2 genomes (Fig. 5A–B). We also measure the free energy to open a target region [i, j],⁵⁴ notated: ΔG_u[i, j] = −RT (log Z_u[i, j] −log Z) = −RT log P_u[i, j] where Z is the partition function which sums up the equilibrium constants of all possible secondary structures, Z_u[i, j] is the partition function over all structures in which the region [i, j] is fully unpaired, R is the universal gas constant and T is the thermodynamic temperature. Therefore P_u[i, j] is the unpaired probability of the target region and can be approximated via sampling by s₀/s, where s is the sample size and s₀ is the number of samples in which the target region is single-stranded. The regions whose free energy changes are close to zero need less free energy to open, thus more accessible to bind with siRNAs, ASOs, CRISPR-Cas13 gRNAs and RT-PCR primers.

Next, to identify conserved regions that are highly conserved among both SARS-CoV-2 and SARS-related genomes, we require that these regions contain at most three mutated sites on the 9 SARS-related genomes compared to the SARS-CoV-2 reference sequence because historically conserved sites are also unlikely to change in the future,⁵⁵ and the average sequence identity with reference sequence over a large SARS-CoV-2 dataset is at least 0.999 (here we use a dataset of ∼ 2M SARS-CoV-2 genomes submitted to GISAID up to June 30, 2021^†; see Methods §11). Finally, we identified 33 accessible and conserved regions (Fig. 5G and Tab. S6), which are not only structurally accessible among SARS-CoV-2 genomes but also highly conserved among SARS-CoV-2 and SARS-related genomes (Fig. 5C). Because the specificity is also a key factor influencing siRNA efficiency,⁵⁶ we used BLAST against the human transcript dataset for these regions (Tab. S6). Finally, we also listed the GC content of each region. Among these regions, region 16 corresponds to the internal loop containing the slippery site in the extended FSE region, and it is conserved at both structural and sequence levels (Fig. 5D and 5H). Besides SARS-CoV-2 genomes, the SARS-related genomes such as the SARS-CoV-1 reference sequence (NC_004718.3) and a bat coronavirus (BCoV, MG772934.1) also form similar structures around the slippery site (Fig. 5A). By removing the constraint of conservation on SARS-related genomes, we identified 38 additional candidate regions (Tab. S7) that are accessible but only highly conserved on SARS-CoV-2 variants.

We also designed a negative control by analyzing the SARS-CoV-2 reference sequence alone using LinearSampling, which can also predict accessible regions. However, these regions are not structurally conserved among the other 15 SARS-CoV-2 genomes, resulting in vastly different accessibilities, except for one region in the M gene (Tab. S8). The reason for this difference is that, even with a high sequence identity (over 99.9%), single sequence folding algorithms still predict greatly dissimilar structures for the SARS-CoV-2 genomes (Fig. 5E–F). Both regions (in nsp11 and N genes) are fully conserved among the 16 SARS-CoV-2 genomes, yet they still fold into vastly different structures due to mutations outside the regions; as a result, the accessibilities are either low (nsp11) or in a wide range (N) (Fig. 5D). Conversely, addressing this by folding each sequence with proclivity of base pairing inferred from all homologous sequences, LinearTurboFold structure predictions are more consistent with each other and thus can detect conserved structures (Fig. 5A–B).

Discussion

The constant emergence of new SARS-CoV-2 variants is reducing the effectiveness of exiting vaccines and test kits. To cope with this issue, there is an urgent need to identify conserved structures as promising targets for therapeutics and diagnostics that would work in spite of current and future mutations. Here we presented LinearTurboFold, an end-to-end linear-time algorithm for structural alignment and conserved structure prediction of RNA homologs, which is the first joint-fold-and-align algorithm to scale to full-length SARS-CoV-2 genomes without imposing any constraints on base-pairing distance. We also demonstrate that LinearTurboFold leads to significant improvement on secondary structure prediction accuracy as well as an alignment accuracy comparable to or higher than all benchmarks.

Unlike existing work on SARS-CoV-2 using local folding and single-sequence folding workarounds, LinearTurboFold enables unprecedented global structural analysis on SARS-CoV-2 genomes; in particular, it can capture long-range interactions, especially the one between 5’ and 3’ UTRs across the whole genome, which matches perfectly with a recent purely experiment work. Over a group of SARS-CoV-2 and SARS-related homologs, LinearTurboFold identifies not only conserved structures supported by compensatory mutations and experimental studies, but also accessible and conserved regions as vital targets for designing efficient small-molecule drugs, siRNAs, ASOs, CRISPR-Cas13 gRNAs and RT-PCR primers. LinearTurboFold is widely applicable to the analysis of other RNA viruses (influenza, Ebola, HIV, Zika, etc.) and full-length genome analysis.

Methods

Detailed description of our algorithms, datasets, and evaluation metrics are available in the online version of the paper.

Methods

§1 Pairwise Hidden Markov Model

We use a pairwise Hidden Markov Model (pair-HMM) to align two sequences.^{57, 58} The model includes three actions (h): aligning two nucleotides from two sequences (ALN), inserting a nucleotide in the first sequence without a corresponding nucleotide in the other sequence (INS1), and a nucleotide insertion in the second sequence without a corresponding nucleotide in the first sequence (INS2). We then define 𝒜 (x, y) as a set of all the possible alignments for the two sequences, and one alignment a ∈ 𝒜 (x, y) as a sequence of steps (h, i, j) with m + 2 steps, where (h, i, j) means an alignment step at the position pair (i, j) by the action h. Thus, for the lth step a_l = (h_l, i_l, j_l) ∈ a, the values of i_l and j_l depend on the action h_l and the positions i_l−1 and j_l−1 of a_l−1: with (ALN, 0, 0) as the first step, and (ALN, |x| + 1, |y| + 1) as the last one. For two sequences {ACAAGU, AACUG}, one possible alignment {− ACAAGU, AAC −− UG} can be specified as {(ALN, 0, 0) → (INS2, 0, 1) → (ALN, 1, 2) → (ALN, 2, 3) → (INS1, 3, 3) → (INS1, 4, 3) → (ALN, 5, 4) → (ALN, 6, 5) → (ALN, 7, 6)}, where a gap symbol (−) represents a nucleotide insertion in the other sequence at the corresponding position (Fig. S8). The action h_l in each step (h_l, i_l, j_l) corresponds to a line segment starting from the previous node (i_l−₁, j_l−₁) and stopping at the node (i_l, j_l). Thus the line segment is horizontal, vertical or diagonal towards the top-right corner when h_l is INS1, INS2 or ALN, respectively (Fig. S8).

We initialize the first step with the state ALN of probability 1, thus p_π(ALN) = 1. p_t(h₂ | h₁) is the transition probability from the state h₁ to h₂, and p_e((c₁, | c₂) h₁) is the probability of the state h₁ emitting a character pair (c₁, c₂) with values from {A, G, C, U, −}. Both the emission and transition probabilities were taken from TurboFold II. The function e() yields a character pair based on a_l and the nucleotides of two sequences: where x_i and y_i are the ith and jth nucleotides of sequences x and y, respectively. Note that the first step a₀ = (ALN, 0, 0) and the last a_m+1 = (ALN, |x| + 1, |y| + 1) do not have emissions.

We denote forward probability encompassing the probability of the partial alignments of x and y up to positions i and j, and all the alignments that go through the step (h, i, j): where a[: k] indicates the partial alignments from the starting node up to the kth step and a_k = (h, i, j). For instance, and corresponds to the region circled by the blue dashed lines (Fig. S8B, C and D). Similarly, the backward probability assembles the probability of partial alignments a[k + 1 :] from the (k + 1)th step up to the end one:

For example, and are the regions circled by the yellow dashed line (Fig. S8B, C and D). Thus, the probability of observing two sequences p(x, y) is or .

§2 Posterior Co-incidence Probability Computation

Nucleotide positions i and j in two sequences x and y are said to be co-incident (notated as i ∼ j) in an alignment a if the alignment path goes through the node (i, j).⁵⁷ Since the node (i, j) is reachable by three actions ℋ = {ALN, INS1, INS2}, the co-incidence probability for a position pair (i, j) given two sequences is: where p(x, y, a) is the probability of two sequences with the alignment a, and p(x, y) is the probability of observing two sequences, which is the sum of probability of all the possible alignments:

The co-incidence probability for positions i and j (Equation 1) can be computed by:

§3 LinearAlignment

Unlike a previous method⁵⁷ that fills out all the nodes in the alignment matrix by columns (Fig. S8), LinearAlignment scans the matrix based on the step count s, which is the sum value of i and j (s = i + j) for the partial alignments of x_[1,i] and y_[1,j]. As shown in the pseudocode (Fig. S9), the forward phase starts from the node (0, 0) in the state ALN of probability 1, then iterates the step count s from 0 to |x| + |y| −1. For each step count s with a specific state h from ℋ, we first collect all the nodes (i, j) with the step count s with existing, which means the position pair (i, j) has been visited via the state h before. Then each node makes transitions to next nodes by there states, and updates the corresponding forward probabilities and , respectively.

The current alignment algorithm is still an exhaustive-search algorithm and costs quadratic time and space for all the |x| × |y| nodes. To reduce the runtime, LinearAlignment uses the beam search heuristic algorithm⁴⁰ and keeps a limited number of promising nodes at each step. For each step count s with a state h, LinearAlignment applies the beam search method first over B(s, h), which is the collection of all the nodes (i, j) with step count s and the presence of (Fig. S9 line 6). This algorithm only saves the top b₁ nodes with the highest forward scores in B(s, h), and these are subsequently allowed to make transitions to the next states. Here b₁ is a user-specified beam size and the default value is 100. In total, O(b₁n) nodes survive because the length of s is |x| + |y| and each step count keeps b₁ nodes. For simplicity, we show the topological order and the beam search method with alignment examples (Fig. S8A), while the forward-backward algorithm adopts the same idea by summing the probabilities of all the possible alignments.

After the forward phase, the backward phase (Fig. S9) performs in linear time to calculate the co-incidence probabilities automatically because only a linear number of nodes in B(s, h) are stored. Thus by pruning low-scoring candidates at each step in the forward algorithm, we reduce the runtime from O(n²) to O(b₁n) for aligning two sequences. For k input homologous sequences, LinearTurboFold computes posterior co-incidence probabilities for each pair of sequences by LinearAlignment, which costs O(k²b₁n) runtime in total.

§4 Match Scores Computation and Modified LinearAlignment

To encourage the pairwise alignment conforming with estimated secondary structures, LinearTurboFold predicts structural alignments by incorporating the secondary structural conformation. PMcomp⁵⁹ first proposed the match score to measure the structural similarity for position pairs between a pair of sequences, and TurboFold II adapts it as a prior. Based on the base pair probabilities P_x(i, j) estimated from the partition function for a sequence x, a position i could be paired with bases upstream, downstream or unpaired, with corresponding probability P_x,>(i) = ∑_j<i P_x(i, j), P_x,<(i) = ∑ _j>i P_x(i, j) and P_x,o(i) = 1 − P_x,>(i) − P_x,<(i), respectively. The match score m_x,y(i, j) for two positions i and j from two sequences x and y is based on the probabilities of these three structural propensities from the last iteration (t − 1): where α₁, α₂ and α₃ are weight parameters trained in TurboFold II. The forward-backward phrases integrate the match score as a prior when aligning two nucleotides (Fig. S9 line 10, and Fig. S9 line 12).

TurboFold II separately pre-computes match scores for all the O(n²) position pairs for pairs of sequences before the HMM alignment calculation. However, only a linear number of pairs O(b₁n) survive after applying the beam pruning in LinearAlignment. To reduce redundant time and space usage, LinearTurboFold calculates the corresponding match scores for co-incident pairs when they are first visited in LinearAlignment. Overall, for k homologous sequences, LinearTurboFold reduces the runtime of the whole module of pairwise posterior co-incidence probability computation from O(k²n²) to O(k²b₁n) by applying the beam search heuristic to the pairwise HMM alignment, and only calculating the match scores for position pairs that are needed.

§5 Extrinsic Information Calculation

To update partition functions for each sequence with the structural information from homologs, TurboFold¹⁹ introduces extrinsic information to model the the proclivity for base pairing induced from the other sequences in the input set 𝒮. The extrinsic information e_x(i, j) for a base pair (i, j) in the sequence x maps the estimated base pairing probabilities of other sequences to the target sequence via the coincident nucleotides between each pair of sequences: where is the base pair probability for a base pair (k, l) in the sequence y from (t − 1)th iteration. and are the posterior co-incidence probabilities for position pairs (i, k) and (j, l), respectively, from (t)th iteration. The extrinsic information first sums all the base pair probabilities of alignable pairs from another one sequence with the co-incidence probabilities and then iterates over all the other sequences. s_x,y is the sequence identity for sequences x and y. The sequences with a low identity contribute more to the extrinsic information than sequences of higher identity. The sequence identity is defined as the fraction of nucleotides that are aligned and identical in the alignment.

§6 LinearPartition for Base Pairing Probabilities Estimation with Extrinsic Information

The classical partition function algorithm scales cubically with sequence length. The slowness limits its extension to longer sequences. To address this bottleneck, our recent LinearPartition³⁸ algorithm approximates the partition function and base paring probability matrix computation in linear time. LinearPartition is significantly faster, and correlates better with the ground truth structures than the traditional cubic partition function calculation. Thus LinearTurboFold uses LinearPartition to predict base pair probabilities instead of the traditional O(n³)-time partition function.

TurboFold introduces the extrinsic information in the partition function as a pseudo-free energy term for each base pair (i, j). Similarly, in LinearPartition, for each span [i, j], which is the subsequence x_i…x_j, and its associated partition function Q(i, j), the partition function is modified as if (x_i, x_j) is an allowed pair, where λ denotes the contribution of the extrinsic information relative to the intrinsic information. Specifically, at each step j, among all possible spans [i, j] where x_i and x_j are paired, we replace the original partition function Q(i, j) with by multiplying the extrinsic information. Then LinearTurboFold applies the beam pruning heuristic over the modified partition function instead of the original.

Similarly, TurboFold II obtains the extrinsic information for all the O(n²) base pairs before the partition function calculation of each sequence, while only a linear number of base pairs survives in LinearPartition. Thus, LinearTurboFold only requires the extrinsic information for those promising base pairs that are visited in LinearPartition. Overall, for k homologous sequences, LinearTurboFold reduces the runtime of base pair probabilities estimation for each sequence from O(kn³ + k²n²) to by applying the beam search heuristic to the partition function calculation, and only calculating extrinsic information for the saved base pairs.

§7 MSA Generation and Secondary Structure Prediction

After several iterations, TurboFold II builds the multiple sequence alignment using a probabilistic consistency transformation, generating a guide tree and performing progressive alignment over the pairwise posterior co-incidence probabilities.²² The whole procedure is accelerated in virtue of the sparse matrix by discarding alignment pairs of probability smaller than a threshold (0.01 by default). Since LinearAlignment uses the beam search method and only saves a linear number of co-incident pairs, the MSA generation in LinearTurboFold costs linear runtime against the sequence length straightforwardly.

Estimated base pair probabilities are fed into downstream methods to predict secondary structures. To maintain the end-to-end linear-time property, LinearTurboFold uses ThreshKnot,⁴¹ which is a thresholded version of ProbKnot⁶⁰ and only considers base pairs of probability exceeding a threshold θ (θ = 0.3 by default). We evaluate the performance of ThreshKnot and MEA with different hyperparameters (θ and γ). On a sampled RNAStrAlign training set, ThreshKnot is closer to the upper right-hand than MEA, which indicates that ThreshKnot always has a higher Sensitivity than MEA at a given PPV (Fig. S10B).

§8 Efficiency and Scalability Datasets

Four datasets are built and used for measuring efficiency and scalability. To evaluate the efficiency and scalability of LinearTurboFold with sequence length, we collected groups of homologous RNA sequences with sequence length ranging from 200 nt to 29,903 nt with a fixed group size 5. Sequences are sampled from RNAStrAlign dataset,¹⁸ the Comparative RNA Web (CRW) Site,⁶¹ the Los Alamos HIV database (http://www.hiv.lanl.gov/) and the SARS-related betacoronaviruses (SARS-related).⁴⁴ RNAStrAlign, aggregated and released with TurboFold II, is an RNA alignment and structure database. Sequences in RNAStrAlign are categorized into families, i.e. sets of homologs, and some of families are further split into subfamilies. Each subfamily or family includes a multiple sequence alignment and ground truth structures for all the sequences. 20 groups of five homologs were randomly chosen from the small subunit ribosomal RNA (Alphaproteobacteria subfamily), SRP RNA (Protozoan subfamily), RNase P RNA (bacterial type A subfamily) and telomerase RNA families. For longer sequences, we sampled five groups of 23S rRNA (of sequence length ranging from 2,700 nt to 2,926 nt) from the CRW Site, HIV-1 genetic sequences (of sequence length ranging from 9,597 nt to 9,738 nt) from the Los Alamos HIV database, and SARS-related sequences (of sequence length ranging from 29,484 nt to 29,903 nt). All the sequences in one group belong to the same subfamily or subtype. We sampled five groups for each family and obtained 35 groups in total. Due to the runtime and memory limitations, we did not run TurboFold II on SARS-CoV-2 groups (Fig. 2, A and D).

To assess the runtime and memory usage of LinearTurboFold with group size, we fixed the sequence length around 1,500 nt, and sampled 5 groups of sequences from the small subunit ribosomal RNA (Alphaproteobacteria subfamily) with group size 5, 10, 15 and 20, respectively (Fig. 2, B and F). We used a Linux machine (CentOS 7.7.1908) with 2.30 GHz Intel Xeon E5-2695 v3 CPU and 755 GB memory, and gcc 4.8.5 for benchmarks.

We built a test set from the RNAStrAlign dataset to measure and compare the performance between LinearTurboFold and other methods. 60 groups of input sequences consisting of five homologous sequences were randomly selected from the small subunit ribosomal RNA (rRNA) (Alphaproteobacteria subfamily), SRP RNA (Protozoan subfamily), RNase P RNA (bacterial type A subfamily) and telomerase RNA families from RNAStrAlign dataset. We removed sequences shorter than 1,200 nt for the small subunit rRNA to filter out subdomains, and removed sequences that are shorter than 200 nt for SRP RNA following the TurboFold II paper to filter out less reliable sequences. We resampled the test set five times and show the average PPV, Sensitivity and F1 scores over the five samples (Fig. 2, C and F).

An RNAStrAlign training set was built to compare accuracies between MEA and ThreshKnot. 40 groups of 3, 5 and 7 homologs were randomly sampled from 5S ribosomal RNA (Eubacteria subfamily), group I intron (IC1 subfamily), tmRNA, and tRNA families from RNAStrAlign dataset. We chose θ = 0.1, 0.2, 0.3, 0.4 and 0.5 for ThreshKnot, and γ = 1, 1.5, 2, 2.5, 3, 3.5, 4, 8 and 16 for MEA. We reported the average secondary structure prediction accuracies (PPV and Sensitivity) across all training families (Fig. S10B).

§9 Benchmarks

The Sankoff algorithm¹¹ uses dynamic programming to simultaneously fold and align two or more sequences, and it requires O(n^3k) time and O(n^2k) space for k input sequences with the average length n. Both LocARNA¹² and MXSCARNA¹⁴ are Sankoff-style algorithms.

LocARNA (local alignment of RNA) costs O(n²(n² + k²)) time and O(n² + k²) space by restricting the alignable regions. MXSCARNA progressively aligns multiple sequences as an extension of the pairwise alignment algorithm SCARNA⁶² with improved score functions. SCARNA first aligns stem fragment candidates, then removes the inconsistent matching in the post-processing to generate the sequence alignment. MXSCARNA reduces runtime to O(k³n²) and space to O(k²n²) with a limited searching space of folding and alignment. Both MXSCARNA and LocARNA uses pre-computed base pair probabilities for each sequence as structural input. All the benchmarks use the default options and hyper-parameters running on the RNAStrAlign test set. TurboFold II iterates three times, then predicts secondary structures by MEA (γ=1). LinearTurboFold also runs three iterations with default beam sizes (b₁ = b₂ = 100) in LinearAlignment and LinearPartition, then predicts structures with ThreshKnot (θ = 0.3).

§10 Significance Test

We use a paired, two-tailed permutation test⁶³ to measure the significant difference. Following the common practice, the repetition number is 10,000, and the significance threshold α is 0.05.

§11 SARS-CoV-2 Datasets

We used two large SARS-CoV-2 datasets. The first dataset is used to draw a representative sample of most diverse SARS-CoV-2 genomes. We downloaded all the genomes submitted to GISAID⁴³ by December 29, 2020 (downloaded on December 29, 2020), and filtered out low-quality genomes (with more than 5% unknown characters and degenerate bases, shorter than 29,500 nt, or with framing error in the coding region), and we also discard genomes with more than 600 mutations compared with the SARS-CoV-2 reference sequence (NC_0405512.2).⁶⁴ After preprocessing, this dataset includes about 258,000 genomes. To identify a representative group of samples with more variable mutations, we designed a greedy algorithm to select 16 most diverse genomes genomes found at least twice in the 258,000 genomes. The general idea of the greedy algorithm is to choose genomes one by one with the most new mutations compared with the selected samples, which consists of only the reference sequence at the beginning.

The second, larger, dataset is to evaluate the conservation of regions with respect to more up-to-date variants. We downloaded all the genomes submitted to GISAID by June 30, 2021 (downloaded on July 25, 2021), and did the same preprocessing as the first dataset. This resulted in a dataset of ∼2M genomes, which was used to evaluate conservation in Figure 5 and Tables S5, S6, S7.

Supporting Information

Fig. S6.

Approaches for analyzing homologous sequence can be categorized into three plans⁶⁵ (related to Fig. 1). Plan A involves two steps: first aligning sequences and then folding aligned multiple sequences. This line works well for homologs with a high sequence identity. Plan B employs joint folding and alignment for multiple sequences, and it requires more time and space. Plan C folds sequences separately first and then aligns structures. Italic methods in each plan are evaluated on RNAStrAlign dataset (Tab. S1).

Fig. S7.

The flowchart of LinearTurboFold with more detailed information (related to Fig. 1). At iteration 0, LinearPartition calculates the partition function and estimates the base pair probabilities for each sequence. From iteration 1 to T, the two major modules LinearAlignment and LinearPartition are conducted and updated in order with the match score and extrinsic information, respectively. The match score and extrinsic information are required and calculated for promising position pairs and base pairs during the LinearAlignment and LinearPartition computations, respectively. After T iterations, the match score and LinearAlignment computations are performed one more time over the latest the base pair probabilities. A multiple sequence is generated based on the pairwise co-incidence probabilities from the (T +1)-th iteration, and secondary structures are predicted according to the base pair probabilities for each sequence from the T -th iteration.

Fig. S8.

Illustrations of LinearAlignment. A: An example of aligning two sequences and the beam search method based on the step count. The x-axis and y-axis of the matrix represent two sequences x and y. Yellow notes have the same step count 6. At step count 6, the red path is discarded because of its lower probability compared to others. There are two complete alignment paths (in green and blue) and the observed alignments are on the right side of the matrix with corresponding colors. B: The area enclosed by the blue dashed line corresponds to , which includes all the partial alignments arriving at the node (3, 3) by the state h. And the region circled by the orange dashed line maintains all the partial alignments starting from the step (ALN, 3, 3) . C and D: The regions circled by the blue dashed lines are and , and regions circled by the orange dashed lines are and , respectively.

Fig. S9.

The pseudocode of the LinearAlignment algorithm forward and backward phases (co-incidence probability computation). The pseudocode ignores boundary conditions for simplicity.

View this table:

Table S1.

Structure prediction and multiple sequence alignment accuracies (related to Fig. 2).

Fig. S10.

A: The maximum values of the extrinsic information as a function of sequence length. The maximal value for each sequence is recorded when running LinearTurboFold on the collected dataset of sequence length ranging from 200 nt to 30,000 nt. B: Accuracy comparison between ThreshKnot and MEA on the training set with different hyper-parameters. C: The proportion of alignment runtime in the total runtime as the group size grows from 5 to 20. D: LinearPartition uses thermodynamic parameters from Vienna RNAfold,³¹ which is a subset of the RNAstructure⁶⁷ partition function terms. By only replacing the TurboFold II partition function with LinearPartition with an infinite beam size (i.e., no approximation), the runtime decreases. This indicates part of speedup of LinearTurboFold profits from a simplified energy model.

View this table:

Table S2.

Detailed information of the sampled 16 SARS-CoV-2 genomes and 9 SARS-related genomes (related to Fig. 3, 4 and 5). This dataset includes the reference sequences of SARS-CoV-2 and SARS-CoV-1 (NC_0405512.2, NC_004718.3). Most of the SARS-CoV-2 genomes include the D614G mutation, which has been a dominate mutation in the SARS-CoV-2 spike protein. B.1.1.7 lineage is a more infectious and lethal variant of SARS-CoV-2 first detected in the United Kingdom around November 2020. We utilized MAFFT²¹ to generate the multiple sequence alignment and calculated the sequence identity with the reference sequence.

Fig. S11.

Secondary structure prediction of SARS-CoV-2 for extended 5’ and 3’ UTRs (related to Fig. 3). A: LinearPartition prediction of the SARS-CoV-2 reference sequence (NC_0405512.2) alone (single sequence folding). LinearPartition also predicts a long-range interaction between 5’ and 3’ UTRs. However, it involves the SL2 of the 5’ UTR not SL3, which disagrees with LinearTurboFold prediction and Ziv et al. (Fig 3). B: RNAalifold (MFE) prediction over 25 genomes. RNAalifold did not find any 5’-3’ pairs.

Fig. S12.

Secondary structure predictions of SARS-CoV-2 for extended frameshifting stimulation element (13425-13545 nt) (related to Fig. 4). A–F: Experimentally-guided structures with different probing data for different regions. The structures in each column were estimated with the same experimental data but different regions. The structures in the second row were predicted by Rangan et al. for a fixed region of 198 nt⁴⁸. G: RNAalifold (MFE) prediction over 25 genomes.

View this table:

Table S3.

Fully conserved base pairs across 25 complete SARS-CoV-2 and SARS-related genomes with compensatory mutations (related to Fig. 3 and 4). The positions and nucleotide type of base pairs correspond to the reference sequence of SARS-CoV-2 (NC_0405512.2). The mutations are from the other 24 genomes.

View this table:

Table S4.

Fully conserved structures among 25 genomes (related to Fig. 3 and 4). Regions with compensatory mutations are annotated with alternative base pairs. Novel regions compared with Rangan et al.^29,24 are annotated with stars.

Fig. S13.

A glimpse of the whole MSA and aligned predicted structures for 25 genomes from LinearTurboFold, which are available in https://github.com/LinearFold/LinearTurboFold/blob/main/sars-cov-2_results/. Each genome corresponds to three lines: name, aligned sequence and aligned structure, respectively.

View this table:

Table S5.

Accessibility and conservation of target regions for public RT-PCR forward/reverse primers and probes.⁶⁸ The accessibility is computed by LinearTurboFold, and it is underlined if larger than zero. The conservation on 9 SARS-related genomes is the number of mutated sites. The conservation on the ∼2M SARS-CoV-2 dataset is the percentage of exact matches, which is underlined or bold if less than 0.97 or 0.5, respectively. (The average sequence identity of SARS-CoV-2 genomes is 0.9987, and the average length of primers and probes is 23 nt. Therefore, the probability of randomly sampling a region of length 23 nt without mutations is 0.9987²³ ≈ 0.97).

View this table:

Table S6.

Accessible and conserved regions (related to Fig. 5) with two kinds of constraints on conservation: 1) at most three mutations on SARS-related genomes; 2) the average sequence identity on the SARS-CoV-2 dataset at least 0.999. The start positions and sequences correspond to the reference sequence of SARS-CoV-2 (NC_0405512.2). The accessibilities are calculated from folding with homologs (LinearTurboFold) and single sequence folding (LinearPartition), respectively. We searched for these regions among human representative transcript set (RefSeq Select RNA sequences, refseq_select) using BLAST, and several regions have the exact matches with human transcripts (underlined). Using single sequence folding can only get one accessible region (bold). The conservation of these regions on 9 SARS-related genomes is the number of mutated sites. The table also shows two types of conservations on a large SARS-CoV-2 dataset containing ∼2M genomes submitted to GISAID up to June 30, 2021: the average sequence identity with reference sequence, and the percentage of exact matches of the whole region, respectively.

View this table:

Table S7.

Accessible and conserved regions with a loose constraint on conservation: the average sequence identity on the ∼2M SARS-CoV-2 dataset is at least 0.999. The table keeps the same format as Tab. S6 and only displays new regions not included in that table.

View this table:

Table S8.

Accessible regions by single sequence folding (applying LinearSampling on the SARS-CoV-2 reference sequence alone). The accessibility of the corresponding regions in other 15 SARS-CoV-2 genomes are calculated for each sequence separately. Except for the region in the M gene (in bold), all accessible regions on the reference sequence are not accessible on the other sequences, and always result in a wide range of accessibilities. By contrast, LinearTurboFold is able to find regions that are accessible across all 16 SARS-CoV-2 genomes thanks to fact that consensus folding is determined across the homologous sequences (Tab. S6).

Footnotes

https://github.com/LinearFold/LinearTurboFold
↵^* Besides these joint-fold-and-align algorithms, there exist two alternative approaches to homologous folding: align-then-fold and fold-then-align; see Fig. S6 for details.
↵^† The average sequence identity is 0.9987 on that ∼2M dataset (downloaded on July 25, 2021).

References

↵
Sean R. Eddy. Non-coding RNA genes and the modern RNA world. Nature Reviews Genetics, 2(12):919–929, 2001.
OpenUrl CrossRef PubMed Web of Science
↵
Jennifer A. Doudna and Thomas R. Cech. The chemical repertoire of natural ribozymes. Nature, 418(6894):222–228, 2002.
OpenUrl CrossRef PubMed Web of Science
↵
Eric P Nawrocki and Sean R Eddy. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics, 29(22):2933–2935, 2013.
OpenUrl CrossRef PubMed Web of Science
↵
Edwin A. Brown, Hangchun Zhang, Li-Hua Ping, and Stanley M. Lemon. Secondary structure of the 5’ nontranslated regions of hepatitis C virus and pestivirus genomic RNAs. Nucleic Acids Research, 20(19):5041–5045, 1992.
OpenUrl CrossRef PubMed Web of Science
↵
Justin Ritz, Joshua S. Martin, and Alain Laederach. Evolutionary evidence for alternative structure in RNA sequence co-variation. PLoS Computational Biology, 9(7):e1003152–e1003152, 2013.
OpenUrl
↵
Elena Rivas, Jody Clements, and Sean R Eddy. Estimating the power of sequence covariation for detecting conserved RNA structure. Bioinformatics, 36(10):3072–3076, 2020.
OpenUrl
↵
Robert W Holley, Jean Apgar, George A Everett, James T Madison, Mark Marquisee, Susan H Merrill, John Robert Penswick, and Ada Zamir. Structure of a ribonucleic acid. Science, pages 1462–1465, 1965.
↵
Harry F Noller, JoAnn Kop, Virginia Wheaton, Jürgen Brosius, Robin R Gutell, Alexei M Kopylov, Ferdinand Dohme, Winship Herr, David A Stahl, Ramesh Gupta, et al. Secondary structure model for 23S ribosomal RNA. Nucleic Acids Research, 9(22):6167–6189, 1981.
OpenUrl CrossRef PubMed Web of Science
↵
Norman R Pace, David K Smith, Gary J Olsen, and Bryan D James. Phylogenetic comparative analysis and the secondary structure of ribonuclease P RNA—a review. Gene, 82(1):65–75, 1989.
OpenUrl CrossRef PubMed Web of Science
↵
KP Williams and DP Bartel. Phylogenetic analysis of tmRNA secondary structure. RNA, 2(12):1306–1310, 1996.
OpenUrl Abstract
↵
David Sankoff. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM Journal on Applied Mathematics, 45(5):810—-825, 1985.
OpenUrl
↵
Sebastian Will, Kristin Reiche, Ivo L Hofacker, Peter F Stadler, and Rolf Backofen. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Computational Biology, 3(4):e65, 2007.
OpenUrl
↵
Jakob H Havgaard, Elfar Torarinsson, and Jan Gorodkin. Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS Computational Biology, 3(10):1896–1908, 2007.
OpenUrl
↵
Yasuo Tabei, Hisanori Kiryu, Taishin Kin, and Kiyoshi Asai. A fast structural multiple alignment method for long RNA sequences. BMC Bioinformatics, 9(1):33, 2008.
OpenUrl CrossRef PubMed
↵
Zhenjiang Xu and David H Mathews. Multilign: an algorithm to predict secondary structures conserved in multiple RNA sequences. Bioinformatics, 27(5):626–632, 2011.
OpenUrl CrossRef PubMed Web of Science
↵
David H Mathews and Douglas H Turner. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology, 317(2):191–203, 2002.
OpenUrl CrossRef PubMed Web of Science
↵
Kengo Sato, Yuki Kato, Tatsuya Akutsu, Kiyoshi Asai, and Yasubumi Sakakibara. DAFS: simultaneous aligning and folding of RNA sequences via dual decomposition. Bioinformatics, 28(24):3218–3224, 2012.
OpenUrl CrossRef PubMed
↵
Zhen Tan, Yinghan Fu, Gaurav Sharma, and David H. Mathews. TurboFold II: RNA structural alignment and secondary structure prediction informed by multiple homologs. Nucleic Acids Research, 45(20):11570–11581, 09 2017.
OpenUrl CrossRef
↵
Arif O Harmanci, Gaurav Sharma, and David H Mathews. TurboFold: iterative probabilistic estimation of secondary structures for multiple RNA sequences. BMC Bioinformatics, 12(1):108, 2011.
OpenUrl CrossRef PubMed
↵
Stephan H Bernhart, Ivo L Hofacker, Sebastian Will, Andreas R Gruber, and Peter F Stadler. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics, 9(1):1–13, 2008.
OpenUrl CrossRef PubMed
↵
Kazutaka Katoh and Daron M Standley. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30(4):772–780, 2013.
OpenUrl CrossRef PubMed Web of Science
↵
Chuong B Do, Mahathi SP Mahabhashyam, Michael Brudno, and Serafim Batzoglou. robCons: Probabilistic consistency-based multiple sequence alignment. Genome Research, 15(2):330– 340, 2005.
OpenUrl Abstract/FREE Full Text
↵
Omer Ziv, Jonathan Price, Lyudmila Shalamova, Tsveta Kamenova, Ian Goodfellow, Friedemann Weber, and Eric A Miska. The short- and long-range RNA-RNA interactome of SARS-CoV-2. Molecular cell, 80(6):1067–1077, 2020.
OpenUrl
↵
Nicholas C Huston, Han Wan, Madison S Strine, Rafael de Cesaris Araujo Tavares, Craig B Wilen, and Anna Marie Pyle. Comprehensive in vivo secondary structure of the SARS-CoV-2 genome reveals novel regulatory motifs and mechanisms. Molecular cell, 81(3):584–598, 2021.
OpenUrl
↵
Ilaria Manfredonia, Chandran Nithin, Almudena Ponce-Salvatierra, Pritha Ghosh, Tomasz K Wirecki, Tycho Marinus, Natacha S Ogando, Eric J Snijder, Martijn J van Hemert, Janusz M Bujnicki, et al. Genome-wide mapping of SARS-CoV-2 RNA structures identifies therapeutically-relevant elements. Nucleic Acids Research, 48(22):12436–12452, 2020.
OpenUrl
↵
Lei Sun, Pan Li, Xiaohui Ju, Jian Rao, Wenze Huang, Lili Ren, Shaojun Zhang, Tuanlin Xiong, Kui Xu, Xiaolin Zhou, et al. In vivo structural characterization of the SARS-CoV-2 rna genome identifies host proteins vulnerable to repurposed drugs. Cell, 184(7):1865–1883, 2021.
OpenUrl CrossRef PubMed
↵
Christiane Iserman, Christine A Roden, Mark A Boerneke, Rachel SG Sealfon, Grace A McLaughlin, Irwin Jungreis, Ethan J Fritch, Yixuan J Hou, Joanne Ekena, Chase A Weidmann, et al. Genomic RNA elements drive phase separation of the SARS-CoV-2 nucleocapsid. Molecular cell, 80(6):1078–1091, 2020.
OpenUrl
↵
Tammy CT Lan, Matthew F Allan, Lauren Malsick, Stuti Khandwala, Sherry SY Nyeo, Mark Bathe, Anthony Griffiths, and Silvi Rouskin. Structure of the full SARS-CoV-2 RNA genome in infected cells. BioRxiv, 2020.
↵
Ramya Rangan, Ivan N Zheludev, Rachel J Hagey, Edward A Pham, Hannah K Wayment-Steele, Jeffrey S Glenn, and Rhiju Das. RNA genome conservation and secondary structure in SARS-CoV-2 and SARS-related viruses: a first look. RNA, 26(8):937–959, 2020.
OpenUrl Abstract/FREE Full Text
↵
Jessica S Reuter and David H Mathews. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11(1):1–9, 2010.
OpenUrl CrossRef PubMed
↵
Ronny Lorenz, Stephan H Bernhart, Christian Hoener Zu Siederdissen, Hakim Tafer, Christoph Flamm, Peter F Stadler, and Ivo L Hofacker. ViennaRNA package 2.0. Algorithms for Molecular Biology, 6(1):1, 2011.
OpenUrl
↵
Jamie A Kelly, Alexandra N Olson, Krishna Neupane, Sneha Munshi, Josue San Emeterio, Lois Pollack, Michael T Woodside, and Jonathan D Dinman. Structural and functional conservation of the programmed -1 ribosomal frameshift signal of SARS coronavirus 2 (SARS-CoV-2). Journal of Biological Chemistry, 295(31):10741–10748, 2020.
OpenUrl Abstract/FREE Full Text
↵
Hafeez S Haniff, Yuquan Tong, Xiaohui Liu, Jonathan L Chen, Blessy M Suresh, Ryan J Andrews, Jake M Peterson, Collin A O’Leary, Raphael I Benhamou, Walter N Moss, et al. Targeting the SARS-CoV-2 RNA genome with small molecule binders and ribonuclease targeting chimera (RIBOTAC) degraders. ACS Central Science, 6(10):1713–1721, 2020.
OpenUrl
↵
Zhi John Lu and David H Mathews. Fundamental differences in the equilibrium considerations for siRNA and antisense oligodeoxynucleotide design. Nucleic Acids Research, 36(11):3738–3745, 2008.
OpenUrl CrossRef PubMed Web of Science
↵
Steffen Schubert, Arnold Grünweller, Volker A Erdmann, and Jens Kurreck. Local RNA target structure influences siRNA efficacy: systematic analysis of intentionally designed binding regions. Journal of Molecular Biology, 348(4):883–893, 2005.
OpenUrl CrossRef PubMed Web of Science
↵
Omar O Abudayyeh, Jonathan S Gootenberg, Patrick Essletzbichler, Shuo Han, Julia Joung, Joseph J Belanto, Vanessa Verdine, David BT Cox, Max J Kellner, Aviv Regev, et al. RNA targeting with CRISPR–Cas13. Nature, 550(7675):280–284, 2017.
OpenUrl CrossRef PubMed
↵
Stephen A Bustin and Tania Nolan. Pitfalls of quantitative real-time reverse-transcription polymerase chain reaction. Journal of Biomolecular Techniques: JBT, 15(3):155, 2004.
OpenUrl
↵
He Zhang, Liang Zhang, David H Mathews, and Liang Huang. LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities. Bioinformatics, 36(Supplement_1):i258–i267, 2020.
OpenUrl
↵
John S. McCaskill. The equilibrium partition function and base pair probabilities for RNA secondary structure. Biopolymers, 29:11105–1119, 1990.
OpenUrl
↵
Liang Huang and Kenji Sagae. Dynamic programming for linear-time incremental parsing. In Proceedings of ACL 2010, page 1077–1086, Uppsala, Sweden, 2010. ACL.
↵
Liang Zhang, He Zhang, David H. Mathews, and Liang Huang. ThreshKnot: Thresholded prob-knot for improved RNA secondary structure prediction. BioRxiv, 2019.
↵
He Zhang, Liang Zhang, Sizhen Li, David Mathews, and Liang Huang. LinearSampling: Linear-time stochastic sampling of RNA secondary structure with applications to SARS-CoV-2. BioRxiv, 2020.
↵
Stefan Elbe and Gemma Buckland-Merrett. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1(1):33–46, 2017.
OpenUrl
↵
Carmine Ceraolo and Federico M Giorgi. Genomic variance of the 2019-nCoV coronavirus. Journal of Medical Virology, 92(5):522–528, 2020.
OpenUrl PubMed
↵
Matthew G Seetin and David H Mathews. RNA structure prediction: an overview of methods. In Bacterial Regulatory RNA, pages 99–122. Springer, 2012.
↵
Thomas JX Li and Christian M Reidys. The rainbow spectrum of RNA secondary structures. Bulletin of Mathematical Biology, 80(6):1514–1538, 2018.
OpenUrl
↵
Wan-Jung C Lai, Mohammad Kayedkhordeh, Erica V Cornell, Elie Farah, Stanislav Bellaousov, Robert Rietmeijer, Enea Salsi, David H Mathews, and Dmitri N Ermolenko. mRNAs and lncRNAs intrinsically form secondary structures with short end-to-end distances. Nature Communications, 9(1):1–11, 2018.
OpenUrl
↵
Ramya Rangan, Andrew M Watkins, Jose Chacon, Rachael Kretsch, Wipapat Kladwang, Ivan N Zheludev, Jill Townley, Mats Rynge, Gregory Thain, and Rhiju Das. De novo 3D models of SARS-CoV-2 RNA elements from consensus experimental secondary structures. Nucleic Acids Research, 49(6):3092–3108, 2021.
OpenUrl
↵
Valeria Lulla, Michal P Wandel, Katarzyna J Bandyra, Rachel Ulferts, Mary Wu, Tom Dendooven, Xiaofei Yang, Nicole Doyle, Stephanie Oerum, Rupert Beale, et al. The stem loop 2 motif is a site of vulnerability for SARS-CoV-2. BioRxiv, pages 2020–09, 2021.
↵
Zhi J. Lu and David H. Mathews. Efficient siRNA selection using hybridization thermodynamics. Nucleic Acids Research, 36:640–647, 2008.
OpenUrl CrossRef PubMed Web of Science
↵
Stephen A Bustin, Vladimir Benes, Jeremy A Garson, Jan Hellemans, Jim Huggett, Mikael Kubista, Reinhold Mueller, Tania Nolan, Michael W Pfaffl, Gregory L Shipley, et al. The MIQE guidelines: Minimum information for publication of quantitative real-time pcr experiments. Clinical Chemistry, 55:611–622, 2009.
OpenUrl Abstract/FREE Full Text
↵
Myungsun Park, Joungha Won, Byung Yoon Choi, and C Justin Lee. Optimization of primer sets and detection protocols for SARS-CoV-2 of coronavirus disease 2019 (COVID-19) using PCR and real-time PCR. Experimental & Molecular Medicine, 52(6):963–977, 2020.
OpenUrl
↵
Ye Ding and Charles E Lawrence. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Research, 31(24):7280–7301, 2003.
OpenUrl CrossRef PubMed Web of Science
↵
Ulrike Mückstein, Hakim Tafer, Jörg Hackermüller, Stephan H Bernhart, Peter F Stadler, and Ivo L Hofacker. Thermodynamics of RNA–RNA binding. Bioinformatics, 22(10):1177–1182, 2006.
OpenUrl CrossRef PubMed Web of Science
↵
Sean R Eddy and Richard Durbin. RNA sequence analysis using covariance models. Nucleic Acids Research, 22(11):2079–2088, 1994.
OpenUrl CrossRef PubMed Web of Science
↵
Elham Fakhr, F Zare, and Ladan Teimoori-Toolabi. Precise and efficient siRNA design: a key point in competent gene silencing. Cancer Gene Therapy, 23(4):73–82, 2016.
OpenUrl CrossRef
↵
Arif Ozgun Harmanci, Gaurav Sharma, and David H Mathews. Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign. BMC Bioinformatics, 8(1):130, 2007.
OpenUrl CrossRef PubMed
↵
Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998.
↵
Ivo L Hofacker, Stephan HF Bernhart, and Peter F Stadler. Alignment of RNA base pairing probability matrices. Bioinformatics, 20(14):2222–2227, 2004.
OpenUrl CrossRef PubMed Web of Science
↵
Stanislav Bellaousov and David H Mathews. ProbKnot: fast prediction of RNA secondary structure including pseudoknots. RNA, 16(10):1870–1880, 2010.
OpenUrl Abstract/FREE Full Text
↵
Jamie J Cannone, Sankar Subramanian, Murray N Schnare, James R Collett, Lisa M D’Souza, Yushi Du, Brian Feng, Nan Lin, Lakshmi V Madabusi, Kirsten M Müller, et al. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3(1):2, 2002.
OpenUrl CrossRef PubMed
↵
Yasuo Tabei, Koji Tsuda, Taishin Kin, and Kiyoshi Asai. SCARNA: fast and accurate structural alignment of RNA sequences by matching fixed-length stem fragments. Bioinformatics, 22(14):1723–1729, 2006.
OpenUrl CrossRef PubMed Web of Science
↵
Nima Aghaeepour and Holger H Hoos. Ensemble-based prediction of RNA secondary structures. BMC Bioinformatics, 14(1):139, 2013.
OpenUrl
↵
Fan Wu, Su Zhao, Bin Yu, Yan-Mei Chen, Wen Wang, Zhi-Gang Song, Yi Hu, Zhao-Wu Tao, JunHua Tian, Yuan-Yuan Pei, et al. A new coronavirus associated with human respiratory disease in China. Nature, 579(7798):265–269, 2020.
OpenUrl CrossRef PubMed
↵
Paul P Gardner and Robert Giegerich. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics, 5(1):1–18, 2004.
OpenUrl CrossRef PubMed Web of Science
Matthias Hochsmann, Thomas Toller, Robert Giegerich, and Stefan Kurtz. Local similarity in RNA secondary structures. In Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003, pages 159–168. IEEE, 2003.
↵
David H Mathews. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA, 10(8):1178–1190, 2004.
OpenUrl Abstract/FREE Full Text
↵
Adeel Afzal. Molecular diagnostic technologies for COVID-19: Limitations and challenges. Journal of Advanced Research, 2020.

View the discussion thread.

Posted September 29, 2021.

Download PDF

Data/Code

Citation Tools

Subject Area

Bioinformatics

Subject Areas

All Articles

Animal Behavior and Cognition (5209)
Biochemistry (11730)
Bioengineering (8743)
Bioinformatics (29179)
Biophysics (14964)
Cancer Biology (12080)
Cell Biology (17399)
Clinical Trials (138)
Developmental Biology (9417)
Ecology (14174)
Epidemiology (2067)
Evolutionary Biology (18294)
Genetics (12233)
Genomics (16791)
Immunology (11858)
Microbiology (28051)
Molecular Biology (11575)
Neuroscience (60919)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4955)
Plant Biology (10422)
Scientific Communication and Education (1682)
Synthetic Biology (2881)
Systems Biology (7338)
Zoology (1650)

[1] ↵
Sean R. Eddy. Non-coding RNA genes and the modern RNA world. Nature Reviews Genetics, 2(12):919–929, 2001.
OpenUrl CrossRef PubMed Web of Science

[2] ↵
Jennifer A. Doudna and Thomas R. Cech. The chemical repertoire of natural ribozymes. Nature, 418(6894):222–228, 2002.
OpenUrl CrossRef PubMed Web of Science

[3] ↵
Eric P Nawrocki and Sean R Eddy. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics, 29(22):2933–2935, 2013.
OpenUrl CrossRef PubMed Web of Science

[4] ↵
Edwin A. Brown, Hangchun Zhang, Li-Hua Ping, and Stanley M. Lemon. Secondary structure of the 5’ nontranslated regions of hepatitis C virus and pestivirus genomic RNAs. Nucleic Acids Research, 20(19):5041–5045, 1992.
OpenUrl CrossRef PubMed Web of Science

[5] ↵
Justin Ritz, Joshua S. Martin, and Alain Laederach. Evolutionary evidence for alternative structure in RNA sequence co-variation. PLoS Computational Biology, 9(7):e1003152–e1003152, 2013.
OpenUrl

[6] ↵
Elena Rivas, Jody Clements, and Sean R Eddy. Estimating the power of sequence covariation for detecting conserved RNA structure. Bioinformatics, 36(10):3072–3076, 2020.
OpenUrl

[7] ↵
Robert W Holley, Jean Apgar, George A Everett, James T Madison, Mark Marquisee, Susan H Merrill, John Robert Penswick, and Ada Zamir. Structure of a ribonucleic acid. Science, pages 1462–1465, 1965.

[8] ↵
Harry F Noller, JoAnn Kop, Virginia Wheaton, Jürgen Brosius, Robin R Gutell, Alexei M Kopylov, Ferdinand Dohme, Winship Herr, David A Stahl, Ramesh Gupta, et al. Secondary structure model for 23S ribosomal RNA. Nucleic Acids Research, 9(22):6167–6189, 1981.
OpenUrl CrossRef PubMed Web of Science

[9] ↵
Norman R Pace, David K Smith, Gary J Olsen, and Bryan D James. Phylogenetic comparative analysis and the secondary structure of ribonuclease P RNA—a review. Gene, 82(1):65–75, 1989.
OpenUrl CrossRef PubMed Web of Science

[10] ↵
KP Williams and DP Bartel. Phylogenetic analysis of tmRNA secondary structure. RNA, 2(12):1306–1310, 1996.
OpenUrl Abstract

[11] ↵
David Sankoff. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM Journal on Applied Mathematics, 45(5):810—-825, 1985.
OpenUrl

[12] ↵
Sebastian Will, Kristin Reiche, Ivo L Hofacker, Peter F Stadler, and Rolf Backofen. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Computational Biology, 3(4):e65, 2007.
OpenUrl

[13] ↵
Jakob H Havgaard, Elfar Torarinsson, and Jan Gorodkin. Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS Computational Biology, 3(10):1896–1908, 2007.
OpenUrl

[14] ↵
Yasuo Tabei, Hisanori Kiryu, Taishin Kin, and Kiyoshi Asai. A fast structural multiple alignment method for long RNA sequences. BMC Bioinformatics, 9(1):33, 2008.
OpenUrl CrossRef PubMed

[15] ↵
Zhenjiang Xu and David H Mathews. Multilign: an algorithm to predict secondary structures conserved in multiple RNA sequences. Bioinformatics, 27(5):626–632, 2011.
OpenUrl CrossRef PubMed Web of Science

[16] ↵
David H Mathews and Douglas H Turner. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology, 317(2):191–203, 2002.
OpenUrl CrossRef PubMed Web of Science

[17] ↵
Kengo Sato, Yuki Kato, Tatsuya Akutsu, Kiyoshi Asai, and Yasubumi Sakakibara. DAFS: simultaneous aligning and folding of RNA sequences via dual decomposition. Bioinformatics, 28(24):3218–3224, 2012.
OpenUrl CrossRef PubMed

[18] ↵
Zhen Tan, Yinghan Fu, Gaurav Sharma, and David H. Mathews. TurboFold II: RNA structural alignment and secondary structure prediction informed by multiple homologs. Nucleic Acids Research, 45(20):11570–11581, 09 2017.
OpenUrl CrossRef

[19] ↵
Arif O Harmanci, Gaurav Sharma, and David H Mathews. TurboFold: iterative probabilistic estimation of secondary structures for multiple RNA sequences. BMC Bioinformatics, 12(1):108, 2011.
OpenUrl CrossRef PubMed

[20] ↵
Stephan H Bernhart, Ivo L Hofacker, Sebastian Will, Andreas R Gruber, and Peter F Stadler. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics, 9(1):1–13, 2008.
OpenUrl CrossRef PubMed

[21] ↵
Kazutaka Katoh and Daron M Standley. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30(4):772–780, 2013.
OpenUrl CrossRef PubMed Web of Science

[22] ↵
Chuong B Do, Mahathi SP Mahabhashyam, Michael Brudno, and Serafim Batzoglou. robCons: Probabilistic consistency-based multiple sequence alignment. Genome Research, 15(2):330– 340, 2005.
OpenUrl Abstract/FREE Full Text

[23] ↵
Omer Ziv, Jonathan Price, Lyudmila Shalamova, Tsveta Kamenova, Ian Goodfellow, Friedemann Weber, and Eric A Miska. The short- and long-range RNA-RNA interactome of SARS-CoV-2. Molecular cell, 80(6):1067–1077, 2020.
OpenUrl

[24] ↵
Nicholas C Huston, Han Wan, Madison S Strine, Rafael de Cesaris Araujo Tavares, Craig B Wilen, and Anna Marie Pyle. Comprehensive in vivo secondary structure of the SARS-CoV-2 genome reveals novel regulatory motifs and mechanisms. Molecular cell, 81(3):584–598, 2021.
OpenUrl

[25] ↵
Ilaria Manfredonia, Chandran Nithin, Almudena Ponce-Salvatierra, Pritha Ghosh, Tomasz K Wirecki, Tycho Marinus, Natacha S Ogando, Eric J Snijder, Martijn J van Hemert, Janusz M Bujnicki, et al. Genome-wide mapping of SARS-CoV-2 RNA structures identifies therapeutically-relevant elements. Nucleic Acids Research, 48(22):12436–12452, 2020.
OpenUrl

[26] ↵
Lei Sun, Pan Li, Xiaohui Ju, Jian Rao, Wenze Huang, Lili Ren, Shaojun Zhang, Tuanlin Xiong, Kui Xu, Xiaolin Zhou, et al. In vivo structural characterization of the SARS-CoV-2 rna genome identifies host proteins vulnerable to repurposed drugs. Cell, 184(7):1865–1883, 2021.
OpenUrl CrossRef PubMed

[27] ↵
Christiane Iserman, Christine A Roden, Mark A Boerneke, Rachel SG Sealfon, Grace A McLaughlin, Irwin Jungreis, Ethan J Fritch, Yixuan J Hou, Joanne Ekena, Chase A Weidmann, et al. Genomic RNA elements drive phase separation of the SARS-CoV-2 nucleocapsid. Molecular cell, 80(6):1078–1091, 2020.
OpenUrl

[28] ↵
Tammy CT Lan, Matthew F Allan, Lauren Malsick, Stuti Khandwala, Sherry SY Nyeo, Mark Bathe, Anthony Griffiths, and Silvi Rouskin. Structure of the full SARS-CoV-2 RNA genome in infected cells. BioRxiv, 2020.

[29] ↵
Ramya Rangan, Ivan N Zheludev, Rachel J Hagey, Edward A Pham, Hannah K Wayment-Steele, Jeffrey S Glenn, and Rhiju Das. RNA genome conservation and secondary structure in SARS-CoV-2 and SARS-related viruses: a first look. RNA, 26(8):937–959, 2020.
OpenUrl Abstract/FREE Full Text

[30] ↵
Jessica S Reuter and David H Mathews. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11(1):1–9, 2010.
OpenUrl CrossRef PubMed

[31] ↵
Ronny Lorenz, Stephan H Bernhart, Christian Hoener Zu Siederdissen, Hakim Tafer, Christoph Flamm, Peter F Stadler, and Ivo L Hofacker. ViennaRNA package 2.0. Algorithms for Molecular Biology, 6(1):1, 2011.
OpenUrl

[32] ↵
Jamie A Kelly, Alexandra N Olson, Krishna Neupane, Sneha Munshi, Josue San Emeterio, Lois Pollack, Michael T Woodside, and Jonathan D Dinman. Structural and functional conservation of the programmed -1 ribosomal frameshift signal of SARS coronavirus 2 (SARS-CoV-2). Journal of Biological Chemistry, 295(31):10741–10748, 2020.
OpenUrl Abstract/FREE Full Text

[33] ↵
Hafeez S Haniff, Yuquan Tong, Xiaohui Liu, Jonathan L Chen, Blessy M Suresh, Ryan J Andrews, Jake M Peterson, Collin A O’Leary, Raphael I Benhamou, Walter N Moss, et al. Targeting the SARS-CoV-2 RNA genome with small molecule binders and ribonuclease targeting chimera (RIBOTAC) degraders. ACS Central Science, 6(10):1713–1721, 2020.
OpenUrl

[34] ↵
Zhi John Lu and David H Mathews. Fundamental differences in the equilibrium considerations for siRNA and antisense oligodeoxynucleotide design. Nucleic Acids Research, 36(11):3738–3745, 2008.
OpenUrl CrossRef PubMed Web of Science

[35] ↵
Steffen Schubert, Arnold Grünweller, Volker A Erdmann, and Jens Kurreck. Local RNA target structure influences siRNA efficacy: systematic analysis of intentionally designed binding regions. Journal of Molecular Biology, 348(4):883–893, 2005.
OpenUrl CrossRef PubMed Web of Science

[36] ↵
Omar O Abudayyeh, Jonathan S Gootenberg, Patrick Essletzbichler, Shuo Han, Julia Joung, Joseph J Belanto, Vanessa Verdine, David BT Cox, Max J Kellner, Aviv Regev, et al. RNA targeting with CRISPR–Cas13. Nature, 550(7675):280–284, 2017.
OpenUrl CrossRef PubMed

[37] ↵
Stephen A Bustin and Tania Nolan. Pitfalls of quantitative real-time reverse-transcription polymerase chain reaction. Journal of Biomolecular Techniques: JBT, 15(3):155, 2004.
OpenUrl

[38] ↵
He Zhang, Liang Zhang, David H Mathews, and Liang Huang. LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities. Bioinformatics, 36(Supplement_1):i258–i267, 2020.
OpenUrl

[39] ↵
John S. McCaskill. The equilibrium partition function and base pair probabilities for RNA secondary structure. Biopolymers, 29:11105–1119, 1990.
OpenUrl

[40] ↵
Liang Huang and Kenji Sagae. Dynamic programming for linear-time incremental parsing. In Proceedings of ACL 2010, page 1077–1086, Uppsala, Sweden, 2010. ACL.

[41] ↵
Liang Zhang, He Zhang, David H. Mathews, and Liang Huang. ThreshKnot: Thresholded prob-knot for improved RNA secondary structure prediction. BioRxiv, 2019.

[42] ↵
He Zhang, Liang Zhang, Sizhen Li, David Mathews, and Liang Huang. LinearSampling: Linear-time stochastic sampling of RNA secondary structure with applications to SARS-CoV-2. BioRxiv, 2020.

[43] ↵
Stefan Elbe and Gemma Buckland-Merrett. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1(1):33–46, 2017.
OpenUrl

[44] ↵
Carmine Ceraolo and Federico M Giorgi. Genomic variance of the 2019-nCoV coronavirus. Journal of Medical Virology, 92(5):522–528, 2020.
OpenUrl PubMed

[45] ↵
Matthew G Seetin and David H Mathews. RNA structure prediction: an overview of methods. In Bacterial Regulatory RNA, pages 99–122. Springer, 2012.

[46] ↵
Thomas JX Li and Christian M Reidys. The rainbow spectrum of RNA secondary structures. Bulletin of Mathematical Biology, 80(6):1514–1538, 2018.
OpenUrl

[47] ↵
Wan-Jung C Lai, Mohammad Kayedkhordeh, Erica V Cornell, Elie Farah, Stanislav Bellaousov, Robert Rietmeijer, Enea Salsi, David H Mathews, and Dmitri N Ermolenko. mRNAs and lncRNAs intrinsically form secondary structures with short end-to-end distances. Nature Communications, 9(1):1–11, 2018.
OpenUrl

[48] ↵
Ramya Rangan, Andrew M Watkins, Jose Chacon, Rachael Kretsch, Wipapat Kladwang, Ivan N Zheludev, Jill Townley, Mats Rynge, Gregory Thain, and Rhiju Das. De novo 3D models of SARS-CoV-2 RNA elements from consensus experimental secondary structures. Nucleic Acids Research, 49(6):3092–3108, 2021.
OpenUrl

[49] ↵
Valeria Lulla, Michal P Wandel, Katarzyna J Bandyra, Rachel Ulferts, Mary Wu, Tom Dendooven, Xiaofei Yang, Nicole Doyle, Stephanie Oerum, Rupert Beale, et al. The stem loop 2 motif is a site of vulnerability for SARS-CoV-2. BioRxiv, pages 2020–09, 2021.

[50] ↵
Zhi J. Lu and David H. Mathews. Efficient siRNA selection using hybridization thermodynamics. Nucleic Acids Research, 36:640–647, 2008.
OpenUrl CrossRef PubMed Web of Science

[51] ↵
Stephen A Bustin, Vladimir Benes, Jeremy A Garson, Jan Hellemans, Jim Huggett, Mikael Kubista, Reinhold Mueller, Tania Nolan, Michael W Pfaffl, Gregory L Shipley, et al. The MIQE guidelines: Minimum information for publication of quantitative real-time pcr experiments. Clinical Chemistry, 55:611–622, 2009.
OpenUrl Abstract/FREE Full Text

[52] ↵
Myungsun Park, Joungha Won, Byung Yoon Choi, and C Justin Lee. Optimization of primer sets and detection protocols for SARS-CoV-2 of coronavirus disease 2019 (COVID-19) using PCR and real-time PCR. Experimental & Molecular Medicine, 52(6):963–977, 2020.
OpenUrl

[53] ↵
Ye Ding and Charles E Lawrence. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Research, 31(24):7280–7301, 2003.
OpenUrl CrossRef PubMed Web of Science

[54] ↵
Ulrike Mückstein, Hakim Tafer, Jörg Hackermüller, Stephan H Bernhart, Peter F Stadler, and Ivo L Hofacker. Thermodynamics of RNA–RNA binding. Bioinformatics, 22(10):1177–1182, 2006.
OpenUrl CrossRef PubMed Web of Science

[55] ↵
Sean R Eddy and Richard Durbin. RNA sequence analysis using covariance models. Nucleic Acids Research, 22(11):2079–2088, 1994.
OpenUrl CrossRef PubMed Web of Science

[56] ↵
Elham Fakhr, F Zare, and Ladan Teimoori-Toolabi. Precise and efficient siRNA design: a key point in competent gene silencing. Cancer Gene Therapy, 23(4):73–82, 2016.
OpenUrl CrossRef

[57] ↵
Arif Ozgun Harmanci, Gaurav Sharma, and David H Mathews. Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign. BMC Bioinformatics, 8(1):130, 2007.
OpenUrl CrossRef PubMed

[58] ↵
Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998.

[59] ↵
Ivo L Hofacker, Stephan HF Bernhart, and Peter F Stadler. Alignment of RNA base pairing probability matrices. Bioinformatics, 20(14):2222–2227, 2004.
OpenUrl CrossRef PubMed Web of Science

[60] ↵
Stanislav Bellaousov and David H Mathews. ProbKnot: fast prediction of RNA secondary structure including pseudoknots. RNA, 16(10):1870–1880, 2010.
OpenUrl Abstract/FREE Full Text

[61] ↵
Jamie J Cannone, Sankar Subramanian, Murray N Schnare, James R Collett, Lisa M D’Souza, Yushi Du, Brian Feng, Nan Lin, Lakshmi V Madabusi, Kirsten M Müller, et al. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3(1):2, 2002.
OpenUrl CrossRef PubMed

[62] ↵
Yasuo Tabei, Koji Tsuda, Taishin Kin, and Kiyoshi Asai. SCARNA: fast and accurate structural alignment of RNA sequences by matching fixed-length stem fragments. Bioinformatics, 22(14):1723–1729, 2006.
OpenUrl CrossRef PubMed Web of Science

[63] ↵
Nima Aghaeepour and Holger H Hoos. Ensemble-based prediction of RNA secondary structures. BMC Bioinformatics, 14(1):139, 2013.
OpenUrl

[64] ↵
Fan Wu, Su Zhao, Bin Yu, Yan-Mei Chen, Wen Wang, Zhi-Gang Song, Yi Hu, Zhao-Wu Tao, JunHua Tian, Yuan-Yuan Pei, et al. A new coronavirus associated with human respiratory disease in China. Nature, 579(7798):265–269, 2020.
OpenUrl CrossRef PubMed

[65] ↵
Paul P Gardner and Robert Giegerich. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics, 5(1):1–18, 2004.
OpenUrl CrossRef PubMed Web of Science

[66] Matthias Hochsmann, Thomas Toller, Robert Giegerich, and Stefan Kurtz. Local similarity in RNA secondary structures. In Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003, pages 159–168. IEEE, 2003.

[67] ↵
David H Mathews. Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA, 10(8):1178–1190, 2004.
OpenUrl Abstract/FREE Full Text

[68] ↵
Adeel Afzal. Molecular diagnostic technologies for COVID-19: Limitations and challenges. Journal of Advanced Research, 2020.

LinearTurboFold: Linear-Time Global Prediction of Conserved Structures for RNA Homologs with Applications to SARS-CoV-2

Abstract

Results

Scalability and Accuracy

Highly Conserved Structures in SARS-CoV-2 and SARS-related Betacoronaviruses

Highly Accessible and Conserved Regions in SARS-CoV-2 and SARS-related Betacoronaviruses

Discussion

Methods

Methods

§1 Pairwise Hidden Markov Model

§2 Posterior Co-incidence Probability Computation

§3 LinearAlignment

§4 Match Scores Computation and Modified LinearAlignment

§5 Extrinsic Information Calculation

§6 LinearPartition for Base Pairing Probabilities Estimation with Extrinsic Information

§7 MSA Generation and Secondary Structure Prediction

§8 Efficiency and Scalability Datasets

§9 Benchmarks

§10 Significance Test

§11 SARS-CoV-2 Datasets

Supporting Information

Footnotes

References

Citation Manager Formats

Subject Area