Abstract
The constant emergence of COVID-19 variants reduces the effectiveness of existing vaccines and test kits. Therefore, it is critical to identify conserved structures in SARS-CoV-2 genomes as potential targets for variant-proof diagnostics and therapeutics. However, the algorithms to predict these conserved structures, which simultaneously fold and align multiple RNA homologs, scale at best cubically with sequence length, and are thus infeasible for coronaviruses, which possess the longest genomes (∼30,000 nt) among RNA viruses. As a result, existing efforts on modeling SARS-CoV-2 structures resort to single sequence folding as well as local folding methods with short window sizes, which inevitably neglect long-range interactions that are crucial in RNA functions. Here we present LinearTurboFold, an efficient algorithm for folding RNA homologs that scales linearly with sequence length, enabling unprecedented global structural analysis on SARS-CoV-2. Surprisingly, on a group of SARS-CoV-2 and SARS-related genomes, LinearTurboFold’s purely in silico prediction not only is close to experimentally-guided models for local structures, but also goes far beyond them by capturing the end-to-end pairs between 5’ and 3’ UTRs (∼29,800 nt apart) that match perfectly with a purely experimental work. Furthermore, LinearTurboFold identifies novel conserved structures and conserved accessible regions as potential targets for designing efficient and mutation-insensitive small-molecule drugs, antisense oligonucleotides, siRNAs, CRISPR-Cas13 guide RNAs and RT-PCR primers. LinearTurboFold is a general technique that can also be applied to other RNA viruses and full-length genome studies, and will be a useful tool in fighting the current and future pandemics.
Availability and implementation Our source code is available at https://github.com/LinearFold/LinearTurboFold.
Significance Statement Conserved RNA structures are critical for designing diagnostic and therapeutic tools for many diseases including COVID-19. However, existing algorithms are much too slow to model the global structures of full-length RNA viral genomes. We present LinearTurboFold, a linear-time algorithm that is orders of magnitude faster, making it the first method to simultaneously fold and align whole genomes of SARS-CoV-2 variants, the longest known RNA virus (∼ 30 kilobases). Our work enables unprecedented global structural analysis and captures long-range interactions that are out of reach for existing algorithms but crucial for RNA functions. LinearTurboFold is a general technique for full-length genome studies and can help fight the current and future pandemics.
Ribonucleic acid (RNA) plays important roles in many cellular processes.1, 2 To maintain their functions, secondary structures of RNA homologs are conserved across evolution.3, 4, 5 These conserved structures provide critical targets for diagnostics and treatments. Thus, there is a need for developing fast and accurate computational methods to identify structurally conserved regions.
Commonly, conserved structures involve compensatory base pair changes, where two positions in primary sequences mutate across evolution and still conserve a base pair, for instance, an AU or a CG pair replaces a GC pair in homologous sequences. These compensatory changes provide strong evidence for evolutionarily conserved structures.6, 7, 8, 9, 10 Meanwhile, they make it harder to align sequences when structures are unknown. To solve this issue, Sankoff proposed a dynamic programming algorithm that simultaneously predicts structures and a structural alignment for two or more sequences.11 The major limitation of this approach is that the algorithm runs in O(n3k) against k sequences with the average sequence length n. Several software packages provide implementations of the Sankoff algorithm12, 13, 14, 15, 16, 17 that use simplifications to reduce runtime.*
As an alternative, TurboFold II,18 an extension of TurboFold,19 provides a more computationally efficient method to align and fold sequences. Taking multiple unaligned sequences as input, TurboFold II iteratively refines alignments and structure predictions so that they conform more closely to each other and converge on conserved structures. TurboFold II is significantly more accurate than other methods12, 14, 20, 21, 22 when tested on RNA families with known structures and alignments.
However, the cubic runtime and quadratic memory usage of TurboFold II prevent it from scaling to longer sequences such as full-length SARS-CoV-2 genomes, which contain ∼ 30,000 nucleotides; in fact, no joint-align-and-fold methods can scale to these genomes, which are the longest among RNA viruses. As a (not very principled) workaround, most existing efforts for modeling SARS-CoV-2 structures29, 24, 25, 27, 28, 26 resort to local folding methods30, 31 with sliding windows plus a limited pairing distance, abandoning all long-range interactions, and only consider one SARS-CoV-2 genome (Fig. 1B– C), ignoring signals available in multiple homologous sequences. To address this challenge, we designed a linearized version of TurboFold II, LinearTurboFold (Fig. 1A), which is a global homologous folding algorithm that scales linearly with sequence length. This linear runtime makes it the first joint-fold-and-align algorithm to scale to full-length coronavirus genomes without any constraints on window size or pairing distance, taking about 13 hours to analyze a group of 25 SARS-CoV homologs. It also leads to significant improvement on secondary structure prediction accuracy as well as an alignment accuracy comparable to or higher than all benchmarks.
Over a group of 25 SARS-CoV-2 and SARS-related homologous genomes, LinearTurboFold predictions are close to the canonical structures32 and structures modeled with the aid of experimental data24, 25, 27 for several well-studied regions. Thanks to global rather than local folding, LinearTurboFold discovers a long-range interaction involving 5’ and 3’ UTRs (∼29,800 nt apart), which is consistent with recent purely experimental work,28 and yet is out of reach for local folding methods used by existing studies (Fig. 1B–C). In short, our in silico method of folding multiple homologs can achieve results similar to, and sometimes more accurate than, experimentally-guided models for one genome. Moreover, LinearTurboFold identifies conserved structures supported by compensatory mutations, which are potential targets for small molecule drugs33 and antisense oligonucleotides (ASOs).26 We further identify regions that are (a) sequence-level conserved, (b) at least 15 nt long, and (c) accessible (i.e., likely to be completely unpaired) as potential targets for ASOs,34 small interfering RNA (siRNA),35 CRISPR-Cas13 guide RNA (gRNA)36 and reverse transcription polymerase chain reaction (RT-PCR) primers.37 LinearTurboFold is a general technique that can also be applied to other RNA viruses (e.g., influenza, Ebola, HIV, Zika, etc.) and full-length genome studies.
Results
The framework of LinearTurboFold has two major aspects (Fig. 1A): linearized structure-aware pairwise alignment estimation (module 1); and linearized homolog-aware structure prediction (module 2). LinearTurboFold iteratively refines alignments and structure predictions, specifically, updating pairwise alignment probabilities by incorporating predicted base-pairing probabilities (from module 2) to form structural alignments, and modifying base-pairing probabilities for each sequence by integrating the structural information from homologous sequences via the estimated alignment probabilities (from module 1) to detect conserved structures. After several iterations, LinearTurboFold generates the final multiple sequence alignment (MSA) based on the latest pairwise alignment probabilities (module 3) and predicts secondary structures using the latest pairing probabilities (module 4).
LinearTurboFold achieves linear time regarding sequence length with two major linearized modules: our recent work LinearPartition38 (Fig. 1A module 2), which approximates the RNA partition function39 and base pairing probabilities in linear time, and a novel algorithm LinearAlignment (module 1). LinearAlignment aligns two sequences by Hidden Markov Model (HMM) in linear time by applying the same beam search heuristic40 used by LinearPartition. Finally, LinearTurboFold assembles the secondary structure from the final base pairing probabilities using an accurate and linear-time method named ThreshKnot41 (module 4). LinearTurboFold also integrates a linear-time stochastic sampling algorithm named LinearSampling42 (module 5), which can independently sample structures according to the homolog-aware partition functions and then calculate the probability of being unpaired for regions, which is an important property in, for example, siRNA sequence design.35 Therefore, the overall end-to-end runtime of LinearTurboFold scales linearly with sequence length (see Methods §1–7 for more details).
Scalability and Accuracy
To evaluate the efficiency of LinearTurboFold against the sequence length, we collected a dataset consisting of seven families of RNAs with sequence length ranging from 210 nt to 30,000 nt, including five families from the RNAstralign dataset plus 23S ribosomal RNA, HIV genomes and SARS-CoV genomes, and the calculation for each family uses five homologous sequences (see Methods §8 for more details). Fig. 2A compares the running times of LinearTurboFold with TurboFold II and two Sankoff-style simultaneous folding and alignment algorithms, LocARNA and MXSCARNA. Clearly, LinearTurboFold scales linearly with sequence length n, and is substantially faster than other algorithms, which scale superlinearly. The linearization in LinearTurboFold brought orders of magnitude speedup over the cubic-time TurboFold II, taking only 12 minutes on the HIV family (average length 9,686 nt) while TurboFold II takes 3.1 days (372 × speedup). More importantly, LinearTurboFold takes only 40 minutes on five SARS-CoV sequences while all other benchmarks fail to scale. Regarding the memory usage (Fig. 2B), LinearTurboFold costs linear memory space with sequence length, while other benchmarks use quadratic or more memory. In Fig. 2C–D, we also demonstrate that the runtime and memory usage against the number of homologs (k = 5 ∼ 20), using sets of 16S rRNAs about 1,500 nt in length. The apparent complexity against the group size of LinearTurboFold is higher than TurboFold II because the cubic-time partition function calculation, which dominates the runtime of TurboFold II, was linearized in LinearTurboFold by LinearPartition (Fig. S10C).
We next compare the accuracies of predicted secondary structures and MSAs between LinearTurboFold and several benchmark methods (see Methods §9). Besides Sankoff-style LocARNA and MXS-CARNA, we also consider three types of negative controls: (a) single sequence folding (partition function-based): Vienna RNAfold31 (-p mode) and LinearPartition; (b) sequence-only alignment: MAFFT21 and LinearAlignment (a standalone version of the alignment method developed for this work, but without structural information in LinearTurboFold); and (c) an align-then-fold method that predicts consensus structures from MSAs (Fig. S6): MAFFT + RNAalifold.20
For secondary structure prediction, LinearTurboFold, TurboFold II and LocARNA achieve higher F1 scores than single sequence folding methods (Vienna RNAfold and LinearPartition) (Fig. 2E), which demonstrates folding with homology information performs better than folding sequences separately. Overall, LinearTurboFold performs significantly better than all the other benchmarks on structure prediction. For the accuracy of MSAs (Fig. 2F), the structural alignments from LinearTurboFold obtain higher accuracies than sequence-only alignments (LinearAlignment and MAFFT) on all four families, especially for families with low sequence identity. On average, LinearTurboFold performs comparably with TurboFold II and significantly better than other benchmarks on alignments. We also note that the structure prediction accuracy of the align-then-fold approach (MAFFT + RNAalifold) depends heavily on the alignment accuracy, and is the worst when the sequence identity is low (e.g., SRP RNA) and the best when the sequence identity is high (e.g., 16S rRNA) (Fig. 2E–F).
Highly Conserved Structures in SARS-CoV-2 and SARS-related Betacoronaviruses
RNA sequences with conserved secondary structures play vital biological roles and provide potential targets. The current COVID-19 outbreak raises an emergent requirement of identifying potential targets for diagnostics and therapeutics. Given the strong scalability and high accuracy, we used LinearTurboFold on a group of full-length SARS-CoV-2 and SARS-related (SARSr) genomes to obtain global structures and identify highly conserved structural regions.
We used a greedy algorithm to select the 16 most diverse genomes from all the valid SARS-CoV-2 genomes submitted to the Global Initiative on Sharing Avian Influenza Data (GISAID)43 up to December 2020 (Methods §11). We further extended the group by adding 9 SARS-related homologous genomes (5 human SARS-CoV-1 and 4 bat coronaviruses).44 In total, we built a dataset of 25 full-length genomes consisting of 16 SARS-CoV-2 and 9 SARS-related sequences (Tab. S2). The average pairwise sequence identities of the 16 SARS-CoV-2 and the total 25 genomes are 99.9% and 89.6%, respectively. LinearTurboFold takes about 13 hours and 43 GB on the 25 genomes.
To evaluate the reliability of LinearTurboFold predictions, we first compare them with the Huston et al.’s SHAPE-guided models24 for regions with well-characterized structures across betacoronaviruses. For the extended 5’ and 3’ untranslated regions (UTRs), LinearTurboFold’s predictions are close to the SHAPE-guided structures (Fig. 3A– B), i.e., both identify the stem-loops (SLs) 1–2 and 4–7 in the extended 5’ UTR, and the bulged stem-loop (BSL), SL1, and a long bulge stem for the hypervariable region (HVR) including the stem-loop II-like motif (S2M) in the 3’ UTR. Interestingly, in our model, the high unpaired probability of the stem in the SL4b indicates the possibility of being single-stranded as an alternative structure, which is supported by experimental studies.26, 25 In addition, the compensatory mutations LinearTurboFold found in UTRs strongly support the evolutionary conservation of structures (Fig. 3A).
The most important difference between LinearTurboFold’s prediction and Huston et al.’s experimentally-guided model is that LinearTurboFold discovers an end-to-end interaction (29.8 kilobases apart) between the 5’ UTR (SL3, 60-82 nt) and the 3’ UTR (final region, 29845-29868 nt), which fold locally by themselves in Huston et al.’s model. Interestingly, this 5’-3’ interaction matches exactly with the one discovered by the purely experimental work of Ziv et al.23 using the COMRADES technique to capture long-range base-pairing interactions (Fig. 3C). These end-to-end interactions have been well established by theoretical and experimental studies45, 46, 47 to be common in natural RNAs, but are far beyond the reaches of local folding methods used in existing studies on SARS-CoV-2 secondary structures.24, 25, 27, 28 By contrast, LinearTurboFold predicts secondary structures globally without any limit on window size or base-pairing distance, enabling it to discover long-distance interactions across the whole genome. The similarity between our predictions and the experimental work shows that our in silico method of folding multiple homologs can achieve results similar to, if not more accurate than, those experimentally-guided single-genome prediction. We also observed that LinearPartition, as a single sequence folding method, can also predict a long-range interaction between 5’ and 3’ UTRs, but it involves SL2 instead of SL3 of the 5’ UTR (Fig. 3A), which indicates that the homologous information assists to adjust the positions of base pairs to be conserved in LinearTurboFold. Additionally, the align-then-fold approach (MAFFT + RNAalifold) fails to predict such long-range interactions (Fig. S11B).
The frameshifiting stimulation element (FSE) is another well-characterized region. For an extended FSE region, the LinearTurboFold prediction consists of two substructures (Fig. 4A): the 5’ part includes an attenuator hairpin and a stem, which are connected by a long internal loop (16 nt) including the slippery site, and the 3’ part includes three stem loops. We observe that our predicted structure of the 5’ part is consistent with experimentally-guided models24, 25, 28 (Fig. 4B–D). In the attenuator hairpin, the small internal loop motif (UU) was previously selected as a small molecule binder that stabilizes the folded state of the attenuator hairpin and impairs frameshifting.33 For the long internal loop including the slippery site, we will show in the next section that it is both highly accessible and conserved (Fig. 5), which makes it a perfect candidate for drug design. For the 3’ region of the FSE, LinearTurboFold successfully predicts stems 1–2 (but misses stem 3) of the canonical three-stem pseudoknot32 (Fig. 4E). Our prediction is closer to the canonical structure compared to the experimentally-guided models24, 25, 28 (Fig. 4B–D); one such model (Fig. 4B) identified the pseudoknot (stem 3) but with an open stem 2. Note that all these experimentally-guided models for the FSE region were estimated for specific local regions. As a result, the models are sensitive to the context and region boundaries28, 24, 48 (see Fig. S12D–F for alternative structures of Fig. 4B–D with different regions). LinearTurboFold, by contrast, does not suffer from this problem by virtue of global folding without local windows. Besides SARS-CoV-2, we notice that the estimated structure of the SARS-CoV-1 reference sequence (Fig. 4F) from LinearTurboFold is similar to SARS-CoV-2 (Fig. 4A), which is consistent with the observation that the structure of the FSE region is highly conserved among betacoronaviruses.32 Finally, as negative controls, both the single sequence folding algorithm (LinearPartition in Fig. 4G) and the align-then-fold method (RNAalifold in Fig. S12G) predict quite different structures compared with the LinearTurboFold prediction (Fig. 4A) (39%/61% of pairs from the LinearTurboFold model are not found by LinearPartition/RNAalifold).
In addition to the well-studied UTRs and FSE regions, LinearTurboFold discovers 50 conserved structures with identical structures among 25 genomes, and 26 regions are novel compared to previous studies29, 24 (Fig. 4H and Tab. S4). These novel structures are potential targets for small-molecule drugs33 and antisense oligonucleotides.26, 49 LinearTurboFold also recovers fully conserved base pairs with compensatory mutations (Tab. S3), which imply highly conserved structural regions whose functions might not have been explored. We also provide the whole multiple sequence alignment and predicted structures for 25 genomes from LinearTurboFold (see Fig. S13 for the format and link).
Highly Accessible and Conserved Regions in SARS-CoV-2 and SARS-related Betacoronaviruses
Studies show that the siRNA silencing efficiency, ASO inhibitory efficacy, CRISPR-Cas13 knockdown efficiency, and RT-PCR primer binding efficiency, all correlate with the target region’s accessibility,37, 35, 36, 50 which is the probability of a target site being fully unpaired. However, most existing work for designing siRNAs, ASOs, CRISPR-Cas13 gRNAs, and RT-PCR primers does not take this feature into consideration51, 52 (Tab. S5). Here LinearTurboFold is able to provide more principled design candidates by identifying accessible regions of the target genome. In addition to accessibility, the emerging variants around the world reduce effectiveness of existing vaccines and test kits (Tab. S5), which indicates sequence conservation is another critical aspect for therapeutic and diagnostic design. LinearTurboFold, being a tool for both structural alignment and homologous folding, can identify regions that are both (sequence-wise) conserved and (structurally) accessible, and it takes advantage of not only SARS-CoV-2 variants but also homologous sequences, e.g., SARS-CoV-1 and bat coronavirus genomes, to identify conserved regions from historical and evolutionary perspectives.
To get unstructured regions, Rangan et al.29 imposed a threshold on unpaired probability of each position, which is a crude approximation because the probabilities are not independent of each other. By contrast, the widely-used stochastic sampling algorithm53, 42 builds a representative ensemble of structures by sampling independent secondary structures according to their probabilities in the Boltzmann distribution. Thus the accessibility for a region can be approximated as the fraction of sampled structures in which the region is single-stranded. LinearTurboFold utilized LinearSampling42 to generate 10,000 independent structures for each genome according to the modified partition functions after the iterative refinement (Fig. 1A module 5), and calculated accessibilities for regions at least 15 nt long. We then define accessible regions that are with at least 0.5 accessibility among all 16 SARS-CoV-2 genomes (Fig. 5A–B). We also measure the free energy to open a target region [i, j],54 notated: ΔGu[i, j] = −RT (log Zu[i, j] −log Z) = −RT log Pu[i, j] where Z is the partition function which sums up the equilibrium constants of all possible secondary structures, Zu[i, j] is the partition function over all structures in which the region [i, j] is fully unpaired, R is the universal gas constant and T is the thermodynamic temperature. Therefore Pu[i, j] is the unpaired probability of the target region and can be approximated via sampling by s0/s, where s is the sample size and s0 is the number of samples in which the target region is single-stranded. The regions whose free energy changes are close to zero need less free energy to open, thus more accessible to bind with siRNAs, ASOs, CRISPR-Cas13 gRNAs and RT-PCR primers.
Next, to identify conserved regions that are highly conserved among both SARS-CoV-2 and SARS-related genomes, we require that these regions contain at most three mutated sites on the 9 SARS-related genomes compared to the SARS-CoV-2 reference sequence because historically conserved sites are also unlikely to change in the future,55 and the average sequence identity with reference sequence over a large SARS-CoV-2 dataset is at least 0.999 (here we use a dataset of ∼ 2M SARS-CoV-2 genomes submitted to GISAID up to June 30, 2021†; see Methods §11). Finally, we identified 33 accessible and conserved regions (Fig. 5G and Tab. S6), which are not only structurally accessible among SARS-CoV-2 genomes but also highly conserved among SARS-CoV-2 and SARS-related genomes (Fig. 5C). Because the specificity is also a key factor influencing siRNA efficiency,56 we used BLAST against the human transcript dataset for these regions (Tab. S6). Finally, we also listed the GC content of each region. Among these regions, region 16 corresponds to the internal loop containing the slippery site in the extended FSE region, and it is conserved at both structural and sequence levels (Fig. 5D and 5H). Besides SARS-CoV-2 genomes, the SARS-related genomes such as the SARS-CoV-1 reference sequence (NC_004718.3) and a bat coronavirus (BCoV, MG772934.1) also form similar structures around the slippery site (Fig. 5A). By removing the constraint of conservation on SARS-related genomes, we identified 38 additional candidate regions (Tab. S7) that are accessible but only highly conserved on SARS-CoV-2 variants.
We also designed a negative control by analyzing the SARS-CoV-2 reference sequence alone using LinearSampling, which can also predict accessible regions. However, these regions are not structurally conserved among the other 15 SARS-CoV-2 genomes, resulting in vastly different accessibilities, except for one region in the M gene (Tab. S8). The reason for this difference is that, even with a high sequence identity (over 99.9%), single sequence folding algorithms still predict greatly dissimilar structures for the SARS-CoV-2 genomes (Fig. 5E–F). Both regions (in nsp11 and N genes) are fully conserved among the 16 SARS-CoV-2 genomes, yet they still fold into vastly different structures due to mutations outside the regions; as a result, the accessibilities are either low (nsp11) or in a wide range (N) (Fig. 5D). Conversely, addressing this by folding each sequence with proclivity of base pairing inferred from all homologous sequences, LinearTurboFold structure predictions are more consistent with each other and thus can detect conserved structures (Fig. 5A–B).
Discussion
The constant emergence of new SARS-CoV-2 variants is reducing the effectiveness of exiting vaccines and test kits. To cope with this issue, there is an urgent need to identify conserved structures as promising targets for therapeutics and diagnostics that would work in spite of current and future mutations. Here we presented LinearTurboFold, an end-to-end linear-time algorithm for structural alignment and conserved structure prediction of RNA homologs, which is the first joint-fold-and-align algorithm to scale to full-length SARS-CoV-2 genomes without imposing any constraints on base-pairing distance. We also demonstrate that LinearTurboFold leads to significant improvement on secondary structure prediction accuracy as well as an alignment accuracy comparable to or higher than all benchmarks.
Unlike existing work on SARS-CoV-2 using local folding and single-sequence folding workarounds, LinearTurboFold enables unprecedented global structural analysis on SARS-CoV-2 genomes; in particular, it can capture long-range interactions, especially the one between 5’ and 3’ UTRs across the whole genome, which matches perfectly with a recent purely experiment work. Over a group of SARS-CoV-2 and SARS-related homologs, LinearTurboFold identifies not only conserved structures supported by compensatory mutations and experimental studies, but also accessible and conserved regions as vital targets for designing efficient small-molecule drugs, siRNAs, ASOs, CRISPR-Cas13 gRNAs and RT-PCR primers. LinearTurboFold is widely applicable to the analysis of other RNA viruses (influenza, Ebola, HIV, Zika, etc.) and full-length genome analysis.
Methods
Detailed description of our algorithms, datasets, and evaluation metrics are available in the online version of the paper.
Methods
§1 Pairwise Hidden Markov Model
We use a pairwise Hidden Markov Model (pair-HMM) to align two sequences.57, 58 The model includes three actions (h): aligning two nucleotides from two sequences (ALN), inserting a nucleotide in the first sequence without a corresponding nucleotide in the other sequence (INS1), and a nucleotide insertion in the second sequence without a corresponding nucleotide in the first sequence (INS2). We then define 𝒜 (x, y) as a set of all the possible alignments for the two sequences, and one alignment a ∈ 𝒜 (x, y) as a sequence of steps (h, i, j) with m + 2 steps, where (h, i, j) means an alignment step at the position pair (i, j) by the action h. Thus, for the lth step al = (hl, il, jl) ∈ a, the values of il and jl depend on the action hl and the positions il−1 and jl−1 of al−1: with (ALN, 0, 0) as the first step, and (ALN, |x| + 1, |y| + 1) as the last one. For two sequences {ACAAGU, AACUG}, one possible alignment {− ACAAGU, AAC −− UG} can be specified as {(ALN, 0, 0) → (INS2, 0, 1) → (ALN, 1, 2) → (ALN, 2, 3) → (INS1, 3, 3) → (INS1, 4, 3) → (ALN, 5, 4) → (ALN, 6, 5) → (ALN, 7, 6)}, where a gap symbol (−) represents a nucleotide insertion in the other sequence at the corresponding position (Fig. S8). The action hl in each step (hl, il, jl) corresponds to a line segment starting from the previous node (il−1, jl−1) and stopping at the node (il, jl). Thus the line segment is horizontal, vertical or diagonal towards the top-right corner when hl is INS1, INS2 or ALN, respectively (Fig. S8).
We initialize the first step with the state ALN of probability 1, thus pπ(ALN) = 1. pt(h2 | h1) is the transition probability from the state h1 to h2, and pe((c1, | c2) h1) is the probability of the state h1 emitting a character pair (c1, c2) with values from {A, G, C, U, −}. Both the emission and transition probabilities were taken from TurboFold II. The function e() yields a character pair based on al and the nucleotides of two sequences: where xi and yi are the ith and jth nucleotides of sequences x and y, respectively. Note that the first step a0 = (ALN, 0, 0) and the last am+1 = (ALN, |x| + 1, |y| + 1) do not have emissions.
We denote forward probability encompassing the probability of the partial alignments of x and y up to positions i and j, and all the alignments that go through the step (h, i, j): where a[: k] indicates the partial alignments from the starting node up to the kth step and ak = (h, i, j). For instance, and corresponds to the region circled by the blue dashed lines (Fig. S8B, C and D). Similarly, the backward probability assembles the probability of partial alignments a[k + 1 :] from the (k + 1)th step up to the end one:
For example, and are the regions circled by the yellow dashed line (Fig. S8B, C and D). Thus, the probability of observing two sequences p(x, y) is or .
§2 Posterior Co-incidence Probability Computation
Nucleotide positions i and j in two sequences x and y are said to be co-incident (notated as i ∼ j) in an alignment a if the alignment path goes through the node (i, j).57 Since the node (i, j) is reachable by three actions ℋ = {ALN, INS1, INS2}, the co-incidence probability for a position pair (i, j) given two sequences is: where p(x, y, a) is the probability of two sequences with the alignment a, and p(x, y) is the probability of observing two sequences, which is the sum of probability of all the possible alignments:
The co-incidence probability for positions i and j (Equation 1) can be computed by:
§3 LinearAlignment
Unlike a previous method57 that fills out all the nodes in the alignment matrix by columns (Fig. S8), LinearAlignment scans the matrix based on the step count s, which is the sum value of i and j (s = i + j) for the partial alignments of x[1,i] and y[1,j]. As shown in the pseudocode (Fig. S9), the forward phase starts from the node (0, 0) in the state ALN of probability 1, then iterates the step count s from 0 to |x| + |y| −1. For each step count s with a specific state h from ℋ, we first collect all the nodes (i, j) with the step count s with existing, which means the position pair (i, j) has been visited via the state h before. Then each node makes transitions to next nodes by there states, and updates the corresponding forward probabilities and , respectively.
The current alignment algorithm is still an exhaustive-search algorithm and costs quadratic time and space for all the |x| × |y| nodes. To reduce the runtime, LinearAlignment uses the beam search heuristic algorithm40 and keeps a limited number of promising nodes at each step. For each step count s with a state h, LinearAlignment applies the beam search method first over B(s, h), which is the collection of all the nodes (i, j) with step count s and the presence of (Fig. S9 line 6). This algorithm only saves the top b1 nodes with the highest forward scores in B(s, h), and these are subsequently allowed to make transitions to the next states. Here b1 is a user-specified beam size and the default value is 100. In total, O(b1n) nodes survive because the length of s is |x| + |y| and each step count keeps b1 nodes. For simplicity, we show the topological order and the beam search method with alignment examples (Fig. S8A), while the forward-backward algorithm adopts the same idea by summing the probabilities of all the possible alignments.
After the forward phase, the backward phase (Fig. S9) performs in linear time to calculate the co-incidence probabilities automatically because only a linear number of nodes in B(s, h) are stored. Thus by pruning low-scoring candidates at each step in the forward algorithm, we reduce the runtime from O(n2) to O(b1n) for aligning two sequences. For k input homologous sequences, LinearTurboFold computes posterior co-incidence probabilities for each pair of sequences by LinearAlignment, which costs O(k2b1n) runtime in total.
§4 Match Scores Computation and Modified LinearAlignment
To encourage the pairwise alignment conforming with estimated secondary structures, LinearTurboFold predicts structural alignments by incorporating the secondary structural conformation. PMcomp59 first proposed the match score to measure the structural similarity for position pairs between a pair of sequences, and TurboFold II adapts it as a prior. Based on the base pair probabilities Px(i, j) estimated from the partition function for a sequence x, a position i could be paired with bases upstream, downstream or unpaired, with corresponding probability Px,>(i) = ∑j<i Px(i, j), Px,<(i) = ∑ j>i Px(i, j) and Px,o(i) = 1 − Px,>(i) − Px,<(i), respectively. The match score mx,y(i, j) for two positions i and j from two sequences x and y is based on the probabilities of these three structural propensities from the last iteration (t − 1): where α1, α2 and α3 are weight parameters trained in TurboFold II. The forward-backward phrases integrate the match score as a prior when aligning two nucleotides (Fig. S9 line 10, and Fig. S9 line 12).
TurboFold II separately pre-computes match scores for all the O(n2) position pairs for pairs of sequences before the HMM alignment calculation. However, only a linear number of pairs O(b1n) survive after applying the beam pruning in LinearAlignment. To reduce redundant time and space usage, LinearTurboFold calculates the corresponding match scores for co-incident pairs when they are first visited in LinearAlignment. Overall, for k homologous sequences, LinearTurboFold reduces the runtime of the whole module of pairwise posterior co-incidence probability computation from O(k2n2) to O(k2b1n) by applying the beam search heuristic to the pairwise HMM alignment, and only calculating the match scores for position pairs that are needed.
§5 Extrinsic Information Calculation
To update partition functions for each sequence with the structural information from homologs, TurboFold19 introduces extrinsic information to model the the proclivity for base pairing induced from the other sequences in the input set 𝒮. The extrinsic information ex(i, j) for a base pair (i, j) in the sequence x maps the estimated base pairing probabilities of other sequences to the target sequence via the coincident nucleotides between each pair of sequences: where is the base pair probability for a base pair (k, l) in the sequence y from (t − 1)th iteration. and are the posterior co-incidence probabilities for position pairs (i, k) and (j, l), respectively, from (t)th iteration. The extrinsic information first sums all the base pair probabilities of alignable pairs from another one sequence with the co-incidence probabilities and then iterates over all the other sequences. sx,y is the sequence identity for sequences x and y. The sequences with a low identity contribute more to the extrinsic information than sequences of higher identity. The sequence identity is defined as the fraction of nucleotides that are aligned and identical in the alignment.
§6 LinearPartition for Base Pairing Probabilities Estimation with Extrinsic Information
The classical partition function algorithm scales cubically with sequence length. The slowness limits its extension to longer sequences. To address this bottleneck, our recent LinearPartition38 algorithm approximates the partition function and base paring probability matrix computation in linear time. LinearPartition is significantly faster, and correlates better with the ground truth structures than the traditional cubic partition function calculation. Thus LinearTurboFold uses LinearPartition to predict base pair probabilities instead of the traditional O(n3)-time partition function.
TurboFold introduces the extrinsic information in the partition function as a pseudo-free energy term for each base pair (i, j). Similarly, in LinearPartition, for each span [i, j], which is the subsequence xi…xj, and its associated partition function Q(i, j), the partition function is modified as if (xi, xj) is an allowed pair, where λ denotes the contribution of the extrinsic information relative to the intrinsic information. Specifically, at each step j, among all possible spans [i, j] where xi and xj are paired, we replace the original partition function Q(i, j) with by multiplying the extrinsic information. Then LinearTurboFold applies the beam pruning heuristic over the modified partition function instead of the original.
Similarly, TurboFold II obtains the extrinsic information for all the O(n2) base pairs before the partition function calculation of each sequence, while only a linear number of base pairs survives in LinearPartition. Thus, LinearTurboFold only requires the extrinsic information for those promising base pairs that are visited in LinearPartition. Overall, for k homologous sequences, LinearTurboFold reduces the runtime of base pair probabilities estimation for each sequence from O(kn3 + k2n2) to by applying the beam search heuristic to the partition function calculation, and only calculating extrinsic information for the saved base pairs.
§7 MSA Generation and Secondary Structure Prediction
After several iterations, TurboFold II builds the multiple sequence alignment using a probabilistic consistency transformation, generating a guide tree and performing progressive alignment over the pairwise posterior co-incidence probabilities.22 The whole procedure is accelerated in virtue of the sparse matrix by discarding alignment pairs of probability smaller than a threshold (0.01 by default). Since LinearAlignment uses the beam search method and only saves a linear number of co-incident pairs, the MSA generation in LinearTurboFold costs linear runtime against the sequence length straightforwardly.
Estimated base pair probabilities are fed into downstream methods to predict secondary structures. To maintain the end-to-end linear-time property, LinearTurboFold uses ThreshKnot,41 which is a thresholded version of ProbKnot60 and only considers base pairs of probability exceeding a threshold θ (θ = 0.3 by default). We evaluate the performance of ThreshKnot and MEA with different hyperparameters (θ and γ). On a sampled RNAStrAlign training set, ThreshKnot is closer to the upper right-hand than MEA, which indicates that ThreshKnot always has a higher Sensitivity than MEA at a given PPV (Fig. S10B).
§8 Efficiency and Scalability Datasets
Four datasets are built and used for measuring efficiency and scalability. To evaluate the efficiency and scalability of LinearTurboFold with sequence length, we collected groups of homologous RNA sequences with sequence length ranging from 200 nt to 29,903 nt with a fixed group size 5. Sequences are sampled from RNAStrAlign dataset,18 the Comparative RNA Web (CRW) Site,61 the Los Alamos HIV database (http://www.hiv.lanl.gov/) and the SARS-related betacoronaviruses (SARS-related).44 RNAStrAlign, aggregated and released with TurboFold II, is an RNA alignment and structure database. Sequences in RNAStrAlign are categorized into families, i.e. sets of homologs, and some of families are further split into subfamilies. Each subfamily or family includes a multiple sequence alignment and ground truth structures for all the sequences. 20 groups of five homologs were randomly chosen from the small subunit ribosomal RNA (Alphaproteobacteria subfamily), SRP RNA (Protozoan subfamily), RNase P RNA (bacterial type A subfamily) and telomerase RNA families. For longer sequences, we sampled five groups of 23S rRNA (of sequence length ranging from 2,700 nt to 2,926 nt) from the CRW Site, HIV-1 genetic sequences (of sequence length ranging from 9,597 nt to 9,738 nt) from the Los Alamos HIV database, and SARS-related sequences (of sequence length ranging from 29,484 nt to 29,903 nt). All the sequences in one group belong to the same subfamily or subtype. We sampled five groups for each family and obtained 35 groups in total. Due to the runtime and memory limitations, we did not run TurboFold II on SARS-CoV-2 groups (Fig. 2, A and D).
To assess the runtime and memory usage of LinearTurboFold with group size, we fixed the sequence length around 1,500 nt, and sampled 5 groups of sequences from the small subunit ribosomal RNA (Alphaproteobacteria subfamily) with group size 5, 10, 15 and 20, respectively (Fig. 2, B and F). We used a Linux machine (CentOS 7.7.1908) with 2.30 GHz Intel Xeon E5-2695 v3 CPU and 755 GB memory, and gcc 4.8.5 for benchmarks.
We built a test set from the RNAStrAlign dataset to measure and compare the performance between LinearTurboFold and other methods. 60 groups of input sequences consisting of five homologous sequences were randomly selected from the small subunit ribosomal RNA (rRNA) (Alphaproteobacteria subfamily), SRP RNA (Protozoan subfamily), RNase P RNA (bacterial type A subfamily) and telomerase RNA families from RNAStrAlign dataset. We removed sequences shorter than 1,200 nt for the small subunit rRNA to filter out subdomains, and removed sequences that are shorter than 200 nt for SRP RNA following the TurboFold II paper to filter out less reliable sequences. We resampled the test set five times and show the average PPV, Sensitivity and F1 scores over the five samples (Fig. 2, C and F).
An RNAStrAlign training set was built to compare accuracies between MEA and ThreshKnot. 40 groups of 3, 5 and 7 homologs were randomly sampled from 5S ribosomal RNA (Eubacteria subfamily), group I intron (IC1 subfamily), tmRNA, and tRNA families from RNAStrAlign dataset. We chose θ = 0.1, 0.2, 0.3, 0.4 and 0.5 for ThreshKnot, and γ = 1, 1.5, 2, 2.5, 3, 3.5, 4, 8 and 16 for MEA. We reported the average secondary structure prediction accuracies (PPV and Sensitivity) across all training families (Fig. S10B).
§9 Benchmarks
The Sankoff algorithm11 uses dynamic programming to simultaneously fold and align two or more sequences, and it requires O(n3k) time and O(n2k) space for k input sequences with the average length n. Both LocARNA12 and MXSCARNA14 are Sankoff-style algorithms.
LocARNA (local alignment of RNA) costs O(n2(n2 + k2)) time and O(n2 + k2) space by restricting the alignable regions. MXSCARNA progressively aligns multiple sequences as an extension of the pairwise alignment algorithm SCARNA62 with improved score functions. SCARNA first aligns stem fragment candidates, then removes the inconsistent matching in the post-processing to generate the sequence alignment. MXSCARNA reduces runtime to O(k3n2) and space to O(k2n2) with a limited searching space of folding and alignment. Both MXSCARNA and LocARNA uses pre-computed base pair probabilities for each sequence as structural input. All the benchmarks use the default options and hyper-parameters running on the RNAStrAlign test set. TurboFold II iterates three times, then predicts secondary structures by MEA (γ=1). LinearTurboFold also runs three iterations with default beam sizes (b1 = b2 = 100) in LinearAlignment and LinearPartition, then predicts structures with ThreshKnot (θ = 0.3).
§10 Significance Test
We use a paired, two-tailed permutation test63 to measure the significant difference. Following the common practice, the repetition number is 10,000, and the significance threshold α is 0.05.
§11 SARS-CoV-2 Datasets
We used two large SARS-CoV-2 datasets. The first dataset is used to draw a representative sample of most diverse SARS-CoV-2 genomes. We downloaded all the genomes submitted to GISAID43 by December 29, 2020 (downloaded on December 29, 2020), and filtered out low-quality genomes (with more than 5% unknown characters and degenerate bases, shorter than 29,500 nt, or with framing error in the coding region), and we also discard genomes with more than 600 mutations compared with the SARS-CoV-2 reference sequence (NC_0405512.2).64 After preprocessing, this dataset includes about 258,000 genomes. To identify a representative group of samples with more variable mutations, we designed a greedy algorithm to select 16 most diverse genomes genomes found at least twice in the 258,000 genomes. The general idea of the greedy algorithm is to choose genomes one by one with the most new mutations compared with the selected samples, which consists of only the reference sequence at the beginning.
The second, larger, dataset is to evaluate the conservation of regions with respect to more up-to-date variants. We downloaded all the genomes submitted to GISAID by June 30, 2021 (downloaded on July 25, 2021), and did the same preprocessing as the first dataset. This resulted in a dataset of ∼2M genomes, which was used to evaluate conservation in Figure 5 and Tables S5, S6, S7.