Abstract
Translation elongation plays a central role in multiple aspects of protein biogenesis, e.g., differential expression, cotranslational folding and secretion. However, our current understanding on the regulatory mechanisms underlying translation elongation dynamics and the functional roles of ribosome stalling in protein synthesis still remains largely limited. Here, we present a deep learning-based framework, called ROSE, to effectively decipher the contextual regulatory code of ribosome stalling and reveal its functional connections to translational control of protein expression from ribosome profiling data. Our validation results on both human and yeast datasets have demonstrated superior performance of ROSE over conventional prediction models. With high prediction accuracy, ROSE provides a precise index to estimate the translation elongation rate at codon resolution. We have demonstrated that ROSE can successfully decode diverse regulatory factors of ribosome stalling, including codon usage bias, tRNA adaptation, codon cooccurrence bias, proline codons, N6-methyladenosine (m6A) modification, mRNA secondary structure and protein-nucleotide binding. In addition, our comprehensive genome-wide in silico studies based on ROSE have revealed notable functional interplay between elongation dynamics and several cotranslational events in protein biogenesis, including protein targeting by the signal recognition particle (SRP) and protein secondary structure formation. Furthermore, our intergenic analysis suggests that the enriched ribosome stalling events at the 5’ end of the coding sequences (also referred to as the ramp sequences) can be involved in the modulation of translation efficiency. These findings indicate that ROSE can offer a powerful tool to analyze the large-scale ribosome profiling data and provide novel insights into the landscape of ribosome stalling, which will further expand our understanding on translation elongation dynamics.
Introduction
The translation process, including translation initiation, elongation and termination, is a fundamental biological process that delivers genetic information to functional proteins in living cells. The dysregulation of translation has been shown to be associated with a variety of diseases, such as neurological disorder and cancers [1]. Elongation is a crucial step of mRNA translation after initiation, in which the ribosome scans the mRNA sequence and gradually grows the nascent peptide chain by appending new amino acids (Supplementary Figure 1). Although numerous studies have shown that the local elongation rate along an mRNA sequence varies a lot, the underlying regulatory mechanisms of this phenomenon still remains far from clear [2–6]. On the other hand, translation elongation plays essential roles in diverse aspects of protein biogenesis, such as differential expression, cotranslational folding, covalent modification and secretion [3, 4, 6]. In addition, the connections between the elongation rate and human health are increasingly emerging, which further underscores the necessity of a well understanding on the regulatory mechanisms and functions of ribosome stalling and translation elongation dynamics [3, 7].
Gene expression profiling techniques, such as microarrays [8] and RNA-Seq [9], which measure mRNA abundance at a transcriptome-wide level, have been routinely used to study the regulation of gene expression. However, most of the biological questions with respect to the translational control of protein expression cannot be addressed by these methods. In recent years, ribosome profiling has emerged as a high-throughput sequencing-based approach to measure the ribosome occupancy on mRNAs at a translatome-wide level in vivo [2, 4, 5, 10, 11]. With an accurate inference of the ribosome A-site (i.e., the entry position of aminoacyl-tRNA) in a ribosome-protected fragment (also referred to as the ribosome footprint, ~30 nucleotides), ribosome profiling provides a genome-wide snapshot of translation elongation dynamics and offers a new angle to estimate translation efficiency. Currently, ribosome profiling has been widely used to study a number of important biological problems related to translation [2, 4, 5], such as the identification of novel or alternatively translated genome regions [12–19], and the discovery of critical regulatory factors in translational control and protein biogenesis [20–36]. Based on the current available large-scale studies involving ribosome profiling experiments, several databases, e.g., GWIPS-viz [37] and RPFdb [38], have been established to collect and curate these profiling data.
Although a large amount of sequencing data have been produced by ribosome profiling experiments, researchers are still challenged by the complexity, heterogeneity and insufficient coverage of these data to derive unbiased and biologically relevant conclusions [2, 4, 5]. Recently, deep learning has become one of the most popular and powerful techniques in the machine learning field [39, 40]. Its superiority over traditional machine learning models has been demonstrated in a wide range of applications, such as speech recognition [41], image classification [40] and natural language processing [42]. Specifically, deep learning has also been successfully applied to analyze large-scale genomic data and uncover notable biological patterns [43–48]. In this work, we propose a deep learning-based framework, called ROSE (RibosOme Stalling Estimator), to address the aforementioned challenges and model translation elongation dynamics based on the ribosome profiling data.
Our framework ROSE casts the modeling problem into a classification task and predicts the ribosome stalling events by integrating the encoded sequence features. It is trained based on both human and yeast ribosome profiling data to derive evolutionarily conserved conclusions about the underlying regulatory mechanisms of ribosome stalling and translation elongation dynamics. Our validation results have demonstrated that ROSE can greatly outperform conventional machine learning models for analyzing genomic sequence data. We have shown that ROSE can successfully decipher diverse regulatory factors of ribosome stalling, including codon usage bias, tRNA adaptation, codon cooccurrence bias, proline codons, N6-methyladenosine (m6A) modification, mRNA secondary structure and protein-nucleotide binding, and provide an accurate estimate of the elongation rate at codon resolution. Moreover, our comprehensive in silico studies reveal several notable regulatory relations between elongation dynamics and the cotranslational events in protein biogenesis. In particular, our analysis identifies interesting interplay between elongation dynamics and the signal recognition particle (SRP) binding of the transmembrane (TM) segments. Our results indicate that ribosomes tend to stall with high probability at a position ~50 codons downstream from a TM segment, which may promote the molecular recognition by SRP. In addition, our studies show that protein secondary structure elements (SSEs) and their transition patterns tend to be highly correlated with the local elongation rates, which may imply an essential regulatory effect of elongation dynamics on cotranslational folding. Furthermore, our intergenic analysis suggests that the enriched ribosome stalling events in the ramp sequences (i.e., the 5′ end of the coding sequences) could be highly important in the modulation of translation efficiency. These results demonstrate that ROSE can offer an effective and powerful tool for analyzing large-scale ribosome profiling data and provide new insights to understand the landscape of ribosome stalling, which is crucial for revealing the contextual regulation and the functional roles of translation elongation dynamics.
Results
Designing, training and validating ROSE
We propose a deep learning-based framework, called RibosOme Stalling Estimator (ROSE), to analyze the large-scale ribosome profiling data and study the contextual regulation of ribosome stalling and its functions in protein biogenesis (Fig. 1a). Unlike previous work that characterized translation elongation dynamics mainly using the stochastic simulation approaches [22, 27, 28], ROSE formalizes the modeling problem into a classification task, in which the resulting prediction score can be used to measure the probability of a ribosome stalling event. In this classification framework, those codon positions in which the ribosome footprint densities exceed 95% of the profiled regions are defined as the positive samples representing the occurrences of ribosome stalling, while the remaining sites are regarded as the negative samples (Online Methods).
We assume that a ribosome stalling event is primarily determined by its surrounding sequence. The codon position of interest, i.e., the ribosome A-site (Supplementary Fig. 1), is first extended both upward and downward by 30 codons, which yields the codon sequence profile of a putative stalling event. We then encode this sequence and feed the encoded features into a deep convolutional neural network (CNN) to learn the complex relations between ribosome stalling and its contextual features (Fig. 1b). We call the prediction score directly outputted by the CNN the intergenic ribosome stalling score, which is also termed interRSS (Online Methods). The name “interRSS” comes from the fact that all the scores along the genome are calculated by a universal model and can be compared intergenetically under the same criterion. To further eliminate the interRSS bias among different genes and facilitate the study on the interplay between the intragenic factors (e.g., the binding of the signal recognition particle (SRP) on the transmembrane segments) and elongation dynamics, we also normalize the interRSS within each gene and obtain the intragenic ribosome stalling score, which is also termed intraRSS (Online Methods). We collectively call interRSS and intraRSS the ribosome stalling score (RSS). In principle, RSS can be statistically considered as an estimate of the local translation elongation rate. A higher RSS generally indicates a lower elongation rate and vice versa.
ROSE relies on a number of motif detectors (i.e., convolution operators) to scan the input sequence and integrate those stalling relevant motifs to capture the intrinsic contextual features of ribosome stalling (Fig. 1b, Online Methods). Unlike previous CNN architectures used for analyzing biological data [44, 47], our new CNN framework includes multiple parallel convolution-pooling modules, which can not only significantly reduce the model complexity, but also alleviate the potential overfitting problem (Online Methods). The standard error back-propagation algorithm is used to learn the network parameters of the CNN model [49]. We also deploy several optimization techniques, including regularization [50], dropout [50, 51] and early stopping [50], to further overcome the overfitting problem. In addition, we propose an efficient strategy for large-scale automatic model selection, i.e., calibrating the model hyperparameters without any human intervension (Online Methods). To further boost prediction performance, we also implement an ensemble version of ROSE which consists of 64 CNN classifiers.
Based on our screening principles (Online Methods), we selected two ribosome profiling datasets from GWIPS-viz [37] as our training and test data, including one human dataset [52] (denoted by Battle15) and one yeast dataset [27] (denoted by Pop14). We first performed a normalization procedure similar to that in [21] to remove the technical and experimental biases from the ribosome profiling data (Fig. 1a, Online Methods). Since some protein-coding genes can be poorly sequenced due to the issue of sequencing depth and the influence of differential expression, which may introduce unexpected biases to our analysis, here we mainly focused on those genes with sequencing coverage above 60%.
We compared the performance of ROSE to that of a classical model, called gkm-SVM, which is developed mainly based on the support vector machine (SVM) and has been successfully applied in various analyses of genomic sequence data [53, 54]. Our tests on both human and yeast datasets showed that ROSE can greatly outperform gkm-SVM with an increase in the area under the receiver operating characteristic curve (AUROC) by up to 18.4% (Figs. 2a and 2b). In addition, the ensemble version of ROSE (also termed eROSE) consistently had superior performance compared to the single version (also termed sROSE).
It has been widely observed that the first 30–50 codons of a coding sequence (CDS) are often enriched with rare codons, and create a “ramp” to reduce the elongation rate during the initial translation elongation process [55, 56]. Such a ramp sequence has been proposed to be universal in prokaryotic and eukaryotic genes, and has been suggested to serve the purpose of reducing the likelihood of downstream ribosomal traffic jams and thus increasing the overall translation efficiency [6, 55, 56]. To focus solely on the elongation process and remove the possible measurement biases introduced by the ramp regions, here we excluded all the reads of these regions, i.e., the first 50 codons at the 5’ end of the coding sequences, from training data. On the other hand, we showed that even without using any training sample from ramps, ROSE can still successfully predict the ribosome stalling events in these regions, with the AUROC scores above 83.0% (Fig. 2c). Although the existence of a ramp region and its effects on the downstream elongation process still remain controversial [29, 57], here we observed a significantly increasing trend of ribosome stalling in the ramp regions based on our method (Fig. 2d; P = 4.64 × 10−33 for human and P = 3.80 × 10−3 for yeast, one-sided Wilcoxon rank-sum test). As ROSE was trained independently from the ribosome profiling reads of the 5’ end of the coding sequences, our observation reached an unbiased conclusion that the translation elongation rate in the ramp sequences is relatively low compared to that in the downstream regions, which basically reconfirmed the existence of the ramp sequences.
ROSE deciphers the regulatory code of ribosome stalling
Next, we analyzed several factors that may play important roles in regulating ribosome stalling during translation elongation. Previous studies [4, 10, 12, 21, 22, 24, 27–29, 58, 59] have provided various mechanistic insights into translation elongation dynamics based on statistical analysis. Nevertheless, due to unexpected experimental distortions (e.g., the use of cycloheximide in ribosome stabilization [24]) and insufficient modeling power, these studies may derive insignificant or even contradictory conclusions [2, 21, 58]. With stringent screening and normalization procedures as well as superior prediction performance, ROSE enables one to systematically investigate and effectively decode diverse factors that regulate ribosome stalling. Here, we mainly focused on codon usage bias, tRNA adaptation, codon cooccurrence bias, proline codons, N6-methyladenosine (m6A) modification, mRNA secondary structure and protein-nucleotide binding, and studied how they affect ribosome stalling and elongation dynamics.
Codon usage bias
Systematic variation has been observed in codon usage across species, among genes in a genome, or even within a gene [60]. Such a codon usage bias has been demonstrated to be crucial in various cellular functions, such as splicing control, stabilization of mRNA structure, maintenance of translation fidelity and regulation of protein folding [3, 7]. Particularly, rare codons are normally highly associated with low translation elongation rates, which has been widely believed to be responsible for proper protein folding [3, 6, 7, 60, 61]. Several metrics have been proposed to measure the codon usage frequency, such as the codon adaptation index (cAI) [62] and the %MinMax score [63]. Here, we used ROSE to reexamine how the codon rareness can affect ribosome stalling based on both cAI and %MinMax profiles. Specifically, we scanned the ribosome occupancy sites along a genome, and then compared the intraRSSes of those sites that were enriched with rare codons to those of the background (Online Methods). We calculated cAI for both ribosome A- and P-sites, and %MinMax for the local region around the ribosome A-site (i.e., five codons both upstream and downstream from the A-site). Consistent results were observed for both human and yeast (Figs. 3a, 3b and Supplementary Figs. 2a, 2b), i.e., those sites enriched with rare codons displayed significantly higher intraRSSes than the background (P < 10−65 for cAI and P < 10−25 for %MinMax, one-sided Wilcoxon rank-sum test). These results implied that the codon bias is an important factor to modulate translation elongation dynamics.
tRNA adaptation
Another codon related feature affecting the translation elongation rate is the tRNA concentration. In general, the codons recognized by abundant cognate tRNAs have short decoding time [64]. However, previous analyses of ribosome profiling data often led to inconsistent conclusions on the effects of this feature, which may be attributed to experimental bias, methodological difference or unknown coregulation factors [5, 24, 27, 57, 61]. In addition to tRNA concentration, the strength of a wobble pairing interaction may also influence the elongation rate [3]. The tRNA adaptation index (tAI) has been proposed to consider both the tRNA concentration (approximated by the copy number of the corresponding tRNA gene) and the strength of codon-anticodon pairing (computed according to the Crick wobble rules) [65]. In fact, it has been found that codon bias is often correlated with tRNA abundance [55, 66–68]. Our analysis also observed a certain correlation between cAI (also %MinMax) and tAI for both human and yeast datasets (Supplementary Table 1). We reexamined the effects of tRNA concentration and the strength of wobble pairing on the elongation rate using ROSE, in which the ribosome occupancy sites were quantified by their tAI scores. In particular, we compared the intraRSSes of those ribosome A- and P-sites enriched with low tAI scores to those of the background (Online Methods), and our comparative analysis on both yeast and human datasets supported the conclusion that lower tRNA concentrations and weaker wobble pairing interactions tend to induce ribosome stalling and associate with lower elongation rates (Figs. 3a, 3b and Supplementary Figs. 2a and 2b; P < 10−60 by one-sided Wilcoxon rank-sum test).
Codon cooccurrence bias
In addition to the aforementioned codon usage bias, the codon cooccurrence bias, i.e., the nonuniform distribution of synonymous codon orders, can also affect translation elongation dynamics [6, 69]. Previous studies have suggested that after being recharged, tRNAs may still stay around the ribosome, and tRNA recycling can modulate the translation elongation rate [6, 69]. Under this hypothesis, we would expect the highly isoaccepting-codon-reused regions to be depleted of ribosome stalling events. To examine this problem, we first defined a new metric, called the codon cooccurrence index (cCI), which measures the autocorrelation (i.e., reuseness) of isoaccepting codons in a local region (Online Methods). Our studies on both human and yeast showed that those regions enriched with autocorrelated codons (i.e., with high cCI scores) displayed significantly lower intraRSSes than the background (Figs. 3a and 3b; P < 10−9 by one-sided Wilcoxon rank-sum test). This result provided a novel evidence to support the argument that the codon cooccurrence bias, probably coordinated with tRNA recycling, can modulate the ribosome stalling tendency and control the local elongation rate.
Proline codons
The unique structure of the proline side chain is generally associated with a relatively low efficiency in its peptide bond formation, which may slow down translation elongation [21, 61, 70–75]. Several studies have confirmed the relatively low translation elongation rates at proline codons [21, 61]. Here we performed an extended study on the influence of proline codons on ribosome stalling using ROSE. In particular, four peptide patterns of proline were investigated, including XPX, XPP, PPX and PPP, in which the three positions correspond to the ribosome E-, P- and A-sites, respectively, and “P” and “X” represent proline and non-proline amino acids, respectively. The comparative analysis on both human and yeast datasets showed that these peptide patterns of proline displayed significantly higher intraRSSes than the background (Figs. 3a and 3b; P < 10−100 by one-sided Wilcoxon rank-sum test). In addition, the dipeptide and tripeptide patterns of proline, including XPP, PPX and PPP, showed significantly higher intraRSSes than XPX (Fig. 3c; P < 10−40 by one-sided Wilcoxon rank-sum test). This result implied that a codon sequence with more prolines may have a higher chance to induce ribosome stalling and yield a lower translation elongation rate.
N6-methyladenosine modification
Notably, we found that the mRNA modification N6-methyladenosine (m6A) within codons can also influence ribosome stalling (Fig. 3d). N6-methyladenosine is probably the most prevalent post-transcriptional modification in mRNAs and plays vital roles in regulating mRNA stability and translation efficiency. Most m6A sites are distributed in mRNA regulatory regions such as 3’-UTRs and the locations around the stop codons, where the m6A “readers” such as YTHDF1 and YTHDF2 can bind to dynamically regulate gene expression [76]. Moreover, it has been reported that m6A in the CDS can modulate the translation initiation to promote translation efficiency [77]. During translation, the additional methyl group of an m6A site may act as a stumbling block to ribosomes and thus result in a translation pause. Recently, Choi et al. elucidated that the m6 A-modified codons at the ribosome A-sites can reduce the translation elongation rate in E.coli [78]. Thus, it is reasonable to hypothesize that the m6A marks in eukaryotic cells may also be closely correlated to the ribosome stalling tendency. We used ROSE to test this hypothesis based on the translatome-wide m6A mapping obtained from the known single-nucleotide resolution sequencing data, including two human datasets (denoted by Linder15 [79] and Ke15 [80], respectively) and one yeast dataset (denoted by Schwartz13 [81]). The analysis results showed that in human, those codons modified by m6 A at the ribosome A-sites had a significantly higher ribosome stalling tendency than the background (Fig. 3a and Supplementary Fig. 2c; P = 9.61 × 10−13 for Linder15 and P = 2.20 × 10−64 for Ke15, one-sided Wilcoxon rank-sum test). To ensure that such an increase of intraRSS did not result from the underlying adenine nucleotides in the codon sites of interest, we also constructed a control dataset which contained 10,000 randomly-selected codon sites covering the adenine nucleotides but without m6A modification. In the control test, those m6A codon sites from the human translatome still exhibited higher intraRSSes than the control dataset (Supplementary Fig. 2d; P = 1.76 × 10−9 for Linder15 and P = 4.52 × 10−54 for Ke15, onesided Wilcoxon rank-sum test). For yeast, the difference of intraRSS between the m6A codon sites and the background was insignificant (Fig. 3b and Supplementary Fig. 2d; P > 0.05 by two-sided Wilcoxon rank-sum test). This result may be due to the fact that the number of codons modified by m6A in the yeast dataset was quite limited (only 278 samples after read mapping), and the m6A modification was only observed in yeast meiosis [81]. Overall, our analysis provided a new insight into the relation between the m6A modification and translation elongation in human and yeast, which can further expand the previous conclusion drawn from the E. coli data [78].
mRNA secondary structure
During the translation elongation process, the ribosome should first unwind the locally folded mRNA secondary structures (e.g., stem-hairpin or stem-internal loops) to move forward [82, 83]. This indicates that in a highly double-stranded region, translation elongation can be slowed down, which thus increases the probability of ribosome stalling [3, 5, 6, 27, 58, 84–87]. To verify this hypothesis, we first ran RNAfold [88] to predict the secondary structures of all mRNA sequences in the background dataset, which contained 10,000 randomly-selected ribosome occupancy sites from the genome. Here, the mRNA sequences covering the codon sites of interest were 183 nucleotides long, as we extended each putative ribosome occupancy site by 30 codons both upward and downward as input to ROSE. We then measured the folding level of each sequence by computing its double-stranded ratio (denoted by ds%) in the local region of a ribosome A-site, and regarded the top 5,000 mRNA sequences with the highest ds% scores as highly folded. Next, we compared the intraRSSes of those highly-folded mRNAs to those of the remaining sequences. For both human and yeast, we observed an expected increase of intraRSS for those mRNAs with highly-folded structures (Figs. 3a and 3b; P < 10−8 by one-sided Wilcoxon rank-sum test), which provided a novel evidence on the regulatory effect of mRNA secondary structure on translation elongation.
RNA-binding proteins
Recently, RNA-binding proteins (RBPs) have received broad interests for their crucial roles in post-transcriptional and translational regulation [89, 90]. By affecting the stability and the translation process of their target mRNAs, RBPs can act as important regulatory factors to control gene expression [91, 92]. As a specific RBP, the fragile X mental retardation protein (FRMP) is essential for neuronal translation regulation, whose transcriptional inactivation has been known to be involved in many diseases, such as fragile X syndrome and autism [93]. It has been found that FMRP can associate with polyribosomes and impede the elongation of a peptide chain [94]. Structural studies have also indicated that FRMP may directly bind to both RNAs and ribosomal proteins to prohibit ribosome movement (Fig. 3e) [95]. Therefore, a region with downstream FMRP binding in the CDS is expected to have a higher chance of ribosome stalling. Here, we estimated the FMRP binding affinity of a region downstream the ribosome A-site based on the known FMRP binding sites identified by the PAR-CLIP experiment [93] (Online Methods). We calculated the intraRSSes of the codon sites enriched with FMRP binding downstream, and found that these sites had significantly higher intraRSSes than the background (Fig. 3f; P = 5.49 × 10−15 by one-sided Wilcoxon rank-sum test), which confirmed the effectiveness of our model to capture the ribosome stalling events regulated by this specific RNA-binding protein.
For general RBPs, we hypothesized that it would be highly probable for ribosomes to stall in an mRNA region enriched with RBP binding motifs downstream, as the bound RBPs may obstruct ribosome movement. To verify this hypothesis, we analyzed the intraRSSes of those codon sites with strong RBP binding propensity downstream (Online Methods), in which the RBP binding motifs and the corresponding binding affinities were downloaded from the CISBP-RNA database [90]. We found that indeed these sites exhibited significantly higher intraRSSes than the background (Figs. 3a and 3b; max E-score estimation, P = 9.36 × 10−82 for human and P = 1.90 × 10−4 for yeast, one-sided Wilcoxon rank-sum test). The increases were significant for human with respect to all four criteria used to estimate the RBP binding affinities, i.e., max/average E- and Z-scores (Fig. 3a and Supplementary Fig. 2a, Online Methods). However, for yeast, the increase was only significant for the max E-score (Fig. 3b and Supplementary Fig. 2b), which may be attributed to the limited number of RBP binding motifs available in yeast (3 motifs in yeast vs. 91 motifs in human). Overall, our analysis showed that RBP binding may act as another factor in influencing ribosome stalling. Of course, we cannot rule out other unknown factors associated with RBP binding motifs that may virtually control ribosome stalling, which will certainly require additional experimental studies and further investigation.
ROSE unveils intragenic RSS landscapes
As the RSS outputted by ROSE encodes diverse factors affecting the ribosome stalling events, it can be regarded as a useful indicator of the local elongation rate. On the other hand, due to the uneven nature of translation elongation dynamics, each gene can have a specific intragenic landscape of ribosome stalling, which may be important for many cellular functions, e.g., modulating the distribution of ribosomes along an mRNA [55, 56], regulating the protein cotranslational folding process [96], and assisting protein translocation across membranes [97]. Here, we analyzed the intraRSS landscape by relating it to several cotranslational events in protein biogenesis, including protein secondary structure formation and protein targeting by the signal recognition particle (SRP).
RSS correlates with protein secondary structure
Although codon bias has been shown to associate with protein secondary structure elements (SSEs) and cotranslational folding [3, 6, 97, 98], it lacks more direct evidence to verify that such an association is imparted from the local elongation dynamics. Here, we sought to probe the relation between the protein SSEs and the local elongation rates based on the intraRSS computed by ROSE. In particular, we first derived a set of non-redundant protein chains across human and yeast genomes from the Protein Data Bank (PDB) [99], in which BLAST [100] with the sequence-similarity cutoff P = 10−7 was used to compare two protein sequences. The SSEs of these protein chains were then determined based on the mapping to the DSSP database [101, 102], which contains the experimentally-determined secondary structure assignments for the protein sequences in the PDB. We then investigated the intraRSS landscapes of different SSE patterns, including a single chain of alpha helix (H), beta stand (B) or random coil (C), and transitions between different SSEs.
To obtain the average position-specific intraRSSes of a certain SSE pattern, all the eligible SSE-aligned sequences with a particular window size were extracted from the genome with five flanking amino acids on both sides, and then the mean intraRSS of each position was calculated. Note that here we mainly considered the intraRSSes of those codons at the ribosome P-sites, where the corresponding amino acids are concatenated to the nascent peptides (Supplementary Fig. 1). Overall, we found that with the window size of six, all the tendencies of the intraRSS change for individual SSE patterns were species independent, showing consistent trends for both human and yeast datasets (Figs. 4a and 4b; Spearman correlation coefficient R > 0.6). To further eliminate the bias that may be caused by the window size, we also repeated the same analysis procedure with the window size of ten, and found that five out of seven SSE patterns still showed similar trends for both human and yeast (Supplementary Fig. 3; Spearman correlation coefficient R > 0.5). The remaining two SSE patterns displayed relatively weak agreement in the intraRSS tendencies (Supplementary Fig. 3; Spearman correlation coefficient R < 0.1), which was probably attributed to the dramatic drop of the sample size in both datasets (i.e., from thousands to tens).
We further compared the intraRSSes of the structured (i.e., alpha helix or beta strand) and random coil residues at the ribosome P-sites. Consistent with the previous report that frequent codons were usually enriched in the structured regions while depleted in the random coils [97], our results showed a significantly higher stalling probability in the coils than in the alpha helix or beta strand regions (Fig. 4c). Furthermore, we examined the tendency of the intraRSS change along a protein secondary structure fragment. As expected, the intraRSS landscape showed a lower chance of stalling in the middle of a structured region while a higher chance in the middle of a coil region, when compared to the corresponding flanking regions on both sides (Fig. 4a and Supplementary Fig. 3a). This behavior was reminiscent of another previous study on the relations between codon frequency and protein secondary structure, in which the tRNA adaptation index (i.e., tAI) was mainly used as an indicator of the elongation rate [98]. Our intraRSS landscape showed a similar trend to the previous finding [98] that the transitions from structured to coil regions generally accompanied an increase in the stalling probability on the transition boundaries (Fig. 4b and Supplementary Fig. 3b). In addition, the opposite transitions (i.e., from coil to structured regions) exhibited roughly symmetrical trends in the change of intraRSS (Fig. 4b and Supplementary Fig. 3b). Notably, the ribosome stalling positions revealed by intraRSS here were slightly different from those reported in [98]. In particular, our results displayed more symmetrical stalling positions than the previous results [98], suggesting that ribosomes are prone to stall at the coil residues near the transition boundaries. Admittedly, we cannot exclude the influence of other factors on these findings, such as the database upgrade or the discrepancy of SSE assignment between different databases (e.g., JOY [103] vs. DSSP [101, 102]). Nevertheless, our results indicated the necessity of incorporating other controlling factors in addition to tRNA adaptation to better estimate the translation elongation rate.
RSS associates with SRP recognition
Next, we investigated whether the intraRSS landscape can reflect the elongation process that regulates the coupling between the protein translation and translocation activities. We were particularly interested in the interplay between the translational pause and the signal recognition particle (SRP) binding of the transmembrane (TM) segments (Fig. 5a). We expected that our model would effectively capture the ribosome stalling events encoded by the heterogeneity of amino acid composition in and around the TM domains.
We first downloaded all the available TM protein sequences of human and yeast as well as the corresponding TM domain information from the Uniprot database [104]. To avoid the biases that may be caused by the influence between different TM segments, here we only considered the single-pass integral proteins and the last TM segments of the multispan TM proteins, which resulted in 4,235 human and 561 yeast proteins. For yeast proteins, we also excluded 65 TM sequences that are not bound by SRP according to the previous experimental study [105]. To characterize the intraRSS landscape along the elongation process, all the protein sequences were aligned with regard to the start of the TM segment whose position was indexed as zero, and then the mean intraRSS of each codon between positions -10 and +80 was calculated.
We first focused on the yeast TM proteins, whose translation has been previously characterized both computationally and experimentally [106]. The intraRSS landscape computed by ROSE captured two major stalling events during the TM protein translation process (Fig. 5b). The first stalling event after the TM start occurred right at the end of the TM segment, where the structured TM segment (majorly alpha helix) is transited to a more flexible intracellular region. This result agreed well with our previous conclusion about the relation between intraRSS and protein secondary structure (Fig. 4b and Supplementary Fig. 3b). The other intraRSS peak, spanning positions from +50 to +70, probably represented the intrinsic stalling to promote the nascent-chain recognition by SRP, which was consistent with the previous report [106]. Indeed, a TM segment generally contains ~20 residues and the length of the ribosome exit tunnel is ~30 residues. Thus position +50 is approximately the place where the translated TM segment emerges from the exit tunnel and is bound by SRP. To further verify whether this intraRSS peak was truly relevant to SRP binding, we also specifically examined the intraRSS landscape of the TM segments that were not associated with SRP binding (termed SRP-), which were obtained from the previous experimental study [105]. Compared to the above result on the sequences with SRP binding (termed SRP+), the first stalling event remained at position +20, while the intraRSS peak in the putative SRP-associated stalling region (i.e., positions from +50 to +70) was significantly diminished (Fig. 5b; P = 1.5 × 10−3 by one-sided Wilcoxon rank-sum test), which indicated that intraRSS is an excellent indicator of the ribosome stalling event that regulates the synthesis of a TM domain. Notably, the TM segments at positions between 0 and +20 of the SRP-proteins showed a remarkable increase in intraRSS compared to the result of the SRP+ proteins, suggesting that an alternative membrane targeting may occur in the absence of SRP binding. To probe whether our position alignment scheme can induce bias to our analysis, we also performed a similar study in which the TM ends were aligned together, from which we drew the same conclusion (Supplementary Fig. 4a).
For human, as we lacked the SRP recognition data, it was difficult to separate the TM domains bound and unbound by SRP. Here, we only analyzed the intraRSS landscape of a mixed human dataset that did not distinguish the TM segments with and without SRP binding. In this analysis, although we still observed a similar intraRSS peak near the end of the TM segment, the peak signal around position +50 was relatively weak compared to that in yeast (Supplementary Fig. 4b). Such a discrepancy may be caused by either the mixed effect of the TM domains with and without SRP binding in the human dataset or the intrinsic mechanistic difference in cotranslational protein targeting mediated by SRP recognition between human and yeast.
ROSE uncovers intergenic RSS landscapes
As discussed previously (see Section “Designing, training and validating ROSE”), ROSE was able to capture the stalling features of the ramp region (i.e., the first 30–50 codons downstream from the start codon) of a gene. The ribosome stalling in the ramp was expected to enable the fluency of the downstream translation elongation process by reducing the chance of ribosome jamming. In other words, the aggregation of ribosomes in the ramp regions may facilitate the downstream translation [10, 55, 56, 60]. Thus, we expected a gene with a higher interRSS in the ramp region to be more efficiently translated. To verify this hypothesis, we performed a large-scale study to investigate the relation between the interRSS of the ramp region and the translation efficiency of the corresponding gene.
We used the logarithm of the protein expression level divided by the corresponding mRNA expression level to measure the translation efficiency (TE) of each gene. In our analysis, the mRNA and protein expression data of human (lymphoblastoid cell lines, LCLs) and yeast (S. cerevisiae) were obtained from [52] and [107, 108], respectively. As short genes may introduce bias to our analysis of the ramp regions, here we only focused on those genes with more than 300 codons, which in total resulted in 12,734 and 3,590 genes for human and yeast, respectively. We then divided these obtained genes into two classes: those with the highest 10% mean interRSSes of the ramp regions were defined as the ramp genes, while the remaining ones were regarded as the non-ramp genes. Our comparison showed that the ramp genes owned stringently higher translation efficiency than the non-ramp genes (Figs. 6a and 6b; P = 3.85 × 10−7 for human and P = 2.7 × 10−3 for yeast, one-sided Wilcoxon rank-sum test). In particular, for human, the ramp genes displayed dominantly higher translation efficiency than the non-ramp genes approximately starting from the TE median, while for yeast, the same phenomenon was observed for those genes roughly with top 10% TE (Figs. 6a and 6b).
Furthermore, we compared the enriched functional categories between ramp and non-ramp genes. Specifically, we only focused on those genes with high translation efficiency (i.e., with top 50% TE) as they may be crucial for supporting the fundamental cellular activities, and used all the expressed genes in the dataset as the background for the gene ontology (GO) analysis (Supplementary Table 2). Interestingly, we found that among these genes in yeast, the ramp genes were significantly enriched with many housekeeping GO terms, such as translation regulation and amino acid biogenesis, while the non-ramp genes showed much weaker enrichment or even depletion of these GO categories (Fig. 6c and Supplementary Table 3). For human, such a GO enrichment was not significantly observed after correcting the P values (Supplementary Table 3; P > 0.05). This may be due to the intrinsic expression property of LCLs, as we also failed to observe a significant GO enrichment even when simply focusing on those genes of high TE without differentiating their interRSS levels of the ramps (Supplementary Table 3; P > 0.05). Nevertheless, for those cell cycle related GO terms, the corrected P values of the ramp genes with high TE were expectedly smaller than those of the non-ramp genes (Supplementary Table 3). Taken together, these results suggested that interRSS can provide a good indicator for studying the regulatory functions of the ramp sequences, which may modulate the translation efficiency of important genes at the elongation level.
Discussion
In addition to those factors investigated in the previous sections, we also studied the correlation between the amino acid charge and ribosome stalling, which still remains controversial and unclear. Several studies claimed that the positively-charged amino acids can slow down the formation of a peptide chain by interacting with the negatively-charged ribosomal exist tunnel [27, 58, 86, 111, 112]. However, others found no such correlation by arguing the quality of experimental data and the methodological limitations in the previous studies [21]. Here, we reexamined this problem based on our method. Indeed, for those codon sites enriched with the positively-charged amino acids upstream (i.e., with the 10,000 highest ratios of the positively-charged amino acids in the upstream 30 codons) in the genome, we did not observe a significant increase of intraRSS compared to the result of the background. To probe this problem in more detail, we further looked into the spesific positively-charged amino acids, including histidine, lysine and arginine, and asked whether any particular amino acid can contribute to ribosome stalling. However, although significant difference of intraRSS was observed between those sites enriched with a particular amino acid upstream and the background, we cannot reach a consistent conclusion over human and yeast datasets (Supplementary Fig. 5).
In this study, we have proposed a deep learning-based method, called ROSE, to model translation elongation dynamics by integrating the underlying sequence features. To our best knowledge, our work is the first attempt to exploit the machine learning/deep learning technique to model the elongation process and predict ribosome stalling based on the large-scale ribosome profiling data. Through comprehensive analyses on both human and yeast datasets, we have shown that ROSE can effectively decipher diverse factors that control the elongation rate and the ribosome stalling tendency. Moreover, ROSE provides a genome-wide estimate on the local elongation rate and enables us to investigate both intragenic and intergenic landscapes of ribosome stalling. Based on ROSE, we have validated the existence of the ramp sequences in an unbiased manner, and uncovered interesting interplay between elongation dynamics and the cotranslational events in protein biogenesis. For instance, our results have revealed that translation elongation dynamics may modulate the TM recognition by SRP and also associate with the protein secondary structure formation during the cotranslational folding process. In addition, the interRSS landscape indicates that the ramp sequences may be involved in the modulation of translation efficiency. Although translation initiation has been confirmed to be crucial for regulating translation efficiency and differential expression [6, 56], our studies have suggested a potential regulatory function of translation elongation to determine protein expression. Overall, our method can provide an effective and powerful tool to analyze the large-scale ribosome profiling data, and further expand our understanding on the regulatory mechanisms and functions of ribosome stalling and translation elongation dynamics.
Methods
Datasets and data preprocessing
The training and test datasets were downloaded from GWIPS-viz [37], in which abundant ribosome profiling data have been maintained to facilitate the downstream analysis. Here we applied the normalization method introduced by Artieri and Fraser [21] to remove the technical and experimental biases from the ribosome profiling data. More specifically, after mapping the ribosome profiling and mRNA-seq reads to the reference genome, their codon-level reads were first scaled by the mean coverage level within each gene, which canceled out the coverage differences among genes. Next, the scaled ribosome profiling reads were divided by the scaled mRNA-seq reads in the corresponding locations to eliminate the shared biases between these two fractions. After that, a logarithm operation was further performed to yield the final normalized values of the ribosome profiling data. To exclude the unexpected biases that may be introduced from the reads of poorly-sequenced genes, here we mainly considered the reads of those genes whose read coverage ratios were larger than 60%. After normalization, we selected those codon sites whose read abundance was in the top 5% as positive (i.e., foreground) samples, while randomly chose the same number of codon sites from the remaining 95% as negative (i.e., background) samples. These labeled samples were then used as training and test data in a binary classification task. Note that here we excluded all the reads of the ramp regions (i.e., the first 50 codons at the 5’ end of the coding sequences) from the training data.
In this study, we mainly focused on eukaryotic cells, including human and yeast cells. As a previous study [24] has shown that the use of cycloheximide (CHX) for stabilizing ribosomes before sequencing can induce unexpected bias to the measurement (i.e., the ribosomes can continue moving forward in the present of CHX), here we only considered the profiling data with the use of the flash-freezing method [11] instead of CHX. Moreover, after applying the aforementioned normalization technique on the original human and yeast ribosome profiling datasets downloaded from GWIPS-viz, we only kept those datasets with more than 10,000 positive samples, which were generally sufficient enough to train a reliable machine learning model. In the end, such a normalization and screening procedure yielded two qualified datasets (Supplementary Fig. 6), i.e., the human dataset of lymphoblastoid cell lines (LCLs) [52] (denoted by Battle15) and the yeast dataset of S. cerevisiae [27] (denoted by Pop14), which included 109,770 and 20,902 samples, respectively. For each dataset, we randomly selected 90% of the samples as training data and the remaining 10% as test data. The final performance of our model was mainly reported based on the test data (Figs. 2a and 2b). Note that as the isoform-level analysis of ribosome profiling data has not been completely solved [10], here we mainly chose the largest transcript of each gene as the reference genome to construct the gene-level ribosome profiles.
Model design
A convolutional neural network (CNN) is a specific type of neural network in deep learning, which has been widely used in common data science fields, such as computer vision [113] and natural language processing [114]. In particular, CNNs have also been used to model biological sequence data, e.g., the predictions of protein-nucleotide binding [44] and effects of noncoding variants [47]. Generally speaking, a CNN is comprised of multiple local motif detectors (i.e., convolution operators) that are invariant with certain transformations, such as translation and rotation, and subsampling (i.e., pooling operators) for dimension reduction and efficient training. To further increase the learning capacity of the network, many layers of these operators are often stacked together, and then followed by several fully-connected layers, and finally the output layer.
In our framework, we first encode the input codon sequence using the one-hot encoding technique [115], that is, the mth codon is encoded as a binary vector of length 64, in which the mth position is one while the others are zeros, after indexing all 64 codons. Then the encoded information is fed into one convolutional layer and one pooling layer to learn the hidden features. In the convolutional layer, several one-dimensional convolution operations are performed over the 64-channel input data, in which each channel corresponds to one dimension of the input vector, and the weight matrix (i.e., kernel) can be regarded as the position weight matrix (PWM). More specifically, given a codon sequence s = (c1,…, cn) and the corresponding one-hot representation S, where n stands for the input length (here n = 61 as we extend the codon site of interest on both sides by 30 codons) and ci represents the ith codon in the sequence, the convolutional layer computes X = conv(S), i.e., where 1 ≤ i ≤ n – m + 1, 1 ≤ k ≤ d, m is the kernel size, and d is the kernel number. Next, the rectified linear activation function (ReLU) is used to imitate the neuron activation, that is, the output of the convolutional layer is further processed by the activation function Y = ReLU(X), where
After convolution and rectification, we reduce the dimension of matrix Y using the max pooling operation, which computes the maximum value within a scanning window of size three and step size two. More specifically, given the upstream input Y, the max pooling operation computes Z = pool(Y), i.e., where i is the index of the output position, j is the index of the start input position, k is the index of the kernel, and m is the size of the scanning window during the pooling operation (here we choose m = 3).
To enable the local motif detectors to scan sequence motifs in different ranges synchronously, while not increasing the model complexity too much, here we propose a parallel architecture, which includes three kernels of different sizes, corresponding to short (5–7), mediate (8–9) and long (10–13) ranges, respectively. The outputs of these three kinds of convolution operators are further rectified and then subsampled independently and in parallel, and finally concatenated into a unified representation U. To calculate the final probability of a ribosome stalling event, the unified representation is directly fed to a sigmoid layer, which computes where W is the weight matrix of the sigmoid layer.
Note that the sequential (i.e., layer-wise) architecture in conventional CNNs, in which several convolutional and pooling layers are stacked together, can also detect motifs in different ranges. The reason that our parallel architecture can significantly reduce the model complexity comes from the fact that the parallelism simulates the SUM operation, e.g., (a1 + a2) + (b1 + b2), while the sequentiality mimics the PRODUCT operation, e.g., (a1 + a2) × (b1 + b2). Obviously, the computational complexity of the latter is much higher than that of the former. Our network reduction can be useful for relieving the potential overfitting problem during the training process. We note that a similar idea has also been proposed in [114]. However, the pooling operation in [114] is carried out over the whole convolutional layer without any window restriction, which is quit different from ours. In summary, a complete CNN in our deep learning framework can be formulated as where i represents the kernel index in the parallel architecture, and encode(·), conv(·), ReLU(·), pool(·), concat(·) and sigm(·) represent the one-hot encoding, convolution, ReLU, max pooling, concatenation and sigmoid operations, respectively.
The above calculated probability p(s) is defined as the intergenic ribosome stalling score (also termed interRSS), which measures the likelihood of ribosome stalling at a codon position. To eliminate the interRSS bias among different genes, we further define the intragenic ribosome stalling score (also termed intraRSS) as follows, where interRSS(position) represents the interRSS of the codon position of interest and mean (gene) stands for the mean interRSS of the corresponding gene. When computing mean(gene), we exclude those codon positions in the ramp region (i.e., the first 50 codons at the 5’ end of the coding sequence). In addition, as limited by the input length of our model (i.e., 61 codons), we mainly use the mean interRSS of codon positions 31-50 to estimate the interRSS level of the ramp region.
Model training and model selection
Given the training samples {(si, yi)}i, the loss function of our model is defined as the sum of the negative log likelihoods (NLLs), i.e., where si is the input codon sequence and yi is the true label. To train the CNN, the standard stochastic gradient descent with the error backpropagation algorithm is performed [49]. To further optimize the training procedure, we also apply several training strategies, including the minibatch and momentum techniques [50]. In addition, we use the Adam algorithm for stochastic optimization to achieve an adaptive moment estimation [116]. To further overcome the overfitting issue, we also apply several regularization techniques, including L2-regularization-based weight decay [50], dropout [51] and early stopping [50].
The network structure and the aforementioned optimization techniques introduce a number of hyperparameters to our framework, such as the kernel size, kernel number, base learning rate, weight decay coefficient and the max number of training iterations. It is important to perform proper hyperparameter calibration and model selection for accurate modeling. Although we can achieve this goal using the conventional cross-validation strategies, it is generally time-consuming to test all possible combinations of these hyperparameters. To conquer this difficulty, here we propose a one-way model selection strategy for automatic and efficient hyperparameter calibration. In this strategy, we first arbitrarily choose the initial values of the hyperparameters from a candidate set. Then, we separate the hyperparameters into two groups, including those describing the network structure (denoted by H1), such as the kernel size and the kernel number, and those describing the optimization procedure (denoted by H2), such as the base learning rate and the weight decay coefficient. Next, by fixing the values of the hyperparameters in H2, we calibrate those hyperparameters in H1 using a three-fold cross-validation (CV) procedure, and determine their optimal values that achieve the best CV performance. Similarly, the hyperparameters in H2 are also calibrated via the three-fold CV procedure after fixing the previously determined values of the hyperparameters in H1. The final values of all hyperparameters of ROSE are provided in Supplementary Table 4. The ROC curves and AUROC scores of the CNNs with calibrated hyperparameters are shown in Supplementary Figs. 7a and 7b for the Battle15 and Pop14 datasets, respectively. Though we can carry out this procedure for more iterations (i.e., multi-way), our test results show that the one-way implementation generally yields satisfying prediction performance in this study.
After hyperparameter calibration and model selection, we train the final ROSE model using the whole training dataset. Due to the nature of non-convex optimization, random weight initialization may affect the search result of the gradient descent algorithm. Here, we use the Xavier initialization algorithm to automatically determine the initial scales of weights according to the number of input and output neurons [117]. To account for the potential initialization bias and further boost the prediction performance, we also implement an ensemble version of ROSE (termed eROSE), in which 64 CNNs are trained independently and then combined together to compute the final prediction score, i.e., in which pi (s) represents the probability calculated by the ith CNN.
Our implementation of ROSE depends on the Caffe library [118], and the Tesla K20c GPUs are used to speed up the training process.
Statistical analysis on diverse factors influencing ribosome stalling
Diverse factors, such as tRNA adaptation and mRNA secondary structure, can interplay with each other to affect the ribosome stalling tendency. To investigate whether a factor potentially influences ribosome stalling, we first identified those codon sites along the genome that were enriched with this factor, and then checked whether the predicted (intra)RSSes of these positions were significantly different from those of the background. In particular, given a factor, such as tRNA adaptation, we first computed its quantity (e.g., tAI) across the genome and then chose those codon sites whose quantities were in the top N list (here N was set to 10,000 in our study). After that, we ran the Wilcoxon rank sum test to compare the (intra)RSSes of the chosen sites with those of a background dataset, which was generated by randomly selecting 10,000 ribosome occupancy sites from the genome. If the (intra)RSSes of those codon sites enriched with the factor and the background were significantly different, we said this factor can influence ribosome stalling. In addition, we probed the correlations between different factors based on the background dataset, and found little correlation between different factors that we were interested in, except cAI, %MinMax and tAI (Supplementary Table 1). This enhanced our conclusion about the regulatory function of a factor in controlling ribosome stalling, as it basically excluded the possibility that the observed effect was indirect and caused by the influence from other known factors.
Codon cooccurrence index
Cannarozzi et al. [69] have proposed the tRNA pairing index (TPI) to characterize the codon order within each gene. However, the TPI measurement is not suitable for our study as here we mainly focused on the codon distribution within a local region rather than over the whole transcript. Here, we proposed a novel index, called the codon cooccurrence index (cCI), to estimate the codon cooccurrence level in a local region. Precisely speaking, given the codon of interest at position i, we only considered its local region [i – w, i + w], where w stands for the window size. For each codon at position p ∈ [i – w, i + w], we checked whether it had an isoaccepting codon in the upstream region [i – u, p – 1]. We used notation isop to represent this indicator, that is, isop = 1 if the indicator holds true, and isop = 0 otherwise. Thus, the cCI at position i was defined as in which we set w = 5 and u = 30 in our study.
Estimating the RBP binding affinity
Suppose that we index the codon position at the ribosome A-site as zero. Then the downstream region covering positions from +1 to +3 is still protected by the ribosome (Supplementary Figure 1). We were particularly interested in estimating the binding affinity of FMRP or other RBPs in the region of the next ten codons after the ribosome protected fragment (i.e., codons from +4 to +13), which was denoted by R, and then investigating the influence of this estimated binding affinity score on ribosome stalling. For FMRP, we mainly used the abundance of the mapped reads of the FMRP binding sites identified by PAR-CLIP [93] to estimate its binding affinity. In particular, if there are N reads identified in region [i, i + x], then for any site s ∈ [i, i + x], its FMRP binding affinity, denoted by aff (s), was estimated by aff (s) = N/x. After that, the overall binding affinity of the region R right after the ribosome protected fragment was calculated by aff(R) = ∑s∈R aff(s). Here we only considered those binding sites whose lengths are within one standard derivation from the mean calculated based on the length distribution of FMRP binding sites, as the extremely long regions may introduce bias to our analysis.
For general RBPs, the estimation of their binding affinity was similar except that now we used the E- and Z-scores provided by [90] instead of the mapped PAR-CLIP reads. In particular, given a region R and an RBP binding motif set M, for any 7-mer m, we defined aff-max(R) = maxm∈R(maxm∈M(m)) for the max-score estimation, and aff-mean(R) = meanm∈R(maxm∈M(m)) for the mean-score estimation, where maxm∈m (m) returns the maximum E- or Z-score of the 7-mer m within the set M.
Author contributions
S.Z., H.H., J.Zhou and J.Zeng conceived the research project. J.Zeng supervised the research project. S.Z. preprocessed raw data, designed and implemented ROSE, and carried out model training and validation tasks. X.H. prepared the sequencing data of m6A modification. S.Z., H.H., J.Zhou and J.Zeng performed all the computational and statistical analyses. S.Z., H.H., J.Zeng, J.Zhou and X.H. wrote the manuscript. All the authors discussed the test results and commented on the manuscript.
Competing financial interests
The authors declare no competing financial interests.
Acknowledgments
This work was supported in part by the National Basic Research Program of China Grant 2011CBA00300, 2011CBA00301, the National Natural Science Foundation of China Grant 61033001, 61361136003 and 61472205, and China’s Youth 1000-Talent Program, the Beijing Advanced Innovation Center for Structural Biology. The authors are grateful to Drs. T. Jiang, Q. Zhang and W. Chen for their helpful discussions about this work. They thank Ms. T. Chu for her help on preparing the figures in this paper.
Footnotes
↵† These authors contributed equally to this work.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].
- [14].
- [15].
- [16].
- [17].
- [18].
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].
- [24].↵
- [25].
- [26].
- [27].↵
- [28].↵
- [29].↵
- [30].
- [31].
- [32].
- [33].
- [34].
- [35].
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].
- [46].
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].
- [68].↵
- [69].↵
- [70].↵
- [71].
- [72].
- [73].
- [74].
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].↵
- [95].↵
- [96].↵
- [97].↵
- [98].↵
- [99].↵
- [100].↵
- [101].↵
- [102].↵
- [103].↵
- [104].↵
- [105].↵
- [106].↵
- [107].↵
- [108].↵
- [109].↵
- [110].↵
- [111].↵
- [112].↵
- [113].↵
- [114].↵
- [115].↵
- [116].↵
- [117].↵
- [118].↵