Abstract
Motivation Fast and accurate identification of transmembrane (TM) topology is well suited for the annotation of whole membrane proteome, and in turn the initial step to predict the structure and function of membrane proteins. However, till now the methods that utilize only amino acid sequence (pureseq) will suffer from low prediction accuracy, whereas the methods that exploit sequence profile or consensus will need too much computing time.
Method This article employs a deep learning framework DeepCNF that predicts TM topology from amino acid sequence only. Compared to previous pureseq approaches that based on Hidden Markov Models (HMM) or Dynamic Bayesian Network (DBN), DeepCNF can accommodate a lot more context information by a hierarchical deep neural network, and simultaneously model the interdependency between adjacent topology labels.
Result Experimental results show that our TM prediction method PureseqTM not only outperforms existing pureseq methods, but also reaches or even surpasses the profile/consensus methods. On the 39 newly released membrane proteins, our approach successfully identifies the correct TM segments and boundaries for at least 3 cases while either of the other approaches failed to do so. When applied to the entire Human proteome, our method can identify the incorrect annotations of TM regions by UniProt, as well as discover the membrane-related proteins that are not manually curated as membrane protein.
Availability http://pureseqtm.predmp.com/
Introduction
Transmembrane proteins (TPs) are key players in energy production, material transport, and communication between cells [1]. TPs are encoded by ~30% genes in the various genomes [2] and have been targeted by ~50% of therapeutic drugs [3]. Despite their abundance and importance, the number of solved TPs structures is relatively low compared to those of non-transmembrane proteins (non-TPs). In particular, if we take the non-redundancy threshold to 40% sequence identity, then there are only about 1500 non-redundant TPs whereas the non-redundant number of non-TPs is more than 34000. The underlying reason is that the experimental determination of TPs is challenging as membrane proteins are often too large for NMR spectroscopy and difficult to crystallize for X-ray crystallography [4]. Thus, it is critical to develop computational methods for the prediction of TP structure from amino acid sequence, and the initial step is the accurate identification of the transmembrane topology [5].
As shown in the left part of Figure 1, transmembrane (TM) topology refers to the locations of the membrane-spanning segments, which could be represented as a 1D 0/1 string to indicate the location of each residue to reside in (label 1) or out of (label 0) the membrane. This simple but direct definition of TM topology is consistent with the 3-label definition used by many other works that divides non-TM regions (i.e., label 0) into inner or outer sides [6–10]. In this work, we only focus on the prediction of TM topology of the alpha-helical TPs because of the following two facts: (i) almost all the TM regions in Eukaryotic TPs are alpha-helical except some of them are beta-barrel in the mitochondrial membrane [8]; (ii) more than 85% of the available TPs that have 3D structure belong to the alpha-helical TPs [11]. If there is only one TM segment, then this membrane protein is denoted as single-pass transmembrane protein (sTP); similarly, a multi-pass transmembrane protein (mTP) will contain two or more TM segments. Here we mainly focus on the topology prediction of multi-pass transmembrane protein as about 75% of the current available alpha-helical TPs are mTPs [11].
Till now, a variety of approaches have been proposed to predict the 1D TM topology from the input sequence of a membrane protein. These approaches can be roughly categorized into three groups: (a) single-sequence-based methods that only rely on the input amino acid sequence information (or, ‘pureseq’ features). Two representative methods are TMHMM/Phobius [7, 12] and Philius [8], where the former established a Hidden Markov Model (HMM) model and the latter employed a Dynamic Bayesian Network (DBN) model. The advantage of the pureseq methods is their extremely fast running speed, while the disadvantage is the relatively low prediction accuracy; (b) evolutionary-based methods that consider the information embedded in the homologous Multiple Sequence Alignment (MSA) through evolutionary analysis (or, ‘profile’ features). Two representative methods are OCTOPUS [10] and MEMSAT-SVM [9]. The advantage of the profile methods is their improved prediction accuracy over pureseq methods, but at the cost of significantly reduced running speed due to the search for MSA; (c) consensus methods that combine the outputs from different predictors. TOPCONS2 [13] and CCTOP [6] are the two representatives. As the consensus methods will also integrate the profile methods, their running speed can’t compete with that of the pureseq methods. So, it remains a question that can we develop an approach for TM topology prediction that can reach the accuracy of profile or consensus methods but as efficient as pureseq methods.
Recently, we have developed a deep learning framework Deep Convolutional Neural Fields (DeepCNF) [14] for a variety of protein sequence labeling problems ranging from secondary structure element (SSE) [15], solvent accessibility (ACC) [16], to order/disorder region (DISO) [17, 18], which obtained the top-tier performance according to the third-party evaluations [19–22]. In this work, we employed DeepCNF for the prediction of TM topology labels based on amino acid sequence information only (denoted as PureseqTM). Briefly, DeepCNF can be viewed as Conditional Random Field (CRF) with Deep Convolutional Neural Network (DCNN) as its non-linear feature generating function. DeepCNF can model not only complex relationship between the input features and TM labels, but also the correlation among adjacent TM labels. These properties give DeepCNF better ability to model long-range dependencies embedded in the input features, and better performance over CRF (a model compatible to or even better than HMM and DBN) and DCNN on 1D sequence labeling tasks. Furthermore, besides considering amino acid as simply 20 alphabets, we may also take into account the physical-chemical properties of amino acids [23]. Consequently, our proposed method PureseqTM can easily take as input these pureseq features, and achieve the performacne compatible to or even better than profile and consensus methods, while keeps the similar running speed of pureseq methods.
As shown in the right part of Figure 1, PureseqTM has two modules: (i) a module for discriminating TPs and non-TPs, and (ii) a module for predicting TM topology. Experimental results show that PureseqTM greatly outperforms existing pureseq methods Phobius and Philius, especially on the identification of the correct number and boundaries of the TM segments (measured by protein-level and segment-level accuracy, respectively). Specifically, on the 39 newly released mTPs, PureseqTM achieved the best performance 0.667 in terms of protein-level accuracy, which is 12.9%, 7.7% and 5.2% better than Phobius, Philius and even Topcons2, respectively. Moreover, PureseqTM correctly identified the number and boundaries of the TM segments for at least 3 cases among this dataset where Phobius, Philius, or Topcons2 was not able to do so. Finally, we applied our method on the entire Human proteome from UniProt [24]. The results indicate that PureseqTM can not only identify the incorrect annotations of TM regions by UniProt, but also discover the membrane-related proteins that are not reviewed as membrane protein by UniProt. Since it is time-consuming to generate sequence profiles, this makes our method a good tool for proteome-wide TM topology prediction.
Result
Model Architecture
In general, as shown in Figure 2, the model architecture of PureseqTM could be considered as an integration of Hidden Markov Model (HMM) and Dynamic Bayesian Network (DBN) to produce an initial estimation of the probabilities of the transmembrane topology labels as well as the probabilities of their transitions (or more precisely, topology change). Then the long-range dependencies embedded in these information are in turn effectively exploited by a Deep Convolutional Neural Field (DeepCNF) model for predicting the binary topology labels at each amino acid position.
HMM features generated by Phobius
Phobius [12] considers the transmembrane topology prediction problem as a supervised learning problem over the observed input amino acid sequences a= a1,…,aL, and output the hidden topology labels y = y1,…,yL. The yi is from the four topology labels {s,i,m,o}, which correspond respectively to signal peptides, cytoplasmic (‘inside’) loops, membrane-spanning segments, and non-cytoplasmic (‘outside’) loops. Then Phobius established a HMM that is a generative model to produce a joint probability distribution between observed sequence a and hidden states y. When the HMM model is trained, it is possible to inference the posterior probability of the four topology labels at each amino acid position via Forward-Backward algorithm. Thus, the generated four probabilities of the topology labels from Phobius become the HMM features for our proposed DeepCNF model.
DBN features generated by Philius
DBN could be regarded as a strict generalizations of HMM [25], where the latter only contains two variables (one observation and one hidden state) in each time frame i and one connection between adjacent frames (i.e., the connection between the adjacent hidden states). On the contrary, DBN can model multiple variables in each frame as well as more than one connection between adjacent frames. This property enables DBN to encode the states that depict the transitions between the adjacent topology labels (i.e., ‘changeState’ according to Philius [8]). In particular, changeState can take value 0 or 1. If 1 is taken, then the topology label in the next frame shall be changed. Therefore, instead of predicting the posterior probability of the four topology labels (i.e., changeState is set to 0) which respectively correspond to ‘ss’, ‘ii’, ’mm’, and ’oo’, Philius is able to predict additional four posterior probability of the topology label transitions (i.e., changeState is set to 1) which respectively correspond to ‘xs’, ‘xi’, ’xm’ and ‘xo’ where ‘x’ indicate any other topology label. However, as signal peptide starts from the N-terminal and the probability of ‘xs’ only has value at the first frame, we may delete this state. Also, the ‘x’ could only be ‘m’ for ‘xi’ and ‘xo’ because of the state transition diagram in topology prediction. In summary, the generated seven transition probabilities between the topology labels from Philius become the DBN features for our proposed DeepCNF model.
Additional amino acid features
In addition to the predicted probabilities of the topology labels from Phobius and Philius, we may further utilize the features embedded in the input amino acid sequence (i.e., ‘pureseq’ features). One straightforward approach (denoted as one-hot encoding) uses a binary vector of 20 elements to indicate the amino acid type at position i. However, the 20 amino acids are not simply alphabetic letters, as they encode a variety of physiochemical properties. These properties of the 20 amino acids could be obtained from an on-line database (AAindex) [23] that forms a 20×494 matrix (i.e., each amino acid has 494 physiochemical properties). This high dimensional and redundant data can be reduced to a 20×5 matrix (i.e., each amino acid can be represented by a 5-dimentional vector), which basically represents bipolar, secondary structure, molecular volume, relative amino acid composition, and electrostatic charge, respectively [26]. Consequently, these 20+5 pureseq features are added to our proposed DeepCNF model.
Binary topology label prediction by DeepCNF
It has been reported that DeepCNF model was successfully applied to a variety of sequence labeling problems, such as protein secondary structure prediction [15], protein order/disorder region prediction [18], and detecting the boundaries of expressed transcripts from RNA-seq reads alignment [27]. Generally speaking, DeepCNF has two modules: (i) the Conditional Random Fields (CRF), and (ii) the Deep Convolutional Neural Network (DCNN). DeepCNF can not only model complex relationship between the features from the amino acid sequence and the topology label by a deep hierarchical architecture that allows the capture of long-range dependencies embedded in these features (in parameters W and U), but also explicitly depict the interdependency between adjacent topology labels (in parameter T).
Performance evaluation
We measure the prediction results in terms of the following evaluation criteria: protein-level accuracy, segment-level accuracy, and residue-level accuracy. The protein-level accuracy (pAccu) refers to the definition that a correct prediction of the whole protein should have the correct number of transmembrane segments at approximately correct locations (overlap of at least five residues) [13]. For segment-level accuracy, we use segment recall (sReca) and segment precision (sPrec). The segment recall is defined as the approximately correct prediction of the transmembrane segment with overlap of at least five residues to the ground-truth segment; while the vice versa is the definition for segment precision [8].
The residue-level accuracy is defined at each residue, which consists of Q2 accuracy, SOV score, recall (Reca), precision (Prec) and Matthews correlation coefficient (Mcc), respectively. The Q2 accuracy is defined as the percentage of residues for which the predicted transmembrane topology label is correct. The Segment OVerlap (SOV) score [28] measures how well the ground-truth and the predicted transmembrane regions match, especially at the middle region instead of terminal regions (see section S1 in Supplemental Material for details). In order to calculate recall, precision and Mcc, we define true positives (TP) and true negatives (TN) as the numbers of correctly predicted transmembrane and non-transmembrane residues, respectively; whereas false positives (FP) and false negatives (FN) are the numbers of misclassified transmembrane and non-transmembrane residues, respectively. Then recall and precision is defined as TP = (TP + FN) and TP = (TP + FP), respectively. Mcc is defined as follows:
Performance on validation dataset
The performance of our approach PureseqTM was compared to the performance of Phobius [12] and Philius [8] on the 164 validation dataset. Furthermore, to discover the importance of each input features, we conduct a study by incrementally adding the features from one-hot encoding, AAindex, HMM to DBN, respectively. This feature incremental study is critical as the HMM and DBN feature originate from Phobius and Philius, and we need to show the performance to what extend PureseqTM can reach without using these features.
As shown in Table 1, all our feature combination strategies outperform Phobius and Philius in terms of segment-level, and a large collection of residue-level accuracy, such as Q2, SOV, and Mcc. Notably, our DeepCNF model trained by only one-hot encoding features outperforms Phobius and Philius, especially in terms of Q2, SOV, and Mcc from residue-level accuracy, and segment recall and precision. This result suggests that there exist some long-range dependencies between amino acid sequence and transmembrane topology, and this information can be learned by our deep learning model better than that by HMM and DBN. Furthermore, with more features being added to DeepCNF, the performance, especially in terms of protein-level accuracy, increases incrementally. When all features are added, our method can reach 0.573 protein-level accuracy, which is compatible to Philius.
Performance on test dataset
To show the performance on “real-world” cases, we challenge our method using the 39 mTPs dataset, in which all data are released after Jul 2016 [11] and not homologous to the entries in the 328 training/validation set. Moreover, we performed a 3D structure comparison [29] between the proteins in the test dataset and those in the training dataset. Among the 39 mTPs, about 7 (15) of them have structural analogs in the 328 dataset whose TMscore > 0.65 (0.55). This means that at least 60% to 82% of the mTPs in our test dataset are novel fold to the training dataset [30]. This resembles actual challenges that no sequence or structure similarity could be found for those newly obtained mTPs.
For this task, we not only compare our PureseqTM with Phobius and Philius that rely on the input amino acid sequence only (i.e., ‘pureseq’ methods), but also compare with Topcons2 [13] which is a consensus approach relying on evolutionary profile additionally. As shown in Table 2, not surprisingly, our method again achieves better performance than Phobius and Philius in terms of some key measurements in residue-level accuracy, such as Q2 (better than 2.2%), SOV (better than 3.6%) and Mcc (better than 3.2%). Figure 3 shows the head-to-head comparison of the Mcc and SOV values between PureseqTM and the other three methods. These results indicate that PureseqTM outperforms others in terms of Mcc and SOV for a large portion of the cases in the test dataset.
Nonetheless, it’s really surprising that PureseqTM achieved the best performance 0.667 in terms of protein-level accuracy, which is 12.9%, 7.7% and 5.2% better than Phobius, Philius and even Topcons2, respectively. These results show that the improvement of our method is even higher than that in the 164 validation set, especially in protein-level accuracy. One possible explanation is that some of those mTPs in the validation set overlap with the training data of Phobius and Philius. Consequently, these results indicate that our approach could be applied to a newly released mTP that possibly has novel transmembrane topology, with only amino acid sequence as input.
Table 2 also shows a weird phenomenon that in terms of segment-level and residue-level accuracy, Philius, a pureseq method, outperforms the consensus method Topcons2 that relies on evolutionary profile as well as the output of Philius. In order to explain this case, we conduct a similar feature incremental study on training data but start from the 40 evolution-related features. See Method for details of how to generate these profile features. As shown in the bottom part of Table 1, the performance reaches its peak till the AAindex features are added, and it will decrease when adding HMM or DBN. This experiment might explain why the performance of Topcons2 is not compatible to that of Philius.
If the DeepCNF model trained by sequence profile (i.e., profile model) is applied to the 39 test set, we may find that in terms of residue-level accuracy, such as Q2 and SOV, the profile model will gain ~1-2% compared to the model trained by residue-related features (i.e., pureseq model). However, it seems that the protein-level and segment-level accuracy of the profile model doesn’t improve a lot over that of the pureseq model, which might indicate that the evolution-related features could improve the boundary detection, but not that useful for identifying the transmembrane segment. This hypothesis could also be used to explain the phenomenon why Topcons2 didn’t improve a lot over Philius in terms of protein-level and segment-level accuracy.
Case studies of the test set
To show the performance of our PureseqTM method, we select three case studies from the test set where the ground-truth of the transmembrane topology is known from PDBTM [11].
5lnko
This protein is the chain o from the ovine respiratory complex I, which has 120 residues and 2 transmembrane helices. Note that this protein is structurally dissimilar to the training dataset, where the most similar protein only has TMscore 0.47 to this protein. This indicates that 5lnko has a novel fold. As shown in Figure 4, our method predicted the correct number of transmembrane helices, while Phobius, Philius predicted one and Topcons2 failed to predict this protein as a TP. In terms of the residue-level accuracy (Table 3), our method reached 0.806, 0.833 precision, 0.758 Mcc, 0.908 Q2, and 0.978 SOV, which indicated that the overlap between our prediction and the ground-truth is very large. Therefore, the success of this case indicates that PureseqTM could be applied to predict the transmembrane topology for those “real-world” new cases that no sequence or structure similarity could be found in the sequence or template database.
5ldwY
This protein is the chain Y from the mammalian respiratory complex I, which has 141 residues and 4 transmembrane helices. The most similar protein to 5ldwY in the training dataset has TMscore 0.545, which indicates that there is no strong structural similarity. As shown in Figure 5, our method predicted the correct number of transmembrane helices, while Topcons2 predicted one and Phobius failed to predict this protein as a TP. Although Philius successfully predicted the correct number of transmembrane helices, in terms of residue-level accuracy, our method is significantly better than Philius (Table 4). In particular, the Q2, SOV and Mcc of our method is 0.823, 0.928, and 0.642, respectively, which is 6%, 5%, and 10% better than that of Philius. From this case, we may also found out that Philius tend to predict longer transmembrane segment, which will cause a better recall but lower down the precision. Another interesting phenomenon from this case study is that although using similar model architecture, say HMM in Phobius and DBN in Philius, it seems that for most of the cases Philius outperforms Phobius in terms of protein-level accuracy. This hypothesis could also be inferred from the fact that with the inclusion of DBN feature to our model, the protein-level accuracy increased significantly.
4he8E
This protein is the chain E from the respiratory complex I from Thermus thermophiles, which has 95 residues and 3 transmembrane helices. This protein has structural analogs in the training dataset, where the most similar protein has TMscore 0.694 to this protein. The sequence homologs (measured by Meff) of this protein is 1004. Figure 6 shows that our method and Topcons2 predicted the correct number of transmembrane helices, while Phobius and Philius predicted two. In terms of residue-level accuracy, the profile-based consensus method Topcons2 outperforms PureseqTM, especially in SOV and Mcc (Table 5). This is normal that when a protein has a large amount of sequence homologs, the profile-based approach will gain increased performance in the residue-level measurements, as indicated in Table 1. Thus, if the profile model of PureseqTM is used for this case, we shall obtain a much better prediction than Topcons2 (Table 5). Specifically, the Q2, SOV and Mcc of our profile mode is 0.852, 0.983, and 0.708, respectively, which is 4%, 9%, and 3% better than that of Topcons2.
Discrimination of transmembrane and non-transmembrane proteins
Before predicting the transmembrane topology, a preliminary task is to distinguish transmemrane proteins (TPs) and non-transmembrane proteins (non-TPs). This is critical for the application to genomic or proteomic data. To fulfill this task, we train a model to discriminate TPs and non-TPs by randomly adding ~1000 proteins from the non-TP dataset to the training set. To test the performance of this discrimination model, we compare with Phobius, Philius and Topcons2 on the discrimination dataset that contains 440 TPs (including both single-pass and multi-pass transmembrane proteins) and ~6400 non-TPs. We define true positives (TP) and true negatives (TN) as the numbers of correctly predicted TPs and non-TPs, respectively; whereas false positives (FP) and false negatives (FN) are the numbers of misclassified TPs and non-TPs, respectively.
As shown in Table 6, we observe that our method PureseqTM performs compatible to the other methods on this discrimination dataset, where all the four methods have relatively high success rate in distinguishing TPs from non-TPs. If we take a close look at the 15 false negatives of PureseqTM, 12 (12) of them are also false negatives of Phobius (Philius), which indicates that the failure rate of discrimination of PureseqTM depends largely on the Phobius and Philius. Furthermore, among the 15 FNs, 8 of them (53.3%) are actually single-pass transmembrane proteins (sTPs), which are not included in our training dataset. Similarly, if we take a look at the 50 false positives, 45 (31) of them are also false positives of Phobius (Philius). Among the 50 FPs, 35 of them (70%) are predicted to be sTPs. These phenomena indicate that our discrimination model can be reliably applied to distinguish mTPs from non-TPs. However, there is still room for improving the success rate to discriminate sTPs from non-TPs.
Application to Human proteome
Given the lower rate of false classifications of TPs and non-TPs, as well as the higher rate of correct identification of transmembrane segment, it should be interesting to see the application of PureseqTM for detecting the Human membrane proteome. To do so, we obtained the 20416 Human protein sequences from UniProt [24] to extract the reviewed transmembrane (TM) regions as ground truth, which results in total 5238 TPs. Among these 5238 TPs, 2399 are annotated as sTPs and the remaining 2839 are mTPs, respectively. It should be noted that even when there is proof for the existence of TM segments, it is difficult to determine their boundaries. Thus, only 52 of the TPs has experimental evidence (i.e., UniProt ECO: 0000269), while most of the ground truth annotations are based on predictions by TMHMM, Memsat and Phobius. To show the performance of PureseqTM, we first run a discrimination study on Human TPs and non-TPs, we then predict the TM topology segments on each identified TPs. We compare our method with Phobius, Philius, and Topcons2, respectively. Finally, several case studies will be displayed to imply the usefulness of PureseqTM for (i) correcting the UniProt entries whose TM segments are mislabeled, and (ii) discovering novel mTPs that are not annotated by UniProt.
First of all, Table 7 shows that PureseqTM correctly identifies 4912 (93.7%) TM proteins, second best only to Phobius that is one of the references for the ground-truth. Taking a close look at the 326 false negatives of PureseqTM, we find that 251 (286) of them are also false negatives of Phobius (Philius). Interestingly, among the 326 FN, only 36 (11%) of them are mTPs, whereas most of them belong to sTPs. In particular, among the 290 sTPs, about 115 (155) of them belong to type I (type II) signal-anchor sTP, and about 20 of them belong to mitochondria TM proteins (mtTPs). Therefore, more efforts shall be paid to improve the recognition rate for sTPs and mtTPs.
Secondly, Table 8 shows that PureseqTM reaches the second best to Phobius in terms of protein-level accuracy, segment-level accuracy as well as residue-level accuracy. This result implies that the TM topology boundaries detected by PureseqTM are closer to the UniProt annotation than those detected by Philius and Topcons2. However, it should be noted that the ground-truth itself in this task is actually a consensus annotation result, which will bias heavily to Phobius.
Thirdly, we point out that the TM topologies of a variety of UniProt entries are not correctly annotated. Specifically, there exist 186 UniProt entries in which the number and boundaries of TM segments only match with Phobius, but not Philius, Topcons, and PureseqTM (Supplemental Table S4). Here comes one example, sodium-dependent phosphate transport protein 3 (UniProt ID: O00624, and gene ID: SLC17A2), which has 439 residues and was reported to have 6-12 TM segments [31]. Phobius and UniProt annotate 9 segments, but Philius, Topcons, and PureseqTM all predict 11 segments (see Figure 7). An alternative evidence to show SLC17A2 being an 11 segment mTP comes from PredMP service [32] (http://database.predmp.com/#/databasedetail/O00624) that performs de novo folding assisted by the predicted contact map from RaptorX-Contact [33]. As shown in Figure 7, the Meff (number of non-redundant sequence homologs) value of SLC17A2 reaches ~9,680, and ln(Neff) is 4.9 where Neff is the length-normalized Meff. According to literature [34], when ln(Neff) is larger than 3.5, the predicted 3D models by RaptorX-Contact on average will have TMscore⩾0.6, which indicates that this 3D model is likely to have a correct fold. Hence, we might have strong evidences to show that SLC17A2 shall be an mTP with 11 TM segments, and the UniProt annotation is incorrect.
Fourthly, among the 380 false positives of PureseqTM, we find that 349 (209) of them are also false positives of Phobius (Philius), and 262 (68.9%) of them are predicted to be sTPs. For the remaining 118 UniProt entries that are predicted to be mTPs by PureseqTM, we figured out that 41 of them are also predicted to be a TP by Phobius, Philius, and Topcons2 (Supplemental Table S5). This phenomenon indicates that these non-TP UniProt entries might have a chance to be relevant to membrane (e.g., either buried within, interfacial to, or cross the membrane). We show here that at least the following 9 entries: Q9UMS5, Q8N3S3, Q9Y5W8, Q5JWR5, Q6NT55, P11511, P51690, Q9BXQ6, O95197 satisfy these conditions according to the literatures. For example, let’s take a detailed look at the putative homeodomain transcription factor 1 (UniProt ID: Q9UMS5, and gene ID: PHTF1), which has 762 residues. In the year 2003, J. Oyhenart et. al. determined that PHTF1 should be an integral membrane protein localized in an Endoplasmic Reticulum (ER) [35]. Later on, in the year 2015, Reta Birhanu Kitat et. al. performed an experiment based on high-PH reverse-phase StageTip fractionation to confirm that PHTF1 is a membrane protein [36]. According to our analysis, PHTF1 might contain 7 transmembrane helices (Figure 8). This evidence is not only supported by the prediction result from Topcons2, but also from the published literature by A. Manuel et. al., in which they proposed that the stretches of hydrophobic amino acid residues (from amino acids 99 to 115, 122 to 138, 477 to 493, 533 to 549, 610 to 626, 647 to 663, and 736 to 752) might be membrane-spanning domains [37]. The range of these transmembrane segments is quite close to PureseqTM and Topcons2 (Figure 8).
Finally, we challenge a question whether or not PureseqTM could find a new membrane protein (MP) that neither is labeled by UniProt, nor is predicted to be a MP by Phobius, Philius, and Topcons2. To answer this question, we first collect a subset from the 118 UniProt entreis that are predicted to be non-TP by all the other three methods. This list contains 11 entries, and at least one entry Q9BY12 (S phase cyclin A-associated protein in the endoplasmic reticulum, SCAPER) has strong literature evidences to be a membrane-related protein. Specifically, William Y. Tsang et. al. first reported that SCAPER is a perinuclear protein localized to the nucleus and primarily to the ER, in which SCAPER is most enriched in the membrane fraction [38]. Later on, William Y. Tsang et. al. further indicated that SCAPER may either be associated with the surface of the membrane or a transmembrane protein that spans the membrane at least twice [39]. This coincides with our prediction as shown in Figure 9.
Conclusion and discussion
This article has presented a deep learning method PureseqTM for transmembrane topology prediction from amino acid sequence only (i.e., pureseq features). PureseqTM makes its uniqueness from the other methods in that it employs a deep probabilistic graphical model DeepCNF to simultaneously capture long-range dependencies embedded in the input features as well as explicitly depict the interdependency between adjacent topology labels. Compared to HMM and DBN, CRF (a key module in DeepCNF) does not have the strict independence assumptions. This property allows CRF to accommodate any context information, which makes the feature design in CRF much more flexible than that in HMM and DBN. Experimental results show that PureseqTM performs much better than the state-of-the-art pureseq methods, and even reaches or outperforms the profile and consensus methods, in terms of protein-level, segment-level, and residue-level accuracy. Therefore, PureseqTM is the first approach, to our knowledge, that can reach high prediction accuracy while keeps efficient running speed, which enables the accurate annotation of whole membrane proteome in reasonable time.
Our feature incremental study has revealed three important discoveries. Firstly, the influence of the profile feature in transmembrane topology (TM) is less important compared to that in other protein structural properties such as 3-state secondary structure element (SSE), 3-state solvent accessibility (ACC), and 2-state order/disorder region (DISO). According to literatures [14, 18], the difference of the prediction accuracy in terms of QX (here X indicates the number of labels) between pureseq features and profile features are 10%, 8.5%, 4%, and 2% for SSE, ACC, DISO, and TM, respectively. Secondly, we found that the HMM and DBN features could further improve the prediction accuracy with the pureseq features, but fail to do so with the profile features. This might explain the phenomenon that in some cases the prediction accuracy of a consensus approach does not outperform the single method integrated in that consensus approach. Thirdly, although PureseqTM incorporates the HMM and DBN features generated by the two state-of-the-art pureseq methods Phobius and Philius, we show that without these features (say, only use the one-hot amino acid encoding feature) PureseqTM can still reach higher performance than Phobius and Philius in terms of segment-level and residue-level accuracy.
One obvious limitation of our method is the relatively low performance when the input sequence is a single-pass transmembrane protein (sTP), as shown in Table 6 and 7. This is normal because we exclude all the sTPs from our training/validation data. The underlying reasons are two folds: (i) the physical-chemical properties of sTPs are quite different with that of multi-pass transmembrane protein (mTPs), where the formers are mostly represented by water-soluble domains. If we add those sTPs to train our model, the prediction accuracy of mTPs will be influenced (data not shown); (ii) currently, compared to mTPs, sTPs only constitutes about 25% of the PDBTM database. However, according to our analysis, among ~5200 Human membrane proteins, ~45% of them are sTPs. This data imbalance will cause the trained model to be biased to mTPs. Therefore, in order to solve this problem, we shall distinguish them separately by a discrimination model (as what we’ve done for discriminating TPs and non-TPs), and train different models for predicting the TM region(s). Further, to solve the limited number of sTPs in the PDBTM (less than 150 if we set the non-redundancy threshold to 25% sequence identity) for training purpose, there appears a Membranome database that provides structural and functional information about more than 6000 sTPs from a variety of model organisms [40].
We may further improve the TM topology prediction for mTPs by leveraging the 2D pairwise features embedded in the co-evolutionary information [33]. It is obvious that the pairwise TM helical-helical interactions are critical for the formation of TM topology, in which the underlying pairwise features are the 2D distance map that contains all square distances between each pair of the residues. Recently, these pairwise features, as described by contact map or distance map, could be predicted accurately from co-evolutionary analysis through ultra-deep learning model [33, 34, 41]. Recently, a similar approach has been successfully applied to predict the SSE and significantly improved the accuracy [42]. Thus, we believe that such approach could also improve the TM topology prediction by a large margin.
Funding
This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Awards No. FCC/1/1976-17-01, FCC/1/1976-18-01, FCC/1/1976-23-01, FCC/1/1976-25-01, FCC/1/1976-26-01, and URF/1/3450-01 to X.G. This work was also supported by National Institutes of Health (NIH) [R01GM089753] and National Science Foundation (NSF) [DBI-1564955] to J.X.
Conflict of Interest
none declared.
Method
Datasets
In general, there are three datasets used in this work: training, testing, and discrimination. We train and validate our method on the training dataset. The test dataset is applied for testing our method and comparing it with other approaches. The discrimination dataset is used for distinguishing tramsmbrane and non-transmembrane proteins. Below please find the details of each dataset.
Training dataset
The dataset used to train our proposed method is a subset of 510 transmembrane proteins created in Jul 2016 from PDBTM [11], in which any two proteins share less than 25% sequence identity [32]. Specifically, we choose 328 alpha-helical multi-pass Transmembrane Proteins (mTPs) from this 510 dataset, and divide them randomly into two equally sized sets: one for training and the other for validation (Supplemental Table S1 and S2).
Testing dataset
To test the performance of our method and compare with other approaches, we collect a set of TPs from PDBTM which are released after Jul 2016, or with no homology with the 510 dataset. To remove redundancy with the 328 training/validation dataset, we use a strict rule that (i) any two proteins in the test set share less than 25% sequence identity; (ii) there is no protein in the test set that shares >25% sequence identity or BLAST [43] E-value <0.001 with any proteins in the training/validation dataset. This creates a test dataset containing 39 mTPs (Supplemental Table S3). We use a protein structure alignment tool DeepAlign [29] to perform 3D structure comparison between the proteins in the test dataset and those in the training dataset.
Discrimination dataset
To show the performance of discriminating Transmembrane Proteins (TPs) and non-Transmembrane Proteins (non-TPs), we collect a subset containing 440 alpha-helical TPs from the 510 dataset as the TP dataset. For non-TP dataset, we first download the PDB25 dataset released at Sep 2016 [44], in which any two proteins share less than 25% sequence identity. We then exclude the proteins in PDB25 sharing >25% sequence identity or having a BLAST E-value <1 with any of the 510 dataset. This results in 6418 proteins as the non-TP dataset.
To label each residue from a given TP sequence, we used the following 9 labels extracted from PDBTM [11]: 1 (Side1), 2 (Side2), B (Beta-strand), H (alpha-helix), C (coil), I (membrane-inside), L (membrane-loop), F (interfacial helix), and U (unknown localizations). As in this work we focus on alpha-helical mTPs, the 9 labels are reduced to binary classification in which label H is denoted as ‘1’ and all other labels are denoted as ‘0’. For non-TPs, we label all residues as ‘0’.
Input features
Basically, our method in ‘pureseq’ mode only relies on the residue-related features. If evolutionary profile information is provided, our method in ‘profile’ mode could also take as input the evolution-related features. Below please find the details of each type of features.
If ‘pureseq’ mode is on, then our method only takes input the 36 residue-related features: (i) amino acid identity represented as a binary vector of 20 elements (or, one-hot encoding); (ii) reduced AAindex [23] which contains 5 highly interpretable numeric patterns of amino acid variability (see Table 2 in [26]). These features may allow for a richer representation of amino acids that reflect polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge [16]; (iii) 4 predicted probabilities of the transmembrane topology labels from Phobius [12], which are cytoplasmic (label ‘i’), non-cytoplasmic (label ‘o’), transmembrane region (label ‘m’), and signal peptide (label ‘s’), respectively; (iv) 7 predicted transition probabilities between the transmembrane topology labels from Philius [8], which are ‘ii’, ‘oo’, ‘mm’, ‘ss’, ‘mi’, ‘mo’ and ‘xm’, respectively, where ‘x’ indicates ‘i’ or ‘o’.
If ‘profile’ mode is on, besides the 20 one-hot encoding and 5 AAindex features, we use additional 40 evolution-related features. In particular, we use 20 position specific scoring matrix (PSSM) generated by PSI-BLAST [43] to encode the evolutionary information at each residue. We also use 20 Hidden Markov Model (HMM) profile generated by HHpred [45], which is complementary to PSSM to some degree. The reason why we didn’t use the 4+7 predicted probabilities of the topology labels in profile mode is due to the fact that they didn’t improve the prediction accuracy (Table 1), especially the protein-level accuracy, on the validation dataset. The evolution-related features could be generated in TGT file using the procedure https://github.com/realbigws/TGT_Package.
DeepCNF training
The training procedure for the topology prediction using the DeepCNF model was based on the procedure used for training protein secondary structure element (SSE) [15]. In particular, we fix the model architecture with the following parameters: 5 layers, 100 neurons and 11 window length per position for each layer. The underlying reason is that such model architecture could reach or close to the best performance on validation data.
Although similar with the model to train SSE, here we use two cascaded procedures to train the predictor for 2-state transmembrane topology: (i) a model to distinguish TPs and non-TPs, and (ii) a model for accurately detecting transmembrane topology regions. We describe the detailed training procedures for the two models as follows.
Model for detecting transmembrane topology regions
We first describe how to train the model on mTPs. We train our model using the 164 entries from the training set and validate our model on the 164 entries from the validation set. As the model architecture is fixed, there is only one tunable hyper-parameter lambda in DeepCNF model, which is used for reducing over-fitting with a L2 norm. We tried a variety of values ranging from 0,0.5,1,1.5,2,2.5,3.5,5,7.5,10,12.5,15,17.5, 20,22.5,25,27.5,30 and found out that lambda=20 will produce the best performance on validation dataset (actually, there is no large difference when lambda is setting from 5 to 30). Although the trained model could detect topology regions accurately, it’s not performing well on the discrimination task. The underlying reason is straightforward as we didn’t feed any non-TP as the training data. We denote this trained model as ‘puretm.model’. A similar approach could be applied for training the model with evolution-related features, and we denote this model as ‘proftm.model’.
Model to distinguish TPs and non-TPs
To train this model, we use the same 164 entries from the training set and randomly choose 1000 proteins from the non-TP dataset, while keep the 164 entries from validation set as is. Note that there is a highly imbalanced non-TP/TP ratio (around 15:1) in this task. To deal with the imbalanced distribution of the topology labels, we train the DeepCNF model by maximizing AUC which is an unbiased measure for class-imbalanced data [46]. An alternative approach to train this discrimination model is to initialize the DeepCNF with the trained model ‘puretm.model’, and to early stop when the AUC value on the validation set declines. We denote this trained model as ‘detect.model’.
Feature incremental study
To show the importance of each type of features in ‘pureseq’ mode, we conduct an incremental study which is similar to a reverse operation of ablation study. Specifically, as there are four types of features: one-hot encoding, AAindex, HMM, and DBN, we incrementally add them to train the DeepCNF model on the 164 training set and check the performance on the 164 validation set. The order we choose for the four types of features is according to their model complexity. If the newly added feature type could significantly improve the prediction accuracy on validation dataset in all measurements, then we may infer that such type of feature will make large contribution to the prediction. On the other hand, if the newly added feature could only improve the accuracy in a part of the measurement but worsen or doesn’t change the others, then such feature might not be that important. A similar feature incremental study could be conducted for ‘profile’ mode.
Transmembrane topology prediction
When an amino acid sequence is input, PureseqTM will first call detect.model to distinguish TP and non-TP. If TP is identified, then PureseqTM will employ puretm.model to predict the transmembrane topology (‘1’ for transmembrane and ‘0’ for non-transmembrane) and also the corresponding probabilities at each residue. Our method also enables the sequence profile as the input when ‘profile’ mode is on. It should be noted that if the predicted transmembrane segment satisfies the two conditions: (i) the segment length is above 30 and (ii) there exist two peaks in the predicted probability, then PureseqTM will cut this segment into two at the position where the local minimum of the probability is found.
Programs to compare
We compare our method with the following programs: Phobius [12], Philius [8], and TOPCONS2 [13] for 2-state transmembrane topology prediction. Phobius and Philius are methods that only rely on the input sequence information (i.e., ‘pureseq’ methods); TOPCONS2 is a consensus approach that combines the outputs from different predictors ranging from Phobius, Philius, SCAMPI [47] and OCTOPUS [10]. It should be noted that SCAMPI is a pureseq method, whereas OCTOPUS rely on evolutionary profile (i.e., ‘profile’ method).
We run Phobius and Philius with their default parameters. For TOPCONS2, we submit all the relevant sequences from the datasets to its server to obtain the prediction results. It should be noted that all of the three approaches will return more labels (such as signal peptide) other than binary topology prediction. To transfer their results into 2-state transmembrane topology, we only keep the probability in state ‘transmembrane’ as label ‘1’, while regarding other states as label ‘0’ (i.e., non-transmembrane).
Data and software availability
The web server implementing the DeepCNF model for transmembrane topology prediction from the amino acid sequence features only is publicly available at http://pureseqtm.predmp.com/. The code for the stand-alone package is maintained on GitHub (https://github.com/PureseqTM/pureseqTM_package). The list of the training, validation, and testing dataset is available in the Supplemental Material. For more detailed information about the relevant dataset, such as the Human proteome dataset, users are suggested to visit https://github.com/PureseqTM/PureseqTM_Dataset.