A high-precision hybrid algorithm for predicting eukaryotic protein subcellular localization

Motivation Subcellular location plays an essential role in protein synthesis, transport, and secretion, thus it is an important step in understanding the mechanisms of trait-related proteins. Generally, homology methods provide reliable homology-based results with small E-values. We must resort to pattern recognition algorithms (SVM, Fisher discriminant, KNN, random forest, etc.) for proteins that do not share significant homologous domains with known proteins. However, satisfying results are seldom obtained. Results Here, a novel hybrid method “Basic Local Alignment Search Tool+Smith-Waterman+Needleman-Wunsch” or BLAST+SWNW, has been obtained by integrating a loosened E-value Basic Local Alignment Search Tool (BLAST) with the Smith-Waterman (SW) and Needleman-Wunsch (NW) algorithms, and this method has been introduced to predict protein subcellular localization in eukaryotes. When tested on Dataset I and Dataset II, BLAST+SWNW showed an average accuracy of 97.18% and 99.60%, respectively, surpassing the performance of other algorithms in predicting eukaryotic protein subcellular localization. Availability and Implementation BLAST+SWNW is an open source collaborative initiative available in the GitHub repository (https://github.com/ZHANGDAHAN/BLAST-SWNW-for-SLP or http://202.206.64.158:80/link/72016CAC26E4298B3B7E0EAF42288935) Contact zhaqi1972@163.com; zhangdahan@genetics.ac.cn Supplementary Information Supplementary data are available at PLOS Computational Biology online.


44
Following the completion of human, Arabidopsis thaliana [1], and rice genome sequencing projects in the

87
Its 18-class-subcellular-location-coverage drill dataset includes 35,738 eukaryotic proteins from SWISS-PROT 88 release 2013_11 (S1 Table), and its test dataset is composed of 3 independent test datasets T1, T2, and T3 that

101
Proteins located in each SL generally play fundamental roles in cell biology. In detail, "Thylakoid" is an 102 essential place for light-dependent reactions, consisting of the thylakoid membrane and thylakoid lumen [31].

103
"Lysosome" is a membrane-bound cellular organelle existing in most animal cells that harbours over 50 104 different enzymes responsible for plasma membrane repair, cell signalling, and energy metabolism [32].

124
We take Dataset I as an example to illustrate BLAST+SWNW (Fig 1). First, for a sequence (named 125 "queryseq") from the test dataset of Dataset I, we scan its similar protein sequences (named "databaseseqs") 126 from the drill dataset of Dataset I with BLASTP with a loosened E-value of 30. In this paper, 30 is the default E-127 value of BLAST+SWNW. Second, we delete all non-standard amino acid characters from "queryseq" and 128 "databaseseqs" to meet the input requirements of SW and NW. Third, p values between the "queryseq" and its 129 similar "databaseseq" are calculated by integrating the output of SW and NW. In detail, for a "queryseq" and 130 "databaseseq" pair, we set the parameter λ to 0.16931, k to 0.20441, and define m and n as the length of 131 "queryseq" and "databaseseq", respectively. Then, the similarity scores measured by SW (sw_score) and by NW

132
(nw_score) are calculated using MATLAB functions swalign() and nwalign(), P_SWNW values between 133 "queryseq" and "databaseseq" are calculated as follows: P S sw score queryseq databaseseq P S nw score queryseq databaseseq sw score queryseq databaseseq k m n e e nw score queryseq databasese k m n e e Finally, the SL of "databaseseq" with the minimum out of all P_SWNW is chosen as the predicted SL of 137 "queryseq".    Table 1). the SL "Secreted" in the 18 SLs of Dataset I corresponds to a union of "Secreted" and "Cell wall" in the 24 SLs 209 of Dataset II; SL "Chloroplast" in Dataset I is equivalent to the combination of "chloroplast", "Thylakoid" and "

210
Thylakoid membrane" in Dataset II; and SL "Plastid" in Dataset I corresponds to a union of "Plastid" and 211 "Plastid membrane" in Dataset II. In this sense, LocTree3 can predict 22 SLs in Dataset II, which represent all 212 SLs except for "Lysosome" and "Lysosome membrane". Additional details and the corresponding relationships 213 between the 24 SLs of BLAST+SWNW, 11 SLs of CELLO and 5 SLs of YLOC are shown in S11 and S12 214 Table. 215 A test dataset with 5497 sequences that were randomly selected from Data II was submitted to the in S13 Table. 218 Two out of 24 SLs, "Lysosome" and "Lysosome membrane", are not available in LocTree3. When we 219 tested sequences from the other 22 SLs, the average sensitivity, specificity and accuracy of LocTree3 were 220 74.01%, 98.77%, and 97.65%, respectively, with possible overlaps existing between the drill dataset of

221
LocTree3 and our test dataset. In contrast, even after the overlap was removed, the average sensitivity, specificity, and accuracy of BLAST+SWNW still reached 94.47%, 99.82%, and 99.60%, and the average 223 sensitivity, specificity, and accuracy of BLAST+SWNW with an E-value=10 -5 were 92.00%, 99.89%, and 224 99.57% respectively based on the same dataset (S14 Table). respectively.

230
BLAST with an E-value=30 and LocTree3) by applying them to Dataset II, their sensitivities and weighted 231 average sensitivities (marked as "In total") are shown in Fig 3. Obviously, BLAST+SWNW with an E-value=30 232 and BLAST with an E-value=30 were superior to the other two algorithms. The prediction of 16 of 24 SLs were 233 better with BLAST+SWNW with an E-value=30 than with LocTree3. Undoubtedly, BLAST+SWNW with an 234 E-value=30 will be a better algorithm for predicting protein SLs due to more subcellular locations, higher 235 sensitivity than LocTree3, BLAST, and others.   Table).  (Fig 5). Therefore, 1 was chosen as the default, meaning no threshold was set for 268 SW or NW.

The robustness of BLAST+SWNW for different identity between drill and test sequences 272
After sequences in the drill dataset with 99% or higher identity to test sequences were removed, programming to construct a matrix and use backtracking to find an optimal alignment between two given 283 sequences. In this sense, BLAST can be more strongly reinforced by SW and NW than by pattern recognition 284 algorithms. In other words, BLAST+SWNW is more powerful than BLAST+pattern recognition, such as 285 LocTree3. Additionally, the poor performance of pattern recognition algorithms (shown in S1 Text, S8~10 286 Table) suggests that additional intrinsic features should be added and used in SLP to improve prediction 287 sensitivity, specificity, and accuracy.