Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks

Characterization of protein complexes, i.e. sets of proteins assembling into a single larger physical entity, is important, as such assemblies play many essential roles in cells such as gene regulation. From networks of protein-protein interactions, potential protein complexes can be identified computationally through the application of community detection methods, which flag groups of entities interacting with each other in certain patterns. Most community detection algorithms tend to be unsupervised and assume that communities are dense network subgraphs, which is not always true, as protein complexes can exhibit diverse network topologies. The few existing supervised machine learning methods are serial and can potentially be improved in terms of accuracy and scalability by using better-suited machine learning models and parallel algorithms. Here, we present Super.Complex, a distributed, supervised AutoML-based pipeline for overlapping community detection in weighted networks. We also propose three new evaluation measures for the outstanding issue of comparing sets of learned and known communities satisfactorily. Super.Complex learns a community fitness function from known communities using an AutoML method and applies this fitness function to detect new communities. A heuristic local search algorithm finds maximally scoring communities, and a parallel implementation can be run on a computer cluster for scaling to large networks. On a yeast protein-interaction network, Super.Complex outperforms 6 other supervised and 4 unsupervised methods. Application of Super.Complex to a human protein-interaction network with ~8k nodes and ~60k edges yields 1,028 protein complexes, with 234 complexes linked to SARS-CoV-2, the COVID-19 virus, with 111 uncharacterized proteins present in 103 learned complexes. Super.Complex is generalizable with the ability to improve results by incorporating domain-specific features. Learned community characteristics can also be transferred from existing applications to detect communities in a new application with no known communities. Code and interactive visualizations of learned human protein complexes are freely available at: https://sites.google.com/view/supercomplex/super-complex-v3-0.

. Comparing Super.Complex with 2-stage clustering on the hu.MAP dataset using 6 3 existing and 3 new evaluation metrics shows comparable performance for both 4 algorithms. In summary, Super.Complex achieves better performance on 7 precision-like 5 metrics (18% higher on average, 9% by median with 2 metrics doing better with 2-stage 6 clustering by 3% on average and by median), while the 2-stage clustering method achieves 7 better performance on recall measure on all 9 metrics by 10% on average and 9% by median. 8 Super.Complex achieves better F1-score-like measures on 4 of the metrics by 4% on average 9 and 2% by median, compared to 5 wins by 2-stage clustering by 2% on average and by 10 median. can be merged to yield a community of higher community score. The number of maximum 22 iterations provided guarantees that the seeding and growth converge to a solution.

24
Robustness of the Super.Complex algorithm 25 In the candidate community search, we can expect that the learned complexes will differ from 26 experiment to experiment due to 3 modes of possible stochasticity namely, (i) the -greedy    Code was not available for ClusterEPs, or the other supervised methods in Table 3 -RM, NN and   66 ClusterSS. subsequently, the proteins of the virus have the potential to interact with multiple human proteins. 71 We preliminarily investigated the human protein complexes that may be affected by the virus, and  Qi overlap measure: where t is the user-specified threshold, usually set to 0.5

136
Here, | | is the number of nodes in community C, | 1 ∩ 2 | is the number of nodes 137 common to both communities, and | 1 ∪ 2 | is their union, i.e. the total number of nodes in the 138 two communities with each node counted once.

139
Once we have a way to match a learned community to a known community or vice-versa 140 using a similarity measure, standard machine learning metrics such as precision, recall, and F1 141 scores can be computed.   We also plot PR curves for edges learned. The learned communities are evaluated against 166 all known communities, only training communities, and only testing communities. We report the 167 results on the test communities for the yeast experiment in Table 3. 168 Time complexity 169 The average time complexity of the algorithm is determined by the two main steps - used, and c is the number of processes in the single compute node running the AutoML step.

176
The candidate community search step has a complexity of ( × 4 × × ), keeping in mind 177 that in the worst case, the feature extraction step has a complexity of ( 3 ) for a subgraph with n 178 nodes. P is the total number of processes used to run this code in parallel, S is the number of seeds 179 used to sample candidate communities which is typically N, the number of nodes in the network,

180
G is the number of steps for each community's growth and is also taken to be the final number of 181 nodes in the subgraph after growth for complexity calculation (as both are similar), K is the average 182 degree of the graph. X is the machine learning model inference time for one value, which can be 183 assumed a small constant here in cases with fairly simple models. We note that in an alternate 184 method that we provide, one can specify the number of neighbors to check, say M instead of 185 checking all neighbors, achieving gains when < , with the complexity now as ( × 4 × × ).

186
Using all nodes as seeds ( = ), in very large sparse protein interaction networks (small degree 187 ≪ ) with small complexes (for example, we use < 10 nodes in our yeast experiments) and