Summary
The large and ever-increasing numbers of quantitative proteomics datasets constitute a currently underexploited resource for drawing biological insights on proteins and their functions. Multiple observations by different laboratories indicate that protein complexes often follow consistent trends. However, proteomic data is often noisy and incomplete–members of a complex may correlate only in a fraction of all experiments, or may not be always observed. Inclusion of potentially uninformative data hence imposes the risk of weakening such biological signals. We have previously used the Random Forest (RF) machine-learning algorithm to distinguish functional chromosomal proteins from ‘hitchhikers’ in an analysis of mitotic chromosomes. Even though it is assumed that RFs need large training sets, in this technical note we show that RFs also are able to detect small high-covariance groups, like protein complexes, and relationships between them. We use artificial datasets to demonstrate the robustness of RFs to identify small groups even when working with mixes of noisy and apparently uninformative experiments. We then use our procedure to retrieve a number of chromosomal complexes from real quantitative proteomics results, which compare wild-type and multiple different knock-out mitotic chromosomes. The procedure also revealed other proteins that covary strongly with these complexes suggesting novel functional links. Integrating the RF analysis for several complexes revealed the known interdependency of kinetochore subcomplexes, as well as an unexpected dependency between the Constitutive-Centromere-Associated Network (CCAN) and the condensin (SMC 2/4) complex. Serving as negative control, ribosomal proteins remained independent of kinetochore complexes. Together, these results show that this complex-oriented RF (nanoRF) can uncover subtle protein relationships and higher-order dependencies in integrated proteomics data.
- Abbreviations:
- RF
- Random Forest
- MCCP
- Multi-Classifier Combinatorial Proteomics
- nanoRF
- Random forests trained with small training sets
- MVP
- Multivariate proteomic profiling
- FP
- Fractionation profiling
- ICP
- interphase chromatin probability
- CCAN
- Constitutive Centromere-Associated Network
- Nup
- Nucleoporin
- SMC
- Structural Maintainance of Chromosomes
- SILAC
- Stable Isotope Labeling by Amino acids in Cell culture