RT Journal Article SR Electronic T1 treeClust improves protein co-regulation analysis due to robust selectivity for close linear relationships JF bioRxiv FD Cold Spring Harbor Laboratory SP 578971 DO 10.1101/578971 A1 Georg Kustatscher A1 Piotr Grabowski A1 Juri Rappsilber YR 2019 UL http://biorxiv.org/content/early/2019/03/20/578971.abstract AB Gene co-expression analysis is a widespread method to identify the potential biological function of uncharacterised genes. Recent evidence suggests that proteome profiling may provide more accurate results than transcriptome profiling. However, it is unclear which statistical measure is best suited to detect proteins that are co-regulated. We have previously shown that expression similarities calculated using treeClust, an unsupervised machine-learning algorithm, outperformed correlation-based analysis of a large proteomics dataset. The reason for this improvement is unknown. Here we systematically explore the characteristics of treeClust similarities. Leveraging synthetic data, we find that tree-based similarities are exceptionally robust against outliers and detect only close-fitting, linear protein – protein associations. We then use proteomics data to demonstrate that both of these features contribute to the improved performance of treeClust relative to Pearson, Spearman and robust correlation. Our results suggest that, for large proteomics datasets, unsupervised machine-learning algorithms such as treeClust may significantly improve the detection of biologically relevant protein – protein associations relative to correlation metrics.