RT Journal Article SR Electronic T1 Updating the in silico human surfaceome with meta-ensemble learning and feature engineering JF bioRxiv FD Cold Spring Harbor Laboratory SP 499780 DO 10.1101/499780 A1 Daniel Bojar YR 2018 UL http://biorxiv.org/content/early/2018/12/19/499780.abstract AB Next to being targeted by most available drugs, human proteins located in the plasma membrane are also responsible for a plethora of essential cellular functions, ranging from signaling to transport processes. In order to target and study these transmembrane proteins, their plasma membrane location has to be established. Yet experimental validation of the thousands of potential plasma membrane proteins is laborious and technically challenging. A recent study performed machine learning to classify surface and non-surface transmembrane proteins in human cells based on curated high-quality training data from the Cell Surface Protein Atlas (CSPA) and other databases, reporting a cross-validation prediction accuracy of 93.5% (1). Here, we report an improved version of the surfaceome predictor SURFY, SURFY2, using the same training data with a meta-ensemble classification approach involving feature engineering. SURFY2 yielded predictions with an accuracy score of 95.5% on a test dataset never seen before by the classifier. Importantly, we found several high-confidence re-classifications of disease-relevant proteins among the discrepant predictions between SURFY and SURFY2. To rationalize the prediction mechanism of SURFY2 and analyze differently classified transmembrane proteins we investigated classifier feature importances and data distributions between prediction sets. SURFY2 exhibited both an increased precision as well as recall compared to SURFY and delivers the best in silico human surfaceome up to now. This updated version of the surfaceome will instigate further advances in drug targeting and research on cellular signaling as well as transport processes.