ABSTRACT
Post-translational Modifications (PTMs), chemical or proteinaceous covalent alterations to the side chains of amino acid residues in proteins, are a rapidly expanding feature class of significant importance in cell biology. Due to a high burden of experimental proof and the lack of effective means for experimentalists to prioritize PTMs by functional significance, currently less than ~2% of all PTMs have an assigned biological function. Here, we describe a new artificial neural network model, SAPH-ire TFx for the functional prediction of experimentally observed eukaryotic PTMs. Unlike previous functional PTM prioritization models, SAPH-ire TFx is optimized with both receiver operating characteristic (ROC) and recall metrics that maximally capture the range of diverse feature sets comprising the functional modified eukaryotic proteome. The tool was through systematic evaluation of input features, model architectures, training procedures, and interpretation metrics using a 2018 training dataset of 430,750 PTMs containing 7,480 PTMs with literature-supported evidence of biological function. The resulting model was used to classify an expanded 2019 dataset of 512,015 PTMs (12,867 known functional) containing 102,475 PTMs unencountered in the original dataset. Model output from the 2019 extended dataset was benchmarked against pre-existing prediction models, revealing superior performance in classification of functional and/or disease-linked PTM sites. Finally, a dynamic web interface provides customizable graphical and tabular visualization of PTM and SAPH-ire TFx data within the context of all modifications within a protein family, exposing several metrics by which important functional PTMs can be identified for investigation.
AUTHOR SUMMARY The modification of proteins after they are translated is an important process that can control the structure and function of the proteins on which they occur. Hundreds of different types of modification happen at some point during the lifetime of every protein in eukaryotic cells and play an essential role in cellular processes such as cell division, cell communication, gene regulation. Using current state-of-the-art detection tools, the rate at which post-translational modifications are detected now far surpasses the rate at which they can be investigated for functionality. Furthermore, not all modifications detected are functional, making it difficult to determine into which modifications one should invest experimental effort. Here, we describe a new computational tool – SAPH-ire TFx – capable of predicting functional modification sites from large-scale datasets, and consequently focus experimental effort towards only those modifications that are likely to be biologically significant. We show that the tool performs well across multiple datasets within which known functional modifications are scattered; and we show that the tool outperforms prior functional prioritization tools. Finally, we also provide a user-friendly web tool for experimentalists to investigate SAPH-ire TFx output for proteins and protein families of interest.