PT - JOURNAL ARTICLE AU - Dagmar Ilzhoefer AU - Michael Heinzinger AU - Burkhard Rost TI - SETH predicts nuances of residue disorder from protein embeddings AID - 10.1101/2022.06.23.497276 DP - 2022 Jan 01 TA - bioRxiv PG - 2022.06.23.497276 4099 - http://biorxiv.org/content/early/2022/06/26/2022.06.23.497276.short 4100 - http://biorxiv.org/content/early/2022/06/26/2022.06.23.497276.full AB - Predictions of millions of protein 3D structures are only a few clicks away since the release of AlphaFold2 results for entire data sets. However, many proteins have so-called intrinsically disordered regions (IDRs) that do not adopt unique structures in isolation. These IDRs are associated with several diseases, including Alzheimer’s Disease. We showed that the absence of reliable AlphaFold2 predictions correlated only to a limited extent with IDRs. In contrast, many expert methods predict IDRs directly and reliably by combining complex machine learning models with expert-crafted input features and evolutionary information from multiple sequence alignments. Some of these input features are not always available and computationally expensive to generate, limiting their scalability. In this work, we present the novel prediction method SETH that predicts residue disorder from embeddings generated by the protein Language Model ProtT5, which explicitly only uses single sequences as input. Thereby SETH, a relatively shallow convolutional neural network, already outperformed much more complex state-of-the-art solutions while being much faster, allowing to create predictions for the human proteome in fewer than 30 minutes on a machine with one RTX A6000 GPU with 48GB RAM. Trained on a continuous disorder scale, our method captured subtle variations in disorder, thereby providing important information beyond the binary classification of other predictors. The new method is freely publicly available at: https://github.com/DagmarIlz/SETH.Competing Interest StatementThe authors have declared no competing interest.3Dthree-dimensional, i.e., co-ordinates of all atoms/residues in a proteinAIartificial intelligenceAlphaFold2AI-based method reliably predicting protein 3D structure from EI/MSAs (Jumper et al., 2021)ANNartificial feed-forward neural networkAUCarea under the receiver operating characteristic curveCheZOD scoreschemical shift Z-scores (Nielsen and Mulder, 2019)CIconfidence interval, here typically used as the 95% CI implying an interval between ±1.96*StandardErrorCNNconvolutional neural networkEIevolutionary informationIDRintrinsically disordered regionLogReglogistic regressionLSTMlong short-term memory cellMSAmultiple sequence alignmentNLPNatural Language ProcessingNMRnuclear magnetic resonancePIDEpercentage pairwise sequence identitypLDDTpredicted local distance difference test from AlphaFold2 (Jumper et al., 2021)pLMprotein language modelSOTAstate-of-the-artt-SNEt-distributed stochastic neighbor embeddingρSpearman correlation coefficient