Abstract
Predictions of millions of protein 3D structures are only a few clicks away since the release of AlphaFold2 results for entire data sets. However, many proteins have so-called intrinsically disordered regions (IDRs) that do not adopt unique structures in isolation. These IDRs are associated with several diseases, including Alzheimer’s Disease. We showed that the absence of reliable AlphaFold2 predictions correlated only to a limited extent with IDRs. In contrast, many expert methods predict IDRs directly and reliably by combining complex machine learning models with expert-crafted input features and evolutionary information from multiple sequence alignments. Some of these input features are not always available and computationally expensive to generate, limiting their scalability. In this work, we present the novel prediction method SETH that predicts residue disorder from embeddings generated by the protein Language Model ProtT5, which explicitly only uses single sequences as input. Thereby SETH, a relatively shallow convolutional neural network, already outperformed much more complex state-of-the-art solutions while being much faster, allowing to create predictions for the human proteome in fewer than 30 minutes on a machine with one RTX A6000 GPU with 48GB RAM. Trained on a continuous disorder scale, our method captured subtle variations in disorder, thereby providing important information beyond the binary classification of other predictors. The new method is freely publicly available at: https://github.com/DagmarIlz/SETH.
Competing Interest Statement
The authors have declared no competing interest.
9 Abbreviations
- 3D
- three-dimensional, i.e., co-ordinates of all atoms/residues in a protein
- AI
- artificial intelligence
- AlphaFold2
- AI-based method reliably predicting protein 3D structure from EI/MSAs (Jumper et al., 2021)
- ANN
- artificial feed-forward neural network
- AUC
- area under the receiver operating characteristic curve
- CheZOD scores
- chemical shift Z-scores (Nielsen and Mulder, 2019)
- CI
- confidence interval, here typically used as the 95% CI implying an interval between ±1.96*StandardError
- CNN
- convolutional neural network
- EI
- evolutionary information
- IDR
- intrinsically disordered region
- LogReg
- logistic regression
- LSTM
- long short-term memory cell
- MSA
- multiple sequence alignment
- NLP
- Natural Language Processing
- NMR
- nuclear magnetic resonance
- PIDE
- percentage pairwise sequence identity
- pLDDT
- predicted local distance difference test from AlphaFold2 (Jumper et al., 2021)
- pLM
- protein language model
- SOTA
- state-of-the-art
- t-SNE
- t-distributed stochastic neighbor embedding
- ρ
- Spearman correlation coefficient