Quantitative Missense Variant Effect Prediction Using Large-Scale Mutagenesis Data

Cell Syst. 2018 Jan 24;6(1):116-124.e3. doi: 10.1016/j.cels.2017.11.003. Epub 2017 Dec 6.

Abstract

Large datasets describing the quantitative effects of mutations on protein function are becoming increasingly available. Here, we leverage these datasets to develop Envision, which predicts the magnitude of a missense variant's molecular effect. Envision combines 21,026 variant effect measurements from nine large-scale experimental mutagenesis datasets, a hitherto untapped training resource, with a supervised, stochastic gradient boosting learning algorithm. Envision outperforms other missense variant effect predictors both on large-scale mutagenesis data and on an independent test dataset comprising 2,312 TP53 variants whose effects were measured using a low-throughput approach. This dataset was never used for hyperparameter tuning or model training and thus serves as an independent validation set. Envision prediction accuracy is also more consistent across amino acids than other predictors. Finally, we demonstrate that Envision's performance improves as more large-scale mutagenesis data are incorporated. We precompute Envision predictions for every possible single amino acid variant in human, mouse, frog, zebrafish, fruit fly, worm, and yeast proteomes (https://envision.gs.washington.edu/).

Keywords: large-scale mutagenesis; machine learning; variant effect prediction.

MeSH terms

  • Algorithms
  • Animals
  • Computational Biology / methods*
  • Databases, Genetic
  • Forecasting / methods
  • Genes, p53 / genetics
  • Humans
  • Machine Learning
  • Mutagenesis
  • Mutation, Missense*