Abstract
Deep proteomics profiling using labeled LC-MS/MS experiments has been proven to be powerful to study complex diseases. However, due to the dynamic nature of the discovery mass spectrometry, the generated data contain a substantial fraction of missing values. This poses great challenges for data analyses, as many tools, especially those for high dimensional data, cannot deal with missing values directly. To address this problem, the NCI-CPTAC Proteogenomics DREAM Challenge was carried out to develop effective imputation algorithms for labeled LC-MS/MS proteomics data through crowd learning. The final resulting algorithm, DreamAI, is based on an ensemble of six different imputation methods. The imputation accuracy of DreamAI, as measured by Pearson correlation, is about 15%-50% greater than existing tools among less abundant proteins, which are more vulnerable to be missed in proteomics data sets. This new tool notably enhances data analysis capabilities in proteomics research.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵* Co-first authors
We added a new section to describe the data sets and related bioinformatic and statistical methods used in the challenge and the follow up study. We have improved the representation of the manuscript and provided more technical detail in the new method section and supplements. Author affiliations and supplemental files were also updated.
Abbreviations
- iTRAQ
- isobaric tags for absolute and relative quantification
- TMT
- tandem mass tags
- PTM
- post-translational modifications
- MNAR
- missing not at random
- KNN
- K nearest neighbors
- PCA
- principal component analysis
- CPTAC
- Clinical Proteomic Tumor Analysis Consortium
- TCGA
- The Cancer Genome Atlas
- CCRCC
- clear cell renal cell carcinoma
- Cor
- Pearson Correlation Coefficient
- NRMSD
- normalized root mean squared deviation
- PNNL
- Pacific Northwest National Laboratory
- JHU
- Johns Hopkins University
- NAT
- normal adjacent tumor
- MCMC
- Markov Chain Monte Carlo
- SVM
- support vector machine
- ANN
- artificial neural network