ABSTRACT
One of the difficulties encountered in the statistical analysis of metaproteomics data is the high proportion of missing values, which are usually treated by imputation. Nevertheless, imputation methods are based on restrictive assumptions regarding missingness mechanisms, namely “at random” or “not at random”. To circumvent these limitations in the context of feature selection in a multi-class comparison, we propose a univariate selection method that combines a test of association between missingness and classes, and a test for difference of observed intensities between classes. This approach implicitly handles both missingness mechanisms. We performed a quantitative and qualitative comparison of our procedure with imputation-based feature selection methods on two experimental data sets, as well as simulated data with various scenarios regarding the missingness mechanisms and the nature of the difference of expression (differential intensity or differential missingness). Whereas we observed similar performances in terms of prediction on the experimental data set, the feature ranking and selection from various imputation-based methods were strongly divergent. We showed that the combined test reaches a compromise by correlating reasonably with other methods, and remains efficient in all simulated scenarios unlike imputation-based feature selection methods.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵* Shared co-first authorship.
(i) We modified the combined test to account for more complex designs via random effects. (ii) We compared our result to a similar model at peptide level proposed by Goemmine et al (2020). (iii) We proposed a simulation study which enlightens the performances of each FSM as a function of the missingness mechanism and the nature of the biological difference (differential abundance or differential presence).
Acronyms
- FSM
- Feature Selection Method
- FDR
- False Discovery Rate
- kNN
- k-Nearest Neighbors
- SVD
- Singular Value Decomposition
- SVM
- Support Vector Machine
- RF
- Random Forest
- MAR
- Missing At Random
- MNAR
- Missing Not At Random