TY - JOUR T1 - Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data JF - bioRxiv DO - 10.1101/171967 SP - 171967 AU - Runmin Wei AU - Jingye Wang AU - Mingming Su AU - Erik Jia AU - Tianlu Chen AU - Yan Ni Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/08/17/171967.abstract N2 - Introduction Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection of methods can significantly affect following data analyses and interpretations. According to the definition, there are three types of missing values, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).Objectives The aim of this study was to comprehensively compare common imputation methods for different types of missing values using two separate metabolomics data sets (977 and 198 serum samples respectively) to propose a strategy to deal with missing values in metabolomics studies.Methods Imputation methods included zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC). Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate the imputation accuracy for MCAR/MAR and MNAR correspondingly. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes sum of squared error were used to evaluate the overall sample distribution. Student’s t-test followed by Pearson correlation analysis was conducted to evaluate the effect of imputation on univariate statistical analysis.Results Our findings demonstrated that RF imputation performed the best for MCAR/MAR and QRILC was the favored one for MNAR.Conclusion Combining with “modified 80% rule”, we proposed a comprehensive strategy and developed a public-accessible web-tool for missing value imputation in metabolomics data. ER -