Abstract
Motivation Missing values are often unavoidable in modern high-throughput measurements due to various experimental or analytical reasons. Imputation, the process of replacing missing values in a dataset with estimated values, plays an important role in multivariate and machine learning analyses. Three missingness patterns have been conceptualized: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Each describes unique dependencies between the missing and observed data. As a result, the optimal imputation method for each dataset depends on the type of data, the cause of the missing data, and the nature of relationships between the missing and observed data. The challenge is to identify the optimal imputation solution for a given dataset.
Results Imputation for Lipidomics and Metabolomics (ImpLiMet) is a user-friendly UI-platform that enables users to impute missing data using eight different methods. Using the user’s dataset, ImpLiMet can then suggest the optimal imputation solution through a grid search-based investigation of the error rate for imputation across three missingness data simulations. The effect of imputation can be visually assessed by principal component analysis (PCA) comparing the impact of removing all features and samples with missing data with the chosen imputation method.
Availability and implementation ImpLiMet is freely available at https://complimet.ca/shiny/implimet/ with software accessible at https://github.com/complimet/ImpLiMet
1 Introduction
Missing data are a major problem for multivariate, machine learning (ML) and network analyses. For example, in large lipidomic or metabolic datasets, measurements for some analytes may not be available in every sample due to routine technical variability, low abundance, ion suppression from co-eluting analytes, inaccurate feature assignment in annotation pipelines, or because analytes are simply not present in a subset of samples. This “missingness” confounds ML approaches, limits the number of methodologies that can be utilized, and reduces the statistical power of models that exclude samples with missing values. Sample exclusion further alters cohort representation, notably when “missingness” is an indicator of a particular subgroup, biasing results towards the groups in which all analytes are observed, and potentially leading to inaccurate interpretations (Jäger et al. 2021; Stoyanovich et al. 2020; Sterne et al. 2009)
Imputing missing values is a valid solution to the problems of “missingness” in multivariate analyses and can reduce data bias resulting from sample exclusion. Three types of missingness have been conceptualized that can be addressed by imputation (Mack et al. 2018; Scheffer 2002):
Missing completely at random (MCAR) refers to values whose absence is completely independent of any other data feature or covariate. In this type of missingness, each sample has the same probability of presenting an MCAR value because there is no underlying difference between the samples with or without missing data (Mack et al. 2018; Rubin 1976). A real-world example of MCAR is transient (aka random) technological failure over the course of data collection such that there is no relationship between the samples with missing or observed values.
Missing at random (MAR) refers to missing values whose absence are related to the values of other measured features but not to the measured values of the same feature (Schafer 1997). Here, missing values do not depend on the variable in question but on the values of the other analytes present in each sample. An example of MAR would be when the value for one analyte is missing because its measurement is obscured by the abundance of another analyte in the same sample (i.e., ion suppression of co-eluting analytes in the case of lipidomic or metabolomic datasets).
Missing not at random (MNAR) refers to missing values that are absent because of a feature, condition, or covariate that is responsible for the absence in that sample. Here, the probability of missingness depends on the sample itself. A biological example of this group would be analytes that are not synthesized, thus not present, in every condition. A technological example would be when analytes are present in a given sample but are below the limit of quantification of the technology used to measure the data.
Multiple imputation methods can approximate missing values (Bertsimas et al. 2018; Schafer 1997). Recently, both Jäger et al., and Chilimoniuk et al., have compared and evaluated different imputation approaches with respect to the quality of the imputed data and their downstream impact on ML pipelines (Chilimoniuk et al. 2024; Jäger et al. 2021). Jager et al. (Jäger et al. 2021) present method for testing imputation quality based on the error rate and downstream use of data and in their work show that in almost all shown examples Random-Forest (RF) provides the most optimal result. However, both Jager et al. and Chilimoniuk et al. show that the optimal imputation solution is dataset dependent, influenced by the type of missing data, the cause of the missing data, and the nature of relationships between the missing and observed data (Chilimoniuk et al. 2024; Jäger et al. 2021). With this knowledge, the bioinformatic challenge becomes to identify the optimal imputation solution for a given dataset.
To address this challenge, we present Imputation for Lipidomics and Metabolomics – ImpLiMet. ImpLiMet is an R package available at https://github.com/complimet/ImpLiMet and online with UI at Computational Lipidomics and Metabolomics: CompLiMet: https://complimet.ca/shiny/implimet/. ImpLiMet enables users to impute missing data using 8 different methods. The effect of each method can be visualized by principal component analysis (PCA) comparing the impact of removing all features and samples with missing data to the chosen imputation method. To identify the optimal imputation solution, ImpLiMet further offers an optimization option wherein the error of each imputation method is evaluated, and the user is informed of the method with the lowest mean absolute percentage error (MAPE) across three “missingness” simulations for their dataset.
2 Implementation
ImpLiMet is written in R and deployed with a RShiny UI. It is compatible with all modern web browsers. Figure 1 presents the ImpLiMet workflow and pseudocode for the optimization procedure. To run ImpLiMet, the user uploads their dataset with missing values as a .CSV file. If the dataset includes features measured in different units by different platforms (multiple feature measurement groups), the user has the option to format their data such that the imputation methods consider features measured using different analyzers or in different units separately. An example of different measurement groups could be the combination of lipidomic and metabolomic data measured using different platforms. The user can then specify the number of features or samples with the selected percentage(s) % of missing values to be removed prior to choosing an imputation measure or optimizing across measures. Eight imputation methods are available: 1) replacing with the minimum by feature, 2) replacing with the minimum divided by 5 by feature, 3) replacing with the maximum by feature, 4) replacing with the median by feature, 5) replacing with the mean by feature, 6) using K-Nearest Neighbors (kNN) (Hastie, et al. 1999; Troyanskaya, et al. 2001), 7) using Random Forest (RF) (Pantanowitz, Marwala, 2009) or 8) using Multivariate Imputation by Chained Equations (MICE) (van Buuren, et al. 1999; van Buuren et al. 2006). For kNN, RF, and MICE, the user is queried to provide the number of neighbours for kNN, the number of trees for RF, and the number of iterations for MICE. kNN is implemented using impute.KNN function; RF imputation utilizes missRanger.RF function (Stekhoven, Buehlman, 2011; Wright et al. 2017) and MICE function mice (van Buuren et al. 2011).
If users have at least 3 features and 6 samples with no missing values in the dataset, ImpLiMet further offers an optimization option wherein the error of each imputation method is evaluated by simulating the three different sources of missingness in the user’s dataset once all missing data is removed and testing all available imputation methods. Optimization then suggests as the best imputation method the one with the lowest mean absolute percentage error (MAPE) across the three “missingness” data simulations. The selected approach is used to impute the original dataset and this result is provided as a download.
In the optimization step, samples without any missing values are selected to create a complete set. If the cleaned dataset obtained by removing all samples (rows) with missing values has no remaining values, optimization instead selects features (columns) without missing values. If there are less than at least 6 samples and 3 features remaining, optimization of imputation cannot be done. From the dataset devoid of missing features, ImpLiMet removes data values at the sample threshold percentage initially provided by the user for filtering. If threshold percentage is not provided, i.e., user opts not to remove any features or samples from their dataset prior to imputation, ImpLiMet uses 30% as the threshold percentage in the optimization process. Threshold percentage is used to simulate the optimal imputation method for a given dataset at the level of the user’s specified tolerance for imputation. For extremely small dataset sizes (e.g., a 6×3 matrix), only a 10% threshold for full optimization will enable simulation as all other thresholds will result in an insufficient sample size for imputation method testing and error calculation. The known values removed for simulation are kept as the hold-out set and are used to evaluate error of imputation as follows:
Given dataset: X = {xij},i = 1 …,Ns; j = 1 …,Nf where Ns is the number of samples and Nf is the number of features; with missing elements xkm,(k,m) ∈ M the goal of imputation is to determine values for the missing elements that resemble the complete data. As the first step in optimization, any row or column with missing elements are removed leading to the subset From this subset data, removal is performed separately to simulate MAR, MCAR, and MNAR mechanisms. Pseudocode for each missingness mechanism is provided in Figure 1B.
For MCAR, a filtering matrix of dimension N′sxN′f is created by random sampling from a uniform distribution (minimum = 0 and maximum = 1) generated from the function runif in R. Random values in the matrix are ranked and values below the imputation threshold are set to NA for missing and above are set to one for remaining. The element-wise product between this filtering matrix and full data matrix provides the MCAR example set for further testing.
For MNAR, the missing value assignment is performed individually for each feature as follows: 1) A list of values is generated by sampling from a logistic distribution (location = 0, scale = 1), denoted . 2) A second list is generated by sampling from the uniform distribution (minimum = 0 and maximum = 1), denoted . 3) A third list is generated from the product of . 4) The ranks for the values in L1, L3, as well as the feature measurements, are computed. 5) The highest and lowest ranks from L3, with the number of missing values dependent on the assigned threshold, are determined and the corresponding (feature-wise) ranks in L1 are assigned. Equivalent ranks in the data set are removed as missing.
For MAR, a co-dependance group is created by summating all feature values in a sample except the values in the current cell. If the input file contains information about the feature groups, based on biological or analytical characteristics, the summation calculation is performed within each feature group for each sample for the co-dependence matrix. The MAR process follows MNAR steps 1 through 3. In step 4, the ranks for the values in L1, L3, and the sample values in the filtering matrix are computed. Missing indices are assigned to the highest and lowest ranks from L3, with the number of missing values dependent on the sample threshold. The order of the values in L1, that produces the missing indices in L3, are retrieved, and the corresponding order in the filtering matrix column for the co-dependent feature are assigned as NA.
After generating the three types of missing datasets, each dataset is imputed using each of the eight available methods. For multivariate methods, users are prompted to select a simple or full version of parameter optimization. Simple parameter optimization uses the following default parameters: K-value=10, Tree Value=500, and Mice Iteration=2. If full parameter search is selected, the accuracy of the imputed values is tested over a range of hyperparameters for kNN, MICE and RF. Specifically, for kNN, the K-values tested range from 10 to 100 incremented by 20. For the optimal K-value in in this range, a refined search is conducted from k-4 to k+4 in single value increments to identify the K-value with the lowest error rate. For RF, the number of trees in the sequence of 5, 10, 20, 50, 100, 150, 200, 500 are examined to determine the optimal tree size. For MICE, 1-3 iterations are tested. Error rates are calculated by mean absolute error rate (MAPE) defined as: where N is the number of missing values, xi is the actual value and yi is the prediction. The MAPE results for each of the eight imputation methods assessed for each missingness mechanism are displayed and the method with the lowest MAPE value across the missingness mechanisms is highlighted and used for imputation.
The effect of imputation on the dataset is visualized by PCA comparing the original dataset following removal of all samples and features with missing data to that of the imputed dataset. PCA for both samples, calculating PCs using features as variables, and features, using their values across samples as variables are shown. The user provided sample and feature names are shown in the plots for reference. An example of the optimization utilization as well as comparison of errors in imputation using recommended and other imputation methods is presented in the Supplementary Materials. Briefly, from the subset of metabolomics data published by Li et al. (Li et al. 2019) with complete data for 50 samples and 50 features we have removed values from 120 cells and tested the error rate for the imputed values using different methods. Result shows that the recommended method, in this case RF, provides imputation with the lowest error when comparing the imputed and the original data and the best agreement in PCA with the original dataset (full information is provided in Supplementary Materials).
3 Conclusion
ImpLiMet is a versatile and open-access Web-based application designed to assist users in identifying the optimal imputation solution for their given dataset. Presented imputation recommendation shows the best method based on the lowest error rate overall, while at the same time presenting error rates of imputation for different types of missingness for all methods. ImpLiMet currently includes eight previously presented imputation methods. Future work will include the addition of other published and newly developed imputation methods as well analysis of the type of missingness that is present in the dataset.
Funding
This work was supported in part by RGPIN-2019-06796 to SALB from the Natural Sciences and Engineering Research Council of Canada (NSERC) as well as operating grant AI-4D-102-3 to SALB and MCC from the NRC AI4D Program. HO received an NSERC CREATE Matrix Metabolomics Scholarship.
Conflict of Interest
none declared
Footnotes
Contact: steffanyann.bennett{at}uottawa.ca or miroslava.cuperlovic-culf{at}nrc-cnrc.gca.
Supplementary information: Supplementary data are available at Bioinformatics online.