Grouped variable importance with random forests and application to multiple functional data analysis
Introduction
In the high dimensional setting, identification of the most relevant variables has been the subject of much research during the last two decades (Guyon and Elisseeff, 2003). For linear regression, the lasso method (Tibshirani, 1996) is widely used. Many variable selection procedures have also been proposed for nonlinear methods. In the context of random forests (Breiman, 2001), it has been shown that the permutation importance measure is an efficient tool for selecting variables (Díaz-Uriarte and Alvarez de Andrés, 2006, Genuer et al., 2010, Gregorutti et al., 2014).
In many situations such as medical studies and genetics, groups of variables can be clearly identified and it is of interest to select groups of variables rather than to select them individually (He and Yu, 2010). Indeed, interpretation of the model may be improved along with the prediction accuracy by grouping the variables according to a priori knowledge about the data. Furthermore, grouping variables can be seen as a solution to stabilize variable selection methods. In the linear setting, and more particularly for linear regression, the group lasso has been developed to deal with groups of variables, see for instance Yuan and Lin (2006a). Group variable selection has also been proposed for kernel methods (Zhang et al., 2008) and neural networks (Chakraborty and Pal, 2008). As far as we know, this problem has not been studied for the random forest algorithm introduced by Breiman (2001). In this paper, we adapt the permutation importance measure for groups of variables in order to select groups of variables in the context of random forests.
The first contribution of this paper is a theoretical analysis of the grouped variable importance measure. Generally speaking, the grouped variable importance does not reduce to the sum of the individual importances and may even be quite unrelated to it. However, in more specific models such as additive regression ones, we derive exact decompositions of the grouped variable importance measure.
The second contribution of this work is an original method for selecting functional variables based on the grouped variable importance measure. Functional Data Analysis (FDA) is a field in statistics that analyzes data indexed by a continuum. In our case, we consider data providing information about curves varying over time (Ramsay and Silverman, 2005, Ferraty and Vieu, 2006, Ferraty, 2011). One standard approach in FDA consists in projecting the functional variables onto a finite dimensional space spanned by a functional basis. Classical bases in this context are splines, Fourier, wavelets or Karhunen–Loève expansions, for instance. Most of the papers about regression and classification methods for functional data consider only one functional predictor; references include Cardot et al., 1999, Cardot et al., 2003, Rossi et al. (2006) and Cai and Hall (2006) for linear regression methods, Amato et al. (2006) and Araki et al. (2009) for logistic regression methods, Górecki and Smaga (2015) for ANOVA problem, Biau et al. (2005) and Fromont and Tuleau (2006) for -NN algorithms and Rossi and Villa (2006) and Rossi and Villa (2008) for SVM classification. The multiple FDA problem, where functional variables are observed, has been less studied. Recently, Matsui and Konishi (2011) and Fan and James (2013) have proposed solutions to the linear regression problem with lasso-like penalties. The logistic regression case has been studied by Matsui (2014). Classification based on several functional variables has also been considered using the CART algorithm (Poggi and Tuleau, 2006) and SVM (Yang et al., 2005, Yoon and Shahabi, 2006).
We propose a new approach for multiple FDA using random forests and the grouped variable importance measure. Indeed, various groups of basis coefficients can be proposed for a given functional decomposition. For instance, one can choose to regroup all coefficients of a given functional variable. In this case, the selection of a group of coefficients corresponds to the selection of a functional variable. Various other groupings are proposed for wavelet decompositions. For a given family of groups, we adapt the recursive feature elimination algorithm (Guyon et al., 2002) which is particularly efficient when predictors are strongly correlated (Gregorutti et al., 2014). In the context of random forests, this backward-like selection algorithm is guided by the grouped variable importance. Note that by regrouping the coefficients, the computational cost of the algorithm is drastically reduced compared to a backward strategy that would eliminate only one coefficient at each step.
An extensive simulation study illustrates the application of the grouped importance measure for FDA. The method is then applied to a real life problem coming from aviation safety. The aim of this study is to explain and predict landing distances. We select the most relevant flight parameters regarding the risk of long landings, which is a major issue for airlines.
The group permutation importance measure is introduced in Section 2. Section 3 deals with multiple FDA using random forests and the grouped variable importance measure. The application to flight data analysis is presented in Section 4. Note that additional experiments about the grouped variable importance are given in Appendix B. In order to speed up the algorithm, the dimension of the data can be reduced in a preprocessing step. In Appendix C, we propose a modified version of a well-known shrinkage method (Donoho and Johnstone, 1994) that simultaneously shrinks to zero the coefficients of the observed curves of a functional variable.
Section snippets
The grouped variable importance measure
Let be a random variable in and a random vector in . We denote by the regression function. Let and denote the variance and variance–covariance matrices of .
The permutation importance introduced by Breiman (2001) measures the accuracy of each variable for predicting . It is based on the elementary property that the quadratic risk is the minimum error for predicting knowing . The formal definition of the variable importance
Multiple functional data analysis using grouped variable importance
In this section, we consider an application of grouped variable selection for multiple functional regression with scalar response . Each covariate takes its values in the Hilbert space equipped with the inner product for . One common approach of functional data analysis is to project the variables onto a finite dimensional subspace of and to use the basis coefficients in a learning algorithm (Ramsay and Silverman, 2005). For
A case study: variable selection for aviation safety
In this section, we study a real problem coming from aviation safety. Airlines collect large amounts of information during flights using flight data recorders. For several years now, airlines are required to use these data for flight safety purposes. A large number of flight parameters (up to 1000) are recorded each second, including aircraft speed, accelerations, the heading, position, and warning signals. Each flight provides a multivariate time series corresponding to this family of
Conclusion
We have considered the selection of grouped variables using random forests and proposed a new permutation-based importance measure for groups of variables. Our theoretical analysis provided exact decompositions of the grouped importance measure into a sum of the individual importances for specific models such as additive regression models. A simulation study highlighted the fact that in general the importance of a group does not reduce to the sum of the individual importances. Since the idea of
References (45)
- et al.
Dimension reduction in functional regression with applications
Comput. Statist. Data Anal.
(2006) - et al.
Functional linear model
Statist. Probab. Lett.
(1999) - et al.
Variable selection using random forests
Pattern Recognit. Lett.
(2010) - et al.
Stable feature selection for biomarker discovery
Comput. Biol. Chem.
(2010) - et al.
Supervised group lasso with applications to microarray data analysis
BMC Bioinformatics
(2007) Variable and boundary selection for functional data via multiclass logistic regression modeling
Comput. Statist. Data Anal.
(2014)- et al.
Variable selection for functional regression models via the regularization
Comput. Statist. Data Anal.
(2011) - et al.
Support vector machine for functional data classification
Neurocomputing
(2006) - et al.
Wavelet estimators in nonparametric regression: A comparative simulation study
J. Stat. Softw.
(2001) - et al.
Functional logistic discrimination via regularized basis expansions
Comm. Statist. Theory Methods
(2009)
Functional classification in Hilbert spaces
IEEE Trans. Inform. Theory
Bagging predictors
Mach. Learn.
Random forests
Mach. Learn.
Classification and Regression Trees
Prediction in functional linear regression
Ann. Statist.
Spline estimators for the functional linear model
Statist. Sinica
Selecting useful groups of features in a connectionist framework
IEEE Trans. Neural Netw.
Sparse group lasso: Consistency and climate applications
Gene selection and classification of microarray data using random forest
BMC Bioinformatics
Ideal spatial adaptation by wavelet shrinkage
Biometrika
Wavelet shrinkage: asymptopia
J. R. Stat. Soc. Ser. B
Cited by (98)
Balancing urban expansion with a focus on ecological security: A case study of Zhaotong City, China
2023, Ecological IndicatorsGlobal patterns and key drivers of stream nitrogen concentration: A machine learning approach
2023, Science of the Total EnvironmentLocal interpretation of nonlinear regression model with k-nearest neighbors
2023, Digital Chemical EngineeringTree-based boosting with functional data
2024, Computational StatisticsVariable Importance in High-Dimensional Settings Requires Grouping
2024, Proceedings of the AAAI Conference on Artificial Intelligence