Grouped variable importance with random forests and application to multiple functional data analysis

https://doi.org/10.1016/j.csda.2015.04.002Get rights and content

Abstract

The selection of grouped variables using the random forest algorithm is considered. First a new importance measure adapted for groups of variables is proposed. Theoretical insights into this criterion are given for additive regression models. Second, an original method for selecting functional variables based on the grouped variable importance measure is developed. Using a wavelet basis, it is proposed to regroup all of the wavelet coefficients for a given functional variable and use a wrapper selection algorithm with these groups. Various other groupings which take advantage of the frequency and time localization of the wavelet basis are proposed. An extensive simulation study is performed to illustrate the use of the grouped importance measure in this context. The method is applied to a real life problem coming from aviation safety.

Introduction

In the high dimensional setting, identification of the most relevant variables has been the subject of much research during the last two decades (Guyon and Elisseeff, 2003). For linear regression, the lasso method (Tibshirani, 1996) is widely used. Many variable selection procedures have also been proposed for nonlinear methods. In the context of random forests (Breiman, 2001), it has been shown that the permutation importance measure is an efficient tool for selecting variables (Díaz-Uriarte and Alvarez de Andrés, 2006, Genuer et al., 2010, Gregorutti et al., 2014).

In many situations such as medical studies and genetics, groups of variables can be clearly identified and it is of interest to select groups of variables rather than to select them individually (He and Yu, 2010). Indeed, interpretation of the model may be improved along with the prediction accuracy by grouping the variables according to a priori knowledge about the data. Furthermore, grouping variables can be seen as a solution to stabilize variable selection methods. In the linear setting, and more particularly for linear regression, the group lasso has been developed to deal with groups of variables, see for instance Yuan and Lin (2006a). Group variable selection has also been proposed for kernel methods (Zhang et al., 2008) and neural networks (Chakraborty and Pal, 2008). As far as we know, this problem has not been studied for the random forest algorithm introduced by Breiman (2001). In this paper, we adapt the permutation importance measure for groups of variables in order to select groups of variables in the context of random forests.

The first contribution of this paper is a theoretical analysis of the grouped variable importance measure. Generally speaking, the grouped variable importance does not reduce to the sum of the individual importances and may even be quite unrelated to it. However, in more specific models such as additive regression ones, we derive exact decompositions of the grouped variable importance measure.

The second contribution of this work is an original method for selecting functional variables based on the grouped variable importance measure. Functional Data Analysis (FDA) is a field in statistics that analyzes data indexed by a continuum. In our case, we consider data providing information about curves varying over time (Ramsay and Silverman, 2005, Ferraty and Vieu, 2006, Ferraty, 2011). One standard approach in FDA consists in projecting the functional variables onto a finite dimensional space spanned by a functional basis. Classical bases in this context are splines, Fourier, wavelets or Karhunen–Loève expansions, for instance. Most of the papers about regression and classification methods for functional data consider only one functional predictor; references include Cardot et al., 1999, Cardot et al., 2003, Rossi et al. (2006) and Cai and Hall (2006) for linear regression methods, Amato et al. (2006) and Araki et al. (2009) for logistic regression methods, Górecki and Smaga (2015) for ANOVA problem, Biau et al. (2005) and Fromont and Tuleau (2006) for k-NN algorithms and Rossi and Villa (2006) and Rossi and Villa (2008) for SVM classification. The multiple FDA problem, where p functional variables are observed, has been less studied. Recently, Matsui and Konishi (2011) and Fan and James (2013) have proposed solutions to the linear regression problem with lasso-like penalties. The logistic regression case has been studied by Matsui (2014). Classification based on several functional variables has also been considered using the CART algorithm (Poggi and Tuleau, 2006) and SVM (Yang et al., 2005, Yoon and Shahabi, 2006).

We propose a new approach for multiple FDA using random forests and the grouped variable importance measure. Indeed, various groups of basis coefficients can be proposed for a given functional decomposition. For instance, one can choose to regroup all coefficients of a given functional variable. In this case, the selection of a group of coefficients corresponds to the selection of a functional variable. Various other groupings are proposed for wavelet decompositions. For a given family of groups, we adapt the recursive feature elimination algorithm (Guyon et al., 2002) which is particularly efficient when predictors are strongly correlated (Gregorutti et al., 2014). In the context of random forests, this backward-like selection algorithm is guided by the grouped variable importance. Note that by regrouping the coefficients, the computational cost of the algorithm is drastically reduced compared to a backward strategy that would eliminate only one coefficient at each step.

An extensive simulation study illustrates the application of the grouped importance measure for FDA. The method is then applied to a real life problem coming from aviation safety. The aim of this study is to explain and predict landing distances. We select the most relevant flight parameters regarding the risk of long landings, which is a major issue for airlines.

The group permutation importance measure is introduced in Section  2. Section  3 deals with multiple FDA using random forests and the grouped variable importance measure. The application to flight data analysis is presented in Section  4. Note that additional experiments about the grouped variable importance are given in Appendix B. In order to speed up the algorithm, the dimension of the data can be reduced in a preprocessing step. In Appendix C, we propose a modified version of a well-known shrinkage method (Donoho and Johnstone, 1994) that simultaneously shrinks to zero the coefficients of the observed curves of a functional variable.

Section snippets

The grouped variable importance measure

Let Y be a random variable in R and X=(X1,,Xp) a random vector in Rp. We denote by f(x)=E[Y|X=x] the regression function. Let Var(X) and Cov(X) denote the variance and variance–covariance matrices of X.

The permutation importance introduced by Breiman (2001) measures the accuracy of each variable Xj for predicting Y. It is based on the elementary property that the quadratic risk E[(Yf(X))2] is the minimum error for predicting Y knowing X. The formal definition of the variable importance

Multiple functional data analysis using grouped variable importance

In this section, we consider an application of grouped variable selection for multiple functional regression with scalar response Y. Each covariate X1,,Xp takes its values in the Hilbert space L2([0,1]) equipped with the inner product f,gL2=f(t)g(t)dt, for f,gL2([0,1]). One common approach of functional data analysis is to project the variables onto a finite dimensional subspace of L2([0,1]) and to use the basis coefficients in a learning algorithm (Ramsay and Silverman, 2005). For

A case study: variable selection for aviation safety

In this section, we study a real problem coming from aviation safety. Airlines collect large amounts of information during flights using flight data recorders. For several years now, airlines are required to use these data for flight safety purposes. A large number of flight parameters (up to 1000) are recorded each second, including aircraft speed, accelerations, the heading, position, and warning signals. Each flight provides a multivariate time series corresponding to this family of

Conclusion

We have considered the selection of grouped variables using random forests and proposed a new permutation-based importance measure for groups of variables. Our theoretical analysis provided exact decompositions of the grouped importance measure into a sum of the individual importances for specific models such as additive regression models. A simulation study highlighted the fact that in general the importance of a group does not reduce to the sum of the individual importances. Since the idea of

References (45)

  • G. Biau et al.

    Functional classification in Hilbert spaces

    IEEE Trans. Inform. Theory

    (2005)
  • L. Breiman

    Bagging predictors

    Mach. Learn.

    (1996)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • T. Cai et al.

    Prediction in functional linear regression

    Ann. Statist.

    (2006)
  • H. Cardot et al.

    Spline estimators for the functional linear model

    Statist. Sinica

    (2003)
  • D. Chakraborty et al.

    Selecting useful groups of features in a connectionist framework

    IEEE Trans. Neural Netw.

    (2008)
  • S. Chatterjee et al.

    Sparse group lasso: Consistency and climate applications

  • R. Díaz-Uriarte et al.

    Gene selection and classification of microarray data using random forest

    BMC Bioinformatics

    (2006)
  • D.L. Donoho et al.

    Ideal spatial adaptation by wavelet shrinkage

    Biometrika

    (1994)
  • D.L. Donoho et al.

    Wavelet shrinkage: asymptopia

    J. R. Stat. Soc. Ser. B

    (1995)
  • Fan, Y., James, G., 2013. Functional additive regression....
  • Cited by (98)

    • Tree-based boosting with functional data

      2024, Computational Statistics
    • Variable Importance in High-Dimensional Settings Requires Grouping

      2024, Proceedings of the AAAI Conference on Artificial Intelligence
    View all citing articles on Scopus
    View full text