Grouped variable importance with random forests and application to multiple functional data analysis

doi:10.1016/j.csda.2015.04.002

Computational Statistics & Data Analysis

Volume 90, October 2015, Pages 15-35

https://doi.org/10.1016/j.csda.2015.04.002 Get rights and content

Abstract

The selection of grouped variables using the random forest algorithm is considered. First a new importance measure adapted for groups of variables is proposed. Theoretical insights into this criterion are given for additive regression models. Second, an original method for selecting functional variables based on the grouped variable importance measure is developed. Using a wavelet basis, it is proposed to regroup all of the wavelet coefficients for a given functional variable and use a wrapper selection algorithm with these groups. Various other groupings which take advantage of the frequency and time localization of the wavelet basis are proposed. An extensive simulation study is performed to illustrate the use of the grouped importance measure in this context. The method is applied to a real life problem coming from aviation safety.

Introduction

In the high dimensional setting, identification of the most relevant variables has been the subject of much research during the last two decades (Guyon and Elisseeff, 2003). For linear regression, the lasso method (Tibshirani, 1996) is widely used. Many variable selection procedures have also been proposed for nonlinear methods. In the context of random forests (Breiman, 2001), it has been shown that the permutation importance measure is an efficient tool for selecting variables (Díaz-Uriarte and Alvarez de Andrés, 2006, Genuer et al., 2010, Gregorutti et al., 2014).

In many situations such as medical studies and genetics, groups of variables can be clearly identified and it is of interest to select groups of variables rather than to select them individually (He and Yu, 2010). Indeed, interpretation of the model may be improved along with the prediction accuracy by grouping the variables according to a priori knowledge about the data. Furthermore, grouping variables can be seen as a solution to stabilize variable selection methods. In the linear setting, and more particularly for linear regression, the group lasso has been developed to deal with groups of variables, see for instance Yuan and Lin (2006a). Group variable selection has also been proposed for kernel methods (Zhang et al., 2008) and neural networks (Chakraborty and Pal, 2008). As far as we know, this problem has not been studied for the random forest algorithm introduced by Breiman (2001). In this paper, we adapt the permutation importance measure for groups of variables in order to select groups of variables in the context of random forests.

The first contribution of this paper is a theoretical analysis of the grouped variable importance measure. Generally speaking, the grouped variable importance does not reduce to the sum of the individual importances and may even be quite unrelated to it. However, in more specific models such as additive regression ones, we derive exact decompositions of the grouped variable importance measure.

The second contribution of this work is an original method for selecting functional variables based on the grouped variable importance measure. Functional Data Analysis (FDA) is a field in statistics that analyzes data indexed by a continuum. In our case, we consider data providing information about curves varying over time (Ramsay and Silverman, 2005, Ferraty and Vieu, 2006, Ferraty, 2011). One standard approach in FDA consists in projecting the functional variables onto a finite dimensional space spanned by a functional basis. Classical bases in this context are splines, Fourier, wavelets or Karhunen–Loève expansions, for instance. Most of the papers about regression and classification methods for functional data consider only one functional predictor; references include Cardot et al., 1999, Cardot et al., 2003, Rossi et al. (2006) and Cai and Hall (2006) for linear regression methods, Amato et al. (2006) and Araki et al. (2009) for logistic regression methods, Górecki and Smaga (2015) for ANOVA problem, Biau et al. (2005) and Fromont and Tuleau (2006) for $k$ -NN algorithms and Rossi and Villa (2006) and Rossi and Villa (2008) for SVM classification. The multiple FDA problem, where $p$ functional variables are observed, has been less studied. Recently, Matsui and Konishi (2011) and Fan and James (2013) have proposed solutions to the linear regression problem with lasso-like penalties. The logistic regression case has been studied by Matsui (2014). Classification based on several functional variables has also been considered using the CART algorithm (Poggi and Tuleau, 2006) and SVM (Yang et al., 2005, Yoon and Shahabi, 2006).

We propose a new approach for multiple FDA using random forests and the grouped variable importance measure. Indeed, various groups of basis coefficients can be proposed for a given functional decomposition. For instance, one can choose to regroup all coefficients of a given functional variable. In this case, the selection of a group of coefficients corresponds to the selection of a functional variable. Various other groupings are proposed for wavelet decompositions. For a given family of groups, we adapt the recursive feature elimination algorithm (Guyon et al., 2002) which is particularly efficient when predictors are strongly correlated (Gregorutti et al., 2014). In the context of random forests, this backward-like selection algorithm is guided by the grouped variable importance. Note that by regrouping the coefficients, the computational cost of the algorithm is drastically reduced compared to a backward strategy that would eliminate only one coefficient at each step.

An extensive simulation study illustrates the application of the grouped importance measure for FDA. The method is then applied to a real life problem coming from aviation safety. The aim of this study is to explain and predict landing distances. We select the most relevant flight parameters regarding the risk of long landings, which is a major issue for airlines.

The group permutation importance measure is introduced in Section 2. Section 3 deals with multiple FDA using random forests and the grouped variable importance measure. The application to flight data analysis is presented in Section 4. Note that additional experiments about the grouped variable importance are given in Appendix B. In order to speed up the algorithm, the dimension of the data can be reduced in a preprocessing step. In Appendix C, we propose a modified version of a well-known shrinkage method (Donoho and Johnstone, 1994) that simultaneously shrinks to zero the coefficients of the observed curves of a functional variable.

Section snippets

The grouped variable importance measure

Let $Y$ be a random variable in $R$ and $X^{⊤} = (X_{1}, \dots, X_{p})$ a random vector in $R^{p}$ . We denote by $f (x) = E [Y | X = x]$ the regression function. Let $Var (X)$ and $Cov (X)$ denote the variance and variance–covariance matrices of $X$ .

The permutation importance introduced by Breiman (2001) measures the accuracy of each variable $X_{j}$ for predicting $Y$ . It is based on the elementary property that the quadratic risk $E [{(Y - f (X))}^{2}]$ is the minimum error for predicting $Y$ knowing $X$ . The formal definition of the variable importance

Multiple functional data analysis using grouped variable importance

In this section, we consider an application of grouped variable selection for multiple functional regression with scalar response $Y$ . Each covariate $X^{1}, \dots, X^{p}$ takes its values in the Hilbert space $L^{2} ([0, 1])$ equipped with the inner product ${〈 f, g 〉}_{L^{2}} = \int f (t) g (t) d t,$ for $f, g \in L^{2} ([0, 1])$ . One common approach of functional data analysis is to project the variables onto a finite dimensional subspace of $L^{2} ([0, 1])$ and to use the basis coefficients in a learning algorithm (Ramsay and Silverman, 2005). For

A case study: variable selection for aviation safety

In this section, we study a real problem coming from aviation safety. Airlines collect large amounts of information during flights using flight data recorders. For several years now, airlines are required to use these data for flight safety purposes. A large number of flight parameters (up to 1000) are recorded each second, including aircraft speed, accelerations, the heading, position, and warning signals. Each flight provides a multivariate time series corresponding to this family of

Conclusion

We have considered the selection of grouped variables using random forests and proposed a new permutation-based importance measure for groups of variables. Our theoretical analysis provided exact decompositions of the grouped importance measure into a sum of the individual importances for specific models such as additive regression models. A simulation study highlighted the fact that in general the importance of a group does not reduce to the sum of the individual importances. Since the idea of

References (45)

U. Amato et al.
Dimension reduction in functional regression with applications
Comput. Statist. Data Anal.
(2006)
H. Cardot et al.
Functional linear model
Statist. Probab. Lett.
(1999)
R. Genuer et al.
Variable selection using random forests
Pattern Recognit. Lett.
(2010)
Z. He et al.
Stable feature selection for biomarker discovery
Comput. Biol. Chem.
(2010)
S. Ma et al.
Supervised group lasso with applications to microarray data analysis
BMC Bioinformatics
(2007)
H. Matsui
Variable and boundary selection for functional data via multiclass logistic regression modeling
Comput. Statist. Data Anal.
(2014)
H. Matsui et al.
Variable selection for functional regression models via the regularization
Comput. Statist. Data Anal.
(2011)
F. Rossi et al.
Support vector machine for functional data classification
Neurocomputing
(2006)
A. Antoniadis et al.
Wavelet estimators in nonparametric regression: A comparative simulation study
J. Stat. Softw.
(2001)
Y. Araki et al.
Functional logistic discrimination via regularized basis expansions
Comm. Statist. Theory Methods
(2009)

G. Biau et al.

Functional classification in Hilbert spaces

IEEE Trans. Inform. Theory

(2005)

L. Breiman

Bagging predictors

Mach. Learn.

(1996)

L. Breiman

Random forests

Mach. Learn.

(2001)

L. Breiman et al.

Classification and Regression Trees

(1984)

T. Cai et al.

Prediction in functional linear regression

Ann. Statist.

(2006)

H. Cardot et al.

Spline estimators for the functional linear model

Statist. Sinica

(2003)

D. Chakraborty et al.

Selecting useful groups of features in a connectionist framework

IEEE Trans. Neural Netw.

(2008)

S. Chatterjee et al.

Sparse group lasso: Consistency and climate applications

R. Díaz-Uriarte et al.

Gene selection and classification of microarray data using random forest

BMC Bioinformatics

(2006)

D.L. Donoho et al.

Ideal spatial adaptation by wavelet shrinkage

Biometrika

(1994)

D.L. Donoho et al.

Wavelet shrinkage: asymptopia

J. R. Stat. Soc. Ser. B

(1995)

Fan, Y., James, G., 2013. Functional additive regression....

Cited by (98)

Balancing urban expansion with a focus on ecological security: A case study of Zhaotong City, China
2023, Ecological Indicators
The conflict between socio-economic development and ecological environmental protection presents a significant challenge to achieving regional and urban sustainable development. In Zhaotong City, Yunnan Province, China, this conflict arises from the interplay between urban expansion and ecological environmental protection land, creating a fundamental tension between built-up and non-built-up ecosystems. This study focuses on investigating the spatial characteristics of expansion, constructing an ecological security pattern based on ecosystem service functions and sensitivity, and utilizing integrated models to simulate urban expansion paths and analyze their formation mechanisms. The results highlight that Zhaotong City has undergone rapid urban expansion, characterized by linear growth surrounding the central city, specifically in the east, north, and west directions. The expansion between urban areas exhibits a discontinuous dotted-star distribution, with concentrated urbanization and a distinct urban boundary observed in Zhaoyang District. In contrast, other areas experience scattered expansion, indicative of random land encroachment. Through simulation, three primary urban expansion axes are identified: the western town sprawl belt, central town development axis, and resource-rich town corridor. Ecological conflicts primarily arise in Zhaoyang District, Zhenxiong County, and Yanjin County. By combining these findings, the study analyzes the scenario pattern of urban expansion evolution in Zhaotong City, revealing a transition from the spreading spillover expansion of a single center (Zhaoyang and Ludian) during the circle expansion phase to a polycentric expansion and spreading spillover pattern centered around Zhaoyang District along three primary axial radial paths. In conclusion, this study contributes to our understanding of the challenges posed by the conflict between socio-economic development and ecological environmental protection in Zhaotong City. The findings underscore the importance of integrating ecological considerations into urban planning processes and highlight the need for sustainable development practices to achieve a harmonious coexistence between urban expansion and ecological security. The insights gained from this research offer guidance for fostering sustainable urban development in Zhaotong City and beyond.
Global patterns and key drivers of stream nitrogen concentration: A machine learning approach
2023, Science of the Total Environment
Anthropogenic loading of nitrogen to river systems can pose serious health hazards and create critical environmental threats. Quantification of the magnitude and impact of freshwater nitrogen requires identifying key controls of nitrogen dynamics and analyzing both the past and present patterns of nitrogen flows. To tackle this challenge, we adopted a machine learning (ML) approach and built an ML-driven representation that captures spatiotemporal variability in nitrogen concentrations at global scale. Our model uses random forests to regress a large sample of monthly measured stream nitrogen concentrations onto a set of 17 predictors with a spatial resolution of 0.5-degree over the 1990–2013, including observations within the pixel and upstream drivers. The model was validated with data from rivers outside the training dataset and was used to predict nitrogen concentrations in 520 major river basins of the world, including many with scarce or no observations. We predicted that the regions with highest median nitrogen concentrations in their rivers (in 2013) were: United States (Mississippi), Pakistan, Bangladesh, India (Indus, Ganges), China (Yellow, Yangtze, Yongding, Huai), and most of Europe (Rhine, Danube, Vistula, Thames, Trent, Severn). Other major hotspots were the river basins of the Sebou (Morroco), Nakdong (South Korea), Kitakami (Japan), and Egypt's Nile Delta. Our analysis showed that the rate of increase in nitrogen concentration between 1990s and 2000s was greatest in rivers located in eastern China, eastern and central parts of Canada, Baltic states, Pakistan, mainland southeast Asia, and south-eastern Australia. Using a new grouped variable importance measure, we also found that temporality (month of the year and cumulative month count) is the most influential predictor, followed by factors representing hydroclimatic conditions, diffuse nutrient emissions from agriculture, and topographic features. Our model can be further applied to assess strategies designed to reduce nitrogen pollution in freshwater bodies at large spatial scales.
Local interpretation of nonlinear regression model with k-nearest neighbors
2023, Digital Chemical Engineering
With respect to molecular, material, and process designs, it is important to construct nonlinear regression models with high predictive ability between the features, that is, x, and the properties and activities, that is, y. The interpretations of such constructed models can help elucidate the mechanism by which x affects y. In this study, a stable and effective method for the local interpretation of regression models is proposed by improving the local interpretable model-agnostic explanations (LIME) and Shapley additive explanations (SHAP) techniques, which have been developed as methods for interpreting nonlinear regression models. By calculating the local contribution of x to y based on LIME and SHAP for the k-nearest neighbors of a target sample, a stable interpretation is possible even for models that overfit the training data. Furthermore, a method for calculating the local contribution of x to y based on simulations using a nonlinear regression model is proposed. Finally, using actual datasets consisting of the characteristics of various compounds, it is confirmed that the proposed method can locally and accurately interpret nonlinear models.
Tweetable abstract
A method for calculating the local contribution of x to y based on simulations using a nonlinear regression model is proposed, and could accurately interpret nonlinear models for actual datasets.
Tree-based boosting with functional data
2024, Computational Statistics
Variable Importance in High-Dimensional Settings Requires Grouping
2024, Proceedings of the AAAI Conference on Artificial Intelligence
Supervised Learning via Ensembles of Diverse Functional Representations: the Functional Voting Classifier
2024, arXiv

View all citing articles on Scopus

View full text

Grouped variable importance with random forests and application to multiple functional data analysis

Abstract

Introduction

Section snippets

The grouped variable importance measure

Multiple functional data analysis using grouped variable importance

A case study: variable selection for aviation safety

Conclusion

Comput. Statist. Data Anal.

Statist. Probab. Lett.

Pattern Recognit. Lett.

Comput. Biol. Chem.

BMC Bioinformatics

Comput. Statist. Data Anal.

Comput. Statist. Data Anal.

Neurocomputing

Wavelet estimators in nonparametric regression: A comparative simulation study

J. Stat. Softw.

Functional logistic discrimination via regularized basis expansions

Comm. Statist. Theory Methods

Functional classification in Hilbert spaces

IEEE Trans. Inform. Theory

Bagging predictors

Mach. Learn.

Random forests

Mach. Learn.

Classification and Regression Trees

Prediction in functional linear regression

Ann. Statist.

Spline estimators for the functional linear model

Statist. Sinica

Selecting useful groups of features in a connectionist framework

IEEE Trans. Neural Netw.

Sparse group lasso: Consistency and climate applications

Gene selection and classification of microarray data using random forest

BMC Bioinformatics

Ideal spatial adaptation by wavelet shrinkage

Biometrika

Wavelet shrinkage: asymptopia

J. R. Stat. Soc. Ser. B