Sparse Additive Interaction Learning

Diseases are now thought to be the result of changes in entire biological networks whose states are affected by a complex interaction of genetic and environmental factors. In general, power to estimate interactions is low, the number of possible interactions could be enormous and their effects may be non-linear. Existing approaches such as the lasso might keep an interaction but remove a main effect, which is problematic for interpretation. In this work, we introduce a sparse additive interaction learning model called sail for detecting non-linear interactions with a key environmental or exposure variable in high-dimensional settings. Our method can accommodate either the strong or weak heredity constraints. We develop a computationally efficient fitting algorithm with automatic tuning parameter selection, which scales to high-dimensional datasets. Through an extensive simulation study, we show that sail outperforms existing penalized regression methods in terms of prediction error, sensitivity and specificity when there are non-linear interactions with an exposure variable. We apply sail to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data to select non-linear interactions between clinical diagnosis and Aβ protein in 96 brain regions on mini-mental state examination. Our algorithms are available in an R package (https://github.com/greenwoodlab).


Introduction
Computational approaches to variable selection have become increasingly important with the advent of high-throughput technologies in genomics and brain imaging studies, where the data has become massive, yet where it is believed that the number of truly important variables is small relative to the total number of variables. Although many approaches have been developed for main effects, there is an enduring interest in powerful methods for estimating interactions, since interactions may reflect important modulation of a genomic system by an external factor [1]. Accurate capture of interactions may hold the potential to better understanding biological phenomena and improving prediction accuracy. For example, a model that considered interactions between brain imaging data and genetic features had better classification accuracy compared to a model that considered the main effects only [2].
Furthermore, the manifestations of disease are often considered to be the result of changes in entire biological networks whose states are affected by a complex interaction of genetic and environmental factors [3]. However, there is a general deficit of such replicated interactions in the literature [4]. Indeed, power to detect interactions is always lower than for main effects, and in high-dimensional settings (p >> n), this lack of power to detect interactions is exacerbated, since the number of possible interactions could be enormous and their effects may be non-linear. Hence, analytic methods that may improve power are essential.
Interactions may occur in numerous types and of varying complexities. In this paper, we 1 INTRODUCTION consider one specific type of interaction models, where one (exposure) variable is involved in possibly non-linear interactions with a high-dimensional set of measures X leading to effects on a response variable, Y . We propose a multivariable penalization procedure for detecting non-linear interactions X and E.

A Sparse additive interaction model
Let Y = (Y 1 , . . . , Y n ) ∈ R n be a continuous outcome variable, X E = (E 1 , . . . , E n ) ∈ R n a binary or continuous environment vector, and X = (X 1 , . . . , X p ) ∈ R n×p a matrix of predictors, possibly high-dimensional. Furthermore let f j : R → R be a smoothing method for variable X j by a projection on to a set of basis functions: Here, the {ψ jℓ } m j 1 are a family of basis functions in X j [5]. Let Ψ j be the n × m j matrix of evaluations of the ψ jℓ and θ j = (β j1 , . . . , β jm j ) ∈ R m j for j = 1, . . . , p (θ j is a m j -dimensional column vector of basis coefficients for the jth main effect). In this article we consider an additive interaction regression model of the form where g(·) is a known link function, µ = E [Y |Ψ, X E ], β 0 is the intercept, β E is the coefficient for the environment variable, τ j = (τ j1 , . . . , τ jm j ) ∈ R m j are the basis coefficients for the jth interaction term, and (X E • Ψ j ) is the n × m j matrix formed by the component-wise multiplication of the column vector X E by each column of Ψ j . For a continuous response,

INTRODUCTION
we use the squared-error loss to estimate the parameters: and for binary response Y i ∈ {−1, +1} we use the logistic loss: where Θ := (β 0 , β E , θ 1 , . . . , θ p , τ 1 , . . . , τ p ) and D := (Y, Ψ, X E ) is the working data. Here we assume that p is large relative to n, and particularly that ∑ p j=1 m j /n is large. Due to the large number of parameters to estimate with respect to the number of observations, one commonly-used approach is to shrink the regression coefficients by placing a constraint on the values of (β E , θ j , τ j ). Certain constraints have the added benefit of producing a sparse model in the sense that many of the coefficients will be set exactly to 0 [6]. Such a reduced predictor set can lead to a more interpretable model with smaller prediction variance, albeit at the cost of having biased parameter estimates [7]. In light of these goals, we consider the following objective function: where are also in the model [8]. This is known as the strong heredity principle [9]. Indeed, large main effects are more likely to lead to detectable interactions [10]. In the next section we discuss how a simple reparametrization of the model (5) can lead to this desirable property.

Strong and weak heredity
The strong heredity principle states that an interaction term can only have a non-zero estimate if its corresponding main effects are estimated to be non-zero. The weak heredity principle allows for a non-zero interaction estimate as long as one of the corresponding main effects is estimated to be non-zero [9]. In the context of penalized regression methods, these principles can be formulated as structured sparsity [11] problems. Several authors have proposed to modify the type of penalty in order to achieve the heredity principle [12, 13? ? ]. We take an alternative approach. Following Choi et al. [14], we introduce a new set of parameters γ = (γ 1 , . . . , γ p ) ∈ R p and reparametrize the coefficients for the interaction terms τ j in (2) as a function of γ j and the main effect parameters θ j and β E . This reparametrization for both strong and weak heredity is summarized in Table 1.

INTRODUCTION
instead of penalizing τ as in (5), leading to the following objective function:

Toy example
We present here a toy example to better illustrate our method. With a sample size of n = 100, we sample p = 20 covariates X 1 , . . . X p independently from a N (0, 1) distribution truncated to the interval [0,1]. Data were generated from a model which follows the strong heredity principle, but where only one covariate, X 2 , is involved in an interaction with a binary exposure variable, E: For illustration, function f 1 (·) is assumed to be linear, whereas function f 2 (·) is non-linear: The error term ε is generated from a normal distribution with variance chosen such that the signal-to-noise ratio (SNR) is 2. We generated a single simulated dataset and used the strong heredity sail method with cubic B-splines to estimate the functional forms. 10-fold cross-validation (CV) was used to choose the optimal value of penalization. We used α = 0.5 and default values were used for all other arguments. We plot the solution path for both main effects and interactions in Figure 1, coloring lines to correspond to the selected model. We see that our method is able to correctly identify the In Figure 2, we plot the true and estimated component functionsf 1 (X 1 ) and E ·f 2 (X 2 ), and their estimates from this analysis with sail. We are able to capture the shape of the correct

Related Work
Methods for variable selection of interactions can be broken down into two categories: linear and non-linear interaction effects. Many of the linear effect methods consider all pairwise interactions in X [13,14,16,17] which can be computationally prohibitive when p is large.
The computational limitation can be perceived through the relatively small number of variables used in simulations and real data analysis in [13,14,16,17]. More recent proposals

INTRODUCTION
for selection of interactions allow the user to restrict the search space to interaction candidates [18,19]. This is useful when the researcher wants to impose prior information on the model. Two-stage procedures, where interaction candidates are considered from an original screen of main effects, have shown good performance when p is large [20,21] in the linear setting. There are many fewer methods available for estimating non-linear interactions. For example, Radchenko and James (2010) [12] proposed a model of the form where f (·) are smooth component functions. This method is more computationally expensive than sail since it considers all pairwise interactions between the basis functions, and its effectiveness in simulations or real-data applications is unknown as there is no software implementation.
While working on this paper, we were made aware of the recently proposed pliable lasso [22] which considers the interactions between X n×p and another matrix Z n×K and takes the form where α j is a K-dimensional vector. Our proposal is most closely related to this method with Z being a single column matrix; the key difference being the non-linearity effects of our predictor variables. As pointed out by the authors of the pliable lasso, either their or ours can be seen as a varying coefficient model, i.e., the effect of X varies as a function of the exposure variable E or Z in equation 8.
The main contributions of this paper are fourfold. First, we develop a model for non-linear interactions with a key exposure variable, following either the weak or strong heredity principle, that is computationally efficient and scales to the high-dimensional setting (n << p).
Second, through simulation studies, we show improved performance over existing methods that only consider linear interactions or additive main effects. Third, we show that our method possesses the oracle property [23], i.e., it performs as well as if the true model were known in advance. Fourth, all of our algorithms are implemented in the sail R package hosted on GitHub with extensive documentation (http://sahirbhatnagar.com/sail/). In particular, our implementation also allows for linear interaction models, user-defined basis expansions, a cross-validation procedure for selecting the optimal tuning parameter, and differential shrinkage parameters to apply the adaptive lasso [24] idea.
The rest of the paper is organized as follows. Section 2 describes our optimization procedure and some details about the algorithm used to fit the sail model for the least squares case. In Section 3, through simulation studies we compare the performance of our proposed approach and demonstrate the scenarios where it can be advantageous to use sail over existing methods. Section 4 contains some real data examples and Section 5 discusses some limitations and future directions.

Algorithm and Computational Details
In this section we describe a blockwise coordinate descent algorithm for fitting the leastsquares version of the sail model in (6). We fix the value for α and minimize the objective function over a decreasing sequence of λ values (λ max > · · · > λ min ). We use the subgradient equations to determine the maximal value λ max such that all estimates are zero. Due to the heredity principle, this reduces to finding the largest λ such that all main effects (β E , θ 1 , . . . , θ p ) are zero. Following Friedman et al. [25], we construct a λ-sequence of 100 values decreasing from λ max to 0.001λ max on the log scale, and use the warm start strategy where the solution for λ ℓ is used as a starting value for λ ℓ+1 .

Blockwise coordinate descent for least-squares loss
The strong heredity sail model with least-squares loss has the form and the objective function is given by Solving (10) in a blockwise manner allows us to leverage computationally fast algorithms for ℓ 1 and ℓ 2 norm penalized regression. We show in Supplemental Section A that by careful construction of pseudo responses and pseudo design matrices, existing efficient algorithms can be used to estimate the parameters. Indeed, the objective function simplifies to a modified lasso problem when holding all θ j fixed, and a modified group lasso problem when holding β E and all γ j fixed.
Denote the n-dimensional residual column vector R = Y −Ŷ . The subgradient equations are given by where s 1 is in the subgradient of the ℓ 1 norm: s 2 is in the subgradient of the ℓ 2 norm: and s 3 is in the subgradient of the ℓ 1 norm: Define the partial residuals, without the jth predictor for j = 1, . . . , p, as and the partial residual without the jth interaction for j = 1, . . . , p, as From the subgradient equations (11)- (14) we see that where S(x, t) = sign(x)(|x| − t) is the soft-thresholding operator. We see from (15) and (16) that there are closed form solutions for the intercept and β E . From (18), each γ j also has a closed form solution and can be solved efficiently for j = 1, . . . , p using a coordinate descent procedure [25]. Since there is no closed form solution for β j , we use a quadratic majorization technique [26] to solve (17). Furthermore, we update each θ j in a coordinate wise fashion and leverage this to implement further computational speedups which are detailed in Supplemental Section A.2. From these estimates, we compute the interaction effects using the reparametrizations presented in Table 1, e.g.,τ j =γ jβEθj , j = 1, . . . , p for the strong heredity sail model. We provide an overview of the computations in Algorithm 1. A more detailed version of this algorithm is given in Section A.1 of the Appendix.
Algorithm 1 Blockwise Coordinate Descent for Least-Squares sail with Strong Heredity. For a decreasing sequence λ = λ max , . . . , λ min and fixed α: for j = 1, . . . , p and set iteration counter k ← 0. 2. Repeat the following until convergence: (a) update γ = (γ 1 , . . . , γ p ) i. Compute the pseudo design: for j = 1, . . . , p ii. Compute the pseudo response Y by removing the contribution of every term not involving γ from Y iii. Solve: Compute the pseudo response ( Y ) by removing the contribution of every term not involving θ j from Y iii. Solve: θ ii. Compute the pseudo response ( Y ) by removing the contribution of every term Page 14 of 65

Maximum penalty parameter (Lambda max)
The subgradient equations (12)- (14) can be used to determine the largest value of λ such that all coefficients are 0. From the subgradient Equation (12), we see that From the subgradient Equation (13), we see that θ j = 0 is a solution if From the subgradient Equation (14), we see that γ j = 0 is a solution if Due to the strong heredity property, the parameter vector (β E , θ 1 , . . . , θ p , γ 1 , . . . , γ p ) will be entirely equal to 0 if (β E , θ 1 , . . . , θ p ) = 0. Therefore, the smallest value of λ for which the entire parameter vector (excluding the intercept) is 0 is: which reduces to

Weak Heredity
Our method can be easily adapted to enforce the weak heredity property: That is, an interaction term can only be present if at least one of it's corresponding main effects is nonzero. To do so, we reparametrize the coefficients for the interaction terms in (2) as . We defer the algorithm details for fitting the sail model with weak heredity in Section A.3 of the Appendix, as it is very similar to Algorithm 1 for the strong heredity sail model.

Adaptive sail
The weights for the environment variable, main effects and interactions are given by w E , w j and w jE respectively. These weights serve as a means of allowing a different penalty to be applied to each variable. In particular, any variable with a weight of zero is not penalized at all. This feature is usually selected for one of two reasons: 1. Prior knowledge about the importance of certain variables is known. Larger weights will penalize the variable more, while smaller weights will penalize the variable less 2. Allows users to apply the adaptive sail, similar to the adaptive lasso [24] We describe the adaptive sail in Algorithm 2. This is a general procedure that can be applied to the weak and strong heredity settings, as well as both least squares and logistic loss functions. We provide this capability in the sail package using the penalty.factor argument and provide an example in Section C.6 of the Appendix.
Algorithm 2 Adaptive sail algorithm 1. For a decreasing sequence λ = λ max , . . . , λ min and fixed α run the sail algorithm 2. Use cross-validation or a data splitting procedure to determine the optimal value for the tuning parameter: , θ [opt] j and τ [opt] j for j = 1, . . . , p be the coefficient estimates corresponding to the model at λ [opt] 4. Set the weights to be Run the sail algorithm with the weights defined in step 4), and use cross-validation or a data splitting procedure to choose the optimal value of λ

Flexible design matrix
The definition of the basis expansion functions in (1) is very flexible, in the sense that our algorithms are independent of this choice. As a result, the user can apply any basis expansion they desire. In the extreme case, one could apply the identity map, i.e., f j (X j ) = X j which leads to a linear interaction model (referred to as linear sail). When little information is known a priori about the relationship between the predictors and the response, by default, we choose to apply the same basis expansion to all columns of X . This is a reasonable approach when all the variables are continuous. However, there are often situations when the data contains a combination of categorical and continuous variables. In these cases it may be sub-optimal to apply a basis expansion to the categorical variables. Owing to the flexible nature of our algorithm, we can handle this scenario in our implementation by allowing a user-defined design matrix. The only extra information needed is the group membership of each column in the design matrix. We provide such an example in the sail package showcase in Section C.7 of the Appendix.

Simulation Study
In this section, we use simulated data to understand the performance of sail in different scenarios.

Comparator Methods
Since there are no other packages that directly address our chosen problem, we selected comparator methods based on the following criteria: 1) penalized regression methods that can handle high-dimensional data (n < p), 2) allows at least one of linear effects, non-linear effects or interaction effects, and 3) has a software implementation in R. The selected methods can be grouped into three categories: 1. Linear main effects: lasso [27], adaptive lasso [24] 2. Linear interactions: lassoBT [21], GLinternet [18] 3. Non-linear main effects: HierBasis [28], SPAM [29], gamsel [30] For GLinternet we specified the interactionCandidates argument so as to only consider interactions between the environment and all other X variables. For all other methods we supplied (X , X E ) as the data matrix, 100 for the number of tuning parameters to fit, and used the default values otherwise 1 . lassoBT considers all pairwise interactions as there is no way for the user to restrict the search space. SPAM applies the same basis expansion to every column of the data matrix; we chose 5 basis spline functions. HierBasis and gamsel selects whether a term in an additive model is nonzero, linear, or a non-linear spline up to a specified max degrees of freedom per variable.
We compare the above listed methods with our main proposal method sail, as well as with adaptive sail (Algorithm 2), sail weak which has the weak heredity property and linear sail as described in Section 2.5. For each function f j , we use a B-spline basis matrix with degree=5 implemented in the bs function in R [31]. We center the environment variable and the basis functions before running the sail method.

Simulation Design
To make the comparisons with other methods as fair as possible, we followed a simulation framework that has been previously used for variable selection methods in additive models [32,33]. We extend this framework to include interaction effects as well. The covariates are simulated as follows. First, we generate z 1 , . . . , z p , u, v independently from a standard normal distribution truncated to the interval [0,1] for i = 1, . . . , n. Then we set where the parameter t controls the amount of correlation among predictors. The first four variables are nonzero (i.e. active in the response), while the rest of the variables are zero (i.e. are noise variables). This leads to a compound symmetry correlation structure where but the covariates of the nonzero and zero components are independent. We consider the case when p = 1000 and t = 0. The outcome Y is then generated following one of the models and assumptions described below.
We evaluate the performance of our method on three of its defining characteristics: 1) the strong heredity property, 2) non-linearity of predictor effects and 3) interactions. Simulation scenarios are designed specifically to test the performance of these characteristics

Hierarchy simulation
Scenario (a) Truth obeys strong hierarchy. In this situation, the true model for Y contains main effect terms for all covariates involved in interactions.
Here, in addition to the interaction, the E variable has its own main effect but the covariates X 3 and X 4 do not.
Scenario (c) Truth only has interactions. In this simulation, the covariates involved in interactions do not have main effects as well.

Non-linearity simulation scenario
Truth is linear. sail is designed to model non-linearity; here we assess its performance if the true model is completely linear.

Interactions simulation scenario
Truth only has main effects. sail is designed to capture interactions; here we assess its performance when there are none in the true model.
The true component functions are the same as in [32,33] and are given by f 1 (t) = 5t, . We set β E = 2 and draw ε from a normal distribution with variance chosen such that the signal-to-noise ratio is 2. Using this setup,

Results
The test set MSE results for each of the five simulation scenarios are shown in Figure  We see that sail ,adaptive sail and sail weak have the best performance in terms of both MSE and yielding correct sparse models when the truth follows a strong hierarchy (scenario 1a), as we would expect, since this is exactly the scenario that our method is trying to target. We visually inspected whether our method could correctly capture the shape of the association between the predictors and the response for both main and interaction effects. To do so, we plotted the true and predicted curves for scenario 1a) only. Figure 5 shows

Real Data Application
In this section we illustrate sail on several real data examples.

Alzheimer's Disease Neuroimaging Initiative
Alzheimer's is an irreversible neurodegenerative disease that results in a loss of mental function due to the deterioration of brain tissue. The overall goal of the Alzheimer's Disease Neuroimaging Initiative (ADNI) is to validate biomarkers for use in Alzheimer's disease clinical treatment trials [34]. The patients were selected into the study based on their clinical diagnosis: controls, mild cognitive impairment (MCI) or Alzheimer's disease (AD). PET amyloid imaging was used to asses amyloid beta (Aβ) protein load in 96 brain regions. The response we use here is general cognitive decline, as measured by a continuous mini-mental state examination score. We applied sail to this data to see if there were any non-linear interactions between clinical diagnosis and Aβ protein in the 96 brain regions on mini-mental state examination.
There were a total of 343 patients who we divided randomly into equal sized training/validation/test splits. We ran the strong heredity sail with cubic B-splines and α = 0.1. We also applied the lasso, lassoBT, HierBasis and GLinternet to this data. Using the same default settings and strategy as the simulation study, we ran each method on the training data, determined the optimal tuning parameter on the validation data, and assessed MSE on the test data. We repeated this process 200 times.

Number of Active Variables
Test Set MSE GLinternet HierBasis lasso lassoBT sail In Figure 8 we plot the mean test set MSE vs. the mean number of active variables ±1 SD.
We see that sail produces the sparsest models but doesn't perform as well as HierBasis and GLinternet in terms of MSE. sail achieves a similar MSE to both the lasso and lassoBT with fewer variables on average. GLinternet produces the largest models and seems to be sensitive to the train/validate/test split as evidenced by the large standard deviations.
To visualize the results from the sail method, we chose the train/validate/test split which led to the best test set MSE, and then plotted the interaction effects in Figure 9. The left panel shows the middle occipital gyrus left region in the occipital lobe known for visual object perception. We see that more Aβ protein loads leads to a worse cognitive score for the MCI and AD group but not for the controls. The right panel shows the cuneus region which is known to be involved in basic visual processing, and we see that more Aβ proteins leads to better cognitive scores for the MCI and AD group and poorer scores for the controls.

Discussion
In this article we have introduced the sparse additive interaction learning model sail for detecting non-linear interactions with a key environmental or exposure variable in highdimensional settings. Using a simple reparametrization, we are able to achieve either the weak or strong heredity property without using a complex penalty function. We developed a blockwise coordinate descent algorithm to solve the sail objective function for the leastsquares loss function. All our algorithms are implemented in a computationally efficient, well-documented and freely available R package. Furthermore, our method is flexible enough to handle any type of basis expansion including the identity map, which allows for linear interactions. Our implementation allows the user to selectively apply the basis expansions to the predictors, allowing for example, a combination of continuous and categorical pre- dictors. An extensive simulation study shows that sail, adaptive sail and sail weak outperform existing penalized regression methods in terms of prediction error, sensitivity and specificity when there are non-linear main effects only, as well as interactions with an exposure variable.
Our method however does have its limitations. sailcan currently only handle X E · f (X) or f (X E )·X and does not allow for f (X, X E ), i.e., only one of the variables in the interaction can have a non-linear effect and we do not consider the tensor product. The reparametrization leads to a non-convex optimization problem which makes convergence rates difficult to assess, though we did not experience any major convergence issues in our simulations and real data analysis. The memory footprint can also be an issue depending on the degree of the basis expansion and the number of variables.
To our knowledge, our proposal is the first to allow for non-linear interactions with a key exposure variable following the weak or strong heredity property in high-dimensional settings.
We also provide a first software implementation for these models.
In this section we provide more specific details about the algorithms used to solve the sail objective function.

A.1 Least-Squares sail with Strong Heredity
A more detailed algorithm for fitting the least-squares sail model with strong heredity is given in Algorithm 3.

A.2 Details on Update for θ
Here we discuss a computational speedup in the updates for the θ parameter. The partial residual (R s ) used for updating θ s (s ∈ 1, . . . , p) at the kth iteration is given by where Y (k) (−s) is the fitted value at the kth iteration excluding the contribution from Ψ s : Using (27), (26) can be re-written as where Denote θ (k)(new) s the solution for predictor s at the kth iteration, given by: Now we want to update the parameters for the next predictor θ s+1 (s + 1 ∈ 1, . . . , p) at the kth iteration. The partial residual used to update θ s+1 is given by where R * is given by (29), θ (k) s is the parameter value prior to the update, and θ (k)(new) s is the updated value given by (30). Taking the difference between (28) and (31) gives Therefore R t = R s + ∆, and the partial residual for updating the next predictor can be computed by updating the previous partial residual by ∆, given by (32). This formulation can lead to computational speedups especially when ∆ = 0, meaning the partial residual does not need to be re-calculated.

A.3 Least-Squares sail with Weak Heredity
The least-squares sail model with weak heredity has the form The objective function is given by Denote the n-dimensional residual column vector R = Y −Ŷ . The subgradient equations are given by where s 1 is in the subgradient of the ℓ 1 norm: s 2 is in the subgradient of the ℓ 2 norm: and s 3 is in the subgradient of the ℓ 1 norm: Define the partial residuals, without the jth predictor for j = 1, . . . , p, as the partial residual without X E as and the partial residual without the jth interaction for j = 1, . . . , p From the subgradient Equation (36), we see that β E = 0 is a solution if From the subgradient Equation (37), we see that θ j = 0 is a solution if From the subgradient Equation (38), we see that γ j = 0 is a solution if From the subgradient equations we see that where S(x, t) = sign(x)(|x|−t) is the soft-thresholding operator. As was the case in the strong heredity sail model, there are closed form solutions for the intercept and β E , each γ j also has a closed form solution and can be solved efficiently for j = 1, . . . , p using the coordinate descent procedure implemented in the glmnet package [25], while we use the quadratic majorization technique implemented in the gglasso package [26] to solve (44). Algorithm 4 details the procedure used to fit the least-squares weak heredity sail model.

A.3.1 Lambda Max
The smallest value of λ for which the entire parameter vector (β E , θ 1 , . . . , θ p , γ 1 , . . . , γ p ) is 0 is: which reduces to This is the same λ max as the least-squares strong heredity sail model.

C sail Package Showcase
In this section we briefly introduce the freely available and open source sail package in R. More comprehensive documentation is available at https://sahirbhatnagar.com/sail.
Note that this entire section is reproducible; the code and text are combined in an .Rnw 1 file and compiled using knitr [35].

C.3 Cross-Validation
cv.sail is the main function to do cross-validation along with plot, predict, and coef methods for objects of class cv.sail. We run it in parallel: set.seed(432) # to reproduce results (randomness due to CV folds) Mean−Squared Error q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  Mean−Squared Error q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 36 35 35  Recall that we consider the following penalized least squares criterion for this problem: The weights w E , w j , w jE are by default set to 1 as specified by the penalty.factor argument.
This argument allows users to apply separate penalty factors to each coefficient. In particular, any variable with penalty.factor equal to zero is not penalized at all. This feature can be applied mainly for two reasons: 1. Prior knowledge about the importance of certain variables is known. Larger weights will penalize the variable more, while smaller weights will penalize the variable less 2. Allows users to apply the Adaptive sail, similar to the Adaptive Lasso In the following example, we want the environment variable to always be included so we set the first element of p.fac to zero. We also want to apply less of a penalty to the main effects for X 2 , X 3 , X 4 : We see from the plot above that the black line (corresponding to the X E variable with penalty.factor equal to zero) is always included in the model.

C.7 User-Defined Design Matrix
A limitation of the sail method is that the same basis expansion function f (·) is applied to all columns of the predictor matrix X. Being able to automatically select linear vs. nonlinear components was not a focus of our paper, but is an active area of research for main effects only e.g. gamsel and HierBasis.
However, if the user has some prior knowledge on possible effect relationships, then they can supply their own design matrix. This can be useful for example, when one has a combination of categorical (e.g. gender, race) and continuous variables, but would only like to apply f (·) on the continuous variables. We provide an example below to illustrate this functionality.
We use the simulated dataset sailsim provided in our package. We first add a categorical variable race to the data: set.seed (1234) library(sail) x_df <-as.data.frame(sailsim$x) x_df$race <-factor(sample(1:2, nrow(x_df), replace = TRUE)) One benefit of using stats::model.matrix is that it returns the group membership as an attribute: attr(x, "assign") The group membership must be supplied to the sail function. This information is needed for the group lasso penalty, which will select the whole group as zero or non-zero.

C.7.1 Fit the sail Model
We need to set the argument expand = FALSE and provide the group membership. The first element of the group membership corresponds to the first column of x, the second element to the second column of x, and so on.
We can plot the solution path for both main effects and interactions using the plot method for objects of class sail: In this instance, since we provided a user-defined design matrix and 'expand = FALSE', the numbers at the top of the plot represent the total number of non-zero coefficients.

C.7.2 Find the Optimal Value for λ
We can use cross-validation to find the optimal value of lambda: We can plot the cross-validated mean squared error as a function of lambda: Mean−Squared Error q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 55 55 55 54 51 47 47 47 47 44 36 15 9 The estimated non-zero coefficients at lambda.