Computational Statistics and Data Analysis A sparse additive model for high-dimensional interactions with an exposure variable

sail possesses the oracle property, i.e., it performs as well as if the true model were known in advance. A computationally eﬃcient ﬁtting algorithm with automatic tuning parameter selection, which scales to high-dimensional datasets is proposed. Simulation results show that sail outperforms existing penalized regression methods in terms of prediction accuracy and support recovery when there are non-linear interactions with an exposure variable. sail is applied to detect non-linear interactions between genes and a prenatal psychosocial intervention program on cognitive performance in children at 4 years of age. Results show that individuals who are genetically predisposed to lower educational attainment are those who stand to beneﬁt the most from the intervention. The proposed algorithms are implemented in an R package available on CRAN (https://cran .r-project .org /package =sail).


Introduction
Computational approaches to variable selection have become increasingly important with the advent of high-throughput technologies in genomics and brain imaging studies, where the data has become massive, yet where it is believed that the number of truly important variables is small relative to the total number of variables.Although many approaches have been developed for main effects, there is an enduring interest in powerful methods for estimating interactions, since interactions may reflect important modulation of a genomic system by an external factor and vice versa (Bhatnagar et al., 2018).
Interactions may occur in numerous types and of varying complexities.In this paper, we consider one specific type of interaction model, where one exposure variable E is involved in possibly non-linear interactions with a high-dimensional set of measures X leading to effects on a response variable, Y .We propose a multivariable penalization procedure for detecting non-linear interactions between X and E. Our method is motivated by the Nurse Family Partnership (NFP); a program of prenatal and infancy home visiting by nurses for low-income mothers and their children (Olds et al., 1998).In this intervention, NFP nurses guided pregnant women and parents of young children to improve the outcomes of pregnancy, their children's health and development, and their economic self-sufficiency, with the goal of reducing disparities over the life-course.Early intervention in young children has been shown to positively impact intellectual abilities (Campbell and Ramey, 1994), and more recent studies have shown that cognitive performance is also strongly influenced by genetic factors (Rietveld et al., 2013).Given the important role of both environment and genetics, we are interested in finding interactions between these two components on cognitive function in children.

A sparse additive interaction model
Let Y ∈ R be a continuous outcome variable, E ∈ R a binary or continuous environment/exposure vector of known importance, and X ∈ R p a vector of additional predictors, possibly high-dimensional.Assume that we have n observations of each quantity denoted by Y = (Y 1 , . . ., Y n ) ∈ R n , X E = (E 1 , . . ., E n ) ∈ R n , and X = (X 1 , . . ., X p ) ∈ R n×p .Furthermore let f j : R → R be a smoothing method for variable X j by a projection on to a set of basis functions: (1) Here, the ψ j m j 1 are a family of basis functions in X j (Hastie et al., 2015).Let j be the n × m j matrix of evaluations of the ψ j and θ j = (β j1 , . . ., β jm j ) ∈ R m j for j = 1, . . ., p (θ j is a m j -dimensional column vector of basis coefficients for the jth main effect).In this article we consider an additive interaction regression model of the form (X E • j )τ j + ε, (2) where β 0 ∈ R is the intercept, β E ∈ R is the coefficient for the environment variable, τ j = (τ j1 , . . ., τ jm j ) ∈ R m j are the basis coefficients for the jth interaction term, (X E • j ) is the n ×m j matrix formed by the component-wise multiplication of the column vector X E by each column of j , and ε ∈ R n is a vector of i.i.d.errors with mean zero and finite variance.Here we assume that p is large relative to n, and particularly that p j=1 m j /n is large.Due to the large number of parameters to estimate with respect to the number of observations, one commonly-used approach in the penalization literature is to shrink the regression coefficients by placing a constraint on the values of (β E , θ j , τ j ).Certain constraints have the added benefit of producing a sparse model in the sense that many of the coefficients will be set exactly to 0 (Bühlmann and Van De Geer, 2011).Such a reduced predictor set can lead to a more interpretable model with smaller prediction variance, albeit at the cost of having biased parameter estimates (Fan et al., 2014).In light of these goals, consider the following penalized objective function: jk , λ > 0 and α ∈ (0, 1) are adjustable tuning parameters, w E , w j , w jE are non-negative penalty factors for j = 1, . . ., p which serve as a way of allowing parameters to be penalized differently (see Algorithm 2 for more details on how to estimate these weights).The first term in the penalty penalizes the main effects while the second term penalizes the interactions.The parameter α controls the relative weight on the two penalties.Note that we do not penalize the intercept.An issue with (3) is that since no constraint is placed on the structure of the model, it is possible that an estimated interaction term is non-zero while the corresponding main effects are zero.While there may be certain situations where this is plausible, statisticians have generally argued that interactions should only be included if the corresponding main effects are also in the model (McCullagh and Nelder, 1989).This is known as the strong heredity principle (Chipman, 1996).Indeed, large main effects are more likely to lead to detectable interactions (Cox, 1984).
The strong heredity principle states that an interaction term can only have a non-zero estimate if its corresponding main effects are estimated to be non-zero, whereas the weak heredity principle allows for a non-zero interaction estimate as long as one of the corresponding main effects is estimated to be non-zero (Chipman, 1996).In the context of penalized regression methods, these principles can be formulated as structured sparsity (Bach et al., 2012) problems.Several authors have proposed to modify the type of penalty in order to achieve the heredity principle (Radchenko and James, 2010;Bien et al., 2013;Lim and Hastie, 2015;Haris et al., 2016).We take an alternative approach.In Section 2 we discuss how a simple reparametrization of the model (3) can lead to this desirable property.

Related work
Methods for variable selection of interactions can be broken down into two categories: linear and non-linear interaction effects.Many of the linear effect methods consider all pairwise interactions in X (Zhao et al., 2009;Choi et al., 2010;Bien et al., 2013;She and Jiang, 2014) which can be computationally prohibitive when p is large.More recent proposals for selection of interactions allow the user to restrict the search space to interaction candidates (Lim and Hastie, 2015;Haris et al., 2016).This is useful when the researcher wants to impose prior information on the model.Two-stage procedures, where interaction candidates are considered from an original screen of main effects, have shown good performance when p is large (Hao et al., 2018;Shah, 2016) in the linear setting.There are many fewer methods available for estimating non-linear interactions.For example, Radchenko and James (2010) proposed a model of the form Y = β 0 + p j=1 f j (X j ) + j>k f jk (X j , X k ) + ε, where f (•) are smooth component functions.This method is more computationally expensive than sail since it considers all pairwise interactions between the basis functions, and its effectiveness in simulations or real-data applications is unknown as there is no software implementation.

Our contributions
The main contributions of this paper are five-fold.First, we develop a model for non-linear interactions with a key exposure variable, following either the weak or strong heredity principle, that is computationally efficient and scales to the high-dimensional setting (n << p).Second, through simulation studies, we show improved performance in terms of prediction accuracy and support recovery over existing methods that only consider linear interactions or additive main effects.Third, we show that our method possesses the oracle property (Fan and Li, 2001), i.e., it performs as well as if the true model were known in advance.Fourth, we demonstrate the performance of our method in two applications: 1) gene-environment interactions in a prenatal psychosocial intervention program Olds et al. (1998) and 2) a study aimed at identifying which clinical variables influence mortality rates amongst seriously ill hospitalized patients (Connors et al., 1995).Fifth, we implement our algorithms in the sail R package on CRAN (https://cran .r-project.org/package =sail), along with extensive documentation.In particular, our implementation also allows for linear interaction models, user-defined basis expansions, a cross-validation procedure for selecting the optimal tuning parameter, and differential shrinkage parameters to apply the adaptive lasso idea (Zou, 2006).
The rest of the paper is organized as follows.Section 2 describes our optimization procedure and some details about the algorithm used to fit the sail model for the least squares case.Theoretical results are given in Section 3. In Section 4, through simulation studies we compare the performance of our proposed approach and demonstrate the scenarios where it can be advantageous to use sail over existing methods.Section 5 contains two real data examples and Section 6 discusses some limitations and future directions.

Strong and weak heredity
Following Choi et al. 2010, we introduce a new set of parameters γ = (γ 1E , . . ., γ pE ) ∈ R p and reparametrize the coefficients for the interaction terms τ j in (2) as a function of γ jE and the main effect parameters θ j and β E .This reparametrization for both strong and weak heredity is summarized in Table 1.
To perform variable selection in this new parametrization, we penalize γ = γ 1E , . . ., γ pE instead of penalizing τ as in (3), leading to the following penalized objective function: An estimate of the regression parameters is given by = arg min Q ( ).This penalty allows for the possibility of excluding the interaction term from the model even if the corresponding main effects are non-zero.Furthermore, smaller values for Table 1 Summary of reparametrization and penalty terms for strong and weak heredity sail model.Note that the penalty terms are identical for both model types, i.e., the reparametrization only affects the likelihood term of the objective function.

Model Reparametrization Penalty
Strong heredity α would lead to more interactions being included in the final model while values approaching 1 would favor main effects.Similar to the elastic net (Zou and Zhang, 2009), we fix α and obtain a solution path over a sequence of λ values.

Toy example
We present here a toy example to better illustrate the methods proposed in this paper.With a sample size of n = 100, we sample p = 20 covariates X 1 , . . .X p independently from a N(0, 1) distribution truncated to the interval [0,1].Data were generated from a model which follows the strong heredity principle, but where only one covariate, X 2 , is involved in an interaction with a binary exposure variable (E): The error term ε is generated from a normal distribution with variance chosen such that the signal-to-noise ratio (SNR) is 2. We generated a single simulated dataset and used the strong heredity sail method (described below) with B-splines (df=5) to estimate the functional forms.10-fold cross-validation (CV) was used to choose the optimal value of penalization.
We used α = 0.5 and default values for all other arguments.We plot the solution path for both main effects and interactions in Fig. 1 (top panel), coloring lines to correspond to the selected model.We see that our method is able to correctly identify the true model.We can also visually see the effect of the penalty and strong heredity principle working in tandem, i.e., the interaction term E • f 2 (X 2 ) (orange lines in the bottom panel) can only be non-zero if the main effects E and f 2 (X 2 ) (black and orange lines respectively in the top panel) are non-zero, while non-zero main effects do not imply a non-zero interaction.
In Fig. 1 (bottom panel), we plot the true and estimated component functions ˆf1 (X 1 ) and E • f2 (X 2 ), and their estimates from this analysis with sail.We are able to capture the shape of the correct functional form.Lack-of-fit for f 1 (X 1 ) can be partially explained by acknowledging that sail is trying to fit a spline to a linear function.Nevertheless, this example demonstrates that sail can still identify trends reasonably well.

Blockwise coordinate descent for least-squares loss
Here we describe a blockwise coordinate descent algorithm for fitting the least-squares version of the sail model in (4).
We fix the value for α and minimize the objective function over a decreasing sequence of λ values (λ max > • • • > λ min ).We use the subgradient equations to determine the maximal value λ max such that all estimates are zero (the derivation of λ max is provided in Supplemental Section B.3). Due to the heredity principle, this reduces to finding the largest λ such that all main effects (β E , θ 1 , . . ., θ p ) are zero.Following Friedman et al. 2010, we construct a λ-sequence of 100 values decreasing from λ max to 0.001λ max on the log scale, and use the warm start strategy where the solution for λ is used as a starting value for λ +1 .
We assume that Y , j , X E and X E • j have been centered by their sample means Y , j , X E , and X E • j , respectively.Here, j ∈ R m j and X E • j ∈ R m j represent the column means of j and X E • j , respectively.Since the intercept (β 0 ) is not penalized and all variables have been centered, we can omit it from the loss function and compute it once the algorithm has converged for all other parameters.The strong heredity sail model with least-squares loss has the form: and the objective function is given by 6) in a blockwise manner allows us to leverage computationally fast algorithms for 1 and 2 norm penalized regression.Indeed, by careful construction of pseudo responses and pseudo design matrices, existing efficient algorithms are the five basis coefficients for X 1 and X 2 , respectively.λ 1S E is the largest value of penalization for which the CV error is within one standard error of the minimizing value λ min .Bottom: Estimated smooth functions for X 1 and the X 2 • E interaction by the sail method based on λ min .(For interpretation of the colors in the figure(s), the reader is referred to the web version of this article.)can be used to estimate the parameters.The objective function simplifies to a modified lasso problem when holding all θ j fixed, and a modified group lasso problem when holding β E and all γ jE fixed.The main computations are provided in Algorithm 1.A more detailed version of the derivations is given in Supplemental Section B.1.The sequence of tuning parameters (λ max > • • • > λ min ) is automatically chosen by our software package based on the data inputs X, Y , and X E .The user may also choose to supply their own decreasing sequence.We recommend B-splines with 5 degrees of freedom for the basis function, and α = 0.5 to provide similar penalties to both main effects and interactions.Smaller values of α will favor the inclusion of more interaction terms.The weights for the environment variable (w E ), main effects (w j ) and interactions (w jE ) can be chosen via the adaptive sail (Algorithm 2), or be left to their default values of 1. Smaller weights will penalize the corresponding variables less.The default value for the convergence threshold ( ) is 1 × 10 −4 .
Algorithm 1 Blockwise Coordinate Descent for Least-Squares sail with Strong Heredity.

Details on update for θ
Here we discuss a computational speedup in the updates for the θ parameter.The partial residual (R s ) used for updating θ s (s ∈ 1, . . ., p) at the kth iteration is given by where is the fitted value at the kth iteration excluding the contribution from s : Using ( 8), ( 7) can be re-written as where s the solution for predictor s at the kth iteration, given by: Now we want to update the parameters for the next predictor θ s+1 (s + 1 ∈ 1, . . ., p) at the kth iteration.The partial residual used to update θ s+1 is given by where R * is given by (10), θ is the parameter value prior to the update, and θ is the updated value given by ( 11).Taking the difference between ( 9) and ( 12) gives Therefore R t = R s + , and the partial residual for updating the next predictor can be computed by updating the previous partial residual by , given by ( 13).This formulation can lead to computational speedups especially when = 0, meaning the partial residual does not need to be re-calculated.

Weak heredity
Our method can be easily adapted to enforce the weak heredity property.That is, an interaction term can only be present if at least one of its corresponding main effects is non-zero.To do so, we reparametrize the coefficients for the interaction terms in (2) as τ j = γ jE (β E • 1 m j + θ j ), where 1 m j is a vector of ones with dimension m j (i.e. the length of θ j ).We defer the algorithm details for fitting the sail model with weak heredity in Supplemental Section B.4, as it is very similar to Algorithm 1 for the strong heredity sail model.

Adaptive sail
The weights for the environment variable, main effects and interactions are given by w E , w j and w jE respectively.These weights serve as a means of allowing a different penalty to be applied to each variable.In particular, any variable with a weight of zero is not penalized at all.This feature is usually selected for one of two reasons: 1. Prior knowledge about the importance of certain variables is known.Larger weights will penalize the variable more, while smaller weights will penalize the variable less 2. Allows users to apply the adaptive sail, similar to the adaptive lasso (Zou, 2006) We describe the adaptive sail in Algorithm 2. This is a general procedure that can be applied to the weak and strong heredity settings.We provide this capability in the sail package using the penalty.factorargument.

Flexible design matrix
The definition of the basis expansion functions in (1) is very flexible, in the sense that our algorithms are independent of this choice.As a result, the user can apply any basis expansion they desire.In the extreme case, one could apply the identity map, i.e., f j (X j ) = X j which leads to a linear interaction model (referred to as linear sail).When little information is known a priori about the relationship between the predictors and the response, by default, we choose to apply the same basis expansion to all columns of X.This is a reasonable approach when all the variables are continuous.However, there Algorithm 2 Adaptive sail algorithm.
1.For a decreasing sequence λ = λ max , . . ., λ min and fixed α run the sail algorithm 2. Use cross-validation or a data splitting procedure to determine the optimal value for the tuning parameter: λ [opt] ∈ {λ max , . . ., λ min } 3. Let β E [opt] , θ [opt] j and τ [opt]   j for j = 1, . . ., p be the coefficient estimates corresponding to the model at λ [opt]   4. Set the weights to be for j = 1, . . ., p 5. Run the sail algorithm with the weights defined in step 4), and use cross-validation or a data splitting procedure to choose the optimal value of λ are often situations when the data contains a combination of categorical and continuous variables.In these cases it may be sub-optimal to apply a basis expansion to the categorical variables.Owing to the flexible nature of our algorithm, we can handle this scenario in our implementation by allowing a user-defined design matrix.The only extra information needed is the group membership of each column in the design matrix.We illustrate such an example in a vignette of the sail R package.

Theory
In this section we study the asymptotic behavior of the sail estimator , defined as the minimizer of (4), as well as the model selection properties.We show that sail possesses the oracle property when the sample size approaches infinity and the number of predictors is fixed.That is, under certain regularity conditions, it performs as well as if the true model were known in advance and has the optimal estimation rate (Zou, 2006).The regularity conditions and proofs are given in Supplemental Section 1.
where λ 1 = λ(1 − α)w E , λ m = λ(1 − α)w m for m = 2, . . ., p + 1, and λ m = λα w mE for m = p + 2, . . ., 2p + 1. Define , that is, A 1 contains the indices for main effects whose true coefficients are non-zero, and A 2 contains the indices for interaction terms whose true coefficients are non-zero.Let Note that our asymptotic results are stated for the main effects and interaction terms only, even though our formulation includes an unpenalized intercept.Consistency results immediately follow for β 0 since we assume the data has been centered, leading to a closed form solution for the intercept in the least-squares setting.

Lemma 1. [Existence of a local minimizer] If a
Lemma 1 states that if the tuning parameters corresponding to the non-zero coefficients converge to 0 at a speed faster than 1 √ n , then there exists a local minimizer of Q n ( ) which is √ n-consistent (Wang et al., 2007;Choi et al., 2010).

Theorem 1 (Model selection consistency). If
Theorem 1 shows that sail can consistently remove the main effects and interaction terms which are not associated with the response with high probability.Together with Lemma 1, we see that the asymptotic behavior of the penalty terms for the zero and non-zero predictors must be different to satisfy the model selection consistency property (15) (Nardi and Rinaldo, 2008).Specifically, when the tuning parameters for the non-zero coefficients converge to 0 faster than 1/ √ n (i.e.

√
na n → 0) and those for zero coefficients are large enough (i.e.√ nb n → ∞), the Lemma 1 and Theorem 1 imply that the √ n-consistent estimator n satisfies P A c 2 = 0 → 1. Next, we obtain the asymptotic distribution of the sail estimator.

Theorem 2 (Asymptotic normality). Denote
Under the regularity conditions, the subvector A of the local minimizer n given in Lemma 1 satisfies where I * A is the Fisher information matrix for A at A = * A , assuming A c is known in advance.
Together, Theorems 1 and 2 establish that if the tuning parameters satisfy the conditions √ na n → 0 and √ nb n → ∞, then as the sample size grows large, sail has the oracle property (Fan and Li, 2001).In order for the conditions on the tuning parameters to be satisfied, we follow the strategies outlined for the adaptive Lasso (Zou, 2006), the adaptive group Lasso (Nardi and Rinaldo, 2008) and the adaptive elastic-net (Zou and Zhang, 2009).That is, we define the adaptive weights Here, the 1/n is to avoid division by zero.

Simulation study
In this section, we use simulated data to understand the performance of sail in different scenarios.

Comparator methods
Since there are no other packages that directly address our chosen problem, we selected comparator methods based on the following criteria: 1) penalized regression methods that can handle high-dimensional data (n < p), 2) allowing at least one of linear effects, non-linear effects or interaction effects, and 3) having a software implementation in R. The selected methods can be grouped into three categories: 1. Linear main effects: lasso (Tibshirani, 1996), adaptive lasso (Zou, 2006) 2. Linear interactions: lassoBT (Shah, 2016), GLinternet (Lim and Hastie, 2015) 3. Non-linear main effects: HierBasis (Haris et al., 2019), SPAM (Ravikumar et al., 2009), gamsel (Chouldechova and Hastie, 2015) For GLinternet we specified the interactionCandidates argument so as to only consider interactions between the environment and all other X variables.For all other methods we supplied (X, X E ) as the data matrix, 100 for the number of tuning parameters to fit, and used the default values otherwise (R code for each method available at https:// github .com/sahirbhatnagar /sail /blob /master /my _sims /method _functions .R). lassoBT considers all pairwise interactions as there is no way for the user to restrict the search space.SPAM applies the same basis expansion to every column of the data matrix; we chose 5 basis spline functions.HierBasis and gamsel select whether a term in an additive model is non-zero, linear, or a non-linear spline up to a specified max degrees of freedom per variable.We compare the above listed methods with our main proposal method sail, as well as with adaptive sail (Algorithm 2) and sail weak which has the weak heredity property.For each function f j , we use a B-spline basis matrix with degree=5 implemented in the bs function in R (R Core Team, 2017).We center the environment variable and the basis functions before running the sail method.

Simulation design
To make the comparisons with other methods as fair as possible, we followed a simulation framework that has been previously used for variable selection methods in additive models (Lin and Zhang, 2006;Huang et al., 2010).We extend this framework to include interaction effects as well.The covariates are simulated as follows.First, we generate x 1 , . . ., x 1000 independently from a standard normal distribution truncated to the interval [0,1] for i = 1, . . ., n.The first four variables are non-zero (i.e.active in the response), while the rest of the variables are zero (i.e. are noise variables).The exposure variable ( X E ) is generated from a standard normal distribution truncated to the interval [-1,1].The outcome Y is then generated following one of the models and assumptions described below.We evaluate the performance of our method on three of its defining characteristics: 1) the strong heredity property, 2) non-linearity of predictor effects and 3) interactions.Simulation scenarios are designed specifically to test the performance of these characteristics.

Heredity simulation
Scenario (a) Truth obeys strong heredity.In this situation, the true model for Y contains main effect terms for all covariates involved in interactions: Scenario (b) Truth obeys weak heredity.Here, in addition to the interaction, the E variable has its own main effect but the covariates X 3 and X 4 do not: Scenario (c) Truth only has interactions.In this simulation, the covariates involved in interactions do not have main effects as well:

Non-linearity simulation scenario
Truth is linear.sail is designed to model non-linearity; here we assess its performance if the true model is completely linear:

Interactions simulation scenario
Truth only has main effects.sail is designed to capture interactions; here we assess its performance when there are none in the true model: The true component functions are the same as in (Lin and Zhang, 2006;Huang et al., 2010) and are given by f 1 (t) = 5t, f 2 (t) = 3(2t − 1) 2 , f 3 (t) = 4 sin(2πt)/(2 − sin(2πt)), f 4 (t) = 6(0.1 sin(2πt) + 0.2 cos(2πt) + 0.3 sin(2πt) 2 + 0.4 cos(2πt) 3 + 0.5 sin(2πt) 3 ).We set β E = 2 and draw ε from a normal distribution with variance chosen such that the signal-to-noise ratio is 2. Using this setup, we generated 200 replications consisting of a training set of n = 200, a validation set of n = 200 and a test set of n = 800.The training set was used to fit the model and the validation set was used to select the optimal tuning parameter corresponding to the minimum prediction mean squared error (MSE).Variable selection results including true positive rate, false positive rate and number of active variables (the number of variables with a non-zero coefficient estimate) were assessed on the training set, and MSE was assessed on the test set.

Results
The prediction accuracy and variable selection results for each of the five simulation scenarios are shown in Fig. 2 and Table 2, respectively.We see that sail, adaptive sail and sail weak have the best performance in terms of both MSE and yielding correct sparse models when the truth follows a strong heredity (scenario 1a), as we would expect, since this is exactly the scenario that our method is trying to target.Our method is also competitive when only main effects are present (scenario 3) and performs just as well as methods that only consider linear and non-linear main effects (HierBasis, SPAM), owing to the penalization applied to the interaction parameter.Due to the heredity property being violated in scenario 1c), no method can identify the correct model with the exception of GLinternet.When only linear effects and interactions are present (scenario 2), we see that adaptive sail has similar MSE compared to the other linear interaction methods (lassoBT and GLinternet) with a better TPR and FPR.It is important to note that the variable selection performance of sail is highly dependent on being able to correctly select the exposure variable ( X E ).In Supplemental Section C, we show the selection rates of X E .We see that sail is able to consistently identify the exposure variable across all simulation scenarios and replications.Overall, our simulation study results suggest that sail outperforms existing methods when the true model contains non-linear interactions, and is competitive even when the truth only has either linear or additive main effects.

Gene-environment interactions in the nurse family partnership program
It is well known that environmental exposures can have an important impact on academic achievement.Indeed, early intervention in young children has been shown to positively impact intellectual abilities (Campbell and Ramey, 1994).More recent studies have shown that cognitive performance, a trait that measures the ability to learn, reason and solve problems, is also strongly influenced by genetic factors.Genome-wide association studies (GWAS) suggest that 20% of the variance in educational attainment (years of education) may be accounted for by common genetic variation (Rietveld et al., 2013;Okbay et al., 2016).Unsurprisingly, there is significant overlap in the SNPs that predict educational attainment and measures of cognitive function.An interesting query that arises is how the environment interacts with these genetics variants to predict measures of cognitive function.To address this question, we analyzed data from the Nurse Family Partnership (NFP), a psychosocial intervention program that begins in pregnancy and targets maternal health, parenting and mother-infant interactions (Olds et al., 1998).The Stanford Binet IQ scores at 4 years of age were collected for 189 subjects (including 19 imputed using mice (Buuren and Groothuis-Oudshoorn, 2010)) born to women randomly assigned to control (n = 100) or nurse-visited intervention groups (n = 89).For each subject, we calculated a polygenic risk score (PRS) for educational Fig. 3.Estimated interaction effect identified by the weak heredity sail using cubic B-splines and α = 0.1 for the Nurse Family Partnership data.The selected model, chosen via 10-fold cross-validation, contained three variables: the main effects for the intervention and the PRS for educational attainment using genetic variants significant at the 0.0001 level, as well as their interaction.
attainment at nine different p-value thresholds using weights from the GWAS conducted in Okbay et al. 2016.We applied the weak heredity sail with cubic B-splines and α = 0.1 to encourage interactions, and selected the optimal tuning parameter using 10-fold cross-validation.This resulted in a total 55 parameters to estimate.In this context, individuals with a higher have a propensity for higher educational attainment.The goal of this analysis was to determine if there interaction between genetic to educational attainment ( X and maternal participation in NFP program (E) on IQ at 4 years of age (Y ).Our method identified an interaction between the intervention and PRS which included genetic variants at the 0.0001 level of significance.This interaction is shown in Fig. 3.We see that the intervention has a much larger effect on IQ for lower PRS compared to a higher PRS.In other words, perinatal home visitation by nurses can impact IQ scores in children who are genetically predisposed to lower educational attainment.Similar results were obtained for the other imputed datasets (Supplemental Section D).We also compared sail with two other interaction selection methods, lassoBT and GLinternet with default settings, on 200 bootstrap samples of the data.The average and standard deviation of the MSE and size of the active set (| J |) across the 200 bootstrap samples are given in Table 3.We see that sail tends to select sparser models while maintaining similar prediction performance compared to lassoBT.The GLinternet statistics are omitted here since the algorithm did not converge for many of the 200 simulations.

Study to understand prognoses preferences outcomes and risks of treatment
The Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT) aimed at identifying which clinical variables influence medium-term (half-year) mortality rate amongst seriously ill hospitalized patients and improving clinical decision making (Connors et al., 1995).With a relatively large sample size of 9,105 and detailed documentation of clinical variables, the SUPPORT dataset allows detection of potential interactions using the strategy implemented in sail.We applied sail to test for non-linear interactions between acute renal failure or multiple organ system failure (ARF/MOSF), an important predictor for survival rate, and 13 other variables that were deemed clinically relevant.These variables included the number of comorbidities (excluding ARF/MOSF), age, sex, as well as multiple physiological and blood biochemical indices.The response was whether a patient survived after six months since hospitalization.
A total of 8,873 samples had complete data on all variables of interest.We randomly divided these samples into equal sized training/validation/test splits and ran lassoBT, GLinternet, and the weak heredity sail with cubic B-splines and α = 0.1 (as was done in the Nurse Family Partnership program case study).A binomial distribution family was specified for GLinternet, whereas lassoBT had the same default settings as the simulation study since it did not support a specialized implementation for binary outcomes.We again ran each method on the training data, determined the optimal tuning parameter on the validation data based on the area under the receiver operating characteristic curve (AUC), and assessed AUC on the test data.We repeated this process 200 times and report the results in Table 3.We found that sail achieved similar prediction accuracy to lassoBT and GLinternet.However, the predictive performance of lassoBT and GLinternet relied on models which included many more variables.In Fig. 4, we visualize the two strongest interaction effects associated with the number of comorbidities and age, respectively.For those having undergone ARF/MOSF, an increased number of comorbidities decreases their chance of survival, while there seems to be no such relationship for non-ARF/MOSF  patients.The interaction between ARF/MOSF and age shows the risk incurred by ARF/MOSF is most distinguishing among patients between the ages of 70 and 80.

Discussion
In this article we have introduced the sparse additive interaction learning model sail for detecting non-linear interactions with a key environmental or exposure variable in high-dimensional settings.Using a simple reparametrization, we are able to achieve either the weak or strong heredity property without using a complex penalty function.We developed a blockwise coordinate descent algorithm to solve the sail objective function for the least-squares loss.We further studied the asymptotic properties of our method and showed that under certain conditions, it possesses the oracle property.All our algorithms have been implemented in a computationally efficient, well-documented and freely available R package on CRAN.
Furthermore, our method is flexible enough to handle any type of basis expansion including the identity map, which allows for linear interactions.Our implementation allows the user to selectively apply the basis expansions to the predictors, allowing for example, a combination of continuous and categorical predictors.An extensive simulation study shows that sail, adaptive sail and sail weak outperform existing penalized regression methods in terms of prediction accuracy, sensitivity and specificity when there are non-linear main effects only, as well as interactions with an exposure variable.We then demonstrated the utility of our method to identify non-linear interactions in both biological and epidemiological data.
In the NFP program, we showed that individuals who are genetically predisposed to lower educational attainment are those who stand to benefit the most from the intervention.Analysis of the SUPPORT data revealed that those having undergone ARF/MOSF, an increased number of comorbidities decreased their chances of survival, while there seemed to be no such relationship for non-ARF/MOSF patients.In a bootstrap analysis of both datasets, we observed that sailtended to select sparser models while maintaining similar prediction performance compared to other interaction selection methods.
Our method however does have its limitations.sail can currently only handle X E • f (X) or f (X E ) • X and does not allow for f (X, X E ), i.e., only one of the variables in the interaction can have a non-linear effect and we do not consider the tensor product.The reparametrization leads to a non-convex optimization problem which makes convergence rates difficult to assess, though we did not experience any major convergence issues in our simulations and real data analysis.The memory footprint can also be an issue depending on the degree of the basis expansion and the number of variables.Furthermore, the functional form of the covariate effects is treated as known in our method.Being able to automatically select for example, linear vs. nonlinear components, is currently an active area of research in main effects models (Haris et al., 2019).To our knowledge, our proposal is the first to allow for non-linear interactions with a key exposure variable following the weak or strong heredity property in high-dimensional settings.We also provide a first software implementation for these models.

Fig. 1 .
Fig. 1.Top: Toy example solution path for main effects (top) and interactions (bottom).{X1 1 , X1 2 , X1 3 , X1 4 , X1 5 } and {X2 1 , X2 2 , X2 3 , X2 4 , X2 5 }are the five basis coefficients for X 1 and X 2 , respectively.λ 1S E is the largest value of penalization for which the CV error is within one standard error of the minimizing value λ min .Bottom: Estimated smooth functions for X 1 and the X 2 • E interaction by the sail method based on λ min .(For interpretation of by A = m : φ * m = 0 the unknown sparsity pattern of * , and A = m : φ m = 0 the estimated sail model selector.We can rewrite the penalty terms in (4), and consider the sail estimates n given b n

Fig. 4 .
Fig. 4. Illustration of estimated interaction effects identified by sail for the SUPPORT data.Median prediction curves in dark colors based on 200 train/validate/test splits represent the estimated marginal interaction effects.Coefficients estimated in each of the 200 train/validate/test splits were used to generate prediction curves representing a 90% confidence interval colored in corresponding light colors.

Table 2
Mean (standard deviation) of the number of selected variables (| J |), true positive rate (TPR) and false positive rate (FPR) as a percentage from 200 replications for each of the five scenarios.|J | is the number of truly associated variables.
Boxplots of the test set mean squared error from 200 replications for each of the five simulation scenarios.

Table 3
Comparison of analytic methods for selecting interactions using the Nurse Family Partnership program and the SUPPORT datasets.Averages (standard deviations in parentheses) are based on 200 bootstrap samples.| J | is the number of variables selected by the method.GLinternet results not reported for NFP data since the algorithm did not converge in many of the bootstrap samples.