Fast Sparse-Group Lasso Method for Multi-response Cox Model with Applications to UK Biobank

We propose a Sparse-Group regularized Cox regression method to analyze large-scale, ultrahigh-dimensional, and multi-response survival data efficiently. Our method has three key components: 1. A Sparse-Group penalty that encourages the coefficients to have small and overlapping support; 2. A variable screening procedure that minimizes the frequency of disk memory access when the data does not fit in the RAM; 3. An accelerated proximal gradient method that optimizes the regularized partial-likelihood function. To demonstrate the efficacy of our method, we implemented the proposed algorithm for human genetics data and applied it to 16 time-to-event phenotypes from the UK Biobank.

h 0 (s)e β T X ds .
Here h 0 : R + → R + is the baseline hazard function. In our applications we are interested in the association between genetic variants and the response, so the baseline hazard function is a nuisance variable. Fortunately we can estimate the parameters β without knowing the baseline hazard and achieve almost the same estimation accuracy as when the baseline hazard is known. This is done by maximizing the partial likelihood function (Cox 1972): .

39
The Lasso method (Tibshirani 1996) makes the assumption that only a small subset of predictors 40 are associated with the response. In other words, it assumes the model has a sparsity structure.

41
This assumption makes Lasso very effective for high-dimensional data. A sparse solution can be 42 obtained by adding the penalty λ β 1 to the original objective function, for an appropriately chosen 43 regularization parameter λ > 0 (usually through cross validation).

45
Sparse-Group Lasso (Simon et al. 2013) is a variation of Lasso. It assumes not only that many 46 individual elements of β are 0, but also that within some predefined groups of variables, the corresponding parameters are 0 simultaneously. For a group g ⊆ {1, 2, · · · , d} of variables that we believe 48 the coefficients are 0 together, we add an additional penalty λ g β g 2 . In this section we define the notations and the key model assumptions that we will use in the subsequent sections. For an integer n, define [n] = {1, 2, · · · , n} and define x + = max{x, 0} for all x ∈ R.
We analyze K ≥ 1 time-to-event responses on n individuals. For example, the responses could be time from birth to K different diseases. The data we observed are in the format: Here X i ∈ R d are ith individual's features. Denote the full features matrix X = [X 1 , X 2 , · · · , X n ] T ∈ 52 R n×d . For k = 1, · · · , K, O k i = 1 implies that T k i is the true time of the event k for the ith individual, 53 and O k i = 0 implies that the true time of the event k is right-censored at T k i . We assume each response 54 follows a Cox proportional hazard model: where h 0,k : R + → R + is the baseline hazard function of the kth response. Let be the number of observed event k.

57
We make the assumption that not only β k is sparse for all k ∈ [K], but they also have a small 58 common support. That is {j ∈ [p] : β j k = 0, k ∈ [K]} is a small set relative to p. In human genetics 59 applications, the first assumption means that each response is associated with a small set of genetic 60 variants, and the second assumption implies that there are significant overlap among the genetic 61 variants that are associated with each response. This belief translates to the following regularized 62 partial likelihood objective function: Here the first term is the sum of the K negative log-partial-likelihood, normalized by the number of observed events. The second term is the regularization. β j 1 , β j 2 are the 1-norm and 2norm of the coefficients for the jth variable. That is, if we put all the parameters into a matrix B = [β 1 , β 2 , · · · , β K ] ∈ R d×K , then β j is the jth row of B and β k is the kth column. Note that when α = 0, the objective function decouples for each β k and they can be optimized separately. In our implementation we solve a slightly more general problem: Here g j ∈ R K is the partial derivative of the smooth part of (2) with respect to the coefficients of the variable j. For each α > 0, v ∈ R K , define 78 v α * := sup{u T v : u ∈ R K , u 1 + α u 2 ≤ 1}. (4) Here we use the following result to get an optimality condition for the solutions β 1 , · · · , β K . The 79 proofs are given in the appendix.

80
Proposition 1. For any λ > 0, the gradient at the optimal solution to (2) satisfies: This result motivates us to first fit a model (solving (2)) using a small number of variables whose 83 gradient has the largest α * -norm, assuming the coefficients for the rest of the variables are all zero.

84
Then to verify the validity of the assumption we check that g j α * ≤ λ for variables assumed to 85 have zero coefficients. We refer to this step as KKT checking. Note that based on its definition (4), 86 it's not clear how we can compute v α * . Here we give a more explicit characterization.

88
Using the above we can check if v α * ≤ λ quite easily and compute v α * using binary search 89 in O(K log v ∞ ) to any fixed precision (since v α * ≤ v ∞ ).

91
Now we are ready to state the overall structure of our algorithm. Suppose valid solutions for 92 λ 1 , · · · , λ l have been obtained. Next we do the following steps: 1. A gradient step that tries to minimize the smooth part of the objective: where t is the step size that we determine using backtracking line search. 122 2. A proximal step that tries to keep the regularization term small: Here the proximal operator prox t : To simplify the notation we omit the dependency of prox t (x) on λ, α. Simple calculation shows that prox t (x) = S 2 (S 1 (x; tλ); tαλ).
where S 2 (·; tαλ) : R K → R K is a group soft-thresholding operator: We describe the optimization algorithm in psuedocode, including details about Nesterov acceleration Initialize ever active set A (0) = ∅ ; Construct the regularization parameters λ 1 , · · · , λ L ; Initialize a short list of initial regularization parameters Λ (0) = {λ 1 , · · · , λ L (0) }; Initialize the parametersβ Set v to be the largest number such that the solutions for λ = λ 1 , · · · , λ v have been obtained; Screening: Compute (or use the cached) full gradient (3) g 1 , · · · , g d . These gradients are evaluated atβ M . This will be the variables used in the fitting step; Fitting: Obtain parameter estimates by solving (2) using only variables in S (i) ; The optimization algorithm is described in algorithm 2; Coefficients for variables not in S (i) are set to be zero.; end Checking: Compute the full gradients g 1 (λ), · · · , g d (λ) at solutions obtained with regularization parameters λ ∈ Λ (i) ; Find the smallest λ ∈ Λ (k) such that the KKT condition (5) is satisfied: be the set of variables we use to optimize (2). β j is assumed to be 0 for j ∈ S; Let p = |S| be the number of variables that are assumed to be non-zero; Set line search parameter η > 1; Write the parameter matrix B ∈ R p×K . The rows of B are β j for j ∈ S; Write the smooth part of the objective as: Initialize B (0) at a user-specified value, or set B (0) = 0; Set iteration count i = 0; Set initial step-size t = 1; Set Nestrov weights w 0 , w 1 = 1; while B has not converged do Nesterov acceleration: Compute the gradient g j : Start backtracking line search: Check convergence based on objective value change or parameter change. end return B (i) Overall, we find that multiSnpnet-Cox is able to improve prediction across a large number of 132 diseases (in the application in this study 16, Table 1). The sparse solution for the 16 responses 133 range from hundreds to tens of thousands of active variables (Figure 1). In addition, analogous to 134 our proposal in Qian et al. (2020), there is an alternative way to find a low-rank coefficient profile 135 for regression. Instead of pursing to solve a non-convex problem directly, we can follow a two-stage 136 procedure: 1) solve a full-rank regression, and 2) conduct SVD of the resulting coefficient matrix 137 and use its rank approximation as our final estimator. We referred to this approach as the lazy  When the matrices needed for the proximal step do not fit in GPU memory, frequent communica-160 tion between the system memory and the GPU memory could significantly slow down the algorithm.  In our Sparse-Group penalty, the groups are defined as the effect of the same predictor on all 167 the responses. In practice it is often useful to define more general groups (e.g. genetic variants that 168 are known to affect a biological mechanism that are associated with some responses). While general 169 grouping does not change our algorithm by much, it could potentially introduce irregular memory 170 access pattern, which could hurt performance. In future works, we hope to develop methods for this 171 more general setting.

173
Finally, we freeze the parameters for a response when its coefficients start to overfitting. This 174 proves to be quite effective in preventing irrelevant variables from entering the model, which helps 175 both speed and validation metrics of the other responses. However, this step is not mathematically well-justified. In particular, the proximal operator changes whenever we constrain some parameters 177 to stay at its previous value. We hope to resolve this issue in future works. qqqqqqqq qqqqq qqqq qq q qq qqq qqqqqq qqqqq qqq qq qqq qqq qqq qq qqq qq qq qq qqq qqqqq qqqqqqqq qqqq qqqq qq q q qq qqq qqqqqq qqq qq qq q q q qq q qq q qq q q q qq q q q q q    Table 1 provides the mapping of phenotype IDs to description. We show a slightly more general result from convex analysis. Let f : R K → R be continuously 234 differentiable, convex, and bounded from below. Let · be a norm on R K , and · * be its 235 corresponding dual norm. Let λ > 0, and set 236 x * := arg min We will show that It is clear that (6) is equivalent to the constrained optimization problem 238 min x,y f (x) + λ y such that x = y.
The Lagrangian dual is 240 g(z) := inf x,y L(x, y, z) = inf y λ y + z T y + inf Using the definition of dual norm, when z * > λ, the infimum of the first term above is −∞, and 241 when z * ≤ λ, the infimum is 0. Therefore Therefore the dual solution z * := arg max z g(z) must satisfy z * ≤ λ. Now we go back to the 243 Lagrangian. Since the primal objective is convex and Slaters condition holds, the solution to the 244 primal problem can be obtained through minimizing 245 L(x, y, z * ) = f (x) + λ y + z * T (y − x).
which implies that, at the optimal x * we must have ∇f (x * ) = z * , so 246 ∇f (x * ) * = z * * ≤ λ, and λ x * + ∇f (x * ) T x * = 0 (13) then by Holder's inequality so we must have ∇f (x * ) * = λ. Proposition 1 is a direct consequence of this result. We prove the claim for λ = 1. The regularization term can be written as where B * α is the unit dual ball That is v α * ≤ 1 if and only if v ∈ B * α , which by definition means that v = v 1 + v 2 for some 253 v 1 ∞ ≤ 1, v 2 2 ≤ α. We must have (and it is sufficient to have) The infimum on the left-hand side is achieved when v 1 = v − S 1 (v, 1), which proves the claim.