## Abstract

Modeling variances in data has been an important topic in many fields, including in financial and neuroimaging analysis. We consider the problem of regressing covariance matrices on a vector covariates, collected from each observational unit. The main aim is to uncover the variation in the covariance matrices across units that are explained by the covariates. This paper introduces *Covariate Assisted Principal* (CAP) regression, an optimization-based method for identifying the components predicted by (generalized) linear models of the covariates. We develop computationally efficient algorithms to jointly search the projection directions and regression coefficients, and we establish the asymptotic properties. Using extensive simulation studies, our method shows higher accuracy and robustness in coefficient estimation than competing methods. Applied to a resting-state functional magnetic resonance imaging study, our approach identifies the human brain network changes associated with age and sex.

## 1 Introduction

Modeling variances is an important topic in the statistics and financial literature. In linear regression with heterogeneous errors, various (generalized) linear models have been proposed to model the error variances using the covariates directly or indirectly as a function of the mean (see for example Box and Cox (1964); Carroll et al. (1982); Smyth (1989); Cohen et al. (1993)). These models use separate regression models of the covariates to predict a scalar variance parameter of the error, as well as the mean of the response. Usually, the goal is to improve the efficiency of estimating the mean regression model, while the variance regression model is of less interest.

Regression models for covariance matrices were studied before under different settings. For time series, the “autoregressive conditionally heteroscedastic” (ARCH) models (Engle and Kroner, 1995) were developed to model temporal heteroscedasticity. Anderson (1973) proposed an asymptotically efficient estimator for a class of covariance matrices, where the covariance matrix is modeled as a linear combination of symmetric matrices. Chiu et al. (1996) proposed to model the elements of the logarithm of the covariance matrix as a linear function of the covariates. Pourahmadi (1999) considered another type of matrix decomposition, where the covariates predict linearly the unconstrained elements in the Cholesky decomposition. However, this approach is not order invariant, and requires the matrix columns/rows follow a meaningful ordering. These matrix regression models usually require a large number of parameters to be estimated.

Several approaches were proposed to extend matrix outcome regression models to high dimensions. Hoff and Niu (2012) introduced a regression model, where the covariance matrix is a parsimonious quadratic function of the explanatory variables. Applying low-rank approximation techniques, Fox and Dunson (2015) generalized the framework to a nonparametric covariance regression model and enabled scaling up to high dimensions. In a recent paper, Zou et al. (2017) linked the matrix outcome to a linear combination of similarity matrices of covariates, and studied the asymptotic properties of various estimators under this model. These approaches again model the whole covariance matrix as outcomes, and thus the interpretation could be challenging for large matrices.

Closely related to covariance matrices, principal component analysis (PCA) and related methods are widely used to generate interpretable results for large dimensional data. These methods have been extended to model multiple covariance matrices. Flury (1984) and Flury (1988) introduced a class of models, called common principal components models, to uncover the shared covariance structures. Boik (2002) generalized these models using spectral decompositions. Hoff (2009) developed a Bayesian hierarchical model and estimation procedure to study the heterogeneity in both the eigenvectors and eigenvalues of covariance matrices. Assuming that the eigenvectors span the same subspace, Franks and Hoff (2016) extended this to the so-called high dimensional setting with large *p* and small *n*. It is unclear, however, how these methods can be extended to incorporate multiple covariates.

In the application area of neuroimaging analysis, PCA-type methods are becoming increasingly popular for modeling covariance matrices, partly because of their desirable interpretability and computational capability for analyzing large and multilevel observations. Covariance matrices (or correlation matrices after standardization) of multiple brain regions are also commonly known as functional connectivity analysis (Friston, 2011). Decomposing the covariance matrices into separate components enable identification of coherently active brain subnetworks (Poldrack et al., 2011), and usually a few principal components are needed to explain the variation in neuroimaging data (Friston et al., 1993). As before, it is unclear how these methods can be extended to include multiple covariates.

Indeed, modeling the covariate-related alterations in covariance matrices is an important topic in neuroimaging analysis, because changes in functional connectivity have been found to be associated with various demographic and clinical factors, such as age, gender, and cognitive behavioral functions including developmental and mental health capacities (Just et al., 2006; Wang et al., 2007; Luo et al., 2011; Mennes et al., 2012; Hafkemeijer et al., 2015; Park et al., 2016). A commonly implemented method to analyze the covariance changes is to regress one matrix entry on the covariates, and this model is repeatedly fitted for each matrix element (see, for example, Wang et al. (2007) and Lewis et al. (2009)). Though this approach has good interpretability and is scalable, it suffers from the multiplicity issues, because of the large number of regressions involved. For example, *p*(*p* − 1)*/*2 regressions for *p* brain regions. Adapting the covariance regression model proposed in Hoff and Niu (2012), Seiler and Holmes (2017) introduced a simplified model to analyze a large and multilevel neuroimaging dataset.

In this paper, we propose a *Covariate Assisted Principal (CAP) regression* model for multiple covariance matrix outcomes. This model integrates the PCA principle with a generalized linear model of multiple covariates. Analogous to PCA, our model aims to identify linear projections to allow for interpretability of the covariance matrices, while being computationally feasible for large data. Unlike PCA, our method targets the projections that are associated with the covariates. This enables us to study the changes in covariance matrices associated with subject-specific factors, such as individual demographic or disease information.

This paper is organized as follows. In Section 2, we introduce our proposed CAP regression model. Section 3 presents the estimation and computation algorithms in identifying the proposed principal projection directions. We compare the performance of our proposed methods with competing approaches through simulation studies in Section 4. We then apply our methods to a real fMRI dataset in Section 5. Section 6 summarizes the paper with a summary and discussion of future directions.

## 2 Model

For each *i* ∈ {1, …, *n*}, let **y**_{it} ∈ ℝ^{p}, *t* = 1, …, *T _{i}*, be independent and identically distributed random samples from a multivariate normal distribution with mean zero and covariance matrix, Σ

_{i}, where Σ

_{i}may depend on explanatory variables,

**x**

_{i}∈ ℝ

^{q−1}. In our application example,

**y**

_{it}is a sample of brain fMRI measurements of

*p*regions, and

**x**

_{i}is a vector of covariates postulated to be related to fMRI measurements, both collected from subject

*i*. We assume that there exists a vector

**∈ ℝ**

*γ*^{p}such that satisfies the following multiplicative heteroscedasticity model: where

*β*

_{0}and

*β*_{1}are model coefficients. The logarithmic linear model follows from Harvey (1976) in which Σ

_{i}is a scalar.

A toy example of this model (*p* = 2) is shown in Figure 1. The covariance matrices, represented by the contour plot ellipses, vary as the covariate *x* varies. On the first projection direction (PD1) with the largest variability, there is no variation under different *x* values. However, the variance in the second direction (PD2) decreases as *x* increases. Our proposed model (1) thus aims to identify the second projection direction. In other words, the objective is to discover the rotation such that the data variation in the new space can be best characterized by the explanatory variables.

Compared with existing methods, our proposed model has two main advantages. First, different from the model proposed by Hoff and Niu (2012) and Zou et al. (2017), which directly model Σ_{i} by linear combinations of symmetric matrices constructed out of **x**_{i}, we assume a log-linear model for the variance component after rotation. The linear form allows easy interpretation of the regression coefficient and provides the modeling flexibility shared by all other (generalized) linear models, such as interactions. The projection enables computational scalability similar to PCA. The common principal component approach, studied in Flury (1984), only allows the eigenvalues to vary across a group indicator, our model (1) provides a direct model of multiple covariates, including continuous ones. This enables studying the covariate-related changes in covariances in our fMRI experiment. Second, our model relaxes the standard complete common principal component assumption imposed in Flury (1984) and Boik (2002), and we assume that there exists at least one projection direction such that model (1) is satisfied. This partial common diagonalization assumption is more realistic for data with higher dimensions.

## 3 Method

We propose to estimate the model parameters by maximizing the likelihood function under a quadratic constraint:
where *ℓ*(* β*,

*) is the negative log-likelihood function (ignoring constants), , and*

**γ****H**is a positive definite matrix in ℝ

^{p×p}. Without the constraint,

*ℓ*(

**,**

*β***), for any fixed**

*γ***, is minimized by**

*β***=**

*γ***0**. Thus the constraint is critical.

Two natural choices of **H** in the constraint are:

(C1) **H** = **I** which is equivalent to a unit constraint under *ℓ*_{2}-norm, i.e.,

(C2) **H** =
which is equivalent to a unit constraint with respect to the average sample covariance, i.e.,

(C1) is inspired by standard PCA. The second one is by common principal component analysis. We show in the following proposition that (C1) will lead to a solution that is less appealing in certain situations.

*When* **H** = **I** *in the optimization problem* (2), *for any fixed β*,

*the solution of*

**γ**is the eigenvector corresponding to the minimum eigenvalue of matrixThe matrix can be regarded as a normalization on the covariance matrices based on the explanatory variables. Thus, constraint (C1) achieves the projection direction with the lowest normalized data variation. In the Appendix Section D, we further discuss the property of these two constraints using examples. We will focus on constraint (C2) in this paper because the signals are usually not associated with the smallest eigenvalue in most scenarios.

### 3.1 Algorithm

The optimization problem (2) is biconvex. We propose to solve the optimization problem by block coordinate descent. For given ** γ**, the update of

**is obtained by the Newton-Raphson algorithm. For given**

*β***, the solving for**

*β***requires quadratic programming. Though generic quadratic programming packages could be used, we derive the explicit solution in Proposition A.1 in the supplementary material. The algorithm is summarized in Algorithm 1. This algorithm works for any positive definite**

*γ***H**. To obtain robustness against obtaining a solution in a local minimum, we propose to randomly choose a series of initial values and take the estimate with the lowest objective function value.

### 3.2 Extension for finding multiple projection directions

It is possible that more than one projection direction is associated with the covariates. We propose to find these directions sequentially. This is modified from the strategy of finding multiple principal components one by one.

### A block coordinate descent algorithm for solving optimization problem (2).

**Input:**

**Y**: a list of data where the ith element is a *T _{i}* ×

*p*data matrix

**X**: *a n* × *q* matrix of covariate variables with the first column of ones *β*^{(0)}, *γ*^{(0)} : initial values

**Output**:

Given (*β*^{(s)}, *γ*^{(s)} from the sth step, for the (*s* + 1)th step:

(i) update

*β*by where(ii)update

*γ*by solving where**H**=**I**under (C1) and under (C2), using Proposition A.1.Repeat steps (i)-(ii) until convergence.

Suppose Γ^{(k−1)} = (*γ*^{(1)}, …, **γ**^{(k−1)}) contains the first (*k* − 1) components (for *k* ≥ 2), and let
where **Y**_{i} = (**y**_{i1}, …, **y**_{iTi})^{⊤} for *i* = 1, …, *n*. We cannot directly apply Algorithm 1 to
as in PCA algorithms, since
is not of full rank. We introduce a rank-completion step. The whole algorithm is summarized in Algorithm 2. In the algorithm, step (iii) completes the data to full rank by adding nonzero positive eigenvalues to those zero eigencomponents, which are the exponential of model intercept of the corresponding directions. This step also guarantees that there are no identical eigenvalues in the covariance matrix of **Ỹ**_{i}, which is a necessary condition for unique eigenvector identification.

Analogous to the PCA approach, step (iv) is an orthogonal constraint to ensure that the *k*th direction is orthogonal to the previous ones, which is equivalent to the following optimization problem, for *k* ≥ 2,

For any fixed *β*^{(k)}, we derive an explicit formula for solving *γ*^{(k)}, see Section A.2 of the supplementary material. The proof is adapted from Rao (1964, 1973).

### 3.3 Choosing the number of projection directions

We propose a data-driven approach to choose the number of projection directions. Extending the common principal component model, Flury and Gautschi (1986) introduced a metric to quantify the “deviation from diagonality”. Suppose **A** is a positive definite symmetric matrix, the “deviation from diagonality” is defined as

### An algorithm for finding the *k*th projection direction under constraint (C2).

**Input:**

**Y**: a list of data where the *i*th element is a *T _{i}* ×

*p*data matrix

**X**: *a n* × *q* matrix of covariate variables with the first column of ones Γ^{(k−1)}: a *p* × (*k* − 1) matrix contains the first (*k* − 1) directions

**B**^{(k−1)}: a *q* × (*k* − 1) matrix contains the model coefficients of the first (*k* − 1) directions

**Output**:

(i) For

*i*= 1, …,*n*, let(ii) Apply singular value decomposition (SVD) on such that

(iii) Let with where {

*D*_{i1}, …,*D*_{i(p−(k−1))}} are the first (*p*− (*k*− 1)) diagonal elements of matrix*D*, and_{i}*β*_{10}, …,*β*_{(k−1)0}are the intercept of the first (*k*− 1) directions (first row of**B**^{(k−1)}).(iv) Treat as the new data, and apply Algorithm 1 under constraint (C2) with an additional orthogonal constraint where diag(

**A**) is a diagonal matrix with the diagonal elements the same as matrix**A**, and det(**A**) is the determinant of matrix**A**. From Hadamard’s inequality, we have that*ν*(**A**) ≥ 1, where equality is achieved if and only if**A**is a diagonal matrix.

To adapt this metric in our model, we let Γ^{(k)} ∈ ℝ^{p×k} denote the matrix containing the first *k* projection directions. We define the average deviation from diagonality as
which is the weighted geometric mean of each subject’s deviation from diagonality. As *k* increases, the requirement for Γ(*k*)^{⊤}**S**_{i}Γ^{(k)} to be a diagonal matrix, as in Flury and Gautschi (1986), may become more stringent. In practice, we can plot the average deviation from diagonality and choose the first few projection directions with DfD value close to one or choose a suitable number right before a sudden jump in the plot. See an example in Section E of the supplementary material.

### 3.4 Analysis under a Common Principal Component Model

We need additional assumptions to perform theoretical analysis. Following Flury (1986), we assume that the covariance matrices Σ_{1}, …, Σ_{n} can be diagonalized by the same orthogonal matrix. That is, there exists an orthogonal matrix Γ, such that
where Γ = (*γ*_{1}, …, * γ_{p}*) and Λ

_{i}= diag{λ

_{i1}, …,

*λ*}. Suppose the eigenvalues are ordered as where denote the sample covariance matrix. Suppose Φ = (

_{ip}

*ϕ*_{1}, …,

*ϕ*_{p}) and Δ

_{i}= diag{

*δ*

_{i1}, …,

*δ*} are the maximum likelihood estimator of Γ and Λ

_{ip}_{i}(

*i*= 1, …,

*n*), respectively, using the method proposed in Flury (1984). Flury (1986) showed that they are both consistent estimators, and thus is a consistent estimator of Σ

_{i}. Therefore, we have

Based on (10), we replace **S**_{i} by the consistent estimator
in our optimization problem (2). Since Φ is the orthonormal eigenbasis, ** γ** can be represented by the linear combination of the columns in Φ, i.e.,
where

**a**= (

*a*

_{1}, …,

*a*)

_{p}^{⊤}. The optimization problem (2) is reformulated as: where

**H**=

**I**under (C1) and under (C2). With given

**, under constraint (C1), it is equivalent to solve**

*β*Suppose the eigenvectors are ordered based on the average eigenvalues
we have **â** = * ϕ_{p}*. Therefore, under constraint (C1), the method yields the common eigenvector with the lowest average eigenvalue. Now consider the constraint (C2). Let

**b**= (

*b*

_{1}, …,

*b*)

_{p}^{⊤}with Minimizing the objective function in (11) under constraint (C2) is equivalent to solving the following problem,

Suppose
satisfies the model assumption, with *T _{i}* =

*T*, the minimizer of above optimization problem is

*k*if where Since

*δ*is a consistent estimator of λ

_{ij}_{ij}for

*i*= 1, …,

*n*and

*j*= 1, …,

*p*, we impose the following condition.

(Eigenvalue condition). *Assume* Σ_{i} = ΓΛ_{i}Γ^{⊤} *is the eigendecomposition of* Σ_{i} *with* Γ = (*γ*_{1}, …, * γ_{p}*)

*an orthogonal matrix and*Λ

_{i}= diag{λ

_{i1}, …, λ

_{ip}}

*a diagonal matrix*,

*for i*= 1, …,

*n. The eigenvalues are ordered as*

*where*

*Suppose there exists k*∈ {1, …,

*p*}

*such that λ*=

_{ik}*satisfies the model assumption*,

*and assume*

*where*

Under this condition, we propose a min-max algorithm (Algorithm 3) to identify the common principal component with eigenvalues that fit the log-regression model (1) and meanwhile explain large variations in the data. We call this algorithm a min-max approach as it contains a minimization (of the objective function) and maximization (of data variation) steps. To acquire the first *k* (*k* ≥ 2) directions, we propose to order those *p _{s}* components that satisfy

*ŝ*(

*j*) =

*j*in step (iv) by the average eigenvalues and return the first min{

*k*,

*p*} components. Thus this algorithm also provides an estimate of the number of components.

_{s}#### 3.4.1 Asymptotic properties

We first discuss the asymptotic property of ** β** estimator given the true

**. As**

*γ***is estimated by maximizing the log-likelihood function, we have the following theorem.**

*β**Assume*
*under the true γ*,

*we have*

*where*

*is the maximum likelihood estimator when the true*.

**γ**is knownA common principal component based method for solving optimization problem (2) under constraint (C2).

**Input:**

**Y**: a list of data where the ith element is a *T _{i}* ×

*p*data matrix

**X**: a *n* × *q* matrix of covariate variables with the first column of ones

**Output:**

(i) Use Flury (1984) method to estimate Φ and Δ

_{i}(*i*= 1, …,*n*) for**Y**with*n*groups.(ii) For

*j*= 1, …,*p*, estimate*β*with*γ*=*ϕ*, denoted by_{j}*β*^{(j)}.(iii) For each

*j*, minimize the objective function with(iv) For those

*ŝ*(*j*) =*j*, maximize the variance with(v) Estimate

When *p* = 1 and *T _{i}* = 1 (for

*i*= 1, …,

*n*), our proposed model (1) degenerates to a multiplicative heteroscedastic regression model. The asymptotic distribution of in Theorem 1 is the same as in Harvey (1976). We now establish the asymptotic theory when

**is estimated from the common principal component approach (Flury, 1984).**

*γ**Assume* Σ_{i} = ΓΛ_{i}Γ^{⊤}, *where* Γ = (*γ*_{1}, …, **γ**_{p}) *is an orthogonal matrix and* Λ_{i} = diag{λ_{i1}, …, λ_{ip}} *with* λ_{ik} ≠ λ_{il} (*k* ≠ *l*), *for at least one i* ∈ {1, …, *n*}. *There exists k* ∈ {1, …, *p*} *such that for* ∀ *i* ∈ {1, …, *n*},
*Let*
*be the maximum likelihood estimator of γ_{k} in Flury (1984). Then assuming that the assumptions in Theorem 1 are satisfied*,

*from Algorithm 3 is*-

*consistent estimator of*.

**β**## 4 Simulation Study

In the simulation study, we generate data from a multivariate normal distribution with *p* = 5 and covariance Σ_{i} for sample *i*. We assume the covariance matrices satisfy the common diagonalization assumption, i.e., Σ_{i} = ΓΛ_{i}Γ^{⊤}, where
is an orthogonal matrix, and Λ_{i} is a diagonal matrix with diagonal elements {λ_{i1}, …, λ_{ip}}. For the log-linear model, *X _{i}* (

*i*= 1, …,

*n*) is generated from a Bernoulli distribution with probability 0.5 to be one. Thus

*q*= 2 because of the additional intercept column. Two scenarios are tested: (i) the null case with

*β*

_{1}= 0 and (ii) the alternative case with the second and third eigenvalues satisfying the regression model. For the first cases with

*β*

_{1}= 0, λ

_{ij}is generated from a log-normal distribution with mean

*β*

_{0}and variance 0.5

^{2}; and for the second case with

*β*

_{1}≠ 0, λ

_{ij}= with

**x**

_{i}= (1

*X*)

_{i}^{⊤}. The simulation is repeated 200 times.

As demonstrated in Section 3, constraint (C1) yields the component with the lowest normalized data variation. Thus, in this section, we only present the performance under constraint (C2). Under constraint (C2), for higher-order directions, enforcing the orthogonality constraint reduces the parameter search space and increases computation complexity. In this simulation study, we implement both cases with and without the orthogonality constraint. We compare the following methods:

(1) our proposed block coordinate descent method (Algorithms 1 and 2) under constraint (C2) without the orthogonality constraint, denoted as CAP;

(2) our proposed block coordinate descent method (Algorithms 1 and 2) under constraint (C2) with the orthogonality constraint, denoted as CAP-OC;

(3) our method under the complete common principal component model (Algorithm 3) for finding the first

*k*projection directions, denoted as CAP-C;(4) a principal component analysis (PCA) based method, where we apply PCA on each subject and regress each of the first

*k*eigenvalues on the covariates, denoted as PCA;(5) a common principal component method, where we apply common PCA on all subjects using the method in Flury (1984) and regress each of the first

*k*eigenvalues on the covariates, denoted as CPCA.

We first evaluate the performance under the null case, i.e., *β*_{1} = 0. *β*_{0}’s are set to be *β*_{0} = (5, 4, 1, −1, −2)^{⊤}. We present the estimate of *β*’s from CAP and CAP-C over 200 simulations in Figure E.2 in the supplementary material. Our estimate of *β*_{1} is centered around zero with an average of 0.01 under CAP and −0.01 under CAP-C, both much smaller than the corresponding standard errors.

Under the alternative scenario, we set ** β** as
where the first row is for the intercept term (

*β*

_{0}’s). Under this setting, the second and third eigencomponents of the covariance matrices follow the log-linear model (1) and the eigencondition (Condition 1) is satisfied. Table 1 presents the estimate of model coefficients

*β*

_{1}over 200 simulations with

*n*= 100 and

*T*= 100. Since the intercept term is not of our study interest, we will not report the results here. For our proposed methods (CAP and CAP-C), the coverage probability is obtained by both the asymptotic variance in Theorem 1 (CP-A) and 500 bootstrap samples (CP-B); while for PCA and CPCA approaches, only CP-B is reported. As the data is generated under the complete common principal component assumption, the CAP-C approach yields the estimate of

_{i}*β*with the lowest bias. The estimated

*β*from CAP (or CAP-OC) for the first direction (the second eigencomponent) is very close to those from CAP-C, and the coverage probability from either the asymptotic variance or bootstrap achieves the designated level (

*α*= 0.05). For the second direction (the third eigencomponent), the estimate from CAP has slightly higher bias and the coverage probability is smaller than 0.95. The estimation bias of CAP-OC is higher, due to the orthogonality restriction. The higher bias in the proposed CAP approaches is possibly due to the data manipulation step in Algorithm 2. Both PCA and CPCA do not take into account the covariate information and thus the first two direction estimates are associated with the

*β*components corresponding the largest two eigenvalues, even though the first

*β*component is zero. The estimate of

**from CAP, CAP-OC and CAP-C are presented in Table E.1 of Section E in the supplementary material.**

*γ*To further assess the finite sample performance of our proposed CAP approach, we vary the number of subjects with *n* = 50, 100, 500, 1000 and the number of observations within subject with *T _{i}* = 50, 100, 500, 100. Figure 2 shows the estimate, coverage probability from the asymptotic variance, and the mean squared error (MSE) of model coefficients of the first two directions. From the figure, as both

*n*and

*T*increase, the estimate of

_{i}*β*

_{1}in both the first and second identified direction converge to the true value. The coverage probability of the first direction is always close to the designated level, and the coverage probability of the second direction converges to 0.95 as both

*n*and

*T*increase. The MSE of both directions converge to zero. The simulation results demonstrate that our proposed method (CAP) can successfully recover the eigencomponents that possess multiplicative heteroscedasticity. Similar results from CAP-C are shown in Figure E.3 of Section E in the supplementary material.

_{i}## 5 Resting-state fMRI data example

We apply our proposed method to Human Connectome Project (HCP) resting-state fMRI (rs-fMRI) data. Our dataset includes *n* = 118 healthy young adults (39 aged 22-25 and 79 aged 26-30; 42 female and 76 male) from the most recent S1200 release. The sample size selected here is typical for fMRI studies. The rs-fMRI dataset was preprocessed following the minimal preprocessing pipeline in Glasser et al. (2013). Global signal regression was performed to address whole brain fluctuations typically seen as nuisances (Murphy et al., 2009; Fox et al., 2009). The blood-oxygen-level dependent (BOLD) signals are extracted from *p* = 20 functional brain regions in the default mode network (DMN) (Power et al., 2011) and averaged over voxels within the 5 mm radius. The BOLD time series are temporally correlated, thus we first calculate the effective sample size (ESS) defined by Kass et al. (1998), which infers the equivalent sample size of independent samples,
where
is the data at time *t* of the *j*th brain region from all subjects, for *t* = 1, 2, … and *j* = 1, …, *p*, and *T* = 1200 is the number of time points. We subsample ESS(*p*) = 660 time points (demeaned and variance stabilized (Beckmann and Smith, 2004)) for analysis.

It has been shown that there exists sex discrepancy in functional connectivity in the DMN (Gong et al., 2011; Zhang et al., 2018). In this study, the individual demographic information, i.e., age and sex (both as categorical variables), together with their interaction are considered as the explanatory variables. For age, the category 22-25 is the reference level and labeled as Age1 and 26-30 as Age2; for sex, sex = 1 for male and 0 for female.

Four methods are compared in this study, including *(i)* element-wise correlation regression (Wang et al., 2007); *(ii)* common principal component method, i.e., the CPCA method in Section 4; *(iii)* our CAP-C method; and *(iv)* our CAP method. In the simulation study (Section 4), it is shown that the proposed CAP approach without orthogonal constraint overperforms the approach with orthogonal constraint (CAP-OC). In this real data analysis, we employ the CAP approach and include a *post hoc* procedure to examine the orthogonality among the identified projection directions.

For the element-wise correlation regression, each off-diagonal element in the correlation matrix is Fisher *z*-transformed and multiple testing adjustment is performed following the Benjamini and Hochberg (1995) procedure to control the false discovery rate (FDR). None of the FDR corrected p-values are significant at level 0.05. See Figure F.1 and Figure F.2 in the supplementary material for the raw and FDR corrected *p*-values.

We present the estimated regression coefficients (together with 95% confidence intervals) of first ten common PCs from the CPCA approach in Figure F.3 in the supplementary material. From the figure, only the model coefficients of the fourth component (CPC4) are significant, indicating that not all of the top PCs are related to either age or sex. Under the same common PCA assumption, CAP-C directly discovers the PCs that are relevant to the covariates. Thus, the first component identified by CAP-C is CPC4. Figure F.4 shows the estimated model coefficients (and 95% confidence intervals from 500 bootstrap samples) of the top seven discovered PCs.

Our CAP approach discovers five projection directions, where the number five is chosen based on the average DfD (see Figure F.5 in Section F of the supplementary material), and the orthogonality of these five directions are verified in Figure F.10. Figure 3 exhibits the model coefficients (and 95% confidence intervals from 500 bootstrap samples) of the five projection directions. From the figures, for each identified projection direction, at least one of the covariates is significant. We use D1 as an example, which presents a significant age effect. To interpret the loadings, Figure 4a shows the loading profile, and six brain regions have the loading magnitude greater than 0.2 (see Figure 4b in the brain map). This suggests that the connectivity between these brain regions show significant difference in the comparison (see Figure F.9 in the supplementary material for the scatter plot). In in the supplementary material, additional loading plots and brain maps are available in Figures F.7 and F.8, comparisons under different contrasts are shown in Figure F.6. To compare with CAP-C, Table F.1 in the supplementary material displays the similarity (similarity between −1 and 1, and 0 indicates orthogonal) of the projection directions to the common PCs from CAP-C. The two significant PCs identified by CAP-C, CPC4 and CPC18, have similarity greater than 0.6 to the projections D2 and D4 identified by CAP, respectively.

To study the reliability of our proposed methods, we apply the estimated projections to three other scanning sessions of resting-state fMRI data acquired from the same subjects. Figure F.11 in the supplementary material shows the estimated model coefficients and 95% bootstrap confidence intervals and Figure F.12 presents the comparisons under different contrasts. From the figures, the estimate and significance are very similar to the result presented in Figure 3, which validates the existence of difference between age groups and/or sex within these five components (also known as brain subnetworks) of the DMN.

## 6 Discussion

In this study, we introduce a Covariate Assisted Principal regression model for multiple co-variance matrix outcomes. Our approach allows the identification of projection directions that are associated with the explanatory variables or covariates. Under certain regularity conditions, our proposed estimators are asymptotically consistent. Using extensive simulation studies, our model shows high estimation accuracy. Applied to resting-state fMRI studies, our method avoids the massive number of hypothesis testing suffered in the element-wise regression approach.

One challenge in modeling covariance matrices directly is having a constraint of positive definiteness. Via projections, the study of a positive definite matrix is decomposed into modeling the eigenvalues in orthogonal spaces. This relaxes the constraint and preserves geometric interpretation. The existing spectral decomposition based methods rely on the assumption that there exists a common diagonalization of the covariance matrices. In practice, this can be unrealistic, especially when *p* is large. Researchers are often more interested in studying a subset of the components related to the covariates. Though CAP enables identification of a small set of components, the theoretical analysis is challenging without the complete common diagonalization regularity condition in CPCA. One future direction will consider relaxing these assumptions.

The current framework assumes the dimension of the data, *p*, is fixed and less than both the number of observations within a subject and the number of subjects. Another future direction is to extend the method to settings of large *p*, small *n*.

## Footnotes

September 13, 2018