Discriminating the Influence of Correlated Factors from Multivariate Observations: the Back-to-Back Regression

Identifying causes solely from observations can be particularly challenging when i) potential factors are difficult to manipulate independently and ii) observations are multi-dimensional. To address this issue, we introduce “Back-to-Back” regression (B2B), a linear method designed to efficiently measure, from a set of correlated factors, those that most plausibly account for multidimensional observations. First, we prove the consistency of B2B, its links to other linear approaches, and show how it provides a robust, unbiased and interpretable scalar estimate for each factor. Second, we use a variety of simulated data to show that B2B outperforms least-squares regression and cross-decomposition techniques (e.g. canonical correlation analysis and partial least squares) on causal identification when the factors and the observations are partially collinear. Finally, we apply B2B to magneto-encephalography of 102 subjects recorded during a reading task to test whether our method appropriately disentangles the respective contribution of word length and word frequency - two correlated factors known to cause early and late brain responses respectively. The results show that these two factors are better disentangled with B2B than with other standard techniques.


Back-to-Back regression
We consider the measurement of multivariate signal Y ∈ R n×dy (the dependent variables), generated from a set of putative causes X ∈ R n×dx (the independent variables), via some unknown linear apparatus F ∈ R dx×dy . Not all the variables in X exert a causal influence on Y . By considering a square binary diagonal matrix of causal influences E ∈ D dx×dx , we denote by XE the causal factors of Y . In summary, the problem can be formalized as: Observations Factors Cause selection Noise Cause-effect mapping 2) regression from X toX ) Figure 1: Back-to-back regression identifies the subset of factors E ii = 1 in X that influence some observations Y by 1) regressing from Y to X to obtainX, and 2) returning the diagonal of the regression coefficients from X toX.
where i is a given sample, and n i is a sample-specific noise drawn from a centered distribution. 45 While the triplet of variables X and N are independent, we allow each of them to have any form of  coefficientsĤ from X toX. The diagonal of the regression coefficientsĤ, denoted byÊ = 55 diag(Ĥ), is the desired estimate of the causal influence matrix E, as detailed in the Appendix A.1.

56
If using l2-regularized least-squares [10,26], B2B has a closed form solution: where Λ X and Λ Y are two diagonal matrices of regularization parameters, useful to invert the 57 covariance matrices of X and Y if these are ill-conditioned. 58 Performing two regressions over the same data sample can result in overfitting, as spurious 59 correlations in the data absorbed by the first regression will be leveraged by the second one. To 60 avoid this issue, we split our sample (X, Y ) into two splits (X 1 , Y 1 ) and (X 2 , Y 2 ). Then, the first 61 regression is performed using (X 1 , Y 1 ), and the second regression is performed using (X 2 , Y 2 ). To 62 compensate for the reduction in sample size caused by the split, B2B is repeated over many random 63 splits, and the final estimateÊ of the causal influence matrix is the average over the estimates 64 associated to each split [2]. To accelerate this ensembling procedure, we implemented an efficient leave-one-out cross-validation scheme as detailed in [26] as follows: where Σ X is the X kernel matrix and where G is computed with an eigen decomposition of X: where Q, V and λ are the eigen vectors, eigen values and regularization, respectively.

67
We summarize the B2B procedure in Algorithm 1. The rest of this section provides a theoretical 68 guarantee on the correctness of B2B.

69
Algorithm 1: Back-to-back regression. Since EĤ =Ĥ, we havê Assuming, without loss of generality, that the active features in E are the k ∈ Z : k ∈ [0, d x ] first features, and rewriting X = (X 1 , X 2 ) and N = (N 1 , N 2 ) (X 1 and N 1 containing the k first 78 features), we have: where Σ AB is the covariance of A and B, and: In the absence of noise, we have Σ N 1 N 1 = 0, and so diag k (Ĥ) = I, and Therefore, we recover E fromĤ.

80
In the presence of noise, the causal factors of E correspond to the positive elements of diag(Ĥ).

81
The methods to recover them are presented in the Appendix Appendix A. to d x , X ∈ R n×dx contains rows drawn from N (0, Σ X ), N ∈ R n×dx contains rows drawn from 95 N (0, Σ N ), E ∈ R dx×dx is a binary diagonal matrix containing n c ones, Σ X = AA where 96 A ∈ R dx×dx contains entries drawn from N (0, σ 2 ), Σ N = BB where B ∈ R dx×dx contains 97 entries drawn from N (0, σ 2 ), and the factor h ∈ R + .

98
To simulate a wide range of experimental conditions, we sample 10 values in log-space for Forward regression consists of an l2-regularized "ridge" regression from the putative causes X to the observations Y : Backward regression consists of an l2-regularized "ridge" regression from Y to X: CCA finds G cca ∈ R dz,dy and H cca ∈ R dz,dx s.t. X and Y are maximally correlated in a latent Z space: PLS finds G pls ∈ R dz,dy and H pls ∈ R dz,dx s.t. X and Y are maximally covarying in a latent Z space: G pls , H pls = argmax We employ five-fold cross-validation to select the optimal number of components for CCA and   In most cases, E is not known and AUC can thus not be estimated. To address this issue, we 120 assess the ability of each model to reliably predict independent and identically distributed data from 121 Y , given all of the X features versus all-but-ones feature X −i (i.e. 'knock-out X'). This procedure 122 results in two correlation metrics R f ull and R knockout , whose difference indicates how much each X i improves the prediction of Y . In our figures, ∆R is the average of 124 ∆R i . A higher score means that for prediction, the model relies on individual features rather than 125 combinations of features. 126 We show in Appendix Appendix A.3 pseudo-code to assess feature importance for our algorithm 127 as well as baselines. For the Backward Model, feature importance cannot be assessed as the X 128 collinearity is never taken into account. 129 We show in Figures 2 (bottom) and B.5 (right, in Appendix) that our method outperforms 130 baselines.

131
Next, we apply our method to brain imaging data from the anonymized multimodal neuroimag- reading task. Twelve subjects were excluded from the analysis because of corrupted file headers.
135 Subjects were exposed to a rapid serial visual presentation of Dutch words. The word lists consisted 136 of 120 sentences, and scrambled lists of the same words. Each word was presented on the computer 137 screen for 351ms on average (min: 300ms, max: 1400ms). Successive words were separated by 138 a blank screen for 300ms, and successive sentences were separated by an empty screen for a few

139
(3-4) seconds.  We aim to identify the word features that cause a variation in brain responses. We consider four 153 distinct but linearly-correlated features. First, 'Word Length' refers to the total number of letters.

154
Word Length is expected to specifically cause a variation in the early evoked MEG responses  Figure 3: A hundred subjects read approximately 2,700 words while their brain activity was recorded with MEG. Top. Average brain response to words (word onset at t=0 ms), as viewed from above the head (red= higher gradient of magnetic flux). Bottom. Each line represents a magnetometer, color-coded by its spatial position. Posterior responses, typical of primary visual cortex activity, peak around 100 ms after word onset and are followed by an anterior propagation of activity typical of semantic processing in the associative cortices.
(i.e. from 100 ms after stimulus onset) elicited by the retinotopically-tuned visual cortices (e.g.   We used the feature importance described in Algorithm 2 to assess the extent to which each 177 feature X i specifically improves the prediction of held-out Y data, using a five-fold cross-validation 178 (with shuffled trials to homogeneize the distributions between the training and testing splits).    198 In addition, the same decoding Frequency rose from ≈ 100 ms and from ≈ 400 ms respectively. Furthermore, Word Function improved the prediction performance of all models (all p < 0.0002) except for PLS (p = 0.7989).

228
Overall, these results confirm that Word Length, Word Frequency and Word Function causally 229 influence specific periods of brain responses to words.

230
To assess which model would be most sensitive to these causal discoveries, we compared B2B 231 to other models across subjects (Figure 4 right) clarify what is being represented in brain responses -i.e. what feature causes specific brain activity.  (2)   SinceĜ is the least square estimator of X from Ŷ Replacing Y by its model definition Y = (XE + N )F , we havê Since N is centered and independent of X, we havê In the same way, forĤ, we havê a positive quantity which reaches a minimum (zero) for Let us now prove that EFĜ = FĜ.

362
Let F † be the pseudo inverse of F , and Z = F † EFĜ, we have F Z = F F † EFĜ As E is a binary diagonal matrix, it is an orthogonal projection and therefore a contraction, thus But sinceĜ = arg min G X − XEF G 2 + N F G 2 , we also have Summarizing the above, N being full rank, this yields EFĜ = FĜ.

367
Equation 1 does not explicitly contain a measurement noise term. Yet, in most experimental cases, the problem is best described as: This equation is actually equivalent to Equation 1 given our hypotheses. Indeed, we can rewrite M = M F −1 F over Img(F ), which leads to: Consequently, assuming that F is full rank on Img(XE), B2B yields the same solutions to For B2B, feature importance is assessed as follows:

373
Algorithm 2: B2B feature importance. Input: For the Forward Model, the feature importance is assessed as follows:

375
Algorithm 3: Forward feature importance. Input: For the CCA and PLS models, the feature importance is assessed as follows:
Input: In case of noise, B2B yields non binaryÊ. Three thresholding rules can be used to binarize its 381 values thus explicitly recover "causal" features.

382
First, given known signal-to-noise ratio, the threshold above which a feature should considered 383 to be "causal" can be derived analytically. Indeed, Equation 9 implies that the k first diagonal 384 elements ofĤ are bounded: where σ X 1 , σ X k , σ N 1 and σ N k denote the largest and smallest eigenvalues of Σ X 1 X 1 and Σ N 1 N 1 .

386
The average value µ of non-zero coefficients of diag(Ĥ) is the trace ofĤ divided by k, and can be computed as The decision threshold between "causal" and "non-causal" elements is thus a fraction µ, whose 387 proportion arbitrarily depends on the necessity to favor type I and type II errors. In practice, we 388 cannot use this procedure for our MEG study, because signal-to-noise ratio is unknown.

389
Second, diag(Ĥ) can be binarized with the Sonquist-Morgan criterion [20], a non-parametric clustering procedure separating small and large values in a given set. This procedure maximizes the ratio of inter-group variance while minimizing the intra-group variance, over all possible splits of the diagonal into p largest values and d x − p smallest values. Let m 0 and m 1 be the average values of the two clusters, p and d x − p their size, and v the total variance of the sample, Sonquist-Morgan criterion maximizes [15]: This procedure assumes that there exists at least one causal and at least one non-causal feature.

390
Third, second-order statistics across multiple datasets can be used to identify the elements of  (top) and when the models are tested on an these four variables as well as another 10 word-embedding features (bottom). These results illustrate that, unlike Regularized CCA, B2B remains robust even when the number of tested factors increases.