## Abstract

Brain signals recorded with electroencephalography (EEG), magnetoencephalography (MEG) and related techniques often have poor signal-to-noise ratio due to the presence of multiple competing sources and artifacts. A common remedy is to average over repeats of the same stimulus, but this is not applicable for temporally extended stimuli that are presented only once (speech, music, movies, natural sound). An alternative is to average responses over multiple subjects that were presented with the same identical stimuli, but differences in geometry of brain sources and sensors reduce the effectiveness of this solution. Multiway canonical correlation analysis (MCCA) brings a solution to this problem by allowing data from multiple subjects to be fused in such a way as to extract components common to all. This paper reviews the method, offers application examples that illustrate its effectiveness, and outlines the caveats and risks entailed by the method.

## 1 Introduction

Stimulus-driven signals recorded with electroencephalography (EEG), magnetencephalography (MEG) and related techniques compete with much stronger sources within the brain, the body, and the environment. The signal of interest usually represents only a fraction of the signal power at the electrode or sensor. To overcome the noise and artifacts, a common practice is to present the same stimulus multiple times and average the responses over repeated presentations. Supposing that the response is the same for all presentations, and the noise is un-correlated between presentations, the signal-to-noise power ratio (SNR) improves with the number of repeats. SNR can be further improved by combining signals across sensors, i.e. spatial filtering. Spatial filters can be optimized based on assumptions about signal and noise (de Cheveigné and Parra, 2014), and this combination of temporal averaging and spatial filtering can greatly improve the SNR. However, averaging and optimization are not applicable if the stimulus is presented only once, for example because it is too long to be repeated (e.g. a long sample of speech or music), or because one wishes to probe a phenomenon likely to fade with repetitions (e.g. surprise).

Instead of presenting the same stimulus multiple times to one subject, one can also present the same stimulus to multiple subjects just once. To the extent that different subjects’ brains are functionally similar, we expect similar responses (Hasson et al., 2004; Dmochowski et al., 2012; Lankinen et al., 2014). Unfortunately, the position or orientation of neural sources relative to sensors or electrodes is likely to differ across subjects, so averaging over subjects in sensor space is suboptimal. In order to compare between subjects, or average over subjects, we first need some way to transform the data of each to a common representation that is comparable across subjects. This can be accomplished with spatial filters that are tuned to each individual subject (e.g. Haxby et al., 2011; Lankinen et al., 2014).

Canonical Correlation Analysis (CCA) is a powerful technique to find linear components that are correlated between two data matrices (Hotelling, 1936). Given two matrices **X**_{1} and **X**_{2} of size *T* × *d*_{1} and *T* × *d*_{2}, CCA produces transform matrices **V**_{1} and **V**_{2} of sizes *d*_{1} × *d*_{0} and *d*_{2} × *d*_{0}, where *d*_{0} is at most equal to the smaller of *d*_{1} and *d*_{2}. The columns of **Y**_{1} = **X**_{1} **V**_{1} are of norm 1 and mutually uncorrelated between each other, as are the columns of **Y**_{2} = **X**_{2}**V**_{2}, while, more importantly, corresponding columns from each (“canonical correlate pairs”) are maximally correlated. The first pair of canonical correlates (CC) defines the linear combinations of each data matrix with the *highest possible correlation* between them. The next pair of CCs defines the most highly correlated combination that is uncorrelated from the first pair, and so-on. Applied to data from two subjects, CCA can find spatial filters that maximize the brain activity common to both, transforming both subject’s data so that they can more easily be compared or averaged. However, CCA does not address the issue of comparing or merging responses across more than two subjects.

Extensions to connect multiple data matrices have been proposed under names such as *multiple CCA* (Gross and Tibshirani, 2015; Witten and Tibshirani, 2009), *multiway CCA* (Sturm, 2016; Zhang et al., 2011), *multiset CCA* (Takane et al., 2008; Correa et al., 2010b,a; Hwang et al., 2012; Lankinen et al., 2014; Zhang et al., 2017; Via, Javier, Ignacio Santamaria and Péez, 2005; Li et al., 2009), or *generalized CCA* (Kiers et al., 1994; Afshin-Pour et al., 2012; Melzer et al., 2001; Tenenhaus, 2011; Tenenhaus et al., 2015; Velden, 2011; Fu et al., 2017). This diversity in names covers a diversity of formulations (Kettenring, 1971) that all share the aim of finding components that are similar across data matrices. Recent progress addresses regularization (Tenenhaus, 2011), sparsity (Fu et al., 2017; Tenenhaus et al., 2015), missing data (van de Velden and Takane, 2012), nonlinearity (Melzer et al., 2001), or deep learning (Benton et al., 2017). Using similar techniques, independent Component Analysis (ICA) has been generalized under the name of group ICA (GICA) (Eichele et al., 2011; Calhoun and Adali, 2012; Huster et al., 2015; Huster and Raud, 2018).

CCA has been used extensively for brain data analysis and modality fusion (Sui et al., 2012; Dähne et al., 2015; Dmochowski et al., 2017), and several studies have applied multiway CCA (MCCA) and variants thereof to merge data across subjects (Correa et al., 2010b; Afshin-Pour et al., 2012, 2014; Lankinen et al., 2014; Zhang et al., 2017; Li et al., 2009; Hwang et al., 2012; Karhunen et al., 2013; Haxby et al., 2011; Lankinen et al., 2014; Sturm, 2016; Zhang et al., 2017; Lankinen et al., 2018). This paper builds on those studies with the aim to better understand the range of applicability of the tool, what is achieved, and what are the caveats. We describe a simple formulation of MCCA that is easy to understand and explain.

We show that MCCA can be applied effectively to multi-subject datasets of EEG or fMRI, both to *denoise* the data prior to further analyses, and to *summarize* the data and reveal traits common across the population of subjects. MCCA-based denoising yields significantly better scores in an auditory stimulus-response classification task, and MCCA-based joint analysis of fMRI data reveals detailed subject-specific activation topographies. The aims of this paper are (a) to provide an intuitive understanding of MCCA, (b) investigate ways in which it can be put to use, and (c) demonstrate its effectiveness for a range of common tasks in the analysis of brain data.

## 2 Methods

In this section we describe a simple formulation of MCCA, show how it can be applied to a variety of tasks, and give details of the real and synthetic data sets used by the examples reported in the Results.

### 2.1 Data analysis

#### Signal model

Assume a data set consisting of *N* data matrices, each comprised of a time series matrix **X*** _{n}* of dimensions

*T*(time) ×

*d*(channels). These could represent EEG, MEG or fMRI data recorded from N different subjects in response to the same stimulus. They could also be data from multiple imaging modalities gathered from the same subject. Each matrix

_{n}**X**

*consists of linear combinations of a set of sources S common to all data matrices, to which is added a “noise” matrix*

_{n}**N**

*of sources uncorrelated with S, and uncorrelated with the noise matrices*

_{n}**N**

_{n′≠n}added to the other data matrices: where A

*is a mixing matrix specific to subject*

_{n}*n*. The sources S might represent brain sources or networks driven by the same stimulus similarly across different subjects. We are interested in finding these “shared sources” and suppressing the noise. Note that this model assumes that responses of different subjects share the same source

*time course*, but not necessarily the same spatial pattern over channels. The assumption of uncorrelated noise is usually only approximately met, due to spurious correlations.

#### A simple CCA formulation

Consider two data matrices, **X**_{1} and **X**_{2} of size *T* × *d* where *T* is time and *d* the number of channels. All data are assumed to have zero mean. Each matrix is spatially whitened by applying principal component analysis (PCA) and scaling each principal component (PC) to unit norm to obtain whitened matrices and . Whitened data are then concatenated and submitted to a new PCA to obtain a matrix **Y** = [**X**_{1}, **X**_{2}] V of size *T* × 2*d*, where **V** combines the whitening and second PCA matrices (Fig. 1 left). The submatrices **V**_{1} and **V**_{2} formed of the first and last *d* rows of **V** define transforms applicable to each data matrix:
with **Y** = **Y**_{1} + **Y**_{2} (Fig. 1 center).

The outcome of this analysis is equivalent to standard CCA, as explained in the Discussion, the first *d* columns of **Y**_{1} and **Y**_{2} forming canonical pairs (within a scaling factor). Indeed, rotating and to maximize the correlation of the resulting **Y**_{1} and **Y**_{2}, as required by the CCA objective, is equivalent to rotating with the goal of maximizing the norm of their sum, **Y**_{1} + **Y**_{2}, as achieved by the second PCA (Fig. 1 right). The appeal of this formulation is that it is easily extendable to multiple data matrices.

#### A simple MCCA formulation

Consider *N* data matrices **X*** _{n}* each of size

*T*×

*d*with zero mean. Each data matrix is spatially whitened by applying PCA and scaling all PCs to unit norm to obtain whitened matrices . Whitened data are then concatenated along the component dimension and submitted to a second PCA to obtain a matrix

**Y**= [

**X**

_{1}…

**X**

*]*

_{N}**V**of size

*T*×

*D*,

*D*=

*Nd*, where

**V**combines the whitening and second PCA matrices (Fig. 2 left). The submatrices

**V**

*of*

_{n}**V**of size

*d*×

*D*formed by extracting successive d-row blocks of

**V**define transforms applicable to each data matrix: with (Fig. 2, right). If data matrices have different numbers of channels

*d*, then

_{n}**V**

*has size*

_{n}*d*×

_{n}*D*where . We call the columns of

**Y**

*(CCs) by analogy with CCA, and those of*

_{n}canonical correlates**Y**

*summary components*(SC). Each SC is a sum of CCs over data sets. Columns of

**Y**are mutually orthogonal by virtue of the final PCA, but the same is not usually true of

**Y**

*. With*

_{n}*D*>

*d*columns,

**Y**

*forms an*

_{n}*overcomplete basis*of the patterns spanned by

**X**

*. This formulation of MCCA is equivalent to the SUMCORR formulation of Kettenring (1971) as explained in the Discussion (Parra, 2018). The appeal of this formulation is that it is conceptually and computationally straightforward. PCs can be discarded from the initial PCAs, so as to control dimensionality and limit overfitting effects (next section).*

_{n}The variances of the summary components (the columns of **Y**) reflect the degree to which temporal patterns are shared between data matrices (Fig. 3) – the variance of each SC corresponding to the degree of correlation of each shared dimension found in the data. If the data matrices **X*** _{n}* share no components, the variances of all SCs are one (Fig. 3 a). If a component is shared by all

*N*data matrices, the norm of the first SC is

*N*(Fig. 3 d). For data matrices with a small number of samples, spurious correlations may cause the variance profile to be skewed (Fig. 3 b). In real data, shared activity often shows up as components with variance elevated relative to this background (Fig. 3 c).

#### Reduced-rank MCCA

It is often convenient to reduce the rank of each data matrix to by discarding PCs with smallest variance after the initial PCA. The MCCA transform matrices **V*** _{n}* are then of size , and the CC and SC matrices of size . This serves as a form of reglarization that avoids computational issues with rank-deficient data, reduces the risk of overfitting, and limits computation and memory requirements. Importantly, this approach preserves the constraint that the resulting SCs are uncorrelated (Parra et al., 2018).

#### Dealing with data matrices with more channels than samples

CCA fails if the data matrices have fewer samples than channels (*T* ≤ *d*), as is typically the case for fMRI or calcium imaging data for which there are many more voxels or pixels than observation samples (Asendorf, 2015). A simple solution is to replace each data matrix **X*** _{n}* (size

*T*×

*d*) by a matrix of size with columns that capture the principal temporal patterns spanned by

**X**

*. This can be done by applying singular value decomposition (SVD) to express the data as and setting where consists of the first columns of U. Since the have more samples than channels there is no obstacle to applying MCCA to them. This sequence of operations can be represented by a set of transform matrices*

_{n}**V**

*of size . Applying them to the data yields canonical correlate and summary matrices of size . Using this approach, it is straightforward to apply MCCA to datasets with a large number of “channels” such as data from calcium imaging or fMRI. An alternative to SVD is to apply PCA to*

_{n}^{t}

**X**

_{n}and use a subset of the matrix of projection vectors to form , a useful option if

**X**

*is too large to fit in memory (the required covariance matrix can be calculated in chunks).*

_{n}### 2.2 Applications of MCCA

#### Quantifying correlation between *N* data matrices

The variance of each column of **Y** indicates the degree to which a component is shared across data matrices. The value is 1 if the data matrices are perfectly uncorrelated, and *N* if all data matrices include that component (Fig. 3). The profile of variances over SCs thus offers a measure of “sharedness” between data matrices (but see Caveats).

#### Summarizing a set of data matrices

The first few columns of represent temporal patterns that capture most of the correlation across data matrices **X*** _{n}*. They form a basis of the signal subspace that contains those shared patterns.

#### Denoising

Each data matrix **X*** _{n}* may be denoised by projecting it to the over-complete basis of CCs, selecting the first components, and projecting back. We refer to this procedure as “denoising”, as it can be used to attenuate components that are least shared across subjects. This can be summarized by a denoising matrix

**D**

*product of the first columns of*

_{n}**V**

*by the first rows of its pseudoinverse. The denoised data are obtained as .*

_{n}#### Dimensionality reduction

Dimensionality reduction is often performed by applying PCA to a data matrix and truncating the PC series (Cunningham and Yu, 2014). However, this equates relevance to variance, which may not be appropriate because noise sources can have high variance and useful targets small variance. MCCA can be used to weight dimensions according to their *consistency across data matrices*, which may be a better criterion than variance.

#### Outlier detection

Temporally-local glitches and artifacts may interfere with data interpretation and analysis. Analysis algorithms based on least-squares are particularly sensitive to high-amplitude artifacts. MCCA can be used to derive a cross-subject ‘consensus’ response, so that individual subject’s data points that deviate greatly from the consensus can be flagged as outliers and excluded from analysis.

### 2.3 Details of the evaluation examples

The methods are evaluated using six datasets, including synthetic data, EEG, and fMRI.

#### Example 1 - sinusoidal target in separable noise

Synthetic data for this example consisted of 10 data matrices, each of dimensions 10000 samples × 10 channels. Each was obtained by multiplying 9 Gaussian noise signals (independent and uncorrelated) by a 9 × 10 mixing matrix with random coefficients. To this background of noise was added a “target” consisting of a sinusoidal time series (Fig. 4, left) multiplied by a 1 × 10 mixing matrix with random coefficients. The target was the same for all data matrices, but the mixing matrices differed, as did the noise sources. The SNR was set to 10^{−20}, i.e. a very unfavorable SNR, but because the noise is not of full rank the target and background are in principle linearly separable.

#### Example 2 - sinusoidal target in non-separable noise

Synthetic data for this example consisted of 10 matrices of dimensions 10000 samples × 10 channels, each obtained by multiplying 10 Gaussian noise sources (independent and uncorrelated) by a 10 × 10 mixing matrix with random coefficients. To this background was added a sinusoidal target as in the previous example, with SNR varied as a parameter. The noise here is full rank so the target and background are not linearly separable.

#### Example 3 - sinusoidal target in EEG noise

Data for this example used EEG to simulate realistic neural activity as background noise. EEG data were recorded during approximately 20 minutes from one subject in the absence of any task, from 40 electrodes (32 standard positions plus additional electrodes on forehead and temple) at 2048 Hz sampling rate with a BioSemi system. A robust polynomial detrending routine (de Cheveigné and Arzounian, 2018) was used to remove slow drifts. Ten “data matrices” were produced by selecting three-second intervals of EEG data with random offsets, removing their means, and adding a target consisting of 4 cycles of a 4 Hz sinusoid multiplied by a 1 × 40 mixing matrix with random coefficients, renewed for each data matrix. The SNR of the target was varied as a parameter.

#### Example 4 - EEG response to tones

Data for this example were borrowed from a study on auditory attention (Southwell et al., 2017). EEG data were recorded using a 64-channel EEG system in response to 120 repetitions of a 1 kHz tone pip with interstimulus interval (ISI) randomized between 750 and 1550 ms (recorded for the purpose of locating electrodes responsive to sound). Data from a subset of 10 subjects were detrended using a robust detrending routine, bad channels were interpolated using spherical interpolation (EEGLAB), and the data were filtered between 2-45 Hz. A peristimulus epoch of duration 1.2 s (starting 0.2 s prestimulus) was defined for each trial, and the corresponding data were extracted as a 3D matrix of dimensions time × channel × trial. For each channel, the 0.2 s prestimulus waveform was averaged over trials and subtracted from that channel’s waveform (“baseline correction”). After applying the first PCA (of the two-step MCCA) to each subject, the first 30 PCs were retained and the remainder discarded.

Two analyses were performed on these data to try to extract the cortical response to the 1 kHz tone from the background EEG noise. In the first, repetition over trials was exploited to design a spatial filter for each subject using the joint diagonalization algorithm (JD) that maximizes the ratio of trial-averaged variance to total variance (de Cheveigné and Simon, 2008; de Cheveigné and Parra, 2014). This resulted in a set of 10 analysis matrices of size 64 × 30, one for each subject. In the second analysis, MCCA was applied, using 30 PCs from each subject in the first PCA, resulting in 10 subject-specific analysis matrices of size 64 × 300.

For each subject, the first column of the JD analysis matrix defines the best linear combination of channels to maximize repeat-reliability across trials, while the first column of the MCCA analysis matrix defines the best linear combination of channels to maximize correlation with the other subjects.

#### Example 5 - EEG response to speech

Data for this example were taken from a study on auditory cortical responses to natural speech (Di Liberto et al., 2015). The same data were also used in a recent study on the application of CCA to speech/EEG decoding (de Cheveigné et al., 2018). We borrowed the data from the first study, and the decoding methods and evaluation metrics from the second, with the purpose of evaluating the benefit of introducing a denoising stage based on MCCA before the speech/EEG decoding stage.

In brief, EEG data were recorded from 8 subjects using a 128-channel BioSemi system with standard electrode layout, at 512 Hz sampling rate. Each subject listened to 32 speech excerpts, each of duration 155 s, from an audio book, presented diotically via headphones, for a total of approximately 1.4 hours. The database included both the audio stimuli and the EEG responses. Further details about the stimulus and recording are available in Di Liberto et al. (2015). The EEG were preprocessed (downsampling to 64 Hz, detrending, artifact removal), and the stimulus temporal envelope calculated as described in de Cheveigné et al. (2018).

A decoding model (de Cheveigné et al., 2018; Dmochowski et al., 2017) was evaluated according to several metrics: correlation, d-prime, and percent-correct classification scores for a match vs mismatch classification task. The classification task consisted in deciding whether a segment of EEG matched the segment of stimulus of same duration that produced it (match) or some unrelated segment (mismatch). The duration of the segment was varied as a parameter from 1 to 64 s.

This task is related to that of determining which of two concurrent voices is the focus of a listener’s attention (cocktail party phenomenon) (Ding and Simon, 2012; Fuglsang et al., 2017; Lalor et al., 2009; Khalighinejad et al., 2017; Koskinen and Seppä, 2014; Martin et al., 2014; Mesgarani and Chang, 2012; Mirkovic et al., 2015; O’Sullivan et al., 2014; Tiitinen et al., 2012; Zion Golumbic et al., 2013), of potential use for the “cognitive control” of an external device such as a hearing aid. The decoding model used CCA to relate the stimulus to the EEG response, producing multiple stimulus-response CC pairs that were used for discrimination. Further details of the decoding model, classification task, and metrics can be found in de Cheveigné et al. (2018). Here, we are only interested in knowing if scores for single-source decoding are improved by introducing a stage of EEG denoising based on MCCA.

For this denoising, the EEG data of each subject were submitted to MCCA, keeping 40 PCs in the first PCA, resulting in a 128 × 320 analysis matrix for each subject. The first 110 columns of this matrix were multiplied by the first 110 rows of its pseudoinverse to yield a 128 × 128 subject-specific denoising matrix. This has the effect of attenuating activity that is *least* correlated with the other subjects.

#### Example 6 - fMRI response to natural sounds

Data for this example were taken from a study that measured fMRI responses to natural sounds (NormanHaignere et al., 2015). Responses were gathered from 10 subjects to each of 165 sounds belonging to 11 categories including speech, music, animal vocalizations, and others. For each subject, the recording session was repeated either twice or 3 times. See Norman-Haignere et al. (2015) for further details. For the present analysis, data for each subject were averaged over repeats and organized as a matrix **X*** _{n}* of 165 sounds × 6309 voxels (voxels from both hemispheres were used, and voxels outside a subject-specific region of interest that included primary and secondary auditory cortex were set to zero). In this analysis we are interested in finding particular profiles of response over sounds (for example speech vs non-speech, or music vs non-music) and also the brain areas associated with such profiles in each subject.

As there are more “channels” (voxels) than samples (*T* < *d*), an SVD was used as described in the Methods and the first 10 dimensions were used for MCCA. The columns of are white so the first PCA can be dispensed of. Matrices were concatenated and subjected to the second-step PCA of the MCCA algorithm, and the 15 first columns (arbitrary number) of the SC matrix were selected as a basis spanning the profiles over sounds that were most similar across subjects.

To find profiles specific to particular sound categories (e.g. speech, music, etc.), Joint Decorrelation (de Cheveigné and Parra, 2014) was used to find a linear transform applicable to the 15-column basis to maximize the variance over the selected category, relative to the other categories. This can be seen as a rotation of the basis so as to isolate activity specific to processing of that sound category. This 165 × 1 activation profile was then cross-correlated with the 165 × 6309 matrix of fMRI response data of each subject to find the topography specific to that subject (Haufe et al., 2014).

## 3 Results

The MCCA method is evaluated first with synthetic data to get an understanding of its basic properties and capabilities, and then with real EEG and MEG data to see whether these extend to situations of practical use.

### 3.1 Synthetic data

#### Example 1 - sinusoidal target in separable noise

The data consist of 10 matrices made up of a sinusoidal target (Fig. 4, left) common to all data matrices, with added noise distinct across matrices (see Methods). At the unfavorable SNR of 10^{−20} the target is not visible in the raw signal of any of the data matrices (Fig. 4 center), and it cannot be extracted by averaging because of the extremely low SNR and the fact that the mixing coefficients are of random sign. Since the data are separable (the rank of the noise is only 9), the target *can* be recovered by applying the appropriate demixing matrix (inverse of the mixing matrix), however that matrix is unknown.

MCCA applied to the dataset produced projection matrices **V**_{n} that recover the target from **X*** _{n}* (Fig. 4 right). This benefit is similar to that of methods that leverage multiple repetitions to blindly discover spatial filters to improve SNR (de Cheveigné and Simon, 2008; de Cheveigné and Parra, 2014), but instead of repetitions, MCCA leverages the fact that the same target is mixed into multiple data matrices. To summarize, MCCA can reveal a target common across data matrices despite an extremely unfavorable SNR.

#### Example 2 - sinusoidal target in non-separable noise

Data are the same as in the previous example, except that the noise is full rank (10 independent sources mixed in 10 channels) so the target is no longer linearly separable, and one cannot expect to recover the target perfectly, especially at extremely low SNRs. Nonetheless, at a moderately unfavorable SNR (10^{−2} in power) MCCA can recover an estimate of the target that is noisy (Fig. 5 center) but much cleaner than the raw data (not shown). Figure 5 (right) shows the proportion of residual noise in the signal recovered by MCCA as a function of SNR, together with the same proportion for the best raw channel. MCCA provides a clear benefit over a range of SNRs. Two factors can contribute to failure: non-separability per se, and the fact that the algorithm fails to find the ideal demixing matrix. Figure 5 (right) also shows the proportion of residual noise for the ideal demixing matrix (yellow). The MCCA-derived matrix performs only slightly less well than the ideal matrix. To summarize, MCCA is of use even if the data are not separable.

#### Example 3 - sinusoidal target in real EEG noise

EEG background noise differs from the white Gaussian noise that was used in the previous simulations in several ways: it usually has full rank (in particular because of electrode-specific noise), but the variance is unequally distributed across dimensions. It is also temporally structured, with strong temporal correlation and an overall low-pass spectrum. The first component recovered by MCCA is plotted in Fig. 6 (right) for several values of SNR. For SNRs of 0.1 or better the target is almost perfectly recovered. At SNR=0.03 the recovered waveform is somewhat noisy, and at SNR=0.01 or below the target is lost. For comparison Fig. 6 (left) shows the time course of a raw data channel (the channel that showed the largest correlation with the target). For SNR=10 the target waveform is obvious in the raw data, but for smaller values of SNR it is lost in the EEG noise. Comparing Fig. 6 left and right, there is a range of SNRs (roughly 0.03 to 1) for which MCCA provides a clear benefit. Below SNR=0.03 the algorithm switched to some other component within the data (Fig. 6 right, lowest trace) that happened to be similar across data matrices because of random correlations.

To summarize, MCCA is effective at extracting a weak target from within real EEG noise.

### 3.2 Real data

#### Example 4 - EEG response to tones

In this example, contrary to the previous one, the target is not known. However, since the data were collected in response to multiple repeats *and* for multiple subjects, we can apply two different methods (JD and MCCA) to isolate stimulus-evoked activity common to all subjects and compare the results. JD finds a linear transform that optimizes signal to noise ratio assuming that the signal repeats over trials. Figure 7 (top) shows the result of applying the JD analysis to the data of one subject. In the plot on the top left, the blue line shows the mean over repeats of the first component, and the gray band shows ±2 SD of a bootstrap resampling of this mean. On the top right is the topography associated with this component (computed as the map of cross-correlation coefficients between the component and each channel (Haufe et al., 2014)). MCCA can similarly be used to design a subject-specific spatial filter that improves SNR. The plots on the bottom of Figure 7 show the result of applying the subject-specific matrix derived from the MCCA analysis for the same subject. Despite the different criteria used by the two analyses (consistency over trials for JD, consistency between subjects for MCCA) the patterns are remarkably similar. To summarize, it appears that MCCA can exploit between-subject consistency to find a spatial filter that is as effective as that found by JD that exploits between-trial consistency. This is useful for data that do not involve repeated trials.

The subject-specific MCCA analysis matrices (**V*** _{n}*) transform each subject’s data (

**X**

*) into CCs (*

_{n}**Y**

*) that are well correlated across subjects so that it makes sense to average them across subjects and interpret the SCs (Y) as reflecting shared activity. Figure 8 top left shows the trial- and subject-averaged time course of the first SC, which can be interpreted as our best estimate of stimulus-evoked activity common to all subjects. It benefits from several stages of enhancement: (a) spatial filtering within each subject, (b) averaging over trials, (c) averaging across subjects. Also shown in Fig. 8 are the ten subject-specific topographies associated with this component. Despite some differences, topographies are quite similar across most subjects except S1. The bottom left plot shows the maximum over electrodes of the correlation coefficient between the first SC and each electrode (trial-averaged). Correlation coefficients are relatively high except for Subject 1 for whom the EEG response did not match the other subjects.*

_{n}#### Example 5 - EEG response to speech

For stimuli presented once only, one cannot use repetition to distinguish the brain response from the noise. Instead, systems identification techniques (Lalor et al., 2009; Holdgraf et al., 2017; Crosse et al., 2016) are used to fit an encoding model to estimate the part of brain response that is driven by the stimulus, using some representation of the stimulus (e.g. envelope or spectrogram) that can be linearly related to the brain signals. The part of the response that fits the model can be taken as the “true” response, and the rest discarded as noise. However, this partition is contingent on the validity of the stimulus representation and the quality of the model. With MCCA, a “ground truth” response can instead be estimated based on similarity of brain responses across subjects.

EEG were recorded in response to continuous speech (see Methods), and a model was fit to stimulus and response to capture their correlation (de Cheveigné et al., 2018; Dmochowski et al., 2017). The model used CCA to form pairs of maximally-correlated linear transforms of the audio stimulus features and of the EEG respectively (audio-EEG CCs). Note that this usage of CCA is unrelated to our usage of MCCA to merge data across subjects. The quality of that model was evaluated using a match vs mismatch classification task (see Methods). We compute *correlation*, *d-prime* and *percent correct* classification scores to evaluate the benefit of inserting a stage of MCCA-based denoising within the EEG preprocessing pipeline.

Figure 9 (a) shows the correlation between the first audio-EEG CC pair (thick blue line) and subsequent pairs (thin lines), with and without MCCA-based de-noising, for one subject. To the extent that correlation is limited in part by EEG noise, the higher scores on the right suggest that denoising was effective. The d-prime metric measures the degree of separation between distributions of correlation scores for matched and mismatched segments. Figure 9 (b) shows the d-prime metric for the first pair (thick blue) and subsequent pairs (thin lines), with and without MCCA-based denoising for segments of duration 64 s. The dotted line shows the d-prime metric for the multivariate distributions of audio-EEG CC pairs. The larger d-prime scores with MCCA-based denoising suggest that it can effectively contribute to improved discrimination. Figure 9 (c) shows classification scores as a function of segment duration with (red) and without (black) MCCA-based denoising. The higher scores with MCCA-based denoising show its benefit for this task. Figure 9 (d) shows that a similar benefit is found in all subjects. The thick lines are scores for a duration of 16 s, whereas the thin lines are for segments of 2 s (lowest lines) or 64 s (highest lines). To summarize, MCCA is of benefit as a denoising tool for EEG responses to speech.

#### Example 6 - fMRI responses to natural sounds

Data were taken from a study that investigated fMRI responses to natural sounds (Norman-Haignere et al., 2015), in which 10 subjects listened to a set of 165 sounds belonging to 11 different classes. MCCA was applied to find patterns of selectivity to sound that were common across subjects as explained in the Methods. In brief, the 165 × 6309 matrix of voxel activations for each subject was reduced to a 165 × 12 matrix using SVD, the reduced matrices concatenated, and submitted to PCA to obtain a 165 × 120 matrix of SCs. Their variances are plotted in Fig. 10 (top left). The first 10 SCs were subjected to a JD analysis to enhance the contrast between musical sounds (classes ‘Music’ + ‘VocalMusic’) and other sounds as explained in the Methods.

The profile of activation over sounds of the first JD component is plotted in Fig. 10 (top right), with sounds ordered by class and coded as different colors. Activations of the first two classes (’Music’ +’VocalMusic’) are clearly distinct from that of the other classes. The corresponding topography of activation over voxels for each subject can be calculated by cross-correlating this component with the profile of activation over sounds of each voxel. Topographies for the left hemisphere for all subjects are plotted in Fig. 10 (bottom). To a first approximation, topographies are consistent in that a dorso-frontal concentration of activity is found in most subjects. To a second approximation, each topography includes additional regions, suggesting a wider network of activation that is more subject-specific. Such subject-specific details would be smoothed out by averaging over subjects. A similar JD analysis to enhance speech-specific activation revealed patterns with more ventral topographies (not shown). The outcome of this analysis is consistent with that reported by Norman-Haignere et al. (2015) using an ICA-related technique.

The benefit of MCCA here can be interpreted in terms of dimensionality reduction, based here on *consistency across subjects* rather than variance as with PCA. Dimensionality reduction allowed the final JD analysis to be performed on a matrix of size 165 × 12 × 10 rather than 165 × 6309 × 10, making it more effective by reducing overfitting. If PCA had been used instead of MCCA, the 12 selected dimensions might well have been dominated by noise. Using MCCA ensures that they are instead dominated by activity similar across subjects, which is likely to be relevant because all subjects heard the same stimuli.

This example demonstrates that MCCA can be applied also to data with more channels (pixels or voxels) than data points. MCCA offers a powerful, alternative, way of summarizing the high-dimensional data without having to explicitly model what parts of the brain response are driven by the stimulus features.

## 4 Discussion

MCCA finds a linear transform applicable to each data matrix within a data set to align them to common coordinates and reveal shared patterns. It can be used in several ways: as a *denoising* tool applicable to an individual data matrix, as a tool for *dimensionality reduction*, as a tool to *align* data matrices within a common space to allow comparisons, or as a tool to *summarize* data and reveal patterns that are general across data matrices. As formulated here, MCCA is easy to understand, straightforward to apply, and computationally cheap. Care is nonetheless required when applying it, in particular to avoid phenomena such as overfitting.

### What is new?

As reviewed in the Introduction, several versions of MCCA have been proposed in the literature and applied to the analysis of brain data. The contributions of this paper are the following. First, the formulation as a cascade of PCA, normalization, concatenation, and PCA offers an intuitive explanation that may help practitioners gain insight into this method. Past formulations may be hard to follow for the non-mathematically inclined, and their sheer number is bewildering. We used a similar 2-step formulation in a recent tutorial on joint decorrelation (de Cheveigné and Parra, 2014), and we hope that the present paper too will have tutorial value. Second, our usage of MCCA as a denoising tool, to attenuate noise within individual subjects based on across-subject consistency by projection on the overcomplete basis of its SCs, seems to be new. Third, we provide tutorial examples that may encourage researchers to put MCCA to work for a wider range of tasks, including denoising, outlier detection, summarization, and cross-subject statistics.

### How does it work?

The effect of the processing steps is schematized in Fig. 11. Multiple data matrices contain the same source component S, illustrated as a color gradient, mixed here into two 2-dimensional data matrices (Fig. 11a). Each point represents a sample in time (row of the data matrix) and the two axes represent two channels (columns of the data matrix). The color could represent a hidden sensory response that is similar across two subjects. The initial PCAs sphere each data matrix (b), so that the cloud of points is free to rotate in any direction. However, concatenating the sphered data matrices creates a cloud (in a 4-dimensional space) that is not spherical because of the shared component correlation along some direction in 4-D space (projected to 2D in panel (c)). The second PCA finds this direction of correlation between the data matrices and aligns it with the first axis (d), in the process transforming each data matrix so that it is optimally aligned with the other (e).

### Relation with other formulations of CCA and MCCA

As explained by Parra (2018), the aim of MCCA is to find projection vectors v* _{n}* applicable to

**X**

*that maximize the ratio of between-set to within-set covariance: with: where*

_{n}**R**

_{nn}=

^{t}

**X**

_{n}

**X**

_{n}and , are covariance and cross-covariance matrices of the data. The divisor 1

*– N*ensures that p scales between 0 and 1. Setting to zero the derivative of

*ρ*with respect to v

*, the solution is obtained by solving the equation with where λ =*

_{n}*ρ*/(

*N*− 1) + 1. Now, first decompose

**D**=

**UΛ**

^{t}

**U**. Because

**D**is the block-diagonal matrix of the covariances in each data set, this decomposition implies doing PCA on each data set separately, i.e whitening each data set. With this decomposition Eq. 6 can be rewritten as: where is the covariance of the whitened concatenated data. Equation 8 thus corresponds to performing PCA on the concatenated whitened data. In summary, the two-step PCA describe in the Methods (’simple MCCA formulation’) maximizes correlation between data sets. This corresponds to the standard SUMCORR formulation of MCCA described by Kettenring (1971) (see Parra, 2018). The relations between this and other MCCA formulations are described in (Asendorf, 2015).

### MCCA vs CCA

MCCA is understood as a generalization of CCA but some differences are worth noting. For CCA the focus is usually on the CCs **Y*** _{n}*(

*n*= 1, 2), whereas for MCCA it may also be on the SCs Y. For standard CCA the projection matrices are restricted to

*d*(or min

*) columns for each data set, whereas for MCCA it may be useful to consider more than*

_{n}d_{n}*d*columns (as in Example 5). If the objective were to capture sources common to

*all*data matrices,

*d*components would suffice, but to capture also sources shared by

*several sources but not all*, more than d columns are required. For CCA the

*d*columns of

**Y**

_{1}are mutually uncorrelated as are those of

**Y**

_{2}, whereas for MCCA the

*D*columns of

**Y**

*are mutually correlated in general. Columns of their sum*

_{n}**Y**are uncorrelated, however.

The large number (*D* > *d*) and non-orthogonality of the columns of **Y*** _{n}* might be disconcerting for the researcher familiar with CCA. The method may be modified such that

**Y**

*is instead constituted of d orthogonal columns. For this, MCCA is applied as above, for each n the first column of*

_{n}**Y**

*is projected out of*

_{n}**X**

*, and MCCA applied again. This deflationary procedure terminates after d steps because the dimensionality of each data matrix is then exhausted. Smaller matrices with orthogonal columns might be convenient in certain situations, but as pointed out they might not capture all shared sources. The procedure described in the Methods is better in this respect.*

_{n}### Group analysis of multi-subject data

Gathering data from multiple subjects in response to the same stimulus serves several purposes. First, to counteract variability by increasing the number of observations, analogous to recording from repeated trials. Second, to make inferences at the population level via group-level statistical analysis. Third, to allow data-dependent analysis to improve SNR based on similarity between subjects, analogous to methods that improve SNR based on similarity between trials (de Cheveigné and Parra, 2014).

The conventional strategy of calculating a “grand average”, with corresponding channels or voxels of each subject being averaged together (Choi et al., 2013; Luck, 2005), is hampered by inter-subject differences in source-to-sensor mapping. The problem is mild for sources with broad topographies (as in Fig. 8), but for sources with more local spatial characteristics a mismatch between subjects may result in destructive summation. A similar problem affects measures of inter-subject correlation (ISC) applied directly to channels or voxels (Hasson et al., 2004), or to linear combinations that assume the same mixing vectors for all subjects (Dmochowski et al., 2012; Parra et al., 2018).

One simple expedient is to select, for each subject, a group of channels based on responses to a “localizer” stimulus or task, calculate a root mean square average waveform over these channels, and then average these over subjects (e.g. Chait et al. (2010)). However, this packs the multidimensional cortical activity into a single time course from which it may be hard to infer the richer dynamics of cortical activity. Another approach is to apply inverse modeling to map the activity to a source space common across subjects (Litvak and Friston, 2008). However, this requires accurate anatomical information for each subject and is subject to the validity of the reconstruction models (Mahjoory et al., 2017), as well as between-subject variability in source positions and orientations (Lio and Boulinguez, 2016).

Data-driven methods such as MCCA are attractive in that they find a mapping between subjects based only on shared temporal aspects of the data, without requiring external information. MCCA and related methods have been widely used for fMRI data (Li et al., 2009; Correa et al., 2010b; Hwang et al., 2012; Afshin-Pour et al., 2012; Karhunen et al., 2013; Haxby et al., 2011; Afshin-Pour et al., 2014) and EEG/MEG (Lankinen et al., 2014; Sturm, 2016; Zhang et al., 2017). In contrast to MCCA, which finds variance dimensions that are similar across subjects with no attempt to ensure that they correspond to sources within the brain, ICA-based approaches attempt to to isolate sources common across subjects based on criteria of statistical independence (Calhoun and Adali, 2012; Eichele et al., 2011; Huster et al., 2015; Chen et al., 2016; Madsen et al.; Huster and Raud, 2018). Group ICA (GICA) as formulated by Eichele et al. (2011) can be seen as a concatenation of MCCA (as described here) with ICA. Isolating the MCCA step, as we do here, is useful conceptually and avoids the computational cost and assumptions associated with ICA. Hyperalignment, as used by Haxby et al. (2011), is conceptually the same as MCCA but restricting the transformations to rotations, i.e. Procrustes analysis (Xu et al., 2012). Hyperalignment has the advantage to maintain metric distance of patterns in the original and transformed space, but the disadvantage that it cannot favor channels with higher inter-subject correlation.

The focus here is on *temporal patterns* common to all subjects and thus in the MCCA procedure the data are concatenated along the spatial dimension (channels). It is also possible to extract *spatial patterns* common across subjects by concatenating data along the temporal dimension. Methods for group analysis of data from multiple subjects are reviewed by Correa et al. (2010a,b); Calhoun and Adali (2012); Sui et al. (2012); Afshin-Pour et al. (2014); Dähne et al. (2015); Chen et al. (2016); Huster and Raud (2018).

### Denoising and dimensionality reduction

As described in the Methods and illustrated in the Results, data from single subjects can be denoised by projecting on the overcomplete basis of *D* CCs, truncating, and projecting back. Data dimensions that are not shared with other subjects are *downweighted* but not removed, so in general the rank of the data remains the same. Setting the cutoff to a relatively high order suppresses only those components that are very different from those found in other subjects, most likely to be noise. In Example 5, the set of 40 PCs that represented each subject were transformed into 320 CCs, of which 110 were selected before being projected back to obtain “denoised” data, yielding the benefit shown in Fig. 9. The CCs that were rejected absorbed some of the subject-specific patterns of noise, improving the outcome.

It is often useful to reduce the dimensionality of the data for computational reasons (to reduce memory or computation time), or to avoid overfitting. The standard procedure of applying PCA and truncating the series of PCs implicitly equates variance to relevance, which may not be justified, as artifact sources may have high variance, and useful sources may be weak. MCCA is of use in this respect to replace the variance criterion by a criterion of consistency with other data. This can be done conservatively by removing a small fraction of SCs that represent the most atypical patterns within the data set.

As a tool to analyze or denoise the data of a single subject, MCCA is comparable to data-driven linear analysis techniques such as PCA, Independent Component Analysis (ICA), Joint Diagonalization, CCA and others. The fact that it uses a different criterion makes it *complementary* to those methods as a denoising or dimensionality reduction tool (e.g one can apply MCCA before or after ICA, JD, etc.).

### Caveats and cautions

A risk, common to other data-driven methods such as ICA or JD, is circularity of the analysis (Kriegeskorte et al., 2009). The method is designed to optimize correlation between data matrices, and therefore the observation that the components that it finds *are* correlated between data matrices is of little weight, unless corroborated by careful cross-validation. Related to this issue is overfitting: each SC depends on parameters, a number that can be large if there are many data matrices involved. Overfitting can be detected using resampling and cross-validation methods, and the risk of overfitting can be reduced by dimensionality reduction or other regularization techniques.

MCCA can easily latch on to artefacts and noise patterns shared across data matrices. Uninteresting linear or polynomial trends (for example EEG drift potentials) may thus appear among the first MCCA components. More generally, MCCA can be biased towards narrowband or low-frequency components common across data matrices, *even if their phase is not aligned*, particularly if the noise is spectrally-shaped or contains narrow-band components. This is illustrated in Fig. 12 that shows the result of applying MCCA to ten “data matrices”, each of 12 s duration, extracted at random from the same 40-channel EEG data that was used as background noise in Example 3. No known signal is common across these data matrices, nonetheless the lowest-order SCs have narrow spectra (Fig. 12 left) and quasi-sinusoidal waveforms (right) that might make them seem significant. It is easy to understand why MCCA might take such components to be shared: a sinusoid of arbitrary phase can be expressed as the weighted sum of a sine and a cosine, and thus narrowband activity can be approximated as resulting from two sinusoidal components in quadrature phase. As this is the case for all datasets, MCCA will select the two-component sinusoidal basis as common. Such spurious components compete with genuine shared activity, complicating the analysis.

MCCA assumes that temporal patterns are common across data matrices. A difference in latency of a brain response between different subjects may reduce the ability of MCCA to extract this activity. A common outcome in that case is two components, one with a shape similar to the average pattern over subjects, and the other similar to their difference (or derivative). MCCA can readily be extended to include time-lags to account for differences in response latency between subjects, although this comes at the expense of a greater number of parameters and a greater risk of overfitting. MCCA is obviously of no benefit in the absence of synchronous patterns, for example it is not well suited for analyzing resting-state data of a group of subjects.

MCCA yields both CCs and SCs, either of which can be exploited. When reporting, it is important to specify which, to avoid confusion. As an example, the phrase ‘MCCA was applied as a preprocessing step’ is not sufficient to specify what was done.

### Applicability to real-time processing

This work was motivated in part by the need to steer an auditory assistive device using brain signals. An obstacle to reliable decoding is the high-level of noise and artifacts in the EEG signals, and analysis and denoising methods are essential for the success of this application. To be useful, a method must be applicable to *real-time* processing, whereas MCCA as described here works in batch mode. It may nonetheless be of use in the following fashion. EEG data is recorded from a pool of subjects to a calibration sample of speech, and MCCA is used to derive a “canonical” EEG response to that sample. To adapt the system to a new user, EEG data are recorded in response to the calibration sample, and a spatial filter is designed (for example using CCA) to maximize similarity between the subject’s and the canonical response. This spatial filter is then used in the real-time processing pipeline. This suggests that MCCA can also be put to use in a practical application such as cognitive control of a hearing aid.

## 5 Conclusion

Multiway CCA is a powerful tool for analysis of multi-subject multivariate datasets. It can be used both to design spatial filters to denoise data of each individual subject, and to summarize data across subjects. Many related methods have been proposed in the literature, but the processing principles behind them, and the range of tasks that they can be used for, are not widely appreciated. The use of MCCA (or similar techniques) should be more prevalent given the ubiquitous need for merging data across subjects. In this paper we presented a formulation of MCCA that is relatively easy to understand, illustrated in detail how it works, and showed how it can be put to use for a wide range of common tasks in multi-subject multivariate data analysis.

## Acknowledgements

This work was supported by the EU H2020-ICT grant 644732 (COCOHA), and grants ANR-10-LABX-0087 IEC and ANR-10-IDEX-0001-02 PSL*. Lucas C. Parra received support from the National Science Foundation under grant DRL-1660548. Some of these ideas were tried out at the 2017 Telluride Neuromorphic Engineering Workshop. Malcolm Slaney and Sam Norman-Haignière offered useful comments on earlier versions of the manuscript.