## Abstract

Being able to remove or weigh down the influence of outlier data is desirable for any statistical models. While Magnetic and ElectroEncephaloGraphic (MEEG) data are often averaged across trials per condition, it is becoming common practice to use information from all trials to build linear models. Individual trials can, however, have considerable weight and thus bias inferential results. Here, rather than looking for univariate outliers, defined independently at each measurement point, we apply the principal component projection (PCP) method at each channel, deriving a single weight per trial at each channel independently. Using both synthetic data and open EEG data, we show (1) that PCP is efficient at detecting a large variety of outlying trials; (2) how PCP-based weights can be implemented in the context of the general linear model with accurate control of type 1 family-wise error rate; and (3) that our PCP-based Weighted Least Square (WLS) approach increases the statistical power of group analyses as well as a much slower Iterative Reweighted Least Squares (IRLS), although the weighting scheme is markedly different. Together, our results show that WLS based on PCP weights derived from whole trial profiles is an efficient method to weigh down the influence of outlier EEG data in linear models.

**Data availability** all data used are publicly available (CC0), all code (simulations and data 40 analyzes) is also available online in the LIMO MEEG GitHub repository (MIT license).

## Introduction

MEEG data are often epoched to form 3 or 4-dimensional matrices of, e.g., channel × time × trials and channel × frequency × time × trials. Several neuroimaging packages are dedicated to the analyses of such large multidimensional data, often using linear methods. For instance, in the LIMO MEEG toolbox (Pernet et al., 2011), each channel, frequency, and time frame is analyzed independently using the general linear model, an approach referred to as mass-univariate analysis. Ordinary Least Squares (OLS) are used to find model parameters that minimize the error between the model and the data. For least squares estimates to have good statistical properties, it is however expected that the error covariance off-diagonals are zeros, such that Cov(e) = σ^{2}I, I being the identity matrix (Christensen, 2002), assuming observations are independent and identically distributed. It is well established that deviations from that assumption lead to substantial power reduction and to an increase in the false-positive rate. When OLS assumptions are violated, robust techniques offer reliable solutions to restore power and control the false positive rate. Weighted Least Squares (WLS) is one such robust method that uses different weights across trials, such that Cov(e) = σ^{2}V, with V a diagonal matrix:
with y a n-dimensional vector (number of trials), X the n*p design matrix, β a p dimensional vector (number of predictors in X) and e the error vector of dimension n. The WLS estimators can then be obtained using an OLS on transformed data (eq. 2 and 3):
with W a 1*n vector of weights.

When applied to MEEG data, a standard mass-univariate WLS entails obtaining a weight for each trial but also each dimension analyzed, i.e. channels, frequencies and time frames. Following such procedure, a trial could be considered as an outlier or be assigned a low weight, for a single frequency or time frame, which is implausible given the well-known correlations of MEEG data over space, frequencies and time. We propose here that a single or a few consecutive data points should never be flagged as outliers or weighted down, and that a single weight per trial (and channel) should be derived instead, with weights taking into account the whole temporal or spectral profile. In the following, we demonstrate how the Principal Component Projection method (PCP - Filzmoser et al., 2008) can be used in this context, and how those weights can then be used in the context of the general linear model, applied here to event-related potentials.

## Method

### Trial-based Weighted Least Squares

An illustration of the method is shown in figure 1. Trial weights are computed as a distance among trials projected onto the main (>=99%) principal components space. Here, the principal components computed over the f time frames are those directions which maximize the variance across trials for uncorrelated (orthogonal) time periods (figure 1B). Outlier trials are points in the f-dimensional space which are far away from the bulk. By virtue of the PCA, these outlier trials become more visible along the principal component axes than in the original data space. Weights (figure 1E) for each trial are obtained using both the Euclidean norm (figure 1C, distance location) and the kurtosis weighted Euclidean norm (figure 1D, distance scatter) in this reduced PCA space (see Filzmoser et al., 2008 for details). We exploit this simple technique because it is computationally fast given the rich dimensional space of EEG data and because it does not assume the data to originate from a particular distribution. The only constraint is that there are more trials present than time frames. For instance, with trials ranging from −50 ms to +650 ms, sampled at 250 Hz (thus 176 time points), the method requires at least 177 trials. The PCP algorithm is implemented in the *limo_pcout.m* function, distributed with the LIMO MEEG toolbox (https://limo-eeg-toolbox.github.io/limo_meeg/). The WLS solution, implemented in *limo_WLS.m*, consists of computing model beta estimates using weights from the PCP method on OLS standardized robust residuals, following three steps:

(1) After the OLS solution is computed, an adjustment is performed on residuals by multiplying them by where h is a vector of Leverage points (i.e. the diagonal of the hat matrix

*H = X(X’X)*where X is the design matrix). This adjustment is necessary because leverage points are the most influential on the regression space, i.e. they tend to have low residual values (Hoaglin & Welsch, 1978).^{−1}X’(2) Residuals are then standardized using a robust estimator of dispersion, the median absolute deviation to the median (MAD), and re-adjusted by the tuning function. Here we used the bisquare function. The result is a series of weights with high weights for data points having high residuals (with a correction for Leverage).

(3) The WLS solution is then computed following equation 3.

### Simulation-based analyses

#### A. Outliers detection and parameters estimation

Simulated ERPs were generated to evaluate the classification accuracy of the PCP method and to estimate the robustness to outliers and low signal-to-noise ratio of the WLS solution in comparison to an OLS solution and a standard Iterative Reweighted Least Squares (IRLS) solution, which minimizes residuals at each time frame separately (implemented in *limo_IRLS.m*). To do so, we manipulated (i) the percentage of outliers, using 10%, 20%, 30%, 40% or 50% of outliers; (ii) the signal to noise ratio (defined relative to the mean over time of the background activity); and (iii) the type of outliers. The first set of outliers were defined based on the added noise: white noise, pink noise, alpha oscillations and gamma oscillations. In these cases, the noise started with the P1 component and lasted ~ 200ms (see below). The second set of outliers were defined based on their amplitude, or outlier to signal ratio (0.5, 0.8, 1.2, and 1.5 times the true N1 amplitude).

Synthetic data were generated for one channel, using the model developed by Yeung et al. (2018). The simulated signal corresponded to an event-related potential with P1 and N1 components (100 ms long) added to background activity with the same power spectrum as human EEG, generating 200 trials of 500 ms duration with a 250 Hz sampling rate. Examples for each type of simulation are shown in figure 2 and results are based, for each case, on a thousand random repetitions. Performance of the PCP algorithm at detecting outlying synthetic EEG trials was investigated by computing the confusion matrix and mapping the true and false positives rates in the Receiver Operating space, and by computing the Matthew Correlation Coefficients (MCC). Robustness was examined by computing the Pearson correlations and the Kolmorov-Smirnov (KS) distances between the ground truth mean and the OLS, WLS, and IRLS means. Pearson values allowed to estimate the linear relationships between estimated means and the truth while KS distances provide a fuller picture of the overall differences in distributions.

The code used to generate the ERP and the results are available at https://github.com/LIMO-EEG-Toolbox/limo_test_stats/tree/master/PCP_simulations.

#### B. Statistical inference

Accurate estimation of model parameters (i.e. beta estimates in the GLM - equation 3) is particularly important because it impacts group-level results. Inference at the single-subject level may, however, also be performed and accurate p-values need, therefore, to be derived. Here, error degrees of freedom are obtained using the Satterwaithe approximation (equation 4).

*with dfe, the degree of freedom of the error, I the identity matrix, and H the hat matrix.*

To validate p-values, simulations under the null were performed. Two types of data were generated: Gaussian data of size 120 trials × 100 time frames and EEG data of size 120 trials × 100 time frames with a P1 and N1 component as above, added to coloured background activity with the same power spectrum as human EEG. In each case, a regression (1 Gaussian random variable), an ANOVA (3 conditions of 40 trials - dummy coding) and an ANCOVA (3 conditions of 40 trials and 1 Gaussian random covariate) model were fitted to the data using the OLS, WLS and IRLS methods. The procedure was performed 10,000 times, leading to 1 million p-values per data/model/method combination and Type 1 errors with binomial confidence intervals were computed.

### Empirical data analysis

A second set of analyses used the publicly available multimodal face dataset (Wakeman & Henson, 2016) to (i) investigate the PCP classification; (ii) validate the GLM implementation for type 1 error family-wise control at the subject level; (iii) evaluate group results, contrasting WLS against the OLS and IRLS methods. This analysis can be reproduced using the script available at https://github.com/LIMO-EEG-Toolbox/limo_meeg/blob/master/resources/code/Method_validation.m.

#### A. EEG Data and Preprocessing

The experiment consisted in the presentation of familiar, unfamiliar, and scrambled faces, repeated twice at various intervals, leading to a factorial 3 (type of faces) by 3 (repetition) design. The preprocessing replicated Pernet et al (2021). EEG data were extracted from the MEG fif files, time corrected and electrode position re-oriented and saved according to EEG-BIDS (Pernet et al., 2019 - available at OpenNeuro https://10.18112/openneuro.ds002718.v1.0.2.). Data were imported into EEGLAB (Delorme & Makeig, 2004) using *the bids-matlab-tools v5.2 plug-in* and non-EEG channel types were removed. Bad channels were next automatically removed and data filtered at 0.5 Hz using *pop_clean_rawdata.m* of the *clean_radata* plugin v2.2 (transition band [0.25 0.75], bad channel defined as a flat line of at least 5 sec and with a correlation to their robust estimate based on other channels below 0.8). Data were then re-referenced to the average (*pop_reref.m*) and submitted to an independent component analysis (Onton et al., 2006) (*pop_runica.m* using the runnica algorithm sphering data by the number of channels −1). Each component was automatically labelled using the *ICLabel* v1.2.6 plug-in (Pion-Tonachini et al., 2019), rejecting components labeled as eye movements and muscle activity above 80% probability. Epochs were further cleaned if their power deviated too much from the rest of the data using the Artifact Subspace Reconstruction algorithm (Kothe & Makeig, 2013) (*pop_clean_rawdata.m*, burst criterion set to 20).

#### B. High vs. low weight trials and parameters estimation

At the subject level (1st level), ERP were modelled at each channel and time frame with the 9 conditions (type of faces × repetition) and beta parameter estimates obtained using OLS, WLS, and IRLS. For each subject, high vs. low weight trials were compared with each other at the channel showing the highest between trials variance to investigate what ERP features drove the weighting schemes. High and low trials were defined a priori as trials with weights (or mean weights for IRLS) below the first decile or above the 9th decile. We used a two-sample bootstrapt method to compare the 20% trimmed means of high and low trials in every participant, for each of these three quantities: temporal SNR (the standard deviation over time); global power (mean of squared absolute values, Parseval’s theorem); autocorrelation (distance between the 2 first peaks of the power spectrum density, Wiener-Khinchin theorem). A similar analysis was conducted at the group level averaging the metrics across trials. Computations of the three quantities have been automatized for LIMO MEEG v3.0 in the *limo_trialmetric.m* function.

#### C. Statistical inference

In mass-univariate analyses, once p-values are obtained, the family-wise type 1 error rate can be controlled using the distribution of maxima statistics from data generated under the null hypothesis (Pernet et al., 2015). Here, null distributions were obtained by first centering data per conditions, i.e. the mean is subtracted from the trials in each condition, such that these distributions had a mean of zero, but the shape of the distributions is unaffected. We then bootstrap these centred distributions (by sampling with the replacement), keeping constant the weights (since they are variance stabilizers) and the design. We computed 2500 bootstrap estimates per subject. A thousand of these bootstrap estimates were used to compute the family-wise type 1 error rate (FWER), while maxima and cluster maxima distributions were estimated using from 300 to 1,500 bootstraps estimates in steps of 300, to determine the convergence rate, i.e. the number of resamples needed to control the FWER. Since OLS was already validated in Pernet et al. (2015), here we only present WLS results. Statistical validations presented here and other statistical tests implemented in the LIMO MEG toolbox v3.0 (GLM validation, robust tests, etc.) are all available at https://github.com/LIMO-EEG-Toolbox/limo_test_stats/wiki.

#### D. Performance evaluation at the group level

At the group level (2nd level), we computed 3 by 3 repeated measures ANOVAs (Hotelling T^{^2}tests) separately on OLS, WLS, and IRLS estimates, with the type of faces and repetition as factors. Results are reported using both a correction for multiple comparisons with cluster-mass and with TFCE (threshold-free cluster enhancement) at p<.05 (Maris & Oostenveld, 2007; Pernet et al., 2015).

In addition to these thresholded maps, distributions were compared to further understand where differences originated from. First, we compared raw effect sizes (Hotelling T^{^2}) median differences between WLS vs. OLS and WLS vs. IRLS for each effect (face, repetition and interaction), using a percentile t-test with alpha adjusted across all 6 tests using Hochberg’s step-up procedure (Hochberg, 1988). This allowed checking if differences in results were due to effect size differences. Then, since multiple comparison correction methods are driven by the data structure, we compared the shapes of the F value and of the TFCE value distributions (TFCE reflecting clustering). Each distribution was standardized (equation 5) and WLS vs. OLS and WLS vs. IRLS distributions compared at multiple quantiles (Rousselet et al., 2017).

*with Yzi the standardized data, Y the data, and MAD the median absolute deviation.*

## Results

### Outliers detection

While the PCP method is used in the GLM to obtain weights and not to remove outliers directly, simulations allowed us to better understand what kind of trials are weighted down and how good the method is at detecting such trials. Figure 3 shows all the results for ERP simulated with a SNR of 1. Similar results were observed when using a SNR of 2 (supplementary figure 1). First and foremost, in all cases and for up to 40% of outlying trials, the PCP data are located in the upper left corner of the ROC space, indicating good performances. When reaching 50% of outliers, the true positive rate falls down to ~40% and the false positive rate remains below 40%. This is best appreciated by looking at the plots showing perfect control over false positives when data are contaminated with up to 40% of white, alpha, and gamma outliers. In those cases, the Matthew Correlation Coefficients also remain high (>0.6) although not perfect (not =1), indicating some false negatives. Compared with other types of noise, pink noise elicited very different results, with Matthew Correlation Coefficients around 0 indicating chance classification level. Results from amplitude outliers also show Matthew Correlation Coefficients close to 0 with a linear decrease in true positives and a linear increase in false positives as the percentage of outliers increases. This implies that the PCP method did not detect amplitude changes around peaks. These results are simply explained by the principal components being computed over time frames, and outliers with pink noise and weaker or stronger N1 do not affect the temporal profile of the ground truth sufficiently to lead to different eigenvectors (‘directions’) in this dimension when decomposing the covariance matrix, i.e. their temporal profiles do not differ from the ground truth.

### High vs. low trial weights

The classification for real ERP data confirmed the simulation results: the PCP algorithm weighted down trials with dynamics different from the bulk. Single subject analyses (supplementary table 1) and group analyses (figure 4) for WLS showed that trials with a low weight are less smooth than trials with a high weight (higher temporal variance ~10 vs. 7.26 uV and power ~131 vs. 69 dB, lower autocorrelation 11 vs. 12.25 ms), despite having similar spectra (as expected from data filtering and artefact reduction). In comparison, trials with low and high mean weights based on IRLS, were similar on those metrics (temporal variance ~9 vs. 7 uV, and power ~126 vs. 65 dB, autocorrelation 12.25 vs. 12 ms). While 11 out of 18 subjects show maximum between-trial variance on the same channels for WLS and IRLS, only 28% of low weight trials were the same between the two methods, and 56% of high weight trials. Since different trials have low or even high weights between methods, this further indicates that the weighting scheme from WLS differs from IRLS which relies on amplitude variations only.

### Estimation and Robustness

The effect of adding outliers on the mean can be seen in figure 5 and supplementary figure 2. The standard mean, i.e. the ordinary least squares ERPs, shows an almost linear decrease in Pearson correlations and linear increase in KS distances to the ground truth as the percentage of outlier increases, an expected behaviour since OLS are not robust. Our reference robust approach, IRLS, shows robustness to white noise, alpha, and gamma oscillations with higher Pearson correlations than the OLS. Yet it performed worse than the OLS with pink noise and amplitude outliers, showing lower correlations with the ground truth, despite having similar KS distances in all cases. As the IRLS solution for pink noise and amplitude outliers weights data to minimize residuals at each time point separately, these are also expected results, resulting in an average distance (over time) larger than OLS. The new WLS approach showed stronger resistance to outliers for white noise, alpha and gamma oscillations than the IRLS approach, with higher Pearson correlations. For pink noise and N1 amplitude outliers, it performs as well as the IRLS, despite different KS distances. The IRLS algorithm attenuates the influence of those data points that differ from the ground truth, but this may be from different trials at different time points. By doing so, KS distances to the ground truth were similar or lower (for alpha and gamma oscillations) than the OLS. The WLS approach attenuates the influence of trials with different time courses and thus, the WLS ERP mean is affected at every time point, even if the detection concerns a small part of the time course, leading to higher KS distances even with a small number of outliers.

### Statistical inference for single subjects

The average type 1 error rate for every channel and time frame tested with simulated data is at the nominal level (5%) for OLS. Results also show that IRLS are a little lenient, with small but significantly smaller p-values than expected, leading to an error rate of ~0.055. Conversely, WLS are conservative for simulated ERP, with p-values slightly too high, giving a type 1 error rate of ~0.04) and lenient with purely Gaussian data (type 1 error ~0.065 – table 1). This behaviour of WLS is caused by the PCP method which optimizes weights based on distances across time, except that with simulated Gaussian data there is no autocorrelation and the PCA returns a much higher number of dimensions, leading to a meaningless feature reduction and thus meaningless trial distances and weights.

The WLS family-wise type 1 error rate (i.e. controlling the error for statistical testing across the whole data space) examined using nullified ERP data from Wakeman and Henson (2015) shows a good probability coverage for both maximum and cluster statistics with 95% confidence intervals overlapping with the expected nominal value (figure 6). Individual mean values ranged from 0.039 to 0.070 for maximum statistics (across subject average 0.052) and 0.044 to 0.07 for spatial-temporal clustering (across subject average 0.051). Those results do not differ significantly from OLS results (paired bootstrap t-test). Additional analyses based on the number of bootstraps used to build the null distribution indicate that 800 to a 1000 bootstrap samples are enough to obtain stable results, and that the errors are relatively well distributed in space and time even if some channels tend to be more affected than others, i.e. there is no strong sampling bias: maximum number of error occurring at the same location was 0.05% using maximum statistics and 0.9% using spatial-temporal clustering, see bottom for figure 6, error density maps.

### Performance evaluation at the group level

Repeated measures ANOVAs using parameter estimates from each method revealed 2 spatial-temporal clusters for the face effect for both WLS and IRLS, but only the 1st cluster was declared statistically significant using OLS (table 2). The expected results (Wakeman & Henson, 2015) with full faces having stronger N170 responses than scrambled faces are replicated for all approaches (start of cluster 1). Maximum differences were observed over the N170 only when using OLS parameters. Using WLS and IRLS gave maxima much later (P280), a result also observed when using TFCE rather than spatial-temporal clustering. In each case, a repetition effect was also observed in a much more consistent way among methods with the second presentation of stimuli differing from the 1st and 3rd presentations (figure 7).

The statistical maps show that group results using based on WLS parameter estimates lead to smaller F values than those obtained from OLS or IRLS estimates (note the difference in maxima table 1 and scale in Figure 7), which is confirmed by the median differences in Hotelling T^{^2} values (supplementary tables 2, 3 & table 3). Considering uncorrected p-values, this translates into weaker statistical power for WLS: Face effect OLS = 34% of significant data frames, WLS = 31%, IRLS = 34%, Repetition effect OLS = 39%, WLS = 35%, IRLS = 39%. Results based on cluster-corrected p-values showed however more statistical power for WLS relative to OLS for the Face effect (OLS 20% WLS 22% IRLS 25% of significant data frames with cluster mass and 3%, 5% 3% of significant data frames with TFCE), and mixed results for the Repetition effect (OLS 31% WLS 28% IRLS 31% of significant data frames with cluster mass and 7%, 8% 7% of significant data frames with TFCE).

To further understand how cluster-based results lead to more statistical power for WLS while F values are smaller, we compared distributions’ shapes by comparing the deciles of normalized values (figure 8). For the face effect, WLS did not differ significantly from OLS or from IRLS for F-values, while TFCE values were significantly larger, from the 2nd decile onward when compared to OLS, and for deciles 2, 3, 4, 7, 8 and 9 compared to IRLS. For the repetition effect, WLS differed from OLS on deciles 2, 7, 8 and 9 for both F-values and TFCE values while it differed from IRLS on decile 9 only when looking at F-values, and deciles 2, 5, 8 and 9 when looking at TFCE values. Finally, for the interaction effect, WLS did not differ from OLS or IRLS in terms of F-values but had significantly weaker TFCE values than OLS (deciles 1, 3, 6, 7, 8 and 9) and IRLS (all deciles but the 4th). In summary, for the significant main face effect and repetition effect, a general pattern of more right skewed distributions of F-values and TFCE-values for WLS than for OLS and IRLS was observed while a shorter tail was observed for the non significant interaction effect.

## Discussion

Simulation and data-driven results indicate that the proposed WLS-PCP method is efficient at down weighting trials with dynamics differing from the bulk, leading to more accurate estimates. Results show that, for ERP, deriving weights based on the temporal profile provides a robust solution against white noise or uncontrolled oscillations. For biological (pink) noise and amplitude variations which do not alter the temporal profile, the PCP algorithm does not classify well outlier trials, leading to a decrease in detection performance compared with white, alpha or gamma noise. Rather than a defect, we see this as biologically relevant (see below). Importantly, even in those cases of failed detection, the overall correlations with the ground truth remained high (>=0.99). When analyzing real data, differences in amplitude variations were nevertheless captured by the PCP/WLS approach, with amplitude variations related to trials which were out of phase with the bulk of the data.

Group-level analyses of the face dataset replicated the main effect of face type (faces>scrambled) in a cluster from ~150ms to ~350ms but also revealed a late effect (>500ms), observed when using WLS and IRLS parameter estimates but absent when using OLS parameter estimates. Despite more data frames declared significant with WLS than OLS, effects sizes were smaller for WLS than for OLS and IRLS. The shape of the F distributions when using WLS parameter estimates were however more right skewed than when using OLS or IRLS, leading cluster corrections to declare more data points as significant. Indeed, under the null, very similar distributions of maxima are observed for the three methods leading to more power for the more skewed observed distributions. The interplay between 1st level regularization, 2nd level effect size, and multiple comparison procedures depends on many parameters and it is not entirely clear how statistical power is affected by their combination and requires deeper investigation via simulations. Empirically, we can nevertheless conclude that group results were statistically more powerful using robust approaches at the subject level than when using OLS.

Using the trial dynamics (temporal or spectral profile) to derive a single weight per trial makes sense, not just because the observed signal is autocorrelated, but also because it is biologically relevant. Let’s consider first the signal plus noise model of ERP generation (Hillyard, 1985; Jervis et al., 1983; Shah, 2004). In this conceptualization, ERPs are time-locked additive events running on top of background activity. An outlier time frame for a given trial may occur if 1) the evoked amplitude deviates from the bulk of evoked trials, or 2) the background activity deviates from the rest of the background activity. In the former case, the additional signal may be conceived either as a single process (a chain of neural events at a particular location) or a mixture of processes (multiple, coordinated neural events). In both cases, the data generating process is thought to be evolving over time (auto-regressive) which speaks against flagging or weighting a strong deviation at a particular time frame only. It is likely that several consecutive time frames deviate from most other trials, even though only one time frame is deemed an outlier. In the case of a deviation in background activity, it would mean that for an extremely brief period, a large number of neurons synchronized for non-experimentally related reasons, and for this trial only. Although we do not contend that such events cannot happen in general, this would mean that, in the context of ERP outlier detection, the background activity varies by an amount several folds bigger than the signal, which goes against theory and observations. Let now us consider the phase resetting model (Makeig et al., 2002; Sayers et al., 1974). In this model, ERPs are emerging from phase synchronization among trials, due to stimulus induced phase-resetting of background activity. If a trial deviates from the rest of the trials, this implies that it is out-of-phase. In this scenario, deriving different weights for different time frames (i.e. IRLS solution) means that the time course is seen as an alternation of normal and outlying time frames, which has no meaningful physiological interpretation. Thus, irrespective of the data generating model, the WLS approach seems biologically more appropriate than the IRLS method.

In conclusion, we propose a fast and straightforward weighting scheme for trials based on their temporal or spectral profiles. Results indicate that it captures and attenuates well ERP noise, leading to increased estimation precision and possibly increased statistical power at the group level.

## Acknowledgements

Thank you to EEGLAB/LIMO MEEG users who engaged with the beta version, got stuck but persevered until we solved their issues.

## Footnotes

New title, typos and various clarifications in method and results.