ElectroEncephaloGraphy robust statistical linear modelling using a single weight per trial

Being able to remove or weigh down the influence of outlier data is desirable for any statistical models. While Magnetic and ElectroEncephaloGraphic (MEEG) data are often averaged across trials per condition, it is becoming common practice to use information from all trials to build statistical linear models. Individual trials can, however, have considerable weight and thus bias inferential results (effect sizes as well as thresholded t/F/p maps). Here, rather than looking for univariate outliers, defined independently at each measurement point, we apply the principal component projection (PCP) method at each channel, deriving a single weight per trial at each channel independently. Using both synthetic data and open EEG data, we show (1) that PCP is efficient at detecting a large variety of outlying trials; (2) how PCP-based weights can be implemented in the context of the general linear model (GLM) with accurate control of type 1 family-wise error rate; and (3) that our PCP-based Weighted Least Square (WLS) approach increases the statistical power of group analyses as well as a much slower Iterative Reweighted Least Squares (IRLS), although the weighting scheme is markedly different. Together, our results show that WLS based on PCP weights derived from whole trial profiles is an efficient method to weigh down the influence of outlier EEG data in linear models. Data availability all data used are publicly available (CC0), all code (simulations and data analyzes) is also available online in the LIMO MEEG GitHub repository (MIT license).

Introduction channel) should be derived instead, with weights taking into account the whole temporal or 85 spectral profile. In the following, we demonstrate how the Principal Component Projection 86 method (PCP (4)) can be used in this context, and how those weights can then be used in the 87 context of the general linear model (WLS), applied here to event-related potentials. Such 88 weighting should not be taken to bypass data preprocessing. Measurements and physiological 89 artefacts are typically several order of magnitude stronger than the signal of interest, spanning 90 consecutive trials, and require dedicated algorithms. After data preprocessing, residual signal 91 artefacts might however remain. In addition, natural and unintended experimental variations 92 might also exist among all trials, and the weighting scheme aims to account for these.

94
Method 95 96 Trial-based Weighted Least Squares 97 98 The new WLS solution consists of computing 1 st level GLM beta estimates using weights from the 99 Principal Component Projection algorithm (PCP -(4)). As such, the estimation procedure follows 100 the usual steps of weighting schemes, like a standard Iterative Reweighted Least Squares (IRLS) 101 solution: 102 103 (1) After the OLS solution is computed, an adjustment is performed on residuals by 104 multiplying them by 1/√1 − ℎ where h is a vector of Leverage points (i.e. the diagonal of 105 the hat matrix = ( ′ ) −1 ′ where X is the design matrix reflecting experimental 106 conditions). This adjustment is necessary because leverage points are the most influential 107 on the regression space, i.e. they tend to have low residual values (5). 108 (2) Residuals are then standardized using a robust estimator of dispersion, the median 109 absolute deviation to the median (MAD), and re-adjusted by the tuning function. Here we 110 used the bisquare function. A series of weights is next obtained either based on distances 111 (PCP method) or minimizing the error term (IRLS) with, in both cases, high weights for 112 data points having high residuals (with a correction for Leverage).

113
(3) The solution is then computed following equation 3.

115
Unlike IRLS, WLS weights are not derived for each time frames independently but by looking at 116 the multivariate distances in the principal component space (step 2 above) and thus deriving a 117 unique weight for all time frames. An illustration of the method is shown in figure 1. Trial weights 118 are computed as a distance among trials projected onto the main (>=99%) principal components 119 space. Here, the principal components computed over the f time frames are those directions 120 which maximize the variance across trials for uncorrelated (orthogonal) time periods (figure 1B). 121 Outlier trials are points in the f-dimensional space which are far away from the bulk. By virtue of 122 the PCA, these outlier trials become more visible along the principal component axes than in the 123 original data space. Weights (figure 1E) for each trial are obtained using both the Euclidean norm 124 (figure 1C, distance location) and the kurtosis weighted Euclidean norm (figure 1D, distance 125 scatter) in this reduced PCA space (see (4) for details). Weights are then applied to each trial (at 126 each data points), on the original data (i.e. there is no data reduction). We exploit this simple 127 technique because it is computationally fast given the rich dimensional space of EEG data and 128 because it does not assume the data to originate from a particular distribution. The

146
Simulation-based analyses 147 148 A. Outlier detection and effectiveness of the weighting scheme. 149 150 Simulated ERPs were generated to evaluate the classification accuracy of the PCP method and to 151 estimate the robustness to outliers and signal-to-noise ratio of the WLS solution in comparison 152 to an OLS solution and a standard Iterative Reweighted Least Squares (IRLS) solution, which 153 minimizes residuals at each time frame separately (implemented in limo_IRLS.m). To do so, we 154 manipulated (i) the percentage of outliers, using 10%, 20%, 30%, 40% or 50% of outliers; (ii) the 155 signal to noise ratio (defined relative to the mean over time of the background activity); and (iii) 156 the type of outliers. The first set of outliers were defined based on the added noise: white noise, 157 pink noise, alpha oscillations and gamma oscillations following (6). In these cases, the noise 158 started with a P1 component and lasted ~ 200ms (see below). The second set of outliers were 159 defined based on their amplitude, or outlier to signal ratio (0.5, 0.8, 1.2, and 1.5 times the true 160 N1 amplitude).

162
Synthetic data were generated for one channel, using the model developed by Yeung et al. (2018 163 - (7)). The simulated signal corresponded to an event-related potential with P1 and N1 164 components ( The experiment consisted in the presentation of familiar, unfamiliar, and scrambled faces, 221 repeated twice at various intervals, leading to a factorial 3 (type of faces) by 3 (repetition) design.

222
The preprocessing reused the pipeline described in Pernet et al (2021) In mass-univariate analyses, once p-values are obtained, the family-wise type 1 error rate can be 255 controlled using the distribution of maxima statistics from data generated under the null 256 hypothesis (16). Here, null distributions were obtained by first centering data per conditions, i.e. 257 the mean is subtracted from the trials in each condition, such that these distributions had a mean 258 of zero, but the shape of the distributions is unaffected. We then bootstrap these centred 259 distributions (by sampling with the replacement), keeping constant the weights (since they are 260 variance stabilizers) and the design. We computed 2500 bootstrap estimates per subject. A 261 thousand of these bootstrap estimates were used to compute the family-wise type 1 error rate 262 (FWER), while maxima and cluster maxima distributions were estimated using from 300 to 1,500 263 bootstraps estimates in steps of 300, to determine the convergence rate, i.e. the number of 264 resamples needed to control the FWER. Since OLS was already validated in Pernet et al. (2015), Outlier detection 294 295 The PCP method is used in the GLM to obtain weights and not to remove outliers directly. 296 Investigating outlier detection performance allowed however to examine what kind of trials are 297 weighted down and how good the method is at detecting such trials. Figure 3 shows all the results 298 for ERP simulated with a SNR of 1. Similar results were observed when using a SNR of 2 299 (supplementary figure 1 The classification for real ERP data confirmed the simulation results: the PCP algorithm weighted 331 down trials with dynamics different from the bulk. Single subject analyses (supplementary table  332 1) and group analyses (figure 4) for WLS showed that trials with a low weight are less smooth 333 than trials with a high weight (higher temporal variance ~10 vs. 7.26 μV and power ~131 vs. 69 334 dB, lower autocorrelation 11 vs. 12.25 ms), despite having similar spectra (as expected from data 335 filtering and artefact reduction). This is an important result because weights are determined from 336 the multivariate Euclidian distance and skewness of OLS residuals in the Principal Component 337 space, which are independent from those metrics but nevertheless captures those temporal 338 variations.

340
By comparison, trials with low and high mean weights based on IRLS, did not differ on those 341 metrics (temporal variance ~9 vs. 7 μV, and power ~126 vs. 65 dB, autocorrelation 12.25 vs. 12 342 ms), which is expected since IRLS weights are computed independently at each time frames. 343 While 11 out of 18 subjects show maximum between-trial variance on the same channels for WLS 344 and IRLS, only 28% of low weight trials were the same between the two methods, and 56% of 345 high weight trials. Since different trials have low or even high weights between methods, this 346 further indicates that the weighting scheme from WLS captures well trial dynamics, differing from 347 IRLS which relies on amplitude variations only.

349
Effectiveness of the weighting scheme 350 351 The effect of adding outliers on the mean can be seen in figure 5 and

430 431
The WLS family-wise type 1 error rate (i.e. controlling the error for statistical testing across the 432 whole data space) examined using nullified ERP data from Wakeman and Henson (2015) shows 433 a good probability coverage for both maximum and cluster statistics with 95% confidence 434 intervals overlapping with the expected nominal value (figure 6

462
To summarize, simulations with synthetic data and nullified real-world ERP show that using a 463 Satterthwaite approximation of the degrees of freedom of the error in the context of a WLS 464 solution to the GLM with a single weight per trial, controls well the type 1 error rate. 465 466 467 Performance evaluation at the group level 468 469 Repeated measures ANOVAs using parameter estimates from each method revealed 2 spatial-470 temporal clusters for the face effect for both WLS and IRLS, but only the 1st cluster was declared 471 statistically significant using OLS (table 2). The expected results (9) with full faces having stronger 472 N170 responses than scrambled faces are replicated for all approaches (start of cluster 1). 473 Maximum differences were observed over the N170 only when using OLS parameters. Using WLS 474 and IRLS gave maxima much later (P280), a result also observed when using TFCE rather than 475 spatial-temporal clustering. In each case, a repetition effect was also observed in a much more 476 consistent way among methods with the second presentation of stimuli differing from the 1st 477 and 3rd presentations (figure 7).

479
The statistical maps show that group results using based on WLS parameter estimates lead to 480 smaller F values than those obtained from OLS or IRLS estimates (note the difference in maxima 481 table 1 and scale in Figure 7), which is confirmed by the median differences in Hotelling T^2 values 482 (

498
Considering uncorrected p-values, this shift in F-values translates into weaker statistical power 499 for WLS: Face effect OLS = 34% of significant data frames, WLS = 31%, IRLS = 34%, Repetition 500 effect OLS = 39%, WLS = 35%, IRLS = 39%. By contrast, results based on cluster-corrected p-values 501 showed however more statistical power for WLS relative to OLS for the Face effect (OLS 20% WLS 502 22% IRLS 25% of significant data frames with cluster mass and 3%, 5% 3% of significant data 503 frames with TFCE), and mixed results for the Repetition effect (OLS 31% WLS 28% IRLS 31% of 504 significant data frames with cluster mass and 7%, 8% 7% of significant data frames with TFCE). 505 The comparison of distributions' shapes by deciles ( figure 8)  Simulation and data-driven results indicate that the proposed WLS-PCP method is efficient at 534 down weighting trials with dynamics differing from the bulk, leading to more accurate estimates. 535 Results show that, for ERP, deriving weights based on the temporal profile provides a robust 536 solution against white noise or uncontrolled oscillations. For biological (pink) noise and amplitude 537 variations which do not alter the temporal profile, the PCP algorithm does not classify well outlier 538 trials, leading to a decrease in detection performance compared with white, alpha or gamma 539 noise. Rather than a defect, we see this as biologically relevant (see below). Importantly, even in 540 those cases of failed detection, the overall correlations with the ground truth remained high 541 (>=0.99). When analyzing real data, differences in amplitude variations were also captured by the 542 PCP/WLS approach, with amplitude variations related to trials which were out of phase with the 543 bulk of the data.

545
Group-level analyses of the face dataset replicated the main effect of face type (faces>scrambled) 546 in a cluster from ~150ms to ~350ms but also revealed a late effect (>500ms), observed when 547 using WLS and IRLS parameter estimates but absent when using OLS parameter estimates. 548 Despite more data frames declared significant with WLS than OLS, effects sizes were smaller for 549 WLS than for OLS and IRLS. The shape of the F distributions when using WLS parameter estimates 550 were however more right skewed than when using OLS or IRLS, leading cluster corrections to 551 declare more data points as significant. Indeed, under the null, very similar distributions of 552 maxima are observed for the three methods leading to more power for the more skewed 553 observed distributions. The interplay between 1st level regularization, 2nd level effect size, and 554 multiple comparison procedures depends on many parameters and it is not entirely clear how 555 statistical power is affected by their combination and requires deeper investigation via 556 simulations. Empirically, we can nevertheless conclude that group results were statistically more 557 powerful using robust approaches at the subject level than when using OLS. 558 559 Because the proposed weighing technique relies on a PCA, it is typically requested to have more 560 trials (observations) than data points (time, frequency, or time per frequency), ensuring a full 561 rank covariance matrix. From a design perspective, this is a desirable feature. For a 'standard' 562 ERP up to 1 second long, down sampled at 250 Hz, that means 251 trials in total after artefact 563 rejection (about 70 trials per condition for 2 x 2 factorial design). One might argue that is more 564 than usually used, but there is evidence that many more trials than usual are needed for properly 565 powered studies (see (1) for a discussion and references therein). For power analyses, looking at 566 high frequency modulation and thus typically analyzing at 500Hz, this means again 250 data 567 points and thus at least 251 trials, which is more problematic. One can turn to the algorithm to 568 compensate for the lack of trials. One solution may be to use a decomposition for rank reduced 569 data while another may be to down sample the data. Additional simulations varying the number 570 of trials (1500 to 126) and sampling rate (1000Hz, 500Hz, 250Hz) and using white noise outliers, 571 indicate that having more trials than time frames always gives better classification results (as 572 expected) but also that down sampling gives better results than using ill conditioned data 573 (supplementary table 4). Down sampling however interacts with the number of outliers (a priori 574 unknown with real data) such as low sampling rate and high outlier number (>40%) cases have 575 poor performances. In such conditions, computing the data decomposition with ill conditioned 576 data provides better results. As a practical solution, the default usage of the limo_pcout.m 577 function is thus to down sample the data by 2 if there are not enough trials. If this still results in 578 rank deficient data, the PCP algorithm computes weights on the original ill conditioned data. For 579 spectral analysis in the high frequency domain that will likely suffer from such data conditioning, 580 this still provides a good (automated) solution. It might also be possible to derive better weights 581 (and thus more robust results) by maximizing the multivariate spread of the data in removing 582 more components (i.e. retaining less than 99% of the variance) using e.g. a data driven criteria 583 and/or using a more robust decomposition. This would however requires extensive simulations 584 beyond the scope of this paper which applied the PCP method (4) to create a new WLS scheme. 585 586 Using the trial dynamics (temporal or spectral profile) to derive a single weight per trial makes 587 sense, not just because the observed signal is autocorrelated, but also because it is biologically 588 relevant. Let's consider first the signal plus noise model of ERP generation (20)(21)(22). In this 589 conceptualization, ERPs are time-locked additive events running on top of background activity. 590 An outlier time frame for a given trial may occur if 1) the evoked amplitude deviates from the 591 bulk of evoked trials, or 2) the background activity deviates from the rest of the background 592 activity. In the former case, the additional signal may be conceived either as a single process (a 593 chain of neural events at a particular location) or a mixture of processes (multiple, coordinated 594 neural events). In both cases, the data generating process is thought to be evolving over time 595 (auto-regressive) which speaks against flagging or weighting a strong deviation at a particular 596 time frame only. It is likely that several consecutive time frames deviate from most other trials, 597 even though only one time frame is deemed an outlier. In the case of a deviation in background 598 activity, it would mean that for an extremely brief period, a large number of neurons 599 synchronized for non-experimentally related reasons, and for this trial only. Although we do not 600 contend that such events cannot happen in general, this would mean that, in the context of ERP 601 outlier detection, the background activity varies by an amount several folds bigger than the 602 signal, which goes against theory and observations. Let now us consider the phase resetting 603 model (23,24). In this model, ERPs are emerging from phase synchronization among trials, due to 604 stimulus induced phase-resetting of background activity. If a trial deviates from the rest of the 605 trials, this implies that it is out-of-phase. In this scenario, deriving different weights for different 606 time frames (i.e. IRLS solution) means that the time course is seen as an alternation of normal 607 and outlying time frames, which has no meaningful physiological interpretation. Thus, 608 irrespective of the data generating model, the WLS approach seems biologically more 609 appropriate than the IRLS method. 610 611 Sampling frequency In conclusion, we propose a fast and straightforward weighting scheme of single trials for 617 statistical analyses. Results indicate that it captures and attenuates well ERP noise, leading to 618 increased estimation precision and possibly increased statistical power at the group level. 619 620 Acknowledgements 621 622 Thank you to EEGLAB/LIMO MEEG users who engaged with the beta version, got stuck but 623 persevered until we solved their issues. 624