1 1 Robust linear modelling in ElectroEncephaloGraphy can be obtained 2 using single weights reflecting each single trials ’ dynamics 3 4 5

Being able to remove or weigh down the influence of outlier data is desirable for any statistical models. While Magnetic and ElectroEncephaloGraphic (MEEG) data used to average trials per condition, it is now becoming common practice to use information from all trials to build linear models. Individual trials can, however, have considerable weight and thus bias inferential results. Here, rather than looking for outliers independently at each data point, we apply the principal component projection (PCP) method at each channel, deriving a single weight per trial at each channel independently. Using both synthetic data and open EEG data, we show (1) that PCP is efficient at detecting a large variety of outlying trials; (2) how PCP derived weights can be implemented in the context of the general linear model with accurate control of type 1 family-wise error rate; and (3) that our PCP-based Weighted Least Square (WLS) approach leads to in increase in power at the group results comparable to a much slower Iterative Reweighted Least Squares (IRLS), although the weighting scheme is markedly different. Together, results show that WLS based on PCP weights derived upon whole trial profiles is an efficient method to weigh down the influence of outlier data in linear models. Data availability all data used are publicly available (CC0), all code (simulations and data analyzes) is also available online in the LIMO MEEG GitHub repository (MIT license).


Introduction 41
Method 81 82 Trial-based Weighted Least Squares 83 84 An illustration of the method is shown in figure 1. Trial weights are computed as a distance among 85 trials projected onto the main (>=99%) principal components space. Here, the principal 86 components computed over the f time frames are those directions which maximize the variance 87 across trials for uncorrelated (orthogonal) time periods (figure 1B). Outlier trials are points in the 88 f-dimensional space which are far away from the bulk. By virtue of the PCA, these outlier trials 89 become more visible along the principal component axes than in the original data space. Weights  (1) After the OLS solution is computed, an adjustment is performed on residuals by 103 multiplying them by 1/√1 − ℎ where h is a vector of Leverage points (i.e. the diagonal of 104 the hat matrix = ( ′ ) −1 ′ where X is the design matrix). This adjustment is 105 necessary because leverage points are the most influential on the regression space, i.e. 106 they tend to have low residual values (Hoaglin & Welsch, 1978).

107
(2) Residuals are then standardized using a robust estimator of dispersion, the median 108 absolute deviation to the median (MAD), and re-adjusted by the tuning function. Here we 109 used the bisquare function. The result is a series of weights with high weights for data 110 points having high residuals (with a correction for Leverage).

111
(3) The WLS solution is then computed following equation 3. 112  The experiment consisted in the presentation of familiar, unfamiliar, and scrambled faces, 195 repeated twice at various intervals, leading to a factorial 3 (type of faces) by 3 (repetition) design.

196
The procedure followed ( schemes. High and low trials were defined a priori as trials with weights (or mean weights for 219 IRLS) below the first decile or above the 9th decile. We used two samples bootstrap-t on 20% 220 trimmed means to compare these quantities in high and low trials in every participant: temporal 221 SNR (the standard deviation over time); global power (mean of squared absolute values, 222 Parseval's theorem); autocorrelation (distance between the 2 first peaks of the power spectrum 223 density, Wiener-Khinchin theorem). A similar analysis was conducted at the group level averaging 224 across trials metrics. Computations of these three quantities have been automatized for LIMO 225 MEEG v3.0 in the limo_trialmetric.m function.
In mass-univariate analyses, once p-values are obtained, the family-wise type 1 error rate can be 230 controlled using the distribution of maxima statistics from data generated under the null 231 hypothesis (Pernet et al., 2015). Here, null distributions were obtained by first centering data per 232 conditions, i.e. the mean is subtracted from the trials in each condition, such that these 233 distributions had a mean of zero, but the shape of the distributions is unaffected. We then 234 bootstrap these centred distributions (by sampling with the replacement), keeping constant the 235 weights (since they are variance stabilizers) and the design. We computed 2,500 bootstrap 236 estimates per subject. A thousand of these bootstrap estimates were used to compute the family-237 wise type 1 error rate (FWER), while maxima and cluster maxima distributions were estimated 238 using the from 500 to 1,500 bootstraps estimates from the remaining set (e.g. use 500 estimates 239 to build the null distribution of maxima, and test FWER using 1000 draws, redo the analysis with 240 600 estimates to build the null distribution of maxima, and test FWER using again the 1000 241 independent draws, etc

334
Estimation and Robustness 335 336 The effect of adding outliers on the mean can be seen in figure 5 and supplementary figure 2. 337 The standard mean, i.e. the ordinary least squares ERPs, shows an almost linear decrease in 338 Pearson correlations and linear increase in KS distances to the ground truth as the percentage of 339 outlier increases, an expected behaviour since OLS are not robust. Our reference robust 340 approach, IRLS, shows robustness to white noise, alpha, and gamma oscillations with higher 341 Pearson correlations than the OLS. Yet it performed worse than the OLS with pink noise and 342 amplitude

374
Statistical inference for single subjects 375 376 The enough to obtain stable results, and that the errors do not appear at any spatial-temporal 398 locations, i.e. there are no sampling bias (maximum number of error occurring at the same 399 location was 0.05% using maximum statistics and 0.9% using spatial-temporal clustering, see

418
Performance evaluation at the group level 419 420 Repeated measures ANOVAs using parameter estimates from each method revealed 2 spatial-421 temporal clusters for the face effect for both WLS and IRLS, but only the 1st cluster was declared 422 significant using OLS ( and IRLS gave maxima much later (P280), a result also observed when using TFCE rather than 426 spatial-temporal clustering. In each case, a repetition effect was also observed in a much more 427 consistent way among methods with the second presentation of stimuli differing from the 1st 428 and 3rd presentations (figure 7  Group-level analyses of the face dataset replicated the main effect of face type (faces>scrambled) 493 in a cluster from ~150ms to ~350ms but also revealed a late effect (>500ms), observed when 494 using 1st level WLS and IRLS parameter estimates but absent when using OLS parameter 495 estimates. Despite more data frames declared significant with WLS than OLS, effects sizes were 496 smaller (and also smaller than IRLS). The shape of distributions when using WLS parameter 497 estrimates were however more right skewed than when using OLS or IRLS, leading clustering/tfce 498 corrections to declare more data points as significant. Indeed under null, very similar 499 distributions of maxima are observed leading to more power for the more skewed distributions. 500 The interplay between 1st level regularization, 2nd level effect size, and multiple comparison 501 procedures depends on many parameters and it is not entirely clear how statistical power is 502 affected by their combination and requires deeper investigation via simulations. Empirically, we 503 can nethertheless conclude that group results were statistically more powerful using robust 504 approaches at the subject level than when using OLS. 505 506 Using the trial dynamics (temporal or spectral profile) to derive a single weight per trial makes 507 sense, not just because the observed signal is autocorrelated, but also because it is biologically 508 relevant. Let's consider first the signal plus noise model for ERP (Hillyard, 1985;Jervis et al., 1983;509 Shah, 2004). In this conceptualization, ERPs are time-locked additive events running on top of 510 background activity. An outlier time frame for a given trial may occur if 1) the evoked amplitude 511 deviates from the bulk, or 2) the background activity deviates from the rest of the background 512 activity. In the former case, the additional signal may be conceived either as a single process (a 513 chain of neural events at a particular location) or a mixture of processes (multiple, coordinated 514 neural events). In both cases, the data generating process is thought to be evolving over time 515 (auto-regressive) which speaks against flagging or weighting a strong deviation at a particular 516 time frame only. What is likely, is that a minimum of consecutive time frames are seen as 517 deviating, even though only one time frame is deemed an outlier. In the latter case (assuming no 518 artefacts from recordings), a background deviation implies that for an extremely brief period of 519 time, a large number of neurons synchronized for non-experimentally related reasons, and this 520 event did not reoccur in other trials. Although we do not contend that such events cannot happen 521 in general, this means that, in the context of ERP outlier detection, the background activity varies 522 by an amount several folds bigger than the signal, which goes against theory and observations. 523 Let's consider now the phase resetting model (Makeig, S. et al., 2002; Sayers et al., 1974). In this 524 model, ERPs are emerging from the phase synchronization among trials, i.e., the occurrence of a 525 stimulus reset the background activity. If a given trial deviates from the rest of other trials, this 526 implies that it is out-of-phase. In this scenario, deriving different weights for different time 527 frames (i.e. IRLS solution) means that the time course is seen as an alternation of 'normal' and 528 outlying time frames, which has no meaningful physiological interpretation. 529 530 In conclusion, we propose a fast and straightforward weighting scheme for trials based on their 531 temporal (or spectral) profiles. Results indicate that it captures well undesired noise leading to 532 increased precision and possibly increased statistical power (more effect detected) at the group 533 level. 534 535 Acknowledgements 536 537 Thank you to EEGLAB/LIMO MEEG users who engaged with the beta version, got stuck but 538 persevered until we solved their issues. 539 540 541