DN3: An open-source Python library for large-scale raw neurophysiology data assimilation for more flexible and standardized deep learning

We propose an open-source Python library, called DN3, designed to accelerate deep learning (DL) analysis with encephalographic data. This library focuses on making experimentation rapid and reproducible and facilitates the integration of both public and private datasets. Furthermore, DN3 is designed in the interest of validating DL processes that include, but are not limited to, classification and regression across many datasets to prove capacity for generalization. We explore the effectiveness of this library by presenting a general scheme for person disambiguation called T-Vectors inspired by speech recognition. These are single vectors created by typically short, though arbitrary in length, electro-encephalographic (EEG) data sequences that uniquely identify users relative to others. T-Vectors were trained by classifying nearly 1000 people using as little as 1 second-long sequences and generalize effectively to users never seen during training. Generalized performance is demonstrated on two commonly used and publicly accessible motor imagery task datasets, which are notorious for intra- and inter-subject signal variability. According to these datasets, subjects can be identified with accuracies as high as 97.7% by simply adopting the label of the nearest neighbouring T-Vectors, with no dependence on task performed and little dependence on recording session, even when sessions are separated by days. Visualization of the T-Vectors from both datasets show no conflation of subjects between datasets, and indicates a T-Vector manifold where subjects cluster well. We first conclude that this is a desirable paradigm shift in EEG-based biometrics and secondly that this manifold deserves further investigation. Our proposed library provides a variety of essential tools that facilitated the development of T-Vectors. The T-vectors codebase serves as a template for future projects using DN3, and we encourage leveraging our provided model for future work. Author summary We present a new Python library to train deep learning (DL) models with brain data. This library is tailored, but not limited, to developing neural networks for brain-computer-interfaces (BCI) applications. There is abundant interest in leveraging DL in the wider neuroscience community, but we have found current solutions limiting. Furthermore both BCI and DL benefit from benchmarking against multiple datasets and sharing parameters. Our library tries to be accessible to DL novices, yet not limiting to experts, while making experiment configurations more easily shareable and flexible for benchmarking. We demonstrated many of the features of our library by developing a deep neural network capable of disambiguating people from arbitrary lengths of electroencephalography data. We identify a variety of future avenues of study for these representations produced by our network, particularly in biometric applications and addressing the variation in BCI classifier performance. We share our model, library and its associated guides and documentation with the community at large.

Machine learning (ML) has long relied on handfuls of openly accessible datasets, 2 particular to certain sub-fields of study. The MNIST dataset 1 , for example, has been a 3 common touchstone for image classification and computer vision generally whose use 4 continues (for better or worse) to this day. While a demonstration of state-of-the-art 5 performance on a single dataset, like MNIST, at one time may have been sufficient to 6 warrant the consideration of a new technique, most areas of ML are increasingly 7 comparing performance across multiple datasets and tasks. In natural language 8 processing (NLP) for example, the GLUE or SuperGLUE [1] benchmarks express an 9 aggregate score across relevant tasks in the field, varying in size (of dataset) and 10 difficulty. Aspects of the brain-computer interfaces (BCIs) field present a similar but 11 altogether unique variation on this theme. One could loosely divide the work of BCI 12 research (and some of neuroscience more generally) into two streams of inquiry: data 13 collection and data analysis. Provided perfect -or at least consistent -collection, 14 analysis techniques are best validated by collecting new data to test hypothetical 15 analyses. If collection is sufficiently consistent, improvements or subtle changes in 16 collection do not strongly impact the interpretation of analysis experiments. This is 17 true of other ML fields, but unlike BCI, collection in NLP is not a similarly active (or 18 varied) line of research. Instead, quantity of data has taken precedence over quality, or 19 nature [2]. BCI research encourages some specialization in analysis or collection, with 20 the potential consequence of less-than-ideal collection or analysis respectively [3]. Likely 21 as a result of both the difficulty in performing collection and the need for controlling 22 this confound, thousands of research articles are published per year that solely leverage 23 publicly available BCI datasets [4] which is reminiscent of classical ML. The consequence 24 is that the more general applicability of these results is difficult to judge [5,6]. The 25 MOABB [4] project is a current effort to establish a wide-ranging benchmark that in no 26 small way addresses this concern, but it remains notably agnostic to analysis method, 27 and does not strongly integrate privately collected datasets (see MOABB section). 28 Intertwined with the desire for stronger generalization claims is the expanded use of 29 DL for BCI classification. While DL is far from established as the preferred analysis 30 technique [5], there is undoubtedly keen interest in exploring it [6]. Software solutions 31 for leveraging DL for BCI currently exist (see section on prior work and ecosystem) but, 32 to our knowledge, no publicly accessible DL library is as of yet strongly suited to BCI are versed in DL generally, DL remains an enigmatic and fast-moving field, with 43 techniques rapidly falling in and out of fashion. It is here that the question of 44 generalizability returns, in a complementary articulation: reproducibility. Often, 45 extreme care is needed to avoid uninformative results due to the notorious and 46 sometimes dramatic dependence on hyperparameters -that often seem innocuous (at 47 first) -to which many DL techniques are prone. 48 If more generally applicable architectures had a consistent home, and authors 49 wishing to share general models had a semi-standard approach to conform to, a versatile 50 community-driven toolbox of well-motivated techniques could begin to develop and slow 51 the pace of more inexplicable architecture choices. Furthermore, it would allow for 52 stronger benchmarking and provide mechanisms for reproducibility. 53 Herein we present DN3, the deep neural networks for neurophysiology toolbox. This 54 Python library is designed to leverage both public and private data (potentially 55 integrated together) to train deep neural networks in a rapid and reproducible fashion. 56 In particular, we use MNE-Python's tools for neurophysiological data access [8], storage, 57 and processing, bridged with PyTorch 2 -one of the most common and powerful 58 modern deep learning libraries. Knowledge of these underlying libraries is mostly 59 unnecessary, but their lower-level functionalities remains available to DN3. While much 60 of DN3 is tailored to trial-wise BCI classification, it can undoubtedly be used with 61 neurophysiological data more generally. Furthermore, DN3 introduces a unique dataset 62 and experiment configuration tracking module called the Configuratron. This module 63 allows datasets to simply reside in the formats they were recorded in, but be 64 automatically prepared for DL processes using short, human-readable descriptions. 65 Then, these descriptions can be easily shared to reproduce work entirely, or simply in 66 design when data cannot be shared. Finally, DN3 implements some existing classifiers 67 and techniques and will remain open source (under a BSD license) to continue to add 68 state-of-the-art techniques to its repertoire, with the goal of remaining convenient for 69 both experts and beginners.

70
In short, DN3 is best suited to facilitating research at the intersection of deep 71 learning and BCI (and potentially neurophysiological data science more generally, but 72 we focus here on BCI). Experts strongly preferring either of these two fields stand to 73 gain from DN3's consistency in the abstraction of challenges in the other field.

74
Furthermore, it can dramatically reduce boilerplate code, it introduces general 75 mechanisms for experiment reproducibility, and is fully open-source 3 .

76
In section Prior work and ecosystems, we discuss the current Python ecosystem and 77 alternatives to DN3. Next, we provide a structural overview of DN3 and its important 78 modules. Last, we use DN3 to create a cross-paradigm and cross-hardware technique for 79 subject disambiguation (consequently, identification) that we call T-Vectors. We suggest 80 that this technique would have been a long and difficult undertaking but, by making 81 good use of DN3, we have produced a (freely reusable) model with minimal code.

82
T-Vectors are a technique for producing single vectors from variable-length-snippets of 83 EEG data that robustly identify subjects. A neural network was pre-trained to classify 84 over 1000 subjects and subsequently generalized without further training to completely 85 unseen data, despite being recorded in different labs with different hardware. Novel 86 subjects can be identified with well over 90% accuracy using nothing more than the 87 labels of the nearest-neighbouring T-Vectors. We found our representation specifically 88 robust to which task was being performed by the subject and the particular recording 89 session of a sequence of data, yet not confounded by mixing subjects from multiple 90 datasets.

91
Prior work and ecosystems 92 Python has not historically been the standard choice for data analysis in neuroscience. 93  Its large data-science ecosystem and community of open-source, open-access, and   94 community engagement covers a wide array of useful techniques for analysis. MATLAB 95 4 instead has been the neuroscience research standard and in fact includes DL tools. 96 However, few if any new DL approaches publish source code with MATLAB, opting 97 instead for typically either Tensorflow or PyTorch (with a community of researchers and 98 hobbyists constantly translating between the two frameworks). Thus, progress towards 99 merging DL with neuroscience is dependent on the MATLAB developers or experts 100 re-implementing entire processes from scratch.

101
The MNE project, and MNE-Python [8] in particular, is a powerful set of tools for 102 neuroscience data processing, organization, and analysis that further the large 103 ecosystem of Python-based data-science and is a strong alternative to MATLAB. As 104 such, merging MNE-Python with one of these major Python-based DL libraries is a 105 natural solution to studying neuroscience with DL, and is one that has been adopted by 106 prior work in DL with BCI data [9][10][11]. In the introduction, we discussed a variety of 107 advantages to having a dedicated toolbox for users coming from either of BCI or DL; it 108 is worth highlighting why it is preferable to not simply use MNE-Python and a DL 109 library for every DL-neuroscience experiment. MNE-Python's toolset is very large, and 110 makes few assumptions as to the ultimate application of the data. As such, the efficient 111 development and evaluation of DL processes is nowhere near a first-class concern, and 112 can require significant code development for each application, sometimes resulting in 113 code variations just for minor differences in data.

114
In essence, we observe that Python is the de facto choice for DL and that 115 MNE-Python provides many dataset utilities that can facilitate merging DL and 116 neuroscience. However, there is room to add a more application-specific layer on top of 117 MNE-Python to reduce boilerplate code, unify applications, and facilitate novice-level 118 DL.

119
The braindecode package 120 The braindecode Python package is likely the most similar library to DN3 in many 121 respects. Ostensibly, this package provides utilities to train several well known neural 122 network architectures as trial-wise classifiers or regressors of EEG and MEG data.

123
Additionally, it features tools to use a variety of datasets, notably providing a bridge to 124 the datasets featured as part of the MOABB.

125
The potential advantages braindecode might have over DN3 include the use of the 126 skorch [12] package (itself a layer above PyTorch) rather than PyTorch alone as the DL 127 workhorse. As skorch develops, models prepared for, and any tools added to this (well 128 maintained) library, will likely serve to extend braindecode in a way not true for DN3. 129 That said, we elected to avoid this due to the limitations caused by the fairly 130 reductionist skorch, which enforces training pipelines that might exclude more 131 eccentric approaches that may prove useful in an area that has no apparent standard 132 mechanisms. Consider that the implementation of adversarial architectures in skorch 133 are difficult, and similarly, procedures such as meta-learning, like MAML [13] or 134 REPTILE [14] may not even be possible. Using an adversarial training paradigm has an 135 existing (albeit) small BCI-specific literature [15], while meta-learning is commonly 136 considered for transfer learning problems, itself a keenly sought after methodology for 137 core DL research and BCI [5,9]. We preferred to err on the side of flexibility in this 138 regard. The braindecode package also provides utilities to enable explainable AI (XAI) 139 techniques (notably from work done by Schirrmeister et al. [10]) that can be used as an 140 (albeit rough) attempt at understanding the operations a neural network is performing. 141 While we do not preclude the addition of an XAI module within DN3, we elected to 142 avoid providing any ready-made XAI solutions for the time being, as there seems to be 143 no apparent standards within BCI for this as such, and there is a large risk that if used 144 too liberally, XAI techniques can prove to be very misleading [16]. 145 This being said, DN3 has notable advantages that were (at the time of writing) not 146 available in braindecode, some of which may be difficult to add without significant 147 re-design. The first and perhaps most interesting difference with DN3 is the 148 Configuratron, which is a unique addition. Furthermore, the Dataset instances that 149 this subsequently constructs has an application programming interface (API) that is 150 readily compatible with many other libraries, such as other deep learning libraries like 151 Tensorflow/Keras, and in fact braindecode. Later, we will discuss more detail about this 152 API, but it notably includes a variety of conveniences not similarly provided by  Python library that also leverages MNE to load, and at various stages represent a BCI 170 (more specifically than DN3) data. Furthermore, they employ a similar set of 171 abstractions around individual sessions, subjects and datasets (each consisting of a 172 number of the former abstractions from left to right). Aside from not being specifically 173 related to DL at all, MOABB makes another fundamentally different (though 174 reasonable for their application) choice to be strict on data, but lenient on process (e.g. 175 classifier). DN3 relaxes the strictness on data, trying to more effortlessly integrate  Structural overview 183 DN3 was organized around two pillars: preparing data and training deep neural network 184 models. DN3 defines an API that connects these pillars through a single recommended 185 point of interface. Figure 1 gives an overview of the major modules of DN3 and which 186 of the two core aspects each interacts with (left or right of the dashed line). High-level overview of the major aspects of DN3. The library is organized around ease of interface between dataset processing and representation, and the training of deep neural networks with these data.
We elaborate on the motivation of these modules from left to right, focusing on their 188 purposes rather than explaining the particular API which is documented (and kept 189 more up to date than a publication) at https://dn3.readthedocs.io/en/latest/.

201
Some of the more mundane, but highly common, BCI pipeline tasks, such as: 202 renaming or remaping channels, excluding inconsistent or bad subjects and sessions 203 (which in a small capacity is also automatically discovered while loading datasets), and 204 adding some basic transforms such as normalization, are all specified using these 205 configuration files. The majority of these options are meant to provide a much more 206 efficient way of performing the myriad housekeeping (not really even preprocessing) 207 steps that are, for the most part, handled by MNE-Python, but must be done 208 consistently for every subject and session (and dataset if experimenting across datasets). 209 Sharing these configuration files, allows for data to be loaded in, again, a consistent 210 fashion between different researchers, encouraging reproducibility but also flexibility as 211 to how the analysis aspect will be performed, without ever exchanging raw/processed 212 data or a complete codebase, which itself may be impossible to share, e.g. due to 213 privacy concerns. 214 5 The most common being a variation used by datasets found at physionet.org, which holds a number of commonly used public datasets. Additionally, recordings and subjects can be identified using a simple pattern-matching scheme.
While leveraging standard directory structures and file-types makes for minimal 215 coding, this structure can also be adapted to more custom-solutions. Data can be 216 forcibly injected at the session (see figure 2), person and dataset level, while still 217 leveraging the remaining configuration options when appropriate. Furthermore, the 218 configuration file behaves as a more general place for logging hyperparameters, by 219 allowing arbitrary configuration elements. These can include preprocessing and 220 transform details, or DL hyperparameters. Finally, a simple import system 221 (accomplished through a YAML directive) allows for importing other configurations to 222 the experiment. This allows selecting to import different hyperparameter sets, or 223 pulling hyperparameters from web-enabled hyperparameter search tools. This module implements the higher-level containers that represent a dataset and its 226 constituent parts. Datasets are specifically comprised of a set of Thinkers that 227 represent each respective subject, and those are each comprised of a number of 228 recordings. Figure 2 further illustrates this hierarchy and shows how MNE Raw and 229 Epochs objects 6 underpin the data layer below these. Thus, while MNE compatible 230 data is naturally integrated into this scheme, Recording-and Thinker-level APIs allow 231 for more customized data integration into the Dataset format.

232
Once data is represented as a Dataset, single data instances are fetched from disk or 233 system memory depending on configuration (important for more large-scale datasets 234 that are beginning to develop [17]). This is combined with experiment-focused 235 conveniences like leave one (or multiple) subject(s) out cross-validation, or randomized 236 splits, and also the collection of dataset level data distributions and statistics.

237
Furthermore, DN3 includes a suite of utilities that help develop automated rejection of 238 trials, or other steps to filter for the most pertinent sections of data.

240
Commmon to all BCI analysis is some preprocessing step(s), or adjustment of the raw 241 data. For example simply excluding non-neural recording channels, leveraging those 242 channels to strip away artifacts, or spatial filtering/channel re-weighting. DN3 tries to 243 minimize the time spent doing steps like these before each experiment by creating a 244 pipeline to transform data while it is loaded. This could be as simple as stripping Rather than limiting analysis to classification or regression using neural network models 256 (or set of modules that constitute them), DN3 employs a more abstract training 257 Processes API to train... Trainables. The distinction may not be clear from the 258 outset Undoubtedly, the StandardClassification process is likely to be used the 259 majority of the time, but DL as simply a process of classification or regression with a 260 single loss function is perhaps at the time of writing, more than ever, not sufficient.

261
There are now wide variety of end-to-end systems being developed in DL [14, [18][19][20] 262 that can not be fully characterized within such a frame. A community driven solution 263 like DN3 could help make more of these available for wider use. Thus, a Process in 264 DN3, is a more general formulation than simply classification or regression, and is 265 designed to support diverse uses of backpropagation with respect to a (or even multiple) 266 loss function(s). There are some added assumptions that incoming data is, in a sense, The advantage of using the DN3 API, is that models can be generated based on 280 Dataset instances, meaning according to the incoming channel set, sequence length, 281 sampling frequency and classification targets. This has been, in our experience, a way 282 to alleviate a major source of tiresome code, particularly in projects that leveraged 283 several datasets (like the one we present below).

284
Processes would not be complete without the addition of Metrics to evaluate the 285 suitability of Trainables. To this end we have taken advantage of the existing Python 286 data science ecosystem, and provide utilities to leverage the many metrics provided by 287 standard choices like sklearn [21] through wrapper functions and decorators.

289
In this section, we present original work that showcases some of the unique advantages 290 of DN3. We showcase integrating multiple datasets, BCI-dataset-specific utilities, and 291 the capitalization of the wider Python ecosystem.

293
The recorded features of most subjects performing BCI tasks can vary dramatically 294 both within and between subjects, making universally applicable classifiers difficult to 295 develop [9,[22][23][24]. The sensory motor rhythm (SMR) paradigm in particular exhibits a 296 form of subject-dependent variability that manifests in some 20-30% of people seemingly 297 incapable of even eliciting expected features for classification in the first place [22,23]. 298 The nature of this dramatic variability is a topic of continual research but not nearly 299 fully characterized. Prior work groups this variability into psychological, physiological, 300 anatomical, and demographic differences, though integrated studies of these factors are 301 notably lacking [23]. These are then perhaps overly inclusive categories, capturing all 302 known correlates to performance. While these factors potentially inform the cause of 303 the signal variations, the goal nonetheless remains that new users (or new sessions of 304 existing users) can quickly use BCI systems [5,25,26]. To this end, previous work has 305 found that identifying similar users implies what configuration of features and/or 306 classifiers may be effective for a new context [26]. In other words, transfer learning (TL) 307 can be enhanced by identifying similar enough users.

308
More than just identifying similar users, Riemannian geometry-based covariance 309 matrix classification is a large step towards better TL in BCI [5,27]. This alternative 310 represents trials using the covariance matrix of the channels of each trial, resulting in a 311 symmetric positive definite matrix (SPD) confined to its associated differentiable 312 manifold M (i.e., the manifold of the set of SPD matrices). The advantage of this 313 approach is the robustness of a trial's representation on M under a variety of conditions, 314 notably under a variety of BCI preprocessing and recording techniques [5,27]. 315 Specifically, this means that these robust representations allow for better 316 subject-to-subject transfer and stability across different data collection efforts [5]. Prior 317 work has shown that trials of particular users tend to cluster along M , and transfer 318 learning can be further improved if these representations are shifted along M to a 319 common centralized location [28]; however, this requires the use of Riemannian 320 geometry to establish a central location along M , which is cumbersome. Thus the 321 stability afforded by confining trials to the invariant space M is traded-off against the 322 limitation in features asserted by the SPD representation and the inconvenience of the 323 geometry, where only some classification algorithms naturally extend.

324
In the work we present below, we have first and foremost asked whether a deep 325 neural network can be used to create a robust (invariant) descriptive vector of any EEG 326 user. Ideally, this vector could identify a particular user juxtaposed to others, (in the 327 best case) irrespective of hardware used or activity performed. This should consequently 328 then be a model of latent subject-wise variation, where features or directions in this 329 space may be descriptive of subject-wise differences. This is an indirect, and data-driven 330 modelling of latent differences, rather than an explicit account of these differences.

331
Towards application, this representation could be immediately leveraged to identify 332 similar subjects and as noted above may be informative of how best to classify their 333 features. However, it presents a variety of additional opportunities. As a consequence, 334 the range of these representations creates a manifold, somewhat like M above, but  lens was useful insofar as it suggested a possibly effective method to deal with the 351 known phenomenon of person-wise variation, one that was readily transferable to BCI 352 data: summarize utterances into a fixed-length representation that is able to classify a 353 large set of users. I-Vectors and their variants [29] are historically effective, but with 354 more speech recognition being performed by DNNs, the similar X-Vector approach is 355 common and seemingly more accurate [29].

356
Thus, herein we investigate whether DNN-derived X-Vectors, used mostly as they 357 were presented in the ASR literature, are capable of modelling thinker rather than 358 speaker instance differences. We call these vectors trained to identify thinkers: 359 T-Vectors. 360 We would be remiss to not also mention that, while our own motivation comes from 361 a desire to model the latent space for inter-subject variance, our proposed solution (if 362 effective) is also very relevant to a body of research in EEG-based biometrics [30,31]. An ideal training dataset here is one that is as general and representative as possible 374 (while remaining practically tractable) so that it transfers well to many applications. In 375 this case then, it specifically should represent a multitude of different people, performing 376 various tasks, across many demographics and recording contexts, preferably separated in 377 time. To our knowledge, the closest-fitting publicly-available project was the Temple 378 University Hospital EEG Corpus (TUEG) which advertises data from individuals as 379 young as less than a year old, to over 90, a split of 51% female, and sessions that were 380 separated up to eight months apart [17]. It consists of clinical EEG recordings from 381 Temple University Hospital using a limited variety of monopolar EEG systems, all with 382 a roughly 10/20 electrode configuration, in addition to a variety of auxiliary channels 383 including eletro-ocular (EOG), cardio and myographic recordings. The majority of the 384 recordings were made with sampling frequencies of 250 Hz or 256 Hz, though some were 385 higher (typically multiples of this, such as 512 Hz or 1024 Hz ) [17]. We limited our 386 analysis to the version 1.2.0 subset of TUEG, which after rejecting outlying data (see 387 Preprocessing), featured 1364 people 7 roughly an order of magnitude larger than any 388 prior work in EEG user identification that we were aware of [30,31]. 389 For downstream subject disambiguation (datasets considered once the T-Vector 390 model was determined), we focused on two SMR datasets, due to the task's known intra-391 and inter-subject variability. First, the well known BCI competition IV, dataset 2a 392 (BCIC) [32], which we selected for its clear isolation of two distinct recording sessions 393 separated by days, and its limited subject set. The limited subject set was particularly 394 valuable for visual examination. It similarly featured a largely 10/20 monopolar 395 recording setup of 22 EEG channels, 3 EOG channels and a single event trigger channel 396 (the EOG and of course trigger channels were simply discarded in this work). Secondly, 397 we used the easily accessible movement and motor imagery database (MMI) [33,34].

398
This featured 109 subjects using 64 channels in a 10/10 channel configuration, sampled 399 ostensibly at 160 Hz. Four of these subjects were inconsistently sampled and excluded 400 from training 8 leaving 105 possible subjects. Note that these datasets also featured 401 (particularly MMI) in prior work in subject identification for biometric 402 applications [30,31]. 403 All three datasets were simply downloaded from publicly available locations: 404 TUEG 9 , MMI 10 and BCIC 11 . The Configuratron automatically prepared each of the 405 datasets using the configuration files available with the project. This included renaming 406 of a variety of channels, notably from TUEG, excluding a number of uninformative 407 sections of data and more. The most consequential of these is presented in the following 408 section, and can otherwise be found in the published source code. ordering for the channels was then determined, so that the same EEG channel was 421 consistently found at the same tensor index, e.g. the FP1 channel was always found at 422 index 1. This was necessary to allow the use of data across different recording hardware, 423 and has the benefit of allowing previously trained models to be transferable to other 424 applications. DN3 has a system for doing this automatically called the Deep1010 425 mapping, which consistently maps 77 EEG channels, 2 earlobe reference electrodes (this 426 covers the very common 10/20 channel scheme and additionally adds the mid-way 427 points to mostly cover the 10/10 extension [35]) and 11 auxiliary channels that includes 428 4 electro-oculogram channels (horizontal and vertical for left and right eyes) and 7 429 miscellaneous channels that allow for the integration of other sources for potential 430 7 Currently using the whole project, which includes other versions of the data results in over ten thousand targets and has proven more challenging and will be considered in future work. 8 See the uploaded configuration file for more details. 9 https://www.isip.piconepress.com/projects/tuh_eeg/html/downloads.shtml access requires a simple sign-up process.
10 https://physionet.org/content/eegmmidb/1.0.0/ 11 Originally here http://www.bbci.de/competition/iv/, which requires request for access, but we found some difficulty with these files and instead converted the unrestricted copy found at http: //bnci-horizon-2020.eu/database/data-sets to the native MNE (raw.fif) format. artifacts such as electro-cardio and myograms, additional reference channels or other 431 more application-specific channels. Ultimately, when training and evaluating T-Vectors, 432 all but the EEG channels were removed. We explain the Deep1010 as it is a unique 433 feature of DN3. 434 We cropped non-overlapping and reasonably dissimilar sequences by taking 1280 435 samples (at a sampling frequency of 256 Hz corresponding to 5 seconds) every 7680 436 samples (30 seconds). In other words, the first 5 seconds of every 30 from each 437 recording were considered for training. Herein these cropped sequences are referred to 438 simply as points. Due to the extreme size of the TUEG dataset, it was difficult to triage 439 the entire dataset, but a cursory look found notable artifacts and many cases of 440 channels with absent (zero, or minimally varying) data. Scaling each point to lie 441 between -1 and 1 12 , we considered the overall distribution of these points and 442 determined upper and lower bounds on standard deviation for viable training points.

443
This is described in more detail in appendix A, but it suffices to say that a very 444 significant proportion of data had curiously low or high variation as compared to more 445 controlled/smaller scale datasets, and these extremes were removed. We employed a 446 variety of readily available tools from the DN3 library to accomplish this.  The BCIC dataset was prepared as if for classification of the associated 4-way SMR 451 task (imagined left and right hand movement, foot movement and tongue 452 movement) [32]. Points were determined by taking 4.5 second crops from 0.5 seconds 453 before the event marker until 4 seconds after, this time period has been shown in prior 454 work to maximize the event-related signal for previous neural network classifiers [9,10]. 455 Similar to some of the TUEG data, the sampling frequency of 250 Hz was upsampled  12 This is also a default aspect of the Deep1010 mapping. Additional default behaviour includes mapping one of the auxiliary channels to a global scale parameter, which represents the factor by which a particular point's maximum absolute value (the 1 or -1 value in the point) relates to the absolute maximum value in the dataset, as specified in the configuration file. This allows for consistently scaled trials, while still informing models as to the scale of the point's context. This was not used for T-Vectors to minimize any average amplitude being suggestive of identity. that its largest value was 1 and smallest value was -1 (note this was not per channel, 476 and the entire point/trial was scaled and shifted by the same factors), which was also 477 done when extracting T-Vectors of downstream data. 478 Here, we simply employed the network used to create X-vectors [37], but reduced the 479 hidden size of the network to 384 rather than 512. Each T-Vector is then the 480 384-element-long hidden representation from layer segment6 (as labelled by the original 481 authors [37]). These are the activations of the second to last layer (third to last if 482 including the final softmax) before the non-linearity. The entire network was of course 483 subject to training during the pre-training stage, with the final softmax creating a 484 distribution over the 1364 targets (people).

485
Optimization was performed in batches of 128 using the ever-popular Adam 486 optimizer [38] with a default learning rate of 0.001. This rate was divided by 10 at 487 epochs 50 and 75 through the 100 epochs of training. A minimal L2 weight-decay was 488 added to the loss at a factor of 0.00001 (larger values appeared to not separate clusters 489 as well during tuning with a small subset of 100 people).

490
To minimize sensitivity to any expected length, we further cropped each loaded batch 491 uniformly to 20%-100% of its original length. Thus the network is trained to identify 492 users using as little as a single second's worth of data, with no particular consistency in 493 task. This strategy is implemented as a DN3 batch transform, and is part of the 494 available transforms in the library.

495
At the end of pre-training, the model weights were frozen and no longer updated.

496
The final weights used can be downloaded here and DN3 provides tools to easily 497 recreate the network with these weights. Finally, the T-Vector representations of each 498 point of each downstream dataset was collected and saved for analysis.

499
Analysis of vectors 500 We analyze T-Vectors with two complementary approaches. First, we consider a simple 501 supervised prediction of notable variables using k-nearest-neighbours (k = 5 is used 502 throughout). Naturally, subject identity was the most critical variable considered, but 503 we also included: session identity, which task was being performed (e.g., canonical trial 504 labels such as left versus right hand motor imagery task), and dataset prediction 505 (mixing T-Vectors from both downstream SMR datasets). The additional variables were 506 to highlight if the spatial distribution of the points was informative of any other known 507 quantity besides subject identity. These were all compared using 5-fold cross-validation, 508 stratified by prediction target. That is, each fold had an equal (as possible) percentage 509 of each target variable. Our second analysis visualizes the manifold of T-Vectors using 510 t-distributed stochastic neighbour embeddings (t-SNE), noting how readily separable 511 the T-Vector space appeared, and if there were any ready interpretations of behaviour 512 outside of this. Throughout, we set the perplexity of the t-SNE operations to 30.     2  100  100  100  100  -----Table 1. Accuracy of predicting targets over five-fold cross-validation using majority label of five nearest-neighbours. Only the prediction of subject and dataset scored appreciably over chance (all but the dataset target were balanced; chance level prediction was 1 N um. ). Sensitivity to variation in T-Vectors was accounted for by comparing single T-Vectors through an average of four sequential vectors. This averaging showed a consistent, though mild trend in subject prediction. The dashed MMI/Dataset row is because this is the same experiment as BCIC/Dataset (and has the same uniform 100% prediction). one case, well over 90% accurate at identifying subject (and the remaining case was very 524 near this point). Conversely, the local space of the T-Vector representation was 525 markedly less informative of which session or task was being performed, irrespective of 526 dataset. Interestingly, identifying which dataset T-Vectors belonged to was profoundly 527 accurate, making no mistakes. Noting that this variable can be seen as a mixture of 528 subject variation and some other dataset-specific variation, it is clear that some 529 information besides subject identity is encoded by the T-Vector representation. At the 530 very least, whatever confused the subject identity prediction within a dataset did not 531 extend between the datasets considered.
Averaging four sequential vectors. We observe a general tendency for points from the same subject to have a common 534 localization in Figure 3, becoming stronger after pooling. This suggests that the vector 535 representation is robust under different conditions. The BCIC dataset allowed for predominantly of blue, orange, green and purple respectfully, although other colours 543 also added to the mix). While patterns like this are hard to interpret using a single 544 t-SNE plot (or even several for that matter), a reasonable correlation between this zone 545 and subject-specific performance was also observed. The subjects where we observed 546 the lowest single-vector (subject) classification were also A01, A02, A03, and A05 -all 547 of whom scored below 90% (with the remaining subjects scoring above this mark). The 548 supplementary table in Appendix B provides more subject-specific details. After 549 considering the subject-specific performance in conjunction with 3, we concluded that 550 the visualization is representative of how T-Vectors separate data from unseen subjects 551 performing unseen tasks, with novel hardware. In other words, T-Vectors do seem to 552 generalize to new datasets and subjects without any further adaptation.  is that each subject has roughly one cluster to account for them, with minor ambiguity. By visual inspection, we noted 8 representative clusters for 9 subjects for BCIC and 104, perhaps 105 clusters out of 105 subjects for MMI.
This pattern of subject separability is all the more clear in Figure 4, in which the 554 two colours represent the two downstream datasets. Rather than isolating two major 555 groupings (one for each dataset), the pattern indicates sets of localized structures 556 correlated to subject identity. Many small pockets of data abound and, after counting 557 all groupings that did not consist of individual points, the number of MMI clusters is 558 either 104 or 105 (the center region has some ambiguity), which corresponds to the 105 559 subjects used from this dataset (recalling the high accuracy from Table 1, these 560 groupings were largely homogenous). The BCIC groupings appear to provide 561 approximately 8 groupings, one fewer than the total subjects, but with a distinct cluster 562 reminiscent of the most ambiguous region of Figure 3b, although this is not conclusive. 563 Very little changed when adjusting for the differences in recorded channels, focusing 564 only on the 22 EEG channels common to both datasets, Figure 4a looks largely 565 identical to Figure 4b albeit, in the former, the MMI dataset appeared to encircle the 566 BCIC dataset. We therefore conclude that, while determining the source dataset of a 567 particular T-Vector is readily apparent from its neighbours (see Table 1), this was (for 568 the most part) a consequence of separating the independent groups of subjects. In other 569 words, a stability of subject-wise representation is shown across datasets which were 570 recorded very differently.

572
T-Vectors are promising for identifying individuals using minimal EEG recordings.

573
While these vectors may be effective as presented, we expect that without introducing 574 some notions of fairness [39], biases are likely to occur. For example, we might observe a 575 greater sensitivity for well-represented demographic intersections in the data for better 576 represented groups, than for other demographics. 577 We also warn against quickly interpreting T-Vectors (or features therein) as strongly 578 correlated to intersections of demographics or notions of personal characteristics (e.g., 579 intelligence). It would be an error to describe T-Vectors as an "objective" 580 representation of variability -instead, they simply capture some latent features that 581 seem to disambiguate individuals well. Specific investigation into any correlations needs 582 to be done subsequently, with consideration of the likely data biases mentioned above. 583 With these considerations in mind, we intend to explore possible correlations 584 between T-Vector components and markers of mental health, in addition to better 585 qualifying how sensitive T-Vectors are to sessions separated by longer time scales, more 586 different hardware, recording modalities (e.g. transfer to magnetoencephalography) and 587 performed tasks. While capturing correlations like this is certainly a potentially 588 interesting avenue, T-Vectors may also be informative of the scale of these differences, 589 e.g. by distance along a direction of variation, a somewhat unique aspect of this 590 continuous latent space approach.

591
As mentioned in our motivation, prior work in transfer learning has considered the 592 use of adaptive classifiers, whose major mechanism of performance transfer comes  Additionally, throughout all t-SNE visualizations there are a variety of singular 601 points scattered throughout. While these were minimized after averaging sequential 602 vectors (see figure 3b), they are never removed entirely. It would be prudent to consider 603 what these outliers are if they remain after further development. Could they for instance 604 be overly contaminated with muscular artifacts, or other notable characteristics? In this 605 way, it is worth considering if T-Vectors may also prove to be a quick form of data 606 triage, detecting usable versus non-usable data given an expected template T-Vector.

607
In terms of biometric applications, the performance levels presented above are 608 similar to previous work. For example, recent work with the MMI dataset was over 99% 609 accurate, with a reduced channel set and shorter time window [40]. However, this result 610 is achieved by training a DNN with the first 90% of each session's data, and predicting 611 the identity of the user with the remaining data. While this is not uncommon in the 612 literature [30], this paradigm does not necessarily generalize across datasets or hardware 613 and may introduce channel effects and even information leakage which artificially boosts 614 performance. We are unaware of any prior work that does in fact generalize in this 615 fashion, but it is clear that this is a desired property for the application [30]. We 616 therefore suggest that T-Vectors represent the state-of-the-art despite not reporting the 617 greatest performance. Towards resolving a claim like this, previous work has considered 618 the development of a score for EEG biometrics [31] called a U score, but we find some 619 difficulty in fairly calculating U for our own work, as no other prior work performed 620 such a great degree of pre-training to develop their features in the first place.

621
Specifically, U rightfully aims to minimize the amount of time needed to identify a 622 subject, which is determined in terms of total time of recordings for training. It is 623 ambiguous how pre-training would be factored in. If it is, by virtue of this large number 624 our method has an inconsequential score. If we focus only on time to develop a single 625 prediction, i.e. the time embodied by the nearest-neighbours used for prediction, our 626 own method outperforms the previous best score by 3.5%. However, in the interest of 627 not excluding pre-training approaches, and further extending the score to evaluate 628 generalization across multiple datasets, we propose some revisions in appendix C.

629
The source-code for this project can be found at 630 https://github.com/SPOClab-ca/T-Vectors and can be seen as a template for other 631 DN3-based projects. The particular T-Vector weights used in these analyses can be 632 downloaded here.

634
We have presented DN3, a new Python-based deep learning library and set of APIs for 635 BCI and more general neuroscience applications. DN3 aims to increase reproducibility 636 while minimizing redundancy at little to no expense in flexibility. Furthermore, we aim 637 it to be a community-driven solution to engage with DL techniques that might 638 otherwise be inaccessible. Intended additions to DN3 include more applications of large 639 datasets such as semi-and self-supervision tasks, meta-learning optimization and 640 adversarial learning processes. All the while, we hope to continue to integrate more 641 multi-purpose preprocessing steps, transforms, and trainable modules as they develop. 642 These goals ultimately can only be confirmed by the community at large, but we have 643 presented a unique application of DL with EEG data that was considerably streamlined 644 through the use of DN3.

645
A TUEG dataset triage 778 Unlike the MMI and BCIC datasets, the TUEG dataset had little if any inspection for 779 data quality. Furthermore, the number of channels can vary, in some instances files 780 indicate that a channel was recorded, but it ends up blank. As this scale of data is 781 profoundly useful, but difficult to visually inspect for artifacts, errors and the like, we 782 considered the difference in distribution of an easily acquired statistic: the standard 783 deviation of channel values for each trial, to narrow the focus on which trials were likely 784 best representing good data. Log-histogram comparing the distribution of channel-wise standard deviations for all normalized points (cropped training sequences) in the MMI and TUEG datasets. Notice the large peak for TUEG at a standard deviation of 0, and the surprising (in relation to the MMI dataset) increase in deviation ≥ 0.8. We chose to narrow the selection of usable TUEG trials to those that were between the red lines (0.04 ≤ σ trial ≤ 0.45), simply so that the noted extremes were rejected.