Towards a real-world brain-computer interface for image retrieval

Brain decoding — the process of inferring a person’s momentary cognitive state from their brain activity — has enormous potential in the field of human-computer interaction. In this study we propose a zero-shot EEG-to-image brain decoding approach which makes use of state-of-the-art EEG preprocessing and feature selection methods, and which maps EEG activity to biologically inspired computer vision and linguistic models. We apply this approach to solve the problem of identifying viewed images from recorded brain activity in a reliable and scalable way. We demonstrate competitive decoding accuracies across two EEG datasets, using a zero-shot learning framework more applicable to real-world image retrieval than traditional classification techniques.

Research in the field of Brain-Computer Interfaces (BCI) began in the 1970s [1] with 2 the aim of providing a new, intuitive, and rich method of communication between 3 computer systems and their users. Typically, these methods involve measuring some 4 aspect of neural activity and inferring or decoding an intended action or particular 5 characteristic of the user's cognitive state. Although BCI is still in its infancy, there are 6 already practical applications in assistive technology as well as disease diagnosis [2,3]. 7 Brain-controlled prosthetics [4] and spellers [5] have shown their potential to enable 8 natural interaction in comparison with more traditional methods, such as mechanical 9 prosthetics or eye-movement-based spellers. Other relevant applications include 10 identifying the image that a user is viewing, usually referred as image retrieval, of 11 particular interest in the field of visual attention applied to advertising and marketing, 12 searching and organising large collections of images, or reducing distractions during the strength and timing of signal peaks can be quantified and analysed automatically. 25 ERP analysis is well established and has strong applications in medical diagnosis [6] and 26 in cognitive neuroscience research [7,8]; however, the broad characterisation of brain 27 response used in traditional ERP methods is not richly informative enough to decode 28 the level of detail required to make predictions about a participant's cognitive state, as 29 required for BCI image identification. 30 Given the complexities of decoding the nature of an arbitrary visual stimulus from a 31 person's brain activity, cognitive neuroscientists and BCI researchers have traditionally 32 tackled the simpler task of determining which of some finite set of category labels 33 corresponds to a particular pattern of brain activity. In one of the first such studies, 34 Haxby and colleagues [9] collected functional Magnetic Resonance Imaging (fMRI) data 35 as participants viewed a series of images from the categories of human faces, cats, 36 houses, chairs, scissors, shoes and bottles, along with images of random noise. The 37 researchers were able to determine with 83% accuracy which category of object the 38 participant was viewing. 39 However, fMRI is impractical for general BCI applications. Murphy et al. [10] used 40 Electroencephalography (EEG) rather than fMRI and achieved 72% accuracy in 41 classification across the two classes of mammals and tools. While this study addressed a 42 much simpler problem with only two possible classes, it demonstrated category decoding 43 using relatively inexpensive and less intrusive EEG data collection methods (fMRI and 44 EEG technologies are discussed in more detail in Section 'Brain Data'). 45 In the studies mentioned above the classifiers would not determine specifically which 46 stimulus image was displayed (as required for image retrieval), instead they only 47 determine the category which the stimulus image belongs to. Moreover, as a 48 classification approach, this is not scalable to new classes and, although it may yield a 49 high accuracy, it becomes less accurate with increasing number of classes. An 50 alternative approach to BCI image classification makes use of rapid serial visual 51 presentation (RSVP) [11,12]. The participant is presented with a rapid stream of 52 images (approximately 10 each second) and is instructed to count the number of times a 53 particular target image or object appears. A classifier can then reliably decode whether 54 for a given segment of brain data, the participant had been presented with a target or 55 non-target image. This RSVP approach could be more directly applied to our problem 56 by showing a participant a target image from a gallery, and then presenting all of the 57 images in a gallery one by one with the expectation that when the target image should 58 illicit neural activity sufficiently different from the non-target images to identify it.

59
However, as the number of images in the gallery grows, it becomes impractical to 60 present them in a real-world searching scenario.

61
As a more scalable solution, zero-shot learning presents a novel approach to brain 62 decoding in which some feature space is created which can describe each stimulus class, 63 and a mapping is defined between neural activation data and the stimulus feature space. 64 Such a mapping can be defined with a subset of the full set of classes and/or instances, 65 and tested using withheld classes/instances. With this approach, the system can decode 66 arbitrary stimulus images it has not yet been exposed to. Introducing a feature-based 67 model comes at a cost however, as it also impacts the overall accuracy of the system. At 68 present most zero-shot systems in this area [13,14]  and semantic models can be a feasible approach to image decoding in a real-world BCI 96 framework.

97
In this project, we aim to make use of these different levels of information in brain 98 activity explicitly by using specifically chosen feature generation models rather than 99 implicitly by grouping our images into different categories. We also aim to perform a 100 more difficult task: where Carlson et al. [16] utilises zero-shot learning only to 101 determine membership of the stimulus to a particular object category, our approach will 102 aim to determine which actual image was viewed. To this aim, this paper proposes an 103 EEG zero-shot learning framework for individual image retrieval which makes full use of 104 both advanced visual and semantic image features. This approach is motivated by the 105 ultimate goal of designing a system which can retrieve any arbitrary image specified by 106 a neural activation generated by a user thinking about that image, although as a 107 preliminary step towards this goal we restrict our experiments to cases where images are 108 viewed rather than imagined.

109
The main contributions of this paper are:

110
• First time visual and semantic features are used together for EEG zero-shot 111 learning, which translates to potential for a real-world BCI image retrieval system. 112 • State-of-the-art performance for the particular task of EEG-driven image retrieval 113 in a zero-shot framework.

114
• Evaluation across two datasets from different sources including a large open 115 dataset for future comparative studies.

116
• Analysis of how well the feature sets chosen reflect the expected brain activity.

117
General Methodology

118
Our framework comprises of three main components. First the brain data must be 119 cleaned and a subset of the EEG features extracted to represent the underlying 120 cognitive states. Then we apply our chosen computer vision and semantic models to the 121 stimuli, to create a representation of each image in this visuo-semantic feature space.

122
Finally we use a linear regression algorithm to find a mapping between the brain and 123 stimulus spaces which makes the brain decoding possible. A high-level overview of this 124 architecture can be found in Figure 1. Overview of the flow of information and processing during a single fold of cross-validation ('Zero-shot Prediction' Section). Model performance is determined by fit of the predicted feature vectors: in the example above, the true target features are in the second position of a sorted list of neighbours. In this case with a total of seven possible images, this results in a rank of 2, and a CMC AUC of 78.57% ('Measure of Accuracy' Section).

126
Two of the most widely used approaches to recording brain activity are functional 127 Magnetic Resonance Imaging (fMRI) and Electroencephalography (EEG). The former 128 can localize the physical source of brain activity with high spatial accuracy. However, 129 the temporal resolution of fMRI is limited to a sampling rate of 1-2 seconds. Moreover, 130 fMRI requires an MRI scanner, a large and expensive piece of equipment using powerful 131 magnetic fields and liquid helium coolant, making it unsuitable for BCI systems outside 132 of laboratory or clinical settings. As a cheaper and more convenient alternative,

133
Electroencephalography (EEG) can be used to measure the electrical activity produced 134 in the brain. As neurons communicate, they produce a small electrical current.

135
Individually this electrical activity is weak, however, often these cells fire in groups and 136 produce an effect strong enough to be detected by sensitive conductors (or sensors) propagate to the scalp -make it much more challenging to localise each source. There 143 is therefore a trade-off between cost/convenience and the quality of the information 144 recorded; compared with fMRI (or MEG), EEG data is easier to obtain, but is more 145 difficult to analyse in terms of the underlying brain activity. EEG is also impacted by a 146 greater sensitivity to a variety of external artefacts, such as muscle movement, cardiac 147 activity, ambient electrical activity, and electro-ocular activity, all of which negatively 148 impact SNR. Some of these noise sources can be isolated and removed either with signal 149 processing algorithms or by hand.

150
As we are interested in eliciting cognitive states associated with particular images, 151 the experimental paradigms used for the EEG data in this study involve repeated 152 presentations of images on a computer screen ('Datasets' Section). Each time an image 153 is presented is termed a "trial" and the small window of EEG data associated with 154 these trials are known as "epochs". We use epochs which begin when an image is 155 presented and are one second long to comfortably encompass the informative brain 156 activity [16]. It is these epochs which we attempt to map to images features and aim to 157 determine which image was presented at the time the epoch was recorded.

159
Preprocessing is a necessary stage of EEG data analysis that involves aligning, 160 normalising and otherwise cleaning the raw data in order to make it more suitable for 161 downstream analyses. The main goal of preprocessing the EEG data in our framework 162 is to remove sources of noise in order to minimise obfuscation of underlying useful of the strongest noise sources in EEG is ambient electrical activity near the recording 165 equipment, such as personal computers, large lights, or improperly insulated wiring.

166
These signals are relatively easy to separate from brain activity based on their 167 frequency, typically 60hz or 50hz (in America and Europe respectively). A lower 168 frequency cut-off must also be established to remove slower sources of noise -these are 169 generally slow changes in the electrical profile of the scalp or sensors such as a gradual 170 increase or decrease of perspiration leading to a change in conductivity. A band-pass 171 filter was used to remove any signals in our data with a frequency outside the range 172 1-40hz as in [10,11,14,[19][20][21][22].

173
Channels with poor contact with the scalp were then identified using the variation, 174 mean correlation and Hurst component, and these were removed and then interpolated 175 from nearby sensors similar to [19,23]. A section of the EEG lasting one second was 176 extracted each time an image was displayed. These epochs were baselined [19,22] by 177 subtracting the average value from the 500ms prior to the image presentation. Epochs 178 which fell outside the threshold for amplitude range, variance or channel deviation were 179 removed as in [11,19,21,23]. Following this, Independent Component Analysis 180 (ICA) [24] was performed primarily to identify artefacts related to eye movement as 181 in [10,19,20]. In this step the input signal is decomposed into an approximation of its 182 sources, each component is then correlated with sensors placed nearest the eyes and 183 thresholds set for spatial kurtosis, Hurst exponent and mean gradient. Components 184 identified as isolating sources of noise are removed and the EEG signal reconstructed 185 from the remaining components. 186 Next, within each epoch, channels were examined for short term artefacts using 187 variance, median gradient, amplitude range and channel deviation. Channels identified 188 as noisy within the bounds of the epoch were replaced by an interpolation from other 189 nearby channels within that epoch. The recording is also downsampled to a rate of 190 120hz as in [11,14,20] to reduce dimensionality before machine learning is applied.

191
All the above preprocessing steps were implemented using the EEG preprocessing 192 toolkit FASTER [19].

193
As a final preprocessing step before the EEG data are used in our regression model, 194 the data are z-scored (standardised). We primarily perform this step to ensure that the 195 mean of the data is zero as this can simplify the parametrisation of our machine 196 learning. This takes place each iteration of the cross validation, the mean values for the 197 transformation are calculated using only the training samples and the transformation is 198 then applied to the training and testing samples to avoid any influence of the latter in 199 the former. After preprocessing, an EEG feature extraction process is used to continue reducing the dimensionality of the data by extracting the most discriminatory features from the preprocessed data, and further removing uninformative and noisy dimensions of the data. This facilitates the successful mapping of EEG data to our image feature space by extracting only those aspects of the EEG signal which are likely to be informative about the visual and semantic feature sets. Following the approaches used in Mitchell et al. [13] and evaluated in Caceres et al. [25], we ignore all but the features with the highest collinearity across presentations of the same stimulus on the screen. Concretely, the EEG data for a particular participant following preprocessing is a 3D-matrix of size nE × nC × nT, where nE is the number of epochs (i.e. the number of stimulus presentation events), nC is the number of channels (or sensors) in the EEG headset, and nT is the number of timepoints in an epoch (the number of times during an epoch sensor values were recorded). In this work, we use an epoch length of one second and downsample the data to 120Hz, giving nT = 120. We treat the data from each time sample and each sensor as a separate feature, giving a total of nC × nT candidate features. In order to calculate feature collinearity, we reshape the nE × nC × nT data matrix to a 2D-matrix of size nE × (nC × nT), or, equivalently, (nS × nP) × nF where nS is the number of stimuli, nP is the number of times each stimulus was presented in a recording, and nF is the number of EEG features. We then transform this back into a 3D-matrix of shape nF × nP × nS and term this matrix D. D is therefore composed of a nP × nS feature matrix for each EEG feature f . To calculate a stability score for a feature, we measure the consistency of the feature across different presentations of the same stimulus -we calculate the Pearson correlation for each pair of rows in D and use the mean of these correlations as the stability score for that EEG feature f : where σ x is the standard deviation of x and nCom = nP(nP − 1) 2 In spaced orientations (θ) and four standard deviation values (σ) ranging from two to five, 227 resulting in a bank of 32 filters. The rest of the parameters were fixed at default with ksize = (31, 31), wavelength of the sinusoidal factor (λ) = 6.0, spatial aspect ratio (γ) 229 = 0.5 and phase offset (ψ) = 0.

230
Each pixel co-ordinate in an image x, y is convolved with a Gabor filter described by the parameters above: Let Lθ and Lσ denote the sets of parameter values defining the filter bank: Each image in our feature set was convolved with every filter, and the result summed to 232 generate a histogram of 32 dimensions v gabor for each image: x,y∈grid g(x, y, λ, Lθ 1 , ψ, Lσ 1 , γ) . . .
We stack the v gabor vectors to create the final matrix of Gabor features for our image set: The brain is also sensitive to higher-level visual information which is not adequately 235 captured by simple and spatially local Gabor Filters. In order to make use of 236 higher-level visual processing in our system we chose to apply a prominent computer  Using VBOW has the benefit of finding features that generalise well across multiple different objects and as such have the best chance of extending to new classes. Moreover, it removes spatial data making the feature vector invariant to spatial transformations such as rotation, translation and scale which is less relevant to intermediate-level visual information. A list of imageDescriptors were generated for an image, and used to produce a histogram v sift of how often each 'visual word' encoded in the codebook appeared in the stimulus image.

256
This implementation made use of Dense SIFT, meaning the keypoints correspond to 257 a regularly sampled grid, rather than a set of natural keypoints estimated for an image. 258 A histogram v sift was generated for each image, and collated into a matrix representing 259 our stimulus image SIFT features x sift .
Finally, as none of the previous visual features encapsulates colour information, we chose a global HSV histogram to model colour in our approach, since there is some evidence that a HSV colour space comes closer to reflecting human vision than RGB [37]. A HSV histogram v hsv is generated for each image using a quantisation of four bits per pixel and channel: Where iP is the list of pixels in the image, and k hue , k sat and k value are the hue, saturation, and value of the pixel k respectively. This gives each HSV channel 16 bins to produce a histogram of 48 features. The histograms are then collated into a matrix representing our HSV feature space x hsv .
. . . (MWE) which did not have a corresponding feature vector in gM at. In these cases we 278 used the mean of its composite words, following [41]. For example, the stimulus "plaster 279 trowel" was set to the mean of the vector for "plaster" and the vector for "trowel".

280
For each of our images we chose a single word or MWE to represent the content (i.e. 281 the depicted object), and take the row of the GloVe matrix which corresponds to that 282 word as the feature vector for the image in our high-level semantic feature space.
The final complete set of features is the concatenation of the features from each of the component visual and semantic models: Finally, before using these features in our classification model, we apply one further 295 feature selection based on a measure of fit from the regression model (as described in 296 Section 'Zero-shot Prediction'). nC × nT-dimensional brain activity vector associated with stimulus image y. Assuming 302 a linear relationship exists between these two components, multiple linear regression can 303 be applied to find some set of weights w1 such that 304 f 1 (EEG y ) = v EEG1 * w1 0 + v EEG2 * w1 1 + . . . will produce a value as close as possible 305 to features y1 , some vector of weights w2 such that 306 f 2 (EEG y ) = v EEG1 * w2 0 + v EEG2 * w2 1 + . . . will produce a value as close as possible 307 to features y2 , and so on until a vector can be stacked which is as close as possible to 308 features y .

309
Prior studies [10,13,14] have shown success using a linear regression model with 310 brain data when they are regularised. This coupled with its speed and simplicity made 311 it a natural choice for a baseline approach. L2 regularisation is used to reduce 312 overfitting and improve the generalisation properties of the model. This choice is 313 preferred over L1 regularisation given the expected high collinearity of our samples, i.e. 314 signals recorded from nearby locations in very similar temporal instants should register 315 very similar sources in brain activity. A good model will be able to generalise the 316 relationship rather than being limited to projecting the particular samples and/or 317 classes used in training. If this is achieved, the mapping mechanism and the 318 representative feature spaces could be used within a zero-shot learning architecture.

320
Once a mapping between EEG data and the image feature space has been learned from 321 training, a prediction of image features can be made for an EEG epoch withheld from 322 the training set. To ensure a zero-shot framework, we use leave-one-class-out 323 cross-validation to iteratively withhold all epochs associated with a particular 324 stimulus/image for testing in each iteration. Concretely, this means we withhold the 325 data for trials related to Stimulus 1, and train a regression model from the trials for the 326 rest of the stimuli. We then pass the withheld testing trials into our regression model to 327 produce a predicted image feature vector for each trial. We then return the trials for  Following the regression, there is one final step of feature selection over the predicted 333 image features before moving to the feature matching for image retrieval. We do not 334 make use of all image features in the predicted image feature vector, but instead select 335 just those which are best represented in the EEG data. To make the distinction 336 between useful and under-represented features, we approximate each feature's 337 informativeness by calculating the measure of fit of our regression model. When 338 predictions are fed to the classifier, we ignore the columns of the feature space and the 339 predicted feature vectors with the lowest measure of fit. For each iteration of train/test 340 split, after the regression model has been fit, an R 2 measure of fit is calculated for each 341 image feature column in features. For each epoch in a recording we produce a predicted 342 image feature vector and collate these vectors into the matrix p. Each epoch is 343 associated with a particular stimulus image and each stimulus image is associated with 344 a feature vector in features, so we generate t such that t i is the feature vector associated 345 with the stimulus image used in epoch i.

346
These fit values are then averaged across iterations to produce an estimate of which 347 image features are best represented in the EEG data. This estimation is reached 348 entirely without influence from the withheld epochs. The last step of the brain decoding 349 mechanism is implemented using a nearest neighbour classifier between the predicted 350 image feature vector p j from the EEG and the target image feature vector t j . This The first collection of EEG data analysed in this study is the Trento set [10] which uses 364 60 grayscale photographs as stimuli. Since this dataset was initially designed for  There were three participants, two of which took part in two experimental sessions and 371 one participant who took part in one session. Participants were instructed to silently 372 name the image with whatever term occurs naturally whilst EEG data was collected 373 with a 64-channel EEG headset sampling at 500Hz. More details of the paradigm and 374 recording of the data can be found in Murphy et al. [10]. The epoched data for each 375 session therefore consists of a matrix of shape nE × nC × nT, where nE = 360, nC = 64 376 and nT = 500. Through the preprocessing steps outlined in the 'Preprocessing 377 Approach (FASTER)' Section (including removal of noisy epochs), the resulting cleaned 378 set was a matrix of size 340 × 7680 on average per recording. The number of epochs is 379 approximate as for each experimental session, a different number of low quality epochs 380 are removed during preprocessing. In the original study, the aim was to train a linear 381 binary classifier to distinguish between epochs associated with mammal or tool stimuli, 382 which differs from our goal of matching epochs to particular images. As such, the 387 Stanford Data

388
The second EEG dataset we used to test our approach is an open dataset compiled at 389 Stanford University [20]. Participants were presented with a series of colour 390 photographs, drawn from the categories human body, human face, animal body, animal 391 face, fruit/vegetable, or man-made (inanimate object). There were 12 images in each 392 category and each image was presented 12 times in random order for a total of 864 393 trials per recording. Again, categories are discarded and the experiment is treated as an 394 image retrieval task with 72 individual images. There were 10 participants, all of whom 395 completed two sessions which each comprised of three separate EEG recordings for a 396 total of 60 recordings. The EEG was recorded using a 128-channel headset sampling at 397 1kHz. Each recording therefore contained 864 epochs, each with 128,000 features in its 398 raw form. The resulting cleaned set after preprocessing measured approximately 792 399 epochs × 128 channels × 120 timepoints, giving a EEG feature matrix of size 792 × Parameter Optimisation 418 A short gridsearch was performed to empirically optimise the parameters. A random 419 recording from each dataset was chosen and used to perform this gridsearch for each 420 experiment below. We then used the highest performing parameter set to perform the 421 decoding for the rest of the recordings with the same dataset and image feature set. We 422 do expect that different recordings will perform best under different parameter settings, 423 and as such accuracy could be maximised with a more rigorous approach to 424 gridsearching. That said we have chosen to determine parameters from a single 425 recording in order to better reflect training in a real-world BCI system. In Tables  In order to compare the effectiveness of our chosen image feature models and confirm our expectation that combining the models would provide more predictive power than using them in isolation, the AUC for both datasets were calculated when using all visuo-semantic features (features vs ) and compared against using only visual feature set (features v ) or the semantic feature set (features s ) individually.
Results are shown in Table 1 for the Trento dataset and Table 2  Because of our zero-shot analysis framework, a study with directly comparable results 462 could not be identified in a review of relevant EEG literature. However, the studies 463 mentioned in the background section can provide a frame of reference. While Palatucci 464 et al. [14] used image stimuli and decoded the image from brain activity, the focus was 465 on decoding semantic information about the object in the image rather than retrieving 466 the stimulus image based on the brain data. The datasets we have access to in this 467 study involve much more visually complex image stimuli. Where Palatucci et al. [14] 468 made use of minimilistic line drawings, the photographs used in both datasets analysed 469 in this study are much more visually complex. In order to best leverage this extra visual 470 information, we added several visual feature sets to our analysis.

471
The leave-one-class-out task performed by Palatucci et al. [14] is similar enough to 472 the task in this study to give context to our results, though given the two studies use 473 different datasets a direct comparison with our approach is not possible. The paradigm 474 used in this study was very similar to those used in the Trento and Stanford 475 experiments, with participants being presented with a series of images and asked to 476 silently name them. Compared with the Palatucci et al. [14] study, we obtain slightly 477 stronger results (Table 3). Table 3. Leave-one-class-out task percent rank accuracy.

479
In this paper we proposed an approach to zero-shot image retrieval in EEG data using a 480 novel combination of feature sets, feature selection, and regression modeling. We have 481 shown that a combination of visual and semantic feature sets performs better than 482 using either of those feature sets in isolation. We also analysed the performance of each 483 image feature model used in our approach individually to help identify where future 484 improvements could be made. 485 We hope that future work can improve upon this approach embedding model such as word2vec [44] or fastText [45].

497
Moreover, our EEG feature selection may correctly quantify the usefulness of each 498 particular timepoint in each channel, however it is likely that features which are close in 499 time and location will have very similar information and thus similar scores, and so a 500 feature selection method may select a set of good quality but redundant features. In 501 future work, we will explore feature selection methods that produce a small set of 502 maximally informative EEG features. Nevertheless, our approach has demonstrated a 503 marked improvement over current state-of-the-art for EEG zero-shot image decoding 504 and is a step towards the application of EEG to real-world BCI technologies.