Abstract
Objective This paper presents a graph signal processing (GSP)-based approach for decoding two-class motor imagery EEG data via deriving task-specific discriminative features.
Methods First, a graph learning (GL) method is used to learn subject-specific graphs from EEG signals. Second, by diagonalizing the normalized Laplacian matrix of each subject’s graph, an orthonormal basis is obtained using which the graph Fourier transform (GFT) of the EEG signals is computed. Third, the GFT coefficients are mapped into a discriminative subspace for differentiating two class data using a projection matrix obtained by the Fukunaga-Koontz transform (FKT). Finally, an SVM classifier is trained and tested on the variance of the resulting features to differentiate motor imagery classes.
Results The proposed method is evaluated on Dataset IVa of the BCI Competition III and its performance is compared to i) using features extracted on a graph constructed by Pearson correlation coefficients and ii) three state-of-the-art alternative methods.
Conclusion Experimental results indicate the superiority of the proposed method over alternative methods, reflecting the added benefit of integrating elements from GL, GSP and FKT.
Significance The proposed method and results underpin the importance of integrating spatial and temporal characteristics of EEG signals in extracting features that can more powerfully differentiate motor imagery classes.
I. Introduction
Electroencephalography (eeg) is a prevalent, non-invasive imaging modality for capturing brain activity at high temporal resolution [1]. A popular topic in analyzing EEG is during motor imagery (MI) tasks, which are dynamic states of movement imagination during which primary sensorimotor areas exhibit patterns of neural activity that resembles an attenuated version of real executed movement [2, 3]. From neurophysiological perspective, desynchronization of the neural populations during motor imagery tasks attenuates rhythms in the respective cortex and can be measured as a sign of brain activity [3, 4]. MI tasks are extensively utilized in brain-computer interface (BCI) systems, in which mental imagination of a movement is translated to executive commands via classification of features extracted from acquired EEG data [5-7].
Discrimination of mental states from EEG measurements in MI-BCI systems is a challenging task, for which numerous methods have been proposed [8]. One class of proposed methods aims at extracting features from the temporal evolution of the signal acquired at each individual electrode, either in time, frequency, or time-frequency domain [9, 10]. An alternative class of methods aim at extracting spatial features as manifested in multichannel EEG signals [11-15]. Adaptive classifiers, matrix and tensor classifiers, transfer learning, and deep learning are among other methods that have more recently been proposed [7, 16, 17].
Graph signal processing (GSP) [18-21] is an emerging field that has attracted great interest. It has in particular been adopted in an increasing number of neuroimaging studies. In [22], insights provided by the GSP perspective for analysis of brain activity using functional magnetic resonance imaging (fMRI) and diffusion-weighted MRI (dMRI) data are presented. In [23], seven different graphs are constructed based on structural and functional connectivity between brain areas to evaluate the benefit of GSP for classification and dimensionality reduction of fMRI data. In [24-26], GSP is used to perform anatomically-informed spatial processing of fMRI data to enhance brain activation mapping. In [27], GSP is leveraged to introduce a measure of the coupling strength between brain structure and function, which has been used within the context of task decoding and individual fingerprinting for fMRI data [28]. In [29], GSP is used to predict autism spectrum disorder from resting-state fMRI data. In [30], by using a multi-modal imaging dataset consisting of EEG, MRI, and dMRI data, the role of structural connectivity in the representation of brain activity signals and their dynamics is explored in a GSP setting. A GSP-based method for feature extraction in near-infrared spectroscopy (NIRS)-based BCI is presented in [31] that captures the spatial information of the NIRS signals.
A number of studies have also shown promising results in applying GSP techniques in classification, dimensionality reduction, and denoising of EEG signals [32-37]. In [32], network harmonics of the brain structural connectivity graph are derived for tracking fast spatiotemporal cortical dynamics. In [33], a dimensionality reduction method for MI-BCI application is proposed via spectral graph decomposition of a brain structural graph. In [34], a GSP-based approach is presented for adaptive dimensionality reduction and classification of MI tasks exploiting geometrical and correlation graphs of the brain. In [35], GSP techniques are used for emotion recognition using EEG data. In [36], a graph Laplacian denoising method is proposed which improves the separation of MI and resting mental states in MI-BCI EEG data. In [37], an MI decoding approach is proposed that utilizes graph Slepian functions [38], using which discriminative features for classification are extracted from a structural sub-graph of the brain. Inspired by the promising results of the use of GSP in brain imaging applications, we propose a GSP-based method for classification of MI EEG data. Despite the benefits of GSP, its successful application heavily relies on using a suitable graph that can capture subtle intrinsic relations between the data elements. This is not readily available in many applications, such as for EEG data for which, although there is exist a clear definition of graph vertices, there exists no gold standard definition of graph edges and edge weights. In the absence of a well-defined graph, given an ensemble set of signals, graph learning (GL) techniques can be employed to learn a graph from the data at hand. Different GL methods have been proposed in the literature [39]. Here we employ a sub-category of GL that leverages GSP and imposes constraints on graph sparsity and smoothness of graph signal on the resulting graph [40-42].
The method proposed in this paper for EEG data classification is comprised of four stages. First, we use graph learning to learn subject-specific brain graphs; a conventional graph that uses Pearson correlation coefficients as the weight of the edges is also used for comparison. Second, by interpreting EEG data as graph signals, we transform them into the spectral domain of each graph. Third, we derive a discriminative spectral graph subspace that specifically aims at differentiating two-class data. Fourth, we use the extracted features for training and testing a binary classifier.
The remainder of this paper is structured as follows. Section 2 gives an overview of the fundamental concepts. A description of the proposed framework is presented in section 3. Section 4 describes the experimental results and provides a discussion. Finally, section 5 presents our concluding remarks.
II. Materials and methods
A. Dataset
To evaluate the proposed method, EEG data from the publicly available BCI Competition III-Dataset IVa [43] were used. The data, comprising of two classes of motor imagery EEG signals, were recorded from five healthy subjects (labeled as aa, al, av, aw, and ay) using 118 electrodes that were installed with the electrode arrangement in the extended international 10/20-system at a sampling rate of 100 Hz. A total of 280 visual cues of length 3.5 seconds were presented to subjects, interleaved with rest interval of random lengths 1.75 to 2.25 seconds. Despite the limited number of subjects, the dataset is rich in the sense that it includes a lot of trials per subject, making it very suitable for use within a machine learning setting, and that it has been utilized in many studies.
During the presentation of target cues, subjects were asked to perform right hand or right foot motor imageries, and 140 trials were acquired for each class. According to the competition instructions the trials were divided into training and test sets in each class, wherein the set sizes differed across the five subjects. More precisely, for the first two subjects most trials are labeled (60% and 80%, respectively), while for the other three 30%, 20%, and 10% labeled trials are given, respectively, and the remaining trials composing their test sets (for more details, see http://www.bbci.de/competition/iii/). As such, performing classification is more challenging on subjects av, aw, and ay due to their small training set size. In this work, a GSP-based approach is provided to tackle the problem of MI tasks classification in this dataset. In the following section, the principles of GSP are briefly reviewed.
B. Graph Signal Processing Fundamentals
Let G = (V, E, A) denote a weighted, undirected graph, where V = {1, 2,…, N} represents the graph’s finite set of N vertices (nodes), E denotes the graph’s edge set, i.e., pairs (i, j) where i, jϵV, and A is a symmetric matrix (Ai, j = A j, i) that denotes the graph’s weighted adjacency matrix. The weights in the adjacency matrix indicate the strength of the connection, or similarity between two corresponding vertices, therefore, Ai, j = 0, if there is no connection/similarity between vertices i and j. Moreover, it is assumed that there are no self-loops in the graph, which implies Ai, i = 0. Let ℓ2(G) denotes the Hilbert space of all square-summable real vectors f ϵ ℝN with the inner product defined as , and the l2-norm defined as .
A real signal defined on the vertices of G, f :V→ℝ, can be thus seen as vector in ℓ2(G), whose n-th component represents the signal value at the n-th vertex of G. The graph’s normalized Laplacian matrix is defined as L = I - D−1 2AD−1 2, where I is the identity matrix and D is the diagonal matrix of vertex degrees, i.e., . Since L is real, symmetric, and positive semi-definite, it can be diagonalized via eigenvalue decomposition as: where T denotes the transpose operator, U = [uZ,u2, …,uN] is an orthonormal matrix concatenating the eigenvectors uk ϵ ℓ2(G) in its columns, and Λ is a diagonal matrix that stores the corresponding real, and non-negative eigenvalues 0 = λ1 ≤ λ2≤ … ≤ λN ≤ 2. The eigenvalues define the graph Laplacian spectrum, and the corresponding eigenvectors form an orthonormal basis that spans ℓ2(G). By using the Laplacian eigenvectors, a graph signal f can be transformed into a spectral representation, commonly referred to as the graph Fourier transform (GFT) of f, denoted , obtained as:
Given the orthonormality of the Laplacian eigenvectors, the inverse GFT of is obtained as . By synthesizing as a weighted sum of orthogonal graph frequency components uk, the GFT coefficients of entail the degree of signal variability of over G. That is, each GFT coefficient represents the contribution of its corresponding graph Laplacian eigenvector to the graph signal. Importantly, the GFT satisfies Parseval’s energy conservation relation, i.e., .
Graph Laplacian eigenvectors associated to larger eigenvalues entail a larger extent of variability, and as such, eigenvalues of the graph Laplacian matrix can be seen as an extension of frequency elements that define the Fourier domain in classical signal processing. To further illustrate the notion of frequency for graph signals, the total variation (TV) of a graph signal on graph G can be quantified as: where larger values of TV(f) indicate greater changes of f on G, i.e., higher spatial variability. By viewing each graph Laplacian eigenvector as a graph signal, it can be seen that its total variation is equal to its corresponding eigenvalue, i.e.: . This relation shows that each eigenvalue is a quantification of the extent of variability of its corresponding eigenvector. Specifically, the graph Laplacian eigenvalues can be seen as graph frequencies, indicating how the eigenvectors vary with respect to the graph G [18, 19]. Graph signal is smooth on G if its elements associated to vertices connected via large edge weights have similar values. TV(f) is a quantification of the extent of variation of with respect to the structure of G, thus, providing a measure of the degree of smoothness of. A leading paradigm in graph learning exploits this notion of smoothness to learn a graph structure on which data comprise certain regularity.
C. Learning Graphs from Smooth Signals
A class of GL methods enforce data smoothness, deriving a graph Laplacian matrix via solving [41]: where F is an N ×T matrix of graph signals, α is regularization parameter ‖ · ‖,F denotes the Frobenius norm and 1 = [1,…,1]T. Minimizing the first term in the objective function guarantees smoothness of the signals on the learned graph, which can be seen via invoking (3): trace . The Frobenius norm controls sparsity by shrinking edge weights. The imposed constraints ensure finding a valid Laplacian matrix. Considering , where A1 denotes the vertices degree vector, the optimization in (4) can be solved more efficiently via a more general-purpose formulation with respect to the graph’s weighted adjacency matrix [42]: where Z denotes the pairwise Euclidean distance matrix of the signals residing on the graph vertices, with entries given as Zi, j = ‖xi − x j ‖2, where xi ϵ ℝT denotes the signal vector is the residing on vertex i. The first term in this objective function finds the graph’s adjacency matrix under the smoothness assumption; note the equivalence between the first terms in (4) and (5), i.e., trace(FT LF) = 0.5 ‖A ° Z ‖1, where ° Hadamard product. Intuitively, if smooth graph signals reside on well-connected vertices (i.e. vertices connected via large weight edges), it is expected that these vertices have smaller distances Zi,j. Alternatively, the objective function (5) can be improved by replacing the l2-norm with a logarithmic barrier on the vertices degree vector as: where the second term ensures graph degrees to be positive, thus improving the overall connectivity of the graph, and moreover, ensures each vertex having at least one edge. α and β are regularization parameters, and the constraints guarantee to obtain a valid adjacency matrix. The third term controls the sparsity of the resulting graph; intuitively, smaller values of β yield sparser graphs by penalizing edges between vertices with larger Zi, j [42]. In the following, we refer to the GL approaches given in (5) and (6) as the l2-penalized and log-penalized methods, respectively.
D. Two-Class Discriminative Subspace via Simultaneous Diagonalization of Covariance Matrices
After defining a brain graph, the graph spectral representations of the EEG signals were considered to find a discriminative subspace for two-class (right hand and right foot) MI classification. To this end, inspired by the Fukunaga-Koontz transform (FKT) [44], and the method presented in [29], simultaneous diagonalization of two covariance matrices was utilized. For graph signal defined on G, let denote the de-meaned and normalized version of f obtained as [45]:
Let F denote an N ×T matrix that contains a single trial of the EEG time series, with Fc, t being the signal value at electrode c at time point t, and let denote the GFT matrix of the de-meaned and normalized trial. The goal is to determine a projection matrix W that simultaneously diagonalizes:
Where Σ1and Σ2 denote the ensemble averaged covariance matrices of the trials in class 1 (right hand) and class 2 (right foot), respectively. As Σ is positive definite, it can be eigen-decomposed, Σ = V Γ VT, where V is the matrix of eigenvectors of Σ and Γ is the diagonal matrix of the corresponding eigenvalues; using which a whitening transform P can be obtained as:
By whitening Σ with P, the variances in the space spanned by V will become equal, resulting in all the eigenvalues becoming equal to one, i.e.:
Consequently, eigenvalue decomposition of S1 and S2 gives: where B denotes the eigenvectors, which are the same for both S1and S2, and their corresponding eigenvalues are complementary; i.e., by sorting the eigenvalues in descending order, the eigenvector associated with the largest eigenvalue of S1is associated with the smallest eigenvalue of S2. Therefore, a small combination of the first and last eigenvectors of B induces a suitable discriminatory transform for differentiating the two classes. Finally, the overall projection matrix can be obtained as W = BT P.
By applying W to the GFT coefficients, i.e. , we obtained a feature vector y, the variance of which is maximized in one class while minimized in the other class. These features were then used for classification.
III. Proposed method
The proposed method for EEG-based MI task decoding is illustrated as a block diagram in Fig. 1. The training and test EEG signal sets for each subject are initially preprocessed, and then fed into the training and test phases, respectively. As temporal preprocessing, for each trial, we used the time points within the 0.5-2.5 second interval after the visual cue to construct graph signals; this 2-second interval has been previously used in related works [13, 15, 37]. Motor activity, be it real or imagined, modulates the mu and beta rhythms, therefore, we filtered the extracted signal with a third-order Butterworth filter with a passband of 8-30 Hz. Graph signals were then extracted from these filtered signals; in particular, we defined one graph signal per time instance, i.e., each signal represents EEG values across the 118 electrodes, which, thus, resulted in T=200 graph signals per trial.
A. Graph-based Representation of Brain Signals
In the training phase, we modeled the structure of the brain of each subject as a graph, in which vertices corresponded to the EEG electrodes and edges were defined by estimating the graph’s weighted adjacency matrix using the log-penalized and l2-penalized graph learning frameworks. As a means of comparison, we also defined a fully connected correlation graph in which edge weights were defined based on the degree of functional connectivity between electrode pairs; that is, for each electrode pair, the absolute value of the Pearson correlation coefficient between their time courses was defined as the edge weight, reflecting an estimate of the overall statistical dependency between the two electrodes [46].
For each graph, the eigenvectors of L were used to compute the GFT of each graph signal. Using FKT, a transformation matrix that maps the GFT coefficients into a discriminative graph spectral subspace was then derived. The mapped data were then treated as discriminative features. To determine the most effective graph frequency harmonics for classifying the EEG signals, a feature selection algorithm was used; we ranked the GFT coefficients based on their energy using MATLAB’s rankfeatures function that utilizes the Wilcoxon statistical test. GFT Coefficients with higher ranks correspond to more distinctive features. The number of selected features for each subject was determined using 10-fold cross-validation.
B. Evaluation
The classifier was trained using labelled training data, where labels indicate the class of each trial, and classification performance was evaluated with the labelled test data. The projection matrix and the index of discriminative GFT features were computed in the training phase and consequently used in the test phase. The logarithm of variance of the projected GFT coefficients on were used as features to train a support vector machine (SVM) classifier with a linear kernel. Since this projection maximizes the variance of the signals from one class while minimizing it for the signals from the other class, it provides discriminative features for classification. We used SVM due to its overall superior robustness and efficiency in the BCI applications compared to other classifiers [7]. The linear kernel was selected for its simplicity and low computational cost.
IV. RESULTS AND DISCUSSION
Fig. 2(a) shows the arrangement of the 118 electrodes on the head. Fig. 2(b-d) shows the three brain graphs and their corresponding adjacency matrices for subject aa. The nodal degrees are comparable between the learned graphs but are differently scaled for the correlation graph due to the large difference between the degree distributions. The correlation graph is fully connected as it is defined based on the correlation of all electrode pairs, whereas the two learned graphs are notably sparse, a result of sparsity-inducing terms used in the learning process. The log-penalized method yields sparser learned graphs compared to the l2-penalized method. For log-penalized, l2-penalized and correlation graphs, on average across subjects, the number of edges were 1684.8±367.9, 2519.2±366.9, and 13806, respectively; additional quantitative comparison of the graphs is presented in the supplementary material. The sparsity of graphs is desirable because it plays a key role in reducing the computational burden of algorithms and makes them suitable for online BCI applications. Graphs constructed for the other four subjects are shown in Fig. S2 in the supplementary material.
Distribution and histogram of the normalized Laplacian eigenvalues for three graphs of subject aa are shown in Fig. 3(a). Most of the eigenvalues in the correlation graph are concentrated around one, whereas the eigenvalues of the learned graphs, especially the log-penalized graph, gradually increase, and are more widely distributed along the spectrum. Eigenvalues with high multiplicity around one (a high peak at λ ≃1) in the correlation graph spectrum suggest vertex duplication, in which a new vertex to the graph has an identical connectivity pattern to the duplicated vertex, resulting in vertices with the same connectivity profile [48].
Fig. 3(b) illustrates eigenvectors associated to several of the selected normalized Laplacian eigenvalues of the log-penalized graph. The first eigenvector is almost evenly distributed over all the graph vertices and given that TV(u1) = λ1 = 0, there is no notable spatial variation. In the next eigenvectors, the increase in spatial variability is proportional to the increase in graph frequencies. The last eigenvector is highly localized, which is in line with normalized Laplacian matrices characteristics that manifest localized patterns of spatial variability in high frequencies.
Fig. 4 shows several of the eigenvectors and their corresponding eigenvalues for the three studied graphs for subject aa. The eigenvectors of the learned graphs capture a wider range of variability compared to the correlation graph, many eigenvectors of which manifest spatial patterns with similar spatial variabilities corresponding to a spectral value around one. This suggests that in the correlation graph most of the vertices are connected to other vertices in a rather similar pattern. Given that graph Laplacian eigenvectors form an orthonormal basis that represent signals, their broader spatial variability with respect to the graph structure can provide a more precise representation of signals. Accordingly, in GFT, a graph signal is mapped to the graph frequency domain using the Laplacian eigenvectors, the spatial variability of which plays an important role in obtaining an effective decomposition. In the correlation graph, a small subset of the first eigenvectors captures a substantial portion of the total signal energy, whereas in the learned graphs, signal energy is distributed across a wider range of eigenvectors. The complete set of eigenvectors of the three studied graphs for subject aa are shown in supplementary material Fig. S3.
Alternatively, a weighted measure of the number of zero crossings (WZC) can be used to quantify spatial variability of eigenvectors, or any graph signal in general. Strictly speaking, WZC is a weighted measure of changes in the sign of the eigenvectors at the adjacent graph vertices, wherein the adjacency matrix entries are used as weights, computed as: where H(.) is the Heaviside step function. The WZC of the normalized Laplacian eigenvectors of the three studied graphs is shown in Fig. 5. Spatial variability of eigenvectors generally increases by increasing the eigenvalue indices. WZC gradually increases along the spectrum in the learned graphs, especially in the log-penalized one, whereas in the correlation graph, it sharply increases in the initial eigenvalue indices, and then only minimally changes in the remainder of the spectrum. These results corroborate visual interpretations made on spatial variability of eigenvectors as shown in Fig. 4, reflecting the superior capability of the learned graphs over the correlation graph in capturing a wide range of spatially varying pattern as manifested by EEG signals.
In the first experiment, five different sets of the GFT coefficients were utilized. The first set consisted of the entire set of GFT coefficients, denoted all frequencies (AF). Inspired by prior works on the application of GFT on brain imaging data [49, 50], three additional sets of GFT coefficients were defined by dividing the spectrum into three equal frequency bands, denoted low (LF), medium (MF) and high (HF) frequencies. Inspired by [36], a fifth subset was defined via the union of the LF and HF subsets, denoted LF+HF. These five sets of GFT coefficients were then used as inputs to the FKT to derive a discriminative matrix for each set. Consequently, features for classification were extracted by computing the logarithm of variance of the projected GFT coefficients on. Table I presents classification accuracies using three different graphs for each individual subject and also on average across subjects.
Using the LF GFT coefficients resulted in substantially higher classification accuracies compared to using the MF, HF or LF+HF components, in all subjects as well as on average across subjects. It also provided higher accuracies compared to using all the GFT coefficients, in subjects aa, aw, and ay. Moreover, the learned graphs achieved better results compared to the correlation graph in three out of five subjects (aa, av, and ay). To determine an optimal subset of the GFT coefficients that provide the most discriminative features for classification, we implemented feature selection. The logarithm of variance of the GFT coefficients was used as input to feature selection. Fig. 6 illustrates the scores of graph frequencies in the log-penalized graph for each subject and on average across subjects.
The lowest one-third eigenvalue indices attained substantially higher scores than the rest of the spectrum, corroborating results presented in Table I. Therefore, we only used features from this sub-band as the most effective harmonics for each subject, classification accuracies for which are presented in Table II. To evaluate the effectiveness of using the FKT, results of classification using GFT coefficients (without FKT) are also provided in Table II. The direct use of the GFT coefficients is prone to overfitting due to the small size of the training samples in comparison to the dimension of the feature vectors, especially in the subjects with small training sets. Therefore, a subset of GFT coefficients as determined by the feature selection step were fed into the classifier.
The results suggest that using FKT notably improves the classification accuracies compared to directly using the GFT coefficients. That is, mapping the GFT coefficients onto the subspace provided by FKT results in features that better discriminate the two MI classes. This FKT-based approach of extracting features from a temporal set of GFT coefficients is in contrast to prior related works [22, 27] wherein the temporal mean or variance of the GFT coefficients is considered as feature, which notably discards the temporal dynamics. The temporal evolution of GFT coefficients of two representative EEG trials is shown in Fig. S4 in the supplementary material. Overall, the best average accuracy was obtained in the proposed method by using the log-penalized graph learning approach.
Finally, the performance of the proposed method is compared to three alternative state-of-the-art methods; see Table III. The proposed method using log-penalized graph learning outperforms the three alternative methods, on average across subjects. The GSL method, which is GSP-based [37], shows the best classification accuracy in subject av, whereas the RCSSP method, which utilizes an extension of FKT [15], shows the best accuracy in subject aw. In the other three subjects, the proposed method yields higher classification accuracy.
V. CONCLUSIONS
We proposed a GSP-based method for classification of motor imagery tasks from EEG signals. We treated EEG signals as functions defined on the vertices of three different graphs, in particular, two classes of subject-specific graphs learned from the data. Our analysis showed that imagined motor activities are generally spatially smooth on the learned graphs, and can thus be effectively represented by using only a subset of their graph frequency components. Furthermore, we showed that temporal dynamics manifested in EEG signals can be captured by using the FKT transformation, resulting in a discriminative subspace that can better separate motor imagery classes. The classification results showed the superior performance of the proposed method compared to three prior related alternative methods, indicating the benefit of extracting features via integrating spatial and temporal characteristics of EEG signals within a GSP setting. In future work, to obtain more informative features at the resolution of multiple frequency bands rather than at the resolution of eigenvalues, we will investigate how EEG signals can be best abstracted based on the distribution of their energy in the graph spectral domain using filter banks [45, 51].
SUPPLEMENTARY MATERIALS
Acknowledgment
The authors certify that they have no conflict of interest to report in regards to the subject matter discussed in this paper. The authors are grateful to Itani and Thanou [29] for sharing the code of their paper. A preliminary version of this work has been presented [46].
Footnotes
-