Cross-subject Mapping of Neural Activity with Restricted Boltzmann Machines

Subject-to-subject variability is a common challenge both generalizing models of neural data across subjects, discriminating subject-specific and inter-subject features in large neural datasets, and engineering neural interfaces with subject-specific tuning. We study the problem of the cross-subject mapping of neural activity. The objective is to obtain a task-specific representation of the source subject signal into the feature space of the destination subject. We propose to use the Restricted Boltzmann Machine (RBM) with Gaussian inputs and Bernoulli hidden units; once trained over the entire set of subjects, the RBM allows the mapping of source features on destination feature spaces using Gibbs sampling. We also consider a novel computationally efficient training technique for RBMs based on the minimization of the Fisher divergence, which allows the gradients of the RBM to be computed in closed form. We use neural decoding as a downstream application to test the method. Specifically, we test decoding on neuromuscular recordings of spike trains from the ten muscles that primarily control wing motion in an agile flying hawk moth, Manduca sexta . The dataset consists of this comprehensive motor program recorded from nine subjects, each driven by six discrete visual stimuli. The evaluations show that the source features can be decoded using the destination classifier with an accuracy of up to 95% when mapped using an RBM trained by Fisher


Abstract
Subject-to-subject variability is a common challenge both generalizing models of neural data across subjects, discriminating subject-specific and inter-subject features in large neural datasets, and engineering neural interfaces with subject-specific tuning. We study the problem of the cross-subject mapping of neural activity. The objective is to obtain a task-specific representation of the source subject signal into the feature space of the destination subject. We propose to use the Restricted Boltzmann Machine (RBM) with Gaussian inputs and Bernoulli hidden units; once trained over the entire set of subjects, the RBM allows the mapping of source features on destination feature spaces using Gibbs sampling. We also consider a novel computationally efficient training technique for RBMs based on the minimization of the Fisher divergence, which allows the gradients of the RBM to be computed in closed form. We use neural decoding as a downstream application to test the method. Specifically, we test decoding on neuromuscular recordings of spike trains from the ten muscles that primarily control wing motion in an agile flying hawk moth, Manduca sexta. The dataset consists of this comprehensive motor program recorded from nine subjects, each driven by six discrete visual stimuli. The evaluations show that the source features can be decoded using the destination classifier with an accuracy of up to 95% when mapped using an RBM trained by Fisher divergence, showcasing the promising potential of the RBMs for cross-subject mapping applications.

Author summary
In this study, we address the variability of neural data across subjects, which is a significant obstacle in developing models that can generalize across subjects. Our objective is to create a task-specific representation of the source subject signal in the feature space of the destination subject. To this end, we consider the applications of the Restricted Boltzmann Machine (RBM) with Gaussian inputs and Bernoulli hidden units, trained on the joint feature space of the source subject and destination subject.
The trained RBM can then be used to map source features onto the destination feature spaces using Gibbs sampling. We also present a novel, score-based computationally efficient training technique for RBMs based on Fisher divergence. Using neural decoding as a downstream application, we demonstrate the effectiveness of our method on neuromuscular recordings of spike trains from the ten muscles controlling wing motion in an agile flying hawk moth, Manduca sexta, recorded from nine subjects. Numerical evaluations show that the source features can be accurately decoded using the destination classifier with up to 95% accuracy when mapped using an RBM trained by Fisher divergence.

1
Learning algorithms in neuroscience are required to generalize well across unseen 2 subjects of a population. Yet, implementing reliable cross-subject algorithms in 3 neuroscience is a notoriously challenging problem. An important factor contributing to 4 its difficulty arises from the non-stationary nature of the neural activity signals, whose 5 statistical properties vary dramatically even under slight changes in the recording 6 conditions [1,2]. As a result, the algorithms trained and optimized on data collected 7 from a given subject, fail to perform reliably when directly applied to other subjects. 8 For instance, a neural decoder trained on one subject will perform close to a random 9 choice classifier if applied directly to a different subject, thus failing to identify the 10 correct neurological state or stimulus condition even when the subjects perform the 11 same tasks simultaneously [3]. 12 Problems of this type, i.e., problems where the training and test data originate from 13 different distributions, are common in machine learning and are typically tackled within 14 the sub-field of transfer learning. In the context of the cross-subject problem, various 15 approaches have already been considered [2]. In our previous work on this problem, we 16 focused on domain adaptation methods that map the source data onto the destination 17 feature spaces [3]. Generative modeling is another promising approach to cross-subject 18 mapping. More recently, we have also considered using directed graphs, such as 19 conditional variational autoencoder (cVAE), to generate the source data onto the feature 20 space of the destination data [4], where the learning model for the downstream task is 21 trained. This approach, however, requires a separate directed graphical model to be 22 trained each time a new downstream task and/or new destination subject is considered. 23 In this paper, we propose to use undirected graphs to generate samples for 24 cross-subject mapping. Specifically, we suggest training a Restricted Boltzmann 25 Machine (RBM) [5,6] for this purpose, as described in Section 4. RBM is a popular 26 generative model that has had notable success in representation learning with 27 applications in a wide variety of tasks in neuroscience [7][8][9][10][11]. We note that the 28 applications of RBMs in transfer learning have been studied before [12][13][14] in computer 29 vision applications.

30
The main objective of cross-subject learning is to obtain a task-specific 31 representation of the neural activity of one or more source subjects in the destination 32 spaces of one or multiple destination subjects. Once the RBM is trained using a given 33 set of subjects, it can be used to map signals from any source subject within the same 34 set of subjects onto any destination feature space. We also consider an alternative 35 training method for RBMs based on Fisher divergence minimization [15]. In contrast 36 with the conventional contrastive divergence training (which is equivalent to maximum 37 likelihood, we refer more details to [5]), the Fisher divergence minimization allows the 38 gradient of the RBM to be computed in closed form, fostering efficient implementation 39 that does not require iterative Gibbs sampling during training.

40
In Section 5, we evaluate the performance of our method for cross-subject decoding 41 August 24, 2023 2/19 of discrete visual stimulus conditions using the spiking activity of the motor program, 42 specifically the set of spiking motor units, in nine hawk moths [16]. Each moth is 43 exposed to the same set of six visual stimuli and the neuromuscular activity is collected 44 in the form of spike trains extracted from fine wire electromyography (EMGs) of the ten 45 primary flight muscles that control the wings, resulting in a comprehensive, 46 spike-resolved motor program [17].  onto the feature space of one or more destination subjects. In principle, the mapping 70 function can be assumed to be purely deterministic. However, in this paper, we adopt a 71 probabilistic approach, which generates task-specific representations of source subjects 72 into the feature spaces of one or more destination subjects. We outline details below.  To learn the cross-subject mapping, we consider a conditional probability distribution 84 to generate feature representations in the feature space of destination subjects given the 85 feature vector of source subjects. One option is to directly parameterize the probability 86 density function p(x D |x S ) of this conditional distribution. Another approach is to first 87 learn the distribution p(x S ) and the joint distribution p(x D , x S ) of all feature vectors representation of x S in the feature space of destination subjects. The hidden layer h is 102 designed to bridge source and destination feature vectors. We illustrate this generative 103 framework in Fig. 1.

104
To this end, we aim to learn a generative framework such that we can easily sample 105 from the conditional distributions p(x|h; θ) and p(h|x; θ). In the previous work [4], we 106 considered cVAE [18], the directed graphs, to map the source feature onto the feature 107 space of the destination subjects by the decoder of the autoencoder. However, directed 108 graphs are not flexible to adopt new destination subjects, and the architecture of cVAE 109 is built on deep neural networks which are not easy to fine-tune given the limited size of 110 data. In this paper, we consider the class of undirected graphical models and easily 111 adapt to the generative framework. Given its relatively simple architecture and 112 straightforward sampling scheme, RBM is the first generative model we will tackle. The 113 details are elaborated on below.

115
We will first discuss the Gauss-Bernoulli RBMs and outline the principles of their 116 training. We then review the standard training technique that aims to minimize the 117 contrastive divergence, which is equivalent to minimizing the Kullback-Leibler (KL) 118 divergence between the data-generating distribution and the model. We also consider an 119 alternative training technique that minimizes the Fisher divergence. Unlike the classical 120 method, minimizing the Fisher divergence approach allows the gradients of the loss

124
We adopt the following notation conventions. Recall that in the context of the 125 cross-subject mapping problem outlined in Section 4.1, the vector x comprises the 126 feature vectors of the entire population of subjects, i.e., x = (x 1 , . . . , x M ). Let ∇ x and 127 ∆ x denote the gradient and Laplacian operator with respective to (w.r.t.) the vector x. 128 Let dg(x) denote a diagonal matrix whose main diagonal is x. For a square matrix A, 129 let dg(A) denote a diagonal matrix formed by setting all the elements to A not on the 130 main diagonal to zeroes. We use ∥x∥ (respectively ∥A∥ ) to denote the L 2 -norm of the 131 x (respectively the Frobenius norm of A). We further use p * (x) to denote the true 132 data-generating distribution of x. In practice, p * (x) is usually unknown; therefore, 133 given a set of observations, a standard problem is to estimate the model density p(x) 134 from some model class that best explains the data under an appropriate evaluation 135 metric. In this paper, we focus on a parametric density model class p(x; θ), θ ∈ Θ, 136 which is parameterized as an RBM (see Section 4.2.4).

138
A common practice to measure the deviation of a postulated probability distribution p(x) from the data-generating distribution p * (x) is to use the KL divergence defined by where the expectation is taken w.r.t. p * (x) (as denoted by the subscript).

139
D KL (p * , p) ≥ 0 with equality if and only if p = p * almost surely. The minimization of The Fisher divergence of p(x, θ) from the data-generating pdf p * (x) is defined by where the expectation is again taken w.r.t. the data-generating pdf p * (x). We note that D f (p * , p) ≥ 0 with equality if and only if p * = p almost surely. Under mild regularity conditions, the Fisher divergence Eq.
(2) can be written as [15] where c * is a term that does not depend on θ and s f (x, θ) is the Hyvärinen Score, defined as The result Eq.(3) enables to minimize the Fisher divergence over the space of 149 parameters Θ by minimizing the empirical analog of the Hyvärinen Scores f (x n , θ). Let 150 θ * denote the parameter value that minimizes the Fisher divergence between p(x; θ * ) to 151 p * (x) between all model class candidates. By standard asymptotic analysis, it can be shown that the estimateθ f = arg min θsf (x n , θ) satisfiesθ f → θ * in probability as the 153 number of data points grows (we refer details to [15] and references therein). This 154 estimation procedure is known as score matching. It has been proved that score 155 matching using the Langevin Monte Carlo method is equivalent to contrastive 156 divergence in the limit of infinismial step size [20]. Although this result implies that this 157 variant of convergence divergence can retain the consistency on score matching, we note 158 that this equivalence holds only for a particular MCMC method. The actual 159 performance of these two methods are different.
where the energy function E(x, h; θ) is given by Here, W ∈ R H×D is the matrix of weights connecting the units from the hidden and input layer, b ∈ R H and c ∈ R D are the vectors of hidden and input layer biases, and Λ = dg(λ) denotes the diagonal precision matrix of the inputs. All these parameters are freely learnable and they are denoted by θ in Eq.(5) and Eq. (6). We illustrate a Gauss-Bernoulli RBM model in Fig. 2. It is easy to see that the conditional densities are given by The marginal density p(x; θ) of the visible inputs can be also written in the energy-based form August 24, 2023 6/19 where Z(θ) is called the normalizing constant, and F(x; θ) is the free energy: with γ = log(1 H + exp (WΛx + b)) denoting the element-wise Softplus function.  The negative log-likelihood of the parameters of the RBM, i.e., the logarithmic loss can 172 be written as The gradient obtains a particularly interesting form: where the expectation in the second term is taken w.r. divergence (CD), which produces biased estimates of the model parameters [6]. 183 We see that an important implication of approximating MLE-based learning through 184 contrastive divergence minimization is the lack of consistency guarantees. Specifically, 185 minimizing the contrastive divergence is not guaranteed to converge to the 186 data-generating parameter θ * that minimizes the KL divergence from p(x; θ) to the 187 data-generating pdf p * (x). The impediment can be traced back to the computation of 188 the gradient of the logarithmic loss and the analytical intractability of the second term 189 in Eq.(12) which appears due to the intractability of the partition function as a August 24, 2023 7/19 with σ = σ(WΛx + b) and σ ′ = σ ′ (WΛx + b), where σ and σ ′ respectively denote the 198 element-wise Sigmoid operator and the corresponding first derivative.

199
Proof 1 Taking the derivative of the log-density log p(x; θ) w.r.t. x, we obtain which gives the first term in Eq. (13). To obtain the second term, we first compute the Hessian matrix of the log-density log p(x; θ); we obtain: Plugging the Hessian into the Laplacian 200 ∆ x log p(x; θ) = tr(∇ 2 xx log p(x; θ)) gives the second term in Eq. (13), which completes 201 the proof. 202 We observe that unlike the logarithmic loss in Eq.(11), the Hyvärinen Score can be evaluated explicitly in terms of the parameters c, b and W of the Gauss-Bernoulli RBM. Moreover, the calculation does not involve the partition function Z(θ). This simplifies the computation of the gradient w.r.t. the parameters of the RBM, which now can be computed by straightforward application of matrix calculus yielding the following closed-form expressions: It is evident that, as opposed to the gradient of the logarithmic loss Eq.(12), the gradient of the Hyvärinen Score can be computed explicitly w.r.t. the parameters of the Gauss-Bernoulli RBM, producing closed-form expressions that can be used directly used for training the parameters of the RBM. Indeed, letθ (n) f denote the parameter estimate at each step n. The new parameter estimate can be obtained through the following update rule (repeated until convergence): where η is the learning rate, B is the size of the minibatch of randomly chosen data 205 points x b , and b = 1, . . . , B at step n. Recall from Section 4.1 that in cross-subject mapping, the goal is to obtain destination 208 feature representation(s) x D from the source feature vector(s) x S . We will elaborate on 209 how to use an RBM and the Gibbs sampler to sample such representations. We first  We obtain the final estimate after repeating the above two steps k ≥ 1 times; Fig. 1 219 illustrates an example with k = 3. This gives the destination feature space 220 representations of the source feature vectors and they can be further processed using 221 algorithms trained on destination data. Alternatively, in the final step, we can skip 222 sampling from p(x|ĥ; θ) and we can also infer x D asx D = (x j ) j∈MD = max xD p(x|h; θ). 223 For simplicity and without loss of generality, we have assumed that source and    We study a comprehensive flight motor program for hawk moths. We will describe the 235 experimental protocol and related procedures only briefly here; the interested reader is 236 referred to [16] where the data set was first published for more details. The subjects, 5 cycles/second. Moreover, the spheres also drift in opposite directions about the three 242 axes of rotation which result in 6 different visual stimuli also known as pitch (up, down), 243 roll (left, right), and yaw (left, right) [16]. 244 The moth responds to each of the 6 discrete stimuli by producing turning effort as Before we delve into more details with respect to the cross-subject neural decoder, we briefly describe our methodology for constructing feature representations from spike trains proposed in [16]. Since the spike trains are given by variable-length vectors of spike timings, we consider Gaussian kernels, a strategy commonly used in neuroscience, to interpolate the spike trains. The Gaussian kernel is given by: where t n denotes the timing of the n-th spike collected from an arbitrary muscle, trial, 274 and moth, τ denotes the wing stroke cut-off threshold (we only consider spikes that 275 satisfy t n ≤ τ ), and σ is the Gaussian kernel bandwidth. The goal is to obtain a smooth 276 multivariate time-series representation of the spike trains in which the spike timing 277 information is conveyed by centering one kernel at each spike and summing the kernels; 278 these yields feature vectors of fixed dimension τ · ν S where ν S is the sampling frequency. 279 For consistency, the muscles whose recordings are missing are filled with zero vectors of 280 the same length as above. We then flatten the interpolated time series across muscles to 281 obtain one large feature vector. Finally, we apply PCA and retain only the first P across the first two strongest principal components, i.e., modes after performing PCA. 286 The diagrams also show the confidence ellipses corresponding to one standard deviation 287 for each of the conditions. It can be clearly observed that the data demonstrates strong 288 separability properties even in the first two dimensions of the PCA-based feature space; 289 adding more features (i.e., PCA modes) only increases this separability in the

302
• Scenario II (Fig. 4b).  In both of these scenarios, we evaluate the performance of the destination classifier 310 (trained purely on destination data) on the transferred source features through an RBM 311 model with both standard contrastive divergence minimization and Fisher divergence 312 minimization; we use RBM-CD and RBM-FD to denote these two RBM models, with 313 CD standing for contrastive divergence (see Section 4.2.5) and FD standing for Fisher 314 divergence (see Section 4.2.6). We compare the performance with two benchmarks: 315 1. subject-specific neural decoding, when the classifier is trained and tested on 316 purely destination data. Equivalently, the data used to train the classifier and the 317 test data respectively come from the same individual subject. In this case, the 318 performance of the neural decoder may be thought of as an upper bound because 319 it represents the best-case scenario where the classifier has access to the most 320 relevant information about feature patterns of the test data.

321
2. cross-subject neural decoding with no transfer, when the source data is directly 322 decoded using the destination neural decoder without transferring the source data 323 into destination feature space. As we discussed in Section 3, in this case, the 324 performance of the neural decoder is usually close to a random guess, since the 325 neural decoder does not take into account the differences between neural activity 326 patterns between the test (source) and the train (destination) data. The training and testing data for the source and destination data sets in both scenarios 328 are formed by splitting the original data sets of each moth into training and testing 329 data subsets randomly according to ω ∈ (0, 1), which denotes the ratio between the 330 number of training samples and the total number of trials. To characterize the 331 performance of the cross-subject neural decoder statistically, the random train/test data 332 splitting procedure is repeated 100 times; the entire model including the RBM is then 333 re-trained using the new training data, and the test performance is recorded.

334
In cross-subject mapping, we are aiming to find a destination space feature 335 representation of the source feature vector; the mapping should be consistent with the 336 task that the source features are encoding. In other words, the mapping should be such 337 that the transferred representation of the source feature vector should be in the region 338 August 24, 2023 13/19 of the destination feature space that corresponds to the task that the source is 339 performing, as illustrated in the top row in Fig. 5. In essence, when sampling from the 340 joint pdf p(x, h), we wish to sample from a stimulus-dependent distribution and this 341 should be guided by the task information the source feature vector is encoding. Hence, 342 the trials in the training data set should be organized such that there is an input-output 343 correspondence with respect to the stimulus. As we are unable to establish 344 correspondence between individual training trials, we assign the correspondences at 345 random. The procedure is schematically depicted in the bottom row in Fig. 5. Namely, 346 for each source feature vector from the training set, we randomly choose a destination 347 feature vector from the same stimulus class and declare the pair to be an input-output 348 pair.

350
We used the following values for the free parameters: In both scenarios, we use the Linear Discriminant Analysis (LDA) for classification 352 which we also used in [16] and has shown to perform exceptionally high decoding 353 accuracy. no transfer being the lower bound. The performance of the cross-subject neural decoder 359 being upper-bounded by the performance of the subject-specific neural decoder is 360 intuitively expected. Note that in Fig. 6 the lower bound varies around 0.16, which 361 corresponds to random choice decoding in our case (as there are 6 stimuli in the 362 experiment, see Section 5.1). The performance of the cross-subject neural decoder is 363 similar in Scenario II but we omit to show it in Fig. 7 to avoid clutter. We conclude 364 that the performance of the cross-subject neural decoder without transferring the source 365 features to the destination is very poor, and produces a poor lower bound. The poor 366 decoding performance of the second benchmark when the source test data is directly 367 decoded using a decoder trained on the destination data, without any prior 368 transformation/adaptation of the source data to the destination feature space is also 369 intuitively expected. This can be most easily seen by inspecting Fig. 3 which depicts the 370 feature spaces across the first two modes for each moth. Even though each moth 371 individually exhibits a high degree of class separability (which ultimately results in very 372 reliable subject-specific decoding performance as demonstrated in Fig. 6), there is very 373 little alignment between the geometric distribution of the classes/tasks (i.e., visual 374 conditions) in the space spanned by the first two modes across different moths. In fact, 375 the task-specific representations seem to occupy arbitrary segments of the feature space 376 across the first two dimensions and no discernible pattern can be directly observed; 377 adding even more modes/features which are required to achieve high subject-specific 378 separability, only exacerbate the differences of the feature spaces across moths. As a 379 result, directly decoding any source moth using a neural decoder trained over a different, 380 destination moth results in poor performance as reported in Fig. 6.

381
The role of the RBM is to serve as a non-linear mapping function that takes source 382 features and adapts them to the destination feature space where the decoder was 383 trained, and this effect can be observed in Fig. 8 Fig. 8, in addition to the decoding 388 results presented in Fig.6, clearly demonstrate that the FD-RBM model has successfully 389 learned a non-linear transformation that takes a task-specific feature representation 390 from an arbitrary source and maps it into the adequate task-specific region of the 391 destination feature space.

392
By comparing the results in Fig. 6 with the results in Fig. 7, we also observe that 393 the performance of the neural decoder in Scenario I outperforms the neural decoders in 394 Scenario II. This is an intuitively expected result, as in Scenario II, the size of the joint 395 destination feature vector x D is M − 1 times larger than the source feature vector x S ; 396 that is, in Scenario II we are jointly obtaining the representation of a single source in 8 397 different destination feature spaces. The opposite reasoning applies to Scenario I. Hence, 398 the drop in the performance from Fig. 6 to Fig. 7 is expected. Furthermore, for a given 399 population of subjects indexed in M, we can view these two scenarios as the two 400 extreme cases that put the upper (Scenario I) and lower (Scenario II) bounds on the 401 performance. 402 We also observe that the performance of the cross-subject neural decoder with 403 RBM-FD transfer outperforms the performance of the same cross-subject decoder with 404 RBM-CD transfer. This is an interesting finding, further indicating that in the case of 405 RBMs, the training based on Fisher divergence minimization yields better results in 406 comparison with the more conventional approach based on Maximum Likelihood. This 407 result is also consistent with our findings on popular public datasets such as the MNIST 408 where we used RBM for applications such as compression and reconstruction and where 409 we observed that an RBM trained via Fisher divergence minimization yields The design of reliable, robust, and low-cost solutions for cross-subject mapping is a 419 challenging problem in neuroscience. In this paper, we proposed a general framework for 420 learning the joint distribution of source and destination feature representations across a 421 set of subjects using the undirected graphical model and RBM which we evaluated on a 422 neural decoding task and experimental data collected from nine hawk moths during a