## Abstract

Synaptic clustering on neuronal dendrites has been hypothesized to play an important role in implementing pattern recognition. Neighboring synapses on a dendritic branch can interact in a synergistic, cooperative manner via the nonlinear voltage-dependence of NMDA receptors. Inspired by the NMDA receptor, the single-branch clusteron learning algorithm (Mel 1991) takes advantage of location-dependent multiplicative nonlinearities to solve classification tasks by randomly shuffling the locations of “under-performing” synapses on a model dendrite during learning (“structural plasticity”), eventually resulting in synapses with correlated activity being placed next to each other on the dendrite. We propose an alternative model, the gradient clusteron, or G-clusteron, which uses an analytically-derived gradient descent rule where synapses are “attracted to” or “repelled from” each other in an input- and location-dependent manner. We demonstrate the classification ability of this algorithm by testing it on the MNIST handwritten digit dataset and show that, when using a softmax activation function, the accuracy of the G-clusteron on the All-vs-All MNIST task (85.9%) approaches that of logistic regression (92.6%). In addition to the synaptic location update plasticity rule, we also derive a learning rule for the synaptic weights of the G-clusteron (“functional plasticity”) and show that the G-clusteron with both plasticity rules can achieve 89.5% accuracy on the MNIST task and can learn to solve the XOR problem from arbitrary initial conditions.

## Introduction

In the discipline of machine learning, artificial neural networks (ANNs) have gained great popularity due to their impressive success in solving a wide variety of computational tasks. ANNs also hold a special appeal due to their similarity to the networks of neurons in biological brains, lending hope to the possibility that ANNs might serve as a general-purpose algorithm for replicating human-like intelligence. ANNs have also been used by computational neuroscientists to simulate activity dynamics and learning processes in the brain. Despite their utility for both machine learning and neuroscience, ANNs operate at a level of abstraction that disregards many of the details of biological neural networks. In particular, the artificial neurons in ANNs integrate their inputs linearly; that is to say, each input to a neuron in an ANN is given an independent weight and the neuron applies an activation function to the linear weighted sum of these inputs.

In contrast, synapses in biological neurons display an assortment of nonlinear interactions due to the passive biophysical properties of the cell as well as the active properties of voltage-gated ion channels. One such nonlinear channel of particular interest is linked to the NMDA receptor. The NMDA channel is both ligand-gated and voltage-gated (Jahr and Stevens 1990; Mayer, Westbrook, and Guthrie 1984; Nowak, Bregestovski, and Ascher 1984), allowing neighboring synapses on a dendrite to interact in a cooperative, supralinear manner. When two nearby NMDA-based synapses are activated simultaneously, the voltage-dependence of the NMDA receptor creates a voltage perturbation (depolarization) that is greater than the linear sum of the two inputs had they been activated independently (Polsky, Mel, and Schiller 2004).

The NMDA receptor has been shown to play a crucial role in structural plasticity, or the growth and elimination of synapses between neurons (Breton-Provencher, Coté, and Saghatelyan 2014; Kleindienst et al. 2011; Niculescu et al. 2018; Takahashi et al. 2012). Structural plasticity plays a crucial role in development (Elston and Fujita 2014) and learning (Caroni, Donato, and Muller 2012; Trachtenberg et al. 2002; Yang, Pan, and Gan 2009). Structural plasticity increases the likelihood that temporally correlated synapses are placed next to each other (El-Boustani et al. 2018; Kleindienst et al. 2011; Lu and Zuo 2017; Takahashi et al. 2012; Winnubst et al. 2015), which may take advantage of the NMDA supralinearity. While some structural plasticity may be due to new presynaptic-postsynaptic pairs being formed, structural plasticity can allow neurons with extant synaptic connections between them to find the most functionally effective location to make a synapse. The commonality of multiple synaptic contacts between a single presynaptic-postsynaptic pair of cells, as shown by electron microscopy (Bartol et al. 2015; Fares and Stepanyants 2009; Kasthuri et al. 2015; Motta et al. 2019), lends support to the idea that connected neurons may be sampling dendritic locations so as to form optimal functional clusters.

The synergistic coactivation of neighboring inputs on the dendrite via nonlinear NMDA receptors and the ability of a neuron to relocalize its synapses in response to correlated activity have been theorized to have a computational function. An early model (Mel 1991, 1992) proposed that, by placing synapses with correlated inputs next to each other, a neuron could learn to solve a classification task. To demonstrate this capability, Mel created a simplified model of a neuron consisting of a single long dendrite, called the *clusteron*. The dendrite consisted of a sequence of discrete locations, from 1 to N, where N is the number of features in the input (in an image, N would be the number of pixels). Each input to the neuron would impinge upon a specific dendritic location. The “activation” of each synapse was defined as the direct input to that synapse multiplied by the sum of the inputs to nearby synapses within a fixed radius on the dendrite.

In order to train the clusteron to recognize a class of patterns (e.g. to identify pictures of dogs), the clusteron is presented with patterns from the class to be recognized (called the positive class) and stores the average activation of each synapse per pattern. At every epoch of learning, the clusteron randomly swaps the locations of “poorly performing” synapses (i.e. synapses with a low activation relative to the other synapses) with each other on the dendrite, eventually resulting in a configuration wherein correlated inputs become spatially adjacent to each other and, thus, interact nonlinearly with each other, leading to a higher overall activation for the positive class of patterns.

In this work, we propose an alternative model for learning via dendritic nonlinearities and structural plasticity on the single dendrite, which we call the *gradient clusteron*, or G-clusteron. Unlike the original clusteron model, the G-clusteron uses a dendrite with continuous-valued locations (as opposed to the discrete locations in the clusteron) and a bell-shaped distance-dependent function to model the location dependence of interactions between synapses. These modifications allow us to analytically derive a gradient descent-based learning rule for the synaptic locations of the inputs to the G-clusteron. We show that the G-clusteron’s location-update rule can learn to solve the classic MNIST handwritten digit multiclass classification task with accuracy comparable to a linear classifier (logistic regression). Moreover, we derive an additional plasticity rule for the synaptic weights of the G-clusteron and show that when the synaptic location rule and the synaptic weight rule are used simultaneously, the G-clusteron can learn to solve the exclusive or (XOR) binary classification task, a feat which cannot be accomplished by a linear classifier (Minsky and Papert 1969).

## Results

The G-clusteron is a model neuron with a single one-dimensional “dendrite” containing synapses at various dendritic locations (Figure 1A). The *activation* of a synapse is defined as the product of its weighted input with a distance-weighted sum of the weighted input of every synapse on the dendrite (including itself). The G-clusteron then applies a nonlinearity to the sum of its synaptic activations.

Formally, for a given input pattern ** x** = [

*x*

_{1},

*x*

_{2}… x

_{N}], the activation

*a*

_{i}of synapse

*i*is: Where

*l*

_{i}is the real-valued location of synapse i on the dendrite,

*w*

_{i}is the weight of synapse i, and f(

*l*

_{i},

*l*

_{j}) is a bell curve-shaped distance-dependent function which determines how much each synapse affects each other synapse, defined as: where

*r*is a hyperparameter that determines the width of the curve. Note that f(

*l*

_{i},

*l*

_{j}) = 1 when

*l*

_{i}=

*l*

_{j}(i.e. when synapses i and j occupy the same location) and that f(

*l*

_{i},

*l*

_{j}) approaches 0 as synapses i and j move further away from each other (Figure 1B). Also note that f(

*l*

_{i},

*l*

_{j}) = f(

*l*

_{j},

*l*

_{i}). For convenience of notation and computation, we define a matrix F such that F is thus a symmetric matrix with ones on its diagonal.

For a given pattern of presynaptic inputs ** x**, the G-clusteron sums its activations together with a bias term b to produce:
Because we will be using the G-clusteron as a binary classifier, we apply a sigmoidal nonlinearity g and produce an output

*ŷ*between 0 and 1: As in logistic regression,

*ŷ*can be interpreted as a probability estimate for the binary label for the input pattern

**, with values closer to 0 representing a prediction for the label 0 (called the negative class) and values closer to 1 representing a prediction for the label 1 (called the positive class).**

*x*The G-clusteron thus differs from the original clusteron (Mel 1991) in two important ways: 1) each synapse has a real-valued location on the dendrite instead of an integer-valued location index and 2) each synapse’s activation function depends on the inputs of its neighbors as a gradually decreasing distance-dependence function as opposed to a hard cutoff at a fixed distance.

Defining the output of the G-clusteron in this fashion allows us to derive a gradient descent plasticity rule for the synaptic locations. For each input pattern presented to the G-clusteron, we update each synaptic location according to the rule:
where:
Where y ∈ 0,1 is the true label (negative or positive class) for pattern ** x**, and

*η*

_{L}is the learning rate for the synaptic locations. This gradient rule for each synapse can be understood as summing over “forces” that depend on pairwise interactions between that synapse and each of the other synapses on the dendrite. The interactions between two synapses depend both on the weighted inputs of the synapses and the distance between the synapses in the following manner:

For positive-class training patterns, synapses with same-sign weighted inputs exhibit “attraction” while synapses with opposite-sign weighted inputs exhibit “repulsion” (Figure 1C, left).

For negative-class training patterns this trend is reversed: same-sign synapses are repelled, while opposite-sign synapses attract (Figure 1C, right).

The magnitude of attraction and repulsion between two synapses is proportional to the product of the weighted inputs of the two synapses (Figure 1D).

The magnitude of attraction and repulsion between two synapses is distance-dependent. The attractive/repulsive force is small at very small distances, becomes larger at intermediate distances, and shrinks again at large distances (Figure 1E).

(The magnitude of the update for each synapse is also scaled by the magnitude of the error term, |*ŷ* − *y*|).

Eq. 7 can be interpreted as a vector field along the dendrite, like force fields in physical systems of particles. If we consider a “unit synapse” such that *wx* = 1 at an arbitrary location *l* on the dendrite, the magnitude and direction of the “force vector” created by the plasticity rule at that location for a given input pattern ** x** is given by:
This interpretation of the location plasticity rule can give us some intuition as to how such an algorithm might be implemented biologically. We can imagine extracellular chemical signals being released at regions of the dendrite where there is a concentration of synaptic activity. Excitatory and inhibitory inputs would have different chemical signals associated with them. These signals could induce presynaptic axons to form or eliminate synapses on the dendritic regions that were recently active, with differential effects for excitatory and inhibitory synapses. The diffusion of these chemicals in the extracellular space around the dendrite would create a location-dependent field of chemical gradients.

### Toy Examples

To illustrate how the location update learning rule works, we consider several toy problems. For these problems, instead of training the G-clusteron to discriminate between two classes by presenting patterns from the positive and negative classes in a random order, we will show what happens when the G-clusteron is given a dataset where all examples are from the positive class (in other words, the G-clusteron is tasked with maximizing its output on all the input patterns, see Methods) and when all examples are from the negative class (so the G-clusteron must minimize its output for all input patterns). [For all of the following examples, all synaptic weights are fixed to have the value 1 and don’t change over the course of learning; our results thus strictly depend on the synaptic locations and inputs].

For the first toy problem, we create a single synaptic input vector ** x** where the values of

**are randomly distributed between −1 and 1. This input vector is repeatedly presented to the G-clusteron with a positive label (y = 1), such that the G-clusteron will attempt to maximize its overall activation. Because the interaction between synapses in Eq. 1 is multiplicative, a sensible strategy would be to minimize the distance between synapses with same-sign inputs and maximize the distance between synapses with opposite-sign inputs. This would result in two clusters on the dendrite: one with positive-valued synaptic inputs and one with negative-valued synaptic inputs. The location update rule, by causing attraction between same-sign synapses and repulsion between opposite-sign synapses, does exactly this (Figure 2A).**

*x*We now show how the hyperparameter r (i.e. the width of the distance-dependent curve) affects the learning rule for input patterns from the positive class. In the first example, r was sufficiently large relative to the initial dispersion of synapses such that all of the synapses could “feel” each other, thus enabling all synapses with the same sign to eventually aggregate into a single cluster. However, if r is reduced, synapses that start further apart from each other do not exert a strong pull on each other, so instead the learning rule operates in a more local fashion, creating several same-sign clusters (Figure 2B).

We next train the G-clusteron on the same input pattern, but this time with a negative pattern (label y = 0), such that the G-clusteron will minimize its overall activation. Here we pre-initialize the G-clusteron in a high-activation state where same-sign synapses start in separate clusters. Inversely to the first strategy, we would like opposite-sign inputs to become intermixed while increasing the distance between same-sign synapses. Because the plasticity rule flips attraction and repulsion for input patterns rom the negative class, we achieve the expected result (Figure 2C).

Having demonstrated that the G-clusteron learns by aggregating and dispersing same-sign and opposite-sign synaptic inputs for individual input patterns, we illustrate that this mechanism serves to learn the correlation structure of a dataset comprised of multiple input patterns. To this end, we create a 30-dimensional multivariate Gaussian distribution where each dimension of the Gaussian had a mean of 0 and a covariance matrix Σ such that:
In other words, dimensions *x*_{1} … *x*_{5} were all correlated with each other, as were dimensions *x*_{6} … *x*_{10}, *x*_{11} … *x*_{15}, and so on, but otherwise there were no other correlations (Figure 3A, left). As such, each vector sampled from this distribution would exhibit similarity between the first five dimensions, second five dimensions, and so forth (Figure 3B, right).

As the mean of each dimension is 0, a linear neuron would be unable to increase its average output on this dataset, as there is no independent information in each synapse that could be given a linear weight. However, the G-clusteron can take advantage of the correlational structure of the data. Because the groups of correlated inputs will tend to all be either all-positive or all-negative in any given input pattern, the learning rule will gradually move the correlated inputs together, creating dendritic clusters of correlated inputs. As these clusters form, the output of the G-clusteron increases (Figure 3C). Inversely, the G-clusteron can learn to decrease its output on this dataset by separating out the inputs such that no correlated inputs find themselves in the same cluster on the dendrite (Figure 3D).

### Learning MNIST

The ability of the G-clusteron to learn the correlational structure of a dataset enables it to perform supervised classification by aggregating correlated inputs from the patterns where y = 1 and disaggregating correlated inputs from the patterns where y = 0. We trained the G-clusteron on the MNIST dataset of images of handwritten digits (Figure 4A-B) and compared its performance to that of the original clusteron from (Mel 1991) as well as to logistic regression. Before attempting the all-vs-all classification task, we test the G-clusteron separately on each digit in a one-vs-all classification paradigm. For each digit from 0-9, a G-clusteron was trained on a dataset where half of the images were of that digit (positive class, label y = 1) and half of the images contained other digits (negative class, label y = 0). The G-clusteron was then tested on a holdout test set (See Methods). This procedure was repeated for the original clusteron (only positive-class training examples are used for the original clusteron, see Methods) as well as logistic regression. The results are shown in Table 1.

For all digits, all three classifiers achieved a classification accuracy far above chance level of 50%. Depending on the particular digit, the clusteron achieved between 74.4% and 93.2% accuracy, the G-clusteron achieved between 75.5% and 97.1% accuracy, and logistic regression achieved between 87.3% and 98.0% accuracy. We emphasize that both the clusteron and the G-clusteron had all their weights fixed to 1 during the entire learning procedure and were only able to update their synaptic locations. Our results on the one-vs-all MNIST task thus demonstrate the remarkable efficacy of structural plasticity-based learning.

The distribution of synapses and synaptic activations before and after learning MNIST (Figure 4C-E) can be instructive in understanding how the G-clusteron operates. In many instances, over the course of learning, the synapses on the dendrite moved from an approximately uniform synaptic density to a more clustered structure, with some regions of the dendrite having a higher synaptic density than others (Figure 4E). These higher density areas occasionally had larger activations for patterns from the positive class than for patterns from the negative class (Figure 4D), suggesting that the G-clusteron may be building structural clusters to take advantage of correlated inputs in the positive class relative to the negative class. However, not all high-density clusters showed high activation, and some high-activation regions were low density. Thus, while the G-clusteron learning algorithm does produce structural clusters as a consequence of learning, there is not a guaranteed correspondence between the structural clusters and activation level in a complex task like image classification.

Having shown that the clusteron and G-clusteron have satisfactory performance on the one-vs-all MNIST task, we now turn to the harder problem of all-vs-all classification. Here, we wish to train a single-layer network of 10 classifiers on the MNIST dataset (one for each digit) and have the network classify each digit correctly. We consider two standard ways to train a single-layer network on a multiclass classification problem. In the one-vs-rest (OVR) method, 10 units are independently trained on a one-vs-all paradigm, as before, and the classifier which produces the largest output (*ŷ*) for a given input pattern is declared the “winner”, and the input pattern is assigned to the positive class for that classifier (Figure 5A). In the softmax (SM) method, the units are trained simultaneously on each example and the raw outputs (*h*(*x*)) of all units are passed through a softmax function (see Methods), which normalizes the output of each unit by the sum of the outputs of all the units in the layer (Figure 5B). This normalized output is then used in the error term *ŷ* − *y* when calculating the update rule. Because the softmax method allows the classifiers to communicate with each other via the output normalization, it can often lead to superior results for multiclass classification (Duan et al. 2003).

Importantly, the original single-dendrite clusteron does not utilize an error signal in its learning rule, so the original clusteron can only be trained on the multiclass task with the OVR paradigm (Figure 5C). Both logistic regression and the G-clusteron, however, do use an error term, making it straightforward to perform multiclass classification with both the OVR and SM methods.

To test the performance of the G-clusteron under both multiclass classification paradigms, we trained the original clusteron, the G-clusteron, and logistic regression using the OVR method, and the G-clusteron and logistic regression (but not the original clusteron) using the SM method.

When using the OVR paradigm, the original clusteron achieved an accuracy of 71.3%, the G-clusteron achieves 74.2% accuracy, and logistic regression achieves 92.2% accuracy. While both the original clusteron and the G-clusteron achieve performance far better than chance level of 10%, neither of them do nearly as well as logistic regression. However, when we train the G-clusteron using the SM method, it achieves an impressive accuracy of 85.9%, closing in on the accuracy of logistic regression’s 92.6% accuracy in the SM scheme. Although not superior to logistic regression, 85.9% accuracy on an all-vs-all MNIST classification task is notable for an algorithm that can only modify synaptic locations and not synaptic weights.

### Weight update rule for the G-clusteron

In addition to the location update rule, we also derive a gradient descent rule for the weights of the G-clusteron of the form:
Where
with *η*_{w} as the learning rate for the weights. This rule can either be used on its own or in conjunction with the location update rule (Eq. 6) by simultaneously updating the weights and locations using their respective rules during each epoch. To see if the weight update rule can improve the accuracy of the G-clusteron on the all-vs-all MNIST task, we trained a single-layer G-clusteron network that used only the weight rule as well as a network that used both the location rule and the weight update rule simultaneously, using the same protocols (OVR and SM) as for the G-clusteron using only the location rule.

When only the weight rule was used, the G-clusteron achieved an accuracy of 79.8% for OVR and 89.2% for SM, and when the update rules were used simultaneously, the G-clusteron achieved improved accuracies of 83.6% for OVR and 89.5% for SM. Thus, being able to manipulate synaptic locations in addition to synaptic weights may allow the G-clusteron to achieve slightly superior accuracy relative to a G-clusteron using either rule on its own (Figure 5D).

### XOR problem

To motivate the use of the weight update rule in combination with the location update rule for the G-clusteron, we consider the XOR function. The XOR function receives two bits of binary input and returns a 1 if the input bits are different and a 0 if the input bits are the same. The XOR function famously cannot be implemented by linear classifiers like the perceptron (Minsky and Papert 1969). It is thus valuable to demonstrate that the G-clusteron can implement this function. We first wish to show that there are values for the parameters (i.e. the weights and the locations of the two synapses) of the G-clusteron that will result in a correct implementation of the XOR function. We then show that a G-clusteron must be able to update both its synaptic locations and its synaptic weights if it is to learn to solve the XOR problem from an arbitrary initialization of its parameter values.

To implement the XOR function, the parameters of the G-clusteron must satisfy both of the following inequalities (See Methods):
Note that although we originally have 4 parameters (*w*_{1}, *w*_{2}, *l*_{1}, and *l*_{2}), the solution space satisfying these inequalities (Figure 6A-B) can be expressed in terms of only three parameters: *w*_{1}, *w*_{2}, and *F*_{12}.

When expressed in this manner, there are several things to observe about these inequalities. First, note that for *F*_{12} ≤ 0.5, there is no way to assign the weights such that Eq. 12 and Eq. 13 are satisfied (Figure 6B, left). This means that if the two synapses are initialized sufficiently far away from each other such that their distance-dependent factor is less than 0.5, a G-clusteron with only a weight update rule (Eq. 10) would have no way to solve the XOR problem.

Moreover, the range of weights that are valid to solve the XOR problem increases as *F*_{12} increases from 0.5 to 1, where the solution space for the weights is maximal (Figure 6B, center and right). In other words, as the synapses move closer together, there is a larger set of weights that would satisfy Eq. 12 and Eq. 13. However, even when the synapses occupy the exact same location (i.e. *F*_{12} = 1), there is still a large range of invalid weights. For example, if both weights are initialized with the same sign, a G-clusteron with only a location update rule (Eq. 6) could not implement the XOR function (Figure 6B and 6C). Therefore, for the G-clusteron to solve XOR from arbitrary initial parameter values, it must be able to update both its weights and its locations.

To demonstrate this numerically, we created 3 G-clusterons: one with only a weight update rule, one with only a location update rule, and one with both update rules. For each G-clusteron, we ran 1000 trials with randomized initializations on the inputs and labels for the XOR problem. Each trial ran for 10,000 epochs or until convergence (see Methods).

A trial was expected to converge for a G-clusteron with only the weight update rule if the initial *F*_{12} was greater than 0.5, and for a G-clusteron with only the location update rule if the initial weight values were able to satisfy both inequalities (Eq. 12 and 13) given *F*_{12} = 1. For the G-clusteron with both the weight and the location update rules, all trials were expected to converge (Fig. 6D).

The G-clusteron with only the weight update rule converged for 452 trials out of 518 trials analytically expected to converge (Figure 5D, 6E), unexpectedly failing on trials where *F*_{12} was initialized slightly above the boundary condition of 0.5, because the solution space for the weights when *F*_{12} is fixed near 0.5 is small (Fig. 5B middle), and it can therefore take the learning algorithm an exorbitantly long time to direct the weights into this exactingly narrow range of values.

The G-clusteron with only the location update rule converged for 249 trials out of 261 trials analytically expected to converge (Figure 6D, 6F), unexpectedly failing on trials where *F*_{12} was initialized very close to 0, as before, and also on trials where the weights were initialized very close to the boundary of the solution space (Fig. 6B, right panel). In these cases it was evidently difficult for the algorithm to find the bias value with sufficient precision to correctly perform the classification, even though the weights and locations were analytically valid.

Only the G-clusteron with both the weight and the location update rules was able to converge on almost all of the trials (947 out of 1000 trials), unexpectedly failing only on trials where *F*_{12} was initialized very close to 0 (Fig. 6D, 6G). These unexpected failures occurred because the magnitude of the location update rule for a given epoch scales with the valueof *F*_{12} and the weights at that epoch (Eq. 7). Thus, if *F*_{12} is initialized very close to 0, the synaptic locations may only move a miniscule amount each time the gradient rule is applied, depending on the values of the other parameters.

## Methods

### Derivation of Location Update Rule for the G-clusteron

We want a learning rule that updates the synaptic locations on each iteration of the algorithm of the form:

Where Δ*l*_{i} is proportional to the gradient of the error with respect to the locations. For an arbitrary loss function J and nonlinearity g where we have J (*g*(*h*(*l*_{i}))) we have:

By the chain rule we have:

The first two factors of the gradient, and , are specific to the cost function and non-linearity chosen. In our case, if we use the logit cross-entropy error

and sigmoidal nonlinearity

We have:

For the final term:

[Note that the calculation of this derivative requires differentiating Equation 4 with respect to an *arbitrary* location *l*_{k}, (i.e. finding )]. This results in an expression that depends only on k and j, which allows us to perform a notation substitution of i for k.] We thus have:

We can subsume the constant factor into the learning rate, which gives:

Which gives our location update rule (Eq. 7). This same expression holds if we use a softmax nonlinearity:

and a cross-entropy cost function:

as we do for the multiclass MNIST classification task.

### Derivation of Bias Update Rule

We also require a rule for the update of the bias term *b*. As above:

And

The first two terms are as above, the final term is: so we have And (Note the difference in sign relative to the location and weight rules).

### Derivation of Weight Update Rule

In addition to the location update rule, we also derive a gradient descent rule for the weights of the G-clusteron of the form:

Where η is the learning rate, and:

We have

As with the location update rule, we have:

Taking the derivative we obtain:

We can subsume the factor 2 into the learning rate. The weight update rule for the G-clusteron is thus:

This rule can easily be adapted to a batch learning protocol by simultaneously calculating the activations over the entire batch as shown above.

### Batch Learning

For computational efficiency on the MNIST task, the output of the G-clusteron and the learning rules can be implemented in a vectorized fashion. We assume here that we will be working with a dataset comprised of multiple input patterns, running from index *p* ∈ 1,2 … *P*. We will denote input *j* in pattern *p* as . To efficiently calculate each synaptic activation over an entire dataset, we define a matrix *S* such that

We also define a signed distance matrix *D* where

The distance-dependent factor matrix *F* can be expressed in terms of *D*:

The activation for synapse *j* in pattern , is the element *A*_{pj} of matrix *A*, which can be

computed as:

Where ° denotes the Hadamard product, i.e. elementwise multiplication.

For the location learning rule, we utilize a batch protocol where updates are performed after observing *P* patterns (*P* here is the size of the batch, not the dataset). Here, for each pattern *p* in the batch we define a matrix *Q*^{(p)} such that
The derivative of the loss function on the entire batch with respect to location *l*_{i} is given by:
Using our matrix notation, we have
However, this is computationally intensive as it requires an elementwise multiplication for every input pattern in the batch (*F*°*D* can be precomputed for the entire batch, but *F*°*D*°*Q*^{(p)} is different for every input pattern). We therefore rearrange and obtain:
Which averages over all input patterns in the batch first, requiring only two elementwise multiplications regardless of batch size.

### Toy examples

To train the G-clusteron to continually increase or decrease its activation on a particular example or dataset, we treated the activation function as a threshold function which always returned the wrong answer such that it would perform the maximal update at each epoch. For patterns with label*y* = 0, we fixed *ŷ* = 1 such that *ŷ* − *y* = 1, for patterns with label *y* = 1, we fixed *ŷ* = 0 such that *ŷ* − *y* = −1.

### Learning MNIST

(make basically about dataset sizes)

For the MNIST learning tasks, we used the Tensorflow (Abadi et al. 2016) MNIST dataset, which is composed of 60,000 training examples and 10,000 test examples. The examples are split roughly evenly between the 10 digits.

For the one-vs-all experiment, the original clusterons was trained only on positive training examples, ∼6,000 examples per digit. The G-clusterons and logistic regression classifiers were trained on a balanced dataset with ∼6000 images of the digit from the positive class and ∼6000 total images of other digits. The test set was evenly split between positive and negative examples. Hyperparameters for the one-vs-all task can be found in Table 2.

For the all-vs-all experiment, in both the OVR and SM protocols, each of the 10 clusteron nodes was trained on all images from its positive class. Each G-clusteron as well as logistic regression was trained with the entire MNIST dataset.

For both one-vs-all and all-vs-all (SM and OVR), the radius parameter *r* of the G-clusteron was always set to 0.23. For the original clusteron, synapses interacted if they were within 10 synapses of each other. Hyperparameters were hand-tuned to maximize accuracy in all cases.

### XOR problem

The XOR function is defined as:

For the G-clusteron to solve the XOR problem, we require:

where h(** x**) is the output of the G-clusteron on the input vector

**(In our case, the vector [**

*x**x*

_{1},

*x*

_{2}]) before applying the sigmoidal nonlinearity, as defined in Eq 4. In the case of two inputs, h(

**) can be written as: We therefore have (Table 5):**

*x*From Equations 27 and 28 and Table 5, we require:
Which can be simplified to Equations 12 and 13 in Results. (Note that there are two other inequalities that follow from Equations 27 and 28, namely that and , which require that neither *w*_{1} nor *w*_{2} are equal to 0, however Equations 12 and 13 already guarantee this.)

To test the different G-clusteron learning rules on the XOR dataset, we created a G-clusteron for each of the three learning paradigms: weight update rule only, location update rule only, and both weight and location update rules together. Each G-clusteron was run on 1000 trials (i.e. parameter initializations) for 10,000 epochs per trial.

For each trial, the weight values were chosen randomly from a uniform distribution between [−1,1], and the initial locations were chosen in a way such that *F*_{12} was uniformly distributed between [0,1]. The G-clusterons were trained with a stochastic gradient descent protocol such that each epoch, one out of the four input vectors for the XOR function (Table 4) were presented to the algorithm, which would update its parameters according to the update rule. For computational efficiency, we stopped the learning protocol on a given trial when the algorithm converged. Convergence was defined as achieving perfect accuracy on all 4 input vectors of the XOR problem for 10 consecutive epochs. Hyperparameters and the net runtime for all 10,000 trials are shown in Table 6.

For the XOR problem, the radius parameter *r* was set to 1. Momentum was not used for the XOR problem. Hyperparameters were hand-tuned to maximize the number of converged trials. Runtimes differed due to different convergence probabilities for the different rules.

All scripts were written in Python and run on a Acer Aspire 5 Notebook laptop with an i7-10510U quad-core processor (1.80 GHz) and 16GB RAM, running on a Windows 10 operating system.

## Discussion

We have shown that the G-clusteron is a robust single-neuron learning algorithm that can solve real-world classification tasks, such as MNIST handwritten digit classification. The G-clusteron’s ability to solve this task merely by moving synapses on its dendrite without using synaptic weights demonstrates the computational potential of structural plasticity that makes use of distance-dependent nonlinearities. For comparison, recent work to solve MNIST without training synaptic weights (via learning an effective network architecture) required a deep network with dozens of nodes per output class to achieve logistic regression-like performance (Gaier and Ha 2019). The G-clusteron achieves this level of accuracy using only a single neuron per output class (excluding the input layer).

While the original single-dendrite clusteron (Mel 1991) also exhibits impressive performance when tasked with one-vs-all classification on a single digit, the lack of a supervised gradient descent-based plasticity rule makes it difficult to scale this algorithm to a multiclass classification problem, as the clusteron learning rule does not have an error signal that can be communicated between neurons within a layer via a softmax activation function. [It should be noted that the multi-branch clusteron (Poirazi and Mel 2001) employs a supervised learning strategy that applies gradient descent to modify synaptic stability, which in turn determines how synapses are mapped to individual branches, however they do not directly learn synaptic locations via gradient descent as we do here.]

The gradient descent plasticity rule of the G-clusteron enables effective multiclass classification and allows the use of a variety of techniques developed for classic ANNs. For example, we used the ADAM momentum-based adaptive learning method (Kingma and Ba 2015) to dynamically optimize our learning rate. Bringing dendritic cluster-based plasticity algorithms within the fold of gradient descent learning opens up many possibilities for the extension of the G-clusteron algorithm using the extant mathematical frameworks and robust literature for learning using gradient descent in ANNs (Rumelhart, Hinton, and Williams 1986).

We have also shown that incorporating a plasticity rule for the weights in a G-clusteron alongside the dendritic location plasticity rule enables the G-clusteron to solve the classic linearly inseparable XOR problem (Minsky and Papert 1969). This reinforces the possibility that neuron models with dendritic nonlinearities may be more computationally powerful than the M&P nodes used in ANNs (Gidon et al. 2020; Poirazi, Brannon, and Mel 2003; Poirazi and Mel 1999, 2001).

### Relationship to Previous Work

Although our work here treats dendrites with NMDA receptors as nonlinear integration units, recent work (Moldwin and Segev 2020) has shown that a detailed biophysical model of a cortical layer 5 pyramidal cell with NMDA synapses and other active mechanisms can implement the perceptron learning algorithm, implying that a neuron can behave as a linear classifier. However, this does not preclude the possibility of nonlinear integration and plasticity rules; rather it should be taken as an indication a neuron can elect to apply a simple plasticity rule that ignores the synergistic interactions between synapses and still manage to solve classification tasks as a perceptron would.

In addition to the original clusteron (Mel 1991, 1992) and its multiple-layer extension (Poirazi, Brannon, and Mel 2003; Poirazi and Mel 2001), there have been a variety of other attempts to model learning with dendrites and dendritic nonlinearities (Hawkins and Ahmad 2016; Schiess, Urbanczik, and Senn 2016; Urbanczik and Senn 2014). These recent models tend to treat the dendritic branch (or sometimes an entire dendritic tree) as discrete loci of nonlinearity. By contrast, the G-clusteron conceptualizes dendritic nonlinearities as being fluid, varying depending on precise synaptic locations. Conceivably, both discrete, branch-dependent nonlinearities and fluid, location-dependent non-linearities could exist simultaneously, further expanding the computational capabilities of real neurons.

From a computational standpoint, we note that the covariance perceptron (Gilson et al. 2019) also takes a correlation-based approach to solving learning tasks.

### Biological Plausibility

The G-clusteron takes several liberties with respect to the biological phenomena from which it is inspired. While a multiplicative activation function may be a tolerable first-order approximation of the synergetic cooperation between NMDA synapses, a sigmoidal function provides a closer fit (Poirazi, Brannon, and Mel 2003). Even to the extent that a multiplicative model does accurately represent the synergistic relationship between excitatory NMDA synapses, such a relationship does not exist between excitatory synapses and inhibitory synapses or between inhibitory synapses and other inhibitory synapses. The multiplicative interaction with negative inputs, however, is essential to the learning protocol of the G-clusteron, so this aspect of our algorithm should be viewed as “biologically inspired” rather than an accurate depiction of what occurs in real biological cells. We do note, however, that clustering dynamics have been observed to occur between dendritic spines and inhibitory synapses (Chen et al. 2012).

The mechanism of learning synaptic locations in the G-clusteron, namely the attraction and repulsion of synapses based on the activity of their presynaptic inputs, also warrants discussion from a biological standpoint. As we have described, the computation of the “forces” exerted by each synapse results in a vector field along the dendrite with regions of attraction and repulsion. Such a vector field could be implemented by the release of attractive or repulsive chemical factors in proportion to the local dendritic activation which diffuse in the extracellular space, creating a distance dependent gradient that effectively sums the “pull” of the synapses that were active along the dendrite. The attractive factors could stimulate the growth of presynaptic axonal boutons or postsynaptic filopodia at specific regions of the dendrite, while the repulsive factors could eliminate existing synapses.

Brain-derived neurotrophic factor (BDNF) and its precursor, proBDNF, are strong candidates for the signaling mechanism underlying the sort of structural plasticity we suggest here. BDNF has been shown to be responsible for structural plasticity in development by stabilizing correlated synapses during development, while proBDNF weakens synapses that exhibit uncorrelated activity (Niculescu et al. 2018). Another possible signaling agent is estradiol, which is also involved in the formation of new spines (Mendez, Garcia-Segura, and Muller 2011) and interacts with BDNF in spine regulation pathways (Murphy, Cole, and Segal 1998). Astrocytes and microglia, which play an important role in spine elimination (Stein and Zito 2019) may also be implicated in our model as a mechanism for intelligently rearranging synapses on the dendrite in response to local activity.

The weight update rule (Eq. 11) of the G-clusteron can also be understood in a biological framework. Experimental evidence has shown that when LTP is induced in one synaptic spine, the threshold for LTP induction in nearby spines within ∼10µm are reduced (Harvey and Svoboda 2007), a process mediated by the intracellular diffusion of Ras (Harvey et al. 2008). The G-clusteron’s weight update rule qualitatively expresses a similar phenomenon – if a synapse has a strong input such that its weight increases a large amount, nearby synapses will require less input to achieve a large weight update.

### Future Directions

The G-clusteron can be extended in a variety of directions. Because the G-clusteron is inspired by biological dendrites, we use one-dimensional synaptic locations, but our model and algorithm can be easily modified to operate with a “high-dimensional dendsrite” where synapses interact with each other as a function of distance in a space with arbitrarily high dimensions, instead of merely moving along a line. Low-dimensional models, however, may allow for a greater degree of interpretability of the correlational structure in the input data.

Our model can also be made more biologically plausible in a number of ways, such as by using a sigmoidal nonlinearity instead of a multiplicative nonlinearity (Poirazi, Brannon, and Mel 2003; Poirazi and Mel 2001), restricting the nonlinearity to excitatory synapses, or incorporating distance-dependent attenuation (Rall 1967). It would be valuable to see how incorporating these features into the G-clusteron model would affect both the weight and location gradient update rules.

As with any single neuron model, the G-clusteron should also be explored in the context of a multi-layer network. The G-clusteron’s gradient descent-based update rules lend themselves to the possibility of a backpropagation algorithm for a deep network of G-clusteron neurons. This can create exciting avenues for a version of deep learning that incorporates synaptic weight updates together with dendritic nonlinearities and structural plasticity.

## Acknowledgements

The authors would like to thank Eyal Gal and Itamar Landau for their contributions to the development of the G-Clusteron plasticity rules.