Interleaved training prevents catastrophic forgetting in spiking neural networks

Artificial neural networks suffer from the inability to learn new tasks sequentially without completely overwriting all memory of the previously learned tasks, a phenomenon known as catastrophic forgetting. However, biological neural networks are able to continuously learn many tasks over the course of the organism’s lifetime, and typically learn the best when tasks are trained sequentially. Here we used a multi-layer spiking neural network with biologically plausible dynamics and learning rules to study synaptic mechanisms behind catastrophic forgetting and to discuss possible solutions to overcome the problem. The network utilized reward modulated and non-reward modulated STDP and implemented multiple mechanisms for homeostatic regulation of synaptic efficacy. We found that the network can learn to perform two complementary, complex foraging tasks, but underwent catastrophic forgetting upon sequential training due to a near complete overwriting of synaptic weights. Interleaved training was capable of mitigating catastrophic forgetting by partially preserving synaptic weights from the later task and recovering those from the earlier task. Moreover, interleaved training pushed the final synaptic weight configuration towards the decision boundary between the configurations associated with each task, striking an optimal balance between both task representations. Author Summary The human brain is capable of learning an immense number of complex tasks over the course of a lifetime without forgetting previous tasks. Although modern artificial neural networks used in machine learning are getting closer to matching human-level capabilities regarding the complexity of the task to be learned, these networks typically lose all ability to perform this task after being trained sequentially on a second task; a phenomenon known as catastrophic forgetting. To better understand this phenomenon and how the brain overcomes it, we developed a spiking neural network which directly models the functioning of biological neurons and is updated according to biologically plausible learning rules. This network was able to learn to perform complementary complex foraging tasks but suffered from catastrophic forgetting when they were trained sequentially. Catastrophic forgetting was prevented if the new task was trained by interleaving it with trials from the original task. Significantly, we found that interleaved training accomplishes this by representing the two tasks as a single hybrid-task. This indicates that rather than learn to represent each task independently and appropriately toggle between these, the network compromised and found an optimal hybrid representation of synaptic weights which was sufficient to perform both tasks.


63
Humans are capable of continuously learning to perform novel tasks throughout life 64 without interfering with their ability to perform previous tasks. Conversely, while modern artificial 65 neural networks (ANNs) are capable of learning to perform complicated tasks, ANNs have 66 difficulty learning multiple tasks sequentially [1][2][3]. Sequential training typically results in 67 catastrophic forgetting, a phenomenon which occurs when training on the new task completely 68 overwrites the synaptic weights learned during the previous task, leaving the ANN incapable of 69 performing a previous task [1][2][3][4]. Attempts to solve catastrophic forgetting have drawn on insights 70 from the study of neurobiological learning, leading to the growth of neuroscience-inspired artificial 71 intelligence (AI) [5,6]. These include equipping synapses with more complicated dynamics to 72 ensure stability [7], or employing an interleaved training paradigm that has been hypothesized to 73 occur during sleep [4,8,9]. However, although these approaches are capable of mitigating 74 catastrophic forgetting in certain circumstances, a general solution which can achieve human level 75 performance for continual learning is still an open question [6]. 76 Parallel to the growth of neuroscience-inspired ANNs, there has been an increasing 77 investigation of how spiking neural networks (SNNs) can be trained to perform complex tasks [10-78 13]. SNNs attempt to provide a realistic model of brain functioning by taking into account the 79 underlying neural dynamics, such as spiking, and using biologically plausible local learning rules. 80 This potentially makes SNNs well suited to study and to reveal specific mechanisms of how the 81 brain avoids catastrophic forgetting. Indeed, there has already been some progress made along 82 these lines showing that interleaved training can also help mitigate catastrophic forgetting in SNNs 83 provided the network has time to settle back to its baseline state between stimulus presentations 84 [10,11]. However, these studies in SNNs have not focused on analyzing synaptic dynamics, which will ultimately be critical to understanding how catastrophic forgetting is mitigated in biological 86 systems. 87 Here we used a multi-layer SNN to investigate the effect of interleaved training on synaptic 88 dynamics. The network was equipped with spike-timing-dependent plasticity (STDP) and 89 biologically plausible homeostatic plasticity rules which have been studied experimentally in the 90 brain. The network could be trained to learn one of two complementary complex foraging tasks 91 involving pattern discrimination, and exhibits graceful degradation following neural pruning. We 92 found that the network exhibits catastrophic forgetting when trained on the tasks sequentially, but 93 interleaved training allowed the network to learn to perform well on both tasks. We further show 94 that interleaved training does this by partially recovering the synaptic weight distributions learned 95 during sequential training for each task and pushes the final synaptic weight configuration towards 96 the decision boundary between the configurations associated with each task.  synapses sent by the same presynaptic neuron. Non-rewarded STDP allowed neurons in layer H 116 to learn to represent and detect the 4 types of particle patterns in various spatial locations, while 117 rewarded STDP allowed the neurons in layer O to make motor decisions based on the particle 118 patterns detected in the visual field [12].

119
Complementary complex foraging tasks can be robustly learned 120 Using the network model described above, we began by training it on one of two 121 complementary complex foraging tasks. In either task, the network learned to discriminate 122 between a rewarded and a neutral particle pattern in order to acquire as much of the rewarded 123 pattern as possible. In the following we consider pattern discriminability as a measure of 124 performance, with chance performance being 0.5.

125
The paradigm for Task 1 is shown in Figure 2A. Task 1 consisted of an unsupervised 126 training period during which 4 types 2-particle patterns (horizontal, vertical, positive diagonal, 127 and negative diagonal) are present in the environment with equal densities. This was followed by 128 a rewarded training period during which the synapses between layers I and H were frozen, and a 129 testing period during which all synapses were frozen. During both the rewarded training and the 130 testing periods only 2 types of patterns were present: horizontal (rewarded) and negative 131 diagonal (neutral). Mean performance during the testing period was 0.686 ± 0.028. Figure 2B 132 shows examples of trajectories of the simulated agent at the beginning of (left) and after (right) 133 rewarded training. Prior to rewarded training the agent moves more or less randomly through the 134 environment, but afterwards it moves in such a way which appears to seek out horizontal patterns 135 and avoid negative diagonal ones.

136
The complementary paradigm for Task 2 is shown in Figure 2C. Note that Task 2 had the 137 two complementary patterns present during rewarded training and testing: vertical (rewarded) 138 and positive diagonal (neutral). Mean performance was obtained for the testing period of Task 2 139 was 0.710 ± 0.020, similar to that of Task 1. Figure 2D shows examples of trajectories of the 140 simulated agent at the beginning of (left) and after (right) rewarded training for Task 2. After 141 Task 2 training, the agent moves in such a way which appears to seek out vertical patterns and 142 avoid positive diagonal ones. These results demonstrate that the network is capable of learning 143 and performing either one of the two complementary complex foraging tasks.

144
Graceful degradation of performance following neural pruning 146 Previous work [12] suggested that this network learned to perform these types of 147 complex foraging tasks in the following way: (1) neurons in layer H learned to respond only 148 when a specific pair of I layer neurons (representing both particles of a given rewarded pattern) 149 fired together, and (2) neurons in layer H learned to project strongly to a single neuron in layer O 150 to mediate approach behavior. However, it remains unclear to what degree the network learns to 151 use a distributed code. Thus, we tested whether this network model exhibited evidence of a 152 distributed code by implementing a neural pruning protocol.

153
As shown in Figure 3A, the network was first put through a Task 1 training and testing 154 paradigm identical to that shown in Figure 2A. Following this, we implemented a neural pruning 155 protocol in which a randomly selected neuron in layer H was removed every 10 aeons until the 156 layer H was empty ( Figure 3B). It can be seen that performance remained stable until around 157 225,000 aeons ( Figure 3A), or until nearly 700 of the 784 neurons in layer H were removed 158 ( Figure 3C). The fact that such a large portion of layer H can be removed before any significant 159 drop in performance occurs is highly suggestive of the network using a distributed code to make 160 decisions rather than relying on the activity of a few select H layer neurons.

161
Sequential training of both tasks leads to catastrophic forgetting 162 We next tested whether our network model could exhibit catastrophic forgetting by 163 training sequentially on Task 1 followed by Task 2 ( Figure 4A). As shown in Figure 4B, 164 following Task 1 training, performance on Task 1 was 0.708 ± 0.035 , while, as expected, 165 performance on Task 2 was no better than chance. Conversely, following Task 2 training, 166 performance on Task 1 was now no better than chance, while performance on Task 2 improved 167 to 0.697 ± 0.031. Thus, sequential training on a complementary task caused the network to 168 undergo catastrophic forgetting of the task trained earlier.

169
To better understand the source of catastrophic forgetting in our network, we next 170 analyzed the synaptic weights between layers H and O. The synaptic weights were found to be 171 exponentially distributed, and no obvious differences could be seen in the overall distributional 172 structure of these synaptic weights when compared after training on each task, ( Figure 4C).

173
However, important differences were observed if task-relevant synapses (i.e. synapses in the top 174 10% following training on that task) were analyzed ( Figure 4D). The structure in the distribution 175 of Task 1-relevant synapses following Task 1 training (top-left) was destroyed following Task 2 176 training (top-right). Similarly, the structure in the distribution of Task 2-relevant synapses 177 following Task 2 training (bottom-right) was not present following Task 1 training (bottom-left).

178
Therefore, it suggests that, in our network, catastrophic forgetting results from a nearly complete 179 overwriting of the synaptic weight matrix between layers H and O after each new task learning.

180
Interleaved training facilitates sequential learning without catastrophic forgetting 181 After confirming that our network exhibited catastrophic forgetting when trained 182 sequentially, we added an interleaved training phase to our simulation ( Figure 5A) to test 183 whether it was a capable of rescuing performance on Task 1 without overwriting Task 2. Figure   184 5B shows that, following interleaved training, the network achieved a performance of 0.666 ± 185 0.048 on Task 1 and a performance of 0.679 ± 0.024 on Task 2. Moreover, from the very onset 186 of interleaved training the performance was significantly above chance ( Figure 5A). Therefore, 187 interleaved training allowed the network to relearn Task 1 without forgetting what the network 188 had just learned during training on Task 2. We next analyzed the synaptic weight distributions 189 learned after each of the training phases. As before, the distributional structure of the synaptic 190 weights was not noticeably different following training on Task 1 or Task 2 ( Figure 4C).

191
However, following interleaved training, the distribution became bimodal, with a new peak 192 forming at intermediate values of synaptic weights centered around 0.1 ( Figure 5C). 193 We hypothesized that interleaved training relearned Task 1-relevant synapses while 194 simultaneously preserving Task 2-relevant weights. Figure 6A shows that this hypothesis is 195 partially correct. The left and middle columns show analogous results to that of Figure 4D while 196 the right column considers task-relevant synapses following interleaved training. In other words, 197 here we identified task relevant synapses after training the task alone ( Figure 6A, top/left for 198 Task 1 and Figure 6A, middle/bottom for Task 2) and we then traced the same set of synapses 199 after another task training or after interleaved training (we then plot the same set of synapses 200 throughout different points in the simulation). It can be seen that for both tasks, the structure of To better understand the effect of interleaved training on the synaptic weights, we trained 222 a support vector machine (SVM) with a radial basis function kernel to classify the synaptic 223 weights between layers H and O according to whether they serve to perform Task 1 or Task 2.

224
To train the SVM, for each task (Task 1 and Task 2), we selected multiple "snapshots" of the 225 synaptic weight matrices obtained from the last fifth of the Task 1 and Task 2 training phases 226 (i.e. after performance appeared to reach an asymptote). We then used this SVM to classify the 227 synaptic weight matrices which evolved during the second half of interleaved training. Figure 7A 228 shows the average classification value across trials for synaptic weights associated with Task 1, 229 Task 2, and interleaved training. While the SVM robustly classified the synaptic weight matrices 230 from Task 1 and Task 2, the distance of interleaved weight states to the decision boundary were 231 significantly closer to the decision boundary (typically on the task 2 side). This indicates that the 232 synaptic weight matrices from interleaved training are an amalgam of Task 1 and Task 2 states.

233
Since this classification occurred in a 784-dimensional synaptic weight space, we used 234 principle components analysis (PCA) to reduce the dimensionality of the data in order to visualize the trajectory of the synaptic weights at the network level over the course of a 236 simulation ( Figure 7B). Here the beginning of the grey trajectory represents the initial weight 237 distribution, that evolved to the Task 1 (Task 2) specific distribution indicated by red (blue) dots, 238 and eventually to the interleaved training specific distribution (green dots). It can be seen that 239 while synaptic weight matrices associated with Task 1 and Task 2 cluster in distinct regions of 240 PC space, interleaved training pushes the synaptic weights to an intermediate location between 241 Task 1 and Task 2. Our goal of visualization limited us to using only 3 PCs, which combined 242 only explained 40% of the total variance. Therefore, we repeated our SVM classification using 243 the data projected in PC space as input. Figure 7D demonstrates that this gave the same 244 qualitative result as using the high-dimensional data suggesting that PC space captures the major 245 characteristics of weight space which we are concerned with. In this study we report that a multi-layer SNN may exhibit catastrophic forgetting upon is capable of mitigating catastrophic forgetting in ANNs [4,8,9] and SNNs [10,11]. Interestingly, 263 it has been shown that humans actually learn to perform multiple tasks better under a sequential, 264 batch training paradigm than they do under an interleaved paradigm [9]. This was because sequential, batch training allowed human brains to develop a factorized representation which 266 optimally segregated both tasks, thus reducing interference between the memories for each task.

267
However, when trained under an interleaved paradigm, human brains failed to separate the two 268 tasks, and instead formed a single linear boundary for a hybrid task [9]. Therefore, it appears that 269 the SNN presented in this study was able to capture some aspects of how humans learn under an   Environment. Foraging behavior took place in a virtual environment consisting of a 50x50 grid 297 with randomly distributed "food" particles. Each particle was two pixels in length and could be 298 classified into one of four types depending on its orientation: vertical, horizontal, positively-299 sloped diagonal, or negatively-sloped diagonal. During the initial unsupervised training period, 300 the particles are distributed at random with the constraints that each of the four types are equally 301 represented and no two particles can be directly adjacent. During training and testing periods 302 only the task-relevant particles were distributed. When a particle was acquired as a result of the 303 virtual agent moving, it was removed from its current location and randomly assigned to a new 304 location on the grid, again with the constraint that it not be directly adjacent to another particle.

305
This ensures a continuously changing environment with a constant particle density. The density 306 of particles in the environment was set to 10%. The virtual agent can see a 7x7 grid of squares 307 (the "visual field") centered on its current location and it could move to any adjacent square, 308 including diagonally, for a total of eight directions.  (Fig 1). This structure included a basic feed-forward inhibitory circuit [33] 313 found in many biological structures [33][34][35][36][37][38].

314
Each neuron in layer H received synaptic input from nine random neurons in layer I.

315
These connections initially had random strengths drawn from a normal distribution. Each neuron 316 in layer H connected to every neuron in layer O with both an excitatory (W ij ) and an inhibitory 317 (WI ij ) synapse. This provided an all-to-all connectivity pattern between these two layers and grid location with a "food" particle present, the particle was removed and assigned to a randomly 328 selected new location.  There was a 1% chance on every move that the virtual agent would ignore the activity in 342 layer O and instead move in a random direction. Moreover, for every movement cycle that 343 passed without the virtual agent acquiring a particle, this probability was increased by 1%. The 344 random variability prevented the virtual agent from getting stuck in movement patterns 345 corresponding to infinite loops. Synaptic noise was not sufficient to break out of all infinite loops 346 as some loops were the result of forming strong connections was would facilitate the same 347 spiking pattern despite the noise. Other times, the probability of escape from a loop due to noise 348 was simply so low that it would take an impractical amount of time to break the loop. While 349 biological systems could utilize various different mechanisms to achieve the same goal, the 350 method we implemented was efficient and effective for the scope of our study. where V n is the membrane potential, I n is a slow dynamical variable describing the effects of 357 slow conductances, and n is a discrete time-step (0.5 ms). Slow temporal evolution of I n was 358 achieved by using small values of the parameter μ << 1. Input variables β n and σ n were used to 359 incorporate external current I ext (e.g. background synaptic input): β n = β e I ext , σ n = σ e I ext . 360 Parameter values were set to σ = 0.06, β e = 0.133, σ e = 1, and μ = 0.0005. The nonlinearity f α (V n , 361 I n ) was defined in the form of the piece-wise continuous function: This model is very computationally efficient, and, despite its intrinsic low dimensionality, .

372
Here g syn is the strength of the synaptic coupling, modulated by the target rate W j of receiving 373 neuron j. Indices pre and post stand for the pre-and post-synaptic variables, respectively. The 374 first condition, spike pre , is satisfied when the pre-synaptic spikes are generated. Parameter γ 375 controls the relaxation rate of synaptic conductance after a presynaptic spike is received (0 ≤ γ < 376 1). The parameter R is the coefficient of variability in synaptic release. The standard value of R is 377 0.12. X is a random variable sampled from a uniform distribution with range [-1, 1]. Parameter 378 V rp defines the reversal potential and, therefore, the type of synapse (i.e. excitatory or inhibitory).

379
The term (1+XR) introduces a variability in synaptic release such that the effect of any synaptic interaction has an amplitude that is pulled from a uniform distribution with range [1-R, 1+R] 381 multiplied by the average value of the synapse.

388
The value of an STDP event (trace) was calculated using the following equation [17,18]: where t r and t p are the times at which the pre-and post-synaptic spike events occurred 392 respectively, T c is the time constant and is set to 40 ms, and K is maximum value of the trace tr k 393 and is set to -0.04 for a post before pre event and 0.04 for a pre before post event.

394
A trace was immediately applied to synapse between neurons in layers I and H. However,

395
for synapses between neurons in layers H and O the traces were stored for 6 epochs after its 396 creation before being erased. During storage, a trace had an effect whenever there was a 397 rewarding or punishing event. In such a case, the synaptic weights are updated as follows: where t is the current timestep, S rp is a scaling factor for reward/punishment, tr k is the magnitude weight sum close to the initial target value. The effect of these rules was that neurons with lower 408 total output strength could increase their output strength more easily.

409
The network was rewarded when the virtual agent moved to a location which contained a 410 particle from a "food" pattern (horizontal in Task 1, vertical in Task 2) and S rp = 1, but not when 411 it moved to a location with a particle from a neutral pattern (negative/positive diagonal in Task 412 1/2). A small punishment of S rp = -0.01 was applied if the agent moved to a location without a 413 particle present to help the virtual agent learn to acquire "food" as rapidly as possible.   However, following interleaved training, the mean performance on Task