Abstract
Computational neuroscience relies on gradient descent (GD) for training artificial neural network (ANN) models of the brain. The advantage of GD is that it is effective at learning difficult tasks. However, it produces ANNs that are a poor phenomenological fit to biology, making them less relevant as models of the brain. Specifically, it violates Dale’s law, by allowing synapses to change from excitatory to inhibitory, and leads to synaptic weights that are not log-normally distributed, contradicting experimental data. Here, starting from first principles of optimisation theory, we present an alternative learning algorithm, exponentiated gradient (EG), that respects Dale’s Law and produces log-normal weights, without losing the power of learning with gradients. We also show that in biologically relevant settings EG outperforms GD, including learning from sparsely relevant signals and dealing with synaptic pruning. Altogether, our results show that EG is a superior learning algorithm for modelling the brain with ANNs.
Introduction
Computational neuroscience has recently become more concerned with building models that can learn ethological behaviours from realistic, high-dimensional inputs (Summerfield, 2022). This is important for the role of computational models as the interface between theory and empirical data (Levenstein et al., 2023), since models that can learn more realistic tasks can be directly compared to experimental data (Doerig et al., 2023). For this purpose, gradient descent (GD) is increasingly used in computational neuroscience studies (Costa et al., 2017; Driscoll et al., 2024; Flesch et al., 2022; Langdon and Engel, 2022; Malkin et al., 2024; Tang et al., 2023; Whittington et al., 2020; Yamins and DiCarlo, 2016; Yang et al., 2019). GD is leveraged in these studies because it is very effective for training artificial neural networks (ANNs), even though it is not considered biologically plausible (Lillicrap et al., 2020). Moreover, it is clear that GD is phenomenologically different from the learning algorithm(s) of the brain. First, GD prescribes synaptic changes that violate Dale’s law, i.e. changes from synaptic inhibition to excitation (Eccles, 1976). Second, GD does not produce synaptic weights that are log-normal (Pogodin et al., 2024), which have been observed across species and brain regions (Buzsáki and Mizuseki, 2014; Dorkenwald et al., 2022; Rößler et al., 2023; Song et al., 2005). Even if we set aside concerns about the biological plausibility of GD, these phenomenological differences motivate the search for an alternative more brain-like learning algorithm for training models of neural circuits. However, for an alternative algorithm to be better than GD for computational neuroscience, it should be no less powerful at learning than GD, and ideally better at learning in biologically relevant situations.
GD is effective at learning difficult tasks because it is fundamentally normative (Bredenberg and Savin, 2023; Levenstein et al., 2023). Specifically, GD provides the greatest improvement in a measure of behavioural performance with the smallest change to synapses as possible (if we restrict ourselves to linear estimates of performance). But, what does it mean to measure synaptic weight change? Here we re-analyse this core theoretical assumption and show that GD does not align with biology, in part, because it is based on a measure of synaptic change that is agnostic to changes between inhibition and excitation. Using this insight we explore a related, but different learning algorithm, exponentiated gradient (EG) (Kivinen and Warmuth, 1997). Like GD, EG is also derived from normative first principles, and leverages gradient signals for learning, but it is based on a measure of synaptic change that adheres to Dale’s law. Crucially, it works as well as GD for deep learning (Bernstein et al., 2020; Schwarz et al., 2021; Sun et al., 2022), which rules out the common concern that brain-like algorithms do not scale to hard tasks (Bartunov et al., 2018).
We show that EG has a number of desirable properties for modelling the brain. First, by construction, EG maintains Dale’s law. Second, EG leads to networks with heavy-tailed, log-normal distributions of synaptic weights. Third, synaptic changes become proportional to existing synaptic weights, as has been observed experimentally (Loewenstein et al., 2011; Melander et al., 2021). Fourth, in line with EG being a better algorithm for modelling real neural learning, we find that neural networks trained with EG perform better than their GD trained counterparts in biologically relevant learning situations. For example, we find EG works better with synaptic pruning (Faust et al., 2021; Maret et al., 2011) and in situations where there are irrelevant inputs, as is common in the brain. Specifically, with EG, multi-compartmental neuron models better learn to identify task-relevant synapses, and likewise recurrent neural networks (RNNs) are better at sensorimotor tasks in the presence of many irrelevant signals. Altogether, these results show that EG is as or more powerful for learning as GD in biologically relevant contexts. As well, it provides a better phenomenological match to biological learning. This indicates that EG is a better learning algorithm than GD for modelling the brain.
Results
Deriving learning algorithms from optimisation theory
We begin with a brief discussion of how GD and EG can be derived from first principles, as this can explain why they are both effective learning algorithms, and illustrate why EG naturally respects Dale’s law.
Both GD and EG can be derived by framing the problem of learning as follows: how can we obtain the greatest improvement in performance with the least amount of synaptic change? This question can be addressed with the mirror descent framework from machine learning (Bubeck, 2015; Nemirovskij and Yudin, 1983). According to the mirror descent framework, changes to the synaptic weights in a neural network should minimise a combination of task error (quantified by a loss function) and a “synaptic change penalty”. This is formalised as:
where wt are the synaptic weights at time
is a linear approximation of the loss function l(w) around wt, D(w, wt) is the synaptic change penalty from w to a candidate w, and η defines the strength of the synaptic change penalty, thereby controlling the trade-off between reducing the loss and keeping weight changes small. Note that η will later play the role of the learning rate (Eqs. 2, 5).
The form that D takes will lead to different learning algorithms. Notably, GD follows from measuring synaptic change with the squared 2-norm, i.e. the squared Euclidean distance between the two synaptic weight vectors (Fig. 1A, left):
a| Left: GD weight geometry is Euclidean and treats E/I weights equally. Right: EG weight geometry is non-Euclidean and prohibits switching between E and I weights. b| Schematic of the Mod-Cog suite of tasks solved by a recurrent neural network (RNN); adapted from (Khona et al., 2023). c| Accuracy of EG and GD trained networks during learning (n=5). Response denotes the time period when the fixation input drops to 0. d| Left: taskwise testset performance of networks at different checkpoints during training. Checkpoints were taken at updates #50, #100, and at every 10% performance threshold from 25%-95%, and at the end of training. Right: cosine similarity (see methods) of taskwise accuracy between EG and GD trained network checkpoints (black: mean, grey: pairs of experiments). e| Histogram of final recurrent weights for GD (top) and EG (bottom). Dashed line shows a normal pdf for comparison. f| Same as e, but for CDFs rather than densities. g| Number of synaptic weight sign flips per update during training (dark: mean, light: std. deviation, n=5). h| R^2 of a linear fit to update size from current weight (n=5).
where i is an index over individual synaptic weights, and w is the candidate for new synaptic weights. If we solve equation (1), given equation (2), we obtain GD:
where ∇l indicates the gradient of the loss function with respect to the synaptic weights, and η indicates the learning rate (see appendix A for details of this derivation).
Note, however, that equation (2) makes no distinction between changes to synapses that obey Dale’s law and those that don’t (i.e. it doesn’t distinguish between changes that preserve the sign of synapses and those that don’t). An alternative choice for D that does respect this fundamental constraint is the unnormalised relative entropy (Fig. 1A, right), given by:
where we have assumed positive weights for now. If we consider a scenario where a candidate weight, wj, has a different sign than the current weight,
, then equation (4) is effectively an infinite penalty since the log is undefined. Put another way, this version of D simply does not allow for synapses to change from excitatory to inhibitory. Therefore, when solving equation (1), given equation (4), the resulting weights do not change signs. If we incorporate negative weights into the architecture, we get EG (Kivinen and Warmuth, 1997):
where ⊙ denotes element-wise multiplication. Unlike GD, EG does not treat all synaptic weight changes as equal – it instead favours changing already large synapses and completely prevents synapses from changing from inhibitory to excitatory, or vice-versa (because the exponential is always positive, the update can only scale the magnitude of the weight). Not only do these properties respect Dale’s law, they also lead to log-normal synaptic weights, as we show below.
EG is a better phenomenological fit to biology than GD
To investigate the impact of using either GD or EG for learning, we performed experiments with RNNs learning a suite of cognitive neuroscience style tasks named Mod-Cog (Khona et al., 2023; Molano-Mazon et al., 2022; Yang et al., 2019). We chose these tasks because they are a representative sample of the tasks often used in behavioural neuroscience experiments and they have provided fruitful insights into the nature of learning in RNNs in previous studies (Driscoll et al., 2024; Yang et al., 2019).
In the Mod-Cog tasks, an RNN receives four inputs: a fixation cue, a task cue, and two input variables each encoding a circular position. Depending on the task given (as indicated by the task cue) and the time varying input representations, the network must give a specific output. For example, in one task, the network must simply remember the value of one of the input variables and output that value during the response period, denoted by when the fixation cue drops to 0 (Fig. 1B, S1). In other more complicated tasks, the network must estimate and calculate output sequences using delays between inputs. Overall, there were 82 such tasks, each with a different logic for producing the correct output. Networks were trained on all tasks simultaneously and tested on a subset of held out data (see Methods for more details).
Networks trained with either GD or EG learned to perform all tasks with high accuracy. Learning with both algorithms converged relatively rapidly, reaching almost peak performance after 1000 weight updates (Fig. 1C, left), with a final accuracy of 98.4% for both GD and EG (Fig. 1C, right). This indicates that both algorithms are effective for training RNNs on these simple cognitive tasks.
We next asked whether the algorithms found different solutions, despite being equally good at learning. To start, we analysed the distribution and spectral properties of the network activity when solving the tasks, but found little difference between networks trained by EG or GD (Fig. S2). Turning next to learning trajectories, we explored whether the two algorithms learned the tasks in a different order, which would indicate different solutions because tasks were easier or harder for the two algorithms. To examine this we took checkpoints of the models throughout training and measured the test accuracy on each of the tasks (Fig. 1D, left). When we compared the two algorithms’ profiles across the tasks at each of the checkpoint (measured as the cosine similarity between the two vectors of per task accuracies), we found that the two algorithms largely learned the tasks in the same order, indicating that the difficulty of each task was roughly equivalent for the two algorithms (Fig. 1D, right).
The final synaptic weight distributions discovered by the two algorithms, however, were different. We initialised the networks from random log-normal weight distributions (see Methods), because experimental data shows that brains typically have log-normal distributions in their synaptic strengths (as measured by electrophysiology or spine size) (Buzsáki and Mizuseki, 2014; Dorkenwald et al., 2022; Rößler et al., 2023; Song et al., 2005). When we examined the distribution of synaptic weights after training, we found that the weights of networks trained with GD were no longer log-normal (Fig. 1E, top). In contrast, weights in networks trained with EG were still well fit by a log-normal distribution after training (Fig. 1E, bottom). As a result, the value of the Kolmogorov-Smirnov (see Methods) statistic for the GD distribution was 10 times larger than that for the EG distribution, and the two distributions were different from each other (Fig. 1F, Fig. S3B). Furthermore, starting from initial distributions that were normal, uniform or log-normal, we consistently found that the changes prescribed by EG were approximately log-normal, whereas instead GD changes were approximately normal (Fig. S3). Thus, GD and EG discover different synaptic weight solutions, despite exhibiting similar levels of accuracy across the tasks.
We then explored the implications with respect to Dale’s law, which implies that synapses cannot change from being excitatory to being inhibitory, and vice-versa (Eccles, 1976). Notably, GD does not provide any prohibition against changing the sign of a synaptic weight between positive or negative, whereas EG prevents such sign changes explicitly. Thus, if a network is initialised to obey Dale’s law, GD could alter the identities of excitatory and inhibitory neurons, whereas EG cannot. To confirm this, we examined the number of synaptic sign flips that occurred in the networks trained with GD and EG, and as expected, GD exhibited many synaptic sign flips throughout learning whereas EG exhibited none (Fig. 1G). Hence, EG relies on weight space updates that respect Dale’s law to find a solution, whereas GD does not.
Finally, we were interested in exploring how well the learning dynamics of the two algorithms matched previously observed neuroscientific studies. One notable, and replicated, observation has been the magnitude of weight updates, i.e. the amount of synaptic change that occurs, is proportional to the current synaptic weight magnitude (Loewenstein et al., 2011; Melander et al., 2021). Considering the multiplicative form of the EG update rule (Eq. 5), we expect EG to produce proportional weight updates and GD to not. This analysis, however, assumes that learning signals are independent from the synaptic weight magnitudes, which is not necessarily true in RNNs. To test this empirically, we fit a linear regression model from the absolute value of the synaptic weights to the magnitude of synaptic plasticity for each weight update. Here we found that for GD this model did not explain any of the update variance, whereas for EG a reasonable proportion of the variance was explained (Fig. 2H). As such, we conclude that EG synaptic changes are proportional, which matches known biology, whereas GD synaptic changes are not.
a| The right tail of the final weight distributions for EG (orange) contains many more weights than for GD (same data as Fig. 1, k : thousand) b| Response accuracy for networks (from Fig. 1) as smaller weights are pruned from the network. c| Cosine similarity between pruned and unpruned networks as smaller weights are pruned from the network. d| Top: accuracy of 92.5% pruned network re-learning Mod-Cog tasks. Bottom: cosine similarity between the retrained network and the original pruned network (from b). e| As d, but for 95% pruning level. f| As d, but for 97.5% pruning level. (n=5, EG orange, GD blue)
In sum, these results demonstrate that both GD and EG are effective learning algorithms for training RNNs on cognitive tasks, but EG produces solutions that are a better fit to existing neuroscience data: they match the distribution of synaptic weights seen in real brains, they respect the constraints of Dale’s law, and they exhibit proportionality of weight updates. Altogether, this demonstrates that training with EG provides a better fit to biology than using training with GD.
Learning with EG is more robust to synaptic pruning
Having observed that the EG algorithm provides a better phenomenological fit to existing biological data than GD, we next tested whether EG is also a better learning algorithm for biologically-relevant situations. A notable difference between the two algorithms is the skewed, heavier tailed synaptic weight distributions found by EG (Fig. 1E, 2A). This observation motivated us to investigate how robust the EG and GD trained networks are to synaptic pruning, an important cellular process that occurs during sleep and development (Faust et al., 2021; Maret et al., 2011). We hypothesised the large weights found by EG form a backbone that dominates computational dynamics (Teramae and Fukai, 2014), and thus, pruning the smaller weights would have relatively little effect on performance compared to GD trained networks.
To test network robustness to pruning, we took the networks trained on Mod-Cog and examined their accuracy after removing an increasing proportion of the smallest weights. We found that the EG trained networks maintained accuracy in the face of synaptic pruning much better than the GD trained networks, across all levels of pruning (Fig. 2B). To understand why, we examined the cosine similarity between the original and the pruned solutions, as this measure has been previously shown to correlate with robustness to pruning in artificial neural networks (Mason-Williams and Dahlqvist, 2024), and, as a measure of weight similarity, it also gives an insight into how much a network has to be re-trained to achieve pre-pruning accuracy. Indeed, we found that the EG trained networks had a higher pruned-unpruned cosine similarity compared to the GD trained networks, in-line with their better post-pruning accuracy (Fig. 2C). Thus, learning with EG produces networks that are more robust to synaptic pruning.
Of course, biological synaptic pruning occurs over the course of development and animals continue learning with sparsified networks. Therefore we next asked whether there were differences between EG and GD when relearning the tasks after pruning. To test this, we took the RNNs trained on the cognitive tasks that had 92.5%, 95% and 97.5% of their synapses pruned and trained them for 1,000 more updates (see Methods for details). We observed that the EG networks were better at relearning the tasks, achieving higher accuracy more rapidly (Fig. 2D-F, top). As well, the EG network weights maintained a higher cosine similarity with the initial pruned weights during relearning (Fig. 2D-F, bottom), indicating that the EG networks had to change their solutions less to reach the new optimal weight-space solutions to the tasks after pruning. Altogether, these results demonstrate that EG is better for maintaining performance and relearning after synaptic pruning. Moreover, they show that in addition to providing a better phenomenological match to the brain, there are biologically relevant learning situations in which EG is better than GD.
EG is better at ignoring irrelevant features
The previous synaptic pruning experiments showed that EG finds solutions that rely on fewer synaptic weights than the solutions found by GD. Therefore, we hypothesised that finding solutions with a few large weights may also be advantageous for quickly learning tasks where only a few inputs are relevant. Furthermore, EG and its precursors have been shown to have better convergence guarantees in similar optimisation settings (Kivinen and Warmuth, 1997; Littlestone, 1988). Such tasks, involving sparsely relevant information, are highly pertinent for neural circuits as neurons receive both high levels of background noise and inputs from multiple brain areas, not all of which will be always relevant in all contexts (Mante et al., 2013). Furthermore, despite receiving thousands of synaptic connections, generally only a few active synapses are required to drive a cell to fire action potentials (Ikegaya et al., 2013).
To test if EG is beneficial for learning with sparsely relevant synaptic inputs, we first investigated learning in a very simple setting. We modelled a single neuron as a rate-based point neuron receiving either 200, 2000, or 20000 random binary inputs – i.e. presynaptic activations of 1 or 0. We trained the neuron to be active whenever more than 50 of 100 randomly designated relevant synaptic inputs were active (Fig. 3A). For GD, this is equivalent to using the delta rule (Widrow and Hoff, 1988). To solve this task, the neuron needs to learn to ignore irrelevant synaptic inputs and assign large weights only to the relevant inputs. We found that when the total number of synaptic inputs was similar to the number of relevant inputs, EG and GD learning trajectories were also similar (Fig. 3B left). However, as the ratio of relevant inputs to irrelevant inputs decreased, gradient descent struggled to learn which of the inputs were relevant, whereas EG learned much more efficiently (Fig. 3B middle, right).
a| Task description: student neurons receive Bernoulli-distributed binary inputs with N dimensions, and are required to match the output of teacher neurons with 100 non-zero weights. Teacher output is 1 if 50 or more binary inputs are active and 0 otherwise. b| Student neuron performance as N increases. n=3 c| Morphologies of the biophysically detailed multi-compartment mouse pyramidal neuron models used in experiments. d| Example simulation showing a model neuron’s response to two impulse stimuli before and after training. e| Left: running average error for neurons trained with EG or GD for N=3000 synapses. Right: accuracy for individual models trained with EG or GD, averaged over the entire task. f| Difference in average accuracy increases with the total number of synapses. Error bars are s.d. from n=5 models. (EG orange, GD blue)
While striking, these point neuron experiments (Fig. 3A,B) abstract away a vast amount of biological detail and complexity. Hence, we next tested whether the learning properties of EG persisted in more realistic, spiking, multi-compartment models of pyramidal neurons with conductance-based synaptic inputs distributed across the dendritic tree (Fig. 3C). Again, the model neurons were required to learn to activate only when more than 50 of 100 relevant inputs were active and to ignore irrelevant inputs (Fig. 3D). In this case, optimisation is far more complicated, due to the fact that gradients have to be propagated throughout the entire multi-compartment structure (Bicknell and Häusser, 2021). Nonetheless, we again found that learning with EG was advantageous relative to GD across five different neuron morphologies (Fig. 4E). Furthermore, similar to our point neuron experiments, again we found that as the ratio of relevant inputs to irrelevant inputs decreased, the benefits of learning with EG relative to GD also grew (Fig. 4F). Together, these results demonstrate that learning with EG is beneficial for neurons learning from a sparse set of relevant inputs.
a| Schematic of an RNN controlling a two-joint planar arm through 8 muscles. The network receives delayed visual and proprioceptive feedback (see Methods) b| Three example arm reaches from random starting locations (x symbol) to target locations (filled circles) at the start and end of training. Different colours represent different reaches. Scale bars represent 20cm. c| Example feedback to the RNN controller. Left: from arm. Right: noise. d| Learning curves for RNNs trained with EG or GD. (From n=9 models). e| Validation reaches for EG and GD networks after 3 thousand updates. Scale bars represent 20 cm. f| As e but for 6 thousand updates.
EG is better at continuous control with irrelevant features
The above multi-compartment neuron model simulations show that EG’s advantage over GD extends through to realistic levels of biological complexity, albeit for single neurons performing simple tasks. Next, we turned our attention to neural circuits performing more complicated, ethologically relevant tasks in the presence of irrelevant noise. Specifically we considered the task of controlling a two-joint planar arm to perform reaches between random starting and target positions (Fig. 4A, B). This introduces biological complexity to the problem, including a non-linear mapping between the network’s output and the environment, and learning how to employ delayed feedback from the environment to the neural network. As is the case in real brains, we assume that there are many inputs to the circuit that are not relevant to the reaching task, and thus can be treated as noise.
We modelled motor circuits as an RNN receiving a constant stream of visual and proprioceptive inputs from the environment. The output of the RNN was a driving signal to six simulated muscles actuating the arm. In addition to the task relevant visual and proprioceptive signals the network also received additional irrelevant inputs with similar statistics to the task relevant inputs (Fig. 4C). Hence, in order to control the arm accurately, the network must learn to ignore the irrelevant noise.
We compared networks trained with GD and EG in this task and found that there was a clear difference in terms of the speed of learning. Specifically, we observed that EG trained networks (Fig. 3D, orange lines) learned to control the arm faster than GD trained networks (Fig. 3D, blue lines), and converged to a better final solution. EG’s better performance could also be seen by examining a circle of evaluation-reach trajectories after 3,000 and 6,000 updates. We found that EG trained arm controllers showed smooth, direct trajectories to the targets after 3,000 updates, whereas GD trained networks were noticeably less precise (Fig. 3E). Notably, when there were no irrelevant noise inputs the two algorithms were equivalent in their speed of learning (Fig. S4). Hence, EG leads to faster and better learning in the presence of irrelevant inputs, even when applied to non-trivial sensorimotor control tasks.
Discussion
In this paper we proposed that exponentiated gradient (EG) provides a better algorithm for training computational neuroscience models than gradient descent (GD). We presented three lines of evidence justifying this claim. First, we highlighted the theoretical principles shared by both learning rules, and empirically showed that EG is no less powerful than GD at learning a challenging battery of cognitive tasks (Fig. 1). Second, we found that EG matches neural phenomena better than GD: EG maintains Dale’s Law and log-normal weights, and produces multiplicative, proportional synaptic changes. Third, we observed several advantages of EG over standard GD in biologically relevant task settings, including: a greater ability to adapt to synaptic pruning (Fig. 2), the ability for single neurons to learn when very few inputs are relevant to the task (Fig. 3), and the ability for recurrent networks to learn rapidly when engaged in sensorimotor control in the presence of noise signals (Fig. 4). Altogether, these results support our assertion that EG is a better algorithm than GD for training neural network models of the brain.
Theoretical considerations
We observed that EG leads to different solutions than GD. What is the cause of these differences? One observation is that using different synaptic proximity penalties (see Section 2.1) effectively alters the “distance” between distinct settings of synaptic weights (Gunasekar et al., 2018). As a result, given the same initial weights, GD and EG have different “distances to travel” to a target weight setting, and as such, in general they may be more inclined to discover some weight configurations over others. That observation aligns with our results in the Mod-Cog suite of tasks (Fig. 1) where we found that the behaviour of the networks was very similar, but the actual synaptic weights learned were quite distinct. Another observation to explain the differences between the algorithms is that the exponential operation in EG leads to more emphasis being placed on certain directions in the gradient, and this effect will likely be compounded by the multiplicative nature of the weight updates (Schwarz et al., 2021). Indeed, this nonlinear credit assignment process will bias the network towards weights where a small number of connections become very strong, which has implications for pruning (Fig. 2) and learning tasks with a small number of relevant inputs (Fig. 3, 4).
These considerations also have interesting implications for questions of network development and initialisation. Recent deep learning experiments have shown that at initialisation networks contain task-specific subcircuits that both change a lot during learning, and are capable of solving the task by themselves (Frankle and Carbin, 2018). As such, it may be that EG is better at discovering these so-called “winning lottery tickets” within a network’s initialisation, i.e. finding the subset of synaptic weights that should change a lot in order to learn the task. Relatedly, one potential disadvantage for EG is that, due to its inability to change the sign of synaptic weights, if the network is initialised in a state where the only feasible solutions require a sign change in some synapses, then GD will outperform EG. We did not observe this in our simulations, though, likely because this situation is only relevant in an under-parameterised regime where the likelihood of having randomly initialised a group of synapses to the right sign for solving the task is much smaller. However there are inhibitory-excitatory networks architectures that learn well in such regimes (Cornford et al., 2021; Haber and Schneidman, 2022), and biological circuits are not under-parameterised.
Limitations and future work
Our analysis and experiments rely on gradient-based learning signals. While there is ongoing theoretical and experimental work investigating how brain circuits might estimate quantities correlated with the gradient (as they must in order to improve on tasks with small synaptic changes (Richards and Kording, 2023)), it is not yet clear how real neural circuits can engage in gradient-like calculations (Lillicrap et al., 2020). However, we emphasise that our central contribution is agnostic to the specific form of the neuronal credit signal – anything that is correlated with the gradient (e.g. reward signals) could be used in place of the explicit gradient. Instead, we leveraged the often implicit concept of synaptic distance, which dictates how such credit signals are used to update synaptic strengths. Nonetheless, future work could examine whether other, more biologically plausible credit assignment mechanisms would lead to the same set of results we obtained here.
Another potential limitation is the difficulty of the tasks. While these tasks are relevant to computational neuroscience (Khona et al., 2023; Yang et al., 2019), they are simple from an optimisation perspective: such tasks are far less complicated than controlling embodied agents in open-ended environments or engaging in real-world language modelling (Brown et al., 2020; Todorov et al., 2012). Previous work has shown that biologically plausible learning algorithms can fail to scale to more difficult tasks (Bartunov et al., 2018), raising the possibility that our conclusions may also change with an increase in task difficulty. However, because GD is known to perform well in more complicated tasks, and EG and similar algorithms have also been shown to scale-up to hard computer vision tasks (Bernstein et al., 2020; Schwarz et al., 2021; Sun et al., 2022), it is unlikely that such a concern applies here. But, we hope future work will explore more fully the representational differences between EG and GD in tasks that are of a real-world level of complexity.
Finally, we did not consider the role of different cell types. Instead we modelled the heterogeneous biological complexity of neural circuits abstractly using a homogenous population of rate-based rectified linear units. Although it is a common practice to make such simplifications in Neuro-AI studies, the question of cell types may be particularly important when considering which synaptic proximity penalty, and therefore learning algorithm, to use. In particular it may be that synaptic plasticity in inhibitory and excitatory neurons should be modelled more accurately using different distance functions, and in line with this, recent work has shown that synaptic dynamics are different for PV+ interneurons and pyramidal neurons (Melander et al., 2021). The biophysical modelling approach we employed for the single-neuron experiments (Fig. 4) is ideal for exploring such cell-type differences in future work.
Importance for understanding learning in real brains
The use of EG as an algorithm for training models of neural circuits has both computational and experimental implications. First, it could facilitate downstream computational neuroscience studies investigating the functional properties of skewed, log-normally-distributed weights. While there are older works on log-normal-consistent learning (Van Rossum et al., 2000), these studies used synaptic plasticity rules that are unable to learn even the sort of simplified tasks we studied here. Our work provides a means for computational researchers to explore learning with log-normal distributions in weights while maintaining the ability to learn more demanding, or cognitively relevant tasks. Second, this work provides a normative, computational explanation for some previously observed experimental phenomena (e.g. multiplicative weight updates). Finally, we believe that the EG framework, and more generally mirror descent, is an important step towards data-constrained learning algorithms. Here, it allowed us to incorporate several biological observations without sacrificing the performance of GD. In future work, we hope this framework will co-evolve with experimental observations to develop better models of learning in neural circuits.
Funding Acknowledgement
This work was supported by NSERC (Discovery Grant: RGPIN-2020-05105; Discovery Accelerator Supplement: RGPAS-2020-00031; Arthur B. McDonald Fellowship: 566355-2022); CIFAR (Canada AI Chair; Learning in Machine and Brains Fellowship) and IVADO (IVADO postdoctoral fellowship). This research was enabled in part by support provided by (Calcul Québec) (https://www.calculquebec.ca/en/) and the Digital Research Alliance of Canada (https://alliancecan.ca/en). The authors acknowledge the material support of NVIDIA in the form of computational resources. The authors acknowledge the use of the UCL Myriad High Performance Computing Facility (Myriad@UCL), and associated support services, in the completion of this work.
Method details
All simulations were implemented and results analysed using Python. Neural networks were implemented in PyTorch, and multi-compartment neuron models in NEURON. Full code, model checkpoints, hyperparameter settings, and training data will be released upon publication. Simulations were run on either MILA Quebec AI institute’s compute cluster, or University College London’s Myriad high performance computing system. The majority of ANN experiments were run using either NVIDIA RTX8000 or A100 GPUs.
Recurrent Neural Network Architecture
For Mod-Cog (Sec. 2.2) and continuous control (Sec. 2.4) experiments, we modelled all recurrent neural networks (RNNs) as follows. At each time t and input xt, the hidden state ht of 2.5 thousand neurons was defined as:
where ϕ is the neuronal nonlinear activation function. Throughout we set ϕ to be the rectified linear function (ReLU), in order to constrain activity to be positive only and to capture the rectification operation performed by biological neurons in low activity regimes. The values of xt were task dependent and at time t=0 the hidden state was set to be a vector of zeros.
The output activation of the RNN at each time point is:
where the choice of σ and the dimensionality of Wout is dependent on the task (see task details below).
All network parameters (Wout ∈ R|h| ×|h|, Whh ∈ R|h| ×|h|, W in ∈ R|x| ×|h|, bout ∈ R|y|, bh ∈ R |h|, where |h| denotes the cardinality of h) were randomly initialised using the same distribution for EG and GD. For all weights, we initialised each individual weight to be log-normal with
where α = 1. 5. The resulting weights were positive with mean
and variance
, so to obtain final weights we randomly multiplied them by ± 1 with equal probability. This makes the weights zero mean, and due to the variance of XY for independent X (positive weights) and Y (signs), the variance becomes
(the same for input, hidden, and output weights). For any positive α, this initialization scheme produces the same mean and variance of the final weights distribution. We set α = 1. 5 to roughly reproduce the shape of experimentally observed log-normal synaptic weights and did not observe that network training was sensitive to this choice (not shown).
For both biases, for GD, we initialised them from a uniform distribution (standard PyTorch initialisation). In order to account for the inability of EG to change parameter sign, models optimised by EG we initialised to have two bias vectors, one positive, one negative: where we initialized
. For both GD and EG, the biases therefore were mean zero with
variance.
Learning algorithm and optimisation details
For all experiments, apart from the multi-compartment neuron models, parameters were updated using gradients calculated using PyTorch’s automatic differentiation engine, autograd. For RNNs gradients were calculated using backpropagation through time (BPTT) with no truncation. Parameter updates were then obtained according to either gradient descent (standard BPTT) or exponentiated gradient as described in the main text. We applied momentum α (in the standard PyTorch way for SGD) and weight decay γ:
For EG, the sign of weights ensures both negative and positive weights change correctly (a positive gradient ∇l(wt) means the weight should decrease, and without the sign wt ⊙ exp(− η∇l(wt)) would instead increase the negative weights). We further discuss the derivation of generic mirror descent and the weight decay we use for EG in the Supplementary.
Training details
For all experiments we stabilised network training by constraining the norm of the gradient vector to be less than or equal to a scalar value that was chosen via hyperparameter search (Pascanu et al., 2012). Training hyperparameters, such as learning rate, momentum, etc. were selected using grid search to minimise the final validation loss. Please see the individual task descriptions for specifics and further details.
Task-specific details
Mod-Cog
All 82 tasks comprising the Mod-Cog dataset of cognitive tasks shared the following input and output structure (see S1 for task examples). Network inputs were a concatenation of 4 vectors: fixation, stimulus modality 1, stimulus modality 2, and rule encoding. The fixation input was a one dimensional variable that remained at 1 for the duration of a trial until the onset of the decision (response) period when it dropped to 0. The two input stimulus modalities were both ∈ R16 and each encoded a one dimensional circular variable with preferred directions uniformly spaced from 0 to 2π. Input stimuli were presented on one or both input rings for trial time-periods in a task dependent manner as detailed below. Finally, throughout each trial, the rule encoding was a constant one-hot ∈ {0,1}^84 vector indicating the current task to be performed.
For Mod-Cog network outputs, ŷ, were ∈ R17, comprising a scalar fixation output and an output ring ∈ R16. During the response period, output ring unit activities were converted to probabilities using a softmax function and trained using a cross entropy loss to match a one-hot target vector y, representing the location of target activity on the output ring. Before the response period, network outputs were trained to match the target vector y using a mean squared error loss.
To maximise computational efficiency task trials were drawn from neurogym objects (Molano-Mazon et al., 2022) and stored as a dataset as follows. Trials were drawn randomly from one of the 82 tasks, and concatenated together until a sequence length of 350 was obtained. 512 of these sequences were then stacked to form a batch. The dataset itself was composed of 1000 of these batches. Therefore 1000 updates corresponds to one pass through the dataset. Validation and test datasets of 100 batches were additionally drawn.
We trained networks for 6000 updates (i.e. 6 passes through the dataset). During training the learning rate followed a warmup (from 0), constant, cooldown schedule (to 0). Warmup and decay periods were 500, 1000 updates respectively. Best hyperparameters were selected based on average final loss over three random seeds calculated on the validation dataset. These settings were then used for five more random seeds, and final results were reported on the held out test dataset. During training gaussian noise (mean 0, std deviation 0.1) was added to the task inputs. Validation and test sets were evaluated without noise. The subtask specific details are postponed to the Supplementary.
Mod-Cog Pruning
For all pruning experiments non-bias network parameters (i.e. weights) were pruned according to L1 global unstructured pruning. As such, for a given prune percentage x, all non-bias parameters were sorted according to their absolute value, and the smallest x percentage of parameters were masked and set to 0. Networks were pruned after 5 thousand updates. For retraining experiments, the networks continued to learn with the pruning mask in place (i.e. pruned connections were held at 0) for the final thousand updates. For further details see the documentation for “torch.nn.utils.prune”.
Sparsely relevant inputs
Point neuron models
A student neuron received N independent Bernoulli inputs (p=0.5). Out of the N inputs, 100 were randomly designated to be relevant by a teacher neuron, whose weights were 0 for irrelevant inputs, and 1 for relevant inputs. For each activation, the output label from the teacher neuron was 1 if at least R=50 of those inputs were active.
Student neuron weights were initialised to be r/(p * n). For the student neuron, its output x⊤ w was transformed through a sigmoid activation function as
and then passed to the cross-entropy loss. Therefore, for x⊤ w that exceeded R the ŷ was larger than 0.5, predicting y = 1.
For each N, we constructed a training dataset of 10,000 inputs, and validation, test datasets of 1000 inputs. Neurons were trained for 600 iterations with a batch size of 500 and hyperparameters chosen through grid search and best final accuracy on the validation set. Results were reported over 3 seeds.
Multi-compartment neuron models
We used six biophysically detailed models of mouse pyramidal neurons from the Allen Cell Types Database IDs: 483061182, 483101699, 486560376, 484564503, 485836906 and 468193142] (Allen Cell Types Database, 2015). Each model comprises a 3-dimensional morphology reconstructed from experiments and is endowed with a suite of 10 Hodgkin-Huxley style ion channel conductances at the soma. The parameters of each model have been fitted to patch clamp recordings associated with its unique morphology; for full details, see (Gouwens et al., 2018).
Excitatory synapses with AMPA and voltage-dependent NMDA conductances, and inhibitory synapses with GABA conductances were distributed uniformly across the basal dendrites. Synapse models were as described in (Bicknell and Häusser, 2021). For a pre-synaptic spike arriving at time t0, and local dendritic membrane potential Vloc (t), the synaptic current from a given excitatory or inhibitory synapse is described by
In Eqs. (1-2), wE and wI denote synaptic weights, γ is the NMDA/AMPA ratio, EE and EI denote reversal potentials, and the functions gA, gN, gG describe double-exponential activation kinetics of the AMPA, NMDA and GABA conductances,
The normalizer
ensures that the activation peaks at a magnitude of 1. The voltage-dependence of the NMDA conductance, σN (Vloc), is modelled by the sigmoid
using parameters C = 1/3. 57, ρ = 0. 062, as described by (Jahr and Stevens, 1990).
We set γ = 2 for the NMDA/AMPA ratio, and E E= 0 mV, and EI =− 75 mV for the reversal potentials. The rise and decay time constants of the activation kinetics were as follows. AMPA: .
Task
We consider N total synapses, each receiving independent Poisson spike trains at spontaneous rates of 1.25 spikes/s. A number K of the excitatory synapses are designated as “relevant”, to which the neuron should pay attention, whereas the remaining synapses are a source of noise. After a 100 ms burn-in period, an impulse stimulus is presented, with each synapse randomly receiving an input spike with probability pstim. The classification task requires the neuron to fire an action potential whenever at least R out of the K relevant synapses are activated at stimulus time. We fixed the parameters pstim = 0. 5, R = 50, K = 100, while varying N.
The task was run for 60,000 total stimulus presentations with learning as described in the next section. Performance over the training period was quantified as the fraction of correct classifications in blocks of 40 presentations. For one of the model neurons (ID: 483061182) learning did not converge for either the GD or EG algorithm; this model was excluded from further analysis.
Learning rules
Learning was implemented using the gradient-based technique of (Bicknell and Häusser, 2021), generalised for use with the Allen Institute models.
In the binary classification task described above, a neuron is required to spike for one class of input patterns, which we denote ⊕, and remain silent for the other, denoted ⊖. Learning proceeds according to the Tempotron principle (Gütig and Sompolinsky, 2006): When a classification error occurs, synaptic weights are updated with the aim of pushing the somatic membrane potential, Vsoma, above spike threshold for ⊕ patterns, and below threshold for ⊖ patterns. To do this efficiently, the updates must depend on the derivative of Vsoma with respect to the weights, ∂Vsoma /∂wi. In this framework, the GD and EG learning rules therefore take the forms
In Eqs. (3-4), the symbol ± denotes that the sign is positive for ⊕ patterns and negative for ⊖ patterns, η and η’ are learning rates, and the derivative is evaluated at the time of maximum membrane potential during the pattern presentation, tmax. For ⊕ patterns, tmax is the time at which the subthreshold potential reaches its peak value, whereas for ⊖ patterns, tmax is the time of the first action potential (in practice, 2 ms before the action potential, for numerical stability). Note that these updates are only performed if there has been an error on presentation t; weights are left unchanged after successful classifications.
Computing ∂Vsoma /∂wi for the biophysical models is nontrivial, as the derivative must be propagated through a system of differential equations that governs the membrane potential dynamics. In general, suppressing the technical model details, each neuron is described by a dynamical system
where the vector V denotes the membrane potential in each of the many compartments of the discretized morphology, the variables xµ are Hodgkin-Huxley gating variables for the somatic ion channels (µµ = 1, 2, …; typically two gating variables per channel species), f, gµ are nonlinear functions describing current flow and ion channel kinetics, and w is a vector of synaptic weights. The compartments of the morphology are ordered such that the somatic membrane potential is given by the first element of V, i.e. Vsoma ≡ V1.
Taking a derivative of the equation above with respect to a weight, wi, applying the chain rule, and swapping the order of the time and weight derivatives, leads to the system
Eqs. (5-6) describe how the membrane potential in each compartment depends on wi over time. By solving it numerically for each weight, in parallel with simulations of the model, the crucial gradient terms can be read off as the first component of ∂Vsoma /∂wi (tmax).
For efficient application in the classification task, we simulated the models using the NEURON simulation environment (Hines and Carnevale, 1984), while solving the last equation for the gradients using a custom-written Python package. Learning rates η and η’ were selected via grid search to maximise task performance averaged over the entire training period.
Continuous Control
For continuous control, all trials were randomly generated and therefore hyperparameters were chosen based on grid search performance and experiments with 5 seeds re-run for final results.
For the continuous control experiments, the RNN output layer consisted of six neurons with a sigmoid nonlinearity to ensure positive outputs. The output from this last layer served as a muscle drive that was passed to the six muscle actuators controlling the bi-planar arm. The environment models employed were from the open-source Python package motornet (Codol et al., 2024). Briefly, the actuators were six Hill-type muscle models, which produced forces based on their previous activation and the incoming muscle drive from the network. These forces were applied to the effector’s bones proportionally to their insertion points onto it. The new bone positions were then used to determine endpoint positions in the Cartesian space.
Specifically, we employed the RigidTendonArm effector with default parameter values and the Euler algorithm for numerical integration. The muscle actuators were MujocoHillMuscle objects with passive force contribution set to zero. The simulation time constant was set to 10 ms for all models.
The RNN received as input a vector containing the Cartesian coordinates of the arm’s endpoint, start position and target position, as well as each actuator’s length and velocity. During training the visual feedback (the arm’s current endpoint) was delayed by 50 ms (5 simulation steps) and the proprioceptive feedback (each actuator’s length and velocity) was delayed by 20 ms (2 simulation steps) to match feedback delays observed in the motor cortex of biological organisms (Omrani et al., 2016).
Networks were trained to minimise a composite loss quantifying performance over each 1 second reach:
Here l1(target, arm endpoints) is the mean of the l1 distance between the arm endpoint and target location over the 1 second reach. λ1 and λ2 penalise large RNN hidden states and muscle drives respectively, and were both set to 0.1. These environment states are used to compute the gradient as the environment built through MotorNet is differentiable. This is equivalent to backpropagating through an accurate forward model of the effector.
To model learning in the presence of irrelevant inputs, the RNN received 500 time-varying noise inputs xt (at time t) in addition to the arm feedback. These noise inputs obeyed the following coordinate-wise dynamics:
For t > 0, if xt = 0, then we set the “speed” and “momentum”
. After that, for θ = 0. 0005,
Essentially, the noise inputs xt are sampled by sampling the noise “momentum” from an Ornstein–Uhlenbeck process, the noise “speed” integrates the momentum variable and additionally pushes xt away from zero and towards the initial values, with xt itself integrating the speed variable. If x reaches zero, its speed and momentum are reset.
Analysis details
To compare task accuracies (Fig. 1) and weights during pruning (Fig. 2), we used the cosine similarity measure between two vectors x, y,
For comparing weight distributions (Fig. S3), we used the one-sample Kolmogorov-Smirnov test statistic (Hollander et al., 2015) to compare distributions with a normal distribution,
where F (x) is the CDF of the (z-scored, i.e. normalised by the sample mean and standard deviation) weights and F(x) is the CDF of the standard normal.
For comparing two pairs of weights distribution (Fig. S3), we used the two-sample Kolmogorov-Smirnov test statistic (Hollander et al., 2015) for CDFs of z-scored weights Fn and Gn,
For fitting the slope of eigenvalues (Fig. S2B), we used code made available by (Agrawal et al., 2022).
Supplementary
Acknowledgements
We would like to thank Daniel Levenstein, Colin Bredenberg, Zahid Padamsey, Leonid Savtchenko, and Dongyan Lin for their helpful comments on the manuscript and discussion. We would also like to thank all members of the Richards and Lajoie labs for creating a supportive and intellectually stimulating environment.
Appendix A: Mirror descent
In this section we provide a more technical overview of mirror descent and weight decay.
Mirror descent derivation
In this section, we provide an introduction to mirror descent. Several textbooks, such as (Bubeck, 2015; Shalev-Shwartz and others, 2012), provide a more detailed discussion on this framework.
In standard gradient descent, to minimise a loss function l(w), a single update starting from weight wt is derived as follows:
where η is the learning rule. Therefore, a single step of GD linearises the loss at the current weight wt, and then adds a quadratic term to the linearisation to establish a region in which linearisation is considered accurate (which is controlled by η, which can be viewed as an alternative interpretation of the learning rate).
The problem in (1) is convex, and we can find its solution analytically, getting a gradient descent step,
In mirror descent (introduced in (Nemirovskij and Yudin, 1983)), the weight update is regularised by a different penalty D ϕ (w, w t),
What should D ϕ (w, w t) be to work as a penalty? Like for the two-norm in Eq. 1, we expect it to be nonnegative, zero at w = wt, and increase as w moves away from wt.
Mirror descent formalises this intuition of a penalty with Bregman divergence (Bregman, 1967). To define it, we first take a differentiable function ϕ(w) that is strictly convex, i.e. a function that follows for w’ ≠ w
For such functions, we can define the Bregman divergence as the difference between the function and its linearisation:
It’s worth noting that mirror descent is usually defined for strongly convex functions (Bubeck, 2015), i.e. the ones following
for some constant m. This is helpful for deriving algorithmic convergence guarantees (regret bounds). However, to derive the update itself, we only need strict convexity to guarantee invertibility of ∇ϕ. Another way to view this requirement is by examining the minimisation problem in Eq. 3: for a strictly convex potential, the corresponding Bregman divergence is also strictly convex, so adding it to a linear function (which is the linearised loss) makes the overall problem have a unique minimum.
To derive a closed-form mirror descent update, we differentiate the functional in Eq. 3 w.r.t. w and equal the result to zero. As it’s strictly convex (as a sum of convex functions and a strictly convex one), this will give us a unique minimum:
The first example of mirror descent is the regular gradient descent. For
, we have that ∇ϕ(w) = w, and therefore get the gradient descent update in Eq. 2.
To derive exponentiated gradient descent, we need to split the weights into positive and negative ones. One way to do that is to take a vector of signs s and a vector of positive weight w+, such that w = s ⊙ w+ (where ⊙ stands for element wise multiplication). We will only optimise over w+ and hold s fixed. We then take ϕ(w+) = (w+) log w+, so Eq. 3 becomes
and therefore
By the chain rule,
Then, to derive the final update, we note that ∇ϕ(w+) = 1 + log w +, and therefore from Eqs. 4-5 we have that
where the last line uses the fact that the weight has a fixed sign.
Exponentiated gradient with weight decay
For both GD and EG, we added weight decay such that it decreases the current weight wt as γ’wt, where for GD γ’=1 − ηγ and for EG γ’=exp(− ηγ). In terms of the mirror descent formulation, we can view it as
which for GD recovers the standard γ||w||2 weight penalty.
Appendix B: Mod-Cog subtask specific details
The set of 82 Mod-Cog tasks were introduced by Khona et. al. 2023. The Mod-Cog dataset extends an original set of 20 cognitive tasks from Yang et. al. 2019 by including sequence generation and interval estimation requirements. The reader is referred to these references for full details including experimental motivation, here we provide a summary. The 20 original tasks can be split into 4 families: Go, Decision Making, Delay Decision Making and Matching. For all tasks, the network is required to match a target output direction for the duration of the response period. Mod-cog adds 40 additional tasks by requiring the network to instead output a sequence (clockwise or anticlockwise) during the response period, with the sequence beginning at the base task’s target output. Furthermore, 11 of the original Yang et al. 2019 tasks involve delays periods. Mod-cog adds a further additional 22 tasks by offsetting the original task’s target output (clockwise or anticlockwise) by a magnitude dependent on the length of the delay period. See Fig. S1 examples.
Here we provide further details regarding the base set of cognitive tasks introduced by Yang et. al. 2019 (text adapted from Yang 2019).
Go Task family (6 tasks)
[Go, Reaction Time Go, Delay Go, Anti Go, Anti Reaction Time Go, Anti Delay Go]. A single stimulus is presented randomly on either input modality 1 or modality 2, and the target response is in the same direction as the stimulus unless “Anti” when it is the opposite. For non-Reaction Time, the input appears before the fixation cue goes off, whereas for Reaction Time the input appears when the fixation goes off and the network should respond as soon as the stimulus appears. For the delay versions of the task, the input stimulus ends before the fixation drops with a random delay.
Decision making family (5 tasks)
[DM 1, DM 2, Ctx DM 1, Ctx DM 2, MultSen DM tasks]. In each trial, two stimuli are shown simultaneously and are presented till the end of the trial. Stimulus 1 is drawn randomly between 0 and 360°, while stimulus 2 is drawn uniformly between 90 and 270° away from stimulus 1. In DM 1, the two stimuli only appear in modality 1, while in DM 2 the two stimuli only appear in modality 2. In DM 1 and DM 2, the correct response should be made to the direction of the stronger stimulus (the stimulus with higher γ). In Ctx DM 1, Ctx DM 2 and MultSen DM tasks, each stimulus appears in both modality 1 and 2. In the Ctx DM 1 task, information from modality 2 should be ignored, and the correct response should be made to the stronger stimulus in modality 1. In the Ctx DM 2 task, information from modality 1 should be ignored. In the MultSen DM task, the correct response should be made to the stimulus that has a stronger combined strength in modalities 1 and 2.
Delay decision making family (5 tasks)
[Dly DM 1, Dly DM 2, Ctx Dly DM 1 and Ctx Dly DM 2]. These tasks are similar to the corresponding tasks in the DM family, except that in the Dly DM family tasks, the two stimuli are separated in time. The two stimuli are both shown briefly and are separated by a delay period. Another short delay period follows the offset of the second stimulus.
Matching family (4 tasks)
[DMS, DNMS, DMC, DNMC]. In these tasks, two stimuli are presented consecutively and separated by a delay period. Each stimulus can appear in either modality 1 or 2. The network response depends on whether or not the two stimuli are ‘matched’. In the DMS and DNMS tasks, two stimuli are matched if they point toward the same direction, regardless of their modalities. In DMC and DNMC tasks, two stimuli are matched if their directions belong to the same category. The first category ranges from 0 to 180°, while the rest from 180 to 360° belong to the second category. In the DMS and DMC tasks, the network should respond toward the direction of the second stimulus if the two stimuli are matched and maintain fixation otherwise. In the DNMS and DNMC tasks, the network should respond only if the two stimuli are not matched, that is, a non-match, and fixate when it is a match.
Appendix C: Supplementary Figures
a| Go task family. b| Mod-cod sequence generation applied to go task family. c| Mod-cod delay estimation with the go task family.
a| Histograms of EG (red) and GD (blue) trained network activity in log-space. Dashed lines show normal fit, data is a random sample of 1000 time points at the end of training. b| Spectral properties of data shown in a. (Left) Cumulative explained variance vs eigenvalue. (Right) Log-log plots of eigenvalue index vs value. Dashed lines show linear fit (fit from 3rd to 30th eigenvalues).
Gaussian fits to final weights and weight changes for EG and GD for different weight initialisations (zero mean with equal variance). Log-normal: (ln); normal (n); uniform: (u): a| Final weights in the log space for GD (top row) and EG (bottom row) with (left to right) log-normal, normal, and uniform initialisations. b| Kolmogorov-Smirnov (KS) test statistic for standard normal fits of z-scored log final weights. EG with a log-normal distribution is much closer to a normal distribution in the log space, and EG for all initialisations is closer than GD. c| Same as a, but for weight changes. d| Same as b, but for weight changers. GD weight changes are consistently more normal across weight initialisations. e| Same as a, but for changes of log weights. f| Same as b, but for changes of log weights. EG weight changes in the log space are consistently more normal than those for GD (cf. d).
a| Learning curves for RNNs trained with EG (red, gold)or GD (blue, black). (From n=9 models, 500 distractors blue and red, 0 distractors gold and black).