## Summary

During task execution cortical dynamics must bridge sensory cues with future behavioural outcomes. However, how cortical networks acquire such task-specific dynamics remains unclear. Here we propose that the cerebellum drives cortical dynamics to enable rapid and flexible task acquisition. We model cerebellar networks that are tuned through timing rules to provide cortical networks with task-outcome predictions. First, using sensorimotor tasks we show that cerebellar feedback with fixed cortical connectivity is sufficient for rapid task acquisition and one-shot task switching. Next, we demonstrate that, when trained in working memory tasks, the cerebellum can also underlie the maintenance of cognitive-specific dynamics, explaining a range of optogenetic and behavioural observations. Finally, we use our model to introduce a systems consolidation theory in which task information is gradually transferred from the cerebellum to the cortex. In summary, our results suggest that cortico-cerebellar loops are critical for task acquisition, switching and consolidation in the brain.

## Introduction

Learning to interact with the environment requires a continuous integration of fast-changing sensory cues with future behavioural outcomes. Growing evidence suggests that cortical dynamics integrate the task-specific information that is needed for such sensory-behavioural transformations^{1–5}. One dominating view in the field assumes that cortical networks are themselves learnt or optimised leading to the rich dynamics required for task performance^{6–8}. However, to help ensure that task-encoding remains stable, cortical plasticity must be kept under control and relatively weak^{9,10}. This raises the question of how can the cortex acquire new task-specific dynamics in the presence of relatively fixed connectivity?

One possible solution is to consider feedback loops that drive cortical dynamics^{11}. Computational studies have extended recurrent neural networks (RNNs) models of cortical networks (Fig. 1a) to incorporate feedback loops for task acquisition. One type of feedback loop drives RNN dynamics by projecting the readout back to the RNN^{12–14} (Fig. 1 b). Building on this line of work, two recent theoretical studies have suggested that thalamo-cortical feedback can both prepare and control RNN dynamics to achieve flexible motor sequencing^{15,16}. All of these studies assume connectivity within the RNN itself remains fixed, thereby avoiding complex learning rules while being able to reuse RNN dynamics for different contexts^{17}. However, these approaches either assume a relatively simple feedback (i.e. a linear combination of RNN activity) or rely on theoretically optimal, but biologically implausible, derivations for the feedback signal. In particular, the possible role of more powerful, highly adaptable brain regions are often overlooked.

Here we focus on the feedback loop between two key brain regions, the cortex and *the cerebellum*. The cerebellum is a highly plastic system and is well placed to drive cortical dynamics via a set of stereotypical, but functionally separable cortico-cerebellar loops^{18,19}. Indeed, an ever-growing array of clinical^{20}, functional imaging^{21,22}, and optogenetic^{23–25} studies support an important cerebellar contribution to cortical activity in both motor and non-motor domains. Recently, two schools of thought have emerged on the computational role of cortico-cerebellar loops. The first asserts that the cerebellum reinforces cortical-dependent goal-directed behaviour by appropriately steering or stabilising cortical states in real-time^{26,27}. The second also promotes the cerebellum as a facilitator of goal-directed cortical transitions, but that it does so indirectly via teaching signals which lead to cortical plasticity^{28,29}. Whilst these two views may co-exist, it is the former that is well placed to operate under weakly plastic cortical networks. Moreover, the cerebellum acting as an instantaneous driver of cortical dynamics is in line with the fast activity-dependent cortico-cerebellar interactions that have been observed experimentally^{23–25}.

Here we put forward a computational framework in which the cerebellum learns to rapidly steer and stabilise task-dependent cortical dynamics. We test this model on a variety of motor and non-motor tasks, proposing that the cerebellum is optimised to support task acquisition in the cortex. This reduces the burden of learning in cortical networks and allows a given cortical area to rapidly switch between different tasks. In line with this, we show that a strong cortical dependence on cerebellar feedback arises after learning, consistent with recent behavioural and optogenetic experiments. Finally, we use this model to put forward a cerebellar-to-cortical systems consolidation theory, in which quickly learnt task-specific information encoded by the cerebellum is gradually transferred to the cortex. Overall, we introduce a computationally and experimentally supported theory for cerebellar-supported task acquisition, switching and consolidation in the brain.

## Results

### A theory of cerebellar-driven cortical dynamics for task acquisition

To study the role that cerebellar feedback can have in driving cortical dynamics during task acquisition, we explore different variants of cortical RNNs: without feedback (Fig. 1a), with readout feedback (Fig. 1 b)^{14,30} and with feedback provided by a cortico-cerebellar loop (Fig. 1c). We introduce a model of cortico-cerebellar loops, in which a *cortical* RNN is reciprocally connected to a feedforward *cerebellar* network . In our model, temporal RNN representations **h**_{t} are passed onto the cerebellar network to compute task-specific predictions **c**_{t}, which are then sent back to the same cortical RNN. The final model output **z**_{t} is then a linear readout of the RNN activity
where *α* denotes the cortical internal memory (or leak) of the RNN neurons, *f*(*x*) is the cortical activation function which is set as tanh(*x*). are the recurrent, input, and cerebellar weights onto the RNN respectively, and *W*_{rdt} are the readout weights (see Extended Data Fig. S1 for a detailed schematic). For computational efficiency and due to the relatively long duration of the tasks we train our model using a discrete approximation of a continuous RNN (see Methods). To highlight the need for optimised network connectivity rather than inherent cortical memory mechanisms, in our experiments we generally focus on small *α* = 0.1 (see Methods).

The cerebellar feedback **c**_{t} is a feedforward computation on the previous RNN activity
where *W*_{MF} represent the cerebellar (input) mossy fibre (MF) weights onto granule cells (GC) and *W*_{PF} the parallel fibre (PF) weights from GC to Purkinje cells (PC). Together, these constitute the main stages of processing in the cerebellar cortex^{31}. In general we model *W*_{MF} as highly divergent with an input/output ratio of 1:20 (see Methods) and as a rectified linear function (ReLU), in line with the large numbers of cerebellar GCs and their relatively linear processing^{32}.

We use biologically plausible gradient descent^{33} to optimise cortical weights during the acquisition of a given task (Eq. 1). In particular, we minimise the temporal error , where **y**_{t} denotes the desired task outcome at time *t* and is the task error function (see Methods). These weights can all be optimised simultaneously during learning – we refer to this case as *fully plastic*. However, a key idea that we put forward in this study is that it is not the neocortex, but in fact the cerebellum, which acts as a key driver for task acquisition. For this reason we highlight the case in which RNN plasticity is constrained. In particular, we focus on conditions in which RNN plasticity is either absent – *fixed RNN* case, or in which plasticity is strictly limited to its input synapses (i.e. only *W _{ih}*, in Eq. 1 are plastic) –

*input plastic*case. The latter case considers both plasticity at sensory and cerebello-cortical input during task acquisition, in line with experimental observations showing plasticity at cerebellar pathways to the cortex

^{34,35}.

In contrast to cortical learning, the cerebellum is always optimised, through a separate but related cerebellar error by adjusting its parallel fibre weights *W*_{PF} (we assume mossy fibre weights to be fixed in line with experimental observations). Like the cortical prediction error the cerebellar error function depends on the desired outcome **y**. Importantly, to enable the cerebellum to learn to estimate future outcomes (**c**_{t} ≈ *y _{t+τ}*) we formulate a temporal cerebellar learning rule. In this rule the cerebellum learns by comparing its own past output (within a time-window

*τ*) with future desired outcomes, i.e. . This plasticity rule is consistent with time-dependent plasticity rules observed experimentally at parallel fibre synapses

^{36}. For our motor-based tasks we generally consider a cerebellar time window of

*τ*≈ 150ms

^{36}and for the later cognitive tasks use longer windows

*τ*≈ 600ms (see Methods).

### Cerebellum learns to drive cortical dynamics during a line drawing task

To study the functional consequences of cortico-cerebellar loops we first test the model in a motor-based line drawing task. In this task the model receives one out of six cues at the beginning of the task and learns to either remain still or produce one out of five possible straight lines (Fig. 2a; see Methods). Feedback provided by desired outcomes (i.e. straight lines) is provided at each timestep. Consistent with behavioural studies on cerebellar patients^{37}, we find that cerebellar feedback significantly improves learning of the task and final performance (Fig. 2a,b). The ability for cerebellar feedback to facilitate learning does not depend on the degree of plasticity and internal memory in the cortical RNN (Fig. 2c). Interestingly, a fixed RNN with a plastic cerebellum achieves the same learning performance as a fully plastic or input plastic RNN. In contrast, when no feedback or a simple readout feedback is provided the network can fail to learn the task due to fading memory properties of RNNs (Fig. 2b,c). Classical cerebellar models pose that the cerebellum can act as a direct controller of motor tasks^{31}. To contrast this view with our model we also train an RNN with a direct cerebellar readout, which apart from the cortico-cerebellar feedback weights uses the same free network parameters, and find it insufficient to learn the task (Extended Data Figs. S1 and S2).

Next, we study how two known cerebellar features: (i) a large number of granule cells and (ii) timed plasticity rules contribute to task proficiency^{36}. We find that a combination of high numbers of granule cells with a learning rule with a non-zero temporal horizon, *τ*, result in better cerebellar learning (Extended Data Fig. S3), which in turn drives better cortical representations and overall task performance (Fig. 2d and Extended Data Figs. S3,S4). Moreover, because both the cortical RNN readout and cerebellar network are trained on the same desired outcome, we observe that cerebellar output effectively predicts cortical readout *τ*ms ahead (Fig. 2e). Our model thus provides a theory of how the cerebellum learns to predict upcoming movements^{38,39}.

The advantage of a large number of granule cells has been well studied is likely due to better linear separability of its inputs^{40}. However, what are the computational advantages of the cerebellum providing the cortical RNN with expected future outcomes? Due to fading properties of RNNs sensory cues are rapidly forgotten. Therefore a high cerebellar *τ* gives the cerebellar network the ability to map RNN activity to desired outcomes early on in the task. Consistent with this we find that the predictive cerebellar output drives outcome-dependent RNN representations (Fig. 2f,g). This result showing potent initial drive of cortical activity could provide a justification for the observed role of the cerebellum in movement initiation^{41,42}.

Finally, to directly examine the role of cerebellar feedback on cortical dynamics, we inhibit – or “ablate” – cerebellar output (i.e we set **c**_{t} = 0 in Eq. 1) during different stages of the task. In each case we observe significant impairment in the model output which returns to baseline (Fig. 2h-j). Moreover, this effect is most detrimental to task performance when ablation occurs at the start (Fig. 2k). These findings are consistent with the observed freezing effect of cerebellar lesions on gait^{43}. In line with both cortical and cerebellar networks working jointly to perform the task, we find that when the RNN is fully plastic cerebellar ablations have a significant but reduced impact on the cortical dynamics (Fig. 2k and Extended Data Fig. S5). We also observe that the cortical RNN is particularly sensitive to the presence of noise in cerebellar output. When noise is added to its output it leads to irregular behaviour (Fig. S6), in line with the classical motor symptoms of cerebellar ataxia^{44}.

Taken together, this motor-based task highlights the computational benefits of training a cerebellar network to drive cortical dynamics, predicting that the cortex can critically depend on cerebellar feedback for successful task execution. Furthermore, we demonstrate that cerebellar plasticity can effectively replace the need for local cortical plasticity.

### Cerebellar-mediated task switching in cortical networks

We have shown that cortico-cerebellar loops can enable successful task learning with minimal cortical plasticity. This opens the possibility of reusing cortical networks across different contexts and behaviours.

To demonstrate the model’s ability to adapt and perform context-dependent task switching, we consider how models trained in the line-drawing task can be retrained to a curl-field variant^{45}. In particular, we analyse how the cerebellar network can (i) successfully enable learning in a new task context and also (ii) rapidly revert, or *switch*, to a previously learned context.

As expected, when the new task context is introduced to the model, there is a steep increase in error before the model successfully learns the new task (Fig. 3a, left and middle). Notably, however, when the original task is reintroduced, the fixed RNN model recovers the initial dynamics significantly faster than the fully plastic model and more faithfully captures the behavioural data from macaque monkeys^{45} (Fig. 3a, right). This relatively slow switching back suggests that the fully plastic RNN is more prone to forgetting the original task^{9}.

We then asked how the cerebellar network might enable even faster task switching. In line with observed context-dependent activations^{46} and plasticity rules^{47} in the cerebellum, we consider cerebellar PFs which are task-specific. The extent of task-specificity at PFs is modelled by the *PF task overlap;* full overlap (100%) would imply that the same exact PFs are used across task contexts, while zero overlap (0%) implies that a completely different set of PFs is used for each task respectively.

Our results show that the degree of PF task overlap predicts a tradeoff between the speed of learning the new task and the ability to rapidly switch back to the original task (Fig. 3b). Specifically, whilst maximal PF task overlap is beneficial when a new task is introduced, rapid switching is favoured when distinct PFs are used. To highlight the ability to immediately switch back to the original task (zero-shot switch) we focus on the zero-overlap case. For the fixed RNN, but not the fully plastic RNN, the model achieves near-perfect switching to the original task (Fig. 3c,d). Consistent with the need to learn a new task all models show a substantial change in the neuronal activity (Fig. 3e and Extended Data Fig.S7a). However, we expect that models with minimal local cortical plasticity should result in minimal changes in the underlying dynamics of both tasks. To test this, we measure changes in the the covariance of the neuronal activity between the new task and the initial task (see Methods and^{48}). As predicted, only the models with reduced cortical plasticity show the minimal changes observed experimentally (Fig. 3f and Extended Data Fig. S7b). On the other hand, the fully plastic model the dynamics acquired after switching back to the initial task are significantly different to baseline (Extended Data Fig. S7c,d). This suggests that the fully plastic model learns a new solution to the initial task, explaining its relative slowness in switching.

Overall, we apply our models to demonstrate a cerebellar-driven solution to multi-task learning and task switching. We show that the underlying dynamics preserved by a fixed cortical RNN, supported by context-dependent cerebellar feedback, can support rapid behavioural changes whilst minimising for getting of previously acquired task knowledge.

### Cerebellar temporal basis supports non-linear drawing task

Above we have modelled a case in which the cerebellum learns to drive cortical dynamics using a specific predictive time-window (namely *τ* = 150ms). However, a recent study has revealed a diversity of temporal plasticity windows to be at play in the cerebellum^{36,49} (Fig. 4a). Such diversity of temporal windows may enable the cerebellum to learn a *temporal basis* for upcoming events, which may enhance the cerebellum’s ability to predict future outcomes.

To demonstrate the benefit of diversity in temporal windows we consider a more realistic (and challenging) variant of the line-drawing task in which the model is now trained to produce a digit-like output (Fig. 4b; see Methods). This task is selected so as to produce a non-linear and highly varied set of future desired outcomes and therefore the need for richer cerebellar predictions. In particular, we consider a cerebellar network which simultaneously learns with a range, or “temporal basis”, of time-windows *τ* ∈ [0ms, 250ms] such that its prediction effectively spans a relatively long window of upcoming desired outcomes (see Methods).

We find this heterogeneity of cerebellar time windows to enable both faster learning and higher performance thresholds (Fig. 4b,c and Extended Data Fig. S8). As expected, when considering the simpler line-drawing task having multiple time windows does not improve learning (Fig. S8c). Moreover, in line with the results above, a fixed RNN achieves a performance comparable to the plastic RNN models across different degrees of internal memory in the cortical network (Fig. 4d). When comparing the network performance across different numbers of granule cells and time-windows, we find that higher numbers of granule cells combined with multiple time-window learning achieves the best average learning performance (Fig. 4e). Finally, as with the simpler line-drawing task, we find that cerebellar ablation is detrimental to the maintenance and development of these representations (Fig. 4f-h and Extended Data Fig. S9).

These results suggest that the diversity of cerebellar temporal windows observed experimentally^{36,49} can be beneficial when in the presence of more challenging task conditions.

### Cerebellar-driven cortical dynamics maintains beliefs in an evidence accumulation task

So far we have focused purely on motor-based tasks, but growing evidence strongly suggests that the cerebellum also plays important roles in functions that go beyond direct motor control^{19,50}. To demonstrate this we model an evidence accumulation task that has been shown to be cerebellar-dependent^{24}. In this study Deverett et al.^{24} showed that optogenetic inhibition of the cerebellar output nuclei disrupts the ability of mice to determine whether the left or right cheek received more air puffs over a period of time (Fig. 5a). Unlike the previous tasks, here the desired outcome is only provided at the end of the task, making error-related signals highly sparse.

Similar to the motor tasks studied above, cerebellar feedback improves task learning relative to models without feedback or with readout feedback (Fig. 5b). Moreover, a fixed RNN achieves performance comparable or even superior to the fully plastic models across a range of degrees of cortical internal memory (Fig. 5c and Extended Data Fig. S10). These results suggest that weakly plastic cortical networks driven by the cerebellum may also be sufficient for learning cognitive-based tasks with sparse error information.

Next, our ablation analysis reveals strong similarities to the optogenetic observations by Deverett et al.^{24}. In particular, cerebellar ablation greatly impairs the model’s capacity to maintain and develop beliefs, mirroring the behavioural effects observed experimentally (Fig. 5d and Extended Data Fig. S11). The general trend observed experimentally – that later cerebellar ablation is particularly detrimental to task performance – is also captured by our model (Fig. 5e). Indeed, using the same behavioural regression performed by Deverett et al.^{24} (see Methods), we show that cerebellar ablation in latter periods leads to a final choice in which information about previously seen inputs is greatly reduced (Fig. 5f), in line with experimental findings. We further quantified this inability to integrate prior input by analysing performance specifically for “history-centric” trials, which rely on remembering the history of input in the first third period of the cue presentation (see Methods). As expected, late cerebellar ablation is particularly detrimental for these trials (Extended Data Fig. S11).

Overall, our model predicts that the proper maintenance of model selectivity depends critically on cerebellar feedback during evidence accumulation. Consistent with behavioural results, these effects are emphasised when cerebellar ablation occurs in the later stages of the task.

### Cerebellar feedback sustains cortical dynamics in a delayed association task

Next we aim to demonstrate that cerebellar networks can also effectively drive cortical dynamics in tasks with long delay periods, while capturing both neuronal and behavioural observations. To achieve this we model a delayed association task which was recently shown to dependent on cortico-cerebellar loops^{23}. In this study mice were presented with one of two stimuli (left or right) followed by a delay period, after which they were trained to lick in the corresponding direction (Fig. 6a, top). At the same time neural selectivity was recorded both in the anterior lateral motor cortex (ALM) - a working memory and planning region - as well as the cerebellar output nuclei (Fig. 6a, bottom). Timed photoinhibition was used to reveal ALM selectivity to strongly depend on the cerebellar output nuclei, and vice versa.

To model this task we follow the same protocol used experimentally^{23}, where one of two possible cues are presented followed by a delay period, after which the model makes a cue-based response (left or right; see Methods). Given the lack of sensory or teaching information during the delay period the cortico-cerebellar network it is particularly vital in this task to sustain stimulus representations. It is important to note that a standard randomly initialised RNN is unlikely to achieve this property, since memories of previous inputs naturally fade^{17} in the absence of task-induced plasticity.

We observe that cerebellar feedback consistently enables task acquisition (Extended Data Fig. S12), and identify a particularly interesting case when plasticity in the RNN is limited strictly to its input synapses (input plastic). In this case cerebellar feedback significantly improves cortical learning to reach near-perfect performance, whilst also enabling a high degree of stability in task selectivity throughout the delay period (Fig. 6b-d and Extended Data Fig. S12). We speculated that for this task input plasticity is particularly important, because the cerebellum is required to sustain task-specific predictions in the RNN throughout the entire delay period. We verified this stronger cerebello-cortical drive by using concepts from control theory^{51}. In particular, we can explicitly relate cerebello-cortical optimisation to a quantitative increase in the impact, or *energy*, of cerebellar feedback onto RNN activity (Extended Data Fig. S13; see Methods). Moreover, the ability of the cerebellum to drive cortical dynamics should depend on the cortical network’s ability to express those dynamics. In line with this view our results show that (even untrained) cortical recurrent weights are important in maintaining cerebellar predictions over time (Fig. S14).

Next, to demonstrate that the cerebellum helps drive task-specific dynamics in the cortical RNN we performed a simulated ablation in which the cerebellum is transiently removed during the delay period. We find that cerebellar ablation drastically disrupts cortical task selectivity, whilst cerebellar selectivity is also highly disrupted by cortical ablation (Fig. 6d-f and Extended Data Fig. S15). These ablation results are consistent with changes in cortical dynamics observed during optogenetic-mediated ablation^{23} (Fig. 6d, bottom and Extended Data Fig. S15a, bottom). Our model predicts that this effect should depend on the degree of plasticity in the cortical RNN. Consistent with this view in the presence of a fully plastic RNN we do not observe the strong cerebellar dependence that was observed experimentally (Fig. 6f and Extended Data Fig. S16). Taken together our results suggest that the cerebellum, not the neocortex, is the primary site of learning during the acquisition of this working memory task^{23}.

Overall, these results demonstrate that our model can capture working memory tasks and the observed dependency of cortical dynamics on cerebellar input. Moreover, our model makes the prediction that the cerebellum is a key site of plasticity during acquisition of delayed association tasks.

### Computational power of cerebellum can decode fading cortical memories

A prevalent feature in classical cerebellar theories is that the divergence provided by the granular layer enables better separation of its inputs^{52}. This is particularly important for cases in which input activity shows little variance across conditions and the original task-related signal is only weakly encoded, for which a large dimensional expansion at the granular cell layer would help to recover the original task-related signal^{53}.

In the delayed association task, for which memory of the initial stimulus quickly fades with time in the untrained RNN, we observe that as the delay period increases so does the need for more granule cells (Fig. 7a). Indeed, by considering the cerebellar network as a direct readout of the RNN, we observe an inverse relationship between the signal-to-noise ratio (SNR) of the RNN activity and the required number of granule cells to decode the signal (Fig. 7b). We formalise this relationship analytically, and show under reasonable assumptions that the required number of granule cells for distinct stimulus encodings is approximately , where and *θ* ∈ [0, 1) denotes the level of confidence (see Methods). We find this relationship consistent with our simulations (Fig. 7c) and (as expected) that these distinct population responses enable successful downstream discrimination of the initial stimulus (Fig. S17). With the cerebellar granular layer consisting of vast numbers of cells – > 50 billion in the human brain^{54} – receiving distinct combinations of input, it is therefore well placed to decode the potentially rapidly fading memories in cortical networks. This suggests a key cerebellar role in decoding overlapping, long-lasting temporal cortical dynamics, and supports experimental findings showing that cerebellar inhibition is more detrimental for learning when in the presence of long delay periods^{55}.

### Cerebellar task knowledge can be consolidated in the cortex

In each of the previous tasks, cerebellar feedback is shown to mediate learning and the maintenance of task-specific cortical dynamics. However, the neocortex is known to encode long-term representations of tasks^{56}. This suggests a need for a “consolidation” period, during which the memory stored in the cerebellum may be transferred to cortical areas.

To demonstrate cerebellar-to-cortical systems consolidation in our model we develop consolidation-specific learning rules. To achieve consolidation we train cortical recurrent weights to mimic cerebellar input (see Methods). Given that (i) the RNN state **h** is of significantly higher dimensionality than the cerebellar output **c**, and that (ii) **c** is a function of **h** (Eq. 2), several solutions that enable the RNN to capture the cerebellar input should exist. We also gradually decay the cerebellar-to-cortical input weights so that over training the cerebellum stops driving the cortical network, thereby giving full control of the task to the cortical RNN (Fig. 8a).

We tested this computational theory of consolidation on the cortico-cerebellar models (input plastic condition) trained on the previous delayed association task (Fig. 6d, top left). We consider two types of learning rule: (i) a simple biologically plausible rule, which depends on the ratio of cerebellar-to-cortical input and total RNN activity and (ii) a theoretically optimal rule based on a least squares solution (see Methods). In both cases, we observe that the RNNs gradually learn to perform the task without the need for cerebellar input (Fig. 8b-d). Indeed, as expected, we observe that the newly learned recurrent weights faithfully replicate the cerebellar input of the original network. (Fig. 8e). In addition, we find that a model with fixed RNN connectivity does not perform as well as the input plastic condition (Extended Data Fig. S18). This is likely due to better network stability when in the presence of the input plastic, compared to the purely fixed RNN (Extended Data Fig. S16a,b).

In summary, the framework we introduce here suggest that the cortico-cerebellar loops may play an important role in systems consolidation by gradually transferring the rapidly learnt cerebellar knowledge to the cortex.

## Discussion

Growing experimental evidence suggests that cortico-cerebellar loops support behaviour, but their computational roles have remained unclear. Here we have introduced a systems-level modelling framework in which a feedforward cerebellar network receives the state of a cortical RNN and provides task-specific predictions in return. In our model, cerebellar feedback facilitates learning by shaping the underlying cortical dynamics during motor and cognitive tasks in a way that is consistent with both behavioural and optogenetic studies. Our work suggests that the cerebellum is a key site of learning in the brain, allowing for rapid context-switching of cortical dynamics that underlie behaviour. We finish by introducing a theory of cerebellar-to-cortical system consolidation, in which task-specific knowledge is gradually transferred to the cortical network.

Our model is related to previous network architectures in that it uses feedback to enhance neuronal representations and selectivity in a otherwise fixed RNN, thereby facilitating task-relevant downstream processes^{13,14}. However, in contrast with previous work in our model feedback does not stem directly from the (cortical) readout, but from a large cerebellar network with time-predictive learning rules (Fig. 1). The resulting architecture suggests a symbiotic cortico-cerebellar learning strategy: as cerebellar representations improve through rapid learning so do cortical representations, and as cortical representations improve cerebellar decoding becomes easier. According to our model the cortical RNN relies on cerebellar feedback to maintain appropriate representations. Therefore, one can consider the cerebellum as a controller of cortical dynamics. Indeed, two recent theoretical studies have shown how the thalamus, the intermediary in cerebellar-to-cortical communication, can optimally control cortical dynamics during motor preparation and execution^{15,16}. In contrast to existing theories, our work suggests that cerebellar-thalamic-cortical pathways do not have the ability, or indeed need, to provide highly precise control of cortical dynamics. Instead our work suggests that the cerebellum can only provide the cortex with relatively general predictions which can be reused across the cortex.

By retraining cortico-cerebellar networks in a novel task we propose a key role of the cerebellum in task switching (Fig. 3). In particular, we show that cerebellar feedback may provide a solution to the problem of context-dependent adaptation, which requires (i) an ability to learn a new context but also (ii) an instant retrieval of appropriate response to previously learned contexts^{57,58}. Interestingly, we observe that while recurrent cortical plasticity enables adaptation to a new task context there is catastrophic forgetting of the original context. This is at odds with well-known behaviour in the primate, and provides a computational explanation for why cortical modifications in the monkey cortex during motor adaptation are limited^{45}.

A unifying model of the cortico-cerebellar loop, and indeed the cerebellum itself, must extend to non-motor tasks. Recent task-based fMRI studies have revealed functional diversity of the cerebellar cortex across a range of cognitive functions^{21}. Our model inherently implies a high degree of heterogeneity – it suggests that different modules would be required to drive different parts of the cortex that in turn underlie different cognitive functions. In this study we modelled recent behavioural and optogenetic experimental observations^{23,24} which directly implicate the cerebellum in supporting cortical dynamics during evidence accumulation and delayed association tasks (Figs. 5 and 6). In particular, our results show that cortico-cerebellar interactions are enough to learn tasks with highly sparse teaching signals (i.e. only at the end of the task). By combining simulations and theoretical analysis we show that this is enabled by the sheer computational power of the cerebellum (Fig. 7). Moreover, we can best capture experimental observations in conditions in which RNN plasticity is limited, making the prediction that the cerebellum is the primary site of learning for these tasks. This provides an alternative to the commonly assumed view that cortical areas are optimised for specific tasks^{6–8}.

In our model the cerebellum drives cortical dynamics based on prediction error signals that depend on the desired task outcome. In the case of the working memory tasks and in line with the experimental task setup, the desired outcome can be interpreted as a reward signal. Therefore, from this perspective, the cerebellum learns to predict future rewarding events. This is consistent with the growing literature showing that the cerebellum encodes reward-related signals^{59}. However, it remains to be tested exactly how the reward-predictive representations developed by our model compare to those found experimentally.

Here we have also introduced a theory of cerebello-cortical task consolidation. Our theory suggest that cerebellar and cortical learning may operate at different timescales: after an initial fast stage of learning driven by the cerebellum, a period of consolidation might ensue in which the cortex gradually acquires task-specific knowledge encoded in the cerebellum (Fig. 8). Such combination of fast and gradual learning is reminiscent of recent experimental results which suggest significantly faster timescales of plasticity in the hippocampus compared to the prefrontal cortex during a cognitive task^{60}. Moreover, the consolidation period can be related to the idea that a task-optimised cerebellum can be utilised as a cortical teacher^{28,29}. It is in principle possible for cerebellar-thalamo-cortical projections to support this dual role of the cerebellum as both a driver and teacher of cortical states. Indeed, anatomical evidence suggests that this could occur by providing “driving” and “teaching” input to basal and apical dendrites of cortical pyramidal cells, respectively^{61}.

Our work highlights commonalities of cortico-cerebellar interactions in motor and cognitive tasks alike. However, it also suggests interesting differences. The first marked distinction relates to the increased significance of cerebellar-to-cortical (input) plasticity during pure working memory (Fig. 6). This is in line with recent experimental evidence showing stronger plasticity at higher-order thalamo-cortical pathways^{35}. Indeed, because of the need to sustain information during the delay period without sensory or teaching input, it is advantageous for the network to encode a point attractor-like state (see Extended Data Fig. S16, left). Cerebello-cortical plasticity^{34,35} may thus enable greater *controllability* of cerebellar feedback to push the network to these states during working memory tasks, but less so in motor-based tasks^{51} (Fig. S13).

Related to the point above, the second difference we highlight is about cerebello-cortical consolidation being more readily achieved when in the presence of networks with stable dynamics (cf. Fig. 8 and Extended Data Fig. S18). We speculate that unstable network dynamics make cerebellar-to-cortical consolidation less stable. Therefore, we predict that while cerebellar-to-cortical systems consolidation might be possible for tasks which involve discrete stable representations (e.g. working memory tasks), for tasks which involve faster, more dynamic responses (as often required in the motor domain) cerebellar control is likely to be required throughout life.

Overall, our work provides a theoretical framework with which to study the function and mechanisms of cerebellar-driven cortical dynamics for rapid task acquisition, switching and consolidation.

## Author contributions

J.P. developed computational framework with guidance from R.P.C. J.P. performed all numerical and analytical work. J.P. and R.P.C. wrote the manuscript, with contributions from P.C. R.P.C supervised the project.

## Methods

### Model architecture and training

The complete dynamics of each model architecture that we consider (Extended Data Fig. S1; no feedback, readout feedback, cerebellar feedback, no feedback with cerebellar readout) are given in Table S1. In all of our simulations we use a recurrent neural network (RNN) with 50 time-discrete units (see section below).

Unless otherwise stated, the feedforward cerebellar network contains a single hidden layer with 1000 units (granule cells), but other hidden layer sizes are also considered (Figs. 2d and 4e). This yields a divergence from the cortical RNN to the cerebellar granular layer of 50:1000 = 1:20. The cerebellar output layer, which we interpret as Purkinje cells, on the other hand, mirrors the desired task outcome and is therefore of significantly lower dimensionality (3 in evidence accumulation task and 2 in all other tasks).

For each task simulation, network parameters are initialised as follows. The RNN input, recurrent and cerebellar feedback weights *W _{ih}*,

*W*, are drawn from a uniform distribution where . The readout weights Wrdt and cerebellar weights,

_{hh}*W*

_{MF},

*W*

_{PF}, are initialised as , where

*a*is normalised according to the respective number of input neurons, . Note that the readout weight

*W*

_{rdt}includes a bias term. We conducted each task simulation with 5 random seeds for initialisation, except for the theoretical analysis done in Fig. 7b,c and Extended Data Fig. S17 for which we use 10.

During the learning of a task model parameters are updated using gradient descent from the task error signal with respect to to the model parameters (see section below). For each dataset each training session covers 1000 random examples, presented to the model in batch sizes of 10 which we call a “trial”. The test set (used after training) also covers 1000 randomly generated examples. When analysing the learned network dynamics (e.g. model output with and without cerebellar ablation) the model with the best validation error during training was selected. An ADAM optimiser^{62} was used with initial learning rate *η* = 0.001 for the RNN (when plastic), readout and cerebellar network, except for the delayed association task for which we found an RNN learning rate of *η* = 0.0025 to provide more stable learning. The different plasticity constraints of the entire model - termed “fixed RNN”, “input plastic”, and “fully plastic” – are defined with respect to the cortical parameters of Eq. 1 as follows. For the fixed RNN case, only the cortical readout weights *W*_{rdt} are learned. For the input plastic case, RNN input weights and *W _{ih}* and are also learned. Finally, for the fully plastic case, the recurrent weight

*W*is also learned. In all of these cases the cerebellar “parallel fibres”

_{hh}*W*

_{PF}are learned, whilst the “mossy fibres”

*W*

_{MF}remain constant, in line with mossy fibres synapses being (relatively) stable

^{31,63}.

#### Continuous dynamics of RNN model

A continuous version of our RNN can be expressed as
where *τ _{M}* is the membrane time constant (not to be confused with the cerebellar time window

*τ*),

*R*is the membrane resistance, and

_{M}*f*is the rate-based non-linearity which we set as

*f*= tanh. Discretising Eqs. 3 with timesteps of Δ

*t*yields equations in Table S1, where . Note that as in

^{33}we ignore the (1 –

*α*)

*R*. This simplifies notation and has no effect on dynamics if model weights are scaled accordingly. In general we use

_{m}*τ*≈ 20ms and Δ

_{M}*t*= 50ms for the drawing tasks (Figs. 2,3 and 4) and a higher

*τ*≈ 90ms with Δ

_{M}*t*= 200ms for the cognitive tasks (Figs. 5,6,7 and 8) in line with

^{6}). In both cases this gives us a cortical internal memory

*α*= 0.1.

#### Cortical and cerebellar learning rules

When the desired task outcome **y**_{t} is provided the associated error is computed as for the cortical network and for the cerebellar network, where denotes the task error function (mean squared error and cross-entropy loss for regression and classification tasks respectively) and *τ* is the cerebellar time window. The error gradients for the readout and cerebellar weights *W*_{rdt}, *W*_{PF} can then be obtained locally with a simple deltarule on the gradient of the error signal. That is,

Where *η* denotes the learning rate of the cortico-cerebellar network and **GC** denotes the hidden granule cell activity of the cerebellar network which is computed as (cf. Eq. 2).

For the input/recurrent weights - when plastic-obtaining error gradients is more difficult as temporal dependencies need to be considered. To improve biological feasibility in this work we avoid backpropagation through time (BPTT) and instead use the eprop algorithm^{33}. Details can be found in^{33}, but the main idea is that BPTT can be approximated with a mixture of locally computed synaptic eligibility traces and current learning signal. Specifically, the error gradient for a given synapse *w _{ij}* from neuron

*i*to

*j*is computed as where for ease of notation we now use the superscript to denote timestep

*t*and is the neuron

*j*learning signal (obtained by one-step backpropagation through space except for the cerebellar readout architecture in Extended Data Fig. S1 D). is the synaptic eligibility trace of

*w*which is defined recursively by where is initialised as zero. Note that the terms in Eq. 6 are locally available to the synapse. In the case of our network dynamics (Eq. 1), the eligibility trace is simply defined by , where

_{ij}*a*is the activation of the presynaptic neuron

_{i}*i*(e.g. tanh(

*h*) or

_{i}*c*).

_{i}For all weights, the error gradients are accumulated across multiple examples (i.e. batch update) and timesteps before the weights themselves are updated.

#### Learning rules for cerebellar-to-cortical consolidation

A period of “consolidation” is considered for the trained models of the delayed association task (Fig. 8 and Extended Data Fig. S18). During this period the model is presented with further trials (batch size 10) of training data but without their associated targets. The forward dynamics of the model then run as normal (Eq. 1) but now we use a consolidation learning rule for the RNN weights. We consider both an optimal learning rule which uses the least-squares algorithm and also a simple biological learning rule.

We first present the optimal consolidation learning rule, since this motivates the biological rule. We want to change the recurrent (cortico-cortical) input to match the cerebellar-cortico input over the task. To this end we concatenate the time-dependent RNN activities and cerebellar output activities , where⊕ denotes vector concatenation. We then set the change in recurrent weight Δ^{cons} *W _{hh}* with where is the RNN consolidation learning rate and

*F*

^{lsq}is the least-squares solution

At the same time the cerebellar-cortical weights decay according to

Where is the rate of cerebellar-cortical decay. In the experiments shown we select .

For the biological learning rule, the cerebellar-cortical weight decays as in Eq. 8 but now the RNN weights are updated according to the ratio of cerebellar feedback against the whole population activity. That is, for the recurrent weight from neuron *i* to neuron *j* we have

For arbitrary timestep *t* and where denotes the *j*th row of the cerebello-cortical weight .

To demonstrate that Eq. 9 leads to changes in cortico-cortico input which are proportional to the cerebello-cortical input, we see that the change in recurrent input to a given RNN neuron *j* at time *t* becomes

That is, we recover a solution (up to proportionality) to Eq. 7. For this biological learning rule, to improve network stability, we found it beneficial to increase the RNN consolidation learning rate such that (where Δ^{cons}*w _{ij}* is accumulated over the whole sequence). This explains the initially faster learning (over the first few trials) for the biological learning rule (Fig. 8e).

For this consolidation learning period a learning optimiser is not used (i.e. ADAM is not used). Note that these consolidation learning rules do not require information about the desired task outcome (i.e. target) and are in that sense unsupervised.

### Task details

#### 1. Line drawing task

For the line drawing task, the model has to transform one of six possible 10-dimensional binary inputs **x** ∈ [0, 1]^{10} at timestep 1 into an associative “go” 2-dimensional line **y**^{line} (for five of the inputs) or a “no-go” stay at the origin (for one of the inputs). The starting point for each line is the origin, and the endpoints of each line are evenly spaced on the edge of the unit circle (see Fig. 2a, black dashed line). The model learns to draw the line over 20 discrete timesteps, with the intermediate target points spaced evenly, i.e. for a line with endpoint y_{end} we have .

For the stimulus timestep (timestep 1) as well as the remaining 19 timesteps, the model receives (through its *W _{ih}* connection) zero-mean Gaussian noise . Model errors are computed as the mean-squared error to the target response. Unless otherwise stated a cerebellar time window

*τ*= 3 timesteps (≈ 150ms when

*α*= 0.1) is used. The prediction error across time delay

*t*

_{0}between cortical output and cerebellar (or cortical) output (Fig. 2e) is computed as the cue/time average ||

**c**

_{t+t0}–

**z**

_{t}||, where ||.|| is the Euclidean norm.

To analyse the effects of cerebellar ablation we consider partial cerebellar ablation at the start, middle, and end of the sequence (Fig. 2h-k and Extended Data Fig. S5). The specific time windows of these ablation periods are timesteps [1-6, 8-13, 15-20] (inclusive), respectively.

##### Curl-field variant

Once the models of the line drawing task are trained, we tested whether they could re-translate the same external inputs to a curl-field variant of the task (see e.g.^{45}). For this we selected models with cortical internal memory *α* = 0.5, since we found this resulted in faster learning which was comparable to the presented experimental data^{45}, but we find *α* = 0.1 (as presented in Fig. 2) also learns but more slowly. Switching and learning this curl-field new task “context” involved retraining the models to new desired outcomes (central grey curves in Fig. 3c).

Specifically, the curl-field target responses have the same end-point for each line (or same “no-go” zero cue), but intermediate target points now form a semi-ellipse between the origin and the respective end-point. Given the desired endpoint , this can be parameterised by
where is the angle to the end point and *t* runs uniformly between 0 and π (or, for direction towards (*x*_{end}, *y*_{end}) as in our experiments, from π to 2π).

To test how context-dependent cerebellar processing could enable rapid task switching, we considered the extent to which parallel fibre (PF) weights are shared across task contexts. In particular, we label the percentage of PFs used for each context as the PF task overlap. For example, if the PF task overlap is 25%, then 25% of the PFs used for cerebellar processing apply to both task contexts, whilst 75% specifically apply (and are trained) to the current context. Before learning, the PFs which are not shared (i.e. only apply to the curl-field context) are initialised randomly as in the original line-drawing task.

##### Neuronal activity and covariance during task switching

The change in activities and change in covariances (Fig. 3d-f and Extended Data Fig. S7) are computed as in^{48}. We record the RNN time-dependent activities (post non-linearity) given 1000 input examples in multiple periods: task 1 baseline, task 2 and task 1 switching (Fig. 3a). For the latter two periods these are recorded at their respective end, whilst we take two samples of the baseline period at its start and end. The change in activity between any two periods *P*1 and *P*2 is the average change in activity for a given neuron *i*, which is given by
where are the time-varying input-dependent activities of neuron *i* for periods *P*1, *P*2 respectively, and std_{i} is the standard deviation of that neuron in the start of the task 1 baseline period. Here |.| denotes the average (absolute) difference in activity across timesteps and input examples.

For each period, we also compute the covariance matrix of the RNN population. The change in covariance between two sessions is then computed as 1 minus the Pearson correlation between their respective covariance matrices^{48}.

For the task 2 and task 1 switching periods we report changes with respect to the start of the task 1 baseline period. To account for natural variability in the network and better compare to the neural data in^{45}, we normalise the changes by taking away the changes observed within the baseline period itself. For example, the change in covariance in the task 2 period is , where *B*1, *B*2, *T*2 are the start of the task 1 baseline, end of task 1 baseline, and (end of) task 2 respectively. We apply the same normalisation to the reported experimental changes in the monkey M1 and PmD^{45}; this normalisation leads to (average) near-zero change for the M1 activity and PmD (Fig. 3f).

#### 2. Digit drawing task

For the digit drawing task the inputs are the same as the 10-dimensional binary vectors used in the line drawing task, except now the model must draw an associative digit over 20 timesteps instead of line (Fig. 4a). The targets **y**^{digit} are constructed manually within the space [0, 1]^{2} and resemble the digits from 0 to 5 (inclusive). For exact implementation refer to the provided code (see below).

For the standard model with cerebellar feedback a cerebellar time window *τ* = 3 timesteps (≈ 150ms when *α* = 0.1) is generally used. For the model using cerebellar feedback with a temporal basis, we model the cerebellum with a range of time windows, i.e. *τ* = {*τ _{i}*}

_{i}for some distinct

*τ*≥ 0ms. In this task we consider with

_{i}*τ*=

_{i}*i*timesteps (i.e. 0-250ms), so that the final cerebellar output is a concatenation of task predictions which span over the proceeding 250ms period. Explicitly, after training we have cerebellar feedback, , where ⊕ denotes vector concatenation.

Zero-mean Gaussian noise is added to the input at each timestep. Model errors are computed as the mean-squared error to the target response.

To analyse the effects of cerebellar ablation we consider the same partial cerebellar ablation periods as in the line-drawing task. That is, we consider cerebellar ablation at the start, middle, and end of the sequence (Fig. 4 and Extended Data Fig. S9), which correspond to timesteps [1-6, 8-13, 15-20] (inclusive), respectively.

#### 3. Evidence accumulation task

In the evidence accumulation task the model receives 2-dimensional binary inputs (i.e. **x** ∈ [0, 1]^{2}) over a presentation period of *T*^{pres} = 45 timesteps. A non-zero input can occur for at most one of the two dimensions; that is, *x _{t}* ∈ {(1 0)

^{⊤}, (0 1)

^{⊤}, (0 0)

^{⊤}}, where the rate of zero inputs

*x*= (0 0)

_{t}^{⊤}defines the sparsity of input

*ρ*(

*ρ*= 0.7 in our simulations). After this presentation of input there is then a delay period of

*T*

^{del}= 5 timesteps after which the model must classify at which dimension more non-zero input was received (or whether the number at each dimension was the same). That is, the desired outcome

*y*takes one of three values which respectively correspond to more input in the first dimension, more input in the second dimension, or the same. This task resembles the experimental structure of

^{24}, in which mice were trained to select the side of their whiskers which received more air puffs.

Zero-mean Gaussian noise is added to the input at each timestep. Model errors are defined by the cross-entropy loss to the target response.. Model “belief” (Figs 5D and S11) is defined as the model probability (obtained by applying a softmax on the readout) of the correct classification. Unless otherwise stated a cerebellar time window *τ* = 3 timesteps (≈ 600ms when *α* = 0.1) is used. For both readout and cerebellar feedback models, we apply a softmax operation to the feedback returned to the RNN so as to bound its values between 0 and 1.

To analyse the effects of cerebellar ablation we consider full cerebellar ablation (for the entire sequence 1-50; see Fig. 5d and Extended Data Fig. S11a-c, left) and also partial periods of ablation: at the start, middle, and end of the sequence (Fig. 5e,f and Extended Data Fig. S11 a-c, right). The specific time windows of these partial ablation periods are timesteps [1-15, 15-30, 30-45] (inclusive), respectively. To improve readability of our results, the mean error presented in the training curves for this task is smoothed using a Savitzky-Golay filter with window length 25 and polynomial order 3.

To compute the dependence of model choice on inputs over different temporal bins (Fig. 5f), we follow the method in^{24}. In particular, we divide the presentation period evenly into 3 time windows - [1-15, 16-30, 31-45] - and fit the model choice according to a logistic regression model

Where denotes the predicted model choice probability, *S* is the sigmoid logistic function, *E _{i}* =

*#R*, – #

_{i}*L*is the different in the total number of ‘right’ and ‘left’ inputs in window

_{i}*i*, and

*β*is the respective weight on that window. is fitted to minimise the negative log likelihood of the observed model decisions. We present the normalised weights of each window .

_{i}##### History-centric cases

In line with^{24}, we observed cerebellar ablation to be particularly detrimental to input examples for which correct classification would depend on adequately maintaining past inputs (Fig. 5e,f and Extended Data Fig. S11), which we refer to as “history-centric” examples. We defined an input example as being history-centric if exposure only to the final third of the input sequence would lead strictly to the wrong answer. That is, examples (**x**, *y*) such that the “final-third target” is not equal to the desired outcome .

#### 4. Delayed association task

In the delayed association task the model must associate one of two 10-dimensional binary inputs at timestep 1 to a desired binary response *y* at timestep *T*, where *T* is the sequence length or “delay” period^{23}. We select *T* = 15 timesteps but also consider other lengths (Fig. 7a). The task error (as presented in the main text) is defined at the end of the sequence. For stability, we train the network output 5 timesteps from the end of the sequence (timestep 10 onwards when *T* = 15).

Zero-mean Gaussian noise is added to the input at each timestep. Model errors are defined by the cross-entropy loss to the target response. Model “selectivity” is defined as the model output (readout) at the dimension of the correct classification (prior to the softmax operation). Unless otherwise stated a cerebellar time window *τ* = 3 timesteps (≈ 600ms when *α* = 0.1) is used. For both readout and cerebellar feedback models, we apply a softmax operation to the feedback returned to the RNN so as to bound its values between 0 and 1.

To analyse the effects of cerebellar ablation we consider cerebellar ablation within a particular time window between timesteps 8-12 (inclusive) which approximately mirrors the timings in^{23} (Fig. 6d,e and Extended Data Fig. S16) and also partial ablation periods during the start, middle, and end of the sequence (Fig. 6f). The specific time windows of these partial ablation periods are timesteps [1-5, 6-10, 11-15], respectively. To improve readability of our results, the mean error presented in the training curves for this task is smoothed using a Savitzky-Golay filter with window length 25 and polynomial order 3.

### Control-theoretic estimation of cerebellar feedback

For the delayed association task we analyse cerebellar-to-cortical input from a control-theoretic point of view. In particular, we quantify the effect of plasticity in the pathway between the cerebellar network and cortical RNN on cortical activations by estimating the energy cerebellar feedback induces in RNN state space^{51}. This level of energy reflects the potency of feedback onto the RNN: a low energy would reveal a suppressed RNN response, whereas a high energy would reveal an amplified response. We speculated that these two cases would arise from a non-optimised and optimised , respectively (Extended Data Fig. S13A).

As per Kao and Hennequin^{51}, we compute the energy of cerebellar feedback through the *controllability Gramian P* associated with RNN dynamics. Informally, *P* describes the “intrinsic manifold” of the RNN and describes the directions in which the RNN is most (or least) likely to visit. Formally, given a direction **v** in state space, the average energy generated along direction *v* is

In general, the Gramian matrix *P* is only defined for linear systems. In this work we therefore generalise the notion of controllability for the non-linear RNN dynamics as defined in Eq. 1. Here we use the noise covariance matrix Σ in its place, which for linear systems is shown to be equivalent to the Gramian, Σ = *P*^{51}. Explicitly, we compute Σ as the time-course average covariance of RNN hidden activations *h _{t}* under noisy inputs which follow a Wiener process. That is, where is a set of

*N*samples of RNN states which each evolve according to

In our experiments we use *N* = 500 samples and simulate Eq. 14. To ignore intrinsic RNN transients that occur at the start of simulation, we discard the RNN states during the first 5 simulation timesteps when computing Σ. The energy generated from cerebellar feedback is then , where is the normalised direction being driven by the cerebellum in RNN state space. We report the energy generated (during the noise dynamics of Eq. 14) by cerebellar feedback at timestep 10, a time chosen strictly after the initial RNN transient phase (Extended Data Fig. S13b). For comparison we compare this to the energy generated by 100 random sample directions where *I* is the identity matrix. To enable greater interpretability we then normalise these energies by its highest possible value ; i.e. the input which elicits maximal amplification of RNN dynamics. This value can be computed as **u**^{⊤}Σ**u** where **u** is the principal eigenvector of Σ.

### Cerebellum as a decoder of fading memories

For the delayed association task we discussed the need for a greater number of hidden cerebellar units (granule cells) to achieve good task performance (Fig. 7a). In particular, we find that the number of granule cells (GCs) required is inversely proportional to the *signal-to-noise* (SNR) of the RNN hidden neurons.

Under reasonable assumptions, Eq. 15 can be shown analytically. We go through the main steps of this proof; a complete derivation can be found in supplementary note 1.

First, we note that we express the RNN activations *f*(**h**) as the sum of “task-agnostic” and “task-relevant” features, or components. The former might incorporate general network connectivity, or intrinsic neural noise, whilst the latter depends solely on the task condition. We take these components to be independent of each other, so that the total variance of RNN activations can then be expressed as

Where and denote the variance caused by the task-agnostic and task-relevant components, respectively. Assuming each component is centred around zero (which we observe using simulations), the ratio of these two quantities then represents the SNR of the RNN population, . With a further assumption that this RNN population is centered at zero (which we observe as reasonable) and the mossy fibres *W*_{MF} are normally distributed, it can then be shown that the general cerebellar GC input *I*_{GC} can be expressed as

Where *k* is the number of neurons in the RNN population and is the variance of the mossy fibre weights, . Given this distribution for the input, the key step is considering the probability that the task-relevant input *I _{ζ}* can drive a sub-threshold GC activation (

*I*

_{GC}< 0) in one task-setting but super-threshold activation (

*I*

_{GC}> 0) in the other. We term this as the probability of a “unique activation”; it can then be shown that for small SNR values

In other words, whether or not there is a unique activation in the GC is simply governed by SNR-depedent chance. Intuitively, therefore, given a large enough GC layer the likelihood is that at least one GC will have a unique activation. In fact, it can be shown that the number of neurons needed to have at least one unique activation with confidence *θ* can be expressed as (which is the result cited in the main text)

To estimate SNR(RNN) in the models for the delayed association task (Fig. 7b, left axis), we compute the variance of the task-agnostic component as the (average) variance of the population under the same task stimulus *s*, i.e. . Be equally calculating the total variance , the variance of the task-relevant component is then simply computed as the difference to the total variance, i.e. . To determine the minimum number of granule cells required to decode the stimulus from the RNN activity (Fig. 7b, right axis), we tested whether the cerebellar network could be trained to successfully discriminate the stimulus after 40 training sessions for varying quantities of granule cells (quantities as described below). The cerebellar network was deemed to successfully decode the stimulus if, for at least 9 of the 10 seeds, the average error during the last 4 training sessions was less than 5%.

We also generated synthetic inputs to directly test the ability of the cerebellar network for different SNRs (Fig. 7c and Extended Data Fig. S17). In this case the network must classify one of two 10-dimensional input of the form **x** = **x**_{ω} + **x**_{ζ}. Both **x**_{ω} and **x**_{ζ} are drawn randomly from normal distributions of zero mean, but only **x**_{ζ} changes for each task condition (i.e. once sampled **x**_{ω} is fixed). The distributions **x**_{ω}, **x**_{ζ} have distinct variances respectively.

The SNR of the input data is then simply , where we set and vary across a range of values (see below). The architecture of the cerebellar network remains the same as in the main cortico-cerebellar model, except that a spiking activation function is used at the granular layer; that is, in Eq. 2 where *f*_{spike}(*I*) = 1 if *I* > 0 and 0 otherwise.

To verify whether Eq. 19 indeed holds empirically, we ran the network on the dataset described above over a range of different SNRs *s* and network sizes *m*. We considered and *m* = 2^{k} for *k* between 0 and 14 inclusive (i.e. 225 distinct SNR/size configurations). We ran each configuration over 10 random seeds. To determine the number of GCs needed for a given SNR *s*, we considered the minimum m needed such that 1. the GC population vector is distinct for the two task inputs **x**_{1}, **x**_{2} (Fig. 7c and Extended Data Fig. S17a) and correspondingly 2. the cerebellar readout (Purkinje cells) can successfully learn to classify the initial input (Extended Data Fig. S17a). Classification was again deemed successful if the average error during the last 4 training sessions out of 40 was less than 5%. Importantly, the final success for that number of GCs would require a ratio of the confidence *θ* random seeds to be successful (e.g. to have confidence *θ* = 0.6 would required 6/10 seeds to have distinct responses). Naturally we observed a strong relationship between 1. and 2. (Extended Data Fig. S17c), since the ability to successfully discriminate at the readout level depends on discrimination at the hidden GC level. In these experiments we selected *θ* = 0.9.

## Data and code availability

We used the PyTorch library for all neural network models. The code and respective simulated data used for our experiments is available at https://github.com/neuralml/ccLoops.

## Supplementary Information

## Acknowledgements

We would like to thank the Neural & Machine Learning group, James M Shine, Paul Dodson, Everton Agnes, Laureline Logiaco and Jake Stroud for useful feedback. J.P. was funded by a EPSRC Doctoral Training Partnership award (EP/R513179/1), P.C. by the Wellcome Trust (209453/Z/17/Z) and R.P.C. by the Medical Research Council (MR/X006107/1). This work made use of the HPC system Blue Pebble at the University of Bristol, UK.

## References

- [1].↵
- [2].
- [3].
- [4].
- [5].↵
- [6].↵
- [7].
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵