Abstract
Behavioural feedback is critical for learning in the cerebral cortex. However, such feedback is often not readily available. How the cerebral cortex learns efficiently despite the sparse nature of feedback remains unclear. Inspired by recent deep learning algorithms, we introduce a systems-level computational model of cerebro-cerebellar interactions. In this model a cerebral recurrent network receives feedback predictions from a cerebellar network, thereby decoupling learning in cerebral networks from future feedback. When trained in a simple sensorimotor task the model shows faster learning and reduced dysmetria-like behaviours, in line with the widely observed functional impact of the cerebellum. Next, we demonstrate that these results generalise to more complex motor and cognitive tasks. Finally, the model makes several experimentally testable predictions regarding (1) cerebro-cerebellar task-specific representations over learning, (2) task-specific benefits of cerebellar predictions and (3) the differential impact of cerebellar and inferior olive lesions. Overall, our work offers a theoretical framework of cerebro-cerebellar networks as feedback decoupling machines.
Introduction
Learning ultimately depends on environmental feedback1,2. To learn efficiently animals and humans must make good use of this feedback to update their internal models of the world3,4. However, external sensory feedback is inherently delayed and incomplete, thereby reducing the rate and extent of learning in neuronal circuits3. These observations suggest that the brain may employ a general mechanism to facilitate learning when external feedback is not readily available.
The cerebellum is a region of the brain specialised in building predictive models4,5. In the classical view, the cerebellum learns predictive internal models on the motor domain5–10. Consistent with this view are a large body of experimental observations for which cerebellar dysfunction causes motor learning deficits. However, more recently, cerebellar dysfunction has also been associated with impaired language processing, cognitive associative learning and working memory11–15. Moreover, an increasing body of behavioural12,14,16–20, anatomical21,22 and imaging23 studies alludes to a role of the cerebellum in cognition in animals and humans. Taken together, these studies suggest that the cerebellum learns internal models for both motor and non-motor functions in line with the proposed universal functional role of the cerebellum across the brain, including the cerebral cortex9,24–26.
Despite growing experimental evidence there are no specific computational models aiming to capture the functional roles of cerebro-cerebellar interactions during learning of motor and non-motor tasks. Building on recent deep learning developments we theorise that the cerebellum predicts future cerebral feedback signals given current cerebellar activity. This feedback predicted by the cerebellum is then sent back to the cerebral network to drive learning. Specifically, we model a given cerebral area as a recurrent neural network27–30 which receives feedback predictions from a feedforward, cerebellar, network6,7. This view of cerebro-cerebellar interactions is in line with the classical forward models of cerebellar function6,7, in that in our model the cerebellum makes forward predictions (i.e. generates cerebral feedback predictions) given current cerebral activity.
We test our model on a range of sensorimotor, pattern recognition and visual-language tasks. Using these tasks we demonstrate that cerebellar predictions conveyed to the cerebral cortex facilitate learning. Moreover, models without a cerebellar component exhibit slower learning and dysmetria-like behaviours, consistent with a wide range of behavioural observations11,14,31,32. Our results indicate that the cerebellar-mediated facilitation of cerebral learning relies on the ability of the cerebellum to provide effective cerebral feedback predictions. Finally, we make several experimentally testable predictions regarding cerebro-cerebellar representations, task-specific temporal feedback, cerebro-cerebellar activity coupling and the different contributions of cerebellar output and inferior olive lesions for task learning.
Results
A systems-level computational model of cerebro-cerebellar interactions
In order to understand how cerebellar computations may shape cerebral processing, we introduce a cerebro-cerebellar systems-level model based on a recent deep learning algorithm33. In line with previous work we model a given cerebral cortical area A as a recurrent neural network (RNN)27-30 which is coupled with a cerebellar module C - cerebro-cerebellar RNN (ccRNN). We model the cerebellar module as a simple feedforward network C (Fig. 1a) in line with the cerebellar architecture6,7,9. The input layer of the cerebellar network receives cerebral cortical activity a and models the Granule cells (GCs), which project to the output layer, modelling the Purkinje cells (PCs) that provide cerebellar predictions back to the cerebral cortex (Methods). To capture the dimensionality expansion observed between cerebral and cerebellar networks34,35 we constrain our model with M ≫ N, where M corresponds to the number of GCs and N the number of cerebral neurons and use the same ratio found experimentally .
We study the behaviourof our model in a range of tasks. To train the model we use a prediction error function Etask which compares the model output with task-specific external feedback. Using standard gradient descent methods we generate feedback signals of a specific temporal horizon (see example of a RNN unrolled in time in Fig. 1b), fbt, which is then used to update the RNN input and recurrent weights (Fig. 1a; see Methods). For computational efficiency and in line with previous models we use a time-discrete approximation of time-continuous RNN models28.
Following our theoretical proposal the cerebellar module C continuously learns to predict cerebral feedback fbt given cerebral cortical activity at. The cerebellar network is optimised through error signals computed by comparing the actual cerebral feedback fbt at time t with the cerebellar predicted feedback . We postulate that this comparison is done in an inferior olive-like structure, , that generates error signals which are used to optimise the cerebellar network (see Methods). However, similar to the external feedback, actual cerebral feedback is not always available, which would impact the ability of the cerebellar network to learn online to produce effective feedback signals. To circumvent this problem we propose that the cerebellum learns using its own feedback predictions when cerebral feedback is not available (Fig. 1b)33. This leads to the following target feedback where fbt is the true cerebral feedback and is a self-prediction term which enables the cerebellum to learn online (see full details in Methods).
Cerebro-cerebellar model facilitates learning in a simple sensorimotor task
Inspired by classical sensorimotor studies in the cerebellum, we first test a simple visuo-motor task11,31,32,36,37. In this task the model must draw a straight line in a two-dimensional space towards one of seven target locations given a target-specific cue at the start of the task (Fig. 2a, top left). We train a cerebro-cerebellar RNN (ccRNN) and a cerebral-only RNN (cRNN) to perform this task (see full details in Supplementary). To train the models we provide teaching feedback by comparing the cerebral network output with the optimal trajectory (i.e. a straight line between starting and end points; Fig. 2a). In addition, this feedback is delayed with respect to the initial cue and incomplete (i.e. only available every few time steps). This models a more realistic setting in which task feedback is not always readily available. When this feedback is available at time t we calculate the prediction error as , where lt and denote the desired and current model two-dimensional trajectory (i.e. set of feedback points; cf. Fig. 2 schematic), given by a linear readout on the network activity at (Methods). In particular, here we consider a feedback interval at every other time step for both cRNN and ccRNN (but see Fig. 4 for more general cases).
During learning the ccRNN model achieves near-zero error after a relatively small number of training sessions, while the cRNN, which lacks the cerebellar component, also learns but more slowly and with higher variability (Fig. 2b). These observations are in line with a large body of cerebellar experiments11,31,32. In addition, we also observe differences at the level of model output trajectories. While the ccRNN produces smooth and straight trajectories, the cRNN displays a much more variable trajectory towards all targets (Fig. 2b). Due to the sparse task feedback in the absence of a cerebellar network, the cRNN is not able to learn a correct trajectory in points for which there is no direct feedback thus overshooting the target trajectory. In cerebellar patients, this effect is referred to as dysmetria38 which in the motor domain results in ataxia. Ataxia is the lack of coordination and fine control during voluntary movements, a defining symptom resulting from cerebellar malfunction11,38. To evaluate the degree of dysmetria-like output in our models we measure the error between the model output and the optimal trajectory (i.e. a straight line in this case; see Methods). When applying this measure, the ccRNN shows a clear reduction in ataxia-like behaviour compared to cRNN (Fig. 2c).
To highlight the conditions for which the cerebellum may facilitate learning in cerebral networks we test different lengths of cerebral feedback horizon (Methods). Our results show that the ccRNN only facilitates learning for short to medium feedback horizons (<50%, Figs. 2d, S2). These results suggest that the cerebellum is particularly important for cerebral learning in conditions in which cortical networks do not have internal effective feedback available for learning. This is consistent with experimental observations showing that the cerebellum becomes more important in the presence of challenging task conditions for which cerebral feedback might be short39. In contrast, for long cerebral feedback, having a cerebellar module harms learning. In this case the cerebral network has the level of feedback required to learn effectively, thus the noise inherent in the cerebellar feedback can impair learning. This observation suggests that the brain may use intermediate brain structures, such as the thalamus and the pons to gate cerebro-cerebellar interactions depending on task properties (see Discussion).
Next, to gain insight into how cerebral and cerebellar neuronal representations evolve jointly during learning, we use a dimensionality reduction method (demixed principal component analysis (PCA); see Methods). Demixed PCA (dPCA) enables us to extract low-dimensional neuronal representations that capture maximum variance across task variables. First, we focus on the two most informative cue-specific principal components using the neural activities of the recurrent neural network for both cRNN and ccRNN (see all components in Figs. S3,S4 and S5). Next, we calculated the two-dimensional Euclidean distance across the 7 different possible cues (Methods). Our results show that the ccRNN cerebral network is characterised by a stronger increase in separation of stimulus components over learning when compared to the cRNN cerebral network (Fig. 2e). To contrast task-specific components with general temporal information, we compare the level of cue-specific and time-specific explained variance in both models. ccRNN captures overall more cue-specific explained variance when compared with cRNN (Fig. 2f) which demonstrates that ccRNN encodes more task-relevant information, which requires the model to associate the cue information with specific output trajectories. Next, we applied dPCA to the activity of cerebellar neurons. Since the cerebellar module facilitates cue-to-target learning we expected cerebellar representations to be mostly dominated by task-specific information. This is indeed what we find, our results show that the distance between cue-related components is stronger during periods of high learning (Fig. 2g; compare with Fig. 2b), and that most of the variance is explained by cue-specific PCs (95.4%; Fig. 2h).
Overall, our results suggest that in the context of a simple sensorimotor task, cerebellar-mediated decoupling of cerebral feedback enables faster learning and smoother motor trajectories. In addition, it makes a number of ex-perimentally testable predictions about the evolution of task-specific cerebro-cerebellar representations throughout learning.
Cerebro-cerebellar model improves learning in complex sensorimotor and discrimination tasks
Next, to test whether the results from the simple visuomotor task generalise to more realistic settings we explore a range of more advanced sensorimotor tasks. We introduce two tasks in which the models are trained to draw digits given complex spatiotemporal sensory inputs. For these tasks we build on a standard machine learning dataset consisting of 10 (from 0 to 9) two-dimensional handwritten digits (see example in Fig. 3a; MNIST dataset40). In contrast to the previous task in which sensory input was only provided at the start of the task, here the model receives a part of a handwritten digit at any given point in time (i.e. a row of 28 pixels; see Methods). We refer to this task setting in which input is provided over time as online. Given this input then we consider two task variants (Fig. 3a) in which the model has to either (i) draw a straight line (online line drawing (LD) visuomotor task) or (ii) draw a digit (online digit drawing (DD) visuomotor task. Both tasks provide a more realistic model of drawing tasks (Fig. 2) in which lines must be drawn given complex continuous sensory input. As in the previous task we consider cases of sparse task feedback.
As in the simple visuomotor task, here the ccRNN learns faster (Fig. 3b) than cRNN while showing a strong re-duction in dysmetria-like trajectories (Fig. 3c). The ccRNN also facilitates learning when in the presence of short to medium feedback horizon in the cerebral network (Fig. 2d), and we find that dysmetria-like trajectories are reduced in the ccRNN model (Fig. 3c).
To test whether our observations in the sensorimotor tasks generalise to other task domains we train the model in a visual discrimination task. In this task the model receives the same handwritten digits presented sequentially over time but now must discriminate between the 10 classes of digits (online visual discrimination task, Fig. 3a). In line with the results in the visuomotor tasks, we find that ccRNN also facilitates learning in this task, achieving higher accuracy after only 10 training sessions (Fig. 3b). Here we use the certainty the model has about the current class as a measure of dysmetria of thought41 (see Methods). Similarly to the tasks above, we find that dysmetria-like behaviours are reduced in the ccRNN model, which in this case shows that model produces more accurate decisions (Fig. 3c). Finally, in line with previous tasks a cerebellar module facilitates learning in the presence of weak cerebral feedback (Figs. 3d, S6). These results are in line with the growing number of studies implicating the cerebellum in sensory discrimination and decision making tasks19,42,43.
Cerebellar-mediated learning facilitation depends on task feedback interval
In sensorimotor tasks there are physiological constraints inherent to animals and humans which impose limits on the rate at which external feedback is available44–46. To determine the rate of external feedback for which cerebellar predictions are most valuable we trained the model in two tasks (simple LD and LD visuomotor tasks) with a range of external feedback intervals. This feedback interval defines the rate at which external feedback is available for learning, resembling sensorimotor feedback which is typically sporadic rather than continuous11,47,48. We find that when external feedback is given at short intervals there is little advantage of the feedback predictions from the cerebellar component for both the simple LD and online LD visuomotor tasks (Fig. 4a,b). When the interval between external sensory feedback is increased, the benefits of the cerebellar-to-cerebral feedback predictions for learning in the ccRNN model become clear. In contrast, for long feedback intervals the feedback is too infrequent for both cRNN and ccRNN to be able to successfully learn the task. Next we evaluate the degree of dysmetria using the metrics in-troduced above. We observe qualitatively similar results: a model without a cerebellar network (cRNN) exhibits more variable trajectories for medium to long task feedback intervals (Fig. 4a). These results imply that whether cerebellar-to-cerebral feedback is beneficial for learning and leads to dysmetria-like behaviours depends on the rate of task feedback.
Similarity between cerebellar and cerebral feedback is task and learning dependent
The cerebro-cerebellar facilitation of learning shown above depends on the ability of the cerebellum to provide the cerebral network with effective feedback predictions. To study the level of similarity between the cerebellar predicted feedback and the theoretically optimal cerebral feedback as provided by gradient descent methods, we calculated the cosine similarity between cerebellar predictions and the optimal cerebral feedback in a range of tasks (Methods).
First, we measure the cosine similarity for tasks in which external sensory feedback is only provided at the end of the task - a variant of the simple LD task with feedback only at the end and the online visual discrimination. This task setup allows for an easier interpretation of the similarity between cerebellar and cerebral feedback which should decay gradually from the end to the beginning of the task sequence. Indeed, we observe that the cerebellar-cerebral feedback similarity is higher closer to the point in which external sensory feedback is available (i.e. end of the task; Fig. 5a,b top; cf. Figs. 2, 3) and remains high over learning in particular for later points in the task (Fig. 5a,b bottom).
Next, we analyse the cosine similarity for conditions in which external feedback is available throughout the task. For this we consider the same visuomotor tasks as above (simple LD visuomotor, online LD visuomotor and online LD visuomotor). In these tasks we observe more complex dependencies of the cerebro-cerebellar feedback similarity on task properties (Fig. 5c,d). Forthe simple LD taskwe observe that the predictions made during earlier points in the task are more similar than those at later points (Fig. 5c). These results suggest that the model is first learning to align later points in the task and gradually learns to adjust earlier points which are closer to the cue-specific information that defines the trajectory that the model must take. Interestingly, this behaviour is less prominent in the two other tasks, online LD and DD visuomotor tasks, that are characterised by relatively more complex task-specific sensory input occurring throughout the task. For these two more complex tasks and in contrast to the simple LD the similarity remains high throughout learning for later time points (Fig. 5d), which reflects the more challenging nature of these tasks and the need to continuously predict feedback as the task is never fully learnt.
These results make non-trivial predictions on when the cerebellum is able to better align with the cerebral feed-back, which depend on task complexity, the properties of the task feedback, the exact task position and the learning stage.
Learning shapes cerebro-cerebellar activity coupling
The cosine similarity results show that the cerebellar module learns to predict cerebral feedback. Because the cere-bellum maps cerebral activity onto (predicted) cerebral feedback, this suggests changes in the coupling between cerebellar and cerebral neuronal representations throughout learning. To study the degree of cerebro-cerebellar coupling we calculate the pairwise correlations between neurons in the cerebral recurrent neural network and the neurons of the cerebellar network (Methods). Although we observe a relatively small rise in the average cerebro-cerebellar coupling during the first few training sessions, as training progresses, there is a consistent decrease of the correlations (Fig. 6a).
To study more subtle changes in the correlation structure we use standard principal component analysis of the pairwise correlations (Fig. 6b). The first principal component reflects the changes in the average cerebro-cerebellar coupling (Fig. 6b). The second principal component shows a delayed increase with respect to the first, followed by a sustained decrease in the cerebro-cerebellar coupling (see Fig. S7 for remaining components). These results are con-sistent with the need for the cerebellum to provide more effective feedback and thus be more coupled in the earlier learning phases. To study learning periods of consistent increases or decreases in coupling as training progresses we tracked the changes in correlations of cerebro-cerebellar pairs in early, mid and late learning (Figs. 6c). We observe that early in learning - when most learning occurs - a large part of the population shows a consistent increase in cor-relations, but this rapidly changes as learning progresses with only a very small number of pairs showing increases in correlations later in learning.
To better assess the contribution of a plastic cerebellum to the cerebro-cerebellar coupling, we analysed a ccRNN in which the cerebellum does not learn. In this case we can still observe changes in cerebro-cerebellar coupling over learning for some tasks, which reflect changes in the RNN itself, but these are weaker when compared to the normal ccRNN (Fig. S8a). In this case cerebro-cerebellar correlations remain high throughout learning compared to a ccRNN with a plastic cerebellum. This is supported by their low-dimensional representations: whereas a plastic cerebellum leads to principal components that approach near-zero values after the initial learning phase (Fig. 6b, S7), in the case of the fixed cerebellum the principal components continue to fluctuate throughout learning (Fig. S8).
Although our model suggests a long-term decrease decrease in the cerebro-cerebellar activity coupling, it high-lights sub-populations which increase their coupling during specific periods of learning. This observation follows from our proposal in that the cerebellum is trained to map cerebral neuronal activity on cerebral feedback which depend on learning.
Differential impact of cerebellar output and inferior olive on learning
In experimental neuroscience a common paradigm is to inactivate the cerebellum in order to study its role in learning and behaviour. Here we perform in silico lesion experiments to reveal the impact of the modelled cerebellar feedback predictions during learning. First, we test cerebellar output lesions at different points in learning. In all tasks we observe that inactivating the output of the cerebellar module in early learning impairs further learning and performance (Fig. 7a,b). This is expected as the cerebellar network provides feedback predictions that facilitate cerebral learning. Interestingly, we observe that when the cerebellum is suddenly removed learning becomes worse than the baseline model. This is likely due to the additional time taken to adapt to a new learning trajectory which no longer relies on cerebellar prediction. However, cerebellar lesions performed later in learning do not have an impact in the simple LD visuomotor task, which is explained by the fact that for this task the model can achieve near-zero error, thus learning signals provided by the cerebellum are no longer needed. However, for all the online tasks we observe that inactivating the cerebellum even at later stages damages learning. In these more realistic tasks the cortical network still relies on the feedback provided by cerebellum as it does not fully learn the task. Our results indicate that lesion studies should reveal a task-dependent nonlinear role of the cerebellum on cerebral learning.
Next, we assess the impact of disrupting cerebellar learning by modelling a complete lesion of our inferior olive-like error module (Methods). This manipulation effectively stops cerebellar learning, thereby impacting on the ability of the cerebellum to provide informative feedback learning signals to the cerebral network which may prevent the cerebral network from learning. For all of the tasks that we model, inactivating cerebellar learning has a strong impact throughout training, making the model return to naive performance (Fig. 7c,d). Thus, simulated “inferior olive” lesions predicts that if the cerebellum cannot learn it would result in a stronger negative impact in task learning than ablating the cerebellum itself. This further suggests that it is critical for the cerebellum to learn rapidly to be able to provide informative predictions.
Cerebro-cerebellar model facilitates learning in a visual-language task
Our framework does not only apply to sensorimotor tasks, but should generalise to virtually any task within the grasp of current neural networks models. To test the generability of our model and inspired by cognitive tasks in which cerebellar patients have shown deficits49 we test our models in a caption generation task. In this task the network needs to generate a textual description for a given image. All models have two components: a pretrained convolutional neural network (CNN) to extract a lower dimensional representation of the image, and a cRNN or ccRNN on top which is trained to map the low dimensional visual input to captions that describe the image. (Fig. 8a).
We use a standard machine learning dataset50 and the networks are trained to predict the next word (Methods). In contrast to the previous tasks here we use a form of unsupervised learning, in which the prediction error module only uses the data itself (i.e. words) to generate teaching signals (Supplementary). We find that ccRNN models can exhibit faster learning (Fig. 8b) (Fig. 8d) and better generalisation51 (Fig. S9) when in the presence of short cerebral feedback horizons (≤ 40%). All models produce reasonable captions for images unseen during training, but ccRNN models tend to produce captions that better capture the context and semantics of the task (Figs. 8c, S10), consistent with cerebellar deficits14.
Finally, we use a language metric (SPICE52) to measure the quality of the generated captions. These results show that the ccRNN generates richer captions (Fig. 8e) and that it is particularly beneficial for longer captions. This suggests that ccRNN is able to learn richer visuo-language contextual information.
Discussion
Inspired by recent deep learning developments, here we have introduced a systems-level computational model in which cerebellar networks predict cerebral feedback (Fig. 1). In this scheme cerebro-cerebellar loops decouple cerebral cortical networks from future feedback signals. We show that the ccRNN model accelerates learning and improves task behaviour in a range of sensorimotor and cognitive tasks (Figs. 2, 3 and 8). Our results are consistent with observed motor and cognitive deficits in cerebellar patients. Our model makes a number of predictions in terms of (1) task properties (Figs. 4 and 5), (2) cerebro-cerebellar representations and coupling (Figs. 2 and 6), and (3) the differential role of the cerebellum and the inferior olive throughout learning (Fig. 7).
Experimental studies have shown that incomplete or delayed external sensory feedback is important for learning45,53,54. Our model proposes that the cerebellum plays an important role in facilitating motor learning when in the presence of incomplete or delayed feedback. Furthermore, our work suggests that cerebro-cerebellar networks are ideally placed to facilitate learning when task feedback is presented intermittently, at medium frequencies with respect to task sequence. Similarly, our results suggest that cerebellum-dependent dysmetria should be more prevalent for tasks with intermediate to long inter-feedback intervals. Although there is a wide range of studies investigating the role of external sensory feedback in learning53,55 and the precise timing of feedback is known to be important for cerebellar function10,56, it remains to be tested what are the optimal properties of task feedback for learning. Taken together, we suggest cerebellar-mediated feedback predictions to be particularly important for temporally challenging tasks with sparse feedback.
Our representational analyses demonstrate that the cerebellum develops task-specific representations. Recent fMRI studies have observed that different regions of the cerebellum encodes task-specific representations for dif-ferent domains23,57. Similarly, our model predicts the need for different cerebellar modules to provide feedback estimations to the cerebral cortexfor specific task domains. We have also studied the level of coupling between cerebellar and cerebral neural activity. Our results demonstrate an initial rise in correlations which coincides with steep periods of learning followed by a general decay in the coupling during the remaining periods of learning. This general decay in coupling is also reflected in our simulated cerebellar lesions which echo the existing literature in that after a task is consolidated in the cerebrum it becomes less cerebellar-dependent58,59.
In line with previous theoretical accounts6,7,9 we suggest that the cerebellar error function is computed by the inferior olive, which drives learning in the cerebellum via the climbing fibres. This cerebellar error function is a combination of true sensory feedback and self-predicted (bootstrapped) error signals (Fig. 1b), which is analogous to the bootstrapping principles commonly used in reinforcement learning60. The use of self-predictions in the cerebellum suggests the existence of different forms of feedback to the inferior olive from potentially multiple cerebellar modules, consistent with cerebellar-inferior olive connectivity61. Moreover, when ablating the inferior olive lesions we show that task performance become severely impaired. This is due to the cerebellum being unable to learn, thereby providing outdated feedback signals back to the cerebral cortex. These results suggest non-trivial consequences of lesions for cerebro-cerebellar interactions.
While our model is consistent with experimental observations, there are several biological features that we have not considered. In particular, experimental studies suggest that the cerebellum can influence cerebral learning pro-cesses via its projections to the thalamus62–65. This is in line with ccRNN where the cerebellum predicts feedback signals that contribute directly to cerebral learning. However, we have assumed direct long-range projections with the cerebral cortex whereas in biology these projections are mediated through the thalamus and pons. It is possible that both structures may provide bottlenecks that Alter out non-relevant information, such as poor estimated feedback (Figs. 2d, 3d) that would impair cerebral learning. In addition, cerebellar-thalamic-cerebral projections are known to target distal dendrites of pyramidal cells66,67, which have been proposed to encode feedback error signals by a number of models as used by deep learning models68,69. These dendritic-encoded error signals are akin to the gradient descent errors that we use to model cortical feedback signals. In future work it would be of interest to combine our work with these biologically plausible gradient descent models.
Throughout this paper we have assumed the existence of cerebral prediction error modules, which compare the output of a given cerebral area with a desired task output to generate a feedback teaching signal for the cerebral cor-tex. There is evidence of prediction errors across different brain areas, for example sensorimotor prediction errors in the neocortex70,71 or reward prediction errors in the VTA1,72. For simplicity, here we have focused on supervised (Figs. 2,3) and unsupervised (Fig. 8) prediction errors, but these can in principle be readily replaced by reward-based prediction errors1,73. This would predict reward-specific encodings in the cerebellum as observed recently74–76. Indeed, our model is of particular relevance to reinforcement learning due to prevalence of sparse and delayed rewards (Fig. 4).
Finally, our model shares common features with classical internal models of the cerebellum (6,7; Table S1). In the forward model of sensorimotor control, the cerebellum receives an efferent copy of the motor commands and the respective external sensory feedback8,77. With these two input streams the forward model learns to predict the sensory consequences of motor commands. We and others have argued that a similar predictive model can in principle be applied to higher order brain regions such as the prefrontal cortex and the temporo-parietal cortex which are involved in planning of cognitive behaviour and decision making16,17,24,26 (Fig. 1a). In line with forward models the cerebellar module of ccRNN receives an efferent copy of the cerebral neural activity and cerebral feedback. Given these signals the cerebellum learns to predict future cerebral feedback.
Overall, our work offers a novel theoretical framework with which to study cerebro-cerebellar interactions, being consistent with experimental observations while making a large number of testable predictions across multiple levels of interrogation.
Methods
In all our experiments we model a cerebral area A as a long short-term memory recurrent neural network (LSTM)78 with parameters θ which has recently been mapped onto cortical microcircuits79. A (trained) linear readout is attached to the LSTM output states which provides the final model output to which a supervised error module Etask, which below we refer to as E.
In the cerebro-cerebellar RNN model (ccRNN) we attach a feedforward cerebellar module C with independent parameters Ψ to the RNN with reciprocal connections (Fig. 1). The cerebellar module is equivalent to the “synthesiser” as used byJaderberg et al.33 in the backward case. That is, the cerebellar module receives a copy of the RNN activity at (both cell and output LSTM states) and sends back a prediction of the future feedback (or error gradients) with respect to that activity, C(at).
To generate the desired cerebral temporal feedback (error gradients) we use backpropagation through time (BPTT). To highlight the link between BPTT-derived feedback and the cerebellar predicted feedback we start out from first principles closely followingJaderberget al.33. BPTT is used as the standard solution for updating parameters θ in deep learning. In an ideal world one would have access to all error signals within a task of length T, , and derive the resulting parameter updates as θ ← θ - αΔθ where , but this is impractical as it would require information about all possible future error signals. Instead, in practice BPTT over a time horizon K is commonly used (Fig.1) where fbt′ denotes the cerebral feedback and C(ax) cerebellar predictions of future feedback and represents the temporal changes in cerebral activity. These equations help make clear the distinction between cerebral feedback modelling feedback within current horizon K and cerebellar feedback predicting future horizons. Note that if we set C(at′) = 0 then we simply have standard truncated-BPTT over a time horizon K, as commonly used in deep learning.
A key consequence of the cerebellum predicting future feedback is that strong or long feedback signals (i.e. T ≫ 0) are no longer necessary, thus decoupling learning in the cerebral network from future feedback signals. For this reason we focus on weak forms of BPTT with relatively small temporal horizons, in which we model only K time steps of feedback into the past from an error signal E, this is known as truncated BPTT. In our experiments the size of K - which we report as a percentage of the task length (cerebral temporal gradient) - varies but is generally small. For example, for the simple line drawing task we used a one-step BPTT (i.e. K =1; Fig. 2). Note in the main text as we focus on describing a simpler case of K =1 (as used in the simple line drawing task) we use C(at) to refer to the cerebellar feedback prediction from the end of the current horizon, i.e. .
Cerebellar learning
The cerebellar parameters Ψ are themselves learnt but to optimise a distinct, specialised error EIO which we posit to be computed at the inferior olive, the classical teacher of the cerebellar cortex6,7. This is defined by the difference between cerebellar output and a target feedback signal , i.e. . Similar to the cerebral network we update cerebellar parameters using gradient descent: Ψ ← Ψ - αIOΔΨ, where .
Ideally we would simply set the target feedback as the true (desired) cerebral feedback. However, this would require an arbitrary long number of steps of true cerebellar feedback, exactly what we propose that is not required with a cerebellar network. How should the cerebellum itself learn about the future feedback? One elegant solution, which we take from Jaderberg et al.33, is to combine the currently available error with bootstrapped future cerebellar predictions (i.e. self-predictions). Formally, using the same notation as equation 1, the trained target for C(aT) is
Note the resemblance of equation 2 to equation 1: in each case we consider a mixture of nearby “cerebral” error signals beyond which we rely on cerebellar prediction. It is also useful to compare equation 2 with standard reinforcement learning rules (e.g. temporal difference learning algorithm) which rely on similar bootstrapping principles60.
Other biological mappings of our framework
Here we describe other possible mappings between the proposed framework (cerebellum as a decoupling machine) and forward and feedback processing in the cerebral cortex.
Cerebellum as a spatial feedback decoupler
Our paper focuses on temporal problems being solved by a cerebral area modelled as a recurrent neural network (RNN) to which a cerebellar network provides predictions of the future errors/feedback with respect to that area. An analogous biologically relevant system also arises, however, when one considers cerebral processing in space using feedforward computations involving several distinct regions (Fig. S1).
This setup - where the “main” (cerebral) network is a feedforward composition of multiple brain regions - was also considered in Jaderbergetal. Now, as opposed to predicting errors which occur strictly at later points in time, the role of the cerebellar network is to predict errors which occur in later brain regions. The result is that an earlier region has access to its feedback (predicted by the cerebellum) without the need to wait for the later forward/back propagation of spatial activity. Formally, if we assume cerebral processing as a sequence of feedforward computations: A(x) = (aN ○ aN−1 ○… ○ a1)(x) which defines a final error function E(A(x)), then the cerebellar network can provide predicted feedback at a given brain area as soon as its activities are computed: .
This perspective would effectly feedback processing across the brain. This interpretation of the model is consistent with cerebellar-thalamo-cerebral projections targeting distal dendrites, which have been proposed as the site of error or feedback encoding which underlie efficient learning68,69.
Cerebellum as a forward decoupler
In classical cerebellar theory, the complement to the forward model hypothesis is the inverse model, in which the cerebellum predicts motor commands5, or even implicit mental predictions to solve a problem24, directly. Again we can consider this under the proposed framework, but now using its forward prediction version.
In this case the role of the cerebellum is not to predict future feedback activity, but the feedforward activity itself, i.e., for some later region j > i. âj is fed as a replacement to region j, making it forward decoupled from a potentially slower intermediate processing aj ○ aj–1 ○… ○ ai+1.
Functionally this would provide the organism with fast inputs (e.g. motor commands or potential mental solutions) without the need for potentially slower cerebral processing (Fig. S1 b). We also point out the relevance of direct predictions of later activity in the temporal case, where the cerebellum strictly predicts motor activity at later timesteps, as suggested in80. A broad comparison between this framework and the cerebellar internal model hypothesis is shown in Table S1.
Experimental details
To reduce learning instability we scale the cerebellar predicted feedback (Eq. 1) by 0.133. Both cerebral and cerebellar parameters are optimised using the feedback described above together with ADAM for overall learning efficiency81. Training the model involves iterating over training sessions for a given dataset, which is split into batches. For better learning stability model parameters were updated at the end of each batch.
In each experiment all initial RNN parameters are drawn from an uniform distribution , where nRNN is the number of RNN units. The weights of the readout network and the feedforward weights of the cerebellar network (other than the final layer) are initialised according to where bk denotes the “kaiming bound” as described by He et al.82 (slope ), and the biases are draw from , where nin denotes the input size of the layer. The last layer (both weights and bias) of the cerebellar network is zero-initialised, so that the estimated feedback at the start are zero33.
During learning, we employ truncated BPTT as follows. Given an input sequence of N timesteps x1, x2,…, xN and an temporal horizon K, we divide the sequence into K sized truncations. In other words, the sequence is now made up truncations of (x1,…,xK),…, (x(m–1)K+1,…,xmK), (xN–r,…, xN), where N = mT + r for positive integers m, r with 0 ≤ r < K. Note that, along with the value K, how well the sequence is divided into truncations (i.e. values m, r) can itself influence learning (e.g. Fig. 3d).
In the all visuomotor tasks, to test the effect of predicted feedback against the availability of task feedback signals which occur at any timestep where an external teaching signal is provided, we vary the external feedback interval. Given feedback interval n, the target is only available every n timesteps. This is analogous to the rate at which one receives sensory information whilst performing a task (e.g. drawing freehand).
In general, (standard) hyperparameters were selected by hand after a few trial runs. We used the PyTorch library for all neural network models. Our implementation is based on that of github.com/koz4k/dni-pytorch. The code used for our experiments is available at https://github.com/neuralml/ccDNI.
Delta and normalised error
To calculate the delta and normalised error with respect to a given model we take the difference or ratio of total errors during learning (all training sessions). For example, the normalised error of ccRNN with respect to cRNN is . Note that in the ablation case we compare against an “healthy” ccRNN and only consider the respective errors post-ablation. e.g. the normalised error for a model with cerebellar ablation at session 50 is .
Cerebro-cerebellar coupling
To analyse how the coupling between the cerebral and cerebellar networks changes over learning we consider the (absolute) Pearson correlation between a given cerebral (LSTM) unit and a given unit in the cerebellar hidden (granular) layer over different bins during training. Values given are the average correlation over all RNN/cerebellar unit pairs.
Computing details
All experiments were conducted on the BluePebble super computer at the university of Bristol; mostly on GPUs (GeForce RTX 2080 Ti) and some on CPUs. We estimate the total compute time (including unreported results) to be in the order of ~ 2000 hours.
Simple line drawing visuomotor task
In the line drawing task, an LSTM network receives a discrete input cue which signals the network to either (1.) stay at zero or (2.) draw a line in 2D space over a period of 10 timesteps. Here we set 6 distinct non-zero input-target pairs , where each input xi is a (one dimensional) integer ∈ {±1, ±2, ±3}, y1 =0 throughout, and the remaining targets are lines whose end points lie equidistantly on a circle centred on the origin with radius 10. To make the task more realistic we also consider a 7th target in which the network must remain quiet at the centre of the drawing screen, which models periods in which the animal is not actively performing the task. Once an input cue is received at timestep t0, the model receives no new information (i.e. all future input is set to zero). The model is trained to minimise the mean squared error (MSE) between its output and the cue-based target.
The cerebral network is modelled by one hidden layer of 50 LSTM units and the cerebellar network by one hidden layer of 400 neurons. The learning rate is set to 0.001. Each epoch comprises of 20 batches with 50 randomised examples. Unless explicitly stated we use a truncation size of K =1 which covers 10% of the total task duration. Model results are averaged over 10 random seeds (with error bars), where each seed determines the initial weights of the network.
Online visuomotor tasks
For each online visuomotor task (Fig. 3) we use a standard dataset of handwritten digits (MNIST dataset). The model receives the same temporal input input, and the tasks are only differentiated by the desired model output. Given a 28 x 28 handwritten digit as input, at timestep i the model receives the pixels from row i of the image, so that the input is of dimension 28 and is presented over 28 timesteps.
In each case we have one hidden layer of 30 LSTM units in the main model and one hidden layer of 300 hidden units in the feedforward cerebellar network. Data was presented in batches of 50 with a learning rate of 0.0001.
Training and validation data was assigned a 4:1 split, containing 48000 and 12000 distinct image/number pairs respectively. Unless explicitly stated, the truncation value was K = 3 which is ~ 10% of the task duration. Model results are presented over 10 random seeds.
Online line drawing visuomotor task
In this variant each number 0-9 MNIST image is allocated an associated xy position on the edge of a circle centred at 0 with radius 10, and must follow a line of equally spaced points towards that position (Fig. 3a, left). With the model output being a vector of size 2, the training loss is defined at the end by the mean squared error (MSE) between the output of the model and the points forming the target line.
Online digit drawing visuomotor task
Like the online line drawing task, in this variant the model outputs a sequence of 2D coordinates corresponding to the input image. The target sequence however is now of a highly non-linear form, and in this case is a template of the number given as input (Fig. 3a, middle). The model is then trained at to minimise the MSE between the model output and that target shape.
For each digit, the corresponding target drawing lies in [0,1] x [0,1], such that the gap between each successive point is equivalent. All model drawings begin in the top left corner (except for digit 1 which begins below-right). MSE scores are reported as 100 times their raw values to ease comparison with the line drawing case.
Online visual discrimination
This case differs to the others as it is a classification (or decision making) task, where at the end of the presentation of the MNIST image the model must decide which number the digit belongs to (between 0 and 9). Since the decision is only made at the end of the sequence and targets are unavailable at intermediate points, this is a task with hard temporal credit assignment. The output of the model is a vector with probabilities of size 10 (one entry for each number), and the model was trained to maximise the likelihood of the target number using a standard cross-entropy error function.
Visual-language task
The architecture for the caption generation task consists of a pretrained convolutional neural network (CNN) coupled with an RNN (LSTM). The cerebellar network only communicates with the LSTM. The LSTM network has one layer of 256 LSTM units and the cerebellar network has two hidden layers (i.e. here we explicitely model a layer of Granule Cells and one of Purkinje Cells) of 1024 neurons.
The process from image to model-generated caption follows previous work83 and is described next. As part of image preprocessing and data augmentation, which helps prevent model overfitting, a given image is randomly cropped to size 224 x 224, flipped horizontally with even chance, and appropriately normalised to be given to a pretrained Resnet model84. A feature vector X of size 256 is thus obtained and passed to the LSTM at timestep 0. The LSTM is subsequently presented the “gold standard” caption one word per timestep, each time learning to predict the next word (unsupervised task); i.e., at timestep t the model learns . The network simultaneously learns a word embedding so that each word wi is first transformed to a feature vector of size 256 before being given as input to the LSTM (as illustrated in (Fig. 8a). With a preset vocabulary of 9956 distinct words, the final output of the model (P(wi)) is a probability vector of size 9956.
We found the models to be generally prone to overfitting the training data. For this reason, we apply dropout (during training as described by Srivastava et al.85) on the input to the LSTM, where a given input element is set to zero with p = 0.5 probability. Once training is complete the models can generate their own captions to previously unseen images (Figs. 8, S10). Given an image at timestep 0, the model output at timestep i is the word with the highest probability, and the same word is then provided as input to the model at timestep i + 1. In this way the model can autonomously output an entire sequence of words which forms a predicted caption. In the (highly) rare case where the model generates a sequence of > 20 words, we consider only the first 20 words as its caption.
We used the COCO training data set ILSVRC-2012-CLS1,50 which holds 414113 image-caption pairs with 82783 unique images while the held-out validation set (used for Fig. 8b, c) holds 202654 with 40504 unique images; note that each image therefore has ~ 5 distinct gold standard captions. Training takes place in batches of 100 image-caption pairs, with a learning rate of 0.001. Model performance is averaged over 10 random seeds. The performance is quantified in bits per word, which measures how good the model is at predicting the validation set. More specifically if a model assigns high probability to the test set (low BPW) it means it is not surprised by it hence indicating a good understanding of the language.
In order to judge the models beyond their learning curves in BPW, we quantify their ability to generate captions using variety of language modelling metrics popular in the field of language evaluation. In particular, we compare model-generated captions against the gold standard captions using standard metrics in language modelling. We use the Semantic Propositional Image Caption Evaluation or SPICE metric, referred to as caption score. This metric has been shown to be more accurate as it better captures the semantic structure of the generated captions52.
Our code implementation is based on https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/03-advanced/image_captioning.
Demixed principal component analysis
To study the response dynamics specific to task variables we perform demixed principal component analysis (dPCA)86. Demixed PCA extracts low-dimensional components that explain maximum population variance constrained by task specific variables, such as the input stimulus. As a result we obtain principal components that are specific to task variables. The simulated neural data we provide as input to dPCA is a three-dimensional array (n, s, t) with neuronal activity (concatenated across seeds), stimulus identity and time, respectively.
Statistical analysis
Because the initial conditions of these type of models influence its learning trajectory we run our models across 10 different randomly chosen seeds. Significance was then tested using a paired t-test across the different seeds on the relative changes (e.g. ccRNN relative to cRNN). Significance levels are represented as * (p < 0.05), ** (p < 0.01), *** (p < 0.001) and **** (p < 0.0001).
Measuring cerebro-cerebellar feedback similarity
The learning curves of ccRNN plotted against cRNN with a limited feedback horizon highlights the benefit of the feedback predicted by the cerebellar network. This indicates that the predicted feedback can indeed approximating the desired cerebral feedback. To verify this, we quantified the cerebro-cerebellar feedback similarity using cosine similarity - “cossimilarity” - between the predicted feedback and the optimal temporal cerebral feedback (as derived by gradient descent). Specifically given two arbitrary vectors x and y where x is the predicted feedback and y the true optimal feedback, · denotes the dot product and ||||2 the Euclidean norm.
It is important to emphasise that the true feedback is never actually provided to the model (as it goes beyond the feedback horizon K considered). Instead the cerebellum only learns through a combination of cerebral feedback within horizon K and a bootstrapped term (see details above). This measure allow us to evaluate how much information about this ideal feedback can the cerebellum approximate. The final result is shown in Fig. 5a. To provide the reader with intuition about how having external feedback available just atthe end, which would lead to a gradual loss of the ability of the cerebellum to make good predictions for earlier points in the task, we highlight two task variants in which the task error is only defined at the end visual discrimination and a simple line drawing variant where the external task feedback is only provided at the end of the task.
Supplementary Information
Acknowledgements
We would like to thank the Neural & Machine Learning group, Paul Anastasiades, Paul Dodson, Conor Houghton, Laurence Aitchison, Cian O’Donnell, James M. Shine, Max Jaderberg, Nadia Cerminara and Jasmine Pickford for useful feedback. We would also like to thank Samia Mohinta and Milton Llera Montero for help with model analysis and training. JP was funded by a EPSRC Doctoral Training Partnership award (EP/R513179/1) and EB by the Wellcome Trust (220101/Z/20/Z). This work made use of the HPC system Blue Pebble at the University of Bristol, UK.
Footnotes
↵1 This is a commonly used dataset available for our purposes under a Creative Commons license.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].
- [26].↵
- [27].↵
- [28].↵
- [29].
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].
- [64].
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵