A spiking neural-network model of goal-directed behaviour

In mammals, goal-directed and planning processes support flexible behaviour usable to face new situations or changed conditions that cannot be tackled through more efficient but rigid habitual behaviours. Within the Bayesian modelling approach of brain and behaviour, probabilistic models have been proposed to perform planning as a probabilistic inference. Recently, some models have started to face the important challenge met by this approach: grounding such processes on the computations implemented by brain spiking networks. Here we propose a model of goal-directed behaviour that has a probabilistic interpretation and is centred on a recurrent spiking neural network representing the world model. The model, building on previous proposals on spiking neurons and plasticity rules having a probabilistic interpretation, presents these novelties at the system level: (a) the world model is learnt in parallel with its use for planning, and an arbitration mechanism decides when to exploit the world-model knowledge with planning, or to explore, on the basis of an entropy-based confidence on the world model knowledge; (b) the world model is a hidden Markov model (HMM) able to simulate sequences of states and actions, thus planning selects actions through the same neural generative process used to predict states; (c) the world model learns the hidden causes of observations, and their temporal dependencies, through a biologically plausible unsupervised learning mechanism. The model is tested with a visuomotor learning task and validated by comparing its behaviour with the performance and reaction times of human participants solving the same task. The model represents a further step towards the construction of an autonomous architecture bridging goal-directed behaviour as probabilistic inference to brain-like computations. Author summary Goal-directed behaviour relies on brain processes supporting planning of actions based on the prediction of their consequences before performing them in the environment. An important computational modelling approach of these processes sees the brain as a probabilistic machine implementing goal-directed processes relying on probability distributions and operations on them. An important challenge for this approach is to explain how these distributions and operations might be grounded on the brain spiking neurons and learning processes. Here we propose a hypothesis of how this might happen by presenting a computational model of goal-directed processes based on artificial spiking neural networks. The model presents three main novelties. First, it can plan even while it is still learning the consequences of actions by deciding if planning or exploring the environment based on how confident it is on its predictions. Second, it is able to ‘think’ alternative possible actions, and their consequences, by relying on the low-level stochasticity of neurons. Third, it can learn to anticipate the consequences of actions in an autonomous fashion based on experience. Overall, the model represents a novel hypothesis on how goal-directed behaviour might rely on the stochastic spiking processes and plasticity mechanisms of the brain neurons.

In mammals, the acquisition and consolidation of instrumental behaviour involves two 2 sets of processes, one underlying flexible goal-directed behaviour, used in particular to 3 find solutions to new problems or face changing conditions, and the other one related to 4 habits, forming stimulus-response behaviour used to efficiently but inflexibly face 5 familiar conditions [1][2][3]. As also highlighted in the computational literature [4], 6 goal-directed processes are model based ; that is, they rely on an internal representation 7 of the external world (world model ) to internally simulate (planning) the consequences 8 of actions, or action sequences, usable to achieve desired world states (goals) before 9 executing them in the environment [4][5][6][7]. When the agent has a model of the relevant 10 part of the world and has to accomplish a new goal, goal-directed behaviour allows it to 11 solve the task on the basis of planning and the world model. This thanks to the fact 12 that the world model represents the general dynamics of the world, in particular how it 13 responds to the agent's actions, and so it can be used to pursue any goal (in particular, 14 goal independent). The simulated achievement of the new goal might be possibly 15 marked by an internal reward [8]. To an external observer the agent appears to solve 16 the new task 'on the fly' or 'by 'insight'. Instead, habitual behaviour is model free, in 17 the sense that it relies on actions directly triggered by stimuli (habits) and does not 18 require a world model anticipating their outcomes [4,6,9]. Habits are task dependent as 19 they rely on stimulus-responses associations that can lead the agent to specific desired 20 world states. Given a new desired state, the agent thus needs a repeated experience of 21 such state to discover and learn by trial and error the new stimulus-response 22 associations leading to it. 23 When a goal-directed system encounters a new task that involves an unknown part 24 of the environment, or a part of the environment that changed, it first needs to learn a 25 model of it (or to update the existing model) before using it for planning. In this 26 respect, goal-directed behaviour involves two subsets of processes, which tend to 27 characterise two successive phases when a new problem or a changed environmental 28 condition are faced. The first subset of processes are directed to the exploration of the 29 environment to form the internal model of it, while the second subset of processes are 30 directed to the exploitation of the acquired knowledge to plan and execute actions 31 successfully accomplishing the desired goal [10,11]. Here we consider the early phases of 32 the solution of a new task, involving either a new environment or a new goal, and hence 33 we focus on goal-directed behaviour and its exploration/exploitation processes. 34 In brain, goal-directed behaviour relies on ventral/associative basal ganglia and 35 frontal cortex supporting the anticipation of the world dynamics and action 36 consequences; instead, habitual behaviour relies on motor basal ganglia and 37 sensorimotor/premotor cortices able to acquire stimulus-response associations by 38 reinforcement learning [9,[12][13][14]. The brain processes underlying goal-directed 39 behaviour have been interpreted within different computational frameworks. A current 40 December 2, 2019 2/28 goal-directed behaviour and planning through probabilistic representations and 48 inferences, and this has been shown with specific models (e.g., [20][21][22]). These models 49 rely on various probabilistic processes to represent the world, some of which are shown 50 in Fig 1 through their corresponding graphical models. Hidden Markov Models (HMM) are one important means used to represent the 52 world dynamics [23,24]. A HMM assumes that the agent cannot directly access the 53 states of the world (they are 'hidden' to it) but only infer them on the basis of noisy 54 information from sensors. The model thus internally represents the states of the world 55 as probability distributions over possible hidden causes of observations, in particular 56 with a different distribution for each time step. The probability distribution over states 57 at each time step is assumed to depend only on the state of the previous time step 58 (Markov property). The model also internally represents the probability distribution 59 over the possible observations and assumes that it depends only on the current state. inference [20][21][22]).

76
Probabilistic models have the strength of capturing the uncertain nature of the 77 world and the possible probabilistic representations and inferences that the brain might 78 employ to represent them. However, their use as models of the brain, and not only of 79 behaviour, encounters an important challenge; namely, the fact that the probability 80 distributions that these models commonly use directly involve abstract/high-level 81 aspects of cognition and behaviour, such as the probability distribution over 82 world-states, actions, and observations, and so this opens up the problem of explaining 83 how such distributions and the inferences on them could rely on the firing of brain 84 neurons [17,25,26].

85
One important possibility is that the needed probability distributions could rely on 86 the probability distributions of spikes of neurons, sampled by the actual spikes, and that 87 the connections between neural populations, undergoing experience-dependent plasticity, 88 supports the conditional probabilities underlying the needed probabilistic 89 inferences [21,[27][28][29][30]. An interesting approach to implement this idea, on which we 90 build here, proposes mechanisms to ground important building blocks of probabilistic 91 models on spiking neural networks similar to those of the brain, on some typical 92 connectivity patterns of cortex, and on biologically plausible plasticity rules [24,[31][32][33]. 93 This approach proposes how spiking networks and spike-timing dependent plasticity

94
(STDP) could model the learning of hidden causes of observations [32]. In particular, (the winner-take-all pattern relying on lateral inhibition), and STDP, can lead to the 98 emergence of circuits able to identify the hidden causes of observations. Moreover,

99
STDP can also support the formation of probabilistic dependencies between such 100 hidden causes, capturing their relations in time. This can instantiate an HMM usable to 101 internally represent the perception of sequences of events [24,32,33].

102
Recently, mechanisms as these have been used in recurrent spiking neural network 103 models to implement planning [34][35][36]. These models are the state-of-the-art in the 104 realisation of probabilistic models of planning grounded on biologically plausible spiking 105 neural networks. In these models, a two-dimensional neural map of spiking neurons 106 with lateral connectivity is used as a model of the world, with the world consisting in a 107 scenario involving navigation or robot motion tasks. kinematics, either offline [35] or using a dedicated layer [36].

122
Here we propose a neural spiking architecture for probabilistic planning that builds 123 on previous contributions, overcomes some of their limitations, and introduces some proposed in [24] (plus a reinforcement learning rule) to implement a HMM with a 127 spiking recurrent neural network. A recurrent spiking neural network to implement the 128 world model and planning was also used in [35,36]. With respect to these previous 129 models, our architecture presents a number of structure and functioning novelties at the 130 system level.

131
A first novelty of the architecture is that the learning of the world model is 132 intermixed with its use in planning, as it is required by the fact that goal-directed 133 behaviour often tackles tasks involving new or changed parts of the world. This implies 134 the non-trivial challenge that planning must be performed on the basis of a partial 135 model of the world. To face this problem, the model uses an arbitration component, 136 inspired by the mechanisms proposed in [37,38] for the arbitration between  [24]. This is an advancement with 158 respect to biological plausibility of the architecture, with respect to using supervised 159 learning as done in previous spiking network models of goal-directed behaviour [35,36], 160 as the system is able to autonomously learn internal representations of the observed 161 events and event-sequences without the need of an external 'teacher' suggesting them. 162 A last contribution of this work is the use of the architecture to reproduce and 163 interpret the results of the experiment presented in [12] where human participants learn 164 to solve a visuomotor learning task. This allowed the validation of the model, in 165 particular to check if the learning processes of the world model lead to match human 166 performance, and if the arbitration mechanism employing a variable time planning lead 167 to reproduce the reaction times exhibited by human participants. This target 168 experiment was also investigated by the model proposed in [38]. Although this model 169 did not aim to bridge probabilistic modelling to neural mechanisms as here, it used an 170 interesting mechanism of arbitration between goal-directed and habitual behaviour 171 based on entropy as done here.

172
The rest of the paper is organised as follows. Section 1 describes the model 173 architecture and functioning and the visuomotor learning task used to validate it.  This section explains the model architecture and functioning and the visuomotor task 180 used to test it [12]. Although the main objective of this work is to propose the novel 181 spiking neural model of goal-directed behaviour, the section starts by illustrating the 182 visuomotor task to use it as an example while illustrating the model. In the task proposed in [12], the participants are supposed to discover the correct 185 associations between three different stimuli and three, out of five, possible motor 186 responses. During the experiment, three different colours are projected on a screen in a 187 pseudo-randomised order, in particular through sixty triplets each involving each colour 188 once in a random order. After each colour perception, the participants have to press one 189 of the five buttons of a keyboard with their right hand. Once the action is performed, a 190 feedback on the screen informs the participants if the association between the colour 191 and the performed action was correct or wrong. Unbeknown to the participants, a fixed 192 number of errors is used to dynamically consider the action performed at a certain time 193 steps as correct for the particular colour: the correct action for S1 comes after one error 194 (so at the second attempt), for S2 after three errors (fourth attempt), and S3 after four 195 errors (fifth attempt). The activity of the participants is supposed to be organised in 196 two phases: an initial exploratory phase where they the correct associations and a 197 second exploitation phase where should repeat the found correct associations until the 198 end of the task (Fig 2). The participants are thus not supposed to explore all the 199 possible colour-action associations since their objective is to discover and exploit one The visuomotor learning task used to validate the model. Three colour stimuli are presented to the participants in a pseudo-random order, in particular in triplets containing all three colours once in a random order. The action consists in pressing one out of five possible buttons with the right hand. The figure shows four triplets of an ideal participant that never repeats an error for a certain colour and does not forget a found correct action. The colour receiving the first action in the second triplet is marked as the first stimulus (S1), and such action is considered the correct one for it. The colour different from S1 receiving the first action in the fourth triplet is marked as the second stimulus (S2), and such action is considered the correct one for it. The colour different from S1 and S2 receiving the first action in the fifth triplet is marked as the second stimulus (S3), and such action is considered the correct one for it.  Associative layer. The associative layer, forming the core of the model, is composed 219 of 400 neurons, all connected to each other but without self-connections. The 220 associative layer receives the mentioned afferent connections from the input and goal 221 layers, and sends all-to-all efferent connections to the neurons of the output layer.

222
Output layer. As the input layer, the output layer is composed of ten neurons each 223 one representing one of the stimuli, actions, and outcomes of the task. The output layer 224 receives the mentioned afferent connections from the associative layer.

225
Together the four layers instantiate a neural HMM implementing the system's world 226 model used for planning. In particular, the input and output layer together form the 227 observation part of the HMM, and have an identical structure. Given the one 228 directional nature of brain connections, we used the two layers to implement separately 229 the two functions played by the observation part of the HMM, namely the input from 230 the external environment and the possible generative reconstruction of such input based 231 on internal causes. The associative layer implements the probability distribution over 232 the hidden causes of the observations and the probabilistic temporal dependencies 233 between them. The goal layer can condition the latter distributions to possibly increase, 234 with learning, the probability of sampling simulated stimulus-action-outcome sequences 235 that lead to the desired goal.
Alongside the planning components, the system was formed by the following 237 additional components used for exploration and arbitration.

242
Arbitration component. This component decides when to plan, to explore, or to act 243 in the world. Currently the component has not a neural implementation. The decision 244 is made on the basis of the knowledge of the world model, measured as the average 245 entropy of its probability distribution during a planning cycle. When entropy is lower 246 than a threshold, and a goal has not been found, planning continues, whereas if a goal 247 has been found the corresponding action is performed in the environment. If entropy is 248 above the threshold then the control is passed to the exploration component that selects 249 the action to perform in the world. represents the internal simulation of the trial events (colour, action, ation-outcome).

264
The input layer is activated with the observed colour for a certain portion of the 265 planning cycle (here 1/3 of its duration lasting 15 steps: for simplicity one planning 266 cycle is assumed to last a number of steps as the actual trial, as done in [24]).

267
During one planning cycle, the arbitration mechanism operates as follows. The  First, it updates the goal-associative connections so as to lower the goal-conditioned 279 probability of the wrong sampled sequence. Second, it lowers the entropy threshold of a 280 certain amount, thus ensuring that with time the probability of terminating the 281 planning processes increases and the system does not get stuck in planning.

282
After planning terminates, if the system has not found an action that leads to a goal 283 matching, the action is produced by the exploration component. The action selected  if (VisuoMotorTrials ≤ 60) then Goal ← AchievePositiveFeedback 3: else Goal ← AchieveNegativeFeedback

21:
Outcome ← Observe(Environment) 22: TrainWorldModel(Input, Action, Outcome) Learning 23: if (Outcome = Goal) then Note that when a trial starts, the architecture performs a planning cycle to evaluate 298 entropy: this hypothesis is based on the fact that the task is novel. In a more general 299 case where tasks could be familiar, a common habit/planning arbitration process might 300 evaluate if a habit is available before triggering planning and the planning/exploration 301 arbitration process considered here. found to be wrong (this hypothesis is inspired by the 'inhibition-of-return' mechanism of 311 visual exploration, leading to exclude from exploration already explored items [39]); the 312 component is instead not trained in case of success as this would amount to habitual 313 learning not possible in few trials. These hypotheses were isolated through the search of 314 the conditions for the correct reproduction of the target human data of the visuomotor 315 task while fulfilling the challenging constraint that planning has to take place while 316 learning the neural world model.

317
Based on these mechanisms, at the beginning of the visuomotor test the model tends 318 to sample random stimulus-action-outcome sequences because the world model has no 319 knowledge. The arbitration component thus quickly passes the control to the 320 exploration component which decides which action to execute, and this is performed in 321 the environment. With the accumulation of experience trials, the world model improves 322 by learning the hidden causes of observations (colours, actions, feedback) and the time 323 dependencies between them. This leads the arbitration component to measure a higher 324 confidence in the world model, so planning continues and samples with a higher 325 probability (hidden causes of) colour-action-feedback sequences that actually exist in 326 the world. When one of these sequences leads to an action that predicts a goal 327 achievement in the output layer, and the action is actually successful when performed in 328 the environment, this leads to increase the goal-conditioned probability of sampling such 329 sequence so that the next time the same colour is encountered the sequence is readily 330 selected by the planning process.  interneurons. This connectivity pattern tends to keep constant the overall firing rate of 375 the layer. More in detail, the membrane potential u k of a neuron of the model is: where τ is a scaling factor, i(t) is the common inhibition received by all neurons caused 377 by the inhibitory interneurons to which they project, and u k (t) is the total activation 378 December 2, 2019 11/28 received from other neurons: where w ki are the input-associative connection weights, w kg are the goal-associative 380 connection weights, w ka are the internal associative connection weights, s i (t), s g (t), and 381 s a (t) are the incoming spike signals (s ∈ 0, 1) from the neurons of respectively the 382 input, goal, and associative layer, and n is a Gaussian noise component with a standard 383 deviation ν. 384 We then assume, as in [42], that the firing rate v k (t) of a neuron k, reflecting its 385 spiking probability, is exponentially dependent on the membrane potential: where v is a constant scaling the firing rate. This implies the following dependency of 387 the neuron firing rate on the activation received by all neurons of the layer: where i(t) was assumed to be: While the model on which we built assumed a continuous time and an inhomogeneous 390 Poisson process to produce the actual spikes of the layer [24], we considered a discrete 391 time, a fully Markov dependence between succeeding events, and a constant firing rate 392 at each time step, assumed to be v = 1 without loss of generality. These assumptions 393 simplified the analysis of the system and did not alter the core functioning of the model, 394 in particular with respect to the effects of the core unsupervised learning rule illustrated 395 below. With these assumption, Eq 6 becomes a soft-max function where K k=1 v(k) = 1 396 is the layer constant total firing, and v(t) can be interpreted as v(t) = p t (k), with p t (k) 397 being a categorical probability distribution indicating the likelihood that the neuron 398 with index k is the one to fire a spike at time t while the other neurons remain silent.

399
The neurons of the output layer, receiving afferent connections from the associative 400 layer, have the same activation as the neurons of the associative layer.

401
The weights of the connections linking the input-associative layers, the associative 402 neurons between them, and the associative-output layers are updated through a 403 Spike-Timing Dependent Plasticity (STDP) rule [43][44][45][46]. In particular, we used the 404 following STDP learning rules from [24,35] to update a connection weight w post,pre 405 linking the pre-synaptic neuron pre to the post-synaptic neuron post: where ζ is a learning rate parameter, ∆w post,pre is the size of the connection weight pre-synaptic neurons that tend to not fire before it fires. Instead, if the pre-synaptic 414 neuron fires before the post-synaptic neuron (s pre (t − 1) = 1), then w post,pre increases if 415 c < e −wpost,pre and decreases otherwise. This implies that w post,pre will tend to converge 416 to the positive attractor point w * post,pre = −ln(c) reached when e −wpost,pre = c. Overall, 417 for a given neuron the rule tends to form positive incoming connections from neurons 418 that fire just before it fires, and negative connections from all other neurons.

419
The connections that form are the means through which the system implements 420 conditional probabilities. For example, initially the associative units k, each 421 representing possible hidden causes of observations, tend to fire with a certain prior 422 probability distribution, say p(k). The formation of input-associative connections allows 423 an observation i to generate the posterior conditional probability distribution p(k|i) 424 that for example implies an increased probability of selection of the hidden cause k.

425
Within the associative network, the learning rule leads to form a connectivity that 426 supports a sequential activation of the neurons encoding the hidden causes of the 427 observations, where the sequence reflects the temporal order in which the observations, 428 reflecting the world states, are experienced by the HMM. The reason is that once the 429 hidden causes are formed, based on the input-associative connections, then they tend to 430 fire in sequence under the drive of the observations. As a consequence, the learning rule 431 leads each associative neuron to connect with the associative neurons that fired before it 432 and to form negative connections with those that did not fire. In this way, the 433 connections within the associative network tend to form chain-like neural assemblies.

434
These connections are hence able to represent the temporal dependencies between 435 hidden causes, e.g. between a and k corresponding to two succeeding observations, as 436 conditional probabilities p(k|a). Importantly, if the system observes different events 437 following the initial observation of the trial (e.g., different actions and different 438 outcomes after a certain initial colour), the world model will exploit its stochastic neural 439 processes to represent such possible alternative sequences of events. This is at the core 440 of the architecture's capacity to internally simulate alternative courses of actions and 441 events and hence to plan in a goal-directed manner.

442
The same learning rule is also used to train the associative-output connections.

443
Initially, the output layer expresses a probability distribution, say p(o), that tends to be 444 uniform and so when sampled generates unstructured observations. With learning, the 445 world model strengthens some connections between the spiking sequences sampled 446 within the associative network and the observations activating the output layer. When 447 the world model samples an internal sequence within the associative network, this leads 448 to generate the observations on the basis of the reconstruction probability p(o|k).

449
Overall, the neural HMM plus the output layer act as an auto-encoder returning as 450 output the input, and able to capture in its internal states the hidden causes of 451 observations: this is similar to what happens in variational auto-encoder [47], a 452 probabilistic version of auto-encoder [48], with the difference that the model considered 453 here generates sequences of patterns rather than single patterns. In this respect, the 454 output layer acts similarly to the reading-out layer of a dynamic reservoir 455 network [49,50] which is however deterministic.

456
When the planning process has to generate an action to perform, or a predicted 457 feedback to compare with the goal, the generated event at the output layer is considered 458 to be the one that fired the most during the planning cycle. If the system had to 459 generate sequences of events involving multiple actions and predicted states, one should 460 consider other 'reading out' mechanisms, for example that an event is generated each 461 time the unit encoding it fires a minimum number of spikes in sequence.

462
The goal-associative connection weights are updated on the basis of the success or 463 failure to achieve the goal during planning or when the action is performed in the 464 environment. In particular, the update is done on the basis of the following 465 December 2, 2019 13/28 reinforcement learning rule: where η represents the learning rate, m is the reward, equal to 1 if the sequence resulted 467 in a successful goal matching and −1 otherwise, ET is the Eligibility Trace, equal to 1 468 for units that have fired at least once during the planning cycle/trial and to 0 otherwise, 469 and w max is the maximum absolute value that the weight can reach (w max = 0.5). The 470 goal-associative connections allow the goal g to condition the probability distribution 471 over the hidden causes, p(k|i, a, g). With learning, this allows the goal to condition the 472 probability of the sampled hidden causes sequences so as to increase the likelihood of 473 those that involve the correct action. Moreover, when the goal changes, the system is   The entropy threshold decreases linearly at each planning cycle so that the 502 exploration component is eventually called if the planning process fails to match the 503 goal within a certain time: where is the value to which the entropy threshold is set at the beginning of the trial 505 (and the planning process), and δ is its linear decrease.

506
The exploration component is a neural network formed by two layers. The first is an 507 input layer formed by 6 neurons encoding the elements of the Cartesian product 508 between the possible three colours and two goals. The second is an output layer formed 509 by 5 neurons representing the possible actions, receiving all-to-all connections from the 510 input layer. When the exploration component is called to select the action, the input  The model functioning depends on seven important parameters, indicated in Tab 1. We 522 searched the best values of those parameters by fitting the model behaviour to the 523 corresponding data of the human participants. In particular, we randomly sampled and averaged the behaviour of 20 'simulated participants', in particular their performance in 526 the 20 trials with the stimuli S1, S2, and S3, and the average reaction times over colours 527 on the same trials (this because the original data on the reaction times of humans were 528 not separated). Such three performance datasets and one reaction-time dataset were 529 compared with the corresponding average data from 14 human participants through a 530 Pearson correlation coefficient R d,m computed as: where C d,m is the covariance between the data from humans, d, and data from the 532 model, m; V d and V m is their respective variance. In particular, the coefficient was 533 computed separately for the different data sets (performances and reaction times) and 534 then averaged.

535
The range of the parameters explored by the search, and the best parameter values 536 that it found, are shown in Tab 1. The best parameter values, that had a correlation 537 coefficient of 0.72, were used in all the simulations illustrated here. 538

539
This section illustrates the behaviour and functioning of the model when tested with the 540 visuomotor learning task proposed in [12] and described in Sec 1.1. The reported results 541 refer to twenty replications of the simulations each representing a simulated participant 542 performing the task. The results are also discussed from the perspective of the current 543 state-of-the art on probabilistic and spiking neural-network models of goal-directed 544 behaviour. The performance is in particular very similar to the humans' one for the colours whose 548 correct action is found after one or three errors whereas it is slightly lower for the colour 549 whose correct action is found after four errors.

550
Once the model finds the correct action for one colour, when it encounters the same 551 colour again it reproduces the correct action with a high probability. The architecture 552 however takes more cycles to converge to such a high probability for S3 because the 553 planner has a larger number of wrong sequences and so has a higher probability of 554 wrongly anticipating a positive feedback. This problem is less impairing for S1, and in 555 part for S2, involving fewer wrong sequences during planning. 556 Fig 4. Comparison of the performance of the human and simulated participants. The performance (y-axis) is measured as the proportion of correct response over the trials (x-axis), separately for the three different colour stimuli (S1, S2, S3). Curves indicate the values averaged over 14 human participants and 20 artificial participants; error bars indicate the standard error. The data of human participants are from [38].
The arbitration component decides to implement a different number of planning 557 cycles (each involving the generation of colour-action-feedback sequences) depending on 558 the knowledge stored in the world model. If a larger number of planning cycles is 559 performed, the reaction times of the architecture are considered to be longer. These 560 reaction times can be compared with those of the human participants (Fig 5). The 561 reproduction of the human reaction times is particularly interesting and challenging as 562 it has an inverted 'U' shape.

563
In the first trials, for each stimulus the entropy (uncertainty) of the world model is 564 high as the associative layer expresses a rather uniform probability distribution. Indeed, 565 Fig 5. Comparison of the reaction times of the humans and simulated participants. (A) Reaction times of human participants averaged over S1, S2, and S3 (y-axis) for the 'representative steps' (x-axis); the 'representative steps' allow the alignment of the reaction times of the three stimuli so as to separate the exploration phase (first 5 steps) and the exploitation phase (6 steps onward); to this purpose, the reaction times for S1 obtained in succeeding trials from the first onward is assigned the steps (used to compute the averages shown in the plot) '1, 2, 6, 7, ...', whereas S2 is assigned the steps '1, 2, 3, 4, 6, 7, ...', and S3 is assigned the steps '1, 2, 3, 4, 5, 6, 7, ...'; data are taken from [38]; (B) Reaction times of the model plotted in the same way. the component has still to identify the hidden causes of stimuli and actions, so the 566 neurons forming it tend to spike with a similar rate. As the entropy is high, the 567 arbitration component tends to quickly pass the control to the exploration component 568 and so the reaction times are low. After the following trials in the environment, the 569 world model start to form representations of the experienced colour-action-feedback 570 sequences and to assigning to them a higher posterior probability with respect to other 571 patterns. The arbitration component thus tends to compute a lower entropy, the 572 architecture plans for longer, and so the reaction times get longer. During this planning, 573 the associative component tends to sample the learned sequences with a high 574 probability conditioned to the observed colour. If none of the sequences leads to predict 575 an event that matches the pursued goal through the output layer, the probability of 576 such sequences is however decreased under the conditioning of the goal and the control 577 is again passed to the exploration component. When the action performed in the world 578 manages to produce the desired goal, the system learns the corresponding sequence and 579 assigns to it a high posterior probability. When the colour of such sequence is observed 580 again, the sequence is sampled with a higher probability and results in a successful entropy is low but the planning process samples such sequence with a high probability, 592 obtains a successful matching of the goal, and the found successful action is performed: 593 the exploitation of the found solution starts and the reaction times become low again. 594 The results on the performance and reaction times of the model allow us to discuss 595 two of the key features of the model with respect to the existing literature about 596 probabilistic planning models based on spiking neural networks. The first feature is that 597 the world model is learnt in parallel with its use for planning, and an arbitration 598 mechanism decides when to explore or to plan on the basis of an entropy-based 599 confidence on the world model. Previous models of probabilistic planning based on 600 spiking neural networks did not consider the possibility of using approximate world 601 models (as these were trained before the solution of tasks) and a mechanism of 602 arbitration to decide if planning or exploring [35,36]. One of the first models that 603 studied arbitration [37], but that did not rely on spiking neural network, learned the

633
Another difference of our model with respect to the models of [37,38] is that it uses 634 arbitration to select between exploration (attempts of different actions serving the 635 world-model learning) vs. exploitation of the acquired knowledge (planning), rather than 636 between goal-directed behaviour and habitual behaviour. This was done as habitual 637 behaviour takes long to form and so it seems to be ruled out during the first attempts 638 to solve new problems [3]. Instead, the first phase of solution of a new task involves the 639 learning of the model of the world based on the exploration of how the world responds 640 to actions, and the possible exploitation of the collected knowledge by the goal-directed 641 components. In this respect, the arbitration component proposed here makes decisions 642 at a finer time granularity with respect to other models, namely during planning cycles, 643 rather than at the coarser granularity of the trials, as in the other models. Future work 644 could investigate how to add to the exploration/exploitation processes used here the interesting integration of the exploration/exploitation arbitration mechanisms used here 649 and the goal-directed/habitual arbitration mechanism used in previous models.

650
Another difference linked to the previous point is that the model presented here used 651 an explicit representation of the goal to directly (learn to) condition the probability 652 distributions of the world model, rather than to generate a reward corresponding to the 653 desired state and used it to perform reinforcement learning based on the world model 654 (model-based reinforcement learning) as in [37,38]. Our different approach was also uses 655 in [35] who however conflated the goal, initial state, and environment conditions into a 656 whole 'context' representation. Instead, in the model presented here the initial state and 657 goal have factored representations, and the 'environment condition' is considered as part 658 of the world model state. Thus the model is able to express the goal-free probability 659 distributions representing the 'objective dynamics' of the world, or to express the 660 'goal-based probability distributions' due to the agent's action to accomplish the goal. With the experience of the input stimuli, the STDP acting on the input-associative 673 connections and on the internal associative connections leads the associative layer to 674 form an internal representation of the hidden causes of the observations, namely of the 675 colour, the action, the feedback, and the elapse of time (the latter due to the fact that 676 each observation lasts multiple time steps). At the same time, the plasticity of the 677 associative layer leads it to form a HMM that represents in an increasingly accurate 678 fashion the time-related probabilistic dependencies between the discovered hidden 679 causes. Finally, once some possible sequences are encoded by the associative component 680 starting from the current colour, the STDP acting on the goal-associative connections 681 progressively increase the probability of sampling sequences that lead to achieve the 682 goal and to decrease the probability of those that do not. The effect of these processes 683 can be seen in the figure graphs, in particular with respect to S1 for which a successful 684 sequence is discovered after two trials (three graphs at the left). For this stimulus, 685 during trials T4-T15 some specific neurons start to fire in sequence more strongly than 686 other neurons, meaning that the system has learnt to represent the hidden causes, and 687 their time dependencies, of the events of the first successful colour-action-feedback 688 sequence.

689
During T4-T15 and T16-T20, the world model also learns the hidden causes, and 690 their temporal dependencies, of the events of the other two sequences corresponding to 691 S2 and S3 (second and third column of graphs in the figure). Here, learning of the world 692 model and its correct exploitation takes more trials with respect to S1 as the goal 693 (successful feedback) is achieved after a larger number of sequences (four and five for 694 respectively S2 and S3). This implies that the architecture takes longer to first learn the 695 hidden causes of all such sequences and then to decrease the probability of the wrong 696 ones based on the pursued goal.

697
Importantly, during these experiences the world model, which tends to record any 698 aspects of the world dynamics independently of the fact that it is useful to pursue the 699 currently goal or not, also learns sequences leading to a negative feedback. The next 700 section shows how this knowledge might become useful to accomplish other goals. uniform probability distribution leading to random predictions of the trial events. With 717 additional experiences of trials involving S1 (T4-T15), the world model starts to learn 718 to represent the trial events and, under the conditioning of the current goal, to assign a 719 high probability to the correct colour-action-feedback sequence. As a consequence, the 720 probability distribution of the output layer starts to correctly predict such correct 721 sequence.

722
During trials T4-T15 and T16-T20 the same process happens for the correct 723 sequences of the two colours S2 and S3. Also for these stimuli, towards the end of all 724 trials (T16-T20) the probability distribution expressed by the output layer, conditioned 725 to the associative layer activation, has converged to a probability close to 1 for the  [35,36]. In these systems, the world model considers possible sequences of 730 states while abstracting over the actions that might lead to them: actions are computed 731 'off-line' with respect to the planning processes searching possible state sequences to the 732 goal. Moreover, the world model is trained during a random exploration of the 733 environment where actions are chosen according to a uniform probability distribution. 734 As a consequence, the world model can only reflect this probability distribution: given a 735 new goal, the system has thus to infer the probabilities of new possible sequence of 736 states and actions from scratch. Instead, the world model used here is a HMM that 737 observes states and actions as similar events, independently of the fact that they are 738 produced by the environment or by another part of brain (e.g., the actions produced by 739 the exploration component used here, or by a future habitual component). This allows 740 the world model to learn state-transition probabilities that are sensitive to the 741 probability of action selection. This might allow two possible advantages. First, it could 742 support the biasing of the action selection probabilities, and hence the state 743 probabilities, in favour of actions that lead to potentially useful effects from a given 744 state, rather than any action in any state. This might be used to bias the world model 745 to produce sequences of events involving only actions that are useful to accomplish 746 states relevant for the agent's typical goals (this might capture the important concept of 747 affordance used in cognitive sciences [51], see [52]). Moreover, the use of state-action 748 sequence probabilities might allow goals to bias only the probabilities of action elements 749 of the HMM rather than also the probabilities of the state elements, that could thus 750 usefully reflect the actual physics of the world. If a distributed representation of goals is 751 used that allows generalisation over them, this would for example allow new goals 752 similar to previously goals to immediately bias the action probability distribution, and 753 hence the state probability distribution, expressed by the HMM in favour of potentially 754 relevant actions and states.

755
A further novelty of the model presented here with respect to neural probabilistic 756 models [35,36] is that the world model is more realistically learned during the solution 757 of the novel task, rather than before the task solution. This caused the challenge of 758 using a partial model of the world for planning, faced with the novel 759 exploration/exploitation arbitration mechanism proposed here (this same challenge was 760 faced by previous models, [37,38], but these used goal-directed/habitual arbitration, 761 and were not grounded on spiking neural networks).

762
Another novelty of the world model presented here is that it learns on the basis of a 763 biologically-plausible unsupervised neural learning mechanism [24], rather than based on 764 the indication of the internal desired activation patterns by an external 'teacher' [35,36]. 765 Computationally, finding the conditions for the successful functioning of such 766 unsupervised learning process, together with the acquisition of the world model while 767 using it for planning, represented the hardest challenge for the construction of the 768 architecture proposed here. An important advantage of planning is that the world model can store general 771 knowledge on the dynamics of the world that can be used to accomplish different goals. 772 It was thus interesting to check to which extent the current architecture preserved this 773 capacity since it incrementally acquires a partial world model while solving the 774 visuomotor task. To this purpose, after the architecture underwent the experiences 775 reported in the previous section, it was required to perform additional trials to pursue 776 the different goal of 'obtaining a negative feedback' in correspondence to the three 777 colours. As shown in Fig 8A, when the goal is switched, the architecture is able to 778 quickly change behaviour and choose the sequences that lead to the desired new goal 779 given the colour. What happens is indeed that, under the conditioning of the observed 780 colour, the world model already represents the hidden causes of the elements of the 781 sequences and also assign a high probability to these sequences. In particular, since the 782 previous goal unit is now off, the probability of the different sequences tends to be 783 similar, and so the system tends to sample all of them equally during planning. This 784 allows the architecture to rapidly discover a sequence that leads to the desired new goal, 785 to solve the new version of the task through it, and then to increase the probability of 786 such sequence conditioned to the new goal.

787
Regarding the reaction times (Fig 8B), the model shows a transient increase of their 788 size in correspondence to switch of the goal. This is due to the fact that with the new 789 goal the system needs to perform the sampling of some sequences before finding the 790 successful ones. The reaction time is higher for S1 than for S2 and S3 as for it the 791 model has less sequences available to reach the new 'negative feedback' goal and 792 constantly one sequence to achieving the 'positive feedback'.

793
These results represent a prediction of the model that might be tested in future 794 experiments with human participants resembling the simulated test performed here 795 (never performed with humans). In particular, the model predicts a certain performance 796 and reaction times (Fig 8), possibly distinct for stimuli S1/S2/S3, that might be 797 measured and compared with those of humans.

798
Overall, these results show how once the world model has acquired goal-independent 799 knowledge on the environment dynamics the architecture can use it to pursue different 800 goals. This feature is the hallmark of the flexibility of goal-directed behaviour and is 801 shared with the other neural probabilistic models [34,36]. These, however, were not 802 validated with specific empirical data and were not used to produce specific empirical 803 predictions as here.