Statistical learning as reinforcement learning phenomena

Statistical learning (SL) is the ability to extract regularities from the environment. In the domain of language, this ability is fundamental in the learning of words and structural rules. In lack of reliable online measures, statistical word and rule learning have been primarily investigated using offline (post-familiarization) tests, which gives limited insights into the dynamics of SL and its neural basis. Here, we capitalize on a novel task that tracks the online statistical learning of language rules combined with computational modelling to show that online SL responds to reinforcement learning principles rooted in striatal function. Specifically, we demonstrate - on two different cohorts - that a Temporal Difference model, which relies on prediction errors, accounts for participants’ online learning behavior. We then show that the trial-by-trial development of predictions through learning strongly correlates with activity in both ventral and dorsal striatum. Our results thus provide a detailed mechanistic account of language-related SL and an explanation for the oft-cited implication of the striatum in SL tasks. This work, therefore, bridges the longstanding gap between language learning and reinforcement learning phenomena.

16 Introduction 17 Statistical Learning (SL) is the ability to extract regularities from distributional information in the TD model was 3 to 4 times better than the chance model at adjusting to participants' 123 reaction times. We additionally compared the performance of the TD model against that of a TD model, the RW model treats each AX_ combination as a single event, therefore combining 126 the predictive values of the two (A plus X) elements (35)(36)(37), and so does not take into account  participants' reaction times is plotted in Figure 2B with

200
It is generally understood that the final goal of (TD) learning is to inform behavior (37). Even if 201 we consider predictions themselves as some form of covert behavior(38) used to optimize 202 online learning and processing, our paradigm also required participants to make an overt 203 response (a button press) to the presence of their target word. Reaction times are often used as 204 modulators of a condition's related BOLD signal to extract the variability pertaining to such 205 motor responses. However, as previously illustrated ( Fig. 1 and Fig. 2), reaction times in the Rule block will tend to show a close relationship to online learning. Hence, a more suitable 207 baseline to remove response related brain activity is the reaction times to the No Rule block, 208 that is, where no specific rule-learning can occur. We therefore contrasted P(A)-modulated Rule block activity with the reaction time (RT)-modulated No Rule block activity (activity 210 estimates for the contrast between P(A)-modulated Rule and RT-modulated Rule activity are 211 also reported in Fig. S4 and Table S4). Significant prediction-related Rule activity remained in 212 the dorsal striatum, particularly in bilateral caudate nuclei and right putamen ( Figure S3 and 213 Table S3). Altogether, therefore, our analyses (main and control) demonstrate that activity 214 within the striatum was related to the computations that facilitate statistical rule-learning from 215 speech as predicted by the TD model.

217
In this study, we provide evidence for the SL of non-adjacent dependencies as an instance of 218 reinforcement learning. A TD model of reinforcement learning, which capitalizes on the 219 iteration of predictions and prediction errors, was able to mimic participants' reaction time data 220 reflecting gradual SL over trials. This was replicated on two independent cohorts, producing 221 similar model fits that were also clearly superior to those of simpler learning models. Functional 222 neuroimaging data of participants' online learning behavior also allowed us to examine the 223 neural correlates of prediction-based SL. In line with neuro-computational models of TD 224 learning, the trial-by-trial development of predictions from the initial word of the dependencies 225 was strongly related to activity in bilateral striatum. Importantly, striatal activity was unrelated to 226 the overt motor responses required by the task (i.e., button presses) or more general 227 computations, supporting the implication of the striatum specifically in prediction-based SL.

228
Evidence for the adequacy of a TD algorithm in capturing participants' online learning behavior 229 offers novel insights into the mechanisms for SL. In particular, our results underscore the

258
VTA/SNc) with responses consistent with the computation of reward P.E. (23,(47)(48)(49). Under this light, our reported pattern of activity in the ventral striatum is consistent with the gradual transfer over learning of prediction error related dopaminergic responses from rewarding to 261 predictive stimuli as found in classic conditioning paradigms (50,51). That is, a gradual increase 262 in response on A elements may be expected as their predictive value is learned, since these 263 elements can never be anticipated. Alternatively, activity in the ventral striatum could reflect inhibitory signals aimed to attenuate dopaminergic inputs from the VTA/SNc (52) in response 265 to C elements as these become more predictable.

266
From a theoretical standpoint, it may be necessary to distinguish between the response of the  namely, of (speech) motor programs corresponding to the predicted (C) elements. The selection of these motor-articulatory plans may be used to generate sensory-level predictions (38) ultimately translating into increasingly faster RTs for predicted C elements. In 289 this view, activity in the pSTG (see Fig. 3) would reflect the downstream (i.e. sensory) 290 consequences of this selection (38). We conjecture that prediction-based SL is fundamentally 291 linked to such motor engagement as part of the learning mechanism orchestrated by the 292 striatum. This is consistent with the observation that participants more adept to predicting 293 speech inputs embedded in noise, known to involve the speech motor system (65), are also 294 better statistical learners (15), and agrees with the well-accepted role for these structures in 295 procedural learning (66,67) and the managing of motor routines (49,64,68). Note that this speech 296 motor engagement for learning should become of critical importance when putative alternative 297 learning mechanisms (e.g., purely sensory based) are weakest, for example, when a temporal 298 separation is imposed between the elements to be associated, as in our non-adjacent 299 dependency learning task.

300
In contrast to previous research on grammar learning (31,69,70), the trial-wise development of

363
To obtain an online measure of incidental learning, participants were instructed to detect, as  We modelled subjects' learning of the dependencies using a Temporal Difference (TD) 392 model (17,18). Drawing from earlier models of associative learning, such as the Rescorla-measure of the mismatch between predicted and actual outcome (17,18,37,73) (i.e., prediction 395 error (P.E.). This scalar quantity is computed as: where ∂ (t) is the P.E. term at a given time-point t within a trial, which amounts to the 398 discrepancy between the outcome o at that time-point, and the prediction p at the previous

400
Computationally, learning through TD is therefore conceptualized (and modelled) as 401 prediction learning (37), where predictions p at each time-step are updated according to: where α is a free parameter that represents the learning rate of the participant and determines 404 the weight attributed to new events and the P.E. they generate (17).

405
One of the advantages of TD models over simpler models of learning, such as the RW, is that 406 they account for the sequence of events leading to an outcome, rather than treating each trial as 407 a discrete temporal event. That is, although each trial for the participant (i.e., each three-word 408 phrase) was equivalently treated as a trial for the TD model, model updates occurred at the 409 presentation of each individual element (see below). TD models are thus sensitive to the 410 precise temporal relationship between the succession of predictions and outcomes that take 411 place in a learning trial (17). Note that this is particularly valuable in trying to account for the 412 learning of non-adjacent dependencies as distinct from adjacent ones, making a TD model 413 preferable in such cases. This feature is implemented as a temporal discounting factor, this is an 414 additional free parameter γ that represents the devaluation of predictions that are more distant 415 from the outcome (44,74). Thus, upon 'hearing' the final element of a rule (AXC) phrase, the 416 prediction from the initial element A was updated according to:

461
Functional images were realigned, and the mean of the images was co-registered to the T1. The

462
T1 was then segmented into grey and white matter using the Unified Segmentation volumes were re-sampled to 2 mm 3 voxels and spatially smoothed using an 8 mm FWHM

485
In all cases, data were high-pass filtered (to a max. of 1/90 Hz). Serial autocorrelations were also 486 estimated using an autoregressive (AR(1)) model. We additionally included, in all the models

509
(bilateral caudate nuclei, putamen, and ventral striatum; see Table S2) was modulated the trial-by-trial development of  Table S3). Significant centered on the caudate nuclei and the right   Table S1. Whole brain fMRI Rule P(A)-modulated activity vs. implicit baseline. Group-level fMRI local maxima for 529 the P(A)-modulated Rule against implicit baseline contrast (see also red-yellow regions in Fig. 3, main text). Results are   Table S4. Whole brain fMRI Rule P(A)-modulated activity vs. Rule RT-modulated activity. Group-level fMRI local 542 maxima for the P(A)-modulated Rule minus RT -modulated Rule contrast (see also red-yellow regions in Fig. S4).

543
Results are reported for clusters FWE-corrected at p < 0.001 at the cluster level (minimum cluster size = 20), with an 544 additional p < 0.001 uncorrected threshold at the voxel level. MNI coordinates were used. BA, Brodmann Area.

545
Text S1. Offline Recognition Test. Following each block, participants' knowledge of the rules was assessed via a 546 recognition test. Participants were presented with correct sentences (phrases that conformed the rules) and incorrect 547 sentences (phrases that violated the rules). In half the trials, incorrect sentences consisted of violations of the A_C 548 dependencies where A and C elements maintained their correct order within the phrase but belonged to different rule 549 structures (i.e., A1xC2, A2xC1). In the other half, incorrect sentences contained order violations, where A and C 550 elements from a dependency swapped positions (i.e., C1xA1 and C2xA2). The complete offline test consisted in a 551 total of 48 test phrases (24 per dependency). Participants were required to discriminate phrases that could belong to 552 the previously heard language from phrases that could not by pressing the appropriate button. A maximum of 1500 553 ms was allowed to respond, after which there was a jittered interval (1-3 secs.) before the next trial began. Participants' 554 ability to discriminate rule items from violations was assessed by computing d prime scores (d′) from their responses.

555
For each participant, the proportion of hits (i.e., yes responses to correct phrases) and false alarms (i.e., yes responses 556 to incorrect phrases) was used to calculate the d′ score. Hit and false alarm rates of zero or one were corrected 557 according to Macmillan and Kaplan (1985). We computed two distinct d' scores by using false alarms to 1) order was administered.