Tracking and optimizing human performance using deep reinforcement learning in closed-loop behavioral- and neuro- feedback: a proof of concept

Reinforcement learning (RL) is a general-purpose powerful machine learning framework within which we can model various deterministic, non-deterministic and complex environments. We applied RL to the problem of tracking and improving human sustained attention during a simple sustained attention to response task (SART) in a proof of concept study with two subjects, using state-of-the-art deep neural network-based RL in the form of Deep Q Networks (DQNs). While others have used RL in EEG settings previously, none have applied it in a neurofeedback (NFB) setting, which seems a natural problem within Brain Computer Interfaces (BCIs) to tackle using end-to-end RL in the form of DQNs, due to both the problem’s non-stationarity and the ability of RL to learn in a continuous setting. Furthermore, while many have explored phasic alerting previously, learning optimal alerting in a personalized way in real time is a less explored field, which we believe RL to be an most suitable solution for. First, we used empirically-derived simulated data of EEG and reaction times and subsequent parameter/algorithmic exploration within this simulated model to pick parameters for the DQN that are more likely to be optimal for the experimental setup and to explore the behavior of DQNs in this task setting. We then applied the method on two subjects and show that we get different but plausible results for both subjects, suggesting something about the behavior of DQNs in this setting. For this experimental part, we used parameters suggested to us by the simulation results. This RL-based behavioral- and neuro-feedback BCI method we have developed here is input feature agnostic and allows for complex continuous actions to be learned in other more complex closed-loop behavioral or neuro-feedback approaches.


Introduction
Vigilance is often defined as the ability to maintain alertness and attention over a 32 prolonged period of time, possibly through boring, tiring and/or long tasks. For 33 example, driving a car, flying a plane, being a radar operator, or a surgeon during 34 longer procedures, or a security guard of a building or person, all require attention to 35 be high so that false negatives do not occur (i.e. so that targets that should be detected 36 or things that should happen do not happen) as well as to avoid false positives 37 (mistakenly recognizing some bit of anatomy as tumorous, or mistakenly imagining 38 or even hallucinating, due to fatigue, a bleep on a radar, or thinking a car is overtaking 39 behind you when the car may be at a constant and safe distance). In either case, the 40 subject involved makes a decision against reality, leading to at the very least a 41 harmless realization of this, or some kind of minimal correction (e.g. realizing the car 42 behind is not in fact overtaking), but it can also lead to catastrophic system failures 43 leading to the loss of property, loss of life or financial losses, etc. A well-established 44 fact of the ability of humans to sustain attention is that it is limited (Oken BS et al., 45 2006), and we are typically only capable of vigilance for short bursts of time typically, 46 though we have some ability to prolong this period of sustained attention via top-47 down (more internally driven) strategies (Hopfinger, J.B. et al., 2000;Sarter, M. et al., 48 2001) and cues (e.g. reminding oneself throughout a task that it is important and you 49 must refocus, recalling why it is important to you, or replaying certain mental 50 strategies to keep distractions away, as well as noticing more or less consciously 51 different timings and task-related conditions that can help in steering behavior, etc) 52 but it is also possible to prolong sustained attention via bottom-up or more externally 53 driven strategies or cues (e.g. playing a sound to remind of falling attention, or 54 vibrating a car's steering wheel, flashing a light briefly, etc). The problem with the 55 top-down approach is that it is more likely to lead to fatigue and eventual 56 abandonment of the internal strategy or cues people give themselves. Incorporating 57 bottom-up externally cued alerts of various kinds can therefore present an approach to 58 improving attention, especially when coupled with monitoring of attention via, for 59 example, electroencephalography (EEG). While attentional control, whether driven 60 through bottom-up or top-down mechanisms, will be affected by fatigue in any case, 61 bottom-up approaches are, in some sense, more independent of higher-level 62 motivation, interest or ability to control one's attentional systems. Vigilance is one 63 type of attention that has been heavily studied but there are other types of attentional 64 systems that have been identified, which have relevance for sustained attention and 65 vigilance-demanding tasks and which can be 'hijacked' and used to improve vigilance 66 performance on tasks. 67 68

Arousal and orienting for vigilance improvement 69
The broad literature on attention has found that attention is best understood as a 70 number of interacting, partially overlapping systems that work more or less in parallel 71 to meet conflicting and varying task demands. Perhaps most relevant to NFB and the 72 current work are overall cortical (and bodily) arousal as well as the orienting systems. 73 Some researchers have referred to arousal as the global "energy level" of the brain 74 (Portas, C.M. et al., 1998), referring to the idea that it reflects the global, or average, 75 excitability of the cortex (or the brain generally). Other authors have suggested that 76 arousal has local and more global aspects (Barry, R. J., 2007). Indeed, vigilance or 77 sustained attention has often been referred to as 'tonic arousal' or 'tonic alertness' 78 (tonic referring to longer-lasting, broader, more widespread) as opposed to arousal on 79 a shorter time-scale which is more often called just arousal or 'phasic arousal' or 80 'phasic alertness' (i.e. arousal which is just a shorter phase) (Oken, B.S. et al., 2006). 81 Arousal is in large part (but not entirely) dependent on norepinephrine (NE), released 82 primarily from the locus coeruleus (LC) (Oken, B.S. et al., 2006). This LC-NE release 83 affects the cortex and rest of the brain globally, as the NE neuromodulator is released 84 to most brain regions in a non-specific manner ( either case, orienting is the act of the brain re-prioritizing current processing and (at 93 least briefly) focusing energetic and attentional resources to a sudden new stimulus 94 location/timing/feature set. In effect, this means that one area of the cortex or brain is 95 given more blood, glucose, or whatever other resources are necessary to perform its 96 temporarily increased need/recruitment compared to other areas of the cortex or brain. 97 This is clearly important from an evolutionary perspective, as an animal (such as a 98 human) must be able to prioritize mental and physical resources to a specific location 99 or event if a certain external event/cue signals the animal/human to do so (e.g. roaring 100 of a tiger, or a car in the distance alerting a human jaywalking despite a likely 101 impending collision). The temporal orienting is when a cue (typically, and usually by 102 definition, external) signals that an event is likely to happen in a specific (usually 103 upcoming) time interval (Weinbach, N. & Henik, A., 2012). Spatial orienting happens 104 in the brain when a (usually external) cue tells us that an event is likely to happen in a 105 specific spatial location (Weinbach, N. & Henik, A., 2012). It seems that the temporal 106 orienting system is more left hemispheric and the spatial orienting system more right 107 hemispheric, with both orienting systems generally being most dependent/localized in 108 superior parietal areas, close to areas associated with sustained attention, which 109 likewise depends on parietal cortical regions. Both of the orienting systems seem to 110 interact with the NE arousal system, and so they present us one way of modulating 111 arousal in a bottom-up (and potentially completely automatable and consciously-112 effortless manner). This is important for NFB as we can therefore in part 'hijack' the 113 brain's natural arousal systems (whether temporal or spatial) with externally driven 114 alerts or cues -and by monitoring and timing these cues to when the arousal/attention 115 drops below a certain threshold (as estimated by certain EEG-derived measures for 116 example), we can, to a non-trivial degree, maintain a certain level of arousal and 117 vigilance. 118 119

Overview of RL and its use in this work 120
The term and field of reinforcement learning (RL), as used here, is a branch of 121 machine learning (ML). It is similar to, but distinct from, supervised learning, which 122 is perhaps the most commonly applied form of ML. In supervised learning, the correct 123 target value at any given training round is always known and given to the learning 124 algorithm explicitly. This is typically contrasted to unsupervised learning, where no 125 target output values are given for the input values during training -in unsupervised 126 learning the algorithm finds clusters and other types of patterns in the data directly. 127 RL is only vaguely between the two -it is more like the former (i.e. supervised 128 learning), with some features of the latter (i.e. unsupervised), but distinct from both 129 with a rather different mathematical foundation on which it is based (Sutton, R.S. & 130 Barto, A.G., 1998). We often refer to an RL system or algorithm as an "agent" that 131 acts on the world (physical or virtual) in some way through actions. In RL, the main 132 feedback that we provide to the agent is how well it performed, but not what the 133 correct action or decision should have been in any given situation (unlike in  134 supervised learning, where we will adjust algorithmic parameters precisely according 135 to how far we are from the exact target value we were striving to output with the 136 algorithm for the corresponding input for a given input-output pair). This leads to 137 often more unpredictable outcomes -while the agent may learn to reach the general 138 feature or goal, it may do so in highly peculiar, unpredictable and even undesirable 139 ways, as we describe in the discussion section. The cited studies are a small subset of the types of BCI and neurofeedback that have 247 been done. For NFB specifically, where the system learns in real time from the 248 individual (and is thus personalized) depends on the user seeing their current brain 249 activity in some way. This, while useful and successful, requires significant effort by 250 the user and distracts from being able to do an actual task in parallel. One can also do 251 NFB without requiring the user to do anything by passively having them perceive 252 certain stimuli (typically sound or light cues). This latter approach can be referred to 253 as passive NFB versus the former (where the user has to exert effort), which we can 254 call active NFB. Our work presented here provides a very general purpose and 255 powerful approach to performing passive NFB, though the approach need not be 256 restricted to passive NFB and is probably useable in all BCI settings. 257

2.1.Reinforcement Learning framework 259
We wanted to apply end-to-end RL that is close to state-of-the-art. We used the 260 Python RTs, as this should lead to a faster convergence, but these led to the agent getting 296 stuck in one of the two actions for long intervals of trials. 297 298 In order for the agent to explore the state-action space sufficiently well and find 'good' 299 solutions, the agent needs to not get stuck in any one action for too long. There are 300 different exploration strategies that can be used. We explored the use of -exploration 301 and Boltzmann-exploration for both in simulation and the experiments ran. In the 302 former, we select the best (so far found) action (i.e. alerting or not alerting) with 303 probability 1-. We select one of the two actions uniformly with probability . This 304 means that the agent is acting greedy (selecting the best action found so far) with an 305 overall probability of 1-( /2). As long as is not too small, the other action will be 306 chosen often enough for the agent to realize that it is (currently) a better one to choose. 307 In Boltzmann exploration, we have a softmax or Boltzmann distribution-weighted 308 selection of probabilities for selecting a given action from all possible actions: 309 310 In our case, we have: 313 314 The parameter is a temperature parameter that is typically annealed over time and 317 controls the degree to which the actions are weighted according to their Q values. A 318 high temperature parameter leads to more random action selection and thus more 319 exploration, while a small leads to behavior closer to greedy action selection (we 320 have fully greedy action selection for → 0). In between, the agent explores non-321 optimal actions with some probability that depends on the specific range of the Q 322 values. Since this depends on the rewards used, /Boltzmann-action selection is more 323 context-sensitive than -exploration. In general, we can imagine that Boltzmann 324 exploration leads to better outcomes as the non-optimal/non-greedy actions are 325 weighted according to their estimated Q-values (i.e. according to how good the agent 326 thinks those actions are), which is a more informative action selection than uniformly 327 randomly selecting among all other actions with some probability ≈ . In practice, 328 finding a good is more difficult and less intuitive than finding a good . We 329 explored the use of both on this type of BCI/NFB presented here. 330

331
There were only two actions that we used, for simplicity of analysis. Action 0 is no 332 alerting, i.e. doing nothing in that trial and action 1 is alerting using an auditory 333 phasic alert. While we could also train on other parameters such as the 334 strength/volume of the alerting cue, or the phase in the specific band of the EEG, 335 these make interpretation of the agent's decisions and the subsequent results even 336 more complex and do not add to the proof of concept work presented here. 337

2.2.Task and cognitive setting 338
We used the Sustained Attention to Response Task (SART), which has previously 339 been used in a number of seminal studies of attention. It requires the subject to press a 340 given key (we set this to the space key) on all numbers seen except for a specific 341 given number (we set this to be 3 with the heavy right tailed distribution of RTs observed experimentally and then 357 modeled in the simulated data, 1 second constitutes a 'very bad' response, pushing the 358 mean of the moving window considerably higher. We chose to do this to take into 359 count the error rate, which is known to increase in more alerted states. The purpose of 360 this kind of behavioral neurofeedback is to ultimately improve performance on a 361 given task and decrease error rate as well. In other words, while alerting leads to 362 lower reaction times in general, we also wanted to improve on accuracy/error rate. 363 While shorter RTs resulting from the alerted state are an 'improvement', a 364 significantly higher error rate may not be acceptable, depending on the real-world task. 365 We can think of the high reaction time penalty as a kind of regularization applied to 366 the loss function of the RL system. 367 368 The experiment was programmed in 10 GHZ^G _`abcdAe (10 YIH1G _`abcdAe + 10 1^I1G _`abcdAe + 10 GHZ^G _`abcdAe + 10 XI1G _`abcdAe + 10 KGffG _`abcdAe ) 405 406 The other relative band powers are computed analogously, as the relative proportion 407 from all the bands of the linear scale band power. This gives us values in the range 408 [0,1] which are computed in a sliding window fashion and emitted by the Muse-IO 409 tool at 10Hz. 410 411

2.4.Simulation 412
The process of simulating the data that we used was driven empirically. We used one 413 of the two subjects to do a run of SART with 300 trials, with 33% of them randomly 414 alerting, to get some indication for the difference of reaction times for this subject 415 between alerted and non-alerted states, as well as the distribution of the EEG band 416 powers. Figure 1 shows the recorded EEG band values, the reaction times and the 417 corresponding fitted lognormal distributions. Because the EEG bands did not display 418 obvious differences between alerted and non-alerted states in themselves, we show 419 only the fitted distributions and plots of both alerted and non-alerted states combined, 420 and not individually. Note also in Figure 1 that a lognormal fit seems to be a decent fit 421 to the empirically-observed reaction times from this subject. We tried a number of 422 distributions and found that a lognormal distribution is the single distribution that 423 tends to best model the different bands as well as the reaction time. We discuss some 424 issues and implications of this choice. However, the variability of the band powers 425 and the reaction times are such that slight deviation from the distributional form 426 chosen is not going to significantly alter the shape of the distribution, whether a 427 lognormal fit is 'good' or not. As such, the lognormal fits were deemed sufficiently 428 good for generating simulated values with the corresponding distribution parameters 429 found from subject 1's non-feedback SART data. These are: for the non-alerted reaction times. 436 Figure 1: Histogram plots and lognormal fits based on the band power data from subject 1 performing SART with 300 trials and randomized alerting in 33% of the trials (sub-figures A-E) and a histogram plot and lognormal fit of the corresponding reaction times from subject 1 for alerted (in red) and non-alerted (in blue) trials. Note that for the reaction times, we excluded 19 trials which were incorrect, as we penalize and model an incorrect response as having a reaction time of 1 second, which would unnecessarily skew the corresponding fitted distribution of the reaction time to the right. While the incorrect trials tend to be very fast responses where response inhibition did not suppress the incorrect response, such incorrect responses can be costly in various real-world situations and the way they are modeled can strongly influence the real-time adaptive learning system.
Using the MLE parameter estimates of the reaction times, we then simulated values 438 from those distributions to feed more empirically-tuned values for this task and this 439 subject to the DQN simulation, in order to tune the parameters of the RL learning to 440 where it learns successfully and quickly. Figure 2 shows the simulated distributions 441 using the subject 1's empirical data. 442 Figure 2: The figure shows lognormal simulated data with parameters ( and ) derived from subject 1's 300-trial SART results shown in Figure 1 (without behavioral/neuro-feedback with random alerting). Most noteworthy is that relative frontal alpha power is significantly higher in alerted states than in non-alerted states ( Figure 2C). Most other relative band power measures do not differ as noticeably, though relative theta power is lower in alerted than non-alerted states. The simulated reaction time values, driven by the actual reaction time values from the subject, show that alerted states, while having considerable overlap with non-alerted states, tend to have lower reaction times.

2.5.Experimental setup 443
The two subjects then performed 5 runs of the SART each. Both subjects had 444 previous practice on this implementation of SART to familiarize themselves with it. 445 The task was performed at night time when the subjects were already somewhat 446 drowsy, to increase the need to pay attention to the task and to hopefully increase the 447 contrast between non-alerted and alerted states. 448 449 The alerting cue was a 3000Hz sound generated as a 3000Hz sine wave in Audacity. 450 The sound file was saved as a pulse coded modulated uncompressed wave file 451 sampled at 44.1 KHz with 16 bits per sample, played at maximum volume on the 452 same computer system, with the same set of Audio Technica ATH-M50x headphones 453 plugged into the audio/headphone jack. The sound was loud enough to initially be 454 described by both subjects as startlingly loud. 455 456

2.6.Analysis 457
We used the Student's 2-sample t-test on the reaction time results for the final results. 458 We did also use the Wilcoxon signed-rank test as a non-parametric version of the t-459 test, as well as the unequal-variance t-test (Welch's t-test), though they gave almost 460 identical results in all cases, so are not reported in the results. Finally, we ran general 461 linear models (GLMs) for each condition, testing whether the RTs were at all 462 predictable by the relative EEG band powers, using linear terms, but also including 463 interactions between the features/regressors. 464   001,0.01,0.1,1) in Boltzmann or softmax-driven action space exploration. Note that with this action exploration strategy, the agent finds that the alerting and non-alerting actions are much closer to being equally good than the -exploration strategy would suggest. Since with Boltzmannexploration, given suitably chosen values of , we weight the different actions based on their utilities (i.e. their estimated Q values), the agent behaves less binarily in its choice of action. This is one of the main reasons we decided to use Boltzmann-exploration action-selection for the experimental runs.

Results
As mentioned in the caption of Figure 4, Boltzmann-exploration leads to a more 488 balanced choice of which action to take, due to its weighting of which action to select 489 based on the estimated action utilities (i.e. the Q values). Figure 5 below shows four  490  examples of the agent learning at different rates and learning different action  491  preferences for different exploration strategy parameter choices:  492   493 In simulation, such a graded action selection may or may not be useful, depending on 494 the purposes of the simulation. Because both brain dynamics in general and EEG 495 signals specifically are neither stationary nor as simple to model even on short time 496 scales, we expected softer action selection using Boltzmann exploration of actions to 497 prove more successful than an -greedy strategy. 498

499
The figure below shows the RTs from all 5 experimental runs for subject 1: 500 501  A and B) show the same agent (i.e. same architecture) learning the simulated reaction times for = 0.1 and = 0.3. Notice that while the agent in A ( = 0.1) learns somewhat slower, it makes less non-alerting actions, as they lead to an overall higher reaction time. The agent in B learned a few trials faster to quasi-greedily prefer the simulated alerting state, but it makes more exploratory (non-alerting) actions. This is less useful in the case of stationary simulated data, but is important for real, and especially braindriven, inputs. The bottom two figures show the learnings of Boltzmann-exploration-driven agents. The agent shown in C learned to prefer action 1, or alerting, after more trials than the agent in A, and continued to explore more after finding that it is the preferable state. The agent in D displays similar but even more exploratory behavior than C -still preferring alerting over non-alerting but even less so. Finally, note that all the simulated results shown here and discussed are of the same pseudo-randomly generated sequences which were always seeded with the same seed.

515
The reaction time plots themselves are not very informative in themselves -and 516 'natural' variability in task performance between tasks is going to have strong effects 517 beyond that of the alerting frequency and timing. Based on just the reaction times, the 518 simple rewards may have been more effective for subject 1, who had higher standard 519 deviation of RTs during the complex rewards-driven alerting (across all reaction times, 520 i.e. for both alerted and non-alerted trials). For subject 1, we found no significant 521 difference between alerted and non-alerted trials for the -simple rules alerting 522 (p=0.165). -complex rules alerting led to just barely significant (NB: without 523 correcting for the multiple testing) results, with faster alerted trials than non-alerted 524 (p=0.019). Subject 1 had very slightly faster RT on alerted trials than non-alerted 525 trials (p=0.0302) for the -simple agent and a smaller (and 'non significant') 526 difference for the -complex agent (p=0.114). Subject 1 had very similar RTs 527 between alerted and non-alerted trials in the randomized (non-RT-driven) condition E 528 (p=0.8834). Note that alerted states had lower RTs (and correspondingly lower p-529 values, sometimes significant when not corrected for multiple comparisons) for the 530 RL-driven alerted states compared to the non-RL-driven randomly alerted states, 531 shown in E. Subject 2 responded differently to the learned alerts for them. For the -532 simple case, subject 2 did not respond well to the alerts, having a somewhat higher 533 RT on alerted trials than non-alerted trials (p=0.206). The -complex condition had 534 almost statistically similar RTs (though note that there were only 35 non-alerted trials, 535 p=0.3661). The -simple condition led to a small significant difference with faster 536 alerted trials than non-alerted trials (p=0.0891). The -complex case for subject 2 led 537 to no 'significant' difference, but there were only 7 non-alerted trials, so the t-test 538 (and the other tests tried, but not reported) are not going to pick up the small 539 difference in alerted vs non-alerted trials that we find in this task. Finally, the 540 randomized (non-RL-driven) condition for subject 2, shown in E, led to strongly 541 similar RTs between alerted and non-alerted trials (p=0.9302). Again, note that non-542 randomized (RL-driven) alerting did lead to more significant differences between 543 alerted and non-alerted trials. trials on which the RL system alerted (y value of 1) or did not alert (y value of 0) subjet 2 for -simple (in A), for -complex (in B), for -simple (in C), for -complex (in D). E shows the randomized alerted vs nonalerted trials (RL ran in the background but did not control whether the subject was alerted or not). 570 571 572 Figure 10: The figure shows the weights between the input layer and the first hidden layer of the DQN for subject 1, across all 5 conditions. We trained a new DQN for each condition, so the neurons in any of the layers do not directly correspond in any way, except that the input layer always has the same input features (the columns, described in the x-axis label). Feature 1 is the RT and the other 5 features are the linear relative band power values of the delta, theta, alpha, beta and gamma bands, respectively. There were 5 neurons in the first (and all other) hidden layers (hence the 5 rows). A shows these weights for -simple, B forcomplex, C for -simple, D for -complex and E for the random alerts without RL. Note that for E, we show the weights for comparison. While the RL algorithm did run in the background, it did not alert -thus, it could not distinguish between action 1 (alerting) or action 2 (not alerting). Figure 10 above shows the weights from the input layer to the first hidden layer for all 573 5 conditions for subject 1. Figure 11 below shows these for subject 2. The columns 574 are the different inputs (i.e. features) fed to the neural network underlying the DQNs: 575 input 1 is the RT from the last trial and the next 5 inputs are the linear relative band 576 power values over the last second for the delta, theta, alpha, beta and gamma bands, 577 respectively. Sub-figure A shows these weights for -simple, B for -complex, C for 578 -simple, D for -complex and E for the randomized (non-RL-driven) alerts that 579 played about 20% of the time. As noted in the caption of Figures 10 and 11, the sub-580 figures E, for the randomized alerting, had the RL agents running in the background, 581 but both actions were 'silent' to the subject -and, in addition, there were random 582 alerts played on some trials, explicitly unknown to the algorithms. Thus, the weights 583 do not represent the same learned weights as in sub-figures A, B, C and D for 584 contribution to the same task, though they do still give an indication of the relative 585 importance of the different features to the activations of the first hidden layer's 586 neurons, based on the reaction time-based rewards. 587 Figure 11: The figure shows, analogously to Figure 10, the weights from the input layer ot the first hidden layer neurons for subject 2. The sub-figures correspond to the conditions as in subject 1: A is for -simple, B for -complex, C for -simple, D for -complex and E for the randomized (non-RL-driven) alerts.
Next, we looked at the relationship between the reaction times and the rewards for the 588 complex reward conditions. This had an almost perfect linear relationship for both 589 subjects and this was almost perfectly identical for both action exploration strategies. 590 We show a plot of the reaction times and the corresponding rewards for subject 1's -591 complex run below: 592 593 594 Figure 12 suggests that a simpler reward rule where we reward with the negative of 595 the reaction time may have achieved similar learning behavior. We discuss some of 596 the issues around the reward function engineering in the discussion that follows. 597 598 Finally, we looked for any linear relationships between the RTs and the relative EEG 599 band powers, including interaction terms. All the p-values we report here are 600 corrected for multiple comparisons using the Benjamini-Hochberg FDR procedure. 601 For subject 1, we found no significant EEG band power predictors for the RT for -602 simple as well as the -complex and for the -complex conditions. For the -simple 603 alerting, we find that all bands are slightly predictive of the RTs (p<0.05), as well as 604 the interaction terms delta:theta power (p<0.05), delta:beta power (p<0.05), but most 605 strongly alpha:gamma power (p<0.0005). There were other plausible significant 606 effects before FDR-correction. Subject 2 had some significant p-values before FDR-607 Figure 12: An illustrative plot of the more complex rewards (on the y-axis) vs the reaction times (on the x-axis) for exploration condition of subject 1. All other complex reward plots have the same form and strong linear relationship between the two. This relationship is not at all surprising of course, as the rewards are a function of how different the current reaction time was to the previous 10 trials' mean RT. Since most trials will have RTs similar to the mean of the previous 10 trials, most differences (and therefore most rewards) will fall close to 0, which is what we see. correction but none after, for any of the conditions. We give some interpretation of 608 these and the rest of the results next. 609 610 We have presented here some preliminary results of using a powerful end-to-end 611 neural network-driven approach to modeling both the brain state and the behavioral 612 state, as well as when to play an auditory alert to the user of the system. Having 613 shown that the system can learn in a simulation setting, with simulation parameters 614 derived from empirical/experimental runs, we then run a (very) small scale pilot with 615 2 subjects. As this was a pilot study to test that the methods can work in this setting at 616 all, our focus was not on deriving specific results related to the EEG and the reaction 617 times. Our primary focus was to show that the system can learn non-random alerting 618

Discussion
behavior and, perhaps, alert better than a randomly-alerting system. This latter point 619 was, in part, successfully shown empirically. 620 621 In our simulation results section, we mentioned and showed how Boltzmann-622 exploration of which action to choose leads to a more balanced alerting. Interestingly, 623 for 'good' values of (i.e. around 0.01) the system agent learned to alert in a more 624 similar proportion to -exploration, though still not as often. The latter, on the other 625 hand, will pick the best action with probability 1-, which means that once the agent 626 finds that the alerting state leads to a lower reaction time, it is likely to prefer and 627 consistently use that action when choosing greedily. While for many situations this 628 may be desirable, when alerting human users or operators, we do not want to alert 629 constantly simply because alerting tends to improve the reaction time. This is why we 630 implemented two reward rules, and showed results only with the approach which lead 631 to a more balanced proportion of alerted trials. The two reward mechanisms tried 632 were on the one hand where we reward either a 1 or a -1, or these multiplied by some 633 factor (e.g. 0.1 and -0.1 or 0.3 and -0.3); and on the other hand where we reward as 634 the difference between the mean of the moving window of the previous 10 states and 635 the state we just 'landed' in. This has the effect of preferring to not alert too many 636 times in a row, by virtue of the fact that if the system has just alerted many times in 637 the window of states considered, then it is more likely that another alerted state will 638 not have a lower reaction time and lead to a negative reward, or punishment (i.e. 639 because of regression to the mean of the alerted states). A related reason for the 640 ineffectiveness of too much alerting in a row is likely to be that the alerting system 641 becomes habituated to the auditory alert. We have experimented with various types of 642 implicit 'regularizations' or modifications to the reward function to encourage the 643 system to not behave in one way for too long. One approach we explored and found 644 useful is to add an 'energy' term to the agent, which gets decreased on alerting and 645 increased on non-alerting to bias the system towards non-alerting. This makes 646 understanding and analysis of the behavior of the system more complicated still, and 647 the energy term needs to be adjusted depending on the expected value of the 648 magnitude of the rewards for a given action. 649 650 Of course, running a continuous BCI NFB system is more complicated than the 651 controlled case of the simulated distributions. Firstly, the brain and behavioral 652 dynamics are highly non-stationary and would not follow the neat and non-changing 653 distributions in the simulation (despite them being derived from empirical data). In 654 general, while we expected alerted states to lead to lower reaction times, we found 655 that this was not always the case. Part of the reason for this may be due to statistical 656 reasons -i.e. too few alerted or non-alerted trials (e.g. subject 1 had too few alerted 657 trials for the -complex condition and subject 2 had very few non-alerted trials for the 658 -complex and -complex conditions). Unlike with the simulation results, finding that 659 the agent was stuck in one behavior/action for many trials does not directly suggest 660 that this is due to a convergence of the algorithm to the 'best action'. Rather, we can 661 imagine that being in one state for too long is likely to lead to diminishing returns (i.e. 662 being alerted constantly for too long, or doing a repetitive and attention-demanding 663 task for too long without the alerts) and the 'optimal action' is likely to switch. We 664 observed this quite notably in the simple reward rules-driven alerting for both subjects. 665 The complex rewards lead to a different behavior, where a single action was preferred 666 more strongly over a longer continuous time (though switching to the other action 667 eventually). Interestingly, the sliding window RT-difference reward rule (which is 668 more complex than the simpler constant rewards) led to a simple and strong linear 669 relationship between the ultimate reward and the reaction time -namely that the 670 lower the reaction time, the higher the reward, with RTs around 0 having rewards 671 around 0. This is not surprising though it is not at all immediately clear that simpler 672 rewards of that form, where we reward the negative of the reaction time (i.e. small 673 RTs are small punishments, large RTs are large punishments) or the inverse of the 674 reaction time (small RTs lead to larger positive rewards and large RTs lead to smaller 675 positive rewards), would lead to similar learning or behavior as the rule we employed. 676 We found some plausible correlations of various degrees of certainty between the 677 band powers (most notably the alpha band power and gamma band power linear 678 interaction term) and the RT. We do not elaborate on those results as this work's 679 focus is on showing the ability to use this powerful deep neural network-driven RL 680 approach to BCI and NFB. 681 682 An obvious weakness of the work we present here is that it is of a very small sample 683 size (n=2). However, the point of the work is not so much to show generally-684 applicable neuroscientific results from this work directly, but rather to show that we 685 can work with deep RL in a BCI/NFB setting, which we have shown. We have also 686 shown the effects of changing some of the parameters on simulated and real data. 687 While in general Boltzmann-exploration should learn better, as it exploits more 688 information in the estimated Q values for the two actions, this is not necessarily 689 observed. In part, this is because it is easy to pick default parameter values (e.g. = 690 0.1) for it which both quickly find the optimal state-action mapping while allowing 691 for sufficient exploration, compared to Boltzmann-exploration, where sensible 692 values depend on the specific Q-value magnitudes. These in turn depend on the 693 reward function and therefore, in general, on the context to which the RL has been 694 adapted to. It is very likely that Boltzmann action-randomization is going to be more 695 beneficial when there are more actions that are weighted and thus there is more 696 information between the relative estimated utility of the different actions compared to 697 our binary case. 698 699 It is possible, and we found multiple times, trained agents to behave so as to 700 maximize the rewards in 'funny ways'. During development of the presented methods, 701 we had a version of a reward function where a wrong trial, the current RT was set to a 702 high value (e.g. 1000 or 2000 milliseconds). This had the effect of punishing the RL 703 agent when the human subject made a mistake on the task (e.g. most commonly this 704 was pressing the space key when the response should have been inhibited). Having 705