Rational Thoughts in Neural Codes

Complex behaviors are often driven by an internal model, which integrates sensory information over time and facilitates long-term planning to reach subjective goals. We interpret behavioral data by assuming an agent behaves rationally — that is, they take actions that optimize their subjective reward according to their understanding of the task and its relevant causal variables. We apply a new method, Inverse Rational Control (IRC), to learn an agent’s internal model and reward function by maximizing the likelihood of its measured sensory observations and actions. This thereby extracts rational and interpretable thoughts of the agent from its behavior. We also provide a framework for interpreting encoding, recoding and decoding of neural data in light of this rational model for behavior. When applied to behavioral and neural data from simulated agents performing suboptimally on a naturalistic foraging task, this method successfully recovers their internal model and reward function, as well as the computational dynamics within the neural manifold that represents the task. This work lays a foundation for discovering how the brain represents and computes with dynamic beliefs.

Cognition | Neuroscience | Computation | Rational | Neural coding Understanding how the brain works requires interpreting neural activity. The behaviorist tradition (1) aims to understand the brain as a black box solely from its inputs and outputs. Modern neuroscience has been able to gain major insights by looking inside the black box, but still largely relates measurements of neural activity to the brain's inputs and outputs. While this is the basis of both sensory neuroscience and motor neuroscience, most neural activity supports computations and cognitive functions that are left unexplained -we might call these functions 'thoughts'. To understand brain computations, we should relate neural activity to thoughts. The trouble is, how do you measure a thought?
Here we propose to model thoughts as dynamic beliefs that we impute to an animal, by combining explainable Artificial Intelligence (AI) cognitive models for naturalistic tasks with measurements of the animal's sensory inputs and behavioral outputs. We define an animal's task by the relevant dynamics of its world, observations it can make, actions it can take, and the goals it aims to achieve. The AI models that solve these tasks generate beliefs, their dynamics, and actions that reflect the essential computations needed to solve the task and generate behavior like the animal. With these estimated thoughts in hand, we propose an analysis of brain activity to find neural representations and transformations that potentially implement these thoughts.
Our approach combines the flexibility of complex neural network models while maintaining the interpretability of cognitive models. It goes beyond black-box neural network models that solve one particular task and find representational similarity with the brain (2-4). Instead, we solve a whole family of tasks, and then find the task whose solution best describes an animal's behavior. We then associate properties of this best-matched task with the animal's mental model of the world, and call it 'rational' since it is the right thing to do under this internal model of the world. Our  2 To whom correspondence should be addressed. E-mail: xaq@rice.edu latent variable dynamics, but it improves upon usual latent variable methods for neural activity that just compress data without regard to tasks or computation (5, 6). In contrast, our latent variables inherit meaning from the task itself, and from the animal's beliefs according to its internal model. This provides interpretability to both our behavioral and neural models.
We also want to ensure we can explain crucial neural computations that underlie ecological behavior in natural tasks. We can accomplish this by using tasks with key properties that ensure our model solutions implement these neural computations. First, a natural task should include latent or hidden variables: animals do not act directly upon their sensory data, as that data is merely an indirect observation of a hidden real world (7). Second, the task should involve uncertainty, since real-world sense data are fundamentally ambiguous and behavior improves when weighing evidence according to its reliability. Third, the relationships between latent variables and sensory evidence should be nonlinear in the task, since if linear computation were sufficient then animals would not need a brain: they could just wire sensors to muscles and compute the same result in one step. Fourth, the task should have relevant temporal dynamics, since actions affect the future, and useful properties of the world change; animals must account for this.
While natural tasks that animals perform every day do indeed have these properties, most neuroscience studies isolate a subset of them for simplicity. Although this has revealed important aspects of neural computation, it also potentially misses some of the fundamental structure of brain computation. Recent progress warrants increasing the naturalism and complexity of the tasks and models.
One major challenge for practical studies with increased complexity and naturalism is to record from many neurons with enough spatial and temporal precision to reveal the relevant computational dynamics for these tasks. Specifically, the dimensionality of neu-ral data needs to be bigger than the dimensionality of our target tasks (8). Modern neurotechnology now affords us this opportunity: brain-wide calcium imaging at cellular resolution and fine-grained electrophysiological recording can record from thousands of neurons simultaneously at high frequency. Limited experimental time and coverage still hinder our ability to explore the neural representations. But with current large-scale neural data, we will increasingly have enough power to find neural representations and dynamics in naturalistic and cognitively interesting tasks.
This paper makes progress towards understanding how the brain produces complex behavior by providing methods to estimate thoughts and interpret neural activity. We first describe a modelbased technique we call Inverse Rational Control for inferring latent dynamics which could underlie rational thoughts. Then we offer a theoretical framework about neural coding that shows how to use these imputed rational thoughts to construct an interpretable description of neural dynamics.
We illustrate these contributions by analyzing a task performed by an artificial brain, showing how to test the hypothesis that a neural network has an implicit representation of task-relevant variables that can be used to interpret neural computation. We choose an ecologically relevant foraging task that requires sensitivity to past rewards, current observations, and an internal memory state. Our approaches should serve as valuable tools for interpreting behavior and brain activity for real agents performing naturalistic tasks.

Results
Modeling behavior as rational. In an uncertain and partially observable environment, animals learn to plan and act based on limited sensory information and subjective values. To better understand these natural behaviors and interpret their neural mechanisms, it would be beneficial to estimate the internal model and reward function that explains animals' behavioral strategies. In this paper, we model animals as rational agents acting optimally to maximize their own subjective rewards, but under a family of possibly incorrect assumptions about the world. We then invert this model to infer the agent's internal assumptions and rewards and estimate the dynamics of internal beliefs. We call this approach Inverse Rational Control (IRC), because we infer the reasons that explain an agent's suboptimal behavior to control its environment.
This method creates a probabilistic model for an agent's trajectory of observations and actions, and selects model parameters that maximize the likelihood of this trajectory. We make assumptions about the agent's internal model, namely that it believes that it gets unreliable sensory observations about a world that evolves according to known stochastic dynamics. Finally, we assume that the agent's actions are chosen to maximize its own subjectively expected long-term utility. This utility includes both benefits, such as food rewards, and costs, such as energy consumed by actions; it should also account for internal states describing motivation, like hunger or fatigue, that modulate the subjective utility. We then use the agent's sequence of observations and actions to learn the parameters of this internal model for the world. Without a model, inferring both the rewards and latent dynamics is an underdetermined problem leading to many degenerate solutions. However, under reasonable model constraints, we demonstrate that the agent's reward functions and assumed dynamics can be identified. Our learned parameters includes the agent's assumed stochastic dynamics of the world variables, the reliability of sensory observations about those world states, and subjective weights on action-dependent costs and state-dependent rewards.
a Markov chain where at each time an agent can influence the world state transitions by deciding to take an action a ∈ A, according to T (st+1|st, at). At each time step the agent receives a reward or incurs a cost (negative reward) that depends on the world state and action, R(st, at). The agent's goal is to choose actions that maximize its value V , measured by total expected future reward (negative cost) with a temporal discount factor γ ∈ (0, 1), p(s 1:∞ ,a 1:∞ ) . The actions are drawn from a state-dependent probability distribution called a policy, π(a|st), which may be concentrated entirely on one action or may have some width. In a normal MDP, the agent can fully observe the current world state, but must plan for an unknown future. In a Partially Observed MDP (POMDP), the agent again does not know the future, but does not even know the current world state  Optimal solution of a POMDP requires the agent to com-pute a time-dependent posterior probability over the possible current world state, given its history of observations and actions. All of that history can be summarized concisely in a single distribution, the posterior B. It is useful to define a belief state b as completely summarizing the posterior, so we can write B(st|bt) = B(st|o1:t, a0:t−1). This belief state can be expressed recursively using the Markov property as a function of its previous value (Supplemental Information): where Z is a normalization constant. * We can express the entire partially observed MDP as a fullyobserved MDP called a Belief MDP, where the relevant fullyobserved state is not the world state s but instead the agent's own belief state b (10). To do so, we must re-express the transitions and rewards as a function of these belief states, T (bt+1|bt, ot+1, at) and R(bt, at), as described in the Supplemental Information.
The optimal agent then determines a value function Q(b, a) over this belief space and allowed actions, based on its own subjective rewards and costs. This value can be computed recursively through the Bellman equation (11) Q(bt,at) = R(bt, at)+ The optimal policy deterministically selects whichever action maximizes the value Q(b, a). An alternative stochastic policy samples from a softmax function over actions, π(a|b) ∼ 1 Z exp (Q(b, a)/τ )with a temperature parameter τ and normalization constant Z. The randomness introduces a new sub-optimality to the agent: instead of choosing the action with the maximal value, the agent has some chance of choosing a worse action. In the limit of a low temperature τ we recover the optimal policy, but a real agent may be better described by a stochastic policy with some controlled exploration.
Inverse Rational Control. Despite the appeal of optimality, animals rarely appear optimal in experimentally defined tasks, and not just by exhibiting more randomness. Short of optimality, what principled guidance can we have about an animal's actions that would help us understand its brain? One possibility is that an animal is 'rational' -that is, optimal for different circumstances than those being tested. In this section we present a behavioral analysis based on the possibility that agents are rational in this sense. The core idea is to parameterize possible strategies of an agent by those tasks under which each is optimal, and find which of those best explains the behavioral data. We specify a family of POMDPs where each member has its own task dynamics, observation probabilities, and subjective rewards, together constituting a parameter vector θ. These different tasks yield a corresponding family of optimal agents, rather than a single optimized agent. We then define a log-likelihood over the tasks in this family, given the experimentally observed data and marginalized over the agent's latent beliefs ( Figure 1B [3] * A minor notational point is that we assume that the agent is a function of the belief state, either because it is a deterministic function of the belief or because the stochastic output action is fully observed or appended to the belief state. Then we can write B(s t |o t , a t−1 , b t−1 ) = B(s t |o t , b t−1 ). This justifies the omission of an arrow in Figure 1A from a t−1 → b t . Alternatively we can allow the action to be partially observable and add another arrow to that figure.
In other words, we find a likelihood over which tasks an agent solves optimally. In [3] φ are known parameters in the experimental setup that determine the world dynamics. Since they only affect observed quantities in the graphical model, they do not affect the model likelihood over θ (Supplementary Information).
This mathematical structure connects interpretable models directly to experimentally observable data. We can now formalize important scientific problems in behavioral neuroscience. For example, we can maximize the likelihood to find the best interpretable explanation of an animal's behavior as rational within a model class, as we show below. We can also compare categorically different model classes that attribute to the agent different reward structures or assumptions about the task.
The log-likelihood [3] seems complicated, as it depends on the entire sequence of observations and actions and requires marginalization over latent beliefs. Nonetheless it can be calculated using the Markov property of the POMDP: the actions and observations constitute a Markov chain where the agent's belief state is a hidden variable. We show that it is possible to exploit this structure to compute this likelihood efficiently (Supplemental Information).

Challenges and solutions for rationalizing behavior.
To solve the IRC problem, we need to parameterize the task, beliefs, and policies, and then we need to optimize the parameterized log-likelihood to find the best explanation of the data. This raises practical challenges that we need to address.
Our core idea for interpreting behavior is to parameterize everything in terms of tasks. All other elements of our models are ultimately referred back to these tasks. Consequently, the beliefs and transitions are distributions over latent task variables, the policy is expressed as a function of task parameters and preferences, and the log-likelihood is a function of the task parameters that we assume the agent assumes.
Thus, whatever representations we use for the belief space or policy, we need to be able to propagate our optimization over the task parameters through those representations. This is one requirement for practical solutions of IRC. A second requirement is that we can actually compute the optimal policies. Efficient representation of general beliefs and transitions is hard since the space of probabilities is much larger than the state space it measures. The belief state is a probability distribution and thus takes on continuous values even for discrete world states. For continuous variables the space of probabilities is potentially infinitedimensional. This poses a substantial challenge both for machine learning and for the brain, and finding neurally plausible representations of uncertainty is an active topic of research (12-17). We consider two simple methods to solve IRC using lossy compression of the beliefs: discretization, or distributional approximation. We then provide a concrete example application in the discrete case.
Discrete beliefs and actions. If we have a discrete state space then we can use conventional solution strategies for Markov Decision Processes. For a small enough world space, we can exhaustively discretize the complete belief space, and then solve the Belief MDP problem with standard MDP algorithms (11, 18). In particular, the state-action value function Q(b, a) under a softmax policy π(a|b) can be expressed recursively by a Bellman equation, which we solve using value iteration (10, 11). The resultant value function then determines the softmax policy π, and thereby determines the policy-dependent term in the log-likelihood [3].
Finally, to solve the IRC problem we can directly optimize this log-likelihood, for example by greedy line search (Supplementary Information). An alternative in higher-dimensional problems is to use Expectation-Maximization to find a local optimum, with a gradient ascent M-step ( Supplementary Information, (19, 20)). To compute the gradient of the log-likelihood, we again use recursion to calculate the value gradient ∂Q/∂θ exactly, and use the chain rule to derive the policy gradient and then the Q auxiliary function gradient (Supplementary Information).
Continuous beliefs and actions. The computational expense of the discrete solution grows rapidly with problem size, and become intractable for continuous state spaces and continuous controls. A practical choice is to approximate posteriors by a finite set of summary statistics, and update them by a method like expectation propagation (21). The simplest example is to use quadratic statistics, i.e. Gaussian posterior. This belief state can then be updated according to an extended Kalman filter that accounts for the agent's internal model of the stochastic nonlinear dynamics.For more general belief representations, the belief update equations may require additional flexibility.
Rational control with continuous actions also requires us to implement a family of continuous policies π that map from beliefs to actions. We use deep neural networks to implement these policies (22) through an actor-critic method (Deep Deterministic Policy Gradient, (23)), by which one 'critic' network estimates the value of each action taken by the 'actor' network.
Deep learning methods are commonly used in reinforcement learning to provide flexibility, but they lack interpretability: information about the policy is distributed across the weights and biases of the network. Crucially, to maintain interpretability, we parameterize this family by the task. Specifically, we provide the model parameters as additional inputs to a policy network, and learn the optimal policies simultaneously over a prior distribution on task parameters p(θ) (22). This allows the network to generalize its optimal strategies across POMDPs in the task family. It also allows us to compute policy gradients simply using auto-differentiation, which we exploit when optimizing the log-likelihood to find the parameters that best match for an agent's behavior.
Ultimately, after optimizing the log-likelihood for either discrete or continuous representations, the end result is a set of parameters θ that best explain the observed behavioral data, and define the agent's assumed internal task model and subjective preferences. Within this model class, we have therefore found the best rational explanation for the agent's behavior.
Finding a neural code for rational thoughts. We don't presume that any real brain explicitly calculates a solution to the Bellman equation, but rather learns a policy by combining experience and mental modeling. With enough training, the result is an agent that behaves 'as if' it were solving the POMDP (Figure 2A).
If an animal's behavior is well-described as depending on latent beliefs, as we assume in Inverse Rational Control, then it makes sense that we should find neural correlates of these beliefs in the brain. If we can find such correlates, does this mean that the neurons encode or represent those beliefs? Some have argued that the notion of a neural code is a poor metaphor because it captures neither the causal or mechanistic structure of the brain, nor its relation to actions and affordances (24-26). For example, it may be that the brain does not use the neural signals that a neuroscientist can use to extract information about a task.
In contrast, here we argue instead that the linked processes of encoding, recoding, and decoding can be a useful way of explaining task-relevant computation in the brain at the algorithmic or representational level (27). The brain's 'encoding' specifies how neural activity can be used to estimate task variables ( Figure 2B), including both rewarded variables and irrelevant or nuisance ones that must be disentangled from them. 'Recoding' describes how that encoding is transformed over time and space by neural processing ( Figure 2C). 'Decoding' describes how those estimates predict future actions ( Figure 2D).
(In our use of these terms, we are taking the brain's perspective. The term 'decoding' more often reflects the scientist's perspective, where the scientist decodes brain activity to estimate encoding quality. Instead, we reserve the term decoding to describe how neural activity affects actions: we say that the brain decodes its own activity to generate behavior.)   hypothesis that the brain recodes its beliefs rationally by testing whether the dynamics of the behaviorally estimated beliefb match the dynamics of the neurally estimated beliefsb, as expressed through the update dynamicsf dyn (bt, ot) and recoding functionfrec(bt, ot). (D) Similarly, we test whether the brain decodes its beliefs rationally by comparing the behaviorally and neurally derived policiesπact andπ dec . Quantities estimated from behavior or from neurons are denoted by up-pointing or down-pointing hats,ˆandˇ (Table S1).
This level of explanation need not capture every facet of neural responses nor the mechanism by which they evolve. Obviously it cannot explain responses to untested task variables. Nonetheless, it would be great progress if we can account for stimulus-and action-dependent neural dynamics within a task-relevant submanifold (28) that explains how pieces of information interact and predict behavior. Although this 'as-if' description cannot legitimately claim to be causal, it can be promoted to a causal description since it does provide useful predictions for causal tests about what neural features should influence computation and action (29, 30).
Next we describe the general structure of such a representationlevel explanation. We then follow this approach to analyze an artificial brain performing a specific foraging task.
To begin the analysis, we propose to use Inverse Rational To us, as scientists, the agent's beliefs are latent variables, so our algorithm can at best create a posterior p(b) over those beliefs, or a point estimateb indicating the most probable belief. Here we will base our analyses on a point estimate over beliefs. First we will describe the general approach, and then we will apply this approach in an example analysis.
Encoding. First we aim to find the brain's encoding of the beliefs about latent variables. Specifically, we look for neural correlates of the estimateb of the agent's beliefs that we inferred by IRC. While there is little doubt that real behavior is influenced by uncertainty (31-34), it remains unclear how uncertain beliefs that influence actions are encoded by the brain. These beliefs could be represented in the brain in a multitude of ways,and resolving this question is an active topic of current research. One reason it is hard to make progress on this topic is that we cannot measure the agent's beliefs directly, except by assuming optimal inferences. IRC gives us a way to estimate suboptimal beliefs, so we can examine how the brain represents them.
Given beliefsbt imputed by IRC, we can estimate how they are encoded in the neural responses r using a (potentially nonlinear, potentially spatiotemporal) readout function ϕenc(rt). This can be accomplished by minimizing an encoding loss such as  Table S1.) Recoding. While neural dynamics may affect every dimension of neural activity, we focus only on the interpretable dynamics within the lower-dimensional task manifold. By construction, those dynamics reflect the changes in the agent's beliefs. The rational control model predicts that beliefs are updated by sensory observations and past beliefs, with interactions that are determined by the internal model according to a function bt+1 = f dyn (bt, ot) + ηt where f dyn and ηt reflect the deterministic and stochastic parts of the dynamics. If our neural analysis correctly identifies dynamics responsible for behavior, then the beliefsb estimated from the neural encoding should be recoded over time following those same update rules. We estimate this neural recoding functionfrec(bt, ot) directly from the sequence of neurally estimated beliefsb by minimizing a recoding prediction loss, such as Lrec = t rec b t+1,frec(bt, ot) where rec penalizes differences between the actual and predicted future beliefs. We then comparefrec to the update dynamics posited by the behavioral modelf dyn ( Figure 2C). (We should compare these only over the distribution of experienced beliefs, i.e. those beliefs for which the recoding function matters in practice.) Agreement between these recoding functions implies that we have successfully understood the 'recoding' process. Even for good encoding models this is not guaranteed, since activity outside the encoding manifold could influence the neural dynamics.
The encoding dimensions may seem to change over time or context (2, 35). Perhaps this too should count as recoding, such that our approach of estimating beliefs from neural activity using a nonadaptive function ϕenc(r) would then miss important computations. However, this only indicates that our way of measuring the encoding is too limited. The real encoding model could be fixed but nonlinear (36), and can appear adaptive when measured by an inadequate model (37). More complex functions are harder to fit but the brain's neural code may require this added complexity.
Decoding. These encodings and recodings do not matter if the brain never decodes that information into behavior. We can evaluate how the brain uses this information by predicting actions from the neurally encoded beliefs, minimizing a decoding loss between observed actions and distribution of actionš a predicted from neurally estimated belief by the policyπ(ǎ|b): L dec = t dec at,π dec (ǎ|bt) where dec penalizes actions that are unexpected according to the given policy. We then test the hypothesis that the brain decodes neurally encoded rational thoughts by comparing the neurally-derived policyπ dec against the behavioral policy,πact ( Figure 2D).
Application to Foraging. We applied our analyses to understand the workings of a neural network performing a foraging task. The task requires an agent to combine unreliable sensory data with an internal memory to infer when and where rewards are available, and how to best acquire them. We train an artificial recurrent neural network to solve this task in a suboptimal but rational way, use Inverse Rational Control to infer its assumptions, subjective preferences, and beliefs, and then analyze its neural responses to test our coding framework.
Task description. Two locations ('feeding boxes') have hidden food rewards that appear and disappear according to independent telegraph processes with specified transition probabilities (Figure 3,  (38)). The boxes provide unreliable color cues about the current reward availability, ranging from blue (probably unavailable) to red (probably available).  Fig. 3. Illustration of foraging task with latent dynamics and partially observable sensory data. The reward availability in each of two boxes evolves according to a telegraph process, switching between available (red) and unavailable (blue), and colors give the animal an ambiguous sensory cue about the reward availability. The agent may travel between the locations of the two boxes. When a button is pressed to open a box, the agent receives any available reward. We assume there are three possible locations for the agent: the locations of boxes 1 and 2, and a middle location 0. We include a small 'grooming' reward for staying at the middle location, to allow the agent to stop and rest. A few discrete actions are available to the agent: it can push a button to open a box to either get reward or observe its absence, it can move toward a new location, or it can do nothing. Traveling and pushing a button to open the box each have an associated cost. This disincentivizes the agent from repeating fruitless actions. When a button-press action is taken to open a box, any available reward there is acquired. Afterwards, the animal knows there is no more food available now in the box (since it was either unavailable or consumed) and the belief about food availability in that box is reset to zero.
Neural network agent. We first create a rational agent that solves a POMDP problem in this family, and then we use supervised learning to train a nonlinear recurrent neural network to match the belief dynamics and policy of that agent.
To create the rational agent, we discretize beliefs about reward availability for each box into N = 10 belief states. We define the transition matrix in the discretized belief space by binning the continuous transition matrix T (bt+1|bt, at). We allow a small diffusion between neighboring bins, which reflects dynamic belief stochasticity. With the defined transition matrices and reward functions for different actions for the internal model, we can solve for the optimal softmax policy. Figure S1A shows the architecture of our recurrent network. After training to match the rational agent, readouts of the neural activities closely match the POMDP agent's beliefs and policies ( Figure S1B,C), but these task-relevant quantities are encoded implicitly in a large population of neurons.
We then collected sensory observations and actions from the neural network agent while it was challenged by a different task than the one for which it was optimized (Methods). These inputs led to a time series of observations ot, actions at, and neural activity rt. Together these constitute the experimental measurements.
Inverse Rational Control for foraging. We don't know the agent's assumed world parameters, nor do we know the agent's subjective costs, nor the amount of randomness (softmax temperature). Our goal is to estimate a simulated agent's internal model and belief dynamics from its chosen actions in response to its sensory observations. We infer all of these using IRC.
The actions and sensory evidence (color cues, locations and rewards) obtained by the agent all constitute observations for the experimenter's learning of the agent's internal model. Based on these observations over 1000 time points, including 364 movements and 109 button presses, we use IRC to infer the parameters of the internal model that can best explain the behavioral data ( Figure 4A). The comparison between the true parameters and the estimated parameters are shown in Figure 4B. Data limitations imply some discrepancy between the true parameters and the estimated parameters which can be reduced with more data. With the estimated parameters, we are able to infer a posterior over the dynamic beliefs ( Figure 5A). (Note that this is an experimenter's posterior over the agent's subjective posterior!) The inferred posterior is consistent with the agent's true subjective probability of the food availability in each box. The inferred distributions over beliefs reveals strong correlations between the true and estimated belief state ( Figure 5B). Figure 5C-F shows that the artificial brain and inferred agent choose actions with similar frequencies, occupy the three locations for the same fraction of time, and wait similar amounts of time between pushing buttons or travelling. This demonstrates that the IRC-derived agent's internal model generates behaviors that are consistent with behaviors of the agent from which it learned.
Neural analysis of rational foraging. We can now use our neural coding framework to look inside the brain.
We assume that beliefs bt are linearly encoded instantaneously in neural activity rt. For our example synthetic brain, this is correct by construction. After performing linear regression of behaviorally derived beliefsb against neural activity r, we can estimate other beliefsb from previously unseen neural data. Figure 6A shows that these beliefs estimated from neural data are accurate. Figure 6B shows that the recoding dynamics obtained from the neural belief dynamics also match the dynamics described by the rational model. We characterize these neural dynamics using kernel ridge regression betweenbt andbt+1 (Methods). The resultant temporal changes in the neurally-derived beliefs ∆bt =frec(bt, ot) −bt agree with the corresponding changes in the behavioral model beliefs, ∆bt =f dyn (bt, ot) −bt. Although some of these changes are driven directly by the sensory observations (colors), that only explains part of the belief updates: even Belief change predicted from behavior Belief predicted from behavior conditioned on a given sensory input at one time, the updates agree between the neurons and the behavioral model. This provides evidence that we understand the internal model that governs recoding at the algorithmic level.
Similarly, our analysis of neural decoding uses nonlinear multinomial regression to fit the probabilitiesπ dec (a|b) of allowed actions as a function of neurally derived beliefs (Methods). A comparison of the resultant function to the rational policyπact shows that these two decoding functions match reasonably well ( Figure 6C). This provides evidence that we understand the decoding process by which task-relevant neural activity generates behavior.

Discussion
In this work we used an explainable AI paradigm to infer an internal model, latent beliefs, and subjective preferences of a rational agent that solves a POMDP. We fit the model by maximizing the likelihood of the agent's sensory observations and actions over a family of tasks. We then described a neural coding framework for testing whether the imputed latent beliefs encoded in a low-dimensional manifold of neural responses are recoded and decoded in a manner consistent with this behavioral model. We illustrated these two contributions by analyzing the neural coding of an implicit computational model by an artificial neural network trained to solve a simple foraging task requiring memory, evidence integration, and planning. For this simulated data, we successfully recovered the agent's internal model and subjective preferences, and found neural computations consistent with that model.

Related work.
Our approach generalizes previous work in artificial intelligence on the inverse problem of learning agents by observing behavior. Methodologically, other studies of inverse problems address parts of Inverse Rational Control, but with a non-scientific goal -getting artificial agents to solve tasks by learning from demonstrations of expert behavior. Inverse Reinforcement Learning (IRL) tackles the problem of learning how an agent judges rewards and costs based on observed actions (39), but assumes a known dynamics model (20, 40). Conversely, Inverse Optimal Control (IOC) learns the agent's internal model for the world dynamics (41) and observations (42), but assumes the reward functions. In (43, 44) both reward function and dynamics were learned, but only the fully-observed MDP case is explored. We solve the natural but more difficult partially-observed setting, and ensure these solutions provide a scientific basis for interpreting animal behavior. As a cognitive theory, by positing a rational but possibly mistaken agent, our approach resembles Bayesian Theory of Mind (BToM) (45-50). Previous work in BToM has considered tasks with uncertainty about static latent variables that were unknown until fully observed (50), or tasks with partially observed variables but simpler trial-based structure (45, 46). Here we allow for a more natural world, with dynamic latent variables and partial observability, and we infer models where agents make long-term plans and choose sequences of actions. Where prior work in BToM learned subjective rewards (50) or internal models (48), our Inverse Rational Control infers both internal models and subjective preferences in a partially observable world.
In addition, BToM studies have focused their attention on models of behavior, whereas our purpose is to connect dynamic model computations to brain dynamics. Some work has posited a POMDP model for behavior and hypothesized how specific brain regions might implement the relevant computations (51). Here we demonstrate an analysis framework to test such connections, by examining neural representations of latent variables and showing how computational functions could be embodied by low-dimensional neural dynamics.
While low-dimensional neural dynamics is an important topic for emerging studies of large-scale neural activity (2, 6, 52), few have been able to relate these dynamic activity patterns to interpretable latent model variables. Far more commonly, these low-dimensional manifolds are attributed to an intrinsically generated manifold (28, 53), or are related to measurable quantities like sensory inputs or behavioral outputs (2, 54, 55). Population activity in the visual system is known to relate to latent representations extracted by trained deep networks (3, 4), and while this shows that many task-relevant features extracted by machine learning solutions are also task-relevant for the visual system, these feature sets yet account for neither temporal dynamics nor uncertainty, nor are they readily interpretable (56). Our proposed model-based analysis of population activity is currently our best bet for finding interpretable computational principles.

Virtues of representation-level explanations.
Many researchers in machine learning express skepticism that we can find much that is human-interpretable about either artificial or biological neural networks (57, 58). One interesting counterargument is that near any solution found by machine learning optimization, there may be other solutions that perform similarly while retaining interpretability (59). More humbly, even if we cannot find an interpretable network that exactly instantiates the brain's computations, we may still glean satisfying and useful insights from partial explanations at a higher level of abstraction (60-62).
Although the brain may not be thoroughly interpretable, we may benefit from imposing some interpretability, even at the cost of a perfect model. On the other hand, we may find instead that this imposition may lead us to more accurate neural representations that better reflect our abilities to interact with latent variables at many scales, things that brute force deep learning methods fail to find without explicit training. Finally, task-based cognitive models may reveal core principles that appear canonically across the brain.
Our recoding and decoding analysis does not apply to neural responses directly, but rather to the task-relevant information en-coded in those responses. This targeted dimensionality reduction abstracts away the fine details of the neural signals in favor of an algorithmic-or representational-level description. This decreases the number of parameters needed to characterize dynamics, reducing overfitting. More importantly, it can avoid the massive degeneracies inherent in neuron-level mechanisms: different neural networks could have entirely different neural dynamics but could share the task-relevant computations. This illustrates how a deeper, more invariant understanding of neural computations is possible at the algorithmic level than at the mechanistic level.
Limitations. Future states are relevant for selecting actions, but in our formalism they are embedded implicitly in a learned policy, so an agent does not need to imagine any possible futures once learning is completed. Introspectively, our thoughts are often dedicated to anticipating what might happen the future, and neural activity shows signatures of such predictions (63). Thus a natural extension of our approach would be to examine the neural coding of these types of rational thoughts directed at future (and past) world states, both for learning and for re-evaluating policies dynamically.
We applied our method only to a fairly simple task, but our framework is quite general and can scale to much more complex tasks, and can model common errors of cogntive systems. It can be used to infer false beliefs derived from incorrect or incomplete knowledge of task parameters. It can also be used to infer incorrect structure within a given model class. For example, it is natural for animals to assume that some aspects of the world, such as reward rates at different locations, are not fixed, even if an experiment actually uses fixed rates (64). Similarly, an agent may have a superstition that different reward sources are correlated even when they are independent in reality. Given a model class that includes such counterfactual relationships between task variables, our method can test whether an agent holds these incorrect assumptions.
However, our approach does require a model, and it is unlikely that the brain's full internal model is easily expressible compactly. Large-scale tasks are being solved with neural networks (65, 66) that provide rich state representations, but may not permit interpretation. This may be an unavoidable limitation in a world of complex structure (57, 58). Or it may be that these uninterpretable representations are insufficiently constrained, and that richer tasks, multi-task training, and more latent variables may bias networks toward more human-interpretable representations (59, 67, 68) that relate more closely to actionable causal latent variables (24).
In experiments, uncontrolled but structured variability could arise from internal noise sources, internal states, or thoughts about other tasks. Here we have neglected these effects, but when analyzing task-relevant computation in real brains it may help to allow for structured latent dynamics that have no grounding in a task or model (69,70) or may have an implementational purpose (13, 15, 71, 72).

Conclusion.
The success of our methods on simulated agents suggests it could be fruitfully applied to experimental data from real animals performing such foraging tasks (38, 73), as well as to richer tasks requiring even more sophisticated computations. Using explainable AI to construct belief states, their dynamics, and their utility for solving interesting tasks will provide useful targets for interpreting dynamic neural activity patterns, which could help identity the neural substrates of thoughts.

Materials and Methods
Inverse Rational Control. Full mathematical details for IRC are available in Supplementary Information. Code for the discrete case is available at https://github.com/XaqLab/IRC_TwoSiteForaging.
Foraging task and POMDP agent parameters. The foraging task described in the Results has two reward boxes for which the true reward availability followed a telegraph process, alternating between available and unavailable at uniform switching rates. For the two boxes, the true appearance and disappearance probabilities in one time step were γ * 1 = 0.15, γ * 2 = 0.1 and * 1 = 0.05, * 2 = 0.04.
Each box also displayed a sensory cue at each time conditioned on the reward availability, comprising five possible colors, with redder (bluer) colors indicating higher (lower) probability that food is currently available in the box. To be an interesting task, the distributions under the two states should overlap enough that the animal cannot depend primarily on the color cue to anticipate the food availability. Color values for both boxes are drawn independently at each time from a binomial distribution with five states, with mean q * 1 = 0.4 when food is available in the box, and q * 2 = 0.6 otherwise, and variance 0.96 for both of the two cases. The target agent makes wrong assumptions about all of these parameters, acting rationally for a task where γ1 = 0.2, γ2 = 0.15, 1 = 0.1, 2 = 0.08, q1 = 0.42, and q2 = 0.66.
We measure gains and losses in currency of reward, R ≡ 1. In those units, our target agent incurs a subjective cost of 0.3 when pressing the button, and a cost of 0.2 when traveling. Switching between boxes requires two steps, for a total cost of 0.4. We also allow a 'grooming' reward R = 0.2 for waiting at the center location. Our agent uses a softmax policy with temperature τ = 0.1.

Simulated brain.
We trained a neural network to match the behavior of a rational agent. The target behavior was implemented by an agent that used optimal belief updates and a softmax policy trained to solve a Belief MDP by value iteration (11). Our neural network used one recurrently connected layer of 300 rectified linear units (ReLUs) that received external inputs from the world-generated observations and agent-generated actions. Beliefs were estimated from this recurrent layer by a linear weighted sum. In parallel, the recurrently connected neurons provided input to a two-layer perceptron, with 100 ReLU neurons followed by 5 policy neurons ( Figure S1).
The architecture was built in PyTorch and optimized by supervised learning using gradient descent on a mean-squared error loss function and KL-divergence loss function, in two phases respectively. First, the recurrent connection strengths and the linear belief readout were jointly optimized by backpropagation through time to match the dynamic beliefs of the target agent. Second, the linear belief readout was discarded, and the recurrent units' outputs were passed through the two nonlinear stages and were optimized so that the 5 policy neurons matched the target POMDP policy at all times. After 60 iterations of 20 batches of 500 time points per batch, the trained neural network successfully reproduced the target beliefs within a mean squared error of 0.003, and the target policy within an average KL divergence of 0.005.
The trained neural network could then be run autonomously in closed-loop mode, sampling its own actions from a softmax distribution applied to the 5 output neurons.
Neural coding analysis. Encoding: We find an encoding matrix W by regressing b against r. This produces neural estimates of task-relevant variablesb =W r + c for new data. Recoding: We find dynamics by regressingbt against (bt−1, ot) with kernel ridge regression. The kernel functions are radial basis functions with centers on all possible target beliefs and a width at half-max equal to the spacing between beliefs. This yields the 'recoding' functionfrec(bt, ot) representing the nonlinear dynamics of the neural beliefs. We compare the belief updates ∆bt =f (bt, ot)−bt from the recoding functionfrec(bt, ot) and the corresponding belief updates from the task dynamics ∆bt =f dyn (bt, ot)−bt. Decoding: We compute the brain's 'decoding' function, i.e. an approximate policyπ dec , using nonlinear multinomial regression ofb against a with the same radial basis functions as used in recoding. We use a feature space of radial basis functions with centers on a 9 × 9 grid over beliefs, with width equal to the center spacing, and an outer product space over locations.   fully-connected neurons with a ReLU nonlinearity. There are 5 policy neurons, one for each possible action, and the network samples an action from the softmax over these policy neurons. Notice that there are no hats over these quantities, because these are not estimates. B: The neural network has almost the same beliefs as a rational agent given the same observations. C: Neural network reproduces the policy of a rational agent. To find the optimal policy, an agent evaluates the value of each action and state. If the agent were given future observations and actions, then its future beliefs would be known. But when observations are unknown, the agent has only a distribution over beliefs, arising from the distribution of future observations it may encounter from the distribution of future world states. The transition probability between belief states is then

Supporting Information Appendix (SI
is the distribution of future observations given the present belief and action. The parameters of this belief transition probability T (bt+1|bt, at) therefore include parameters from both the world state transitions T (st+1|st, at) and observation functions O(ot|st).
The true instantaneous reward function R(s, a) depends on the actual state and action. But for planning into the future, the agent must consider the reward as a function of its beliefs, which it expects to be R(bt, at) = dst R(st, at)B(st|bt) [9] These beliefs, belief transitions T , and rewards R then determine the optimal policy through the Bellman equation  [18] Line search method. In small problems like the foraging task considered in the main text, we can sometimes optimize the log-likelihood function L(θ) directly by a greedy line search method. Here we iteratively perform one-dimensional grid searches along random directions in parameter space. Once we find the optimal parameters on a line, we choose a new direction randomly from that starting point. We repeat this procedure until convergence. † Unfortunately, the conventional notations in EM and reinforcement learning collide here, both using the same letter: this Q auxiliary function is denoted in the Calligraphic font to distinguish it from the state-action value function Q in the MDP model. EM algorithm. The EM algorithm (19) enables us to solve for the parameters that give best explanation of the observed data, while inferring unobserved states in the model. Recall that the log-likelihood of the observed data log L(θ) can be written as L(θ) = log db1:T p(b1:T , o1:T , a1:T , s1:T |θ, φ) [19] Here θ is a parameter vector which includes both assumptions about the world dynamics and the parameters determining the subjective magnitudes of rewards and action costs. We alternately update the parameters θ to improve the expected complete-data log-likelihood, and calculate the posterior over latent states based on the estimated parameters from the most recent iteration.
According to the EM algorithm, in the E-step the estimated parameters θ old from the previous iteration determine the posterior distribution of the latent variable given the observed data P (b1:T |a1:T , o1:T , θ old ). In the M-step, the observed data log-likelihood function to be maximized reduces to L(θ) = Q(θ, θ old ) + H(P (b1:T |a1:T , o1:T , θ old )) [20] To be consistent with (75), we use Q(θ, θ old ) as the auxiliary function that describes the expected complete data log likelihood, and H(·) is the entropy of the posterior of the latent variable. Note that H(·) is not a function of θ, and thus has a fixed value if θ old is fixed.
The Q-auxiliary function can be expressed as: Q(θ, θ old ) = log p(b1:T , a1:T , o1:T , s1:T |θ, φ) P (b 1:T |a 1:T ,o 1:T ,s 1:T ,θ old ,φ old ) [21] where φ are the parameters in the experimental setup that determine the world dynamics. Since φ are fixed in the experiment and known in the analysis, they do not affect the model likelihood.
The complete data likelihood p(b1:T , a1:T , o1:T , s1:T |θ, φ) can be factorized into transition probabilities and policies at each time due to the Markov property. We can therefore decompose the expected complete data log likelihood Q(θ, θ old ) using the graphical model structure, as described in [10][11][12][13][14], except now the posterior distribution over beliefs is based on the previous iteration's parameters: Q(θ, θ old ) = log P (b1, o1, s1|θ, φ) [22] + t log π(at|bt, θ) [23] [26] Instead of solving for the optimal θ in a closed form, we use gradient descent to update the parameter θ in the M-step. With fixed parameters θ old from the previous iteration, the entropy of the latent state H(P (b1:T |a1:T , o1:T , θ old ) is fixed. As a result, we only need to update parameter θ to maximize function Q(θ, θ old ) in the M-step. The first term in [22] reflects the initial belief distribution, and it has a negligible contribution to Q when there are many time points t. In [24], the transition probability P (bt+1|bt, at+1, ot, θ) is a function of the dynamics parameters, while in [23], the policy term P (at|bt, θ) is a function of both the dynamic parameters and the rewards. Since the transition probability is a matrix whose elements are functions of the dynamics parameters, the gradients can be taken element-wise. We will show how the gradient of the policy function can be derived based on the Q value function in the next part.