ABSTRACT
Progress in understanding how individual animals learn will require high-throughput standardized methods for behavioral training but also advances in the analysis of the resulting behavioral data. In the course of training with multiple trials, an animal may change its behavior abruptly, and capturing such events calls for a trial-by-trial analysis of the animal’s strategy. To address this challenge, we developed an integrated platform for automated animal training and analysis of behavioral data. A low-cost and space-efficient apparatus serves to train entire cohorts of mice on a decision-making task under identical conditions. A generalized linear model (GLM) analyzes each animal’s performance at single-trial resolution. This model infers the momentary decision-making strategy and can predict the animal’s choice on each trial with an accuracy of ~80%. We also introduce automated software to assess the animal’s detailed trajectories and body poses within the apparatus. Unsupervised analysis of these features revealed unusual trajectories that represent hesitation in the response. This integrated hardware/software platform promises to accelerate the understanding of animal learning.
INTRODUCTION
Learning – the change of neural representation and behavior that results from past experience and the consequences of actions – is important for animals to survive and forms a central topic in neuroscience1. Different individuals may apply different strategies to the learning process, reflecting their individual personalities. Indeed, substantial differences in sensory biases, locomotion, motivation, and cognitive competence have been observed in populations of fruit flies2,3, rodents and primates4–6. Thus, it is critical to investigate learning at the individual level.
Rodents, especially the mouse, have become popular experimental animals in studying associative learning and decision-making, because of the wide availability of transgenic resources7–10. They can learn to perform complex decision-making tasks that probe cognitive components such as working memory and selective attention11–13. However, differences in learning strategies across individuals have rarely been addressed, partly owing to the limitations of data gathering and analysis.
Studying differences among individuals requires training and collecting data from multiple animals in a standardized and high-throughput fashion. The training procedures are often time-consuming, requiring several days to many weeks8,9, depending on the task. Although there have been advances in training automation, existing systems either require an experimenter to move animals from the home cage to the training apparatus14–16, or training animals within their own cages17–19. The former introduces additional sources of variability20,21, and the latter precludes tasks that require a large training arena. Following data acquisition, the analysis of behavior aims at understanding the learning process. Present approaches tend to focus on the averaged performance over many trials22. However, changes in behavior may happen at a single trial, and thus the modeling of behavior should similarly offer a time resolution of single trials to assess each animal’s individual approach to learning.
To address these challenges, we present Mouse Academy, an integrated platform for automated training of group-housed mice and analysis of behavioral changes in learning a decision-making task. We designed hardware that makes use of implanted radio frequency identification (RFID) chips to identify each mouse, and guides the animal into a behavior training box. Synchronized video recordings and decision-making sequences are acquired during animal learning. To analyze the decision-making sequences, we developed an iterative generalized linear model (GLM). This model makes a prediction of the animal’s choice in each trial and gets updated based on the animal’s actual choice. This iterative GLM model achieves a prediction accuracy of ~80%, and also reveals the decision-making strategy of the animal and how it changes over time. To analyze the animal’s behavior during the task in greater detail, we developed automated software that tracks the animal in video recordings and extracts its location and body pose using deep convolutional neural networks (CNNs). These features allowed us to perform an unsupervised analysis of each animal’s behavior, and discover individual traits of behavioral learning that were not apparent from the simple choice sequences.
RESULTS
The Mouse Academy platform consists of three components (Fig. 1): an automated RFID sorting and animal training system, an iterative GLM to analyze decision-making sequences, and behavior assessment software that extracts animal trajectories from video data.
Automated RFID sorting supports individual training programs
We designed the equipment in the following manner (Fig. 1a): RFID-tagged mice are grouped in a common home cage where food and bedding is supplied. The home cage connects to a behavior training box through a gated tunnel. The gates are controlled by a home-made RFID animal sorting system23: three RFID antennas are placed along the tunnel, with one near the home cage, one near the training box and one between the two; the motorized gates are placed between the RFID sensors, separating the tunnel into three compartments. An Arduino microcontroller integrates information from the RFID readers to open and shut the gates, allowing only one animal at a time to pass through the tunnel (Supplementary Figs. 1a, 1c, 1d and Supplementary Videos 1-4). The behavior box is outfitted with three ports, each of which contains a photo-transistor to detect snout entry, a solenoid valve to deliver water reward, and a light emitting diode (LED) to present visual cues. To maintain a controlled environment, the training box is isolated from the outside by a light- and sound-proof chamber (Supplementary Fig. 1b).
Once a mouse enters the training box, a protocol is set up to train the mouse to perform a certain task. In the experiments reported here, the animal must nose-poke the center port to initialize a trial and then hold the position for a short period. Visual or auditory stimuli are delivered, and based on these stimuli, the animal must choose to poke one of the side ports. If the correct response is chosen, the animal gets water reward from a lick tube in the response port, otherwise a timeout punishment is applied. This training process is controlled by Bpod, an Arduino microcontroller that interfaces with the three ports. Data from the response ports as well as video recordings from an overhead camera are acquired simultaneously as the animal is trained.
The entire apparatus is orchestrated by a master program that coordinates the RFID sorting device, the Bpod system, synchronized video recording, data management and logging (Supplementary Fig. 1e). The program monitors the amount of water each animal consumes per day and regulates the time each animal can spend in the training box per session. In addition, the software updates the training protocol for each animal based on its performance, for example switching to a harder task once a simpler one has been mastered (Supplementary Fig. 2). This lets each animal learn at its own pace.
The apparatus can be assembled at a materials cost of $1500-2500, with the cheaper option using a Raspberry Pi computer as the controller (Supplementary Fig. 1f and Supplementary Table 1). Compared with designs in which each animal is automatically trained in its own home cage15,17, the system saves considerable space. Because housing and training are independent modules, the same system can be used for diverse training environments.
We tested the automated RFID sorting and animal training system by training group-house mice to learn a variety of decision tasks, following similar procedures as reported previously11,12 (Supplementary Fig. 2 and Online methods). The training period lasted 28 days, with up to five mice in the common home cage. Each animal occupied the training box for 3-4 hours per day (13-15% of the 24 hours) throughout the entire training period (Figs. 2a, 2b and Supplementary Fig. 3). For a sample cohort of four animals trained in sessions of 90 trials each, we found that the behavior box was occupied most of the time, with brief empty intervals of <10 min (Figs. 2c, 2d and 2e). Each animal was trained for over 900 trials (10 sessions), and consumed more than 1.9 mL of water per day (Fig. 2f). Interestingly there was no circadian pattern to the animals’ training activity, even though the setup was illuminated on a daily light cycle (12 h on / 12 h off) (Fig. 2g). As observed previously, it appears that animals working for a goal can avoid circadian modulation of the locomotor pattern24,25.
A generalized linear model accurately predicts decision-making during training
In a decision-making task, an animal is asked to associate distinct stimuli with distinct responses. Although this is the ultimate goal, during learning, it is often observed that the animal begins by basing its decisions on unrelated input variables and gradually switches to using the stimulus variables that actually predict reward. We define a policy as a mapping of these variables to the animal’s decisions. A fundamental goal in the study of learning is to infer what policy the animal follows at any given time and to determine how the policy evolves with experience.
We applied a generalized linear model (GLM) to map factors relevant to the animal’s decisionmaking to its choices through logistic regression. A common way to build such a GLM is by fitting data of an entire session16,26. However, this loses resolution in single trials within the session. During learning, a change of policy can happen at each trial. Thus, we developed the model to make trial-by-trial choice predictions based on various factors the animal might plausibly use. The model works in an iterative two-step process (Fig. 1b). In the prediction step, the model makes a prediction for the next decision based on the input factors. Once the outcome of the animal’s decision is observed, an error term between the model’s prediction and the observation is computed. This error, after weighting by a reward factor and a temporal discount factor, is fed back to the loss function. In the update step, the model is updated by minimizing the regularized loss function. This iteration happens after every trial. The temporal discount factor accounts for the possibility that the most recent trials impact the current decision more than remote trials. The reward factor accounts for the fact that water rewards and timeout punishments may have effects of different magnitude on the updates of the animal’s policy.
We illustrate the utility of this model by fitting results from an easy visual task, in which one of the two choice ports lights up to indicate the location of the reward, and the optimal policy is to simply poke the port with the light (Supplementary Fig. 2a, 2a’ and 2a’’). All the mice eventually reached a >83% performance level, comparable to what mice achieve in similar tasks19,27. The GLM makes a prediction for the outcome of each trial based on a weighted combination of several input variables: the current visual stimulus, a constant bias term, and three terms representing the history of previous trials (Fig. 3a). These inputs from a previous trial include the port choice, whether that choice was rewarded, and a term indicating the multiplicative interaction between the choice and reward (Choice x Reward). This term supports a strategy called win-stay-lose-switch (WSLS), which chooses the same port if it was rewarded previously and the opposite one if not. Since a GLM cannot multiply two inputs, we provided this interaction term explicitly. Each of the above terms has a weight coefficient that can be positive or negative. For instance, a positive weight for the visual stimulus supports turning towards the light, and a negative weight away from the light.
To determine the extent of trial history that affects the animal’s behavior, we fitted the model to the response data including history-dependent terms up to three previous trials. We found that only the immediately preceding trial had an appreciable effect on the prediction accuracy, and thus restricted further analysis to those inputs (Fig. 3b). The model also has three hyperparameters (the temporal discount factor α, the reward factor r, and the regularization factor λ), and we optimized them for each animal by grid search. We found that each animal had a different set of hyperparameters, reflecting differences in the learning process across individuals (Fig. 3c). Among the four sample mice, Animal 2 had the lowest temporal discount factor, suggesting that it weighed recent trials more heavily and updated the policy more quickly. Indeed, this is the animal that learned the fastest among the four (Fig. 3d).
Predictions from the iterative GLM matched ~80% of the animals’ actual choices (Fig. 3f), and the predicted accuracy of each animal captured the actual fluctuations of its learning curve (Fig. 3e and Supplementary Fig. 4). We compared the performance of the GLM with two other modeling approaches (Online methods). The first model was fit to the animal’s average performance in the task; its trial-by-trial match of the animal’s actual choices was only ~59% (Fig. 3g). The second model was a logistic regression fitted to data in a sliding window of N trials. This sliding window model performed worse than the iterative GLM when the window size was small (N = 20 and 30 trials, Fig. 3h); for larger windows the performance was comparable. Overall, the iterative model is advantageous because it makes predictions online as every trial occurs and adapts dynamically to the growing data set.
Individual learning policies can be inferred from iterative GLM fitting
The iterative GLM serves to infer what policy the animal follows in making decisions. The linear weight of each input term reflects its relative importance for the decision. By following this weight vector across trials one obtains a policy matrix that documents how the animal’s policy changes during learning (Figs. 1b and 4c). To test that the model can correctly capture a time-varying policy, we simulated decision-making data from a ground truth policy that changed at a certain frequency, including a certain level of noise in the behavioral output (Fig. 4a). Over a wide range of policy change frequencies and noise levels, the GLM was able to capture the ground truth policy (Figs. 4a and 4b). In addition, different values of policy change frequency and noise levels led to different sets of hyperparameters fitted from the model, showing that the GLM can adapt to individuals with diverse learning characteristics (Supplementary Figs. 5a-e).
We then recovered the policy matrix of each animal from the GLM fits. All four animals started with the non-optimal policy of WSLS. Subsequently each animal followed its own learning process (Fig. 4c): Animal 2 had a clear bias towards the right port at the beginning but it rapidly found the optimal policy of following the light. The other three animals were slower learners. Animal 3 and Animal 4 followed similar processes to converge to the optimal policy. Animal 1 was distinct from the others. At the early stages, it had a strong bias towards the left port and it made decisions based on whether the previous choice was rewarded.
We further validated the transition between policies during learning by analyzing the first and last sessions of each animal and counting how many choices could be explained by each policy (Fig. 4d). Indeed, we found a clear switch from the (non-optimal) WSLS policy to the (optimal) stimulus-based policy (Fig. 4e and Supplementary Fig. 5f). The animals might have been biased towards the WSLS strategy by a shaping method we used during training, which offered the animal a repeat of the same stimulus every time it made a mistake (Online methods). To test whether these correlations in the trial sequence influenced the final policy we performed two additional analyses. First, we only included trials following a correct trial, and performed logistic regression on these trials for each session. This analysis showed that at least on these trials, all the animals based their decisions on the light stimulus by the end of learning (Supplementary Fig. 6a). Second, we compared the error rate on trials following an incorrect choice with that following a correct one. We found no significant difference between the two error rates during the last session (Supplementary Figs. 6b and 6c), suggesting that the animals treated these two types of trials identically.
Automated movement tracking reveals fine structure of behavioral responses
Thus far the report has focused on the animal’s responses only as sensed by the nose pokes into response ports. The GLM fits of those responses already revealed differences in policy across individuals. To gain further insight into these individual preferences, it is essential to track each animal’s behavior along the way from stimuli to responses10. We thus developed software that uses deep learning to automatically, quantitatively and accurately assess each animal’s behavior during decision-making (Figs. 1c and 5a).
To track the animal location and body coordinates, we recorded videos of the animal from above, and analyzed them with a sequence of two deep convolutional neural networks (CNNs) that were pre-trained on annotated pose data (Fig. 5a and Supplementary Videos 5-7). The first CNN was based on the multi-scale convolutional (MSC) Multibox detector28 (Supplementary Fig. 7a).For each frame of the video it computed a crop frame around the body of the mouse. The second CNN was a Stacked Hourglass Network29 that used the cropped video frame to locate seven body landmarks: the nose, the ears, the neck, the body sides and the tail of the animal (Supplementary Figs. 7a and 7b). These landmarks allowed precise identification of the animal’s position and body pose (Supplementary Fig. 7c), from which we further extracted two features: the body centroid (average position of the seven landmarks) and the orientation (angle of the line connecting the centroid and the nose).
To illustrate use of these behavioral trajectories, we focus on the period of the visual choice task where the animal reports its decision: from the time it leaves the center port to when it pokes one of the side ports. The trials fall into four groups based on location of the stimulus and the response. As expected, the trajectories of position and orientation clearly distinguish left from right choices (Figs. 5b and 5d). Interestingly, the trajectories also reveal whether the decision was correct: On incorrect decision, the trajectories reversed direction after ~0.5 s, because the animal quickly turned back to the center after finding no reward in the chosen port (Figs. 5c, 5e and Supplementary Video 5). A linear kernel support vector machine (SVM), trained to predict the category of each trial from a 1 s trajectory, was able to correctly distinguish correct and incorrect choices with an accuracy of over 90% (Supplementary Fig. 8). In addition, many of the trajectories were highly asymmetric and again revealed differences across individuals. For instance, Animal 2 and Animal 4 started from a location close to the right port, Animal 1 closer to the left port (Fig. 5c). This asymmetry correlates with the bias revealed by the iterative GLM: each animal prefers to select the port closer to its body location.
Unsupervised behavioral analysis reveals moments of hesitation
Whereas the supervised learning discussed above relies on prior classification of stimuli and responses, an unsupervised analysis has the potential to discover unexpected structures in the animal’s behavior30. We thus performed an unsupervised classification of the behavioral trajectories.
After subjecting all the trajectories of a given animal to principal component analysis (PCA) we projected the data onto the top three components, which explained over 95% of the variance (Figs. 6a, 6b and Supplementary Fig. 9a). Importantly, without any labels from trial types, these three PCs captured meaningful features that differentiated the animal’s responses. The first PC separated movements to the left from those to the right (Figs. 6a and 6b). The third PC captured the turning-back behavior after an incorrect choice (Fig. 6b and Supplementary Fig. 9a). The second PC captured different baseline positions (Fig. 6b). Each animal has its own preference for a baseline position somewhere off the midline of the chamber (Supplementary Fig. 9b).
We also projected the trajectories into 2 dimensions using a non-linear embedding method, t-distributed stochastic neighbor embedding31,32 (t-SNE). Unlike PCA, this graph prioritizes the preservation of local structures within the data instead of the global structure32. In the t-SNE space the trajectories formed clear clusters (Fig. 6c). Most of the clusters are dominated by one of the decision categories (Fig. 6c and Supplementary Fig. 9c). Interestingly, we found clusters in Animals 2, 3, and 4, in which the centroid trajectories were flat, unlike the trajectories of the four decision categories (Fig. 6d), suggesting that animals hesitated in these trials and made decisions only after a delay. Indeed, in trials flagged by these clusters, the animals had longer reaction times (Fig. 6e). Furthermore, such hesitating responses were more common following an incorrect trial (Fig. 6f); they may reflect a behavioral adjustment to prevent further mistakes33.
DISCUSSION
Despite the fact that rodents can be trained to perform interesting decision-making tasks7–10, the learning progress of individual animals has rarely been addressed. Doing so requires training and observing many animals in parallel under identical conditions, and the ability to analyze the decision policy of each animal on a trial-by-trial basis. To meet these demands, we developed Mouse Academy, an integrated platform for automated training and behavior analysis of individual animals.
We demonstrate here that Mouse Academy can train group-housed mice in an automated and highly efficient manner while simultaneously acquiring decision-making sequences and video recordings. Automated animal training has been of great interest in recent years and efforts have focused on two directions. In one design, multiple animals are trained in parallel within stacks of training boxes. This requires a technician to transfer animals from their home cages to the behavior boxes14–16. Such animal handling has been reported to introduce additional variability20,21, and even the mere presence of an experimenter can influence behavioral outcomes34. Thus, eliminating the requirement for human intervention, as in Mouse Academy, likely reduces experimental variation. In another design, a training setup is incorporated within the animals’ home cage17–19. By contrast, Mouse Academy separates the functions of housing and training, and that modular design allows easy adaptation to a different purpose. For instance, one can replace the 3-port discrimination box with a maze to study spatial navigation learning35,36, or with an apparatus for training under voluntary head-fixation37. In each case, a single training apparatus can serve many mice, potentially from multiple home cages.
To understand how an animal’s decision-making policies change in the course of learning, we developed a trial-by-trial iterative GLM. The evolution of the model is similar to online machine learning38 in which the data are streamed in sequentially, rather than in batch mode. The linear nature of the model supports a straightforward definition of the animal’s decision policy, namely as the vector of weights associated with different input variables. In addition, the simple linear structure allows rapid execution of the algorithm, which favors its use in real-time closed-loop behavior experiments. The model also allows several parametric adjustments. One specifies how much the recent trials are weighted over more distant ones in shaping the animal’s policy. Another rates the relative influence of reward versus punishments. Fitting these parameters to each animal already revealed differences in learning style. This model can have a broader use beyond mouse decision-making, for instance to track the progress of human learners from their answers to a series of quizzes39.
Finally we presented software for automated assessment of behaviors based on video recordings within Mouse Academy. Largely extending existing methods, the software uses deep convolutional neural networks for animal tracking and pose estimation. Because the animal’s movements are unconstrained, we performed the tracking in two stages: the first finds the animal within the video frame and the second locates the body landmarks. Compared to a single-shot procedure, this split approach requires fewer learning examples and less computation in the second stage38. The resulting behavioral trajectories can reveal intricate aspects of the animal’s decision process that are hidden from a mere record of the binary choices. The large data volume again calls for automated analysis, and both supervised machine learning methods30,40,41 and unsupervised classification30–32,42 have been employed for this purpose. Unsupervised analysis is not constrained by class labels, and can identify hidden structure in the data in an unbiased manner. In the present case, we discovered a motif wherein the animal hesitates on certain trials before taking action.
Mouse Academy can be combined with chronic wireless recording43,44, to allow synchronized data acquisition of neural responses. Researchers can seek correlations between neural activity and the policy matrix or even the behavioral trajectories. This will open the door to a mechanistic understanding of how neural representations and dynamics change in the course of animal learning.
ONLINE METHODS
Animals
Subjects were C57BL/6J male mice aged 8-12 weeks. All experiments were conducted in accordance with protocols approved by the Institutional Animal Care and Use Committee of the California Institute of Technology.
Hardware setup
The hardware setup comprises a behavioral training box, an engineered home cage, and a radio frequency identification (RFID) sorting system, which allows animals to move between the home cage and the training box. These components are coordinated by customized software.
The design file for the behavior box was modified from that of Sanworks LLC (https://github.com/sanworks/Bpod-CAD) using Solid-Works computer-aided design software and the customized behavioral training box was manufactured in the lab. The behavior box is controlled by a Bpod state machine (r0.8, Sanworks LLC). To monitor the animal’s behavior, an IR webcam (Ailipu Technology or OpenMV Camera M7) is installed above the behavior box. The behavior box and the webcam are placed within a light- and sound-proof chamber. The chamber is made of particle board with walls covered by acoustic foam. A tunnel made of red plastic tubes connects the behavior box to a home cage (Supplementary Fig. 1b).
For the RFID access control system, an Arduino Mega 2560 microcontroller is connected with three RFID readers (ID-12LA, Sparkfun) with custom antenna coils spaced along the access tunnel. The microcontroller controls two generic servo motors fitted with plastic gates to grant individual access to the training box (Supplementary Fig. 1a).
The microcontroller identifies each animal by its implanted RFID chip and permits only one animal to go through the tunnel connecting the home cage and the behavioral training box (Supplementary Fig. 1c). It also communicates the animal’s identity to a master program running on a PC or Raspberry Pi (Matlab or Python). The master program coordinates the following programs: Bpod (https://github.com/sanworks/Bpod), synchronized video recording, data management and logging. A repository containing the design files, the firmware code for the microcontroller, and the software can be found in https://github.com/muqiao0626/Mouse_Academy.
Behavior training
The training procedures of mice to perform a selective attention task are similar to those previously reported11,12. Mice were water restricted for seven days before training, and habituated in the automated training system to collect reward freely for several sessions. Then the mice were trained in sessions, each of which was made of 90 trials, to collect water rewards by performing two alternative forced choice tasks. Briefly, the animal had to nose-poke one of two choice ports based on the presented stimuli. If the decision was correct, 10% sucrose-sweetened water (3 μL) was delivered to the animal. For incorrect responses, the animal was punished with a five-second timeout. Following an incorrect response, the animal was presented with the identical trial again; this simple shaping procedure helps counter-act biases in the behavior.
Over 28 days of training the animals learned increasingly complex tasks, from visual discrimination to a two-modality cued attention switching task11,12. The training progressed through six stages (Supplementary Fig. 2):
A simple visual task: In this task, the animal initiates a trial by poking the center port and holding the position for 100 ms. Then either the left or right side port light up briefly until the animal moves away from the center port. The animal must then poke one of the two side ports within the decision period of 10 s. Choice of the port flagged by the light leads to a water reward, and choice of the other port leads to a time-out period during which no trials can be initiated. Data presented in the main text are from this stage of training only.
A simple auditory task: As Stage 1, except that the stimulus was white noise sound either the left or the right side to flag the reward port.
A cued single-modality (visual or auditory) switching task: Blocks of 15 trials consisting of single-modality (visual or auditory) stimulus presentation. Each block was like stages 1 or 2, except that the trial type was indicated by a 7 kHz (visual) or 18 kHz (auditory) pure tone.
A cued single- and double-modality switching task: Like stage 3, but distracting trials were introduced in which both visual and auditory stimuli were present, but only one of the modalities was relevant to the decision. The relevant modality was again indicated by the pure tone cues. In repeating blocks, four types of trials were presented: a. five visual-only trials; b. ten ‘attend to vision’ trials with auditory distractors; c. five auditory-only trials; d. ten ‘attend to audition’ trials with visual distractors. During the training, the time that the animal had to hold in the center port was gradually increased to 0.5 s, and the duration of the stimuli was gradually shortened to 0.2 s.
A cued double-modality switching task: Like stage 4 except that the single-modality trials were removed, and the block length was gradually shortened to three trials.
A selective attention task: Like stage 5, but the block structure was abandoned and all eight possible trial types were randomized: (audition vs vision) x (sound left or right) x (light left or right).
Iterative generalized linear model
We modeled the animal’s choice probability by a logistic regression. At each trial number t, the choice probability is defined as where yt indicates the binary choice of the animal (1 = right, –1 = left), xt is the vector of input factors on trial t, and wt-1 is the vector of weights for these factors obtained from fitting up to the preceding trial.
The prediction for the animal’s choice is simply that with the higher model probability:
After observing the animal’s actual choice zt, the cross-entropy error Et between the observation and model prediction is calculated as
We weight the error term by a reward factor Rt, and apply exponential temporal smoothing to get the loss function Lt: where α is the smoothing discount factor accounting for the effect that distant trials have less impact on decision-making than immediately preceding trials, and Rt is defined as
The values of Rt for rewarded and unrewarded trials may be different, accounting for the fact that rewards and punishments may have different effects on learning. For each time point, the weights in the model are determined by minimizing the loss function subject to L1 (lasso) regularization, namely
Then wt is used for prediction of the next trial. For subsequent analysis, we only used predictions starting at the 15th trial. The three hyperparameters for the temporal discount factor α, the reward factor r, and the regularization factor λ were selected by grid search.
To fit the decision-making sequences of the simple visual task, we included the following terms in the input vector xt:
Visual_stimulus: +1 = light on right, −1 = light on left.
Bias: A constant value of + 1. The associated weight determines whether the animal favors the left (negative) or the right (positive) port.
Choice_back_n: The choice the animal made n trials ago (+1 = right, –1 = left).
Reward_back_n: The reward the animal received n trials ago (+1 = reward, –1 = punishment).
Choice x Reward_back_n: The product of terms 4 and 5. This term corresponds to the win-stay-lose-switch (WSLS) strategy of repeating the last choice if it was rewarded and switching if it was punished.
To determine the extent of history-dependence of the animal’s decisions, we fitted the model including terms 3-5 from up to three previous trials (n = 1, 2, 3), and found that only the immediately preceding trial had an appreciable effect on the model’s prediction accuracy. For the subsequent analysis, we therefore included terms 3-5 for the preceding trial (n = 1).
We compared the iterative generalized linear model (GLM) with two other models. The first only captures the animal’s average performance over all trials. If the fraction of the correct responses is z, then the model simply predicts a correct response with probability z, and an error with probability 1–z. Thus, the fraction of trials where the prediction matches the observation is z2 +(1 - z)2.
The second model is a sliding window logistic regression. To make a prediction for trial t, we fitted the logistic model presented above (Eqns 1–2) to the preceding n trials. The loss function is and the weights are again optimized as in Eqn 6.
Recovering policy matrices from simulated data
To test the model’s ability in recovering policy matrices, we trained the model on data generated from pre-defined ground truth policies. The ground truth policies changed every 10 trials, 30 trials, or 90 trials. Binary choices were simulated with different noise levels using the algorithm ‘epsilon-greedy’: with a probability of epsilon, the simulator made a random choice and with a probability of 1-epsilon it chose the action indicated by the ground truth policy. The noise levels (epsilon values) ranged from 0 to 0.6. The similarity between the recovered policy and the ground truth policy was evaluated by the cosine between the recovered weight vector and the ground truth weight vector.
Automated behavior assessment software
In this section we describe each part of the software that constitutes the system we developed to track the mouse. The software is primarily made of two parts: mouse detection and pose estimation, each of which is implemented by a deep convolutional neural network (CNN) trained on annotated video data (Supplementary Fig. 7a). We collected a set of videos using red or IR light illumination from the top of the arena. From these videos we extracted randomly a set of 15,000 frames and asked Amazon Mechanical Turk (AMT) workers to click on body landmarks which give a representation of the skeleton of the mouse (Supplementary Fig. 7b).
Mouse detection
The architecture used for detection is the multi-scale convolutional (MSC) Multibox network28, which computes a list of bounding boxes along with a single confidence score for each box, corresponding to its likelihood of containing the object of interest, in this case a mouse. Each bounding box is encoded by 4 scalars, representing the coordinates of the upper-left and lower-right coordinates. The coordinates of each box are normalized with respect to image dimensions to deal with different image sizes, and their associated confidence score is encoded by an additional node (which ouputs a value from 0 to 1). The loss function is the weighted sum of two losses: confidence and location28. We trained the MSC-Multibox deep CNN to predict bounding boxes that are spatially closest to the ground truth boxes while maximizing the confidence of containing the mouse.
In order to reach better and faster detection, we used prior bounding boxes whose aspect ratios closely match the distribution of the ground truth. As proposed previously by Erhan et al28, these priors were selected because their Intersection over Union (IoU) with respect to the ground truth was over 0.5. In the following matching process, each ground truth box is best matched to a prior, and the algorithm learns a residual between the two. At inference time, 100 bounding boxes are proposed, from which the best one is selected based on the highest score and non-maximum suppression.
We split the dataset into 12,750 frames for training, 750 for validation and 1,500 for testing. During training, we augmented data with random cropping and color variation. Using the Inception-ResNet-V2 architecture initialized with ImageNet pre-trained weights45, we finetuned the network with our training samples by updating the weights using stochastic gradient descent. For the optimizer, we used RMPSProp, with the batch size set to 4, the initial learning rate set to 0.01, and the momentum and the decay both set to 0.9946. Images were resized to 299 × 299. We trained the detector on a machine with a 8-core Intel i7-6700K CPU, 32GB of RAM, and a 8GB GTX 1080 GPU. The model was trained for 288k iterations. A single instance of the forward pass took on average 15 ms.
We evaluated the models using the detection metric Intersection over Union (IoU). Thresholding the IoU defines matches between the ground truth and predicted boxes and allows computing precision-recall curves. The precision-recall curve at different threshold of IoU is shown in Supplementary Fig. 7d. In Supplementary Table 2, we report mean averaged precision (mAP) and recall (mAR).
Pose estimation
With the bounding box generated from the MSC-Multibox deep CNN, we wish to determine the precise pixel location of the keypoints that would describe the body features of the mouse. As a well established problem in computer vision, a good pose estimation system must be robust to occlusion, deformation, successful on rare and novel poses, invariant to changes in appearance from differences in lighting and backgrounds.
The keypoints we chose are the nose, the ears, the neck, the body sides, and the base of the tail (Supplementary Fig. 7b). These features were chosen because they are best recognized regardless of the size of the animal, and one can deduce from them secondary features, such as orientation of the animal. We used Stacked Hourglass Network29 to estimate the keypoints. This architecture has the capacity to learn all seven features and and output pixel-level predictions. The output of the network is a set of heatmaps, one for each keypoint, representing the probability of the keypoint’s presence every pixel (Supplementary Fig. 7a). We estimated the location of the keypoint by the maximum of its heatmap. A mean squared error (MSE) loss was used to compare the predicted heatmap to the ground truth.
During training, cropped frames with the mouse centered in the bounding box were resized to a resolution of 256 × 256. We also augmented the training data as follows (p is the probability of applying a type of augmentation):
rotation with p = 1: angles were selected uniformly between 0° and 180°
translation
horizontal and vertical flips
scaling with p = 1: scaling factors were chosen from a pool of 0.10 to 0.65, uniformly
color variation: adjusted brightness/contrast/gamma with p = 0.5 in order to emulate the effects of poor lighting/setup
Gaussian blur with p = 0.15: frames were blurred either by a σ = 1 or σ = 2 (chosen uniformly).
Gaussian Noise added independently across image with p = 0.15
JPEG artifact with p = 0.15: added artifacts of JPEG compression onto the image Extreme augmentations (with multiple types of augmentations) were examined to make sure that the transformed data looked reasonable. Using original and augmented keypoint annotations, we trained a pose estimator from scratch.
Training started from randomly initialized weights, and continued until validation accuracy plateaued, taking approximately 6 days. This training process was performed for 749k iterations. The network was trained using TensorFlow (Google) on a machine with 8-core Intel Xeon CPU, 24GB of RAM, and a 12GB Titan XP GPU. For optimization, we used RMPSProp optimizer with momentum and decay both set to 0.99, batch size of 8 and a learning rate of 0.00025. We dropped the learning rate once by a factor of 5 after validation accuracy plateaus (after 33 epochs). Batch normalization was used to improve training.
Evaluation was done using the standard Percentage of Correct Keypoint (PCK) metric which reports the fraction of detections that fall within a distance of the ground truth29. More than 85% of the keypoints of the nose, ears, and neck are inferred within an error radius of 0.5 cm, and more than 80% of the keypoints of the body sides and tail lie within an error error radius of 1 cm (as a reference, the distance between two ears is ~3 cm). The averaged PCK of all the seven keypoints is ~80% within a radius of less than 0.5 cm (Supplementary Fig. 7c). Overall the system’s performance can be characterized as high-human, significantly exceeding the typical annotator, but less precise than the absolute best possible.
Supervised and unsupervised analysis of behavioral trajectories
From the pose estimation, we extracted two features to describe an animal’s behavioral trajectories: the centroid was defined as the average position of the seven body landmarks; and the angular orientation of the line from the centroid to the nose. For each trial, these two features were extracted for n frames (n = 30 (1 s) in most cases), thus the data dimension for each trial is 3n (the two centroid coordinates and the orientation).
To determine whether the behavioral trajectories contain information about the decision categories, a support vector machine (SVM) with a linear kernel was trained for each decision category. The training set was labelled with the decision category based on information about the visual stimulus and the animal’s choice (for example, “Stim: R, Choice: L” means that the light is on the right and the animal chooses the left port). Performance of the trained SVM was examined by prediction accuracy on the test set, and the F1 score, which is the harmonic mean of precision and recall:
The performance was computed as the average across 10 repeated analyses (Supplementary Fig. 8).
We performed a non-linear embedding method, t-distributed stochastic neighbor embedding (t-SNE) analysis as previously described31,32. Briefly, the trajectory data of each trial were projected into a 2D t-SNE space. Point clouds on the t-SNE map represented candidate clusters. Density clustering identified these regions. We then plotted trajectories and reaction time distributions to confirm that the clusters were distinct from each other. A repository of the analysis scripts can be found in https://github.com/tonyzhang25/MouseAcademyBehavior.
AUTHOR CONTRIBUTIONS
M.Q. and M.M. designed the study; M.Q. and S.S. constructed the hardware setup and wrote the controlling software; M.Q. performed experiments and collected data for analysis; M.Q. developed the iterative generalized linear model with input from P.P. and M.M; C.S. developed the automated tracking software; T.Z. implemented animal tracking and behavioral trajectory analysis with input from M.Q., P.P. and M.M; M.Q. and M.M. wrote the manuscript with comments from all authors.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
DATA AVAILABILITY
The datasets analyzed during the current study are available in https://drive.google.com/open?id=1gkPbqGYKPGs7Rx1WNmubQW0dKyYE5YVR
ACKNOWLEDGEMENT
We thank Joshua Sanders for technical assistance in incorporating Bpod into our system. We thank Ann Kennedy for insightful comments and suggestions on analysis of the behavioral trajectories. We thank Oisin Mac Adoha and Yuxin Chen for helpful comments and discussions. This work was supported by a grant from the Simons Foundation (SCGB 543015, M.M. and P.P.) and a postdoctoral fellowship from the Swartz Foundation (M.Q.).