Better than maximum likelihood estimation of model-based and model-free learning style

Multiple decision making systems work together to shape the final choices in human behavior. Habitual and goal-directed systems are the two most important systems that are studied in the reinforcement learning (RL) literature by model-free and model-based learning methods. Human behavior resembles the weighted combination of these systems and such a combination is modeled by weighted summation of action’s value from the model based and model free systems. Extraction of this weighted parameter, which is important for many applications and computational modeling, has been mostly based on the maximum likelihood or maximum a posteriori methods. We show these methods bring many challenges and their respective extracted values are less reliable especially in the proximity of extremes values. We propose that using a free format learning method (k-nearest neighbor) which uses more information besides the fitted values e.g. global information like stay probability instead of trial by trial information can ameliorate the estimation error. The proposed method is examined by simulation and results show the advantage of the proposed method. In addition, investigation of the human behavior data from previous researchers proved the proposed method to result in more statistically robust results in predicting other behavioral indices such as the number of gaze directions toward each target. In brief, the proposed method increases the reliability of the estimated parameters and enhances the applicability of reinforcement learning paradigms in clinical trials.


Introduction
Human decision making behavior is believed to be controlled by multiple systems.
Habitual and goal-directed systems are responsible for most decisions and learnings during the human's lifetime [1,2]. While the habitual system reconstructs habits and automatic decisions, the goal-directed system is involved in planning behavior of individuals.
Modeling of the habitual and goal-directed systems has been mapped to Model Based (MB) and Model Free (MF) learnings in the literature of reinforcement learning. In MB learning, a model of environment which is the state transition probabilities are assumed to be recruited in mind and the value of each choice in the current state is calculated based on the model of the environment; on the other hand, in MF system the value of each action in each state is learned by trial and error and is updated without considering any explicit model of the environment. In both models, the choices of subjects depend on the estimated value of actions. Investigations on human beings' behavior have shown that individuals use a combination of MB and MF learnings to guide their behavior during learning tasks [2][3][4][5][6].
Computational hybrid model has proven to be the best descriptor of subjects' behavior, in which subjects run both MB and MF algorithms in parallel and make choices according to a weighted combination of the action values. Through this model, just one parameter (w) determines subject's preference towards MB and MF algorithms [7]. In many studies, this free parameter is the relative weight of the two algorithms' values that combines two algorithms' values and results in the final choice. It is usually assumed constant for each individual subject throughout the task but can vary across subjects [7][8][9][10].
The combination weight (w) is a free parameter which shows the interplay between two learning strategies, MB and MF, and is important in understanding human behavior.
Consideration of changes in this parameter due to pharmacological or cognitive manipulations or neuropsychiatric conditions will provide important insights for clinical research. Over-reliance on habits, for example, could lead to inflexible decision-making in addiction and compulsion [10,11]. Patients with obsessive compulsive disorder (OCD) show a deficit in goal-directed control and an over-reliance on habits [12]. Wit et.al. show that mild Parkinson disease leads to impaired stimulus-response habit formation [13].
Also, Culberth et.al. show that MB behavior is reduced in schizophrenic patients [14].
In a wider view, there is a growing consensus that computational modeling can also be helpful to understand psychiatric disorders. Computational models can break maladaptive types of behavior into distinct cognitive components, while the model parameters associated with the components can be used to understand the latent cognitive sources of deficits [15]. The share of habitual and goal-directed systems are among such components that can be used to assess and diagnose psychiatric disorders [16]. Therefore, estimation of the models' parameters in general, and the weight of MB learning, in particular, is important for many applications.
Reliable and precise estimation of the model's parameters is an essential step in extending the use of modeling to clinical applications. However, due to noisy behavior and confounding factors as well as low sample size, reliable estimation of parameters is a challenge especially for extreme values. In many cases, the model fitting results in extreme values for w; further analysis shows that the estimated value of w by model fitting is less reliable especially when the other parameters of the model are in appropriate range.
Simulations show that the low value of learning rate or high value of the Boltzmann machinery temperature ends in the high error of w estimation in model fitting.
Here, we propose that using all available information including behavioral measures of subjects besides the fitted values of model fitting can improve the reliability of parameter estimation in the assessment of the w parameter. To validate our approach, we used a twostep decision making task, which has been introduced by Daw et.al. previously; It can dissociate MB and MF contributions on human choice behavior [8]. Simulations showed that using some behavioral indices in the estimation of the w by the k-nearest neighbor (KNN) method resulted in a lower error in the retrieval of the real weight. Our results improve the applicability of MB and MF task in clinical trials and also in cognitive assessment protocols.

1-1 Daw Task
Daw et.al. have developed a two-step task by which MB and MF components of human choice behavior could be dissociated [7]. This task has been used by several other researchers [9,[17][18][19][20][21][22][23][24][25].The Markov Decision Process model for this task is illustrated in  Each Start-State action is predominantly associated (with a 70% probability) with one of the second level states. The transitions with 70% probability forenamed "Common"; and those with 30% probability named "Rare". Any action in state two or three terminate the trial and are associated with different reward probabilities that fluctuate independently across the session by a random walk. In any trial, first action has no reward and the second one results in rewarded or unrewarded trial. Thus, subjects have to make trial-by-trial adjustments in their choice to maximize the probability of achieved reward.

1-2 Model Structure
In the modeling of this task, subjects run both MB and MF algorithms in parallel and make choices according to the linear weighted combination of the action values that come from MB and MF systems [7]. This hybrid model has been used by several other  [25][26][27]. Fig 2 shows the flowchart of this model, the parameters of the model and available observations from the human task for each section.
In any trial (t) the value of each action (a) of first stage was calculated by the weighted sum of MB ( t ) and MF ( ) system value (weight: w) according to equation (1).
The stickiness to the previous action increases the value of the previous action by adding P to the value of previous action (equation (2)) The Boltzmann machine which is a stochastic, biologically-plausible approximation of the maximum operation [28] is used to extract the probability of choosing each action based on their values(equation (3)).
"β" is the inverse temperature which controls the trade-off between exploitation and exploration. Due to the non-deterministic environment and its probabilistic nature for rewards, it is usually assumed as a fixed parameter over trials but differs across subjects [7].
In the second stage of the task, corresponding Q values in each state determine the where P is the probability of transition towards the second step state S and calculated by Beta-Binomial Bayesian updating rule according to equation (5) [8].
(1, a, ) = where N(1,a,S) is the number of times S state has been achieved by performing the action a in the first step of the task.  Table 1.

Method
The overall structure of our algorithm to extract the w parameter is illustrated in Fig 3. When model fitting is used to extract the w, the model is initially specified and the set of estimated parameters including w is correspondingly determined by maximizing a similarity function between the model's decision and human decisions. Experimentally, both model structure and objective function affect the performance of model fitting to precisely identify parameter values and this effect becomes clearer when the provided data set is insufficient. This can be due to the limited number of trials, interactions between parameters that make them hard to disentangle or lack of behavior that can be used for the fitting process (e.g., in some Pavlovian conditioning experiments) [29].
Model fitting is an optimization problem and the objective function is a crucial determinant of the quality of the fitting procedure. Among the objective functions which can be used for fitting the observation to the model, the likelihood function is theoretically the best for fitting the parameters of individual agents [30]. This function tries to maximize the probability of the observed action in the corresponding situation. As a second objective function, to add some prior knowledge, "Maximum A Posteriori" (MAP) method can be recruited which maximizes the probability of a parameter set when an observation is captured [31]. To minimize the objective functions, we used interior-point optimization algorithm with 5 different random start points in the model as mentioned above.
In this paper, we want to use other available information including behavior statistics and indices besides the fitted values of parameter to extract w more precisely. In the proposed method, we use a KNN estimator as a learning system to extract the w from behavior. KNN uses the feature space of behavior and a dataset of labelled feature vectors which are used to estimate the w parameter. The dataset is generated by simulation of the model which has been assumed to be used by subjects.

2-1 Proposed KNN Estimator
KNN is a supervised free format learning method and has been used repeatedly by researchers as a good point estimator [32,33]. Different KNN estimators use different distance (neighbor) definitions and different voting methods. We used Euclidean distance in normalized feature space to find K-nearest neighbors and then estimations were calculated by the weighted sum of K-nearest neighbor values (Equation (6)).
In this equation, refers to the value of for the ℎ neighbor of observed behavior in the database. Also ̅ was defined based on the normalized inverse distance according to equation (7).
The stay probability varies between MB and MF system because of the difference in effects of current reward. It has been shown that there are known differences between MB and MF agents in stay probabilities in different situations [7,34]. Furthermore, the slope of stay probabilities, as indices for MF (equation (8)) and MB (equation (9)) behavior [35], were also used as another behavioral indicator in feature space.
Similar to stay probabilities, these conditions were specified by the outcome of the previous trial.   (10) and (11), were added to the feature space [35].
In these equations Fit and  (Table 3).

2-4 Model the noise in decision making
It has been shown that inclusion of lapse rate for subjects can improve the fitting quality in many psychophysical paradigms. This lapse rate is due to the trials in which subject has not attended and has responded randomly. We included this possibility for agents in simulations. To do so, we simulated agents in a range of noise levels, and at each noise level, a random number of trials, depending on the noise level, were selected for which the decisions of agents were reversed.

Results
By using random parameters for an RL agent with Daw8Param parameter set, we simulated a wide range of human behavior during Daw's task. These random parameters were sampled according to Table 4. All fitting methods were applied to observation and AIC was used to choose the best fitted model by each objective function. The model fitting error is calculable for these model fittings because the real value (which is equal to agent w) and extracted values are known. While the Mean Absolute Error was used as a point value of error , standard deviation or hinges were used to illustrate the distribution of error.
Hinges are the distances between the mean of half data which are below and above the MSE.

3-2 Effect of agent learning rate and temperature on model fitting
Assume that the Daw RL agent uses 3ParamV1 version and w is sought by best model fitting based on AIC. The question here is whether parameters' value affect fitting error.
To this end, we run 10000 agents with random parameters, according to Table 4, while and are individually assigned with fixed number sets. Fig 6 clarifies the result from this simulation.  Table 4.

3-3 KNN Parameters
For KNN estimator, the number of neighbors (K) is the only parameter that should be set. In addition to this parameter, the feature selection and feature combination can improve the KNN performance. To achieve the best performance, we found the best K by the exhaustive search to minimize MAE. According to Fig 7.A, the MAE is nearly constant when K is greater than 60 and up to 140 but the value of 101 for K is optimal and was used in all different situations.

3-3-1 Feature Selection
Selecting good features based on the previously mentioned analyses, can improve the performance of KNN estimator. Forward selection is one of the most commonly used feature selection methods. We used forward selection in two different situations based on the available information and analytics: 1-All features will be computed (needs fitting calculations).

2-Features from model fitting are excluded
Under such conditions for forward feature selection method, different feature subsets can be assumed. Based on Fig 7- 21 To have a deeper view we divide it to different areas. The diagonal area which is scattered points that have a small error (below 0.1). Ideally, this area percentage should be 100. The top and bottom area are those points that reported as common mistake of MLE.

3-4 Performance
We also consider the areas that the dominant strategy changed from MB to MF or reverse, as well as areas that the dominant strategy did not change. In addition, those areas that did not have a dominant strategy has been assumed as the transition areas. The scatter of KNN estimations of w by using all features (℘) and ℘sub1 are the nearest values to diagonal. In addition, the scatter illustrates that sticking to the extreme values which is the most problem of fitting methods is solved by KNN.  For KNN method, the tail of the distribution is shorter and includes lower values at the tail.
It means that variance of error is low and the calculated value of standard deviation (STD) confirms this (Table 5). On the other hand, for KNN methods probability of small error (errors between -0.05 and 0.05) are higher than just fitting methods. In addition, MAE which is reported in Table 5 confirms that the bias of KNN estimation is improved with regards to pure fitting method. Extreme errors in both ML and MAP are relatively higher than KNN based methods. Since it is possible to have extreme values for w in clinical conditions, these regions are more important. KNN methods correct these errors and make the method more robust for applications in clinical trials.

3-5 Noise in decision making
To investigate the effect of human mistake or lapse rate in doing actions, we run the simulation with different probability of additive noise in decision making. For each value of noise ratio, 10000 independent RL agents perform the task and both MLE and MAP fitting were applied on the observation by different models. AIC was then used to select the best fitted model for each individual agent's behavior.

Experimental data analyses
The two-step task of Daw et. al has been previously used to investigate the correlation between gaze data and the w. [8], where 5Param version of model has been used to extract the w value by MLE. KNN estimation was used to extract the combination weight from their data in the current investigation. Fig 11 illustrates the difference between the "Fitted This grouping is first checked by mean value of P-Stay in groups as a behavioral data.
According to Fig 11, groups changed for some subjects when the estimated w was used instead of fitted w. The first question arising then is which groups are more consistent with behavioral observation? To answer this question, the stay probability was extracted through different groupings as well as with group of subjects labeled differently between fitting and estimation (Fig 12). can be attributed to neglecting the transition (which is the main specification of MF subjects). This subject will therefore be a better candidate for MF rather than MB label and consequently the estimated w is better than that obtained by just fitting.
All the analysis given in the first part of the report by Konovalov  The proposed method is advantageous due to its lower error for extreme cases. Such extreme cases may be prevalent in clinical trials and psychiatric conditions and will make the proposed method to have superior performance over just model-fitting approaches.
MAP estimation is better than MLE in extreme values due to using a prior, KNN method works very better than MAP too. The mentioned improvements will enhance the applicability of the task of Daw et. al. for computational psychiatry purposes.
We showed that using the proposed method can help to increase the statistical power in analyzing the relation between parameters such as the gaze distribution to habitual and goal directed behavior. It was proven that consideration of behavioral parameters in estimation of combination weight (in addition to fitting), improves the consistency of behavior and subjects grouping and so other conclusions from this grouping can be more precise.
It should be noted that, any model fitting tries to minimize an objective function to extract the behavior under different assumptions. The MLE maximizes the likelihood while the extracted parameter by KNN will not maximize the likelihood even if the error of extracted w is lower. In fact, the flow of probabilities in reinforcement agent decisions cause that a specific parameter does not guarantee maximum likelihood while another parameter exists which satisfies the maximizes likelihood criterion. Although the Cramer-Rao Lower Band can be theoretically obtained by MLE, the above statement ensures that better estimations can be obtained by learning.
The proposed method can be considered as a maximum likelihood estimation using simulation-based estimation. Such a method not only uses trial by trial observations of the behavior, but also uses global observation such as stay probabilities in random variable space and tries to maximize the likelihood of observing all the mentioned behaviors together. For large sample sizes, it is possible that MLE and KNN methods converge to the same estimation error. For limited sample sizes, however, KNN has shown more reliability and avoids overfitting, and is considered a better option in a typical experimental condition.
In sum, our proposed method can enhance the estimation of the combination weight between model based and model free behaviors. This improvement is due to using behavioral indices from the data that make the estimation more robust. This robust estimation can facilitate the using of similar paradigms in clinical applications and help in diagnosis of psychiatric disorders.