Understanding melanopsin using bayesian generative models – an Introduction

Understanding biological processes implies a quantitative description. In recent years a new tool set, Bayesian hierarchical modeling, has seen rapid development. We use these methods to model kinetics of a specific protein in a neuroscience context: melanopsin. Melanopsin is a photoactive protein in retinal ganglion cells. Due to its photoactivity, melanopsin is widely used in optogenetic experiments and an important component in the elucidation of neuronal interactions. Thus it is important to understand the relevant processes and develop mechanistic models. Here, with a focus on methodological aspects, we develop, implement, fit and discuss Bayesian generative models of melanopsin dynamics. We start with a sketch of a basic model and then translate it into formal probabilistic language. As melanopsin occurs in at least two states, a resting and a firing state, a basic model is defined by a non-stationary two state hidden Markov process. Subsequently we add complexities in the form of (1) a hierarchical extension to fit multiple cells; (2) a wavelength dependency, to investigate the response at different color of light stimulation; (3) an additional third state to investigate whether melanopsin is bi‐ or tri-stable; (4) differences between different sub-types of melanopsin as found in different species. This application of modeling melanopsin dynamics demonstrates several benefits of Bayesian methods. They directly model uncertainty of parameters, are flexible in the distributions and relations of parameters in the modeling, and allow including prior knowledge, for example parameter values based on biochemical data.


Introduction
Time-varying data can be analyzed with a multitude of statistical methods.Integrating ordinary or partial differential equations is one of the major tools in the natural sciences.For example in order to analyze the morphology of an action potential we could model the rise and fall by a system of two coupled differential equations.In a linear approximation this results in two exponential functions, where the time-constants of the exponential describe the rise and fall.
Alternatively we could use the more complex Hodgkin-Huxley model (Hodgkin and Huxley, 1952).This system of equations does not only better describe the data, but allows a direct interpretation of model variables in terms of molecular and cellular properties.Furthermore, in many experiments, multiple factors influence the dependent variable concurrently and the process of interest is non-stationary.In that case, extracting single time constants can be biased and unable to explain the data.And consequently the mechanistic model should be preferred.
The benefit of such generative models is the ability to generate 'fake-data' using previously fitted parameters.It allows to predict unseen data and simulate experiments where, for example, some of the parameters where changed.Thus, the first step to analyze time-varying data, is to develop a formal mechanistic model of your data.
Once we specified the model, we need to estimate the values for the parameters based on measured data.A solution to such systems of differential equations is most commonly in the form of maximum likelihood estimates, i.e. the one parameter set so that the occurrence of the data as observed is most likely.While often used, another approach has important benefits and improvements: Bayesian parameter estimation.It allows us to directly estimate parameter uncertainties, interpret them intuitively as probabilities about parameters conditioned on the data and we are able to seamlessly include prior knowledge.Due to these benefits, Bayesian parameter estimation has seen a strong comeback and is becoming ever so popular (Cronin et al., 2010;Ghasemi et al., 2011).
In order to use Bayesian estimation we need to understand three concepts: the likelihood, the prior and the posterior.The likelihood tells us how likely it is, that our data are generated by a given set of parameter-values.The prior tells us, how likely certain parameter-values are in the first place.Thus if we a-priori know that a receptor has a certain time-constant from previous experiments, we can directly incorporate this knowledge in our current model-fit and adequately influence the posterior of the time-constant parameter and all other co-dependent estimates.The posterior of each parameter is the distribution that shows us how probably each parameter-value is, given our data and prior knowledge, thus a combination of prior and likelihood.In the end we do not only get a single best-fitting parameter value, but a distribution.
Thus in addition to the most probable parameter value, we estimate the uncertainty of the parameters, the probability distribution.A broad probability distribution indicates that we cannot estimate the parameter well: neighboring parameter values have a similarly high posterior probability.But a thin distribution indicates that the parameter can be estimate with high precision.Furthermore, dependencies between several parameters might be complex, but can be modelled by these methods.With Bayesian methods we can flexibly use generative models and, importantly, the posterior probability can be interpreted as uncertainty of a parameter, a straight forward and often implicitly used interpretation.
As an example to guide this paper we use patch clamp recordings of cells expressing melanopsin, a photosensitive opsin-type occurring naturally in the retina.In mammals it is expressed in ganglia cells and projects to the suprachiasmatic nucleus and influences the circadian rhythm (Hankins et al., 2008;Do and Yau, 2010).A cell containing melanopsin will begin to fire if photons of a certain wavelength activate the protein.Melanopsin is activated using blue light (470 nm) and can subsequently deactivated using green-yellow light (560 nm).
In contrast to other opsins, melanopsins' activation is tonic, once activated it stays activated for several seconds to minutes (Spoida et al., 2016).Melanopsin presumably occurs in two states, the M (active) and R (resting, inactive) states (for a review see (Schmidt and Kofuji, 2009), but see (Emanuel and Do, 2015)).Activating the protein with blue light increases the probability of the R-state melanopsins to change their configuration to the active M state.Concurrently, a constant transition-probability from R to M and M to R, exists that leads the cell to an equilibrium distribution of melanopsin in M and R state configurations.Here, we use data from melanopsin patch clamp recordings (Spoida et al., 2016), where cells at resting state are activated using blue light and subsequently deactivated with red light (Figure 1) In this paper we develop a Bayesian mechanistic model of melanopsin and discuss the implementation of the model, the inverse fit, model checks, the interpretation of the parameters and how we can exchange parts of the model in a modular way to improve our understanding and design new experiments.The copyright holder for this preprint (which was .http://dx.doi.org/10.1101/043273doi: bioRxiv preprint first posted online Mar. 11, 2016; 3. Methods and Results It is helpful to start with a graphical model representation (Figure 2 A).In this paper we loosely follow the model notation in (Lee and Wagenmakers, 2014).Once the graphical model is specified, it can be directly implemented into a Bayesian programming language.In the graphical model (Figure 2 A) all parameters that change over time are shown inside the time point -plate and indicated with time-indices.The main parameters are the proportion of firing (M) and resting (R) states.In every simulation time step   there is a certain probability to switch states from M to R: ()  .This transition probability is influenced by a constant rate   and a green-light dependent rate   .Of course the light dependent rate is only taken into account, when there is green light, thus we need a dummy-coded green light variable   with 0 when there is no light, and 1 when the green light is active.Because light-activation happens at specific times determined by the experimenter, the transition probabilities change over time, i.e. they are non-stationary.The transitions are implemented using ordinary differential equations.One of the assumptions of the model is, that the recorded patch-clamp currents are directly proportional to the proportion of M-state.We don't expect the patch clamp noise level to change during our recording time, and thus we include a constant Gaussian noise term into our model.To summarize: We model the patch clamp currents using a Gaussian where the .CC-BY 4.0 International license not peer-reviewed) is the author/funder.It is made available under a

Model building
The copyright holder for this preprint (which was .http://dx.doi.org/10.1101/043273doi: bioRxiv preprint first posted online Mar. 11, 2016; mean is proportional on the amount of M-state and thus non-stationary over time.The model allows us to intuitively grasp the parameters, interactions and mechanisms that are needed to model our data.
A more formal way to describe this implementation is to describe the model as a non-stationary two-state hidden Markov model.We then estimate the transition probabilities and relevant factors.All scripts and models are documented and publicly available under http://osf.io/bn6pk.
In this paper we make use of the STAN packages (Carpenter et al., 2016), in combination with R (R Core Team, 2013).The non-stationarity in our case was implemented by a logistic linear model with time-varying predictors.The model code is shown in Figure 2 B, parallel to the model graph.In the following, square brackets reflect arrays, round brackets reflect functions.
In our case, the linear model can be described by: Where   is the constant change parameter,   [𝑡] defines at which time intervals blue light is active and   is the blue light dependent change parameter.The logit function maps values from the domain -infinity to infinity to the domain of 0 to 1, thus in the domain of probabilities.This formulation as a logistic linear model allows us to connect the estimation of parameters over multiple cells with the idea of hierarchical or mixed models (see section Modular Improvements, hierarchical fit further down).The same formula defines the spontaneous transition probability from M-to R-state.Thus for the size of change of R-state at each point in time, there exist two influences: Some fraction of melanopsin changing their state from M to R and in the same time step some spontaneous change from resting state to fire state.The combined probability determines the proportion of R (or M respectively) as captured by using ordinary differential equations.At each simulated time step (with a predefined time-resolution ∆) we update our R-parameter (and M respectively) by a first order integration: We use a discrete time notation here to parallel the code of the implementation.In the two-state model it is necessary that the amount of M state is equivalent to the inverse of the R state.Thus: . CC-BY 4.0 International license not peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was .http://dx.doi.org/10.1101/043273doi: bioRxiv preprint first posted online Mar. 11, 2016; We can make use of this relation and only calculate the change in R state and invert the change in the M state, but if we want to enhance the model to three states, it is more sensible to implement both changes, dR and dM.
The final important relationship to define is the relation to our data and including a noise distribution.In STAN this can be achieved by using: In STAN the tilde (~) means "is sampled from".Thus, the line defines that the measured current C is sampled from a normal distribution with time-varying mean and constant variance  2 .
Before the model fit we need additional statements about the type and range of parameters, we need to define the initial state, e.g. which could be random.
This concludes the implementation of the specified graphical model into STAN.

Bayesian Parameter estimation
In the next step we estimate the posterior parameter distributions.Here we will give short introduction of Bayesian data analysis and Monte-Carlo sampling methodology.Our goal is to estimate the posterior probability distribution: colloquially, what is the probability that each possible parameter value could underlie our data.According to the Bayesian framework, this consists of firstly the likelihood of the data given the parameter.In other words how likely is it, that the data are generated from a specific set of parameters.Secondly from the prior distribution which states how probable a parameter is in the first place.In more formal terms, we are interested in the posterior distribution ((|)) given the likelihood of the data ((|)) and prior parameter probabilities (()).Bayes theorem states that these are directly related to each other ( (|)~ (|) * ()).An example: We record a neuron spiking with 10Hz.
Our imaginative model assumes that the spiking rate of the cell is sampled from a normal distribution with a mean and a fixed standard deviation at 2 Hz.This model has only a single parameter to be estimated.We can easily calculate the likelihood of the Gaussian: We will get a low likelihood for a set of parameters where the mean is 5 Hz, a higher likelihood for a mean of 12 Hz an even higher likelihood for a mean of 10Hz.If we incorporate prior knowledge that these specific neuron types are very rarely observed with a spiking rate of higher than 5Hz, .CC-BY 4.0 International license not peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was .http://dx.doi.org/10.1101/043273doi: bioRxiv preprint first posted online Mar. 11, 2016; Bayes rule will integrate the information gained from the data and the prior-information and we will find the most likely parameter, given prior and data, at a lower estimate, for example 8 Hz. Whether data or prior dominates the posterior depends on how accurate, or certain, your prior knowledge was specified, and how much uncertainty, or noise, about the parameter the data has.Bayes rule automatically finds the optimal compromise between prior knowledge (what we think is a likely result) and our data (what actually happened).
Calculating the posterior is straight forward for a single parameter: We could randomly try out all parameter values using a grid approach, calculate the likelihoods and priors, and observe the posterior.This would be very ineffective, especially for if we have to estimate multiple parameters concurrently as there is a combinatory explosion.This is where the markov chain monte-carlo (MCMC) sampling comes into play.Instead of randomly sampling the space, we start at a random initial value and propose to jump to a new value.We evaluate the posterior at this value, if it is higher (thus more likely) than the current value, we will go there.If it is lower, we will go there only with a probability inverse proportional to the difference.From there on we repeat the procedure for many iterations.This simple rule (known as the metropolis algorithm, (Hastings, 1970)) will ensure that we visit areas more often where the posterior is high, but from time to time explore other, less probable areas as well.Moreover our Markov chains fulfill all assumptions of the ergodicity theorem, thus it is guaranteed that the Markov chain will ultimately converge to the true posterior.In the end, our estimate of the posterior consists of how often we visited a certain parameter value.The MCMC sampling algorithm allows us to estimate highly complex models with many parameters.
Over the years more sophisticated algorithms have been developed.In this paper, we use NUTS, the No-U-Turn sampler, it is more efficient than the metropolis samplers in the case of hierarchical linear models with correlated parameters.This algorithm stems from the family of Hamiltonian monte-carlo (HMCs) algorithms.With HMC algorithms we replace the randomly chosen proposal step of metropolis with an algorithm that more effectively samples the posterior.Imagine that the inverse of the posterior has a bowl shape, thus the most likely points are at the valley, and the most unlikely one raise as mountains the further away you go.We randomly start at a point in the posterior and place a marble and send it with a small push in a random direction on its way.We now simulate for a while and the position the marble ends up, is our new proposed value.We compare it again to the current value and proceed as before.The marble has some momentum so it might just be enough to roll through local minima.The NUTS .CC-BY 4.0 International license not peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was .http://dx.doi.org/10.1101/043273doi: bioRxiv preprint first posted online Mar. 11, 2016; algorithm is based on HMC but in addition makes certain to not allow any u-turns where the marble rolls uphill (due to gained or initial momentum) and would come down the same way again.The exact algorithm is somewhat more difficult because it needs to make certain that it converges towards the posterior but this is the general idea.For details we refer the interested reader to (Homan and Gelman, 2014).NUTS allows for an effective sampling of the posterior and reduced the risk to get stuck in local minima or passages where the chains could get stuck in the posterior landscape.

Model Fit & Sampling Diagnostic
Next we describe how STAN estimates the posterior distribution.Stan is a sophisticated open source implementation of HMC/NUTS for a multitude of programing languages (R, Matlab, Python, Julia, Stata and a command line tool).It allows to specify models in a comparatively simple way and has many tools to evaluate the results.The model comes with their own programming language which is not difficult to learn if experience in python, R, matlab or c++ are available.The STAN-model is then compiled to c++ code by the STAN interface and sampled by the MCMC algorithm.Sampling consists of two phases, the first is a warmup period where sampling-parameters are calibrated by the NUTS algorithm to effectively sample from the shape of the posterior.This is necessary as new proposed values could be outside the allowed range of the parameter and in that case we would have to reject this location proposal, thus we have an overhead of likelihood calculations.If this happens too often, we sample ineffectively.But at the same time, we do not want redundancies in the sampling resulting from small (but not rejected) step sizes.This tradeoff is automatically calibrated in the warmup period and the following sampling period defines the final outcome of our posterior.
The chains of an MCMC sampler need to be diagnosed for proper convergence.Sometimes we can get stuck in certain parameter value constellations, for example in a bimodal posterior distribution, or the MCMC algorithm makes too small jumps and we do not explore the space appropriately.It is difficult to diagnose those problems when we only look at a single chain with a single starting value.Therefore we use multiple chains which run independently.This allows us to check whether the chains converged in the same posterior distribution, which is necessary (but not sufficient) for successful sampling.There are several features that can indicate proper convergence: We visually inspect the chains (Figure 3 A), compare the variance .CC-BY 4.0 International license not peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was .http://dx.doi.org/10.1101/043273doi: bioRxiv preprint first posted online Mar. 11, 2016; between chains to the variance in one chain (termed RHat, and should be close to 1), look at the overlap of the posterior densities of the chains (Figure 3 B) or we check the autocorrelation of a chain (Figure 3 C), how independent two following samples are from each other.A high independency is preferred here.In Stan this is often reported as a single number, the effective number of samples, N_eff, which is the number of samples corrected by the autocorrelogram ( (Gelman et al., 2013) p. 286).In our first model, we ran 5 chains with 300 warmup iterations and 500 samples.Visually we see that the chains seem converged and the posterior overlap.
Similarly the Rhat is below 1.1 for all parameters.The autocorrelogram shows autocorrelation up to a certain degree, but it does not seem worrying (Figure 3 C, upper panel).In a similar vein, the effective samples are 700 for the upper and 1670 for the lower parameter, representing the 'best' and 'worst' effective sample value in this model.According to all our criterions, the chains of the MCMC seem to have converged.

Posterior Predictive / Model checks
After we have samples of the posterior distribution and preferably before interpreting the results we need to check the adequacy of our model.A powerful tool of generative models is, that they are able to simulate new data from the current estimated parameters.These new data should capture the important dynamics and effects of our original data.Otherwise the model would be inadequate.When we sample new data from the posterior parameter estimates, this process is .CC-BY 4.0 International license not peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was .http://dx.doi.org/10.1101/043273doi: bioRxiv preprint first posted online Mar. 11, 2016; called posterior predictive model check.In our example, we sample 1000 new traces from our posterior parameter estimate distributions (Figure 4).Because we randomly sample from a distribution of estimates, each trace will be a little bit different.Our original data should be in the 95% credibility interval of the posterior predictive set.Only after we ensured that the model is adequate, we inspect the posterior parameter estimates visually or calculate and interpret summary statistics (often median and percentiles) of the parameter distributions and interpret the results.The posterior check reveals two problems with our model.In the initial phase, marked with (1), we observe a mismatch between the observed and the predicted data.Here, the posterior predictives indicate that the current is slowly increasing, whereas the data indicate no such trend.This first model missmatch can be readily explained: In the initial phase, the expressed melanopsin proteins are not activated, they need a first activation by blue light, before they can acquire an equilibrium between the R and M state.But the model assumes falsely, that this equilibrium can be acquired from the beginning.By either excluding this portion or adding another initial state for melanopsin, this difference could be modeled.The model mismatch at (2) is currently not well understood.Even though blue light is still activating melanopsin proteins and forcing them to the M state, the current is diminished again.
Mechanism that are able to resolve this range from internalization of receptors, to effects of delay due to the g-coupled receptors, due to a hypothetical refractory period of melanopsin or the g-coupled pathway.This cannot be captured by the current model and thus the posterior predictive show an expected maximum at the end of the blue period.
. CC-BY 4.0 International license not peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was .http://dx.doi.org/10.1101/043273doi: bioRxiv preprint first posted online Mar. 11, 2016; The posterior checks revealed two problems with this model, especially the first one could bias our parameters.To cope with these problems is left open for now, but it is not difficult to resolve them by enhancing the model. We are now ready to interpret our parameters for this single cell fit.We expected the initial Rstate parameter to be around one, due to our baseline correction.This is indeed the case, the average initial state for R is 1 [0.99,1].The estimated standard deviation (the estimated measurement noise) of our signal is 0.052 [0.050,0.054].We defined four main parameters in our model: The first is   , it indicates the constant and spontaneous transition probability from the resting to the active state.The estimate is -6.5 [-6.6, -6.5] on the logit scale.In order to convert this to a more sensible unit, first we take the inverse of the logit function.Then we need to raise the 1-x to the sampling frequency to gain the probability per second:

Figure 1 :
Figure 1: A) Raw data of hOpn4l patch clamp recordings.hOpn4l was expressed in HEK 293 cells which express GIRK1/2 subunits.The GIRK-mediated  + -currents were sampled at 50-200Hz.Blue light (470nm) activates current outflow, green/yellow light (560nm) deactivates the outflow.B) Data were resampled to 5 Hz.We then normalized the range by mapping the 95% percentile of each cell between 0 and -1.

.
CC-BY 4.0 International license not peer-reviewed) is the author/funder.It is made available under a

Figure 2 :
Figure 2: A) A graphical model description of the basic model.Filled parameters depict data that is given.At each point in time a fraction of the R state is changed into the M state with the non-stationary probability p(RtoM).The same process governs the change from M to R. The transition probabilities are influenced by constant (stationary) leakage probabilities and nonstationary, light dependent activations.The active M state is used as a model of the measured current of the patch clamp.These recording are inherently noisy, and we model this noise using a Gaussian function with the non-stationary mean   and the standard deviation .The parameter for the initial M/R state at t=1 was omitted from the graph.B) The graphical model implemented in the STAN programing language.

Figure 3 :
Figure 3: A) four independent MCMC chains with 500 samples each of two parameters, RtoM and Rinit.The chains all converged to the same value range.The variance between chains is similar to the variance within chains.Visually, these chains seem to converge to the same value.B) The posterior density (marginals) of the chains in A. The chains all sample the same region of the posterior, this is an indication for convergence of the chains.These densities can further be simplified by specifying for example the medians and 95% quantiles of the distribution.C) The autocorrelogram of the two parameters.The upper parameter (MtoR) has a higher autocorrelation, thus the effective number of independent samples we drew from the posterior is smaller than for the lower parameter (Rinit).

Figure 4 :
Figure 4: A) The light gray band depicts the 95% credibility interval of 1000 posterior predictives.Posterior predictives are 'new cells' that are simulated from our posterior parameter estimates and reflect the range of possible outcomes of the posterior model fit.The dark gray line depicts the median posterior predictive value.The black curve depicts the original data.The annotation "1" and "2" are discussed in the text.B) Parameter estimates of the single cell shown in A). median and 95% percentiles are shown.