Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Deep Reinforcement Learning for Optimal Experimental Design in Biology

View ORCID ProfileNeythen J. Treloar, Nathan Braniff, Brian Ingalls, Chris P. Barnes
doi: https://doi.org/10.1101/2022.05.09.491138
Neythen J. Treloar
1Department of Cell and Developmental Biology, University College London, London, WC1E 6BT, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Neythen J. Treloar
  • For correspondence: neythen.treloar.14@ucl.ac.uk
Nathan Braniff
2Department of Applied Mathematics, University of Waterloo, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Brian Ingalls
2Department of Applied Mathematics, University of Waterloo, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Chris P. Barnes
1Department of Cell and Developmental Biology, University College London, London, WC1E 6BT, UK
3UCL Genetics Institute, University College London, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

The field of optimal experimental design uses mathematical techniques to determine experiments that are maximally informative from a given experimental setup. Here we apply a technique from artificial intelligence—reinforcement learning—to the optimal experimental design task of maximizing confidence in estimates of model parameter values. We show that a reinforcement learning approach performs favourably in comparison with a one-step ahead optimisation algorithm and a model predictive controller for the inference of bacterial growth parameters in a simulated chemostat. Further, we demonstrate the ability of reinforcement learning to train over a distribution of parameters, indicating that this approach is robust to parametric uncertainty.

Author summary Biological systems are often complex and typically exhibit non-linear behaviour, making accurate model parametrisation difficult. Optimal experimental design tools help address this problem by identifying experiments that are predicted to provide maximally accurate parameter estimates. In this work we use reinforcement learning, an artificial intelligence method, to determine such experiments. Our simulation studies show that this approach allows uncertainty in model parameterisation to be directly incorporated into the search for optimal experiments, opening a practical avenue for training an experimental controller without confident knowledge of the system’s parameter values. We present this method as complementary to existing optimisation approaches and we anticipate that artificial intelligence has a fundamental role to play in the future of optimal experimental design.

2 Introduction

A key goal of systems and synthetic biology is to apply engineering principles in service of understanding and building biological systems. Such approaches rely on inference of mechanistic models and downstream model-based prediction of cellular systems. Biological systems are typically complex, highly non-linear, and noisy, making development of accurate models a challenging task [1, 2, 3]. Furthermore, characterisation experiments can be resource-intensive and time consuming; efficient use of experimental effort is critical. The field of optimal experimental design (OED) uses mathematical techniques to identify experiments that will provide maximally informative characterisation data within the constraints of a fixed experimental capacity.

Applications of OED to nonlinear biological systems is challenging; techniques for efficient OED in this context have been developed via Bayesian methods [4, 5, 6] and methods based on optimisation of the Fisher information [7, 8, 9, 10]. Here, we use reinforcement learning (RL), a branch of machine learning, to develop a novel Fisher information-based OED method for parametrisation of non-linear models of biological systems. Although we focus on an application relevant to synthetic biology, the method is general and can be applied to any OED model parametrisation task. Reinforcement learning methods learn control policies from data that is generated either by simulation or by interaction with a real system. Much of reinforcement learning research has focused on games [11, 12, 13], but its effectiveness has also been shown in optimising chemical reactions [14], controlling plasmas for nuclear fusion [15] and computer chip design [16]. Standard OED methods determine experiments by optimising with respect to a model of the system. This can require complex non-linear programming or integration over posterior distributions to calculate online experimental inputs. In contrast, a reinforcement learning agent chooses its actions based on its experience, which is generated through trial and error interaction with the model during a training process. Although a satisfactory training process can be rather lengthy, the resulting trained agent can act as an experimental controller, rapidly identifying optimal experimental inputs.

Model-based optimal experimental design faces a fundamental challenge: the optimization problem is formulated in terms of the underlying model, whose accuracy can only be guaranteed if the parameter values to be inferred are already known (and, of course, if those parameter values are known then there’s no need to design experiments to infer them). An iterative OED estimation approach can be employed to resolve this problem [17, 18, 19]: begin with an initial guess for the parameter values, then iteratively apply OED, using the resulting experiments to improve the parameter estimates. Even in this case, the performance of the resulting experimental design will still be dependent on the initial guess of parameter values. Bayesian approaches to OED incorporate robustness to this initial data by beginning with a probability distribution of parameter estimates (a prior) rather than a single guess [20]. However, Bayesian approaches typically involve computationally expensive integration and are restricted to specific distribution types [4, 6]. Our proposed reinforcement learning approach provides a complementary means to tackle the problem of limited parameter knowledge.

In the presentation below, we develop our novel reinforcement learning OED algorithm by iterative improvement of a baseline Fitted Q-learning approach. We begin by outlining our proposed formulation of reinforcement learning for OED and introducing a model of bacterial growth in a chemostat, which will be used as an application (Sections 3.1 and 3.2). We then apply the Fitted Q-learning algorithm [21] (FQ-learning), which we have previously shown to be effective in controlling biological systems [22], to the chemostat growth model. In Section 3.3, we use this initial analysis, in which we unrealistically assume prior knowledge of the true parameters, to explore the baseline performance of the RL-OED approach. Next, in Section 3.4 we consider an alternative formulation of the learning agent’s interaction with the model, based on a recurrent neural network. We demonstrate, using a direct measure of value-function construction, that the corresponding algorithm has the capacity to effectively predict the optimality of experiments independently of accurate prior parameter estimates. However, an assessment of OED performance reveals that this strategy yields unsatisfactory performance. To address that issue, in Section 3.5, we again alter the learning agent formulation, this time extending its behaviour to a continuous action space, through application of the Twin Delayed Deep Deterministic policy gradient (T3D) algorithm [23]. We call the resulting RL approach the Recurrent T3D (RT3D) algorithm. Successful OED performance of this algorithm is demonstrated in Section 3.6. Finally, we demonstrate that the RT3D algorithm is robust to uncertainty by training the agent over an ensemble of model instances drawn from a parameter distribution (Section 3.7), resulting in a robust controller and illustrating the algorithm’s ability to design experiments with limited prior knowledge. Throughout, we compare the OED performance of the reinforcement learning controllers with both a one step ahead optimiser (OSAO) and a model predictive controller (MPC).

3 Results

3.1 Reinforcement learning for optimal experimental design

We focus our work on local, Fisher information-based, experimental design. D-optimal design aims to maximise the determinant of the Fisher information matrix (FIM). For linear models with Gaussian measurement error, this goal is equivalent to minimising the volume of the confidence ellipsoid of the resulting parameter estimates [24]. This approach has been demonstrated to be useful even for non-linear systems [7, 8, 9, 10]. Fig 1A shows the expected outcome from a hypothetical OED application. Input profiles for two imagined experiments are shown, representing equivalent experimental effort. Data from the corresponding system outputs are used to infer model parameter values, resulting in expected confidence ellipsoids of parameter estimates. This comparison indicates that a poorly designed experiment (left panel: large confidence ellipsoid and low D-optimality, defined as the logarithm of the determinant of the Fisher information matrix) is less informative than the well designed experiment (right panel: small confidence ellipsoid and high D-optimality).

Figure 1:
  • Download figure
  • Open in new tab
Figure 1:

Reinforcement learning for optimal experimental design. A) A hypothetical example of a poorly designed experiment (left) corresponding to an increasing sequence of input values u over time, with a resulting continual increase in the observable output Y. A corresponding confidence ellipse in p1-p2 parameter space is depicted. The logarithm of the determinant of the Fisher information matrix, log(|I|), is low. In contrast, a hypothetical well-designed experiment (right), which maximises the determinant of the Fisher information matrix, corresponds to non-intuitive choices of input, and a resulting dynamic response in the output. The corresponding confidence ellipse is tight and the determinant of the Fisher information is high. B) Optimal experimental design formulated as a reinforcement learning problem. The model dynamics, F, describe the rate of change of the state vector X in terms of model parameters θ and input u. For each time step, τ, an observation of the system, oτ, is provided to the agent which chooses an action, aτ, to apply over that time step and receives a corresponding reward, rτ. C) Training over a parameter distribution. 1) A distribution of parameters is chosen (shown as uniform). 2) The RL controller is trained. Each episode employs a model parametrisation θ sampled from the distribution. 3) When acting as a feedback controller, the trained RL agent designs near optimal experiments across the parameter distribution. For example well-designed experiments will be executed for either θ1 and θ2 (inputs u1(t) and u2(t) respectively). D) Model of an auxotrophic bacterial strain growing in a chemostat. The nutrient inflows, Cin and C0,in, can be controlled as part of an experiment.

Rather than insisting that the sequence of experimental inputs be decided prior to the start of the experiment (which would be an open-loop design), we allow the inputs to be selected based on measurements of the system output (closed-loop design). Reinforcement learning algorithms learn by observing discrete time series. We thus divide an experimental time series into 𝒯 discrete time steps. Below, we use t to refer to time and τ as the index of these discrete time steps: τ = 1, 2, …, 𝒯. The constant input applied to the system during each time step is determined from an experimental measurement taken at the beginning of the time step. (The time steps are thus sample-and-hold intervals.) We define an experimental design as a sequence of experimental inputs (one for each time step in the experiment). Other experimental features (such as initial conditions, environmental settings, and experiment length) are presumed fixed. In our construction the experimental inputs are determined by a reinforcement learning agent. Before experiments begin, the agent undergoes training on a dynamical model of the system of interest. (In our preliminary analysis (Section 3.3) this is done assuming the true parameters are known. We later demonstrate developments that relax this assumption and thus yield a general and robust approach to OED.) Training is carried out over a number of single experiment simulations, called episodes. During this model-based training, the reinforcement learning agent has the goal of maximising the logarithm of the determinant of the FIM over each experiment (details in Section 5.1). Once training is complete, the trained agent can act as a feedback controller to provide real-time inputs to the experimental system.

To formulate an OED problem in a reinforcement learning framework, we begin by defining (i) the agent’s environment, (ii) the observations available to the agent, (iii) the actions the agent can take, and (iv) the reward function that guides the agent towards optimal behaviour (Fig 1B). The agent’s environment is provided by a (pre-established) model of the experimental system (governed, in our case, by a set of differential equations). At the beginning of each time step, the agent is supplied with an observation, oτ. In our preliminary formulation, described below, the observation consists of the index of the current time step, τ, the system output, Yτ and an estimate of the FIM, Iτ. In subsequent developments of the algorithm, we use an alternative observation composed of the time step index τ, system output, Yτ, and the past trajectory of experimental observations and inputs. (In the latter case the agent makes decisions without knowledge of the true system parameters.) The action taken by the agent at each time step is the experimental input, uτ, to be applied for the duration of the time step. The reward provided to the agent during training is the change in the logarithm of the determinant of the Fisher information matrix from the previous time step rτ = log|Iτ |− log|Iτ−1|. Consequently, over an experiment, the cumulative reward (i.e. the return) is the log of the determinant of the accumulated Fisher information matrix, and so the agent’s optimisation objective is equivalent to maximising the D-optimality score (Section 5.1).

There is a circularity in this setup: the agent’s observation is dependent on knowledge of the system parameters to calculate the elements of the FIM. One of the key advantages of the reinforcement learning approach is the flexibility to incorporate different experiences during training. In Section 3.7 below, we make use of this flexibility to develop a reinforcement learning strategy that avoids dependence on accurate estimates of the parameter values. In that method, training is carried out on model simulations, over a range of parametrisations. In each episode, the agent learns to maximise the FIM-dependent reward, which depends on the parameter values used in the corresponding simulation. The trained agent can then be deployed as an experimental controller, making decisions solely based on experimental observations (assuming no specific knowledge of system parameters). Here, prior knowledge of parameter values is incorporated into the distribution of parameters over which training occurs, which can be of arbitrary shape. It should be noted that this approach is still reliant on an assumption that the model structure is accurate. This robust method of design of experiments for parameter inference is summarised in Fig 1C.

3.2 Chemostat bacterial growth model

The genetic engineering of bacterial strains can have unintended and unpredictable impacts on their growth dynamics, caused, e.g., by increased metabolic burden or by cross talk among regulatory pathways. Consequently, determining the growth characteristics of bacterial strains can be an important task in the field of synthetic biology. A thorough investigation of growth dynamics can be carried out by implementing a range of nutrient conditions in a flow-through bioreactor. (Affordable tabletop bioreactors [28] have become widely available, so this setup is feasible for most synthetic biology labs.) Here, we consider parameter inference for a model of auxotrophic bacterial growth in a chemostat (Fig 1D). (An auxotroph is a bacterial strain whose growth is reliant on a specific nutrient, e.g. a specific amino acid.) We suppose the agent controls the concentration of carbon source, Cin, and auxotrophic nutrient, C0,in, in the inflow.

The bacterial population, N, is the only measured output. (The concentrations C1 and C0 are hidden state variables.) The system equations are: Embedded Image where μmax is the maximal growth rate, Ks is the half-maximal auxotrophic nutrient concentration, γ1 is the yield on this nutrient, Ks0 is the half-maximal concentration for the carbon source, with corresponding yield γ0, and q is the flow-through rate.

Here we consider experimental design with the goal of inferring parameter values for this model. We suppose that each experiment consists of a sequence of ten two-hour intervals. In each interval the experimental inputs Cin and C0,in are assigned constant values between the minimum and maximum bounds of 0.01 and 1 g L−1. We suppose that the initial conditions are fixed for each experiment as N = 20 × 109 cells L−1 and [C0, C1] = [1, 0] g L−1. Model simulations were performed with parameter values as in Table 1. In previous work we found that the parameters γ0 and γ are practically unidentifiable for this model implementation [29]. We thus focus on designing experiments to infer the values of parameters K0, K1 and μmax.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1:

Parameters for the auxotroph system. Nominal parameter values, along with minimum and maximum bounds, used for simulations of a bacterial culture.

3.3 Fitted Q-learning for optimal design of chemostat growth experiments

We begin our analysis of the chemostat model by testing the effectiveness of the Fitted Q (FQ)-learning algorithm [21] for designing optimal experiments. We trained 10 FQ-learning agents over 50,000 episodes, where each episode constitutes a single simulated experiment (see Section 5.2 for details), with model parameters set to their nominal values (Table 1). FQ-learning operates over a discrete action space. As feasible actions, we chose 10 discrete values equally spaced linearly between the minimum and maximum bounds of 0.01 and 1 g L−1, respectively. Inputs to the neural networks were scaled to be between 0 and 1 (Methods Section 5.5). This training took approximately 48 hours of training time on a computing cluster with 40 CPU cores and a GeForce GTX 1080 Ti. In this preliminary analysis we assume prior knowledge of the parameter values, which are used in the training simulations and calculation of the FIM to determine the agent’s reward. Likewise, the one step ahead optimiser (OSAO) and model predictive controller (MPC), which are implemented for comparison, optimise with respect to the true parameters (see Section 5.6 for details).

Ten FQ-agents were trained. The experimental input profiles and resulting system trajectories for a human-chosen rational design, OSAO, MPC, and the best performing FQ-agent are shown in Figure 2A-D. Figure 2E shows the training performance of the ten FQ-agent instances and the equivalent performance (far right) of the optimisers and rational design. The FQ-agents successfully learn throughout the training process (as shown by the increase in optimality score as training progresses), with an average final optimality score of 16.10 (end-point of blue average trace). The best FQ-agent performs significantly better than the OSAO, but not as well as the MPC. As shown, there is significant variance in the performance of the FQ-agents.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2:

Optimal experimental design to infer the values of model parameters for an auxotrophic bacterial strain growing in a chemostat. Control inputs chosen by (A) rational design, (B) one step ahead optimisation, (C) model predictive control, and (D) FQ-learning, along with the corresponding system trajectories. (E) Training progress of ten independent FQ-agents over 50000 episodes, their shared explore rate, and (far-right) the scores of the MPC, OSAO and rational design. The mean FQ-agent return is shown, along with error bars indicating one standard deviation.

To assess the quality of the experimental designs, we compared their performance in generating parameter estimates, as follows. We simulated the model driven by each experimental design and added noise to the corresponding observation outputs. We then determined parameter fits to this simulated data (See Section 5.7 for details.) For each method, we calculated the normalised mean squared error (MSE) and the logarithm of the determinant of the covariance matrix of independent parameter estimates. The results are shown in Table 2. The best performing FQ-learning experiment (from Figure 2E) was used for this comparison. These results show that a high optimality score is a good predictor of both a low determinant of the covariance matrix and a low estimated error in the inferred parameter values. As expected from the D-optimality scores, the MPC has greatly outperformed the rational design and the OSAO, while the performance of the best FQ-agent is between these extremes. The results in this section and our preliminary investigations (Fig S1, [29]) show that reinforcement learning has potential for optimal experimental design. Below, we build on this potential by developing improvements to the method.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2:

Comparison of the performance of optimal experimental designs in supporting parameter inference for the auxotroph model.

3.4 An agent without access to a priori parameter estimates can learn from observations of past time series

In the following sections we show that reinforcement learning can design experiments without a priori knowledge of the true parameter values. As an initial assessment of such an agent, we consider six different formulations of the agent’s observation and assess each variant’s ability to learn a value function. Reinforcement learning is based on using experience to estimate a value function that maps observation-action pairs to the expected return obtained by taking a given action after making a given observation. A good value function will show low error in predicting returns. In our case the return of a full experiment is equivalent to the D-optimality score. Here, all agents choose from the same set of actions and are rewarded in the same way as in Section 3.3; they differ in the observations that are available to them.

  • Agent Ia observes the time step index, the system measurements, and the elements of the FIM (identical to the FQ-controller in Section 3.3). Agent Ib is identical but does not observe the time step index.

  • Agent IIa observes the current time step index and the system measurements. Agent IIb is identical but does not observe the time step index.

  • Agent IIIa observes the current time step index, the system measurements, and the history of measurements and actions from the beginning of the experiment. Agent IIIb is identical but does not observe the time step index.

Here, formulation I, which makes use of a priori knowledge of the parameter values (via the FIM) acts as a positive control, while formulation II is a negative control which is not expected to have sufficient information to be successful. In each case the comparison between variants (a and b) reveals whether the agents learn to use the dynamics of the system, rather than simply mapping a time step to a predicted return. (The return is time dependent (as demonstrated in Fig S2B), because it is easier to gain further information about the system parameters during the early phases of the experiment, when little or no data has been acquired.)

All agents were evaluated by testing their ability to fit the value of randomly chosen experiments undertaken on chemostat environments over a range of different parametrisations (Methods Section 5.3 for details). This procedure tests each agent’s ability to learn the value function from a random policy applied to a range of model parametrisations. Because the parameter values are different for each experiment, the agent must learn to infer where it is in parameter space to optimally predict the value of a given observation-action pair. The results, in terms of mean square testing error at the end of training, are shown in Table 3. When time information is included in the observation (variant a) all agents show relatively low error, with Agent IIIa outperforming Agent Ia, which does better than Agent IIa. Thus, the elements of the FIM observed by Agent Ia are valuable for learning, but improved performance can be obtained by instead observing the full experimental history. When time is omitted from the observations (variant b) Agent Ib and Agent IIIb have minor increases in error, while Agent IIb is unable to learn a reasonable value function.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3:

Mean square error in predicted returns for six agents with different observation formulations. The agent with access to previous observation timeseries but no a priori knowledge of parameter values (formulation III) are able to accurately learn value estimates using the dynamics of the system. Error values are one standard deviation over five repeats.

The fact that Agent III’s performance was not significantly reduced by the omission of the time index shows that it is able to use the dynamics of the system to make its predictions, consistent with it being able to use the dynamics to infer where it is in parameter space. Finally, in assessing training error relative to the testing error, we found that there was minimal overfitting for all agents as shown in Fig S2A.

In summary, these results show that a reinforcement learning agent can use the observed dynamics of the system to make suitable value predictions in the absence of prior knowledge of the parameter values. We next compared Agent IIIa’s OED performance using the same procedure as in Section 3.3. Note that the reward is still dependent on the FIM, a function of the unknown parameter values. However, calculation of the reward is required only during the training period (when simulation parameters are known); it would not be required by the agent when deploying a trained agent on a real system. Much like our FQ agent, this OED comparison shows high variance between training repeats with the best repeat falling somewhere between the OSAO and MPC (Fig S3). This is likely due to the large number of action options: two experimental inputs each with 10 possible values yields 100 possible actions for which the value function needs to be learned, over each observation. Q-learning in large discrete action spaces is a difficult learning problem [30] and we hypothesise that this is why our RL agents are showing high variance. This is supported by previous work where we applied the same approach to a different system with a smaller set of just 12 action options and found much lower variation in the resulting FQ-agents [29].

3.5 Transition to continuous action space: The Recurrent T3D algorithm

To improve the RL method’s OED performance, we adopt a further refinement of the approach by transitioning to a continuous action space, thus avoiding the problems associated with Q-learning in large discrete action spaces. The application of Q-learning methods in continuous action spaces is non-trivial and requires an additional neural network which represents the agent’s policy. We developed a recurrent version of the continuous reinforcement learning algorithm Twin Delayed Deep Deterministic Policy Gradient (T3D) [23] (see Methods Section 5.4 for details). We call this novel algorithm Recurrent T3D (RT3D). The internal structure of the agent is shown in Fig 3A. The main components of the agent are the memory, the value network, and the policy network. During each time step, the observation, o, and reward, r, are determined from the environment and stored in memory. The observation also acts as input for the policy network, which chooses the action to apply. This action is also stored in memory. We define a transition as the tuple (o, a, r, o′, d) specifying, respectively, an observation, action, reward, a subsequent observation, and an indicator that is 1 if the episode terminated during this transition and otherwise is 0. Here a is the action taken after observing o, while r and o′ are the reward and observable state of the environment that result from taking action a. The memory contains all of the transitions observed so far during training, where N is the number of transitions currently stored in memory. The agent’s value and policy functions are periodically updated by training on the experience stored in the memory (dotted arrows in Fig 3A). The value function is approximated using a deep neural network. A GRU layer, a type of recurrent layer which excels at processing sequences such as time series data, takes as input all system measurements, Y, and experimental inputs, u, seen so far in the current experiment. The output of the GRU is concatenated with the current system measurements, Yτ, the index of the current time step, τ, and the current action, uτ. These are fed into a feed forward network. The output of the value network is an estimate of the value of the supplied experimental input, uτ, given the sequence of observables. The policy network is structured similarly, but does not take the current experimental input uτ as input. The output of the policy network is the experimental input that is estimated to have the highest value for the given observation.

Figure 3:
  • Download figure
  • Open in new tab
Figure 3:

The recurrent T3D algorithm. (A) Structure of the RT3D agent (left) and the artificial neural networks used to approximate the value function (top-right) and policy (bottom-right). (B) Control inputs chosen by the best performing Recurrent T3D instance and the corresponding system trajectory. (C) Average training progress and shared explore rate of ten Recurrent T3D agents over 17500 episodes and the scores (far-right) of the MPC, OSAO and rational design. Error bars indicate one standard deviation

3.6 Recurrent T3D can design optimal experiments for a bacterial strain growing in a chemostat

We applied the RT3D algorithm to the chemostat model as in Section 3.3. Here the RT3D agent uses the nominal parameter values during training to generate the FIM (to determine the reward), while the OSAO and MPC have access to the these parameter values for their optimisation calculations. Ten RT3D agents were trained for 17500 simulation episodes (about 10-15 hours of computation time, see Methods). The RT3D algorithm chooses experimental inputs continuously between the maximum and minimum bounds. Inputs to the neural networks were scaled to be between 0 and 1 (Methods Section 5.5). The experimental input profiles and resulting system trajectories for the best performing RT3D agent are shown in Fig 3B. Fig 3C shows the training performance of the ten RT3D agents and the equivalent performance of the OSAO, MPC, best FQ-agent and rational designs. The average RT3D optimality score was 19.84. In comparison to the FQ-controllers the average optimality score has increased to a level comparable to the MPC and the training performance is stable, with only minor differences between each of the 10 independently trained RT3D agents. The MPC reaches a slightly higher optimality score than the mean of the RT3D controllers, but the best RT3D agent performs better than the MPC. As before, we test the performance in terms of acuracy of parameter estimates, summarised in Table 4. As expected, the best performing RT3D agent has outperformed the other designs in terms of parameter covariance and parameter error. Again, we see the correlation of high D-optimality with low parameter error and low parameter covariance. From these results we can conclude that the recurrent RT3D controllers are performing at a similar level to an MPC controller and that the RT3D algorithm has significantly improved performance over the FQ algorithm with significantly lower training cost in terms of the amount of data and computation time.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 4:

Performance metrics for optimal experimental design on the chemostat growth model. The best performing R3TD agent is reported.

3.7 Recurrent T3D can be generate optimal experiments over a parameter distribution

Finally, we demonstrate the use of RT3D to design experiments assuming minimal knowledge of the system parameters. As shown Section 3.4, the agent is able to infer its position in parameter space using the trajectory of experimental inputs and measurements and can then adapt the experiment to maximise the optimality score. To assess performance, we trained the RT3D controller over a parameter distribution as laid out in Fig 1C. Here, each episode is initialised using a different parameter set sampled from the uniform distribution indicated in Table 1. The training performance in Fig 4A shows that RT3D controllers can successfully learn to optimise the objective over the distribution of parameters.

Figure 4:
  • Download figure
  • Open in new tab
Figure 4:

The RT3D algorithm trained over a parameter distribution. (A) Average training progress and shared explore rate of ten Recurrent T3D agents over 17500 episodes. For each episode the model parameters are sampled from a distribution. This was then averaged across the 10 repeats. Error bars indicate one standard deviation. (B) The optimality score of the best performing agent in panel A compared to an MPC with a priori parameter knowledge for different parameter samples. Four randomly sampled parametrisations, S1, S2, S3 and S4 are shown along with the lower and upper bounds of the parameter space, L and U, the nominal parameters from literature, N, and the centre of the distribution, C. (C) The experimental designs of the RT3D controller at different parameter samples. Here the RL controller is trained on a distribution over parameters and adapts its experimental design for different points in parameter space. For each experimental design the system is initialised with different parameter values: L, U, N and C

We carried out a comparison of the RT3D OED success, using the best performing agent from Fig 4A. Here, we give the MPC controller an intrinsic advantage. We compared the performance, in terms of the optimality score, of the RT3D controller against an MPC controller that uses a priori knowledge of the sampled parameters. (Recall that the RT3D controller has only knowledge gained by training against the parameter distribution.) Eight samples were investigated. Four of these (S1, S2, S3, S4) were sampled from the distribution. The remaining four were chosen specifically to reveal the behaviour of the RT3D controller across the parameter distribution. These were (N) the nominal parameter values, (L) the lower bounds, (U) the upper bounds, and (C) the centre of the distribution. The corresponding optimality scores are shown in Fig 4B (the parameter values for each sample can be found in Table S1. For every sample, the RT3D agent has performed nearly as well as the MPC that has total system knowledge. From these results, we conclude that, by training over a parameter distribution, the RT3D controller can generate near optimal experiments across the whole distribution.

The experiments the RT3D agent designed at the L, U, C and N parameter samples and resulting system trajectories are plotted in Fig 4C. The experimental inputs for the first three intervals are identical. This is expected, because the agent has little information with which to infer the system behaviour at the beginning of the experiment. After this initial stage, the experiments diverge. The differences in the experiments are relatively minor, suggesting that there is a ‘core’ experimental design which works well over the distribution; the agent introduces slight deviates to maximise its effectiveness for different regions of parameter space.

4 Discussion

In this work we demonstrated a novel application of reinforcement learning to the optimal design of biological experiments. The problem was formulated as the maximisation of the determinant of the Fisher information (D-optimal design). First the efficacy of the approach was tested using the FQ-learning algorithm to design optimal experiments in the unrealistic setting in which the method has prior knowledge of the parameter values we seek to identify. Positive results here indicated that this approach had the potential to design optimal experiments. We then introduced algorithm refinements that focussed on eliminating the dependence on the prior knowledge of parameter values. The dependence on the true parameter values is a limitation of other OED works [9, 10, 17, 18, 19], which require ad hoc verification [9] or other workarounds [17, 18, 19]. To decouple the RL controller from the true parameters, we used a recurrent neural network to enable it to make experimental decisions based on a full experimental history of past measurements and experimental inputs. We showed that this approach produces an agent that can effectively design optimality on systems sampled from a distribution over parameters, performing similarly to an MPC with explicit knowledge of the parameter values.

Our approach can complement existing methods. OED work has often been categorised as either Bayesian or Fisher information based [31, 32]. Fisher information based techniques have been limited to local analysis around a nominal parameter guess. Bayesian approaches allow global optimisation and the incorporation of prior knowledge, but can be computationally expensive. Here we build on previous work that has demonstrated the applicability of Fisher information based methods to non-linear biological systems [9, 10], including online OED of fed batch bioreactors [18, 19]. We have shown the capability to encode prior parameter knowledge and therefore optimise globally with respect to the Fisher information by training an RL controller over samples from a parameter distribution. The same technique could be used in the future to encompass measurement noise and stochasticity in the model dynamics, mirroring some of the statistical capabilities of Bayesian approaches. Bayesian approaches often rely on complex integrals over multiple probability distributions, which, for online experimental design, need to be determined during the experiment using methods such as Monte Carlo simulation [4, 6]. This can be computationally expensive, potentially limiting the speed at which experiments can be done. Previous Bayesian work has been limited to greedily optimising over the next time step only [4] or choosing from a limited set of experimental designs [6]. In the approach outlined here, a trained RL controller can be rapidly queried for inputs and therefore design experiments online. However, the price for this is a potentially lengthy training process which must be completed prior to the experiment. The longest training times in this work for our RT3D algorithm were around 12 hours of simulation time on a computing cluster or around 15 hours on a personal laptop.

Here, we have focussed on D-optimal design by maximising the determinant of the FIM. Alternative optimality criteria could be used, e.g. other Fisher information based metrics (such as maximising the trace of the FIM (A-optimality) or maximising the minimum eigenvalue of the FIM (E-optimality), or others metrics such as maximising the uncertainty of the posterior predictive distribution [6]. We demonstrated the capability to learn over a parameter distribution with a uniform distribution. Future work could be done to evaluate the performance of an RL controller when trained over more complex distributions that are, e.g., non-symmetric or multimodal. Because the RL controller learns by sampling the distribution, these alternatives would be straightforward to implement. Indeed, the flexibility to learn over any distribution presents a compelling reason to use reinforcement learning for OED. Additionally, work in this area could address optimization for model selection, using, e.g., Ds optimality, in which experimental effort is focussed on a subset of the parameters of a nested model [33], or by introducing a term in the reward function to encourage the divergence of different model predictions [34].

Overall, we have developed and demonstrated the potential for reinforcement learning to be used for OED. As the systems we build and characterise in biology continue to increase in complexity, automated experimental design tools will become ever more important. Furthermore the generality of this approach means it can be applied in many areas of science and engineering.

5 Materials and methods

5.1 Formulation of the optimal experimental design problem

Optimal experiments will be designed on systems which can be described by a set of non-linear differential equations: Embedded Image where X is a vector of state variables, θ is a vector of parameters and u is a vector of experimental inputs. System measurements are assumed of the form Y = Xm +ϵ, where Xm are the measurable state variables and ϵ is a Gaussian noise term. We consider optimal experimental design tasks with the goal of inferring accurate estimates of the system parameters, θ. We define our objective as the determinant of the Fisher Information matrix, |I|. This is called a D-optimality criterion. The theory demonstrating that D-optimality corresponds to maximally accurate parameter inferences holds only for linear systems with Gaussian errors, but previous work has shown that this same criterion can be successfully applied to non-linear systems [7, 8, 9, 10]. We follow an established method [9, 10] to obtain I from the system equations (Equation 1). First we obtain time derivatives for the sensitivity of each of the state variables with respect to each of the parameters: Embedded Image The scale of parameters can vary, which can lead to poor conditioning of the Fisher information matrix [31]. To remedy this, the sensitivities are scaled by the parameter values, which is equivalent to using logarithmic sensitivities. The scaled sensitivities are Embedded Image Writing Equation 2 in terms of the scaled sensitivities yields Embedded Image We assume that measurement error ϵ is normally distributed with variance equal to 5% of the measured quantity: ϵ = 𝒩 (0, σ2), σ2 = 0.05Xm, and we assume measurements are independent. The time derivative of the scaled FIM can be written as [33] Embedded Image Where Embedded Image is a diagonal matrix, where Embedded Image and Embedded Image are the i-th measurable state variable and the associated variance, respectively. The FIM can then be determined by integration over an experiment, assuming that I(t = 0) = 0 Embedded Image For the optimisation objective we use the determinant of the FIM (the D optimality score). Because this can vary over orders of magnitude, the optimality criterion is taken to be the logarithm of the determinant of the FIM: Embedded Image We use the shorthand Iτ = I(u, θ, 0, tf), where tf is the time at the beginning of time step τ. The Python API of the CasADi library was used to solve all differential equations in our analysis [35].

5.2 Neural Fitted Q-learning algorithm

In neural FQ-learning [21], a value function is learned that maps observation-action pairs to values, Q(o, a). This value function is represented by a neural network. We define a transition as the tuple (o, a, r, o′, d) specifying, respectively, an observation, an action, a reward, a subsequent observation, and an indicator that is 1 if the episode terminated during this transition and otherwise is 0. Here a is the action taken after observing o, while r and o′ are the reward and the observation that result from taking action a. The state can be continuous, but the action is limited to a discrete set of feasible values. From a sequence of these state transitions, a sequence of Q-learning targets is generated as: Embedded Image Here the term maxa Q(o′, a), where a is an action that can be taken by the agent, gives an estimate of the total future reward obtained after observing o′. This is weighted by γ, the discount factor, which dictates how heavily the possible future rewards affect decisions. The neural network is trained on the set of inputs and targets generated from all training data seen so far (Algorithm 1). A training iteration takes place after each episode, resulting in Algorithm 2. We used the Adam optimiser [36] to train the neural network because of its ability to dynamically adapt the learning rate, which is favourable when implementing reinforcement learning with a neural network [37].

In this work we use an ϵ-greedy policy in which a random action is chosen with probability E and the action a = maxa Q(o, a) is chosen with probability 1 − ϵ. The explore rate was initially set to ϵ = 1 and decayed as Embedded Image where e is the episode number, starting at 0, and A is a constant that dictates the rate of decay. The clip function checks if its first argument is within the interval defined by the second and third arguments and, if it is not, clips it to the nearest edge of the interval. In this work, A was set to the total number of episodes used in a given training simulation divided by 11. This choice ensures that the explore rate reaches 0 before the end of training. This E-greedy strategy is a widely used policy that has been proven effective [12, 11] and is easy to implement. The inputs to the neural network are continuous. The input layer is of size 11 if the elements of the FIM are included (Section 3.3), otherwise it is of size 2 to account for the single measured variable and the time step index (Agents II and III in Section 3.4). The neural network contains two hidden layers with 100 neurons each, all of which use the ReLU activation function. The linear output layer consists of 100 neurons, accounting for the discretisation of the two dimensional action space into 10 bins along each axis. When a recurrent network is used (Agent III in Section 3.4) two additional GRU layers were added each with 64 neurons to process the sequence of previous observations and actions, the output of which is concatenated to the current observation before being passed to the rest of the network. In this case, the network is structured similarly to the value network in Fig3A.

Algorithm 1

FQ-iteration

Figure
  • Download figure
  • Open in new tab

Algorithm 2

FQ-learning

Figure
  • Download figure
  • Open in new tab

Before application of reinforcement learning to the chemostat model, we verified preliminary application to two other systems: an empirical growth model described by Monod dynamics (Fig S1) and a mechanistic model of gene transcription [29].

5.3 Value fitting

The process used for evaluating an agent’s ability to learn a value function is illustrated in Fig 5. First, a training set of 10,000 experiments was generated by sampling 10,000 different parametrisations of the chemostat model from a uniform distribution over parameters, with maximal and minimal values as in Table 1. For each parametrisation, 10 random experimental inputs were applied, sampled uniformly from the discrete set of actions. This procedure was repeated to generate an independent testing set. For each experiment the return was calculated as the sum of the rewards obtained after visiting each observation-action pair through the experiment,Embedded Image. This results in two independent datasets consisting of N = 10, 000 × 10 data-points, where each data-point is composed of an observation, o, an action, a, and the corresponding return, G. The value functions of each agent was fitted to the training data set using repeated FQ-iterations (Algorithm 1), where the Q-learning targets were the returns, G. (Note, this training scenario is distinct from the episode-based reinforcement learning strategy employed in section 3.3.) This procedure tests an agent’s ability to learn the value function from a random policy applied to a range of model parameterisations. Because the parameter values are different for each experiment, the agent must learn to infer where it is in parameter space to optimally predict the value of a given observation-action pair.

Figure 5:
  • Download figure
  • Open in new tab
Figure 5:

Testing agents that do not have prior access to simulation parameter values. 1) A distribution of parameters was chosen (uniform distribution in this case). 2) Data was generated on simulated chemostat models using a random policy for action selection. 3) Agents were trained to predict the observed returns.

5.4 Twin Delayed Deep Deterministic Policy Gradient

Twin delayed deep deterministic policy gradient (T3D) [23] is an off-policy algorithm for continuous deep reinforcement learning (Algorithm 3). It is based on a previously established algorithm called deep deterministic policy gradient (DDPG) [38], but introduces a few modifications to improve learning stability. The DDPG algorithm is closely related to Q-learning and can be thought of as Q-learning adapted for continuous action spaces. Like deep Q-learning, DDPG uses a neural network to approximate and learn a Q-function Q(o, a) which maps observation action pairs to a value. In addition, DDPG also learns a policy, a = π(o), which is represented by a second neural network. The policy network maps states to actions and is trained to choose the action, a, that maximises the value of the state action pair for the given observation, o, according to the value network: a = argmaxa Q(o, a) ≈ Q(o, π(o)).

The DDPG algorithm proceeds as follows. As in FQ-learning, a transition is defined as the tuple (o, a, r, o′, d) specifying, respectively, an observation, action, reward, a subsequent observation and an indicator that is 1 if the episode terminated during this transition and otherwise is 0. As the agent learns, it stores observed state transitions in a replay buffer, 𝒟, which can be though of as its memory. Two tricks are used to increase stability of the learning process in DDPG. Firstly, at each update a random sample, B, of past experience is taken from the replay buffer to reduce the temporal correlation in the updates. Secondly, target networks Qtarg, πtarg are used to generate the Q-learning targets. The parameters of these networks update slowly to the parameters of Q and π by Polyak averaging, θtarg = ρθtarg + (1 − ρ)θ. This reduces the dependence of the target on the trained parameters and further increases stability.

Three further additions to DDPG lead to the T3D algorithm. First, the policy network updates are delayed by updating half as frequently as the Q-network. Second, to address a common failure mode of DDPG in which the policy can exploit incorrect sharp peaks in the Q-function, the target policy is smoothed by adding random noise to the target actions which effectively regularises the algorithm: Embedded Image where c is an upper bound on the absolute value of the noise, alow and ahigh are lower and upper bounds on the target action respectively, and σ is the standard deviation of the noise. Finally, because all Q-learning methods involve maximising over target actions, they are prone to overestimate the Q-function. To reduce this tendency, double Q-learning is implemented in T3D: two Q-functions, Q1 and Q2, are learned and the one that gives the smaller value is used to calculate the Q-learning target. From a sequence of state transitions, ℬ, sampled from the replay buffer a sequence of Q-learning targets, y, is created according to: Embedded Image The term (1 d) mini=1,2 Qi,targ(o′, a′) gives an estimate of the total future reward obtained after entering state o′. The networks Q1 and Q2 are trained on the set of inputs by regressing to the targets with the following losses. Embedded Image Then, every other update, the policy network is updated by training it to maximise Q1: Embedded Image Finally, the target networks are updated: Embedded Image We use an ϵ-greedy policy adapted to the continuous action space; a random action is uniformly chosen between alow and ahigh with probability ϵ and the action a = clip(π(s) + 𝒩 (0, 0.2E), alow, ahigh) is chosen with probability 1 − ϵ. The explore rate ϵ was set to decay exponentially as training progressed. The explore rate was initially set to ϵ = 1 and decayed as Embedded Image where e is the episode number and A is a constant that dictates the rate of decay, where A is equal to the number of episodes divided by eleven. This ensures the explore rate reaches 0 before the end of training. The Adam optimiser [36] was used to train the neural networks, because of its ability to dynamically adapt the learning rate, which is favourable when implementing reinforcement learning with a neural network [37].

The details of our neural networks are as follows (see Fig 3A). The value network contains a GRU cell composed of two layers each with 64 neurons. This takes as input all system measurements, Y, and experimental inputs, u, seen so far in the current experiment. The GRU cell is a type of recurrent layer which excels at processing sequences such as time series data. The output of the GRU is concatenated with the current system measurements, Yτ, the current time step, t, and the current action, uτ. These are fed into a feed forward network composed of two hidden layers, each with 128 neurons. The output of the value network is an estimate of the value of the supplied experimental input, uτ, given the sequence of observables. The policy network contains a recurrent GRU cell composed of two layers each with 64 neurons. This takes as input all measurements and experimental inputs seen so far in the current experiment. The output of the GRU is concatenated with the current observation, composed of the current system measurements and the current time step, and fed into a feed forward network composed of two hidden layers, each with 128 neurons. The output of the policy network is the experimental input that is estimated to have the highest reward value for the given sequence of observations. TensorFlow version 2.4.1 [39] was used to implement all neural networks.

Algorithm 3

T3D

Figure
  • Download figure
  • Open in new tab

Because training a RL controller takes place over a large number of independent episodes, it is possible to gather experience from multiple episodes in parallel. We took advantage of this by running experimental simulations in parallel using the functionality provided by the CasADi library. This means that all of the computationally demanding aspects of the training process are parallelised. The training of the neural networks using TensorFlow is implemented in parallel and can be run on a GPU or a multicore CPU. The simulation of the experimental system using CasADi is parallelised and can take advantage of a multicore CPU. Experimental simulations were run in batches of 10 parallel simulations, but this number could be increased to take advantage of more computing resources. The average training time for the RT3D algorithm was 11.73 hours with standard deviation 0.91 hours over 20 total training runs on a computing cluster with 40 CPU cores and a GeForce GTX 1080 Ti. To asses the training time on a smaller scale personal computer, a single RT3D agent was trained on a 13-inch MacBook pro with a 2GHz Quad core Intel i5 processor with no GPU and the training time was 15.48hrs.

5.5 Scaling neural network inputs and outputs

To prevent network instability and improve convergence, inputs to neural networks are often scaled [40]. The inputs to the neural networks for all reinforcement learning agents were scaled to be approximately between 0 and 1. This was done by dividing each input by a scalar normalisation constant. At the nominal parameter values, the carrying capacity of the chemostat system is 48 × 109 and so using a normalisation coefficient of 50 × 109 ensures that all population measurements are between 0 and 1 before entering the network. This normalisation of the population measurements was used in Section 3.3. The time step, τ is known to be an index between 0 and 10 and so a normalisation constant of 10 was used for all sections. The elements of the FIM comprising the agent’s observations in Sections 3.3 and 3.4 present more of a challenge. To find a suitable normalisation constant for each of these, the highest value of each FIM element during the agent’s exploration phase was found during a trial training run, where the portion of training in which the explore rate is equal to 0 was simulated. This value was used to set a normalisation constant so all inputs lie between 0 and 1.

When sampling from a parameter distribution in Sections 3.4 and 3.7 the normalisation can be more challenging. Some regions of parameter space led to instability in the simulation output. Such simulations were discarded. To limit the number of episodes that were discarded, a square root was applied to the population measurement before being normalised by 1 × 105 and passed into the neural network. The square root increases the allowable population range while still allowing sufficient separation between different population measurements for the agent to learn. (We attempted application of a log10 scaling to the population measurements but found that led to poor performance. We hypothesise that because the majority of episodes are stable and have populations that vary within a single of magnitude, the agent was unable to sufficiently distinguish between different population measurements under that logarithmic scaling.) The square root normalisation method was used in all RT3D training simulations (Sections 3.6, 3.7) and in Section 3.4. Approximately 2.5% of training episodes were discarded when sampling from a parameter distribution in Sections 3.4 and 3.7. For consistency, the output of the neural networks was also scaled to be between 0 and 1 by dividing the Q-learning targets by 100.

5.6 Model predictive control

A model predictive controller (MPC) [41] uses a calibrated model of a system (Equation 1) to predict optimal control inputs. Assuming that the parameter values are known, the MPC integrates the model over a predefined time interval to optimise an objective function with respect to the control inputs u. In this work, the objective function was the optimality score of an experiment (Equation 3). The time horizon the MPC optimises over is a hyperparameter to be chosen. We consider two variants of MPC. The first optimizes over a single time interval; we refer to this as the one step ahead optimiser (OSAO). At time step τ the OSAO solves the optimisation problem Embedded Image (Note, the term log|Iτ | has already been determined by the start of this interval and so does not contribute to the optimisation task; we include it here to simplify the interpretation of the full MPC optimality criterion below.) The second variant is a controller that optimises over the full timeseries of N intervals simultaneously; we refer to this as the MPC. At the beginning of the experiment the MPC uses the model to optimise over the full experiment, choosing each input that will be applied by solving the optimisation problem: Embedded Image where 𝒯 is the number of time steps in the experiment. To solve the optimisation problems for both the OSAO and MPC the non-linear solver IPOPT [42] was called from the CasADi library [35].

5.7 Parameter fitting

To confirm that our optimality scores correspond to improved parameter estimation accuracy, we used simulated experiments to assess the correlation between high D-optimality scores and low covariance in the resulting parameter estimates. To generate simulated data, we simulated the model using the given experimental design and added normally distributed observation error with a variance equal to 5% of the corresponding output. Thirty independent simulated datasets were generated. When assuming prior knowledge of the parameter values (Sections 3.3 and 3.6) the model was simulated using the nominal parameter values. In Section 3.7 parameters were sampled from the distribution before each simulated experiment. The non-linear solver IPOPT [48] was called from the CasADi library [35] to infer parameter estimates from the simulated data.

Footnotes

  • * christopher.barnes{at}ucl.ac.uk

  • https://github.com/ucl-cssb/RED

  • https://doi.org/10.5281/zenodo.6521194

References

  1. [1].↵
    Kamil Erguler and Michael PH Stumpf. Practical limits for reverse engineering of dynamical systems: a statistical analysis of sensitivity and parameter inferability in systems biology models. Molecular BioSystems, 7(5):1593–1602, 2011.
    OpenUrl
  2. [2].↵
    Ryan N Gutenkunst, Joshua J Waterfall, Fergal P Casey, Kevin S Brown, Christopher R Myers, and James P Sethna. Universally sloppy parameter sensitivities in systems biology models. PLoS computational biology, 3(10):e189, 2007.
    OpenUrl
  3. [3].↵
    Gabriele Lillacci and Mustafa Khammash. Parameter estimation and model selection in computational biology. PLoS computational biology, 6(3):e1000696, 2010.
    OpenUrl
  4. [4].↵
    Edouard Pauwels, Christian Lajaunie, and Jean-Philippe Vert. A bayesian active learning strategy for sequential experimental design in systems biology. BMC Systems Biology, 8(1):1–11, 2014.
    OpenUrl
  5. [5].↵
    Juliane Liepe, Paul Kirk, Sarah Filippi, Tina Toni, Chris P Barnes, and Michael PH Stumpf. A framework for parameter estimation and model selection from experimental data in systems biology using approximate bayesian computation. Nature protocols, 9(2):439–456, 2014.
    OpenUrl
  6. [6].↵
    Juliane Liepe, Sarah Filippi, Michal Komorowski, and Michael PH Stumpf. Maximizing the information content of experiments in systems biology. PLoS computational biology, 9(1):e1002888, 2013.
    OpenUrl
  7. [7].↵
    Samuel Bandara, Johannes P Schlöder, Roland Eils, Hans Georg Bock, and Tobias Meyer. Optimal experimental design for parameter estimation of a cell signaling model. PLoS computational biology, 5(11):e1000558, 2009.
    OpenUrl
  8. [8].↵
    Jakob Ruess, Andreas Milias-Argeitis, and John Lygeros. Designing experiments to understand the variability in biochemical reaction networks. Journal of The Royal Society Interface, 10(88):20130588, 2013.
    OpenUrl
  9. [9].↵
    Nathan Braniff, Matthew Scott, and Brian Ingalls. Component characterization in a growth-dependent physiological context: optimal experimental design. Processes, 7(1):52, 2019.
    OpenUrl
  10. [10].↵
    Nathan Braniff, Addison Richards, and Brian Ingalls. Optimal experimental design for a bistable gene regulatory network. IFAC-PapersOnLine, 52(26):255–261, 2019.
    OpenUrl
  11. [11].↵
    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
    OpenUrlCrossRefPubMed
  12. [12].↵
    Guillaume Lample and Devendra Singh Chaplot. Playing fps games with deep reinforcement learning. In AAAI, pages 2140–2146, 2017.
  13. [13].↵
    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
    OpenUrlCrossRefPubMed
  14. [14].↵
    Zhenpeng Zhou, Xiaocheng Li, and Richard N Zare. Optimizing chemical reactions with deep reinforcement learning. ACS Cent. Sci., 2017.
  15. [15].↵
    Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 2022.
    OpenUrl
  16. [16].↵
    Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nazi, et al. A graph placement methodology for fast chip design. Nature, 594(7862):207–212, 2021.
    OpenUrl
  17. [17].↵
    Tilman Barz, Diana C López Cárdenas, Harvey Arellano-Garcia, and Günter Wozny. Experimental evaluation of an approach to online redesign of experiments for parameter determination. AIChE Journal, 59(6):1981–1995, 2013.
    OpenUrl
  18. [18].↵
    MN Cruz Bournazou, T Barz, DB Nickel, DC Lopez Cárdenas, F Glauche, A Knepper, and P Neubauer. Online optimal experimental re-design in robotic parallel fed-batch cultivation facilities. Biotechnology and bioengineering, 114(3):610–619, 2017.
    OpenUrlCrossRef
  19. [19].↵
    David Benjamin Nickel, Mariano Nicolas Cruz-Bournazou, Terrance Wilms, Peter Neubauer, and Andreas Knepper. Online bioprocess data generation, analysis, and optimization for parallel fed-batch fermentations in milliliter scale. Technical report, Wiley Online Library, 2017.
  20. [20].↵
    Dries Telen, Dominique Vercammen, Filip Logist, and Jan Van Impe. Robustifying optimal experiment design for nonlinear, dynamic (bio) chemical systems. Computers & chemical engineering, 71:415–425, 2014.
    OpenUrl
  21. [21].↵
    Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pages 317–328. Springer, 2005.
  22. [22].↵
    Neythen J Treloar, Alex JH Fedorec, Brian Ingalls, and Chris P Barnes. Deep reinforcement learning for the control of microbial co-cultures in bioreactors. PLoS computational biology, 16(4):e1007783, 2020.
    OpenUrl
  23. [23].↵
    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596. PMLR, 2018.
  24. [24].↵
    Anthony Atkinson, Alexander Donev, and Randall Tobias. Optimum Experimental Designs, with SAS. 01 2007.
  25. [25].
    MASAYUKI Seto and MARTIN Alexander. Effect of bacterial density and substrate concentration on yield coefficients. Appl. Environ. Microbiol., 50(5):1132–1136, 1985.
    OpenUrlAbstract/FREE Full Text
  26. [26].
    JD Owens and JD Legan. Determination of the monod substrate saturation constant for microbial growth. FEMS Microbiol. Rev., 3(4):419–432, 1987.
    OpenUrlCrossRef
  27. [27].
    Robert A Cox. Quantitative relationships for specific growth rates and macromolecular compositions of mycobacterium tuberculosis, streptomyces coelicolor a3 (2) and escherichia coli b/r: an integrative theoretical approach. Microbiology, 150(5):1413–1426, 2004.
    OpenUrlCrossRefPubMedWeb of Science
  28. [28].↵
    Harrison Steel, Robert Habgood, Ciarán Kelly, and Antonis Papachristodoulou. Chi. bio: An open-source automated experimental platform for biological science research. bioRxiv, page 796516, 2019.
  29. [29].↵
    Neythen J Treloar. Towards the implementation of distributed systems in synthetic biology. PhD thesis, UCL (University College London), 2022.
  30. [30].↵
    Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. Deep reinforcement learning in large discrete action spaces. arXiv preprint arxiv:1512.07679, 2015.
  31. [31].↵
    Ankush Chakrabarty, Gregery T Buzzard, and Ann E Rundell. Model-based design of experiments for cellular processes. Wiley Interdisciplinary Reviews: Systems Biology and Medicine, 5(2):181–203, 2013.
    OpenUrl
  32. [32].↵
    Nathan Braniff and Brian Ingalls. New opportunities for optimal design of dynamic experiments in systems and synthetic biology. Current Opinion in Systems Biology, 9:42–48, 2018.
    OpenUrl
  33. [33].↵
    Anthony Atkinson, Alexander Donev, and Randall Tobias. Optimum experimental designs, with SAS, volume 34. Oxford University Press, 2007.
  34. [34].↵
    Lucia Bandiera, David Gomez-Cabeza, James Gilman, Eva Balsa-Canto, and Filippo Menolascina. Optimally designed model selection for synthetic biology. ACS Synthetic Biology, 9(11):3134–3144, 2020.
    OpenUrl
  35. [35].↵
    Joel A E Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. CasADi – A software framework for nonlinear optimization and optimal control. Mathematical Programming Computation, 11(1):1–36, 2019.
    OpenUrlCrossRef
  36. [36].↵
    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv.org, 2014.
  37. [37].↵
    Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. arXiv.org, 7(1), 2015.
  38. [38].↵
    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  39. [39].↵
    Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  40. [40].↵
    Xing Wan. Influence of feature scaling on convergence of gradient iterative algorithm. In Journal of physics: Conference series, volume 1213, page 032021. IOP Publishing, 2019.
  41. [41].↵
    Eduardo F Camacho and Carlos Bordons Alba. Model predictive control. Springer science & business media, 2013.
  42. [42].↵
    Andreas Wächter and Lorenz T Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical programming, 106(1):25–57, 2006.
    OpenUrlCrossRefWeb of Science
Back to top
PreviousNext
Posted May 10, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Deep Reinforcement Learning for Optimal Experimental Design in Biology
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Deep Reinforcement Learning for Optimal Experimental Design in Biology
Neythen J. Treloar, Nathan Braniff, Brian Ingalls, Chris P. Barnes
bioRxiv 2022.05.09.491138; doi: https://doi.org/10.1101/2022.05.09.491138
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Deep Reinforcement Learning for Optimal Experimental Design in Biology
Neythen J. Treloar, Nathan Braniff, Brian Ingalls, Chris P. Barnes
bioRxiv 2022.05.09.491138; doi: https://doi.org/10.1101/2022.05.09.491138

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Synthetic Biology
Subject Areas
All Articles
  • Animal Behavior and Cognition (4119)
  • Biochemistry (8828)
  • Bioengineering (6532)
  • Bioinformatics (23484)
  • Biophysics (11805)
  • Cancer Biology (9223)
  • Cell Biology (13336)
  • Clinical Trials (138)
  • Developmental Biology (7442)
  • Ecology (11425)
  • Epidemiology (2066)
  • Evolutionary Biology (15173)
  • Genetics (10453)
  • Genomics (14056)
  • Immunology (9187)
  • Microbiology (22199)
  • Molecular Biology (8823)
  • Neuroscience (47626)
  • Paleontology (351)
  • Pathology (1431)
  • Pharmacology and Toxicology (2493)
  • Physiology (3736)
  • Plant Biology (8090)
  • Scientific Communication and Education (1438)
  • Synthetic Biology (2224)
  • Systems Biology (6042)
  • Zoology (1254)