Playing it safe: information constrains collective betting strategies

Every interaction of a living organism with its environment involves the placement of a bet. Armed with partial knowledge about a stochastic world, the organism must decide its next step or near-term strategy, an act that implicitly or explicitly involves the assumption of a model of the world. Better information about environmental statistics can improve the bet quality, but in practice resources for information gathering are always limited. We argue that theories of optimal inference dictate that “complex” models are harder to infer with bounded information and lead to larger prediction errors. Thus, we propose a principle of playing it safe where, given finite information gathering capacity, biological systems should be biased towards simpler models of the world, and thereby to less risky betting strategies. In the framework of Bayesian inference, we show that there is an optimally safe adaptation strategy determined by the Bayesian prior. We then demonstrate that, in the context of stochastic phenotypic switching by bacteria, implementation of our principle of “playing it safe” increases fitness (population growth rate) of the bacterial collective. We suggest that the principle applies broadly to problems of adaptation, learning and evolution, and illuminates the types of environments in which organisms are able to thrive.


INTRODUCTION
Risk is an inherent part of life. Whether in pathogen detection [1], phenotype selection [2,3], biochemical and evolutionary mechanisms [4,5], foraging and explorationexploitation strategies [6][7][8][9], or reproduction [10,11], biological functions are partly shaped by the need to reduce risk. Broadly, risk arises from uncertain interactions of an organism with the world around it [12]. First, both the environment and the typical sensory apparatus are intrinsically noisy. Thus an organism only has probabilistic information about the state of the world. Second, any finite system can only observe some aspects of the world, while others, perhaps most, remain hidden from observation. Thus, an organism inherently functions with partial information, bounding its capacity for rational decisionmaking. In other words, living systems must necessarily play betting games, producing responses, e.g. expressing a phenotype, detecting an odor, showing an immune response, that lead to outcomes that will probably be beneficial given the limited and uncertain information available for decision-making.
Every betting game has an associated risk. To improve the betting strategy, it is vital to know the odds of the game. An organism can mitigate the risk of its strategy, and minimize the likelihood of wrong bets, by gathering information to assess the odds and adapting to the learned statistics of the environment. Such learning and adaptation is particularly important if rare, but potentially catastrophic, events can occur.
Living systems can adapt in this way to the environment over many timescales, over the course of evolution, in response to environmental cues, and through ongoing life-long learning. Given the stochastic nature of the environment from the perspective of an agent inhabiting it, essentially all adaptation processes can be viewed as schemes for inferring a probability distribution [13,14]. The ability to infer this distribution is limited by physical and temporal constraints of information gathering. In turn, this limits the capacity for adaptation, and thus increases risk for the system. There are several probabilistic inference frameworks that explicitly or implicitly account for the limitations of information gathering processes guiding model choice [15][16][17][18][19][20][21][22]. Here we argue that a key idea arising from these frameworks is that "complex" models, in a sense defined in information geometry [22,23], are difficult to infer robustly from limited data, and that a small inaccuracy in inference of such models can lead to large deviations from the optimal strategy.
Because of this, if information-gathering is limited, it can pay to bias inference towards simple models. Thus, we propose a principle of "playing it safe", which dictates that given finite information, biological systems should effectively bias themselves towards simpler models. We derive this principle for a large class of probabilistic models using Bayesian probability theory and information geometry, and show how the bias towards simplicity can be tuned by the choice of prior. We then illustrate the efficacy of the principle in the classic example of Kelly betting [24,25], and demonstrate a biological realization in the phenomenon of stochastic phenotypic switching by bacteria. We conclude with a discussion of the implications of our finding for the types of environments that are "learnable", and in which living systems can prosper.

Optimising betting success is hard for complex models
Betting games are unpredictable by nature. However, bet placement can be optimised by knowing the probabilities controlling events in the game, which often requires estimation of a probability distribution determined by d real parameters, P (x; θ = {θ 1 , . . . , θ d }), denoted here by θ. In real-world settings, the information required to infer parameters is typically limited. For example, we may have N independent observations x i , n = {x 1 , . . . , x N } of a random event which can be used for the inference of parameters. However, finite data leads to statistical fluctuations in the inference process and we need to understand how these affect betting success.
Success in betting games can be quantified through a loss function. In many games the loss is a function of the game's true probability distribution P θ * and an inferred distribution Q φ , adopted by the bettor. The true distribution can lie outside the manifold parameterized by θ within which the inference is performed, but we will nevertheless use the notation θ * for the data generating distribution, because we will often consider cases where this distribution lies within the considered model family. We expect the loss to have a global minimum when φ = θ * if the truth lies within the model family under consideration, since exact knowledge of the game's odds allows the bettor to place bets optimally and thus minimise the loss. If the truth lies outside the model family, the loss should be minimized when φ comes "as close as possible" in terms of the loss function. Here, we adopt the Kullback-Leibler (KL) divergence to measure the difference between the true probabilities and the adopted model. The closer the adopted model is to the truth, the smaller the loss. The KL divergence is the loss function of the well-known Kelly betting setup. An important property of the KL divergence is that for many natural parametrizations of distributions it diverges on the boundaries of the parameter space. Changing the parameter by a small amount in such regions leads to large changes in the KL divergence. A way to understand the origin of this behaviour is through model complexity. Consider the Jeffreys prior where I(θ) is the Fisher information metric (FIM), obtained by Taylor expansion from the KL divergence between models in the parametric family [26,27] (Materials and Methods). In information geometry, this quantity is understood as the density of distinguishable models, and is regarded as a local measure of complexity in a parametric model family [22,27]. In regions where Jeffreys prior is small, i.e., where the model family is simple, changing a parameter by a small amount does not move the distribution much in model space [22]. However, Jeffreys prior can become large and even diverge. In such regions, i.e., where the model family is complex, even models which are close in parameter space are far from each other in model space. In many natural parametrizations, divergences occur at the boundaries of parameter space (see Figure 1 and Eq. (25) in Materials and Methods for the multinomial distribution, and Appendix, Section A for the Poisson and Gaussian distributions).
The multinomial distribution in d = 2 variables is described by a single parameter, typically the event probability θ. The corresponding Jeffreys prior in Figure 1A takes the form w(θ) = 1/π θ(1 − θ). Equivalently w(θ) δθ 2 measures the KL divergence between multinomial models parametrized by θ and θ + δθ. We see that the density of distinguishable models has a minimum at θ = 1/2 and diverges at the boundaries in these particular coordinates. The inset in Figure 1A shows that to keep the KL divergence D KL (P θ ||P φ ) between two nearby multinomial models below some value , the parameter φ ≡ θ + δθ must lie closer to θ, when θ itself lies closer to the boundary of the one-dimensional parameter space at 0. By symmetry, the same holds near the other boundary at 1. Thus, we see that in this case when the true model is close to a boundary of parameter space, the parameter φ must be inferred with higher precision than when the true model lies near a value around 1/2, in order to achieve the same degree of statistical proximity.
It is hard to optimise the KL divergence in regions of parameter space that are complex in the sense described above. For example, we may decide to rely on the maximum-likelihood estimate of parametersθ ML and define the expected loss as where p is the distribution of maximum-likelihood values under the true distribution θ * given N observations. In the limit that N → ∞, this distribution becomes sharply peaked around θ * . However, for finite N , p has finite width and the maximum-likelihood estimatê θ ML is subject to statistical fluctuations due to finite sampling, thus limiting the precision with which the parameters can be determined. In Appendix, Section B we show for the Bernoulli and Poisson models that the KL divergence between the true model and the expected maximum-likelihood model plus/minus the standard error diverges as the truth approaches the boundary, even if the standard error on the maximum-likelihood estimate decreases to zero with increasing amounts of data. In addition, apart from fluctuations due to finite sampling, in real-world data the observations may additionally be corrupted, effectively introducing an error floor in the

1/±µ
< l a t e x i t s h a 1 _ b a s e 6 4 = " E B c 6 R s R s I V 8 V W Z / 7 b + C v h 7 w 2 z p A = " > A A A C J n i c b V D L S g M x F M 3 4 t r 6 q L t 0 E i 1 A R y o y I u h F E X Q i 6 q G B t o V O G T H r b h m Y e J n e E M v Z r 3 P g r b l w o I u 7 8 F D N t F 9 p 6 I H B y 7 j 0 k 5 / i x F B p t + 8 u a m p 6 Z n Z t f W M w t L a + s r u X X N + 5 0 l C g O F R 7 J S N V 8 p k G K E C o o U E I t V s A C X 0 L V 7 5 5 n 8 + o D K C 2 i 8 B Z 7 M T Q C 1 g 5 F S 3 C G R v L y J x e e G z D s q C C 9 u u 4 X y 5 6 L H U D 2 + F j 2 0 i H d c 5 s g k Q 0 v / V 1 X w j 1 1 I d Z C Z v 6 C X b I H o J P E G Z E C G a H s 5 d / c Z s S T A E L k k m l d d + w Y G y l T K L i E f s 5 N N M S M d 1 k b 6 o a G L A D d S A c x + 3 T H K E 3 a i p Q 5 I d K B + t u R s k D r X u C b z S y S H p 9 l 4 n + z e o K t 4 0 Y q w j h B C P n w o V Y i K U Y 0 6 4 w 2 h Q K O s m c I 4 0 q Y v 1 L e Y Y p x N M 3 m T A n O e O R J c r d f c g 5 L B z c H h d O z U R 0 L Z I t s k y J x y B E 5 J Z e k T C q E k y f y Q t 7 I u / V s v V o f 1 u d w d c o a e T b J H 1 j f P x H B p s c = < / l a t e x i t > We seek a model with the following properties. In the large data limit (large N ), the optimal model has to converge to the maximum-likelihood estimateθ ML which itself converges to the true model θ * . In the opposite limit of small data (small N ), we should select a model that is less affected by statistical fluctuations. We will parametrize this in terms of a deterministic bias θ bias and show below how to select it. For finite data size, the optimal model should represent a compromise between the maximum-likelihood model and the bias. Overall, we make the following Ansatz for the model where the parameter κ ∈ [0, 1] is used to interpolate between the bias and maximum-likelihood term. In the next section we show that this Ansatz appears naturally in the inference of a large class of probability distributions (exponential families with a conjugate prior). A priori, the values of the parameters κ and θ bias can be chosen freely, as long as θ lies in the parameter space of the probabilistic model. We determine the optimal choice of these parameters through minimisation of the expected loss, L θ * (θ) θ ML . There are different ways to optimise with respect to these parameters. For instance, we may keep κ fixed and optimise for the bias, or vice versa. These lead to different optimal models and which way we pick depends on the setup of the inference we want to solve. We focus on the former approach to optimisation and later discuss the second option in an example. Thus, we optimise the expected loss with respect to the bias: To evaluate the condition, it is convenient to write it as a gradient of the expected loss The gradient field is defined over parameter space and we are looking for optimal points where the gradient vanishes. As the minima of gradient fields are invariant under coordinate changes, the solution of the optimization problem is invariant to changes in parameterization of the underlying probabilistic model. However, next we also show that for a large class of probability distributions, there exists a particular parameterization that allows us to understand in detail how the optimum arises from an interplay of the amount of data available about the true model and the geometry of model space. This interplay constrains the regions of parameter space that yield safe betting models, and we thus refer to it as the "playing it safe" principle.
To make analytic progress, we specialise to exponential families which form a large class of probability distributions, which include many well-known distributions such as the Multinomial, Poisson, and Gaussian distributions etc. [28,29]. An exponential family is a collection of probability densities that can be written in the canonical form where η is the canonical parameter associated with the sufficient statistic t(x), F is the log-partition function, and k(x) is referred to as the auxiliary carrier term. Alternatively, we can use the mean parameters θ which are related to the canonical parameters by a one-to-one map (see Materials and Methods). The mean parameters often also have a simple relation to commonly used parameterizations of well-known distributions. Examples are the event probabilities in the Multinomial distribution, the average number of counts in the Poisson distribution, and the variance of the Gaussian distribution with zero mean. We give examples of these relations below and in Appendix, Sections D and E. The KL divergence between two densities P η * and P η of the same exponential family, can be written as a Bregman divergence [30,31] between the associated mean parameters: where F * is the dual of the log-partition function F (see Materials and Methods). Bregman divergences form a large class of statistical distances, including many commonly known measures, such as the squared Euclidean distance, the Mahalanobis distance, or the Itakura-Saito distance. We note that in our context, the particular choice of the probabilistic model P η , implies a particular Bregman divergence through the dual function F * . In Appendix, Section C1 we show that for exponential families in mean parameterisation, the optimality condition Eq. (6) takes the simple form which holds even for differences δθ ≡ θ * − θ of finite size. In general, the condition provides d non-linear constraints on the parameters κ and θ bias such that the expected loss is optimised. We give an example illustrating the derivation of this condition for the Bernoulli model in Appendix, Section C2. We now illustrate that this condition, which holds for the mean parameterisation, implies a bias towards models of lower complexity in the sense described in the previous section. For this we focus on the case when the exponential family depends on a single parameter and write out the condition in Eq. (9) explicitly using the definitions of δθ and θ in Eq. (4): To illustrate the implications of this condition, we assume an infinite parameter space with a single boundary. For example, the Poisson distribution with parameter λ ∈ (0, ∞) has such a parameter space. We consider the scenario shown in Figure 2, where the Fisher information metric I diverges to one side of the true model and falls off to small values to the other side. We then analyse the large and small data limits as illustrated in Figure 2, and optimise θ bias while κ is fixed to a finite value: Large N : The distribution of the maximum-likelihood value becomes sharply peaked around θ * and we see that Diverging Fisher information at the boundary leads to an asymmetric complexity weight I(θ) around the true model θ * . We illustrate this for a one-dimensional parameter space with a single boundary. The black and grey curves are the distributions of the empirically observed maximumlikelihood valueθML for small and large N , respectively. For small N , the distribution is wide and models to the left of the true model θ * (dashed red line), receive a higher complexity weight when evaluating I(θ)(θ * − θ) θ ML (left-hand side of Eq. (9)), than models to the right of the true model.
in order to make the bracket inside the expectation value vanish, we require θ bias → θ * . We note that the local complexity I can be arbitrarily large in this limit as it is multiplied by the bracket with value zero. Empirical parametersθ ML for which θ * −θ ML is not zero are improbable and multiplied by very small probability when taking the expectation value. Thus the entire expectation value will vanish as required. Small N : The distribution of the ML parameter has finite spread around the true model θ * . For simplicity of argument suppose that the distribution is symmetric around θ * . To understand the optimization, we go through the possible cases for choosing the bias: • θ bias = θ * : The expectation value becomes nonzero (either positive or negative), because the ML values on one side of the true model θ * receive a higher complexity weight I(θ) than the ML values on the opposite side.
• θ bias on the side of θ * with diverging complexity: The expectation value assumes even larger (negative or positive) values.
• θ bias on the side of θ * with diminishing complexity: For the right magnitude of θ bias the expectation value is adjusted to zero, because the models θ(θ bias ,θ ML ) are biased to receive a smaller complexity weight. This demonstrates the "playing it safe" principle for this case.
This argument in the one-dimensional case immediately generalises to the case of higher-dimensional parameter spaces. Equation (9) provides an independent condition for each of the d dimensions of parameter space and the components of the bias θ bias can thus be adjusted individually to satisfy all conditions.

Optimal Bayesian inference for exponential families
We now show how the "playing it safe" principle introduced in the previous section, and in particular the Ansatz Eq. (4), arises for Bayesian inference of exponential families with conjugate priors, for which we can provide an analytical treatment. In short, the information contained in the prior determines the function κ and the bias θ bias in Eq. (4). We will show that by tuning the conjugate prior through its hyperparameters we can control the strength of the bias relative to the maximumlikelihood term.
We now specialise to exponential families Eq. (7) in canonical coordinates η. If the true model η * is unknown, we can at best use the N independent observations x i , n = {x 1 , . . . , x N }, to determine a Bayesian posterior distribution for η, where P (η) is a prior distribution, and P (n|η) is the likelihood of seeing data n given the model parameterized by η. Given the posterior, the goal is to minimise the posterior expected loss where in the second line we have used relation Eq. (8) and transformed variables from canonical to mean parameters also in the posterior density and integral. We want to determine the minimum value of this expression with respect to θ This optimal value is also known as the Bregman information. In [32] it was shown that the conditional optimal parameter θ is given by the posterior expectation value Thus, the optimal parameter depends both on the data and the choice of prior. Our next goal is to understand how the prior choice in relation to the available data size affects the optimal parameter. Conjugate priors for exponential families allow us to make general analytic statements in this respect. We denote the conjugate prior of the exponential family by P (η; α), where α is a set of hyperparameters. In [33] it was shown for exponential families, that under the conjugate prior the posterior mean is given by (16) where N 0 is a function of the hyperparameters α, and κ ≡ N 0 /(N 0 + N ). From this expression we can understand the role played by the hyperparameters. N 0 (α) is an threshold number of samples determined by the hyperparameters that sets the scale for whether the actual sample size N is large enough for the evidence to overcome the prior. Likewise, α/N 0 functions as effective additional parameters that appear on equal footing as the maximum-likelihood parameter. In fact, the quotient α/N 0 is the prior expectation for the parameter θ, and we denote it by θ prior ≡ E(θ|α) = α/N 0 . Overall, this leaves us with the following form of the conditional optimal parameter: shows that θ prior takes the role of the bias θ bias , such that overall this confirms our initial Ansatz.
In the above derivation, the conjugate prior P (η; α) was defined on the canonical parameters and lead to the linear convex form of Eq. (17). However, in the next section and in (Appendix, Sections D and E), we show for the multinomial, Poisson, and Gaussian model, that defining the conjugate prior on the event probabilities, the mean number of counts, and the variance, respectively, equally results in the optimal parameter having convex linear form. We emphasise that for the development of our theory it is not important on which set of parameters the conjugate prior is defined. What counts for our argument, is that the final expression for the conditional optimal parameter has the linear convex form.
The conditional optimal parameter in convex linear form depends on the data and crucially also on the shape of the prior, set by its hyperparameters. While for a given prior, the loss is minimised by the conditional optimal model, we may further optimise the expected loss with respect to the prior choice which we do by optimising the hyperparameters. Given the convex linear form in Eq. (17), we see that there are two natural ways to optimise the hyperparameters. The first, is to make a choice for θ prior , and then optimise the value of κ (or, equivalently q). The second way is to set the value of κ and then optimise θ prior . We now illustrate both paths of optimisation for the multinomial distribution.
Optimization for the multinomial model The multinomial model is the probability distribution of N events, each assuming one of d possible outcomes (Materials and Methods). The true model is parameterised by the d-dimensional set of probabilities θ * . We note that these probabilities are related to the mean parameters of the multinomial distribution by rescaling the mean parameters with a factor of 1/N . For simplicity of notation and to emphasize their close relation, we use the same symbol θ to denote the probabilities that we used above for the mean parameters. It is natural to collect information on the model by counting event outcomes and computing empirical probabilitiesθ. The data n consists of the counts of occurrence x i for each of the d possible outcomes over N observations ( . Following our earlier discussion, we use Bayesian inference with a conjugate prior and optimise the conditional model to minimise the expected loss. The likelihood function is the multinomial distribution and its conjugate prior is a Dirichlet distribution given by The set of d hyperparameters α i > 0 determines the prior shape. Direct calculation yields, as expected, the linear convex form Eq. (17) of the conditional optimal model whereθ i = x i /N , are the empirical probabilities. The expression in Eq. (20) is also known as a pseudo-Laplacian law [15,34,35] with the α i being "pseudocounts". We now specialise to d = 2, i.e. two possible event outcomes. The conditional optimal model depends on two parameters, α 1 and α 2 from which we define The next step is to optimize these hyperparameters in the two possible ways outlined in the previous section. First, we consider optimization of the bias θ bias , while κ is fixed. Since θ 2 = 1 − θ 1 , we are dealing with optimization in one dimension. For simplicity of notation, we suppress component indices in the following. The landscape of the KL loss is shown in Figure 3A for θ * = 0.2 and κ = 0.2. The curve of the optimal bias choice and the optimized parameter θ opt (averaged over empirical observations), as a function of N is shown in white. Asymptotically, the curve approaches θ * for both parameters. In Figure 3B, we show how the local complexity of the model with optimal parameter compares to the complexity of the true model as a function of N and for different values of the true model. Asymptotically, the complexity ratio approaches unity, but the closer the true model lies to the boundary, the more data is required to reach the full complexity of the true model. We also observe in Figs. 3A and B that for small N there is an initial dip in the curves. At first sight this may seem to be in tension with the "playing it safe" principle. However, this is explained and resolved by the existence of the second boundary of the parameter space of the d = 2 multinomial model at θ = 1. For small N , the distribution of maximum-likelihood values is so broad that diverging complexity at the second boundary influences the optimization, pushing the parameter away from the boundary.
Next, we consider optimization of the parameter κ, while θ bias is held fixed. We make the choice α 1 = α 2 which, by Eq. (21) implies θ bias = 1/2. This is an appealing choice as it implies equal prior knowledge in all directions of parameter space. The landscape of the KL loss is shown in Figure 3C with θ * = 0.2. The curve of the optimal choice for κ and the optimized parameter θ opt as a function of N is shown in white. Asymptotically, κ approaches zero and the optimal model approaches θ * . In Figure 3D, we show for different true models, that the complexity ratio approaches unity in the asymptotic limit as expected, but the closer the true model lies to the boundary, the more data is required.
We thus see that either way of optimising implies "playing it safe" as strategy. In (Appendix, Sections D and E), we demonstrate the optimization of the bias for the Poisson model with unknown average number of counts, and the Gaussian model with unknown variance parameter.
Playing it safe in bacterial phenotypic switching Betting strategies in biology are often found in collective systems where bets can be distributed (or hedged), across the possible outcomes of random events. There are many cases of bet hedging systems in biology [10,36,37]; a particularly well-known example is stochastic phenotypic switching in bacterial populations [2,38]. In the simplest setup, the population lives in an environment which can assume one of two different states: a nutrientrich state, allowing the bacterial population to grow, or an antibiotic, hazardous state that kills off bacteria. Each bacterium switches stochastically and independently from the others between a growth and a resistant phenotype. The growth phenotype prospers in the growth state of the environment, but is instantly killed if the environment is in the antibiotic state. Conversely, the resistant phenotype does not grow in the growth environment, but survives antibiotic treatment. It has been shown that in order to maximise population growth, the bacteria should perform Kelly betting (also known as proportional betting) [2,24]. According to Kelly's theory, the bacteria should stochastically select each phenotype with a probability that is proportional to the probability < l a t e x i t s h a 1 _ b a s e 6 4 = " M g 0 t M i / Y z w Z z F C + Z x X e U 0 E 5 w 1 P 4 = " > A A A B / X i c b V D L S s N A F J 3 U V 6 2 v + N i 5 G S y C q 5 J I U Z d F N y 4 r 2 A c 0 I U y m 0 3 b o T B J m b o Q a i r / i x o U i b v 0 P d / 6 N k z Y L b T 0 w c D j n X u 6 Z E y a C a 3 C c b 6 u 0 s r q 2 v l H e r G x t 7 + z u 2 f s H b R 2 n i r I W j U W s u i H R T P C I t Y C D Y N 1 E M S J D w T r h + C b 3 O w 9 M a R 5 H 9 z B J m C / J M O I D T g k Y K b C P P B g x I I E n C Y y U z E J O 9 D S w q 0 7 N m Q E v E 7 c g V V S g G d h f X j + m q W Q R U E G 0 7 r l O A n 5 G F H A q 2 L T i p Z o l h I 7 J k P U M j Y h k 2 s 9 m 6 a f 4 1 C h 9 P I i V e R H g m f p 7 I y N S 6 4 k M z W Q e U i 9 6 u f i f 1 0 t h c O V n P E p S Y B G d H x q k A k O M 8 y p w n y t G Q U w M I V R x k x X T E V G E g i m s Y k p w F 7 + 8 T N r n N f e i V r + r V x v X R R 1 l d I x O 0 B l y 0 S V q o F v U R C 1 E 0 S N 6 R q / o z X q y X q x 3 6 2 M + W r K K n U P 0 B 9 b n D 0 z G l c k = < / l a t e x i t > ✓ bias -optimization:   of occurrence of the corresponding environmental state. We consider the case where the bacteria only have imprecise information about environmental probabilities, show how the "playing it safe" principle implies safer betting strategies in this case, and how these choices translate into optimisation of growth of the bacterial population. In our toy model of stochastic phenotypic switching, successive discrete environmental states are chosen independently according to a Bernoulli model with a probability θ * for the nutrient state, and probability 1 − θ * for the antibiotic states. The bacteria can maximise population growth by betting according to Kelly's theory, namely by matching their probabilistic model to the environment. In Kelly betting any mismatch is penalised and the loss is captured by the KL divergence between the true probabilistic model P θ * and the model P θ assumed by the bacterial population (the bettor). Instead of the loss, it is also instructive to consider the closely related growth rate function derived by Kelly in [24,25]: where H(·) is the entropy. This function captures the long-term population growth rate, i.e., the growth rate after a long succession of environmental states. The term in brackets is the maximal achievable growth rate from which the KL loss is subtracted. In Figure 4A we show the long-term growth rate G θ * of Kelly betting. The growth rate has a single maximum when the assumed model is equal to the truth, θ = θ * for which the loss is zero. The further away the adopted model lies from the truth, the smaller the growth rate; it can even become negative if the assumed model lies too far away from the truth. We will see (Figures 4A,B,C and S7) that there is a region of safe model choices. This region is given by models θ, for which the long-term population growth rate is non-negative. Models in this region are safe, because while they might not be optimal, they guarantee that there are no long-term losses.
We now assess, for Kelly betting, how the amount of information available about the true model and the prior choice affect the magnitude of fluctuations in the assumed model. The empirical observations are subject to statistical fluctuations and as discussed above translate to the conditional optimal model. To quantify the magnitude of fluctuations, we consider the standard error of the empirical estimateθ. For N observations the standard error is given by σ N = θ * (1 − θ * )/N . The fluctuations of the empirical estimate, determine the fluctuations of the conditional optimal model. The magnitude of fluctuations of the conditional optimal strategy Eq. (17) within one standard deviation, is given by the bounds θ ± ≡ θ(θ ± ML ; N 0 , θ bias ), whereθ ± ML = θ * ± σ N . We will examine a toy model of bacterial phenotypic switching, in which environmental states are chosen and . The safe region (gray) constitutes models θ for which the growth rate function G θ * (θ) is non-negative. For larger values of N0, the standard error of the conditional optimal model θ is pulled into the safe region, such that a smaller loss in the growth rate is expected.
phenotypes expressed simultaneously at discrete time steps. The bacterial population then performs collective betting. We know that with perfect information about the environmental probabilities the optimal strategy would be to perform proportional betting. This means matching the probability of expressing the growth phenotype with that of the occurrence of the growth state of the environment. However, the bacteria do not know the probability of environmental states. We will suppose that instead they keep an implicit memory of the previ-ous N environmental states, which they use to compute empirical probabilities.
To illustrate the effect of safe learning, we compare the growth of three types of bacterial colonies distinguished only by their learning strategy, as they respond to a sequence of environmental states which include a single antibiotic episode. In Figure 4B we show how the size of each population develops over time. All bacteria have a memory over the past N = 5 environmental states. In the extreme case of maximum-likelihood estimation (N 0 = 0, dark blue curve), the population is killed completely by the antibiotic state. In contrast, the safer the learning strategy (larger N 0 ), the smaller the loss a bacterial population suffers due to antibiotic treatment. After the antibiotic treatment, bacteria with the safest strategy recover their numbers most quickly (yellow curve), but they are eventually overtaken by bacteria with less safe learning strategies (teal curve). The observed trade-off between the ability to survive antibiotic treatment and achieve high growth rates is the reason for the existence of an optimally safe learning strategy. The precise choice of optimal strategy depends on the statistics of the environment, as well as potentially other factors such as a future horizon over which safe growth needs to be guaranteed.
In environments which fluctuate more the optimal learning strategy will need to be safer, because the probabilities inferred from a finite memory of past events will have a larger variance. However, with a maximally safe learning strategy (θ = 0.5), the bacterial population will not be able to achieve any long-term growth, but only keep its population size constant as shown in Figure 4A. In turn, this means that long-term physical growth is only possible in environments which are not too random. In fact, the more environments are suitable for life to prosper, the closer they are to being deterministic. Neardeterministic environments are thus livable, but as we have shown, rare, yet potentially catastrophic events require organisms to "play it safe" to survive in even in such worlds.
Finally, in Figure 4C we contrast different prior choices to show how these affect model safety and plot the standard error bounds of the conditional optimal model θ ± as function of the observations N . In this example we fix θ bias = 1/2 which, by Eq. (21), implies N 0 = 2α. We observe that a large value of N 0 reduces the magnitude of statistical fluctuations and biases the conditional optimal model θ opt towards a probability of 1/2, i.e. minimal complexity. Overall, the inferred model is pulled into the safe growth region and larger values of N 0 thus result in a safer learning strategy. From (Appendix, Section F and Figure S6) we see that this effect is the more pronounced the closer the true model lies to the boundary of parameter space. Notably, maximum-likelihood estimation (N 0 = 2α = 0) is the least safe strategy. For a large number of observations N , the fluctuations diminish and the inferred model converges to the true model. However, we observe that safer learning strategies (larger N 0 ) converge to the true model more slowly than less safe strategies. Furthermore, in the limit of maximum safety q = N 0 /N → ∞, the optimal model θ opt → 1/2 for which the growth rate vanishes. This illustrates the trade-off between safe learning and growth maximisation. DISCUSSION We have proposed that biological systems acting with bounded information should "play it safe" by biasing themselves towards models of the environment that are less complex, in the sense of being easier to infer despite statistical fluctuations in limited observed data. We derived this result analytically for exponential families, a very general class of probability distributions, including, e.g., the widely employed multinomial, Poisson, and Gaussian distributions. In the context of collective betting, our results imply that optimal adaptation involves a balance between risk created by uncertainty in inference of the latent distribution, and making full use of the potential for growth or reward.
We illustrated these ideas in the example of bacteria in an environment which mostly supports growth, but occasionally manifests an antibiotic. In this situation, it pays to maintain diversity between high-growth and lowgrowth but resistant phenotypes, as a bet-hedging mechanism [2,38]. Indeed, wild type Escherichia coli show a resistant fraction of ∼ 10 −6 − 10 −5 [39]. We showed how this sort of strategy arises in terms of "playing it safe" in an uncertain world. It will be interesting in the future to apply our analysis to other situations where populations of organisms must make decisions with limited information, for example in bacterial chemotaxis [40] We also cast our analysis in the language of Bayesian inference, and noted that the bias towards simpler models can be implemented by an appropriate choice of prior. The optimal prior depends on knowledge of the true model, or rather its distribution in cases where the generative process in the world itself changes stochastically. It would be interesting to determine the sorts of evolutionary dynamics that would allow systems to adjust their priors to "play it safe" across a given distribution of generative processes.
Another interesting question is to understand how our considerations operate if we use the information maximizing discrete priors discussed in [41,42]. We instead used the continuous conjugate priors because it seemed natural that a biological agent might have ways of tuning over a continous family of choices, e.g., by adjusting concentration levels of a protein, or activity levels of a circuit. By contrast, it is harder to imagine biological methods for selecting over a discrete set of values. Neverthless, it is important and interesting to consider how our "playing it safe" principle might interact with the discrete priors of [41,42] because the authors of these articles are precisely concerned with the limitations imposed on inference by having a finite amount of data. Indeed, their discrete priors depend on the number of data points N , and are designed to maximize the amount of information gained about the parameters from each observation. Perhaps there is some way for an organism to incorporate an understanding of the amount of data it is able to accumulate from the past, to effectively discretize its prior on the parameters of the world. It would be useful to understand whether and how some variant of our "playing it safe" principle could be realized in that context.
Finally, in a time-varying world, inferences are only useful up to some future time horizon. This restricted future utility should limit the value and precision required of the optimal model, while also increasing the imperative for "safety" in light of possible future changes in the world. We can expect trade-offs associated with keeping past memory to play a role in such situations [43,44]. Perhaps this observation also has a bearing on the widespread observation that Occam's Razor, understood as a preference for simplicity or parsimony, seems to be an organizing principle for mental functions [45][46][47][48][49][50], and that biases towards simple models are helpful in many statistical settings that require inference and prediction [22,41,[51][52][53].

Information geometry basics
We define the FIM by Taylor expansion of the KL divergence between two parametric distributions P θ and P φ . The Taylor expansion between two models which are close in parameter space φ ≡ θ * − δθ is given by and the FIM I is given by the coefficient of the quadratic term As an important example we consider the multinomial distribution where the parameters θ i are probabilities and the mean is N θ. For the multinomial distribution Jeffreys prior takes the form of a Dirichlet distribution From this form we see that Jeffreys prior has a minimum at θ i = 1/d, but diverges when any of the parameters tend to zero, θ i → 0. Analogously, in (Appendix, Sections A) we show for the Poisson and Gaussian distribution, respectively, that Jeffreys prior diverges on the boundaries of the standard parameterization of these distributions.

Exponential families
An exponential family consists of a collection of densities where η = [η 1 η 2 . . . η d ] T is the canonical parameter associated with the sufficient statistic t(x), k(x) is the auxiliary carrier term, and F is the log-partition function: where ν(x) the Lebesgue or counting measure. The logpartition function is convex. If the sufficient statistic is a collection of d functions, the canonical parameter η takes values in the convex set We consider exponential families where Ω is an open set, in which case the exponential family is called regular .
Since the log-partition function is a convex function, the set Ω is also convex. We restrict ourselves to minimal exponential families, which are defined by not having linear constraints amongst the parameters and also not amongst the components of the sufficient statistic. Three examples appear in the table below. Distribution An alternative set of parameters is given by the mean parameters θ defined as the expectation of the sufficient statistic where the last relation can be verified by direct calculation. The distribution in mean parameterisation is obtained by inverting this defining equation of the mean parameter and substituting for η in the canonical form of the exponential family. The analogue of the log-partition function F for the mean parameters is the dual function of F defined by the Legendre-Fenchel transform The dual function F * can thus be obtained by explicit substitution for η in terms of θ, after inverting the defining equation of the mean parameter. Finally, we note a relation between F * and negative entropy −H(P η(θ) ) of the distribution −H(P η(θ) ) = P η(θ) log P η(θ) dν(x) where we have used the definition of the mean parameters and the defining relation of the dual function. For distributions which afford a vanishing auxiliary carrier term k(x), the dual function is thus given by the negative entropy. This is for example the case for the Gaussian distribution.

ACKNOWLEDGMENTS
This work was supported in part by the Simons Foundation MMLS Grant 400425 and by the NIH grant R01EB026945. VB thanks the Galileo Galileo Institute where he was a Simons Visiting Scientist, and the Aspen Center for Physics (supported by NSF grant PHY-1607611), for hospitality as this work was completed.
where I is the FIM, given by The normalised Jeffreys prior is computed from the FIM through the relation (A3)

Multinomial distribution
The KL divergence for discrete distributions is defined as We define the multinomial distributions with parameters θ and θ as: where in the last step we have evaluated the mean of the multinomial distribution. The next step is to Taylor expand the KL divergence around θ, by setting θ = θ − δθ. For the Taylor expansion we find where in the last step the linear term vanishes since i θ i = i θ i = 1 and θ i = θ i − δθ i . From the quadratic term we read off the form of the FIM where δ ij is the Kronecker delta. Taking the square root of the determinant and computing the normalisation factor finally gives us Jeffreys prior: where Dir is the Dirichlet distribution.

Poisson distribution
We consider two Poisson distributions with parameters λ and λ: The KL divergence between the two distributions is given by The next step is to Taylor expand the KL divergence around λ , by setting λ = λ + δλ. For the Taylor expansion we find From the quadratic term we read off the form of the FIM Taking the square root gives us Jeffreys prior: Jeffreys prior is shown in Figure S1. It diverges on the boundary of parameter space for λ → 0 and approach zero as for λ → ∞.

One-dimensional Gaussian
The KL divergence for continuous distributions is defined as The KL divergence between two Gaussian distributions with mean µ and variances σ 2 and σ 2 is given by The next step is to Taylor expand the KL divergence around σ, by setting σ = σ + δσ. For the Taylor expansion we find From this we find the FIM: and Jeffreys prior Jeffreys prior for the Gaussian is shown in Figure S2. It diverges on the boundary of parameter space as σ → 0 and approaches zero for σ → ∞.

Appendix B: Standard error and diverging KL divergence
Consider the KL divergence for the Bernoulli model where H x (y) is the cross-entropy function For θ → 0, 1, the KL divergence diverges to plus infinity. What is the behaviour of the KL divergence when we try to estimate the value of θ using data from many Bernoulli trials generated by the true model θ * ? In particular, how does the KL divergence behave when θ * lies very close to a boundary of parameter space?
To answer this question we consider the empirically estimated parameterθ. This parameter is subject to statistical fluctuations due to finite data size. In general, the standard error for a parameter estimated from N independent trials is given by where σ is the standard deviation of the distribution we are sampling from. Within the standard error the estimated parameter lies within the boundsθ For the Bernoulli model, the empirical estimate of the probability parameter is obtained as the mean number of events observed in N independent Bernoulli trials and σ is the standard deviation of the Bernoulli distribution The statistically expected bounds areθ Let us now focus on the boundary at θ = 0 and recall that the KL divergence diverges at this boundary. Therefore, one way to check whether we can expect the KL divergence to diverge for a given model θ * and number of observations N , is to check whether the boundary θ = 0 lies within the standard error boundθ − . From the last equation we see that the boundary is included if the condition is satisfied. Rearranging this inequality gives Thus true models which are within 1/(N + 1) of the boundary, the boundary where the KL divergence diverges lies within the standard error and statistically we can expect the KL divergence to diverge. See Figure S3

plots.
For the Poisson model, the empirical estimate of the parameter lambda is obtained as the mean number of events observed in N independent Poisson trials and σ is the standard deviation of the Poisson distribution The statistically expected bounds areλ The boundary of parameter space lies at λ = 0. Therefore we considerλ − as the important bound and the condition for the boundary of parameter space to be included is or Therefore, for any given N we can find a models λ * which lie close enough to the boundary according to the above condition, and for which we should expect the KL divergence to diverge. where To compute the condition for the optimal model, we take the derivative of the Bregman divergence with respect to the bias: Next, we use the relation relating the Fisher information metric in mean parameterization to the dual function. We derive this relation starting with Eq. (8) replacing θ = θ * − δθ, and Taylor expanding in δθ. Using the definition of the Bregman divergence Eq. (C2) and noting that ∇ δθi = −∇ θi , the coefficients of the first three orders in δθ are given by (C13) Using this relation, we rewrite the above condition as where −δθ ≡ θ * − θ. Thus, in mean parameterisation, the optimality constraint in Eq. (6) of the main text, takes the simple form 2. The optimality condition for the Bernoulli model in mean parameterization derived from Taylor expansion As an example, we evaluate the optimality condition for the Bernoulli model written in mean parameters. In this derivation we will use Taylor expansion of the KL divergence and find that all but a single term of the infinite series cancel to yield a simplified form of the condition in Eq. (9) of main text.
We consider KL divergence between two Bernoulli distributions, denoted by P θ * and P θ , respectively: where θ * ≡ θ + δθ and expand the right-hand side in δθ. For the first few orders we find . . .
where ∂ ≡ ∂ δθ and powers on the right-hand side act element-wise. The Taylor expansion is thus Next, we act with the derivative with respect to the bias: Remembering that δθ is a function of θ, ∂δθ/∂θ = −1, we obtain: where on in the first line on the right-hand side we have grouped terms such that expressions in the brackets vanish to show how terms of successive order in δθ cancel. Using θ 2 = 1 − θ 1 , we finally see that in matrix notation the expression takes the form where we recognised the 2 × 2 matrix to be the Fisher information matrix. Thus, we obtain the optimality condition Appendix D: Optimal parameter calculation for Poisson distribution with unknown rate parameter We consider two Poisson distributions with parameters λ * and λ: P λ * (n) = λ * n e −λ * n! , P λ (n) = λ n e −λ n! . (D1) We want to minimise the posterior mean loss with respect to λ. Taking the derivative with respect to λ we obtain the condition The condition is satisfied by The conjugate prior for the mean number of counts λ is the Gamma distribution and the posterior takes the form and the hyperparameters α, β > 0.X is an empirical estimate of the rate λ from N observations. The posterior mean which gives the optimal parameter choice is given by and we define as before The hyperparameter β can thus be interpreted as virtual observations and α as the sum of counts across β observations. If we do not trust in the empirical estimate, then we should take α large compared to β. In this limit, the optimal λ values becomes large and lies further away from the boundary of parameter space situated at λ = 0. Up to a scaling by 1/N , the maximum-likelihood value of the average countλ ML computed from N observations, is distributed according to the true Poisson distribution with the rate rescaled by N . This can be seen from the characteristic function of the Poisson distribution given by exp(λ(e it − 1)) and the fact that the characteristic function of the sum of N i.i.d. random variables is given by the N 'th power of the characteristic function of a single random variable. The empirically averaged expected loss is then given by Poi(n; N λ * )D KL (P λ * ||Q λ(n/N ;α,β) ) . (D11) We minimise this loss with respect to the hyperparameters, for instance by keeping β fixed an minimising with respect to α. In Figure S4 we show the optimal choice of the hyperparameter α as a function of the data size, confirming that for small N , α is large, while for increasing data size α → 0.
< l a t e x i t s h a 1 _ b a s e 6 4 = " R 4 6 O w k 3 A 1 y 8 4 F s I I 3 x d + N w 8 J N j o = " > A A A B / n i c b V D L S s N A F L 3 x W e s r K q 7 c D B b B V U m k q M u i G 5 c V 7 A O a E C a T a T t 0 J g k z E 6 G E g r / i x o U i b v 0 O d / 6 N k z Y L b T 0 w c D j 3 X O 6 Z E 6 a c K e 0 4 3 9 b K 6 t r 6 x m Z l q 7 q 9 s 7 u 3 b x 8 c d l S S S U L b J O G J 7 I V Y U c 5 i 2 t Z M c 9 p L J c U i 5 L Q b j m + L e f e R S s W S + E F P U u o L P I z Z g B G s j R T Y x x 4 3 5 g g H n s B 6 J E U e M q y m g V 1 z 6 s 4 M a J m 4 J a l B i V Z g f 3 l R Q j J B Y 0 0 4 V q r v O q n 2 c y w 1 I 5 x O q 1 6 m a I r J G A 9 p 3 9 A Y C 6 r 8 f B Z / i s 6 M E q F B I s 2 L N Z q p v z d y L J S a i N A 4 i 5 B q c V a I / 8 3 6 m R 5 c + z m L 0 0 z T m M w P D T K O d I K K L l D E J C W a T w z B R D K T F Z E R l p h o 0 1 j V l O A u f n m Z d C 7 q 7 m W 9 c d + o N W / K O i p w A q d w D i 5 c Q R P u o A V t I J D D M 7 z C m / V k v V j v 1 s f c u m K V O 0 f w B 9 b n D + n a l h 4 = < / l a t e x i t > bias FIG. S4. Optimization of the expected loss for the Poisson distribution with respect to the prior shows a bias to low complexity models. The true model lies at λ * = 3. (A) The bias λ bias is optimised while κ = 0.2 (q = 0.25) is kept fixed. The plot shows the landscape of the expected loss as a function of the bias and the number of observations. We are using a non-linear colormap based on the cumulative distribution of the loss. (B) The plot shows for different true models λ * the ratio of the local complexity measure by Jeffreys prior of the expected optimal to the true model.

Appendix E: Optimal parameter calculation for Gaussian with known mean and unknown variance
Consider two Gaussian distributions with the same known mean µ and unknown variances σ * 2 and σ 2 The KL divergence two between these distributions is given by We want to minimise the posterior mean loss L σ * post (σ) = dσ 2 P (σ 2 |n)D KL (Pσ2||P σ 2 ) (E3) with respect to σ 2 , where P (σ 2 |n) is the posterior. Taking the derivative with respect to σ we obtain the condition satisfied by The conjugate prior for the variance is the scaled inverse-chi-squared distribution χ −2 . In [54], the posterior distribution for the variance, is given as where The loss-minimizing value for σ 2 is given by the Bayesian mean which comes out as and we define Up to a scaling of σ * 2 /N , the empirical variance is distributed according to the chi-squared distribution of order N (recall, the variance is given by the sum of N i.i.d. squared Gaussian random variables divided by N ) given by Finally, we optimise the expected loss with respect to the bias term, while κ is fixed. In this equation we are using the continuous version of the Kullback-Leibler divergence. By Eq. (18) of the main text, κ fixes q which in turn provides a value for N 0 given N : where we have also used the definition of N 0 in terms of the hyperparameter ν. For the hyperparameters ν and the bias we can thus write: Minimising the expected loss given N , yields the optimal choice for σ 2 bias . In Figure S5 we show the optimal bias value as a function of the data size. Optimization of the expected loss for the Gaussian distribution with respect to the prior shows a bias to low complexity models. The true model lies at σ * 2 = 2 and we set µ = 0 for the known mean value. (A) The bias σ 2 bias is optimised while κ = 0.2 (q = 0.25) is kept fixed. The plot shows the landscape of the expected loss as a function of the bias and the number of observations. We are using a non-linear colormap based on the cumulative distribution of the loss. (B) The plot shows for different true models σ * 2 the ratio of the local complexity measure by Jeffreys prior of the expected optimal to the true model.  respectively. In the left column of subfigures we show the growth rates G θ * (θ) as a function of the model choice θ.
Regions where the growth rate is non-negative are indicated by color shaded boxes. In the right column of subfigures we show the statistical fluctuations of the inferred model as function of N for two different learning strategies. The first learning strategy is maximum-likelihood estimation α = 0 and the second strategy uses α = 1/2. The later strategy (α = 1/2) is safer, because the magnitude of fluctuations is shrunk and more contained in the interval of non-negative growth rates in comparison to the maximum-likelihood strategy.