Acceleration of Evolutionary Processes by Learning and Extended Fisher's Fundamental Theorem

Natural selection is general and powerful concept not only to explain evolutionary processes of biological organisms but also to design engineering systems such as genetic algorithms and particle filters. There is a surge of interest, both from biology and engineering, in considering natural selection of intellectual agents that can learn individually. Learning by individual agents of better behaviors for survival may accelerate the evolutionary processes by natural selection. We have accumulating pieces of evidence that organisms can transmit its information to the next generation via epigenetic states or memes. Also, such idea is important for engineering applications. To accelerate the evolutionary process, an agent should change their strategy so that the population fitness increases the most. Equivalently, an agent should update the strategy towards a gradient of the population fitness. However, it has not yet been clarified whether and how an agent can estimate the gradient and accelerate the evolutionary process. We also lack methodology to quantify the acceleration to understand and predict the impact of learning. In this paper, we address these problems. We show that an learning agent can accelerate the evolutionary process by proposing ancestral learning, which uses the information transmitted from the ancestor (ancestral information). We next show that the ancestral information is sufficient to estimate the gradient. In particular, learning can accelerate the evolutionary process without communications between agents. Finally, to quantify the acceleration, we extend the Fisher's fundamental theorem (FF-thm) for natural selection to ancestral learning. Our extended FF-thm relates the acceleration of the evolutionary process to the variety of individual fitness of the agent. By the theorem, we can quantitatively understand when and why learning is beneficial.


I. INTRODUCTION
A fundamental question in evolutionary biology is how organisms acquire sophisticated traits, functions, and strategies to survive in harsh and ever-changing environments. Attempts to answer the question have invented the theory of natural selection [1]. Evolutionary process by natural selection is general and powerful enough not only to explain various biological phenomena but also to be applied to optimization of engineering systems. Genetic and evolutionary algorithms [2] solve mathematical optimizations by simulating the "evolution" of candidates of the solution. Also, particle filters are designed to solve the filtering problem of latent state models by approximating a posterior distribution with a population of replicating particles [3].
While the original natural selection is a passive process in that the trait of an organism can change only randomly, several studies both in biology and engineering have considered natural selection of intelligent agents ( Fig. II.1 (a)) that can learn from experience and actively change their traits accordingly. For biological systems, researchers have discussed the fitness value of information processing of organisms like sensing of environments [4][5][6][7]. In this context, some studies [8,9] pointed out the possibility that learning can accelerate the evolutionary process by natural selection. While this idea seems to violate the conventional assumption that the evolution is a blind watchmaker, we have accumulating pieces of evidence that organisms can transmit its information to the next generation not only via genes but also via epigenetic states or memes [10]. Epigenetic states and memes enable the organism to transmit the information that is necessary for learning. A pioneering study by Xue and Leibler [8] considered a growing population of agents, each of which follows a learning rule to choose the same type as that its parent chose more frequently than the parent (we call it Xue's rule). They showed that this simple learning rule can acquire the optimal type-switching strategy for changing environments. In another line of works, the effect of learning on evolution has also been discussed as the Baldwin effect [11,12].
In engineering, it has been shown that genetic algorithms and particle filters can be improved by introducing learning by individual agents or particles. A memetic algorithm [13] and an information geometric optimization [14] are such extended optimization algorithms that employ an active update of candidates of the solutions by, for example, gradient descent. In addition, some estimation algorithms of a latent state model employ a population of replicating particles that also individually learn the parameters of the model [15].
In this work, we aim to understand the impact of learning in evolutionary processes both qualitatively and quantitatively from a general view point.

A. Learning in evolutionary processes
Because the interplay of natural selection and learning is tangled, we firstly describe the situation we consider and the definition of learning in this work.
We consider a population of agents that assexually replicate. Each agent has type, and stochastically selects one type in one generation. The type and the state of environment can affect the number of offspring that the agent can generate. For biological systems, the type can be interpreted as a phenotypic trait of an organism. The stochastic type selection can be beneficial when the state of environment changes over time. The type cannot be directly cut indirectly inherent between generations [16,17]. Each agent also has a type-switching strategy that determine the probability to choose each type. We assume that the strategy is heritable and also subject to selection. From the biological viewpoint, the strategy can be regarded as a genetic or epigenetic trait, and the types (phenotypes) of agents can be correlated among generations via inheritance of the strategy. From the engineering viewpoint, the strategy is related to hyper-parameters that determines the behaviors of an agent.
Since the strategy is heritable, better strategies can be selected via natural selection if we have a diversity of strategies in a population. In a conventional evolutionary process, the diversity of strategies is generated by random (mutational or epignetic) changes that occur when the strategy of individual agent is inherited from one generation to the next. As learning of individual agents, we consider here the case that the inherited strategy is biased based on the past information of ancestors or the population. Specifically, we consider the learning rules that biases the offspring strategy to gain greater fitness. In general, the conventional random changes can also be regarded as kinds of learning rule in which no average gain of fitness is expected. Therefore, we call them passive or zero-th order learning rules. Our main focus here is the learning rules that can bias the strategy to have an average gain of fitness. They should update the strategy to the direction called a gradient of fitness, into which the fitness increases (defined rigorously in Section V). We call them active or first order because the gradient is closely related to the first derivative of the fitness with respect to the strategy [18].
Under the setting above, we have at least three problems about the interplay between natural selection and learning. The first one is whether or when learning can accelerate the evolutionary process of an agent to acquire the optimal strategy. Since a learning rule must be simple enough to be implemented in biological systems, we should investigate whether the evolutionary process is accelerated even by simple learning rules. For engineering systems, such simplicity is desirable for building a scalable learning algorithm. In the previous work by Xue and Leibler [8], the simple Xue's rule was shown to achieve the optimal strategy under a constant environment via the evolutionary process. However, the zero-th order random changes can also achieve the optimal one and therefore the active learning may not always be beneficial nor efficient compared with passive ones.
The second problem is whether an agent can estimate the gradient from accessible information or not. In particular, we do not know what information is sufficient for the estimation of the gradient. Although Xue's learning rule can find the optimal strategy by using only the information of the parent's type, the relationship between Xue's rule and the gradient is unclear. The information of the parent's type might be insufficient to estimate gradient, and communications between agents at the same generation might be required. The sufficient condition is also important for engineering systems to find new variants of the genetic algorithm and the particle filter.
The last one is how to quantify and predict the acceleration of natural selection by learning. For the conventional evolutionary processes with natural selection, we have Fisher's fundamental theorem (FF-thm) and its variants [19]. The theorem states that the increase in the mean fitness of a population is proportional to the variance of the fitness in the population. From the relation, we can predict the progress and speed of the evolution in the population. Because the evolutionary process becomes more complicated by taking learning of individual agents into account, a simple relationship similar to FF-thm would facilitate our understanding of the impact and efficiency of learning. Furthermore, such a relationship may be applied to analyzing the performance of engineering systems.
In this paper, we address the three problems. First, we propose ancestral learning, which utilizes only the information transmitted from the ancestors via epigenetic states or memes. The ancestral learning is simple and therefore biologically reasonable, which also generalizes Xue's learning rule. We validate that the ancestral learning accelerates the evolutionary process by numerically showing that the optimal type-switching strategy is acquired by the ancestral learning faster than by the zero-th order mutational rules. Second, we prove that the ancestral information is sufficient to estimate the gradient of fitness. In particular, we show that ancestral learning updates the strategy into the direction of the gradient. Third, we derive an extended FF-thm for the ancestral learning, which relates the variation of fitness among ancestors to the fitness gain by the ancestral learning. With this theorem, we can predict the acceleration of evolu-tionary processes by ancestral learning, which depends on the property of environment. The theorem enables us to quantitatively understand when and why ancestral learning becomes beneficial.

II. SETUP
We consider population dynamics of asexual agents with a discrete generation time t ∈ {0, 1, 2, . . . } (Fig. II.1  (b)). Let x (t) ∈ X and y (t) ∈ Y be the type of an agent and the state of the environment at time t. The type models the phenotypic trait of organisms in biological systems. Each agent has its own stochastic type-switching strategy π F ∈ R X where π F (x) is the probability to switch into type x and π F satisfies that x∈X π F (x) = 1 and π F (x) ≥ 0 for all x ∈ X ( Fig. II.1 (b)). We call π F a strategy and y (t) ∈ Y the environmental state at time t. Environmental state y (t) at time t follows a distribution Q(y) on Y, which is independent of t. An agent with type x under environmental state y duplicates asexually and produces e k(x,y) daughters on average ( Fig. II.1 (c)). The term e k(x,y) is called an individual fitness [20] of the agent. We define paths (histories) of the types along a lineage and the environmental states from time 0 to time t − 1 as X (t) and Y (t) , respectively.
To define a "fitness" of strategy, we first consider the case where the agents cannot learn the strategy and the strategy π F is fixed in a population and over generations. The number N (t) of the agents in the population at time t under a path Y (T ) of environmental states becomes (II.1) Here the initial size N (0) of the population is given as an initial condition. When the dependency on strategy π F or a path Y (T ) of environment states is clear from the context, we omit them. We can use this dynamical system to define a "fitness" of the strategy π F . The cumulative population fitness of strategy π F under Y (t) up to time t is defined as The time-averaged population fitness of π F is defined as which exists almost surely and independently of Y (t) owing to the ergodicity of the environmental state [21,22]. In the following, we call λ(π F ) the population fitness in short [23]. When the agents learn their strategy ( Fig. II.1 (d)), the number N (t) (π) of the agents with a strategy π at FIG. II.1. Schematic representation of the setup for learning in evolutionary processes. We will consider agents that can replicate and learn. Examples are microbes, animals, and humans (a). (b-d) Schematic illustrations of the model. An agent at time t − 1 first determines its type based on its strategy π (b). The agent then produces e k(x,y (t−1) ) daughters depending on its type x and the environmental state y (t−1) (c). After the replication, the daughters inherit strategies updated by a given learning rule L (d).
time t becomes (II.4) Here, L(π | π ) is a (possibly stochastic) learning rule, which satisfies π L(π | π ) = 1. The learning rule L can depend on available information for agents to learn. We consider the following sources of the available information.
Each agent can transmit information to the next generation via epigenetic states or memes. Specifically, each agent can access the frequency of the types that ancestors chose. While we do not explicitly consider communications between agents, we show that learning rule without communication is sufficient to achieve acceleration of evolutionary processes via estimating fitness gradient. In addition, we do not assume that the agent can sense the environmental state y. For a further generalization on these assumptions, see Section X. Under this setting, we consider how agents in the population can gradually acquire the optimal strategy: by the zeroth or the first order learning. We note that the optimal strategy is unique due to the concavity of the population fitness λ(π) to π. The concavity of λ follows from Eq. (V.3) that we prove later.

III. ANCESTRAL LEARNING
We first propose ancestral learning and validate that it can accelerate the evolutionary process. Ancestral learning is self-reinforcement of strategy by positive feedback. It updates strategy every τ est generations, where τ est is a hyperparameter called an update interval. We suppose that the update occurs at time t = iτ est − 1 (i = 1, 2, . . . ). We specially regard that the initial strategies π (0) F are acquired at time −1 by the zero-th update. After an agent at time (i − 1)τ est − 1 acquires the strategy π (i−1) F by the (i − 1)-th update, its descendants at time t ((i − 1)τ est ≤ t < iτ est ) have the same strategy. At time iτ est − 1, i.e., at the next update, each of the descendants calculates the empirical distribution j π (i−1) F emp of the ancestor's types back to time (i − 1)τ est . In par- where δ x,x is the Kronecker's delta and x (t ) is the type of the ancestor at time t . When π (i−1) F is clear from the context, we omit it. After obtaining the empirical distribution, the agent updates its strategy by a rule where α is a hyperparameter called a learning rate. In this rule of ancestral learning, the strategy after i−1th update π (i) F is the a mixture of the previous strategy π (i−1) F with the frequency of types that ancestors chose j π (i−1) F emp . If the learning rate is close to 1, i.e., α ≈ 1, the updated strategy π (i) F becomes identical to the ancestor's type frequency. If α ≈ is small, the information of ancestor's type is gradually assimilated to the strategy. Ancestral learning τ e s t -g e n e r a t i o n s : 3/4 coincides with Xue's rule [8] when τ est = 1. In addition, the rule does not require communications between agents.
Ancestral learning is a biologically reasonable learning rule. The information used in the rule is only the empirical distribution j emp of the ancestor's types, which can be stored and transmitted via epigenetic states or memes as we discussed before. Owing to this property, we call j emp ancestral information. Also, the memory to store j emp is reasonably small. In the following, we prove that the compressed information j emp instead of the whole path X (t) of the acestor's type is sufficient for attaining the optimal strategy. The update rule of ancestral learning seems natural since it is similar to Hebb's rule [24] as pointed out in [8]. Hebb's rule is a self-reinforcement by positive feedback in that the synaptic connection between activated and coactivated neurons are strengthened.
The intuitive explanation why ancestral learning can attain the optimal strategy is that replicating the types of the survived ancestors is likely to contribute to the survival of the descendants. Due to the growth competition among the population, the empirical distribution j emp of ancestor's types deviates from the strategy π F , and j emp seen as a strategy has a greater population fitness than π F . This deviation known as survivorship bias works as the driving force of ancestral learning.
To see the intuition more precisely, let us consider the simple case where the environment is constant Y = { * }, the learning rate α = 1.0, and the update interval τ est is sufficiently long. In this case, the individual fitness only depends on type and we can omit y in e k(x,y) as e k(x) . The optimal strategy π * F is    π * F (x * ) = 1 (x * = argmax x∈X k(x)), π * F (x) = 0 (otherwise), which means that π * F always selects the type x * maximiz-ing the individual fitness e k(x) . We calculate j π (i−1) F emp to see how ancestral learning updates the strategy π (i−1) F and to check that π (i) F converges to the optimal π * F as i → ∞. Since j emp (x) = iτest−1 t =(i−1)τest δ x,x (t ) is the sum of independent and identically distributed random variables {δ x,x (t ) } t , the law of large numbers implies that when τ est is sufficiently long (we discuss the case when τ est is not large, and show that small learning rate α can compensate small τ est in Sec. IX). We can interpret δ x,x (t ) as the following probability. Recall that an via the update by ancestral learning and that its descendants have the same strategy until the next update at time iτ est − 1. Let us consider the sub-population that consists of the descendants. We choose an agent at time t + 1 ((i − 1)τ est ≤ t < iτ est ) from the sub-population uniformly at random. Under this setting, δ x,x (t ) is the probability π B (x) that the parent of the chosen agent expresses type x. Let N (t ) be the number of the agents in the subpopulation at time t . In the sub-population, the number of the agents at time t + 1 whose parent expresses type x is e k(x) π (i−1) F (x)N (t ) and the total number of the agents This equation and Eq. (III.3) imply that j emp ≈ π B when τ est is sufficiently large. The probability π B is called a retrospective process of π (i−1) F for the constant environment [25][26][27][28]. The retrospective process is biased so that π B (x * ), the probability to switch into the optimal type x * , is larger than π (i−1) F (x * ) and therefore better fitted to the environment. Since ancestral learning updates the strategy to j emp , the strategy becomes π We next consider the case where the environment is not constant. We calculate j π (i) F emp as with the constantenvironment case. Since the environmental state y (t) follows Q(y) independently of t, the law of large numbers implies that where δ x,x (t ) | y is the conditional expectation of δ x,x (t ) given that the environmental state at time t is y. We can interpret δ x,x (t ) | y as the following conditional probability π F (x | y). Let us consider an agent that acquires strategy π (i−1) F at time (i − 1)τ est − 1 and the sub-population that consists of its descendants as before. Suppose that we choose an agent at time t + 1 ((i − 1)τ est ≤ t < iτ est ) from the population and that the environmental state at time t is y. Under this setting, δ x,x (t ) | y is the probability π B (x | y) that the parent of the chosen agent expresses type x. By a similar argument, we have . (III.6) The probability π B (x | y) is also called the retrospective process and fitted to the environmental state y better than π F . This equation and Eq. (III.5) imply that j emp (x) converges to the averaged retrospective process π B (x) := y π B (x | y)Q(y). Therefore, π F is updated to the mixture of the strategies π B (x | y), each of which is better fitted to the corresponding environmental state. We will numerically (Section IV) and theoretically (Section V) prove that update to such a mixture strategy leads to the optimal one.

IV. ANCESTRAL LEARNING CAN ACCELERATE THE EVOLUTIONARY PROCESSES
Next, we validate that learning can accelerate the evolutionary process by numerically showing that the optimal type-switching strategy is acquired with ancestral learning faster than with the zero-th order mutational rule.
We simulate the evolutionary process by a multi-type branching process in a random environment [29,30]. Namely, we simulate the dynamical system defined by Eq. (II.4) while taking the individuality and finite number of the agents into account. In the simulation, we set X = Y = {0, 1, 2} (Fig. III.1 (a)). In the following, the colors are numbered from left to right in the figure. Namely, 0 ∈ Y corresponds to red, 1 to yellow, and 2 to blue as in Fig. III.1 (a). We also set Q(0) = 0.6 and Q(y) = 0.2 for the other y ∈ Y. Each agent with type x under environmental state y has four daughters if x = y and one daughter, otherwise. In short, e k(x,y) = 4 if x = y and e k(x,y) = 1, otherwise. We represent strategy π F as a vector of the form (π F (0), π F (1), π F (2)). Due to the symmetry of e k(x,y) , the zero-th component π * (0) of the optimal strategy is higher than the others. We start the simulation from a single agent, whose initial strategy is π (0) F = (1/3, 1/3, 1/3). We limit the number of the agents in the population to N max = 30 to avoid the intractability of the numerical experiment due to the exponential growth of the number of the agents. If the number of the agents in the next generation exceeds N max , we select N max agents uniformly at random.
We investigate three learning rules. Each learning rule updates the strategy at every time step, i.e., τ est = 1. The first learning rule is ancestral learning with learning rate α = 0.01. The second and the third ones are the zero-th order mutational rules. Since there are innumerable zeroth order learning rules, we choose two representative ones to perform the control experiments for ancestral learning. The second learning rule is π F ← (1−α)π F +αδ x,x k , where π F and π F are the strategies of the agent before and after the update, respectively, and x k is chosen uniformly at random from X . In biological systems, this rule can be seen as a random mutation of π F whose rate is constant. The trajectory of π F updated by this rule is a random walk over R X if no growth occurs, that is, e k(x,y) = 1 for all x ∈ X and y ∈ Y. We therefore call this learning rule a random walk. The third learning rule is π F ← (1 − α)π F + αδ x,x k , where x k is sampled from the discrete distribution π F . In biological systems, this rule can be seen as the mutation of π F whose rate is dependent on the current π F . The change of mutation rate is known as an adaptive mutation [31]. Therefore, we call this learning rule an adaptive random walk. The adaptive random walk coincides with ancestral learning if no growth occurs. In this sense, the adaptive random walk is a control to see the effect of the population growth on ancestral learning. Figure IV.1 is the result of the simulation of the three learning rules, which shows that ancestral learning accelerates the evolutionary process. We show lineage trees up to t = 50. The population fitness of the population with ancestral learning increases faster than those with the other learning rules (Fig. III.1 (b)) along the lineage of the most successful agent, whose population fitness is the maximum of the agents at the end of each lineage tree. The acceleration of the evolutionary process is also observed at the lineage tree level (Fig. III.1 (d-f)). In Fig. III.1 (g-i), we select the lineage of the most successful agent in each lineage tree and plot the trajectory of π F along the lineage.
To see whether the optimal strategy is acquired by ancestral learning, we run another simulation until t = 1500. We first checked that the strategy converges, i.e., the strategy before and after the update is almost identical when t is sufficiently large. We then verified that the converged strategy is the optimal strategy. We checked the convergence of the strategy along the lineage of the most successful agent with ancestral learning. The strategy converges since the population fitness along the lineage reaches a ceiling ( Fig. III.1 (c)). The convergence is also supported from the trajectory of the strategy (  Fig. III.1 (j-l)). The converged strategy (approximately (0.92, 0.04, 0.04)) of the most successful agent with ancestral learning is close to the optimal since it satisfies the optimality condition (Karush-Kuhn-Tucker condition) with small error [32,Theorem 16.2.1].
From these results, we conclude that ancestral learning accelerates the evolutionary process. Since ancestral learning does not use the information via a communication between the agent at the same time, we numerically showed that learning can accelerate the evolutionary process even without communications.

V. ANCESTRAL INFORMATION IS SUFFICIENT TO ESTIMATE GRADIENT
We next address the second problem: whether an agent can estimate the gradient of the population fitness or not. Although we numerically showed that ancestral learning accelerates the evolutionary process, the relationship between ancestral learning and the fitness gradient is unclear. The ancestral information j emp used in ancestral learning might be insufficient to estimate the gradient and the communication between the agents at the same generation might be required. In this section, we prove that the ancestral information j emp is sufficient to estimate the gradient. It theoretically implies that an agent can estimate the gradient without the communication between agents. It also implies that ancestral learning updates the strategy into the direction of the gradient.
To calculate the gradient of the population fitness, we employ a pathwise formulation and variational principle [28] of the population dynamics. Let us consider the case where the path of the environmental state is Y (t) and the agents do not learn and stick to a fixed strategy π F . By applying Eq. (II.1) recursively, we know that the of the agents at time t whose path of the type of the ancestors is where N (0) (x) is the number of the initial agent with type x and the quantities k( ) are the pathwise (historical) individual fitness and pathwise forward probability, respectively. Under the pathwise forumulation, we can represent the cummulative population fitness as Since each y (t) follows Q(y) independently, the population fitness satisfies (cf. [21,22]) The form of log · πF in the right hand side is equivalent to the scaled cummulant generating function [33] and the following variational principle holds:  (1) πF (2) πF (0) πF (1) πF (2) πF (0) πF (1) πF (2) πF(0) πF(1) πF(2)  The parameters of the model. In the panel, 0 ∈ Y corresponds to red, 1 to yellow, and 2 to blue. The red environmental state occurs more frequently than the others. An agent has more daughters if its type is equal to the environmental state. (b-l) The simulated lineage trees of the agents that adopt ancestral learning (d), the random walk (e), and the adaptive random walk (f). Each curve in (b) shows the trajectory of the population fitness λ along the lineage of the most successful agent, whose λ is the maximum of the agents at the end of each lineage tree. Ancestral learning increases the population fitness the best among the three learning rules. The curve (c) shows the same plot as (b) for another simulation until t = 1500. The dotted line shows the population fitness of the most successful agent with ancestral learning at t = 1500. By the longer simulation, we can see the convergence of the population fitness in the population with ancestral learning. In (d-f), each point corresponds to an agent and its color represents the population fitness λ of the agent. [ Black lines connect parents to their daughters. The curves (g-i) show the trajectories of the strategy πF along the lineage of the most successful agent updated by ancestral learning (f), by the random walk (g), and by the adaptive random walk (h), respectively. The curves (g-i) are truncations of those in (j-l) up to t = 50. In (j), the upper dotted line shows πF(0) of the most successful agent and the lower one is the average of πF(1) and πF (2). We can see that the strategy of the most successful agent with ancestral learning converges to (0.92, 0.04, 0.04) approximately. The converged strategy is close to the optimal since it satisfies the optimality condition ([32, Theroem 16.2.1] ) with small error.
where π runs over all distributions on X and D [· ·] is the Kallbuck-Leibler divergence (KL-divergence) defined by See Appendix XI A for the proof. By direct calculation, we can see that the maximizer isπ B . We can calculate the derivative of the population fitness from the variational principle: See Appendix XI B for the proof. We now have all the ingredients to calculate the gradient of the population fitness. Since the strategy π F has a constraint x∈X π F (x) = 1, we consider the following definition of the gradient. A gradient at π F under the constraint x∈X π F (x) = 1 is defined by where the limit is one-sided from the positive real numbers, δπ ∈ R X with x∈X (π F (x) + δπ(x)) = 1, i.e., x∈X δπ(x) = 0, and D πF ( ) is the sphere around π F with radius . To define the sphere, we use the KLdivergence as a natural distance over distributions on X . Intuitively, the gradient is the direction into which the population fitness increases the most among all alternatives that satisfy the constraint and have the same infinitesimal distance from π F . The definition is related to a proximal operator [34] and coincides with the usual gradient if no constraint is imposed and the sphere is defined by the Euclidean distance. We prove that the gradient is directed towardπ B , i.e., See Appendix XI B for the proof. The result addresses the second problem. To estimate the gradient, an agent must estimateπ B . By the discussion in the last paragraph of Section III, the ancestral information j emp is the unbiased estimator ofπ B , that is, j emp =π B . Therefore, an agent can estimate the gradient from ancestral information without communication between the agents at the same generation. The explicit formula of the gradient also implies that ancestral learning updates the strategy into the direction of the gradient. The direction π F of the update of the strategy by ancestral learning equals the right hand side of Eq. (V.8) on average. In particular, ancestral learning finds the optimal strategy if the learning rate is sufficiently small since λ is concave.

VI. FISHER'S FUNDAMENTAL THEOREM FOR ANCESTRAL LEARNING
We address the last problem, the quantification of the acceleration of the evolutionary process by learning, via extending the FF-thm to ancestral learning. Ancestral learning may increase the population fitness much faster under some environments than others depending the stochastic property Q(y) of the environments. In addition, the acceleration might also depend on the update interval τ est and the learning rate α. We can understand such dependency as well as when and why learning becomes beneficial by extending the conventional FF-thm to ancestral learning.
Let us first review the conventional FF-thm for natural selection [19]. The FF-thm relates the speed of the evolution and the variance of the individual fitness in the population. To illustrate this, we consider the following fixed-type population dynamics in a constant environment. The set of types is X as before. The type of the daughter is the same as that of the parent. The environment is constant Y = { * }. The individual fitness of type x is e k(x) . Here, we omit the dependency of the individual fitness e k(x, * ) on the environmental state * since the environment is constant. Under this setting, the number N (t) (x) of the agent with type x at time t is (VI.1) Since we are interested in statistics of the population such as the variance of the individual fitness, we focus on the fraction .

(VI.2)
We define a covariance of random variables f (x) and g(x) with respect to a probability distribution p(x) over X by From this, a variance is also defined as One of the measures of the evolutionary speed is the gain of the the mean individual fitness e k(x) p (t) . The gain satisfies the following relation due to Eq. (VI.2): See Appendix XI C for the proof. The equation reveals the relationship between the evolutionary speed and the variance of the individual fitness in the population. This equation is called the FF-thm for natural selection [35].
Since we are not interested in the mean individual fitness but the population fitness at time t, we present an FF-thm of the population fitness. We introduce variants of the covariance and variance to extend the conventional FF-thm. We define a logcovariance and a log-variance by respectively. The log-covariance measures the similarity of two random variables as the covariance does since the log-covariance is monotonically increasing with respect to the covariance. Indeed, we can prove that (VI.10) by direct calculation. By using these quantities, we can obtain an extended FF-thm for the population fitness by a similar argument to Eq. (VI.6): See Appendix XI C for the proof. This equation reveals the relationship between the speed of the evolutionary process measured by the gain of the population fitness and the log-variance of the individual fitness in the population. The FF-thm for the population fitness has a close connection to ancestral learning. To see this, let us first consider a simple case where the environment is constant Y = { * }, the learning rate α = 1.0, and τ est ≈ ∞. Under this setting, we showed in Section III that the update of ancestral learning is π is the retrospective process of π (i−1) F . This update is equivalent to Eq. (VI.2) if we identify p (t) with π (i) F . In addition, the gain of the population fitness by evolutionary process is equivalent to the acceleration of the evolutionary process by ancestral learning. To see this, we introduce a measure of the acceleration defined by ∆λ (i) := λ(π where π (i−1) F and π (i) F are the strategy of the agent before and after the update by ancestral learning. The gain ∆λ (i) of the population fitness depends on ancestral learning and is independent of natural selection. We can therefore regard ∆λ (i) as a measure of the acceleration. The gain ∆λ (i) is equivalent to the left-hand-side of Eq. (VI.11) if we identify p (t) with π (i) F as before. Owing to these two equivalences, we can extend the FF-thm (Eq. (VI.11)) for the population fitness to ancestral learning by substituting p (t) with π (i) F : This theorem reveals the relationship between the gain of the population fitness by an update of ancestral learning and the log-variance of the individual fitness of the strategy.
The theorem also reveals the trade-off between the acceleration ∆λ (i) and the population fitness λ(π (i) F ) by showing that the acceleration is larger when the agent expresses a variety of types. This is interepreted that the agent can obtain information about which type is fitted to the environment the best by expressing a variety of types. We call such a situation exploratory. On the other hand, an agent with the optimal strategy always expresses the same type under this setting (Eq. (III.2)). Therefore, the theorem implies that the acceleration is almost zero when the strategy is close to the optimal and λ(π (i) F ) is larger. We call such a situation exploitative. Thus, we can see the so-called exploration-exploitation trade-off in this setting.
We can further extend the FF-thm for ancestral learning to the case where the environment is not constant: x ∈X e k(x ,y) p (t−1) (x ) , (VI. 15) with probability Q(y).

VII. MEASURES TO CHARACTERIZE ANCESTRAL LEARNING
By using terms that appears in Eq. (VI.13), we can quantitatively characterize different aspects of strategies during and after learning. We define actual gain ∆ ac λ (i) and expected gain ∆ ex λ (i) by the left and right hand sides of Eq. (VI.13), respectively: The reason why the additional KL term (Eq. (VII.4)) appears in Eq (VI.13) is attributed to the existence of two representative strategies: bet-concentrating and betbalancing. Each term (Eqs. (VII.3) and (VII.4)) of the expected gain (Eq. (VII.2)) is associated with one of the representative strategy and equals to the gain of the population fitness by the corresponding strategy. Betconcentrating is defined as a situation where an agent expresses a small subset of types that are fitted to the environment. Formally, a strategy is bet-concentrating on X X if π F (x) > 0 for x ∈ X and π F (x) = 0 otherwise. An example is the optimal strategy (Eq. (III.2)) for the constant environment, which is concentrating on the single optimal type {x * }. The bet-concentrating strategy is beneficial when the environment is constant or the environmental states y ∈ Y are similar to each other since an agent can survive by expressing not all but a few types in such situations. Here, similarity between two environmental states y and y means the closeness of e k(x,y) and e k(x,y ) for all x ∈ X (See the next paragraph for the formal definition). However, if the environmental states are dissimilar, an agent cannot reproduce efficiently by concentrating on only a few types because those types are not adaptive to some environmental states. An agent should stochastically choose types from a variety of alternatives to reduce the risk of bet-concentrating. The probability to expresse a type should be determined such that the strategy has a greater population fitness. Even if the strategy is bet-concentrating on a subset X with #X > 1, the probability π F (x) for x ∈ X should be determined to maximize λ. We define bet-balancing in X as the stochastic expression of the types in X whose probabilities are positive and are set so that the population fitness is maximized. In general, the optimal strategy is the combination of bet-concentrating and bet-balancing. For example, let us examine the optimal strategy π * F = (0.72, 0.0, 0.28) in the model shown in Fig. VIII.1 (j), which is calculated numerically. The strategy is bet-concentrating on X = {0, 2} and bet-balancing in X .
During the evolutionary process with learning, an agent attains the optimal strategy by acquiring the two representative strategies. The variance and KL terms of the expected gain,Σ (i) and KL (i) , correspond to the gains of population fitness by acquiring the respective strategy. The variance termΣ (i) measures the gain of population fitness by acquiring bet-concentrating whereas the KL term KL (i) does by acquiring bet-balancing. To see this interpretation, we rewrite the updated strategy π (i) F . We proved that π in Section III when τ est ≈ ∞.

By definition,
(VII. 6) This equation is the transformation of the probability distribution π by multiplying e k(x,y) for each x ∈ X . In the transformation, the normalization factor is e k(x ,y) . Let us examine the multiplicative factors. For convenience, we define a vector ( e k(x,y) Q(y) ) x∈X ∈ R X by collecting the multiplicative factors for x ∈ X . It is the average of the vectors F y := (e k(x,y) ) x∈X ∈ R X defined for each y. We regard F y as a representation of environmental state y by embedding it into R X (Fig. VIII.1 (e,h)). We can use the embedding to measure similarity between the environmental states y and y by log-Cov π (i) By taking the normalization factor e k(x ,y) into account, we also define a scaled embedding f y by which depends on the current strategy π (i−1) F in addition to y. We use the scaled embedding to rewrite Eq. (VII.5) as (VII.8) The updated strategyπ is more bet-concentrating when the environmental states are more similar since if each f y has similar peaks (larger components), so is their averageπ Fig. VIII.1 (e)). Iteration of such update leads to the concentration on the types where peaks lie on. We will see that the variance term (Eq. (VII.3)) measures the similarity of the environmental states and corresponds to the gain of the population fitness by being bet-balancing. On the other hand,π (i−1) B is more bet-balancing when the environmental states are more dissimilar since if each f y has different peaks, then their averageπ becomes flat (Fig. VIII.1 (h)). Iteration of such update leads to bet-balancing because no concentration occurs and the probabilities of expressing types are balanced so that the population fitness increases. We will see that the KL term KL (i) measures the dissimilarity of the vectors and corresponds to the gain of the population fitness by being bet-balancing. By using the correspondence, we can interpret the vanish of the KL term when the environment is constant as the unnecessity of bet-balancing.
We rewrite Eq. (VI.13) to see that the variance and KL terms,Σ (i) and KL (i) , measure the similarity and the dissimilarity of the environmental states, respectively. We first see that the variance term measures the similarity between the environmental states. The variance term equals . (VII.9) Since the log-covariance measures the similarity between two environmental states, the variance term measures that between all environmental states. We can say the opposite for the KL term. The KL term equals to . (VII.10) See Appendix XI C for the proof. The KL term is in principle larger when the environmental states are more dissimilar since the second moment F y (x)F y (x) π (i−1) F appears in the numerator, although F y (x) π (i) F in the denominator may change the relationship. Therefore, the KL term measures the dissimilarity of the environmental states.

VIII. NUMERICAL VALIDATION OF THE FF-THM FOR ANCESTRAL LEARNING
We numerically verify the FF-thm for ancestral learning. We simulate four different models whose stochastic property Q(y) of environments are different. In each model, we investigate whether the FF-thm holds, i.e., ∆ ac λ (i) ≈ ∆ ex λ (i) . The learning rate α = 1.0 unless otherwise specified. Also, we set τ est = 1000 to avoid the fluctuation of j est (cf. Eq. (IX.7)).
We first validate the FF-thm when the environment is constant. We simulate the model shown in Fig. VIII.1 (a) and call it a constant environment model. We observe that ∆ ac λ (i) ≈ ∆ ex λ (i) along the lineage of an agent whose initial strategy is π (0) F = (0.5, 0.5) (Fig. VIII.1 (b)). To check the validity of the FF-thm beyond one lineage, we compare ∆ ac λ (1) and ∆ ex λ (1) of the agent that has an initial strategy generated uniformly at random (Fig. VIII.1  (c)). We observe that ∆ ac λ (1) ≈ ∆ ex λ (1) for most of the random strategies.
We next verify the FF-thm when the environment is not constant by simulating three models. We first simulate the model shown in (Fig. VIII.1 (d)). Since the environmental states are similar in this model, we call it a similar environment model. In this model, the optimal strategy is bet-concentrating ( Fig. VIII.1 (e)) on {0} and the variance termΣ (i) is expected to dominate. Fig. VIII.1 (f) shows ∆ ac λ (i) , ∆ ex λ (i) ,Σ (i) , and KL (i) along the lineage of an agent whose initial strategy is π (0) F = (0.5, 0.5). From the plot, we find that ∆ ac λ (i) ≈ ∆ ex λ (i) and that the variance term dominates as expected.
We next simulate the model shown in Fig. VIII.1 (g). Since environmental states are dissimilar in this model, we call it a dissimilar environment model. In this model, the optimal strategy is bet-balancing as illustrated in Fig. VIII.1 (h), and the KL term (Eq. (VII.4)) is expected to be non-negligible. Fig. VIII.1 (i) shows ∆ ac λ (i) , ∆ ex λ (i) ,Σ (i) , and KL (i) along the lineage of an agent whose initial strategy is π (0) F = (0.9, 0.1). We verify ∆ ac λ (i) ≈ ∆ ex λ (i) and find that the KL term is not negligible as expected. We also observe thatΣ (i) ≈ KL (i) as i increases.
We finally simulated the model shown in Fig. VIII.1 (j). In this model, the environmental state 0 and 1 are similar whereas the state 2 is dissimilar from them. Therefore, we call the model a combined model. In this model, the optimal strategy π * = (0.72, 0, 0.28) is the combination of bet-concentrating on X = {0, 2} and bet-balancing over X . Fig. VIII.1 (k) shows ∆ ac λ (i) , ∆ ex λ (i) ,Σ (i) , and KL (i) along the lineage of an agent whose initial strategy is (0.05, 0.15, 0.8). We can see that ∆ ac λ (i) ≈ ∆ ex λ (i) . We also observe that the KL term is not negligible. Since the variance term drops faster than the KL term, an agent acquires the bet-concentrating strategy first and then does the bet-balancing strategy. This interpretation is also supported from the strategy π (5) F = (0.31, 0.04, 0.65) just before the fifth update, when the variance term becomes negative for the first time. The strategy is almost concentrating on X = {0, 2}. On the other hand, the strategy is not bet-balancing in X since π (5) F (0) and π (5) F (2) are far from the optimal probabilities π * F (0) and π * F (2), respectively. To check the validity of the FF-thm beyond one lineage, we compare ∆ ac λ (1) and ∆ ex λ (1) of the agent that has an initial strategy generated uniformly at random ( Fig. VIII.1 (l)). We observe that ∆ ac λ (1) ≈ ∆ ex λ (1) for most of the random strategies.

IX. TRADEOFF BETWEEN LEARNING RATE AND UPDATE INTERVAL
The FF-thm for ancestral learning is derived for α = 1 and τ est 1. To address other situations, especially one where τ est is not so large, we further extend the FF-thm for ancestral learning to the case where learning rate α < 1.0 and show that there is a trade-off relation between α and τ est . First, we define an α-log-covariance by generalizing Eq. (VI.10): . Notice thatΣ (i) = ∆exλ (i) (Eq (VII.2)) when the environment is constant. At each update, we observe ∆acλ (i) ≈Σ (i) . The dotted black line represents ∆λ = 0. The comparison between ∆acλ (1) andΣ (1) when an agent has a randomly generated initial strategy (c). For most of the strategies, ∆acλ (1) ≈Σ (1) is observed when the learning rate is α = 1.0 or α = 0.1. (d-f) The similar environment model (d). An illustration of the representation of environmental state y in R X by the embedding vector Fy (e). The environmental state y is represented so that the x-th component of Fy is e k(x,y) . The environmental states are similar since these two embedded vectors point to similar directions. The optimal solution π * in this model is bet-concentrating. Geometrically, the strategy lies on the red dotted line x∈X πF(x) = 1 and the optimal is on the axis corresponding to the red type. The strategy thus moves toward π * on this line. The comparison between ∆acλ (i) ,Σ (i) , KL (i) (Eq. (VII.4)), and ∆exλ (i) (f). Since the FF-thm for non-constant environments (Eq. (VI.13)) has the additional KL-term, the KL term KL (i) and the expected gain ∆exλ (i) are shown in addition to (b). We can observe that the FF-thm holds and the variance term dominates. (g-i) The dissimilar environment model (g). An illustration of the embeddings of the environmental states in this model (h). From this embedding, we can see that the environmental states are dissimilar. In this case, the optimal strategy is bet-balancing and lies at the middle of the red dotted line. The same comparison as (f) in this model (g). We observe that the FF-thm holds and the KL term is not negligible. At last,Σ (i) ≈ KL (i) is achieved. (j-l) The combined model (j). Since the variance term drops earlier than the KL term, we can see that an agent learns bet-concentrating first and then acquires bet-balancing. The same comparison as (c) in this model (l). For most of the strategies, we can see that ∆acλ (1) ≈ ∆exλ (1) holds when the learning rate is α = 1.0 or α = 0.1.
By using this quantity, we have and To check the validity of the FF-thm (Eq. (IX.6) for α < 1.0, we simulate the constant environment model ( Fig. III.1 (a)) and the combined model ( Fig. VIII.1 (j)) when the learning rate is α = 0.1. We compare ∆ ac λ (1) and ∆ ex λ (1) of the agent that has an initial strategy generated uniformly at random (Fig. VIII.1 (c,l)). We observe that ∆ ac λ (1) ≈ ∆ ex λ (1) for most of the random strategies.
When τ est < ∞, the FF-thm (Eq. (IX.2)) does not hold and ∆ ac λ (i) < ∆ ex λ (i) . Owing to the finite update interval, the ancestral information j ) by the concavity of λ and the Jeansen's inequality. When τ est is sufficiently large (but still finite), we can quantify this decrease by Here, Tr (A) is the trace of matrix A, the matrix V is the covariance matrix of j est defined by and See Appendix XI F for the proof. We note that the second term is non-positive due to the negative semidefiniteness of I λ shown from the concavity of λ. Since V is of the order 1/τ est , the deviation α 2 Tr (I λ V ) /2 from the FF-thm for τ est = ∞ is negligible if the learning rate is sufficiently small compared to update interval τ est : α 2 /τ est 1. Thus, there is a trade-off between α and τ est in relation with the efficiency of learning.
In Section VI, we mainly focused on the case of τ est = ∞ to make the FF-thm (Eq. (IX.2)) intuitive. However, a short τ est is realistic and might be beneficial in both biological and engineering systems. The benefit of a short τ est is that an agent has more opportunities for the acceleration by the update of strategy. The drawback is that the acceleration by each update becomes smaller due to the fluctuation of j est around its expectationπ B . Equation (IX.7) indicates that the decrease is of the order α 2 /τ est . It implies that an agent can keep the decrease small by adopting small α compared to τ est , although such a small learning rate makes the learning slow (Eq. (IX.2)). In other words, the decrease in memory size τ est can be compensated by the decrease in learning speed α. Since the decrease of the acceleration (Eq. (IX.7)) depends on the second power of α while it does on the first power of τ est , an agent might prefer the pair of small α and short τ est to that of large α and long τ est . Indeed, we have numerically shown that ancestral learning accelerates the evolutionary process with small α = 0.01 and short τ est = 1 in Section III. In such a situation, our extended FF-thm is insightful because the deviation (Eq. (IX.7)) is small.

X. DISCUSSION
In the present paper, we investigated the acceleration of the evolutionary process by learning. We first numerically showed that ancestral learning can accelerates the evolutionary process. We next proved that an agent can estimate the gradient of the population fitness from the ancestral information j est without the communication between agents. We then quantified the acceleration via extending the FF-thm for the ancestral learning and revealed that the gain of the population fitness by ancestral learning has a connection to the log-variance of the individual fitness of the strategy. We finally derived the trade-off relation between the learning rate and the update interval. Overall, we have established a theoretical framework to characterize and evaluate the impacts of learning in evolutionary processes.
However, there remain some sorts of factors that might be useful for agents to learn but we have not considered. One is the type of a parent. While an agent with ancestral learning uses the ancestor's types j est , it does not use the type of the parent directly. Such strong dependence on the parent might be beneficial when the environmental state is strongly correlated to the previous state. When type x of an agent depends on that x of the parent, the type-switching strategy should be modeled as a Markov transition T F (x | x ) instead of the distribution π F (x). Promising techniques for the generalization are the large deviation and the variational representation, which played the important role in the present paper, for Markov chains in random environments [21,22].
Another one is communications between agents. Although we showed that the agent can estimate the gradient without communications, learning with such information might further accelerate the evolutionary process than ancestral learning. The acceleration by ancestral learning becomes small when the update interval τ est is short due to the fluctuation of j est (Eq. (IX.7)). Communications between agents might be useful to suppress such fluctuation.
The last one is sensing of the environmental state. In the context of population dynamics, researchers have considered the situation where an agent receives a sensing signal z of the environmental state y and then expresses their type by a signal-dependent strategy π F (x | z) [4][5][6][7]. Since sensing is another form of information processing, we should consider the unification of sensing and learning to understand the significance of information processing to organisms. In such a setting, an agent might attain the optimal strategy π * F (x | z) via extended ancestral learning. Also, such an sensing signal might improve ancestral learning. To achieve such unification, we need a theory that can integrate the prospective and retrospective information obtained by sensing and learning.

ACKNOLEDGEMENTS
The first author is supported by JSPS Research Fellowship Grant Number JP19J22607 and JST ACT-X Grant Number JPMJAX190L. This research is supported by JSPS KAKENHI Grant Numbers 19H05799 and 19H03216 and by JST CREST JPMJCR1927 and JPMJCR2011.

SOURCE CODE AVAILABILITY
The source code for simulation is available at https://github.com/so-nakashima/learning_in_ growing_systems. The language was C++17 with Boost 107100. We used Windows Subsystem for Linux 2. For the other plottings, we used matplotlib-cpp (https://github.com/lava/matplotlib-cpp), which requires Python 3. We used Python 3.8.5. The proof is a special case of [7,28]. For the completeness of the paper, we give the proof. For a fixed y ∈ Y and an arbitrary distribution π over X , log e k(x,y) πF(x) = log x∈X π(x) π F (x) π(x) e k(x,y) (XI.1) By applying the Jensen's inequality, we have log e k(x,y) By substituting π(x) with π B (x | y), we can see that the equality is attained. Therefore, log e k(x,y) The proof is essentially the same as [28]. Since the maximizer of the right hand side of Eq. (XI.5) is π B (x | y), log e k(x,y) (XI.6) We differentiate the both hand sides with respect to π F (x) while taking into account of the dependence of π B on π F : Since π B is the maximizer of the F , the derivative of F at π B is zero and consequently the seconde term vanishes. Therefore, By taking average with respect to Q(y), we have Eq. (V.6). We next prove Eq. (V.8) via the method of Lagrange multiplier. For sufficiently small , we need to solve the following linearized optimization: δπ(x) (XI. 10) under the constraints x∈X δπ(x) = 0 and D [π F π F + δπ] = . For a sufficiently small , we can approximate D [π F π F + δπ] by using the Fisher information matrix [36] as Here, the Fisher information matrix is a |X |×|X | diagonal matrix with diagonal entries {1/π F (x)} x∈X . By using this approximation, the Lagrangian function is By differentiating L with respect to δπ(x), we have the stationary condition: for all x ∈ X . By multiplying π F (x) and taking sum x∈X of the both hand side of Eq. (XI.14), we have We here used x∈X δπ(x) = 0. By rearranging Eq. (XI.14) and substituting λ = −1, we have We first prove Eq. (VI.5) for the completeness of the paper. By direct calculation, .