Abstract
The engineering of living cells able to learn algorithms by themselves, such as playing board games —a classic challenge for artificial intelligence— will allow complex ecosystems and tissues to be chemically reprogrammed to learn complex decisions. However, current engineered gene circuits encoding decision-making algorithms have failed to implement self-programmability and they require supervised tuning. We show a strategy for engineering gene circuits to rewire themselves by reinforcement learning. We created a scalable general-purpose library of Escherichia coli strains encoding elementary adaptive genetic systems capable of persistently adjusting their relative levels of expression according to their previous behavior. Our strains can learn the mastery of 3×3 board games such as tic-tac-toe, even when starting from a completely ignorant state. We provide a general genetic mechanism for the autonomous learning of decisions in changeable environments.
One-Sentence Summary We propose a scalable strategy to engineer gene circuits capable of autonomously learning decision-making in complex environments.
Animal brains are powerful decision-making devices able to learn by themselves using reinforcement learning. Computational design methods have been used to experimentally implement biological adaptive behaviors(1-7), but not advanced decision making, which was achieved artificially using physical and chemical systems by engineering memory units with neural network computational capabilities(8-14). Engineered gene circuits could endow living cells with decision making capabilities, although their reprogramming has focused on modifying the encoding DNA, such as mutating and recombining regulatory regions(5, 15). This adaptation requires the directed rewiring(16, 17) of gene circuits, demanding the precise adjustment of every interaction, as when an experimenter computes a steepest descent method(13) to infer the needed experimental adjustments. This hampers the engineering of large systems where the individual adjustment of parameters would be impractical.
To dramatically simplify the capability to train a gene network towards a complex behavior, irrespective of its size, the computation and implementation of the required adaptation should be encoded in the gene network itself. We therefore propose a genetic strategy where gene circuits autonomously rewire themselves towards a targeted behavior by shifting plasmid heteroplasmy. We test the capability for learning complex decision-making in living bacteria through the gameplaying of board games, a common decision-making benchmark in artificial intelligence(18).
Inducible antibiotic resistance markers allow the adaptation of co-encoded genes via shifting plasmid ratios
We created a library of plasmids carrying 9 possible inducible promoter systems (Fig. 1A), which we transformed into the Escherichia coli DH10B Marionette strain (providing 12 chemically-driven promoters with minimal cross-talk activation) after we cured it of its former chloramphenicol resistance(19).
We co-transformed the cells with a mixture of two almost identical multi-copy plasmids, P1 and P2, both maintained by an ampicillin resistance gene (AmpR). We designed the P1 and P2 plasmids to stabilize their copy number ratios within a cell, by having the same length(20) (including the length of the fluorescent markers), a common promoter and a translational insulator sequence, as well as the same medium copy replicon. P1 and P2 encoded inducible operons for fluorescent proteins followed polycistronically by fusions of antibiotic resistance proteins (KanR/CmR for kanamycin/chloramphenicol resistance) or corresponding non-functional “dead” forms (dKanR/dCmR) (Fig. 1A). The CmR and KanR genes allow cellular antibiotic selection for higher P1 or P2 DNA-copy number (denoted by a and b, respectively) (as confirmed by flow cytometry experiments, Fig. S1). CmR or KanR selection thus shifts the P1:P2 plasmid ratios because the total plasmid copy number (a+b) is conserved.
We define the fraction of the P1 plasmid in cells co-transformed with both P1 and P2 plasmids (a/(a+b)) as weight (Fig. 1A), analogously to artificial neural networks (ANN). The distinguishable fluorescent proteins in P1 and P2 allow the convenient utilization of their ratio for an accurate estimation of the weight from fluorescence alone (confirmed via qPCR, R2=0.94, and mixed-read Sanger sequencing(21), R2=0.99, Fig. S2). We have therefore chosen fluorescence measurements to measure weight and referred to the corresponding values as F-weight.
The strains co-transformed with the plasmids P1 and P2 realize a minimal gene circuit, which we call memregulon (contraction for memory regulon, analogous to the memristor element used in electronic circuits with neuromorphic behavior(22)), able to adjust its DNA levels according to its promoter activation and antibiotic. A memregulon’s red fluorescence agrees with its weight multiplied by the corresponding P1-only cells’ red fluorescence (Fig. 1B and Fig. S16A). For instance, the mCherry expression of cells co-transformed with a 0.5 memregulon weight is expected to have 50% of the red fluorescence per cell than cells only transformed with the plasmid P1. Usually this would require reengineering the mCherry promoter with suitable mutations. Fig. 1B shows that the 0.5 weight cells have indeed a red fluorescence per cell significantly identical to half of the induced and non-induced values of cells transformed only with the P1 plasmid (p<0.01), which contrasts with the inability of transcription regulation to lower the non-induced values.
In the following, we grow and maintain bacterial cultures in agar plates. We can measure the weights in a memregulon culture using fluorescence or DNA sequencing, where we always copy the cultures with a replica plating to preserve the original plate (Fig. 1C). We show that the weight remains constant at the population level for many days and consecutive replica plating procedures (Fig. 1D), effectively functioning as a genetic memory system(20).
Memregulons produce gene circuits able to adapt their expression levels by self-modifying their DNA copies
Memregulon weight can be altered by culturing cells with specific antibiotics and promoter inducers. For example, kanamycin or chloramphenicol respectively decreases or increases the weight, corresponding to reduction or increment of mCherry fluorescence levels. In agar plates, we produce a new parental plate through a replica plating where the destination plate will contain selection antibiotic, ampicillin and the cognate inducers (Fig. 1E). We stop the antibiotic selection after a calibrated time with a subsequent replica plating using only ampicillin. We consider the memregulon as activated when its promoter has been fully induced by the cognate inducer; only in this state can the memregulon show fluorescence and change its weight significantly (p<0.01, Fig. S3).
As the promoters might have had small crosstalks with noncognate inducers, we measured the change in weight in the presence of cognate and non-cognate inducers, which showed significant variation in weight (p< 0.01) only for the cognate inducer (Fig. 1F, S4 and S5). This allows culturing different memregulons together, where each memregulon can independently adjust its weight. Instead of manually modifying the mCherry expression levels through external manipulations of the plasmid DNA copy number, we allow the mCherry operon to persistently set its own P1 copy number levels (i.e. mCherry expression levels) via selection pressures from kanamycin/chloramphenicol and cognate inducer activation. This enables the local and unsupervised training of weights, a desired feature in the training of ANN(23).
We can use the combined output (e.g. fluorescence) of a set of active memregulons for decision making. If the output is not desired, we then reproduce the environmental condition, activating memregulons in the presence of kanamycin/chloramphenicol to downregulate/upregulate their expression, thus contributing to decision making. This allows training by self-programming of the decision-making, by a stepwise downregulation/upregulation of the memregulons’ contribution to wrong/correct decisions.
Choosing the highest weight among independent memregulon cocultures allows an experimental reinforcement learning algorithm to find the optimal path in decision trees
The distributed multicellular circuits (DMC)(24, 25) strategy allows us to explore whether a coculture of strains, containing memregulons, can learn complex decision-making. For this, we initially challenged the cultures with a mathematical problem equivalent to solving a maze (Fig. 2A), where a “rat” must find the path to the exit without backtracking. Although this corresponds to one of the simplest decision trees, it will allow defining the methodology to be used for more complex problems. Paths encounter crossings with 3 possible diversions each. Among the maze’s four crossings, the one encountered first is designated the inducer L-arabinose (Ara) and the others the inducer 3-hydroxytetradecanoyl-homoserine lactone (OHC14). The optimal path follows the diversions 1 and 2 at the crossings a and b. To choose the diversion at each crossing, we set up three cocultures of two strains at a 1:1 cell ratio. Each coculture contains strains with the pBAD and pCin memregulons of different weights, encoding Marionette promoters(19), inducible under the chemicals Ara and OHC14 respectively. The ‘chosen’ diversion at the a and b crossings is defined as the number of the coculture with the highest pBAD and pCin weight, allowing generating an integer value from a vector of analog values. The starting cultures were picked such that they had weights where their initial decisions were the furthest from the optimal path. The weights are measured after replica plating measurement of the red and green fluorescence, adding the chemical inducer designated to the crossing (Fig. 2B). If the two decisions (a, b) do not correspond to the unique path towards the exit, we apply a “punishment” selection using a destination plate containing kanamycin, ampicillin, and the inducers Ara and OHC14. We repeat this cycle of measurement and negative reinforcement learning twice until no more kanamycin selection is needed because the memregulons have modified their weights to encode the output of the optimal path (Fig. 2C). Controls where the learning is done with either swapped antibiotics or swapped inducers show no change in decisions (Fig. S6).
Generalizing the experimental reinforcement learning algorithm allows finding the optimal strategy in the tic-tac-toe game
Because memregulon cocultures maintain stable their constituent memregulon weights (Fig. S7), we investigated whether the use of additional promoters could scale up the complexity of problems by challenging cocultures of memregulon strains to learn mastering a board game. As done with the early computers, we chose the familiar tic-tac-toe game, a two-player game on a 3×3 board, where the two players (“X” and “O”) alternately occupy one vacant board position; the winner is the first player obtaining 3 matching symbols on any row, column, or diagonal.
This game was studied recently using DNA computing(11, 26), which required implementing custom 3-input logic gates with catalytic DNA. However, it is not necessary to implement combinatorial gates to implement expert players if decisions are made by choosing the highest weight (called winner-take-all, WTA, strategy) even when using linear positive weights (27) (Fig. S8).
It is also useful to define a measure of the general skill level at a game, alike to the Elo ranking(28). Thanks to the small size of a 3×3 board game, we can use a computer simulation to play all possible games. For this, we input the measured F-weights into a simulation parametrized with our experimental data (supplementary text), where we evaluate the percentage of won or drawn games (called expertise) when playing all possible matches.
As an example of how reinforcement learning can automatically train the weights to achieve a complex computation, we generalized our previous experimental learning algorithm (Fig. 2B) to two-player games. We now consider one of the players to be a trainer (player X) and the other a bacterial player. The O player consists of a set of cocultures, one for each of the board positions, excluding the central (played first by player X) (Fig. 3A, left). We assigned a chemical inducer to each of the 9 board positions (Fig. 3A, right). The cells play a match against an opponent by reading their F-weights through replica plating fluorescence measurements (Fig. 3B). The experimental algorithm is as follows (see Fig. 3C): As in the maze example, the chemicals activate the memregulons’ promoters involved in a decision vertex (acting as a “leaf” selector in the decision tree), but now the simultaneous use of more than one inducer to measure the F-weights allows the identification of all the opponent’s positions. The highest “multi-inducer” F-weight, among memregulon cocultures at unoccupied positions, “chooses” the bacteria’s next move. After several rounds, the match finishes and, if the O player loses, we apply a negative reinforcement learning operation to the O cocultures, assigned to positions occupied by the O player (Fig. 3C). This updates the parental cultures and we play new matches until the player O achieves mastery (100% expertise).
An example of a match is overviewed in Fig. 3D: After player X starts at the center (round 0), player O could move at any of the other 8 unoccupied positions and, therefore, we consider cocultures at all of them. We do replica F-weight measurements to the cocultures by inducing them with 3-oxohexanoyl-homoserine lactone (OC6, inducer assigned to the center position, where X has moved) and then we choose the position where its coculture had the highest F-weight. In the next round, X makes another move (corresponding to the position assigned to Ara) and we inquire about O’s move by inducing the 6 cocultures (at unoccupied positions) with OC6 and Ara (the two positions currently occupied by X), and measuring the highest F-weight among them. O loses in round 3, so we apply a negative reinforcement operation with kanamycin selection (Fig. 1E), in the cocultures at positions previously occupied by O (Fig. 3D, encircled in green), adding all the inducers corresponding to X’s moves before round 3 (OC6, Ara, & OHC14), which lead to the losing decisions of O. After this learning, we have updated 3 cocultures, which become new parental plates for replica measurements, together with the unchanged 5. Bacteria play new matches until a match ends in a draw and the O player achieves mastery.
In principle, a trainer can always choose strategies avoiding draws, although we do not impose any condition on the trainer. Two bacterial players can even learn together by playing each other. To test this, we set up 2 cocultures of 2 memregulon strains, both chosen to have some knowledge of the game (having X and O expertises of 90% and 48% respectively) and able to achieve mastery in few learnings. We performed a tournament of memregulon cocultures playing among themselves and applying positive or negative reinforcement to the players winning or losing matches. Both cocultures reached mastery after one match (Fig. S9).
A random player of 9-memregulon cocultures learns to master tic-tac-toe by playing using reinforcement learning
We asked if our experimental algorithm could train a naïve bacterial player O (playing uniformly random) to learn mastering the tic-tac-toe game. We chose bacterial cultures to be second player because the naïve player had a low starting expertise (20%). The starting cocultures (denoted by O0) consist of the same 9 memregulon strains at equal cell ratios and equal weights at all 8 positions (Fig. 4A); this experimentally implements a random player because all positions have the same cultures and interrogating for the highest weight would give a random position. We performed all the experiments in 3 biological replicates:
O plays a tournament against a trainer player X (decision matrix in Table S1). The first match lasts for 5 rounds and ends with O losing. We show in Fig. 4B the F-weights of the cocultures at allowed positions as filled red circles (containing the F-weight value multiplied by 100). Their highest value represents O’s decision (O’s move in the next round). The match ends and player X wins the match, which triggers a negative reinforcement (L1) of the O0 cocultures at the 4 positions occupied by O in round 4, to produce O1. The weight decreases at those positions and the measurement of O1 in round 0 shows that the position with the highest F-weight has changed, implying a different decision (Fig. 4B). After each learning, we also compute the expertise of each of the biological replicates (Fig. 4C). Tables S2-S11 detail the computation of the O player’s expertise after each learning, by showing the results of using the measured F-weights to play every possible tic-tac-toe match. The cocultures continue the matches by losing each time in a different way, and suffering a negative reinforcement learning (L2 to L7) each time, which further changes the cocultures (O2 to O8). The expertise did not increase monotonously, but it reached 100% for all replicates in O8. We also validated the mastery by letting the cocultures play against an expert automaton (Fig. 4B).
Although the O8 cultures acquired their mastery by playing 8 games, they have the capability to win arbitrary matches (Fig. 4C, Fig. S10). As a positive control, we performed a single steepest-descent-like operation to manually train the weights, according to a computational calculation to obtain an expert player (Osd) (supplementary text) (Fig. S11). Two alternative learning tournaments were performed as negative control, starting from O7; using either negative reinforcement with a swapped inducer (O7a) or using chloramphenicol instead of kanamycin (O7b) did not improve the expertise, as the player lost against the expert automaton (Fig. 4A). We also verified that the cocultures maintained their expertise in time after cold storage (at 4 °C or - 80 °C) (Fig. 4D, S12). Reinforcement learning also allowed naïve bacterial cocultures to reach mastery when acting as a first player (Fig. S13).
Memregulon cocultures can also learn mastering arbitrary 3×3 board games
To explore the capacity of a consortium of memregulon strains to learn arbitrary algorithms, we performed computer simulations of cocultures of 9 memregulons at every position of a 3×3 board except the center, showing that they can learn in less than 35 cycles (Fig. 4E) 98% of the possible games in this board (Fig. S14A). Moreover, trying to push the limits of learning, they could even learn how to simultaneously master more than one game at the same time, although not always (Fig. S14B). In some cases, we found that such repeated learning tournaments required enough negative reinforcement steps that some weights vanished (Fig. S15A). If a weight vanishes, the P1 plasmid is lost, and so is its ability to store a memory, because it is not possible to have a P1 and P2 mixture anymore. To rescue this, we add to the experimental algorithm an operation that we call memregulon fusion. For this, we mix each memregulon strain culture with another one that contains the same memregulon with a weight of 0.5. This mixture operation changes all weights by averaging each of them with 0.5. This averaging maintains the position with the highest weight, and therefore the player’s expertise (Fig. 4F), while increasing the weights smaller than 0.5 (Fig. S15B).
To allow our experimental learning optimization to converge towards mastery on arbitrary games, we need to avoid getting non-expert players trapped in draws where no more learning occurs. For this, we further extended the experimental learning algorithm by applying a reinforcement learning using chloramphenicol (instead of kanamycin) for selection. After the last match where a negative reinforcement was applied, we incubated the cocultures with chloramphenicol and the inducers used in the match. We call this reinforcement “unlearning”, mirroring a similar concept from machine learning(29). After one round of unlearning, the bacteria altered their decisions and therefore their expertise also changed, thus avoiding getting trapped in draws (Fig. 4F).
Discussion
We can better appreciate the computational power of our memregulon cocultures by identifying them with a single-layer artificial neural network of three 2-input neurons (maze example) or nine 9-input neurons (3×3 board games), with the only non-linearity coming from a winner-take-all (WTA) interaction among the neurons (decision on the highest weight). Such networks can be universal function approximators, even when using positive weights exclusively (27). The change of a weight only when a memregulon is active is central to learning. This follows Hebb’s idea(30) that the changes in synaptic strength (weight) should be proportional to the presynaptic cell activity and to a function of the postsynaptic cell activity. Long-term potentiation and long-term depression would correspond to a weight increasing and decreasing respectively(31).
Moreover, similarly to neuromodulated synaptic plasticity, because their change of weights requires the memregulon activity together with either kanamycin or chloramphenicol, these antibiotics act as neuromodulatory signals(32).
Memregulons also allow for the construction of gene circuits with predefined behaviors because the red fluorescence per cell linearly correlates with its weight (Fig. S16A). Although positive and negative reinforcement learning could be thought to be equivalent to positive and negative selections in directed evolution(33), here we do not have mutations, which allows for a smoother, faster and reversible traversing of the phenotypic landscape. Memregulons maintained their weight in solid cultures across many days, suggesting the possibility of using them in ecosystem-level gene circuits(34). It could be possible to enrich the computation capability by using different promoters in P1 and P2 (Fig. S16B), providing a mechanism to adapt the topology of gene circuits(35). Further developments could involve genetically encoding the computation of the maximum output among positions(36), negative selection markers(37), CRISPR to cleave(20) or regulate(38) the plasmid copies, engineered RNA replicons(39), engineered microbial ecosystems(40), as well as adding an extra memregulon library to each player, designed to receive the output of the first library through a cell-cell communication system, mimicking a hidden layer in a neural network. This would enable the processing of more complex information and, therefore, learning more advanced algorithms.
Adaptive gene circuits could already exist in prokaryotic or eukaryotic systems as a non-Darwinian adaptation tool(41). Heterozygotic mutations in multicopy plasmids(42), polyploid Archea(43) or in mitochondrial DNA (microheteroplasmy)(44) maintain the ratios of wild-type to intra-cellular mutations. As a mutation in a growth-altering gene under a regulation could suffice to set up a reinforcement learning, it may be possible to infer memregulons in nature by identifying a mapping among environmental conditions, genes, inducible promoters, and selection markers with their inactivating mutations. This mapping would establish in fact a language for “teaching” algorithms to these cells. Reinforcement learning with memregulons provides a strategy for the unsupervised adaptation of complex gene circuits with a large, unknown number of interactions, which will allow for the engineering of genetically encoded general-purpose computational devices capable of self-learning, opening the way to the engineering of synthetic living artificial intelligence.
Funding
Ministerio de Ciencia e Innovacion PID2020-118436GB-I00 (AJ) BBSRC BB/P020615/1 (MI, AJ)
EPSRC-BBSRC grant BB/M017982/1 (AJ) EU grant 610730 (AJ)
School of Life Sciences departmental allocation, Keele University (RG) Volkswagen Foundation grant LIFE 93 065 (MI)
Author contributions
Conceptualization: AR, AJ Software: AR, AJ
Formal analysis: AR, AJ Methodology: AR, SP, AJ
Investigation: AR, SP, CV, MW, RG, AJ Visualization: AR, AJ
Supervision: AJ
Writing – original draft: AR, AJ
Writing – review & editing: AR, SP, CV, MW, RG, MI, AJ
Competing interests
Authors declare that they have no competing interests.
Data and materials availability
All data are available in the main text or the supplementary materials.
Supplementary Materials
Materials and Methods
Supplementary Text
Figs. S1 to S19
Tables S1 to S13
References (1–44)
Data S1 to S15
Acknowledgments
We acknowledge M. Kushwaha, M. Fuegger and T. Nowak for discussions.