A methodology combining reinforcement learning and simulation to optimize biofabrication protocols applied to the simulated culture of epithelial sheets

Tissue Engineering and Regenerative Medicine aim to replicate and replace tissues for curing disease. However, full tissue integration and homeostasis are still far from reach. Biofabrication is an emerging field that identifies the processes required for generating biologically functional products with the desired structural organization and functionality and can potentially revolutionize the regenerative medicine domain, which aims to use patients’ cells to restore the structure and function of damaged tissues and organs. However, biofabrication still has limitations in the quality of processes and products. Biofabrication processes are often improved empirically, but this is slow, costly, and provides partial results. Computational approaches can tap into biofabrication underused potential, supporting analysis, modeling, design, and optimization of biofabrication processes, speeding up their improvement towards a higher quality of products and subsequent higher clinical relevance. This work proposes a reinforcement learning-based computational design Preprint submitted to Journal of Computational Science May 15, 2023 space exploration methodology to generate optimal protocols for the simulated fabrication of epithelial sheets. The optimization strategy relies on a deep reinforcement learning algorithm, the Advantage-Actor Critic, which relies on a neural network model for learning. In contrast, simulations rely on the PalaCell2D simulation framework. Validation demonstrates the proposed approach on two protocol generation targets: maximizing the final number of obtained cells and optimizing the spatial organization of the cell aggregate.

space exploration methodology to generate optimal protocols for the simulated fabrication of epithelial sheets. The optimization strategy relies on a deep reinforcement learning algorithm, the Advantage-Actor Critic, which relies on a neural network model for learning. In contrast, simulations rely on the PalaCell2D simulation framework. Validation demonstrates the proposed approach on two protocol generation targets: maximizing the final number of obtained cells and optimizing the spatial organization of the cell aggregate.

Introduction
Biofabrication is the automated generation of biologically functional products with a structural organization from living cells and biomaterials through bioprinting or bioassembly and subsequent tissue maturation processes [1,2].
Biofabrication includes a rich set of methodologies to recapitulate the structural and functional organization and heterogeneity of biological parts, from cell aggregates to tissues, with the challenging goal of fabricating organs [2].
Tissue Engineering (TE) applications to Regenerative Medicine (RM) aim to use patients' cells to restore the structure and function of damaged tissues and organs [1]. While few products undergo clinical translation, full tissue integration and homeostasis are still far from reach [3]. In these domains, the complexity of target biological systems and the intrinsic uncertainty in their behavior limits the capability to design optimal products and reliably con-trol their structure and function during biofabrication [4], limiting products' clinical relevance. The biofabrication processes' design space is a multidimensional region where input variables and process parameters determine the final product quality [5]. Hence, implementing an effective Design Space Exploration (DSE) of a biofabrication process is challenging and time-consuming [4]. Empirical DSE has high costs in terms of time and resources. Nevertheless, most often, advancements in the field are demanded by human effort or luck. Automation and digitalization increase the yield of experimental campaigns [6] and thus of DSE. However, they still need to implement an empirical approach, which fails to account for a good exploration even if experiments are parallel, faster, and more reliable. Besides the inefficient brute-force screenings, experimental designs of biological protocols can rely on established methodologies to optimize the usage of resources [7], such as One Factor At a Time (OFAT) [8,9], factorial experimental design [10] and Design of Experiment (DoE) computational techniques [11,9]. The latter supports research design by enabling efficient, systematic exploration and exploitation of complex design spaces and can tackle multi-factorial problems [12,13,14]. Nevertheless, DoE approaches show limitations in supporting complex processes such as biofabrication due to the enormous number of experiments required to map the design space and to build predictive models from data. Computational approaches and Artificial Intelligence (AI) can support the design and optimization of biofabrication processes beyond the limitations of the empirical approach [15,16]. In particular, computational modeling and optimization exploit the power of computation to explore system behavior and to conduct systematic DSE, accounting for the relations between complex parameter configurations and process outputs. To impact the biofabrication domain, computational approaches must rely on white-box models of the physical process to explicitly represent the relations between the generated protocols and the biological mechanisms modeled [17]. Nevertheless, most computational approaches to biofabrication overlook biological complexity and primarily focus on the non-living components of the process. Models that express the complexity of the physical process have huge state spaces [18]. Model complexity requires computational DSE to rely on metaheuristics for computational feasibility [19]. Few approaches combine computational DSE and white-box modeling of biological complexity. At the same time, the potential combinations of different DSE and white-box simulation approaches are countless. Considering this, an essential first step is to assess the computational feasibility and the relevance of results for different combinations.
Previous work provided a model-based Optimization via Simulation (OvS) approach [20], where optimization relied on a Genetic Algorithm (GA) [21] and orchestrated the white-box simulation of a discrete model of the biofabrication of epithelial sheets [22]. This work continues the exploration of different DSE and white-box simulation approaches, with the application of Machine Learning (ML) and the simulation of white-box models to optimize biofabrication. In particular, optimization relies on a variant of the Synchronous Advantage Actor-Critic (A2C) Deep Reinforcement Learning (DRL) algorithm [23] and orchestrates the simulation of vertex models via PalaCell2D, a simulator of epithelial sheets growth [24,25]. Validation of the proposed methodology demonstrates its ability to maximize the final number of cells obtained and optimize their geometrical configuration. This work further assesses the computational feasibility and biological relevance of different combinations of computational DSE and white-box simulation approaches to optimize biofabrication. This computational approach aims to overcome limitations posed by empirical DSE by providing rational guidance to experimental campaigns, supporting innovation and quality standards in biofabrication processes, and increasing clinical and industrial translatability of the resulting products.
The remaining of the paper illustrates state of the art (subsection 1.1, the proposed methodology (section 2), the validation strategy and experimental results (section 3), and future directions for advancing this research (section 4).

Models of biological complexity
The relations between the generated protocols and the biological mechanisms modeled must be explicit to improve its performance and provide an explanation rooted in biological knowledge for provided results. Thus, a white-box modeling approach is preferable but has the drawback of higher model complexity, which implies long computational times for simulations.
Several white-box modeling approaches for biological complexity exist, including continuous, discrete, or hybrid formalisms [26,27]. Ordinary Differential Expressions (ODEs) usage is frequent in systems biology. In contrast, other approaches, such as Agent-Based Models (ABMs) and discrete models, have a great potential to enrich the toolkit of modelers in computational biology [28,29], yet are less frequent, also due to their lack of standardization [30].
High-level Petri Nets (PN) models are accessible, formally complete, and naturally support modeling of complex biological processes [31] such as ontogenesis [32,33] or the relations between a host organism and the microbiota [34,35].
Other approaches, such as vertex models [36], allow capturing epithelial sheets' dynamic behavior.

Model-based Design Space Exploration
In particular, OvS approaches combine simulation and optimization to systematically explore behavioral trajectories of a model after different configurations of parameters [20]. Considering that each trajectory corresponds to a biofabrication process design, OvS supports the computational generation of simulated biofabrication protocols to test in the laboratory to prioritize more informative experiments to optimize the process.
Two main categories of OvS approaches [20]. On the one hand, metamodelbased OvS uses a metamodel to estimate the input-output relations of the simulation model [37]. The substitution of simulation models with metamodels in OvS has the advantage of reducing the computational times [38].
However, this has a cost in terms of accuracy [39] and explainability of results.
Metamodel-based OvS of biofabrication mainly targets the non-living parts of the process. For example, ML approaches proved helpful in optimizing bioprinting [40,41], also by finding the best combination of bio-inks and scaffolds through multi-objective optimization [42]. Deep Learning (DL) proved able to optimize a bioprinting process, predicting optimal configurations of printing parameters for specific objectives [43,44,45]. Fewer metamodel-based OvS approaches account explicitly for the living parts in biofabrication. For example, DL predicted functional activations from different environmental stimuli [45]. Also, an Artificial Neural Network (ANN) predicted the induction of osteogenesis from biomechanical loading parameters [46]. However, these approaches rely on black-box models of the simulated system, which fail to provide the desired explainability.
On the other hand, model-based OvS relies on the simulation of a whitebox model of the process, which provides higher explainability but has high computational costs [20]. To support real-world applications, OvS must provide informative and fully explainable results. Model-based OvS techniques tackle the complexity of the corresponding design space with optimization strategies based on heuristics and meta-heuristics, combining stochastic exploration with local exploitation of optimal solutions [47]. Nevertheless, few model-based OvS approaches exist in biomanufacturing. Most of them focus on the non-living parts of the biofabricated construct and the structural setup before product maturation. For example, Finite Elements Model (FEM)-based modeling and optimization are primarily employed in biomechanical research [48,49,50,51,52], where, for simulations, Computational Fluid Dynamics (CFD) modeling unravels hydrodynamic patterns and study the influence of fluid dynamics on cell behavior during tissue maturation [53,54,55].
Model-based OvS approaches that explicitly model and simulate the complexity of its biological components can leverage several simulation and optimization approaches. Independently of the modeling formalism chosen, comprehensive models of biological systems are complex and have huge state spaces [18], requiring model-based OvS approaches to rely on metaheuristic algorithms [19] with a search component [56] for optimization. Metaheuristic algorithms can be either based on single solutions or populations of solutions.
On the one hand, single-solution algorithms generate and improve a single solution by local search, which implies the risk of converging to local optima.
This category includes several algorithms, such as simulated annealing [57] and tabu search [58]. On the other hand, population-based metaheuristics work with a population of candidate solutions during the search process. By fostering the diversity of solutions within the population, these algorithms reduce the risk of converging to local optima. This category includes, among others, GAs [59], Particle Swarm Optimization (PSO) [60] and Ant Colony Optimization (ACO) [61].
A previous work [22] exemplifies a model-based OvS approach combining evolutionary computation and discrete models simulation to generate protocols for the biofabrication of human epithelial monolayers computation-ally. Simulation relies on discrete models whose high abstraction level can express more detailed intracellular dynamics while still simulating hundreds of cells. Still, the computational complexity of such simulations is high and requires a metaheuristic approach to optimization to reach optimality. The latter relies on a GA that iteratively generates new protocols to simulate, evolves them, and evaluates their performance by simulation. After that, the best-performing ones pass to the next generation until reaching optimality.
This approach has the advantage of ensuring convergence within feasible computational times.
However, this sticks to an offline computational DSE paradigm, performing protocol generation and their simulation in separate moments. This shows limitations in a wetware-in-the-loop scenario, where protocols should dynamically adapt to the current system state given a final target. In particular, the evolutionary process only learns from the outcome of simulations. It selects the protocols that generate the best products but does not learn from any intermediate point along the simulation, discarding valuable information for the optimization process. Also, protocol generation occurs offline from simulation, which would fail to adapt to the dynamic responses of a biofabricated system in real time.
Another approach for the computational DSE of processes with large state spaces is the combination of DRL with white-box simulations, which overcomes this limitation by learning continuously by interacting with the system itself. Reinforcement Learning (RL) is a ML approach, whose core is an agent capable of self-learning by interacting with an environment, where numerical rewards guide the interactions and the training process [62,63].
Given this considerable advantage, RL has broad application for optimizing real-world manufacturing processes [64]. In DRL, the capability of DL to work as feature extractors allows the application of RL to problems with high-dimensional state spaces, such as biofabrication processes [18].

Materials and Methods
This work proposes a computational DSE methodology where a RL model and the A2C algorithm learn optimal biofabrication protocols to optimize simulated biofabrication. Fig 1 presents the implementation of the proposed methodology.
The exploration engine (Fig 1.A) relies on a single-agent variation of the A2C algorithm and a ANNs as a learning model; the training process occurs through the interaction between Actor and Critic, and learning rules and rewards adapt to the particular application (Fig 1.D). The simulator interface (Fig 1.B) allows the exploration engine to interact with the simulator of the process model. The simulator of the process model (Fig 1.C) works as a learning environment (Fig 1.E) for the training process.
The training process consists of epochs. Each learning epoch aims to generate the best possible action to take the learning environment closer to the target and contains the three following steps (numbered in Fig 1).
1. Observation collection: the exploration engine reads the simulation  Each epoch results in a generated action. At the end of the training process, the sequence of generated actions to reach the optimal target defines an optimal biofabrication protocol.
The following subsections provide a detailed description of the exploration engine (subsection 2.1), the learning model (subsection 2.2), and the simulator interface (Appendix B. Simulator interface), which are the constitutive components of the proposed methodology. Since they are application-specific, the Results and Discussion provides a more detailed description of the simulator and the process model. Finally, Training process illustrates the training process.

Exploration engine
This work relies on a RL model for learning. RL aims to find an optimal agent behavior to get maximal rewards. In particular, the exploration engine relies on Actor-Critic algorithms [65], which combine Q-learning [66] and Policy Gradients [67]. Policy gradient methods directly model and optimize a policy, representing an optimal agent behavior. In Actor-Critic, the Actor component generates an action to perform over the learning environment by random sampling from the probability distribution of all performable actions in a given state, relying on the same objective function as the REINFORCE algorithm [68].
The policy π is usually modeled with a parameterized function for the parameter vector θ: π θ (a|s). The value of the reward J(θ), which works as the objective function, depends on the chosen policy (see Equation 1). In Equation 1, d π (s) is the stationary distribution of Markov chain [69] for π θ , a ∈ A are actions, s ∈ S are states, V π (s) and Q π (s, a) are, respectively, the value of state s and the value of a (s, a) pair, when following a policy π [70].
The goodness of the performed action a is evaluated based on the reward J(θ) generated by its execution in the learning environment. According to the Policy Gradient Theorem [71], the derivative of the expected reward is the expectation of the product of the reward and gradient of the log of the policy π θ . This allows us to compute the objective function without considering the derivative of the state distribution d π (s), untying calculations from the double dependency on action selection and states distribution [70]. In other words, the state distribution depends on the policy parameters, but there is no dependency of the policy gradient on the gradient of the state distribution [72]. This simplifies the calculation of the gradient ∇ θ J(θ) dramatically, as in Equation 2 .
In Actor-Critic algorithms, the Critic component evaluates the goodness of the performed action a computing the reward J(θ) as the Mean Squared Error (MSE) between the current value and the best possible action [72], Since random sampling from action distributions can determine high variability in the log probabilities generated, letting the policy converge to a sub-optimal choice, the A2C algorithm [23] adjusts the log probabilities by multiplying them by the advantage. The advantage is a factor that quantifies how better the chosen action compares to other possible actions (Equation 4).
By applying Equation 4 within Equation 3, we obtain a more stable training process, as depicted in Equation 5, where A(s t , a t ) is the advantage.
The exploration engine takes the actions generated by the ANN model to use them in the environment. Then, after the effect of such actions, it generates the values to compute the A2C objective function, which in turn guides the update of the ANN weights through backpropagation [73].
Synchronous A2C [73] devises more agents acting simultaneously and combines the results they independently obtain, resolving inconsistencies. information and dynamically adapt to the system evolution.

Learning model
The learning model leverages the general ANN structure devised by the A2C algorithm [73]. The ANN structure includes a backbone (Fig 2.A) that processes the simulation states as observations and provides its outputs as an input to both the Critic and the Actor components. The ANN model implementation relies on the tensorflow library (version 2.9.1) [74], while the backbone relies on the ResNet18 model [75].

Model structure
The Actor (Fig 2.B) processes the backbone output to generate actions.
It can produce both discrete and continuous actions. In particular, for each discrete one, the Actor generates a single value among the possible ones,

Model adaptability
The ANN model adapts to the application and training process. The  hyperparameter for the training process. The size and shape of data provided to the ANN as an input determine the size and shape of its output layers.
The width and height of the image-like observation of the environment, and channels, which is the number of color-like channels, set the input shape of the ANN.
The ANN output size adapts to the set of actions to generate, ranging The train_manager class allows implementing different environments for different sets of hyperparameters to perform distinct parallel training processes (Fig 3.A) via its parallel_train function, relying on the process class from the Python module multiprocessing, and a pipe, that is, a communication channel, passed to each subprocess. Each subprocess contains a whole and independent training process, based on the interactions between the train class (Fig 3.B), the environment file implementation (Fig 3.C), and the learning model (Fig 3.D), that allow the individual A2C agent (Fig 3.E) to learn by iteratively interacting with the learning environment (Fig 3.F).
Finally, each training subprocess is stored so that it is possible to request real-time information through the pipe and wait for all the subprocesses to

Results and Discussion
The proposed methodology is intended to perform DSE of protocol generation for real-world applications and physical biofabrication processes. In perspective, the optimization engine will interact with an automatic culture system, whose specific effectors define tunable parameters forming a biofab-rication protocol. A first step towards real-world applications is optimizing protocols for simulated biofabrication.
In particular, this work relies on PalaCell2D, a simulator of the proliferation of epithelial cells oriented to tissue morphogenesis [24,25]. The

PalaCell2D setup
PalaCell2D relies on a vertex model [36] that represents cells through their membrane, modeled as a set of vertices (Fig 4).  In the PalaCell2D model, the number of vertices modeling a cell changes dynamically with its size so that the density of vertices remains uniform along the cell membrane. It supports the simulation of (i) mechanical properties, (ii) internal and external forces from the cellular, and (iii) extracellular compartments over the membrane. PalaCell2D supports simulating cellular processes such as apoptosis and proliferation. Simulation controls them based on the internal cellular pressure. The latter depends on defined model parameters set for the simulation (as in Eq (6)): the cell mass m, the cell area A, the pressure sensitivity η, and the target density ρ 0 .
Internal pressure controls the growth in the mass of the cell. The mass changes with a rate that depends on the external pressure p ext applied on the cell, the target area A 0 , the mass growth rates during proliferation ν and relax ν relax , the simulation time step dt and the pressure threshold p max above which the proliferation stops as inEq (7).
External pressure p ext depends on the contact with other cells and the external force F ext . The latter is either local (i.e., the interaction of a cell with a wall of the culture environment) or global (i.e., the force exerted over the whole culture environment).
The simulation scenario used for validation controls cell proliferation by applying a global external force (in Eq (8)). The parameter a prolif indicates the probability of controlling the switch to the proliferation state in cells.
In the experiment proposed in [24], F ext models an external pressure exerted on a deformable capsule that acts on the cellular vertices with an intensity that depends on their distance from the capsule center and the capsule radius.

Optimized protocol generation
The goodness of generated protocols depends on their capability to gener- times [4], results aim to analyze the evolution of the learning processes reaching optimality rather than its convergence to a global optimum.  corresponding to numIter=20 simulation steps. Then, the total number of simulation steps is 3400. Performance evaluation occurs every 5 epochs.

Experimental results for Target 1
Each experiment runs an independent training process based on a different combination of two learning hyperparameters: the learning rate (lr) and the discount rate (gamma). Tab 1) illustrates the six combinations explored, summarizing the highest absolute final number of cells obtained as a performance measure for every setting. Exp3 (lr=0.0001, gamma=0.99) exhibits a steady increase in the mean    reward is the fraction of cells within the target (f raction inside ), as in Eq (9) where cells inside and cells outside are the number of cells inside and outside the target space, and cells f inal is the total number of obtained cells.
Increasing the mean f raction inside along training indicates improving performance. In contrast, the variance of this value indicates whether learning is in exploration (low variance) versus exploitation (high variance) mode.
Protocols for this target control both the initial position for seeding the first before the simulation starts and the compression stimuli along the simulation.

Framework setup for Target 2
The framework setup (see Appendix A. Framework setup parameters) still includes a simulated space in the environment that is a 400x400x1 grid, as in

Experimental results for Target 2
Each experiment runs an independent training process based on a different combination of two learning hyperparameters: the learning rate (lr) and the discount rate (gamma). Tab     that the processes enter exploration mode. In particular, Exp7 (lr=0.001, gamma=0.99), and Exp10 (lr=0.0001, gamma=0.95 explore between windows 6 and 9, only to get back to exploitation during the last part of the training. Exp7 (lr=0.001, gamma=0.99) has the higher mean values among all training processes for this target, while its variance values are similar to the others.

Conclusions
This work presents a computational DSE approach to optimize the simulated biofabrication of epithelial sheets and to generate optimal biofabrication protocols based on a variant of the A2C algorithm. Starting from the idea proposed in [22], it aims to generate biofabrication protocols for the simulated growth of epithelial cells under compression stimuli to validate its capabilities.
Results show that the framework can explore the solution space defined by the specific application, providing a comparative performance analysis of different hyperparameters settings and showing that the framework can impact the structure of generated protocols through the training process.
Future works will target analyzing and testing other synchronous A2C variants that involve multiple agents acting simultaneously and joining their independent results, resolving their inconsistencies. Moreover, we plan to systematically explore hyperparameter combinations and tackle more complex optimization targets, such as different geometries for target areas to fill and the coexistence of several cell types within a construct.

Data Availability Statement
The source code and data used to produce the results and analyses presented in this manuscript are available from the public GitHub repository https://github.com/smilies-polito/rLotos.

Acknowledgements
We acknowledge the precious collaboration and support provided by • numIter controls the number of simulation steps (or iterations) to perform.
• stopAt indicates the simulation step (or iteration) when to stop the simulation.
Another set of parameters is the ones controlling the simulation evolution.
The biofabrication protocol sets their values according to the actions generated by the RL algorithm along the training. This work includes a new parameter to the ones provided in PalaCell2D: initialPos, the position of the first cell generated in the simulation space.
• initialPos sets the position of the initial cell, indicating its x and y coordinate values.
• compressionAxis sets the axis for external force application, with either 'X' or 'Y' as values.
• comprForce indicates the intensity of the external force applied.
Another set of parameters in the configuration supports the management of input and output files. Both types are .vtp files from the Visualization Toolkit VTK [79].
• initialVTK indicates the name of the input .vtp file from which to restore the simulation (if this value is missing, the simulation starts from the beginning).
• finalVTK indicates the name of the generated output .vtp file.
Learning environment setup This work implements the simulator interface (see Materials and Methods) to use PalaCell2D simulations as a learning environment. Moreover, it relies on the configure method, called by the implemented environment, to create the configuration file needed by the PalaCell2D simulator. The implemented PalaCell2D environment has the following parameters to interface the learning process.
• width and height: the width and height in pixels of the observations provided to the learning model, with a default value of 300.
• numIter: the number of simulation steps per learning epoch, with a default value of 20.
• max_iterations: the maximum number of total simulation steps, with a default value of 4200.
Output management Performance evaluation relies on the computation of defined metrics over the generated output. Data saving occurs every five training epochs. The helper file vtkInterface.py contains the following methods to read PalaCell2D outputs.
• read_cell_num: reads the number of cells in the simulation space from the simulation output file.
• create_png_from_vtk: reads the simulation output file and produces a visualization of the simulation space. This visualization centers on the cellular aggregate and not on the simulation space.
• create_decentered_pil_image: reads the simulation output file and produces a visualization of the simulation space. This visualization centers on the simulation space.
• add_target: adds a circular target with given center coordinates and radius to a chosen image.
• count_target_points: reads the simulation output file and returns the number of cellular vertices inside and outside a specified target. the environment file, whose structure follows the one proposed in [80]. The environment file sets the ANN model structure (see Learning model) and the training process parameters for the target application.
The simulator interface includes the following set of required functions to advance the training process.
• reset: prepares the environment for the next episode starting from an initial state; • render: returns an image showing the environment state; • adapt_action: transforms the actions provided by the ANN in a format that can be used by the step method; • step: acts on the environment with the generated actions and provides an observation of the environment, the reward value, and a boolean value that tells whether the environment has reached a terminal state or not to the training process.
The simulator interface also includes the following set of required functions to manage data produced by the training process and keep track of performance.
• save_performance: indicates whether or not to collect performance indexes related to the environment into environment variables.
• get_performance: saves the performance indexes in a checkpoint file.
• load_performance: restores the performance indexes in the respective environment variables when restoring the training from a previous state through checkpoint files.
• check_performance: indicates whether or not to save the collected performance indexes in a checkpoint file.
• data_to_save: indicates which data to save in a checkpoint file that is accessible after the training.
• load_data_to_save: restores the collected data in the respective variables inside the environment when restoring the training from a previous state through checkpoint files.
Appendix C. Nested training process. To tackle Target 2 (see Target 2: a circular patch of cells), a helper training class allows synchronizing the two nested subprocesses. This class has the same structure as the one already presented in Training process, with the addition of a pipe allowing communication between the two subprocesses. The two environments perform the steps described in Training process during the training process.