Stochastic inference of clonal dominance in gene therapy studies

Clonal dominance is a wake-up-call for adverse events in gene therapy applications. This phenomenon has mainly been observed as a consequence of a malignancy progression, and, in some rare cases, also during normal haematopoiesis. We propose here a random-effects stochastic model that allows for a quick detection of clonal expansions that possibly occur during a gene therapy treatment. Starting from the Ito-type equation, the dynamics of cells duplication, death and differentiation at clonal level without clonal dominance can be described by a local linear approximation. The parameters of the base model, which are inferred using a maximum likelihood approach, are assumed to be shared across the clones. In order to incorporate the possibility of clonal dominance, we extend the base model by introducing random effects for the clonal parameters. This extended model is estimated using a tailor-made expectation maximization algorithm. The main idea of this paper is to compare the base and the extended models in high dimensional clonal tracking datasets by means of Akaike Information Criterion in order to detect the presence of clonal dominance. The method is evaluated using a simulation study, and is applied to investigating the dynamics of clonal expansion in a in-vivo model of rhesus macaque hematopoiesis. Author summary Preventing or quickly detecting clonal dominance is an important aspect in gene therapy applications. Over the past decades, clonal tracking has proven to be a cutting-edge analysis capable to unveil population dynamics and hierarchical relationships in vivo. For this reason, clonal tracking studies are required for safety and long-term efficacy assessment in preclinical and clinical studies. In this work we propose a random-effects stochastic framework that allows to investigate events of clonal dominance using high-dimensional clonal tracking data. Our framework is based on the combination between stochastic reaction networks and mixed-effects generalized linear models. We have shown in a simulation study and in a real world application that our method is able to detect the presence of clonal expansions. Our tool can provide statistical support to biologists in gene therapy surveillance analyses.


Introduction
The idea of gene therapy is that the correction of the defective gene(s) underlying the disease is, in principle, sufficient for inducing disease remission or even full recovery [1]. Since the blood system possesses a hierarchical structure with haematopoietic stem cells (HSC) at its root [2], correction of HSCs might be sufficient to eradicate a genetic disease in the blood system. Since transduction of stem cells has proven to be less efficient than transduction of more mature cells, it might be necessary to allow very high gene transfer rates, i.e., multiple vector copies, to ensure efficient genetic modification of HSCs [3,4]. But genetic modification of large numbers of cells is associated with the higher probability of unintentional vector insertion events near the growth-regulatory genes that may lead to insertional mutagenesis [5][6][7]. One particular drawback of insertional mutagenesis is the phenomenon of clonal dominance, which occurs if one or more clones dominate cell production [8]. The most extreme case of clonal dominance is monoclonality, where an entire tissue is dominated by the progeny of one particular cell. Although different gradients of clonal dominance (oligoclonality) exist, a precise threshold that defines dominance is hard to be specified in general, and thus a clear definition of what is meant by clonal dominance is required for any particular study.
Clonal dominance in malignant haematopoiesis has been previously identified as a consequence of a clonal competition that is corrupted by disease progression [9,10]. However, clonal dominance has also been observed in normal haematopoiesis, even in the case of truly neutral clonal markers [11][12][13]. Indeed, on the basis of various mathematical models, progression of monoclonality has been discussed also for normal (non-leukaemic) stem cell systems [14][15][16][17][18]. While there is strong evidence for clonal selection inducing monoclonal systems in the crypts of the small intestine [19][20][21][22], such a process has not been demonstrated for the haematopoietic system yet. To shed more light on those mechanisms, in this manuscript we extend the work of [23,24] and propose a random-effects cell differentiation network to model the dynamics of clonal expansions for high dimensional clonal tracking data.
More in detail, starting from the definition of the master equation [25], a set of Ito stochastic differential equations is derived to describe the first-two-order moments of the process. We estimate the parameters of the Ito system from its Euler-Maruyama local linear approximation (LLA) [26]. We propose a new inference procedure in the LLA formulation using a maximum likelihood approach, replacing the iterative weighted least square algorithm previously developed in [23,24]. Although the base LLA model formulation has been shown to be effective in modelling cell differentiation [24], it has some limitations as it only provides an average description of the dynamics across all the clones, and does not take into account possible extreme behaviour. Indeed in the base model all the dynamics parameters are shared across the clones, and thus is not possible to identify heterogeneous clonal patterns. Thus the base LLA formulation cannot be used to model clonal dominance. Therefore in this work we further increase the flexibility of the base LLA model to check if the process dynamics is mainly due to few clones and if those dominate a particular cell type. To this end we introduce random effects for the clones inside the LLA formulation, providing a mixed-effects LLA model. Then, if the mixed model outperforms the fixed one in terms of Akaike Information Criterion we use the former to infer the process parameters in order to identify which clones are mainly expanding and in which cell compartments. As every mixed-effects formulation, inference of parameters is performed by means of an expectation-maximization algorithm, for which we developed an efficient implementation. Effectively, our random-effects LLA formulation describes a stochastic process of clonal dominance on a network of cell lineages. We tested and validated our method using a simulation study. Finally, our model allowed to investigate the dynamics of clonal expansion in a in-vivo model of rhesus macaque hematopoiesis [27].

A stochastic model for cell differentiation
We consider three event types, such as cell duplication, cell death and cell differentiation for a time counting process Y t = (Y 1t , . . . , Y Nt ) (1) of a single clone in N distinct cell lineages. The time counting process Y t for a single clone in a time interval (t, t + t) evolves according to a set of reactions {v k } k and hazard functions {h k } k defined as v k = 8 > < > : for death Y it ij for differentiation (2) which contains a linear growth term with a duplication rate parameter ↵ i > 0, a quadratic term for cell death with a death rate parameter i > 0, and a linear term to describe cell differentiation from lineage i to lineage j with differentiation rate ij > 0 for each i 6 = j = 1, . . . , N. Finally we use the LLA formulation of Section 1.3.1 from S1 Text with net-effect matrix and hazard vector defined as In this formulation we implicitly assume that cells belonging to the same lineage obey to the same dynamics laws, that is all the clones share the same vector parameter ✓. In case we argue that clones behave differently in terms of dynamics we can use the random-effects LLA formulation of Eq. (6), where the random effects are defined on the vector parameter ✓ w.r.t. the clones. This is the case in our application study presented in next section, where we check wether there is heterogeneity in the clones for the duplication and death parameters, which we use as a proxy for a clonal expansion or contraction.

LLA formulation of clonal dominance
Let Y t = (Y 1t , . . . , Y Nt ) be a collection of "cells" of N different types at time t obeying to a network of stochastic biochemical reactions defined by a net-effect matrix V 2 Z N ⇥K , a vector parameter ✓ and an hazard vector h(Y, ✓) = (h 1 (Y, ✓), . . . , h K (Y, ✓)) and let be the local linear approximation of an Ito-type equation written in generalized linear model formulation (see Section 1 of S1 Text for details) where 2 is the noise variance, M t ✓ the mean drift, W t (✓) the diffusion matrix, and Y t = Y t+ t Y t is a finite-time increment. In system (4) all the cell counts Y 1 , . . . , Y N share the same parameter vector ✓. To infer the parameters of (4)-(5) we developed a maximum likelihood algorithm which is fully described in Section 1.4 of S1 Text. In some cases it may happen that the cells being analysed are drawn from a hierarchy of J different populations that possibly behave differently in terms of dynamics. In this case it might be of interest to quantify the population-average ✓ and the subject-specific effects u around the average ✓ for the description of the subject-specific dynamics. Therefore we introduce here a novel stochastic framework which is more flexible then the base LLA model thus allowing for the quantification of clonal contribution to the process. In particular, to quantify the contribution of each subject j = 1, . . . , J on the process dynamics we extend the LLA (4) with a mixed-effects model [43] introducing random effects u for the J distinct subjects on the parameter vector ✓, leading to a random-effects stochastic reaction network (RestoreNet). The extended random-effects formulation becomes where M is the block-diagonal design matrix for the random effects u centered in ✓, where each block M j is subject-specific. As in the case of the null model (4), to explain additional noise of the data, which has the additional advantage of avoiding singularity of the covariance matrix W (✓), we add to its diagonal a small quantity 2 which we infer from the data. Under this framework it can be shown that where = (✓, 2 , ⌧ 2 1 , . . . , ⌧ 2 p ) is the set of all the unknown parameters. Once the parameters are estimated (see next section for inference details), the conditional expectations E u| Y ; [u] can then be used as a proxy for the clone-specific rate parameters. This method allows to infer the clone-specific dynamic by extremely reducing the problem dimensionality from J · p to 2 · p + 1 (J 2).

Inference procedure
In order to infer the Maximum Likelihood estimatorˆ for = (✓, 2 , ⌧ 2 1 , . . . , ⌧ 2 p ) we develop an efficient tailor-made Expectation-Maximization algorithm where the collected cell increments Y and the random effects u take the roles of the observed and latent states respectively. The full analytical expression of and its partial derivatives @ @ j Q( | ⇤ ) are available (see Section 1.4 of S1 Text). In the EM-algorithm we iteratively update the E-function Q( | ⇤ ) using the current estimate ⇤ of and then we minimize the Q( | ⇤ ) w.r.t. . As the E-step function Q( | ⇤ ) is non-linear, we used the L-BFGS-B algorithm from the optim() base R function for optimization, to which we provided the objective function, along with its gradient r Q( | ⇤ ), as input. Given the high-dimensionality of the clones being analysed, and due to the sparsity of the clonal tracking datasets, the E-step function Q( | ⇤ ) and its gradient r Q( | ⇤ ) are written in a sparse block-diagonal matrix fashion, so as to reduce computational complexity and memory usage. The EM algorithm is run until a convergence criterion is met, that is when the relative errors of the E-step function Q( | ⇤ ) and the parameters ⇤ are lower than a predefined tolerance. Once we get the EM estimateˆ for the parameters we evaluate the goodness-of-fit of the mixed-model according to the conditional Akaike Information Criterion [44]. As every EM algorithm, the choice of the starting point s is very important from a computational point of view. We chose s = (✓ s , 2 s , ⌧ 2 1 = 0, . . . , ⌧ 2 p = 0) as a starting point where (✓ s , 2 s ) is the optimum found in the fixed-effects LLA formulation (4). This is a reasonable choice since we want to quantify how the dynamics E u| Y ;ˆ [u] j of each subject (clone) j departs from the average dynamics ✓ s . With the help of simulation studies (see Section 2 of S1 Text) we empirically proved that this choice always led to a conditional expectation E u| Y ; [u] consistent with the true clone-specific dynamic parameters ✓. Computational details can be found in Section 1.4 of S1 Text. The pseudocode of the EM algorithm is provided in Algorithm 3 of S1 Text.

Computational implementation
The maximimul likelihood inference for the basal model and the expectation maximization algorithm for the random-effects model are implemented in the package RestoreNet. Few minimal working examples showing the usage of the package are provided in Section 5 of S1 Text.

Results
A first comparative evaluation study on synthetic data, whose results are provided in Section 2 of S1 Text, shows how the proposed random-effects formulation is able to identify clonal dominance. We found that the random-effects model reached a significantly lower AIC than the null model, thus detecting the simulated dominance of a single clone into a cell type.
Next, we compared the base and random-effects models on the clonal tracking data of the rhesus macaque study fully described in Section "Clonal tracking data". Although the sample DNA amount was maintained constant during the whole experiment (200 ng for ZH33 and ZG66 or 500 ng for ZH17), the sample collected resulted in different magnitudes of total number of reads (see Table 2 from S1 Text). This discrepancy makes all the samples not comparable across time and cell types. Therefore we rescaled the barcode counts according to Eq. (19) from S1 Text, and we report the rescaled cell counts, at clonal level, in    corresponding model becomes effectively a birth/death model. We fitted both the fixed model (4) and the mixed-effects model (6) separately to the data of each animal, where J is equal to 481 (ZH33), 139 (ZH17), and 202 (ZG66) respectively. The size of the dynamic vector parameter ✓ is equal to 10, that is one scalar value for each combination of the five cell types with the duplication and death reactions. Also, N equals 11275 (ZH33), 2555 (ZH17), and 2770 (ZG66), while the number of time-points T is equal to 6 (ZH33), 5 (ZH17), and 4 (ZG66).
We report the results on model selection in Table 1 and the estimated parametersˆ in Table 3 of Section 4 from S1 Text. Then, from the estimated parametersˆ following Eq. (18) from S1 Text we computed the conditional expectations , which we use as a proxy for the k-th clone-specific net-duplication ↵ l l in each cell lineage l. The resulting values are reported in Fig. 2 in a box-plot fashion. To visualize our findings at clonal level, in Fig. 3 we propose to use a weighted pie chart. Each pie corresponds to a particular clone and is weighted by The biological interpretation of this figure is that the larger the diameter, the more the corresponding clone is dominating cell production into the lineage associated to the largest slice.
As a result, according to the AIC values, in each treated individual the mixed model (M 1 ) outperformed the fixed one (M 0 ). This means that the clones did not follow the same average dynamics for the birth/death process. Instead, the dynamic of some ZH17 (b) and ZG66 (c). Each k-th clone is identified with a pie whose slices are lineage-specific and weighted with w k defined as the difference between the corresponding duplication and death parameters, that is The diameter of the k-th pie is proportional to the euclidean 2-norm of w k . The legend scales are different across the three plot panels.
clones departed from the average dynamics with a significant (random) effect. In particular, the conditional net-duplication rates suggest that there is clonal dominance in specific cell lineages. As an example, for the animals ZH33 and ZG66 we observed clonal expansions into NK cells with high conditional rates. Whereas, for the animal ZH17 we observed clonal expansions into G and B cell lineages with high conditional rates. Finally, for the animal ZG66 we also observed events of clonal dominance into M and T cell lineages. Furthermore, the weighted pie charts shown in Fig. 3 reveal different gradients of clonal dominance between the three rhesus macaques. As an example, looking at the size of the pies it is possible to observe an higher clonal dominance of NK cells in ZH33 and of G cells in ZH17 compared to the expansions of M, NK and T cells detected in ZG66, where the diameters of the clone-specific pies are rather similar. Not only does the proposed mixed effect model detect clonal dominance of certain cell types, it is also able to detect which clones are responsible.

Discussion and conclusion
In this work we proposed a random-effects cell differentiation network which takes into account heterogeneity in the dynamics across the clones. Our framework extends the clone neutral local linear approximation of a stochastic quasi-reaction network, written in the Ito formulation, by introducing random-effects for the clones to allow for clonal dominance. To infer the parameter of the base (fixed-effects only) model we used a maximum likelihood approach. Whereas, to infer the parameters of the random-effects model, we have developed an expectation-maximization (EM) algorithm. We tested our framework with a ⌧ -leaping simulation study (see Section 2 from S1 Text for details), showing accurate performance of the method in the identification of a clonal expansion and in the inference of the true parameters. Subsequently, the application of the method on a rhesus macaque clonal tracking study revealed significant clonal dominance for specific cell types. Particularly interesting is that the NK clonal expansions detected by our model were already observed by former studies [27,45,46], and therefore our findings are consistent with those previously obtained. Indeed [45] described the oligoclonal expansions of NK cells and the long-term persistence of HSPCs and immature NK cells.
The main approximation in both the basal and random-effects formulations is the piece-wise linearity of the process. That is, in both cases we consider first a local linear approximation of the Ito equation, which then we use to infer the process parameters either with or without random-effects. Although the linearity assumption makes all the computations easier, this approximation becomes poor as the time lag increments (the ts) of the collected data increase. This can be addressed by introducing in the likelihood higher-order approximation terms than the ones considered by the Euler-Maruyama method. The Milstein approximation is a possible choice. Another, completely different, approach is to employ extended Kalman filtering (EKF) which is suitable for non-linear state space formulations. Also, our framework cannot consider false-negative errors or missing values of clonal tracking data. Also for this issue, an EKF formulation could be a possible extension.
Our tool can be considered as complementary to the classical Shannon entropy index [47] in detecting fast and uncontrolled growing of clones after a gene therapy treatment. Indeed, while the Shannon entropy measures the diversity of a population of clones as a whole, RestoreNet provides a clone-specific quantification of dominance in terms of conditional mean and variance of the expansion rates. In conclusion, our proposed stochastic framework allows to detect deviant clonal behaviour relative to the average dynamics of hematopoiesis. This is an important aspect for gene therapy applications where is crucial to quickly detect clonal dominance to prevent any adverse event that may be related to malignant scenarios. Therefore our tool can provide statistical support to biologists in gene therapy surveillance analyeses. With slight modifications our framework can be applied to every study of population dynamics that can be described with an Ito-type formulation, even when the whole population needs to be drawn from an hierarchical structure having subject-specific dynamics. S1 Text Stochastic inference of clonal dominance in gene therapy studies (PDF)

Author Contributions
• All authors analysed the data and wrote the paper.
• L.D.C. designed and implemented the stochastic framework.

Data availability
The data that supports the findings of this study is openly available at [27].

Code availability
• The code that supports the findings of this study is openly available at https://github.com/delcore-luca/ClonalDominance • The stochastic framework is implemented in the package RestoreNet available at https://github.com/delcore-luca/RestoreNet Stochastic quasi-reaction networks (S-QRN) allow to implement a particular class of stochastic differential equations that can be used to model biochemical reactions. More formally, let be a collection of molecules of N different types observed at time t, and consider K distinct (and competing) reactions each occurring with its own rate ✓ j . The coefficients r ji 's defining the left-side of the reaction are called reagents and represent the minimum amount of molecules of type i needed for the j-th reaction to occur. Similarly, the coefficients p ji defining the right-side of the reaction are called products and represents the amount of produced molecules of type i after the j-th reaction is triggered. We assume that, if we observe Y 0 = (r j1 , . . . , r jN ) molecules at time t = 0, the j-th reaction will occur after T j ⇠ exp(✓ j ), j = 1, . . . , K Namely, if exactly r ij molecules of each type i would be present, then the j-th reaction can only take place in one way, with the exponential hazard rate ✓ j . The interpretation is that, after a waiting time T j , r ji molecules of type i collide with each other and produce p ji molecules of type i (8i = 1, . . . , N), while the molecules move randomly in a hosting "cellular" environment. However, in general at time t = 0 we might observe Y i0 r ji molecules of each type i and, therefore, the j-th reaction can take place in a combinatorial number of ways leading to the following waiting time formulation In this case, the effect will be that at time t + T j we have the following expression for the number of molecules of substrate i, where v ji = p ji r ji is the j-th net effect. More compactly, for a set of K reactions and N species, the molecular transfer from reagent to product species is a net change of V = P R May 26, 2022 1/17 Algorithm 1: ⌧ -leaping algorithm Input: S (no. simulations), Y 0 (initial state), ⌧ (time lag), ✓(t) (reaction rates) . . .

Simulating a trajectory of molecules
A ⌧ -leaping algorithm is an alternative method to a Gillespie algorithm for simulating triggering-chain events. Instead of simulating a waiting time for the first reaction to occur and selecting the corresponding winner reaction, a ⌧ -leaping algorithm simulates the number of occurrences of each possible event after a time-lag equal to ⌧ elapsed. Formally, let {N r (t)} t 0 be an inhomogeneous Poisson point process representing the number of reactions of type r that took place up to (and including) time t. Therefore The last equation gives an estimate of the expected number of reactions of type r that took place within the time interval [t, t + t [. Furthermore, the expected number of molecules Y t+ t at time t + t given the current number of molecules Y t is given by The pseudocode of the ⌧ -leaping algorithm (⌧ -LA) is reported in Algorithm 1.

Local linear approximation
In order to estimate the rates ✓ = ⇥ ✓ 1 · · · ✓ K ⇤ 0 , we focus on the first two-order moments of the process, that is we consider the Ito equation where dY t = Y t+dt Y t is an infinitesimal time drift and µ(dY t ; ✓) and (dY t ; ✓) are called mean-drift and diffusion respectively. Given a -algebra (⌦, F , P), the solution (7) is called Ito diffusion. Instead of finding the Ito diffusion itself, we focus on the first two-order moments µ(dY t ; ✓) and (dY t ; ✓) of the infinitesimal time drift dY t which can be approximated with the following Lemma and Proposition.
Lemma 1. Given the hazard function as a limit of a conditional probability h(t) = lim dt!0 P(T < t + dt|T > t) dt for small dt the following approximation holds Proposition 1. An approximation of the mean drift µ(dY t ; ✓) and the diffusion (dY t ; ✓) for a small time increment dt is given by May 26, 2022 3/17 Using previous results and some linear algebra, the approximated Ito equation (7) can be further approximated as ) or more compactly where we included the term 2 I N so as to prevent singularity of the diffusion term, and to additionally explain noise variance. In practice, since we collect only discrete-time . . .
which is convenient for parameters inference.

Maximum Likelihood (ML)
We infer the parameters (✓, 2 ) with a maximum likelihood approach, that is we solve the following constrained optimization problem where the objective function is and we compactly write the diffusion matrix W ⇤ = W (✓, 2 ) as a function of the free parameters. Using the rules of matrix calculus [1], the partial derivatives of f w.r.t. ✓ and 2 can be written as where . , 1, .
Then, we solve (26) by using the objective function (14) and its gradients (15)-(16) inside the L-BFGS-B optimization algorithm from the optim() function of the stats R package. The inference procedure is summarised in Algorithm 2.

A mixed effects LLA model
In the system (24) all the molecules Y 1 , . . . , Y N share the same parameter vector ✓.
In some cases it may happen that the molecules being analysed are drawn from a hierarchy of J different populations having different properties. In this case it might be of interest to quantify the population-average ✓ and the subject-specific effects u around the average ✓ for the description of the subject-specific dynamics. Therefore, to quantify the contribution of each subject j = 1, . . . , J on the process's dynamics we extended the LLA (24) by introducing random effects u for the J distinct subjects on the parameter vector ✓, leading to the following mixed-effects [2] formulation where M is the block-diagonal design matrix for the random effects u centered in ✓, where each block M j is subject-specific. As in the case of the null model (24), to explain additional noise of the data and to avoid singularity of the stochastic covariance matrix W (✓) we added to its diagonal a small unknown quantity 2 which we infer from the data. In order to infer the maximum likelihood estimatorˆ for = (✓, 2 , ⌧ 2 1 , . . . , ⌧ 2 p ) we developed an efficient tailor-made expectation-maximization algorithm where Y and u take the roles of the observed and latent states respectively. Under this framework Also, the joint log-likelihood of Y and u is given by l( Y, u; ) / l( Y |u; ) + l(u; ) which only depends on u linearly via its first two-order conditional moments (18). Therefore, it follows for the E-step function that The gradient of Q( | ⇤ ) is defined by the following partial derivatives  In the EM-algorithm we iteratively update the E-function Q( | ⇤ ) using the current estimate ⇤ of and then we minimize the Q( | ⇤ ) w.r.t. . The EM algorithm is run until a convergence criterion is met, that is when the relative errors of both the E-step function Q( | ⇤ ) and the vector parameter are lower than a predefined tolerance. Once we get the EM estimateˆ for the parameters we evaluate the goodness-of-fit of the mixed-model according to the conditional Akaike Information Criterion [3]. As every EM algorithm, the choice of the starting point s is very important from a computational point of view. We chose as a starting point s = (✓ s , 2 s , ⌧ 2 1 = 0, . . . , ⌧ 2 p = 0) where (✓ s , 2 s ) is the optimum found in the fixed-effects LLA formulation (24). This is a reasonable choice since we want to quantify how the dynamics E u| Y ;ˆ [u] j of each subject j departs from the average dynamics ✓ s . The EM pseudocode is given in Algorithm 3.

Simulation studies
Here we mimic the dynamics of J = 3 distinct clones in four synthetic cell types A, B, C, D following the differentiation network structure of Figure 4. The net-effect matrix V and hazard vector h(Y, ✓) can be derived from equations (6)-(7) of the main paper. To simulate the clonal tracking data we used the ⌧ -leaping Algorithm 1 with a time lag ⌧ = 1, that has been run independently for each clone. We designed each simulation so that the first clone dominates lineage D and the third clone dominates lineage C. We first run a single simulation under different magnitudes for the noise variance 2 . Then we fitted both the base (24) and random-effects (27) models to the simulated data using Algorithms 2 and 3. We reported in Figure 2  model against the true clone-specific parameters. In the same figure we also show a piechart where each clone k is identified with a pie whose slices are lineage-specific and weighted with w k , defined as the difference between the conditional expectations of the duplication and death parameters, that is where lin is a cell lineage. The diameter of the k-th pie is proportional to the euclidean 2-norm of w k . Therefore, the larger the diameter, the more the corresponding clone is expanding into the lineage associated to the largest slice. The values of the estimated conditional expectations are reported in Table 1. The scatterplot of Figure 2 clearly indicate strong correlation between the true parameters and the conditional expectations E u| Y ;ˆ [u]. In particular, as expected, as the noise variance 2 increases, the parameter estimates gradually move away from the diagonal, so that the correlation decreases. Also, our model correctly detected the dominance of clones 1 and 3 in lineages D and C respectively even for large values of 2 , as suggested by the pie-charts of Figure 2 and by the values of Table 1.
Subsequently, to check parameter uncertainty we run n sim = 100 independent simulations separately for each noise variance setting. After fitting both the base (24) and the random-effects (27) models, the latter always reached a significantly lower AIC compared to the null model as suggested by the boxplots of Figure 3. In Figure 3 we also report the boxplots of the estimated conditional expectation E u| Y ;ˆ [u], obtained from the independent simulations, divided by the true parameters ✓ true . Not surprisingly, as the noise variance 2 increases, the parameter estimates get poor, but they still fluctuate around the true values, even under extreme magnitudes of 2 . These results clearly show accurate performance of the method in the identification of a simulated clonal dominance and in the inference of the true parameters, regardless of the noise level of the data. 12

Simulating clonal tracking datasets
A clonal tracking dataset compatible with RestoreNet's functions must be formatted as a 3-dimensional array Y whose ijk-entry Y ijk is the number of cells of clone k for cell type j collected at time i. The function get.sim.tl() can be used to simulate a trajectory of a single clone given an initial conditions Y 0 for the cell counts, and obeying to a particular cell differentiation network defined by a list rct.lst of biochemical reactions. In particular, our package considers only three cellular events, such as cell duplication (Y it ! 1), cell death (Y it ! ;) and cell differentiation (Y it ! Y jt ) for a clone-specific time counting process observed in N distinct cell lineages. The time counting process Y t for a single clone in a time interval (t, t + t) evolves according to a set of biochemical reactions defined as v k = 8 > < > : (0 . . . 1 i . . . 0) 0 dup. of the i-th cell type (0 · · · 1 i . . . 0) 0 death of the i-th cell type (0 · · · 1 i . . . 1 j . . . 0) 0 diff. of the i-th type into the j-th type (21) with the k-th hazard function given by Finally, the net-effect matrix and hazard vector are defined as In particular, the cellular events of duplication, death and differentiation are respectively coded in the package with the character labels "A->1", "A->0", and "A->B", where A and B are two distinct cell types. The following R code chunk shows how to simulate clone-specific trajectories of cells via a ⌧ -leaping simulation algorithm. In particular, as an illustrative example we focus on a simple cell differentiation network structure of four synthetic cell types A, B, C and D and only one clone, as illustrated in Figure 4.

Fitting the base model
The base model is defined as . . .
Further details can be found in []. The package RestoreNet allows to infer the parameters (✓, 2 ) of (24) with a maximum likelihood approach, that is by solving the following constrained optimization problem where the objective function f is the negative log-likelihood of the multivariate normal distribution N NT M✓, ⌃(✓, 2 ) . The following R code chunk shows how to accomplish this on a clonal tracking dataset simulated from the same differentiation network structure of previous section. In this case we simulate the trajectories of three independent clones following different dynamics of clonal dominance, that is we use clone-specific values for the vector parameter ✓.

Fitting the random-effects model
The random-effects model is defined as Y = where Y t+ t Y t = Y is the vector of cellular increments that took place in the time interval t, M is the block-diagonal design matrix for the random effects u centered in ✓, J is the number of clones, and each block M j is clone-specific. As in the case of the null model (24), to explain additional noise of the data, which has the additional advantage of avoiding singularity of the covariance matrix W (✓), we add to its diagonal a small quantity 2 which we infer from the data. Under this framework (see [] for details) the conditional distribution of the random effects u given the data Y has the following explicit formulation where E u| Y ; [u] and V u| Y ; (u) provide clone-specific mean and variance of the (random) reaction rates. The package RestoreNet allows to infer the vector parameter = (✓, 2 , ⌧ 2 1 , . . . , ⌧ 2 p ), and in turn to get the corresponding conditional first twoorder moments E u| Y ; [u] and V u| Y ; (u), by the means of an efficient tailor-made Expectation-Maximization algorithm where Y and u take the roles of the observed and latent states respectively. The following R code chunk shows how to accomplish this on the simulated clonal tracking dataset of previous section. In this example we use the optimal parameter vector✓ 0 estimated for the null model in the previous section, as initial guess for the corresponding parameters in the random-effects model. > re.res <-fit . re ( theta _ 0 = null . res $ fit $ par , Y = Y, rct . lst = rcts , maxemit = 100) # # random -effects model fitting > re.res$ fit $ par # # estimated parameters [1]

Visualizing results
The main graphical output of RestoreNet is a clonal piechart. In this representation each clone k is identified with a pie whose slices are lineage-specific and weighted with w k , defined as the difference between the conditional expectations of the random-effects on duplication and death parameters, that is w k = E u| Y ;ˆ [u k ↵ l ] E u| Y ;ˆ [u k l ], where l is a cell lineage. The diameter of the k-th pie is proportional to the euclidean 2-norm of w k . Therefore, the larger the diameter, the more the corresponding clone is expanding into the lineage associated to the largest slice. The package RestoreNet includes the function get.scatterpie() which returns a clonal piechart given a fitted random-effects model previously obtained with the function fit.re(). The following R code chunk illustrates how to obtain a clonal piechart with few lines of R code. > re.res <-fit . re ( theta _ 0 = null . res $ fit $ par , Y = Y, rct . lst = rcts , maxemit = 100) # # random -effects model fitting > get . scatterpie ( re .res , txt = TRUE ) # # get the clonal piechart