## Abstract

CRISPR technology has enabled large-scale cell lineage tracing for complex multicellular organisms by mutating synthetic genomic barcodes during organismal development. However, these sophisticated biological tools currently use ad-hoc and outmoded computational methods to reconstruct the cell lineage tree from the mutated barcodes. Because these methods are agnostic to the biological mechanism, they are unable to take full advantage of the data’s structure. We propose a statistical model for the mutation process and develop a procedure to estimate the tree topology, branch lengths, and mutation parameters by iteratively applying penalized maximum likelihood estimation. In contrast to existing techniques, our method estimates time along each branch, rather than number of mutation events, thus providing a detailed account of tissue-type differentiation. Via simulations, we demonstrate that our method is substantially more accurate than existing approaches. Our reconstructed trees also better recapitulate known aspects of zebrafish development and reproduce similar results across fish replicates.

## Introduction

Recent advancements in genome editing with CRISPR (clustered regularly interspaced short palindromic repeats) have renewed interest in the construction of large-scale cell lineage trees for complex organisms [McKenna et al., 2016, Woodworth et al., 2017, Spanjaard et al., 2018, Schmidt et al., 2017]. These lineage-tracing technologies, such as the GESTALT method [McKenna et al., 2016] that we focus on here^{1}, inject Cas9 and single-guide RNA (sgRNA) into the embryo of a transgenic organism harboring an array of CRISPR/Cas9 targets separated by short linker sequences (barcodes). These barcodes accumulate mutations because Cas9 cuts are imperfectly repaired by non-homologous end joining (NHEJ) during development while Cas9 and sgRNA are available. The resulting mutations are passed from parent cell to daughter cell, which thereby encodes the ontogeny. Mutated barcodes are later sequenced from the organism, and computational phylogenetic methods are then used to estimate the cell lineage tree. Because these barcodes have great diversity, GESTALT provides researchers with rich data with the potential to reveal organism and disease development in high resolution.

The most common phylogenetic methods used to analyze GESTALT data are Camin-Sokal (C-S) parsimony [Camin and Sokal, 1965] and the neighbor-joining distance-based method [Saitou and Nei, 1987]. However these methods are blind to the operation of the GESTALT mutation process, so the accuracy of the estimated trees are poor [Salvador-Martínez et al., 2018]. In addition, existing methods supply branch length estimates in terms of an abstract notion of distance rather than time, limiting their interpretability. Therefore, these estimated trees only provide ordering information between nodes on the same lineage, but not for nodes on parallel lineages. In addition, C-S parsimony is unable to distinguish between equally parsimonious trees, so obtaining a single tree estimate is difficult in practice: We find over ten thousand parsimony-optimal trees for existing datasets. To address these challenges, we set out to develop a statistical model of the mutation process, allowing us to estimate branch lengths that correspond to time as well as the mutation parameters.

No appropriate likelihood model is currently available for GESTALT because CRISPR arrays violate many classical statistical phylogenetic assumptions. First, Cas9 enzymes may cut two targets simultaneously with the entire intervening sequence deleted during NHEJ. In addition, once the nucleotide sequence for a target is modified, Cas9 is no longer able to cut the target. Thus sites are not independent, the mutation process is irreversible, and cuts can introduce long insertion and/or deletions. In contrast, the classical phylogenetic assumptions are that individual nucleotide positions are independent and that the mutation process is reversible and only introduces point mutations [Felsenstein, 2004, Yang, 2014]. Finally, these types of methods assume that there are many independent observations — their estimates are unstable when the effective sample size is much smaller than the number of parameters to estimate [Goolsby, 2016, Adams and Collyer, 2018, Julien et al., 2018].

In this paper, we introduce GAPML (GESTALT analysis using penalized Maximum Likelihood), a statistical model for GESTALT and tree-estimation method (including topology and branch lengths) by an iterative procedure based on maximum likelihood estimation. We model barcode mutations as a two-step process: Targets are cut according to a continuous time Markov chain, immediately followed by random insertions or deletions of nucleotides (indels). Our method does not rely on the aforementioned assumptions. Instead we introduce the following assumptions tailored to the GESTALT setting:

(

*outermost-cut*) an indel is introduced by cuts at the outermost cut sites(

*target-rate*) the cut rates only depend on which targets are active (i.e. able to be cut)(

*indel-probability*) the conditional probability that an indel is introduced only depends on which targets were cut.

From these assumptions, we show that the Markov process is “lumpable” and the aggregated process is compatible with Felsenstein’s pruning algorithm, thereby enabling effcient computation of the likelihood [Kemeny and Laurie Snell, 1976, Felsenstein, 1981]. Since only a small number of barcodes are usually available in practice, we propose a regularization method on the branch length and mutation parameters to stabilize and improve estimates. Our method extends maximum-likelihood phylogenetic methods with branch-length penalties [Kim and Sanderson, 2008] to the setting where the tree topology is unknown.

We validate our method on simulated and empirical data. In simulations, our method is more accurate than current tree-estimation methods. In addition, we reconstruct cell lineage trees of transgenic zebrafish from McKenna et al. [2016] and show that our trees better reflect the known biology of zebrafish development. Based on these results, we conclude that with appropriate statistical techniques it is possible to reconstruct an accurate cell lineage tree with current GESTALT technology, which addresses some concerns raised in Salvador-Martínez et al. [2018]. Our simulation engine and estimation method are available on Github (https://github.com/matsengrp/gestaltamania).

## Results

### Brief description of our probabilistic GESTALT evolution model

We model the GESTALT barcode (see Figure 1) as a continuous time Markov chain where the state space is the set of all nucleotide sequences. A state transition is an instantaneous event where either (1) an unmodified target is cut then the repair process inserts/deletes nucleotides around the cut site, or (2) two unmodified targets are cut, the intervening sequence is removed, and the repair process inserts/deletes nucleotides around the cut sites. The transition rate between barcode sequences depends on the entire sequence and each target is associated with a separate cut rate. If multiple copies of the barcode are used, we assume the barcodes are on separate chromosomes or are sufficiently far apart that they act in an independent and identically distributed (iid) manner.

We use this Markov model for GESTALT barcodes evolving along a cell lineage tree where the vertices represent cell divisions and the edge lengths represent time between cell divisions. The full cell lineage tree describes the relationships of all cells in the organism. Since we only collect a small sample of all the cells, our goal is to recover the subtree describing the development of the observed sequences.

To estimate this subtree, our method needs to calculate the likelihood of possible trees and model parameters, which requires an enumeration of the possible barcodes at each internal node. However a full enumeration is infeasible. For example, a double cut (transition (2) described above) could remove one or more targets, which could have themselves been modified in an infinite number of possible ways before the double cut erased this history.

We have carefully chosen our assumptions to make this likelihood tractable yet maintain biological realism. Briefly, under the *target-rate* and *indel-probability* assumptions, we can group states together if they share the same set of unmodified targets to calculate the likelihood more effciently, a property known more formally known as “lumpability” (Figure 2). Since the number of targets in a barcode is typically small (e.g. 10 targets per barcode in McKenna et al. [2016], Schmidt et al. [2017]), calculating the likelihood becomes computationally feasible. In addition, the *outermost-cut* assumption allows us to exclude many groups from the likelihood computation so that the number of enumerated groups at most internal nodes is typically linear in the number of unique indels.

### A maximum-likelihood tree estimation procedure

We follow current best practice for maximum-likelihood phylogenetics by optimizing the tree and mutation parameters of our model using a hill-climbing iterative search over tree space. First, we initialize the tree topology by selecting a random parsimony-optimal tree from C-S parsimony. At each subtree prune and regraft (SPR) iteration, we select a random subtree and regraft where the penalized log likelihood is highest (Figure 3c). The method stops when the tree no longer changes. At each iteration, we only consider SPR moves that preserve the parsimony score as we have found that the parsimony-optimal trees tend to have the highest likelihoods (Figure 18). The entire algorithm is presented in Algorithm 1. We discuss some important details of our method below.

We maximize a penalized log likelihood as opposed to the unpenalized version since the latter tends to give unstable and inaccurate estimates when the dataset is generated by a small number of barcodes. In particular, the length of the leaf branches and the variance of the target rates tend to be overestimated in such settings. Thus we use a penalty function that discourages large differences in branch lengths and target cut rates. Penalization introduces a slight complication since certain candidate SPR moves have naturally larger penalties. In order to make the penalty comparable between candidate SPR moves, we randomly select a leaf in the subtree and apply the candidate SPR moves only to that single leaf. When scoring the SPR moves, the penalty is calculated for the shared subtree, i.e. the tree where we ignore the random leaf. Finally, we regraft the entire subtree where the penalized log likelihood is highest.

Our method is able to estimate the tree at a finer resolution than existing methods (Figure 3a). The most commonly used method, C-S parsimony, produces estimates at the coarsest resolution: For nodes where the ordering is ambiguous, the method simply groups them under a single parent node. This commonly results in tree estimates with many multifurcating nodes (nodes with 3+ children) that have ten or more children. Our method uses the estimated model parameters to estimate the order and time of ambiguous nodes by projecting the subtree onto the space of caterpillar trees (Figure 3b). By producing tree estimates at a finer resolution, our method allows researchers to learn more about the structure of the true cell lineage tree. In addition, taking advantage of the irreversibility property, we effciently estimate the branch ordering within the caterpillar trees by solving a single optimization problem, rather than considering each possible ordering separately.

### Simulation engine and results

We built a simulation engine of the GESTALT mutation process during embryonic development. Since cell divisions during embryonic development begin in a fast metasynchronous fashion and gradually become more asynchronous [Moody, 1998], the simulation engine generates a cell lineage tree by performing a sequence of synchronous cell divisions followed by a birth-death process where the birth rate decays with time. We mutate the barcode along this cell lineage tree according to our model of the GESTALT mutation process. The simulation engine can generate data that closely resembles the data collected from zebrafish embryos in McKenna et al. [2016] (Figure 4a). We can input different barcode designs into the simulation engine to understand how they affect our ability to reconstruct the cell lineage tree.

We used our simulation engine to assess the validity and accuracy of the estimated model parameters and tree. Because our method infers branch lengths, we evaluate the accuracy using two metrics that include branch length information: BHV distance [Billera et al., 2001] and internal node height correlation (see Figure 15). We compare our method to a simpler model-free approach: estimating the tree topology using C-S parsimony [Camin and Sokal, 1965] or neighbor-joining (NJ) [Saitou and Nei, 1987] and then applying semiparametric rate smoothing (`chronos` in the `R` package `ape`) to estimate branch lengths [Sanderson, 2002]. We will refer to these two approaches as “CS+chronos” and “NJ+chronos.” We do not compare against the original tree estimates from C-S parsimony and neighbor-joining since those branch lengths correspond to edit distance and have very poor performance according to our two metrics. Our method consistently outperforms these alternative methods (Figure 4c). We note that previous *in silico* analyses of GESTALT measure accuracy in terms of the Robinson-Foulds (R-F) distance, which only depends on the tree topology [Salvador-Martínez et al., 2018]. However the R-F distance does not recognize that different tree topologies can be very similar depending on their branch lengths, and is therefore too coarse as a performance metric.

Cell lineage tree reconstruction for penalty parameter *κ*

Initialize tree 𝕋. Let the sequenced GESTALT barcodes be denoted *D*.

**for** Iteration *k* **do**

Pick a random subtree from 𝕋. Select one of the leaves C of the subtree.

**for** each possible SPR move involving the subtree that doesn’t change the parsimony score (including the no-op) **do**

Construct 𝕋 ′ by applying the SPR to leaf `C`; let be the subtree of 𝕋 ′ when excluding `C` Set the score of the SPR move as the penalized log likelihood maximized with respect to the branch length parameters *ℓ* and deactivation and indel process parameters *θ* and *β*, respectively:
**end for**

Update the tree 𝕋 by performing the SPR move on the subtree that maximizes the score

**end for**

We find, based on the simulations, that our likelihood-based method improves in performance as the number of independent barcodes increases (Figures 4b). In a simulation with a six-target barcode, the estimated tree from a single barcode has internal node height correlation of 0.5 with the true tree whereas using four barcodes increases the correlation to 0.9. Even though other analyses of GESTALT have recommended increasing the number of targets in a single barcode to improve tree estimation [Salvador-Martínez et al., 2018], it is more effective to increase the number of targets by introducing independent (and identical) barcodes (Figure 16).

### Improved zebrafish lineage reconstruction

To validate our method, we reconstructed cell lineages using our method and other tree-building methods on GESTALT data from zebrafish [McKenna et al., 2016]. As the true cell lineage tree is not known for zebrafish, we employed more indirect measures of validity. For each method, we asked (1) if similar conclusions could be made across different biological replicates and (2) if the tree estimates aligned with the known biology of zebrafish development. The dataset includes two adult zebrafish where cells were sampled from dissected organs. The organs were chosen to represent all germ layers: the brain and both eyes (ectodermal), the intestinal bulb and posterior intestine (endodermal), the heart and blood (mesodermal), and the gills (neural crest, with contributions from other germ layers). The heart was further divided into four samples— a piece of heart tissue, dissociated unsorted cells (DHCs), FACS-sorted GFP+ cardiomyocytes, and non-cardiomyocyte heart cells (NCs). In addition, datasets were collected from embryos before gastrulation (dome stage, 4.3 hours post-fertilization (hpf)), at pharyngula stage (30 hpf), and from early larvae (72 hpf), where the cell type assignments are unknown.

### GAPML captures more similar developmental relationships between tissue types across the two adult fish replicates

For each estimated tree, we calculated the *distance between tissues* — the average tree distance between a leaf of one tissue to the closest internal node leading to a leaf from the other tissue, weighted by the allele abundance (Figure 5). (All alleles that were found in the blood were removed since blood is found in all dissected organs and can confound the relationship between organs McKenna et al. [2016].) Recall that all of the fitting procedures are completely agnostic to any tissue source or cell abundance information. For a good method, we expect the correlation between tissue distances from the two fish to be close to one. We tested if the correlations were significant by permuting the cell types and abundances in the estimated trees. The correlation was 0.770 (*p <* 0.001) using our method whereas ‘CS+chronos’ and ‘NJ+chronos’ had correlations of 0.306 (*p* = 0.21) and −0.325 (*p* = 0.22), respectively. One might be concerned that our method is consistent across fish replicates because it returns very similar trees regardless of the data. However, this is not the case: When we re-run our method with randomly permuted cell types and abundances, the average correlation between the tissue distances drops to zero.

### GAPML estimates similar mutation parameters across fish replicates

For each time point, the fish replicates were traced using the same GESTALT barcode and processed using the same experimental protocol (Table 6a). We compared the estimated target rates from our method to those estimated using a model-free empirical average approach where the estimated target cut rate is the proportion of times a cut was observed in that target in the set of unique observed indels. The average correlation between the estimated target rates from our method were much higher than that for the alternate approach (Figure 6a). In fact, we can also compare target cut rates between fish of different ages that share the same barcode, even if the experimental protocols are slightly different. The 4.3hpf and 3day fish share the same barcode version and we find that the target rate estimates are indeed similar (Figure 6b). Again, a possible concern is that our method may have high correlation because it outputs very similar values regardless of the data. However, the estimated target cut rates were different for fish with different barcode versions. More specifically, the 30hpf fish used version 6 of the GESTALT barcode whereas the other fish used version 7. Visually, the target rates look quite different between those in the 30hpf fish and the other fish with the version 7 barcode (Figure 6b). Calculating the pairwise correlations between the estimated rates in 30hpf versus 3day fish, the average correlation, 0.416, is quite low and the bootstrap 95% confidence interval, (0.046, 0.655), is very wide and nearly covers zero.

### GAPML recovers both cell-type and germ-layer restriction

It is well known that cells are pluripotent initially and specialize during development. To evaluate recovery of specialization by tissue type, we calculated the correlation between the estimated time of internal tree nodes and the number of descendant tissue types; to evaluate recovery of specialization by germ layer, we calculated the correlation between the estimated time of internal nodes and the number of germ layers represented at the leaves. (As before, all the estimation methods do not use the tissue origin and germ layer labels.) Since any tree should generally show a trend where parent nodes tend to have more descendant cell types than their children, we compared our tree estimate to the same tree but with random branch length assignments and randomly permuted tissue types. Our method estimated much higher correlations compared to these random trees (Table 7a). We show an example of the node times versus the number of descendant cell types and germ layers in Figure 7b. The estimated correlations from the other methods tended to be closer to zero compared to those in GAPML in all cases, except when using ‘NJ + chronos’ to analyze the second adult fish. However upon inspection, the correlation is high for ‘NJ + chronos’ because it estimates that cells are pluripotent for over 90% of the fish’s life cycle and specialize during a small time slice at the very end.

### Analysis of the zebrafish GESTALT data

In this section, we analyze the fitted trees of the adult zebrafish in more detail. Our primary goals are to (1) check if summaries concord with known zebrafish biology, (2) generate new hypotheses about zebrafish development, and (3) generate new hypotheses on how to improve the experimental procedure. Again, as our method is agnostic to the tissue types, our trees have no prior assumptions or particular biases regarding the relationships between cell types.

Here we focus on the ordering and relative length of events. We ignore absolute estimated times since our fitting procedure for a single barcode heavily penalizes large differences between branch lengths. Though this procedure aids estimation accuracy, it also heavily biases the absolute time estimates. Thus in the figures, we scale time to be between 0 to *T* = 1 to draw focus away from the absolute times. (We anticipate the branch length estimates to be more accurate to improve with the GESTALT technology. According to our simulations, the absolute branch length estimates are much more reliable when the several barcodes are inserted into the organism.)

We begin with a coarse summary of the cell lineage tree: We plot the average distance between a leaf node of one tissue type to the most recent ancestor of each different tissue type (Figure 5). This matrix recapitulates some well-established facts about zebrafish development. For example, we estimate that tissue types from the endoderm and mesoderm tended to have shorter shared lineage distances; these tissue types tended to separate from the ectodermal tissues earliest. This signal potentially captures the migration of the mesoderm and endoderm through the blastopore, isolating them from the ectoderm [Solnica-Krezel, 2005]. In addition, previous studies have found that gills are formed when the anterior part of the intestine grows toward and fuses with the body integument [Shadrin and Ozernyuk, 2002]. The distance matrix shows a large proportion of gill cells dividing late from the other endoderm and mesoderm layers.

The distance matrix also shows that the GFP+ cardiomyocytes tend to be farthest away from other tissue types, which could be either a developmental signal or an artifact of the experimental protocol. GFP+ cardiomyocytes were sorted using 2uorescence-activated cell (FACS) and this purity could drive their separation from the other more heterogeneous organ populations. An interesting biological speculation would be that the heart is the first organ to form during vertebrate embryo development and, in particular, the myocardial cells are the first to develop, driving this observed signal [Keegan et al., 2004]. These observations show GAPML’s improved lineage distance estimation provide a more refined measure of the developmental process, and as our simulations show, will only improve as experimental approaches becomes more sophisticated.

The full cell lineage tree estimated using GAPML for the first adult zebrafish provides significantly more detail than the Camin-Sokal parsimony tree inferred for the original McKenna et al. [2016] publication (Figure 8). Our tree has estimated branch lengths whereas the branches were all unit-length in McKenna et al. [2016]. In addition, the dashed lines in our tree correspond to the caterpillar spines where we have estimated the ordering between children of multifurcating nodes. Since the original maximum parsimony tree estimate in McKenna et al. [2016] contained many multifurcating nodes and our method converts any multifurcating node to a caterpillar tree, our final tree contains many caterpillar trees. The longest caterpillar spine in our estimated tree starts from the root node and connects all the major subtrees that share no indel tracts. As the zebrafish embryo rapidly divides from the single-cell stage, these initial CRISPR editing events establish the founding cell in each subtree. GAPML estimates the target cut rates to order the events along the caterpillar trees, an impossible task in the original Camin-Sokal multifurcating trees. Lastly, we observe that the last three subtrees at the end of this spine (farthest away from the root) are primarily composed of alleles only observed in the intestinal bulb and the posterior intestine. This concords with our understanding of zebrafish development: Of the dissected organs, the digestive tract is the last to fully differentiate at day four [Moody, 1998]. In aggregate, these examples again show the power of a refined lineage tree to establish new interesting biological questions and a refined map in which to answer them.

### Analysis of GESTALT barcode mutation parameters

Finally, our model’s estimated target cut rate parameters (Table 1) provide an interesting resource when considering redesigns of the GESTALT barcode. Here we focus on the GESTALT barcode in the adult fish. The estimated target cut rates were very similar across the two fish replicates.

We estimated very different cut rates across the ten targets. Target 1 and 9 had the highest cut rates; target 3 had the lowest cut rate. The ratio between the highest and lowest cut rates is at least 10 in both fish, i.e. a deletion at target 1 is at least 10 times more likely to occur than at target 3. In terms of the tree estimation, the targets with high cut rates mainly help capture early cell divisions whereas targets with low cut rates tend to capture late cell divisions. Having a broad spectrum of target cut rates is useful for capturing cell divisions throughout the tree, though the specific details depends on the true tree. Our simulation engine may be useful for understanding how variation in the target rates affects estimation accuracy under various conditions.

The double cut rate is similar across the fish. The rate is quite high: For the first adult fish, the double cut rate of 0.076 means that the probability of introducing a single-target indel as opposed to an inter-target indel in an unmodified barcode is 59%. To counter this, we can decrease the number of long inter-target deletions (and the number of masked events) by placing high cut-rate targets closer together in the barcode. One potentially helpful barcode design is to place the highest cut-rate targets in the center and the lowest cut-rate targets on the outside. Alternatively, designers could arrange the targets from highest to lowest cut rate. Table 1 shows that the current barcode design is counter to our suggestion, as the two targets with the highest cut rates in the two adult fish are targets 1 and 9.

The characterization of target effciencies in a compact multi-target barcode is challenging problem. Our method can help steer the next generation of CRISPR-based lineage recording technologies to have increased recording capacity.

## Discussion

In this manuscript, we have proposed a statistical model for the mutation process for GESTALT, a new cell lineage tracing technology that inserts a synthetic barcode composed of CRISPR/Cas9 targets into the embryo. Our method, GAPML, estimates the cell lineage tree and the mutation parameters from the sequenced modified barcode. Unlike existing methods, our method estimates branch lengths and the ordering between children nodes that share the same parent. We demonstrate that our method outperforms existing methods on simulated data, provides more consistent results across biological replicates, and outputs trees that better concord with our understanding of developmental biology. We have answered the question “Is it possible to reconstruct an accurate cell lineage tree using CRISPR barcodes?” in Salvador-Martínez et al. [2018] in the aZrmative: The cell lineage tree can be estimated to a high degree of accuracy as long as appropriate methods are used.

Our method provides a number of technical contributions to the phylogenetics literature. The GESTALT mutation process violates many of the classical phylogenetic assumptions that existing methods rely upon for computational tractability. Thus we determined the most appropriate assumptions that are most suitable in this new setting and developed new methods so that the likelihood is computationally tractable and the estimated trees are accurate. We believe these techniques could be useful for other phylogenetic problems where the independent-site assumption does not hold. In addition, our methods may be useful as a jumping off point for analyzing other CRISPR-based cell lineage tracing technologies, such as that using homing CRISPR barcodes [Kalhor et al., 2017, 2018]. There are still many areas of improvements for the current method, such as quantifying the uncertainty of our estimates, estimating meta-properties about the cell lineage tree for organisms of the same species, and utilizing data sampled at multiple time points.

Finally, the biological results were relatively limited since the goal of this manuscript was primarily on methods development and the data in McKenna et al. [2016] only provide tissue source information. In future work, we plan to apply our method to analyze data where each allele is paired with much richer information, such as single-cell gene expression data [Raj et al., 2018].

## Materials and Methods

### Data

The data processed in this paper are all from McKenna et al. [2016] and are available at the Gene Expression Omnibus under GSE81713. We use the aligned data where each allele was described with the observed insertion/deletions (indels) at each target. Each CRISPR target can only be modified once and indels can only be introduced via a double-stranded break at a target cut site, so we further processed the aligned data accordingly: merging indels if there were more than one associated with a given target, and extending the deletion lengths and insertion sequences so that a target cut site was nested within each indel. For this paper, we assume that the processed data is correct and do not attempt to model the effects of alignment error.

### Methodological contributions

Here we highlight the key methodological contributions that we needed to develop in order to analyze GESTALT. We needed to develop new methodology because the GESTALT mutation process violates many classical assumptions in phylogenetics.

Since GESTALT is a new technology, we introduce new mathematical abstractions for the biological process. We then consider all statistical models that satisfy our proposed set of assumptions. We carefully designed these assumptions to balance biologically realism and computationally feasibility. We next show that such models are “lumpable” and use this to effciently calculate the likelihood. Even though lumpability has been used to reduce computation time for general Markov chains, this idea is rarely used in phylogenetics. We show how lumpability can be combined with Felsenstein’s algorithm if the mutation process is irreversible.

In addition, our method estimates trees at a finer resolution compared to other methods, which leads to better tree estimates. In particular, we resolve the multifurcations as a caterpillar tree and show how to effciently search over caterpillar tree topologies. As far as we know, this is one of the few methods that tunes the tree topology, which is typically treated as a combinatorial optimization problem, by solving a continuous optimization problem. The closest equivalent appears in the Bayesian phylogenetics literature where local changes to the topology may be introduced via a Subtree-Slide operator [Hohna et al., 2008].

To handle the small number of barcodes, we improve the stability of the method by maximizing the *penalized* log likelihood. Previous phylogenetic methods that penalize branch lengths assume that the tree topology is known [Kim and Sanderson, 2008, Zhang et al., 2018]. However, the tree topology is unknown in GESTALT. We show how penalization can be combined with tree topology search methods. Combining the two is not trivial since a naïve approach will bias the search towards incorrect topologies.

Finally, we use an automatic differentiation framework to optimize the phylogenetic likelihood. Automatic differentiation software has accelerated deep learning research since they allow researchers to quickly iterate on their models [Baydin et al., 2018]. Likewise, we found that this tool greatly accelerated our progress and, through this experiment, we believe that this tool may greatly accelerate the development of maximum likelihood estimation methods in phylogenetics.

### GESTALT framework and definitions

In this section, we present mathematical definitions for the many components in GESTALT, though begin in this paragraph by giving an overview in words. We begin with defining the *barcode* and the individual *targets* within it. A barcode is mutated when nucleotides are inserted and/or deleted, which is referred to as an *indel tract*. We then define a possibly-mutated barcode or *allele* as a collection of the observed indel tracts. Finally, we use all these abstractions to define the barcode mutation process, which is framed as stochastic process where the state space corresponds to all possible alleles and the transitions correspond to indel tracts. To aid the reader, Table 2 briefly summarizes the definitions used in the paper.

The unmodified barcode is a nucleotide sequence where *M* disjoint subsequences are designated as targets (Figure 9a). The targets are numbered from 1 to *M* from left to right and the positions spanned by target *j* are specified by the set pos(*j*). Each target *j* is associated with a single cut site *c*(*j*) ϵ pos(*j*).

A barcode can be modified by the introduction of an indel tract. An indel tract, denoted by IT[*p*_{0}, *p*_{1}, *s, j*_{0}, *j*_{1}], is a mutation event in which targets *j*_{0} and *j*_{1} are cut (*j*_{0} ≤ *j*_{1}), positions *p*_{0}, *p*_{0} + 1, …, *p*_{1} - 1 in the unmodified barcode are deleted, and a nucleotide sequence *s* is inserted. If *j*_{0} = *j*_{1}, only a single target is cut. When *p*_{0} = *p*_{1}, then no positions are deleted. If *s* is of length zero, then no nucleotides are inserted. We only consider indel tracts that modify the sequence, i.e. either *p*_{0} *< p*_{1} or *s* has positive length, and nest at least one target cut site between positions *p*_{0} and *p*_{1}.

A possibly-modified barcode, also referred to as an allele, is a sequence of disjoint indel tracts associated with a single barcode (Figure 9b)
where *m≥* 0 and *p*_{1, k} *< p*_{0,k} _{+1} for *k* = 1, …, *m* - 1. The positions *p*_{0,k}, *p*_{1,k} in the indel tracts always refer to the positions in the original unmodified barcode: deletions and insertions do not change the indexing. Let Ω be the set of all possible alleles.

A target *j* is active in allele *a* if none of the nucleotides in pos(*j*) are modified. That is, the status of target *j* in allele *a* is
So TargStat(*j*; *a*) is 0 if target *j* is active and 1 if it is inactive. For convenience, denote the target status of allele *a* as
We now introduce the rules governing how alleles change through the introduction of indel tracts. First, transitions between possible allele states Ω are constrained by the status of the targets. For a given allele *a*, we can introduce the indel tract *d* = IT[*p*_{0}, *p*_{1}, *s, j*_{0}, *j*_{1}] if and only if targets *j*_{0} and *j*_{1} are active. Note that the set of transitions allowed under this rule is a superset of biologically-plausible transitions. For example, even if position *p*_{0} is deleted, introducing indel tract *d* is formally allowed in our scheme. However in order to exclude biologically-implausible transitions, our models assign near-zero probability to such transitions.

Let Apply(*a, d*) be the resulting allele from introducing indel tract *d* into allele *a*. If indel tract *d* does not overlap any other indel tract in *a*, then Apply(*a, d*) is simply the union *a* u{*d*}. If *d* completely masks other indel tracts, i.e. all indel tracts in *a* are either completely within the range of *p*_{0} to *p*_{1} − 1 or are completely outside of the range, then Apply(*a, d*) is the resulting allele after removing the masked indel tracts and adding *d*. The last possibility is that *d* is adjacent or overlaps, but does not fully mask, other indel tract(s) in *a*; then Apply(*a, d*) is the resulting allele after properly merging overlapping indel tracts (Figure 10). From a biological perspective, it is impossible to introduce overlapping but non-masking indel tracts. However the likelihood model is much simpler if we allow such events to happen. Since deletion lengths tend to be short in the empirical data (75% quantile = 10, McKenna et al. [2016]), we will estimate that long deletion lengths occur with small probability, which means that these overlapping but non-masking indel tracts have very small probabilities. Thus we believe our model closely approximates the GESTALT mutation process. We will also discuss an assumption that removes these overlapping indel tracts from the likelihood calculation entirely in more detail in the following section.

### GESTALT Model and Assumptions

Here we define the GESTALT model more concretely and formalize the model assumptions presented in the Introduction. Recall the three model assumptions. First, as we rarely observe very long deletion lengths in the GESTALT data, we assume that indel tracts are introduced by cuts at the outermost cut sites (assumption A). In addition, based on our biological understanding of the CRISPR/Cas9 mutation process, we assume that mutations occur in two stages: first the targets are cut according to the rates of the active targets (assumption B) and then nucleotides are deleted/inserted according to a process that only depends on which targets were cut (assumption C). Figure 11 presents a flowchart of how the assumptions are used to derive later results.

The barcode mutation process up to time *T* is formulated as a continuous time Markov chain {*X*(*t*) : 0 ≤ *t ≤ T*} with state space Ω. As mentioned before, the choice of using state space Ω implicitly assumes that indel tracts are introduced instantaneously, i.e. nucleotides are inserted and/or deleted immediately after target(s) are cut.

For a given tree 𝕋, let *t*_{N} be the length of the branch ending with node N and let {*X*_{N} (*t*) : 0 ≤*t*_{N}} be the Markov process along the branch. In addition, let *a*_{N} be the allele observed at that leaf, and let Leaves(N) denote all leaves with ancestral node N. The set of leaves in the entire tree 𝕋 is denoted Leaves(𝕋). As notational shorthand, we denote the Markov process for the branches with end nodes in the set *s* as *X*_{s}. In addition, the observed alleles in the leaf set *s* are denoted *a*_{s}.

If there are multiple barcodes, we use the notation to represent the process for the *i*th barcode and to represent the allele observed on the *i*th barcode. In the manuscript, we assume the barcodes are independently and identically distributed. Therefore we will typically discuss the model and assumptions in the context of a single barcode and omit the index of the barcode.

Unfortunately, calculating the likelihood of the tree for a general model where the mutation rate depends on the entire sequence is computationally intractable: We would need to compute the transition rates between an in1nite number of barcode states. Instead we introduce a model assumption so that we can aggregate the possible barcode states into lumped states indexed by target statuses. Then we can compute the likelihood using transition matrices of dimension at most 2^{M} × 2^{M}. As *M* is typically small (*M* = 10 in McKenna et al. [2016]), the assumption makes the likelihood computationally feasible.

First we formalize the *outermost-cuts* assumption, which states that the cuts for an indel tract occur at the outermost cut sites. We define this mathematically by requiring that for any indel tract where targets *j*_{0} and *j*_{1} are cut, the deletions to the left and right cannot extend past the cut site of the neighboring targets *j*_{0} - 1 and *j*_{1} + 1.

(*outermost-cuts*). *All indel tracts are of the form* IT[*p*_{0}, *p*_{1}, *s, j*_{0}, *j*_{1}] *where*

This assumption limits the possible mutation histories of the alleles. Note that Assumption 1 still allows indel tracts to deactivate targets immediately neighboring the cut site. That is, an indel tract *d* = IT[*p*_{0}, *p*_{1}, *s, j*_{0}, *j*_{1}] can either have a *short* deletion to the left so that target *j*_{0} - 1 is unaffected or a *long* deletion to the left such that target *j*_{0} - 1 is deactivated, i.e.

We can have similar short and long deletions to the right.

For the second assumption, let us introduce the concept of a target tract, which is a set of indel tracts that cut the same target(s) and deactivate the same target(s). A target tract, denoted , is the set of all indel tracts that cut targets *j*_{0} and *j*_{1} and delete nucleotides such that targets through are inactive, i.e.
Note that we always have that . Each indel tract *d* belongs to a single target tract; we denote its associated target tract by TT(*d*).

This second assumption decomposes the mutation process into a two-step process where targets are cut and then indels are introduced; and combines the *target-rate* and *indel-probability* assumptions. In particular, the assumption states that the instantaneous rate of introducing an indel tract can be factorized into the rate of introducing any indel tract from a target tract, which depends on the target status of the current allele, and the conditional probability of introducing an indel tract which only depends on the target tract.

(*rate-rate, indel-probability*). *Let a be any allele, d be any indel tract that can be introduced into a, and τ be the target tract* TT(*d*). *The instantaneous rate of introducing indel tract d in allele X*(*t*) = *a can be factorized into two terms: first, a function h that only depends on τ,* TargStat(*a*), *and time t, then second, the conditional probability of introducing d given τ:*

Thus the instantaneous rate of introducing *d* only depends the allele *a* through its target status. Using this assumption, we will show that the mutation process is equivalent to a continuous time Markov chain where we lump together possible allele states that share the same target status.

### Likelihood approximation: summing over likely ancestral states

The likelihood of a given tree and mutation parameters is the sum of the probability of the data over all possible mutation histories. There are many possible ancestral states since inter-target deletions can mask previously-introduced indel tracts. In this section, we present an approximation of the likelihood that only sums over the likely ancestral states and ignores those with very small probabilities. We also provide a simple algorithm that efficiently specifies the set of these likely states, which is useful when we actually implement the (approximate) likelihood calculation.

We first address the problem that inter-target indel tracts have too many possible histories for brute-force computation. Not only can inter-target deletions mask previously-introduced indel tracts, but they can also result from merging overlapping multiple indel tracts. To simplify the likelihood calculation, we ignore any mutation history where indel tracts merge. We are motivated to do so by observing that indel tracts rarely merge – merging only occurs when many nucleotides are deleted whereas we mostly observe short deletions in the experimental data. Thus, mutation histories involving merging indel tracts probably contribute a negligible fraction to the total likelihood. The approximation is therefore as follows:

*We approximate the likelihood with the probability of the data over all possible mutation histories where no two indel tracts are ever merged:*

Note that this approximation strictly lower bounds the full likelihood since we sum over a subset of the possible histories.

Based on Approximation 1, we define a partial ordering between alleles. Given two alleles *a, a’* ∈ Ω, we use *a’* ≽ *a* to indicate that *a* can transition to the allele *a’* without merging indel tracts, there is a sequence of indel tracts such that
where no indel tracts merge.

Now we present an algorithm to concisely express the set of alleles that are summed over at each internal node under Approximation 1. For the tree 𝕋 and observed alleles *a*_{Leaves(𝕋)}, define the set of likely ancestral states at each internal node N as
We define AncState(·) over leaf nodes in the same way as (7), but we interpret this set as the alleles that likely preceded the observed allele. Henceforth, we use the shorthand notation AncState(N) AncState N; 𝕋, *a*_{Leaves(𝕋)} whenever the context is clear. The approximate likelihood from Approximation 1 is equal to summing over AncState(N) at each internal node N.

To construct the sets AncState(N), we also need sets of indel tracts of the following forms (Figure 12a):

wildcard

^{2}WC[*j*_{0},*j*_{1}]: the set of all indel tracts that only deactivate targets within the range*j*_{0}to*j*_{1}singleton-wildcard SGWC[

*p*_{0},*p*_{1},*s, j*_{0},*j*_{1}]: the union of the singleton set {IT[*p*_{0},*p*_{1},*s, j*_{0},*j*_{1}]} and its inner wildcard, if it exists:The singleton of singleton-wildcard (9) refers to {IT[

*p*_{0},*p*_{1},*s, j*_{0},*j*_{1}]} and the inner wildcard of a singleton-wildcard refers to WC[*j*_{0}+ 1,*j*_{1}- 1] if it exists and Ø otherwise.

Two singleton-wildcards are disjoint if the range of their positions don’t overlap. A wildcard is disjoint from a wildcard/singleton-wildcard if the range of their targets don’t overlap.

Given a set of indel tracts *D*, let the alleles generated by *D*, denoted Alleles(*D*), be the set of alleles that can be created using subsets of *D*, i.e. Alleles(*D*) is
We are interested in wildcards and singleton-wildcards because for any leaf node L with indel tracts IT[*p*_{0,m}, *p*_{1,m}, *s*_{m}, *j*_{0,m}, *j*_{1,m}] for *m* = 1, *…, M*_{L}, a superset of AncState(L) is the alleles generated by the union of their corresponding singleton-wildcards, i.e.
In fact, the following lemma states that we can use a recursive algorithm to compute supersets of AncState(N). The algorithm starts at the leaves and proceeds up towards the root. We first use (10) to compute supersets of AncState(L) for all leaf nodes L. Now consider any internal node N with two (direct) children nodes C_{1} and C_{2}. For mathematical induction, suppose that we have already computed supersets of AncState(C_{i}) for *i* = 1 and 2 that are the alleles generated by unions of wildcards/singleton-wildcards. We compute the superset of AncState(N) as the alleles generated by the nonempty pairwise intersections of wildcards and/or singleton-wildcards corresponding to AncState(C_{1}) and AncState(C_{2}). Since the intersection of wildcards and/or singleton-wildcards is always a wildcard or singleton-wildcard, we can also write this superset as the alleles generated by the union of disjoint wildcards and/or singleton-wildcards. We can repeatedly intersect wildcards/singleton-wildcards for internal nodes with multiple children nodes. The proof for the lemma is straightforward so we omit it here.

*Consider any internal node* N *with children nodes* C_{1}, *…,* C_{K} . *Suppose for each child* C_{k}, *we have that*
*Where* *are pairwise disjoint wildcards and/or singleton-wildcards. Then, we can also write* AncState(N) *in the form of* (11) *where* *are the non-empty intersections of* *i.e.*

For eZciency reasons, we are not satisfied with computing supersets of AncState(·); rather, we would like to concisely express the set of alleles that is exactly equal to AncState(·). The only case in which the algorithm computes a strict superset of AncState(N) is when the alelles observed at the leaves of N imply that the observed indel tracts must be introduced in a particular order. For example, if an allele has indel tracts *d*_{1} and *d*_{2}, we know that *d*_{1} must be introduced before *d*_{2} if *d*_{1} cuts target *j* and *d*_{2} deactivates target *j* (Figure 12b). Due to this ordering, we may find that the same indel tract observed in two alleles must have been introduced independently (also known as homoplasy in the phylogenetics literature). To indicate such orderings, we use the notation *d* ∈ *a* ⇒ *d’* ∈ *a* to denote that “if indel tract *d* is in allele *a*, then indel tract *d’*must also be in *a*.” The set of alleles respecting this ordering constraint is denoted
Per this definition, Order(*d* ⇒ *d’*) contains all alleles that do not include *d*.

The following lemma builds on Lemma 1 and computes the sets exactly equal to AncState(·). The algorithm is similar as before, but the parent nodes also adopt ordering requirements from their children nodes. Note that ordering requirements only ever involve observed indel tracts. Again, the proof for the lemma is straightforward so we omit it here.

*For any leaf node* L, *suppose its observed allele is* {*d*_{m} : *m* = 1, *…, M*_{L}} *for some M*_{L} ≥ 0, *where d*_{m} = IT[*p*_{0,m}, *p*_{1,m}, *s, j*_{0,m}, *j*_{1,m}]. *Denote its list of ordering requirements as*
*Then*
*where D*_{L,m} = SGWC[*p*_{0,m}, *p*_{1,m}, *s*_{j}, *j*_{0,m}, *j*_{1,m}].

*Similarly, for any internal node* N, *we can also write* AncState(N) *in the form of* (13). *If node* N *has children nodes* *are pairwise disjoint wildcards and/or singleton-wildcards satisfying* (12) *and*
Now that we’ve shown that AncState(N) can be written in terms of disjoint wildcards and singleton-wildcards, we introduce one more notation that will be useful later. Define SG(N) to be the singletons from the singleton-wildcards in AncState(N).

### Likelihood calculation: aggregating states

Here we show how to use the concept of “lumpability” to calculate the approximate likelihood (6), even when there are an infinite number of ancestral states. Recall that we previously proposed calculating (6), which sums over a subset of all possible ancestral states. Though this has decreased the set of states to sum over, the number of ancestral states under consideration at each tree node is still infinite. Even if we ignore the insertion sequences, the number of possible ancestral states still grows at a rate of *o*(*p*^{2}) where *p* is the number of positions in the unmodified barcode and the likelihood calculation has complexity *o*(*p*^{6}) since we need to construct transition matrices. Since the barcode is 300 nucleotides long in McKenna et al. [2016], we cannot calculate the likelihood by considering all possible states separately.

To handle Markov processes with infinite state spaces, one solution is to partition the states into lumped states and show that the behavior of the original Markov process is equivalent to that of an aggregate Markov process over the lumped states [Kemeny and Laurie Snell, 1976, Hillston, 1995]. This property, called “lumpability,” is defined as follows.

*Let X*(*t*) *be a continuous time Markov chain with state space* Ω. *If there exists a partition* {*A*_{1}, *…, A*_{M}} *of* Ω *and a continuous time Markov chain Y* (*t*) *with state space* {*A*_{1}, *…, A*_{M}} *such that*
*then X is lumpable.*

Although lumpability is a well-established technique for Markov chains, the practical difficulty is typically in constructing the appropriate partition [Ganguly et al., 2014].

There is relatively little work on applying these ideas of lumpability in phylogenetics. (The one application in Davydov et al. [2017] calculates the likelihood of a codon model approximately by assuming states are lumpable, even though this is not necessarily true in their model; here we will show that the states are indeed lumpable.) Here we extend lumpability to the phylogenetics setting where we have different partitions of the state space at each tree node. In particular, for some indexing set *B*, define a partition {*g*(*b*; N) : *b* ∈ *B*} of Ω at every node N. Lumpability is only useful for efficient phylogenetic likelihood computation if these partitions are compatible with Felsenstein’s pruning algorithm [Felsenstein, 1981]. For any *b* ∈ *B* and node N, let *p*_{N}(*b*) be the component of the likelihood for the subtree below N for states in partition *b*:
By Felsenstein’s algorithm, we have
For lumpability to be useful, we must be able to show that there exists an easy-to-compute weight function *w*(*b,* N, *b’,* C) such that
and that for each tree node C, *X*_{C}(·) is indeed lumpable over the partition *g*(·; C). Obviously if the partitions are the same across all tree nodes, then we can just set all weights to one. However we will need to construct a different partition for each tree node for the GESTALT likelihood.

We propose partitioning Ω at tree node N based on whether or not the allele is a likely ancestral state (i.e. is it in AncState(N)) and its target status (Figure 2):

*Define the indexing set B to be* {0, 1}^{M} ∪ {other}.

*For internal tree node* N, *partition the state space* Ω *into*
*For leaf node* N, *partition the state space* Ω *into*

To prove that the Markov process over the branch with end node N is lumpable with respect to the proposed partition, we show that the instantaneous transition rate from any allele in *g*(*b*; N) to the set *g*(*b’*; N) is the same. Therefore we use *q*_{lump} to denote the transition rates between the lumped states {*g*(*b*; N)}. The results show that there are two types of transitions between the lumped states, which determines the appropriate formula for calculating *q*_{lump}. Either the transition corresponds to an observed indel tract and there is only one indel tract that is a valid for transitioning between the lumped states; or the transition corresponds to a masked indel tract, in which case all indel tracts from the possible target tracts are valid transitions between the lumped states.

*Suppose Assumption 2 holds. Consider any branch with child node* C, *and target statuses b, b’*∈ {0, 1}^{M} *where the sets g*(*b*; C) *and g*(*b’*; C) *are nonempty. For any alleles a, a’* ∈ *g*(*b*; C), *we have*
*If the only transition from an element in g*(*b*; C) *to g*(*b’*; C) *is via the unique indel d* ∈ SG(C) *that deactivates the targets b’*\ *b, then*
*where h is defined in Assumption 2. Otherwise, we have*
*Proof.* The instantaneous transition rates for an allele *a* ∈ *g*(*b*; C) to the set *g*(*b’*; C) is
If *d* is an indel tract that can be introduced to the allele *a* ∈ *g*(*b*; C) and Apply(*a, d*) has target status *b’*, then we can introduce the same indel tract to any other allele *a’* ∈ *g*(*b*; C) and Apply(*a’, d*) will also have the target status *b’*. Therefore we have proven that (19) must hold for all *a, a’*∈ *g*(*b*; C).

To calculate the hazard rate between these lumped states, we rewrite the summation by grouping indel tracts with the same target tract: One of the following two cases must be true:

From the decomposition (13) of AncState(C), there is only one indel tract

*d*in the sets*D*_{C,m}for*m*= 1, …,*M*_{C}such that Apply(*a, d*) ∈*g*(*b*′, C) for all*a*∈*g*(*b,*C).*d*cannot be from a wildcard or the inner wildcard of a singleton-wildcard since this would contradict the fact that*d*is the only indel tract in {*D*_{C,m}} such that Apply(*a, d*) ∈*g*(*b*′, C) for all*a*∈*g*(*b,*C). Therefore*d*must be the singleton for some singleton-wildcard*D*_{C,m}. In other words, the only possible transition from*g*(*b*; C) to*g*(*b*′; C) is via the indel tract*d*.Otherwise, for some target tract

*τ*, there are at least two indel tracts in*d, d*′ ∈ τ in the sets*D*_{C,m}for*m*= 1, …,*M*_{C}that deactivate targets*b*′ \*b*such that Apply(*a, d*) ∈*g*(*b*′, C) and Apply(*a, d*′) ∈*g*(*b*′, C) for all*a*∈*g*(*b,*C). In this case,*d*and*d*′ must be from a wildcard or the inner wildcard of a singleton-wildcard (*d*and*d*′ cannot both be from a singleton of a singleton-wildcard since*d*≠*d*′). Therefore every indel tract*d*in*τ*satisfies Apply(*a, d*) ∈*g*(*b*′, C) for all*a*∈*g*(*b,*C).

Therefore (20) simplifies to
Note that to construct the entire instantaneous transition rate matrix of the aggregated process, we can easily calculate the total transition rate away from a target status and then calculate the transition rate to sink state *g*(other; C) using the fact that each row sums to zero. The transition rate away from *g*(other; C) is zero. □

We are finally ready to combine lumpability with Felsenstein’s pruning algorithm. The following theorem provides a recursive algorithm for calculating (6), using results from above.

*Suppose the above model assumptions hold. Consider any tree node* N, *target status b, and nonempty allele group g*(*b*; N). *Denote*
*If N is an internal node, then*
*where* Pr *X*_{C}(*t*_{C}) ∈ *g*(*b*′; C)|*X*_{C}(0) ∈ *g*(*b*; C) *is calculated using the instantaneous transition rates given in Lemma 3.*

*Proof.* For any internal node, we know that
(We do not need to sum over the partition *g*(other; N) since it contributes zero probability.) By irreversibility of the mutation process, *g*(*b*; N) ⊆*g*(*b*; C) if C is a child of N. By (19) in Lemma 3,
which means (22) also holds for node N. □

Note that (22) requires enumerating the possible target statuses. We can do this quickly using Lemma 2.

Since each node may have a different partition of the state space, we compute a separate instantaneous transition rate matrix for each branch. If we have multiple barcodes, we need to compute a separate matrix for each branch and for each barcode. Calculating the likelihood and its gradient can therefore become memory-intensive when there are many branches and/or barcodes. One way to reduce the amount of memory is to sum over subsets of AncState(·) instead. This is often reasonable since it is unlikely for the barcode to have many hidden events. For the analyses of the zebrafish data, we only sum over states that can be reached when at most one masked indel tract occurs along each branch. If there are more than 20 such states at a node, we only sum over the possible states when no masked indel tracts occur along that branch.

### Caterpillar trees

As discussed in the main manuscript, we resolve the multifurcations in the tree as caterpillar trees to estimate the ordering of events. Recall that a caterpillar tree is a tree where all the leaves branch off of a central path, which we call the “caterpillar spine.” Thus for each multifurcating node N with children nodes C_{1}, …, C_{Κ}, we resolve the multifurcation as a tree where the children nodes branch off of the caterpillar spine (Figure 3b). We do not consider all possible resolutions of the multifurcations since there are a super-exponential number of them and we likely do not have enough information to choose between all the possible trees (recall that they are parsimony-equivalent).

We need an effcient method to select the best ordering in each caterpillar tree since the number of possible orderings for *K* children nodes is *K*!, which is also super-exponential. Since it is computationally intractable to calculate the likelihood for each ordering, we take an alternate approach where we introduce another approximation of the likelihood. This approximate likelihood can be computed using the same mathematical expression regardless of the ordering of the children nodes, which means we can tune over all possible orderings in the caterpillar trees by solving a single continuous optimization problem.

*We approximate the likelihood by considering only the mutation histories that have a constant allele along the caterpillar spines:*

To construct a mathematical expression for (23) that is independent of the ordering along caterpillar trees, we first re-parameterize the branch lengths for children of multifurcating nodes. For each child C of a multifurcating node, let *ℓ*_{C} indicate the distance between the child node and the multifurcating node and *β*_{C} ∈(0, 1) indicate the proportion of distance *ℓ*_{C} that is located on the caterpillar spine (Figure 13). We can capture all possible orderings in a caterpillar tree by varying the values of these two sets of parameters across the children of a multifurcating node.

Next we extend the likelihood calculation in Theorem 1 where the multifurcations are resolved as caterpillar trees and we want to calculate the approximate likelihood (23). We do this by recursing on the analogous quantity
To calculate (24), we use the recursive relation that is equal to
where *t*_{spine} = max{*ℓ*_{C}*β*_{C} : C ∈ children(N)}. Using the same machinery in Lemma 3 and Theorem 1, we can then calculate (25) effciently by grouping ancestral allele states into lumped states.

### Penalization

Our algorithm fits the tree and mutation parameters by maximizing the penalized log likelihood. Penalization improves accuracy when the number of observations is small relative to the number of parameters; GESTALT exhibits this problem because the number of parameters is large and the number of independent barcodes is small (McKenna et al. [2016] only has one barcode).

We propose a tree-based penalty that discourages large differences in the branch lengths ℓ and the target rates λ. For multifurcating nodes, the branch lengths include the length of the spine as well as the lengths of branches off of the spine, i.e. . Let *L* be the number of branch lengths in ℓ. The penalty is then
where Κ _{1}, *Κ* _{2} *>* 0 are penalty parameters.

Our penalty on the branch length was also considered in the penalized likelihood framework in Kim and Sanderson [2008]. However the focus of Kim and Sanderson [2008] was to encourage development of methods that were more flexible and had fewer assumptions, rather than to improve estimation in high-dimensional settings. In particular, their work focused on the standard phylogenetic setting in which there are multiple independent sites. However, in the GESTALT setting, the current available datasets were generated using a single barcode and we cannot tune the penalty parameters using the common approach of cross-validation [Hastie et al., 2009, Arlot and Celisse, 2010]. In addition, Kim and Sanderson [2008] only discussed penalized likelihood in the context of a fixed topology. In our setting the true tree is unknown and we must consider various tree topologies.

We found that a major hurdle in applying this penalized likelihood framework is that some topologies will naturally have larger penalties. Therefore we cannot simply choose the tree with the highest penalized log likelihood. Our solution is to perform a hill-climbing iterative search and score topology updates based on the penalized log likelihood where the penalty is restricted to the shared subtree. We found that our method tends to choose topology updates that improve the tree estimate (see Figure 17b).

Alternatively, we could have applied regularization methods tailored for the setting where the topology is unknown. These methods typically regularize the tree towards a prespecified tree [Wu et al., 2013, Dinh et al., 2018]. However we would like to place minimal assumptions on the developmental process and we have little to no knowledge about the true tree.

### Tuning penalty parameters

By varying the value of the penalty parameters Κ _{1} and Κ _{2}, we can control the trade-off between minimizing the penalty versus maximizing the log likelihood. Choosing appropriate values is crucial for estimation accuracy. A common approach for tuning penalty parameters is to use cross-validation [Arlot and Celisse, 2010]; we use this procedure whenever possible. Note that we keep the tree topology fixed when tuning the penalty parameter.

We can perform cross-validation when there are multiple barcodes. First we partition the barcodes into training and valdation sets *T* and *V,* respectively. Next we fit tree and mutation parameters and , respectively, for each Κ using only the training data. We choose the Κ with the highest validation log likelihood
For our simulation studies with two and four barcodes, we used half of the barcodes for the validation set and half for training.

Unfortunately cross-validation cannot be utilized when there is a single barcode since we cannot split the dataset by barcodes. Instead we propose a variant of cross-validation described in Algorithm 2. The main differences are that we partition the leaves instead of the barcodes into training and validation sets *s* and *s*^{c}, respectively; and we select the best penalty parameter that maximizes the conditional probability of the observed alleles at *s*^{c} given the observed alleles at *s*.

To partition the leaves, we randomly select a subset of leaf children of each multifurcating node to put in the validation set *s*^{c}. We partition leaves in this manner, rather than simply dividing the leaves in half, because we must be able to evaluate (or closely approximate) (27) at the end of Algorithm 2 using the fitted branch length and mutation parameters. That is, we must be able to regraft the leaves in the set *s*^{c} onto the fitted tree. Regrafting is easy for the leaves in our specially-constructed set: The parent node of each leaf in *s*^{c} must be located somewhere along the caterpillar spine corresponding to its original multifurcating parent. In our implementation, we chose to regraft the leaves to the midpoints of their corresponding caterpillar spines. The regrafting procedure is illustrated in Figure 14. Note that we do not tune the branch lengths of these validation leaves since it amounts to peeking at the validation data. In our simulations, we found that when tuning the branch lengths to maximize the unpenalized (or penalized) log likelihood, we nearly almost always choose the smallest penalty parameter since it prioritizes maximizing the likelihood and, therefore, (27).

To assess each candidate penalty parameter Κ, we compare the conditional probability of the observed alleles at *s*^{c} given the observed alleles at *s*. Our motivation is similar to that in cross-validation: If the alleles are observed from the tree with branch and mutation parameters *ℓ** and θ *, we know that
by Jensen’s inequality. (Note that this conditional probability is high only for if we have good estimates of both the mutation parameters and branch lengths of leaves *s*^{c}. It is not sufficient to only have an accurate estimate of the mutation parameters.) Recall cross-validation is motivated by a similar inequality but uses Pr(*X*; *ℓ,* θ) rather than a conditional probability.

From a theoretical standpoint, using (26) to select penalty parameters makes the most sense if we have an unbiased estimate of the expected conditional probability. Unfortunately, in our setting, the conditional probability in (27) is actually a biased estimate since the fitted parameters depended on the observed alleles at leaves *s*. Nonetheless, in simulations (where the truth is known), this biased estimate seemed to work well, as the selected penalty parameter was typically close to the best penalty parameter.

Cross validation for a single barcode

Initialize *s* to be all the leaves in tree 𝕋. Throughout, let *s*^{c} denote all the leaves in 𝕋 not in *s*. **for** each multifurcating node N where at least one children is a leaf **do**

Let C_{1}, …, C_{m} be the children nodes of N that are leaves. Randomly select *m*′ ≥1 of them and remove these from *s*.

**end for**

Let 𝕋 _{s} be the subtree over the leaves *s*.

**for** each candidate penalty parameter **κ do**

Maximize the penalized log likelihood of the tree 𝕋_{s} with respect to its branch lengths ** 𝓁** and mutation parameters

*θ***end for**

Return the penalty parameter that maximizes the conditional probability:

Finally, we must simultaneously tune the penalty parameter and the topology of the tree (from Algorithm 1). Our full algorithm alternates between tuning the penalty parameters for a fixed tree topology and running a single iteration of Algorithm 1 for a fixedpenalty parameter. After the penalty parameters are stable, we keep them fixed and only run Algorithm 1.

### Specific model implementation

Here we discuss the specific implementation we use to model the GESTALT data. We suppose the mutation process is homogeneous and does not depend on *t*. Therefore we will drop the time index *t* in the model. Recall that there are two major components of the mutation model: the rates at which which target tracts are introduced and the conditional probability of an indel tract given the selected target tract.

To parameterize the rate at which a target tract is introduced, we further decompose the rate into a rate *h*_{0} that represents the rate at which the targets *j*_{0} and *j*_{1} are cut and various scaling factors that control how often deletions are short or long (recall the definition in (3) and (4)):
where *γ*_{0} and *γ*_{1} parameterize how often long deletions occur to the left and right, respectively.

We specify *h*_{0} using the assumption that the cutting time for target *j* follows an exponential distribution with rate of *λ*_{j} > 0. For focal target cuts where *j*_{0} = *j*_{1}, we define
For double cuts at targets *j*_{0} and *j*_{1}, we suppose the cut time follows an exponential distribution with rate *ω* · (*J*_{j}^{0} + *J*_{j}^{1}), where *ω* is an additional model parameter that we estimate and does not depend on the targets.

Our parameterization of the double-cut rate is based on the assumption that an inter-target deletion is introduced when the cuts at both targets occur within a small time window of length *ε*. For random cut times *X*_{j}^{0} and *X*_{j}^{1} for targets *j*_{0} and *j*_{1}, we approximate that the cut times occur within a small window *ε* with the distribution
The values on the left and right hand sides approach each other as *ε* → 0. The first component on the right-hand-side of (28) approaches zero as *ε* → 0 and does not vary much for different values of *J*_{j}^{0}, *J*_{j}^{1} if *ε* is suZciently small. Hence we use the same value of *ω* for all targets. The second component on the right-hand-side of (28) corresponds to an exponential distribution with the rate *J*_{j}^{1} 0+ *J*_{j}^{1}.

We can interpret *ω* in two ways. First, *ω* controls how often a double cut is introduced. In an unmodified barcode, the relative rate that a double cut is introduced versus a single cut is *ω*Σ_{j1 <j2}(*λ*_{j}^{0} + *λ*_{j}^{1}) versus . The second interpretation, based on (28), is that *ω* serves as a proxy for *ε*: Larger *ω* indicates that an inter-target deletion can be introduced by two cuts spaced farther apart in time.

The second major component of the GESTALT mutation model specifies the conditional probability of introducing a particular indel tract given target tract . An indel tract can be represented by its deletion lengths to the left and right and the insertion sequence. We will suppose that the probability of a single insertion sequence is uniform over all possible nucleotide sequences of that length. Let *X*_{0}, *X*_{1}, *X*_{2} be the random variables parameterizing the lengths of the left deletion, right deletion, and insertion, respectively. Let *x*_{τ,min,i} and *x*_{τ,max,i} for *i* = 0 and 1 specify the minimum and maximum deletion lengths to the left and right, respectively, for target tract *τ*. (For example, if , the left deletions must be short so *x*_{τ,min,i} = 0 and *x*_{τ,max,i} is the longest deletion without deactivating target *j*_{0} - 1. As another example, if , the left deletion is long so *x*_{τ,min,i} is the minimum deletion length to deactivate target and *x*_{τ,max,i} is the longest length without deactivating target ). For insertions, *x*_{τ,min,2} = 0 and *x*_{τ,max,i} = ∞ regardless of the target tract.

We parameterize the conditional probability of indel *d* with left deletion, right deletion, and insertion lengths *x*_{0}, *x*_{1}, *x*_{2} given target tract *τ* as
where
where *p*_{0,boost} + *p*_{1,boost} + *p*_{2,boost} = 1. The probabilities *p*_{i,boost} ensure that we can never introduce an indel tract that deletes and inserts nothing. When the minimum insertion and deletion lengths are zero (in the case of focal target tracts), we use the probabilities *p*_{i,boost} to randomly pick whether to boost the minimum left deletion, right deletion, or insertion length by one.

We assume that the deletion lengths follow a zero-inflated, truncated negative binomial distribution; and the insertion lengths follow a zero-inflated negative binomial distribution. Let NB(*m, q*) denote the negative binomial distribution, which is the distribution for the number of successes until *m* failures are observed and *q* is the probability of success. The zero-inflation factor for deletion lengths for focal indel tracts is *p*_{i,0} and inter-target indel tracts is *p*_{i,1}, where left and right are indicated by *i* = 0 and *i* = 1, respectively. The zero-inflation factor for insertion lengths is *p*_{2,0} = *p*_{2,1}. Then for *i* = 0, 1, 2, we define

### Implementation

The code is implemented in Python using Tensorflow. We maximize the penalized log likelihood using Adam [Kingma and Ba, 2014].

### Comparison Methods

We use PHYLIP version 3.697 [Felsenstein, 1995], the neighbor-joining algorithm in Bio.Phylo (Biopython version 1.72) [Talevich et al., 2012], and the chronos function in R package ape version 5.2 [Paradis and Schliep, 2018].

### Evaluation metrics

Given ultrametric trees 1 and 2 with the same set of leaves, the internal node height correlation between the two trees is calculated as follows (Figure 15):

For each internal node in tree 1, find the matching node in tree 2 that is the most recent common ancestor of the same set of leaves.

Calculate the Pearson correlation of the heights of matched nodes.

Do the same swapping tree 1 and 2.

Average the two correlation values.

A correlation of 1 means that the trees are exactly the same; the smaller the correlation is, the less similar trees are.

### Simulation setup and additional results

For the results in Figure 4b, the data was simulated with 5 synchronous cell division cycles followed by a birth-death process where the birth rate decayed at a rate of exp(−18*t*). The barcode was composed of six targets with *λ*= 0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6. The weight *w* was set to 0.06 so that 20% of the unique observed indel tracts were due to double cuts. We sampled 8% of the leaves so that the average number of unique observed alleles was around 100 leaves. We refer to this simulation setup as Simulation A. We ran 20 replicates of Simulation A.

The results in Figure 4c are from a larger simulation, which we will refer to as Simulation B, that is closer to the data collected in McKenna et al. [2016]. Since zebrafish data undergo around 11 synchronous cell division cycles, this larger simulation entailed 9 synchronous cell division cycles followed by a birth-death process. We simulated with a barcode composed of ten targets. The resulting tree had on average around 200 leaves. We ran GAPML for 8 topology tuning iterations; at each iteration, we consider at most 15 SPR moves. The displayed results are from 20 replicates.

For this larger simulation, we also compared the runtimes of the methods on a server with an Intel Xeon 2×8 core processor at 3.20GHz and 256 GB RAM. Obtaining tree topologies from C-S parsimony and neighbor-joining runs on the order of minutes. Branch length estimation using chronos runs on the order of seconds. In contrast, GAPML required up to three hours. Though the runtime of our method is much longer, it is still reasonable compared to the amount of time spent on data collection, which includes waiting until the organism is a certain age.

Using our simulation engine, we compare two very simple barcode design ideas: a single barcode with many targets, recommended in Salvador-Martínez et al. [2018], or many identical barcodes. However we believe the latter is more effective since spreading the targets over separate barcodes tends to create more unique alleles. In particular, the inter-target deletions tend to be shorter, which means fewer existing mutations are deleted and fewer targets are deactivated. To test this idea, we compared to the two design options in a simulation setup where we iteratively increased the number of targets by six, i.e. add six targets to the existing barcode or add a new barcode with six targets. Here we observe all 1024 leaves of a full binary tree with 10 levels. All targets had the same single-cut rate. We calibrated the double-cut weight *ω* to be around 18% for both barcode designs – this slightly favors the single-barcode design since it would have a higher rate of double cuts *in vivo* compared to a multiple-barcode design. Nevertheless, we find in our simulations that splitting the targets over separate barcodes tends to result in a much larger

Next, to better understand our algorithm GAPML, we show in-depth simulation results from a single replicate (Figure 17). Here we use the settings from Simulation B. Starting from the initial tree topology, the algorithm tunes the branch lengths and mutation parameters to maximize the penalized likelihood. During the gradient descent algorithm, the BHV distance of the tree estimate decreases (Figure 17a). In addition, we see that the BHV distance of the tree estimate decreases as Algorithm 1 iteratively performs SPR moves to update the tree topology.

Our method searches over the maximally parsimonious trees since they tend to have the highest penalized log likelihood. To justify this restricted search, we compared the penalized log likelihood for tree topology candidates of different parsimony scores, where the data was generated using Simulation A. To generate tree topologies with different parsimony scores, we started with the maximally parsimonious tree fit from Camin-Sokal and iteratively applied random SPR moves. For each of tree rearrangement, we fit a model by maximizing the penalized log likelihood. The penalty parameter is the same across all rearrangements. As seen in Figure 18, the most parsimonious trees have the highest penalized log likelihoods. Since our method aims to select a tree topology that maximizes the penalized log likelihood, it would not benefit from considering SPR moves that make the tree less parsimonious; instead, considering these additional moves would make the method much slower.

### Zebrafish data analysis

For the zebrafish analyses, we estimated the tree over at most 400 randomly selected alleles (without replacement). 50% of the fish in this dataset had fewer than 400 alleles and the median number of unique alleles over the zebrafish datasets was 443. 25% of the fish in this dataset had more than 1000 alleles. We limit the number of alleles analyzed due to runtime restrictions.

To test if the fitted trees are recovering similar developmental relationships across fish rather than random noise, we ran a permutation test comparing the correlation between tissue distances from the estimated trees to that from randomly-estimated trees over randomly-shuffled data. More specifically, for a given tree topology, we randomly permute the observed alleles at the leaves. Each allele is associated with the number of times it is observed in each tissue type; we randomly shuffle these abundances over the possible tissue types within each allele. Finally, we randomly assign branch lengths along the tree by drawing samples from a uniform distribution and using the *t*-parameterization of Gavryushkin and Drummond [2016] to assign branch lengths. The correlation between tissue distances in these random trees is close to zero. All permutation tests were performed using 2000 replicates.

We also tested if the Pearson correlation between the number of tissue types/cell types and the internal node times is different from that of random trees. The random trees were generated using the same procedure as above.

We conclude by noting that the random trees are generated using the estimated tree topology from each method. Thus the null distributions are different and the p-values are not directly comparable between methods. Though this slightly complicates interpretation, we prefer this approach since the estimated tree topology may naturally induce correlation between tissue distances. For most validation tests, the mean of the null distribution was similar across the different methods, and therefore the p-values are somewhat comparable. The major exception was the tests that checked recovery of cell-type and germ-layer restriction: here the mean of the null distribution were very different and we abstain from comparing p-values across methods.

For Figure 6a, we bootstrapped fish replicates to estimate confidence intervals for the average correlation between estimated target cut rates.