DeST-OT: Alignment of Spatiotemporal Transcriptomics Data

Spatially resolved transcriptomics (SRT) measures mRNA transcripts at thousands of locations within a tissue slice, revealing spatial variations in gene expression and distribution of cell types. In recent studies, SRT has been applied to tissue slices from multiple timepoints during the development of an organism. Alignment of this spatiotemporal transcriptomics data can provide insights into the gene expression programs governing the growth and differentiation of cells over space and time. We introduce DeST-OT (Developmental SpatioTemporal Optimal Transport), a method to align SRT slices from pairs of developmental timepoints using the framework of optimal transport (OT). DeST-OT uses semi-relaxed optimal transport to precisely model cellular growth, death, and differentiation processes that are not well-modeled by existing alignment methods. We demonstrate the advantage of DeST-OT on simulated slices. We further introduce two metrics to quantify the plausibility of a spatiotemporal alignment: a growth distortion metric which quantifies the discrepancy between the inferred and the true cell type growth rates, and a migration metric which quantifies the distance traveled between ancestor and descendant cells. DeST-OT outperforms existing methods on these metrics in the alignment of spatiotemporal transcriptomics data from the development of axolotl brain.

In vanilla optimal transport, and in the original formulation of PASTE [45], one has two constraints on alignment matrix Π, given by probability measures g 1 ∈ R n1 , g 2 ∈ R n2 : the row-sum of Π must be g 1 , and the column-sum of Π must be g 2 .Measures g 1 and g 2 are supported on indices i ∈ {1, . . ., n 1 } and j ′ ∈ {1, . . ., n 2 }, enumerating the spots of two slices of spatial transcriptomics data, S 1 = (X (1) , S (1) ) and S 1 = (X (2) , S (2) ).The rows x i ≡ X (1) i,• of X (1) and x ′ j ′ ≡ X (2) j ′ ,• of X (2) are indexed by spots, and each row is the expression vector at the given spot.Likewise, the rows s i ≡ S (1) i,• of S (1) and s ′ j ′ ≡ S (2) j ′ ,• of S (2) are the spatial coordinates of each spot.We use the convention that a prime on an index, e.g.j ′ , refers to a spot in the second slice, while the absence of a prime on an index denotes a spot in the first slice.

S1.1 Context
In PASTE, a convex combination of standard inter-slice expression cost and a Gromov-Wasserstein cost are used (this combination referred to as Fused Gromov-Wasserstein (FGW)).The corresponding optimization problem is: Above, C ∈ R n1×n2 is the inter-slice (gene expression) feature-distance matrix C ij ′ ≡ C(x i , x ′ j ′ ) whose ij ′ -th entry is the distance in expression space between the expression vector x i at spot i in the first slice and the expression vector x ′ j ′ at spot j ′ in the second slice.Matrices D (1) and D (2) in ( 12) are intra-slice (spatial) distance matrices.Define the 4-tensor L ≡ L(D (1) , D (2) ) ∈ R n1×n2×n1×n2 entry-wise via We let ⊗ denote the tensor-matrix multiplication operator, such that for the alignment matrix Π ∈ R n1×n2 ≥0 , L ⊗ Π denotes the n 1 × n 2 matrix whose ij ′ -th entry is k, l ′ L i j ′ k l ′ Π kl ′ .Let ⟨•, •⟩ F denote the Frobenius inner product of matrices.In this notation, the objective function ( 12) can be reformulated as follows: with L as in (13).The only difference between PASTE2 [24] and the original PASTE is the inclusion of the constraint 1 T n1 Π1 n2 = s, which goes along with a relaxation of the two marginal constraints to inequalities: Note that by considering the dual problem, partial optimal transport, in which there is a constraint s ∈ (0, 1) on the total mass 1 T n1 Π1 n2 transported, can be seen as an instance of unbalanced optimal transport associated to the total variation φ-divergence (see [36] § 4.2).Unbalanced optimal transport refers to the version of (14) obtained by deleting the two hard constraints, Π1 n2 = g 1 and Π T 1 n1 = g 2 , on the marginals of Π, and instead adding to the cost function in brackets a pair of "soft constraints" in the form of two terms: D φ (Π1 n2 |g 1 ) + D φ (Π T 1 n1 |g 2 ).Here, φ : (0, ∞) → [0, +∞] is a so-called entropy function, namely it is convex, positive, lower-semi-continuous, and with φ(1) = 0. Let D φ (•|•) denote the associated φ-divergence, defined for positive measures α, β on some common space X as D φ (α|β) = X φ( dα dβ )dβ, where we have supposed α and β are mutually absolutely continuous, for instance.When φ(p) = p log p − p + 1, one recovers the Kullback-Leibler (KL) divergence, which we denote by D φ (•|•) = KL(•∥•) in this case.

S1.2 DeST-OT objective
Our view of an OT objective function as a Hamiltonian, assigning an energy to a given transport plan Π, motivates the terminology we use for its components below.The first change we address in the formulation of DeST-OT, relative to PASTE, is the constraints on the optimization together with our calibration of mass on the second slice.Specifically, we relax the constraint Π1 n2 = g 1 , while preserving the other.This fixes the total mass transported by Π to be the total mass of the second (usually larger, in the spatiotemporal setting) slice, which is equipped with positive, not necessarily probability, measure g 2 = 1 n1 1 n2 .This allows the Π-marginal over the first slice, namely Π1 n2 , to be non-uniform, while fixing its total mass to the value n2 n1 .One is naturally lead to introduce a mass-flux term, as this object becomes non-zero as soon as we relax the first marginal constraint.In section S1.4,we discuss the value of ξ.The non-uniformity in Π1 n2 reflects that not all spots in the first slice are necessarily contributing the same amount to spots in the second slice -different cell types may have different growth rates.
The second change we address concerns the matrices used by the Gromov-Wasserstein (GW) term E GW (Π) = ⟨L⊗Π, Π⟩.This term favors Π matrices which are nearly isometries between the two slices, as whenever D (1) ik = D (2) j ′ l ′ , the corresponding summand of the GW term E GW (Π) vanishes.For a more general pair of intra-slice cost matrices, (symmetric, non-negative matrices) M (1) ∈ R n1×n1 and M (2) ∈ R n2×n2 , we generalize (13), defining the 4-tensor and we define the analogous GW energy (using M (i) in place of the D (i) ).We also refer to it as a quartet energy, as each summand defining it corresponds to two pairs points: We define the matrices M (i) from a pair of expression distance matrices for the first and second slice, C (1) ∈ R n1×n1 , C (2) ∈ R n2×n2 .These intra-slice feature matrices C (i) are distinct from their inter-slice counterpart C ∈ R n1×n2 + which contains the distance in (expression) feature space from spots in slice 1 and slice 2, and is therefore not a symmetric, square matrix in general.We specifically choose M (i) to be the Hadamard product of matrices C (i)  and D (i) , so that entry-wise, one defines M (1) ik and M (2) (2) j ′ l ′ .We refer to the M (i) as merged feature-spatial matrices.
To E M GW , we add a symmetric pair of terms E M triplet (1) and E M triplet (2) of the form ) 2 , i.e. entry-wise squaring of the merged feature-spatial matrix, which is why the superscript indicates the same dependence on the M-matrices as the GW term.Note that E M triplet (1) is identical to the terms in the sum defining E M GW which have i = k, which is why we name it for the first slice instead using the index of the merged feature-spatial matrix involved in the definition.This energy E M triplet (1) favors Π which predict a localized (in space, and in gene expression space) set of possible ancestors for each spot in the second slice.
Likewise, when is identical to the terms in the sum defining E M GW for which j ′ = ℓ ′ , and so is named for the second slice.Transport plans Π with low E M triplet (2) -energy are more likely to predict that descendants of a given spot in the first slice are close (spatially, and in feature-distance) in the second slice.Together, E M triplet (1) and E M triplet (2) encourage spot-wise continuity in Π, without penalizing Π for spatial distortions intrinsic to tissue growth, or for change in expression characteristic of development.We refer to the sum of these terms as the triplet energy between triplets of points, E triplet : Combining these changes, and adding in a term of entropy regularization, the formulation1 used by DeST-OT is: with L M defined in ( 16), and the notation L M ⊗ Π as defined just below (13).The choices of V (1) = ((M (1) ) 2 and V (2) = ((M (2) ) 2 as above enable us to write the sum of our triplet and quartet energies in (20) as where δ ik is 1 if and only if i = k and is 0 otherwise.The 4-tensor L M is defined implicitly by the above display, and the objective function used by DeST-OT can now expressed concisely as follows, though ( 20) is the expression we use in practice for computing gradients: As indicated by the underbrace in ( 22), we define the merged feature-spatial energy, denoted E M (Π) to be the sum of E M quartet (Π) and E M triplet (Π), to have a more descriptive way of referring to the left-hand side of (21).

S1.3 Connection to energy-based models
As such, we augment the optimal-transport formulation by framing it as a product-of-experts, constructing a total-energy which we optimize with respect to Π that carries the standard Wasserstein (doublet) energy ⟨C, Π⟩ F , the quartet energy ⟨L M ⊗ Π, Π⟩ F defined through the merged feature-spatial matrices, and two triplet energies which are also defined from these matrices: Tr ΠM (2) Π T and Tr Π T M (1) Π .Denoting the optimal-transport entropic-term density P H , we minimize the following product-of-experts Gibbs-Boltzmann distribution with respect to Π: Which offers the interpretation that taking the sum of the energy terms in our optimal-transport problem is equivalent to optimizing a product of the Boltzmann-distributions over our variable Π, where this product represents a product-ofexperts in which each "expert" distribution refines the matrix Π to be more biologically and spatially realistic with respect to the transcriptomic and spatial geometry of the first and second slices.Alternatively viewing each term as representing pairwise motifs (Wasserstein, or doublet term), ternary motifs (triplet term), and quaternary motifs (GW, or quartet term), these experts capture features ranging from lower to higher order.
S1.4 On the mass-flux term, semi-relaxed OT The semi-relaxed approach has been explicitly discussed in [41,12,6,4,32,7].Blondel et al. [4] formulate a version of our problem in their Definition 3 ("semi-relaxed smooth primal"), fixing their second marginal as we do, but without using entropic regularization in order to produce sparse transport plans.These authors point out that the mixed relaxed distance introduced by Benamou in [2] is a version of semi-relaxed OT, fixing the first marginal instead of a second, and using an ℓ 2 -penalty in place of a KL penalty.Interestingly, this penalty is applied directly to the analogue of ξ in this setting.A recent paper of Dong et al. [12] uses semi-relaxed transport for domain adaptation, aligning imaging data of different modalities.These authors use a Wasserstein loss, fixing the first marginal and in addition constraining the total mass transported.apply the semi-relaxed framework to graph matching, fixing the first marginal, and using a GW loss function.For matching two graphs, they find that fixing the first marginal while relaxing the second better preserves the structure of the first graph under the transport plan, versus the balanced regime.
These authors [41] note that semi-relaxed OT (in the absence of entropy regularization) produces sparser transport plans than vanilla optimal transport.This sparsity also motivated the earlier work of [32] on color transfer; maps that are "too" multi-valued (i.e.not sparse enough) can inappropriately mix colors being transferred and create spatial irregularities.These authors also fix the first marginal, using a sparse transport plan to assign colors from the second image to the pixel of the first, while preserving its geometry.On the theoretical side, Chizat et al. [7] introduce the notion of a semi-coupling to connect the (PDE) dynamic and static pictures of semi-relaxed OT.Lastly, we mention two studies of Frank-Wolfe [13] and Sinkhorn [14] in the semi-relaxed setting.We now discuss some properties of the mass-flux term ξ built from Π and advantages our formulation give us.Let us start by examining the objects analogous to ξ in the fully-unbalanced case.When both marginal constraints are relaxed, both of the vectors ξ 1 = Π1 n2 − g 1 and ξ 2 = Π T 1 n1 − g 2 may be non-zero.Towards relating vectors ξ 1 and ξ 2 , we take the inner product of these expressions with 1 ni to obtain: We solve each of these for the total mass transported, namely 1 T n1 Π1 n2 ≡ 1 T n2 Π T 1 n1 .After relating the two expressions through the total mass transported and re-arranging terms, one has with the last line following from our choice of normalization on the uniform measures g 1 = 1 n1 1 n1 and g 2 = 1 n1 1 n2 .In the semi-relaxed context of DeST-OT, ξ ≡ ξ 1 , and vector ξ 2 is zero, in which case, one has Thus, the semi-relaxed formulation allows us to describe all growth and death implicit in transport plan Π by a single vector ξ, instead of a pair of vectors (ξ 1 , ξ 2 ) of different dimensionality.Moreover, the hard constraint of the semi-relaxed formulation makes this single vector ξ interpretable in a way that the pair (ξ 1 , ξ 2 ) is not.Observe that the second marginal g 2 is fixed, with units assigning mass 1 n1 to each spot.Though the first marginal of Π is no longer fixed, ξ is by definition an additive perturbation of g 1 which uses the same units as g 2 , also assigning mass 1 n1 to each spot.ξ also uses these units, giving expression g 1 + ξ physical meaning.
Rescaled vector q 1 = n 1 (Π1 n2 ) has in its i-th element the amount of mass transported from spot i, in units of numbers of spots.These numbers are fractional, as Π is implicitly estimating these values in a sense averaged over t 2 − t 1 , with respect to a stochastic process similar to the one described in [6].Because the rescaled target measure n 1 g 2 shares these counting-measure units, we can reasonably interpret q 1 ≡ q 1 (Π) as the vector of the expected number of descendants of each spot in the first slice, as predicted by Π.This is clear, as we have: Where multiplying by n 1 scales q 1 to have the units of counting measure over spots.We lose this interpretation in the fully-unbalanced regime: when the target measure is not constrained to be uniform, the amount of mass transported from spot i may not reflect an expected number of descendants, as this value will depend on the mass of each descendant.
From the definitions of ξ and q 1 , we have and n 1 ξ has the following intuitive interpretation: at spot i,

S2 Growth rate conversion
The conversion of the growth vector (mass-flux) ξ to a growth rate can be done simply.We use the growth process described in [22], where one considers the PDE modeling a transcriptomic trajectory as a density evolving in time: With −∇ • (ρF) a continuity equation drift term, σ 2 2 ∆ρ a diffusion term, and Jρ a term representing growth and branching, which are all modeled using successive optimal transport steps.We convert our mass-flux term from the optimal transport to a growth using the principle of [22], where operator-splitting the branching component above from the drift-diffusion gives us a continuous-time growth update as: Thus we simply need to consider the solution to the ordinary differential equation ρ = Jρ for J between times t 1 , t 2 : For ρ(t 1 ) and ρ(t 2 ) a prior and posterior density over the spots in the first slice.By the semi-relaxed optimal transport formulation, we simply compare against a simple uniform prior density ρ(t 1 ) = 1 n1 which we start with.Thus one concludes the connection to a growth rate, on a per-spot basis in the first slice, is given as: Which establishes a simple monotonic connection between our growth vector ξ and a proper growth rate in time.In the notation of [22], τ = t 2 − t 1 , and one has J = τ −1 (p b − p d ), establishing a connection between the growth rate and birth-death parameters p b , p d , and τ .Conventionally, cell division and death are modeled as follows: each cell has an exponential clock of rate τ −1 .When a cell's clock rings, it dies with probability p d , or splits into two cells with probability p b = 1 − p d .As such, our growth rates have a natural connection to the parameters of this birth-death process.

S3 Connection to the continuity equation and interpretation of growth vector
Suppose we denote x(t k ) = x ∼ µ, x(t k+1 ) = y ∼ ν as samples of our slices S 1 and S 2 , and consider the pairwisealignment problem in the equivalent Monge-formulation (the discrete analogue of µ being g 1 , and ν being g 2 ).The Monge-problem is formulated as: Where it can be shown that an equivalent form exists in which the transport-map can be expressed as the integral of some vector field F(x, t), weighted by a time-dependent density ρ(x, t) and where the optimization is performed over a vector-field of least-norm: Where the vector field is optimized subject to the continuity equation: In the semi-relaxed case, one may either assume mass-conservation or generalize the vanilla continuity equation by assuming the presence of a source or sink term σ (i.e.σ = Jρ for J a growth rate as defined in S2) which allows mass to be introduced into the system in the form As such, ξ represents the integral of a divergence in the case of mass-conservation (i.e.local mass-redistribution), or a spatial divergence added to a source/sink term in the case where the total mass is increased or decreased in the system.To be precise, we have that in the balanced case: And general semi-relaxed case:

S3.1 Interpretation of ξ
Supposing we call the spatial volume encompassed by our tissue slice at time t, V t .Each element of ξ, ξ k , represents the approximate value of ∇ Generally speaking the volume V t is assumed to be connected such that, for N t dots at time t, the volume is We assume a cell type assignment is some function from a matrix of distance-based dissimilarities and feature-values to a partition P of the data ϕ(D, C) = P.Where P = {P k } K k=1 represents a decomposition of our dots into disjoint sets representing respective cell types.Following from this, we define the volume of a cell type k at time t, V t,k , as Definition 1. Tissue-slice growth.The term for ∆ = 1/2.By the divergence theorem, one identifies the sum of the divergences in the volume V t as the mass-flux across the boundary ∂V t .Thus, the growth of a tissue slice contained within in a volume at time t, V t , is defined as the mass-flux across this boundary integrated from t 1 to t 2 .
For a balanced optimal transport this is identically zero for any boundary as ξ = 0: For a semi-relaxed optimal transport with marginals that sum to the same value, such as the uniform marginals Π T 1 n1 = g 2 = 1 n2 1 n2 and marginal g 1 = 1 n1 1 n1 , one may have local growth or death as ξ is not necessarily the zero-vector, but globally mass is conserved: In the case of a semi-relaxed optimal transport, if we allow for this source or sink term where we generalize the above definition by asserting global mass gain or loss is possible.In particular, we can set the unbalanced marginals to reflect the normalized change in volume (or change in mass) as: Where η ∈ R could be chosen to reflect something known-such as a total change of mass or volume between the two slices.We also note that the interpretation of ξ can also be used to measure local growth of a tissue within some fixed boundary, as described in the corollary below.
Corollary 1 (Cell type growth).Given a partition P of tissue into separate cell types, and defining the vector z p by The term z T p ξ represents the volume integral and represents the mass-flux out of the boundary of cell type p, ∂V t,p , integrated from t 1 to t 2 .

S3.2 Defining an Unbalanced Optimal Transport problem to match an a priori known total mass-flux
We seek a balancing condition which fixes the mass-flux η from our optimal transport 24 to some interpretable and meaningful value.By our previous decomposition of the volume of a bar-coded grid of spots into boxes of equal volume ∆ 2 , or analogously a small hexagonal volume for Visium, and our assumption that the mass-density is constant for each spot, we have that the normalized mass-flux above must equate to: as we assume that the mass-density ρ M (x) = C is constant for any spot on the grid.For ∆ 2 a per-spot volume, we see that the volume-normalization lets this technology-specific factor cancel and the sum of the growth ξ is related to the change-in-spots directly as Which requires that the unbalanced condition have g 2 = 1 n1 1 n2 as the constraint measure.This makes sense, as , which has the interpretation that the density in the first slice sums to 1, and the density of the second is proportional to the ratio of the number of spots of the second slice to the first.With n2 n1 > 1 implying mass is generated as the second volume exceeds the first, n2 n1 < 1 implying mass is lost as the second volume is smaller than the first, and n2 n1 = 1 implying mass-conservation.

S4 Semi-relaxed Sinkhorn
Calculating the gradient with respect to Π In order to run Sinkhorn, we require an initial condition on Π and an analytical expression for the gradient of our primal objective with respect to Π.In particular, the following represents a primal feasible Π: Recall that the energy function E : Π → E(Π) of interest in our current formulation is We calculate the gradient of f as follows.The gradient of the expression term E expression (Π) is straightforward to compute: Next, we consider the gradient of the distance-regularization term E row corresponding to the rows of Π: F So, we have that: The gradient of the distance regularization term corresponding to the columns of Π is similarly given as: With, and the log applied element-wise above.So: Lastly, we compute the semi-relaxed gradient for f FGW (Π): ik Ψ (2) Owing to our semi-relaxed marginalization constraints, it holds that k Π kl ′ = 1 n1 and i Π ij ′ = 1 n1 , such that the rightmost term is a constant and disappears when we take the gradient with respect to Π.As such, the above is proportional to: Converting this to matrix form, we have: We consider each term independently.For the rightmost we consider the Fréchet derivative in the direction V: For the leftmost term, we have: We can therefore identify the final gradient of our semi-relaxed GW term as: With the squaring operation applied element-wise through Ψ (1) .

S4.1 Dual formulation of semi-relaxed Sinkhorn
For our optimal-transport problem, we have both a primal objective E(Π) we seek to minimize, as well as a single constraint that Π T 1 n1 = g 2 which restricts the set of primal-feasible Π.The Karush-Kuhn-Tucker (KKT) conditions introduce a set of dual variables for each constraint, and a Lagrangian which lower bounds the primal objective.The conditions, which involve first order constraints, dual constraints, and complementary slackness constraints, offer a set of conditions which must be satisfied by an optimal solution to the primal problem.As such, we introduce the dual-variable µ, and establish the Lagrangian L to maximize as a lower-bound to the primal objective: Collecting the non-entropic terms in the gradient of the above expression, we define the matrix C * to be: The KKT conditions imply the following first-order condition holds when we consider the gradient of our Lagrangian with respect to the variable being optimized, Π: We also have the following primal constraint for our singular equality condition on the second marginal: The dual constraints, which are implicitly accounted for in the entropy term of Sinkhorn which lifts the matrix Π through exponentiation such that Π ≽ 0 always holds, for technical correctness include a dual variable Ω ≽ 0 and a term in the Lagrangian of the form −⟨Ω, Π⟩ F .The two dual constraints would formally be given as: And a complementary slackness condition enforced elementwise through the matrix Π, where one either has the active constraint where Π ij > 0 and Ω ij = 0, or a slack, or inactive, constraint with Ω ij > 0 for the (omitted) component of the Lagrangian given as −⟨Ω, Π⟩ F .Thus the slackness condition would be written as: Looking more closely at the first-order conditions, one finds: Implying a similar solution-form to Sinkhorn involving an exponentiation with the Gibbs kernel, wherein one has K i,j = e −ϵ −1 C * i,j and: From Sinkhorn, we know there exist unique vectors u, v such that Π = diag (u)K diag (v) and see: We consider what the update rule should be, given this identification, by establishing: Noting the Hadamard multiplication, this directly implies an update rule for u in the form: This may analogously be identified from the general form of unbalanced optimal transport, wherein one uses an anisotropic proximal step weighting two divergences on the marginals g 1 .In particular, for a generalized optimal transport problem of the form: for φ 1 , φ 2 convex, positive lower-semi-continuous divergences.For these, one takes a proximal step for each marginal where u, v are updated as: This has the following closed-form updates for unbalanced Sinkhorn where we consider the case that D φ1 , D φ2 are taken to be KL-divergences: While we demonstrated that the semi-relaxed update follows directly from the KKT conditions, one can easily recover the semi-relaxed update in the limit of λ 2 → ∞ in the unbalanced case.The marginal update then becomes l+1) , where one recovers the marginal constraint on g 2 and our semirelaxed update rule.

S4.2 Converting semi-relaxed Sinkhorn to log-domain for stability
Consider a general form of entropy-regularized, fully unbalanced Wasserstein optimal transport: given measures , and given a candidate transport plan Π ∈ R n1×n2 + , let the functional Π → P ϵ (Π) be given by where φ 1 , φ 2 are convex, positive lower-semi-continuous, and with φ i (1) = 0.The terms D φi are the associated φ-divergences.Depending on the choice of the φ i , one can recover either the fully unbalanced constraints of moscot, or the semi-relaxed setting of DeST-OT.Specifically, moscot corresponds to taking φ 1 = φ 2 ≡ p → p log p − p + 1, in which case both φ-divergences are the KL divergence.On the other hand, DeST-OT corresponds to taking φ 1 ≡ p → p log p − p + 1, while the single hard constraint corresponds to choosing φ 2 = ι {1} , where Thus, (32) corresponds to a version of the moscot objective (3) with the GW term ⟨L ⊗ Π, Π⟩ omitted, or a version of the DeST-OT objective (22) without our modified GW term ⟨ L M ⊗ Π, Π⟩.In both cases, the hyperparameters attached to different terms have been suppressed.
Proposition 1 ([36], Proposition 2).One has where for such f ∈ R n1 + , g ∈ R n2 + , we define the dual objective to P ϵ (32) The optimal primal plan Π ⋆ is obtained from the optimal pair (f ⋆ , µ ⋆ ) of dual variables as Thus, connecting the dual variables of the generalized Sinkhorn objective for arbitrary φ-divergences to our primal variables u, v, we are able to express closed-form updates for the dual variables instead.The Sinkhorn iterations for the semi-relaxed case, in terms of the dual variables f , g are given as: Owing to the presence of the exponential of each dual variable on the right-hand side of 33 and 34, these cannot be evaluated directly for small values of the entropy-regularization ϵ.When ϵ → 0 we would recover an exact, non entropically-regularized optimal transport -however, this would cause the term inside of the logarithm to exceed numerical precision.For the purpose of introducing the log-domain Sinkhorn algorithm, we make the following definitions.For 1 ni ∈ R ni + and ϵ > 0, the Softmin operators Smin gi ϵ are defined for any vectors f ∈ R n1 and µ ∈ R n2 by: and Smin Rewriting 34 34 we see: And: One can simply evaluate the softmin Smin 1n 1 ϵ using the log-sum-exp trick, which we broadcast across dimensions for an efficient and stable log-domain update.

S5 Metrics of growth
Let S 1 = (S 1 , X 1 ) and S 2 = (S 2 , X 2 ) be a pair of spatiotemporal tissue slices, corresponding to timepoints t 1 and t 2 .For a collection of cell types shared across both slices, enumerated as {1, . . ., P }, suppose we are given cell type partitions P 1 = {P 1 (p)} P p=1 and P 2 = {P 2 (p)} P p=1 for each slice.The mass of cell type p at time t 1 is m 1 (p) = |P 1 (p)|, and the mass of cell type p at time t 2 is likewise m 2 (p) = |P t (p)|.The mass-flux of cell type p over these two timepoints is then given by m 2 (p) − m 1 (p).Below, we define a metric quantifying how well the cell type mass-flux inferred by an optimal transport method matches the empirical mass-flux of a cell type.If such a method outputs a pair (Π, ξ) of an alignment matrix and a growth vector, we measure the distortion between ξ and the true mass flux.We discuss the assumptions that go into this metric, and how it can be generalized.

S5.1 A metric of growth: individual cell level
To measure the accuracy of the growth rates across cell types and spots within the cell type, we define the true growth rate at time t 1 of cell type p as: In calling this a true growth rate, we are making two assumptions: first, that there are no cell type transitions between distinct cell types -in the next subsection we discuss how to relax this.The second assumption made is that the burden of accomplishing the change in mass is shared equally across cells of the same type, which can be viewed as an entropy-maximizing assumption.
The normalization factor 1 n1 makes these growth-rates consistent with the slice one mass-normalized values of the marginals g 1 , g 2 and ensures consistency across experiments.We quantify the total distortion between the true growth factors for each cell type and those derived from optimal transport as: This assesses the total divergence between the optimal-transport growth rates and the true, empirical growth rates under a fixed cell type labeling.The sum of the growth-rates across spots within a tissue slice equals the total growth rate of the slice:

S5.2 Generalizing the growth-metric to cell type transitions
The definition of growth given in the section above assumes that cells of type p transition and grow to their own cell type, and computes the distortion relative to this assumption.While this is valuable as a baseline metric when no cell type transitions are available, we generalize this notion when we have a density matrix of cell-to-cell reverse-time transitions T ∈ R P ×P ≥0 across P cell types.Supposing we have a vector m 2 ∈ R P ≥0 of cell type masses at time t 2 , and likewise m 1 at time t 1 , we modify m 1 (p) in the definition above to include transitions.The section above implicitly assumes that the matrix is diagonal (in particular, it is the P × P identity matrix T = 1 P ).Under ground truth transition matrix T, the mass of cell type p from time at time t 1 is described in terms of m 2 as follows: Which is to say, m 1 can be described in terms of m 2 via the matrix-multiplication and the mass-flux out of cell type p remains as defined in the previous sections with the only adjustment being that the new m 1 (p)'s account for cell type transitions.As before, we assume the cell type partitions across each slice are consistent with these cell type mass vectors.Now, consider an alignment matrix Π ∈ R n1×n2 , and let {p → q} denote the following set: {(i, j ′ )} , understood as the event that cell type p of the first slice is mapped to cell type q in the second slice.In making this definition, we show how any alignment matrix Π induces a reverse-time transition matrix: indeed, Π is a positive measure on such pairs, so one can naturally form a coarse-grained P × P matrix from Π using the events {p → q}: first, define Whatever T we construct from Π should be a transition matrix, row-normalized such that 1 P = T1 P .This is achieved by tilting each row p of measure Π with the factor n 1 /m 2 (q): Now, consider the distortion metric defined in (36).Instead of viewing J growth as a function of ξ only, as was done in the previous section, we now regard it as a function of both ξ and γ 1 ≡ (γ 1 (p)) P p=1 .Just as ξ arises from Π, vector γ 1 is defined from T through (37) and (35).Given the output (Π, ξ) of an OT method, the next proposition gives the optimal pair (T, γ 1 ) for this output according to the growth distortion metric (36).
Proof.When ξ is fixed, the minimizer of 36, in terms of γ 1 , is given in each entry by the cell type sample mean of ξ: The last step is to use the single hard constraint in our semi-relaxed formulation, along with our choice of mass calibration for the second slice: Π T 1 n1 = 1 n1 1 n2 .Continuing from the above display, one has By definition, all entries of T are necessarily positive.Therefore these row-sum conditions show us this is indeed a reverse-time transition matrix, and the proof is finished.

S5.3 Interpretation of per-cell growth distortion J growth
For the setting of γ 1 (p) in the previous section, rather than matching the cell type mass-fluxes assuming cell types transitioning to themselves, we find the optimal matrix T of cell types transitioning from t 1 to t 2 , calculate the adjusted mass contributed to m 2 from cell type p at t 1 as m 1 (p) = T p,• m 2 , and calculate the expected distortion in the mass-flux.
As noted in 2, the minimizer of J growth over γ 1 is simply the sample mean of the per-cell type mass-flux, given as: Therefore an alternative interpretation to one in which we match the mass-flux of the optimal transition matrix is that we are measuring the sum of the variances of each mass-flux variable within each cell type: In a tissue or cell type, the assumption that growth is relatively homogeneous or smoothly varying, as opposed to coming from a very small, sparse, set of cells, has been observed in morphogen studies where within tissues like the imaginal wing disk in drosophila one observes uniform cell divisions and growth [15] [25] [27].As such, measuring the within-cell type variance of the growth-factors is biologically reasonable and captures whether growth patterns are very sparse (i.e. when aligned via an unbalanced routine which assigns a few "representative" cells as those which grow), or whether growth patterns are relatively consistent.
S6 Procrustes-like problems and a cell migration metric S6.1 Quantifying a metric of naive rigid-body migration Suppose we are given the solution to either the generalized Procrustes' problem for an orthogonal transformation Q ∈ O(n) or the generalized Wahba's problem for a rotation Q ∈ SO(n) along with a general translation vector h ∈ R n as described in S6.3.This then describes a simple rigid-body transformation which relates the coordinate frames of slice 1 and 2, Supposing the true transformation relating the two frames is indeed a rigid-body transformation, one may directly quantify the cost of an alignment with respect to the matrix Π given Q, h.We introduce this as our naive rigid-body migration metric, given as: Where we take the difference between the spatial points s i and s ′ j ′ in our pair of slices (X (1) , S (1) ) and (X (2) , S (2) ).This difference is computed under the posterior implied by Π, and as we compare matrices Π which may be balanced, unbalanced, or semi-relaxed, we normalize by a factor of Π T i,. 1 n2 so that any posterior in 49 is normalized to a distribution consistently.We call this metric naive, as it is clear this transformation has determinant 1, and neither allows for volume expansion (growth) or volume shrinkage (death).Moreover, the true transformation underlying tissue development might involve some arbitrary (e.g.diffeomorphic) transformation, such that the rigid-body assumption is more generally inappropriate.As such, it is not necessarily the case that lower loss under this metric is universally better.However, one can say that if the alignment tends to incur exceptionally high loss under this metric, without any geometric consistency, the alignment may be spatially unrealistic.In other words, it might be aligning the sub-level set of points with similar features )} without accounting for the geometry of either slice.
S6.2 Invariance of the method to rigid-body transformations, or group action by members of SE(2) Let us call each spatial point in the first slice s i ∈ R 2 , and each point in the second slice s ′ j ′ ∈ R 2 .Suppose we want the second slice to be in the same coordinate basis as the first slice, but we do not know Q ∈ SE(2), h ∈ R 2 , representing the rigid-body transformation of the data sj ′ = Q(s ′ j ′ − h) which moves us into the correct, shared coordinate-frame.A reasonable question to ask is whether our objective returns an alignment matrix Π which is unique, independent of the specific Q, h that relate the two coordinate frames.From the general objective in 20, it is clear that our objective only depends on the points in the first and second slices s i , s ′ j ′ through the distance matrices D (1) and D (2) .Considering that s i is in the correct basis, let us restrict our attention to s ′ j ′ and D (2) .It is simple to check that: Which implies that our objective does not depend on the spatial location or rotation of the second slice relative to the first.Uniqueness therefore only depends on the convexity of the objective, which could in principle be achieved by using the projection of D (1) , D (2) onto the vector space of positive-semi definite matrices S + n or kernelization.The next reasonable question is whether one can recover the correct Q, h given the alignment matrix Π. Problems of this form include the Orthogonal Procrustes problem (where Q is simply an orthogonal matrix), and Wahba's problem (where Q ∈ SO(2) explicitly), which are both well studied and have existing solution methods.Thus, the final problem needed to align the coordinate frame can be given (in Procrustes' form) as: Or, in Wahba's problem form: min We offer both the Procrustes' and Wahba's form, with the latter having the restriction that Q ∈ SO(2) requires det Q = 1, which is to say we seek an explicit (orientation-preserving) rotation rather than a general orthogonal transformation (which may be a rotation composed with a reflection).In either case, the problem of finding a correct alignment is something that emerges after the outputs of DeST-OT, which is itself completely independent of the coordinate frame the first and second slice are set in and is invariant to linear translation and rotation, unlike works involving Gaussian processes such as [18].

S6.3 Review of the Solution to Wahba's problem
First, we consider an appropriate transformation of the coordinates to eliminate the dependence on the translation, for this light generalization of Wahba's problem to a joint-distribution across both slices [10].In particular, consider the change of variables given by: h = −Q T s cm + s ′ cm + ν For s cm , s ′ cm representing the center of mass of slice 1 and slice 2, and ν an undetermined constant.In particular, Substituting this into the primal objective: By definition of the center of mass-coordinates, the above equates to the following, with the minimization over r now over ν: Where the independence of the left and right primal objectives yields that the minimizer is simply ν ⋆ = 0.After the optimization for the rotation, the optimal translation is simply: Thus, one may simply focus on the problem: min As above, we use a simple spatial condition to partition each slice into two pieces.Here, we set two "pivot" lines, given by y = w 1 and y = w 2 .The cell type partitions P 1 , P 2 over the two slices are defined as follows: We assign features in a similar manner to the one-dimensional case.Previously we assigned features to the boundary of cell type segments and linearly interpolated between these.
In this slightly more general setting, let R i (A) and R i (B) denote (smallest) bounding rectangles for sets P i (A) and P i (B) for slices i = 1, 2.
We assign the first four unit vectors in the standard basis (e j ) 8 j=1 of R 8 to cell type A, and the last four to cell type B. The top and bottom sides of R i (A) are assigned to features e 1 and e 3 , while the left and right sides of the rectangle are assigned to features e 2 and e 4 .We make similar assignments for R i (B) using the last four basis vectors.At each spot s, lying in some bounding rectangle R for its cell type, the feature assigned to s is a convex combination of the features decorating the top and bottom sides of R, plus the features decorating the left and right sides of R. In particular, for R = [x min , x max ] × [y min , y max ] and (x, y) ∈ R, the coefficients λ x , λ y are defined as: x − x min x max − x min , λ y = y − y min y max − y min Thus, for a given cell type we have two gradients along the x and y direction which are scaled equivalently between the two slices S (1) and S (2) which we seek to align.The final feature vector within cell type A is therefore given as and repeat this for cell type B as g B (x, y) = λ x g x,L + (1 − λ x )g x,R + λ y g y,T + (1 − λ y )g y,B choosing features which are mutually-orthogonal-for simplicity 8-dimensional unit vectors e 1:8 .Between S (1) and S (2) these features are consistent, and λ x , λ y ∈ [0, 1] gives the proportion of each feature depending on how far the (x, y) coordinate is along the axis of each ellipse within a given cell type.I.e. for (x 1 , y 1 ) ∈ S (1) , (x 2 , y 2 ) ∈ S (2) we have that if λ A x1 = λ A x2 and λ A y1 = λ A y2 then f A (x 1 , y 1 ) = f A (x 2 , y 2 ) and the two features should be aligned between timepoints 1 and 2 (likewise for cell type B).As the features are chosen to be orthogonal, this alignment is unique.To model noise, we assume that the data we observe at each point (x, y) is distributed as ∼ N (f A (x, y), σ 2 1 8 ) for cell type A and ∼ N (g B (x, y), σ 2 1 8 ) for cell type B for varying levels of σ.S8 Benchmarking methods PASTE [45] By design, PASTE cannot align spatiotemporal data and does not infer cell growth and death because it uses balanced OT.For the purpose of benchmarking experiments in this work, we relax PASTE's balanced constraints so that we optimize the unbalanced version of the PASTE objective.For the alignment matrix Π computed by relaxed PASTE, we then take the difference between the row sums of Π and the uniform distribution g 1 (Eq.( 4)) as PASTE's inferred growth ξ.

moscot [21]
We use moscot.problems.spatiotemporal.SpatioTemporalProblem for solving spatiotemporal alignment problems in this work using moscot.For the alignment matrix Π computed by moscot, we take the difference between the row sums of Π and the uniform distribution g 1 as moscot's inferred growth ξ.For benchmarking experiments related to growth rates instead of growth ( §??), we use the numbers returned by SpatioTemporalProblem.posterior_growth_rates as moscot's inferred growth rates over spots in timepoint t 1 .

SLAT [44]
We use scSLAT.model.run_SLAT of the scSLAT package for computing an embedding for each spot in each timepoint.We then use scSLAT.model.spatial_match to compute an alignment between spots across the two timpoints.We construct a matrix Π such that Π ij ′ = 1 if spot i in timepoint 1 is the best spot aligned to spot j ′ in timepoint 2. We then divide Π by the number of spots at timpoint 1 to convert it into a semi-relaxed transport matrix.We take the difference between the row sums of Π and the uniform distribution g 1 as SLAT's inferred growth ξ.STalign [8] We first "rasterize" the spatial positions of our input data into two greyscale images.This effectively convolves a scatterplot of the spatial data with two-dimensional Gaussian noise, to produce an image approximating each tissue slice.We then ran STalign's LDDMM on the pair of images, outputting a diffeomorphism φ (which maps the first slice to the second) and its inverse φ −1 (mapping the second slice to the first).We then use φ −1 to construct a transport plan Π as follows: for spot j ′ in the second slice, we apply φ −1 to s j ′ , and select s i in the first slice which is closest to φ −1 (s j ′ ).Then we set Π ij ′ = 1, and repeat this procedure for each spot in the second slice, filling out each column of Π.The transport plan is thus a deterministic map from the spots of the second slice into those of the first.We then divide Π by the number of spots at timpoint 1 to convert it into a semi-relaxed transport matrix.We take the difference between the row sums of Π and the uniform distribution g 1 as STalign's inferred growth ξ.

Figure S1 :
Figure S1: Cell type transition matrix at each pair of timepoints during axolotl brain development Rows are cell types at the previous timepoint.Columns are cell types at the next timepoint.Matrices are column normalized.