The distribution of branch duration and detection of inversions in ancestral recombination graphs

Recent breakthroughs in the reconstruction of ancestral recombination graphs (ARGs) have meant that inference of genome-wide genealogies from large sequencing datasets is now possible. This development has also opened several new directions where research is urgently needed: improving our understanding of which aspects of genealogies are (and which are not) faithfully captured by reconstruction tools, as well as the development of inference methods that directly use ARGs as inputs. By deriving the distribution of branch duration under the SMC’ model, defined as the genomic interval spanned by a branch of the ARG until it is disrupted by recombination, we benchmark the quality of ARGs produced by several reconstruction tools: ARGweaver, Relate, tsinfer and tsdate, and ARG-Needle. Further, we develop an accurate and powerful ARG-based test for the presence of chromosomal inversions, based on the idea that suppression of recombination in individuals heterozygous for an inversion has a detectable impact on branch duration.


Introduction
In the presence of recombination, the genealogical history of a sample can be fully captured in the form of an ancestral recombination graph (ARG). This was first described by Griffiths and Marjoram (1997) as a realisation of the coalescent with recombination (CwR), a stochastic process operating backwards in time, generating a genealogy through a sequence of coalescence and recombination events (Hudson, 1983). Wiuf and Hein (1999) reframed the CwR as a stochastic process operating spatially along the genome: starting with the leftmost endpoint, local trees are generated sequentially moving to the right, reshaped by recombination events. While calculating the properties of the ARG under both frameworks is generally intractable, this seminal work spurred on a suite of simplifying approximations, enabling applications to large-scale genomic data.
The sequentially Markovian coalescent (SMC) model, proposed by McVean and Cardin (2005), imposed the assumption that the process along the genome is Markovian, which, in essence, prohibits recombination events in genetic material not ancestral to the sample. This was followed by the SMC' extension (Marjoram and Wall, 2006), which was shown to be an excellent approximation to the CwR, based on the joint distribution of pairwise coalescent times and a quantification of bias in population size estimates (Wilton et al., 2015), and the distribution of the next local tree conditional on the current one in a two-locus model (Hobolth and Jensen, 2014). Thus, for a small trade-off in accuracy, the SMC' model offered a substantially more tractable way of calculating analytic approximations to various quantities of interest, such as the correlation between coalescence time and linkage probability for a randomly sampled pair of sequences (Eriksson et al., 2009), and identity-by-descent tract length distributions and related quantities (Harris and Nielsen, 2013;Carmi et al., 2014). It also enabled the development of powerful new inference methods: for instance, by considering the genealogy of a single pair of sequences, Li and Durbin (2011) developed a HMMbased approach (the pairwise SMC, or PSMC) for inferring the history of human population sizes, which was subsequently extended by Schiffels and Durbin (2014) to multiple samples.
Meanwhile, the definition of the ARG has become decoupled from its initial description as the realisation of a stochastic process, to more broadly denote a genealogical network that captures genetic inheritance. Using this looser definition, the problem of explicitly reconstructing plausible ARGs from sequencing data has seen significant recent progress driven by the use of heuristic methods and principled approximations to the CwR. ARGweaver (Rasmussen et al., 2014) implements an MCMC scheme based on a time-discretised version of the SMC (or SMC') to obtain posterior samples of ARGs compatible with a given dataset. Relate (Speidel et al., 2019) and tsinfer/tsdate Wohns et al., 2022) reconstruct a single ARG from data, by using methods based on the Li and Stephens (2003) framework to first reconstruct the topologies and then estimating the branch lengths using Bayesian approaches with coalescent-based priors. ARG-Needle (Zhang et al., 2023) reconstructs a single ARG by sequentially threading in each sample, by first identifying the most closely related samples already in the ARG via genotype hashing, and subsequently estimating coalescence times under the Ascertained Sequentially Markovian Coalescent (ASMC) model (a coalescent-based HMM). These methods scale to thousands of human genomelength samples and have already been applied to many large-scale datasets, resulting in powerful inference of evolutionary events and parameters, such as the history of human demography , past population sizes (Speidel et al., 2019), signals of selection (Hejase et al., 2022), and genetic associations for complex traits (Zhang et al., 2023).
Methods for quantifying the quality of ARG reconstruction have so far been limited to simulation studies calculating the accuracy of inferred local tree topologies (Speidel et al., 2019;Kelleher et al., 2019) and pairwise coalescence times (Brandt et al., 2022). While calculating quantities of interest is generally very difficult under the CwR model, the tractability of the SMC' model makes it possible to derive metrics under which reconstructed ARGs can be compared against the ground truth in coalescent-based simulations, and hence identify which features are and are not captured faithfully. Only a small number of studies have so far explored this direction. Deng et al. (2021) calculated the probability that a single recombination event is tree-changing or topology-changing under the SMC' and hence derived an approximation to the distribution of the genomic distance between local trees, finding that Relate and tsinfer generally overestimate this quantity; McKenzie and Eaton (2022) extended these results to the multispecies coalescent model.
Moreover, ARGs can, in principle, capture the effects of all the evolutionary forces that have shaped the observed genetic diversity of a sample of sequences, while providing a much more efficient representation of genomes than multiple sequence alignments . Thus, ARGbased statistical inference methods have the potential to provide very powerful insights into various evolutionary events and parameters. This is a rapidly developing field, however progress has been hampered by the fact that methods that work very well on simulated ARGs often lose power and accuracy when applied to reconstructed genealogies, for reasons that are, in general, poorly understood. We thus have two main goals: to use the tractable SMC' model to better understand the ways in which reconstructed ARGs differ from simulated ones, and to use this model to develop genealogy-based inference methods.
We present a number of results in this direction, focusing on the idea of branch duration. For a given edge of the ARG, this can be defined as the span of the genome in which the branch is present in the local tree (as illustrated in Figure 1), which is determined by recombination events that change local tree topologies and branch lengths. Intuitively, the longer (resp. shorter) the time-length of a branch, the more (resp. less) likely it is to be broken up by recombination as we move along the genome, so its duration along the genome should be shorter (resp. longer). This is a fundamental property of the ARG which has several implications. For instance, in the absence of recombination,  Figure 1: Local trees of an ARG with n = 10 samples. Branch highlighted in red (1 → 20) spans one local tree and has duration 0.55; branch highlighted in green (8 → 14) spans two local trees and has duration 0.95; branch highlighted in blue (7 → 12) spans all three local trees and has duration 1. ARG simulated using msprime and visualised using tskit Kelleher et al., 2016). the genealogy is a single tree with all branch durations equal to the length of the genome, and the probability that a given branch carries at least one mutation is inversely proportional to its time-length (assuming mutations occur as a Poisson process along the branches). On the other hand, in the presence of recombination, the mutation rate per branch is less variable, because of the interplay between branch time-length and duration. Thus, the distribution of the duration of a given branch conditional on the local tree in which it first arises is a fundamental property of the ARG arising from the correlations between local trees and, to the best of our knowledge, has not been previously derived analytically.
We calculate the probability that a particular branch of the ARG is disrupted by the next recombination event under the SMC' model and use this to construct an accurate approximation of the distribution of branch duration along the genome. Through simulation studies, we use this to compare how closely the SMC and SMC' models approximate the CwR, and how well the aforementioned reconstruction methods estimate branch durations for ARGs simulated under these models. We also obtain several results demonstrating the effect of a recombination event on the height and total branch length of a given local tree.
Further, we propose a very close approximation to the distribution of clade duration along the genome and use this to construct an ARG-based test for detecting chromosomal inversions. This is based on the idea that recombination is suppressed in individuals heterozygous for an inversion, which manifests in the ARG as a clade of samples that persists for a longer stretch of the genome than would otherwise be expected. We demonstrate the utility of this test for both simulated and reconstructed ARGs.
Code implementing the results and used to produce the figures is publicly available at github. com/a-ignatieva/argbd.

Notation and background
The notation is illustrated in Figure 2. Let T be a fixed local tree with n leaves. Denote by T i (for i ∈ {2, · · · , n}) the population-scaled time at which the number of lineages in T jumps from i to i − 1, with T 2 being the time of MRCA and setting T n+1 := 0. Let n(t) be the number of lineages at time t, so n(0) = n, with n(T j ) = j − 1 and n(t) = 1 for t ≥ T 2 .
The recombination event occurs at genomic position 0.5; the recombination point R = (b, s) is shown as a cross; the coalescence point C = (b ′ , u) is shown as a circle; n(s) = 3 and L T (s) gives the total length of the branches shown in blue. The tree on the right T ′ is obtained by pruning the subtree below R and reattaching at C; solid vertical lines show branches that have not been affected by the recombination event.
For a branch b ∈ T , denote the lower end time by t ↓ (b) ≥ 0 and the upper end time by t ↑ (b) ≤ T 2 , with the time-length of the branch given by t(b) = t ↑ (b)−t ↓ (b). Let d ← (b) and d → (b) be the leftmost and rightmost endpoints of the genomic span of branch b, respectively, with its duration given by Let par(b), sib(b), ch 1 (b) and ch 2 (b) denote the parent, sibling, left child and right child branch of b respectively, such that we have the following relations: Define the sets of branches A(b) := {b, sib(b), par(b)} and B(b) := {b, sib(b), ch 1 (b), ch 2 (b)}, and denote by b r the root lineage extending past the MRCA node.
Let L T (t) be the sum of branch lengths in T above time t and up to T 2 : so that L T (0) is the total branch length of T . Denote by T x the local tree at position x along the genome.
Under the SMC', moving along the genome, a recombination event happens after an exponentially distributed waiting time with rate L T (0) · ρ/2; when this event happens, a location R is selected uniformly at random along the branches of T , say on branch b at time s, which we denote as R ∈ b or R = (b, s). A new coalescence point C is selected by allowing the recombining lineage to coalesce at rate 1 with all the lineages present above time s (including b r ). We denote a coalescence point on branch b ′ at time u as C ∈ b ′ or C = (b ′ , u). The next tree along the genome is then formed by pruning the subtree below the recombination point R and reattaching it at the chosen coalescence point C. We write R ∈ B(b) to mean that the recombination point is on one of the branches in B(b).
The difference with the spatial formulation of the CwR is that the coalescence point is restricted to be on the local tree T , whereas under the CwR it could be placed on any branches of the ARG corresponding to the full sequence of trees to the left of the recombination position. In essence, the SMC' approximation disallows any recombination events that occur in non-ancestral material, making the process Markovian along the genome. Figure 3: Possible events that do and do not disrupt branch b (shown in red); sib(b) is shown in blue, ch 1 (b) and ch 2 (b) in green. Recombination points are shown as crosses; coalescence points as circles. Branches that are disrupted (or newly added) are shown as dashed lines. Events highlighted in green (marked with ticks) do not disrupt b. Events highlighted in red (marked with stars) disrupt the branch in terms of both branch length and topology (b is topologically disrupted); those highlighted in blue (marked with dots) disrupt b via changing only its branch length.
The difference between the SMC and SMC' models is that under the SMC', the coalescence point can be chosen on the same branch b containing the recombination point, above time s (so that recombinations can occur that do not change the tree topology or branch lengths), whereas events of this type are disallowed under the SMC.

Probability that a branch is disrupted
Considering a fixed branch b ∈ T , when a recombination event occurs, we would like to know the probability that b is affected by this recombination event. This includes both changes in the branch length of b and events that change the topology of the clade around b (we will say in these cases that b is topologically disrupted by the recombination). This can happen because either (1) the recombination point is on b ′ ∈ B(b) and the coalescence point is not on b ′ , or (2) the recombination point is not on B(b) and the coalescence point is on b. The possible scenarios that do and do not disrupt b are illustrated in Figure 3.
We have, for a given tree T , The probability that b is disrupted by the recombination event is thus We now calculate each of these probabilities in turn for an arbitrary branch β ∈ T under the SMC'.

Probability recombination point is on branch β
Under the SMC', the recombination point is chosen uniformly at random along the branches of the tree. Thus, the probability that the recombination event happens on branch β is the ratio of the branch length to the total branch length of T , so and

Probability coalescence point is not on β given recombination point is on β
The probability that, conditional on the recombination event happening on branch β at time s, the coalescence point is on β has been derived by Deng et al. (2021), which in our notation is as follows.
Proposition 2.1 (Deng et al. (2021), Proposition 1). Letting k = n(s), so that T k is the first coalescence time just above time s,

5)
and Marginalising out the recombination time s, hence summing over k = n(s) in (2.4), gives the following.
and for x, y, A, B ∈ Z, x ≥ k, 2 ≤ y ≤ x, The proofs, translated into our notation, are given in the Appendix, Sections A1 and A2.

Probability coalescence point is on β given recombination point is not on B(β)
We start by conditioning on the recombination time s to obtain the following.
Proposition 2.3. Conditional on the recombination point R being at time s and on a branch outside the set B(β), with k = n(s), the probability that β is disrupted is (2.10) with Q kk and Q kj as defined in (2.5) and (2.6), respectively.
The proof is given in the Appendix, Section A4. Substituting the expressions (2.2), (2.3), (2.7) and (2.11) into (2.1) gives the desired probability that branch b is disrupted by the next recombination event.

Probability that a branch is topologically disrupted
Most ARG reconstruction algorithms focus on identifying the presence of recombination events through finding patterns of mutations not consistent with tree-like evolution. This, in general, does not allow for the detection of recombination events that only change branch lengths (panels highlighted in blue in Figure 3). We therefore also calculate the probability that a branch b is topologically disrupted (corresponding to panels highlighted in red in Figure 3).
The proof is given in the Appendix, Section A5.

Probability that a clade is disrupted
We now calculate the probability that a particular clade of branches is disrupted by the next recombination event, i.e. that the membership of sample nodes in the clade changes from one local tree to the next (but allowing for events that disrupt branches within the clade without changing the group of subtended samples). This can happen when a lineage within the clade recombines and coalesces outside the clade or its root branch, or if a lineage from outside the clade recombines and coalesces into the clade, as illustrated in Figure 4.
Let G be the set of branches subtended by a branch g, with clade MRCA time t ↓ (g) = T m , n G (t) the number of lineages in clade G at time t, abusing notation to write G ∪ g := G ∪ {g}. Let n G∪g (t) be the number of lineages in G ∪ g at time t and L G (t) the total branch length within the clade above time t. Then 1 = P(R ∈ G) · (P(C / ∈ G ∪ g|R ∈ G) + P(C ∈ G ∪ g|R ∈ G)) + + P(R ∈ g) Considering only the events that disrupt the clade, we obtain Theorem 2.2. The probability that a clade G is disrupted by a recombination event is The proof is given in the Appendix, Section A9.

Change in total branch length of tree
Conditioning on the tree T , we now consider the distribution of the change in total branch length following a single recombination event. First, considering the sign of the change, we have the following.
Proposition 2.5. The probability of C being negative, zero, or positive is given by (2.14) The proof is given in the Appendix, Section A6.
To explore the distribution of the magnitude of the change in branch length (when this is nonzero), we derive its density.
Proposition 2.6. Conditional on the change in total branch length being non-zero, the density of C is given by The proof is given in the Appendix, Section A7.

Change in tree height
The height of the tree, H(T ) = T 2 , can change following a recombination event if (1) the coalescence point is above T 2 , in which case a new root is formed and the tree height increases, or (2) the recombination happens on one of the two lineages descending from the MRCA, then the tree height can either increase or decrease. Let the set M := {ch 1 (b r ), ch 2 (b r )} contain the two branches descending from the MRCA, and let H = H(T ′ ) − H(T ) be the magnitude of the change in height. Then we have the following.
Proposition 2.7. Conditional on T , the probability of the change in tree height being negative, zero, or positive is given by where Q 1 , Q 2 and Q 3 are as defined in (2.8), (2.9) and (2.17), respectively.
The proof is given in the Appendix, Section A8.

Distribution of branch duration along the genome
For a given branch b, the distribution of its duration can be characterised by considering the rate at which branch-disrupting recombination events arrive as we move left-to-right along the genome. However, the instantaneous rate at which b is disrupted at position τ may not be the same as that at position τ ′ > τ , due to the effect of other recombination events that might occur between τ and τ ′ . Thus, the duration of a branch is the waiting time to the next branch-disrupting recombination event, but the rate at which this happens is inhomogeneous along the genome (and is, in fact, itself random).
Similarly to Deng et al. (2021), however, we find that if a recombination event between adjacent trees T and T ′ does not disrupt b, then where P T (b disrupted) is given by (2.1), based on simulation results as shown in the Appendix, Figure A2. Thus, an approximation to the distribution of branch duration can be constructed by assuming that the rate at which branch-disrupting recombination events arrive is homogeneous along the genome, which is equivalent to assuming that recombination events that do not disrupt the branch b also do not change the local tree: so if T is the local tree at position d ← (b), after each recombination event the newly formed local tree is T ′ = T . Recombination events occur as a Poisson process along the genome with rate L T (0) · ρ/2, allowing us to thin the process by multiplying this rate by the probability that the event is branch-disrupting, thereby offering a tractable approximation for the arrival of branch-disrupting recombination events. Then conditional is distributed as the waiting time to the first event in a Poisson process with rate That is, or, by rescaling, Analogously, if the recombination rate is not constant along the genome, with the population-scaled recombination rate at position w given by ρ(w)/2, then the intensity of the process at position w is instead given by and we have The quality of this approximation can be verified by simulation, using the probability integral transform as follows. For the i-th branch b i ∈ {b 1 , . . . , b m } of a simulated ARG, take T to be the local tree at position d ← (b i ), compute q i as and let p i = 1 − e −q i . Then a Q-Q plot can be constructed by plotting the ordered quantities p (1) ≤ . . . ≤ p (m) against the corresponding quantiles of the uniform distribution 1 1+m , . . . , m 1+m . If the approximation fits well, the points should lie on the diagonal. A Kolmogorov-Smirnov (K-S) goodness of fit test can be used to test the null hypothesis that the computed p i values are uniformly distributed.
If we were to consider only events that topologically disrupt the branch, we instead have the approximation A similar procedure to that described above can be used to check goodness of fit.

Distribution of clade duration along the genome
Similar approximations to those employed when investigating the distribution of branch duration along the genome can be used for the distribution of the waiting time until a clade G is broken up by a recombination event (that is, when the clade either gains or loses one or more samples as a consequence of recombination). Consider an inhomogeneous Poisson process with intensity at position w given by conditional on the local tree at d ← (G) (defined as the leftmost position along the genome where the clade arises). Again through rescaling time, we have Thus, for each clade G i in a simulated ARG, we can calculate its left and right endpoints d ← (G i ) and d → (G i ), respectively. Letting the quality of the approximation can again be checked using a Q-Q plot as described above.

Probability of a branch being disrupted by a recombination event
The left panel of Figure 5 shows (2.1) calculated for each branch of an ARG simulated under the SMC' with stdpopsim (Adrion et al., 2020;Lauterbur et al., 2022), against the normalised age of the branch t ↑ (b)/L T (0). Middle and right panels show the same quantities, but calculated for each branch of an ARG reconstructed from the simulated data: using Relate (v1.1.9) (middle), and tsinfer (v0.3.0) and tsdate (v0.1.5) (right).
For the simulated ARG, the calculated probability that the branch is disrupted by a recombination event is higher for older branches: this is as expected, as branches at the top of a local tree tend to have greater time-length and co-exist with fewer other lineages. However, old branches are generally difficult to accurately reconstruct from the sequences at the leaves due to lack of signal, unless they are strongly supported by mutations (which implies longer duration along the genome). As a result, reconstructed ARGs have fewer old branches (with fewer points lying towards the right of the plots), and the old branches that are present tend to have lower probability of disruption than those of a similar age in the simulated ARG. This bias is seen for both Relate and tsinfer/tsdate. Figure 5: Value of (2.1) calculated for each branch b, against t ↑ (b)/L T (0). Left panel: for a uniform random sample of 10 000 branches in one simulated ARG (n = 100, stdpopsim human Chr21, HapMapII GRCh38 recombination map, constant population size model with N e = 10 000 diploids). Middle and right: same quantity calculated for each branch of the ARG reconstructed from the simulated data using Relate (middle) and tsinfer/tsdate (right). Colour shows number of edges between the top end of the branch and the root of T (purple dots correspond to branches extending from the MRCA node).

Quality of approximation to the distribution of branch duration
We first assess the quality of the approximation derived in Section 2.7 by simulating an ARG under the SMC' and checking if the simulated branch durations follow (2.22), by using the procedure described in Section 2.7. Using stdpopsim, we simulated two ARGs under the SMC' with n = 100 (haploid samples) and the following sets of parameters: • dataset 1: Chr21 with HapMapII GRCh38 recombination map, mutation rate 1.29 · 10 −8 per site per generation, N e = 10 000 diploids, constant population size model; • dataset 2: 5Mb of Chr21 with flat recombination map, recombination rate 1.2 · 10 −8 per site per generation, mutation rate 1.29·10 −8 per site per generation, N e = 10 000 diploids, constant population size model.
We sampled 10 000 branches from each ARG (uniformly at random) for testing, to speed up computation. The corresponding Q-Q plots are shown in Figure 6 (blue points). The points adhere very closely to the diagonal, demonstrating that the approximation provides an excellent fit. The K-S p-values of 0.31 (left panel) and 0.75 (right panel) also suggest good agreement. Grouping branches by their depth (the number of branches on the way to the MRCA) or clade size (number of samples subtended by the branch) in the tree and constructing Q-Q plots for each group also did not reveal significant deviation from the diagonal (as seen in the Appendix, Figure A4), suggesting that the approximation holds for all branches.

Effects of recombination on a local tree
To calculate a Monte Carlo estimate of the marginal probability that the change in total branch length is negative, zero, or positive, we average over local trees simulated using msprime under the SMC' model. The results are shown in Figure 7. Recombination has a stabilising effect on the total branch length of local trees: when the total branch length is small (resp. large), the probability that the recombination event will increase the total branch length increases (resp. decreases). : Mean probability of change in total branch length (left) and tree height (right) being negative (red, top stack), zero (blue, middle stack) or positive (green, bottom stack). Trees were simulated and binned according to total branch length (left) or height (right), with 100 trees simulated per bin, and sample size n = 100. For each tree, probabilities were calculated using (2.14)-(2.16), the stacked bar plot shows the mean probabilities for each bin.
Similarly, we can average over trees simulated under the SMC' model to estimate the marginal probability that the change in tree height following a recombination event is negative, zero, or positive. The results are shown in Figure 7 (right panel). As with total branch length, recombination tends to increase (resp. decrease) the tree height with higher probability when the tree height is small (resp. large). Figure 8 shows the density (2.18) for three simulated trees with varying total branch lengths L T (0), for n = 10 and n = 100. In both cases, the density is concentrated around zero and skewed to the left (resp. right) when the total branch length is small (resp. large); it is roughly symmetric about zero for middling values of L T (0). These results shed light on why, despite the strong assumption that recombination events that do not disrupt the given branch also do not change the rest of the local tree, our approximation to the distribution of branch duration gives an extraordinarily close fit for data simulated under the SMC' model. For a branch that is close to the leaves, the probability that the branch is disrupted by the next recombination event is very small ( Figure 5). Thus, many recombination events will occur before this branch is disrupted. As can be seen in Figure 7, recombination has the effect of stabilising the total branch length, with events causing an increase (decrease) in total branch length being more likely if the current total branch length is relatively small (large). Thus, it seems the fluctuations in total branch length average out and do not significantly affect the overall rate of branch-disrupting recombination events. On the other hand, for a branch that is close to the root of the tree, per Figure 5 the probability of the branch being disrupted by the next recombination is relatively high. Thus, a relatively small number of recombination events are likely to occur before they affect the given branch. As illustrated in Figure 8, when a recombination changes the total branch length of the tree, the magnitude of this change is concentrated around 0. Thus, the effect of recombination on the rest of the tree does not appear to significantly affect the probability that the branch is disrupted.

Comparison of simulation models
We next simulate ARGs under the CwR and under the SMC, with the same two parameter sets as given in Section 3.2 and again compare the resulting branch durations to (2.22). For the CwR, the duration of a branch is taken to be the sum of all the genomic intervals where that branch appears in the local tree (to account for the presence of recombination events that occur in non-ancestral material). Figure 6 shows that the approximation is an excellent fit to the CwR (green points), with K-S p-values of 0.06 (left panel) and 0.94 (right panel). This suggests that the distribution of branch durations under the SMC' and that under the CwR are remarkably close.
Under the SMC (red points), branch durations are shorter in general, with points falling below the diagonal (with both K-S p-values < 0.001). This is due to the model disallowing recombination events that do not change the local tree, so branches are more frequently disrupted by recombination.

Reconstructed ARGs
We next compare the distribution of branch duration under the SMC' to that of ARGs reconstructed using Relate, tsinfer/tsdate, ARG-Needle, and ARGweaver. For each dataset simulated under the SMC' as described in Section 3.2, we used each tool to reconstruct an ARG, using the true simulation parameters as inputs for each tool (and for ARGweaver, a time discretisation grid with 100 points and selecting the MAP ARG from 1000 posterior samples, for ARG-Needle using 50 time discretisation points). We sense-checked the output of the ARG reconstruction methods using a number of metrics (comparing the MRCA times, local tree topologies, and LD decay against the simulated ARGs).
The corresponding Q-Q plots are shown in Figure 9 (calculated for a uniform random sample of 10 000 branches for each ARG; the corresponding histograms are also shown in the Appendix, Figure A3). For the simulated ARGs and those reconstructed by ARGweaver and ARG-Needle, we check branch duration against the approximation (2.22); for ARGs reconstructed using Relate and tsinfer/tsdate, we check against (2.23) which only takes topology-changing events into account (to adjust for the fact that these tools do not detect recombination events that affect only branch lengths). Moreover, Relate does not attempt to infer branch duration when the branch is not supported by at least one mutation, in which case the local topology is resampled from one local tree to the next (while tsinfer more directly captures the duration of branches in the reconstructed ARG based on shared ancestry). Therefore, we only test branches in Relate ARGs which carry at least one mutation. This requires conditioning the approximation (2.23) on the number of mutations being greater than zero; the corresponding calculations are provided in the Appendix, Section A10. ARGweaver accurately captures the branch duration distribution under the SMC model, showing a similar deviation from the diagonal as for data simulated under the SMC in Figure 6, although results were only calculated for n = 10 due to excessive runtimes for larger sample sizes. Note that while it is possible to use the SMC' model with the latest version of ARGweaver, we found that the resulting ARGs contained cycles, which prevented us from calculating the required probabilities.
ARGs reconstructed by tsinfer/tsdate consistently have an excess of long-lasting branches, while for Relate this depends on the sample size; both tools however produce skewed distributions of branch duration. This is likely to be, in part, due to the waiting distances between trees being generally skewed, as shown by Deng et al. (2021). The ARGs produced by tsinfer contain polytomies (nodes with more than two descendants), which is also not accounted for under the SMC' model. This results in the total branch length of a reconstructed local tree to often be greater than it would be if the polytomies were resolved to make the tree binary. Moreover, some recombination events that would disrupt the branch under the SMC' (such as recombination points located on child branches), may not do so when the tree contains polytomies.
Relate first reconstructs the sequence of (correlated) local trees along the entire genome and then identifies which branches persist between trees, using an argument based on the similarity of clades subtended by the branch in successive trees. Thus, branch durations are calculated approximately, rather than being explicitly inferred, which we do not account for.
The threading procedure used by ARG-Needle to reconstruct the ARG uses the ASMC model to estimate coalescence times between the added sequence and the closely-related samples already in the ARG, with the lowest possible resulting coalescence time dictating which branch the sequence is threaded to and over what genomic length. This procedure (which relies on a combination of maximum a posteriori and posterior mean estimates for the coalescence times) is optimised for metrics having a direct effect on downstream analyses, and appears to lead to ARG-Needle consistently underestimating branch durations. We note that this is somewhat affected by the choice of time discretisation used within the ASMC (particularly for small sample sizes), but our overall findings do not change significantly when toggling this parameter.
Considering the un-corrected distributions of the branch durations in the simulated and reconstructed ARGs (i.e. directly calculating the observed duration of each branch, without adjustment, and plotting the histogram) results in similar conclusions (shown in the Appendix, Figure A5). Further, Figure A6 (Appendix) shows histograms of the expected number of mutations per branch, being the product of the observed branch duration (in bp), observed time-length (in generations), and the mutation rate (per bp per generation), demonstrating large deviations between the distributions for simulated and reconstructed ARGs.
For Relate and tsinfer/tsdate, we find that using approximation (2.23) results in a marginally better fit than (2.22). This is as expected, since these algorithms are not designed to detect recombination events that only affect branch times. Additionally, we find that goodness of fit depends on the mutation rate: as the mutation rate increases, more recombination events are detected by the algorithms and thus branch time estimates become more accurate.

ARG-based test for inversions
Chromosomal inversions are a type of structural variant, whereby as a result of recombination, the genome breaks at two points and the segment between the breakpoints is reinserted in the opposite orientation. While recombination can proceed normally in individuals homozygous for the inversion, in heterozygotes recombination is substantially suppressed in the region containing the inversion, since a crossover recombination within the region is likely to result in the production of unbalanced gametes (Kirkpatrick, 2010). Inversions are a relatively common phenomenon in many organisms, including humans (Puig et al., 2015) and mice (Hammer et al., 1989), and can range in size from hundreds of kilobases to several megabases (Hanlon et al., 2022). They have been posited to play an important role in speciation and local adaptation (Kirkpatrick, 2010), and a large body of literature exists examining the interplay of inversions and selection from the population genetics perspective (e.g. Kirkpatrick and Barton, 2006).
Detecting inversions computationally from sequencing data typically relies on paired-end mapping (detecting reads mapping in the opposite orientation to the reference), split-read methods (detecting reads that map onto the reference with gaps), and de novo assembly (directly reconstructing the sequenced genome to look for differences with the reference) (Tattini et al., 2015). For instance, DELLY (Rausch et al., 2012) implements a combination of these approaches and was used to identify hundreds of inversions in data generated by the 1000 Genomes Project (Sudmant et al., 2015). In general, however, such methods suffer from high false positive rates and poor sensitivity, with their performance depending on the size of the inverted region, particularly for short-read sequencing data (Lucas Lledó and Cáceres, 2013); the detection of structural variants from long-read data is challenging due to high sequencing error rates (Sedlazeck et al., 2018).
Suppose that in a given ARG, an inversion happens on a branch g which subtends a clade G. Suppression of recombination in heterozygotes implies that if a lineage within G undergoes a recombination, it will coalesce with lineages in G with high probability; likewise, if a recombination event happens on a branch not carrying the inversion, with high probability it will coalesce with branches outside G. This implies that inversions can be detected by looking for clades that last for "too long" along the genome. Phrasing this as a hypothesis test, for each clade G i , we calculate its left and right endpoints d ← (G i ) and d → (G i ), and estimate the probability of observing a clade duration greater than d Note that we are imposing the simplifying assumption that recombination in heterozygotes is suppressed completely in the inverted region (so the clade G remains completely intact). In reality, recombination can occur in heterozygotes: multiple breakpoints occurring in the inverted region would allow for this, however such events have relatively low probability unless the inversion region is very large or the recombination rate is very high.

Simulated ARGs
To demonstrate the power of this test in detecting inversions, we used SLiM  to simulate an ARG with one inversion under balancing selection (n = 100, 5Mb with recombination rate 1·10 −8 per bp per generation, constant population size 10 000, inverted segment length 200kb, neutral mutations at rate 1·10 −8 per bp per generation added using msprime). Figure 10 shows the corresponding Q-Q plot (middle panel) and p-values for each clade (right panel). The durations of most clades are unaffected by the inversion, so most of the points on the Q-Q plot adhere tightly to the diagonal. However, there are outliers in the tail (right panel, shown in red) with p-value < 0.05/m, where m = 32 564 is the number of clades. Figure 11 shows in red the genomic positions of the clades with significant p-values; all of these overlap the location of the inverted region, and the clade subtended by the edge carrying the inversion is a significant outlier with the lowest p-value. The equivalent Q-Q and p-value plots for a neutral simulation, using SLiM with no inversion and otherwise the same parameters, are shown in the Appendix, Figure A7, demonstrating that in this case the p-values are approximately uniformly distributed and there are no clades with significant p-values.

Reconstructed ARGs
As can be seen from Figure 10 (left panel), similarly to branch duration, the distribution of clade duration in ARGs reconstructed using Relate, tsinfer/tsdate and ARG-Needle is skewed, with clade Figure 11: The spans of all clades classed as outliers based on (2.24) for the simulated ARG (resp. ARG reconstructed using Relate) are plotted as red (resp. purple) thick horizontal lines; a random sample of 2 000 other clades also shown in black, for reference. Corresponding p-values are shown on the y-axis. Blue vertical lines show breakpoints between local trees; green dotted lines delineate the region of the inversion; thin red (resp. purple) horizontal line shows Bonferronicorrected significance threshold for calculations using the simulated (resp. Relate) ARG. duration generally overestimated by these methods. Thus, directly applying the test as described above will lead to a high false positive rate. We now describe a correction which can be applied to counteract two main problematic features of reconstructed ARGs, focusing particularly on Relate (due to the presence of polytomies for tsinfer/tsdate being difficult to correct for, and the large bias seen with ARG-Needle which both under-and overestimates clade duration).
Let G = {G 1 , G 2 , . . . , G N } be a list of all clades in the reconstructed ARG. The first issue is that due to a lack of mutations around the leftmost and rightmost endpoints of a clade, Relate may overestimate its duration, causing false positives. To correct for this, we proceed as follows. Suppose that the root branches of a clade G i in trees T d ← (G i ) , . . . , T d → (G i ) have mutations at positions m i 1 < m i 2 < . . . < m i K . We (1) remove from G all clades with fewer than two mutations in total and fewer than M mutations per kb on average, and (2) measure an adjusted clade duration d using the positions of the leftmost and rightmost mutations that support the given clade. That is, we define The second issue is that the clade carrying the inversion may not be supported by mutations uniformly along the inverted region, causing it to appear and disappear multiple times in quick succession in the reconstructed ARG, which can cause false negatives. We correct for this by "merging" pairs of clades that are nearby on the genome (in terms of genetic distance, to allow for varying recombination rates). For two clades G i , G j ∈ G that have identical sets of sample descendants and are less than L centiMorgans apart, that is , and update G as We apply this to all pairs of clades in G iteratively, until no more clades can be merged together.
For each clade in the reduced list G i ∈ G, we thus calculate an adjusted version of (2.25): again taking p i = e − q i . Note that with the above definitions, this can be computed even though the clades in the reduced list will now not necessarily exist (with the re-defined durations) in the ARG itself. We thus obtain adjusted p-values for each clade, applying a significance threshold of 0.05/N , where N is the original number of clades in the reconstructed ARG. We apply this to an ARG reconstructed using Relate for the data simulated using SLiM (as described in Section 3.6.1), setting L = 0.01 and M = 0.05. We detect three significant clades, shown in purple in Figure 11; all of these overlap the inverted region. The clade spanning the entire inverted region remains a significant outlier with the lowest p-value. The corresponding Q-Q and p-value plots are shown in the top row of Figure A8 (Appendix), showing that the correction brings the points on the Q-Q plot very close to the diagonal. The equivalent plots for a simulation with no inversion are shown in the bottom row, demonstrating that the p-values are approximately uniformly distributed and there are no false positives.

Test error rates
We examined the performance of the test by applying it to ARGs simulated using SLiM with the parameters given in Section 3.6.1, varying the length of the inverted region from 0 to 200kb (100 ARGs in each case). Defining positive detection as there being at least one clade within the inverted region with a significant p-value, the resulting ROC curve is shown in Figure 12 (left panel). Fixing the false positive rate at 5% (corresponding roughly to one false positive per 100Mb) gives the confusion matrix shown in Table 1a, demonstrating very high sensitivity. For each inversion length, out of all the clades with significant p-values across the simulations, a high percentage lie within the inverted region.
Reconstructing an ARG using Relate for each simulated dataset and applying the adjustments described in Section 3.6.2 gives the ROC curve shown in Figure 12 (right panel); the results in Table 1b show high sensitivity is maintained for inversions longer than around 100kb. These results demonstrate very good performance in detecting the presence of inversions, as well as pinpointing the candidate clade and its position along the genome.     Table 1: Confusion matrices and results summaries for inversion detection test, based on 100 simulations for each given length of the inverted region: using (a) the simulated ARGs, (b) ARGs reconstructed using Relate.

Discussion
We have found that the distribution of branch duration is very accurately captured by the SMC' approximation to the CwR. Relate and tsinfer/tsdate seemingly do not capture the distribution of branch duration well, echoing the results of Deng et al. (2021) regarding the length of intervals between local trees.
The differences between the approximate distribution of branch durations and the distribution of branch durations in ARGs produced using reconstruction tools are due to both model misspecification and the particularities of each algorithm. We suggest that the bias seen in ARGs reconstructed using tsinfer stems from the presence of polytomies, which lead to an excess of deep branches with long durations. It is nontrivial to adjust for their presence within our model, and it is not currently possible to break polytomies at random using tskit without re-sampling branches in each tree independently (which would prevent a proper calculation of their duration). In general, deep branches can arise either through true demographic events or due to inaccuracies in ARG reconstruction; our method can be used to detect deep branches that are likely to be artefactual, which may, for instance, help with avoiding spurious inference of introgression.
We note that apart from ARGweaver, none of the ARG reconstruction methods we consider are explicitly optimised to recover the distribution of branch duration, focusing instead on other aspects such as node times, local tree topologies, and patterns of linkage disequilibrium (which are recovered well in our simulation studies). Our results can potentially be used to improve the estimates of branch duration produced by tsinfer and Relate during the topology reconstruction step and hence also improve the downstream inference of node times.
The concept of branch duration is closely related to the ARG-based definition of haplotype blocks proposed by Shipilina et al. (2023): the "set of genomic regions that descend from a particular edge in the ARG, which is defined by a unique coalescence event, and by the set of descendant samples". Our results thus provide a theoretical derivation for the distribution of the length of haplotype blocks under the SMC' approximation. Moreover, since we demonstrate that branch durations are not, in general, well reconstructed by Relate, tsinfer/tsdate and ARG-Needle, this definition may be difficult to apply to data in practice.
The SMC'-based approximation we construct for the distribution of clade duration also appears to provide an excellent fit based on simulations. The corresponding test we develop for detecting chromosomal inversions has excellent performance on simulated ARGs, as well as (with appropriate adjustment) ARGs reconstructed using Relate. Since the test detects long clade duration after adjusting for its age and the local tree topology, and hence specifically detects localised suppression of recombination, it has the power to discriminate inversions from other genealogy-distorting events such as point mutations under balancing selection. This opens up the exciting possibility of detecting inversions using ARGs reconstructed from large-scale genetic datasets.
Following the recent breakthroughs in scalable ARG reconstruction, there has been a flurry of interest in developing ARG-based methods for statistical inference of evolutionary events and parameters. Such methods have the potential to leverage the efficiency with which ARGs store genetic information and avoid the loss of power faced by methods relying on summary statistics. However, structural variants are largely ignored by the currently available methods for ARG storage and reconstruction, while being an important part of the evolution of many species. Developments in this direction would bring us closer to accurately recording and reconstructing genealogies, paving the way to more powerful genealogy-based inference.

Data availability statement
Data sharing is not applicable to this article as no new data were generated or analysed.

Acknowledgements
otherwise, as the recombination time is chosen uniformly at random along the length of the branch. The conditional density of the coalescence time is for u > s and 0 otherwise. Conditional on the coalescence time u, the probability that the coalescence point is on β is otherwise.
Letting k = n(s), so that T k is the first coalescence time just above time s, and the summands of the second term are i Thus,
The number of lineages in set B(β) at time s can be written as where 1(·) is the indicator function. Then conditional on the recombination point not being on a branch in the set B(β), the density of the recombination event time is otherwise.

A6 Proof of Proposition 2.5 A6.1 Probability of no change in total branch length
The probability that the recombination event does not result in a change in total branch length is Q 1 (k) + Q 2 (k, k, n(t ↑ (β)) + 1, 0, 1) , using (2.7). This is equal to the probability derived by Deng et al. (2021, Theorem 1).

A6.2 Probability that change in total branch length is negative
We have Integrating over the recombination time,

A6.3 Probability that change in total branch length is positive
Similarly, the probability that the change in branch length is positive is given by Thus, Q 2 (k, n(t ↑ (b)), 2, 1, 0) + Q 3 (k) .

A7 Proof of Proposition 2.6
We first calculate the density of change in total branch length conditional on the recombination point R = (b, s), then marginalise out the recombination time and branch.

A7.1 Density of change in total branch length conditional on recombination point
Suppose that the recombination point is on branch b at time s. Then given the coalescence time u, the change in total branch length Let l = n(u), so T l is the time of the first coalescence event just above time u. Through conditioning on the coalescence point not being on branch b, and a change of variable in (A1), for s−t ↑ (b) < c < 0 Finally, for c > T 2 − t ↑ (b), Note that e (k−1)s P 3 gives the probability that the coalescence event happens above T 2 . The conditional density of C is thus

A7.2 Density of change in total branch length
Marginalising out the position and time of the recombination point, we have Thus,

A8 Proof of Proposition 2.7
The probability of a negative change in tree height, conditional on the location and time of the recombination point, is Integrating over the recombination time gives Summing over the branches and multiplying by the corresponding probability, the unconditional probability that the change in tree height is negative is thus The probability of no change in tree height, conditional on the recombination point, is The probability of no change in tree height is thus Finally, the probability of a positive change in height, conditional on the recombination point, is The probability that the change in tree height is positive is

A9 Proof of Theorem 2.2
Conditional on T , the recombination point is chosen uniformly along the branches, so Conditional on the recombination point being in G and letting the clade MRCA time be t ↓ (g) = T m , the density of the recombination time is for s ≤ T m 0 otherwise, and similarly otherwise.

A10 Conditioning on branch having at least one mutation
Suppose that mutations occur as a Poisson process along the branches with constant rate θ, and the recombination rate at position x is ρ(x)/2. Conditional on the left endpoint of the given branch d ← , let D be its right endpoint, which using (2.23) has the density Let t be the time-length and M the number of mutations on the branch. Then the conditional distribution of D is given by We have Assuming that the recombination map is piecewise constant, we split the part of the genome to the right of d ← into portions where the recombination rate is constant between the (ordered) breakpoints d ← =: w 0 < w 1 < w 2 < . . . < w k < . . ., adding an extra breakpoint w k := d → . Then we can write where, by integrating, Similarly, xii Combining, we have (A3) Note that in the limit θ → ∞, this reduces to 1 − exp (−Λ(b)), as expected (since this effectively removes the conditioning).
For the case where the recombination rate is constant along the genome, with ρ(x) = ρ and λ(x) = λ, we obtain The p-values for each branch of a simulated ARG can now be computed by evaluating the cdf (A3) for the given values of d ← and d → , and the Q-Q plot can again be constructed by plotting these against the corresponding quantiles of the uniform distribution. However, this potentially requires summing over a large number of increments of the recombination map. Instead, we propose to approximate (A3) by We find this to be a very close match to the exact distribution based on simulations with human-like parameters, while being very fast to compute.
A11 Change in total branch length and the probability that a branch is disrupted, following a recombination event For a given branch b i that exists in local trees T (1) , . . . , T (k) , let These quantities are calculated for a uniform random sample of 1 000 branches from an ARG simulated with stdpopsim (n = 100, Chr21 with HapMapII GRCh38 recombination map, N e = 10 000, constant population size model); the corresponding histograms are shown in Figure A2. These suggest that the quantity P T (b i disrupted) · L T (0) stays relatively conserved following each recombination event, even if individually P T (b i disrupted) and L T (0) may vary more significantly. This is similar to the findings of Deng et al. (2021, Figure 5).  Figure A6: Histograms of (observed) expected number of mutations per branch, calculated as (d → (b) − d ← (b)) · t(b) · µ for each branch b in simulated and reconstructed ARGs (same as those in Figure 9). Note log scale on the x-axis.
xvii A16 Q-Q and p-value plots (simulated ARG without inversion) Observed p-values Figure A8: Top row: Q-Q plot (left panel) and p-value plot (right panel) for ARG reconstructed using Relate from data simulated using SLiM with one inversion under balancing selection (as described in Section 3.6.1). Blue (resp. green) points show values calculated after (resp. before) applying the adjustments described in Section 3.6.2; red points correspond to clades with p-value below the Bonferroni-corrected significance threshold. Bottom row: same using SLiM simulation without inversions (and otherwise same parameters).