Site Frequency Spectrum of the Bolthausen-Sznitman Coalescent

We derive explicit formulas for the two first moments of he site frequency spectrum $(SFS_{n,b})_{1\leq b\leq n-1}$ of the Bolthausen-Sznitman coalescent along with some precise and efficient approximations, even for small sample sizes $n$. These results provide new $L_2$-asymptotics for some values of $b=o(n)$. We also study the length of internal branches carrying $b>n/2$ individuals. In this case we obtain the distribution function and a convergence in law. Our results rely on the random recursive tree construction of the Bolthausen-Sznitman coalescent.


Introduction
The Bolthausen-Sznitman coalescent is an exchangeable coalescent with multiple collisions that has recently gained attention in the theoretical population genetics literature. It has been described as the limit process of the genealogies of different population evolution models, including models that contemplate the effect of natural selection [15,16]. It has also been proposed as a new null model for the genealogies of rapidly adapting populations, such as pathogen microbial populations, and other populations that show departures from Kingman's null model [1,13].
A measure of the genetic diversity in a present day sample of a population is often used in population genetics in order to infer its evolutionary past and the forces at play in its dynamics. The Site Frequency Spectrum (SFS) is a well known theoretical model of the genetic diversity present in a population, it assumes that neutral mutations arrive to the population as a Poisson Process and that each arriving mutation falls in a different site of the genome (infinite sites model), in contrast to the Allele Frequency Spectrum in which mutations are assumed to fall on the same site but create a new allele every time (infinite alleles model). Given the close relation between the Site Frequency Spectrum and the whole structure of the underlying genealogical tree, it can be used as a model selection tool for the evolutionary dynamics of a population [3,10,4].
In this work we give explicit expressions of the first and second moments for the whole Site Frequency Spectrum (SF S n,b ) 1≤b<n of the Bolthausen-Sznitman coalescent, which to our knowledge were only known for Kingman's coalescent until now [5]. Here SF S n,b denotes the number of mutations shared by b individuals in the sample of size n. For the expectation we obtain the formula where θ denotes the mutation rate. For larger values of n there might occur problems in the calculation of this integral due to the exorbitant growth of the Gamma function. Also this formula allows no insight into the shape of the expected site frequency spectrum. For this purpose approximations are helpful. A first approximation, resting on Stirling's formula, reads for 2 ≤ b ≤ n − 1 where f 1 is a convex, non-monotone function on (0, 1) defined by (2) f 1 (u) := 1 0 u −p−1 (1 − u) p−1 sin(πp) πp dp .
We remark that this integral may be reduced to the (complex) exponential integral Ei(·). These formulas show that the shape of the Site Frequency Spectrum, restricted to the range 2 ≤ b < n, is explained essentially by one function not depending on the population size n. Also our approximations update those given in [13] for the case of families with frequencies close to 0 and 1, since we have f 1 (u)∼(u log u) −2 close to 0 and f 1 (u)∼((u − 1) log(1 − u)) −1 close to 1, see equations (30) and (31) below. The case b = 1 is not covered by (1), it has to be treated separately, which reflects the dominance of external branches in the Bolthausen-Sznitman coalescent. See Theorem 3.4 for a complete summary. The above approximation is accurate also from a numerical point of view. Only for b = 2 we encounter an enlarged relative error which anyhow remains less than 10 percent for n ≥ 8. If a more precise result is desired then the following refined approximation may be applied for 2 ≤ b ≤ n: with a positive function g 1 on (0, 1) given by With this formula we have a relative error remaining below 1 percent for b = 2 and n ≥ 10, below 0.5 percent for b = 2 and n ≥ 150, and below 0.3 percent for b ≥ 3 and n ≥ 10. Thus this approximation appears well-suited for practical purpose. Figure 1 illustrates its precision in the cases n = 5, 20, 35 and θ = 1. For b = 1 the approximation formula corresponding to (1) reads which is an immediate consequence of Stirling's approximation. It is precise for small n and requires no further correction as in the case b ≥ 2.
We also study the asymptotic behavior of the second moments which, together with the above asymptotics for the first moment, leads to the following L 2 convergences: log n n SF S n,1 → θ, These generalize and strengthen the results in [2] for the Bolthausen-Sznitman coalescent. We also provide the joint distribution function of the branch lengths of large families, i.e families of size at least half the total population size, and their marginal distribution function. These results are useful to obtain the marginal distribution function of the Site Frequency Spectrum and a sampling formula for the half of the vector corresponding to large family sizes, although we do not present such tedious computations here.
Asymptotic results for related functionals on the Bolthausen-Sznitman coalescent have been derived by studying the block count chain of the coalescent through a coupling with a random walk as in [8] and [9], where asymptotics for the total number of jumps, and the total, internal, and external branch lengths of the Bolthausen-Sznitman coalescent are described; these results give the asymptotic behaviour of the total number of mutations present in the population, the number of mutations present in a single individual, and the number of mutations present in at least 2 individuals. Also, a Markov chain approximation of the initial steps of the process was developed in [2] where asymptotics for the total tree length and the Site Frequency Spectrum of small families were derived for a class of Λ-coalescents containing the Bolthausen-Sznitman coalescnet.
Progress has also been made for the finite coalescent even for the general coalescent process. The finite Bolthausen-Sznitman coalescent has been studied through the spectral decomposition of its jump rate matrix described in [11] where the authors used it to derive explicit expressions for the transition probabilities and the Green's matrix of this coalescent, and also the Kingman coalescent. The spectral decomposition of the jump rate matrix of a general coalescent, including coalescents with multiple mergers, is also used in [17] where an expression for the expected Site Frequency Spectrum is given in terms of matrix operations which in the case of the Bolthausen-Sznitman coalescent result in an algorithm requiring on the order of n 2 computations. In [7] another expression in terms of matrix operations is given for this and other functionals on general coalescent processes, both in expected value (and higher moments) and in distribution; these expressions however are deduced from the theory of phase-type distributions, in particular distributions of rewards constructed on top of coalescent processes, and also require vast computations for large population sizes.
Our method, mainly based on the Random Recursive Tree construction of the Bolthausen-Sznitman coalescent given in [6], gives easy-to-compute expressions for the first and second moments of the Site Frequency Spectrum of this particular coalescent. This combinatorial construction not only allows us to study the bottom but also the top of the tree thus providing an additional insight into the past of the population and large families, both asymptotically and for any fixed population size.
In Section 2 we layout the basic intuitions that compose the bulk of our method, including the Random Recursive Tree construction of the Bolthausen-Sznitman coalescent and the derivation of the first moment of the Site Frequency Spectrum for the infinite coalescent as a first application (Corollary 2.2). In Section 3 we present our results on the first and second moments of the branch lengths (Theorem 3.1) and of the Site Frequency Spectrum (Corollary 3.2) for any fixed family size and initial population. We then use these expressions to obtain asymptotic approximations of these moments as the initial population goes to infinity (Theorems 3.4 and 3.5) which lead to L 2 convergence results on the SFS (Corollary 3.6). In Section 4 we restrict ourselves to the case of large family sizes and present the joint and marginal distribution functions of their branch lengths (Theorems 4.1 and 4.3), along with a limit in law result (Corollary 4.2). Section 5 provides explanations for approximations (1) and (3). Finally, in Sections 6 and 7 we provide detailed proofs of our results.

Preliminaries
Consider the Bolthausen-Sznitman coalescent (Π ∞ (t)) t≥0 with values in P ∞ , the space of partitions of N, and the ranked coalescent (|Π ∞ (t)| ↓ ) t≥0 , with values in the space of mass partitions P [0,1] , made of the asymptotic frequencies of Π ∞ (t) reordered in a non-increasing way. In what follows we present the Random Recursive Tree (RRT) construction of the Bolthausen-Sznitman coalescent given by Goldschmidt and Martin in [6]; then we follow the argument given in the same paper to establish that where P D(α, θ) is the (α, θ)−Poisson-Dirichlet distribution. Briefly, the construction of the Bolthausen-Sznitman coalescent in terms of Random Recursive Trees proceeds as follows. We work on the set of recursive trees whose labeled nodes form a partition π of [n] := {1, . . . , n}, where the ordering of the nodes that confers the term "recursive" is given by ordering the blocks of π according to their least elements. A cutting-merge procedure is defined on the set of recursive trees of this form with a marked edge, this procedure consists of cutting the marked edge and merging all the labels in the subtree below with the node above, thus creating a new recursive tree whose labels form a new (coarser) partition of [n] (see Figure 2). With this operation in mind we consider a RRT with labels {1}, · · · , {n}, say T , to which we also attach independent standard exponential variables to each edge. Then, for each time t > 0 we retrieve the partition of [n] obtained by performing a cutting-merge procedure on all the edges of T whose exponential variable is less than t. This gives a stochastic process (Π n (t)) t≥0 with values on the set of partitions of [n] that can be proven to be the n-Bolthausen-Sznitman coalescent.  Figure 2: On the left, an example of a recursive tree whose labels constitute a partition of {1, · · · , 10}. On the right, the resulting recursive tree after a cutting-merge procedure performed on the marked edge (dashed line) of the first tree.
The fact that |Π ∞ (t)| ↓ d = P D(e −t , 0) now follows readily. To see this, consider the construction of T where nodes arrive sequentially and each arriving node attaches to any of the previous nodes with equal probability. Considering also their exponential edges and having in mind the cutting-merge procedure we see that for any fixed time t, and assuming that b − 1 nodes have arrived and formed k blocks of sizes s 1 , . . . , s k in Π b−1 (t), the next arriving node, node {b}, will form a new block in Π b (t) if and only if it attaches to any of the roots of the sub-trees of T that form the said k blocks and if, furthermore, its exponential edge is greater than t; this occurs with probability On the other hand, in order for {b} to join the jth block of size s j it must either attach to the root of the sub-tree of T that builds this block and its exponential edge must be less than t, which happens with probability 1−e −t b−1 , or it must attach to any other node of the said sub-tree, which happens with probability s j −1 b−1 ; thus, the probability of attaching to the jth block is We recognize in these expressions the probabilities that define the Chinese Restaurant Process with parameters α = e −t and θ = 0.
Case (i) Figure 3: Schematic representation of passing from Π n (t) to Π n+1 (t) for fixed t, by adding a new node (blue) to a RRT. Solid lines and dotted lines represent edges whose exponential variables are greater than t and less than or equal to t, respectively. In this case at time t there are four subtrees rooted at R 1 , R 2 , R 3 , and R 4 , which form the blocks that constitute Π n (t); these blocks are also the tables of a Chinese Restaurant Process. In case (i) the new node will be included in the block formed by R 2 at time t, irrespective of whether its exponential edge is greater than t or not. In case (ii) the new node forms part of the block rooted at R 4 because its exponential edge is less than t. Finally, in case (iii) the new node is a new root of a subtree that will form an additional block of Π n+1 (t) (i.e. the new node opens a new table in the Chinese Restaurant Process).
We now provide two straightforward applications of the RRT construction described above which nonetheless contain the essential intuitions underlying the forthcoming proofs.

Site Frequency Spectrum in the infinite coalescent
For the first application consider a subset I ⊂ (0, 1) and define (C I (t)) t≥0 to be the process of the number of blocks in Π ∞ (t) with asymptotic frequencies in I. Then gives the total branch length of families with size frequencies in I in the infinite coalescent. Our first theorem is a simple corollary of the equality in law (5).
In particular, note that if in the infinite sites model with mutation rate θ we define SF S I to be the number of mutations shared by a proportion u of individuals with u ranging in I, then by conditioning on ℓ I we get Proof of Theorem 2.1. Since it only remains to compute [C I (t)] and simplify the expressions, but this is a straightforward consequence of Equation (6) in [14] which states that if ̺ = (a 1 , · · · ) is P D(α, θ) distributed, and f : R → R is a function, then du.
Using Euler's reflection formula, making p = e −t on the above expression and integrating on [0, ∞) we finish the proof.

Time to the absorption
In this section we prove a useful lemma for the upcoming proofs, but a first consequence of this lemma gives the distribution function of the time to absorption, A n , in the n-coalescent, a result already proved in [12].
Here Be stands for the Beta function and Ψ for the digamma function where γ stands for the Euler-Mascheroni constant. .
Also, for independent trees T 1 and T 2 of respective size n 1 and n 2 , we have The proof of (10) follows the same lines as in [12] where the law of the time to absorption of the Bolthausen-Sznitman coalescent is derived, since this time is the maximum of the exponential edges attached to the root of a RRT. That is, , and, as n → ∞, where E is a standard exponential random variable. The latter convergence in distribution was elegantly proved in [6] using a construction of random recursive trees in continuous time, whereas in this case it follows from Stirling's approximation to the Gamma functions appearing in (12).
On the other hand, the equality (11) will be used in the computation of the distribution function of branch lengths with large family sizes presented in Section 4.
Proof of Lemma 2.3. Let E 2 , · · · , E n be the exponential edges associated to the nodes of T . For the proof of (9) we consider the event {m(T ) > s}. This event occurs when, in the recursive construction of T along with the exponential edges, the ith node (2 ≤ i ≤ n) does not attach to {1} whenever E i < s; this happens with probability 1 − 1−e −s i−1 . Thus, considering the n nodes, we obtain .
For (10) we instead build the tree such that the ith node does not attach to {1} whenever E i > s; this happens with probability 1 − e −s i−1 . Thus we obtain .
and by changing the variable p = e −x we obtain (11).

Moments of the Site Frequency Spectrum
By a simple adaptation of our previous notation for branch lengths in the infinite coalescent (C I and ℓ I ), in the finite case we also define for 1 ≤ b ≤ n − 1 the process (C n,b (t)) t≥0 and the random variables (ℓ n,b ), where C n,b (t) is the number of blocks of size b in Π n (t), and We now provide explicit expressions for [ℓ n,b ] and [ℓ n,b 1 ℓ n,b 2 ]; for this we define the functions .
As before, we may define SF S n,b as the number of mutations shared by b individuals in the n-coalescent. By conditioning on the value of the associated branch lengths we get We also characterize the asymptotic behavior of the functions L 1 , L 2 and L 3 as n → ∞, which in turn give asymptotic approximations for the first and second moments of the branch lengths and of SF S. For this we recall the function f 1 defined in (2) and also define for 0 < u 1 < u 2 < 1, dp 2 dp 1 .
Remark. The above lemma does not cover the cases b 1 = 1 or b 1 = b 2 for L 2 , nor the cases b 1 = 1, b 2 = 1, n = b 1 +b 2 or b 1 ∨(n−b 2 ) → ∞ for L 3 . However, using the same techniques we also obtain asymptotics in these cases which are used in Theorem 3.5 below.
The proof of the above lemma also gives asymptotic expressions for the functions f 1 , f 2 and f 3 , leading to straightforward asymptotics for the expectation and covariance of SF S. The complete picture for the first moment is given in the next result.
as it is defined in (7).
Case (i) and case (ii) for fixed b also follow from Theorem 4 in [2]. Cases (ii) and (iv) give an update to the approximation of the SFS for small and large families made in [13]. The gray line is the asymptotic approximation as given in Theorem 3.4 (iii). Red (resp. yellow) line is given by Theorem 3.4 (ii) (resp. (iv)).
In the same spirit and using the same techniques we now provide the complete picture for the second moments. In what follows we use the notation f (n) ∼ g(n) to denote that as n → ∞.
Theorem 3.5. The covariance function has the following asymptotics as n goes to infinity, in each of the following cases: These approximations follow from the asymptotics for L 1 , L 2 , and L 3 substituted in the covariance formula given in Corollary 3.2. For the sake of simplicity we do not provide the explicit computations. We only treat the case where the expected value E[SF S n,b ] diverges, then an application of Chebyshev's inequality allows us to prove the following weak law of large numbers with L 2 -convergence, which generalizes and strengthens results on the Bolthausen-Sznitman coalescent derived in [2]. Corollary 3.6. Suppose that b/n → 0 in such a way that [SF S n,b ] → ∞, or equivalently that b = o ( √ n/ log n). Then we have the following In view of Theorem 3.4 this means that for b = 1 log n n SF S n,1 → θ,

Distribution of the Family-Sized Branch Lengths
In this section we discuss the particular case of ℓ n,b when b > n/2. In this case we are able to provide an explicit formula for the distribution function of the length of the coalescent of order b. This leads to convergence in law results, but also to the law of SF S n,b . Observe that in this case, for all t ≥ 0, C n,b (t) ∈ {0, 1} and ℓ n,b is just the time during which the block of size b survives before coalescing with other blocks (if it ever exists, otherwise obviously ℓ n,b = 0). We first find an expression for the distribution function of ℓ n,b .
Theorem 4.1. Suppose that n/2 < b < n. For any s ≥ 0, From the derived distribution of ℓ n,b in Theorem 4.1 we obtain that, conditioned on ℓ n,b > 0, the variable (log n) ℓ n,b has a limiting distribution. Furthermore, We now give the joint distribution of the branch lengths for large families, i.e. the joint distribution of the vector (ℓ n,b ) b>n/2 . For this we introduce the following events: for any collection of integers b = (b 1 , · · · , b m ) such that n/2 < b 1 < b 2 < · · · < b m < n, and any collection of nonnegative numbers s = (s 1 , · · · , s m ), define the event that is, the event that a block of size b 1 exists for a time larger than s 1 , that this block then merges with some other blocks of total size exactly b 2 − b 1 , that this new block exists for a time larger than s 2 , and so on, until the last merge of the growing block occurs with the remaining blocks of total size exactly n − b m . Theorem 4.3. For b = (b 1 , · · · , b m ) and s = (s 1 , · · · , s m ) as above, we have (25) where (m : 1) := (m, m − 1, . . . , 1). and ·, · is the usual inner product in Euclidean space.
By conditioning on (ℓ n,b ) b>n/2 and using equation (26) one can obtain a sampling formula for the vector (SF S n,b ) b>n/2 , although the computations are rather convoluted and we do not present them here.

The approximations
Here we derive the approximations given above in the Introduction. From Stirling's approximation we have the well-known formula Γ(m + c)/Γ(m) ≈ m c . Its application requires some care, since we shall apply this approximation also for small values of m down to m = 1. It is known and easily confirmed by computer that the approximation is particularly accurate within the range 0 ≤ c ≤ 1. Thus we use for p ∈ (0, 1) and b ≥ 2 the approximations Also by Euler's reflection formula Γ(1 − p)Γ(1 + p) = πp/ sin(πp). Inserting these formulas into the expression (15) for the expected SFS we obtain It turns out that this approximation overestimates the expected SFS, which can be somewhat counterbalanced by replacing the scaling factor n/(n − 1) 2 by 1/(n − 1). This yields our first approximation (1). For the second approximation (3) we apply the expansion see [18]. Again this approximation is particularly accurate for 0 ≤ c ≤ 1 leading for p ∈ (0, 1) and b ≥ 2 to .
Using this approximation in the expression for the expected SFS we get for b ≥ 2 with the function g 1 as defined in (4). This integral can be evaluated by elementary means yielding formula (3).

Proofs of Section 3
As in the infinite coalescent case, the proof of Theorem 3.1 begins with the definition (14) and by noting that and similarly È ({1, · · · , b} ∈ Π n (t)) .
Thus, using (8), the fact that |Π ∞ (t)| ↓ =: (A 1 , A 2 , . . . ) d = P D(e −t , 0), and writing Π n as Π ∞ |n , we obtain Finally, by changing the variable p = e −t , we obtain (15). Now we use the random tree construction of the n-Bolthausen-Sznitman coalescent in order to compute the second moments of ℓ n,b .
Proof of Theorem 3.1 (second moments). Let 1 ≤ b 1 ≤ b 2 ≤ n − 1, and B 1 , B 2 be the collection of all possible blocks of sizes b 1 and b 2 respectively in a partition of [n]. Then We now compute È (B 1 ∈ Π n (t 1 ), B 2 ∈ Π n (t 2 )) by cases.
where this probability is of course 0 if b 1 + b 2 > n. Now suppose that t 1 ≤ t 2 . In terms of the RRT construction of the Bolthausen-Sznitman coalescent, the event is characterized by a RRT with exponential edges, say E 2 , · · · , E n , constructed as follows: for i ∈ {1, · · · , b 1 − 1} the node {i + 1} along with E i+1 arrive to the tree but with the imposed restriction that it may not attach to {1} and have E i+1 > t 1 at the same time, which occurs with probability e −t 1 /i; this ensures that {i + 1} coalesces with {1} before time t 1 for all i < b 1 , thus creating the block {1, · · · , b 1 } up to time t 1 . After {1}, · · · , {b 1 } have arrived, the node {b 1 + 1} must attach to {1} and E b 1 +1 must be greater than t 2 , which occurs with probability e −t 2 /b 1 ; the node {b 1 + 1} will be the root of a sub-tree formed with the nodes {b 1 +2}, · · · , {b 1 +b 2 } which will build the block {b 1 + 1, · · · , b 1 + b 2 } at time t 2 . Thus, for each i ∈ {1, · · · , b 2 − 1} the node {b 1 +i+1} must arrive and attach to any of {b 1 +1}, · · · , {b 1 +i}, which occurs with probability i b 1 +i , and, furthermore, conditional on this event, it may not attach to {b 1 + 1} and have E b 1 +i+1 > t 2 at the same time, which occurs with probability e −t 2 i . Finally, if n − b 1 − b 2 > 0, for i ∈ {0, · · · , n − b 1 − b 2 − 1} the node {b 1 + b 2 + i + 1} must either attach to any of {b 1 + b 2 + j}, 1 ≤ j ≤ i, or attach to {1} or {b 1 + 1} and have E b 1 +b 2 +i+1 > t 1 or E b 1 +b 2 +i+1 > t 2 respectively; this occurs with probability e −t 1 +e −t 2 +i b 1 +b 2 +i . Putting all together we obtain where the last product is set to 1 if n − b 2 − b 1 = 0. On the other hand, if t 2 < t 1 , by exchangeability we may instead compute Assuming that t 1 ≤ t 2 and using the same rationale as before we obtain where the product in the middle is set to From the previous computations, and summing over the corresponding cases, we see that if b 1 + b 2 ≤ n then, changing the variable p = e −t , the integral in (27) is given by Γ(p 2 + 1) dp 2 dp 1 p 1 whereas if b 1 + b 2 > n the first summand in the above expression is set to zero. Rearranging terms we obtain (16).
Thus, using Euler's reflection formula to write Γ(1 − p)Γ(1 + p) as πp/ sin (πp) in the definition of L 1 , we get Thus, for every ǫ > 0 there is a b 0 ∈ N such that for large enough n ∈ N we have It remains to study the approximation as n → ∞ in the cases where n − b or b remain constant. In the first case, when n − b = c, we have b → ∞ as n → ∞ and, by Stirling's approximation and dominated convergence and substituting p = y/ log b on the one hand and on the other hand because of b → ∞ 1 Similarly for the second case, if b ≥ 2 is fixed, we have n − b → ∞ as n → ∞. Thus, with 1 − p = y/ log n Thus L 1 (n, b) ∼ (b − 1)n −2 f 1 ((b − 1)/(n − 1))/b, which extends (28) for b < b 0 . This extends (28) for b < b 0 . Thus we proved (19).
For the proof of (20), we substitute b by 1 and perform similar computations: 1

This proves (20).
Proof of Lemma 3.3 (asymptotics for L 2 and L 3 ) . The arguments here are similar to the arguments in the proof of the asymptotics for L 1 , but we avoid repeating similar and tedious computations. We only layout the first steps of the proof. By Stirling's approximation applied to the integrands appearing in L 2 and L 3 , we obtain, for b 2 − b 1 > 0, Similar to the analysis in the proof of (19), to obtain (21) it remains to study the cases where at least one of b 1 , b 2 − b 1 , or n − b 2 remains constant, whereas for (22) the cases of interest are where one of b 1 , b 2 , Proof of Theorem 3.4. We first derive the asymptotic behavior of the function f 1 . We have For the proof note that for u < 1/2 we have (1 − u) p−1 ≤ 2. Therefore dominated convergence implies for u ↓ 0 which we obtain again by means of dominated convergence in the limit u ↑ 1 as follows: Also, by another application of Stirling's approximation and for b = o(n) Combining (33) and (34) with Theorem 3.4 (i) and (ii) and letting η → 2 we obtain Using this estimate together with (32) and with Theorem 3.1, Corollary 3.2 yields Because of our assumption E[SF S n,b ] → ∞ our claim is proved.

Proofs of Section 4
Proof of Theorem 4.1. Note that since b > n/2, and by the exchangeability of Π n , we have: where L is the Lebesgue measure, and L({t : {1, · · · , b} ∈ Π n (t)}) gives the time that the block {1, · · · , b} exists in the Bolthausen-Sznitman coalescent starting with n individuals.
We now describe the event {L({t : {1, · · · , b}} ∈ Π n (t)) > s} in terms of the RRT construction of the Bolthausen-Sznitman coalescent. Let G be the event that the nodes {1}, {2}, · · · , {b} and {1}, {b + 1}, · · · , {n} form two sub-trees, say T 1 and T 2 rooted at {1}; i.e. Indeed, observe that by the cutting-merge procedure T ∈ G if and only if any block of Π n that contains all of {1, · · · , b} also contains some j ∈ {b + 1, · · · , n}. On the other hand, on the event {T ∈ G}, the random variable M (T 1 ) is just the time at which the block {1, · · · , b} appears in Π n , while m(T 2 ) is the time at which it coalesces with some other block in T 2 . Furthermore, observe that conditioned on {T ∈ G}, T 1 and T 2 are two independent RRTs of sizes b and n − b + 1 respectively. Thus, by Lemma 2. Proof of Corollary 4.2. Observe that, uniformly for p ∈ (0, 1), we have thus, substituting in (24) and also using Stirling's approximation and Euler's reflection formula, we obtain È(ℓ n,b > 0) ∼ Proof of Theorem 4.3. Letting ℓ π := L (t : π ∈ Π n (t)) for any subset π ⊂ [n], by exchangeability of Π n (t) we have Recall that M T b 1 is defined as the maximum of the exponential edges associated to the root of T b 1 . Letting b m+1 := n, and also letting E b , 1 ≤ b ≤ n, be the exponential variable associated to b, we have where the product above is the probability that T is structured in such a way that {b 1 + 1} attaches to {1} and is the root of a subtree formed with {b 1 + 1, . . . , b 2 }, that {b 2 + 1} attaches to {1} and is the root of a subtree formed with {b 2 + 1, . . . , b 3 }, and so forth. Using the independence of the exponential variables we obtain and, recursively, {ℓ n,b+i = 0} .
Aknowledgements Alejandro and Arno would like to thank Geronimo Uribe Bravo for his fruitful suggestions on the second moment method.