Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Topological linkage disequilibrium calculated from coalescent genealogies

Johannes Wirtz, Martina Rauscher, View ORCID ProfileThomas Wiehe
doi: https://doi.org/10.1101/286393
Johannes Wirtz
Institut für Genetik, Universität zu Köln
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Martina Rauscher
Institut für Genetik, Universität zu Köln
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Thomas Wiehe
Institut für Genetik, Universität zu Köln
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Thomas Wiehe
  • For correspondence: twiehe@uni-koeln.de
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

We revisit the classical concept of two-locus linkage disequilibrium (LD) and introduce a novel way of looking at haplotypes. In contrast to defining haplotypes as allele combinations at two marker loci, we concentrate on the clustering of sampled chromosomes induced by their coalescent genealogy. The root of a binary coalescent trees defines two clusters of chromosomes. At two different loci this assignment may be different as a result of recombination. We show that the amount of shared chromosomes among clusters at two different loci, measured by the squared correlation, constitutes a natural measure of LD. We call this topological LD (tLD) since it is induced by the topology of the coalescent tree. We find that its rate of decay decreases more slowly with distance between loci than that of conventional LD. Furthermore, tLD has a smaller coefficient of variation, which should render it more accurate for any kind of mapping purposes than conventional LD. We conclude with a practical application to the LCT region in human populations.

1. Introduction

Maintenance and decay of haplotype structure and allelic association are of prominent interest when studying the evolutionary dynamics of recombining chromosomes. One possibility of quantifying this association is by the well-known concept of Linkage Disequilibrium (LD), reviewed in several articles, for instance [1, 2, 3]. LD measures the degree of statistical dependence among alleles at two - or more - loci and is commonly defined via a combination of haplotype frequencies. In the simplest non-trivial case two alleles at each of two loci can be distinguished. In the framework of the polarized infinite sites model [4], two allele classes - derived and ancestral - can be distinguished for each polymorphic site. The derived alleles can be traced back to a single mutation event, marking the origin of the derived class at this locus. This event may lie on any branch of the underlying coalescent tree [5] and divide the leaves in two classes: those belonging to the subtree rooted by the mutation, and those belonging to the complement. In this work, we consider the most recent common ancestor, i.e. the root of the entire coalescent tree, rather than of a single mutation event as such a class marker, which divides leaves into a ‘left’ (from the root) and a ‘right’ class. Along a chromosome tree topology may change as a consequence of recombination and thereby re-shuffle leaves among the two classes. This gives rise to an extended definition of linkage disequilibrium: considering two coalescent trees along a recombining chromosome, we call the correlation between left and right class members topological linkage disequilibrium, tLD for short. While tLD is a theoretically simple concept, coalescent topology is in practice generally unknown and has to be estimated from polymorphism data. Still, tLD has a number of interesting properties. For instance, tLD decays more slowly with distance than conventional LD, because only a subset of recombination events affects tree topology at the root. This is of practical interest when measuring long-range LD and searching for possible interactions between chromosomally distant loci. To estimate tree topology several SNPs - on the order of about ten or more - are required. However, this jointly considering of SNPs, rather than calculating LD for all SNP-pairs, introduces a smoothing effect compared to conventional LD. In fact, the coefficient of variation (CoV) of tLD is much smaller than that of LD, especially for short distances between loci. For larger distances, the CoV of tLD is almost constant and only slightly larger than 1, while the CoV of conventional LD is around 3, indicating a much higher dispersion relative to tLD (Figure 6).

2. Materials and Methods

2.1 Linkage Disequilibrium

Consider a population of constant size 2N chromosomes and let α, β be two loci with alleles a, A and b, B with allele frequencies p(a), p(A), p(b) and p(B). Let the four haplotypes ab, aB, Ab and AB have frequencies p(ab) = x1, p(aB) = x2, p(Ab) = x3 and p(AB) = x4. Two-locus linkage disequilibrium is Embedded Image

This can also be written as Dα,β = x1 -(x1 +x2)(x1 +x3) = p(ab) -p(a) p(b). A configuration of x1, …, x4 such that Dα,β = 0, is called linkage equilibrium. In this case, all haplotype frequencies are identical to the product of the involved allele frequencies. In practice, allele and haplotype frequencies in the population must usually be estimated from sample frequencies. In what follows, we do not distinguish between samples and populations and view all frequencies as population frequencies.

Since LD depends on allele frequencies, several standardizations have been introduced. One is Pearson’s correlation coefficient of the allelic association Embedded Image which can take values between −1 and 1, where the sign depends on labeling of alleles. Furthermore, Embedded Image and Embedded Image is the χ2-value of the allelic association. An alternative, frequently used, standardization is Embedded Image known as “Lewontin’s D” [2]. All three quantities, Embedded Image and Embedded Image have been used to estimate recombination rates, divergence times between species, or to statistically test the neutral evolution hypothesis [6, 7, 8]. Here, we focus on Embedded Image.

Let c be the recombination probability per generation per chromosome between loci α and β. The random variable Embedded Image changes due to the action of drift and recombination.

We consider first its expectation with respect to c. When α = β, we have x1 = p(a) = p(b) and x4 = p(A) = p(B), and therefore Embedded Image

A fortiori, two identical loci do not recombine, i.e. c = 0. However, this may still hold also if α ≠ β, for instance in non-recombining chromosomes. In this case LD may take any value Embedded Image and, on average, remains at this value, regardless of time and initial allele frequency distribution, i.e. Embedded Image

Let xi (D) denote haplotype frequency (LD) in the current generation and Embedded Image denote haplotype frequency (LD) in the next generation. We have Embedded Image where η1 = η4 = -1 and η2 = η3 = +1, and therefore Embedded Image expected LD decreases at rate c per generation for any value c > 0. Since, on average, allele frequencies do not change across generations, we have Embedded Image where the last approximation holds for small c. As c increases Embedded Image decreases. If alleles from both loci are assembled into haplotypes completely independently from the previous generation (i.e., c = 1, as the theoretical maximum), the expected value in this limiting case is Embedded Image as originally derived by Haldane [9]. We state

Lemma 1.

Consider a sample of n ∈ℕ chromosomes with two loci α, β and with fixed allele frequencies f (a) = s, f (A) = n- s at α and f (b) = u, f (B) = n -u at β. Assume that chromosomes with allele a and allele b are chosen uniformly and independently, and that the assignment of a-alleles at α is independent from b-alleles at β. Then,

  1. Embedded Image

  2. This also holds if the allele frequencies s and u are randomly chosen according to discrete distributions pα, pβ on 1,, n 1.

  3. In the case of fixed allele frequencies (as in (a)) one has Embedded Image where κ is the fourth standardized moment (kurtosis) of a hypergeometric random variable.

Proof. In the scenario described in Lemma 1 (a), the number of individuals of type (a, b) is distributed as a hypergeometric H(n, s, u) random variable X. Thus, the expected squared correlation can be expressed by the formula Embedded Image

The denominator of the term Embedded Image is independent of k and may be extracted from the summation. The remainder of the summation can then be written as Embedded Image such that the initial equation can be transformed to Embedded Image

To prove Lemma 1 (b), it suffices to note that the same expectation with respect to the distributions of s and u can be expressed by Embedded Image

Since both distributions integrate to 1, the result remains unchanged. Finally, we obtain Lemma 1 (c) by first writing Embedded Image

Then, Embedded Image which we can identify as Embedded Image where X is the hypergeometric random variable mentioned above, Embedded Image is its expectation and Embedded Image its variance. Therefore, the last expression is Embedded Image, and κX is the kurtosis of X.

Determining Embedded Image with respect to c for the non-limiting case is nontrivial. Sved [3] proposed an approximate solution by relating Embedded Image to the conditional probability of linked identity by descent, denoted by a parameter Q. He obtained the approximation Embedded Image where the last approximation holds for c << 1 (see also [10]).

Despite the later attempts (e.g. [11]) to improve this formula, the unsatisfactory discrepancy between eqs (6) and (7) still persists.

2.2. Topological Linkage

A slightly modified concept of linkage disequilibrium can be defined very intuitively in the framework of coalescent theory. Consider a sample of n recombining chromosomes of unit length and with recombination rate ρ = 2Nc > 0 between the ends of the chromosome. The genealogical history of such a sample can be modelled by the ancestral recombination graph (ARG) An [12]. The projection of the ARG on any point γ∈ [0, 1] yields a Kingman coalescent Gγ. Trees at different positions γ and γ’ may be different due to changes of tree topology by recombination. In fact, (Gγ)γ∈[0,1] can be viewed as a (non-Markovian) stochastic process on the set of Kingman coalescent genealogies with state changes caused by recombination events. The genealogy Gγ persists for some nonempty interval [a ≤ γ, γ ≤ b) with expected length Embedded Image. We call S = [a, b) a segment of the chromo-some and denote its genealogy by GS.

As a Markovian approximation of the ARG, the so-called Sequential Markov Coalescent (SMC) was introduced by [13]. In this construction An as the unified history of coalescence and recombination events of the sample is omitted. Instead, genealogies change along the chromosome by uniformly choosing a branch of the current genealogy, removing the subtree below and replacing it somewhere else in the tree. This prune-regraft operation will be described in more detail below.

The SMC-construction is considered a reasonably accurate approximation of the ARG process, with the advantage that computation is much easier. Under the SMC-construction (Gγ)γ∈[0,1] becomes a continuous-time Markov chain, where the distance between changes in Gγ is the average segment length Embedded Image, as in the ARG. For the remainder of this manuscript we assume that (Gγ)γ∈[0,1] is a realization of the SMC.

Any genealogy GS extracted from (Gγ)γ∈[0,1] and valid on some segment S naturally provides a separation of the sample into two disjoint sets: Let S1 contain all sample members found on the “left-hand” side, and S2 all those on the “right-hand” side of the root node of GS. Note that “left” and “right” can also be interpreted as two different alleles, originating by a point mutation along one of the root branches. Moving from segment S to another segment U along the sequence (Gγ)γ∈[0,1], the genealogical tree GU may differ from tree GS as a result of recombination. As a consequence, also the left (U1) and right (U2) descendants below the root of GU may differ from S1 and S2. Correlation can be measured in a way analogous to conventional linkage disequilibrium: Let Embedded Image and Embedded Image

Then we call Embedded Image

topological linkage disequilibrium of the segments S and U, in short tLD.

The choices of “left” and “right” are arbitrary, as much as the naming of alleles, and does not affect the value Embedded Image. Note, that we have Embedded Image, if and only if S1 = U1 or S1 = U2, where segments S and U may be different.

Such a configuration is called complete linkage.

tLD combines the concepts of LD and coalescent. Although tLD and LD have similar properties, they differ, for instance, in their rate of decay with distance. While any recombination event and point mutation may affect LD, tLD is affected only by topological changes at the root of a coalescent tree. We explore this in more detail now.

3. Theory

3.1. Limit behaviour of tLD in the SMC

Disregard for a moment branch lengths of individual coalescent trees Gγ of the SMC and consider only their branching pattern. Such trees are also called coalescent tree topologies (see [14] and Figure 3). The set 𝒯n of coalescent tree topologies of n chromosomes has size |𝒯n|= n!(n 1)!21-n [15]. Omitting branch lengths, but introducing integer-labellings of internal tree nodes which respect the time-order of these branchings, Kingman’s coalescent induces the uniform distribution on 𝒯n. Note that the sequence of topologies, (Tγ)γ∈[0,1] with Tγ ∈𝒯n for γ∈ [0, 1] that is induced by the SMC, is a finite-state Markov chain.

Figure 1:
  • Download figure
  • Open in new tab
Figure 1:

Alignment of 2N haplotypes i1,, i2N, indicated by vertical lines, with two bi-allelic loci. Frequencies given below the parentheses.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2:

Example of the topological association of twelve chromosomes 1 − 12 in two random coalescent trees. The above configuration yields an r2-coefficient of 0.04375.

Figure 3:
  • Download figure
  • Open in new tab
Figure 3:

Left: Example of a Coalescent tree topology (T1). It is a binary tree equipped with an integer labelling representing the order of coalescent events backward in time, and identifiers (A-D) for the chromosomes at the bottom. Middle, Right: A srARG of size 4, represented by its associated topologies T2 and T3. A subtree of T2 is selected for pruning (p) and is regrafted (r) at a layer equal or smaller than where T2 was pruned. Note that the prune-regraft operation affects the internal labelling of T3.

The k-th layer of a coalescent tree topology T is the tree slice in which it has k branches (Figure 3). In particular, the 1-st layer is the (imaginary) slice with only one branch, which extends from the root back into the past. Treating each segment of a branch from the original Coalescent genealogy extending between two layers k, k + 1 as a seperate branch in the associated Coalescent tree topology, each topology of size n has exactly Embedded Image branches. Under the SMC, Tγ switches states at recombination sites γ according to a prune-regraft operation: select a branch at layer k, say, to place a pruning site on the tree; cut the underlying subtree; select a branch at a layer less than k to place a re-grafting site; re-attach the cut subtree at the regrafting site. We call such a transition a single recombination SMC, for short srSMC. Assuming an ‘infinite recombination site’ model, the complete SMC can be viewed as a sequence of srSMC s.

Consider a random srSMC on n chromosomes and extract the topologies T1 and T2 of the two coalescent genealogies G1 and G2. T1 and T2 are still uniformly distributed, but they are not independent. Given a random topology T1, assume a pruning event in layer k at a randomly chosen branch bk. Assume the branch bj for regrafting to be chosen uniformly from the Embedded Image possible branches. A srSMC can be viewed as a triplet (T1, bk, bj). To obtain the conditional distribution P (T2 T1 = T) one has to average over all possible pruning and re-grafting sites. The probability of pruning in layer k > 1 is given by (see [14]) Embedded Image where Embedded Image is the (n 1)-st harmonic number, keeping in mind that this probability does depend on the duration of the layer, i.e. on the coalescent branch lengths. The prune-regraft-operation describes how states in the Markov chain (Tγ)γ∈[0,1] are changed. In particular, the set of all Aldous moves on a topology Tγ (see [16]) is a subset of the possible transformations Tγ can undergo. It follows that (Tγ)γ∈[0,1] is recurrent and aperiodic. We now consider the leftmost and rightmost chromosomal segments S = [0, γ1] and U = [γ2, 1], γ1 ≤ γ2, along the interval [0, 1]. We have the following result on the limiting behaviour of the squared correlation Embedded Image.

Theorem 1.

Given segments S and U with topologies TS and TU and the topological groupings (S1, S2) and (U1, U2). ThenEmbedded Image for ρ → ∞.

Proof. (Tγ)γ∈[0,1] is a Markov chain with stationary distribution Embedded Image for all coalescent tree topologies T. Since it is recurrent, there exists an integer M ∈ℕ (mixing time), depending only on the sample size n, such that for ϵ > 0 Embedded Image

given there are M or more state changes in (Tγ)γ∈[0,1]. By choosing ρ sufficiently large, it is possible to ensure that the probability of M or more changes is close to 1. Therefore, Pr(TU = T|TS) may be brought arbitrarily close to Pr*(T).

Let ρ → ∞ and consider the random variable kρ = |S1 ∩U1|of individuals that are on the left of both trees TS and TU under the SMC with recombination rate ρ. As ρ→∞, kρ converges in distribution to a random variable X which, given |S1| and |U1|, is distributed Embedded Image. |S1| and |U1| themselves are uniformly distributed on {1, …, n -1}.

Therefore, Lemma 1 is applicable to the limiting r.v. X. Convergence in distribution of the kρ implies convergence of Embedded Image to Embedded Image.

Over large distances, we therefore recover the limiting value predicted by Haldane [9], see eq (6). Note that by item (c) of Lemma 1 we can also obtain an exact expression of the variance of tLD in this situation.

3.2. Decline of tlD with distance

Regarding the expected decline of tLD along the chromosome, we state the following

Lemma 2.

Let T1, T2 be the two topologies resulting from a srSMC. The probability of a topological change, i.e. breaking of complete linkage, between T1 and T2 is asymptotically Embedded Image.

Proof. Recombination events that have an effect on tLD can be subdivided into two groups: events which shift a non-root branch above the root or events which move a branch from the left to the right (or vice versa) root-subtree without changing the root. We call the latter switching events. The probability of root-changing events is Embedded Image [14]. To calculate the probability Pr(switch) assume w.l.o.g. that a branch is moved from left to right. Suppose pruning takes place in layer k. The right side has 1 ≤r≤ k -1 branches, where each number r has probability 1/(k -1). The probability of selecting a branch on the left for pruning isEmbedded Image, and some arithmetics leads to the probability of selecting a branch on the right for regrafting, which is Embedded Image when averaged over all k-sized coalescent tree topologies. This needs to be multiplied by Embedded Image (compare equation 9), and then summed over all levels k. Therefore, we obtain Embedded Image where the factor 2 accounts for the two possibilities, switching from left to right or vice versa. After some simplifications, this can be rewritten as Embedded Image

The series in the numerator converges to (4π2 –33)/9 < 1 and hn∼ log(n), which conclude the proof.

Knowing the proportion of recombination events contributing to the decay of Embedded Image, we calculate now the expected proportion of chromosomes affected by a switching event between two segments. Denote by LS,U the number of chromosomes whose assignment to either the left or right class has not been affected by switching. Note that the quantity LS,U is similar to Q in [3]. However, we are here interested in a joint probability of identity by descent, and not in a probability conditioned on identity at one of the loci. LS,U = n means perfect linkage and LS,U < n means some switching took place (Figure 4). Lemma 1 is not readily applicable here, because its conditions require that LS,U vanishes.

Figure 4:
  • Download figure
  • Open in new tab
Figure 4:

A single recombination event moves a branch of the left root-subtree to the right side. In the resulting tree, chromosomes marked by red dots remain in the same left/right-grouping as before recombination took place. Their number is LS,U.

Lemma 3.

𝔼 (LS,U) declines geometrically with rate approximately Embedded Image per recombination event.

Proof. According to Lemma 2 about 1/3 of all recombination events move some individuals from one side of the tree to the other. To determine how many are moved, it suffices to note that recombination events are distributed uniformly over the branches of a given genealogy T in the SMC, such that the size of the subtree Tr below a recombination event is distributed according to the neutral frequency spectrum [17]. Thus, Embedded Image

The expected proportion of chromosomes affected by a recombination event is the expectation of the above distribution divided by n, i.e.Embedded Image, where γ is Euler’s constant. We can combine those two results in a recursion formula for 𝔼 (LS,U). Let U1, U2 be two neighboring segments, then we have Embedded Image

Iterating this formula with initial value Embedded Image shows that Embedded Image with cS,U representing the number of recombination events encountered moving along the chromosome from S to U, which depends only on ρ.

3.3. Numerical approximation of Embedded Image

The above results about the decline of the parameter LS,U suggest an approximation scheme for the expectation of tLD with respect to the number of recombination events seperating two segments. LS,U can be written as LS,U = l + m, where l (m, respectively) is the number of LS,U-chromosomes on the left (right) side of both trees. There are p = |S1|–l additional individuals on the left side of TS and q = |U1|–l on the left side of TU. See Figure 5 for an example.

Figure 5:
  • Download figure
  • Open in new tab
Figure 5:

At segment S, the sample members are grouped by tree topology into left (white) and right (black). Moving from S to U, a number of LS,U = l + m members remains unaffected by recombination, while the rest of the members might switch sides.

To calculate Embedded Image, we need to determine how many of those additional p chromosomes on the left side of TS are also on the left side of TU by chance. We assume that this number is given by a hypergeometric distribution. This is an approximation, because in both ARG and SMC the number of possible left-right configurations is smaller than the number of all hypergeometrically possible configurations. However, this restriction becomes negligible with increasing distance between S and U, i.e. with increasing number of recombination events. Let k denote the number of individuals which end up on the left side of TS and TU by this hypergeometric assignment, such that there are k + l chromosomes in total that are on the left side of both trees.

Under these assumptions, the expected tLD is Embedded Image

It is useful to note that the term Embedded Image is the second moment of a hypergeometric variable with parameters n– LS,U, p, q and with expectationEmbedded Image. We calculate therefore Embedded Image and use this to rewrite the original expression as Embedded Image

This expression can be simplified. The middle term of the summation (line # above) vanishes because of symmetry; the first summand contains the variance of a Hyp(n–l–m, p, q) random variable divided by some constant terms. Thus, we arrive at Embedded Image

In this form, the contribution that arises from the hypergeometric random assignment, labelled HYP, and the remaining parameter-specific (PSP) terms are separated. This decomposition is useful in two ways. First, under the above assumptions, an upper bound can be obtained for Embedded Image at least if LS,U is small in relation to n (See Lemma 4). Second, by averaging over all configurations it is possible to approximate Embedded Image regardless of the tree topologies at segments S and U. Since the size |S1| of the left side of TS is uniform on {1,, n –1}, we start by choosing |S1| randomly according to the uniform distribution. The LS,U-sized portion of individuals not having undergone recombination when going from S to U is then subdivided into l individuals which are on the left side both in TS and TU, and m individuals which are on the right side in both trees by choosing hypergeometrically from the assignment at S, which implicitly determines the parameters p, l and m. The number q of additional individuals on the left side of TU is determined by drawing uniformly from {1, …, n LS,U}. These calculations are easily performed by computational algebra. Note that the explicit calculation of SNP-based LD is much harder because the sizes of the classes are not uniform. The resulting approximation of the expected tLD with respect to LS,U has to be scaled with respect to the expected decay of LS,U itself. Assuming that recombination events are uniformly distributed across a chromosome, the approximation of Embedded Image can be expressed in terms of physical distance in base pairs, given a constant recombination rate per bp.

3.4. An upper bound to Embedded Image

Lemma 4.

Assume that the parameter LS,U is small in relation to the sample size n. Then, we have:

  1. The expectation of the parameter-specific contribution (“PSP”) in equation 11 is of order Embedded Image.

  2. The proposed approximation of Embedded Image is bounded from above by Embedded Image, where i denotes the number of recombination events seperating S and U and cn is some constant.

Proof. Under the assumption of LS,U < n, it is possible to write Embedded Image. Furthermore, l · (n - p - q - l) ∈ [-l · n, l · n]. This allows us to rewrite the numerator of the parameter-specific term in the following way: Embedded Image

The last statement is true because l/n ≤LS,U /n.

Under the assumptions from section 3.3, and if LS,U is small in comparison to n, then the sizes |S1| and |U1| of the left sides of the genealogies are approximately independent and uniformly distributed on {1, …, n –1}. Thus, the expectation of the denominator in Embedded Image converges to Embedded Image as LS,U /n becomes small. This term is of order 𝒪(log(n)2), allowing us to conclude that Embedded Image establishing claim (a).

To show (b), we recall Hölder’s inequality Embedded Image for a nonnegative random variable X. Let Embedded Image, assuming exactly i recombination events between S and U. The maximal value of this random variable is 1 (no decline at all). The expectation of X given the number of recombination events i between S and U has been calculated in section 3.2 asEmbedded Image. Thus, Embedded Image, and therefore Embedded Image with cn ∈ℝ depending on n and of maximally (squared) logarithmic growth. The expectation of the hypergeometric contribution is 0 for LS,U = n and converges to Embedded Image from below for LS,U /n → 0, finishing the proof.

The above calculations hold in the setting assumed in 3.3 and are an approximation of the true SMC. Furthermore, since the term Embedded Image is not bounded from above, this upper bound is only of relevance for large n.

4. Simulations and Application

We use the program ms [18] to generate samples of n chromosomes with recombination rate r and mutation rate θ. Computing tLD requires two steps. First, select two disjoint loci (sub-intervals of [0, 1]). Second, determine the top-most left and right root sub-tree clusters for both loci. We consider two possibilities: (a) use the true tree structure provided by the coalescent simulations (option “-T” selected) or (b) estimate the clusters from SNP data. In the latter case, this is achieved by a two-means clustering approach: We first determine the two most diverged haplotypes (‘antipodes’) and then assign the remaining haplotypes to either of the antipodes based on minimal Hamming distance. The estimated clusters agree well with the true (simulated) clusters if sufficient SNPs are available for estimation. A minimum number of 10 SNPs gives good results (M. Rauscher, unpubl. data; see Figure 7). This is confirmed by the excellent agreement of the summary statistics (average and variance) of tLD determined from actual and estimated clusters (see Figure 6) and by the very good agreement of individual LD estimates (see Figure heatmap). Hence, true and estimated values of tLD agree in general. Importantly, this observation does not depend much on distance. However, the correlations depend heavily on the relation between mutation and recombination rates.

Figure 6:
  • Download figure
  • Open in new tab
Figure 6:

tLD vs. SNP-LD. Average (top left), variance (top right), coefficient of variation (bottom left) and index of dispersion (bottom right) of r2. Data from a single simulation run performed with the program ms [18]. Parameter settings: ms 200 1 -t 100 -r 100 1000 -T. For a poulation size of N = 104 and a recombination rate of 1cM/M b the simulated region corresponds to 0.25cM or 250kb physical distance. Red: tLD calculated from the actual coalescent trees (i.e., using the trees obtained by setting the parameter -T). Orange: tLD calculated from estimated tree topology (see text). Blue: Conventional LD calculated from SNP pairs. Coefficient of variation: σ/µ; index of dispersion σ2/µ.

Figure 7:
  • Download figure
  • Open in new tab
Figure 7:

Heatmaps of true tLD calculated from tree topologies and tLD calculated from estimated tree topologies, performed on the same simulated dataset used in Figure 6

Comparing tLD with conventional LD in simulated data, we find that tLD declines comparably slowly and smoothly, although the variance is large. In contrast, conventional LD vanishes very quickly over similar distances (Figure 6). Although average LD is small, its variance is high, in particular for short distances. In this regime the variance of tLD is much smaller relative to the average than that of conventional LD (coefficient of variation σ/µ shown in Figure 6). These observations are theoretically supported by Lemma 2.

In practice tree topology is unknown. To estimate tree topology at the tree root, we apply the two-means clustering approach described above. We use the human 1k genomes data [19] to determine tLD across the LCT region on chromosome 2 in the CEU (Central Europe) and YRI (Yoruba) populations. We estimate tLD using a window size of 5kb per locus. Most of these windows contain 10 or more SNPs. We find a strongly elevated level of tLD in the CEU population compared to YRI. As can be seen from the heat maps, there is a much higher, and a longer-ranging level of correlation to be observed for tLD than for conventional LD (Figure 8).

Figure 8:
  • Download figure
  • Open in new tab
Figure 8:

The heatmaps of conventional (left) and topological LD generated for the LCT and neighbouring regions of the CEU (top) and YRI subsamples of the human 1k genomes project [19].

Figure 9:
  • Download figure
  • Open in new tab
Figure 9:

Left: Embedded Image in a sample of size n = 100 with respect to LS,U declining at rate 1/(3 log(100)) with respect to the number of recombination events encountered in the sample. Right, Dots: Average tLD between pairs of segments with respect to their distance in units of 2500bp (segment size: 5000bp), estimated on the region expanding over approx. 5.51 · 106 - 5.52 · 106 on chromosome 5 of the human 1k genomes data (CEU). Blue Line: Least-squares fit of expected tLD, estimating Embedded Image at ≈ 1.116cM/M b.

5. Discussion

We have introduced the concept of topological linkage disequilibrium (tLD). Like conventional LD it is a means to analyse and quantify chromosomal linkage. In contrast to the latter tLD can in principle be calculated for arbitrary pairs of loci, even if not polymorphic, because the concept hinges on coalescent tree topology, or subtree clusters, not on allele pairs. In practice, however, tree topology is usually not directly observable and needs to be estimated from polymorphism data. A moderate number of SNPs, on the order of ten, turns out to be sufficient to determine the two root-subtree clusters (assuming a binary coalescent tree) quite reliably (Figure 7). Given SNP data, we use a 2-means clustering approach, described in more detail above. Even with a single SNP, the correct root-subtree clusters may be recovered. In fact, this happens with probability roughly equal to (1 + 1/2)/hn–1, i.e. with the probability that a SNP lands on either one of the two root branches. This is about 1/2 for n = 12 and about 1/4 for n = 227. Alternatively, one may also use other kinds of markers, for instance microsatellites [20], or structural variants, if available, to estimate the root-subtree clusters.

As a function of distance between loci (or markers) average tLD and conventional LD behave qualitatively similarly. Both decay toward the same limit, as expressed in Haldane’s equation (see eq 6). However, tLD decays more slowly, and is less dispersed, than conventional LD (Figure 6). The coefficient of variation of tLD is only slightly larger than one, almost independently of distance between loci, and only about a third of that of conventional LD. Conversely, the inverse of the coefficient of variation, in signal processing also called ‘signal to noise ratio’, is about three times as high for tLD than for conventional LD.

The concept of tLD is embedded in coalescent theory. The first moment of tLD and its limiting behavior can be analytically approximated by a simple function using arguments derived from (Kingman-)coalescent properties. A currently open problem is to integrate the concept of tLD into ancestral recombination graph (ARG) theory. Here, we have resorted to its Markovian approximation, represented by the SMC to derive the presented results. We hypothesize that similar results must hold for the ARG in terms of limiting behaviour. In the ARG setting, the decay of tLD is possibly somewhat slower than in the SMC setting, because in the ARG genealogies can be reverted back with a higher probability than in the SMC.

Given that the various multi-genome resequencing projects furnish now whole-genome sequences with sample sizes of several hundred or more, the concept of tLD should prove useful for any analysis of linkage disequilibrium and gene interactions in such experimental studies.

Acknowledgments

This work was supported by the German Research Foundation (DFG-SPP1590).

Footnotes

  • Email addresses: wirtzj0{at}uni-koeln.de (Johannes Wirtz), martina.rauscher{at}uni-koeln.de (Martina Rauscher)

References

  1. [1].↵
    M. Slatkin. Linkage disequilibrium–understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet., 9(6):477–485, 2008. doi:10.1038/nrg2361.
    OpenUrlCrossRefPubMedWeb of Science
  2. [2].↵
    R. C. Lewontin. The interaction of selection and linkage. I. General considerations; heterotic models. Genetics, 49(1):49–67, 1964.
    OpenUrlFREE Full Text
  3. [3].↵
    J. A. Sved. Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theoretical Population Biology, 2(2):125–141, 1971.
    OpenUrlCrossRefPubMed
  4. [4].↵
    M. Kimura. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics, 61 (4):893–903, 1969.
    OpenUrlFREE Full Text
  5. [5].↵
    J. F. C. Kingman. On the genealogy of large populations. Journal of applied probability, 19:27–43, 1982. doi:10.2307/3213548.
    OpenUrlCrossRef
  6. [6].↵
    J. A. Sved, A. F. McRae, and P. M. Visscher. Divergence between human populations estimated from linkage disequilibrium. Am. J. Hum. Genet., 83(6):737–743, 2008. doi:10.1038/nrg2361.
    OpenUrlCrossRefPubMed
  7. [7].↵
    S. Ennis. Linkage disequilibrium as a tool for detecting signatures of natural selection. Methods Mol. Biol., 376:59–70, 2007. doi:10.1007/978-1-59745-389-9_5.
    OpenUrlCrossRefPubMed
  8. [8].↵
    B. P. McEvoy, et al. Human population dispersal “Out of Africa” estimated from linkage disequilibrium and allele frequencies of SNPs. Genome Res., 21(6):821–829, 2011. doi:10.1101/gr.119636.110.
    OpenUrlAbstract/FREE Full Text
  9. [9].↵
    J. B. S. Haldane. The mean and variance of χ2 when used as a test of homogeneity, when expectations are small. Biometrika, 31(3):3–4, 1940. doi:10.2307/2332614.
    OpenUrlCrossRef
  10. [10].↵
    S. Boitard and P. Loisel. Probability distribution of haplotype frequencies under the two-locus wright-fisher model by diffusion approximation. Theoretical Population Biology, 71(3):380–391, 2007.
    OpenUrlPubMed
  11. [11].↵
    U. Ober, et al. The expected linkage disequilibrium in finite populations revisited. ArXiv e-prints, 2013.
  12. [12].↵
    R. C. Griffiths and P. Marjoram. An ancestral recombination graph, pages 257–270. Springer, 1997. ISBN 0-387-94944-5. doi:10.1007/978-1-4757-2609-1_16.
    OpenUrlCrossRef
  13. [13].↵
    G. A. McVean and N. J. Cardin. Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci, 360(1459):1387–1393, 2005. doi:10.1098/rstb.2005.1673.
    OpenUrlCrossRefPubMed
  14. [14].↵
    L. Ferretti, F. Disanto, and T. Wiehe. The effect of single recombination events on coalescent tree height and shape. PloS One, 8(4), 2013. doi: 10.1371/journal.pone.0060123.
    OpenUrlCrossRef
  15. [15].↵
    F. Murtagh. Counting dendrograms: A survey. Discrete Applied Mathematics, 7(2):191–199, 1984. ISSN 0166-218X. doi:https://doi.org/10.1016/0166-218X(84)90066-0.
    OpenUrl
  16. [16].↵
    D. J. Aldous. Mixing time for a Markov chain on cladograms. Combinatiorics, Probability and Computing, 9(3):191–204, 2000.
    OpenUrl
  17. [17].↵
    J. Wakeley. Coalescent Theory, an Introduction. Roberts and Company, Greenwood Village, CO, 2005. ISBN 978-0974707754.
  18. [18].↵
    R. R. Hudson. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 2:337–338, 2002.
    OpenUrl
  19. [19].↵
    A. Auton, et al. A global reference for human genetic variation. Nature, 526(7571):68–74, 2015. doi:10.1038/nature15393.
    OpenUrlCrossRefPubMed
  20. [20].↵
    H. Li and T. Wiehe. Coalescent tree imbalance and a simple test for selective sweeps based on microsatellite variation. PLoS Comput Biol, 9(5):e1003060, 2013. doi:10.1371/journal.pcbi.1003060.
    OpenUrlCrossRefPubMed
Back to top
PreviousNext
Posted March 22, 2018.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Topological linkage disequilibrium calculated from coalescent genealogies
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Topological linkage disequilibrium calculated from coalescent genealogies
Johannes Wirtz, Martina Rauscher, Thomas Wiehe
bioRxiv 286393; doi: https://doi.org/10.1101/286393
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Topological linkage disequilibrium calculated from coalescent genealogies
Johannes Wirtz, Martina Rauscher, Thomas Wiehe
bioRxiv 286393; doi: https://doi.org/10.1101/286393

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Evolutionary Biology
Subject Areas
All Articles
  • Animal Behavior and Cognition (4229)
  • Biochemistry (9118)
  • Bioengineering (6753)
  • Bioinformatics (23949)
  • Biophysics (12103)
  • Cancer Biology (9498)
  • Cell Biology (13746)
  • Clinical Trials (138)
  • Developmental Biology (7618)
  • Ecology (11666)
  • Epidemiology (2066)
  • Evolutionary Biology (15479)
  • Genetics (10621)
  • Genomics (14298)
  • Immunology (9468)
  • Microbiology (22808)
  • Molecular Biology (9083)
  • Neuroscience (48900)
  • Paleontology (355)
  • Pathology (1479)
  • Pharmacology and Toxicology (2566)
  • Physiology (3828)
  • Plant Biology (8320)
  • Scientific Communication and Education (1467)
  • Synthetic Biology (2294)
  • Systems Biology (6172)
  • Zoology (1297)