## Abstract

Estimating admixture histories is crucial for understanding the genetic diversity we see in present-day populations. Existing allele frequency or phylogeny-based methods are excellent for inferring the existence of admixture or its proportions, but have less power for estimating admixture times. Recently introduced approaches for estimating these times use spatial information from admixed chromosomes, such as the local ancestry or the decay of admixture linkage disequilibrium (ALD). One popular method, implemented in the programs ALDER and ROLLOFF, uses two-locus ALD to infer the time of a single admixture event, but is only able to estimate the time of the most recent admixture event based on this summary statistic. We derive analytical expressions for the expected ALD in a three-locus system and provide a new statistical method based on these results that is able to resolve more complicated admixture histories. Using simulations, we show how this new statistic behaves on a range of admixture histories. As an example, we also apply our method to the Colombian and Mexican samples from the 1000 Genomes project.

## Introduction

There are many methods for inferring the presence of admixture, e.g. methods using simple summary statistics detecting deviations from phylogenetic symmetry [1–3] and methods estimating admixture proportions using programs such as Structure [4], Admixture [5] or RFmix [6]. However, there has been less research on estimating admixture times, possibly because such methods require data which were unavailable until the advent of high-throughput next generation sequencing. Some recently developed methods use the inferred local ancestry of sequences to construct admixture tract length distributions, such as [7–9]. Over time, recombination is expected to decrease the average lengths of admixture tracts. The length distribution of admixture tracts is therefore informative about the time since admixture. Much of the theory relating to tracts lengths is based on Fisher’s famous theory of junctions [10] and subsequent work, such as [11–20]. For example, [21] first discussed the length distribution of tracts descended from a single ancestor. These results informed later analyses of admixture tract length distribution, such as references [7–9]. Gravel [8] also implemented the software program TRACTS, which estimates admixture histories by fitting the tract length distribution, obtained by local ancestry inference, to a exponential approximation.

Another approach, which we will follow in this paper, is based on the decay of admixture linkage disequilibrium (ALD). In a well-mixed, genetically isolated human populations, linkage disequilibrium decays to zero on a scale of tenths of centiMorgans. However, when an admixed population is founded, it begins with large of amount of linkage disequilibrium, which is a result of the allele frequency differences between the source populations. This occurs even if the LD in the source populations themselves is negligible. The linkage disequilibrium in the admixed population then fluctuates in the generations after its founding, decreasing as a result of drift and recombination, or increasing because of additional waves of migration. From the LD present in a modern day admixed population, it is possible to make inferences about the population’s admixture history. This insight was first used in the program ROLLOFF [22] and was later extended by ALDER [23].

These two methods use the fact that if an admixed population takes in no additional migrants after the founding generation, the LD present in the population is expected to decay exponentially as a function of distance. The rate constant of this exponential decay is proportional to the age of the founding admixture pulse and so can be used as an estimator. ROLLOFF and ALDER are well suited for inferring the time of the admixture event when the population’s admixture history can be approximated as a single pulse. However, it can be important to estimate parameters for admixture histories involving multiple pulses, such as estimating the date of Native American admixture in Rapa Nui [24] or determining migration patterns in the Americas [25]. In these instances the expected decay of LD will become a mixture of exponentials. ROLLOFF and ALDER have limited resolution, as they can usually only infer the date of the most recent migration wave [22], or reject the hypothesis of a single pulse admixture [23].

ROLLOFF and ALDER use the information contained in pairs of sites by examining the two-locus linkage disequilibrium between them. Here we extend the theory underlying the methods in ROLLOFF and ADLER to three loci by considering three-locus LD. There are two ways of measuring the linkage between *n* loci. Bennett [26] defines *n*-locus linkage in a way that maintains a geometric decrease of LD each generation as a result of recombination, which is an important property of two-locus linkage disequilibrium. Slatkin [27] defines *n*-locus LD to be the *n*-way covariance, analogously to the property of two locus LD as the covariance in allele frequency between pairs of loci. For two and three loci, these two definitions coincide, but for four or more loci, they do not.

In this paper, we will use Bennett and Slatkin’s definition of three-locus LD to examine the decay of ALD for three sites as a function of the genetic distance between them. We derive an equation that describes the decay of three-locus LD under an admixture history with multiple waves of migration. We then compare the results of coalescent simulations to this equation, and develop some guidelines for when admixture histories more complex than a single pulse can be resolved. Finally, we apply our method to the Colombian and Mexican samples in the 1000 Genomes data set, using the Yoruba samples as a reference. Fitting a two-pulse model to data, we estimate admixture histories for the two populations which are qualitatively consistent with the results reported in [25].

## Model

We use a random union of gametes admixture model as described in [28], which is an extension of the mechanistic admixture model formulated by [29]. In this model, two or more source populations contribute migrants to form an admixed population consisting of 2*N* haploid individuals. Each generation in the admixed population is formed through the recombination of randomly selected individuals from the previous generation, with some individuals potentially replaced by migrants from the source populations. For simplicity, we consider a model with only two source populations. Furthermore, the first source population only contributes migrants in the founding generation, *T*. The second source population contributes migrants in the founding generation and possibly in one or more generations thereafter. In generation *i*, for *i* = *T* – 1*,…*, 0 (before the present), a fraction *m _{i}* of the admixed population is replaced by individuals from the second source population.

## Linkage Disequilibrium and Local Ancestry

ROLLOFF and ALDER use the standard two-locus measure of LD between a SNP at positions *x* and another SNP at position *y*, which is a genetic distance *d* to the right,
where *H _{x}* and

*H*represent the haplotype or genotypes of an admixed chromosome at positions

_{y}*x*and

*y*. In the case of haplotype data,

*H*= 1 if the

_{i,x}*i*

^{th}sample is carrying the derived allele at the SNP at position

*x*, and is otherwise 0. Alternatively, for genotype data,

*H*take on values from {0, 1/2, 1} depending on the number of copies of the derived allele the

_{i,x}*i*

^{th}sample is carrying at SNP position

*x*. We consider an additional site at position

*z*, which is located a further genetic distance

*d*′ to the right of

*y*. The three-loci LD, as defined by as defined by [26] and [27], is given by

The LD in an admixed population depends on the genetic differentiation between the source populations and and its admixture history. Let *A _{x}* represent the local ancestry at position

*x*, with

*A*= 1 if

_{x}*x*is inherited from an ancestor in the first source population, and

*A*= 0 if

_{x}*x*is inherited from the second source population. We can compute

*D*

_{3}in terms of the three-point covariance function of

*A*and so separate out the effects of allele frequencies and local ancestry. Let

_{x}*H*=

_{x}*f*+

_{x}*δA*, where

_{x}*f*is the allele frequency of locus

_{x}*x*in the first source population and

*δ*is the difference of the allele frequencies of locus

_{x}*x*in the two source populations. We now make the assumption that the allele frequencies in the source populations are known and fixed. Equation 2 then becomes

A similar argument shows that *D*_{2}(*d*) is proportional to the two-point covariance function of the local ancestry.

## Local Ancestry Covariance Functions

From the above section we see that we can describe the three-point admixture LD in terms of covariances of local ancestry in the three points. We now expand the covariance in equation 2 into its component expectations to get

Each one of these expectations on the right-hand side is the probability that one or more sites is inherited from an ancestor from first source population. We organize these products of probabilities in a column vector:
so that cov(*A*_{x}, *A*_{y}, *A*_{z}) = (1, −1, −1, −1, 2)**v**_{3}. There is one entry in **v**_{3} for each of the five ways in which the three markers at positions *x*, *y*, and *z* can arranged on one or more chromosomes. In the founding generation *T*, this column vector is given by **v**_{3(T)} = (1 − *m _{T}*, (1 −

*m*)

_{T}^{2}, (1

*m*)

_{T}^{2}, (1

*m*)

_{T}^{2}, (1 −

*m*)

_{T}^{3})′. The probabilities for subsequent generations can be found by left-multiplying drift, recombination, and migration matrices:

The matrices **D**_{i}, **L**, and **U** account for the effects of migration, drift, and recombination, respectively. The migration matrix is a diagonal matrix given by

Its entries are the probabilities that one, two, or three chromosomes in the admixed population will not be replaced by chromosomes from the second source population in generation *i*. The lower triangular drift matrix
gives the standard Wright-Fisher drift transition probabilities between the states as a function of the population size 2*N*. Finally, the upper triangular recombination matrix is determined by the recombination rates between the three sites:

The covariance function is then given by

We can obtain an analogous equation for cov(*A*_{x}, *A*_{y}), involving the migration, drift, and recombination matrices for two loci:

In some cases, equation 4 simplifies further. In a one-pulse migration model, in which *m _{T}* =

*M*and is there after 0, the

**D**

_{i}’s become identity matrices, and we get the closed from expression

This is because (1, −1, −1, −1, 2) is a left eigenvector of both **L** and **U**, with corresponding eigenvalues (1 − 1/2*N*)(1 − 2/2*N*) and exp(−*d* − *d*′). Note that when *M* = 0, the covariance function will be identically 0. Another case is a two pulse model in which we ignore the effects of genetic drift. In this model, admixture only occurs *T* and *T*_{2} generations before the present, so that *m*_{T} = *M*_{1},_{mT′} = *M*_{2}, and all other *m*_{i}’s are 0. Making the substitution *T*_{1} = *T* − *T*_{2}, the right hand side of equation 4 becomes

The corresponding expression for the two-point covariance function is given by which is a mixture of two exponentials.

## Weighted Linkage Disequilibrium

As [23] noted, we cannot use the LD in the admixed population directly, because the allele frequency differences in the source populations can be of either sign. As in [23], we solve this problem by computing the product of the values of the three-point linkage disequilibrium coefficient with the product of the allele frequency differences. Using equation 3 we obtain
because the local ancestry in the admixed sample is independent of the allele frequencies in the admixed population. For inference purposes, we estimate this function by averaging over triples of SNPs which are separated by distances of approximately *d* and *d*′. The LD term is estimated from the admixed population, while the *δ*’s are estimated from reference populations which are closely related to the two source populations. We notice that both this approach, as well as the previous approaches (e.g., [23]), do not take genetic drift in the source populations after the time of admixture into account, i.e. there is an assumption of both this method and previous methods that the allele frequencies in the ancestral source populations can be approximated well using the allele frequencies in the extant populations.

We arrange the data from the admixed samples in an *n* × *S*_{n} matrix **H**, where *n* is the number of admixed haplotypes/genotypes, and *S _{n}* is the number of markers in the sample. Similarly, we arrange the data from the two source populations into two matrices,

**F**and

**G**, which are of size

*n*

_{1}

*× S*and

_{n}*n*

_{2}

*× S*, where

_{n}*n*

_{1}and

*n*

_{2}are the numbers of samples from each of the source populations. For ease of notation, we assume that the positions are given in units which make the unit interval equal to the desired bin width.

For a given *d* and *d*′ the SNP triples we use in the estimator for the weighted LD are

Let *f _{x}* be empirical allele frequency in the admixed population. An estimator of the weighted three-point linkage disequilibrium coefficient is then
where
and similarly for and .

## Algorithm

Directly computing over the set *d*,*d*′ ∈ {0,1,…,*P*}^{2} would be cubic in the number of segregating sites. However, by using the fast Fourier Transform (FFT) technique introduced in ALDER [23], we can approximate *a*ˆ with an algorithm whose time complexity is instead linear in the number of segregating sites.

First, rearrange *â* to get
and define sequences *b _{i}*[

*d*] and

*c*[

*d*] by binning the data and then doubling the length by padding with

*P*zeros,

We can approximate |*S*[*d, d′*]| and the *n* sums in the numerator of in terms of convolutions of these sequences:

These convolutions can be efficiently computed with an FFT, since under a two-dimensional discrete Fourier transform from (*d, d′*)-space to (*j, k*)-space,
where *B _{i}* is the one-dimensional discrete Fourier transform of

*b*and for

*j >*0,

*B*[

_{i}*−j*] is the

*j*

^{th}to last most element of

*B*. Summing over

_{i}*i*and taking the inverse discrete Fourier transform, we can approximate the discrete Fourier transform of the numerator of

*a*ˆ. We apply the same method to

*c*to approximate the denominator of

*â*.

The time complexities for the binning and the FFT’s are *O*(*S _{n}*) and

*O*(

*P*

^{2}log(

*P*)). Of these two, the first term will dominate, because

*P*, the number of bins, much less than

*S*, the number of segregating sites.

_{n}When samples only one source population is available, it is still possible to estimate the weighted admixture linkage disequalibrium by using difference in allele frequencies between the one source population and the admixed population as a proxy for the difference in allele frequencies between the sampled source population and the missing one, [23, 30].

When using only the admixed population itself as a reference population, the method described above will be biased if the same samples are used to estimate both the linkage disequilibrium coefficients and the weights (*δ _{x}*,

*δ*, and

_{y}*δ*). We cannot efficiently compute a polyache statistics like [23]. At the cost of some power, we instead adopt the approach of [30] and separate the admixed population into two equal-sized groups. We then use one group to estimate the weights, and the other group to estimate linkage disequilibrium coefficients, and vice versa. This gives gives two unbiased estimates for the numerator of

_{z}*â*, which we then average.

## Fitting the Two-Pulse Model

We fit equation 6 to the estimates of the weighted LD using non-linear least squares, with two modifications. We added a proportionality constant to account for the expected square allele frequency difference between the source populations. We also subtracted out an affine term in the weighted LD which is due to population substructure [23]. We estimated this by computing the three-way covariance between triples of chromosomes. We use the jackknife to obtain confidence intervals for the resulting estimates by leaving out each chromosome in turn and refitting on the data for the remaining chromosomes.

## Simulations and Data

We used the program macs [31] to generate two source populations which diverged 4000 generations ago and a coalescent simulation to generate an admixed population from the two source populations according to two-pulse and constant admixture models. We sampled 50 diploid individuals from the admixed and two source populations, each consisting of 20 chromosomes of length 1 Morgan. The effective population size was 2*N* = 1000 for the admixed population and two source populations. Using a two pulse model, we varied the migration probabilities and timings for each pulse to examine the accuracy of equation 6. We also simulated data for a model with a constant rate of admixture each generation, and compared this to the predictions made by equation 4.

We computed the weighted LD for the Mexican and Columbia populations in the first phase of the 1000 Genomes data set. These consisted of 66 individuals from Los Angeles and 60 individuals from Medellin, respectively. We used the 88 Yoruba samples as a reference population. We computed the weighted LD on the genotypes to avoid effects of phasing errors.

## Patterns of 3-locus LD

We first evaluate the accuracy of the equations developed in this paper by comparing the analytical results to simulated data (Figures 1–3). We find there is a generally a close match between our equations and the simulated data under both the two-pulse admixture scenarios (Figures 1 and 2) and constant-admixture scenarios (Figure 3). The exception is when the total admixture proportion *M*_{2} + *M*_{1}(1 — *M*_{2}) is close to 0.5. As the total admixture proportion increases above 0.5, the contours for equation 2 flip from being concave down to concave up. This transition can be seen by comparing the upper left side of figure 2 to its lower right. At this threshold, the contours of the estimated weighted LD depend on the actual admixture fractions of the samples, which may differ from the expectation as a result of genetic drift. This mismatch between theory and simulations is most evident in figure 2, for *m*_{1} = 0.1*, m*_{2} = 0.4 and *m*_{1} = 0.2*, m*_{2} = 0.4.

When there is continuous admixture scenario, the shape of the weighted LD surface depends on both the duration and total amount of admixture. When the duration is short, the weighted LD surfaces are indistinguishable from the weighted LD surfaces produced by one pulse of migration. As the duration increases, the contours of the weighted LD surface become more curved. The contours are concave up when the total proportion is greater than 50% and concave down when it is less. When the total proportion is exactly 50%, the amplitude of the weighted LD surface is much smaller than the sampling error.

For two pulse models, the effects of the second pulse of migration only become evident when temporal spacing between the pulses is large enough (*T* 1 *> T* 2). Otherwise, the resulting weighted LD surface cannot be distinguished from the weighted LD surface produced by one pulse of admixture. As in the case of continuous admixture the concavity of the surface contours is determined by the total admixture proportion.

## Comparison to two-locus LD measures

We compared the simulation results to the two-locus weighted LD calculated by ALDER (Figure 4). The information used in estimating Admixture times in ALDER is the slope of the log-scaled LD curves. Notice (Figure 4) that the slopes are somewhat similar for admixture models with identical values of the most recent admixture events (*T* 2). Hence, when two admixture events have occurred, estimation of admixture times tend to get weighted towards the most recent event. Generally, it would be very difficult, based on the shape of the admixture LD decay curve to estimate parameters of a model with more than one admixture event. In contrast, there is a quite clear change in the pattern of three-locus LD as long as the time between the two admixture events is sufficiently large (Figure 1).

## Accuracy of parameter estimates

We next evaluate the utility of the method for estimating admixture times. The qualitative similarities between one pulse and two pulse admixture scenarios seen in the previous simulations under some parameter settings will naturally affect the estimates. As shown in Figure 5, when the spacing between the two pulses is small relative to their age, the median of the estimates of the timing of the second pulse is close to the true value, but the interquartile range is large. Moreover, the best fit often lies on a boundary of the parameter space which is equivalent to a one pulse admixture model. When the spacing between the pulses is larger, the estimates for the timing of the older pulse become more precise.

## 1000 Genomes

To illustrate the utility of the method we computed weighted LD surfaces for Mexican and Columbian samples from the 1000 Genomes consortium previously analyzed for similar purposes by [25]. For the Mexican samples, [25] found a small but consistent amount of African ancestry, which appeared in the population 15 generations ago, with continuing contributions from European and Native American populations since that date, but no African migration. In fitting a two-pulse model to the Mexican weighted LD surface (Figure 6), we estimated that the two pulses occurred 12.3 ± 3.3 and 9.9± 2.7 generations ago. These confidence intervals overlap, and so we cannot reject a one-pulse admixture history. This is not quite consistent with the constant migration model that [25] found, but as we have seen from simulations, it is hard to distinguish a constant migration model from a one-pulse model when the duration of the migration is short.

The weighted LD surface for the Columbia samples is shown in Figure 7. From this, we estimated two pulses of non-Yoruba migration at 11.8 ± 1.2 and 2.64 ± 0.08 generations before the present. [25] also inferred two pulses of admixture, corresponding to 3 and 9 generations ago. The weighted LD surface of the Colombian samples has contours which are strongly concave up, in contrast to those of the Mexican samples.

## Discussion

The method presented here is an extension of previously published methods for using weighted two-locus LD to estimate admixture times. The new method uses more information in the data because it compares triples of SNPs instead of pairs. This gives the method the ability to infer admixture histories more complex than a one-pulse model. However, this comes at the price of greater estimation variances. ALDER and ROLLOFF make estimates from just tens of samples, while our method requires hundreds of samples. Part of this difference can be attributed to the fact that ALDER and ROLLOFF make inferences over a smaller class of models, but the main reason arises from the fact that the two-locus methods are estimating second moments of the data, while we are estimating third moments. The variance of these estimates are both inversely proportional to the sample size, but the constants for estimating third moments are larger. As data becomes more readily available, this disadvantage should disappear.

We also notice that the theory developed in this paper might be useful for other purposes than estimating admixture times. In particular, it can be used to test hypotheses regarding the spatial distribution of introgressed fragments in the genome, without relying on particular inferences of admixture tracts. It can also naturally be extended to include selection, opening up the possibility for model-based tests of selection acting on the distribution of admixture tracts.