Abstract
Geographic patterns of genetic variation within modern populations, produced by complex histories of migration, can be difficult to infer and visually summarize. A general consequence of geographically limited dispersal is that samples from nearby locations tend to be more closely related than samples from distant locations, and so genetic covariance often recapitulates geographic proximity. We use genome-wide polymorphism data to build “geogenetic maps”, which, when applied to stationary populations, produces a map of the geographic positions of the populations, but with distances distorted to reflect historical rates of gene flow. In the underlying model, allele frequency covariance is a decreasing function of geogenetic distance, and nonlocal gene flow such as admixture can be identified as anomalously strong covariance over long distances. This admixture is explicitly co-estimated and depicted as arrows, from the source of admixture to the recipient, on the geogenetic map. We demonstrate the utility of this method on a circum-Tibetan sampling of the greenish warbler (Phylloscopus trochiloides), in which we find evidence for gene flow between the adjacent, terminal populations of the ring species. We also analyze a global sampling of human populations, for which we largely recover the geography of the sampling, with support for significant histories of admixture in many samples. This new tool for understanding and visualizing patterns of population structure is implemented in a Bayesian framework in the program SpaceMix.
Author Summary
In this paper, we introduce a statistical method for inferring, for a set of sequenced samples, a map in which the distances between population locations reflect genetic, rather than geographic, proximity. Two populations that are sampled at distant locations but that are genetically similar (perhaps one was recently founded by a colonization event from the other) may have inferred locations that are nearby, while two populations that are sampled close together, but that are genetically dissimilar (e.g., are separated by a barrier), may have inferred locations that are farther apart. The result is a “geogenetic” map in which the distances between populations are effective distances, indicative of the way that populations perceive the distances between themselves: the “organism’s-eye view” of the world. Added to this, “admixture” can be thought of as the outcome of unusually long-distance gene flow; it results in relatedness between populations that is anomalously high given the distance that separates them. We depict the effect of admixture using arrows, from a source of admixture to its target, on the inferred map. The inferred geogenetic map is an intuitive and information-rich visual summary of patterns of population structure.
Introduction
There are many different methods to learn how population structure and demo-graphic processes have left their mark on patterns of genetic variation within and between populations. Model-based approaches focus on developing a detailed view of the migrational history of a small number of populations, often assuming one or a small number of large, randomly mating populations (i.e. little or no geographic structure). There has been considerable recent progress in this area, using a variety of summaries such as the allele frequency spectrum [Gutenkunst et al., 2009, Bhaskar et al., 2014, Excoffier et al., 2013], or approximations to the coalescent applied to sequence data [Paul et al., 2011, Li and Durbin, 2011].
Other approaches are designed only to visualize patterns of genetic relatedness and population structure, without using a particular population genetic model. Such methods can deal with many populations or individuals as the unit of analysis. Examples of this second set of methods include clustering methods [Pritchard et al., 2000, Alexander et al., 2009, Lawson et al., 2012] and reduced dimensionality representations of the data, such as Principal Components Analysis (PCA; e.g. [Luca et al., 1994, Patterson et al., 2006, Price et al., 2006]).
A third set of methods that describe relatedness between populations by constructing a “population phylogeny” was pioneered by Cavalli-Sforza and Edwards [1967], as were methods to test whether a tree is a good model of population history [Cavalli-Sforza and Piazza, 1975] (see [Felsenstein, 1982] for a review). Tree-based approaches are appealing because trees are easy to visualize and explain, but the underlying assumptions (unstructured populations that split at discrete points in time) rarely hold true.
Recently, there has been a resurgence of interest in these tree-based methods. Some use population trees as a null model to test and quantify the signal of admixture between samples Reich et al. [2009]. Others, such as TreeMix [Pickrell and Pritchard, 2012] and MixMapper [Lipson et al., 2013], visualize population relationships using a directed acyclic graph; for instance, TreeMix connects branches in a population tree with additional edges to explain excess covariance between groups of populations.
There has also been renewed interest in methods for dimensionality reduction for the visualization of patterns of genetic variation [Patterson et al., 2006], especially PCA (also pioneered by Cavalli-Sforza [Menozzi et al., 1978]). Examining such low-dimensional visual summaries has become an indispensible step in the analysis of modern genomic datasets of thousands of loci typed in tens or hundreds of samples. Generally, these visualizations are constructed by plotting the first few eigenvectors of the covariance matrix of normalized allele frequencies against each other.
Both PCA and tree-based methods are valuable as genetic inference and visualization tools, but both also suffer from serious limitations. Because gene flow is frequently pervasive, patterns of relatedness between samples may often be only poorly represented by a tree-based model. PCA is more flexible, as it assumes no explicit model of population-genetic processes, simply describing the axes of greatest variance in the average coalescent times between pairs of samples [McVean, 2009]. This allows PCA to describe more geographically continuous relationships: PCA applied to human populations within continents often shows a close correspondence to geographic locations [e.g. Novembre et al., 2008, Wang et al., 2012]. However, the interpretation of PCA is more difficult, as the results can be strongly affected by the size and design of sampling, and the linearity and orthogonality requirements of the PC axes can lead to counterintuitive results [Novembre and Stephens, 2008, Francois et al., 2010].
What is desired, then, is a method for inferring and visualizing patterns of population differentiation that can recapitulate complex, non-hierarchical structures, while also admitting simple and intuitive interpretation. Since gene flow and population movements are often constrained by geography, it is natural to base such a method in a geographic framework. There is a rich history of population genetics theory for populations distributed in continuous space [Malécot, 1975, Nagylaki, 1978, Felsenstein, 1975, Barton et al., 2002], as well as exciting new developments in the field [e.g. Petkova et al., 2014]. The pattern of increasing genetic differentiation with geographic distance was termed “Isolation by Distance” by Wright [1943], and is ubiquitous in natural populations [Meirmans, 2012]. Descriptive models of such patterns rely only on the weak assumption that an individual’s mating opportunities are spatially limited by dispersal; a large set of models, ranging from equilibrium migration-drift models to non-equilibrium models, such as recent spatial expansions of populations, give rise to the empirical pattern of isolation by distance.
In this paper, we present a statistical framework for studying the spatial distribution of genetic variation and genetic admixture based on a flexible parameterization of the relationship between genetic and geographic distances. Within this framework, the pattern of genetic relatedness between the samples is represented by a map, in which inferred distances between samples are proportional to their genetic differentiation, and long distance relatedness (in excess of that predicted by the map) is modeled as genetic admixture. These ‘geogenetic’ maps are simple, intuitive, low-dimensional summaries of population structure, and provide a natural framework for the inference and visualization of spatial patterns of genetic variation and the signature of genetic admixture. The implementation of this method, SpaceMix, is available at https://github.com/gbradburd/SpaceMix.
Results
Data
The genetic data we model consist of allele counts at L unlinked, bi-allelic single nucleotide polymorphisms (SNPs), sampled across K populations. After arbitrarily choosing an allele to count at each locus, denote the number of counted alleles at locus ℓ in population k as Ck,ℓ, and the total number of alleles observed as Sk,ℓ. The sample frequency at locus ℓ in population k is . Although we will refer to “populations”, each could consist of a single individual (Sk,ℓ = 2 for a diploid). We will depict results as coordinates on a map; however, the method does not require user-specified sampling locations.
We first compute standardized sample allele frequencies at locus ℓ in population k, by where is the sample allele frequency at locus ℓ in population k, and is the average of the K sample allele frequencies, weighted by mean population size. This normalization is widely used [e.g. Nicholson et al., 2002, Patterson et al., 2006]; mean-centering makes the result invariant to choice of which allele to count at each locus, and dividing by makes each locus have roughly unit variance in some sense.
We work with the empirical covariance matrix of these standardized sample allele frequencies, calculated across loci, namely, . Using the sample mean to mean-center has implications on their covariance structure, discussed in the Methods (“The standardized sample covariance”). For clarity, here we proceed as if were instead an unobserved, global mean allele frequency at locus ℓ.
Spatial Covariance Model
We wish to model the distribution of alleles among populations as the result of a spatial process, in which migration moves genes locally on an unobserved landsape. Migration homogenizes those differences between populations that arise through genetic drift; populations with higher levels of historical or ongoing migration share more of their demographic history, and so have more strongly correlated allele frequencies.
We assume that the standardized sample frequencies are generated independently at each locus by a spatial process, and so have mean zero and a covariance matrix determined by the pairwise geographic distances between samples. To build the geogenetic map, we arbitrarily choose a simple and flexible parametric form for the covariance matrix in which covariance between allele frequencies decays exponentially with a power of their distance [Diggle et al., 1998, Wasser et al., 2004, Bradburd et al., 2013]: the covariance between standardized population allele frequencies (i.e. values) between populations i and j is assumed to be, for i ≠ j, where Di,j is the geogenetic distance between populations i and j, α0 controls the within-population variance (or the covariance when distance between points is 0, known as a “sill” in the geospatial literature), α1 controls the rate of the decay of covariance per unit pairwise distance, and α2 determines the shape of that decay.
Within population variance may vary across samples due to either noise from a finite sample size or demographic history unique to that sample (e.g., bottlenecks or endogamy). To accommodate this heterogeneity we introduce population-specific variance terms, resulting in the covariance matrix for standardized sample frequencies where δi,j = 1 if i = j and is 0 otherwise, ηk is a nonnegative sample-specific variance term (nugget) to account for variance specific to population k that is not accounted for by the spatial model, and is the mean sample size across all loci in population k, so that accounts for the variance introduced by sampling within the population.
The distribution of the sample covariance matrix is not known in general, but the central limit theorem implies that if the number of loci is large, it will be close to Wishart. Therefore, we assume that is Wishart distributed with degrees of freedom equal to the number of loci (L) used and mean equal to the parametric form Ω given in equation (3). We denote this by
Note that if the standardized sample frequencies are Gaussian, then the sample covariance matrix is a sufficient statistic, so that calculating the likelihood of is the same as calculating the likelihood of the data up to a constant. Handily, it also means that once the sample covariance matrix has been calculated, all other computations do not scale with the number of loci, making the method scalable to genome size datasets.
Location Inference
Non-equilibrium processes like long distance admixture, colonization, or population expansion events will distort the relationship between covariance and distance across the range, as will barriers to dispersal on the landscape. To accommodate these heterogeneous processes we infer the locations of populations on a map that reflects genetic, rather than geographic, proximity. To generate this map, we treat populations’ locations (i.e. coordinates in the geoge-netic map) as parameters that we estimate with a Bayesian inference procedure (described in the Methods). These location parameters for each population are denoted by G, and determine the matrix of pairwise geogenetic distances between populations, D(G), which together with the parameters and η determine the parametric covariance matrix Ω (given by equation (3)). We acknowledge this dependence by writing .
The prior distributions on the parameters that control the shape and scale of the decay of covariance with distance ( and η) are given in the Methods. The priors on the geogenetic locations, G, are independent across populations; because the observed locations naturally inform the prior for populations locations, we use a very weak prior on population k’s location parameter (Gk) that is centered around the observed location. This prior on geogenetic locations also encourages the resulting inferred geogenetic map to be anchored in the observed locations and to represent (informally) the minimum distortion to geographic space necessary to satisfy the constraints placed by genetic similarities of populations. In practice, we also compare results to those produced using random locations as the “observed” locations, and can change the variance on the spatial priors to ascertain the effect of the prior on inference.
We then write the posterior probability of the parameters as where P() denotes the various priors, and the constant of proportionality is the normalization constant.
We then use a Markov chain Monte Carlo algorithm to estimate the posterior distribution on the parameters as described in more detail in the Methods.
Illustrated with Simulations
We first apply the method to several scenarios simulated using the coalescent simulator ms [Hudson, 2002]. Each scenario is simulated using a stepping stone model in which populations are arranged on a grid with symmetric migration to nearest neighbors (eight neigbors, including diagonals) with 10 haploid individuals sampled from every other population at 10,000 unlinked loci (for details on all simulations, see Methods and Appendix). The basic scenario is shown in Figure 1a, which is then embellished in various ways. In the SpaceMix analysis of each simulated dataset, we treat population locations as unknown parameters to be estimated as part of the model, and center the priors on each population’s location at a random point. The resulting geogenetic maps are produced from the parameters having maximum posterior probability. Since overall translation, rotation, and scale are nuisance parameters, we present inferred locations after a Procrustes transformation (best-fit rotation, translation, and dilation) to match the coordinates used to simulate the data.
The lattice scenarios, illustrated in Figures 1 and 2, are: homogeneous migration rates across the grid; a longitudinal barrier across the center of the grid; a series of recent expansion events; and an admixture event between opposite corners of the lattice. In the simple lattice scenario with homogeneous migration rates (Figures 1a and 1d), SpaceMix recovers the lattice structure used to simulate the data (i.e., populations correctly choose their nearest neighbors). After adding a longitudinal barrier to dispersal across which migration rates are reduced by a factor of 5 (Figure 1b), the two halves of the map are pushed farther away from one another, reflecting the decreased gene flow between them.
In the expansion scenario, in which all populations in the last five columns of the grid have expanded simultaneously in the immediate past from the nearest population in their row (Figure 1c), the daughter populations of the expansion event cluster with their parent populations, reflecting the higher relatedness (per unit of geographic separation) between them. In all scenarios, populations at the corners of the lattice are pulled in somewhat because these have the least amount of data informing their relative placements. In Figure S1, we show the relationship between genetic covariance, geographic distance, and inferred geogenetic distance for these simulations.
We next simulated a long-distance admixture event on the same grid, by sampling half of the alleles of each individual in the northeast corner population from the southwest corner population (Figure 2a). We then ran a SpaceMix analysis in which the locations of these populations were estimated (Figure 2b). The admixture creates excess covariance over anomalously long distances, which is clearly difficult to accommodate with a two-dimensional geogenetic map. Figure 2b shows the torturous lengths to which the method goes to fit a good geogenetic map: the admixed population 30 is between population 1, the source of its admixture, and populations 24, 25, and 29, the nearest neighbors to the location of its non-admixed portion. However, this warping of space is difficult to interpret, and would be even more so in empirical data for which a researcher does not know the true demographic history.
Inference of Spatial Admixture
To incorporate recent admixture, we allow each allele sampled in population k to have a probability wk (0 ≤ wk ≤ 0.5) of being sampled from location which we refer to as population k’s source of admixture, and a probability 1-wk of being sampled from location Gk. With no nugget, each allele would be sampled independently, but the nugget introduces correlations between the alleles sampled in each population.
With this addition, the parametric covariance matrix before given by (3) becomes a function of all the pairwise spatial covariances between the locations of populations i and j and the points from which they draw admixture (illustrated in Figure 3); now, we model the covariance between and , for each ℓ, as where D is the 2k × 2k matrix of pairwise distances between all inferred locations and sources of admixture, and for readability, we denote, e.g., F(D(Gi, G*j)), as F(Di,j*). The spatial covariance, F(D), is as given in equation (2), and we reintroduce the nugget, ηk, and the sample size effect, , for each population as above in Eqn. (3).
We proceed in our inference procedure as before, but now with the locations of the sources of admixture and the admixture proportions to infer. The likelihood of the sample covariance matrix is exactly as before in (4), except with Ω replaced by Ω*. The posterior probability of these parameters can be expressed as a function of this parametric admixed covariance, Ω*, as specified by the parameters w, G*, , and η, and the inferred locations, G. We place a weak spatial prior on the sources of admixture, G* around the centroid of the observed locations. The admixture proportions, w, are capped at 0.5, to ensure identifiability, and are heavily weighted towards small values to be conservative with respect to admixture inference. These priors are detailed in Table 2.
The models described above may be used in various combinations. In the simplest model, populations do not choose their own locations, nor are they allowed to draw admixture; the only parameters to be estimated are those of the spatial covariance function given in equation (2), and the population-specific variance terms (ηi). In the most complex model, population locations, the locations of their sources of admixture, and the proportions of admixture are all estimated jointly in addition to the parameters of the spatial covariance function and the population specific variances. Users may wish to employ the more constrained models (e.g., fixing the locations or admixture proportions for some or all samples) in a model selection framework to test specific hypotheses.
Allowing admixture gives sensible results for the scenario of Figure 2a: in the resulting map, the only population that draws substantial admixture is the one that is actually admixed, and it draws admixture (95% CI: 0.36 - 0.40) from the correct location (Figure 2c).
A more subtle simulated admixture scenario, with admixture proportion of 10% across a geographic barrier, is shown Figure 4a. The resulting SpaceMix map (Figure 4b), separates the east and west sides of the grid to accommodate the effect of the barrier, and the admixed population (population 23) chooses admixture from very close to its true source (population 13), and in close to the correct amount .
Another difficult scenario is shown in Figure 4c, where 40% admixture has occurred between two populations immediately adjacent to each other on either side of a barrier. Here, the admixed population 18 is correctly identified as admixed, but it explains its intermediate genetic relationships by taking a location close to its true admixture source (population 13), and drawing admixture (95% CI: 0.04 - 0.14) from a location on the far margin of the half of the grid on its own side of the barrier. Because there is no sampled intervening population between admixed population 18 and its source of admixture 13, there is nothing to stop 18 from explaining its higher covariance with 13 via its chosen location G(18) rather than via that of its source of admixture G*(18). In each of these scenarios, the estimated admixture proportion is less than that used to simulate the data. This is due to the stringent prior we place against admixture. We discuss these examples further in the Methods.
Empirical Applications
To demonstrate the applications of this novel method, we analyzed population genomic data from two systems: the greenish warbler ring species complex, and a global sampling of contemporary human populations. Maps showing our sampling in these two systems are given in Figure 5, and information on the specific samples included is given in the Supplementary Materials, Tables S1 and S2. For all analyses presented below, we used random ‘observed’ locations as the priors on population locations. The geogenetic maps shown here were maximum a posteriori estimates (over all parameters). For clarity and ease of interpretation, we then present a full Procrustes superimposition of the inferred population locations (G) and their sources of admixture (G*), using the observed latitude and longitude of the samples/individuals to give a reference position and orientation.
Greenish Warblers
The greenish warbler (Phylloscopus trochiloides) species complex is broadly distributed around the Tibetan plateau, and exhibits gradients around the ring in a range of phenotypes including song, as well as in allele frequencies [Ticehurst, 1938, Irwin et al., 2001, 2005]. At the northern end of the ring in central Siberia, where the eastern and western arms of population expansion meet, there are discontinuities in call and morphology, as well as reproductive isolation and a genetic discontinuity [Irwin et al., 2001, 2008]. It is proposed that the species complex represents a ring species, in which selection and/or drift, acting in the populations as they spread northward on either side of the Tibetan plateau, have led to the evolution of reproductive isolation between the terminal forms.
The question of whether it fits the most strict definition of a ring species focuses on whether gene flow along the margins of the plateau has truly been continuous throughout the history of the expansion or if, alternatively, discontinuities in migration around the species complex’s range have facilitated periods of differentiation in genotype or phenotype without gene flow [Mayr, 1942, 1970, Coyne, 2004] (see Wake and Schneider [1998] for discussion). Alcaide et al. [2014] have suggested that the greenish warbler species complex constitutes a ‘broken’ ring species, in which historical discontinuities in gene flow have facilitated the evolution of reproductive isolation between adjacent forms.
To investigate this question, we applied SpaceMix to the dataset from Alcaide et al. [2014], consisting of 95 individuals sampled at 22 distinct locations and sequenced at 2,334 SNPs, of which 2,247 were bi-allelic and retained for SpaceMix runs. The libraries were prepared using a genotype-by-sequencing protocol and were run on an Illumina HiSeq 2000 with a paired-end sequencing protocol [Alcaide et al., 2014].
We first ran SpaceMix on the population dataset, with no admixture. The resulting inferred map (Figure 6a) largely recapitulates the geography of the sampled populations around the ring. The Turkish population (TU, Phylloscopus trochiloides ssp. nitidus) clusters with the populations in the subspecies ludlowi, due to its recent expansion, but also chooses a relatively high nugget parameter (see Figure S5a), reflecting the population history it does not share with its ludlowi neighbors. In the North, where the twin waves of expansion around the Tibetan Plateau are hypothesized to meet, the inferred geogenetic distance between populations from opposite sides of the ring was much greater than their observed geographic separation, reflecting the reproductive isolation between these adjacent forms (see Figure S2).
We then ran the method allowing admixture (Figure 6b). The only population sample with appreciable admixture is the Stolby sample (ST; w = 0:19, 95% credible interval: 0.146-0.238). This sample is known to be composed of an equal mixture of eastern plumbeitarsus and western viridanus individuals [Alcaide et al., 2014]. Multiple runs agreed well on the level of admixture of the Stolby sample (see Figure S3). What does vary across runs is whether the Stolby sample chooses to locate itself by the viridanus cluster and draw admixture from near the plumbeitarsus cluster or vise versa; however, this is to be expected given the 50/50 nature of the sample’s makeup (Figure S3). The somewhat intermediate position of the Stolby sample, and its non-50/50 admixture proportion, likely partially reflect the influence of the priors (Figure S3).
We repeated these analyses (with and without admixture) on an individual level (Figure 7). No individual drew appreciable admixture (see Figure S11 for admixture proportions), and so we discuss the results with admixture (those without admixture are nearly identical, see Figure S8). As with the analysis on multi-sample populations, the results approximately mirror the geography of the individuals.
There are, however, a number of obvious departures in the individual geogenetic map from the population map. The most obvious is that the location of a pair of nitidus samples (in purple) is very far from the rest of the samples. These individuals appear to be fairly close relatives: in the population-level analysis of Figure S5a, this increase in shared ancestry was accounted for by a large nugget for the nitidus population; but in the individual-level analysis, they cannot share a nugget parameter, and must therefore choose a location close to each other and far from the rest of the samples. The same phenomenon seems to be at work in determining the locations of a pair of individuals, one identified as P. t. ludlowi (Lud-MN3), one as P. t. trochiloides (Tro-LN11), as they also show an unusually low pairwise sequence divergence (see Figure S13).
The split between viridanus and plumbeitarsus individuals (blue and red, respectively), in the north at the contact zone of the two waves of expansion, is clearer now than in the population-based analysis, as individuals from the Stolby population have moved to near their respective clusters. Despite the fact that viridanus and plumbeitarsus individuals have moved away from each other in our geogenetic map, they are still closer to each other than we would expect if all gene flow between the two was mediated by the southern populations, in which case we would expect the populations to form a line, with viridanus at one end and plumbeitarsus at the other. This horseshoe, with viridanus and plumbeitarsus at its tips, is steady within and among runs of the MCMC and choice of position priors (see Figures S9a–S9c).
Is this biologically meaningful? A similar horseshoe shape appears when a principal components (PC) analysis is conducted and individuals are plotted on the first two PCs [see Figure S7 and Alcaide et al., 2014]. However, as discussed by Novembre and Stephens [2008], such patterns in PC analysis can arise for somewhat unintuitive reasons. If populations are simulated under a one dimensional stepping stone model, then plotting individuals on the first two PCs results in a horseshoe (e.g. see Figure S19b) not because of gene flow connecting the tips, but rather because of the orthogonality requirement of PCs (see Novembre and Stephens [2008] for more discussion). In contrast, when SpaceMix is applied to one dimensional stepping stone data, the placement of samples is consistent with a line (see Figures S19c, S19d). The proximity of viridanus and plumbeitarsus in geogenetic space may be due to gene flow between the tips of the horseshoe north of the Tibetan Plateau. This conclusion is in agreement with that of Alcaide et al. [2014], who observed evidence of hybridization between viridanus and plumbeitarsus using assignment methods.
The SpaceMix map also diverges from the observed map in the distribution of individuals from the subspecies ludlowi (in green). These samples were taken from seven sampling locations along the southwest margin of the Tibetan Plateau, but, in the SpaceMix analysis, they partition into two main clusters, one near the trochiloides cluster, and one near the viridanus cluster. This break between samples from the same subspecies, which is concordant with the findings of Alcaide et al. [2014], makes the ludlowi cluster unusual compared to the estimated spatial distributions of the other subspecies (see Figure S12), and suggests a break in historic or current gene flow.
Human Populations
Human population structure is a complex product of the forces of migration and drift acting on both local and global scales, patterned by geography [Novembre et al., 2008, Ralph and Coop, 2013], time [Skoglund et al., 2012, 2014], admixture [Hellenthal et al., 2014], landscape and environment [Beall et al., 2010, Bigham et al., 2010, Bradburd et al., 2013], and shaped by culture [Reich et al., 2009, Atzmon et al., 2010, Moorjani et al., 2011]. To visualize the patterns these processes have induced, we create a geogenetic map for a worldwide sample of modern human populations. In doing so we fully acknowledge that human history at these geographic scales has many aspects that are not well captured by isolation distance or simple admixture models. To simplify the discussion of our results, we talk about samples’ locations and those of their sources of admixture, but of course both reflect the compounding of drift and gene flow over many historical processes. We therefore urge caution in the interpretation our results, and view them as a simplistic but rich visualization of patterns of population structure in humans.
We used a random subset of 10,000 SNPs from the SNP dataset of Hellenthal et al. [2014], which is comprised of 1,490 individuals from 95 population samples (see Figure 5b for map of sampling), as well as the latitude and longitude attributed to each sample. We ran two sets of SpaceMix analyses: in the first, we estimated population sample locations, and in the second, we also allowed admixture. We note that few of the putative admixture events that we report have escaped the notice of previous investigators, which is unsurprising given the depth of recent attention on human admixture studies, particularly on the subset of HGDP samples [see Rosenberg et al., 2002, Li et al., 2008, Loh et al., 2013, Patterson et al., 2012, Hellenthal et al., 2014, for various global analyses]. However, what is novel here is the ability to visualize these admixture events in a geographic context, and that these admixture signals stand out against a null model of isolation by distance (rather than tree-based models).
When population samples choose only their own locations, the map roughly recapitulates the geography of the samples (Figure 8a), a result that holds nicely when we zoom in on the more heavily sampled area of Eurasia (Figure 8b). We see that samples both in the Americas and in Oceania cluster close to the East Asian samples, but that the two clusters are on opposite sides. The proximity of these groups to the East Asians represents the fact that both groups share an ancestral population in the relatively recent past with East Eurasian populations, but the two expansions occurred independently. As in our simulations (Figure 1f) population expansions/bottlenecks have distorted the relationship between geographic and geogenetic distance. In the SpaceMix analysis of the human genetic data, the scale of inferred inter-population sample distance within Africa is much greater than that between any other group (see Figure S15), and the slope of the relationship between observed and estimated geographic distances between populations on each continent decays with distance from Africa. This pattern is consistent with a history of human colonization events characterized by serial bottlenecks [Harpending and Rogers, 2000, Prugnolle et al., 2005, Ramachandran et al., 2005] following an out-of-Africa expansion, and subsequent expansions into Western Eurasia, East Asia, the Americas, and Oceania [but see Pickrell and Reich, 2014, for a discussion of other models].
To investigate possible patterns of admixture further, we ran a SpaceMix analysis with admixture (results shown in Figures 9 and 10). The biggest change between the geogenetic map of human populations inferred with admixture and that without is the positioning of African samples with respect to the rest of the world. The relatively large geogenetic distances between these groups reflects the fact that Eurasian, North African, Oceanian, and American populations all share relatively large amounts of drift (population history) not shared with the Sub-Saharan African samples. The inclusion of admixture allows samples that fall intermediate between Sub-Saharan Africa and North Africa and the Middle East to move closer to one or the other, which, in turn, allows each of those major clusters to move relatively farther apart. The Ethiopian and Ethiopian Jewish samples move to be closer to the Sub-Saharan samples than the of the North African samples, but draw substantial amounts of admixture (~40%) from close to where the Egyptian sample has positioned itself in the the Middle East cluster, as do the Sandawe [Hodgson et al., 2014, Pickrell et al., 2012]. The SanKhomani draw admixture from near Syria, which may reflect multiple distinct geographic sources of admixture as discussed by [Hellenthal et al., 2014] and [Pickrell et al., 2014]. Interestingly the Bantu South African sample, though it moves to join the other Bantu samples, draws admixture from close to the San populations. This is consistent with previous signals of the expansion of Bantu-speaking peoples into southern Africa [Pickrell et al., 2012, Schlebusch et al., 2012, Pickrell et al., 2014, Hellenthal et al., 2014].
The majority of North African samples (Egyptian, Tunisian, Morocan, Mozabite) move to join the Middle Eastern samples (positioning in rough accord with their sampling location along North Africa), and draw admixture from near the Ethiopian samples. All of the Middle Eastern samples draw admixture from close to the location chosen by the Ethiopian samples and where most of the North African samples draw admixture from, representing the complex history of North African-Middle Eastern gene flow [Henn et al., 2012, Hellenthal et al., 2014].
A number of other population samples draw admixture from Africa. The Sindhi, Makrani, and Brahui draw admixture from close to the location of the Bantu samples [Hellenthal et al., 2014], and the Balochi and Kalash draw admixture from some distance away from African population samples. Of the European samples, the Spanish and the East and West Sicilian samples all draw small amounts of admixture from close to the Ethiopian samples, presumably reflecting a North African ancestry component [Moorjani et al., 2011, Botigu et al., 2013].
The other dominant signal of admixture is between East and West Eurasia [a signal documented by many authors: Rosenberg et al., 2002, Li et al., 2008, Xu and Jin, 2008, Hellenthal et al., 2014]. The majority of samples maintain their relative positions within each of these groups; however, several of the populations that chose locations intermediate between eastern and western Eurasia (in the SpaceMix analysis without admixture) now move towards one side and draw admixture from the other. The Uzbekistani and Hazara samples move to be closer to the East Asian samples, while drawing a substantial admixture proportion from close to where the Georgian and Armenian samples have located themselves, while conversely the Uygur sample moves to be close to the Burusho, Kalash, and Pathan samples.
The Tu sample (with a geogenetic location in East Asia) draws a small amount of ancestry from close to where the Uygur have positioned themselves. The Chuvash move close to Russian and Lithuanian samples, drawing admixture from close to the Yakut; the Turkish sample also draws a smaller amount of admixture from there. There are several other East-West connections: the Russian and Adygei samples have admixture from a location “north” of the East Asian samples, and the Cambodia sample draws admixture from close to the Eygptian sample [Pickrell and Pritchard, 2012, Hellenthal et al., 2014].
There are also a number of samples that draw admixture from locations that are not immediately interpretable. For example, the Hadza and Bantu Kenyan samples draw admixture from somewhat close to India, and the Xibo and Yakut from close to “northwest” of Europe. The Pathan samples draw admixture from a location far from any other samples’ locations, but close to where the India samples also draws admixture from. The Myanmar and the Burusho samples both draw admixture far from the locations estimated for other samples as well.
There are a number of possible explanations for these results. As we only allow a single admixture arrow for each sample, populations with multiple, geographically distinct sources of admixture may be choosing admixture locations that average over those sources. This may be the case for the Hadza and Bantu Keynan samples [Hellenthal et al., 2014]. A second possibility is that the relatively harsh prior on admixture proportion forces samples to choose lower proportions of admixture from locations that overshoot their true sources; this may explain the Xibo and Yakut admixture locations. A final explanation is that good proxies for the sources of admixture may not be included in our sampling, either because of of the limited geographic sampling of current day populations, or because of old admixture events from populations that are no longer extant. The admixture into the Indian and Pathan samples (whose admixture source also clusters with the Indian Jew samples in some MCMC runs) may be an example of this; Reich et al. [2009] and Moorjani et al. [2013] have hypothesized that many populations from the Indian subcontinent may be descended from an admixture event involving an ancestral Southern Indian population not well represented by our samples.
In Figures S17 and S18, we show the results of other independent MCMC analyses on these data. The broad-scale patterns and results discussed above are consistent across these runs. However, as is to be expected, there is significant heterogeneity in the exact layout of sample and admixture locations. For example, there is some play, among MCMC runs, in the internal orientation of the African locations with respect to Eurasia. Samples that draw a significant amount of admixture, such as the central Asian populations (Uygur, Hazara and Uzbekistani), switch their location with that of their source of admixture (as was also seen across MCMC runs in the warbler data analysis). Similarly the Ethiopian and Ethiopian Jew samples choose locations, in some MCMC runs, close to the other North African samples, and draw admixture from near the Sub-Saharan samples (as do the other North African samples).
Discussion
In this paper we have presented a statistical framework for modeling the geography of population structure from genomic sequencing data. We have demonstrated that the method, SpaceMix, is able to accurately present patterns of population structure in a variety of simulated scenarios, which included the effects of spatially heterogeneous migration, population expansion, and population admixture. In empirical applications of SpaceMix, we have largely recovered previously estimated population relationships in a circum-Tibetan sample of greenish warblers and in a global sample of human populations, while also providing a novel way to depict these relationships. The geogenetic maps SpaceMix generates serve as simple, intuitive, and information-rich summaries of patterns of population structure. SpaceMix combines the advantages of other methods for inferring and illustrating patterns of population structure, using model-based inference to infer population relationships (like TreeMix [Pickrell and Pritchard, 2012], and MixMapper [Lipson et al., 2013]), and producing powerful visualizations of genetic structure on a map (like PCA [Patterson et al., 2006] and SPA [Yang et al., 2014]).
The patterns of genetic variation observed in modern populations are the product of a complex history of demographic processes. We choose to model those patterns as the outcome of a spatial process with geographically determined migration, and we have included statistical elements to accommodate deviations from spatial expectations. However, the true history of a sample of real individuals is vastly more complex than any low-dimensional summary, and, as with any summary of population genetic data, SpaceMix results should be interpreted with this in mind. Furthermore, our “admixture" events are shorthands for demographic relationships that occurred over possibly substantial lengths of time and regions of the globe; approximating this by a single point in space is certainly an over simpliflication. Aspects of population history that are better described as a population phylogeny may be difficult to interpret using SpaceMix, and may be better suited to visualization with hierarchical clustering-based methods [Pritchard et al., 2000] or TreeMix/MixMapper-like methods [Pickrell and Pritchard, 2012, Lipson et al., 2013]. There is obviously no one best approach to studying and visualizing population structure; investigators should employ a range of appropriate methods to identify those that provide useful insight.
SpaceMix offers much of the flexibility of PCA - like PCA, it is well suited to describing population structure in a continuous fashion - but it also has a number of advantages over PCA. When isolation by distance holds, the first (one or) two PCs often correspond to some simple rotation of latitude and longitude; however, these first two PCs explain a relatively small part of the total variance of the data. Due to the linearity of PCs, many higher order PCs have to correspond to higher order functions of positional information under isolation by distance, and are therefore needed to explain the bulk of the variance [Novembre and Stephens, 2008]. These higher order PCs can be hard to interpret in empirical data (see discussion in the warbler section). The recently introduced SPA approach [Yang et al., 2012], since it assumes allele frequencies are monotonically increasing in a given direction, may also suffer from this problem (although we note that PCA and SPA both have significant speed advantages over SpaceMix).
In comparison, if isolation by distance holds then (nearly) all of the variance will be captured by SpaceMix in the inferred geogenetic positions (to the extent to which the parametric form of the covariance is flexible enough to capture the empirical decay of covariance with distance). The application of SpaceMix to humans nicely illustrates the utility of our approach: the first two PCs of this dataset resemble a boomerang (Figure S16), with its arms corresponding to the Africa/Non-Africa split and the spread of populations across Eurasia. In contrast, while the SpaceMix geogenetic map is dominated by the genetic drift induced by migration out of Africa, it also captures much more detail than is contained in the first two PCs (e.g., Figure 9b). This comparison is also nicely illustrated by the example in Figure S19.
An advantage of PCA is that it can explain more complex patterns of population structure by allowing up to K different axes. Although SpaceMix can easily be extended to more than two dimensions, simply by allowing Gi to describe the location of a sample in d dimensions, the interpretation and visualization of these higher dimensions would prove difficult, and so for the moment we stick with two dimensions. On the other hand, SpaceMix can describe in two dimensions patterns that PCA, due to the constraints of linearity, would need more to describe.
Another strong advantage of SpaceMix over current methods is the introduction of admixture arrows. Although PCA can be interpreted in light of simple admixture events [McVean, 2009], and Yang et al. [2012, 2014] can locate the recent, spatially admixed ancestry of out of sample individuals, neither approach explicitly models admixture between multiple geographically distant locations, as SpaceMix does. Assignment methods are designed to deal with many admixed samples [Pritchard et al., 2000], but they have no null spatial model for testing admixture. We feel that an isolation by distance null model is often more appropriate for testing for admixture, especially when there is geographically dense sampling. SpaceMix offers a useful tool to understand and visualize spatial patterns of genetic relatedness when many samples are admixed.
As currently implemented, SpaceMix allows each population to have only a single source of admixture, but some modern populations draw substantial proportions of their ancestry from more than two geographically distant regions. In such cases the inferred source of admixture may fall between the true locations of the parental populations. Although it is statistically and computationally feasible to allow each population to choose more than one source of admixture, we were concerned about both the identifiability and the interpretability of such a model, and have not implemented it. However, there may be empirical datasets in which such a modeling scheme is required to effectively map patterns of population structure. In addition, we have assumed that only single populations are admixed, when in fact it is likely that particular admixture events may affect multiple samples.
One concern is that the multiple admixed samples (from a single admixture event) may simply choose to cluster close to each other, and not need to draw admixture from elsewhere due to the fact that their frequencies are well described by their proximity to other admixed populations. Along these lines, it is noticeable that many of our European samples draw little admixture from elsewhere [also noted by Hellenthal et al., 2014, using a different approach], despite evidence of substantial admixture [Lazaridis et al., 2014]. This may reflect the fact that all of the European samples are affected by the admixture events, and are relatively over-represented in our sample. However, this may also simply reflect the fact that the admixture is ancient, and the ancient individuals used as proxy sources of these events are not well represented by our extant sampling. Reassuringly, we see multiple cases where similarly admixed populations (Central Asians, Middle Eastern, and North African) populations are separately identified as admixed. This suggests that geogenetic clustering (in lieu of drawing admixture) of populations that share similar histories of admixture is not a huge concern (at least in some cases). The method could in theory be modified to allow geogenetically proximal populations to draw from the same admixture event; however, this may be difficult to make fully automated.
In this paper we have focused on the covariance among alleles at the same locus, but linkage disequilibrium (LD; covariance of alleles among loci) holds rich information about the timing and source of admixture events [e.g. Chakraborty and Weiss, 1988, Moorjani et al., 2013, Hellenthal et al., 2014, Gravel, 2012] as well as information about isolation by distance [Ralph and Coop, 2013]. Just as population graph approaches have been extended to incorporate information from LD [Loh et al., 2013], a spatial covariance approach could be informed by LD. A null model inspired by models of LD under isolation by distance models [De and Durrett, 2007, Barton et al., 2013] could be fitted, allowing the covariance among alleles to decay with their geographic distance and the recombination distance between the loci. In such a framework, sources and time-scales of admixture could be identified through unusually long-distance LD between geographically separated populations.
The landscape of allele frequencies on which the location of populations that were the source of population’s admixture are estimated is entirely informed by the placement of other modern samples, even though the admixture events may have occurred many generations ago. This immediately leads to the caveat that, instead of “location of the parental population”, we should refer to the “location of the closest descendants of the parental population”. The increased sequencing of ancient DNA [see Pickrell and Reich, 2014, for a recent review] promises an interesting way forward on that front, and it will also be exciting to learn where ancient individuals fall on modern maps, as well as how the inclusion of ancient individuals changes the configuration of those maps [Skoglund et al., 2014]. The inclusion of ancient DNA samples in the analyzed sample offers a way to get better representation of the ancestral populations from which the ancestors of modern samples received their admixture. However, it is also possible to model genetic drift as a spatiotemporal process, in which covariance in allele frequencies decays with distance in both space and in time. We are currently exploring using ancient DNA samples as ‘fossil calibrations’ on allele frequency landscapes at points in the past, so that modern day samples may draw admixture from coordinates estimated in spacetime.
Methods
Here we describe in more detail the algorithm we use to estimate the posterior distribution defined by (7) of the population locations, G, their sources of admixture, G*, their admixture proportions, w, their independent drift parameters, η, and the parameters of the model of isolation by distance, . First, we give the exact form of the covariance matrix we use, and then describe the Markov chain Monte Carlo algorithm that samples parameter values from the posterior distribution.
The standardized sample covariance
As motivation, consider several randomly mating (Wright-Fisher) populations that all split from an ancestral population in which a neutral allele is present at frequency ϵℓ, and then subsequently exchange migrants. Since the allele is neutral, the mean change in its frequency in each population after t generations is zero, and if t is much smaller than the population size (so the frequencies remain close to ϵℓ), the variance is proportional to . Conveniently, additional variance introduced by binomial sampling of alleles is also proportional to . It would then be natural to consider the covariance matrix of since these standardized allele frequencies would be independent if the loci are unlinked, and would have mean zero and variance independent of the sample sizes or allele frequencies. The central limit theorem would then imply that in the limit of a large number of loci, the sample covariance matrix XT X is Wishart with degrees of freedom equal to the number of loci and mean determined by the pattern of migration.
Although the conditions are not strictly met, these theoretical considerations indicate that such a normalization may be a reasonable thing to do, even after substituting the empirical mean allele frequency in place of ϵℓ, which is what we do to define in equation (1). Recall that the sample allele frequency at locus ℓ in population k is given by , where Ck,ℓ is the number of (arbitrarily chosen) counted alleles, and Sk,ℓ is the total number of sampled alleles. As sample size may vary across loci, we first calculate , the mean sample size in population k, as . We then compute the global mean allele frequency at locus ℓ as
If sample size were constant across all loci in each population, this would be equivalent to defining the variance-normalized sample frequencies and writing where T is the mean centering matrix whose elements are given by where δi,j = 1 if i = j and is 0 otherwise. If the covariance matrix of Y is Ω*, then the covariance matrix of would be TTΩ*T. Since allowing T to vary by locus would be computationally infeasible, we make one final assumption, that the covariance matrix of the standardized frequencies at each locus is given by TTΩ*T. This makes it inadvisable to include loci for which there are large differences in sample sizes across populations. This mean centering acts to to reduce the covariance among populations in compared to , and can induce negative covariance between more unrelated populations (as, across loci, they are often on opposite sides of the mean).
Additionally, the covariance matrix of the standardized frequencies has rank K-1 rather than K, and so the corresponding Wishart distribution is singular. To circumvent this problem we compute the likelihood of a (K-1)-dimensional projection of the data. Any projection would do; we choose a projection matrix ψ by dropping the last column of the orthogonal matrix in the QR decomposition of T, and compute the likelihood of the empirical covariance matrix of allele frequencies as
Markov chain Monte Carlo Inference Procedure
The inference algorithm described here may be used to estimate the parameters with any of these held fixed, for instance: (1) population locations are fixed, and they do not draw any admixture; (2) populations may choose their own locations, but not admixture; (3) populations may draw admixture, but their own locations are fixed; or (4) populations may both choose their own locations and draw admixture. The free parameters for each of options are given in Table 1.
Below, we outline the inference procedure for the most parameter-rich model (inference on both population locations, their sources of admixture, and the proportions in which they draw admixture, in addition to inference of the parameters of the spatial covariance function). A table of all parameters, their descriptions, and their priors is given in Table 2.
We now specify in detail the Markov chain Monte Carlo algorithm we use to sample from the posterior distribution on the parameters, for Bayesian inference.
We assume that the user has specified the following data:
the allelic count data, C, from K population over L variant loci, where Ck,ℓ gives the number of observations of a given allele at locus ℓ in population k.
the sample size data, S, from K population over L variant loci, where Sk,ℓ gives the total number of alleles typed at locus ℓ in population k.
It is not necessary, but a user may also specify
the geographic sampling locations, G(obs), from each of the K populations, where Gk(obs) gives the longitude and latitude of the kth sampled individual(s).
The geographic location data may be missing, or generated randomly, for some or all of the samples; if so, the spatial priors on estimated population locations, G, and their sources of admixture, G* will not be tethered to the true map.
Initiating the MCMC
We then calculate the standardized sample covariance matrix as described in the section “The standardized sample covariance” above, as well as , the mean sample size across loci for each population. Armed with the standardized sample covariance, the geographic sampling locations, and the inverse mean sample sizes across samples , we embark upon the analysis.
To initiate the chain, we specify starting values for each parameter. We draw initial values for α0, α1, α2, η and w randomly from their priors. We initiate G at user-specified geographic sampling locations and G* at randomly drawn, uniformly distributed values of latitude and longitude in the observed range of both axes.
Overview of MCMC procedure
We use a Metropolis-Hastings update algorithm. In each iteration of the MCMC, one of our current set of parameters is randomly chosen to be updated by proposing a new value. In the cases of {η, w, G, G*}, where each population has its own parameter, a single population, k is randomly selected and only its parameter value (e.g. ηk) is chosen to be updated. Below we outline the proposal distributions for each parameter. This gives us a proposed update to our set of parameters Θ′, which differs from Θ at only one entry.
The set of locations of populations and their sources of admixture specify a pairwise geographic distance matrix D, which, given the current and η parameters, gives the admixed covariance matrix described in (6), Ω*. The likelihood of the two sets of parameters Θ and Θ′, calculated with (12) and the priors of Table 2, combine to give the Metropolis-Hastings ratio, R, the probability of accepting the proposed parameter values Θ′:
Note that all of our moves, described below, are symmetric, so the Hastings ratio is always 1. If we accept our proposed move, Θ is replaced by Θ′ and this is recorded, otherwise Θ′ is discarded and we remain at Θ.
Updates for , η, and w
We propose updates to the values of the , η and w parameters via a symmetric normal density with mean zero and variance given by a tuning parameter specific to that parameter. For example, , where and σα0 is the tuning parameter for α0. For η and w, each of which consists of K parameters, each parameter receives its own independent tuning parameter. If the proposed move takes the parameter outside the range of its prior, the move is rejected and we do not move in that iteration of the MCMC.
Updates for geographic coordinates G and G*
Updates to the location parameters, G and G*, are somewhat more complicated due to the curvature of the Earth. Implementing updates via a symmetric normal density on estimated latitude and longitude directly would have the drawback of a) being naive to the continuity of the spherical manifold and b) vary the actual distance of the proposed move as a function of the current lat/long parameter values (e.g., a 1° change in longitude at the equator is a larger distance than at the North Pole).
Instead, we propose a bearing and a distance traveled, and, given these two quantities and a starting position, calculate the latitude and longitude of the proposed update to the location. For example, in an update to the location of population i, Gi, we propose a distance traveled ∆Gi, where, e.g., , and a bearing, γ, where γ ~ U(0, 2π). Then we use the following equations to calculate the latitude and longitude of the proposed location: and where latitude and longitude are given in radians and are taken mod 2π. As with η and w, each population’s location and admixture source location parameters have their own tuning parameters.
Adaptive Metropolis-within-Gibbs proposal mechanism
We use an adaptive Metropolis-within-Gibbs proposal mechanism on each parameter [Roberts and Rosenthal, 2009, Rosenthal, 2011]. This mechanism attempts to maintain an acceptance proportion for each parameter as close as possible to 0.44 [optimal for one-dimensional proposal mechanisms; Roberts et al., 1997, Roberts and Rosen-thal, 2001]. We implement this mechanism by creating, for each variable i, an associated variable ζi, which gives the log of the standard deviation of the normal distribution from which parameter value updates are proposed. As outlined above, in the cases of {η, w, G, G*}, for which each population receives a free parameter, each population gets its own value of ζ.
When we start our MCMC, ζi for all parameters is initiated at a value of 0, which gives a proposal distribution variance of 1. We then proceed to track the acceptance rate, ri for each parameter in windows of 50 MCMC iterations, and, after the nth set of 50 iterations, we adjust ζi by an “adaption amount”, which is added to ζi if the acceptance proportion in the nth set of 50 iterations (ri(n)) has been above 0.44, and subtracted from ζi if not. The magnitude of the adaption amount is a decreasing function of the index n, so that updates to ζi proceed as follows:
We choose to cap the maximum adaption amount at 20 (which is the equivalent of capping the variance of the proposal distribution at 4.85 × 108) to avoid proposal distributions that offer absurdly large or small updates. This procedure, also referred to as “auto-tuning”, results in acceptance rate plots like those shown in Figure S20, and in more efficient mixing and decreased autocorrelation time of parameter estimates in the MCMC.
Simulations
We ran our simulations using a coalescent framework in the program ms (Hudson). Briey, we simulated populations on a lattice, with nearest neighbor (separated by a distance of 1) migration rate mi,j, as well as migration on the diagonal of the unit square at rate . For each locus in the dataset, we used the -s option to specify a single segregating site, and then we simulated 10,000 loci independently, which were subsequently conglomerated into a single dataset for each scenario. For all simulations, except the “Populations on a line” scenario (Figure S19), we sampled only every other population, and, from each population, we sampled 10 haplotypes (corresponding to 5 diploid individuals). In the “Populations on a line” scenario, we simulated no intervening populations, such that every population was sampled.
To simulate a barrier event, we divided the migration rate between neighbors separated by the longitudinal barrier by a factor of 5. To simulate an expansion event, we used the -ej option to move all lineages from each daughter population to its parent population at a very recent point in the past. For admixture events, we used the -es and -ej options to first (again, going backward in time) split the admixed population into itself and a new subpopulation of index k + 1, and second, to move all lineages in the (kth + 1) into the source of admixture. Forward in time, this procedure corresponds to cloning the population that is the source of admixture, then merging it, in some admixture proportion, with the (now) admixed population. The command line arguments used to call ms for a single locus for each simulation are included in the Appendix.
Intuition on identifiability of admixture parameters
A natural concern is whether all of the parameters we infer are separately identifiable, most notably whether population locations, admixture locations, and proportions can be estimated. That is, if a population has received some admixture from another population, what is to stop it from simply moving toward that population in geogenetic space to satisfy its increased resemblance to that population, rather than choosing admixture from that location? We do not provide a formal proof, but here build and illustrate some relevant intuition.
Admixture is identifiable in our model because there are covariance relationships among populations that cannot simply be satisfied by shifting population locations around (as demonstrated by the tortured nature of Figure 2b). Consider the simple spatial admixture scenario shown in Figure 11. Populations A-D are arrayed along a line, but there is recent admixture from D into B (such that 40% of the alleles assigned to B are sampled from location D). The lines show the expected covariance under isolation by distance that each population (A, C, or D, as indicated by line color) has with a putative population at a given distance. The dots show the admixed covariance between B and the three other populations, as well as B’s variance with itself (B-B) as specified by equation (6), with no nugget or sampling effect.
Due to its admixture from D, B has lower covariance with A than expected given its distance, somewhat higher covariance with C, and much higher covariance with D. In addition, the variance of B is lower than that of the other three populations, which each have variance 1: the value of the covariance when the distance is zero. This lower variance results from the fact that the frequencies at B represent a mixture of the frequency at D and the frequency at B before the admixture.
Now, using this example scenario, let us return to the concern posed above: that admixture location and population location are not identifiable. For the sake of simplicity, assume that we hold the locations of A, C, and D constant, as well as the decay of covariance with distance (as could be the case if A-D are part of a larger analysis). The covariance relationships of B to the other populations cannot be simply satisfied by moving B towards D, as B would then have a covariance with C that is higher, and a covariance with A that is lower, than that we actually observe.
Introducing admixture into the model allows it to satisfy all of these conditions: it can draw ancestry from D but keep part of its resemblance to A, it avoids B having to move closer to C, and it explains B’s low variance. Even in the absence of a sample from population C, B is better described as a linear mixture of a population close to A and D. However, there are specific scenarios in which a limited sampling scheme (both in size and location), can lead to tradeoffs in the likelihood between estimated population location and that of its source of admixture.
The analyses depicted in Figures 2c, 4b, and 4c, give examples of these tradeoffs. In each, the inferred admixture proportions in the admixed populations are less than those used to simulate the data, and the admixed populations are able to explain the high covariance they have with their sources of admixture via their inferred location, rather than just via their inferred source of admixture and admixture proportion. The reason they choose to do explain their anomalous covariance with their inferred location, rather than with their admixture source, is that we place a very harsh prior against admixture inference (Table 2). The prior is designed to make inference conservative with respect to admixture, but it has the side effect of skewing the posterior probability toward lower admixture proportions.
Empirical Applications
Below, we describe the specifics of our analyses of the greenish warbler dataset and the global human dataset. The analysis procedure for each dataset is given here:
For each analysis,
Five independent chains were run for 5×106 MCMC iterations each in which populations were allowed to choose their own locations (but no admixture). Population locations were initiated at the origin (i.e. - at iteration 1 of the MCMC, Gi = (0, 0)), and all other parameters were drawn randomly from their priors at the start of each chain.
The chain with the highest posterior probability at the end of the analysis was selected and identified as the “Best Short Run”.
A chain was initiated from the parameter values in the last iteration of the Best Short Run. Because inference of admixture proportion and location was not allowed in the five initial runs, admixture proportions were initiated at 0 and admixture locations, G* were initiated at the origin. This chain (the “Long Run”) was run for 108 iterations, and sampled every 105 iterations for a total of 1000 draws from the posterior.
For each dataset, we ran two analyses using the observed population locations as the prior on G. Then, to assess the potential influence of the spatial prior on population locations, we ran one analysis in which the observed locations were replaced with random, uniformly distributed locations between the observed minima and maxima of latitude and longitude. For the warbler dataset, we repeated this analysis procedure, treating each sequenced individual as its own population. For clarity and ease of interpretation, we present a full Procrustes superimposition of the inferred population locations (G) and their sources of admixture (G*), using the observed latitude and longitude of the populations/individuals (G) to give a reference position and orientation. As results were generally consistent across multiple runs for each dataset regardless of the prior employed, we (unless stated otherwise) present only the results from the ‘random’ prior analyses.
Finally, we compared the SpaceMix map to a map derived from a Principal Components Analysis (Patterson and Reich 2006). For this analysis, we calculated the eigendecomposition of the mean-centered allelic covariance matrix, then plotted individuals’ coordinates on the first two eigenvectors (e.g. Novembre et al 2008). For consistency of presentation, we show the full Procrustes superimposition of the PC coordinate space around the geographic sampling locations.
Acknowledgements
We would like to gratefully acknowledge the Coop Lab, Yaniv Brandvain, Razib Khan, Dan Runcie, Brad Shaffer, Marjorie Weber, and Will Wetzel for helpful discussions. We also thank Miguel Alcaide, Cristian Capelli, Garrett Hellenthal, and Darren Irwin for sharing data. This work was supported in part by NSF grant #1262645 (DBI) to PR and GC and NSF grant # 1148897 to GB (GRFP).