Geographic patterns of human allele frequency variation: a variant-centric perspective

A key challenge in human genetics is to describe and understand the distribution of human genetic variation. Often genetic variation is described by showing relationships among populations or individuals, in each case drawing inferences over a large number of variants. Here, we present an alternative representation of human genetic variation that reveals the relative abundance of different allele frequency patterns across populations. This approach allows viewers to easily see several features of human genetic structure: (1) most variants are rare and geographically localized, (2) variants that are common in a single geographic region are more likely to be shared across the globe than to be private to that region, and (3) where two individuals differ, it is most often due to variants that are common globally, regardless of whether the individuals are from the same region or different regions. To guide interpretation of the results, we also apply the visualization to contrasting theoretical scenarios with varying levels of divergence and gene flow. Our variant-centric visualization clarifies the major geographic patterns of human variation and can be used to help correct potential misconceptions about the extent and nature of genetic differentiation among populations.


Introduction 18
Understanding human genetic variation, including its origins and its consequences, is 19 one of the long-standing challenges of human biology. A first step is to learn the fundamental 20 aspects of how human genomes vary within and between populations. For example, how often 21 do variants have an allele at high frequency in one narrow region of the world that is absent 22 everywhere else? For answering many applied questions, we need to know how many variants 23 show any particular geographic pattern in their allele frequencies. 24 In order to answer such questions, one needs to measure the frequencies of many alleles 25 around the world without the ascertainment biases that affect genotyping arrays and other provide these data, and thus present an opportunity for new perspectives on human variation. 29 However, large genetic data sets present a visualization challenge: how does one show 30 the allele frequency patterns of millions of variants? Plotting a joint site frequency spectrum 31 (SFS) is one approach that efficiently summarizes allele frequencies and can be carried out for 32 data from two or three populations (Gutenkunst et al. 2009). For more than three populations, 33 one must resort to showing multiple combinations of two or three-population SFSs. This repre-34 sentation becomes unwieldy to interpret for more than three populations and cannot represent 35 information about the joint distribution of allele frequencies across all populations. Thus, we 36 need visualizations that intuitively summarize allele frequency variation across several popula-37 tions. 38 New visualization techniques also have the potential to improve population genetics ed- conceptions can arise from observing how direct-to-consumer genetic ancestry tests apportion 46 ancestry to broad continental regions. One may mistakenly surmise from the output of these 47 methods that most human alleles must be sharply divided among regional groups, such that 48 each allele is common in one continental region and absent in all others. Similarly, one might 49 mistakenly conclude that two humans from different regions of the world differ mainly due to 50 alleles that are restricted to each region. Such misconceptions can impact researchers and the 51 broader public alike. All of these misconceptions potentially can be avoided with visualizations 52 of population genetic data that make typical allele frequency patterns more transparent. 53 Here, we develop a new representation of population genetic data and apply it to the 54 New York Genome Center deep coverage sequencing data (see URLs) from the 1000 Genomes 55 Project (1KGP) samples (1000Genomes Project Consortium et al. 2015. In essence, our approach 56 represents a multi-population joint SFS with coarsely binned allele frequencies. It trades preci-57 sion in frequency for the ability to show several populations on the same plot. Overall, we 58 aimed to create a visualization that is easily understandable and useful for pedagogy. As we 59 will show, the visualizations reveal with relative ease many known important features of hu-60 man genetic variation and evolutionary history. 61 This work follows in the spirit of Rosenberg (2011) who used an earlier dataset of mi-62 crosatellite variation to create an approachable demonstration of the major features in the geo-63 graphic distribution of human genetic variation (as well as earlier related papers such as 64 Lewontin 1972; Witherspoon et al. 2007 Figure 1 shows the allele frequency of each variant (rows) in each of the 26 popula-73 tions of the 1KGP (columns, see Supplemental Table 1 for labels). As a convention throughout 74 this paper, we use darker shades of blue to represent higher allele frequency, and we keep track 75 of the globally minor allele, i.e., the rarer (< 50% frequency) allele within the full sample. The 76 figure shows that variants seem to fall into a few major descriptive categories: variants with al-77 leles that are localized to single populations and rare within them, and variants with alleles that 78 are found across all 26 populations and are common among them. 79 Frequencies of the globally minor allele across 26 populations from the 1KGP for 100 randomly chosen variants from Chromosome 22. Note that the allele frequency bin spacing is nonlinear to capture variation at low as well as high frequencies.
To investigate whether such patterns hold genome-wide, we devise a scheme that allows 80 us to represent the > 90 million single-nucleotide variants (SNVs) in the genome-wide data (see 81 schematic, Figure 2). First, we follow the 1KGP study in grouping the samples from the 26 pop-82 ulations into five geographical ancestry groups: African (AFR), European (EUR), South Asian 83 (SAS), East Asian (EAS), and Admixed American (AMR) (Figure 2A, Box 1). For clarity, we mod-84 ify the original 1KGP groupings slightly for this project (by including several samples from the 85 Americas in the AMR grouping, see Box 1). While human population structure can be dissected 86 at much finer scales than these groups (e.g., Leslie et al. 2015; Novembre and Peter 2016), the re-87 gional groupings we use are a practical and instructive starting point-as we will show, several 88 key features of human evolutionary history become apparent, and many misconceptions about 89 human differentiation can be addressed efficiently with this coarse approach (see Discussion). 90 As any such groupings are necessarily arbitrary, we also show results without using regional 91 groupings to calculate frequencies (see section Finer-scale resolution of variant distributions be-92 low). 93 To represent the geographic distributions of alleles compactly, we give every variant a 94 five-letter code according to its allele frequencies across regions ( Figure 2A) region, we code the allele's frequency as 'u', 'R', or 'C', based on whether the allele is "(u)nde-97 tected," "(R)are," or "(C)ommon" ( Figure 2B). Finally, we concatenate the allele's regional fre-98 quency codes in the fixed (and arbitrary) order: AFR, EUR, SAS, EAS, and AMR. This procedure 99 generates a "geographic distribution code" for each variant. For example, the code 'CCCCC' 100 represents a variant that is common across every region, while 'uuRuu' represents a variant that 101 is rare in South Asia and unobserved elsewhere ( Figure 2C). 102 This scheme requires a few choices. To distinguish between "rare" and "common" al-103 leles, we used a threshold of 5% frequency. For comparison, we also show results using a 1% fre-104 quency threshold ( Figure S1A). For 96.6% of variants in the dataset with high-quality ancestral 105 allele calls (Box 1), the globally minor allele is the derived (younger) allele, and for comparison 106 we also produced results tracking the derived rather than the globally minor allele ( Figure S1C). 107 Neither changing the frequency threshold to 1% nor tracking the derived allele meaningfully 108 affects the basic observations that follow. 109 Next, we coded all ~92 million biallelic SNVs in the dataset and tabulated the propor-110 tions of each geographic distribution code. We display the codes in a vertical stack from the 111 most abundant code at the bottom to the least abundant at the top with the height of each code 112 proportional to its abundance, so that the cumulative proportions of the rank-ordered codes are 113 easily readable ( Figure 3). 114 The distribution of codes is heavily concentrated, with 85% of variants falling into just 115 eight codes out of the 242 (35 -1) that are possible. Of the top eight codes, the top four codes 116 represent rare variants that are localized in a single region. The fifth most abundant code, 117 'RuuuR', represents rare variants found in Africa and the Admixed Americas (which includes 118 African American individuals, for example). The sixth code is another set of localized rare vari-119 ants ('uRuuu', i.e., variants rare in EUR). The seventh code is 'CCCCC' or "globally common 120 variants." The eighth most abundant category, 'uRuuR', represents rare variants found in Eu-121 rope and the Admixed Americas. Conspicuously infrequent in the distribution are variants that 122 are common in only one region outside of Africa and absent in others (e.g., 'uCuuu', 'uuCuu', 123 'uuuCu', 'uuuuC'). Instead, when a variant is found to be common (>5% allele frequency) in 124 one population, the modal pattern (37.3%) is that it is common across the five regions 125 ('CCCCC'). Further, 63% of variants common in at least one region are also globally widespread, 126 in the sense of being found across all five regions. This number rises to 82% for variants com-127 mon in at least one region outside of Africa ( Figure S2 and S3). 128 Singleton variants-alleles found in a single individual-are the most abundant type of 129 variant in human genetic data and are necessarily found in just one geographic region. To focus 130 on the distributions of non-singleton variants, we removed singletons and retallied the relative 131 abundance of patterns ( Figure 3C). Removing singletons reduces the absolute number of vari-132 ants observed by 48.2% (91,784,637 vs. 44,290,364). Without singletons, we see more clearly the 133 abundance of patterns that have rare variants shared between two or more regions (codes with 134 two 'R's and one 'u', such as 'uuRRu' or 'RRuuu'). 135 We observe variants at ~3.1% of the measurable sites in the reference human genome (GRCh38). A measurable site is one at which it is possible to detect variation with current sequencing technologies (currently approximately 2.9 Gb out of 3.1 Gb in the human genome; see URLs). B and C: The relative abundance of different geographic distributions for 1KGP variants, (B) including singletons, and (C) excluding singletons. In panels B and C, the right-hand rectangles show the number and percentage of variants that fall within the corresponding geographic code on the left-hand side; distribution patterns are sorted by their abundance, from bottom-to-top. See Figure 2 for an explanation of the 5-letter 'u','R','C' codes. The proportion of the genome with variants that have a given geographic distribution code can be calculated from the data above (for example, with the 'Ruuuu' code, as 17% ´ 3.1% = 0.53%). (AMR). Differing from the 1KGP, we include in the "Admixed in the Americas" 175 (AMR) regional grouping the following populations: "Americans of African An-176 cestry in SW USA", "African-Caribbeans in Barbados (ACB)", and the "Utah Res-177 idents (CEPH) with Northern and Western European Ancestry". We chose this 178 grouping because it is a more straightforward representation of current human 179 geography. We note challenges and caveats of these alternate decisions in the Dis-180 cussion. Supplemental Table 1 provides a full list of the 26 populations and the 181 grouping into five regions. Figure 7 and Figure S7 provide a complementary view 182 to Figure 2 where the analysis is not based on the five groupings, but instead all 26 183 populations. 184 In Figure 5, we present results for variants differing between pairs of individuals 185 from the Simons Genome Diversity Project (SGDP). We include only autosomal 186 biallelic SNVs for variants that pass "filter level 1", which is the filtering procedure 187 for the majority of analyses used by (Mallick et al. 2016). (see URLs). 188 In Figure 6 we present results for variants found on five commercially available 189 genotyping arrays: The Affymetrix 6.0 (Affy6) genotyping array, the Affymetrix 190 Human generation time (Fenner 2005) gives "/2! = 0.05. We compare this scenario with 221 "/2! = 0.5, corresponding to a deeper divergence of approximately 600,000 years 222 ago. 223 Figure 4A shows the expected patterns in a sample of 100 individuals from each 224 population for deep divergence ("/2! = 0.5), shallow divergence ("/2! = 0.05) 225 without admixture, and shallow divergence with admixture (# = 0.02). The shal-226 low divergence model with or without admixture reproduces the preponderance 227 of 'Ru' and 'CC' mutations seen in the data, while the deep divergence model 228 shows many more 'Cu' and many fewer 'CC' mutations. The case with admixture 229 shows a slight increase in variant sharing ('RR' alleles increase from 1.3% of vari-230 ants to 4.2%; 'RC' and 'CR' alleles increase from 6% to 10%; 'CC' alleles comprise 231 23% in both cases). 232 We can understand the relationship between the split time and geographic distri- These new mutations will be private to one population ('Ru' or 'Cu') and the over-245 whelming majority will go extinct before reaching detectable frequencies. Condi-246 tional on non-extinction, the expected frequency of a neutral mutation increases 247 linearly with time (see Appendix B). As a result, the frequencies of new mutations 248 since the split time &' will mostly be contained in a triangular envelope ( < &'/2! 249 ( Figure 4B). For recent divergence, the new mutations will be assigned code 'Ru' 250 or 'uR', while in deeply diverged populations they may be categorized as 'Cu' or 251 'uC'. 252 the SGDP data to avoid ascertainment biases that might arise from looking at individuals within 261 the same dataset we use to measure allele frequencies. Figure 5 shows a representative subset 262 with 6 pairs chosen from 3 populations ( Figure S6, shows a larger set of examples). For each pair 263 we see some variants that were undiscovered in the 1KGP data (denoted , ( in the figure). These 264 account for 17-20% of each set of pairwise SNVs and are likely rare variants. We see that the 265 variants that differ between each pair of individuals are typically globally widespread (i.e., 266 codes with no 'u's, with proportions out of the total S varying from 54%-76% for the pairs in 267 Figure 5.) The observation of mostly globally common variants in pairwise comparisons may 268 seem counterintuitive considering the abundance of rare, localized variants overall. However, 269 precisely because rare variants are rare, they are not often carried by either individual in a pair. 270 Instead, pairs of individuals mostly differ because one of them carries a common variant that 271 the other does not; and as Figure 3 already showed, common variants in any single location are 272 often common throughout the world (also see Figures 7 and S1). 273 From the example pairwise comparisons ( Figure 5, and Figure S6), one also observes evi-274 dence for higher diversity in Africa, which is typically interpreted in terms of founder effects re-275 ducing diversity outside of Africa (  The abundance of geographic distribution codes for different pairs of individuals from the SGDP dataset. Above each plot we show the total number of variants that differ between each individual (S) and the number that were unobserved completely in the 1KGP data (SU). Across the bottom we show the proportion of variants with globally widespread alleles for each pair. We calculate this as the fraction of variants with no 'u' encodings over the total number of variants (S). (Note: by doing so, we make the assumption that if a variant is not found in the 1KGP data it is not globally widespread). Cumulative fraction of 1000G variants C C C C C

B Geographic distributions of pairwise SNVs for pairs of individuals from the Simons Genome Diversity Project
Han/Han S = 3,358,497 S u = 577,429 (17%)

The geographic distributions of variants typed on genotyping arrays 287
Targeted genotyping arrays are a cost-effective alternative to whole-genome sequencing. 288 The geographic distribution of the variants on genotyping arrays affects genotype imputation 289 and genetic risk prediction ( ). 295 Figure 6 shows the geographic distributions of bi-allelic SNVs included on five popular 296 array products in the 1KGP data. In stark contrast with the SNVs identified by whole-genome 297 sequencing ( Figure 3B) imputation accuracy, leading to greater power to map population-specific disease risk. 305

Finer-scale resolution of variant distributions 306
While the use of 5 regional groupings above allows us to describe variant distributions 307 compactly with a 5-digit encoding, the basic principle of grouping allele frequencies can be ex-308 tended to build a 26-digit encoding for the 1KGP variants. Doing so, we find a consistent pat-309 tern with Figure 2B, in that the majority of variants are seen to be rare and geographically local-310 ized (1 'R', and the remainder 'u's), and when a variant is common in any one population, it is 311 typically common across the full set of populations (Figure 7, pattern with all 'C's ). This view 312 reveals that the 5-digit encodings with 1 'R' and 4 'u's are often due to variants that are rare 313 even within a single population. This is not unexpected given many of them are singletons. 314 When we remove singletons (Supp Fig. 7), we again see more clearly rare allele sharing indica-315 tive of recent gene flow, though at finer-scale resolution. 316

Figure 7. A finer-scale summary of geographic distributions in human SNVs from the 1KGP.
This plot is the analogous plot to Figure 3B but rather than calculating frequencies with the 5 regional groupings, we compute them within each of the 26 1KGP populations. The total number of variants represented is the same as in Figure 3B (S = 91,784,367). See Figure 2 for an explanation of the 'u','R','C' codes.

Discussion 317
By encoding the geographic distributions of the ~92 million biallelic SNVs in the 1KGP 318 data and tallying their abundances, we have provided a new visualization of human genetic di-319 versity. We term our figures "GeoVar" plots as they help reveal the geographic distribution of 320 A goal of our work was to build a visualization that can help correct common miscon-332 ceptions about human genetic variation. First, because many existing methods to describe pop-333 ulation structure emphasize between-group or between-individual differentiation, they can con-334 vey a misleading impression of "deep" divergence between populations when it may not exist. 335 Comparing Figure 1 to outputs of models with "deep" or "shallow" divergence can help teach 336 how patterns of human variation are consistent with shallow divergence and the Recent African 337 Origins model (Box 2). Second, because personal ancestry tests can identify ancestry to broad 338 continental regions, it is possible to incorrectly conclude human alleles are typically found ex-339 clusively in a single region and at high frequency within that region (e.g., patterns such as 340 'uuCuu'.) As our figures show, this is not the case. Rather, it should be kept in mind that most 341 fine-scale personal ancestry tests work using genotyping arrays and combining evidence from 342 subtle fluctuations in the allele frequencies of many common variants (Novembre and Peter  343 2016). Finally, another related misconception is that two humans from different regions of the 344 world differ mainly due to alleles that are typical of each region. As we show in Figure 5, most 345 of the variants that differ between two individuals are variants with alleles that are globally 346 widespread. 347 Our method requires computing allele frequencies within predefined groupings. Group-348 ing and labeling strategies vary between genetic studies and are determined by the goals and 349 constraints of a particular study (Race, Ethnicity, and Genetics Working Group 2005; Panofsky 350 and Bliss 2017; Mathieson and Scally 2020). While we chose deliberately coarse grouping 351 schemes to address the misconceptions described above, the key facts we derive about human 352 genetic variation are robust and appear in finer-grained 26-population versions of the plot (Fig-353 ure 7). We recommend that any application of the GeoVar approach needs to be interpreted with 354 the choice of groupings in mind. 355 The visualization method developed here is also useful for comparing the geographic 356 distributions of different subsets of variants, (e.g., Figures 5 and 6). For example, when applied 357 to the list of variants targeted by a genotyping array (Figure 6), the approach quickly reveals the 358 relative balance of common versus rare variants and the geographical patterns of those variants. 359 Interpreting the results of this visualization approach does have some caveats. First, we 360 estimate the frequency of alleles from samples of local populations. We expect that as sample 361 sizes increase many alleles called as unobserved 'u' will be reclassified as rare 'R'. The average 362 sample size across all of our geographic regions is approximately 500 individuals (AFR: 504, 363 EUR: 404, SAS: 489, EAS: 504, AMR: 603). Assuming regions are internally well-mixed, we have 364 ~80% power to detect alleles with a frequency of ~0.2% in a region ( Figure S4). For alleles with 365 lower frequencies, we would require larger sample sizes to ensure similar detection power. An 366 implication is that in large samples, we should observe more rare variant sharing. Thus, we ex-367 pect the figures here to underrepresent the levels of rare variant sharing between human popu-368 lations. 369 A second caveat is that our encoding groups a wide range of variants into the "(C)om-370 mon" category (i.e., all variants where the frequency of the globally minor allele is greater than 371 5%). For some applications, such as population screening for carriers, it may be enough to know 372 that a variant falls in the "rare" or "common" bins we have described, and more detail is incon-373 sequential. For other applications, the detailed fluctuations in allele frequency across popula-374 tions are relevant-for example, differences in allele frequencies at common variants ( Figure S5 Third, one must interpret our results with the sampling design of the 1KGP study design 378 in mind. In particular, the 1KGP filtered for individuals of a single ethnicity within each locale. 379 However, in our current cosmopolitan world, the genetic diversity in any location or broad-380 based sampling project will be considerably higher than implied by the geographic groupings 381 above. For example, the UK Biobank, while predominantly of European ancestry, has represen-382 tation of individuals from each of the five regions used here (Bycroft et al. 2018). The 1KGP also 383 sampled South Asian ancestry from multiple locations outside of South Asia, and whether those 384 individuals show excess allele sharing due to recent admixture in those contexts is unclear. 385 While we expect overall similar patterns to those seen here using emerging alternative datasets 386 (Bergström et al. 2019), there may be subtle differences due to sampling and study design con-387 siderations. 388 Despite these caveats, the results of the visualizations provided here help reinforce the 389 conclusions of a long history of empirical studies in human genetics (Lewontin 1972 abundance of localized rare variants and broadly shared common variants, with a paucity of 393 private, locally common variants. Together these are footprints of the recent common ancestry 394 of all human groups. As a consequence, human individuals most often differ from one another 395 due to common variants that are found across the globe. Finally, though not examined explicitly 396 above, the large abundance of rare variants observed here is another key feature of human vari-397 ation and a consequence of recent human population growth (Slatkin and  instance, due to ascertainment bias in arrays ( Figure 6) and power considerations, common var-423 iants are often found in genome-wide association studies of disease traits (Manolio et al. 2009 Figure 2. A: Top 10 categories when conditioning on the variant being "common" (MAF > 5%) in at least one population. Conditioned on a variant being common in a single region, 37.3% of variants are categorized as "globally common" or "CCCCC". B: The proportion of variants that fall within the "globally common" or "CCCCC" geographic distribution code conditional on the variant being common (MAF > 5%) in the specific continental group

Supplementary Figure 3. A:
The number of variants that fall within a given geographic distribution code conditional on the variant being "globally widespread", i.e. a category that has no unobserved ("u") codes. We note that 55.6 % of variants conditioned on being globally widespread are also globally common ("CCCCC"). In terms of absolute numbers, variants that are common in at least one population (S = 9,958,838) that are also globally widespread (S = 6,322,767) comprise ~ 63% of the total when conditioning on being common in at least one population. When conditioning on variants common only in regions outside Africa (S = 7,544,648), the percentage of globally widespread variants (S = 6,179,781) increases to ~ 82 %. B: The proportion of variants that fall within a "globally present" category, defined as categories that contain no unobserved ("u") codes, conditional on the variant being common (MAF > 5%) in the specific continental group

Sampling probabilities
The abundances of two-population distribution codes is a simple transformation of the cumulative distribution function (CDF) of the joint allele counts (K 1 , K 2 ). Conditioning on allele frequencies at time t, but before admixture, the CDF is given by For n randomly sampled haploid individuals from each population, and admixture coefficient a, we have: Writing P (k) n (x 1 , x 2 ) for the binomial cumulative distribution function P {K i  k|x 1 , x 2 }, and substituting (5) into (4) yields: where the inner product now represents the double integral weighted by p(x 1 )p(x 2 ).

Numerical integration
We compute the integrals in (6) by two-dimensional Gauss-Jacobi quadrature. The left argument of the inner product is a polynomial of degree n in both x 1 and x 2 . As a result, we can choose m = 2n, so that D P (k 1 ) n P (k 2 ) n , R 2n E = 0 due to the orthogonality of the Jacobi polynomials. Because S 2n is also a polynomial, the integrand is a polynomial of degree 4n. Thus, fixed-order tensorproduct Gauss-Jacobi quadrature is guaranteed to yield the exact integral with 4n 2 evaluations of the integrand.
For short times and rare alleles (i.e., t, p ⌧ 1), we can use the approximation p(1 p) ⇡ p, to get a simpler diffusion equation: We can solve this equation in closed form to find the time-dependent extinction probability, where we have replaced the boundary condition (10) with lim p!• }(p, t) = 1. For t ⌧ 2p, this probability is exponentially small, while for t > 2p it behaves like 1 2p/t (Fig. 4C). We can use (12) to find the expected frequency of a new mutation conditional on its survival to time t. By the law of total probability we have where in the last equality we used the fact that for a new neutral mutation E[X(t)] = p = 1/2N. Thus, to leading order in 1/N, we have E[X(t)|X(t) > 0] ⇠ t/2.