Population histories of the United States revealed through fine-scale migration and haplotype analysis

The population of the United States is shaped by centuries of migration, isolation, growth, and admixture between ancestors of global origins. Here, we assemble a comprehensive view of recent population history by studying the ancestry and population structure of over 32,000 individuals in the US using genetic, ancestral birth origin, and geographic data from the National Geographic Genographic Project. We identify migration routes and barriers that reflect historical demographic events. We also uncover the spatial patterns of relatedness in subpopulations through the combination of haplotype clustering, ancestral birth origin analysis, and local ancestry inference. Examples of these patterns include substantial substructure and heterogeneity in Hispanics/Latinos, isolation-by-distance in African Americans, elevated levels of relatedness and homozygosity in Asian immigrants, and fine-scale structure in European descents. Taken together, our results provide detailed insights into the genetic structure and demographic history of the diverse US population.

cohort at >90% confidence. A total of 3,028 individuals (9.3% of total) did not meet the 132 classification threshold ( Figure 1C; Table S1). The inability to classify these individuals may be 133 due to the complex and variable admixture profiles of certain populations such as 134 Hispanics/Latinos. 135 136 Regional differences in genetic ancestry correspond to historical demographic trends. We 137 evaluated the distributions of classified individuals across the four designated US Census 138 regions: South, Northeast, Midwest, and West (Table S1) Similarly, we also find a clade of individuals with Bangladesh ancestral origins that is similar to 218 the 1000 Genomes Project Bengali from Bangladesh (BEB). Interestingly, we also inferred a 219 small, but genetically distinct 'Girmitiyas' clade (N = 12; blue branch in Figure 2B). While the 220 small sample size makes it difficult to accurately assess this clade, we note that many former 221 British colonies (e.g. Trinidad and Tobago, Fiji, Barbados, Guyana) are represented in the 222 ancestral origins of these individuals. We therefore hypothesize that these individuals may 223 potentially be descendants of Girmitiyas, indentured Indian laborers brought to those former 224 colonies. 32 225

Population differentiation and migration rate inference across the United States 227
Understanding the relationship between genetics and geography can provide insights into 228 demographic history. Previous analyses of this relationship in the US population have primarily 229 compared data aggregated at the state or regional level. 7,9 Such approaches, however, do not 230 capture the fine-scale patterns of genetic similarity that are not influenced by discrete 231 administrative boundaries. We therefore sought to infer continuous population structure across 232 space with the estimating effective migration surfaces (EEMS) method. 33 EEMS statistically 233 measures effective migration rates by overlaying a dense grid of evenly-spaced demes and 234 calculating genetic differentiation (i.e. resistance distance) between neighboring demes. Higher 235 rates of migration are inferred in locations where genetic similarity is high (colored in blue) while 236 lower rates of migration are inferred in locations where genetic similarity is low (these locations 237 are also referred to as migration barriers and colored in dark orange). We applied EEMS to 238 genetically classified Europeans, African Americans, and Hispanic/Latinos across the 239 contiguous 48 states. We excluded East Asians and South Asians due to low sample density.  Genetic and geographic diversity is greatest amongst Hispanic/Latino haplotype clusters. We 295 identified a total of five Hispanic-related clusters. The largest of these cluster (n=810) is strongly 296 associated with south Florida (OR = 10.4; p = 2.5e-25; Figure 4, Table S3) Ricans characterize a substantial proportion of another Hispanic/Latino cluster associated with 302 Florida (OR > 4), as well as New York City (OR > 5). Unlike the other Hispanic clusters, the 303 Puerto Rican cluster shares the same branch on the F ST tree as the African American clusters 304 ( Figure S8), likely due to relatively high proportions of African ancestry (mean = 11.2%, sd = 305 9.0%) among Puerto Ricans. In addition to providing a DNA sample, participants also provided geographic location (postal 507 code), and, optionally, family history information in the form of ancestral birth origin and ethnicity 508 (up to grandparental level) when they registered on the Genographic Project website to track 509 and access the results of their DNA sample. All data was deidentified prior to research access. 510 We limited our study to those individuals who provided valid geographic location in the United 511 States. Approximately 75% of individuals provided complete pedigrees and family history data 512  clusters of discrete populations. 24 We performed fineSTRUCTURE analysis separately for the 604 two populations. The first part of the fineSTRUCTURE framework uses ChromoPainter to 605 measure shared ancestry between individuals and estimate a coancestry matrix. This matrix is 606 then used in fineSTRUCTURE's clustering and tree-building algorithm to hierarchically clusters 607 of individuals from fine levels of structuring to broader levels. We first applied ChromoPainter to 608 phased genotypes to estimate the number of contiguous segments (chunks) shared and total 609 amount of genome (in cM) shared between each pair of individuals within each population, as 610 well as the normalization parameter (c). Using the coancestry matrix and normalized parameter, 611 we then ran the fineSTRUCTURE with 2 million Markov Chain Monte Carlo (MCMC) iterations, 612 of which 1 million are "burn-in" iterations, and every 2,000 iterations was sampled. Finally, we 613 used fineSTRUCTURE to infer a hierarchical tree using 100,000 hill-climbing moves. We used 614 the scripts accompanying the fineSTRUCTURE software as well as the ape package in R to 615 visualize the coancestry matrix and dendrogram results. 616

617
To examine the properties of the inferred clusters, we sought to examine both structure both the 618 broad-scale and fine-scale. There is no definitively correct level of the dendrogram to pick for examination. We examined clades at various levels of the tree and assessed broad structure at 620 the levels in which clades had sufficient number of individuals (on average 50 or more samples). 621 We further used a combination of PCA and analysis of ancestral origins to assess and define 622 these clades. Some of the clusters are small but genetically distinct as evident by the branch 623 length and height of the split (i.e. Girmitiyas, Bangladesh), and therefore, they were kept as 624 separate clades. 625 626 Unlike traditional PCA, PCA using the coancestry matrix (i.e. chunk counts matrix) can better 627 discern fine-scale population structure and provide greater interpretability. 24 We performed PCA 628 analysis on the chunk counts matrix using in the Python library scikit-learn. Individual markers 629 are colored and labelled based on their respective grouping. 630 631

Estimating Effective Migration Surfaces 632
We estimated migration and diversity relative to geographic distance using the estimating 633 effective migration surfaces (EEMS) method for Genographic individuals that were classified 634 under African, European, and Native American ancestries. 33 We excluded East Asian and South 635 Asian ancestries due to low sample size and density. We used unrelated individuals with 636 available postal code data. We first computed pairwise genetic dissimilarities with the EEMS 637  i  k  e  o  t  h  e  r  I  B  D  d  e  t  e  c  t  i  o  n  a  l  g  o  r  i  t  h  m  s  ,  I  B  D  s  e  q  d  o  e  s   646   n  o  t  r  e  p  l  y  o  n  p  h  a  s  e  d  g  e  n  o  t  y  p  e  d  a  t  a  a  n  d  i  s  l  e  s  s  s  u  s  c  e  p  t  i  b  l  e  t  o  s  w  i  t  c  h  e  r  r  o  r  s  i  n  p  h  a  s  i  n  g  t  h  a  t  c  a

Detection of IBD Clusters 657
While fineSTRUCTURE can identify population structure in admixed cohorts using haplotype 658 similarity, 23 fineSTRUCTURE does not scale to large sample sizes and is not recommended for 659 samples >10,000. 24 We therefore sought to identify clusters of related individuals in the 660 haplotype network using the Louvain Method implemented in the igraph package for R. The 661 Louvain Method is a greedy iterative algorithm that assigns vertices of a graph into clusters to 662 optimize modularity (a measure of the density of edges within a community to edges between 663 communities). 36 The Louvain Method begins by first assigning each node as its own community 664 and then adds node i to a neighbor community j. It then calculates the change in modularity and 665 places i in the community with that maximizes modularity. The algorithm repeats this 666 continuously and terminates when no vertices can be reassigned. 667

668
We partitioned the haplotype network into clusters by recursively applying the Louvain Method 669 within subcommunities. At the highest level, we take the full, unpartitioned haplotype graph and identify a set of subcommunities. We isolate the vertices within each subcommunity, keeping 671 only the edges between those vertices to create separate new networks. We then apply the 672 Louvain Method to the new subgraphs. We repeat this process up to four levels. We combined 673 subcommunities with low genetic divergence based on F ST values of < 0.0001. 674 675

Annotation of IBD Clusters 676
We used a combination of ancestral birth origins and self-reported ethnicities to discern 677 demographic characteristics of each cluster. For each cluster, we quantified the proportion of 678 each birth origin (i.e. country of origin) amongst all four grandparents, treating each 679 grandparent's origin equality. We use these proportions to inform population labels. Clusters in 680 which a single non-US birth origin was in high proportions was labeled with that country. In 681 cases where multiple non-US birth locations exists in approximately equally high proportions, 682 we assigned a label representing the broader region (e.g. Eastern Europeans for Poland, 683 Lithuania, Ukraine, and Slovakia; East Asia for Japan, China). For certain clusters, annotations 684 could not be easily discerned by birth origin data. In these cases, we relied on self-reported 685 ethnicities to label the clusters as these populations were found to be less associated with a 686 non-US country (e.g. Ashkenazi Jews) or the population has resided in the US for generations 687 (African Americans, Acadians). 688 689

Runs of Homozygosity 690
We used PLINK v1.90b3.39 to infer runs of homozygosity with a window of 25 SNPs. 56 We 691 calculated the cumulative runs of homozygosity (cROH) size by summing the lengths of 692 homozygous segments.  We thank the National Geographic Genographic Project participants who consented to research 727 for making this study possible. We also thank Gregory Vilshansky for helping organize and 728 manage the data for the Genographic Project.