A harmonized public resource of deeply sequenced diverse human genomes

Underrepresented populations are often excluded from genomic studies due in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high quality set of 4,094 whole genomes from HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also demonstrate substantial added value from this dataset compared to the prior versions of the component resources, typically combined via liftover and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared to previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.


Supplemental materials
Data harmonization Genotype data Meta-data Table S1 | Harmonization of HGDP and 1000 Genomes Project meta-data project labels.Table S2 | Genetic outliers identified in analysis of global and subcontinental PCA.Table S3 | Final sample counts.QC Meta-data Summaries Figure S1 | Coverage across the 1kGP and HGDP. Figure S2 | Coverage across 1kGP and HGDP by population.Table S4 | Coverage as well as SNV and SV statistics by population.Structural variants (SVs) Figure S3 | Dosage and sex ploidy of HGDP samples and batching strategy.Figure S4 | SV callset and quality evaluation results.S11 | Variants absent in HGDP+1kGP that were identified in comparison datasets depending on whether variants passed QC in each dataset.Table S12 | Variant counts and percentages of comparison datasets.

Analysis tutorials
Figure S20 | PCA shrinkage analysis to determine acceptable levels of missingness before ancestry resolution becomes too low to accurately assign population labels.Table S13 | Shrinkage analysis matches and no classification numbers by SNP missingness in the test dataset, as shown in Figure S20.References Table S1 | Harmonization of HGDP and 1000 Genomes Project meta-data project labels.These labels are referred to as geographical/genetic regions throughout this manuscript.The sample sizes included here are pre-QC.Post-QC numbers are shown in Table S3.After combining region data, we used principal components analysis (PCA) to identify ancestry outliers within regions.We identified outliers as described in Table S2 and provide final sample counts in Table S3.
Table S2 | Genetic outliers identified in analysis of global and subcontinental PCA.
Within subcontinental PCA biplots, outliers were identified when one to three samples defined most of an entire PC among PCs 1-6.

HGDP01271 MID Mozabite
Table S3 | Final sample counts.Note: hard filtering was performed as in gnomAD v3 with modifications as described below in Quality Control.The first row of the "Total" column includes a "synthetic diploid" QC sample (CHM; Complete Hydatidiform Mole) described previously (Li et al. 2018)  A) Coverage in both datasets is uniformly above 30X, with an average of 33X coverage across the harmonized dataset.The coverage of the HGDP genomes is more variable than in 1kGP, as expected based on a variety of technical differences such as multiple sequencing batches, PCR+ vs PCR-free, and older cell lines in HGDP compared to 1kGP.The differences in project coverages also impacts the distribution of coverage statistics by Geographical region given their tally by project (Table S4).Regional abbreviations are as described in Table S1.OCE is excluded from this plot as it is represented by only two populations.Mean coverage across the different regions is 33X with coverage consistently above 30X for all regions.

Table S4 | Coverage as well as SNV and SV statistics by population.
Coverage was computed across the genome as part of the gnomAD project.Relatedness was inferred using KING-robust.Because number of variants and singleton counts per individual are sensitive to sample size imbalances, they were tallied using a downsampled version of the dataset in which each population was randomly downsampled to match the smallest population (i.e. 6 individuals per population), then SNVs were removed if they were not polymorphic in the downsampled dataset.Given the more pronounced impact of batch effects on structure variant (SV) calling and the number of batches present within and between datasets, the number of SVs per individual were calculated across the full dataset, not in the downsampled dataset.

Population genetic comparisons
The breakdown of ancestry and population structure by ADMIXTURE is similar to that identified in global PCA, with K=2 highlighting structure in the AFR, K=3 highlighting structure in the EAS, K=4 highlighting structure in the EUR and CSA, K=5 highlighting structure in the AMR, K=6 highlighting structure in the OCE, K=7 highlighting structure in the MID, and subsequent values of K highlighting structure within meta-data labels (Figure S7).

Figure S7 | ADMIXTURE analysis of the HGDP and 1kGP resource.
We ran ADMIXTURE with values of K=2 through K=10 across populations and harmonized geographical/genetic regions.Each row of bar plots shows the breakdown of regional substructure as K increases, where K is the number of genetic ancestry components fit in that run.For example, when K=2, AFR separates from the rest of the populations as the most distinct population due to high levels of genetic diversity.
When K=3 EUR separates from the rest, and so on.We chose the best fit value of K to be K=6 based on a reduction in the rate of change of 5-fold cross validation error as shown in Figure S8.

Figure S8 | 5-fold cross-validation error across ADMIXTURE runs.
We selected K=6 as the point at which cross-validation error leveled out.As described in the ADMIXTURE manual, the cross-validation error enables users to identify the value of K for which the model has best predictive accuracy, as determined by "holding out" data points.It partitions observed genotypes into 5 roughly equally sized folds, masks genotypes for each fold, then predicts the genotypes.

Quality control
Our sample QC procedure was mostly the same as in gnomAD, but differed slightly.Specifically, because whole populations were removed from gnomad 'fail_' filters, we did not filter on the basis of these, which were used in gnomAD v3.1.The clearest example of filters that failed was the fail_n_snp_residual filter, as shown in Figure S19.
Figure S19 | Example of a filter that was included in gnomAD v3.1 but excluded from this project.
The "fail_n_snp_residual" filter, which regresses out principal components from the number of SNPs in an effort to identify technical outliers, would have excluded whole continental groups and populations in this resource because these groups are distinct from the majority of individuals in gnomAD.

Table S9 | gnomAD Sample and Variant Quality Control Steps
This table contains the steps conducted by the gnomAD team on the jointly called HGDP+1kGP dataset.These QC steps were conducted prior to the additional quality control steps we conducted as outlined in the methods section.Samples and variants that passed gnomAD's QC were labeled as PASS.For additional details on the QC steps conducted, see the gnomAD blogpost on the release of the HGDP+1kGP dataset (https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#sample-and-variant-quality-control).

Sample QC Steps
Hard Filtering Using sample QC metrics computed using Hail's sample_qc() module on all autosomal bi-allelic SNVs the following filters were applied: • Number of SNVs: < 2.4M or > 3.75M • Number of singletons: >100k • Ratio of heterozygous to homozygous variants: > 3.3 Hard filtering using BAM-level metrics was performed when such metrics were available.Removed samples with: • Contamination: > 5% • Chimeras: > 5% • Median insert size: < 250 Sex Inference Used a rough F-stat cutoff of 0.5 to split samples into XX and XY categories.Final X and Y ploidy cutoffs were determined from the means and standard deviations of those XX and XY distributions.Sex was assigned based on the following cutoffs: Ancestry inference 30 principal components (PCs) were computed using principal components analysis (PCA) using Hail's hwer_normalized_pca() function.76,419 high quality variants were selected for these analyses using the following selection criteria: 1. Lifting over all sites from gnomAD v2.1 to GRCh38 2. Add ~5k sites widely used for quality control of GWAS data defined by Shaun Purcell and lifted them over to GRCh38 3. From the two above sets, selected all bi-allelic SNVs with an Inbreeding coefficient > -0.25 (no excess of heterozygotes) The 30 PCs were visually inspected to determine which PCs were capturing variants explained by the available "known" population labels.The first 16 PCs were chosen as best capturing the global ancestry variation and then a random forest classifier was trained using those PCs as features on the samples with "known" population labels.Ancestry labels were then assigned to all samples for which the random forests model, and all remaining samples were given a population label of "other" (oth) Sample QC metric outlier filtering Computed sample QC metrics using Hail's sample_q() module and regressed out the first 8 ancestry assignment PCs.Samples which fell outside the 4 median absolute deviations (MADs) from the median for the following metrics: • Number of snps

Table S10 | Filters Applied to Comparison Datasets
For each dataset, the filter name is written along with the number of variants removed by that filter.Phase3 1kGP is not included in this table as there were no filters listed for that dataset.

Dataset
Filter Counts were generated by taking the disjoint subset of the pre-and post-QC HGDP+1kGP dataset with both the pre-and post-QC versions of the comparison dataset.For the comparison datasets, QC is defined as applying the filters supplied by the datasets as in Table S10 as well as removing the X and Y chromosome from the dataset.For HGDP+1kGP, QC is defined as applying gnomAD sample, variant, and genotype QC filters, as well as removing PCA outliers.For the gnomAD dataset only, the post-QC version of HGDP+1kGP includes PCA outliers since gnomAD did not filter these ancestry outliers.Note that for Phase 3 1kGP, since there were no QC filters to apply, the pre-and post-QC numbers only refer to pre and post removal of the X and Y chromosomes.

Analysis tutorials
To show examples of how to use the individual-level data in a cloud-computing environment, we have created a series of tutorials in iPython notebooks that make use of Hail.These tutorials show how to merge datasets, apply sample and variant QC, run ancestry analysis via PCA and visualization, generate summary statistics of genomes by population, compute and plot population divergence statistics via F ST and F 2 statistics, and intersect external datasets with this dataset and infer ancestry information using project meta-data.The organization of these notebooks is outlined in Figure 6.
Figure S20 | PCA shrinkage analysis to determine acceptable levels of missingness before ancestry resolution becomes too low to accurately assign population labels.We started with a set of SNPs that were used in other PCA (e.g. Figure 2), which had undergone minor allele frequency filtering, missingness filtering, and LD pruning.We randomly selected 80% of samples (N=2,720) to train the random forest with corresponding meta-data labels as usual and held out 20% of samples as a test dataset (N=680).After filtering out monomorphic sites from the training dataset once samples were divided, we retained 200,403 variants which were used to train the random forest.We randomly downsampled SNPs in the test dataset to include 50%, 80%, 90%, 95%, 99%, 99.9%, and 100% of SNPs in the training dataset.These plots show the corresponding projected PCs in the test dataset, showing the extent to which shrinkage affects analyses.Table S13 shows rates of unclassified individuals by SNP missingness in the test dataset.

Figure S1 |
Figure S1 | Coverage across the 1kGP and HGDP.A) Coverage in both datasets is uniformly above 30X, with an average of 33X coverage across the harmonized dataset.The coverage of the HGDP genomes is more variable than in 1kGP, as expected based on a variety of technical differences such as multiple sequencing batches, PCR+ vs PCR-free, and older cell lines in HGDP compared to 1kGP.The differences in project coverages also impacts the distribution of coverage statistics by Geographical region given their tally by project (TableS4).The overall coverage distributions by population are shown in FigureS2.B) Over 95% of bases are covered over 10X, and over 90% of bases are covered over 20X in HGDP+1kGP.
Figure S1 | Coverage across the 1kGP and HGDP.A) Coverage in both datasets is uniformly above 30X, with an average of 33X coverage across the harmonized dataset.The coverage of the HGDP genomes is more variable than in 1kGP, as expected based on a variety of technical differences such as multiple sequencing batches, PCR+ vs PCR-free, and older cell lines in HGDP compared to 1kGP.The differences in project coverages also impacts the distribution of coverage statistics by Geographical region given their tally by project (TableS4).The overall coverage distributions by population are shown in FigureS2.B) Over 95% of bases are covered over 10X, and over 90% of bases are covered over 20X in HGDP+1kGP.

Figure S2 |
Figure S2 | Coverage across 1kGP and HGDP by population.Regional abbreviations are as described in TableS1.OCE is excluded from this plot as it is represented by only two populations.Mean coverage across the different regions is 33X with coverage consistently above 30X for all regions.

Figure S3 |
Figure S3 | Dosage and sex ploidy of HGDP samples and batching strategy.A) Distribution of dosage scores across HGDP samples.We used the previously developed whole genome dosage model (Collins et al 2020) to quantify non-uniform distribution of sequencing coverage.The dosage scores corresponded predominantly to PCR-amplified (PCR+) and PCR-free (PCR-) library protocols.B) Samples ranked by dosage score.C) Distribution of chrX copy number across HGDP samples.D) Batching strategy for SV calling.HGDP samples were first split by their PCR status and chrX ploidy.PCR-samples were then ranked by their sequencing depth from low to high, and split into four sub batches of equivalent sizes.Male and female batches with matched coverage quantiles are combined to form the final batches.E) Workflow of SV discovery from the HGDP and 1KGP genomes.The HGDP and 1KGP samples have been processed separately through the first steps of GATK-SV, including raw SV discovery, batching SVs across each batch and initial filtering of SVs using the "FilterBatch" method in GATK-SV.The filtered SVs were then merged across HGDP and 1KGP to form a non-redundant set of SV loci, systematically genotype across both HGDP and 1KGP samples, and processed through downstream steps of GATK-SV (see https://github.com/broadinstitute/gatk-sv for details).

Figure S4 |
Figure S4 | SV callset and quality evaluation results.A) Count of SV sites across 4,151 HGDP and 1kGP samples by variant type.B) Count of SVs per genome by variant type.C) Count of SV sites by allele frequency.D) Inheritance of SVs calculated in 100 pather-mother-child trio families.Proband Inheritance Rate -proportion of SVs in children's genome that were inherited from either parents; Paternal Inheritance Rate -proportion of SVs in children's genome that were shared by paternal genome; Maternal Inheritance Rate -proportion of SVs in children's genome that were shared by maternal genome; Parental Transmission Rate -proportion of SVs in parents' genome that were transmitted into children's genome; Trans.Rate (Paternal) -proportion of SVs in paternal genome that were transmitted into children's genome; Trans.Rate (Maternal) -proportion of SVs in maternal genome that were transmitted into children's genome.E) Correlation of allele frequencies.F) Hardy-Weinberg Equilibrium distribution of SVs across all samples.Each point is a single biallelic autosomal SV projected onto HWE ternary axes corresponding to its ratio of homozygous reference (0/0), heterozygous (0/1), and homozygous alternate (1/1) genotypes across all samples in the indicated population.The distance of a point to a vertex indicates the fraction of samples with that genotype.Deviation from HWE was assessed using a chi-square goodness-of-fit test with one degree of freedom, and points are colored based on their P-value.Green points are SVs within bounds defined for HWE based on the number of sites documented in each population, and purple points are SVs outside of these P-value bounds.The proportion of SVs corresponding to each P-value cutoff is provided at the right of each panel.Plots were generated using the "HardyWeinberg" package in R.

Figure S5 |
Figure S5 | Mean count of SVs versus SNVs by project, region, and number of individuals.Top line shows a fitted regression line to the 1000 Genomes Project points, and bottom line is fitted to HGDP points.A larger number of SVs are present in the 1000 Genomes Project data, which was explored more fully in Figure S6.

Figure S6 |
Figure S6 | SV breakdown in count by class across HGDP and 1kGP (HGSV).Per genome SV counts by study and PCR status (A,C), and population (B).Per genome SV counts are also broken down by SV type, including deletions, duplications, multi-allelic CNVs, insertions, inversions, and complex SVs in D).

Figure S9 |
Figure S9 | PCA biplots and densities globally.A) Map shows where all samples in analyses are from.B) PCA biplots of PCs 1-4.PCA outliers were removed prior to this analysis.Filled circles indicate populations in the 1000 Genomes Project, while filled triangles indicate populations in HGDP.Population codes are as in Table S1.C) Density plot of PCA biplots of PCs 1-4.

Figure S10 |
Figure S10 | Subcontinental PCA in AFR populations.A) Map shows where all AFR samples in analyses are from.B) PCA biplots of PCs 1-4.PCA outliers were removed prior to this analysis.Filled circles indicate populations in the 1000 Genomes Project, while filled triangles indicate populations in HGDP.Population codes are as in Table S1.C) Density plot of PCA biplots of PCs 1-4.

Figure S11 |
Figure S11 | Subcontinental PCA in CSA populations.A) Map shows where all CSA samples in analyses are from.B) PCA biplots of PCs 1-4.PCA outliers were removed prior to this analysis.Filled circles indicate populations in the 1000 Genomes Project, while filled triangles indicate populations in HGDP.Population codes are as in Table S1.C) Density plot of PCA biplots of PCs 1-4.

Figure S12 |
Figure S12 | Subcontinental PCA in EAS populations.A) Map shows where all EAS samples in analyses are from.B) PCA biplots of PCs 1-4.PCA outliers were removed prior to this analysis.Filled circles indicate populations in the 1000 Genomes Project, while filled triangles indicate populations in HGDP.Population codes are as in Table S1.C) Density plot of PCA biplots of PCs 1-4.16

Figure S13 |
Figure S13 | Subcontinental PCA in EUR populations.A) Map shows where all EUR samples in analyses are from.B) PCA biplots of PCs 1-4.PCA outliers were removed prior to this analysis.Filled circles indicate populations in the 1000 Genomes Project, while filled triangles indicate populations in HGDP.Population codes are as in Table S1.C) Density plot of PCA biplots of PCs 1-4.

Figure S14 |
Figure S14 | Subcontinental PCA in AMR populations.A) Map shows where all AMR samples in analyses are from.B) PCA biplots of PCs 1-4.PCA outliers were removed prior to this analysis.Filled circles indicate populations in the 1000 Genomes Project, while filled triangles indicate populations in HGDP.Population codes are as in TableS1.C) Density plot of PCA biplots of PCs 1-4.17

Figure S15 |
Figure S15 | Subcontinental PCA in MID populations.A) Map shows where all MID samples in analyses are from.Palestinian and Druze have the same geographical coordinates.B) PCA biplots of PCs 1-4.PCA outliers were removed prior to this analysis.All MID populations are from HGDP.Population codes are as in Table S1.C) Density plot of PCA biplots of PCs 1-4.

Figure S16 |
Figure S16 | Subcontinental PCA in OCE populations.A) Map shows where all OCE samples in analyses are from.B) PCA biplots of PCs 1-4.PCA outliers were removed prior to this analysis.All OCE populations are from HGDP.Population codes are as in Table S1.C) Density plot of PCA biplots of PCs 1-4.

Figure S17 |
Figure S17 | HGDP+1kGP ancestry labels applied to the Gambian Genome Variation (GGV) Project.A) PCs 1 and 2 of all HGDP+1kGP samples with GGV projected into the same PC space, with each reference population colored and the GGV samples shown in grey.B) The same PCs with the reference data shown in grey and the GGV samples showing the assigned ancestry-all AFR.

Figure S18 |
Figure S18 | Dendrogram of the pairwise F ST heatmap between populations colored by geographical/genetic regions.Populations largely cluster by region with a few exceptions.MID and three AMR populations for example are interspersed among other regions.

Table S5 |
Sex chromosome aneuploidies in the HGDP samples.FigureS5| Mean count of SVs versus SNVs by project, region, and number of individuals.TableS6| SV calls by external support from the HGSV study.FigureS6| SV breakdown in count by class across HGDP and 1kGP (HGSV).
| Example of a filter that was included in gnomAD v3.1 but excluded from this project.Table S9 | gnomAD Sample and Variant Quality Control Steps Dataset Comparisons Table S10 | Filters Applied to Comparison Datasets Table . It was removed during initial QC together with 31 samples that failed gnomAD's sample QC hard filters and two contaminated samples, and is excluded from the sample count of the initial dataset reported in the manuscript.

Table S8 | Pearson's correlation and Mantel tests results with and without waypoints.
Waypoints and calculations are described further in Methods ("F ST versus geographical distance").
MADs above the median number of singletons and over 4 MADs above the median ratio of heterozygous/homozygous variants call ratio.

Table S11 | Variants absent in HGDP+1kGP that were identified in comparison datasets depending on whether variants passed QC in each dataset.
The numbers in this table are the counts of variants which are only in the comparison datasets, listed in the first column.

Table S12 | Variant counts and percentages of comparison datasets.
The number of variants in are calculated for each minor allele frequency (MAF) bin.The MAF is represented as the MAF in the HGDP+1kGPd dataset.Variants with a MAF of 0% are not in the HGDP+1kGP dataset.For each comparison dataset there are counts for the number of variants in the comparison dataset only, in both the comparison dataset and HGDP+1kGP, and in HGDP+1kGP only.For each dataset at the bottom of each total count is the percentage of variants in each category previously described.

Table S13 | Shrinkage analysis matches and no classification numbers by SNP missingness in the test dataset, as shown in Figure S20.
There were no mismatched labels assigned.Fraction of SNPs in test dataset out of training dataset