Abstract
A map of approximately independent linkage disequilibrium (LD) blocks has many uses in statistical genetics. Current LD block maps are based on sparse recombination maps and only available for GRCh37 (hg19) and prior genome assemblies. We generated LD blocks for European (EUR) ancestry populations using a new large recombination map in GRCh38.
1 Introduction
With an increasing number of large-scale genome-wide association studies (GWAS) relying on metaanalysis, many newly developed statistical methods circumvent the need for individual-level data, and instead require GWAS summary statistics only. These methods often rely on an external reference panel such as the 1,000 Genomes Project (The 1000 Genomes, 2015) to model population-specific linkage disequilibrium (LD) patterns. Further, approaches to study the local genetic architecture utilize population-specific maps of approximately independent LD blocks (Shi et al., 2019), and methods to build these blocks have been described previously (Berisa and Pickrell, 2016). However, current blocks were generated based on GRCh37 coordinates, and as more data are mapped to the GRCh38 genome assembly, updated block coordinates are needed. One straightforward solution is to convert existing GRCh37 LD blocks to GRCh38 positions using reources such as liftOver (Kent et al., 2002). However, this approach produces large unmappable regions, as liftOver aims to map short genomic sequences between genome builds, with longer genomic regions becoming fragmented and scattered across multiple chromosomes. To overcome this issue, we estimated new approximately independent LD blocks in European ancestry populations using a recently generated genome-wide recombination map based on parent-child pairs from Iceland (Halldorsson et al., 2019).
2 Methods
We applied LDetect (Berisa and Pickrell, 2016) (https://bitbucket.org/nygcresearch/ldetect/src/master/), which utilizes population-specific variants and a recombination map file to generate LD blocks. We used a recent high-resolution recombination map based on more than 115,000 Icelandic individuals (Halldorsson et al., 2019). This genetic map has a higher resolution than the previously used recombination map from HapMap, with a sex-averaged resolution of 683 bp as compared to 1,324 for HapMap. Further, the Icelandic recombination map is natively based on GRCh38. We downloaded VCF files from the 1000 Genomes GRCh38 December 2018 biallelic single nucleotide variant (SNV) data (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/) and removed variants with a minor allele frequency < 0.01. We interpolated genetic distances for the remaining variants using the recombination map and a custom R script, which is included in the scripts directory of our GitHub repository. We then partitioned each chromosome by first generating naïve blocks containing 5,000 SNVs each. To ensure that these naïve blocks did not terminate within regions of high LD, we extended the end of each partition until a shrunken covariance estimator (Wen and Stephens, 2010) between the first and last SNV was negligible (Sij<1.5×10−8).
The resulting partitions had a median length of 1.4 Mb with a median 300 Kb overlap and allowed us to efficiently compute the partition specific SNV covariance in parallel. When computing the covariance, we used only European ancestry sub-populations (TSI, IBS, CEU, GBR) to improve consistency with the recombination map based on European ancestry individuals. We computed the covariance minima across each partition and then selected block-specific breakpoints using a low-pass filter with local search algorithm (the fourier-ls algorithm). A more detailed description of our methods, including all our code, as well as coordinates for approximately independent LD blocks in BED format, can be found in our GitHub repository (https://github.com/jmacdon/LDblocks_GRCh38).
3 Results
Overall block statistics are presented in Table 1. We generated a total of 1,361 LD blocks, which is fewer than the existing 1,703 GRCh37 European ancestry LD blocks. The GRCh38 blocks were also longer and more variable in length (Figure 1). As compared to GRCh37, the block lengths for GRCh38 have a median 20% increase in both block length and median absolute deviation (MAD). While these updated LD blocks are useful for European-ancestry populations, a limitation of this work is the lack of GRCh38 LD blocks for other ancestries. Unfortunately, no high-resolution GRCh38 genetic maps currently exist for other populations. LD blocks for African and Asian populations, based on GRCh37, are available on the LDetect bitbucket data repository in BED format (https://bitbucket.org/nygcresearch/ldetect-data/src/master/).
Length of LD blocks per chromosome in European ancestry populations. The long maximum block lengths for chromosomes 1, 9, 12, and 18 are due to inclusion of centromeric regions.
Chromosome block length (log10(bp)) distribution for European genetic ancestry population, by genomic build.
Funding
This work was supported by the National Institute of Health (CA194393), as well as the National Institute of Environmental Health Services (P30-ES007033).
Conflict of Interest
none declared.