Abstract
The concept of haplotype blocks has been shown to be useful in genetics. Fields of application range from the detection of regions under positive selection to statistical methods that make use of dimension reduction. We propose a novel approach (“HaploBlocker”) for defining and inferring haplotype blocks that focuses on linkage instead of the commonly used population-wide measures of linkage disequilibrium (LD) which fail to identify segments shared by individuals in only a subset of the population. We define a haplotype block as a sequence of alleles that has a predefined minimum frequency in the population and only haplotypes with a similar sequence of alleles are considered to be carrying that block, effectively screening a dataset for group-wise identity-by-descent (IBD). Different to most other approaches these blocks are not restricted to shared start or end positions, but can overlap or even contain each other. From these haplotype blocks we construct a haplotype library that represents a large proportion of genetic variability of a population with a limited number of blocks. Our method is implemented in the associated R-package HaploBlocker and provides flexibility to not only optimize the structure of the obtained haplotype library for subsequent analyses (e.g., identification of shared segments between different populations), but is also able to handle datasets of different marker density and genetic diversity. By using haplotype blocks instead of SNPs, local epistatic interactions can be naturally modelled and the reduced number of parameter enables a wide variety of new methods for further genomic analyses. We illustrate our methodology with a dataset comprising 501 doubled haploid lines in a European maize landrace genotyped at 501’124 SNPs. With the suggested approach, we identified 2’851 haplotype blocks with an average length of 2’633 SNPs (compared to 27.8 SNPs per block in HaploView) that together represent 94% of the dataset.
Author summary Whereas it is quite easy to identify segments of shared DNA between pairs of individuals, the problem becomes far more complex when analyzing a population. Especially for livestock and crop populations under strong selection one can observe long and possibly favourable segments that are segregating at high frequency. We propose here an adaptive and flexible approach to identify such segments (“haplotype blocks”). The main conceptual difference to other approaches is that we allow haplotype blocks to overlap so that patterns shared by a subset of the population can be mapped adequately. Afterwards, we select a set of those haplotype blocks that form a representation of the whole population (“haplotype library”). This haplotype library can be used similar to a SNP-dataset for subsequent genomic approaches with the advantage of a massive reduction of the number of parameters compared to standard haplotyping approaches. Since many breeding goals (e.g. grain yield, milk production) are known to be caused by complex interactions in genomic regions (or even the whole genome) using haplotype blocks instead of single base pairs provides a natural model for local interactions and enables the use of more complex models to incorporate distant interactions between genes, for instance.
Footnotes
↵¤ University of Goettingen, Animal Breeding and Genetics Group, Albrecht-Thaer-Weg 3, 37075 Goettingen, Germany
* torsten.pook{at}uni-goettingen.de