Kinpute: Using identity by descent to improve genotype imputation

Genotype imputation, though generally accurate, often results in many genotypes being poorly imputed, particularly in studies where the individuals are not well represented by standard reference panels. Kinpute uses identity by descent information—due to either recent, familial relatedness or distant, unknown ancestors—to alleviate this problem. Kinpute is designed to use a within-study reference panel and works in conjunction with other genotype imputation software to improve the quality of the imputed genotypes. In addition, Kinpute can select an optimal set of study individuals to sequence as the within-study reference panel. Kinpute is an open-source and freely available C++ software package that can be downloaded from https://github.com/markabney/Kinpute/releases.


Introduction
Genotype imputation methods have been a boon to researchers, allowing them to maximize available resources by allowing them to computationally infer the alleles at untyped variants for many individuals. Although a majority of genotypes may be imputed with high accuracy, a substantial fraction will typically be discarded due to low quality. This is particularly true when the study population is not well represented by available reference panels. For instance, indigenous and founder populations, or even isolated populations of European ancestry, can have significantly improved genotype imputation results when population-specific sequence data are included (Deelen et al., 2014;Hou et al., 2017;Mitt et al., 2017;Pistis et al., 2015;Sidore et al., 2015;Zhou et al., 2017).
In studies where there exists substantial identity-by-descent (IBD) between subjects, for instance due to founder effects or familial relatedness, IBD can increase imputation quality (McCarthy et al., 2016). In the case of families, with a known pedigree, there have been several methods that make use of IBD to perform imputation (Burdick et al., 2006;Chen and Schaid, 2014;Cheung et al., 2013;Livne et al., 2015). Long-range phasing approaches (Kong et al., 2008;Palin et al., 2011;Uricchio et al., 2012) make use of IBD for imputation, also.
Our method, Kinpute, does not require pedigree information, can impute genotypes in IBD regions in both closely and distantly related individuals, does not require phase information (and, hence, is robust to phasing errors), and is novel in that it can leverage the accuracy of standard imputation software (Howie et al., 2012;Fuchsberger et al., 2015;Browning and Browning, 2016) by using their output as prior genotype probabilities. Kinpute requires a within-study sequenced reference panel and can select an optimal set of such individuals. If genotype probabilities from a standard imputation method are not available, Kinpute can use allele frequencies to generate prior genotype probabilities.

Methods
The software consists of two components, one which selects an optimal set of individuals to use as a reference panel while leaving the rest as the imputation panel, and another which performs genotype imputation on the imputation panel given sequence data on the reference panel. Both components require prior estimates of IBD sharing. Here, we present a brief description of the two methods. Full details are given in the Supplementary Information.
To select an optimal internal reference panel, we assume knowledge of either kinship coefficients, from a pedigree for instance, or a genetic relationship matrix estimated from genotype data, for every pair of individuals in the study. These coefficients measure the probability of a pair of individuals being IBD at an arbitrary locus. Our method, given a fixed size of the desired reference panel, seeks a division of the sample such that a global measure of the probability that the reference panel individuals are not IBD with the imputation sample is minimized.
The imputation method is designed to work in conjunction with the output of other genotype imputation methods, e.g. IMPUTE2 (Howie et al., 2012), minimac  or Beagle (Browning and Browning, 2016), that output genotype probabilities at specific markers. These genotype probabilities are used as prior probabilities for our method.
In the absence of genotype probabilities from other methods, Kinpute will compute prior probabilities from the allele frequencies in the reference panel.
Our approach to using IBD to enhance imputation is to use the framework set of markers to compute IBD probabilities, across the genome, for a non-sequenced individual with every reference panel individual. In addition, we use IBD probabilities within the refer-

Results
We applied our method to a human founder population in which 98 individuals were sequenced. Details on the subjects are given in the Supplementary Information. Of the 98 individuals, 50 were selected as an internal reference panel while imputation was done in the remaining 48. The sequence data of the 48 were hidden from imputation and used as the "ground truth" against which to compare the imputed genotypes. IBD probabilities between all pairs of the 98 individuals was computed from genotype array data using the IBDLD software package Abney, 2011, 2013), and IMPUTE2 was used with the 1000 Genomes data combined with the internal reference panel to compute initial imputed genotype probabilities. These probabilities were used as prior probabilities for our Kinpute 4 method.
To assess the imputation improvement provided by Kinpute, we stratify the genotypes by genotype and SNP metrics. SNPs are stratified by IMPUTE2 Info score (> 0.4 or < 0.4), minor allele frequency (≤ 0.02 or > 0.02), and whether the SNP was shared (i.e. in both reference panels), or private (i.e. only in the population specific reference panel). Minor allele frequency was computed from the pooled reference panels for shared SNPs, or from the population specific reference panel for private SNPs. In addition, each genotype was stratified based on whether the IMPUTE2 results had high certainty (i.e. the maximum probability for a genotype was at least 0.9), or low certainty (i.e. the complement of high certainty). Within each stratification bin, we compute imputation quality metrics using all genotypes in that bin. In Table 1 we show the genotype dosage R 2 between imputed dosage and the true genotype from both IMPUTE2 and Kinpute. The additional quality metrics, concordance, heterozygote sensitivity, and heterozygote positive predictive value, are given in the Supplementary Information.
In all but one stratification levels, where neither method does well, Kinpute improves the imputation quality of IMPUTE2. For genotypes that are of low certainty, in particular, Kinpute can substantially increase R 2 , especially for SNPs that have low INFO score.

Conclusions
We have shown that in the presence of IBD, Kinpute can significantly improve the quality of imputed sequence data over imputation methods such as Impute2. Note that Kinpute should be viewed as a supplement, as opposed to an alternative, to standard imputation methods such as Impute2, Beagle or minimac. In the absence of IBD, Kinpute returns the supplied prior probabilities. When IBD is present, however, and particularly when the external reference panel does not represent the study population well, Kinpute provides an additional useful tool to maximize the use of a study's sequence data.