TY - JOUR T1 - Scaling probabilistic models of genetic variation to millions of humans JF - bioRxiv DO - 10.1101/013227 SP - 013227 AU - Prem Gopalan AU - Wei Hao AU - David M. Blei AU - John D. Storey Y1 - 2014/01/01 UR - http://biorxiv.org/content/early/2014/12/24/013227.abstract N2 - A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. Researchers have developed sophisticated statistical methods to capture the complex population structure that underlies observed genotypes in humans. The number of humans that have been densely genotyped across the genome has grown sig-nificantly in recent years. In aggregate about 1M individuals have been densely genotyped to date, and if we could analyze this data then we would have a nearly complete picture of human genetic variation. Existing state-of-the-art methods, however, cannot scale to data of this size. To this end, we have developed TeraStructure. TeraStructure is a new algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (1012 observed genotypes, e.g., 1M individuals at 1M SNPs). It is a principled approach to approximate Bayesian inference that iterates between subsampling locations of the genome and updating an estimate of the latent population structure. On real and simulated data sets of up to 10K individuals, TeraStructure is twice as fast as existing methods and recovers the latent population structure with equal accuracy. On genomic data simulated at the tera-sample-size scales, TeraStructure continues to be accurate and is the only method that can complete its analysis.Software TeraStructure is available for download at https://github.com/premgopalan/terastructure.Funding This research was supported in part by NIH grant R01 HG006448 and ONR grant N00014-12-1-0764. ER -