PT - JOURNAL ARTICLE AU - Peter Pfaffelhuber AU - Franziska Grundner-Culemann AU - Veronika Lipphardt AU - Franz Baumdicker TI - How to choose sets of ancestry informative markers: A supervised feature selection approach AID - 10.1101/759464 DP - 2019 Jan 01 TA - bioRxiv PG - 759464 4099 - http://biorxiv.org/content/early/2019/09/08/759464.short 4100 - http://biorxiv.org/content/early/2019/09/08/759464.full AB - Inference of the Biogeographical Ancestry (BGA) of a person or trace relies on three ingredients: (1) A reference database of DNA samples including BGA information; (2) a statistical clustering method; (3) a set of loci which segregate dependent on geographical location, i.e. a set of so-called Ancestry Informative Markers (AIMs). We used the theory of feature selection from statistical learning in order to obtain AIM-sets for BGA inference. Using simulations, we show that this learning procedure works in various cases, and outperforms ad hoc methods, based on statistics like FST or informativeness for the choice of AIMs. Applying our method to data from the 1000 genomes project (excluding Admixed Americans) we identified an AIMset of 17 SNPs, which partly overlaps with existing ones. For continental BGA, the AIMset outperforms existing AIMsets on the 1000 genomes dataset, and gives a vanishing misclassification error.