RT Journal Article SR Electronic T1 Minor allele frequency thresholds dramatically affect population structure inference with genomic datasets JF bioRxiv FD Cold Spring Harbor Laboratory SP 188623 DO 10.1101/188623 A1 Ethan Linck A1 C.J. Battey YR 2018 UL http://biorxiv.org/content/early/2018/10/21/188623.abstract AB One common method of minimizing errors in large DNA sequence datasets is to drop variable sites with a minor allele frequency below some specified threshold. Though widespread, this procedure has the potential to alter downstream population genetic inferences and has received relatively little rigorous analysis. Here we use simulations and an empirical SNP dataset to demonstrate the impacts of minor allele frequency (MAF) thresholds on inference of population structure. We find that model-based inference of population structure is confounded when singletons are included in the alignment, and that both model-based and multivariate analyses infer less distinct clusters when more stringent MAF cutoffs are applied. We propose that this behavior is caused by the combination of a drop in the total size of the data matrix and by correlations between allele frequencies and mutational age. We recommend a set of best practices for applying MAF filters in studies seeking to describe population structure with genomic data.