Abstract
Motivation Genotype datasets typically contain a large number of single nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to infer an individual’s origin or membership to a cultural group, dimensionality reduction techniques are routinely deployed. However, inherent (technical) difficulties such as missing or noisy data need to be accounted for when analyzing a lower dimensional representation of genotype data, and the intrinsic uncertainty of such analyses should be reported in all studies. However, to date, there exists no stability assessment technique for genotype data that can estimate this uncertainty.
Results Here, we present Pandora, a stability estimation framework for genotype data based on bootstrapping. Pandora computes an overall score to quantify the stability of the entire embedding, infers per-individual support values, and also deploys a k-means clustering approach to assess the uncertainty of assignments to potential cultural groups. In addition to this bootstrap-based stability estimation, Pandora offers a sliding-window stability estimation for whole-genome data. Using published empirical and simulated datasets, we demonstrate the usage and utility of Pandora for studies that rely on dimensionality reduction techniques.
Availability and Implementation Pandora is available on GitHub https://github.com/tschuelia/Pandora.
Contact julia.haag{at}h-its.org
Supplementary information All Python scripts and data to reproduce our results are available on GitHub https://github.com/tschuelia/PandoraPaper.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
We shortened the manuscript and focused on a technical explanation of the presented tool. We moved all detailed analyses of empirical datasets to the supplementary material.