TY - JOUR T1 - A Collection of 2,280 Public Domain (CC0) Curated Human Genotypes JF - bioRxiv DO - 10.1101/127241 SP - 127241 AU - Richard J. Shaw AU - Manuel Corpas Y1 - 2017/01/01 UR - http://biorxiv.org/content/early/2017/04/14/127241.abstract N2 - Cheap sequencing has driven the proliferation of big human genome data aggregation consortiums, providing extensive reference datasets for genome research. These datasets, however, may come with restrictive terms of use, conditioned by the consent frameworks with which individuals donate their data. Having an aggregated genome dataset with unrestricted use analogous to public domain licensing is therefore unusually rare. Yet public domain data is tremendously useful because it allows freedom to perform research with it. This comes with the price of donors surrendering their privacy and accepting the associated risks derived from publishing personal data. Using the Repositive platform (https://repositive.io/?23andMe), an indexing service for human genome datasets, we aggregated all deposited files in public data sources under a CC0 license from 23andMe, a leading Direct-to-Consumer genetic testing service. After downloading 3,137 genotypes, we filtered out those that were incomplete, corrupt or duplicated, ending up with a dataset of 2,280 curated files, each one corresponding to a unique individual. Although the size of this dataset is modest compared to current major genome data aggregation projects, its full access and licensing terms, which allows free reuse without attribution, make it a useful reference pool for validation purposes and control experiments. ER -