A Collection of 2,280 Public Domain (CC0) Curated Human Genotypes

Cheap sequencing has driven the proliferation of big human genome data aggregation consortiums, providing extensive reference datasets for genome research. These datasets, however, may come with restrictive terms of use, conditioned by the consent frameworks within which individuals donate their data. Having an aggregated genome dataset with unrestricted use, analogous to public domain licensing, is therefore unusually rare. Yet public domain data is tremendously useful because it allows freedom to perform research with it. This comes with the price of donors surrendering their privacy and accepting the associated risks derived from publishing personal data. Using the Repositive platform (https://repositive.io/?23andMe), an indexing service for human genome datasets, we aggregated all deposited files in public data sources under a CC0 license from 23andMe, a leading Direct-to-Consumer genetic testing service. After downloading 3,137 genotypes, we filtered out those that were incomplete, corrupt or duplicated, ending up with a dataset of 2,280 curated files, each one corresponding to a unique individual. Although the size of this dataset is modest compared to current major genome data aggregation projects, its full access and licensing terms, which allows free reuse without attribution, make it a useful reference pool for validation purposes and control experiments.


Background & Summary
The availability of personal genome Direct-to-Consumer (DTC) tests has been fuelled by companies like 23andMe, which makes it easy for users to access their personal genotype data for an affordable price. Despite recent improvements in cost and availability of Next Generation Sequencing (NGS) technology for DTC use, array genotyping is still a viable option [1], particularly when users want to have a whole genome screen of known Single Nucleotide Variations (SNVs) for a cheap price. For about $100 one can buy a DTC test and have a satisfying state-of-the-art analysis of ancestry and genetic health risk [2]. 23andMe has been able to capitalise on the curiosity of many customers from around the world, generating one of the most extensive private genome data ecosystems in the world [3]. The genotype data aggregated from 23andMe customers, combined with phenotype information from questionnaires, has already been proven to be an effective way to discover new markers, e.g. [4][5][6].
Using the genotype data from 23andMe as a gateway for personal/recreational genomics also has other advantages. The format of the different SNP array versions (although the number of tested SNVs may vary) remains constant and relatively easy to handle: the size of the genotype files is in the order of tens of MB, which is more manageable than the GB sizes of Next Generation Sequencing (NGS) files. 23andMe genotype files are also a useful proxy for understanding the individual's main genetic features.
In this study we use the Repositive platform [7] (https://repositive.io/?23andMe), which indexes human genomic datasets from all major and known genome data repositories, to gather the greatest possible open access 23andMe set of individual genotypes. We took only 23andMe genotypes that have been deposited in public archives and have no restriction of use, i.e. a CC0 license (Table 1). These archives include (in order of greatest contribution): openSNP [8], the Personal Genome Project [9], Open Humans [10], Genomes Unzipped [11], the Corpasome [12] and Stephen Keating's data source [13]. Any genotype from these data sources can be copied, modified, distributed and used, even for commercial purposes, all without having to ask permission. Stephen Keating CC0 http://stevenkeating.info/main.html Table 1: Summary of licenses and their provenance for all data used for this study. All of them are in the public domain (CC0) [14]. This means that the data referenced in this study can be used, distributed or modified, even for commercial purposes, without needing to ask permission to the individuals from whom this data originated.
We downloaded every open access public domain 23andMe genotype available in every public repository currently indexed by Repositive, curating and bundling them into a single dataset. This amounted to 3,137 genotype files that, after curation, were reduced to 2,280 genotypes.

Data Records
For version 1 and 2 of their SNP chip 23andMe used a customised Illumina Hap550+ array. Version 3 is based on a customised Illumina OmniExpress+ array and version 4 uses a custom Illumina HumanOmniExpress-24 format chip. The downloaded raw data file is a zipped text file 5 MB to 30 MB in size in bed format [22], with a header describing the date in which the file was downloaded and the human reference assembly build used as coordinates for the mapping of the SNVs tested. The file itself provides the SNPdb ID for each SNV [23], the chromosome and position in the mapped assembly and the genotype, including both alleles. Figure 2 shows a snippet of the 23andMe genotype bed file from one of us downloaded on "Tue Dec 13 09:12:17 2016".

Data Availability
The 2,280 curated set of links is available: https://drive.google.com/file/d/0B8yXU9SkT3ftR3pIbW81cDhrc2s/view The data can be downloaded using the python script provided at the link supplied earlier.

Technical Validation
One common problem when aggregating genome data from different data sources is duplication. In this case, users wishing to share their genotype data may have submitted their 23andMe file to more than one open access repository.
We created a Principal Component Analysis (PCA) to cluster the 23andMe data from the 2,402 collected 23andMe individuals with more than 500k SNPs mapped to GRCh37 ( Table 2). The resulting PCA clusters are shown in Figure 2. From the PCA, we were able to identify 141 very clear duplications. There were some further cases of very similar (but not identical) values but there were no obvious cases of siblings or other relatives from the metadata associated to these 23andMe files.