SNP Selection and Concordance in Consumer Genetics Testing

The use of Direct To Consumer (DTC) genetic testing for predicting health risks and a variety of other phenotypes has been extensively discussed. Additionally, there have been wide ranging discourses on privacy and ethical concerns. Much less attention has been paid to what most people actually use DTC testing for: ancestry determination. Furthermore, comparison of the platforms used by different companies and how they have chosen SNPs to address the questions of health and ancestry have not been broadly reported. When SNPs across three genotyping platforms are compared, only 16-18% of SNPs with reported genotypes are shared across all platforms. Only 110,051 of the more than 600,000 SNPs are called on all three panels examined (Ancestry, 23andMe and MyHeritage). SNPs genotyped on all platforms are highly concordant with only two SNPs having discordant calls. When the SNPs unique to a single panel are examined, it is apparent that each company has its own strategy for choosing SNPs. When each platform is examined, the unique SNPs have different frequencies, ethnic selectivities, and chromosomal locations. Because each company separates the world into different, overlapping geographical regions, it is impossible to do an exact comparison of ancestry results. Factoring in the ways the regions overlap, congruent results are generated for the major contributors to ancestry.


Introduction
The recent approval of 23andMe's test for three BRCA1/BRCA2 mutations (www.fda.gov/NewsEvents/Newsroom/PressAnnouncements/ucm599560.htm) has sparked renewed discussion of the impact of such tests on individual health.
The benefits of knowing whether one has a pathogenic variant that predisposes to disease is countered by the disadvantages of having a potentially inaccurate test and/or a lack of understanding of the test's meaning by the individual ordering it.
These issues have been extensively discussed elsewhere (1--8) and will not be addressed here in detail. While many people are interested in the health information that can be potentially gleaned from the genome--wide SNP data, most actually purchase the test for its use in helping to understand family history. Even the names of many DTC companies (Ancestry, MyHeritage, FamilyTreeDNA) reinforce the idea that the primary use for the data is family history. However, the issues raised by the health tests have diminished the attention paid to how the family history data is actually generated and the potential differences among providers.
There are three required steps for determining ancestral origins via DNA testing.
The first component, algorithms for determining ancestry, has been described at a high level but the methods are mainly proprietary and have sparked patent battles (www.wired.com/story/23andme--sues--ancestry/). As a result, the methods cannot be readily compared. The second component, the reference populations, are similarly proprietary. Each company has its own set of samples representing populations around the world. The ability of these samples to accurately reflect a particular area cannot be determined without access to them and how they were selected. The third component, the SNPs used for the analysis, however, can be compared across platforms. The SNPs chosen for the three panels are quite dissimilar so can be contrasted to determine differences in strategy based on publically available information.

Methods
Ancestry, 23andMe, and MyHeritage all offer similar DTC tests that employ the Illumina Infinium microarray (https://www.illumina.com/products/by-type/microarray--kits/infinium--iselect--custom--genotyping.html) that provides genotyping data for a custom list of around 700,000 SNPs. Each company chooses its own set of SNPs for genotyping. When data is provided back to the consumer, it includes both chromosomal locations and dbSNP numbers when available, allowing easy comparison of SNPs and genotype calls. In instances where there are no dbSNP identifiers, chromosomal locations were provided. In addition to comparing genotype files, databases used to obtain information were dbSNP (www.ncbi.nlm.nih.gov/projects/SNP/), the hg19 human reference genome (www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/?report=full), the gnomAD sequence database (gnomad.broadinstitute.org/) and the BLAT tool for comparing short DNA sequences (http://genome.ucsc.edu/cgi--bin/hgBlat?command=start).

Results
Three vendors with similar products, Ancestry, 23andMe, and MyHeritage were chosen to perform genotyping and interpretation. The MyHeritage dataset was also sent to two other vendors, FamilyTreeDNA and Gencove, for them to interpret independently. Each platform reports over 600,000 SNP genotypes so SNPs are located about every 5000 bp. Because accurate SNP genotyping requires well-behaved DNA, an even distribution of SNPs across the whole genome is not possible.
If DNA is too extreme in GC--content, either high or low, discrimination of SNPs is not possible. Also, if a SNP is embedded in a region that is too similar to other genomic regions, the signal from the different regions will interfere with each other, leading to potentially erroneous calls. Despite sequence limitations, there are no gaps between SNPs >50 Mb with any vendor. 23andMe has the lowest number of gaps >1 Mb (19) while MyHeritage has the highest (27).
As shown in Table 1 reports 4301. Each of these chromosomes (X, Y, mito) has unique attributes with respect to determining ancestry along maternal and paternal lines.  To compare SNP choices, SNPs were categorized in two ways, by overall frequency (less than 1%, 1--5%, 5--20% and greater than 20%) for the minor allele (which was sometimes the reference allele) and also by the difference in frequency between the highest frequency population and the lowest frequency population (less than 2--fold, 2--5 fold and greater than 5--fold). The SNPs for each vendor differed in how these categories were populated. The most common category for Ancestry was greater than 20% minor allele frequency and 2--5--fold difference between populations. The most common category for MyHeritage was 5--20% minor allele frequency and >5-fold discrimination. The most common category for 23andMe was the high population difference category (>5--fold) but lower overall frequency (1--5%). All vendors chose at least 50% SNPs with high population differences but this choice was most extreme with 23andMe (79%). This was at the expense of picking very low frequency SNPs (32% less than 1% frequency) while Ancestry and MyHeritage chose far fewer SNPs in that frequency range (8% and 3%, respectively). This fits well with the heterozygosity rates observed in the DNA analyzed here because the less common SNPs will frequently be homozygous reference. Comparison of SNP properties is relatively straightforward because of the common nomenclature and public data. Comparison of ancestry determination is more problematic because each vendor has separated the world into different geographical sectors. Thus, it is generally impossible to directly compare the detailed results at a country--level resolution. An attempt has been made to make the regional definitions equivalent in Table 4 but some quantitative variation is undoubtedly due to different borders for the locations of ancestral reference populations. While any of these are possible, the low levels and inconsistencies across analyses suggest they are more likely noise and not real.

Discussion
Most discussions of DTC genetic testing have focused on disease diagnosis or ethical/privacy issues. Only rarely has there been a discussion of the data quality and its use for ancestry testing. Previous reports that compared genotyping results examined panels of very different sizes (13,14) so the comparison was necessarily limited in scope. While they found high concordance, there were too few SNPs compared to draw strong conclusions. One widely reported study indicated that "40% of variants in a variety of genes reported in DTC raw data were false positives" The DNA--based ancestry analyses using six companies to provide results using three sets of data yield similar conclusions. The primary problem in making a quantitative comparison among vendors lies in the fuzziness of the geographical boundaries that each provides. Not surprisingly, the reference populations originate from large areas with borders that have varied over time. Even if the birth location for current DNA donors is known precisely, the birth locations for their ancestors will be known with less certainty. As a result, each company defines the borders for their analysis differently.
With these data, all vendors agree the DNA is of primarily Irish/English/Northwestern European origin. This DNA--derived ancestry is consistent with documentary evidence obtained independently. The major differences among vendors are contributions of less than 3%. This could be simply noise, errors in genotyping/haplotyping, less than perfect reference populations or may actually reflect real contributions from distant, undocumented ancestors.
Higher resolution studies and/or better reference populations would be needed to clarify these discrepancies.
Using all three vendors for genotyping and three more for interpretation provided more confidence than simply using a single vendor would. It is easier to see where the uncertainties are with both percentages and geography. It supports documentary evidence of approximately 75% Irish, 19% English, 3% Scottish, and 3% Swedish. There is still some space for uncertainty at the <3% level that will require better tests and reference populations to resolve.