Abstract
Consumer genomics databases reached the scale of millions of individuals. Recently, law enforcement investigators have started to exploit some of these databases to find distant familial relatives, which can lead to a complete re-identification. Here, we leveraged genomic data of 600,000 individuals tested with consumer genomics to investigate the power of such long-range familial searches. We project that half of the searches with European-descent individuals will result with a third cousin or closer match and will provide a search space small enough to permit re-identification using common demographic identifiers. Moreover, in the near future, virtually any European-descent US person could be implicated by this technique. We propose a potential mitigation strategy based on cryptographic signature that can resolve the issue and discuss policy implications to human subject research.
Main Text
Consumer genomics has gained tremendous popularity in the last few years1. As of today, more than 15 million people have taken direct-to-consumer (DTC) autosomal genetic tests for self-curiosity, with about 7 million kits sold in 2017 alone2. Nearly all major DTC providers use dense genotyping arrays that probe ~700,000 SNPs in the genome of each participant and most DTC providers allow participants to download their raw genotype files in a textual format. This option has led to the advent of third-party services, such as DNA.Land and GEDmatch that allow participants to upload their raw genetic data in order to get further analysis (Table 1)3.
DNA matching is one of the most popular features in consumer genomics. This feature harnesses the dense autosomal genotypes to find identity-by-descent (IBD) segments, which are indicative of a shared ancestor. Previous studies have shown that this technique has virtually perfect accuracy to find close relatives and good accuracy to find distant relatives, such as 2nd or 3rd cousins4–6, providing the option for long range familial searches. This feature has led to many “success stories” by the genetic genealogy community, including the reunions of Holocaust survivors with relatives when they thought they had no living family left, reunions of adoptees with their biological families, and investigations of potential abductions of babies7.
From a technical and regulatory perspective, the consumer genomics tools are far more powerful for familial searches than traditional forensic techniques. Forensic familial searches use a small set of ~20 autosomal STR regions that were standardized for traditional fingerprinting8,9. This set is not sufficient to detect IBD matches and instead forensic techniques have to rely on partial allelic matches as a means to identify relatives. Due to the small size of the STR panel, the forensic technique is limited to finding only 1st to 2nd degree relatives and can suffer from false positives9. In addition, familial searches in forensic databases has remained a highly controversial investigative tool from an ethical and societal perspective10. As a result, federal legislation permits the FBI to conduct familial searches for investigating severe crimes but allows individual states to develop their own policies11. Some states, such as California, have placed certain restrictions on conducting familial searches, whereas other states, such as Maryland, have prohibited the practice completely. However, these limitations primarily affect searches of state-owned forensic databases and do not explicitly restrict the use of crime scene samples with civilian DNA databases, such as consumer genomics databases, yielding a much less stringent regulatory route for such searches.
In the last few months, law enforcement has started to exploit third-party consumer genetic services for long-range familial searches. This route to re-identify individuals has been predicted before12, but became practical only recently with the rapid increase of consumer genomics database sizes. In a recent high-profile case, the FBI used a long-range familial search to trace the Golden State Killer13,14. The investigators obtained a dense genome-wide genotyping profile of a crime scene sample of the perpetrator either by an array or whole genome sequencing but no cooperation of a DTC provider based on public information. They rendered the profile to look like an ordinary dataset from a consumer genomics provider and uploaded the profile to GEDmatch, a third-party website that has approximately 1 million samples that offers long range familial searches using IBD matches. The GEDmatch search identified a 3rd degree cousin of the perpetrator in GEDmatch13. The FBI team consisted of five genealogists took four months until they were able to trace the identity of the perpetrator, which was confirmed by a standard forensic test. In a few other less notable cases over the past few months, private DNA detectives from the “DNA Doe Project” used long range familial searches with third party services to identify the remains of the bodies of “Lyle Stevik” with an estimated time of a few hundred hours of work and of the “Buckskin Girl” in a few hours of work15. Recently, a forensic DNA company announced that they set up a division that will use such long-range familial searches and uploaded 100 cold cases to third-party DTC services16. Finally, as we were authoring this manuscript, the Snohomish County Sheriff’s Office announced that they solved a cold case from 1987 of a double murder using another long-range familial search on GEDmatch17. In this case, the investigators found two relatives: a paternal half-first cousin, once removed and a maternal first cousin, once removed. Using genealogical searches, this profile quickly led to one sibship of three sisters and a brother. As the perpetrator was a male, the investigators focused on the brother who was confirmed to be a person of interest with a traditional forensic DNA test.
We took an empirical approach to investigate the probability of a long-range familial search to re-identify an individual. To this end, we analyzed a dataset of 600,000 individuals that were tested with a DTC provider and consented for this type of research (Supplementary Methods). About 85% of all individuals showed European heritage as their main DNA ethnicity (Supplementary figure 1; Supplementary table 1), similar to previous studies with DTC individuals18. We searched for IBD segments among the 180 billion potential pairs of individuals. Overall, these individuals formed a dense IBD network, with over 1 billion pairs having at least a single IBD region longer than 6cM. We derived a subset of these pairs that included only those with at least two IBD segments, which increased the chance of correctly inferring genealogical relationships. Next, we removed pairs with IBD segments greater than 700cM (approximately first cousin and closer relationships) to circumvent potential ascertainment biases due to close relatives sending in their kits together. Finally, we analyzed the number of individuals with at least one IBD segment between 30cM and 600cM (Supplementary Methods). The low end of our range corresponds to 4th cousins; the high end corresponds to 2nd cousins based on a large-scale crowd sourcing project19.
Our results show long range familial searches have a good probability to return a relative for a database size of 600,000 individuals (Figure 1A). We found that 46% of the searches will result in an IBD of at least 100cM, which typically corresponds to third cousin or closer relative, similar to the Golden State Killer Case. Interestingly, these results, which rely on a partial genetic database, are considerably higher than surname inference from Y-chromosome, which is another re-identification tactic using genetic genealogy data20. Moreover, long-range familial searches allow direct re-identification of females and not just males. In 10% of the searches with our data of 600,000 individuals, the top match will have an IBD segment of at least 300cM, which corresponds to a second cousin or a closer relative. To further validate our results, we also manually performed 30 random long-range familial searches in GEDmatch, which has pproximately 1M individuals in their database. The results were consistent: the top match in over 90% of he searches shared >30cM, in 75% of the searches shared >100cM match, and 10% of the searches with >300cM match (Figure 1A). Since most individuals in these databases are US Caucasians, these results are likely to be relevant to this ethnic group.
More broadly, we expect long range familial searches to return a match to virtually anyone with genetic databases that cover even a small fraction of the target population. This assertion relies on a population genetics model that takes into account the probability of sharing at least two IBD segments of >6cM and assuming population growth rates seen in the last 200 years in the Western world [a recent blog post by Doc & Coop conducted a similar analysis for GEDmatch database size21] (Supplementary Methods). This model has multiple simplifying assumptions such as no population structure, inbreeding, and random sampling of participants. However, we found that the model showed a good approximation of our empirical analysis by predicting that 44% of the searches will return at least a third cousin match compared to the observed rate of 46% for >80cM for north Europeans in our data. If considering a US Caucasian target, similar to the Golden State Killer, our model predicts that a database with ~5 million individuals (2% of this ethnic group) has a 3rd cousin match for virtually any person in this ethnic group. With databases of this size, over 90% of the searches will return more than one 3rd cousin, which can greatly improve triangulation and ~70% of searches will return a 2nd cousin or a closer relative. Notably, consumer genomics grows at exponential rates and covering 2% of the US Caucasians is within reach for some 3rd party websites in the near future.
Next, we wondered on the ability to narrow down the suspect list after finding a match in a long-range familial search. We assumed a case where a long range familial search retuned a bona fide match to a 3rd cousin or genetically equivalent relative of >100cM. Furthermore, we considered a scenario where that the sex of the person of interest is known, their age can be estimated within a 10yr interval, and the location of residence can be estimated within a radius of 100miles (approximately the land area of the state of Maine). We used extensive genealogical records of population scale family trees22 to analyze whether basic demographic information has the power to quickly prune this search space (Supplementary Methods).
We found that the suspect list can be quickly pruned using simple demographic information. We predict that a match in the scenario has a search space of ~850 relatives on average (Figure 2A). Our simulations suggest that geographic data will exclude on average 57% of the list (Figure 2B and Supplementary table 2). Next, age at 10yr interval is expected to exclude another 91% of the relatives (Figure 2C), leading to 33 individuals on average. Finally, sex information will halve the list to around 16-17 individuals on averages, a search space that is small enough for manual inspection. In research projects, the HIPAA privacy law permits the release of the year of birth, which is even a more powerful identifier (Figure 2D). Our analysis shows that age at a single year resolution together with geography (<100miles) and sex is expected to return 1-2 individuals. Figure 2E summarizes the entire process.
Taken together, our lines of analyses show that long-range familial searches have the potential to re-identify substantial numbers of US Caucasian individuals. The main barrier is not finding a match or pruning the search space to trace the person of interest. Rather, successfully tracing an individual simply depends on the accessibility of genealogical data, their accuracy, and the determination of the investigators. Indeed, policymakers and the general public might be in favor of such enhanced forensic capabilities for solving horrendous crimes. However, we caution that the open nature of these services means that the very same technique can be exploited to identify genomic data of research subjects or counter-espionage activities by foreign adversaries.
We propose a technical measure that can mitigate some of the risks and restore control to data custodians (Supplementary figure 2). The collection and processing methods of traditional DTC providers are geared towards a large amount of saliva or buccal cells and not to the minute quantities of DNA and a variety of tissue of origin common to crime scene evidence. Therefore, forensic long-term familial searches have so far used special labs to develop the raw genotype data and had to render the data to mimic the format of regular DTC providers in order to be uploaded to third-party services. In our proposal, DTC providers will add cryptographic signatures to the header of the text file containing raw data available to their customers. Each supplier will use a secret private key for signing the data and will make the public key available at a known Internet address. This way, third-party services will be able to authenticate that a raw genotyping file was created by a valid DTC supplier without any modification and distinguish between valid sources and questionable sources. In case of a failure to validate the file, the 3rd party service can either reject the file, allow the DNA profile to be found but not to initiate a search, or quarantine the file until some reassurance about its origin is provided. Of course, on a case-by-case basis, third-party services can cooperate with law enforcement and allow the search as opposed to the current situation in which such searches are conducted unilaterally. Similarly, this approach can also prevent exploiting long-range familial searches to re-identify research subjects.
To facilitate the adoption of our proposal, we provide a source code (under the free MIT license) that can sign and verify the raw genotype files and relies on a well-known digital signature scheme23. Importantly, our software does not assert or recommend any list of suppliers. Any lab that produces raw genotype files is welcome to use this scheme and any third-party provider should decide independently which list of suppliers they want to support. We believe that this technical approach, if adopted, can quickly mitigate some of the risks compared to legislation that usually takes a considerable amount of time24.
The rise of long-range familial searches also has implications for human subject research. The U.S. Department of Health and Human has recently rejected proposals to include whole genome sequencing as identifiable information in the Revised Common Rule but implemented a mechanism to evaluate the scope of identifiable private information based on new technological developments25. The growing success of long-range familial searches shows that even simpler types of genotypic information, such as genome-wide genotyping arrays, can be used to identify individuals with high success rates. These rates will grow in the near future due to the immense interest in consumer genomics. These developments will further challenge the status quo regarding the identifiability of DNA data of human subjects and may require the developments of new policy measures to further protect these datasets.
Conflict of interest statement
Y.E. and T.S are MyHeritage employees. Y.E. is also a consultant of ArcBio. S.C. is a consultant of MyHeritage. I.P. holds equity in 23andMe. When multiple companies are mentioned in this manuscript, we listed them in a lexicographic order.
Acknowledgments
We thank G. Japhet for his contributions to the cryptographic signature scheme and Y. Naveh for valuable comments. Y.E. holds a Burroughs Wellcome Fund Career Awards at the Scientific Interface. Y.E. conceived the idea for this study. Y.E. and T.S conducted the analysis of matches using the MyHeritage and the Geni.com data. S.C and I.P. developed the theoretical framework to estimate the number of matches. Y.E. and T.S. adapted the code for the cryptographic signatures. Y.E., T.S., S.C., and I.P. wrote the manuscript. The code for the cryptographic signatures is available on https://github.com/erlichya/signature.