Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

PLIGHT: A tool to assess privacy risk by inferring identifying characteristics from sparse, noisy genotypes

View ORCID ProfilePrashant S. Emani, View ORCID ProfileGamze Gürsoy, View ORCID ProfileAndrew Miranker, View ORCID ProfileMark B. Gerstein
doi: https://doi.org/10.1101/2021.07.18.452853
Prashant S. Emani
1Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
2Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Prashant S. Emani
Gamze Gürsoy
1Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
2Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gamze Gürsoy
Andrew Miranker
2Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Andrew Miranker
Mark B. Gerstein
1Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
2Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
3Department of Computer Science, Yale University, New Haven, CT 06520, USA
4Department of Statistics & Data Science, Yale University, New Haven, CT 06520, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Mark B. Gerstein
  • For correspondence: pi@gersteinlab.org
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

The leakage of identifying information in genetic and omics data has been established in many studies, with single nucleotide polymorphisms (SNPs) shown to carry a strong risk of reidentification for individuals and their genetic relatives. While the ability of thousands or hundreds of thousands of SNPs (especially rare ones) to identify individuals has been demonstrated, here we sought to measure the informativeness of even a sparse set of tens of noisy, common SNPs from an individual, by putting the genotype-based privacy leakage from an individual on quantitative footing. We present a computational tool, PLIGHT (“Privacy Leakage by Inference across Genotypic HMM Trajectories”), that employs a population-genetics-based Hidden Markov Model of recombination and mutation to find piecewise matches of a sparse query set of SNPs to a reference genotype panel. Given the ready availability of auxiliary sources of noisy genotype data – such as acquiring small samples of environmental DNA or learning about someone’s Mendelian diseases and physical characteristics – inference on sparse data becomes a genuine concern. We explore cases where query individuals are either known to be in databases or not, and consider both simulated “mosaics” of genotypes (i.e. genotypes stitched together from diploid segments sampled from two or more source individuals) and actual genotype data obtained from swabs of coffee cups used by a known individual. Our findings are as follows: (1) Even 10 common SNPs (minor allele frequency > 0.05) often are sufficient to identify individuals in conventional genomic databases. (2) We are able to identify first-order relatives (parents, children and siblings) of query individuals with 20-30 common SNPs. (3) We find some potential for leakage of phenotypic information, based on a simulated attack by combining polygenic risk scores (PRSs) of the piecewise genotypic matches. We also found, for simulated mosaics of two individuals, that 20 common SNPs were often sufficient to find the correct identities of both component individuals. Finally, applying PLIGHT to coffee-cup-derived SNPs, we find that our tool is able identify the individual (when present in the reference database) using as little as 30 SNPs; alternatively, when the individual is not present in the reference database, we reconstruct possible genomes for the individual based on just 30-90 query SNPs by piecewise matching to the reference haplotype database. In this way, we are able to perform a small degree of imputation of unobserved query SNPs. Overall, the tool could be used to determine the value of selectively masking released SNPs, in a way that is agnostic to any explicit assumptions about underlying population membership or allele frequencies.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • https://github.com/gersteinlab/PLIGHT

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted July 19, 2021.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
PLIGHT: A tool to assess privacy risk by inferring identifying characteristics from sparse, noisy genotypes
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
PLIGHT: A tool to assess privacy risk by inferring identifying characteristics from sparse, noisy genotypes
Prashant S. Emani, Gamze Gürsoy, Andrew Miranker, Mark B. Gerstein
bioRxiv 2021.07.18.452853; doi: https://doi.org/10.1101/2021.07.18.452853
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
PLIGHT: A tool to assess privacy risk by inferring identifying characteristics from sparse, noisy genotypes
Prashant S. Emani, Gamze Gürsoy, Andrew Miranker, Mark B. Gerstein
bioRxiv 2021.07.18.452853; doi: https://doi.org/10.1101/2021.07.18.452853

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3505)
  • Biochemistry (7346)
  • Bioengineering (5323)
  • Bioinformatics (20260)
  • Biophysics (10016)
  • Cancer Biology (7743)
  • Cell Biology (11300)
  • Clinical Trials (138)
  • Developmental Biology (6437)
  • Ecology (9951)
  • Epidemiology (2065)
  • Evolutionary Biology (13321)
  • Genetics (9361)
  • Genomics (12583)
  • Immunology (7701)
  • Microbiology (19021)
  • Molecular Biology (7441)
  • Neuroscience (41036)
  • Paleontology (300)
  • Pathology (1229)
  • Pharmacology and Toxicology (2137)
  • Physiology (3160)
  • Plant Biology (6860)
  • Scientific Communication and Education (1272)
  • Synthetic Biology (1896)
  • Systems Biology (5311)
  • Zoology (1089)