Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

SPDI: Data Model for Variants and Applications at NCBI

J. Bradley Holmes, Eric Moyer, View ORCID ProfileLon Phan, Donna Maglott, Brandi L. Kattman
doi: https://doi.org/10.1101/537449
J. Bradley Holmes
1National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Eric Moyer
1National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Lon Phan
1National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lon Phan
Donna Maglott
1National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Brandi L. Kattman
1National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Motivation Normalizing diverse representations of sequence variants is critical to the elucidation of the genetic basis of disease and biological function. NCBI has long wrestled with integrating data from multiple submitters to build databases such as dbSNP and ClinVar. Inconsistent representation of variants among variant callers, local databases, and tools results in discrepancies and duplications that complicate analysis. Current tools are not robust enough to manage variants in different formats and different reference sequence coordinates.

Results The SPDI (pronounced “speedy”) data model defines variants as a sequence of 4 operations: start at the boundary before the first position in the sequence S, advance P positions, delete D positions, then insert the sequence in the string I, giving the data model its name, SPDI. The SPDI model can thus be applied to both nucleotide and protein variants, but the services discussed here are limited to nucleotide. Current services convert representations between HGVS, VCF, and SPDI and provide two forms of normalization. The first, based on the NCBI Variant Overprecision Correction Algorithm, returns a unique, normalized representation termed the “Contextual Allele” for any input. The SPDI name, with its four operations, defines exactly the reference subsequence potentially affected by the variant, even in low complexity regions such as homopolymer and dinucleotide sequence repeats. The second level of normalization depends on an alignment dataset (ADS). SPDI services perform remapping (AKA lift-over) of variants from the input reference sequence to return a list of all equivalent Contextual Alleles based on the transcript or genomic sequences that were aligned. One of these contextual alleles is selected to represent all, usually that based on the latest genomic assembly such as GRCh38 and is designated as the unique “Canonical Allele”. ADS includes alignments between non-assembly RefSeq sequences (prefixed NM, NR, NG), as well inter- and intra-assembly-associated genomic sequences (NCs, NTs and NWs) and this allow for robust remapping and normalization of variants across sequences and assembly versions.

Availability and implementation The SPDI services are available for open access at: https://api.ncbi.nlm.nih.gov/variation/v0/

Contact holmesbr{at}ncbi.nlm.nih.gov

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.
Back to top
PreviousNext
Posted January 31, 2019.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
SPDI: Data Model for Variants and Applications at NCBI
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
SPDI: Data Model for Variants and Applications at NCBI
J. Bradley Holmes, Eric Moyer, Lon Phan, Donna Maglott, Brandi L. Kattman
bioRxiv 537449; doi: https://doi.org/10.1101/537449
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
SPDI: Data Model for Variants and Applications at NCBI
J. Bradley Holmes, Eric Moyer, Lon Phan, Donna Maglott, Brandi L. Kattman
bioRxiv 537449; doi: https://doi.org/10.1101/537449

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3479)
  • Biochemistry (7318)
  • Bioengineering (5296)
  • Bioinformatics (20197)
  • Biophysics (9976)
  • Cancer Biology (7703)
  • Cell Biology (11250)
  • Clinical Trials (138)
  • Developmental Biology (6418)
  • Ecology (9916)
  • Epidemiology (2065)
  • Evolutionary Biology (13280)
  • Genetics (9352)
  • Genomics (12554)
  • Immunology (7674)
  • Microbiology (18939)
  • Molecular Biology (7417)
  • Neuroscience (40893)
  • Paleontology (298)
  • Pathology (1226)
  • Pharmacology and Toxicology (2126)
  • Physiology (3140)
  • Plant Biology (6838)
  • Scientific Communication and Education (1270)
  • Synthetic Biology (1891)
  • Systems Biology (5296)
  • Zoology (1085)