RT Journal Article
SR Electronic
T1 SPDI: Data Model for Variants and Applications at NCBI
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 537449
DO 10.1101/537449
A1 J. Bradley Holmes
A1 Eric Moyer
A1 Lon Phan
A1 Donna Maglott
A1 Brandi L. Kattman
YR 2019
UL http://biorxiv.org/content/early/2019/01/31/537449.abstract
AB Motivation Normalizing diverse representations of sequence variants is critical to the elucidation of the genetic basis of disease and biological function. NCBI has long wrestled with integrating data from multiple submitters to build databases such as dbSNP and ClinVar. Inconsistent representation of variants among variant callers, local databases, and tools results in discrepancies and duplications that complicate analysis. Current tools are not robust enough to manage variants in different formats and different reference sequence coordinates.Results The SPDI (pronounced “speedy”) data model defines variants as a sequence of 4 operations: start at the boundary before the first position in the sequence S, advance P positions, delete D positions, then insert the sequence in the string I, giving the data model its name, SPDI. The SPDI model can thus be applied to both nucleotide and protein variants, but the services discussed here are limited to nucleotide. Current services convert representations between HGVS, VCF, and SPDI and provide two forms of normalization. The first, based on the NCBI Variant Overprecision Correction Algorithm, returns a unique, normalized representation termed the “Contextual Allele” for any input. The SPDI name, with its four operations, defines exactly the reference subsequence potentially affected by the variant, even in low complexity regions such as homopolymer and dinucleotide sequence repeats. The second level of normalization depends on an alignment dataset (ADS). SPDI services perform remapping (AKA lift-over) of variants from the input reference sequence to return a list of all equivalent Contextual Alleles based on the transcript or genomic sequences that were aligned. One of these contextual alleles is selected to represent all, usually that based on the latest genomic assembly such as GRCh38 and is designated as the unique “Canonical Allele”. ADS includes alignments between non-assembly RefSeq sequences (prefixed NM, NR, NG), as well inter- and intra-assembly-associated genomic sequences (NCs, NTs and NWs) and this allow for robust remapping and normalization of variants across sequences and assembly versions.Availability and implementation The SPDI services are available for open access at: https://api.ncbi.nlm.nih.gov/variation/v0/Contact holmesbr{at}ncbi.nlm.nih.gov