Variantkey: A Reversible Numerical Representation of Human Genetic Variants

Human genetic variants are usually represented by four values with variable length: chromosome, position, reference and alternate alleles. There is no guarantee that these components are represented in a consistent way across different data sources, and processing variant-based data can be inefficient because four different comparison operations are needed for each variant, three of which are string comparisons. Working with strings, in contrast to numbers, poses extra challenges on computer memory allocation and data-representation. Existing variant identifiers do not typically represent every possible variant we may be interested in, nor they are directly reversible. To overcome these limitations, VariantKey, a novel reversible numerical encoding schema for human genetic variants, is presented here alongside a multi-language open-source software implementation (http://github.com/genomicspls/variantkey). VariantKey represents variants as single 64 bit numeric entities, while preserving the ability to be searched and sorted by chromosome and position. The individual components of short variants can be directly read back from the VariantKey, while long variants are supported with a fast lookup table.


Introduction
There is no standard guideline for consistent representation of genetic variants. For a given genome assembly, different classes of genetic variants are typically represented in the Variant Call Format (VCF) [1] as a combination of four components: chromosome name, base position, reference allele and alternate allele(s). The alternate field can contain comma-separated strings for multiallelic variants but they can be decomposed into biallelic ones. Since there is no guarantee that the variant components are represented in a consistent way across different data sources, a normalization step is also required. Even with normalized variants, basic operations can be inefficient due to the necessity to perform four different comparison operations for each variant, three of which are string comparisons. Working with strings poses extra challenges on storage, memory allocation and data-representation due to their variable-length nature. Existing variant identifiers don't provide a solution to this general problem because they are typically assigned, they require an expensive lookup operation to be retrieved or reversed and they don't represent every possible variant we may be interested in. In particular, the dbSNP reference identifier (rs#, "rs ID" or "rs tag") [2] can't uniquely identify a variant in a given reference genome.
To overcome these limitations, VariantKey, a novel reversible numerical encoding schema for human genetic variants, is presented here. VariantKey represents variants as a single 64 bit numeric entity, while preserving the ability to be * https://www.genomicsplc.com  The individual components of short variants can be directly read  back from the VariantKey, while long variants are supported with a fast lookup table. A reference implementation of VariantKey and utility functions is provided as a multi-language software library (https://github.com/Genomicsplc/variantkey) released under the open source MIT license. For portability and performance, the core source code is written in header-only C programming language in a way that is also compatible with C++. Full wrappers for Go, Python, R and a basic implementation in Javascript are also provided. The source code is documented and fully covered by unit tests.

Human Genetic Variant Definition
A genetic variant is a difference from the reference DNA nucleotide sequence. In this context, the human genetic variant for a given genome assembly is defined as the set of four components compatible with the VCF format: • CHROMchromosome: An identifier from the reference genome. It only has 26 valid values: autosomes from 1 to 22, the sex chromosomes X=23 and Y=24, mitochondria MT=25 and a symbol NA=0 to indicate an invalid value.

Decomposition
In the Variant Call Format (VCF) the alternate field can contain comma-separated strings for multiallelic variants, while in this context we only consider biallelic variants to allow for allelic comparisons between different data sets. In VCF files the decomposition from multiallelic to biallelic variants can be performed using the vt software tool [4] with the command: The "-s" option (smart decomposition) splits up INFO and GENOTYPE fields that have number counts of R and A appropriately [5].

Normalization
A normalization step is required to ensure a consistent and unambiguous representation of variants. As shown in Figure 1, there are multiple ways to represent the same variant, but only one can be considered "normalized" as defined by [6]:

Normalization Function
Individual biallelic variants can be normalized using the normalize_variant function provided by the variantkey library. The normalize_variant function first checks if the reference allele matches the genome reference. The match is considered valid and consistent if there is a perfect letter-by-letter match, and valid but not consistent if one or more letter matches an equivalent one. Equivalent nucleotide letters are defined in Table 1.
If the reference allele is not valid, the normalize_variant function tries to find a reference match with one of the following variant transformations: • swap the reference and alternate alleles -sometimes it is not clear which one is the reference and which one is the alternate allele. • flip the alleles letters (use the complement letters) -sometimes the alleles refers to the other DNA strand.
• swap and flip.
Note that the swap and flip processes can lead to false positive cases, especially when considering Single Nucleotide Polymorphisms (SNP). The return code of the normalize_variant function can be used to discriminate or discard variants that are not consistent.
If the variant doesn't match the genome reference, then the original variant is returned with an error code.
If both alleles have length 1, the normalization is complete and the variant is returned. Otherwise, a custom implementation of the vt normalization algorithm [6] is applied: not T (V comes after T and U) A C G B N aNy base (not a gap) A C G T N Table 1: Single-letter nucleotide codes [9] • while break, do if any of the alelles is empty and the position is greater than zero, then * extend both alleles one letter to the left using the nucleotide in the corresponding genome reference position; else * if both alleles end with the same letter and they have length 2 or more, then · truncate the rightmost letter of each allele; * else · break (exit the while loop); • while both alleles start with the same letter and have length 2 or more, do truncate leftmost letter of each allele; The genome reference binary file (fasta.bin), used by the normalize_variant function, can be obtained from a FASTA file using the resources/tools/fastabin.sh script. This script extracts only the first 25 sequences for chromosomes 1 to 22, X, Y and MT. Precompiled versions can be downloaded from: https://sourceforge.net/ projects/variantkey/files/.

Normalized VariantKey
The variantkey library provides the normalized_variantkey function that returns the VariantKey of the normalized variant. This function should be used instead of the variantkey function if the input variant is not normalized.

POS Encoding
The reference position in the chromosome is encoded as unsigned integer number with the first base having position 0. This section is 28 bit long, so it can store up to 2 28 = 268, 435, 456 symbols, enough to contain the maximum position found in the largest human chromosome.

REF+ALT Encoding
The reference and alternate alleles are encoded in 31 bit.
This section allow two different type of encodings:

Non-reversible encoding
If the total number of nucleotides between REF and ALT is more then 11, or if any of the alleles contains nucleotide letters other than base A, C, G and T, then the LSB (least significant bit) is set to 1 and the remaining 30 bit are filled

Reversible encoding
If the total number of nucleotides between REF and ALT is 11 or less, and they only contain base letters A, C, G and T, then the LSB is set to 0 and the remaining 30 bit are used as follows: • The reversible encoding covers 99.635% of the variants in the normalized dbSNP VCF file GRCh37.p13.b150.

VariantKey Properties
• It can be encoded and decoded on-the-fly.
• Sorting by VariantKey is equivalent of sorting by CHROM and POS.
• The 64 bit VariantKey can be exported as a single 16 character hexadecimal string.
• Sorting the hexadecimal representation of VariantKey in alphabetical order is equivalent of sorting the Vari-antKey numerically. • Comparing two variants by VariantKey only requires comparing two 64 bit numbers, a very well optimized operation in current computer architectures. In contrast, comparing two normalized variants in VCF format requires comparing one numbers and three strings. • VariantKey can be used as a main database key to index data by "variant". This simplify common searching, merging and filtering operations.
• All types of database joins between two data sets (inner, left, right and full) can be easily performed using the VariantKey as index. • When CHROM, REF and ALT are the only strings in a table, replacing them with VariantKey allows to work with numeric only tables with obvious advantages. This also allows to represent the data in a compact binary format where each column uses a fixed number of bit or bytes, with the ability to perform a quick binary search on the first sorted column.

Example Application
A direct application of the VariantKey representation is the ability to be used in binary lookup table files.
The variantkey library provides tools to create and use binary lookup table files to convert rsID to VariantKey and vice-versa: • rsvk.bin Lookup table to retrieve VariantKey from rsID. This binary file can be generated by the resources/tools/rsvk.sh script from a TSV file. This can also be in Apache Arrow file format with a single RecordBatch, or Feather format. The first column must contain the rsID sorted in ascending order.

• vkrs.bin
Lookup table to retrieve rsID from VariantKey. This binary file can be generated by the resources/tools/vkrs.sh script from a TSV file. This can also be in Apache Arrow file format with a single RecordBatch, or Feather format. The first column must contain the VariantKey sorted in ascending order.
Software benchmarks on the variantkey library, performed using a laptop with 16GB of RAM and Intel core i7 7th gen CPU, shows that it is possible to retrieve from a file with 400M items, a VariantKey from rsID or viceversa in about 48ns, or more than 20M items per second.

Conclusions
Existing human genetic variant representation formats are not designed to maximize computational performance and do not provide a general solution to efficiently represent any arbitrary variant. VariantKey, a new reversible numerical encoding schema for human genetic variants is proposed here, with a publicly available implementation. This new computer-friendly format is expected to improve the performance of large-scale computation of variant-based genetic data.