VCFdbR: A method for expressing biobank-scale Variant Call Format data in a SQLite database using R

As exome and whole-genome sequencing cohorts grow in size, the data they produce strains the limits of current tools and data structures. The Variant Call Format (VCF) was originally created as part of the 1,000 Genomes project. Flexible and concise enough to describe the genetic variations of thousands of samples in a single flat file, the VCF has become the standard for communicating the results of large-scale sequencing experiments. Because of its static and text-based structure, VCFs remain cumbersome to parse and filter in an interactive way, even with the aid of indexing. Iterating on previous concepts, we propose here a pipeline for converting VCFs to simple SQLite databases, which allow for rapid searching and filtering of genetic variants while minimizing memory overhead. Code can be found at https://github.com/tkoomar/VCFdbR

The Variant Call Format is a well-specified and flexible way to describe genetic variants 2 identified by sequencing, along with annotations describing those variants. These 3 annotations are a particularly powerful feature of the VCF, and set it apart from more 4 efficient formats of expressing genetic variants, like PLINK and BGEN. Unfortunately, 5 despite advances in processing VCFs [1] as a flat text file, searching and filtering 6 variants in a VCF based on its annotations is slow and cumbersome. 7 When studying genetics of larger cohorts, it is often necessary to begin with 8 exploratory analyses -which require rapid iteration and changes to filtering schemes. 9 Data from these studies are often so large that they cannot be read directly into RAM, 10 and writing subsets of variants to a hard disk repeatedly is slow and introduces 11 overhead that does not scale well. Therefore, faster methods of identifying variants of 12 interest from sequencing data are needed, such as databases or datastores.

13
For example, the GenomicsDB format, a datastore developed by GATK team and 14 Intel [3], is fast, scaleable and efficient. However, it is currently only widely used for the 15 merging of gVCFs produced by GATK, and much work is still needed to make it a 16 standard that can replace the VCF. The Hail data-analysis library [8] does a good job 17 of implementing GenomicsDB, but attempts to be an end-to-end solution. This requires 18 learning its own interface and is not compatible with the tools many computational 19 geneticists already use -particularly R [7] and Bioconductor [2] packages.

20
The Gemini database [6] is a somewhat more flexible alternative that takes the 21 approach of converting a VCF to an SQLite database that can be manipulated with any 22 1/5 suitable interface. SQLite is a single-file database that does not require a server, making 23 it suitable for working on high performance computing clusters that are available and 24 widely used by computational geneticists in academic institutions. However, Gemini 25 compresses genotypes to save space, which slows down queries. Users can opt not to 26 compress genotypes with Gemini, but this creates 1 column per sample (i.e. 'wide' 27 format), which can rapidly exceed the column limit of an SQLite database. Additionally, 28 this is not scaleable to very large cohorts which might result in databases which violate 29 file system limits on the size of single file.

30
Many researchers would benefit from a method for converting a VCF to an SQLite 31 database that is suitable for large cohorts. This requires genotypes to be stored in the 32 more database-friendly 'long' format as well as an option to save genotypes to external 33 files to avoid hitting single-file size limits. Such a method should also avoid operations 34 that require writing intermediate files to disk and the requirement to learn bespoke 35 interfaces. The R language is widely used by computational geneticists, and has the 36 excellent general-purpose SQL database interface provided by the dbplyr package [9]. 37 Additionally, R is under-served in this realm, as both Gemini and Hail utilize 38 Python-based interfaces. For these reasons, we propose a relatively simple pipeline for 39 converting a VCF to an SQLite database, which we call VCFdbR.

40
The VCFdb Table Schema   41 Overview 42 SQLite is a server-less, single-file relational database composed of one or more 'tables', 43 which may be linked by shared 'keys'. In a VCFdb created by VCFdbR, this 'key' is a 44 integer that uniquely identifies each variant in the VCF. Specific columns within a given 45 SQLite table may also be indexed, so that they may be searched rapidly.      Additional some processing time.

116
The header of the VCF is then parsed and written to the miscellaneous tables of the 117 database. For each chunk, the VCF's INFO field is then parsed and searched for a CSQ 118 4/5 field. Then, the genotypes are parsed, converting them to 'long' format. Once the data 119 INFO, CSQ and FORMAT are fully parsed, they are appended to the corresponding 120 database tables (variant_info, variant_impact and variant_geno, respectively). If 121 operating in File-GT mode, the parsed FORMAT fields are written to the file system 122 instead of the database.

123
After all VCF chunks have been processed, the database tables are indexed.

124
Indexing always includes the variant_id, and new indexes can be added by the user at 125 any time. Finally, the GenomicRanges representation of the database is written to disk 126 and the gene_map table is created (if the input VCF had VEP annotations).