Vcflib and tools for processing the VCF variant call format

Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies — as well as in somatic and germline mutation studies. VCF can present single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called against a reference genome. Here we present over 125 useful and much used free and open source software tools and libraries, part of vcflib tools and bio-vcf. We also highlight cyvcf2, hts-nim and slivar tools. Application is typically in the comparison, filtering, normalisation, smoothing, annotation, statistics, visualisation and exporting of variants. Our tools run daily and invisibly in pipelines and countless shell scripts. Our tools are part of a wider bioinformatics ecosystem and we consider it very important to make these tools available as free and open source software to all bioinformaticians so they can be deployed through software distributions, such as Debian, GNU Guix and Bioconda. vcflib, for example, was installed over 40,000 times and bio-vcf was installed over 15,000 times through Bioconda by December 2020. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation that can not easily be represented by the VCF format. All source code is published under free and open source software licenses and can be downloaded and installed from https://github.com/vcflib. Author summary Most bioinformatics workflows deal with DNA/RNA variations that are typically represented in the variant call format (VCF) — a file format that describes mutations (SNP and MNP), insertions and deletions (INDEL) against a reference genome. Here we present a wide range of free and open source software tools that are used in biomedical sequencing workflows around the world today.

From its introduction in 2011 the VCF variant call format has become pervasive in 2 bioinformatics sequencing workflows [1][2][3]. VCF is one of the important file formats in analysis following the Unix philosophy of small utilities as explained in the 'small tools 22 manifesto' [9]. Development of these tools was often driven by the need to transform 23 VCF into other formats, to digest information, to address quality control, and to 24 compute statistics. The vcflib toolkit contains both a library and collection of 25 executable programs for transforming VCF files consisting of over 30,000 lines of 26 source code written in the C ++ programming language for performance. vcflib also 27 comes with a toolkit for population genetics: the Genotype Phenotype Association 28 Toolkit (GPAT). Even though vcflib was not previously published, Google scholar 29 shows 132 citations for 2020 alone -pointing to the github source code repository. 30 Next to vcflib, we present bio-vcf, a parallelized domain-specific language (DSL) 31 for convenient querying and transforming VCF; and we discuss cyvcf2 [10], 32 slivar [11], and hts-nim [12] as useful tools and libraries for VCF processing. 33 Tools and libraries 34 2.1 vcflib C ++ tools and libraries 35 At its core, vcflib provides C ++ tools and a library application programmers 36 interface (API) to plain text and compressed VCF files. A collection of 83 command 37 line utilities is provided, as well as 44 scripts (Table 1 lists a selection). Most of these 38 tools are designed to be strung together: piping the output of one program into the 39 next, thereby preventing the creation of intermediate files, parallelize processing, and 40 reducing the number IO operations. For example, the vcflib vcfjoincalls script 41 includes the following pipeline (where vt is a variant normalization tool [13]; see 42 Table 1 for the other individual steps by vcflib): 43 single nucleotide variant (SNV), an insertion, a deletion or a structural variant with 48 rich annotation [1]. In a VCF line, fields are separated by the TAB character. Fields subsubfields and so on: effectively projecting a 'tree' datastructure onto a single line.

54
The advantage is that it is easy to view a VCF file and it is almost trivial to write a 55 basic VCF parser and it is easy to add information to VCF, sometimes leading to 56 unwieldy nested annotations. An evolving VCF 'standard' is tracked by the 57 samtools/htslib project [17] and later amendments are particularly focused on more  The vcflib API describes a class vcflib::VariantCallFile to manage the 67 reading of VCF files, and vcflib::Variant to describe the information contained in a 68 single VCF record. The API provides iterators that are used in every included tool.

69
For every record the tree-type hierarchy ( Fig. 1   Object Notation (JSON) using bio-vcf -(a) the line-based VCF record makes use of separators to split tab-delimited fields into subfields. Subfields are split with characters , =:; / and so on. This splitting effectively projects a 'tree-like' datastructure that can also be represented as (b) a JSON record. JSON is used as a common data exchange format for databases and web-services. This example was generated with (c) the bio-vcf tool using a template [16]. bio-vcf transform data to any textual format, including RDF, HTML, XML etc. See also the bio-vcf section widely used descriptive statistics in population and evolutionary genetics (Fig 2). Fst 96 is defined as the correlation between randomly sampled gametes relative to the total 97 drawn from the same subpopulation [19]. wcFst is Weir & Cockerham's Fst for two 98 populations [20]. pFst is a probabilistic approach for detecting differences in allele 99 frequencies between two populations and bFst is a Bayesian approach. bFst accounts 100 for genotype uncertainty in the model using genotype likelihoods. For a more detailed 101 description see [21]. The likelihood function has been modified to use genotype pVst calculates Vst to test the difference in copy numbers at each SV between two where Vt is the overall variance of copy number and Vs  Perl/Ruby-style regular expression or regex [24]), and where the tumor bcount of the 142 ALT allele is larger than 4. The letter 'r' represents a record or line in a VCF file and 143 the letter 's' stands for each sample in a record.

144
The naming of variables, such as s.dp and r.tumor.bcount, is inferred from the 145 VCF file itself, so if a VCF has different naming conventions they are picked up 146 automatically.

147
bio-vcf typically reads from the terminal STDIN and writes to STDOUT.

148
The following full command line invocation reads VCF files and filters for chromosomes 1-9 where the quality (r.qual) is larger than 50. It also checks for non-empty samples where the sample read depth is larger than 20. For each selected record with --eval it outputs a BED record (the default output is the VCF record itself, useful for filtering):

183
Similar to bio-vcf, slivar allows users to specify simple expressions for filtering and 184 annotation [11]. Whereas bio-vcf uses Ruby to supply the DSL, slivar uses 185 Javascript. slivar has built-in pedigree support for the samples so that, for example, 186 a single expression can be applied to every trio (mother, father, child) to identify de 187 novo variants:  biopython [25], bioperl [26], bioruby [27] and R's CRAN [28], contain VCF parsers 201 that may be useful. But a first point of call may be vcflib itself as it is also a C ++ programming library and in addition to being an integral part of the vcflib tools 203 mentioned here is used by, for example, the freebayes variant caller [5].

204
Of particular interest is the fast cyvcf2 library that was started in 2016 with 205 htslib [17] bindings and is actively maintained by co-author Pedersen today [10].

206
Similar to bio-vcf it presents a DSL-type language that can be used in Python 207 programming. Meanwhile hts-nim [12] contains bindings for the Nim programming 208 language with similar syntax and functionality. For example a nim bin counter (part 209 of hts-nim-tools/vcf-check [12]) can be written as: Nim is a statically typed compiled language that looks very similar to Python and, 220 because it transpiles to C, Nim has a much faster runtime and can link without 221 overheads against C libraries such as vcflib and htslib.

222
Finally, another tool of interest, by the same author, is vcfanno; written in the Go 223 language and allows annotations of a VCF with any number of INFO fields from any 224 number of VCFs or BED files. vcfanno uses a simple conf file to allow the user to 225 specify the source annotation files and fields and how they will be added to the info of 226 the query VCF [29].

228
Ten years is a long time in bioinformatics and the VCF file format is starting to show 229 its age. Not only is the VCF format redundant and bloated with duplication of data, a 230 more important concern is that the VCF format does not accommodate interesting 231 complex genomic variations, such as complex and nested variants, such as 232 superbubbles, ultrabubbles, and cacti [30][31][32]. An even more important shortcoming of 233 VCF is that it always depends on a single reference genome, resulting in variant calling 234 bias and missing out on variation not represented in the reference [32]. One solution is 235 to work with multiple reference genomes, but comparing VCF files from different 236 reference genomes is challenging -even for different versions of one reference genome. 237 To address such challenges the authors are actively working on pangenome 238 approaches that store variation in a pangenome graph format, e.g. [32][33][34][35][36][37].

239
Pangenomes can incorporate multiple individuals and multiple reference genomes. analysis, such as population genotyping, more powerful with improved results, e.g. [32]. 250 The VCF file format has become a crucial part of almost all sequencing workflows 251 today. The design and presentation of the VCF file format can set the norm for 252 designing future file formats [2,3], but we can also learn from its mistakes. In this paper we wrote VCF 'standard' consistently between quotes because, even though 254 there exists a standardization effort -now at VCFv4.3 [3] -VCF is flexible by 255 design, alternative VCF standards are introduced (e.g. [38]) and most tools take 256 liberties when it comes to producing VCF files. Therefore all VCF parsers have to 257 take a flexible approach towards digesting input data and ignore input data that is not 258 understood. 259 We recognise that the success of a file format requires a crucial focus on having an 260 early 'standard' that is both easy to understand and flexible enough to grow, in line 261 with the success of other bioinformatics file formats, such as SAM/BAM [3] and 262 GFF [39]. Biology is a fast moving field and it is impossible to predict how a file 263 format is going to be used in a (near) future. The downside of such flexibility is that 264 older software may not support features that were added later. One of the weakest 265 aspects of the VCF format is its metadata: next to ad hoc metadata in records (see 266 Fig.1), the header record requires specialized parsing and ignores existing ontologies.

267
Also for the VCF records, robust validation, error checking and correctness 268 checking is virtually impossible. Great attention should therefore be paid to any 269 amendments to an earlier standard to keep backward compatibility when possible.

270
VCF and many other formats in bioinformatics use layered character separators as a 271 grammar for defining a tree structure of data (see Fig. 1). programming languages, such as cyvcf2 and hts-nim, as well as vcflib and bio-vcf. 287 A wide range of solutions exist for VCF processing that make use of these three 288 approaches and functional overlap is found between vcflib, bio-vcf, cyvcf2, the 289 original vcftools [2], bcftools [7] and the existing Bio* programming libraries, such as 290 biopython [25], bioruby [27] and biojava [41]. vcftools and bcftools provide annotation, 291 merging, normalization and filtering capabilities that complement functionality and 292 can be combined in workflows with vcflib and bio-vcf. These solutions together 293 provide a comprehensive and scalable way of dealing with VCF data and every single 294 tool represents a significant investment in research and software development.

295
Therefore, before writing a new parser from scratch, we strongly suggest to first study 296 the existing solutions. In the rare case a new tool is required it may be an idea to 297 merge that with existing projects so everyone can benefit.

298
Once software is written, it is important software development and maintenance 299 continues. In the biomedical sciences it is a clear risk for projects to get abandoned 300 once the original author moves on to another job or other interests; partly due to a 301 lack of scientific recognition, attribution and reward [42]. We note that with the pyvcf 302 project, for example, this has happened twice and the github contribution tracker 303 shows no more contributions by a project owner. This means no one is merging 304 changes back into the main code repository and the code is essentially unmaintained.

305
vcflib, bio-vcf and cyvcf2, in contrast, show a continuous adoption of code 306 contributions thanks to the original authors encouraging others to take ownership and 307 even release versions of the software. We also recognise the importance of creating 308 small tools that can interact with each other following the Unix philosophy.

309
For overall adoption of software solutions it is important the tools and 310 documentation get packaged by software distributions, such as Bioconda [43],

311
Debian [44] and GNU Guix [45,46]. Bioconda downloads are a good estimation of 312 relative popularity because they tend to represent actual installations. vcflib, for pangenomes and reference based approaches we keep updating our tools and libraries. 319 We intend to add more documentation and regression tests. One recent innovation in 320 vcflib is the generation of Unix 'man' pages from markdown pages and the --help 321 output from every individual tool and script that also doubles as regression tests. This 322 documentation is always up to date because when it goes out of sync the embedded 323 tests fail. We think it will also be useful to add tool descriptions in the common 324 workflow language (CWL) format [46][47][48]. CWL definitions allow easy sharing of tool 325 components between sequencing workflows. The scenario will be to write a CWL tool 326 definition that can be converted to documentation and running tests. upstream and downstream analysis workflows. We predict that pangenome approaches 332 will play an increasingly important role in sequence analysis and, at the same time,

333
VCF processing tools will remain in sequencing workflows for the forseeable future.