Compression for population genetic data through finite-state entropy

We improve the efficiency of population genetic file formats and GWAS computation by leveraging the distribution of sample ordering in population-level genetic data. We identify conditional exchangeability of these data, recommending finite state entropy algorithms as an arithmetic code naturally suited to population genetic data. We show between 10% and 40% speed and size improvements over dictionary compression methods for population genetic data such as Zstd and Zlib in computation and and decompression tasks. We provide a prototype for genome-wide association study with finite state entropy compression demonstrating significant space saving and speed comparable to the state-of-the-art.


Introduction
A full description of the mapping from genotype to phenotype in humans will be a centerpiece of science, and will lead to an understanding of the human condition, and will lead to advances in medicine. Genome-wide association study (GWAS) is currently the main tool used to create, and revise draft versions of this genotype-to-phenotype mapping. The pursuit of such studies is a world-wide effort conducted and considered by labs in almost every university and also in biotechnology and pharmaceutical companies (Visscher et al., 2017). The computational resources required for GWAS are large, and so improving the efficiency of the manipulation and storage of genetic data and of conducting GWAS will increase the pace and accessibility of GWAS research, and also allow more efficient use of public and private funding.
To this end, researchers have made improvements to the compression of genetic data, including support for the cutting-edge Zlib (Gailly and Adler, 2004) and Zstd (Collet and Skibiński, 2019) libraries in the popular bgen and ped formats (Band and Marchini, 2018;Chang et al., 2015). This has allowed cheaper dissemination and storage of large-scale consortia such as the UK Biobank (Band and Marchini, 2018;Bycroft et al., 2018). The bgen and the new pgen formats have also been extended to provide variable or low bitrates for lossy compression of dosages for large genetic datasets (Band and Marchini, 2018;Chang et al., 2019).
In this work, we further improve the compression of population-level ge-  Figure 1: Computational efficiency of encoding a single variant simulated for one million subjects with a given minor allele frequency (MAF) using Zstd or finite state entropy (fse), with 120 replicates per condition. Top) Runtime of encoding in seconds (y-axis) and one standard deviation. Bottom) Size of encoding in megabytes (y-axis). Improved performance of fse over Zstd is demonstrated.
netic data using ideas stemming from recent advances in source coding (Shannon, 1948) including finite state entropy (Collet, 2019) and asymmetric codes (Duda, 2013). These recent advances are arithmetic codes (Huffman, 1952) for compressing streams of conditionally independent symbols and show improved performance in size and speed of population genetic data compression tasks.

Conditional independence in genetics
File formats for population level genetic data such as bgen and plink2 's ped and pgen formats store genotypes in compressed variant blocks (Chang et al., 2015), in which a bitstring representation of the genotypes for all subjects are compressed separately for each marker (i.e., they are stored in a variant-major form). This organisation is optimal for the purpose of fast GWAS. However, the samples collected in population-level genetic data are usually exchangeable, or partially exchangeable (Orbanz and Teh, 2010).
This exchangeability arises because genotypes for subjects are stored in a fixed order, and this order is identical for all variants. When a study is conducted on a homogeneous population, this fixed order does not provide information about the genotypes, leading to conditional independence among the genotypes at a fixed variant (i.e., exchangeability). The independence is conditioned on the minor allele frequency (a.k.a. the MAF), or other variant specific measures. If a pedigree, or a genetic similarity matrix (Patterson et al., 2006), or population indicators is also considered, then the genotypes are jointly exchangeable (i.e., the joint distribution on the genotypes is still exchangeable and invariant under permutation, as long as the permutation is also applied to the pedigree, the genetic similarity matrix, or the population indicators).
For stratified or related populations, often the fixed order still provides no information about the genotypes (the subjects may be presented in an order given by sorted and random subject identification numbers) But even if a heterogeneous sample is considered in which the order indicates subpopulation identity in a block structure (leading to partial exchangeability), conditional independence still exists among all subjects in each subpopulation, and the number of blocks in the structure is often small compared to the number of samples. These considerations imply that a bitstring representation of the ordered genotypes is well approximated by independent and identically distributed draws from a fixed law: Here X is the bitstring for a marker typed for N subjects and X i is the bitstring for subject i and L is a discrete law on floating point numbers or vectors of floating point numbers (the floating points are relevant for variants that include dosage information along with discrete genotypes). For example, if X involves hard-called genotypes with MAF p under the Hardy-Weinberg equilibrium (a.k.a. HWE), then X i is a bitstring representation of the set {0, 1, 2} and L is the law described by a discrete distribution taking the value 0 with probability (1 − p) 2 and 1 with probability 2p(1 − p) and 2 with probability p 2 (here the level of X i is the number of minor alleles). In other situations, such as for data in which dosages or genotype likelihoods or phase are recorded, L and the support of X i can be suitably modified to still present data in the form (1).
Population-level genetic file formats often use Zstd , Zlib or gzip or similar 'dictionary' style compression methods for compression of their variant blocks. These methods work by building a dictionary of strings that occur commonly in the uncompressed bitstring and then replacing those aspects of the bitstring with keys into a dictionary (Sayood, 2012). However, due to the exchangeability or partial exchangeability of bitstrings of the form (1), for a dictionary value to be viable, many of the permutations of the dictionary value around the boundaries of the bitstring representation of a single genotype must appear in the dataset. We call this the 'exploding dictionary' problem.
Here, when we say a dictionary value is viable we loosely mean that the value occurs often enough in the uncompressed data that its inclusion in the dictionary decreases the size of the compressed bitstring. Roughly speaking, the independence displayed in (1) reduces the extent to which dictionary values are viable. On the other hand, source coding theory (Shannon, 1948) is designed to describe and compress bitstrings that display independence such as (1). Our method, which we refer to as agent (arithmetic codes for genetic data), exploits this exchangeability and independence by specifying the compression for genotype blocks with such source coding theory.

Preliminary demonstration
In Figure 1 we consider simulation replicates involving half a million samples and 1 variant, with MAF varying from 0.01 to 0.5, and we assume HWE and a random subject ordering (i.e., full exchangeability). We use reference implementations for both fse and Zstd (Collet, 2019;Collet and Skibiński, 2019). We show improved performance in both space and runtime for the arithmetic code at all MAF levels. The largest runtime improvements are seen at high MAF (Figure 1 Top) and the largest size improvements are seen at low MAF ( Figure 1 Bottom). Notably, we find an apparently constant relationship between runtime and MAF in the fse arithmetic code. Theoretically, the runtime must be increasing with the size of the compressed variant block (even as it appears constant), as this block must be written to RAM (random access memory) or disc, and also the size of the finite state machine used by fse must also increase with MAF. But however, Figure 1 indicates that these increases may be amortized. The fse method improves upon Zstd in runtime by 8% to 34% and in size by 52% to 26%, as the MAF is varied.

Related work
The variant-major organisation of file formats in population-level genetics allows random access to variants, which is required for optimised GWAS and PHEWAS (phenotype-wide association study). In another line of recent work, researchers have used linkage between adjacent variants to decrease compression size using trees, and Burrows-Wheeler transformations (Adjeroh et al., 2002;Kelleher et al., 2018). We do not consider a comparison with these works, as they do not allow random access.
We also note that recent advances in compressing genomes ( which also involves arithmetic codes. The fse algorithm has also been used in this setting (Holmes, 2016). However, these works are not relevant for the context-free nature of the variants (such as single-nucleotide polymorphisms, or SNPs) considered in population-level genetic data. To our knowledge this manuscript is the first work to highlight the exchangeable or partially exchangeable nature of population genetic data and recognise that algorithmic codes (including fse) are most appropriate for compression of variant-major stores.
In Section 2, we review the fse and ans methods (summarizing Duda 2013 and Collet 2019). In Section 3, we provide runtime and filesize results for the agent tool on the Thousand Genomes Project (The 1000 Genomes Project Consortium, 2015) and simulated genomes.

Approach
Finite state entropy (fse) is a member of a family of general asymmetric numerical systems (ans) known as tabled asymmetric numeral systems (tans), described in Duda 2013. These methods combine the advantageous compression ratio of arithmetic coding with the favourable speed of Huffman coding. To fully understand the justification and advantages of fse and tans methods in general, we briefly summarize the developments leading to the state-of-the-art performance of these methods.

Tree-based coding and the integer bit problem
The original compression algorithm described in Shannon 1948 (Shannon-Fano coding) represents symbols by a prefix-free set of variable-length bitstrings.
The assignment of symbols to bitstrings is determined by constructing a binary tree from the top-down, attempting to evenly partition the current set of symbols according to total frequency into two subsets at each node of the tree. Improvements on this greedy tree-based coding led to Huffman coding.
The state-of-the-art in compression speed remains Huffman coding (Collet, 2019), which is described in (Huffman, 1952) with the binary tree constructed in a bottom-up manner which takes into consideration the frequency of each symbol. This non-greedy method of tree-building was found to be an improvement over Shannon-Fano coding, resulting in a closer, but yet still imperfect, approximation to Shannon entropy. Since this construction operates at the symbol level, Huffman coding schemes cannot break a one-bitper-symbol lower bound. Additionally, Huffman coding schemes require that a symbol be represented using an integer number of bits. Together, these limitations result in an imperfect approximation of the Shannon entropy, and therefore a limited compression ratio. While workarounds such as grouping multiple symbols together allows one to arbitrarily approach Shannon entropy, this results in a large alphabet size and thus a larger tree. These limitations apply in general to all symbol-level coding schemes, which can be considered as stateless (since there is no memory of previous symbols). Better compression ratios can only be obtained through the use of stateful or message-level coding schemes (Duda, 2013).

The dictionary problem in genetics
Dictionary compression methods are one approach towards stateful compression schemes. The Zlib (Gailly and Adler, 2004) and Zstd (Collet and Skibiński, 2019) methods used for the bgen or ped formats (Band and Marchini, 2018;Chang et al., 2015) are examples of dictionary methods. These methods generate the dictionary in an online fashion during the compression process, which decreases the compression rate. As the dictionary increases in size, queries into the dictionary become increasingly computationally intensive. While previously compiled dictionaries can be used in such methods to avoid online dictionary building, the compressed size of datasets which do not resemble those used to compile the dictionary will remain suboptimal, which is an issue given the diversity of genetic datasets. While dictionary methods provide a stateful compression scheme which operates above the symbol-level, these issues suggest the possibility of better performance from stateful and message-level coding schemes which do not use dictionaries. Indeed, Zstd itself uses fse to compress its dictionary, and the approach we adopt will simply omit the dictionary and use fse directly.

Arithmetic coding and the interval problem
Arithmetic coding methods are stateful, message-level, and dictionary-less coding schemes. Rather than assigning symbols to variable-length bitstrings, as in Huffman coding, or assigning sequences of symbols to bitstrings, as in dictionary methods, arithmetic coding methods assign entire messages to real numbers, with the goal of minimizing the length of the binary representation of the assigned number.
We briefly describe the high-level operation of a real-valued unit interval arithmetic encoder. Such an arithmetic coder maintains two state variables which represent a subinterval within the unit interval. Compression is achieved by a bijection between all possible messages and the real numbers within the unit interval, with messages of higher probability being assigned real numbers with shorter binary representations. Given an alphabet Σ, a symbol probability distribution P(s) defined over all symbols s ∈ Σ, and a message σ = σ 1 · · · σ n ∈ Σ * , such a real number with short binary representation is found by contracting the current subinterval over n iterations such that the size of the i + 1 subinterval (as a proportion of the immediately preceding

Solutions to the interval problem
Asymmetric numerical systems provide a speed improvement over arithmetic coding by reducing two real state variables into one integer state variable and by replacing the representation of symbols using subintervals with one using uniformly distributed subsets which partition the integers. This replacement of real intervals with integer partitions and elimination of a state variable preserves the advantageous compression ratio of arithmetic coding while operating at a higher speed.
The additional speed improvement of tans methods (Duda, 2013) such as fse (Collet, 2019) arises from the implementation of the compression algorithm as a table (hence the "tabled" in tans) recording the transitions of a finite-state machine (hence the "finite state" in fse). This is a computationally efficient technique which has also found ubiquitous usage in fast string matching algorithms, such as modern regular expression engines. The speed of fse approaches Huffman coding, yet without the compression ratio issues. For these reasons, fse was chosen to be the compression method of choice for agent over all of the aforementioned techniques.

Methods
We conduct a series of experiments to explore the speed and size of the

Experiment 1: Partially exchangeable data
We examine the speed and compression ratio of our agent method in the partially exchangeable case. We consider data from the Thousand Genomes are ordered in a block structure such that the subjects in each population are ordered contiguously (i.e., all subjects from the first population appear first in the ordering, followed by all subjects from the second population, and so on). For each of these two presentations, we consider a task in which an uncompressed bgen file is recompressed using fse or Zstd compression algorithms. In each case, all of the Thousand Genomes Project data is presented in a single bgen file (i.e., with all chromosomes concatenated).
So, for timings and compression sizes of bgen with fse, results are obtained by timing a version of qctool that has been modified to allow compression according to the fse algorithm.

Experiment 2: Larger studies
Considering the excellent performance of plink2 in terms of speed and size displayed in Experiment 1 (enumerated in Table 1), we note that our qctool +fse patch outperforms the reference qctool +Zstd code. The reduced performance of qctool +fse with respect to plink2 could be due to qctool overhead that is addressed by plink2 and also small sample size considerations (plink2  Figure 3: Performance of plink2 versus agent in a task in which between 10,000 and 80,000 simulated samples are considered. Simulation is based on extending the Thousand Genomes Project to create a biologically plausible sample of chromosome 22 on many more subjects using the hapgen software. The agent method outperforms plink2 for all conditions, but extrapolation of resulting file size indicates improved performance of plink2 under small sample sizes (as indicated by Table 1 columns 4 and 5).

Experiment 3: Compressing dosages
Modern GWAS often rely on genotype imputation methods in which variants that are not typed in a study sample are imputed through use of dense reference We consider a simulation study in which dosages are compressed. We simulate 490,000 haplotypes under the coalescent with recombination with 100,000 segregating sites and 1,000 recombination sites over a number of basepairs roughly equivalent to chromosome 22 using the ms software (Hudson, 2002). We use 10,000 of the haplotypes as a reference panel, and we combine the remaining 240,000 haplotypes into diploid genotypes forming a study panel. For details on study and reference panels in imputation we refer to Howie et al. 2009. We vary the amount of imputation over the four conditions, and for each condition we remove X% of the markers for X% of the samples in the study panel, and then impute the removed genotypes using the reference panel. We vary X over the range 20%, 40%, 60% and 80%. The same set of markers is removed for each selected sample leading to the blockwise study/reference paradigm. We conduct the imputation using the impute2 software. The impute2 software was run on 80 equally sized and partially overlapping chunks. A linkage disequilibrium plot is shown in Figure 2 for one of the chunks from this experiment.
In Figure 4, we consider a task in which chunks from the above described simulation are compressed with Zstd (using plink2 ) or with fse (using agent).
The size of the compressed dataset and the speed of the compression are improved by agent under all imputation conditions. We note that compression using arithmetic codes is particularly recommended for dosage data stored in fixed point numerical representations with low bit rates. This is due to the highly skewed distributions over the genotypes that tend to result from imputation, as illustrated in Appendix A (these histograms are found from the default example provided with the impute2 software). High skew allows arithmetic codes to efficiently assign codewords.

Experiment 4: Genome-wide association study
To demonstrate the advantages of the fse methods displayed in (Left) Ten independent restarts are considered for a dataset with 512 phenotypes. (Right) The number of phenotypes is varied. In each condition, plink2.00 is only slightly faster than agent. software.
We perform two examinations using 30/80 of the chunks for the simulations developed in Experiment 3. First, we consider 10 repeats for the 20% imputation condition and 512 phenotypes drawn independently from a standard Gaussian distribution (i.e.: N = 120, 000 samples, D = 512 phenotypes and M = 19, 922 of which 20% were imputed). We show box plots for the runtimes of the 10 repeats in Figure 5 (Left) and note that plink2 is only slightly faster than agent in this condition. In plink2, compression may not be done on dosages, which may explain the improved plink2 runtimes. Both pieces of software are optimized with Advanced Vector Extension (AVX2) instructions. Second, we consider the same genetic datasets and vary the number of phenotypes between 512 and 2048 (again, with independent standard Gaussian distributions). We perform one GWAS for each setting of the imputation proportion, yielding 5 runtimes for each method for each phenotype condition and these runtimes are plotted on Figure 5 (Right). This shows that the scalings of the runtime of agent and plink2.00 as the number of phenotypes is varied are similar.

Discussion
We reviewed existing compression algorithms and their theoretical advantages and disadvantages for population genetic data storage, including the state-ofthe-art ans and tans methods and fse compression. We noted the opposing compression ratio-speed tradeoffs in arithmetic coding versus Huffman coding and the ratio and speed issues with using existing dictionary methods such as Zlib and Zstd for population genetic data. We selected fse for its inheritance of the high compression ratio of arithmetic coding and the high compression speed of Huffman coding combined with the elimination of large and computationally costly dictionaries.
We compared compression time and compressed size for fse compression with existing methods of population genetic data storage over three experiments to test the effects of exchangeability, large number of samples and imputed dosages. In the fourth experiment we apply our agent tool using fse to a genome-wide association study and compare the time taken with the existing plink2 tool.
In Experiment 1, a modified qctool with fse support compresses faster and smaller than qctool 's existing Zstd implementation for both fully and partially exchangeable data. In Experiment 2, our tool agent using fse compresses faster and smaller than the existing plink2 tool on simulated chromosome 22 data with 1,055,452 variants over 20,000 to 80,000 samples (a typical dataset size in genetic analysis). In Experiment 3, agent compresses faster and smaller than plink2 on imputed dosages for 240,000 samples over imputation percentages of 20%, 40%, 60% and 80%.
In Experiment 4, we add a GWAS feature to agent and show that its speed is only slightly slower than plink2.00. Since fse compression on population genetic data is faster than dictionary methods (as shown in Experiments 1 to 3 and Figure 1), the speed improvement in plink2.00 is likely due to the high quality of the design of plink2.00 (i.e., it's likely that plink2.00 speed would be improved should fse be adopted by plink2.00 and also it's likely that agent's speed could be improved through improvements to the design). We also note that the relative speed between agent and plink2.00 We apply agent to a genome-wide association study in a fourth experiment.
With the recent rise of genome-wide association studies and the resulting needs for computational resources in both time and space domains, the improvements brought by agent will increase the speed and feasibility of population genome-wide association studies and reduce the cost of such studies for both public and private sector research. Ultimately, agent will assist in the speedy delivery of the results of such studies, which have realworld outcomes in the form of new medical treatments, diagnostics, and a better understanding of the human genetic code.