DiGeST: Distributed Computing for Scalable Gene and Variant Ranking with Hadoop/Spark

Background The advent of next-generation sequencing technologies has opened new avenues for clinical genomics research. In particular, as sequencing costs continue to decrease, an ever-growing number of clinical genomics institutes now rely on DNA sequencing studies at varying scales - genome, exome, mendeliome - for uncovering disease-associated variants or genes, in both rare and non-rare diseases. A common methodology for identifying such variants or genes is to rely on genetic association studies (GAS), that test whether allele or genotype frequencies differ between two groups of individuals, usually diseased subjects and healthy controls. Current bioinformatics tools for performing GAS are designed to run on standalone machines, and do not scale well with the increasing size of study designs and the search for multi-locus genetic associations. More efficient distributed and scalable data analysis solutions are needed to address this challenge. Results We developed a Big Data solution stack for distributing computations in genetic association studies, that address both single and multi-locus associations. The proposed stack, called DiGeST (Distributed Gene/variant Scoring Tool) is divided in two main components: a Hadoop/Spark high-performance computing back-end for efficient data storage and distributed computing, and a Web front-end providing users with a rich set of options to filter, compare and explore exome data from different sample populations. Using exome data from the 1000 Genomes Project, we show that our distributed implementation smoothly scales with computing resources. We make the resulting software stack Open-Source, and provide virtualisation scripts to run the complete environment both on standalone machine or Hadoop-based cluster. Conclusions Hadoop/Spark provides a powerful and well-suited distributed computing framework for genetic association studies. Our work illustrates the flexibility, ease of use and scalability of the framework, and more generally advocates for its wider adoption in bioinformatics pipelines.

from the 1000 Genomes Project, we show that our distributed implementation smoothly scales with computing resources. We make the resulting software stack Open-Source, and provide virtualisation scripts to run the complete environment both on standalone machine or Hadoop-based cluster. Conclusions: Hadoop/Spark provides a powerful and well-suited distributed computing framework for genetic association studies. Our work illustrates the flexibility, ease of use and scalability of the framework, and more generally advocates for its wider adoption in bioinformatics pipelines.

Background
The goal of most DNA sequencing studies is to identify causal single-nucleotide variations (SNVs) in patients with Mendelian diseases, and more recently combinations of variants in digenic or oligogenic diseases [6,17,15]. There however exists millions of mutations in any individual genome, and identifying which ones are disease-causing remains a largely open problem. A common approach to tackle this issue is to rely on Genetic Association Studies (GAS), which are based on the principle that genotypes can be compared directly, using casecontrol designs [31,34,36,10].
Case-control designs for GAS generally involve a set of M variants v i , 1 ≤ i ≤ M , and a set of N samples s j , 1 ≤ j ≤ N . Samples are divided in two groups G 0 and G 1 , for the control and case groups, respectively. Then, an M -by-N genotype matrix is expressed as G = {z i,j } 1≤i≤M,1≤j≤N where z i,j ∈ {0, 1, 2} is the zygosity of sample s j for variant v i , and the values {0, 1, 2} code for homozygous reference, heterozygous, and homozygous alternative, respectively. An example of genotype matrix is given in Fig. 1 for a set of 7 samples and 5 variants over 2 genes. Given a genotype matrix, a GAS aims at finding the sequencing regions that most differ between the case and control groups. This is generally a two step process where, for any given region of interest (for example, variant, gene, or set of genes), the difference between the case and control genotypic data is first quantified by a score and some statistical measure of significance. The scores are then ordered by decreasing order of importance, so that genomic regions of highest difference between the two groups are identified [31,32,4,65].
The standard scoring approach in GAS consists in summing, for each variant, the allele counts for both the case and control groups, and to compute a standard Fisher's test or χ 2 statistic for assessing the significance of the difference [31]. Many variations of this baseline approach have been designed. In particular, methods such as CAST [48], CMC [35], WSS [40], KBAC [37], SKAT [70], RareCover (RC) [2] and others [57,50,49,22] have been proposed to 'collapse' or aggregate sets of variants within a region, which is usually a gene, in order to increase the statistical power of the tests. Other efforts have been targeted at further extending traditional GAS to gene-gene interaction studies (Genetic Interaction Studies -GIS) [4,41,9,66,45,53,39], in order to address phenomena such as epistasis, now widely accepted as an important contributor to genetic variation in complex diseases.
Score statistics computations often require substantial computational resources. This computational burden becomes especially significant when statistical tests rely on permutations, or when interactions between loci (GIS) are involved. As an example, computing all pairwise scores for a set of one million variants requires on the order of a trillion comparisons. The standard tools for GAS (PLINK [58], VariantTools [67], SnpSift [5], R Bioconductor [16] or AssotesteR [23], gNOME [31]) are however designed to run on standalone machines, and do not scale well as the size and complexity of genetic study designs increase.
Few solutions have been designed to address this challenge, that mostly rely on dedicated hardware devices such as Graphical Processing Units (GPUs) [72,71,25,18], or Field-Programmable Gate Array (FPGAs) [68,20]. While these solutions greatly speed up computation times, their use is in practice hindered by the need to acquire specialised and expensive hardware, whose programming is based on low-level and difficult to debug programming languages.
In this paper, we argue that the recently developed Spark cluster computing framework, coupled with a Hadoop cluster back-end, provides a well-suited distributed environment for performing large scale genetic association studies. Besides providing a fast and scalable computing framework, Spark features a rich and high-level application programming interface (API) particularly wellsuited for manipulating data such as variant data. On the other hand, the Hadoop framework is designed to operate on off-the-shelf computer hardware, provides a distributed and fault tolerant data storage back-end (Hadoop Distributed File System -HDFS), and a robust resource management platform for parallelising Spark jobs.
The contributions of this paper are several. First, we show that Spark can be used to efficiently extract variant data from a distributed database using standard SQL syntax, and to generate genotype matrices as distributed collec-tions of objects. We then detail how scoring can be performed in a distributed way, for genomic regions ranging from variant to genes to pairs or variants or genes. Third, we experimentally show that the proposed workflow, called Di-GeST (Distributed Gene/Variant Scoring tool), efficiently scales to case control designs involving thousands of samples and billions of scorings, and effectively distributes computations on the available computing resources. Finally, we provide a Web based front end to interact with DiGeST, allowing a user to easily define filtering criteria, create control and case groups, and explore scoring results. All the tools and data are open source and available from [30].
The paper is structured as follows. We first outline the DiGeST workflow, and describe its main components: data filtering, genotype matrix creation, and scoring. We then present the command line and Web interfaces to run DiGeST. Finally, we illustrate its scalability on datasets involving millions of variants and thousands of samples, and discuss the benefits and limits of Hadoop/Spark for genetic association studies.

Implementation
DiGeST performs scoring in two stages. First, variants of interest are extracted (filtered) from a variant database, and grouped in an intermediary genotype matrix whose entries are the genotypes of all filtered variant/sample pairs. Second, scoring is applied on the genotype matrix, which returns a sorted scoring matrix containing, for each region of interest (variant, gene, pair of variants/genes), the score together with additional statistics (group scores, p-values, ...). Both stages are run in a distributed way thanks to the Spark/Hadoop computing frameworks. The workflow is summarised in Fig. 2, and detailed in the following sections.

Variant database
The variant database serves as input to the DiGeST workflow, and stores variant data as a single flat table similar to the standard VCF (Variant Call Format) structure [12]. A minimum of seven fields are required for downstream processing with DiGeST. The first field contains the sample identifier. The next four fields (chromosome, position, reference, alternative) uniquely identify the variant by its position and DNA sequence change. The sixth field contains the zygosity of the sample for the given variant, and is represented by 2 (homozygous alternative), 1 (heterozygous alternative) or 0 (homozygous reference). Note that these first six fields are also part of the required fields of VCF files. A seventh column additionally contains the gene symbol, following the HGNC gene nomenclature [19], which is needed by DiGeST for performing ranking along genes.
The database may contain additional fields, which can be used during the filtering stage. Such fields can include, for example, annotations from genomic databases such as 1000 Genomes Project [8], dbSNP [63], dbNSFP [38], SnpEFF [13], that can be obtained using standard annotation software, see [55] for a review.
Given that a single individual genome has 3-4 millions variants, the variant database can become a very large data structure when thousands of samples are included. DiGeST relies on the Apache Parquet format [46,42] for storing the variant database. Apache Parquet is a columnar data storage format designed to support very efficient compression and encoding schemes. In particular, Parquet allows gzip and snappy compression. While gzip has better compression accuracy (about twice as much as snappy), it is much slower to decompress and compress (about 5 times). As the storage costs continue to decrease, snappy usually provides a better option by significantly reducing the data processing time, allowing end-users to reach their results faster. Besides compression, Parquet also allows files to be split and stored on a distributed file system, and data to be queried from the files using SQL like syntax. Such properties make the format much more suitable for variant filtering than the CSV or VCF formats.

Filtering stage
The filtering is the first processing stage of the DiGeST workflow, and consists in extracting variants of interest from the variant database, and creating the genotype matrix as a resilient distributed dataset (RDD). RDDs are Spark's underlying distributed data structures, split in partitions, which allow data to be stored and processed in a distributed manner. They consist in collections of items, usually tuples or (key, value) pairs. The sequence of transformations from the database extraction to the genotype matrix creation is summarised in Fig. 3.
The filters are user-defined sets of conditions aimed at selecting variants and samples to include in the control and case groups. They are expressed using a standard SQL syntax, with conditions on the set of fields (columns) available in the variant database. Queries taken by DiGeST to create the case and control groups are typically of the form The set of samples to include in the case and control group is defined by conditioning the Sample ID field. Additional conditions can be specified by extending the SQL WHERE clause. These may include, for example, conditions on gene symbols to restrict the variant subset to a gene panel, or conditions on quality control or variant deleteriousness annotations if these are available.
SQL queries for the control and case groups are executed using the Spark SQLContext object, which provides the entry point to all Spark SQL functionalities. The SQLContext returns two RDDs which, after merging, form a unique RDD that contains all variants from the two groups as a distributed collection of tuples (Sample ID, chr, pos, ref, alt, zygosity, Gene Symbol). Finally, in order to group variant data by variant IDs, these tuples are rearranged in (key, value) pairs, where the key consists in a first tuple (Gene Symbol, chr, pos, ref, alt) that uniquely identifies a variant, and the value is a second tuple (Sample ID, zygosity) that identifies the zygosity value for a given sample and variant in the genotype matrix. The transformation is done by applying a createKey VariantGene(variantT uple) function, see Pseudocode 1.  Spark's application programming interface (API) provides all the high level operators needed to express the transformations summarised in Fig. 3. SQL queries are executed using the sqlContext object. Resulting RDDs are then merged using the union operator, transformed in (key, value) pairs using the map operator, and grouped by variants using the groupByKey operator, as summarised in Pseudocode 2. Since most values in a genotype matrix are sparse, only entries coding for heterozygous and homozygous alternatives (1 or 2, respectively) are effectively stored.
Pseudocode 2: Filtering and genotype matrix creation Input : SQL queries sqlCase and sqlControl for case and controls, and optionally number of partitions P Output: A Spark RDD genotypeM atrix 2 caseV ariants RDD ← sqlContext.sql(sqlCase).rdd 3 controlV ariants RDD ← sqlContext.sql(sqlControl).rdd 4 mergedV ariants RDD ← caseV ariants RDD.union(controlV ariants RDD) 5 genotypeM atrix RDD ← mergedV ariants RDD.map(createKey VariantGene).groupByKey(P ) The number of partitions in which the RDD is split when extracting the data from the Parquet database is by default determined by Spark, and depends on the block size of the file system (typically 256MB on HDFS file systems). It may be desirable to increase the number of partitions for better parallelisation, and this can be specified as an optional argument P during the grouping operation (Pseudocode 2, line 5).

Scoring stage
The scoring stage essentially consists in restructuring the genotype matrix so that the genomic regions to score are grouped together (e.g., genes, pair of variants...), and in applying the scoring function to each group. We outline in the following how Spark can efficiently perform such groupings and compute corresponding scores in a distributed way, for variant, genes and 2−way interactions (pairs of variants/genes). We then show that the proposed approaches can be generalised to the scoring of larger genomic regions (for example sets of genes), and k−way interactions.
1) Variant scoring : The Spark implementation for variant scoring is almost straightforward, since the genotype matrix groups genotype data by variant keys. Let us denote by S(P i ) a scoring function that takes as input a partition P i of the genotype matrix, and returns a list of (variant key, (score, p value)) tuples, one for each variant key in P i . An example of Pseudocode for such a function is given in Pseudocode 3, which implements the basic allelic sum score together with a Fisher's test for statistical significance [31]. Global variables are used to provide the IDs of samples belonging to the case and control groups.
The overall scoring is distributed by applying the scoring function S(P i ) to each partition P i of the distributed genotype matrix by means of the Spark map operator. Scores are then sorted the using the takeOrdered operator. Finally, sorted variant keys, together with the scores and p-values, are retrieved as a table in the form of a Spark dataf rame. A high-level summary of the Spark processing pipeline is provided in Fig. 4.
It is worth noting that the first stage (scoring) does not require any exchange of data (shuffling), while the sorting does, which is represented in the diagram using a thick arrow.
2) Gene scoring : The workflow for gene scoring is essentially the same as for variant scoring, with an added preprocessing step that changes the grouping of the genotype matrix in order to group variant keys belonging to the same gene. The change of grouping is done by using the Spark groupBy operator, which takes care of reorganising the genotype matrix along gene keys. The resulting workflow is illustrated in Fig. 5. The change of grouping may require data to be shuffled, which is represented by the thick arrow between the grouping and scoring steps.  Within each partition, tuples (variant key,list(Sample Id, zygosity)) are scored and transformed in tuples (variant key,(score, p value)). Variant keys whose scores are the highest are then returned as a Spark dataf rame using the distributed takeOrdered operator. The scoring of the partitions follows the same logic as for the scoring of variants. The Pseudocode 3 only needs to be adapted in order to perform scoring along gene keys instead of variant keys.
3) Pair of variants scoring : The scoring of pairs of variants requires the genotype matrix partitions to be paired in order to compute the scores for any pair of variants. The set of P partitions is therefore first transformed in a set of P (P +1) 2 paired partitions, each of which contains a pair of partitions (P i , P j ), j ≥ i. An efficient approach to perform this pairing is to build a list listP airsT oCreate of index pairs (i, j), for all the pairs of partitions that need  Index pairs are then grouped together, resulting in an RDD made of all the pairs of partitions ((i, j), (P i , P j )), j ≥ i. Each partition is finally scored, and scores are sorted and returned as a data frame. A summary of the sequence of transformation is given in Fig. 6.
The partition scoring now takes a pair of partition (P i , P j ) as input, and returns the scores and p-values for possible pairs of variants in P i and P j .

4) Extensions to larger genomic regions or k-way interactions :
The proposed pipelines for gene and pair of variants scoring can be gen- Figure 6: Pair of variant scoring: All P (P +1)/2 pairs of partitions are generated from the initial set of P partitions. Scoring is then performed for each partition pair, and scores are finally ranked and retrieved as a data frame.
S(P 1 , P P ) S(P P , P P ) 1) Pair partitions (map/groupBy) 2) Score partition (map) 3) Sort scores (takeOrdered) eralised to larger genomic regions (e.g., sets of genes) and k−way interactions (k > 2). The grouping of variants across multiple genes may be achieved by making the gene sets as keys, and by relying on the groupBy operator before the partition scoring, as was done for the gene scoring. The computation of scores involving k−way interactions (k > 2) is a generalisation of the k = 2 case. It can be achieved by modifying Pseudocode 4 so that partitionP airsList takes not only pairs of partitions, but any tuples of partition indices. The generic scoring pipeline is summarised in Fig. 7, and consists in four main steps that sequentially performs variant grouping, partition interactions, scoring and sorting. Steps 1 (change grouping) and 2 (k-way interactions) are optional preprocessing steps. The former allows to group variants across genes or sets of genes for aggregative scorings. The latter allows to assess interactions between genomic regions for multi-loci scorings. Steps 3 (scoring) and 4 (sorting) are common to all scoring workflows, and perform the actual scoring and ranking, respectively.

1) Change grouping 2) k-way interactions 3) Scoring 4) Sorting
Key Score P-value key28 237 2.10 -7 key397 218 4.10 -6 key51 208 3.10 -6 The k-way interaction step increases the number of partitions from P to an order O( P ! (P −k)! ), that is, all possible combinations of k partitions. For other steps, the number of partitions remains by default the same. It may however be changed programatically as an argument to the groupBy operator, or by explicitly using the repartition operator.
It is finally worth emphasising that all these steps can be expressed very concisely thanks to Spark's grouping, mapping and sorting operators, which abstract the complexity of the distributed computing backend.

User interface
DiGeST scoring pipeline may be called either from the command line, using a Spark submission script, or from a user-friendly Web front end (see online demo at http://bridgeiris.ulb.ac.be/digest). We detail both options below.

Command line
The command line is the most direct way to run the DiGeST scoring pipeline. The parameters for an analysis are provided by means of a configuration file jobArguments.conf. The results are returned in two files: a CSV file containing the rankings, and a JSON file containing metadata about the analysis. This is illustrated in Fig. 8. The configuration file must specify the following parameters: • analysisName: The name for the analysis • scale: Scale of the analysis, either 'variant' or 'gene' • scope: Scope of the analysis , either 'univariate' or 'bivariate' • sqlControl and sqlCase: The SQL queries for selecting variants for the control and case populations • pathVariants: Path to the variant dataframe, in Parquet format.
The configuration file is passed as a parameter to the DiGeST Spark Python script 'digest.py', which returns two files after completion: • analysisName metadata.json : Contains the same as jobArguments.conf, plus the total runtime for the analysis • analysisName ranking.csv: A CSV file containing the variants/genes ranked according to their scores.

Shiny R Web front end
The text files used as input and outputs of the DiGeST command line interface can be cumbersome to manage, especially for experiments involving a large number of genes, samples or filtering criteria. We therefore designed a Web application that allows on the one hand to create filters and groups, and generate the jobsArguments.conf, and on the other hand to explore the ranking results in a friendly way. The Web application is developed with Shiny R [60], and provides four main tools for interactively managing DiGeST data.  [13] is either high or moderate is given in Fig. 9. The set of variables proposed by the filtering tool are those included in the variant dataframe. In our design, we preprocessed the exome data from the 1000 Genomes Project using the Highlander filtering tool [24], and included 35 annotations fields from the 1000 Genomes Project [8], dbSNP [63], dbNSFP [38], and SnpEFF [13].
The filtering tool allows to save a given set of filtering conditions. The conditions are converted in a SQL syntax, that can either be used to feed the jobsArgument.conf file, or to interactively query the variant dataframe from the Parquet files (using the Apply filter button). In the latter case, a subset of 1000 variants are retrieved from the variant dataframe, and provided to the user as a table than can be either explored from the Web interface, or downloaded as a CSV. The query from Fig. 9, for example matches around 5.5 million variants. A snapshot of the table provided to the user is given in Fig. 10. The filtering tool is available both for phenotypic and variant data, under the 'Phenotype manager' and 'Gene and variant filtering manager' (see Fig. 9), respectively.
2) DiGeST launcher : The DiGeST launcher allows to create the jobsArguments.conf file, and start an analysis. Control and case groups are selected from the set of previously saved filters. The user may then select the scale (variant or gene) and scope (univariate or bivariate) for the scoring, and give a name to the analysis. The jobsArguments.conf file is created once pressing the start button, and provided to the Spark submission script.
3) Results browser : The DiGeST pipeline returns the rankings in the form of a CSV file, whose rows contain the variant/gene or pair of variants/genes IDs, together with the scores and a set of metadata (p-values, number of variants, scores for individual groups). The results browser allows to open the CSV of an analysis, and to display the results in an interactive table where results may be reordered according to the columns. Depending on the scope of the analysis  [19]), or pair of variants IDs or genes symbols. A hyperlink connects gene symbols to their page on the OMIM (Online Mendelian Inheritance in Man) Web site [21]. An example of the result table is given in Fig. 12 for a digenic ranking.  In the result browser, the pivot table is particularly useful to visualise how variants are distributed among samples of the case and control groups. An example is given in Fig. 13, where the number of variants in each gene of the first pair (IGLV3-12 and NCAPH2) is given for each sample of the two groups (European and Asian subset of samples from the 1000 genome project).

Results
We ran DiGeST on the 1000 Genomes Project phase 3 (1000GPp3) exome data [8], containing SNPs data for 2504 individuals, and a total of around 2.7 million variants. This section presents the data preparation and scalability results for univariate and bivariate rankings, both at the variant and gene scales.

Data preparation
Exome variants were filtered and annotated using Highlander [24], resulting in a total of about 260 million annotated variants for all 2504 individuals (samples). Each sample has around 100000 variants, see Fig. 14, left. The second peak around 120000 variants per sample correspond to samples with African origin. The data covers 33168 genes (using UCSC hg19 genome assembly), and the median number of variants per gene is 47. A few genes (41) have more than 1000 variant per genes. The number of variants per genes is reported in Fig. 14, right (genes with more than 1000 variants are not represented for clarity reason). Data was converted to the Parquet format, and the database size on disk after conversion was slightly less than 40GB.

Cluster back-end
We ran experiments on an in-house cluster consisting of 10 nodes, each with 24 cores, 128GB RAM, 4TB HDFS disk space, and 10Gb/s Ethernet connection. The Hadoop Yarn resource negociator system from Cloudera Hadoop Distribution 5.7.1 was used as the back-end for Spark 2.0. In all the experiments, Spark executors were allocated 2GB of RAM. The requested computing resources ranged from one executor, 2GB of RAM (single executor experiments) to one hundred executors and 200GB of RAM. All analyses were repeated five times, which was deemed sufficient given that computation times were very stable across runs.

Variant rankings
We assessed the computation times for variant rankings by varying the number of variants from 10k to 2.8 millions, for a population of 2504 samples (1252 samples in the control and case populations). Scalability results are reported in Fig.  15, for a number of executors varying from 1 to 50. The left panel reports the genotype matrix creation times, and the right panel the corresponding variant scoring times. The genotype matrix creation times were proportional to the number of variants included in the analysis (Fig. 15, left panel). The bottleneck for the creation time is the reading from the HDFS file system, which was on average of 8MB/s for our cluster. Given that the complete set of data was around 40GB, it took around 5000 seconds to create the whole genotype matrix with one executor. Thanks to the distributed file system, the loading times could be decreased by increasing the number of Spark executors. Loading times with 2 and 10 executors were reduced to 2500 and 530 seconds, providing almost proportional speed-ups of 1.9 and 9.2, respectively. Achievable speed-ups were however also bounded by the network bandwidth, which at a certain point becomes the main bottleneck. With 50 executors, the shuffling of data between executors reached its maximum capacity of 10Gb/s. The loading time was reduced was reduced to 150s, that is, a speed-up of only 30.
The computation times for scoring (Fig. 15, right panel) were also proportional to the number of variants. With one executor, the completion time for scoring all 2.7 million variants across the 2504 samples was of around 160s. The bottleneck were the disk access times, to load the 1.5GB sparse genotype matrix at a speed of again around 8MB/s. Increasing the number of executors to 10 and 50 reduced the computation times to 19s and 7s, providing speed-ups of 8.3 and 23, respectively. Given the relatively short computation times, Spark's overhead becomes non negligible, therefore making these speedups sublinear in the number of executors. However, it is still remarkable that such speed-ups can be obtained given the short execution times of the task, and illustrate Spark's ability to still provide speed-ups even for very short tasks.
Finally, it is worth noting that the standard deviations across all runs are very tight, on average less than one second. Only for one experiment (variant scoring, 2.7M variants, 2 executors) is the standard deviation noticeable on the graph (Fig. 15, right panel).

Scalability of Gene rankings
The main difference between the gene and variant ranking pipelines is an added step that groups variants by genes after the genotype matrix is created. The genotype matrix creation times are therefore the same as in Fig. 15, left panel. We report below the processing times for the gene grouping step (Fig. 16, left  panel), and the gene scoring step (Fig. 16, right panel)   The interpretation of the results is essentially similar to the scoring of variants: processing times follow a linear trend as the number of genes increases, and are bounded by disk access times. Hence, the scoring times are very comparable to those obtained for variant scoring. Grouping times took slightly longer (about twice the time), since shuffling occurs to group variants by genes. Processing times were also very stable across runs, with standard deviations less than a second, hence not visible on the graphs.

Scalability of variant pair rankings
Scalability results are reported in Fig. 17, for a number of executors varying from 1 to 100. The left panel reports the computation times as the number of variants increases from 10k to 100k for a population size of 2504 samples, and the right panel reports the computation times as the population size increases from 100 to 2504 samples, for 25k variants. A timeout of 3600 seconds was set so that any run taking more than one hour was prematurely stopped. Spark's ability to fully use high numbers of executors becomes apparent when billions of scores are computed. Thus, the computation times for 100k variants (≈ 5 billion pairs) takes just one hour (3610±30 seconds) with 50 executors, and half an hour (1864±22 seconds) with 100 executors. With only one executor, computations take slightly more than 2 days (not reported on the chart for clarity reason). Speed-ups gained by using 50 and 100 executors are therefore of 48 and 95, respectively, which reflects the low overhead caused by Spark's framework.
Finally, the trends observed in the right panel (increase of the sample size for a fixed number of variants) are sublinear, reflecting the sparsity property of the genotype matrix.

Discussion and research perspectives
With DiGeST, we showed that Hadoop and Spark provide well-suited and complementary computing frameworks for performing genomic association studies.
A compelling advantage of the frameworks is that most of the complexity related to the distributed infrastructure is abstracted by the HDFS storage system and the MapReduce programming model. In particular, the latter provides all the high-level operators required to group, filter, transform or sort variant data. This enables to express in a programmatically very concise way a wide variety of scoring schema, including aggregative or multi-loci scorings.
Our experimental results further illustrated the ability of both frameworks to efficiently distribute the loading and processing of large amounts of data, and to scale almost proportionally with the available computing resources. The main bottlenecks in execution times were found to be related to hardware limits, in particular disk access and network bandwidth. We plan to address in the future work a theoretical analysis of the tradeoffs between communication and parallelism, using models such as proposed in [61,1].
Hadoop/Spark is designed to operate on commodity machines, and it should therefore be kept in mind that alternative computing frameworks such as High Performance Computing solutions, or those based on dedicated hardware such as GPUs [72,71,25,18] and FPGAs [68,20], still retain the edge in terms of raw computational capacity. Rather, the main benefits of Hadoop/Spark lie in its robustness to node failure, and its ability to provide an abstraction of the distributed infrastructure. Both these characteristics considerably speed up the development time of a distributed application.
More specifically focusing on genomic association studies, we aim at extending DiGeST with a wider range of scoring functions. DiGeST currently only implements the basic association test for a disease trait by comparing allele frequencies between cases and controls, and applying a Fisher's test. Extension to other standard GAS tests such as the Cochran-Armitage test and Hotelling's T(2) statistic [58] are expected to require only slight variations in the partition scoring step of DiGeST pipeline. Extensions to aggregative tests such as burden tests (e.g., CMC [35], or CAST [48], WSS [40], KBAC [37]) or variancecomponent tests (e.g., SKAT [70]) could on the other hand be implemented by relying on the variant grouping feature of DiGeST.
Beyond scoring methods based on statistical tests, advanced computational techniques based on Machine Learning (ML) have more recently been showed to provide significant improvements to variant or gene prioritisation tasks [27,64,59,47,69,14,3]. A more general avenue for our research aims at investigating the integration of such techniques within the DiGeST pipeline.

Conclusions
In the landscape of currently available open source distributed computing frameworks, the work presented in this article supports that Hadoop and Spark bring effective complementary solutions for DNA sequencing association studies. On the one hand, Hadoop is a robust and now well-established framework for faulttolerant distributed data storage and resource management system. On the other hand, Spark provides a fast and light-weight computing engine, with a rich range of high-level functions for filtering, grouping, and transforming data in an efficient way.
Relying on Hadoop/Spark, we developed DiGeST, a distributed gene and variant scoring tool, and showed its ability to provide scalability to DNA sequencing case/control studies. We coupled this distributed back-end with a user-friendly Web front-end based on R Shiny, allowing users to easily filter variant data, and explore scoring results. All tools are made open-source, and can be reused efficiently thanks to virtualisation scripts made available from the project web site.