Coordinate-based mapping of tabular data enables fast and scalable queries

Motivation Biologists commonly store data in tabular form with observations as rows, attributes as columns, and measurements as values. Due to advances in high-throughput technologies, the sizes of tabular datasets are increasing. Some datasets contain millions of rows or columns. To work effectively with such data, researchers must be able to efficiently extract subsets of the data (using filters to select specific rows and retrieving specific columns). However, existing methodologies for querying tabular data do not scale adequately to large datasets or require specialized tools for processing. We sought a methodology that would overcome these challenges and that could be applied to an existing, text-based format. Results In a systematic benchmark, we tested 10 techniques for querying simulated, tabular datasets. These techniques included a delimiter-splitting method, the Python pandas module, regular expressions, object serialization, the awk utility, and string-based indexing. We found that storing the data in fixed-width formats provided excellent performance for extracting data subsets. Because columns have the same width on every row, we could pre-calculate column and row coordinates and quickly extract relevant data from the files. Memory mapping led to additional performance gains. A limitation of fixed-width files is the increased storage requirement of buffer characters. Compression algorithms help to mitigate this limitation at a cost of reduced query speeds. Lastly, we used this methodology to transpose tabular files that were hundreds of gigabytes in size, without creating temporary files. We propose coordinate-based, fixed-width storage as a fast, scalable methodology for querying tabular biological data. Contact stephen_piccolo@byu.edu

Introduction data-and their corresponding annotations-are stored in tabular text files. The ways that objects, attributes, and values are oriented within these files differ across these and other repositories 29-31 , 75 but values are commonly oriented in rows and columns and are separated by tab characters, comma 76 characters, or some other delimiter. 77 Typically, to parse such data, researchers write custom scripts or use software packages that 78 facilitate parsing 32-34 . To perform selection, the code must extract all values from the column(s) to 79 be used as filtering criteria. If data values are delimited by tab characters, for example, the code 80 must identify the positions of tab characters and extract data at the relevant positions for each row. 81 However, because data values may vary in length, the positions of tab characters may differ for each 82 row, and these positions must be reidentified for each row, thus slowing execution. After identifying 83 rows that match the selection criteria, the researcher may then wish to project the data. When 84 parsing a tab-delimited file, the code must again identify positions of tab characters for each row 85 and extract values at the relevant positions. In this methodology, the code parses the data row by 86 row, thus minimizing memory consumption. Alternatively, the entire file could be parsed into an 87 in-memory data structure; this methodology may increase the efficiency of selection and projection, 88 but many datasets are too large to fit in memory. Additionally, if a researcher wishes to use only a 89 few columns for selection or projection, it is inefficient to read the entire file into memory. Hybrid 90 solutions exist, such as the pandas module for Python 35  In evaluating methodologies that could handle such data, we envisioned scenarios in which data 101 files are created once and then queried many times. Public repositories like TCGA, LINCS, and UK 102 Biobank cater to these scenarios; after the data have been prepared, they are stored on web servers, 103 enabling researchers to download and query the data. Because the files are written only once, it is 104 less important to optimize speeds for writing the files, and it is unnecessary to support concurrent 105 writing by multiple agents. In contrast, it is highly preferable that researchers can query the data 106 quickly and flexibly. With this context in mind, we sought a solution that would meet the following 107 criteria:

108
• Handle datasets larger than what can fit into memory on modern personal computers.

110
• Support selection based on data in any column.

111
• Store the data in a portable format that can be transported across systems without custom 112 tools or specialized expertise.

113
• Store the data in a space-efficient manner (while preferring fast speeds over reduced storage).

116
• Can be created, indexed, and queried in a non-proprietary 43 , programming-language agnostic, 117 and platform-independent manner.

118
• Can transpose rows and columns without reading all the data into memory and without 119 creating temporary files.

120
• Can represent missing values explicitly.

121
In our quest to identify a solution that would address these criteria, we considered a variety of 122 binary-based solutions. These included relational databases, NoSQL databases, HDF5, and the 123 Apache Parquet format 44 . With each solution, we faced limitations. For example, the SQLite 124 relational database has a limit of 32,767 columns 45 . NoSQL databases provide many options for 125 structuring the data, but we failed to identify an approach that would provide adequate query speeds 126 and storage sizes. The HDF5 format is designed primarily for numerical data, whereas we sought 127 the ability to handle other data types as well. As a columnar storage solution, Parquet was efficient 128 at projection; however, it was ill-suited to selection. Ultimately, we focused on text-based solutions, 129 performing a benchmark analysis of 10 different techniques for parsing tabular data. As described 130 below, we chose one of these techniques and refined it further. We found that this technique 131 addresses each of the above criteria yet is human readable, fast, and scalable.

133
In an initial round of benchmarks, we used Python scripts to generate tabular text files in which 134 10% of the columns contained categorical values (randomly generated, 2-digit alphabetical 135 sequences) and 90% of the columns contained numerical values (ranging between 0.0 and 1.0).

136
First, we used these scripts to generate relatively small files, containing 100 columns and 1000 137 rows. After verifying functionality, we generated two types of large file that represent dimensions 138 that will be increasingly seen in biological research: 1) "tall" files containing 1 million rows and 139 1,000 columns, and 2) "wide" files containing 1,000 rows and 1 million columns. Each of these 140 files contained a total of 1 billion data points (approximately 10 GB in size). For each set of 141 dimensions, we saved the data in four different formats:

142
• tsv. We separated each value on each row with tabs (tab-separated-value format).

143
• msgpack. We used the MessagePack format 46 to serialize each row of data as a list object.

144
• flags. In an attempt to make it faster to access elements at a given column index, we specified 145 the index of each element within each row of data and embedded these indices within the file, 146 prior to each datum.

147
• fwf. The width of each column corresponded to the data value with the largest number of 148 characters in that column (fixed-width format). We also added a buffer character between 149 columns.

150
In this phase, we evaluated 10 techniques for projecting the data. Different techniques used 151 different versions of the input data (see below). We coded each technique to select the first column 152 and every hundredth column thereafter. Each script saved the selected columns to a tab-delimited 153 text file. We then used a script to verify that the output was correct.

154
• delimiter-split. We used TSV files as input, split each line on tabs, and extracted values at the 155 specified indices.

156
• pandas. We applied the read_csv function from the Python pandas module to the TSV files.

157
• reg-ex-quant. We used regular expressions to quantify tab characters that preceeded each 158 specified index and then extracted those values using capturing groups.

159
• reg-ex-tab. We used regular expressions to map non-capturing groups to indices that should 160 be ignored and capturing groups to indices that should be extracted.

161
• msgpack. We deserialized each row from the MessagePack serialized files and extracted 162 values at the specified indices. 163 • flags. For the flag files, we identified the position of each specified flag and then extracted 164 characters after it until another flag was reached.

165
• awk. We applied the awk command-line utility to the TSV files 47 . This Unix-based tool 166 provides extensive support for parsing text files.

167
• gawk. This is another variation on awk.

168
• nawk. This is yet another variation on awk.

169
• fixed-width. We used the header line in the file to identify the starting and ending positions of we removed the extraneous buffer characters between the columns. This reduced file sizes; however, 182 we retained a nonessential newline character at the end of each line to make the files more readable.

183
All of our code, along with a bash script to execute the benchmarks, can be found at

195
First, we evaluated methods for projecting tabular text files that contained 1 billion data points.

196
These files had either a "tall" or "wide" orientation. The tall files simulate scenarios in which 197 researchers collect 1,000 data points for 1 million patients (or other object type). The wide files 198 simulate scenarios in which researchers collect 1 million data points for 1,000 patients. As 199 high-throughput data-generation technologies advance and as researchers combine individual 200 datasets into aggregate ones, such scenarios will be increasingly common.

201
Commonly, biology researchers store data in tab-or comma-delimited files and parse such files 202 using the delimiter-split method. Thus, we considered the performance of this approach to be a 203 baseline. For the tall files, our scripts extracted every hundredth column in 21.72 and 17.76 seconds 204 with and without memory mapping, respectively. All but two of the competing methods 205 outperformed this approach (Figure 1). In contrast, on wide files, the performance slowed 206 considerably for all methods. The baseline method extracted every hundredth column in 31.38 and 207 27.44 seconds. The pandas method and both regular-expression methods performed worse than the 208 baseline, and their performance was dramatically worse than it was on the tall files. The poor 209 performance of pandas is perhaps surprising, given the package's popularity among data 210 scientists 52 .

211
The fixed-width method performed best overall on the tall file, projecting the data in only 5.34 212 seconds with memory mapping; however, its performance was mediocre on the wide file. We 213 hypothesized that a few adjustments to the file format and our algorithmic approach might improve 214 the performance substantially (see Methods). In addition, we implemented a chunking scheme in 215 which we parsed 1000 rows of input data at a time before writing to the output file. After these 216 adjustments, we projected every hundredth row from the tall file in 3.70 seconds. For the wide file, 217 we projected the data in 3.43 seconds, only 10% the duration of the original approach. Given these 218 results, we focused on this method and evaluated its performance further. 219 We wanted a method that would excel at projection and selection. Therefore, we performed 220 selection on data from one column with categorical data and one column with numerical data. The  When performing the initial benchmarks, we stored the data in four tabular formats (see Methods).

227
The flags and fixed-width files were larger than the other formats, especially for the wide files (see 228 Figure 2). To enable indexing, these formats require extra text within the files. We considered ways 229 to reduce this extra storage requirement while still supporting fast query times. We tested four 230 compression algorithms: gzip, bzip2, lzma, and snappy. We compressed the text files line by line.

231
After compression, the lengths of the lines varied, so we saved the starting position of each row to a 232 serialized dictionary. Compression times differed considerably across the methods ( Figure 3); as its 233 name implies, snappy was extremely fast. In constrast, snappy-compressed files were 234 approximately twice as large as files compressed using the other algorithms. Most importantly for seconds) for all file sizes except the largest (0.06 seconds). The largest file (500000 rows by 500000 columns) had a total of 250 billion data points and was 465 GB in size ( Figure 4A). Although 249 extracting 10 rows and 10 columns does not reflect a real-world scenario, it illustrates the promise 250 of performing these operations quickly on extremely large files.

251
As a final test, we transposed the simulated genotype files. To our knowledge, no tool exists for 252 transposing tabular files without reading the full dataset into memory or writing temporary files.

253
Our fixed-width approach successfully transposed each of the simulated genotype files, including 254 the largest, without writing temporary files. Time and memory usage did increase as file sizes 255 increased, but in a nonlinear fashion ( Figure 4B).

257
Our goal was to evaluate techniques for storing and querying tabular data. Such data are used 258 widely in biology research. We sought to identify a methodology that would reduce query times 259 and overcome limitations of existing approaches. In addition, we sought a solution that would 260 provide the flexibility and readability of plain text plus some benefits of binary formats. Our results 261 suggest that fixed-width formatting, memory mapping, and coordinate-based indices can meet these 262 needs for many research scenarios, especially those that prioritize fast reading over fast writing and 263 that can accept larger file sizes as a compromise for faster querying. In this study, we simulated 264 data for which all values in a given column are the same width; but in practice, data values in a 265 given column often vary in length. In these scenarios, extra buffer characters are needed. More 266 research is necessary to evaluate how much these buffer characters would increase file sizes in 267 practice, but we predict that query speeds will be impacted only minimally. One possibility for 268 mitigating the effects of larger file sizes is to compress the whole file using a standard compression 269 scheme (e.g., gzip) before it is placed on a web server; accordingly, distributing the file would 270 require less disk space on the server and less network bandwidth during file transfer; the researcher 271 could then decompress the file locally before querying it.

272
The results of time-based benchmarks must be interpreted with caution because execution times 273 vary from one computer system to another. Additionally, we performed these benchmarks on 274 hardware whose performance exceeds that of many computer systems used currently for biology 275 research. However, we are confident in our conclusions about the relative performances of the 276 methods we evaluated.

277
This study describes a proof of concept rather than a production-ready tool. One limitation of our 278 current approach is that data types are not stored explicitly for each attribute; these must be inferred 279 from the data. However, we believe our methodology's performance in these benchmarks merits 280 further development.

281
The simplicity of our methodology is one of its strong points. It should be possible to implement 282 this approach in any programming language and operating system that support reading and writing 283 text files as well as memory mapping. We do not intend this to be "yet another file format" for 284 bioinformaticians to deal with; rather, we describe it as a methodology for extending an existing 285 format. In addition, our approach could facilitate translation among formats and data orientations. were structured and how they were parsed (see Methods). awk, gawk, and nawk did not support memory 294 mapping. Tall files consisted of 1 million rows and 1,000 columns; "Wide" files had 1,000 rows and 1 million 295 columns. Each file included a mixture of categorical (10%) and numerical (90%) attributes. Note that the 296 x-axis scales differ for the two panels.  to alter the level of compression; we used levels 1 and 9, which are indicated in parentheses. Although the 308 snappy compression algorithm was much faster than the other algorithms, these speeds were 20-50X slower 309 than without compression.