Abstract
Background Bioinformatics software tools operate largely through the use of specialized genomics file formats. Often these formats lack formal specification, and only rarely do the creators of these tools robustly test them for correct handling of input and output. This causes problems in interoperability between different tools that, at best, wastes time and frustrates users. At worst, interoperability issues could lead to undetected errors in scientific results.
Methods We sought (1) to assess the interoperability of a wide range of bioinformatics software using a shared genomics file format and (2) to provide a simple, reproducible method for enhancing inter-operability. As a focus, we selected the popular Browser Extensible Data (BED) file format for genomic interval data. Based on the file format’s original documentation, we created a formal specification. We developed a new verification system, Acidbio (https://github.com/hoffmangroup/acidbio), which tests for correct behavior in bioinformatics software packages. We crafted tests to unify correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format. To analyze the performance of existing software, we tested the input validation of 80 Bioconda packages that parsed the BED format. We also used a fuzzing approach to automatically perform additional testing.
Results Of 80 software packages examined, 75 achieved less than 70% correctness on our test suite. We categorized multiple root causes for the poor performance of different types of software. Fuzzing detected other errors that the manually designed test suite could not. We also created a badge system that developers can use to indicate more precisely which BED variants their software accepts and to advertise the software’s performance on the test suite.
Discussion Acidbio makes it easy to assess interoperability of software using the BED format, and therefore to identify areas for improvement in individual software packages. Applying our approach to other file formats would increase the reliability of bioinformatics software and data.
1 Introduction
1.1 File format interoperability
For your latest research project, you have constructed a pipeline from multiple published bioinformatics tools. Each tool works well with the author’s data, but you run into errors with your data. The author’s data and your data have slight differences in file metadata and data formatting, which lead to the errors. As a result, you must spend time manually editing your data files and intermediate outputs to conform to each tool’s expectations. Meanwhile, ensuring interoperability between software tools that parse the data file format could have prevented your frustration.
Scientific software developed by academics often suffers from software engineering deficiencies1, which can lead to the scenario described above. Among these include problems with deployment2, maintenance3, robustness4, and documentation5. Software engineering flaws may hinder fulfilling the Findable, Accessible, Interoperable, and Reusable (FAIR) principles for scientific data management6—especially the guidelines on interoperability and reusability. Software engineering flaws may also affect web services that parse bioinformatics file formats, which may have vulnerabilities to attacks such as malicious code injections in input files7.
One key difficulty arises from interoperability of specialized file formats used for scientific data. Often, creators specify such formats informally, or not at all, leaving users and developers to guess the details of critical components or edge cases. Rare standardization efforts such as those of the Global Alliance for Genomics and Health (GA4GH)8 have developed a few formal specifications. These include the sequence alignment/map (SAM), BAM, CRAM, and variant call format (VCF) file formats9.
Interoperability issues can also arise from issues within the software. Developers can address some interoperability problems, however, through simple solutions such as checklists. For example, Bioconda10 recipes require adequate tests and a stable source code uniform resource locator (URL)11. Bioconductor12 also has guidelines for package submission regarding code style, performance and testing13. Simple checklists can greatly improve software quality, even for programmers and researchers that lack formal software engineering training.
Software testing recommendations and standard test suites can aid researchers and developers. Extensive test suites for common standards, such as TeX’s trip tests14, or the Web Standards Project Acid test suite15 exercise independent implementations of common standards by focusing on edge cases. In a bioinformatics context, tools that parse the VCF format16 can use simulated VCF files with known behavior to test software correctness17.
Here, we tackled the bioinformatics software engineering problem of file format interoperability, specifically focusing on the plain-text whitespace-delimited Browser Extensible Data (BED) format18. We chose to use the BED file format because of its simplicity and its popularity. First, we developed a formal specification for the BED format as a comprehensive specification did not exist. Second, we quantified the degree to which a wide variety of bioinformatics software varied in their processing of this file format. In particular, we tested bioinformatics software input validation, checking input data for correct formatting. To facilitate this work, we created Acidbio (https://github.com/hoffmangroup/acidbio), a system for automated testing and certification of bioinformatics file format interoperability.
1.2 The BED file format
The BED format describes genomic intervals in plain text. Each BED file consists of a number of lines, each with 3 to 12 whitespace-delimited fields. The mandatory first three fields (chrom, chromStart, and chromEnd) define an interval on a chromosome. The optional last nine fields provide additional information about the interval such as a name, score, strand, and aesthetic features used by the University of California, Santa Cruz (UCSC) Genome Browser19. The optional fields have binding order—all fields preceding the last field used must contain values.
BED variants distinguish BED files based on its number of fields. BEDn denotes a file with only the first n fields. For example, a BED4 file has the chrom, chromStart, chromEnd, and name fields. BED3 to BED9, along with BED12, represent the 8 standard BED variants.
BEDn+m denotes a file with the first n fields followed by m fields of custom-defined fields supplied by the user. The custom-defined fields can contain many types of plain-text data. BEDn+m files act as custom BED files. Currently, no in-band information exists to supply information about a BED file’s fields. A BED parser must infer the fields present in a BED file.
The file conversion tool bedToBigBed18, developed by the UCSC Genome Browser team20, has served as the de facto file validation tool for the BED format. The BED format appears deceptively simple, and without careful consideration of the specification, a developer may miss unexpected flexibility or rigidity in some fields.
2 Results
2.1 A new formal specification addresses ambiguities in the BED format
Despite existing for almost two decades, the BED format until recently lacked a formal specification similar to the SAM21 or VCF16 specification. The UCSC Genome Browser Data File Formats Frequently Asked Questions (https://genome.ucsc.edu/FAQ/FAQformat.html) specified some details, but lacked technical details that other formal specifications clearly define.
Through the GA4GH standards process8, we established a specification of the BED format (https://github.com/samtools/hts-specs/blob/master/BEDv1.pdf). The new specification defines each BED field and their possible numerical range or valid character patterns. It also provides semantics surrounding whitespace, sorting, and default field values. The specification formalizes missing details and captures the existing use of the BED format, taking the input from relevant stakeholders into account. During the development of the specification, we solicited input from a number of stakeholders, including the UCSC Genome Browser team, the File Formats subgroup within the GA4GH Large Scale Genomics work stream (https://www.ga4gh.org/work_stream/large-scale-genomics/), and the public through GitHub comments (https://github.com/samtools/hts-specs/pull/570).
2.2 Most existing tools perform poorly on a BED test suite
To measure the ability of BED parsers to accept good input and reject bad input, we created an Acidbio test suite with 92 individual test cases. Specifically, we used the new specification to develop a test suite of expected pass and expected fail BED files. The expected pass test cases conform to our specification—for these cases, we expect tools to return a zero exit code and not output any error or warning messages. The expected fail test cases do not conform to our specification—for these cases, we expect tools to return a non-zero exit code or output an error or warning message. The test suite contains 92 tests, covering the definitions of fields and the structure of the BED file. The test suite also covers all BED variants from BED3 to BED12. The BED3 test cases represent the core of our test suite, as all BED files must have the first three fields.
The BED format does not contain in-band information on whether a file uses BED fields only or also has custom fields. A parser might assume that for BED files with 4 to 12 fields, all the fields represent standard BED fields. In this case, the parser should validate the fields according to the file format rules.
Alternatively, a parser might treat fields 4 through 12 as custom data. A tool designed to handle arbitrary custom BED files may not validate the optional BED fields. This means the tool may not fail on the expected fail test cases. The expected success test cases, however, should all work even for non-specified custom data. Also, this flexibility does not apply to mandatory fields 1 through 3, as their definition cannot change.
We examined behavior of tools, expecting strict validation of standard BED4 through BED12 files. This provides more informative results than permitting the whole range of behavior one might expect for custom data. Unexpected results in the optional fields indicate the need for better means for interchange of metadata on these fields.
Using our test suite, we assessed 80 Bioconda packages that support the BED format as input (Figure 1). In some packages, we assessed multiple tools, making 99 tools in total. For each tool, we calculated its performance on each BED variant by taking the number of tests that behaved as expected divided by the number of tests for the BED variant. Of the 99 tools, only 26 achieved ≥ 70% expected results for BED3 tests. Averaged for tests across all BED variants, 51 tools achieved ≥ 50% expected results. We have deposited full results on Zenodo (https://doi.org/10.5281/zenodo.5784787). Beyond the possibility of expecting custom BED files, we attributed unexpected results to several causes described below.
2.3 Existing tools parse BED files in different ways
All tools have distinct purposes, causing them to parse the BED format in different ways and focus on varying aspects of BED files. Different purposes mean some test cases may never arise in the expected usage of the tool. We have identified a few groups of tools that have similar behaviors, which cause poor performance on the test suite.
Tools that require a specific BED variant
Some tools require a specific number of fields in the input BED file. For example, slncky22 requires a BED12 file. This causes all BED3 to BED11 inputs to raise an error.
Tools that only validate a subset of BED fields
Many tools use the BED format only for interchange of genomic intervals in the first three fields. Some of these tools will accept any BED n file and perform no validation after the first three fields. For example, many tools ignore fields that describe aesthetic features only for genomic browser display, such as thickStart, thickEnd, and itemRgb. A tool such as bedtools23 that mainly operates on genomic intervals would incorrectly succeed on an expected fail BED9 test case.
File converters
Some tools convert the BED format to a different file format, without performing any validation. Some file converters use a garbage-in-garbage-out approach, going from invalid input in BED format to invalid output in some other format. For example, bioconvert bed2wiggle24 fails as expected on most expected fail test cases, but still produces output retaining the input file errors. Using a garbage-in-garbage-out approach may make debugging complex pipelines more difficult. Raising warnings during file conversion helps debugging, as the user can narrow down the source of the error to steps before file conversion.
Tools that use another library for BED parsing
Some tools call an external library to perform operations on BED files. If the main tool does not perform extra error checking of its own, it can only detect the same errors that the external library finds. For example, intervene25 uses bedtools as a dependency, which results in their similar patterns of performance.
2.4 Ambiguous format specification makes uniform behavior more difficult
The previous absence of a formal specification for the BED format also influenced test performance. Our formal specification and the behavior of the reference implementation bedToBigBed conflict with the expectations of tool developers in many ways.
Definition of whitespace
Many BED files use tabs to delimit fields. The BED format, however, also accepts spaces to delimit fields, if the fields themselves contain no spaces20. Of the 99 tools examined, 60 reject space-delimited BED files allowed by the specification (Table 1, “other-fully_space_delimited.bed”). Also, the BED format permits blank lines, though 37 tools do not accept this (Table 1, “other-space_between_lines.bed”).
Expanded definition of fields
The BED format requires strict limits for certain fields and some generators do not respect these limits. For example, the specification defines score as an integer value between 0 and 1000, inclusive. Some tools use the score as a p-value, which violates the integer definition. To allow tools to repurpose the nine optional fields, one can treat these tools as BEDn+m parsers, with custom definitions for the remaining fields. Nonetheless, repurposing field names, such as score, with different definitions can confuse parsers that will misinterpret the data and use it incorrectly.
Conflict between our formal specification and bedToBigBed
We used the de facto file validator bedToBigBed to inform the design of our test suite. Without a formal specification, however, uncertainty surrounding specific edge cases arose when bedToBigBed disagreed with our understanding of correct behavior.
Our formal specification disagreed with bedToBigBed in three instances. First, bedToBigBed accepted a BED7 file with thickStart less than chromStart. Second, bedToBigBed accepted a BED12 file with the length of the blockSizes or blockStarts list greater than blockCount. Third, bedToBigBed accepted BED11 files while our specification disallowed BED11.
2.5 Software engineering deficiencies lead to poor performance on the test suite
Beyond issues in differences in design between tools and the previous informal specification of the file format, we can also attribute poor testing performance to problems in software engineering.
Silently accepting invalid input
Tools should alert users on input errors, allowing them to check whether they have made an error. In some cases, developers prefer to skip an invalid data point and continue. In this case, the tool should at least provide a warning message describing the skipped line. Otherwise, an error could slip past the user and affect their results. In our test suite, a warning message would count as an expected failure, improving the performance statistics for a tool that generates them.
Errors in BED file generators can easily slip past users. When a downstream tool raises an error on bad input, this reduces the time before someone discovers the problem with the upstream generator.
Insufficient testing
While some of our test cases cover formatting issues that can hinder interoperability, others represent “can’t happen” scenarios that, uncaught, pose logic bombs for a software tool. For example, all tools should reject negative start positions (Figure 2), “02-negative-start.bed”, but 48/99 tools accepted a test case that has negative starts. Given the limited resources and incentives to publish in academic software engineering, developers require a simpler way to ensure avoidance of obvious problems than manually developing test cases.
2.6 No relationship between package performance and downloads found
We observe little correlation between the number of downloads a package has compared to the package’s performance on the test suite (Figure 3). Many packages have a similar number of downloads. We attribute this to packages having specific purposes that make them useful for a few users. However, very highly downloaded packages such as bedtools23 and the UCSC Genome Browser tool suite83 have better performance than other tools.
2.7 Automated fuzzing can detect errors that a manually designed test suite does not
Differential testing107 using files generated from a grammar-based fuzzer108 can discover new errors not found by the test suite. A grammar-based fuzzer automatically generates files based on a defined structure of the file format.
We found one example of unexpected behavior in bedtools coverage23 where coverage raised an error but bedToBigBed did not. Since bedtools coverage requires two input files, we generated two files using the fuzzer (Table 2) and validated them using bedToBigBed. On the generated files, bedtools coverage exited with exit status 1 and error message “Error: line number 1 of file 2.bed has 4 fields, but 0 were expected.” Our manually designed test suite did not catch this error—we only uncovered it due to the use of fuzzing.
2.8 BED badge indicates conformance with the BED format
We designed badges that developers can display in a tool’s documentation to clearly indicate the file types used and indicate the tool’s performance on the test suite (Figure 4). The badges reassure users that the software underwent thorough testing. The availability of such badges encourages developers to perform input validation.
Acidbio includes steps to produce a BED badge. We recommend developers to display a BED badge if their software conforms to the BED formal specification.
3 Methods
3.1 The Acidbio test system
We developed the Acidbio test system, which automatically runs a number of bioinformatics tools on a test suite (Figure 5). To determine an actual success or failure, we consider the exit status and outputs to standard output and error. A test case passes on a successful exit status and no error or warning messages printed.
We identified error and warning messages by manually running the tools. We had to identify these error and warning messages manually because some tools logged errors without returning a non-zero exit code or logged issues in the BED file through warnings instead of errors.
To provide Acidbio with details on how to run each tool, we created a YAML Ain’t Markup Language (YAML) configuration file that stored each tool’s command-line usage file (Figure 6). The YAML file also stored the locations of the additional files needed to run each tool and each tool’s Conda environment.
3.2 Tool discovery
To identify tools to test, we used Bioconda10, a repository that contains thousands of bioinformatics software packages. Each package contains one or more tools. We only included Bioconda packages with tools that have a command-line interface, as opposed to add-on modules executed within another program, and use the BED format as input. This excluded the numerous R, Bioconductor12, and Perl packages that have no command-line interface.
For packages that contain multiple tools, we selected a smaller set of subtools to test. We systematically identified these packages by manually examining the documentation for over 1000 packages to determine if it matches our criteria. We had to manually examine documentation because Bioconda has no structured metadata on each package’s input file formats. This process yielded 80 packages, with 99 tools total.
Some tools use the BED format as the primary input file, such as a mandatory argument. Examples include bedtools23 and high-throughput sequencing toolkits such as ngs-bits98. These tools generally perform calculations using the intervals found in the BED file.
Other tools use the BED format as a secondary input file, such as an optional argument. Tools that use BED as a secondary input file generally use it to define genomic intervals of interest for data in another file format, such as SAM. In the tools we tested, 60 packages used the BED format as the primary input file, and 20 packages used the BED format as a secondary input file.
After collecting a list of all the possible packages that we could test, we then attempted to install each package and run the tools. We excluded packages that we could not install or could not run without error on any input files. We found no cases where a package contained both working and broken tools.
3.3 Test suite
We created a test suite that contains tests for each BEDn format, covering various edge cases drawn from our BED specification. The test suite contains both expected success test cases (Table 3) and expected fail test cases (Table 4). Some tests include validating ranges for numeric fields, validating character sets for alphanumeric fields, or data formatting for fields such as itemRgb or the block definitions.
We manually generated the test cases, designing them to make sense for all the tools tested. We used genomic intervals between positions 250000 and 260000 since one might find them in both chromosomes and non-chromosome scaffolds. Each test case varies based on the criteria tested. Some criteria only require a deviation in one field in one feature to generate a test case. For example, to test a score greater than 1000, only a single feature had a score greater than 1000. Other criteria required deviation in multiple features to generate a test case. For example, to test that the parser accepts strand “.”, we set all features to strand “.”.
We built tests upon each other—we repeated a test case for all BED variants with additional fields added. As an example, a test case in BED5 testing a negative score gets repeated in testing the BED6 through BED12 variants.
For tools that use BED as a secondary file format, we collected test files for their non-BED primary file formats. For each of these file formats, we sourced an example file from the creators of the format or from a repository such as a FASTA for GRCh38/hg38109 from the UCSC Genome Browser (https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/). We edited non-BED files to ensure that their ranges matched the BED test cases. We also validated the collected non-BED files with a file validator, when possible.
Since the new formal BED specification prohibits BED10 and BED11, we considered all BED10 and BED11 tests expected fail, even if the test case fell under expected success for other BED variants.
3.4 Fuzzing
We used a fuzzing approach110 to automatically generate test cases beyond our manually designed test suite (Figure 7). We created an ANother Tool for Language Recognition 4 (ANTLR4) grammar111 to define the structure of the BED format and the possible values for each field. Then, we used a file generator that builds a file based on our grammar. We tested the tools using grammar-based fuzzing and grammarinator112 as the file generator.
To introduce further variation into the BED file, we created an ANTLR4 meta-grammar that defines possible ANTLR4 BED grammars. The meta-grammar produces variation by allowing the BED grammar to vary on the structure or definition of fields. For example, the meta-grammar may produce a BED grammar that only allows tabs as the whitespace, or it may produce a BED grammar that allows both tabs and spaces. By varying the BED grammar produced, the user can test different combinations of field definitions and BED file structure that a single BED grammar cannot achieve.
3.5 Availability
Acidbio and the BED test suite are available at GitHub (https://github.com/hoffmangroup/acidbio) and deposited in Zenodo (https://doi.org/10.5281/zenodo.5784763). Results of each package on each test case and the scripts used to generate figures are available at Zenodo (https://doi.org/10.5281/zenodo.5784787).
4 Discussion
4.1 Use in software development
Acidbio can help researchers and programmers test their tools to improve the robustness and interoperability of their code. Acidbio can serve a similar function to the Web Standards Project Acid test suite15 designed to improve interoperability of web browsers. When the Web Standards Project created the Acid tests, many web browsers had poor compliance with existing web standards. Over time, browsers such as Opera113 and Internet Explorer114 began to achieve perfect performance on the Acid tests and interoperability improved. Similarly, we intend Acidbio to make it easier for developers to create bioinformatics software that more easily interoperates with other software.
To test new tools, developers need only create a short configuration YAML file to describe their tool’s command line interface, and run the Acidbio test harness. From the test results, a programmer may identify edge cases they missed and fix them before distributing their software. Once fixed, the programmer can put a BED badge in a software’s documentation to indicate that it interoperates with the BED format. Editors or reviewers of papers describing tools can use the test suite to verify the software’s quality. Package repository managers can also use the test suite to verify the quality of submitted packages.
4.2 The utility of a formal specification
The interpretation of a standard can turn into a matter of opinion. While formalizing the standard with a specification can help improve interoperability, the only way to truly ensure agreement on expected behavior involves further formalization through a formal grammar or including test cases in the standard. A deterministic grammar or test suite removes potential for misunderstandings about standard conformance.
4.3 Postel’s law
Postel’s law, “be conservative in what you do, be liberal in what you accept from others”115, related initially to how software sends and accepts messages over the internet. Adherence to Postel’s law helped the internet to succeed—leniency in accepting data without strict validation helped more organizations implement internet software116.
As seen here, many software tools have taken a liberal approach to accepting BED files. This seemingly increases the utility of these tools. By liberally accepting input, however, tools encourage BED producers to take a lackadaisical approach to correctness and interoperability, which leaves the format open to misuse117. Programmers may unwittingly create software that generates incorrect BED files if they only supply their output to downstream consumers with a liberal approach to validation. This results in technical debt, where problems lay undiscovered until after the developers complete the project, or years later, when it becomes much harder to fix.
The developers of the Extensible Markup Language (XML) format purposefully rejected Postel’s law, deciding that malformed XML files would raise fatal errors118. They did this because this approach encourages producers of the file format to strongly conform to the specification. A strict validation approach reduces opportunities for parsers to misunderstand input and prevents common errors from becoming accepted.
The lack of a strict validation approach for previous HyperText Markup Language (HTML) implementations led to a morass of incompatible and poorly described HTML file formats. This greatly increased the complexity of potential bugs in web browsers that could actually handle the existing base of web pages. Despite the existence of formal HTML specifications, web browsers had to create special “quirks modes” to handle HTML files that did not satisfy these specifications119.
The history of HTML and XML should inform file validation behavior in bioinformatics software. While one may not want to raise fatal errors for each non-conforming file, BED parsers must at least provide warnings when encountering them. Users can easily ignore warnings, however, or miss them in a stream of irrelevant and voluminous diagnostic information. To ensure that users notice problems with file formats and that programmers fix upstream generators, parsers must take a strict validation “warnings are errors” approach and refuse to parse invalid files.
4.4 Application to other bioinformatics file formats
Users and developers can apply the same methodology developed here to test other bioinformatics file formats for conformance. Establishing a common interface to parse a file format will improve interoperability of bioinformatics software and move closer to FAIR6 goals. For binary file formats or software written in languages with weak memory safety, testing and interoperability become even more important.
Computational tools described in scholarly papers often undergo precious little testing. The existence of test systems such as Acidbio make it easy to test that a tool interoperates with other software well. We recommend that when such a test system exists, journal editors, reviewers, and software repository managers ensure that the tool achieves good performance in the test suite prior to acceptance. After acceptance, managers can indicate which file formats the package uses as input and output to make searching for tools easier. Developers can also add badges similar to the BED badge to indicate software’s conformance to the relevant specification.
4.5 BED metadata
Tools parse BED files in the absence of in-band information embedded within the file. The lack of in-band information may lead to difficulties parsing BED files. For example, a tool cannot determine whether a BED file has custom fields without in-band information. With such metadata, tools can easily determine whether the input file has the fields it needs.
A header section at the beginning of a BED file can provide metadata to make parsing of BED files easier. The header can define the file’s BED variant and specify information such as the genome assembly used. For custom BEDn+m files, the header can define the custom-defined fields, similar to the INFO lines in the VCF meta-information lines. Having a header would provide a direct method of supplying file metadata directly within the file, allowing parsers to easily read the BED file. Future versions of the GA4GH BED specification may add such metadata.
4.6 Limitations of the testing approach
Our testing approach applies the same BED files and secondary files to all the tools, except tools that use BAM input. Given the diversity of tools that use the BAM format, we could not find a single BAM file with data relevant to all tools. Instead, we used two different BAM files to avoid tools raising logical errors on our test cases.
Our testing approach only considers whether a BED parser accepts valid input and rejects invalid input. It does not consider correctness of the output. Developers can validate output file format using a file validation tool. For BED files, one can use bigToBigBed18 for file validation, keeping in mind the edge cases discussed above where its behavior differs from the GA4GH BED specification. Testing for correctness of analyses represents a much more difficult problem that one cannot trivially address.
The fuzzing approach also has some limitations. The quality of the generated test cases relies on the file generator to cover a wide range of possible BED files. For a grammar-based fuzzing approach, the grammar would have to describe all possible variations in a file, which becomes difficult for more complex file formats. Another potential issue with file generation arises if the generator has too few methods to vary its output files, generating files that do not cover enough cases. Machine learning or other approaches that inform future file generation from past unexpected behavior can address this issue120.
Other fuzzing approaches, such as mutation-based fuzzing, may not work in a bioinformatics context. Mutation-based fuzzers randomly modify existing files by adding random or nonsense characters. These fuzzers would not create diverse BED files and the mutations would likely create invalid and meaningless BED files. A security-oriented fuzzer such as American Fuzzy Lop121 can detect these vulnerabilities. Security-oriented fuzzers will produce test cases that can have nonsense data such as non-ASCII characters, which tests the tool’s ability to handle unexpected data.
Competing interests
The authors declare no competing interests.
Author contributions
Conceptualization, D.D. and M.M.; Data curation, Y.N.; Formal analysis, Y.N.; Funding acquisition, M.M.H.; Investigation, Y.N., D.D., and E.G.R.; Methodology, Y.N. and M.M.H.; Project administration, M.M.H.; Resources, M.M.H.; Software, Y.N.; Supervision, D.D., E.G.R., and M.M.H.; Validation, Y.N.; Visualization, Y.N.; Writing — original draft, Y.N.; Writing — review & editing, Y.N., D.D., E.G.R., and M.M.H
Acknowledgments
We thank Carl Virtanen (0000-0002-2174-846X) and Zhibin Lu (0000-0001-6281-1413) at the University Health Network High-Performance Computing Centre and Bioinformatics Core for technical assistance, Michael Hicks (University of Maryland, College Park; 0000-0002-2759-9223) and Leonidas Lampropoulos (University of Maryland, College Park; 0000-0003-0269-9815) for helpful discussions, and W. James Kent and the UCSC Genome Browser team for creating the BED format. This work was supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN-2015-03948 to M.M.H.).
Footnotes
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].
- [27].
- [28].
- [29].
- [30].
- [31].
- [32].
- [33].
- [34].
- [35].
- [36].
- [37].
- [38].
- [39].
- [40].
- [41].
- [42].
- [43].
- [44].
- [45].
- [46].
- [47].
- [48].
- [49].
- [50].
- [51].
- [52].
- [53].
- [54].
- [55].
- [56].
- [57].
- [58].
- [59].
- [60].
- [61].
- [62].
- [63].
- [64].
- [65].
- [66].
- [67].
- [68].
- [69].
- [70].
- [71].
- [72].
- [73].
- [74].
- [75].
- [76].
- [77].
- [78].
- [79].
- [80].
- [81].
- [82].
- [83].↵
- [84].
- [85].
- [86].
- [87].
- [88].
- [89].
- [90].
- [91].
- [92].
- [93].
- [94].
- [95].
- [96].
- [97].
- [98].↵
- [99].
- [100].
- [101].
- [102].
- [103].
- [104].
- [105].
- [106].↵
- [107].↵
- [108].↵
- [109].↵
- [110].↵
- [111].↵
- [112].↵
- [113].↵
- [114].↵
- [115].↵
- [116].↵
- [117].↵
- [118].↵
- [119].↵
- [120].↵
- [121].↵