Abstract
Reproducing experiments is vital to science. Being able to replicate, validate and extend previous work also speeds new research projects. Reproducing computational biology experiments, which are scripted, should be straightforward. But reproducing such work remains challenging and time consuming. In the ideal world we would be able to quickly and easily rewind to the precise computing environment where results were generated. We would then be able to reproduce the original analysis or perform new analyses. We introduce a process termed “continuous analysis” which provides inherent reproducibility to computational research at a minimal cost to the researcher. Continuous analysis combines Docker, a container service similar to virtual machines, with continuous integration, a popular software development technique, to automatically re-run computational analysis whenever relevant changes are made to the source code. This allows results to be reproduced quickly, accurately and without needing to contact the original authors. Continuous analysis also provides an audit trail for analyses that use data with sharing restrictions. This allows reviewers, editors, and readers to verify reproducibility without manually downloading and rerunning any code. Example configurations are available at our online repository (https://github.com/greenelab/continuous_analysis).
The Current State of Reproducibility
Leading scientific journals have targeted reproducibility to increase readers’ confidence in results and reduce retractions1–5. In a recent survey, 90% of researchers acknowledged a reproducibility crisis6. Research that uses computational protocols should be particularly amenable to reproducible workflows because all of the steps are scripted into a machine-readable format. But written descriptions of computational approaches can be difficult to understand and may lack required parameters. Even when results can be reproduced, the process often requires a substantial time investment and help from the original authors. Garijo et al.7 estimated it would take 280 hours for a non-expert to reproduce a paper describing a computational construction of a drug-target network for Mycobacterium tuberculosis8. These are the good scenarios: the results behind most computational papers are not readily reproducible7,9–11.
The practice of “open science” has been discussed as a means to aid reproducibility3,12. In open science the data and source code are shared. Sharing can also extend to intermediate results and project planning13. Sharing data and source code is currently necessary but not sufficient to make research reproducible. Even when code and data are shared, it remains difficult to reproduce results due to differing computing environments, operating systems, library dependencies etc. It is common to use one or more open source libraries on a project, and research code quickly becomes dependent on old versions of these libraries as software advances14. These old or broken dependencies make it difficult for readers and reviewers to recreate the environment of the original researchers, whether to validate or extend their work.
An example of where sharing data does not automatically make science reproducible occurs in the most standard of places: differential gene expression analysis. Such analyses are routine. Our understanding of the genome, including transcriptome annotations, have improved and updated probe set definitions are available15. Analyses relying on unspecified probe set definitions cannot be reproduced using current definitions.
We analyzed the fifteen most recently published papers that cite Dai et al., a common source for custom chip description files (CustomCDF), that were accessible at our institution16–31. We identified these manuscripts using Web of Science on May 31, 2016. We recorded the number of papers that cited a version of CustomCDF, as well as which version was cited. We expect this analysis to provide an upper bound on reproducible work: these papers specifically cited the source of their CDFs. Of these fifteen papers, nine (60%) specified which version they used. These nine used versions 11, 15, 16, 17, 18, and 19 of the BrainArray CustomCDF.
This initial analysis was performed based on article recency without regard to article impact. To determine the extent to which this issue affects high impact papers, we performed a parallel evaluation for the ten most cited papers32–41 that cite Dai et al. We determined the ten most cited papers using Web of Science on May 31, 2016. Of these ten papers, one38 (10%) specified which version of the CustomCDF was used. That paper used version 11 of the BrainArray CustomCDF.
We sought to determine which versions were currently in use in the field. We asked three individuals who performed microarray analysis recently, and we accessed and evaluated two cluster systems used for processing data. We found that each individual had one of the three most recently released versions installed (18, 19, and 20), and versions 18 and 19 were installed on cluster systems.
To evaluate the impact of differing CDF versions, we downloaded a recently published public gene expression dataset (GEO Series Ascension number GSE47664). This experiment examined differential expression between normal HeLa cells and HeLa cells with TIA1 and TIAR knocked down42. We performed a parallel analysis using each of the three versions that we found installed on machines that we could access (18, 19, and 20). Each version identifies a different number of significantly altered genes (Figure 1A), demonstrating the challenge of reproducible analysis. We simulated a parallel analysis of differential expression using Docker containers on mismatched machines43. This specifies the CDF version and produces the same number and set of differentially expressed genes for a given version across machines (v18 example in Figure 1B). Had continuous analysis been used for papers citing the BrainArray CustomCDF their computational results would be easily replicated.
Continuous Analysis in Computational Workflows
We developed continuous analysis to produce a verifiable end-to-end run of computational research with minimal start-up costs. In contrast with the status quo, continuous analysis preserves the computing environment and maintains the versions of dependencies. We described the benefits of containerized approaches above, but maintaining, running and distributing Docker images manually would become time consuming. Integrating Docker into a continuous scientific analysis pipeline meets three criteria: (1) anyone can re-run code in a computing environment matching the original authors (Supplemental Figure 1); (2) readers and reviewers can follow exactly what was done in an “audit” fashion without running code (Supplemental Figure 2 & 3); and (3) the solution imposes zero to minimal cost in terms of time and money on the researcher, depending on their current research process.
Continuous analysis extends continuous integration44, a common practice in software development and deployment. Continuous integration is a software development workflow that triggers an automated build process whenever developers check their code in to a source control repository. This automated build process runs test scripts if they exist. These tests can catch bugs introduced into software. Software that passes tests is automatically deployed to remote servers.
For continuous analysis (Figure 2), we repurpose these services in order to run computational analyses, update figures, and publish changes to online repositories whenever relevant changes are made to the source code. When an author is ready to release code or publish their work they can export the most recent continuous integration run. Because this process generates results in a clean and clearly defined computing environment without manual intervention, reviewers can be confident that the analyses are reproducible.
In each project we maintain dependencies with the free open-source software tool Docker45. Docker defines an “image” that allows users to download and run a container, a minimalist virtual machine with a predefined computing environment. Docker images can be several gigabytes in size, but once downloaded can be started in a matter of seconds and has minimal overhead14. In addition, Docker images can be easily tagged to coincide with software releases and paper revisions. At the time of submission, authors can run the `docker savè command to export a static file that can be uploaded to services such as Figshare or Zenodo to receive a DOI. For example, we have uploaded our continuous analysis environment for the examples in this paper46.
To set up continuous analysis, a researcher needs to do three things. First they must create a Dockerfile, which specifies a list of dependencies. Second, they need to connect a continuous integration service to their version control system and provide the commands to run their analysis. Finally, they need to commit and push changes to their version control system. Many researchers already perform the first and third tasks in their standard workflow.
The continuous integration system will automatically rerun the specified analysis with each change, precisely matching the source code and results. It can also be set to listen and run only when changes are marked in a specific way, e.g. by committing to a specific ‘staging’ branch. For the first project, this process can be put into place in less than a day. For subsequent projects, this can be done in under an hour.
Setting up Continuous Analysis
We have created a GitHub repository with instructions for paid, local, and cloud-based continuous analysis setups47. These are fully detailed in the supplementary materials and online repository. Here we describe how continuous analysis can be setup using the free and open source Drone software on a researcher’s personal computer and connected to the GitHub version control service. This setup is free to users.
Install Docker on the computer.
Pull the Drone image via docker:
sudo docker pull drone/drone:latest
Create a new application in GitHub (Figure 3).
Add a webhook to the GitHub project (Figure 4). This will notify the continuous integration server of any updates pushed to the repository.
Create a configuration file on the Drone computer at /etc/drone/dronerc filling in the client information provided by GitHub
REMOTE_DRIVER=github
REMOTE_CONFIG=https://github.com?client_id=….&client_secret=….
Run the drone container
sudo docker run drone/drone:latest
Continuous analysis can be performed with dozens of full service providers or a private installation on a local machine, cluster or cloud service47. Full service providers can be set up in minutes but may have computational resource limits or monthly fees. Private installations require configuration but can scale to a local cluster or cloud service to match the computational complexity of all walks of research. With free, open-source continuous integration software48, computing resources are the only associated costs.
Using Continuous Analysis
After setup, running continuous analysis is simple and fits into existing research workflows that use source control systems. We have used continuous analysis in our own work49. We have also prepared three example repositories (detailed in supplemental materials):
An example demonstrating the setup of continuous analysis with a wide variety of services and configurations (highlighted below).
An easy to follow basic phylogeny tree building example, combining sequence alignment using MAFFT50, format conversion using EMBOSS Seqret51, and tree calculation and drawing using.
An RNA expression analysis workflow examining organoid models of pancreatic cancer in mice based on work from Boj et al.52 using details and source code published by Balli53. This example shows the ability of continuous analysis to scale to large computations. This example uses kallisto54, limma55,56, and sleuth57 to analyze 150GB of gene expression data and approximately 480 million reads.
To demonstrate the setup process and different configurations of continuous analysis we show a simple example of continuous analysis with kallisto. The recently published software tool kallisto quantifies transcript abundance in RNA-seq data. Our example re-runs the examples provided in kallisto with each commit to a repository.
Add a script file to re-run custom analysis. For Drone, this is a .drone.yml file that specifies commands to run each step of the analysis. An example configuration is available in the continuous analysis GitHub repository as well as the supplemental materials.
Commit changes to the source control repository.
Push changes to GitHub.
The configured continuous integration service automatically runs the specified script. We configured this to rerun the analysis, regenerate the figures, and commit updated versions to the repository. The service provides a complete audit log of what was run in the clean continuous integration environment (Figure 5). By generating and pushing updated figures, this process also generates a complete change log for each result (Figure 6). Interactive development tools, such as Jupyter58,59, RMarkdown60,61 and Sweave62 can be incorporated to present the code and analysis in a logical graphical manner. For example, we recently used Jupyter with continuous analysis in our own publication63 and corresponding repository49.
In summary, continuous analysis provides the results of a verifiable end-toend run in a “clean” environment. Because continuous analysis runs automatically in the background, no transition is needed between the exploration and publication phases of a scientific project. The audit trail provided by continuous analysis allows reviewers and editors to provide sound judgment on reproducibility without a large time commitment. If readers or reviewers would like to re-run the code on their own (e.g. to change a parameter and evaluate the impact on results), they can easily do so with the Docker container containing the final computing environment and intermediate results. Version control systems provide the capability to watch for updates. Readers can “star” or “watch” a repository on services such as Github, Gitlab, and Bitbucket to be automatically notified of changes and updated runs. Wide adoption of these systems throughout the publication process could allow reviewers and editors to automatically be notified of updated results.
Continuous analysis provides an audit trail for reproducible analyses of closed data
Continuous analysis can be even more powerful when working with closed data that cannot be released. Without continuous analysis, reproducing computational analyses based on closed data is dependent on the original authors completely and exactly describing each step, a process that may be an afterthought and relegated to extended methods. Readers must then diligently follow complex written instructions without intermediate confirmation they are on the right track. The containers produced during continuous analysis include a matching environment for replication as well as intermediate results. This allows readers to determine where their results diverge from the original work and to determine whether divergence is due to software-based or data-based differences.
Best practices with continuous analysis
We suggest a development workflow where continuous analysis runs only on a single branch (Supplemental Figure 4). Researchers can push to this branch when they believe they are ready for a release to avoid running the full process during incomplete updates. If the updates to this branch succeed, the changes are then automatically carried over to the master or production branch and released. We recommend exporting both the before and after processing Docker images and uploading to an archival service like Figshare or Zenodo. The archived images can then be cited to guide readers to the version used in the manuscript46. For convenience, the images can also be shared through the Docker Hub registry.
It may currently be impractical to use continuous analysis for generic preprocessing steps involving very large data or analyses requiring particularly high computational costs. In particular, steps that take days to run or incur substantial costs in computational resources may not be amenable with existing providers64. One day, continuous analysis systems specifically designed for scientific workflows may facilitate reproducible workflows in these settings. For now, researchers may need to use discretion when preprocessing via continuous analysis, as it may be computationally intractable to reanalyze after each commit to a staging branch. Researchers may elect to run only the final workflow through this process, or may elect to employ continuous analysis after standard but computationally expensive preprocessing steps are completed.
For small datasets and less intensive computational workflows it is easiest to use a full service continuous integration service. These services have the smallest setup times. With private data or when data size and computational complexity scale it becomes necessary to setup a local privately hosted continuous integration server. Cluster or cloud based continuous integration servers can handle the largest workflows.
The impact of reproducible computational research
Reproducibility can have wide-reaching benefits for the advancement of science. For authors, easily reproducible work is a sign of quality and credibility. Continuous analysis addresses the reproducibility of computationally analyses in the narrow sense: generating the same results from the same inputs. It does not solve reproducibility in the broader sense: how robust results are to parameter settings, starting conditions and partitions in the data. Continuous analysis lays the groundwork needed to address reproducibility and robustness of findings in the broad sense.
Acknowledgements
This work was supported by the Gordon and Betty Moore Foundation under a Data Driven Discovery Investigator Award to CSG (GBMF 4552) and supported by a Commonwealth Universal Research Enhancement (CURE) Program grant from the Pennsylvania Department of Health. We would like to thank David Balli for providing the RNA-seq analysis design, Katie Siewert for providing the phylogenetic analysis design, and Alex Whan for contributing a Travis-CI implementation.