Reproducible Computational Workflows with Continuous Analysis

Reproducing experiments is vital to science. Being able to replicate, validate and extend previous work also speeds new research projects. Reproducing computational biology experiments, which are scripted, should be straightforward. But reproducing such work remains challenging and time consuming. In the ideal world we would be able to quickly and easily rewind to the precise computing environment where results were generated. We would then be able to reproduce the original analysis or perform new analyses. We introduce a process termed “continuous analysis” which provides inherent reproducibility to computational research at a minimal cost to the researcher. Continuous analysis combines Docker, a container service similar to virtual machines, with continuous integration, a popular software development technique, to automatically re-run computational analysis whenever relevant changes are made to the source code. This allows results to be reproduced quickly, accurately and without needing to contact the original authors. Continuous analysis also provides an audit trail for analyses that use data with sharing restrictions. This allows reviewers, editors, and readers to verify reproducibility without manually downloading and rerunning any code. Example configurations are available at our online repository (https://github.com/greenelab/continuous_analysis).

reviewers to recreate the environment of the original researchers, whether to validate or extend their work.
An example of where sharing data does not automatically make science reproducible occurs in the most standard of places: differential gene expression analysis. Such analyses are routine. Our understanding of the genome, including transcriptome annotations, have improved and updated probe set definitions are available 15 . Analyses relying on unspecified probe set definitions cannot be reproduced using current definitions.
We analyzed the fifteen most recently published papers that cite Dai et al., a common source for custom chip description files (CustomCDF), that were accessible at our institution [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31] . We identified these manuscripts using Web of Science on May 31, 2016. We recorded the number of papers that cited a version of CustomCDF, as well as which version was cited. We expect this analysis to provide an upper bound on reproducible work: these papers specifically cited the source of their CDFs. Of these fifteen papers, nine (60%) specified which version they used. These nine used versions 11,15,16,17,18, and 19 of the BrainArray CustomCDF.
This initial analysis was performed based on article recency without regard to article impact. To determine the extent to which this issue affects high impact papers, we performed a parallel evaluation for the ten most cited papers 32-41 that cite Dai et al. We determined the ten most cited papers using Web of Science on May 31, 2016. Of these ten papers, one 38 (10%) specified which version of the CustomCDF was used. That paper used version 11 of the BrainArray CustomCDF.
We sought to determine which versions were currently in use in the field. We asked three individuals who performed microarray analysis recently, and we accessed and evaluated two cluster systems used for processing data. We found that each individual had one of the three most recently released versions installed (18, 19, and 20), and versions 18 and 19 were installed on cluster systems.
To evaluate the impact of differing CDF versions, we downloaded a recently published public gene expression dataset (GEO Series Ascension number GSE47664). This experiment examined differential expression between normal HeLa cells and HeLa cells with TIA1 and TIAR knocked down 42 . We performed a parallel analysis using each of the three versions that we found installed on machines that we could access (18, 19, and 20). Each version identifies a different number of significantly altered genes ( Figure 1A), demonstrating the challenge of reproducible analysis. We simulated a parallel analysis of differential expression using Docker containers on mismatched machines 43 . This specifies the CDF version and produces the same number and set of differentially expressed genes for a given version across machines (v18 example in Figure 1B). Had continuous analysis been used for papers citing the BrainArray CustomCDF their computational results would be easily replicated. The status quo requires a reader or reviewer to find and install specific versions of dependencies. These dependencies can become difficult to find and may become incompatible with newer versions of other software packages. Different versions of packages identify different numbers of significantly differentially expressed genes from the same source code and data. B.) Containers define a computing environment that captures dependencies. In container-based systems, the results are the same regardless of the host system.

Continuous Analysis in Computational Workflows
We developed continuous analysis to produce a verifiable end-to-end run of computational research with minimal start-up costs. In contrast with the status quo, continuous analysis preserves the computing environment and maintains the versions of dependencies. We described the benefits of containerized approaches above, but maintaining, running and distributing Docker images manually would become time consuming. Integrating Docker into a continuous scientific analysis pipeline meets three criteria: (1) anyone can re-run code in a computing environment matching the original authors (Supplemental  The researcher pushes code that includes a script capable of running the analyses from start to finish. The continuous integration provider runs the latest version of code in the specified Docker environment without manual intervention. This generates a Docker container with intermediate results that allows anyone to rerun analysis in the same environment, produces updated figures, and stores logs describing everything that occurred. Example configurations are available in the supplementary materials as well as our online repository (https://github.com/greenelab/continuous_analysis). Because code is run in an independent, reproducible computing environment and produces detailed logs of what was executed, this practice reduces or eliminates the need for reviewers to re-run code to verify reproducibility. minimal cost in terms of time and money on the researcher, depending on their current research process.
Continuous analysis extends continuous integration 44 , a common practice in software development and deployment. Continuous integration is a software development workflow that triggers an automated build process whenever developers check their code in to a source control repository. This automated build process runs test scripts if they exist. These tests can catch bugs introduced into software. Software that passes tests is automatically deployed to remote servers.
For continuous analysis (Figure 2), we repurpose these services in order to run computational analyses, update figures, and publish changes to online repositories whenever relevant changes are made to the source code. When an author is ready to release code or publish their work they can export the most recent continuous integration run. Because this process generates results in a clean and clearly defined computing environment without manual intervention, reviewers can be confident that the analyses are reproducible.
In each project we maintain dependencies with the free open-source software tool Docker 45 . Docker defines an "image" that allows users to download and run a container, a minimalist virtual machine with a predefined computing environment. Docker images can be several gigabytes in size, but once downloaded can be started in a matter of seconds and has minimal overhead 14 . In addition, Docker images can be easily tagged to coincide with software releases and paper revisions. At the time of submission, authors can run the `docker save` command to export a static file that can be uploaded to services such as Figshare or Zenodo to receive a DOI. For example, we have uploaded our continuous analysis environment for the examples in this paper 46 .
To set up continuous analysis, a researcher needs to do three things. First they must create a Dockerfile, which specifies a list of dependencies. Second, they need to connect a continuous integration service to their version control system and provide the commands to run their analysis. Finally, they need to commit and push changes to their version control system. Many researchers already perform the first and third tasks in their standard workflow. The continuous integration system will automatically rerun the specified analysis with each change, precisely matching the source code and results. It can also be set to listen and run only when changes are marked in a specific way, e.g. by committing to a specific 'staging' branch. For the first project, this process can be put into place in less than a day. For subsequent projects, this can be done in under an hour.

Setting up Continuous Analysis
We have created a GitHub repository with instructions for paid, local, and cloud-based continuous analysis setups 47 . These are fully detailed in the supplementary materials and online repository. Here we describe how continuous analysis can be setup using the free and open source Drone software on a researcher's personal computer and connected to the GitHub version control service. This setup is free to users.  Continuous analysis can be performed with dozens of full service providers or a private installation on a local machine, cluster or cloud service 47 . Full service providers can be set up in minutes but may have computational resource limits or monthly fees. Private installations require configuration but can scale to a local cluster or cloud service to match the computational complexity of all walks of research. With free, open-source continuous integration software 48 , computing resources are the only associated costs.

Using Continuous Analysis
After setup, running continuous analysis is simple and fits into existing research workflows that use source control systems. We have used continuous analysis in our own work 49 . We have also prepared three example repositories (detailed in supplemental materials): 1. An example demonstrating the setup of continuous analysis with a wide variety of services and configurations (highlighted below). 2. An easy to follow basic phylogeny tree building example, combining sequence alignment using MAFFT 50 , format conversion using EMBOSS Seqret 51 , and tree calculation and drawing using. 3. An RNA expression analysis workflow examining organoid models of pancreatic cancer in mice based on work from Boj et al. 52 using details and source code published by Balli 53 . This example shows the ability of continuous analysis to scale to large computations. This example uses kallisto 54 , limma 55,56 , and sleuth 57 to analyze 150GB of gene expression data and approximately 480 million reads.
To demonstrate the setup process and different configurations of continuous analysis we show a simple example of continuous analysis with kallisto. The recently published software tool kallisto quantifies transcript abundance in RNAseq data. Our example re-runs the examples provided in kallisto with each commit to a repository.
1. Add a script file to re-run custom analysis. For Drone, this is a .drone.yml file that specifies commands to run each step of the analysis. An example configuration is available in the continuous analysis GitHub repository as well as the supplemental materials. 2. Commit changes to the source control repository. 3. Push changes to GitHub. The configured continuous integration service automatically runs the specified script. We configured this to rerun the analysis, regenerate the figures, and commit updated versions to the repository. The service provides a complete audit log of what was run in the clean continuous integration environment ( Figure 5). By generating and pushing updated figures, this process also generates a complete change log for each result ( Figure 6). Interactive development tools, such as Jupyter 58,59 , RMarkdown 60,61 and Sweave 62 can be incorporated to present the code and analysis in a logical graphical manner. For example, we recently used Jupyter with continuous analysis in our own publication 63 and corresponding repository 49 . In summary, continuous analysis provides the results of a verifiable end-toend run in a "clean" environment. Because continuous analysis runs automatically in the background, no transition is needed between the exploration and publication phases of a scientific project. The audit trail provided by continuous analysis allows reviewers and editors to provide sound judgment on reproducibility without a large time commitment. If readers or reviewers would like to re-run the code on their own (e.g. to change a parameter and evaluate the impact on results), they can easily do so with the Docker container containing the final computing environment and intermediate results. Version control systems provide the capability to watch for updates. Readers can "star" or "watch" a repository on services such as Github, Gitlab, and Bitbucket to be automatically notified of changes and updated runs. Wide adoption of these systems throughout the publication process could allow reviewers and editors to automatically be notified of updated results.
computational analyses based on closed data is dependent on the original authors completely and exactly describing each step, a process that may be an afterthought and relegated to extended methods. Readers must then diligently follow complex written instructions without intermediate confirmation they are on the right track. The containers produced during continuous analysis include a matching environment for replication as well as intermediate results. This allows readers to determine where their results diverge from the original work and to determine whether divergence is due to software-based or data-based differences.

Best practices with continuous analysis
We suggest a development workflow where continuous analysis runs only on a single branch (Supplemental Figure 4). Researchers can push to this branch when they believe they are ready for a release to avoid running the full process during incomplete updates. If the updates to this branch succeed, the changes are then automatically carried over to the master or production branch and released. We recommend exporting both the before and after processing Docker images and uploading to an archival service like Figshare or Zenodo. The archived images can then be cited to guide readers to the version used in the manuscript 46 . For convenience, the images can also be shared through the Docker Hub registry.
It may currently be impractical to use continuous analysis for generic preprocessing steps involving very large data or analyses requiring particularly high computational costs. In particular, steps that take days to run or incur substantial costs in computational resources may not be amenable with existing providers 64 . One day, continuous analysis systems specifically designed for scientific workflows may facilitate reproducible workflows in these settings. For now, researchers may need to use discretion when preprocessing via continuous analysis, as it may be computationally intractable to reanalyze after each commit to a staging branch. Researchers may elect to run only the final workflow through this process, or may elect to employ continuous analysis after standard but computationally expensive preprocessing steps are completed.
For small datasets and less intensive computational workflows it is easiest to use a full service continuous integration service. These services have the smallest setup times. With private data or when data size and computational complexity scale it becomes necessary to setup a local privately hosted continuous integration server. Cluster or cloud based continuous integration servers can handle the largest workflows.

The impact of reproducible computational research
Reproducibility can have wide-reaching benefits for the advancement of science. For authors, easily reproducible work is a sign of quality and credibility. Continuous analysis addresses the reproducibility of computationally analyses in the narrow sense: generating the same results from the same inputs. It does not solve reproducibility in the broader sense: how robust results are to parameter settings, starting conditions and partitions in the data. Continuous analysis lays the groundwork needed to address reproducibility and robustness of findings in the broad sense.