Abstract
Interactive notebooks can make bioinformatics data analyses more transparent, accessible and reusable. However, creating notebooks requires computer programming expertise. Here we introduce BioJupies, a web server that enables automated creation, storage, and deployment of Jupyter Notebooks containing RNA-seq data analyses. Through an intuitive interface, novice users can rapidly generate tailored reports to analyze and visualize their own raw sequencing files, their gene expression tables, or fetch data from >5,500 published studies containing >250,000 preprocessed RNA-seq samples. Generated notebooks have executable code of the entire pipeline, rich narrative text, interactive data visualizations, and differential expression and enrichment analyses. The notebooks are permanently stored in the cloud and made available online through a persistent URL. The notebooks are downloadable, customizable, and can run within a Docker container. By providing an intuitive user interface for notebook generation for RNA-seq data analysis, starting from the raw reads, all the way to a complete interactive and reproducible report, BioJupies is a useful resource for experimental and computational biologists. BioJupies is freely available as a web-based application from: http://biojupies.cloud and as a Chrome extension from the Chrome Web Store.
Introduction
RNA-sequencing1 is a widely applied experimental method to study the biological molecular mechanisms of cells and tissues in human and model organisms. Currently, experimental biologists that perform RNA-seq experiments are experiencing a bottleneck. The raw read FASTQ files, which are relatively large (>1 GB), need to be first aligned to the reference genome before they can be further analyzed and visualized to gain biological insights. The alignment step is challenging because it is computationally demanding, typically requiring specialized hardware and software. Recently, we have developed a cost effective cloud-based alignment pipeline that enabled us to align >250,000 publicly available RNA-seq samples from the Sequence Read Archive (SRA)2. Here, we describe a service that enables users to upload and align their own FASTQ files, and then obtain an interactive report with complete analysis of their data delivered as a Jupyter Notebook3.
Inspired by the paradigm of literate programming4, data analysis interactive notebook environments such as Jupyter Notebooks, R-markdown5, knitr6, Observable (https://beta.observablehq.com), or Zeppelin (https://zeppelin.apache.org) have been rapidly gaining traction in computational biomedical research and other data intensive scientific fields. The concept of an interactive notebook is not new. Computing software platforms such as MATLAB and Mathematica deployed notebook style analysis pipelines for many years. However, the availability of open-source and free interactive notebooks that can run and execute in any browser make these new notebook technologies transformative.
Interactive notebooks enable the generation of executable documents that contain source code, data analyses and visualizations, and rich narrative markup text. By combining all the necessary information to rerun data analysis pipelines, and by producing reports that enable rapid interpretation of experimental results, interactive notebooks can be considered a new form of publication. Hence, interactive notebooks can transform how experimental results are exchanged in biomedical research. In this direction, several academic journals now support the Jupyter Notebook as a legitimate component of a publication, for example, the journal F1000 Research9, or as an acceptable format type to submit articles, for example, the journal Data Science10. However, generating interactive notebooks requires high level of computer programming expertise which is uncommon among experimental biologists.
The analysis of RNA-sequencing data, and the processing of large datasets produced by other omics technologies, typically requires the chaining of several bioinformatics tools into a computational pipeline. In the pipeline, the output of one tool serves as the input to the next tool. In order to enable experimental biologists to execute bioinformatics pipelines, including those developed to process and analyze RNA-seq data, several platforms have been developed, for example, GenomeSpace11, Galaxy12, and GenePattern13. These software platforms provide access to workflows that can run on scalable computing resources through a web interface. Hence, users with limited computational expertise can launch computationally intensive data analysis jobs with these platforms. Galaxy and GenePattern have recently integrated Jupyter14, 15. In addition to these platforms, several interactive web applications for analysis of RNA-seq data have been developed. Most of these platforms are implemented with the the R Shiny toolkit16. Instead of relying on the integration of external computational tools, these R Shiny applications analyze uploaded data by executing R code on the server side, and then displaying the results directly in the browser. Examples of such applications include iDEP17, START18, VisRseq19, ASAP20, DEApp21, and IRIS22. However, none of these tools currently allow for the automated generation of reusable and publishable reports that contain research narratives in a single interactive notebook. In addition, these platforms do not provide users with the ability to upload the raw sequencing files for processing in the cloud.
BioJupies is a web-based server application that automatically generates customized Jupyter Notebooks for analysis of RNA-seq data. BioJupies allows users to rapidly generate tailored, reusable reports from their raw or processed sequencing data, as well as fetch RNA-seq data from >5,500 studies that contain >250,000 RNA-seq samples published in the Gene Expression Omnibus (GEO)23. To enable such fetching, we preprocessed all the human and mouse samples profiled by the major sequencing platforms by aligning them to the reference genome via the ARCHS4 cloud computing architecture2. Most importantly, BioJupies enables users to upload their own FASTQ files for alignment and downstream analysis in the cloud. BioJupies is freely available from http://biojupies.cloud. BioJupies is open source on GitHub at https://github.com/MaayanLab/biojupies available for forking and contributing plug-ins to enhance the system.
Methods
The BioJupies interactive website
The BioJupies website is hosted on a web server running in a Docker24 container as a Ubuntu image that is pulled from Docker Hub. It is served using a Python-Flask framework with an nginx HTTP server, uWSGI, and a MySQL database. The front-end of the website is built using HTML5 Bootstrap, CSS3, and JavaScript. When a user requests the generation of a notebook, the website builds a JSON-formatted ‘notebook configuration’ string that contains information about the selected datasets and plug-ins. The website first queries the MySQL database to find whether a notebook with the same configuration was previously generated. If a notebook is found, the link to it is displayed on the final results page. Otherwise, the notebook configuration JSON is sent to the notebook generator server through an HTTP POST request. The server uses this information to generate a notebook. Once the notebook has been generated, the server returns a notebook link to the client for display.
Programmatic generation of Jupyter Notebooks
The Jupyter Notebook generator server uses the information in the JSON-formatted ‘notebook configuration’ string to generate a Jupyter Notebook with the Python library nbformat. The server executes the notebook using the Python library nbconvert and then saves the notebook in a file encoded into the ipynb format. The file is subsequently uploaded to a Google Cloud storage bucket using the google-cloud Python library. The statically rendered notebook is made available on a persistent and public URL using Jupyter nbviewer. The Docker container that encapsulates the running server contains a copy of the BioJupies plug-in library. The BioJupies plug-in library contains all the scripts necessary to download, normalize and analyze the RNA-seq data.
The BioJupies plug-in library
The BioJupies plug-in library is a modular set of Python and R scripts which are used to download, normalize, and analyze RNA-seq datasets for notebook generation using BioJupies. The library consists of a set of core scripts responsible for loading the RNA-seq data into the pandas DataFrames25, scripts for normalizing the data, and scripts for performing the differential gene expression analysis, currently implemented with limma26 or the Characteristic Direction27 methods. In addition to these, the library contains a broad range of data analysis tools organized in a plug-in architecture. Each plug-in can be used to analyze the data, and to embed interactive visualizations of the results in the output Jupyter Notebook report. Currently, the library contains 14 plug-ins divided into 4 categories (Table 1). The plug-in library is expected to be updated regularly and grow as we incorporate more computational tools submitted by developers who wish to have their tools integrated within the BioJupies framework. The BioJupies plug-in library is openly available from GitHub at: https://github.com/MaayanLab/biojupies-plugins.
The initial BioJupies toolbox includes 14 plug-ins for the analysis of RNA-seq data, divided into four categories.
Processed RNA-seq datasets
The pre-processed RNA-seq datasets available for analysis from BioJupies are directly extracted from the ARCHS4 database2. First, gene-level count matrices stored in the h5 format were downloaded for mouse and human from the ARCHS4 web-site (human_matrix.h5 v2 and mouse_matrix.h5 v2). Gene counts and metadata were subsequently extracted using the Python h5py library and packaged into 5,780 individual HDF5 files, one for each GEO Series (GSE) and GEO Platform (GPL) pair. Data packages were subsequently uploaded to a Google Cloud storage bucket using the google-cloud library and made available for download through a public URL.
Processing of user-submitted RNA-seq data
Processed RNA-seq datasets are submitted from the upload page using the Dropzone JavaScript library. Uploaded files are combined with sample metadata into an HDF5 data package28 using the h5py Python library and subsequently uploaded to a Google Cloud storage bucket with the google-cloud library. Raw sequencing files are submitted directly to an Amazon Web Services (AWS) S3 cloud storage bucket from the raw sequencing data upload page. Once the data is uploaded, gene expression is quantified using the ARCHS4 RNA-seq processing pipeline2, which runs in parallel on the AWS cloud. The core component of the processing pipeline is the alignment of the raw mRNA reads to a reference genome. This process is encapsulated in deployable Docker containers that performs the alignment step with Kallisto29. Once gene counts have been quantified, the data is combined with sample metadata into an HDF5 package similarly to the way the processed datasets are made accessible to BioJupies for further analysis. The system allocates resources dynamically based on need to optimize cost. The use of the efficient aligner Kallisto with the optimized ARCHS4 RNA-seq processing pipeline keeps the cost negligible and the service scalable.
The BioJupies Chrome Extension
The BioJupies Chrome extension is freely available from the Chrome Web Store at: https://chrome.google.com/webstore/detail/biojupies-generator/picalhhlpcihonibabfigihelpmpadel?hl=en-US. The Chrome extension enhances the search results pages of the GEO website at: https://www.ncbi.nlm.nih.gov/geo. When a user is visiting a search results page on GEO, the BioJupies Chrome extension extracts the accession IDs of the query returned GEO datasets. The extension then queries the BioJupies MySQL database to identify which search-returned datasets have been processed by ARCHS42. For those datasets that have been processed by ARCHS4, the BioJupies Chrome extension embeds BioJupies buttons near the matching entries. When these buttons are clicked by the user, a user interface is deployed. The user interface leads the user through steps to generate the ‘notebook configuration’ JSON string. Once all information is gathered, the ‘notebook configuration’ JSON string is sent to the notebook generation server through an HTTP POST request to generate a Jupyter Notebook. Once the notebook is ready, a link to it is displayed by the Chrome extension.
Results
The BioJupies website
The BioJupies website enables users to generate customized, interactive Jupyter Notebooks containing analyses of RNA-seq data through an intuitive user interface. The notebooks contain executable Python code, interactive visualizations, and rich annotations that provide detailed explanations of the results. The notebooks also provide details and references about the methods used to perform the analyses. The notebooks are made available to users through a persistent sharable URL. The automatically generated notebooks can be downloaded and modified on the user’s local computer through the execution of a Docker image that contains all the data, tools and source code needed to rerun analyses.
The process of notebook generation consists of three steps. First, the user selects the data they wish to analyze through a web interface. Users can either upload their own raw or processed RNA-seq data (Supplementary Video S1), or select from over 5,500 ready-to-analyze datasets published in GEO and processed by ARCHS4. Next, the user selects from an array of computational plug-ins to analyze the data (Supplementary Video S2). Finally, the user can customize the notebook by modifying optional parameters, changing the notebook’s title, and adding metadata tags from resources such as the Disease Ontology30, the Drug Ontology31, and the Uber-anatomy Ontology32. Once this process is complete, a notebook generation job is launched by the BioJupies notebook generator server. The BioJupies notebook generator creates and executes a Jupyter Notebook file with the uploaded, or fetched, dataset and selected plug-ins. When the process is completed, the generated notebook is uploaded to a cloud storage bucket and is made available through a persistent URL (Supplementary Video S3). If the user approves, the persistent URL is made publicly and automatically available on the Datasets2Tools repository33. A schematic representation of the entire BioJupies workflow shows how the various components of BioJupies come together (Fig. 1).
The user starts by uploading RNA-seq data to the BioJupies website (http://biojupies.cloud), or by selecting from thousands of publicly available datasets. If raw FASTQ files are provided, expression levels for each gene are quantified using a cloud-based alignment pipeline. The user subsequently selects the tools and parameters to apply to analyze the data. Finally, a server generates a Jupyter Notebook with the desired settings and returns a report to the user through a persistent URL.
Uploading RNA-seq Data
BioJupies enables the generation of Jupyter Notebooks from RNA-seq data in both raw and processed forms. In case of processed RNA-seq data, the user uploads numeric gene counts in a tabular format. This can be an Excel spreadsheet or a comma-separated text file. In addition, metadata that describes the samples can be uploaded in separate Excel spreadsheets, or as comma-separated text files, following the provided examples. In the case of raw RNA-seq data, the user is provided with a user interface that enables them to upload FASTQ files directly to an AWS S3 cloud storage bucket through an HTML form (Supplementary Video S1). The user is required to specify the organism, and whether the RNA-seq data was generated using single-end or paired-end sequencing. Once this information is collected, gene expression levels for each gene are quantified by launching parallel alignment jobs in the cloud using the ARCHS4 pipeline. The pipeline employs the kallisto aligner for the alignment step29. We have benchmarked the kallisto aligner with other aligners and found it to produce comparable read quality at a significant lower cost2. Once the alignment process is complete, which may take up to 10 minutes, sample counts are merged to generate a gene count matrix. The gene count matrix is subsequently combined with the uploaded metadata. From that point on, the user follows the same steps to generate notebooks as with fetched data, or processed uploaded data (gene counts matrix), by selecting the computational tools they wish to employ (Supplementary Video S2).
Analyzing Published Datasets
In addition to allowing analysis of user-submitted data, BioJupies provides direct access to data from >5,500 published GEO studies. These studies contain >250,000 processed samples from human or mouse, profiled across hundreds of different tissue types, cell lines and biological conditions. The BioJupies website enables users to fetch the data into BioJupies from these studies using a search engine that supports text-based queries. The search engine has filters, for example, by specifying a range of number of samples per study, filtering by organism, and filtering by a publication date range. The datasets available for fetching are updated regularly as more studies are published in GEO.
The BioJupies Plug-in Library
All the source code that is used to analyze the RNA-seq data and generate the plots is available on GitHub at a public repository: https://github.com/MaayanLab/biojupies-plugins. Currently, the BioJupies plug-in library consists of 14 computational plug-ins subdivided into four categories: exploratory data analysis, differential expression analysis, enrichment analysis, and small molecule query (Table 1). These plug-ins analyze the data and display interactive visualizations such as three-dimensional scatter plots, bar charts, and clustered heatmaps with enrichment analysis features (Supplementary Video S3). The plug-in’s modular design of BioJupies provides the mechanism needed for other bioinformatics tool developers to integrate their tools with BioJupies. Suggestions for plug-ins can be submitted to the BioJupies web-site or through GitHub.
The BioJupies Chrome Extension
BioJupies is also provided as a Chrome extension. The Chrome extension is freely available from the Chrome Web Store at: https://chrome.google.com/webstore/detail/biojupies-generator/picalhhlpcjhonibabfigihelpmpadel?hl=en-US. The Chrome extension detects BioJupies available processed datasets returned from GEO search pages. The BioJupies Chrome extension embeds buttons near each entry from returned results of GEO studies that were aligned by us. Clicking on the embedded button, a popup window is triggered to provide an interactive interface to capture the settings to generate the Jupyter Notebook directly from the selected GEO dataset. The BioJupies Chrome extension entry forms follow the same content as in the BioJupies website. The BioJupies Chrome extension launches a notebook generation job and displays the permanent link to the user once the process is completed (Supplementary Video S4).
Discussion
The automated generation of Jupyter Notebooks for RNA-seq data analysis lowers the point-of-entry for researchers with no programming background. With BioJupies users can rapidly extract knowledge from their own RNA-seq data, or from already published studies. Furthermore, the notebooks generated can be easily shared with collaborators, rerun and customized using Docker containers to further refine the analysis. This makes it attractive to both experimentalists and programmers. The pipeline is cost effective and scalable, so it can accommodate a large user base. However, if hundreds of users utilize the service daily, the costs can start mounting to levels that are not sustainable by us. In such case, technologies that transfer some of the cloud computing costs to the users may be needed.
It may appear that the RNA-seq Jupyter Notebooks make the RNA-seq data, and the tools applied for the analysis, more findable, accessible, interoperable and reusable (FAIR)34. However, currently, the notebooks and the datasets analyzed by BioJupies fail many of the FAIR guidelines and metrics. For example, the notebook is not citable, uploaded RNA-seq data is not indexed in an established repository, metadata describing the samples and experimental conditions are not required, and unique identifiers are not enforced. Future work will enhance the process to better comply the BioJupies generated notebooks with the FAIR principles. One way to increase FAIRness is to encode the BioJupies pipelines in standard workflow languages such as Common Workflow Language (CWL)35 or Workflow Description Language (WDL)36. This approach may facilitate the more rapid adoption of other data analysis pipelines that can be supported by the BioJupies framework. Although BioJupies was developed to create notebooks for RNA-seq analysis, pipelines to process other data types should be possible to implement. This can be achieved through the plug-in architecture and coded workflows. One feature that is currently missing from BioJupies is the launch of live notebooks in the cloud. To achieve this, Kubernetes37 can be utilized to directly deploy BioJupies in the cloud. Hence, BioJupies can be extended in many ways. While there are many ways to extend BioJupies, at its present form, the BioJupies application can facilitate rapid and in-depth data analysis for many investigators.
Acknowledgements
This work is supported by NIH grants U54-HL127624 (LINCS-DCIC), U24-CA224260 (IDG-KMC), and OT3-OD025467 (NIH Data Commons).