Parallel and scalable workflow for the analysis of Oxford Nanopore direct RNA sequencing datasets

The direct RNA sequencing platform offered by Oxford Nanopore Technologies allows for direct measurement of RNA molecules without the need of conversion to complementary DNA, fragmentation or amplification. As such, it is virtually capable of detecting any given RNA modification present in the molecule that is being sequenced, as well as provide polyA tail length estimations at the level of individual RNA molecules. Although this technology has been publicly available since 2017, the complexity of the raw Nanopore data, together with the lack of systematic and reproducible pipelines, have greatly hindered the access of this technology to the general user. Here we address this problem by providing a fully benchmarked workflow for the analysis of direct RNA sequencing reads, termed MasterOfPores. The pipeline converts raw current intensities into multiple types of processed data, providing metrics of the quality of the run, quality-filtering, base-calling and mapping. The output of the pipeline can in turn be used to compute per-gene counts, RNA modifications, and prediction of polyA tail length and RNA isoforms. The software is written using the NextFlow framework for parallelization and portability, and relies on Linux containers such as Docker and Singularity for achieving better reproducibility. The MasterOfPores workflow can be executed on any Unix-compatible OS on a computer, cluster or cloud without the need of installing any additional software or dependencies, and is freely available in Github (https://github.com/biocorecrg/master_of_pores). This workflow will significantly simplify the analysis of nanopore direct RNA sequencing data by non-bioinformatics experts, thus boosting the understanding of the (epi)transcriptome with single molecule resolution.


INTRODUCTION
Next generation sequencing (NGS) technologies have revolutionized our understanding of the cell and its biology. However, NGS technologies are heavily limited by their inability to sequence long reads, thus requiring complex bioinformatic algorithms to assemble back the DNA pieces into a full genome or transcriptome. Moreover, NGS technologies require a PCR amplification step, and as such, they are typically blind to DNA or RNA modifications .
The field of epitranscriptomics, which studies the biological role of RNA modifications, has experienced an exponential growth in the last few years. Systematic efforts coupling antibody immunoprecipitation or chemical treatment with next-generation sequencing (NGS) have revealed that RNA modifications are much more widespread than originally thought, are reversible (Jia et al., 2011), and can play major regulatory roles in determining cellular fate (Batista et al., 2014), differentiation (Furlan et al., 2019;Lee et al., 2019;Lin et al., 2017) and sex determination (Haussmann et al., 2016;Kan et al., 2017;Lence et al., 2016), among others. However, the lack of selective antibodies and/or chemical treatments that are specific for a given modification have largely hindered our understanding of this pivotal regulatory layer, limiting our ability to produce genome-wide maps for 95% of the currently known RNA modifications (Jonkhout et al., 2017).
Third-generation sequencing (TGS) platforms, such as the one offered by Oxford Nanopore Technologies (ONT), allow for direct measurement of both DNA and RNA molecules without prior fragmentation or amplification (Brown and Clarke, 2016), thus putting no limit on the length of DNA or RNA molecule that can be sequenced. In the past few years, ONT technology has revolutionized the fields of genomics and (epi)transcriptomics, by showing its wide range of applications in genome assembly , study of structural variations within genomes (Cretu Stancu et al., 2017), 3' poly(A) tail length estimation (Krause et al., 2019), accurate transcriptome profiling (Bolisetty et al., 2015), identification of novel isoforms (Byrne et al., 2017;Križanovic et al., 2018) and direct identification of DNA and RNA modifications (Carlsen et al., 2014;Garalde et al.;Liu et al., 2019;Simpson et al., 2017). Thus, not only this technology overcomes many of the limitations of short-read sequencing, but importantly, it also can directly measure RNA and DNA modifications in their native molecules. Although ONT can potentially address many problems that NGS technologies cannot, the lack of proper standardized pipelines for the analysis of ONT output is greatly limited its reach to the scientific community.
To overcome these limitations, workflow management systems together with Linux containers offer an efficient solution to analyze large-scale datasets in a highly reproducible, scalable and parallelizable manner. In the last year, several workflows to analyze nanopore data have become available; to analyze on a cluster using 100 nodes, each one with 8 CPUs, and ~1 hour or less on a single GPU (see Table 1 for detailed metrics). Moreover, the pipeline can also be run on the cloud (see section "Running on AWS").
MasterOfPores simplifies the analysis of direct RNA sequencing data by providing a containerised pipeline implemented in the NextFlow framework. It is important to note that this approach avoids the heavy-lifting of installing dependencies by the user, and thus, is simple and accessible to any researcher without bioinformatics expertise. We expect that our workflow will greatly facilitate the access of Nanopore direct RNA sequencing to the community.

Overview of the MasterOfPores workflow
Workflow management systems together with Linux containers offer a solution to efficiently analyse large scale datasets in a highly reproducible, scalable and parallelizable manner. During the last years an increasing interest in the field has led to the development of different programs such as The MasterOfPores workflow includes all steps needed to process raw FAST5 files produced by Nanopore direct RNA sequencing and executes the following steps, allowing users a choice among different algorithms (Figure 1, see also Figure S1): i) Read base-calling with the algorithm of choice, using Albacore (https://nanoporetech.com) or Guppy (https://nanoporetech.com). This step can be run in parallel and the user can decide the number of files to be processed in a single job by using the command --granularity.
ii) Filtering of the resulting fastq files using Nanofilt (De Coster et al., 2018). This step is optional and can be run in parallel.
vi) Final report of the data processing using multiQC (https://github.com/ewels/MultiQC) that combines the single quality controls done previously, as well as global run statistics.

Running MasterOfPores: installation, input, parameters and output
To run MasterOfPores, the following steps are required: options. If these are not specified by the user, the workflow will be run with default parameter settings ( Table 2). The final report includes 4 different types of metrics: (i) General statistics of the input, including the total number of reads, GC content and number of identical base-called sequences; (ii) Per-read statistics of the input data, including scatterplots of the average read length versus sequence identity, the histogram of read lengths, and the correlation between read quality and identity; (iii) Alignment statistics, including the total number of mapped reads, the total number of mapped bases, the average length of mapped reads, and the mean sequence identity; (iv) Quality filtering statistics, including the number of filtered reads, median Q-score and read length, compared to those observed in all sequenced reads; and (v) Per-read analysis of biases, including information on duplicated reads, over-represented reads and possible adapter sequences (Figure 2).
The final outputs of the pipeline include: -Basecalled fast5 files within the "fast5_files" folder.
-Aligned reads in BAM files within the "aln" folder.

Running MasterOfPores on the cloud (AWS Batch and AWS EC2)
Nanopore sequencing allows for real-time sequencing of samples. While GridION devices come with built-in GPUs that allows live base-calling, smaller MinION devices do not have built-in CPU or GPU.
Thus, the user has to connect the MinION to a computer with sufficient CPU/GPU capabilities, or run base-calling after the sequencing. In all these contexts context, the possibility of running the MasterOfPores pipeline on the cloud presents a useful alternative.
The Amazon Web Services (AWS) Batch is a computing service that enables users to submit jobs to a cloud-based user-defined infrastructure, which can be easily set up via either code-based definitions or a web-based interface. Computation nodes can be allocated in advance or according to resource availability. Cloud infrastructure can be also deployed or dismantled on demand using automation tools, such as CloudFormation or Terraform.
Here we show that the MasterOfPores pipeline can be successfully implemented on the cloud, and provide the Terraform script for running MasterOfPores on the AWS Batch CPU environments, available in the GitHub repository (https://biocorecrg.github.io/master_of_pores/). To run the pipeline using the AWS Batch, the users only need to change a few parameters related to their accounts in a configuration file. The pipeline can be run from either a local workstation or an Amazon EC2 entrypoint instance initiated for this purpose (we recommend the latter). Data to be analysed can be uploaded to an Amazon S3 storage bucket.
Similarly, we also tested whether our pipeline could be run in Amazon Web Services (AWS) Elastic Compute Cloud (EC2), which is one of the most popular cloud services (Table S1). Compared to AWS Batch, to run any workflow in AWS EC2, the user must first create an Amazon Machine Image (AMI). The AMI can be created using the same instructions as provided in File S1, starting from the We used up to 100 nodes with 8 CPUs for testing the base-calling in CPU mode and 1 node with 1 GPU card for testing the base-calling in GPU mode ( Table 1).
The MasterOfPores pipeline was ran using guppy version 3.1.5 as the base-caller and minimap2 version 2.17 as the mapping algorithm. Reads were filtered by running nanofilt with the options "-q 0 -headcrop 5 --tailcrop 3 --readtype 1D". Filtered reads were mapped to the yeast SK1 fasta genome.
Specifically, the command that was executed to run the pipeline with these settings was:

Benchmarking the time used for the analysis of S.cerevisiae polyA(+) RNA
Here we have tested the pipeline using both CPU and GPU computing. Specifically, we ran the pipeline on the following configuration: (i) a single CPU node (e.g., emulating the computing time on a single laptop); (ii) a CPU cluster with 100 nodes; (iii) a single mid-range GPU card (RTX2080); and (iv) a single high-end GPU card (GTX1080 Ti).
We found that the computing time required to run the pipeline on a single GPU card was significantly lower than the running time in parallel on a high-performance CPU cluster with 100 nodes, 8 cores per node ( Table 1, see also Table S1). Moreover, we found that the computing time can be significantly reduced depending on the GPU card (base-calling step was ~2X faster for GTX1080 Ti than for RTX2080).

Reporting resources used for the analysis of S. cerevisiae polyA(+) RNA
Taking advantage of the NextFlow reporting functions, the pipeline can produce detailed reports on the time and resources consumed by each process (Figure 3), in addition to the output files (bam, fastq) and final report (html), if the workflow is executed with parameters -with-report (formatted report) or-with-trace (plain text report). Running the base-calling on each multi-fast5 file in parallel on our dataset showed that the most memory intensive tasks (about 5 Gbytes) were the mapping step (using minimap2) and the quality control step (using Nanoplot) ( Table 3), while the most CPUintensive and time-consuming step (~80min) was the base-calling (using Guppy) ( Table 4).
Finally, we should note that the latest (19.10.0) version of NextFlow allows the user to control the execution of a pipeline remotely. To enable this feature, the user needs to login to the https://tower.nf/ website developed by the NextFlow authors and retrieve a token for communicating with the pipeline.
For doing that, the user should set this token as an environmental variable and run the pipeline as follows:

DISCUSSION
The direct RNA sequencing technology offered by Oxford Nanopore technologies (ONT) offers the possibility of sequencing native RNA molecules, allowing to investigate the (epi)transcriptome at an unprecedented resolution, in full-length RNA molecules and in its native context. Although the direct RNA sequencing library preparation kit was made available in April 2017, only a modest number of researchers have started to adopt this new technology, partly due to the complexity of analyzing the resulting raw FAST5 data. Moreover, even in those cases when specific software and tools have been made available, the users typically experience many difficulties in installing dependencies and running the software. To overcome these issues and facilitate the data analysis of direct RNA sequencing to the general user, we propose the use of NextFlow workflows.
Specifically, we propose the use of MasterOfPores workflow for the analysis of direct RNA sequencing datasets, which is a containerised pipeline implemented in the NextFlow framework.
MasterOfPores can handle both single-and multi-FAST5 reads as input, is highly customizable by the user ( Table 2) and produces informative detailed reports on both the FAST5 data processing and analysis (MultiQC report, Figure 2) as well as on the computing resources used to perform each step (NextFlow report, see Figure 3). Thus, the current outputs of the MasterOfPores workflow include: (i) base-called FAST5 files, (ii) base-called fastq file, (iii) mapping BAM file, (iv) MultiQC report, and (v) NextFlow report. In the future we plan to integrate within the MasterOfPores workflow the software for the downstream analyses of direct RNA sequencing datasets including the PolyA tail length estimation, using Nanopolish (Workman et al., 2018) and tailfindr (Krause et al., 2019)) per-transcript isoform quantification and differential expression analysis, using Flair  and the analysis of RNA modifications, using Tombo (Stoiber et al., 2017) and EpiNano (Liu et al., 2019)).
The process of Nanopore read base-calling, that is, converting ion current changes into the sequence of RNA/DNA bases, has significantly improved during the last few years, mainly due to the adoption of deep learning approaches, such as the use of convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which are currently the most commonly used strategies for base-calling. The adoption of RNN and CNN-based base-calling algorithms led to a dramatic improvement in basecalling accuracy. However, this came at the expense of a higher computational cost: only 5-10 reads can be base-called on 1 CPU core per second using the latest versions of the base-calling algorithms.
The use of graphic processing units (GPUs) can greatly accelerate certain CPU-intensive computational tasks, thus allowing to process 50-500 reads per second (Table S1). We therefore developed our pipeline for both for CPU and GPU computing. Moreover, we provide the GPU-enabled docker image and detailed information on how to setup the GPU computing (see section: "Running MasterOfPores"). We encourage users to adopt the GPU computing for the analysis of Nanopore sequencing data whenever possible, as this option is both more time and cost-efficient.

Code availability
The pipeline is publicly available at https://github.com/biocorecrg/master_of_pores under an MIT license. The example input data as well as expected outputs are included in the GitHub repository.
Detailed information on program versions used can be found in the GitHub repository.

Availability of Dockerfiles and Docker images
The pipeline uses software that is embedded within Docker containers. Dockerfiles are available in the GitHub repository (https://github.com/biocorecrg/master_of_pores/tree/master/docker/

Integration of base-calling algorithms in the Docker images
Due to the terms and conditions that users agree to when purchasing Nanopore products, we are not allowed to distribute Nanopore software (binaries or in packaged form like docker images). While the original version of the MasterOfPores pipeline includes both guppy and albacore, we are not legally allowed to distribute it with the binaries. Therefore, here we only make available a version where the binaries must be downloaded and placed into a specific folder by the user. We expect future versions of MasterOfPores will include these softwares within the docker image once this issue is solved.

CPU and GPU computing time and resources
The MasterOfPores workflow was tested both locally (using either CPU or GPU), as well as in the cloud (AWS). Computing times for each mode are shown in