Abstract
Summary Since its introduction, RNA-seq technology has been used extensively in studies of pathogenic bacteria to identify and quantify differences in gene expression across multiple samples from bacteria exposed to different conditions. With some exceptions, the current tools for assessing gene expression have been designed around the structures of eukaryotic genes. There are a few stand-alone tools designed for prokaryotes, and they require improvement. A well-defined pipeline for prokaryotes that includes all the necessary tools for quality control, determination of differential gene expression, downstream pathway analysis, and normalization of data collected in extreme biological conditions is still lacking. Here we describe ProkSeq, a user-friendly, fully automated RNA-seq data analysis pipeline designed for prokaryotes. ProkSeq provides a wide variety of options for analysing differential expression, normalizing expression data, and visualizing data and results, and it produces publication-quality figures.
Availability and implementation ProkSeq is implemented in Python and is published under the ISC open source license. The tool and a detailed user manual are hosted at Docker: https://hub.docker.com/repository/docker/snandids/prokseq-v2.1, Anaconda: https://anaconda.org/snandiDS/prokseq; Github: https://github.com/snandiDS/prokseq.
Motivation
The advancement of massive parallel sequencing and dramatic reduction in sequencing costs have made deep sequencing of RNA (RNA-seq) a primary tool for identifying and quantifying RNA transcripts. Today RNA-seq is widely used to analyse bacterial gene expression in studies that aim to identify drug targets, predict novel gene regulatory mechanisms, etc. Such studies often require profound knowledge of both computational data handling and biology. There are some stand-alone pipelines and tools that require only moderate knowledge of bioinformatics (Delhomme, et al., 2012; Prieto and Barrios, 2019), but these are not designed for analyses of bacterial gene expression.
Prokaryotic RNA-seq analysis is challenging because most available RNA-seq packages assume the input data reflect eukaryotic gene structures, which in many aspects differ from those of prokaryotes (Johnson, et al., 2016). Bacterial transcripts do not have introns and are not alternatively spliced; therefore, using an aligner developed to consider splice junctions often increases falsely assigned reads in the genome (Magoc, et al., 2013). Moreover, unlike in eukaryotes, under specific stresses the expression of almost all prokaryotic genes can change (Creecy and Conway, 2015). Furthermore, quality trimming, adapter removal, and normalization of skewed data are often required for prokaryotic data due to variations in experimental setups, the presence and overexpression of plasmid genes, and differences in RNA-seq protocols (Magoc, et al., 2013; McClure, et al., 2013).
Although there are a few software packages available for prokaryotes that can facilitate the analysis of RNA-seq data, such as SPARTA (Johnson, et al., 2016), EdgePRO (Magoc, et al., 2013), and RockHopper (McClure, et al., 2013), all require substantial knowledge of data handling. Therefore, to reduce human intervention in conducting RNA-seq data analysis for prokaryotes, we developed ProkSeq, a fully automated command-line based workflow. ProkSeq integrates various available tools and built-in functions written in Python. ProkSeq processes RNA-seq data from quality control steps to pathway enrichment analysis of differentially expressed genes. It provides a wide variety of options for differential expression, normalized expression, and visualization, and produces publication-quality figures. Reduced human intervention makes the use of ProkSeq less time consuming than the sequential application of separate tools, which often requires reformatting data. In addition to the convenience, the multithreading feature of the ProkSeq makes the pipeline less time-consuming.
Implementation
ProkSeq runs in a Linux-based command-line environment and depends on user-defined parameters and sample files. Since it integrates several tools, default parameters for the packages are set in the parameter file. However, the user has the flexibility to change the settings in the parameter file for any desired analysis. The sample file provides the names of the fastq files to be included in the analysis, and also defines the experimental classes, such as treatment and control samples. ProkSeq first checks the quality of reads and filters out low-quality reads using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) and afterQC (Chen, et al., 2017). It maps the reads to the reference genome using bowtie2 (Langmead and Salzberg, 2012) and its default parameters for both single and paired-end reads. ProkSeq then generates a report on the alignment quality for each library, as both figures and text, providing information about coverage uniformity, distribution along protein coding sequences, and 5’ and 3’ UTR regions, as well as the read duplication rate and strand specificity generated by RseQC (Wang, et al., 2012) and other built-in functions. Total reads per gene are calculated with featureCounts (Liao, et al., 2014), which provides a high efficiency of read assignments across the genome. ProkSeq also calculates normalized gene expression values for each gene, in the form of transcripts per million (TPM) (Wagner, et al., 2012) and counts per million (CPM) (Wagner, et al., 2012). The formulas by which these are calculated are explained in the supplementary methods (S1). Furthermore, we have integrated salmon (Patro, et al., 2017) to quantify the expression of transcripts by use of a bias-aware algorithm that substantially improves the accuracy and the reliability of subsequent analysis of differential expression.
ProkSeq integrates several tools for differential expression analysis, such as DEseq2 (Love, et al., 2014), edgeR (Robinson, et al., 2010), and NOISeq (Tarazona, et al., 2015). For downstream analysis of differentially expressed genes, ProkSeq uses GO enrichment, and pathway enrichment by integrating clusterProfiler (Yu, et al., 2012). Reports on pre- and post-alignment quality statistics and graphical visualization are created in pdf and HTML formats. One important unique feature of ProkSeq is the integration of well-established normalized methods for skewed data (Creecy and Conway, 2015; Zhu, et al., 2019). Furthermore, the package generates a single-nucleotide resolution wiggle file for visualization in any genome browser. ProkSeq generates vibrant graphics and publication-ready figures at every step of data analysis to give the user more confidence in and understanding of their data. The methods are described in detail in the supplementary methods (S1).
Sample data sets and results
ProkSeq is distributed with an example data sets. The data set contains paired-end reads from Yersinia pseudotuberculosis YPIII (data unpublished) and compares the control versus bile treated. The following files are bundled with exampleFiles.tar.gz.
Sample files (sampleCtrl_1.R1/R2.fq, sampleCtrl_2.R1/R2.fq, sampleCtrl_3.R1/R2.fq, sampleTreat_1.R1/R2.fq, sampleTreat_2.R1/R2.fq, sampleTreat_3.R1/R2.fq),
Sample description files (samples.bowtie.PEsample, samples.bowtie.SEsample, samples.salmon.PEsample, samples.salmon.Sesample),
Parameter definition files (param.input.bowtie and param.input.salmon)
Annotation files (oldAnnotationGFF.bed, oldAnnotationGFF.gtf),
Transcript file (orf_coding_all.fasta), and
Genome file (SequenceChromosome.fasta).
To run the pipeline, one can follow the instructions in https://github.com/snandiDS/prokseq/blob/master/README.md.
We strongly recommend using Docker to run the pipeline. However, the implementation of ProkSeq is straightforward.
mkdir prokseq
cd prokseq Download the package from github or install using conda.
Untar the depend.tar – Depend folder contains all the required external packages. Most binaries of the packages are stored in this folder.
Untar the exampleFiles.tar.gz
Install the following R packages:
DESeq2
edgeR
NOISeq
clusterProfiler
apeglm
ggplot2
Once all the dependencies and R packages are in place, and the example files are untared, change the PATH in parameter definition file. The PATH variable in the parameter definition file should point to the packages.
For example:
If the directory ‘prokseq’ is created in /home/user/ in Step 1, and the pypy package is in depend folder inside /home/user/prokseq/, then specify the path as below for all the packages in the parameter definition file.
# Specify the path to pypy required for running afterqc
PATH PYPY /home/user/prokseq/depend/pypy2.7-v7.2.0-linux64/bin
The following syntax can be used to run ProkSeq
Usage: pipeline-v2.8.py [options] arg Options:
-h, --help= Show this help message
-s SAMPLE_FILE_NAME, --sample=SAMPLE_FILE_NAME
-p PARAMETER_FILE_NAME, --param=PARAMETER_FILE_NAME
-n NUMBER OF PROCESSORS, --numproc=NUMBER OF PROCESSORS
The script is run with PE (paired-end) samples described in samples.bowtie.PEsample, and with the parameters defined in param.input.bowtie. The program is submitted with four processors. The program first check if all the packages are available, and generated a report on the screen. The users can decide to proceed based on the report. The result of the pipeline is stored in the Output folder. The Output folder contains all the alignment file, statistics, pre and post filter QC report as well as result folder. The pipeline with the example data produce plots for various steps some of which are shown in figure 2. Details can be found in the github depository of ProSeq where a sample run folder has been included.
Discussion
ProkSeq, which is designed to be used by biologists without significant competence in bioinformatics, will provide new opportunities to discover unique events in transcription dynamics. RNA-seq data can provide much more information than simply the differential expression of known coding sequences. Exploring RNA-seq reads to single-nucleotide resolution across the genome can provide information about biological events other than gene expression. ProkSeq offers easy access to genome-wide visualization of RNA-seq data. Visualization of read mapping will reveal expression from unannotated genomic regions and intergenic regions, including 5’ and 3’ UTRs, which is of great interest in relation to novel transcriptional and translational regulation. Other tools for revealing this type of information that are available today (Table 1) usually require substantial competence in bioinformatics and provide only some of the options available in ProkSeq. Furthermore, integration of salmon in the process gives the user one of the most up-to-date methods of estimating transcript abundance. Salmon uses a realistic model of RNA-seq data that takes into account not only experimental attributes but also biases commonly observed in RNA-seq data. Users can quickly extract transcript abundance and subsequent differential expression data by opting to use salmon.
ProkSeq provides an option for batch effect identification with normalization. An essential difference between eukaryotes and prokaryotes that can cause problems when analysing prokaryotic gene expression using tools optimized for analyses of eukaryotic cells is the relative number of differentially expressed genes. Tools such as DEseq2, edgeR, and Limma (Dillies, et al., 2013) are designed with the assumption that most genes are invariant, which is the case in eukaryotes. But in prokaryotes, the expression of the majority of genes is altered under specific stress conditions (Berghoff, et al., 2017; Creecy and Conway, 2015). To address this biasness, ProkSeq normalizes the data at the level of nucleotide base count making the data comparable across samples. ProkSeq provides two normalization options that can handle differential expression analyses of this type of data which are described in detail in the supplementary methods (S1).
ProkSeq has been designed to meet biologists need to analyse RNA-seq data in a reliable and time-efficient way. The built-in automatic sequential handling of the data from differential gene expression analysis to downstream functional analyses allows researchers to focus on complex biological mechanisms instead of tackling bioinformatics obstacles. The flexibility that comes with built-in options for certain steps and the visualisation of mapped reads across genomes opens a path to new discoveries in gene regulation as well as in RNA biology.
Acknowledgement
The authors thanks Rikki Frederiksen and Chayan Kumar for testing and feed-back, and Dr Nicolas DelHomme from Umea Plant Science Centre for critical reading. The work has been supported by funding from Knut and Alice Wallenberg foundation (2016.0063), Swedish research Council (2018-02855), and the Medical faculty at Umea University.