tRNAnalysis: A flexible pre-processing and next-generation sequencing data analysis pipeline for transfer RNA

Many tools have been developed to analyse small RNA sequencing data, however it remains a challenging task to accurately process reads aligning to small RNA due to their short-read length. Most pipelines have been developed with miRNA analysis in mind and there are currently very few workflows focused on the analysis of transfer RNAs. Moreover, these workflows suffer from being low throughput, difficult to install and lack sufficient visualisation to make the output interpretable. To address these issues, we have built a comprehensive and customisable small RNA-seq data analysis pipeline, with emphasis on the analysis of tRNAs. The pipeline takes as an input a fastq file of small RNA sequencing reads and performs successive steps of mapping and alignment to transposable elements, gene transcripts, miRNAs, snRNAs, rRNA and tRNAs. Subsequent steps are then performed to generate summary statistics on reads of tRNA origin, which are then visualised in a html report. Unlike other low-throughput analysis tools currently available, our high-throughput method allows for the simultaneous analysis of multiple samples and scales with the number of input files. tRNAnalysis is command line runnable and is implemented predominantly using Python and R. The source code is available at https://github.com/Acribbs/tRNAnalysis.

and tertiary structure, making library preparation difficult [1]. Therefore, efficient library preparation methods must be employed to overcome the ridged structure of tRNA that usually limit the use of standard library prep methods for sequencing [2].
With respect to data analysis, the challenges come from overcoming the reverse transcription errors introduced by chemical modifications and accurately mapping reads to tRNA genomic regions, given their multiple identical and almost identical genomic loci [3]. Typically, most mapping strategies for gene expression analysis only report read alignments with unique best matches and thus discard reads mapping to tRNA altogether. As a consequence, specialist mapping strategies to accurately map tRNAs have been proposed [4,5]. Specifically, Hoffmann et al (2018) have recently proposed a two pass mapping strategy that first maps reads to a tRNA masked genome then secondly these unmapped reads are aligned directly to merged tRNA clusters [3].
While computational pipelines have been developed in the past for small RNA-seq data analysis [6], there is currently a limited number of small RNA-seq pipelines focusing on tRNA data analysis. tRNAnalysis is a pipeline written using the CGAT-core workflow manager [9]. It is seamlessly integrated and runs from a single launch command, while also having the modularity of being able to run any individual task within the pipeline. tRNAnalysis implements best practice mapping strategies to allow accurate mapping of tRNA reads [3]. The pipeline is optimised so that it can process all input fastq files in parallel, produce detailed logging information, and can be ran locally or using a highperformance cluster. tRNAnalysis can be installed using Conda and a Docker image is also provided with all of the software and packages installed. Users can therefore plug and play the pipeline without having to install numerous dependencies. Finally, we provide a user-friendly html report to visualise qualitative and quantitative outputs from our pipeline.

Methods
tRNAnalysis automates and integrates best-practice small RNA sequencing analysis, allowing for automatic cluster submission and parallelisation. The pipeline unifies standard software for its functions, such that the pipeline provides appropriate default settings, with the option to customise pipeline configuration parameters and job resources as required. The workflow is written predominantly in python and R, using CGAT-ruffus decorators [10] and CGAT-core [9] as the workflow management system. The pipeline runs via a single command line interface, and the main steps in our analysis being: • Read pre-processing and quality assessment .

Mapping of reads
The input for mapping is a collection of pre-processed fastq files and a list of

Results
To allow users to familiarise themselves with the functionality of tRNAnalysis, we have included a tutorial example that can be accessed in the github documentation.
However, in order to validate the accuracy of our methods we have used our pipeline to analyse previously generated datasets from Chiou et al (2018) (2018), mainly that specific tRNA fragments are enriched in extracellular vesicles and released by activated T cells (Figure 1). Furthermore, using our workflow we were able to present a more detailed evaluation of the different tRNA types, mapping reads collapsed across all tRNAs ( Figure 1A), by codon (not shown) and by amino acid ( Figure 1B). Our pipeline also allows the user to quickly evaluate the relative proportion of tRNA fragments ( Figure 1C) and tRNA halves ( Figure 1D) in a sample. Furthermore, we also demonstrated that tRNAnalysis can be used to accurately quantify tRNA differential expression between cells and extracellular vesicles (Figure 2).

Discussion
We have developed a one-stop pipeline for the processing and analysis of small RNA-seq data, with a particular emphasis on the analysis of reads aligning to tRNAs.
tRNAnalysis is expansible and robust, and can be run locally or scaled up and ran on an HPC cluster. Additionally, the pipeline is portable since it is distributed as a conda package and a docker image. Currently, tRNAnalsyis supports data generated from all major RNA-seq platforms and implements best practice accurate mapping of tRNA reads [3]. The pipeline encourages a best practice reproducible research approach, by implementing exploratory data analysis using the R statistical programming framework, which is initially generated by the pipeline. Wrapping of prototyped analysis Rmarkdown files allows the rapid reproduction of analyses and reuse of code for multiple tRNA datasets. To illustrate a simple use case of tRNAnalysis, we have included an example small RNA-seq dataset and tutorial, which can be accessed from within the github repository (https://github.com/Acribbs/tRNAnalysis). tRNAnalysis is currently being used to process data in numerous ongoing projects and is under active development to meet the demands of the burgeoning small RNA sequencing field.

Data availability
All data underlying the results are available as part of the article and no additional source data are required