Seqpac: A New Framework for small RNA analysis in R using Sequence-Based Counts

Small RNA sequencing (sRNA-seq) has become important for studying regulatory mechanisms in many cellular processes. Data analysis remains challenging, mainly because each class of sRNA—such as miRNA, piRNA, tRNA- and rRNA-derived fragments (tRFs/rRFs)—needs special considerations. Analysis therefore involves complex workflows across multiple programming languages, which can produce research bottlenecks and transparency issues. To make analysis of sRNA more accessible and transparent we present seqpac: a tool for advanced group-based analysis of sRNA completely integrated in R. This opens advanced sRNA analysis for Windows users—from adaptor trimming to visualization. Seqpac provides a framework of functions for analyzing a PAC object, which contains 3 standardized tables: sample phenotypic information (P), sequence annotations (A), and a counts table with unique sequences across the experiment (C). By applying a sequence-based counting strategy that maintains the integrity of the fastq sequence, seqpac increases flexibility and transparency compared to other workflows. It also contains an innovative targeting system allowing sequence counts to be summarized and visualized across sample groups and sequence classifications. Reanalyzing published data, we show that seqpac’s fastq trimming performs equal to standard software outside R and demonstrate how sequence-based counting detects previously unreported bias. Applying seqpac to new experimental data, we discovered a novel rRF that was down-regulated by RNA pol I inhibition (anticancer treatment), and up-regulated in previously published data from tumor positive patients. Seqpac is available on github (https://github.com/Danis102/seqpac), runs on multiple platforms (Windows/Linux/Mac), and is provided with a step-by-step vignette on how to analyze sRNA-seq data.

adaptor sequences. Concatemer (chimeric) adaptors are found in small quantity in 317 most experiments, and are technical constructs where an incomplete adaptor 318 associates with a complete adaptor during synthesis (34). 319 320 2.3 The low-level evidence filter 321 The make_counts function contains a low-level filtering module, here called an 322 evidence filter. In default settings, it simply filters sequences that fails to replicate 323 across two independent samples. Even in small experiments, such as the Kang et al. 324 dataset, such filtering dramatically increases performance by reducing noise from 325 extremely rare transcripts/degradation products ( Figure 2C). Our experience is that 326 such evidence filter often results in less than half the sequence diversity (number of 327 unique read sequences; lower bars Figure  To illustrate this further, true sequence diversity-that can be replicated and is not 331 due to technical bias-should expect to rise when sRNA from two species is mixed, 332 which is also the case in the Kang et al. dataset (percentages in Figure 2C). 333 Nonetheless, the evidence filter in make_counts can both be disabled (e.g. in single 334 sample/replicate experiments) or intensified (e.g. to increase performance in very 335 large datasets). In addition, confirming our initial observation that seqpac's make_trim 336 function was better in identifying adaptor artifacts, such as concatemer adaptors, 337 make_trim identified more replicable unique sequences passing the evidence filter 338 than the popular cutadapt/fastq_quality_filter workflow ( Figure 2D). 339 340

Annotating sequence with seqpac 341
Seqpac provides two ways to annotate a sequence in a PAC object. Firstly, the 342 reannotation workflow (Figure 3: step 1-3) aligns the trimmed read sequences in the 343 PAC against reference sequences, for example a reference genome, sRNA 344 database, or sequences from another experiment, such as the results from a piwi 345 pull-down. This is done using the reannotation family of functions: map_reanno, 346 import_reanno, add_reanno and simplify_reanno. Seqpac also provides a 'backdoor 347 function', PAC_mapper, that quickly calls the reannotation workflow for mapping the 348 sequences in the PAC object (see Results 6.1, 7.2). 349 Secondly, after aligning a PAC object to a reference genome, the genomic 351 coordinates of PAC sequences can be overlapped with coordinates of already to the adaptor trimming, map_reanno calls Bowtie either internally or externally, 361 through the Rbowtie package (35) or a system call, respectively. The function can 362 parse either seqpac standard or user provided options to Bowtie. It also calls a 363 secondary function, import_reanno, which controls the import options from the Bowtie 364 output files. Options involve for example whether coordinates and fasta sequence 365 names should be reported, or only hit-or-no-hit. This is convenient for large repetitive 366 sRNA references that may generate massive files if everything is reported (e.g. 367 pirBase for humans and flies). 368

369
The map_reanno function runs multiple align/import cycles ( Figure 3: step 1). After 370 each cycle, imported data are saved as Rdata files, and only sequences without an 371 alignment to any of the references will proceed to the next cycle. Each proceeding 372 cycle allows for one additional mismatch until the user-defined max mismatches (or 373 the Bowtie limit of 3 mismatches) has been reached. Reannotating only no-hit 374 sequences in proceeding cycles not only guarantees that only the best hits are 375 reported. Since system demands per sequence increases with each added 376 mismatch, it also significantly increases performance as only the minimum number of 377 sequences are aligned in each mismatch cycle. Importantly, if a sequence aligns to 378 two references, both references will be reported for that cycle. Thus, unlike feature-379 based counting where such multimapping issues must be resolved already when 380 reads are counted, users of seqpac can decide to discriminate between annotations 381 at any stage in the analysis. 382 383 3.3 Annotating a PAC object using the add_reanno and simplify_reanno Next, using the add_reanno function the Rdata files from each mismatch cycle is 385 read into R and organized into a reanno object (Figure 3:  recommend that analysis is performed on a class-by-class basis as far as possible. 403 Nonetheless, hierarchical discrimination is often the only option to resolve some 404 issues with pseudoreplication when multiple classes of sRNA are simultaneously 405 analyzed. This is because the same sequence sometimes appears in multiple 406 reference databases, and therefore obtains multiple classifications, such as both 407 piRNA and miRNA. The purpose of the simplify_reanno function is therefore to 408 hierarchically discriminate between search matches generated by the add_reanno 409 function (Figure 3: step 3). 410 411 Importantly, since the seqpac workflow introduces simplified hierarchical 412 classifications late in the annotation process, users can quickly set alternative 413 hierarchies by just reapplying the simplify_reanno function. Unlike feature-based 414 counting, the seqpac workflow therefore makes it easier to observe the effects of 415 changes to the hierarchy. In addition, since seqpac maintains sequence integrity, 416 users may at any time blast candidate sequences at their favorite genome browsers, to verify that the correct classification was made and to get additional information 418 about the candidate. 419 420 3.4 Annotating genomic coordinates using PAC_gtf 421 When the reannotation workflow runs using the import='genome' mode, the reference 422 coordinates for each PAC sequence will be imported into the reanno object and later 423 added to the PAC annotation table. These coordinates can be parsed to the PAC_gtf 424 function as an alternative way to obtain PAC sequence annotations (Figure 3: step 4). 425 This function uses gtf/gff formatted files that contains coordinates of genomic 426 features and are available at many popular databases, such as Ensembl (37) Figure 2). This involved parallel mapping to both the human (hg38) 436 and fruit fly (dm6) genomes, as well as species specific versions of mirBase (miRNA) 437 (40), pirBase (piRNA) (41), GtRNAdb (tRNA) (42) and ensembl (many types of 438 ncRNA) databases (37). The hierarchy was set to rRNA > tRNA > miRNA > 439 snoRNA > snRNA > lncRNA > piRNA, indicating that rRNA was most prioritized and 440 piRNA was least prioritized. Mapping was carried out allowing for up to 3 441

mismatches. 442 443
As expected, the test clearly discriminated between human and fly samples in terms 444 of genome alignment, and correctly accounted for the expected genomic ratios when 445 samples from these two species had been mixed ( Figure 4A). The human proportion 446 of the dataset was more affected by perfect matching, which is expected due to more 447 outbreeding in the population, but both species gain almost 100% 'mappability' when 448 mismatches were allowed. The fruit fly proportion of the dataset was strongly 449 enriched with an rRNA sized to 30 nt ( Figure 4B). Blasting this sequence showed that 450 it was identical to the complete 2S rRNA subunit. This was expected since Kang et al.
did not to report of any method that depletes 2S rRNA prior to library construction, 452 which is commonly done in fruit fly experiments (43,44). The human proportion of the 453 dataset was instead enriched with miRNA with the expected size of 22 nt ( Figure 4B). 454 There was also a T bias in the expected range between 22-25 that may indicate 455 piRNA ( Figure 4C). Nonetheless, the proportion of piRNA classification was lower 456 than the T bias ( Figure 4B/C), which suggests that some piRNA may have been 457 classified as miRNA given that miRNA was prioritized in the hierarchy. 458 459 4.1 Subsetting and grouping data using targeting objects 460 Seqpac applies an innovative strategy for extracting sample groups and sequence 461 classifications for filtering, plotting and statistical purposes. This involves small 462 targeting objects constructed as a list with two inputs ( Figure 5). The first being a 463 character string naming a target column in a specific if an 'anno_target=' input option is available then columns in the annotation table can 469 be targeted. The second entry of a targeting object is often order sensitive. Thus, if 470 users want the sample groups to appear in a specific order in a graph, they only need 471 to provide that order in the second entry of the pheno_target object ( Figure 5). 472

473
As an example, when using the PAC_pie to generate the pie charts in Figure 4A  In a few cases, seqpac functions use targeting objects for other seqpac objects, such 481 as a PAC summary table (see Results 5.3). While the principle of these objects is 482 similar to the pheno_target and anno_target objects, they may have differences that 483 are carefully described in the manual to each function. 484

Overview preprocessing, summarization and statistical analysis 486
With or without advanced annotations, PAC objects can be filtered (PAC_filter,487 PAC_filtsep), normalized (PAC_norm), and summarized (PAC_summarize) using 488 seqpac internal functions. More advanced statistical wrappers immediately 489 compatible with PAC objects are also available (e.g. PAC_deseq, PAC_pca). Anno tables using the pheno_target and anno_target options (see Results 4.1). A 495 filter that extracts sequences that have reached a percent coverage over a certain 496 threshold is also available for both raw and normalized counts. This can for example 497 be used for the popular '20 counts in 50% of samples' filter. PAC_filter can also plot a 498 graph that shows the impact on the data at different thresholds. Conveniently, seqpac 499 provides a separate function PAC_filtersep, that extracts sequences reaching a 500 coverage threshold within sample groups. The output can directly be used to 501 construct Wenn-diagrams, for example visualizing the sequence overlap that reach 502 100 cpm within two sample groups. It can also be applied for more advanced filters, 503 like removing read sequences that do not reach 20 counts across all samples within 504 a group. 505 506

Normalize, summarize and statistical analysis 507
While the standard structure of a PAC list object contains three tables-Pheno, Anno 508 and Counts-it may hold any number of objects as long as they do not have the 509 same names as the standard objects, just like a regular list. There are, however, two 510 more standard objects that are added to the PAC object later in the analysis: the 511 norm list containing normalized counts tables, and the summary list that contains 512 summarized tables ( Figure 6). It is easy to visualize these objects as two separate 513 'folders' within a PAC object. 514

515
PAC_norm provides a few common normalization methods, like the simple 516 reads/counts per million that standardize each sample against their total counts. It 517 currently also maintains a wrapper for the rlog and vst functions of the DESeq2 518 package (29), that automatically will prepare the PAC counts table for a transformation blinded against experimental groups. Users are, however, encouraged 520 to provide their own normalization tables. As long as the table contains the same 521 sequence (row) and sample (column) names as the Counts table, and are stored in 522 the norm list ('folder') of the PAC object, seqpac functions with a norm input option 523 will automatically search the norm folder for a matching name. 524 525 PAC_summary generates simple group summaries, like means, standard deviations, 526 standard errors, percent group differences and log2 fold changes. It can be applied to 527 both raw counts, as well as normalized counts by naming a table in the PAC norm 528 list/folder using the norm input option. The grouping of samples is controlled by a 529 pheno_target object. PAC_summary does not maintain an anno_target option since 530 summaries over annotations would result in loss of sequence integrity (= feature-531 based counts). Summarizing data across both phenotype and annotation is instead 532 handled by individual functions, or by subdividing the whole PAC file using the 533 PAC_filter function prior to running PAC_summary. 534 535 For more advanced statistical analysis seqpac provides a convenient function, 536 PAC_deseq, that allow users to import a PAC object into DESeq2 (29). This function 537 automatically generates a report containing organized top tables, volcano-plots and 538 p-value distribution histograms. Further, seqpac contains the PAC_pca function that 539 performs a principle component analysis (PCA) with aid of the FactoMineR and 540 factoextra packages (45,46). This function returns scatter plots of the main 541 components annotated using either a pheno_target or anno_target. Lastly, 542 PAC_saturation performs and plots the results of a sequence saturation analysis. 543 This is often used for checking that satisfactory sequencing depths have been 544 reached, where few new sequences are predicted given a hypothetical increase in 545 the sequencing depth. 546 547

Advanced classification and visualization 548
In the quick reference presented in Table 1 a selection of visualization functions is 549 briefly presented. In common for most of them are the option to use pheno_target 550 and/or anno_target objects for grouping and ordering different plots (as described in 551 Figure 5). Seqpac plots are primarily generated using the ggplot2 package (47) and 552 outputs are often saved as lists with summarized data and graphs. As with the other 553 seqpac functions, outputs are described in detail in the functions' manuals. 554 555 Since seqpac's reannotation workflow provides a powerful and quick pipeline for 556 sequence annotation, we have also included a 'back-door' function, PAC_mapper. 557 This function is ideal for detailed mapping of smaller fasta references, such as a list 558 of tRNAs, the 45S pre-rRNA subunit, the mitochondrial genome, or simply a specific 559 genomic region download as a fasta from a genome browser. Conveniently, if a 560 Bowtie index is missing for a fasta reference, PAC_mapper will automatically 561 generate that index, making the alignment of a new fasta reference highly efficient. 562 The output of PAC_mapper is a map object, which is simply a list where each entry 563 refers to a specific sequence in the fasta reference and where the coordinates of all 564 PAC sequences that mapped the reference sequence is reported (e.g. the mapping 565 coordinates of PAC sequences aligning to a specific tRNA). This map object along 566 with the original PAC object can then be fed to the PAC_covplot function to generate 567 coverage plots across the fasta reference, as exemplified in Figure 8. As we have 568 illustrated before, such coverage plots are well suited for characterizing tRNA and 569 rRNA fragmentation (8,44,48), as well as mitochondrial RNA (48). to where the alignment starts or ends in the reference sequence. This is done by 579 either defining different ranges (e.g. classifying a fragment as 5' if it starts within the 580 first 3 nt of a tRNA), or a percentage zone (e.g. classifying a fragment as a half if it 581 ends or starts within 45-55% of a tRNA). Even better, map_rangetype may use ss 582 files, which is a format commonly used for storing information about secondary 583 structures such as tRNA loops. Thus, using this option, users can classify fragments 584 in relation to for example cleavage within a specific loop. We used this strategy to 585 identify a diet-sensitive tRNA derived fragment in human sperm, that we called nuclear internal T-loop tRNA derived RNA (nitRNA)(48). The PAC_trna function plots 587 range-classified tRNAs mimicking some graphs presented in that paper. Secondly, the extra-cellular vesicles from SCC4 and SCC154 failed almost 645 completely to align with known human sRNAs ( Figure 7B). Since seqpac maintains 646 sequence integrity, we blasted a small selection of these non-annotating sequences 647 at NCBI (53). The result strongly indicated that most reads originated from the 648 Mycoplasma hyorhinis genome. Since this is a common contaminant in cell cultures 649 (54), we ran this Mycoplasma genome in parallel to the human genome in the 650 seqpac's reannotaion workflow, thereby picking the best possible alignment from 651 either of them. This showed that all vesicle samples from SCC4 and SCC154-the 652 same samples that explained one of the main components in the PCA-suffered 653 severely from Mycoplasma contamination ( Figure 7C).

655
Now, critics may argue that a feature-based counting strategy should have corrected 656 for this contamination, since reads that fail to align against the human genome will 657 automatically be removed prior to counting. Thus, Mycoplasmic reads should not 658 have affected the results, since normalization of the counts was made after their 659 removal. We, therefore, used seqpac to detect novel sRNA originating from pre-rRNA, 687 hypothesizing that inhibiting RNA pol I would result in fewer rRFs. For this we conducted a small experiment by exposing HeLa cells-which originates from 689 cervical cancer cells-to BMH21. The exposure time was set to 60 min, and as 690 control we used DMSO. RNA from purified cells was then prepared for sRNA-seq, 691 which resulted in fastq files with 75 nt reads. From this raw data we generated an 692 annotated and filtered PAC object using only seqpac functions. To better understand the relevance of these changes we summed the cpm of all 715 fragments mapping to each peak and performed a non-parametric Mann-Whitney U 716 test. For this analysis we also included a third group of samples that had been 717 exposed to BMH21 for 12 hours, to explore if any of the effects of BMH21 were 718 amplified following long-term exposure. Astonishingly, after 12 h exposure, fragments 719 of Peak 1 had almost completely disappeared ( Figure 8F). This was not due to an 720 experimental failure since Peak 2 and Peak 3a were unaffected by the long-term 721 treatment ( Figure 8G-H). In fact, Peak 4 fragments even showed a significant up-regulation ( Figure 8I). Thus, the effects observed in Peak 1 and Peak 4 were 723 amplified by long-term exposure, but in two opposite directions. 724 725 For the Peak 2 and Peak 3 rRNA fragments we have previously observed similar 726 fragments in human sperm (8), and similar fragments located to the 5' ends of the 727 5.8S and 28S subunits in fruit fly embryos (44). This is also true for the 3' fragments 728 of the 28S subunit (Peak 4), even though we never have observed such expression 729 levels as we see in the HeLa cells. To our knowledge, however, highly expressed 730 sRNA fragments from the Peak 1 region-in the 5' external transcribed spacers 731 (ETS)-have never been described. To understand the 5' ETS rRF better, we 732 performed a multi-species blast of the main sequence at NCBI to identify similar 733 GenBank entries. This showed many alignments to ribosomal precursors in humans, 734 one identical sequence in the Chimpanzee, and a few similar sequences in the 735 Figure S1). Thus, this 5' ETS rRF has only evolved in our 736 closest relatives. Here we presented a novel and innovative bioinformatic tool-seqpac-that makes 750 advanced sRNA analysis from genome-scale sequencing data more accessible and 751 transparent. The workflow is completely integrated with R, from trimming the adaptor 752 sequences to generating plots. We showed that seqpac's trimming function performs 753 as well as, or even better, than trimming using standard tools outside R. We further 754 presented the PAC object, which builds a framework of phenotypic information (P) and sequence annotations (A) around a table based on sequence counts (C). Using 756 published data we showed that a sequence-based counting strategy-in contrast to 757 feature-based counting that is more commonly used-diminishes the risk of mistakes 758 in downstream analysis. We demonstrated the strength of maintaining sequence 759 integrity to enable re-annotation of sequences across species and classes of sRNA at 760 any point in the analysis. Lastly, we showed how seqpac can be used for sRNA 761 discovery in cancer research by the discovery of a novel rRNA derived fragment 762 (rRF) that were down-regulated by anti-cancer treatment in vitro and up-regulated in 763 tumors of cervical cancer patients. 764

765
Seqpac is available at github (https://github.com/Danis102/seqpac). As the whole 766 workflow, from adaptor trimming to mapping and plotting, are integrated in R it runs 767 on common computer platforms, including Windows, Mac and Linux. It comes with a 768 complete collection of function manuals and a vignette that guides the user in how to 769 apply the default workflow using a fastq test dataset that are included with the 770 package. R scripts that we used to generate many of the results presented in this 771 paper are available in Supplementary file S1. 772 773 It must be emphasized that seqpac is primarily designed for sRNA sequence 774 analysis. This means that it does not currently supports paired-end sequencing, 775 which is commonly applied for long RNA sequencing. Paired-end sequencing is not 776 required for most sRNA applications where the target sequence lengths seldom 777 exceed 75 nt. As we have demonstrated in this paper, too short reads-as those 778 generated using the 50-cycle flow cell kits available for MiSeq, NextSeq1000 and 779 HiSeq2500/3000/4000-should be avoided. Without some excessive sequence in 780 which the 3' adaptor can be detected, it is difficult to reliably discriminate medium 781 length sRNA (such our novel 5' ETS rRF) from unintentionally included longer RNA. 782

783
We see, however, many advantages to use sequence-based counting also in long 784 RNA sequence analysis, for example to easily extract sequences annotating to a 785 candidate mRNA and check for possible genetic variants. Coverage plots, similar to 786 what we describe for the 45S pre-rRNA ( Figure 8B) would also be applicable for 787 mRNA coverage to visualize splice variants and intronic transcription. Even though we hope to develop long RNA analysis in future updates of seqpac, there are 789 currently a few technical constraints that needs to be resolved. 790

791
As mentioned, paired-end reads are not supported either for trimming or counting. In 792 addition, while Bowtie (20) is still the most popular aligner for sRNA, it does not 793 support indel mapping. While this is not a great problem if sequence integrity is 794 maintained and candidate sequences subsequently can be blasted to detect any slip 795 through, this problem are slightly more announced in samples differing much from 796 their reference genomes, such as cancer cell-lines. A likely reason for Bowtie's 797 popularity in sRNA community is because it is reliable with short sequence 798 alignments. For instance, we initially tried to integrate the Rsubreads package (61) in 799 seqpac's workflow, which applies a highly efficient 'seed-and-vote' mapping 800 algorithm. However, for certain read lengths we consistently experienced failure to 801 correctly vote for the best alignment, possibly as a consequence that too few seeds 802 were covering the read. We will off-course explore more efficient alternatives to 803 Bowtie in the future. 804 805 By using the sequence-based approach of seqpac, we have discovered a novel 806 rRNA derived sRNA (rRF) in the 5' ETS of 45S pre-rRNA. This rRF responds 807 negatively to anticancer treatment and are up-regulated in tumors. The scope of our 808 study was not to dwell deep into the mechanism and clinical potential of this rRF. To 809 our knowledge, however, this fragment has not been described before, and from our 810 experience sRNA in the 5' ETS of pre-rRNAs are rare. This, together with the insight 811 that the sequence is relatively unique to humans (with only some homology in 812 Chimpanzees and Gorillas), makes it a good target for future studies on biomarkers 813 in cancer treatment and diagnosis. In our HeLa cell experiment, the main fragment 814 was 61 nt, which indicates a unique fragment given that we had a maximum read 815 length of 75 nt. Even though the methods used in Tong et al. (25) and Xu et al. 816 (26,27) were restricted to a maximum read length of 50 nt, we found traces of this 817 fragment in the pile of fragments with unverifiable length of ≥ 50 nt. It must be 818 emphasized, however, that we tried to validate the 5' ETS rRF in yet another dataset, 819 Snoek et al.

PAC_gtf tibble, rtracklayer, GenomicRanges
Overlaps genomic coordinates of PAC sequences with features of a gtf file.

PAC_mapper
Backdoor to the reanno workflow for fast mapping of PAC resulting in a map object.

PAC analysis
Filtering and normalization.

PAC_filter
-Subsets data by target objects or coverage thresholds.
(preprocessing) PAC_filtersep -Extracts sequences reaching a threshold within groups of a pheno_target object.

PAC_norm DESeq2
Normalize a raw counts table and saves it in the PAC$norm 'folder'.

PAC analysis
Performs statistical analyses and visualizations.

PAC_pca FactoMineR, factoextra
Performs principal component analysis and plots the results.

PAC_saturation foreach, ggplot2
Performs a sequence saturation analysis and plots the results.

PAC analysis
Generates graphs and saves processed data summarized over both phenotype and annotations.
All functions are described in detail in the manual for each function (e.g. '?make_counts' in the R terminal) and are exemplified in the seqpac vignette; 'vignette("seqpac")' in R terminal).