Package to annotate and prioritize putative oncogenic RNA fusions

Background Gene fusion events are a significant source of somatic variation across adult and pediatric cancers and have provided some of the most effective clinically relevant therapeutic targets, yet computational algorithms for fusion detection from RNA sequencing data show low overlap of predictions across methods. In addition, events such as polymerase read-throughs, mis-mapping due to gene homology, and fusions occurring in healthy normal tissue require stringent filtering, making it difficult for researchers and clinicians to discern gene fusions that might be true underlying oncogenic drivers of a tumor and in some cases, appropriate targets for therapy. Results Here, we present annoFuse, an R package developed to annotate and identify biologically-relevant expressed gene fusions, along with highlighting recurrent novel fusions in a given cohort. We applied annoFuse to STAR-Fusion and Arriba results for 1028 pediatric brain tumor samples provided as part of the Open Pediatric Brain Tumor Atlas (OpenPBTA) Project. First, we used FusionAnnotator to identify and filter “red flag” fusions found in healthy tissues or in gene homology databases. Using annoFuse, we filtered out fusions known to be artifactual and retained high-quality fusion calls using support of at least one junction read and if there is disproportionate spanning fragment support of more than 10 reads compared to the junction read count, we removed them to remove false positives from background noise. Second, we prioritized and captured known, as well as putative oncogenic driver, fusions previously reported in TCGA, or fusions containing gene partners that are known oncogenes, tumor suppressor genes, or COSMIC genes. Finally, using annoFuse, we determined recurrent fusions across the cohort and recurrently-fused genes within each histology. Conclusions annoFuse provides a standardized filtering and annotation method for gene fusion calls from STAR-Fusion and Arriba by merging, filtering and prioritizing putative oncogenic fusions across large cancer datasets, as demonstrated here with the OpenPBTA dataset. We are expanding the package to be widely-applicable to other fusion algorithms, adding functionalities, and expect annoFuse to provide researchers a method for quickly evaluating and prioritizing fusions in patient tumors.

DREAM SMC-RNA Challenge, respectively (16). In a 2019 assessment of 23 fusion algorithms for cancer biology, both Arriba and STAR-Fusion ranked in the top three fastest and most accurate tools (17). annoFuse utilizes a four-step process (Figure 1) that is available with flexible functions to perform downstream functions such as merging, filtering, and prioritization of fusion calls from multiple fusion calling algorithms.

RNA Expression and Fusion Calls
Currently, annoFuse is compatible with fusion calls generated from Arriba v1.1.0 (18) and/or STAR-Fusion 1.5.0 (13). Both tools utilize aligned BAM and chimeric SAM files from STAR as inputs and STAR-Fusion calls are annotated with GRCh38_v27_CTAT_lib_Feb092018.plugn-play.tar.gz, which is provided in the STAR-fusion release. Arriba should be provided with strandedness information, or set to auto-detection for poly-A enriched libraries. Additionally, the blacklist file, blacklist_hg38_GRCh38_2018-11-04.tsv.gz contained in the Arriba release tarballs, should be used to remove recurrent fusion artifacts and transcripts present in healthy tissue. An expression matrix with FPKM or TPM values is also required; the matrix should have a column "GeneSymbol" following the same gene naming convention as found in fusion calls.

Fusion Call Preprocessing
We leveraged the fact that STARfusion uses FusionAnnotator as its final step and thus, require all fusion calls be annotated with FusionAnnotator v. 0.2.0 tol contain the additional column, "annots". Finally, fusion calls for all samples should be merged into a single TSV file with an additional column, "tumor_id", which will enable artifact filtering, annotation, fusion prioritization, and determination of recurrence.

: Fusion Standardization
To obtain a standardized format for fusion calls from multiple fusion calls we use fusion_standardization function to format caller specific output files to a standardizedFusionCalls format defined in the package README. fusion_standardization allows users to standardized fusion calls from multiple callers, users have the freedom to annotate their calls with other databases as annots column which can then be used for filtering.

Step 2: Fusion Filtering
Events such as polymerase read-throughs, mis-mapping due to gene homology, and fusions occurring in healthy normal tissue confound detection for true recurrent fusion calls and false positives for genes considered as oncogenic, tumor suppressor or kinases in some cases.
In this step, we filter the standardized fusion calls to remove artifacts and false positives ( Table   2) using the function fusion_filtering_QC. The parameters are flexible to allow users to annotate and filter the fusions with a priori knowledge of their call set. For example, since the calls are preannotated with FusionAnnotator, the user can remove fusions known to be red-flags as annotated with any of the following databases GTEx_recurrent_STARF2019, HGNC_GENEFAM, DGD_PARALOGS, Greger_Normal, Babiceanu_Normal, BodyMap, and ConjoinG. This is done using the parameter, artifact_filter = "GTEx_recurrent_STARF2019 | DGD_PARALOGS | Normal | BodyMap | ConjoinG". Of note, we decided not to remove genes annotated in HGNC_GENEFAM, as this database contains multiple oncogenes and their removal resulted in missed true fusions using our validation truth set. Read-throughs annotated by any algorithm can also be removed at this step by using parameter "readthroughFilter=TRUE". During validation, we observed the real oncogenic fusion, P2RY8-CRLF2 (19,20), annotated as a read-through in acute lymphoblastic leukemia samples, therefore, we implemented a condition such that if a fusion is annotated as a read-through, but is present in the Mitelman cancer fusion database, we scavenge these fusions back as true positive calls.
This function also allows users to flexibly filter out fusions predicted to be artifactual while retaining high-quality fusion calls using junction read support of ≥ 1 (default) and spanning fragment support of < 10 (default) reads compared to the junction read count, as disproportionate spanning fragment support indicates false positive calls (18). Finally, if both genes of the fusion are deemed not expressed < 1 FPKM (default), the fusion transcript calls can be removed using function expressionFilterFusion.

Step 3: Fusion Annotation
The annotateFusionCalls function annotates standardized fusion calls and performs customizable fusion annotation based on user gene lists as input. As a default setting, we provide lists of, and annotate gene partners as, oncogenes, tumor suppressor genes, and oncogenic fusions.
The optional ZscoredAnnotation function provides z-scored expression values from a user-supplied matrix such as GTEx or within cohort to compare samples with and without the fusion to look for over or under expression of fused genes compared to normal using a zscoreFilter. A cutoff of 2 (default) is set to annotate any score > 2 standard deviations away from the median as differentially-expressed. Researchers can then use this information to decide whether to perform additional downstream filtering.

Step 4: Project-Specific Filtering
Each study often requires additional downstream analyses be performed once high-quality annotated fusion calls are obtained. We developed functions to enable analyses at a cohort (or project-level) and/or group-level (eg: histologies) designed to remove cohort-specific artifactual calls while retaining high-confidence fusion calls. The function called_by_n_callers annotates the number of algorithms that detected each fusion. We retained fusions with genes not annotated with the gene lists above (eg: oncogene, etc) that were detected by both algorithms as inframe or frameshift, as these could represent novel fusions. At the group-level, we add groupcount_fusion_calls (default ≥ 1) to remove fusions that are present in more than one type of cancer. At the sample level, fusion_multifused detects fusions in which one gene partner is detected with multiple partners (default ≥ 5), and we remove these as potential false positives.
Separately, the function fusion_driver retains only fusions in which a gene partner was annotated as a tumor suppressor gene, oncogene, kinase, transcription factor, and/or the fusion was previously found in TCGA. This enables annoFuse to scavenge back potential oncogenic fusions which may have otherwise been filtered. Both sets of fusions are then merged into a final set of putative oncogenic fusions. Finally, samplecount_fusion_call identifies fusions recurrently called in (default ≥ 2) samples within each group.

Visualization
Quick visualization of filtered and annotated fusion calls can provide information useful for review and downstream analysis. We provide the function plotSummary which provides distribution of intra-chromosomal and inter-chromosomal fusions, number of in-frame and frameshift calls per algorithm, and distribution of gene biotypes, kinase group, and oncogenic annotation. If project-specific filtering is utilized, barplots displaying recurrent fusion and recurrently-fused genes can be generated using plotRecurrentFusion and plotRecurrentFusedGene, respectively.

Technical validation of annoFuse
Few gene fusion "truth" sets exist and they are comprised of simulated data or synthetic fusions spiked into breast cancer cell lines or total RNA (16,17,21). We therefore utilized a recent study in which fusions were called and high-confidence fusions reported in 244 patient-derived xenograft models from the Pediatric Preclinical Testing Consortium (PPTC) (22). A set of 27 fusions were molecularly validated from acute lymphoblastic leukemia (ALL) models in the PPTC dataset and comprise of a "truth" set. Table 3 describes the performance of annoFuse, in which we achieved 100% accuracy in calling true positive fusions and an average 96% accuracy of highconfidence fusions as defined in (22). Interestingly, only 114/166 total fusions were detected using STAR-Fusion and Arriba (23/27 within the "truth" set), implying gold standard algorithms alone still fail to capture the full landscape of gene fusions and additional algorithms should be integrated into our workflow. Of the 114 fusions we detected, 110 were retained as putative oncogenic fusions using annoFuse. The four fusions annoFuse did not retain were removed with the "readthrough" filter, which can be turned off as an option.

Case study with annoFuse using OpenPBTA
As proof of concept, we utilized RNA expression generated by STAR-RSEM (23) and Following fusion standardization, annotation, and filtering, we applied project-specific filtering to the OpenPBTA RNA-Seq cohort (n = 1,028 biospecimens from n = 943 patients). The number of in-frame and frameshift fusions per algorithm were roughly equivalent within each STAR-Fusion and Arriba fusion calls ( Figure 2B). Figure 2C depicts the density of genes categorized by gene biotype (biological type), and as expected from biologically-functional fusions, the majority of gene partners are classified as protein-coding. The majority of gene partners were annotated as tyrosine kinase (TK) or tyrosine kinase-like (TKL) (Figure 2D). In Figure 2E, the user can explore the biological and oncogenic relevance of the fusions across histologies. Here, we note that in most histologies, the most prevalent gene partners were classified as oncogenes and the least prevalent as tumor suppressor genes. Notably, many 3' fusion partners within low-grade astrocytic tumors are kinases, which follows expectations listed below.
Following project-specific filtering, we observed KIAA1549--BRAF fusions as the most recurrent in-frame fusion in our cohort (n = 109/943), which was expected as KIAA1549-BRAF  Figure   3A). In addition to recurrent fusions, we also detect recurrently-fused genes to account for partner promiscuity. This enables us to see a broader picture of gene fusions, specifically within diffuse astrocytic and oligodendroglial tumors, in which we see fusions prevalent in ST7, MET, FYN, REV3L, AUTS2, and ROS1, and meningiomas, in which NF2 fusions are common. (Figure 3B).
The few openly-available fusion annotation and prioritization tools ( Table 1)  prioritization. Therefore, we leverage the algorithm agnostic capabilities of FusionAnnotator to pre-annotate fusion input from STAR-Fusion and Arriba.
By integrating FusionAnnotator with functionality of the current gold standard algorithms STAR-Fusion and Arriba, we were able to improve the aforementioned tools' capabilities by meeting the current demands of the research community. We provide the user with flexible filtering parameters and envision annoFuse will be used to quickly filter sequencing artifacts and false positives, as well as further annotate fusions for additional biologically functionality (eg: kinases, transcription factors, oncogenes, tumor suppressor genes) to increase the signal to noise ratio in a cohort of fusion calls. Users can opt to simply annotate and filter artifacts or use annoFuse to functionally prioritize fusions as putative oncogenic drivers. During the prioritization steps, we filter based on genes with cancer relevance (see biological functionality list above), perform analysis of fusion and fused-gene recurrence, to create a stringently-filtered, prioritized list of fusions likely to have oncogenic potential.
As an additional feature, we plan to add expression-based comparison of genes between fused samples, normal, and within a histology or cohort. We acknowledge that protein domain annotation and retention is very important for prioritizing fusion calls and as such, we are working to add functionality from the algorithm-agnostic AGFusion tool in the near future. Likewise, we would like to integrate the recent FusionPathway tool, which is also algorithm agnostic, but depends on protein domain annotation to perform GSEA for oncogenic association. We plan to add additional fusion algorithms currently used by the community, such as deFuse, FusionCatcher, and SOAPfuse, to further increase the applicability of annoFuse. Future features could also include assessment of domain retention, combined with linkage to drug databases to predict fusion-directed targeting strategies.

Conclusions
Gene fusions provide a unique mutational context in cancer in which two functionallydistinct genes are combined to function as a new biological entity. Despite showing great promise as diagnostic, prognostic, and therapeutic targets, translation in the oncology clinic is not yet accelerated for gene fusions. This has been partly due to limited translation of the large number of bioinformatically-derived fusion results into biologically meaningful information. In our efforts to address this, we introduce annoFuse, an R Package to annotate and prioritize putative oncogenic RNA fusions, providing a range of functionalities to filter and annotate fusion calls from multiple algorithms. We include a cancer-specific workflow to find recurrent, oncogenic fusions from large cohorts containing multiple cancer histologies. The multi-algorithm filtering and annotation steps within annoFuse enable users to integrate calls from multiple algorithms to improve highconfidence, consensus fusion calling. The lack of concordance among algorithms as well as variable accuracy with fusion truth sets (1,17) adds analytical complexity for researchers and clinicians aiming to prioritize research or therapies based on fusion findings. Through annoFuse, we add algorithm flexibility and integration, to identify recurrent fusions and/or recurrently-fused genes as novel oncogenic drivers. We expect annoFuse to be broadly applicable to cancer datasets and to facilitate researchers to better inform preclinical studies targeting novel, putative oncogenic fusions and ultimately, aid in the rational design of therapeutic modulators of gene fusions in cancer.

Availability and requirements
Project name: annoFuse: an R Package to annotate and prioritize putative oncogenic RNA

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
All data are available by download from the Gabriella Miller Kids First Data Resource Center with a data access agreement through the Children's Brain Tumor Tissue Consortium.

Competing interests
The authors declare no competing interests.     Recurrence: filters out non-recurrent fusions in genes not annotated as putative oncogenic.

Funding
Annotation lists are also described.