Abstract
In a recent publication, Li et al. introduced LeafCutter, a new method for detecting and quantifying differential splicing of RNA from RNASeq data. In this work, Li et al. first compared LeafCutter to existing methods, then used it for a study of splicing variations and sQTL analysis from a large set of GTEx samples. While the study was elaborate and comprehensive, we want to highlight several issues with the comparative analysis performed by Li et al. We argue these issues created an inaccurate and misleading representation of other tools, namely MAJIQ and rMATS. More broadly, we believe the points we raise regarding the comparative analysis by Li et al. are representative of general issues we all, as authors, editors, and reviewers, are faced with and must address in the current times of fast paced genomics and computational research.
The issues we identified are all concentrated in Fig 3 of Li et al. [1]. We stress these do not invalidate the comprehensive GTEx analysis performed by the authors, for which they should be congratulated. These issues relate only to the comparison to other software and can be summarized by the following points:
Usage of outdated software
Li et al. compared LeafCutter to three different algorithms for differential splicing analysis from RNA-Seq data: MAJIQ[2], rMATS[3] and cufflinks2[4]. Importantly, the versions of the software used were not stated. We found the versions used (MAJIQ 0.9.2, released Jan 2016, and rMATS 3.2.5) were outdated, a consequence of when the authors set up the comparative analysis for their paper submission on March 31st, 2017 (Li et al. personal communication, see supplementary for more details). Consequently, major software re-implementation included in rMATS 4.0 (released May 1st 2017) and MAJIQ 1.0 (released May 10th 2017), were not included in the Li et al. BioRxiv manuscript posted Sep 7th 2017, or the paper published on Dec 11th 2017. The usage of the outdated software packages led the authors to conclude that “the identification of differential splicing across groups in large studies impractically slow with rMATS or MAJIQ”. To support this claim, the authors plotted time and memory usage where both rMATS and MAJIQ were found to consume over 10GB of memory and took over 50 hours (Fig. 3b, Supplementary Fig. 3b in Li et al.). However, in our testing, the May 2017 versions required only a few hundred MB of memory per node, similar to LeafCutter, and while MAJIQ 1.0 was still significantly slower, rMATS was as fast as LeafCutter (see Figure 1a).
(a) Running time for each algorithm, when comparing groups of different sizes. (b) The Intra to Inter Ratio (IIR) when using 3 7 GTEx per tissue group (skeletal muscle). The IIR, serving as a proxy for false discovery, represents the ratio between the number of differential events reported when comparing biological replicates of the same tissue (putative false positives), and the number of events reported when comparing similarly sized groups but from different conditions (here skeletal muscle and cerebellum, see main text and supplementary for details). (c) The original ROC plots from Li et al. for evaluating each method’s accuracy, with the correct execution of MAJIQ superimposed on them (blue line). The blue line was derived using scripts supplied by Li et al. for their data generation. (d) Evaluation using “realistic” synthetic datasets: each synthetic sample is created to match a real sample in terms of gene expression and a lower bound on transcriptome complexity. This simulation does involve de-novo events which are not captured by rMATS or intron retention (not modeled by LeafCutter). All datasets involve 3 biological replicates per group. Each method was evaluated using its own definition of alternative splicing events, so events are not directly comparable between methods. Positive events were defined as those with (|E[∆Ψ]| ≥ 20%), and negative events were defined as those with a small difference between the groups of (|E[∆Ψ]| ≤ 5%). (e) Reproducibility ratio (RR) plots for differentially spliced events between cerebellum and heart GTEx samples (n = 5 per group, as in Li et al.). The end of the line marks the point in the graph matching the number of events reported as significantly changing (RR(NA), see main text and supplementary). Events detected are not directly comparable as each algorithm uses a different definition for splicing events. (f) Evaluation of accuracy using RT-PCR experiments from [2]. Both algorithms were used to quantify Ψ using RNA-seq from [6] and RNA from matching Liver tissue was used for validation.
Reproducibility and correct execution
In order to assess LeafCutter’s accuracy, Li et al. performed two main types of tests. In the first, represented by Fig 3b in the published paper, Li et al. assessed the distribution of p-values reported by the software. For this, the authors compared two groups of RNASeq samples which were both comprised of equal proportions of samples from two conditions (tissues). As both groups included an equal mix of the same conditions the software was expected to produce p-values that follow the null. This figure creates the false impression, also stated in the main text, that MAJIQ’s output is not well calibrated. In fact, the authors erroneously chose to use a different output type, the posterior probability for a predefined magnitude of inclusion change C (denoted P (∆Ψ > C)), as a proxy for p-values. This usage is wrong as p-values are derived from a null model, which is not used by MAJIQ. Instead, as evident by the shape of graph, the posterior probability produced by MAJIQ can be thought of as a soft proxy to a step function. Such an ideal step function would assign all the true ∆Ψ > C with probability 1 and assign probability 0 to the rest. Furthermore, when we perform such a test to assess putative false positives using GTEx samples and the setup discussed below, we find MAJIQ protected from false positives significantly better than LeafCutter (Figure 1b).
The second measure of performance employed by Li et al. is a receiver operating characteristic (ROC) curve, assessing the accuracy of calling differential splicing from synthetically generated RNASeq data. The authors use this analysis to conclude MAJIQ severely under-performs in any but the most extreme cases. This analysis seemed incorrect to us since the graphs show MAJIQ is not able to retrieve many events (MAJIQ’s purple line saturates quickly, see Figure 1c). As the data and scripts were not available at publication, we contacted the authors for this information and were able to repeat the analyses. We found the authors did not use MAJIQ as intended based on the users manual and analysis in [2, 5]: In order to plot such ROC curves one needs to rank events by the absolute expected inclusion change (|E[∆Ψ]|) and have all events reported. Reporting all events is done using the--show-all flag. Running the pipeline supplied to us by Li et al. and superimposing the results on the original graphs, we find there is no significant difference in this test between LeafCutter and MAJIQ (Figure 1c, blue line). However, as we detail below, in more realistic synthetic data as well as real data, we find MAJIQ outperforms both LeafCutter and rMATS by a variety of metrics.
Realistic evaluation metrics
We found the synthetic data used to evaluate the software to be unrealistic. Among the things we noted are the use of uniform expression of isoforms, spiking a single isoform’s expression level which does not translate to any specific splicing change (∆Ψ), using a fixed number (5) of isoforms per gene in the scripts we were supplied with, usage of only a small non-random set of 200 genes, and avoiding fluctuations between individuals. The last point is especially relevant given the heterogeneous nature of the GTEx dataset. The Li et al. analysis also avoided any PSI or isoform specific measures of accuracy and instead assessed ROC at the gene level (spiked yes/no). In what we consider to be more realistic synthetic data mimicking biological replicates, and produced using the procedure described in [5], we find MAJIQ outperforms LeafCutter (see Figure 1d). A more complete description of the issues we found in the synthetic data can be found in the supplementary material.
A metric we found missing was reproducibility of the identified significantly changing events when biological replicates are used to repeat the analysis. Here too we found MA-JIQ’s results to be significantly more reproducible then LeafCutter’s: when using two groups of GTEx cerebellum and skeletal muscle samples, MAJIQ achieved consistently higher re-producibility irrespective of the number of events reported (Figure 1e). MAJIQ’s improved reproducibility was maintained when using biological replicates (data not shown) and when restricting LeafCutter to use a more conservative additional filter of ∆Ψ > 20% (compare light and dark orange lines in see Figure 1b,d). This additional filter is similar to MA-JIQ’s settings and commonly used in the RNA Biology field for defining significant splicing changes.
The evaluations in the Li et al. analysis also did not include any experimental validation or accuracy measure by RT-PCR. While RT-PCR can suffer from biases as well, careful execution in triplicates is considered the golden standard in the RNA field. To make those accessible, we and others have made datasets of such experiments readily available online [2, 5]. Using those datasets, we found LeafCutter to be significantly less accurate when compared to MAJIQ (R2 0.821 vs 0.936 in Figure 1f) and when compared to rMATS. Moreover, we believe these results highlight an inherent issue in LeafCutters output: while useful for sQTL detection (the use case for which LeafCutter was originally designed for), LeafCutters cluster of introns are a mathematical construction. The clusters do not correspond directly to a biological entity or to ratios of isoforms which necessarily add up to one. As such, it is not clear how LeafCutter’s output should be translated to actual inclusion values, or be used for primer design.
Finally, there are several important elements of LeafCutter which were not discussed in the Li et al. comparative analysis. Those include the limited granularity of intron clusters, which can cover large portions of genes (see number of events reported in Figure 1d as a crude indication of this issue), and the fact that intron retention (IR) is not modeled by LeafCutter. The latter has been shown to be of particular importance in the brain, a focus of Li et al. s analysis. For example, in a preliminary analysis of brain GTEx samples we found almost 10% of the differentially spliced events to involve differential IR (data not shown).
Conclusions
In conclusion, our evaluation supports the Li et al. assertion that LeafCutter is an efficient method for differential splicing and particularly sQTL analysis, for which it was originally constructed. We note that our analysis does not imply purposeful misrepresentation of competing software (see time-line in supplementary), nor does it invalidate the comprehensive analysis of GTEx samples performed by the authors. Nonetheless, we conclude that Li et al. misrepresented other software and their relative performance, namely rMATS and MAJIQ. This misrepresentation was a result of using outdated software, lack of proper documentation (software versions, scripts), incorrect usage, and what we view as lacking evaluation criteria.
Nature Research journals have been strong advocates for reproducible science (cf [7, 8]. We believe that adherence to relevant reproducibility guidelines could have helped prevent many of the issues we identified. But beyond the need to follow reproducibility guidelines, other important questions arise. For example, what evaluation criteria should we use? Which software version should we run? What happens if a competing software is updated during a review? Are we obliged to change the manuscript then? Are we also obliged to address pre-prints? Should reviewers be allowed to use pre-prints to scrutinize a manuscript or a grant application? Can authors use pre-prints to base their scientific claims in other papers? And what is the role of the editor in such cases? We found ourselves struggling with those questions ourselves. For example, we faced reviewers of grants and papers which scrutinized our and collaborators work based on this inaccurate representation of MAJIQ, even when it was only a pre-print. We published improvements to the MAJIQ algorithm, which was released as a pre-print back in January 2017, but we choose not to discuss algorithmic enhancements here, as it was only formally published on December 2017. As another example, during the preparation of our Norton et al. manuscript [5], competing methods released updates. We decided to delay submission, rerun all comparative analyses, and consequently removed specific claims. How should such software changes be handled by reviewers and editors? As it stands, we are painfully aware that MAJIQ itself represents a moving target as it is actively being developed. For example, the current version (1.1, released March 1st 2018) is much faster than the previous 1.0 release, and the new version (MAJIQ 2.0, manuscript in preparation) compares well in running time to LeafCutter while still retaining MAJIQs added features (see Figure 1a). Consequently, we found even more recently published splicing analysis methods such as SUPPA2[9] and Whippet[10] also used outdated MAJIQ software (ver 0.9, released Feb 2016) or did not report versions. Such situations of constantly evolving software is not uncommon in genomics, with methods such as rMATS[11, 3], Salmon[12], EdgeR[13], and DESeq[14, 15] serving as good examples. In such settings, the decision on which version to use (and documenting it) can critically affect results.
Finally, we want to point out an important take home message for our community. In this fast moving field of genomics, with evolving standards and procedures, mistakes are bound to happen even as we strive to minimize them. The question, then, is how do we handle mistakes? In preparing this response, we found the online discussion around the MAJIQ vs LeafCutter analysis, including personal communications with the authors, to all be helpful and constructive. The authors responded to our queries and supplied their scripts; other researchers even suggested to use this case to write a guide for software comparisons. We believe this sort of constructive response sets an excellent example, and hope we as a community will keep this positive approach in the future for the benefit of all.
Addendum
We would like to thank Li et al. for their email correspondence, supplying their original scripts, providing feedback to this manuscript and sending us the time-line included in the supplementary information.
All scripts and data used to generate the results presented here are available at: https://bitbucket.org/biociphers/vaquero_norton_2018/
Competing Interests: The authors declare that they have no competing financial interests.