Abstract
Motivation Computational tools used for genomic analyses are becoming increasingly sophisticated and complex. While these applications often provide more accurate results than their predecessors, a new problem is emerging in that these pieces of software have a large number of tunable parameters. Choosing the wrong parameter values for an application may lead to significant results being overlooked or false results being reported.
Results We take some first steps towards generating a truly automated genomic analysis pipeline by developing a method for automatically choosing input-specific parameter values for reference-based transcript assembly. We apply the parameter advising framework, first developed for multiple sequence alignment, to optimize parameter choices for the Scallop transcript assembler. In doing so, we provide the first method for finding advisor sets for applications with large numbers of tunable parameters. By choosing parameter values for each input, the area under the curve (AUC) when comparing assembled transcripts to a reference transcriptome is increased by 28.9% over using only the default parameter choices on 1595 RNA-Seq samples in the Sequence Read Archive. This approach is general, and when applied to StringTie it increases AUC by 13.1% on a set of 65 RNA-Seq experiments from ENCODE.
Availability Parameter advisors for both Scallop and StringTie are available on Github (https://github.com/Kingsford-Group/scallopadvising). Tools to perform the the coordinate ascent procedure are also available in the repository, though this step is not necessary in order to apply advising to a new dataset.
Footnotes
deblasio{at}cs.cmu.edu,carlk{at}cs.cmu.edu