TY - JOUR T1 - Framework for determining accuracy of RNA sequencing data for gene expression profiling of single samples JF - bioRxiv DO - 10.1101/716829 SP - 716829 AU - Holly C. Beale AU - Jacquelyn M. Roger AU - Matthew A. Cattle AU - Liam T. McKay AU - Katrina Learned AU - A. Geoffrey Lyle AU - Ellen T. Kephart AU - Rob Currie AU - Du Linh Lam AU - Lauren Sanders AU - Jacob Pfeil AU - John Vivian AU - Isabel Bjork AU - Sofie R. Salama AU - David Haussler AU - Olena M. Vaske Y1 - 2019/01/01 UR - http://biorxiv.org/content/early/2019/07/30/716829.abstract N2 - Background The clinical value of identifying aberrant gene expression in tumors is becoming increasingly evident. In order for multi-gene expression analysis to achieve wider adoption and eventually be developed as a Clinical Laboratory Improvement Amendments (CLIA)-approved test, the input sample requirements, sensitivity, specificity and reference ranges must be quantified.Methods We analyzed paired-end Illumina RNA sequencing (RNA-Seq) data from 1088 tumor samples from 29 projects. We categorized reads based on where and how well they map to the genome, as well as their PCR duplicate status. We subsampled 5 deeply sequenced samples, identified exceptionally highly expressed genes and samples with similar gene expression profiles.Results We addressed variability in RNA-Seq dataset composition by defining reference ranges for four types of reads found in sequencing data: unmapped (0-13%); mapped duplicate (2-66%); mapped non exonic (0-26%) and mapped, exonic, non-duplicate (MEND, 27-76%). With 20 million MEND reads, we detected over-expressed genes (“up-outlier” genes) with a median sensitivity of 96.1% and specificity of 99.8%; sample similarity had 96.6% sensitivity and 100.0% specificity.Conclusions This strategy for measuring RNA-Seq data content and identifying thresholds could be applied to a clinical test of a single sample, specifying minimum inputs and defining the sensitivity and specificity. We estimate that a sample sequenced to the depth of 70 million total reads will typically have sufficient data for accurate gene expression analysis. ER -