RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays

  1. John C. Marioni1,6,
  2. Christopher E. Mason2,3,6,
  3. Shrikant M. Mane4,
  4. Matthew Stephens1,5,7, and
  5. Yoav Gilad1,7
  1. 1 Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA;
  2. 2 Program on Neurogenetics, Yale University School of Medicine, New Haven, Connecticut 06520, USA;
  3. 3 Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520, USA;
  4. 4 Keck Biotechnology Laboratory, New Haven, Connecticut 06511, USA;
  5. 5 Department of Statistics, University of Chicago, Chicago, Illinois 60637, USA
  1. 6 These authors contributed equally to this work.

Abstract

Ultra-high-throughput sequencing is emerging as an attractive alternative to microarrays for genotyping, analysis of methylation patterns, and identification of transcription factor binding sites. Here, we describe an application of the Illumina sequencing (formerly Solexa sequencing) platform to study mRNA expression levels. Our goals were to estimate technical variance associated with Illumina sequencing in this context and to compare its ability to identify differentially expressed genes with existing array technologies. To do so, we estimated gene expression differences between liver and kidney RNA samples using multiple sequencing replicates, and compared the sequencing data to results obtained from Affymetrix arrays using the same RNA samples. We find that the Illumina sequencing data are highly replicable, with relatively little technical variation, and thus, for many purposes, it may suffice to sequence each mRNA sample only once (i.e., using one lane). The information in a single lane of Illumina sequencing data appears comparable to that in a single array in enabling identification of differentially expressed genes, while allowing for additional analyses such as detection of low-expressed genes, alternative splice variants, and novel transcripts. Based on our observations, we propose an empirical protocol and a statistical framework for the analysis of gene expression using ultra-high-throughput sequencing technology.

Footnotes

  • 7 Corresponding authors.

    7 E-mail gilad{at}uchicago.edu; fax (773) 834-8470.

    7 E-mail mstephens{at}uchicago.edu; fax (773) 834-8470.

  • [Supplemental material is available online at www.genome.org. Raw microarray CEL files have been deposited in the GEO database with accession number GSE11045. The raw Illumina sequencing data are available in the NCBI short read archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) with accession number SRA000299. A summary of the mapped reads and of the processed microarray data is available at http://giladlab.uchicago.edu/data.html.]

  • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.079558.108.

    • Received April 8, 2008.
    • Accepted June 6, 2008.
  • Freely available online through the Genome Research Open Access option.

| Table of Contents
OPEN ACCESS ARTICLE

Preprint Server