Review
Whole-genome expression analysis: challenges beyond clustering

https://doi.org/10.1016/S0959-440X(00)00212-8Get rights and content

Abstract

Measuring the expression of most or all of the genes in a biological system raises major analytic challenges. A wealth of recent reports uses microarray expression data to examine diverse biological phenomena — from basic processes in model organisms to complex aspects of human disease. After an initial flurry of methods for clustering the data on the basis of similarity, the field has recognized some longer-term challenges. Firstly, there are efforts to understand the sources of noise and variation in microarray experiments in order to increase the biological signal. Secondly, there are efforts to combine expression data with other sources of information to improve the range and quality of conclusions that can be drawn. Finally, techniques are now emerging to reconstruct networks of genetic interactions in order to create integrated and systematic models of biological systems.

Introduction

The enthusiasm about microarray expression data analysis in the bioinformatics community has been remarkable. The peer-reviewed conference proceedings in the field have often provided the initial presentation of new methods, including the early application of clustering [1], linear decomposition [2] and algorithms to discern genetic networks 3•., 4•., 5•.. (All references to the Pacific Symposium on Biocomputing can be found at http://www.smi.stanford.edu/projects/helix/psb-online/) The public release of expression data sets 6., 7., 8. created a de facto set of benchmarks for analysis by the bioinformatics community. There remains a risk, however, that the community has tuned these algorithms to perform well on this small set of training examples and that the algorithms will not perform well on entirely new data sets. Thus, the continued release of data from different groups using different detailed methods, and even measurements from redundant experiments, will be critical [9].

In a typical array experiment, many genes (frequently all known) in an organism are assayed under multiple conditions. The data can be represented as a matrix in which the rows are genes and the columns are conditions. These conditions may be different time points during a biological process, such as the yeast cell cycle 7., 8. and Drosophila development [10], or they can be different tissue samples with some common phenotype, such as tissue type or malignancy. Although the amount of data generated in an expression experiment is tremendous, this is not yet a data-rich analytical task by statistical standards. The complexity of genomic systems, with N genes and thus N2 potential pairwise interactions (not to mention higher order interactions), is even larger than the expression data sets and thus the ratio of data to unknown variables is still small. The major initial efforts at clustering and linear decomposition (such as principal components analysis) not only assist humans in understanding the data, but also demonstrate that the amount of independent new information may be much smaller than the number of raw data points suggests 2., 11.. (Some microarray analysis tools are available at http://classify.stanford.edu/)

Whole-genome expression data affect structural biology by providing valuable functional information about when and where a protein is expressed, when it is degraded and with which other proteins it may interact. Early work has surveyed the ability of expression data to yield clues about common sequential/structural motifs for regulatory elements (as reviewed below). It also addresses issues such as protein localization or the justifiability of predicting function using ‘guilt-by-association’ techniques, whereby similar expression may be a component of the association (as reviewed below). Jansen and Gerstein [12] have analyzed the sequential and structural features of highly expressed genes and found biases (more alanine/glycine, less asparagine, shorter sequences, more TIM barrels) in a group of highly expressed proteins.

Although not the main focus of this review, there has been a satisfying focus on maximizing the reproducibility and analyzability of microarray experiments 9., 13., 14., 15.. The ‘fold difference’ is widely used as a quantitative measure of the differential expression. The fold difference is the ratio of the expression in cells of interest versus the control cells. Genes expressed at low levels require higher fold differences in order to rise above the noise 16•., 17. and duplicate measurements of identical experiments can be very valuable for reducing noise and simplifying subsequent analysis [9]. There are also emerging methods for assigning confidence to differentially expressed genes [18]. The noise in expression data can confound analyses and rank data are often more robust than absolute measured values because of the variation in methods for subtracting out background noise and quantifying expression levels 19., 20.. Methods have emerged for imputing missing data in incomplete data sets (O Troyanskaya et al., unpublished data; see Now in press [78]). Finally, aneuploidy (and therefore the number of copies of a gene in the effective genome) has been shown to affect the expression level of a gene, either confounding the analysis or providing insight into the mechanisms of abnormal biology 21., 22..

The remainder of our review is organized around the result of a hierarchical clustering of the literature, in which the word counts are the features of articles used to cluster them, as shown in Fig. 1.

Section snippets

A breadth of applications in biology and medicine

The number and diversity of microarray expression data measurements in the literature are impressive, and reports now appear in speciality journals in both biology and medicine. Initial data sets are often reported as genome-scale ‘reviews’ of a specific process, with subsequent analysis focusing on particular biological questions. Many reports, however, compare only a single pair of conditions and these are more difficult to evaluate because not all the differences between the two conditions

Clustering

The early uses of hierarchical clustering and SOMs on expression data provided a focal point for the introduction of alternative clustering methods. As with BLAST, clustering has become a basic tool for biologists in the field of expression analysis. Although there is a mature statistical literature about clustering, microarray data has sparked the development of multiple new methods. The initial excitement generated by the papers using hierarchical clustering 1., 35. and SOMs (which arrange

Moving beyond clustering

After clustering is applied to an expression data set, we can examine those genes that cluster together and assign a function or value to the cluster. This approach may discover new associations, but in general rediscovers known associations and typically does not take full advantage of knowledge about known transcription factors, regulatory elements, sequence or structure information, or assigned gene functions. For example, there is interest in using information from genes with a common

New directions: the reconstruction of genetic networks

A reductionist approach to studying model systems and isolating individual components is clearly the pillar upon which most biological knowledge rests. However, the understanding of interacting systems, for which approximations about isolation and crosstalk (normally made to simplify the systems) can no longer be made, constitutes a major challenge. Initial efforts in the representation and ‘reverse engineering’ of cellular networks containing genes, their regulators and their downstream

Conclusions

For some time, there were more review articles about the promises and problems with whole-genome expression analysis than there were primary research reports using the methods or introducing new analytic techniques. This imbalance is now being corrected and the community is thinking seriously about ways in which whole-genome expression data can be integrated with other biological knowledge to maximize its impact. The next few years may show major progress in our ability to understand the ways

Update

Clustering methods are now more routinely being evaluated with respect to criteria such as robustness, computational cost, clarity of cluster definitions and reproducibility. A useful report by Yeung et al. [74•] introduces a leave-one-out type approach for testing cluster methods by evaluating their ability to predict the gene associations in a ‘left out’ data set. Herrero et al. [75] report and evaluate a self-organizing tree algorithm (SOTA) that shares features with SOMs, but imposes a

Acknowledgements

We would like to thank Josh Stuart, Olga Troyanskaya, Patrick Sutphin, Teri Klein and David Botstein for useful conversations. This work is supported by the Burroughs Wellcome Fund and grants from NIH GM-61374, LM-06422, GM-07365 and NSF DBI-9600637.

References and recommended reading

Papers of particular interest, published within the annual period of review,have been highlighted as:

  • • of special interest

  • •• of outstanding interest

References (78)

  • Raychaudhuri S, Stuart JM, Altman RB: Principal components analysis to summarize microarray experiments: application to...
  • Koza JR, Mydlower JD, Lanza G, Yu J, Keanne MA: Reverse engineering of metabolic pathways from observed data using...
  • E.P van Someren et al.

    Linear modeling of genetic networks from experimental data

    Ismb

    (2000)
  • Hartemink AJ, Gifford DK, Jaakkola TS, Young RA: Using graphical models and genomic expression data to statistically...
  • J.L DeRisi et al.

    Exploring the metabolic and genetic control of gene expression on a genomic scale

    Science

    (1997)
  • P.T Spellman et al.

    Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization

    Mol Biol Cell

    (1998)
  • Butte A, Ye J, Niederfellner G, Rett K, Häring H, White M, Kohane I: Determining significant fold differences in gene...
  • K.P White et al.

    Microarray analysis of Drosophila development during metamorphosis

    Science

    (1999)
  • N.S Holter et al.

    Fundamental patterns underlying gene expression profiles: simplicity from complexity

    Proc Natl Acad Sci USA

    (2000)
  • R Jansen et al.

    Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins

    Nucleic Acids Res

    (2000)
  • M.D Kane et al.

    Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays

    Nucleic Acids Res

    (2000)
  • A.M Talaat et al.

    Genome-directed primers for selective labeling of bacterial transcripts for DNA microarray analysis

    Nat Biotechnol

    (2000)
  • Sengupta R, Tompa M: Quality control in manufacturing oligo arrays: a combinatorial design approach. Pac Symp Biocomput...
  • Tsien CL, Libermann TA, Gu X, Kohane IS: On the reporting of fold differences. Pac Symp Biocomput 2001:496-507. This...
  • J.M Claverie

    Computational methods for the identification of differential and coordinated gene expression

    Hum Mol Genet

    (1999)
  • E Manduchi et al.

    Generation of patterns from gene expression data by assigning confidence to differentially expressed genes

    Bioinformatics

    (2000)
  • Park P, Pagano M, Bonetti M: A nonparametric scoring algorithm for identifying informative genes from microarray data....
  • S Raychaudhuri et al.

    Pattern recognition of genomic features with microarrays: site typing of Mycobacterium tuberculosis strains

    Ismb

    (2000)
  • Klus G, Song A, Schick A, Wahde M, Szallasi Z: Mutual information analysis as a tool to assess the role of aneuploidy...
  • T.R Hughes et al.

    Widespread aneuploidy revealed by DNA microarray expression profiling

    Nat Genet

    (2000)
  • T.R Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • A Alizadeh et al.

    Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

    Nature

    (2000)
  • M Bittner et al.

    Molecular classification of cutaneous malignant melanoma by gene expression profiling

    Nature

    (2000)
  • D.T Ross et al.

    Systematic variation in gene expression patterns in human cancer cell lines

    Nat Genet

    (2000)
  • A Ben-Dor et al.

    Clustering gene expression patterns

    J Comput Biol

    (1999)
  • U Scherf et al.

    A gene expression database for the molecular pharmacology of cancer

    Nat Genet

    (2000)
  • C.J Roberts et al.

    Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles

    Science

    (2000)
  • V.R Iyer et al.

    The transcriptional program in the response of human fibroblasts to serum

    Science

    (1999)
  • H.A Coller et al.

    Expression analysis with oligonucleotide microarrays reveals that MYC regulates genes involved in growth, cell cycle, signaling, and adhesion

    Proc Natl Acad Sci USA

    (2000)
  • Cited by (122)

    • Biomarkers

      2022, Comprehensive Pharmacology
    • A multi-objective biclustering algorithm based on fuzzy mathematics

      2017, Neurocomputing
      Citation Excerpt :

      Later it was applied to data mining [9]. The biclustering algorithm now is widely used in high-dimensional, large and complex data, especially the gene expression data [10–13]. Different from traditional clustering algorithm, it can simultaneously cluster from rows and columns in the matrix.

    • Secondary use of existing public microarray data to predict outcome for hepatocellular carcinoma

      2014, Journal of Surgical Research
      Citation Excerpt :

      Using IPA to better understand the output of the list of genes, we applied our results of differential gene expression to a known gene ontology database to examine potential gene pathways and networks. This subsequent analysis provides a broad understanding of functional gene expression as it goes beyond simple gene clustering [29]. The genes we identified involve signaling pathways implicated in hepatocarcinogenesis and other candidate genes not well characterized.

    • Expanding interactome analyses beyond model eukaryotes

      2022, Briefings in Functional Genomics
    View all citing articles on Scopus
    View full text