Whole-genome expression analysis: challenges beyond clustering

doi:10.1016/S0959-440X(00)00212-8

Current Opinion in Structural Biology

Volume 11, Issue 3, 1 June 2001, Pages 340-347

https://doi.org/10.1016/S0959-440X(00)00212-8 Get rights and content

Abstract

Measuring the expression of most or all of the genes in a biological system raises major analytic challenges. A wealth of recent reports uses microarray expression data to examine diverse biological phenomena — from basic processes in model organisms to complex aspects of human disease. After an initial flurry of methods for clustering the data on the basis of similarity, the field has recognized some longer-term challenges. Firstly, there are efforts to understand the sources of noise and variation in microarray experiments in order to increase the biological signal. Secondly, there are efforts to combine expression data with other sources of information to improve the range and quality of conclusions that can be drawn. Finally, techniques are now emerging to reconstruct networks of genetic interactions in order to create integrated and systematic models of biological systems.

Introduction

The enthusiasm about microarray expression data analysis in the bioinformatics community has been remarkable. The peer-reviewed conference proceedings in the field have often provided the initial presentation of new methods, including the early application of clustering [1], linear decomposition [2] and algorithms to discern genetic networks 3•., 4•., 5•.. (All references to the Pacific Symposium on Biocomputing can be found at http://www.smi.stanford.edu/projects/helix/psb-online/) The public release of expression data sets 6., 7., 8. created a de facto set of benchmarks for analysis by the bioinformatics community. There remains a risk, however, that the community has tuned these algorithms to perform well on this small set of training examples and that the algorithms will not perform well on entirely new data sets. Thus, the continued release of data from different groups using different detailed methods, and even measurements from redundant experiments, will be critical [9].

In a typical array experiment, many genes (frequently all known) in an organism are assayed under multiple conditions. The data can be represented as a matrix in which the rows are genes and the columns are conditions. These conditions may be different time points during a biological process, such as the yeast cell cycle 7., 8. and Drosophila development [10], or they can be different tissue samples with some common phenotype, such as tissue type or malignancy. Although the amount of data generated in an expression experiment is tremendous, this is not yet a data-rich analytical task by statistical standards. The complexity of genomic systems, with N genes and thus N² potential pairwise interactions (not to mention higher order interactions), is even larger than the expression data sets and thus the ratio of data to unknown variables is still small. The major initial efforts at clustering and linear decomposition (such as principal components analysis) not only assist humans in understanding the data, but also demonstrate that the amount of independent new information may be much smaller than the number of raw data points suggests 2., 11.. (Some microarray analysis tools are available at http://classify.stanford.edu/)

Whole-genome expression data affect structural biology by providing valuable functional information about when and where a protein is expressed, when it is degraded and with which other proteins it may interact. Early work has surveyed the ability of expression data to yield clues about common sequential/structural motifs for regulatory elements (as reviewed below). It also addresses issues such as protein localization or the justifiability of predicting function using ‘guilt-by-association’ techniques, whereby similar expression may be a component of the association (as reviewed below). Jansen and Gerstein [12] have analyzed the sequential and structural features of highly expressed genes and found biases (more alanine/glycine, less asparagine, shorter sequences, more TIM barrels) in a group of highly expressed proteins.

Although not the main focus of this review, there has been a satisfying focus on maximizing the reproducibility and analyzability of microarray experiments 9., 13., 14., 15.. The ‘fold difference’ is widely used as a quantitative measure of the differential expression. The fold difference is the ratio of the expression in cells of interest versus the control cells. Genes expressed at low levels require higher fold differences in order to rise above the noise 16•., 17. and duplicate measurements of identical experiments can be very valuable for reducing noise and simplifying subsequent analysis [9]. There are also emerging methods for assigning confidence to differentially expressed genes [18]. The noise in expression data can confound analyses and rank data are often more robust than absolute measured values because of the variation in methods for subtracting out background noise and quantifying expression levels 19., 20.. Methods have emerged for imputing missing data in incomplete data sets (O Troyanskaya et al., unpublished data; see Now in press [78]). Finally, aneuploidy (and therefore the number of copies of a gene in the effective genome) has been shown to affect the expression level of a gene, either confounding the analysis or providing insight into the mechanisms of abnormal biology 21., 22..

The remainder of our review is organized around the result of a hierarchical clustering of the literature, in which the word counts are the features of articles used to cluster them, as shown in Fig. 1.

Section snippets

A breadth of applications in biology and medicine

The number and diversity of microarray expression data measurements in the literature are impressive, and reports now appear in speciality journals in both biology and medicine. Initial data sets are often reported as genome-scale ‘reviews’ of a specific process, with subsequent analysis focusing on particular biological questions. Many reports, however, compare only a single pair of conditions and these are more difficult to evaluate because not all the differences between the two conditions

Clustering

The early uses of hierarchical clustering and SOMs on expression data provided a focal point for the introduction of alternative clustering methods. As with BLAST, clustering has become a basic tool for biologists in the field of expression analysis. Although there is a mature statistical literature about clustering, microarray data has sparked the development of multiple new methods. The initial excitement generated by the papers using hierarchical clustering 1., 35. and SOMs (which arrange

Moving beyond clustering

After clustering is applied to an expression data set, we can examine those genes that cluster together and assign a function or value to the cluster. This approach may discover new associations, but in general rediscovers known associations and typically does not take full advantage of knowledge about known transcription factors, regulatory elements, sequence or structure information, or assigned gene functions. For example, there is interest in using information from genes with a common

New directions: the reconstruction of genetic networks

A reductionist approach to studying model systems and isolating individual components is clearly the pillar upon which most biological knowledge rests. However, the understanding of interacting systems, for which approximations about isolation and crosstalk (normally made to simplify the systems) can no longer be made, constitutes a major challenge. Initial efforts in the representation and ‘reverse engineering’ of cellular networks containing genes, their regulators and their downstream

Conclusions

For some time, there were more review articles about the promises and problems with whole-genome expression analysis than there were primary research reports using the methods or introducing new analytic techniques. This imbalance is now being corrected and the community is thinking seriously about ways in which whole-genome expression data can be integrated with other biological knowledge to maximize its impact. The next few years may show major progress in our ability to understand the ways

Update

Clustering methods are now more routinely being evaluated with respect to criteria such as robustness, computational cost, clarity of cluster definitions and reproducibility. A useful report by Yeung et al. [74•] introduces a leave-one-out type approach for testing cluster methods by evaluating their ability to predict the gene associations in a ‘left out’ data set. Herrero et al. [75] report and evaluate a self-organizing tree algorithm (SOTA) that shares features with SOMs, but imposes a

Acknowledgements

We would like to thank Josh Stuart, Olga Troyanskaya, Patrick Sutphin, Teri Klein and David Botstein for useful conversations. This work is supported by the Burroughs Wellcome Fund and grants from NIH GM-61374, LM-06422, GM-07365 and NSF DBI-9600637.

References and recommended reading

Papers of particular interest, published within the annual period of review,have been highlighted as:

• of special interest
•• of outstanding interest

References (78)

R.J Cho et al.
A genome-wide transcriptional analysis of the mitotic cell cycle
Mol Cell
(1998)
F.C Holstege et al.
Dissecting the regulatory circuitry of a eukaryotic genome
Cell
(1998)
T.R Hughes et al.
Functional discovery via a compendium of expression profiles
Cell
(2000)
P Toronen et al.
Analysis of gene expression data using self-organizing maps
FEBS Lett
(1999)
J van Helden et al.
Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies
J Mol Biol
(1998)
T.D Schneider et al.
Information content of binding sites on nucleotide sequences
J Mol Biol
(1986)
J.D Hughes et al.
Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae
J Mol Biol
(2000)
C.A Wilson et al.
Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores
J Mol Biol
(2000)
A Drawid et al.
A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome
J Mol Biol
(2000)
Michaels GS, Carr DB, Askenazi M, Fuhrman S, Wen X, Somogyi R: Cluster analysis and data visualization of large-scale...

Raychaudhuri S, Stuart JM, Altman RB: Principal components analysis to summarize microarray experiments: application to...

Koza JR, Mydlower JD, Lanza G, Yu J, Keanne MA: Reverse engineering of metabolic pathways from observed data using...

E.P van Someren et al.

Linear modeling of genetic networks from experimental data

Ismb

(2000)

Hartemink AJ, Gifford DK, Jaakkola TS, Young RA: Using graphical models and genomic expression data to statistically...

J.L DeRisi et al.

Exploring the metabolic and genetic control of gene expression on a genomic scale

Science

(1997)

P.T Spellman et al.

Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization

Mol Biol Cell

(1998)

Butte A, Ye J, Niederfellner G, Rett K, Häring H, White M, Kohane I: Determining significant fold differences in gene...

K.P White et al.

Microarray analysis of Drosophila development during metamorphosis

Science

(1999)

N.S Holter et al.

Fundamental patterns underlying gene expression profiles: simplicity from complexity

Proc Natl Acad Sci USA

(2000)

R Jansen et al.

Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins

Nucleic Acids Res

(2000)

M.D Kane et al.

Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays

Nucleic Acids Res

(2000)

A.M Talaat et al.

Genome-directed primers for selective labeling of bacterial transcripts for DNA microarray analysis

Nat Biotechnol

(2000)

Sengupta R, Tompa M: Quality control in manufacturing oligo arrays: a combinatorial design approach. Pac Symp Biocomput...

Tsien CL, Libermann TA, Gu X, Kohane IS: On the reporting of fold differences. Pac Symp Biocomput 2001:496-507. This...

J.M Claverie

Computational methods for the identification of differential and coordinated gene expression

Hum Mol Genet

(1999)

E Manduchi et al.

Generation of patterns from gene expression data by assigning confidence to differentially expressed genes

Bioinformatics

(2000)

Park P, Pagano M, Bonetti M: A nonparametric scoring algorithm for identifying informative genes from microarray data....

S Raychaudhuri et al.

Pattern recognition of genomic features with microarrays: site typing of Mycobacterium tuberculosis strains

Ismb

(2000)

Klus G, Song A, Schick A, Wahde M, Szallasi Z: Mutual information analysis as a tool to assess the role of aneuploidy...

T.R Hughes et al.

Widespread aneuploidy revealed by DNA microarray expression profiling

Nat Genet

(2000)

T.R Golub et al.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

(1999)

A Alizadeh et al.

Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

Nature

(2000)

M Bittner et al.

Molecular classification of cutaneous malignant melanoma by gene expression profiling

Nature

(2000)

D.T Ross et al.

Systematic variation in gene expression patterns in human cancer cell lines

Nat Genet

(2000)

A Ben-Dor et al.

Clustering gene expression patterns

J Comput Biol

(1999)

U Scherf et al.

A gene expression database for the molecular pharmacology of cancer

Nat Genet

(2000)

C.J Roberts et al.

Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles

Science

(2000)

V.R Iyer et al.

The transcriptional program in the response of human fibroblasts to serum

Science

(1999)

H.A Coller et al.

Expression analysis with oligonucleotide microarrays reveals that MYC regulates genes involved in growth, cell cycle, signaling, and adhesion

Proc Natl Acad Sci USA

(2000)

Cited by (122)

Revealing genetic links of Type 2 diabetes that lead to the development of Alzheimer's disease
2023, Heliyon
A factor leading to Alzheimer’s Disease (AD), portrayed by peripheral insulin resistance, is Type 2 diabetes mellitus (T2D). The likelihood of T2D cases would be at boosted danger in alternating AD cases has severe social consequences. Several genes have been detected via gene expression profiling or different techniques; despite the consideration of the utility of numerous of these genes stays insufficient.
This project is designed to uncover the mutual genomics motifs between AD and T2D via non-negative matrix factorization (NMF) of differentially expressed genes (DEGs) of T2D Mellitus of human cortical neurons of the neurovascular unit gene expression data. A rank factorization value is calculated by employing the combination of the NMF model with the unit invariant knee (UIK) point method. The metagenes are further determined by remarking the enriched Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and gene ontology (GO) enrichment tools. In this study, the most highly expressed genes of metagenes are subjected to protein-protein interaction (PPI) network study to discover the most significant biomarkers of T2D Mellitus in the ageing brain.
We screened the most important shared genes (CDKN1A, COL22A1, EIF4A, GFAP, SLC1A1, and VIM) and essential human molecular pathways that motivate these diseases. The study aimed to validate the most significant hub genes using network-based methods which detected the corresponding relationship between AD and T2D.
Using in silico tools, the computational pipeline has broadly examined transformed pathways and discovered promising biomarkers and drug targets. We validated the most significant hub genes using network-based methods which detected the corresponding relationship between AD and T2D. These consequences on brain cells hypothetically reserve to diabetic Alzheimer’s so-called type 3 diabetes (T3D) and may offer promising methodologies for curative intrusion.
Biomarkers
2022, Comprehensive Pharmacology
In the last years, several efforts focused on the identification of specific features and tools to develop a personalized medicine ensuring the selection of the most effective and safest treatment for each patient. Biomarkers are playing a relevant role, improving early identification of patients at risk, increasing accuracy of diagnosis and facilitating the selection of the best therapeutic intervention. More importantly, biomarkers are crucial to understand the molecular mechanisms underlying diseases, contributing to identify potential new therapeutic targets. In clinical trials, biomarkers are critical to demonstrate the efficacy and safety of the drug/therapeutic intervention under consideration. This chapter provides the most relevant information about biomarkers, a historic perspective about the concept of biomarker and its classification and utility with special emphasis on their role in drug development. Additionally, we explored clinical areas in which biomarkers are implemented in the clinical practice (oncology and cardiology). The most promising biomarkers of additional medical areas where the identification of biomarkers is particularly critical as neurology and psychiatry are also summarized. The success achieved to date and the future challenges to develop valid, reliable and broadly usable biomarkers are crucial steps for the establishment of greater personalized medicine.
A multi-objective biclustering algorithm based on fuzzy mathematics
2017, Neurocomputing
Citation Excerpt :
Later it was applied to data mining [9]. The biclustering algorithm now is widely used in high-dimensional, large and complex data, especially the gene expression data [10–13]. Different from traditional clustering algorithm, it can simultaneously cluster from rows and columns in the matrix.
Biclustering algorithm is to cluster in the horizontal and vertical directions simultaneously in matrix. This algorithm identifies a set of sub-matrix by adopting a greedy iterative strategy, which employs the mean squared residue to measure the element consistency of a sub-matrix. Biclustering algorithm is widely applied in large and complex data. However, different versions of biclustering algorithm always have the problem that with the increasing of data size, more irrelevant rows or columns are involved in clustering which results in the poor performance of clustering. Therefore, this paper proposes a new algorithm, which combines fuzzy member matrix and comprehensive evaluation in fuzzy mathematics with multi-objective optimization algorithm to improve the performance of biclustering algorithm. In order to validate the effectiveness of the new algorithm, the performance the new algorithm and other three mainstream algorithms are compared on three gene/protein expression datasets. The results show the new algorithm has better element consistency, and sub-matrix capacity than other algorithms.
Secondary use of existing public microarray data to predict outcome for hepatocellular carcinoma
2014, Journal of Surgical Research
Citation Excerpt :
Using IPA to better understand the output of the list of genes, we applied our results of differential gene expression to a known gene ontology database to examine potential gene pathways and networks. This subsequent analysis provides a broad understanding of functional gene expression as it goes beyond simple gene clustering [29]. The genes we identified involve signaling pathways implicated in hepatocarcinogenesis and other candidate genes not well characterized.
Since 1990, numerous public repositories of microarray data have been created to store vast genomic data sets. Our hypothesis is that a secondary analysis of an available hepatocellular carcinoma (HCC) public data set could generate new findings and additional hypotheses.
The Gene Expression Omnibus at the National Center for Biotechnology Information was queried for available data sets specific for ‘HCC’ and ‘clinical data.’ Genes that passed filtering and normalization criteria were analyzed using the class comparison and prediction functions in BRB-ArrayTools. Ingenuity pathway analysis software was used to identify potential gene networks up- or down-regulated.
The file GDS274, which measured gene expression in primary HCC lesions with or without hepatic metastases from a cohort of Chinese patients, was identified as an appropriate data set and was imported into BRB-ArrayTools. 9984 genes passed filtering criteria. Clinical data demonstrated alpha fetoprotein (AFP) >100 ng/mL predictive of worse survival (HR 5.87, 95% confidence interval: 1.11–31.0). A class comparison between patients with an AFP >100 and those with AFP <100 demonstrated 92 genes to be differentially expressed. Ingenuity pathway analyses demonstrated the top networks associated with the observed gene expression.
Using available HCC microarray data, we identified genes differentially expressed based on AFP >100. Canonical pathway analysis demonstrated functional gene pathways and associated upstream regulators. This study maximizes the use of publicly available data by generating new findings. Secondary analyses of these data sets should be considered by investigators before embarking on new genomic experiments.
Lymphatic metastasis-associated circRNA‒miRNA‒mRNA network for exploring the pathogenesis and therapeutic target of triple negative breast cancer based on whole-transcriptome sequencing analysis: an experimental verification study
2022, Journal of Translational Medicine
Expanding interactome analyses beyond model eukaryotes
2022, Briefings in Functional Genomics

View all citing articles on Scopus

View full text

ReviewWhole-genome expression analysis: challenges beyond clustering

Abstract

Introduction

Section snippets

A breadth of applications in biology and medicine

Clustering

Moving beyond clustering

New directions: the reconstruction of genetic networks

Conclusions

Update

Acknowledgements

References and recommended reading

Mol Cell

Cell

Cell

FEBS Lett

J Mol Biol

J Mol Biol

J Mol Biol

J Mol Biol

J Mol Biol

Linear modeling of genetic networks from experimental data

Ismb

Exploring the metabolic and genetic control of gene expression on a genomic scale

Science

Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization

Mol Biol Cell

Microarray analysis of Drosophila development during metamorphosis

Science

Fundamental patterns underlying gene expression profiles: simplicity from complexity

Proc Natl Acad Sci USA

Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins

Nucleic Acids Res

Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays

Nucleic Acids Res

Genome-directed primers for selective labeling of bacterial transcripts for DNA microarray analysis

Nat Biotechnol

Computational methods for the identification of differential and coordinated gene expression

Hum Mol Genet

Generation of patterns from gene expression data by assigning confidence to differentially expressed genes

Bioinformatics

Pattern recognition of genomic features with microarrays: site typing of Mycobacterium tuberculosis strains

Ismb

Widespread aneuploidy revealed by DNA microarray expression profiling

Nat Genet

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

Nature

Molecular classification of cutaneous malignant melanoma by gene expression profiling

Nature

Systematic variation in gene expression patterns in human cancer cell lines

Nat Genet

Clustering gene expression patterns

J Comput Biol

A gene expression database for the molecular pharmacology of cancer

Nat Genet

Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles

Science

The transcriptional program in the response of human fibroblasts to serum

Science

Expression analysis with oligonucleotide microarrays reveals that MYC regulates genes involved in growth, cell cycle, signaling, and adhesion

Proc Natl Acad Sci USA

Review
Whole-genome expression analysis: challenges beyond clustering