Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2002 Sep 23;99(20):12509–12511. doi: 10.1073/pnas.212532499

Extracting functional information from microarrays: A challenge for functional genomics

Michael Q Zhang 1,*
PMCID: PMC130487  PMID: 12271149

The advent of the human and model organism genome project has provided an increasingly complete list of genes that code for the building blocks of life on Earth. Deciphering the functions of all these genes has proven to be no easy task. The availability of mountains of transcriptional profiling data from modern large-scale gene-expression technologies such as serial analysis of gene expression (SAGE) (1), oligonucleotide arrays (2), and cDNA microarrays (3) represents a tremendous windfall for computational biologists who have largely migrated from many different fields. One article appearing in this issue of PNAS (4) introduces a novel computational approach, shortest path (SP) analysis, to assign gene functions in a transitive fashion along a correlation linkage path terminated by two known genes belonging to the same functional category.

A major goal of microarray data analyses is to identify genes that interact with each other where not every player has a similar transcriptional profile.

Currently the most popular way to identify interesting genes and their functions is to perform cluster analysis on the relative expression pattern changes (Fig. 1A) in typical microarray experiments that survey a range of conditions (reviewed in ref. 5). The fundamental premise of the clustering approach is that genes having similar expression profile across a set of conditions (cellular process, responses, phenotypes, etc.) may share similar functions (6). Obviously the word “function” is too general to be precise and quantitative and too broad to be specific and meaningful. Genes, the products of which may have same function (say, phosphorylating other proteins), do not necessarily share similar transcriptional pattern. Conversely, genes having different functions can have a similar expression profile simply by chance or stochastic fluctuations. Although many potential caveats exist, large numbers of functionally related genes do show very similar expression patterns under a relevant set of conditions, especially genes that are coregulated by common transcription factors, or their products are the components of a larger complex; this is why a simple clustering of genes with a similar expression pattern is allowed to assign a putative function to unknown genes via “guilt-by-association” arguments (e.g., refs. 7 and 8). Several clustering techniques such as hierarchical clustering (9), K-means (10), and self-organizing map (SOM) (11) have been adopted from other fields and applied widely to microarray data analyses. Successful as it is, clustering cannot reveal functional relation among genes with expression patterns that show very little correlations (they may be related by a time-delay for instance) (Fig. 1B).

Figure 1.

Figure 1

Relations among different concepts in the SP-analysis method. (A) Expression profile matrix (table). t = (t1,t2,… ) is the experimental condition index; in this example it indicates a set of time points. (B) Expression profiles (patterns). g1 and g4 are not strongly correlated directly, but both are strongly correlated with the correlated set (gx,g2). gx,g2 are the transitive genes interpolating the two terminal genes along SP1 (see C and D); similarly, gy is the transitive gene interpolating g1 and g5 along SP2. (C) GO biological process tree. The Ps are process annotations for genes at a particular node. A gene may belong to more than one node (“multiple-function,” such as g2). Expression profile space. gx is on the short path SP1 terminated by the known genes g1, g2, and g4 and hence is assigned a function of P1,1,1,1 (level L0) according to the GO tree in C; gy is on SP2 terminated by g1,g5 and is assigned a function of P1,1,1 (level L1). g1 is shared by both SPs and may be involved in both processes, which means the processes represented by SP1 and SP2 actually crosstalk to each other. The linked gene network can be formed by the subgraph SP1+SP2.

A major goal of microarray data analyses is to identify genes that interact with each other in a particular cellular process (or pathway) where not every player has a similar transcriptional profile. The crucial aspect of the approach of Zhou et al. (4) is to extend the coexpression concept to a more general “transitive coexpression,” which seems to be an important characteristic of many biological processes: Two genes involved in the same process may not be strongly correlated in expression directly, but both can be strongly correlated with the same set of other genes. Another widely recognized point is that functional annotations should really be incorporated early in the data analysis. Not surprisingly, the starting point of the Zhou et al. work is the exploitation of the controlled vocabulary tree in the biological process categories of gene ontology (GO) (ref. 12; Fig. 1C).

In essence, this SP-analysis method starts from a pair of genes belonging to the same biological process category and the same major cellular compartment (mitochondrial, cytoplasmic, or nuclear), according to GO, and constructs the SP through a chain of pairwise, strongly correlated genes, with a distance function that further contracts the strongly correlated genes. Unknown transitive genes on the SP are assigned with the function of the “lowest common ancestor” of all the process subcategories corresponding to the known genes on the same SP (Fig. 1 BD). To define a sufficiently specific gene function, the total SP length must be very short, and this lowest ancestral node must be at least four levels below the root of the GO tree. In particular, if all the known genes are in the same node, the lowest common ancestor is the starting terminal gene process category itself (level L0 assignment); if they are in different nodes but all share a direct parent with the terminal genes, this parent node will be identified as the lowest common ancestor (level L1 assignment) (Fig. 1 C and D).

To test the validity of their SP method, Zhou et al. (4) applied it to the analysis of the Saccharomyces cerevisiae gene-expression profiles of the Rosetta compendium (13), which measured the response of 300 gene-deletion and drug-treatment experiments. First, they used only the known genes (≈1,300 that have GO cellular process and localization annotations). The SP method was able to success fully call 64/84% (cytoplasm), 59/69% (mitochondria), and 39/51% (nuclear) transitive genes at the L0/L1 levels, and these results are highly significant as shown by further permutation tests. Encouraged by the benchmark tests, they extended the graphs of SPs of known genes to an additional ≈3,300 unknown ORFs and were able to assign functions (i.e., cellular process categories) to 146 ORFs that include 75 high-confidence predictions (a gene-function assignment is highly confident if the gene is the only unknown gene on the SP). Because a gene may belong to several SPs, it can therefore get multiple-function assignments. One may choose not to make a prediction on an unknown gene if known genes on the SP fail to have as consistent an annotation as the terminal genes. As often faced by many computational biologists, Zhou et al. spent a tremendous amount of effort in trying to substantiate the biological content of their findings by extensive literature searches. Among the 75 high-confidence annotations, 24 were found in the yeast proteome database (YPD, www.proteome.com), and 16 (83%) were confirmed by YPD-documented experiments. More encouragingly, their computational results seem to be able to correct some database annotation errors after closer scrutiny.

As stated by the authors, the strength of their method is to use the SP to link “transitive coexpressed” genes even if some of the genes (especially the terminal genes) on the SP do not have correlated expression profiles directly. Further advantage is exemplified by the “active incorporation of biological annotation into the knowledge discovery process.” But the conceptual significance actually lies at a much deeper level. For example, one could also ask: If two known targets of a transcription factor are taken as the terminal nodes, could more targets along the SP be identified analogously? If not for the SP defined by the particular distance function, maybe some other SP defined by a more appropriate distance function would have to be used. In general, one could argue that, to a certain extent, the goal of all microarray data analyses is to identify a functionally linked subnetwork hidden in the expression profiles. Suppose we view the expression profile space consisting of clouds of points (genes). If we connect all genes (assuming we know every gene function) involved in a particular part of a cellular process (say, cell-cycle progression), we would trace out a subnetwork path. We could do the same for a different process and would get another path. The intersection would define gene(s) that are involved in both processes. If the two processes are so linked, we could actually trace out a connected subnetwork (Fig. 1D). Conversely, discovering such hidden functional linkages (paths, subnetworks, etc.) activated by response or process variables (such as time shift in the cell-cycle process) would be the central task. The expression space does not have to be limited to relative mRNA density changes at different times or conditions; it could also include proteome information, localization variables, and tissue and developmental parameters. It is actually nontrivial to find the right metric function that defines relevant distance relations appropriate to the cellular processes interested and allows investigators to construct the SP links capable of tracing out the functional subnetworks. Although the particular distance function and related SPs of Zhou et al. (4) may not be sufficient for identifying all types of processes, the general methodology does represent a significant extension of our microarray data analysis repertoire beyond cluster analysis.

It is not clear how far one can take this empirical SP approach. If the two terminal genes are multifunctional, will there more likely be a single SP with multifunctional transitive genes or multiple SPs with largely single-functional transitive genes on each SP? It is more likely that the incomplete knowledge of the existing GO tree and the current resolution for most microarray data will prevent us from getting the answers to such questions. But the real key for understanding transcriptional profiles and gene-regulation networks is to link expression pattern to transcription factor-binding sites (cis-regulatory elements). Recent advances in computational (refs. 14 and 15; reviewed in ref. 16) and experimental (17, 18) technologies have opened up real opportunities for annotating gene functions not only at the phenomenological levels but also at the mechanistic levels.

Acknowledgments

I thank G. X. Chen, N. Banerjee, and H. J. Yuan for critical comments on the manuscript. The Zhang lab is supported by National Institutes of Health grants.

Footnotes

See companion article on page 12783.

References

  • 1.Velculescu V E, Zhang L, Vogelstein B, Kinzler K W. Science. 1995;270:484–487. doi: 10.1126/science.270.5235.484. [DOI] [PubMed] [Google Scholar]
  • 2.Lockhart D J, Dong H, Byrne M C, Follettie M T, Gallo M V, Chee M S, Mittmann M, Wang C, Kobayashi M, Horton H, Brown E L. Nat Biotechnol. 1996;14:1675–1680. doi: 10.1038/nbt1296-1675. [DOI] [PubMed] [Google Scholar]
  • 3.Schena M, Shalon D, Davis R W, Brown P O. Science. 1995;270:467–470. doi: 10.1126/science.270.5235.467. [DOI] [PubMed] [Google Scholar]
  • 4.Zhou X, Kao M-C J, Wong W H. Proc Natl Acad Sci USA. 2002;99:12783–12788. doi: 10.1073/pnas.192159399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Quackenbush J. Nat Rev Genet. 2001;2:418–427. doi: 10.1038/35076576. [DOI] [PubMed] [Google Scholar]
  • 6.Zhu J, Zhang M Q. Pac Symp Biocomput. 1999;5:476–487. [Google Scholar]
  • 7.Wen X, Fuhrman S, Michaels G S, Carr D B, Smith S, Barker J L, Somogyi R. Proc Natl Acad Sci USA. 1998;95:334–339. doi: 10.1073/pnas.95.1.334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Spellman P T, Sherlock G, Zhang M Q, Iyer V R, Anders K, Eisen M B, Brown P O, Botstein D, Futcher B. Mol Biol Cell. 1998;9:3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Eisen M B, Spellman P T, Brown P O, Botstein D. Proc Natl Acad Sci USA. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Tavazoie S, Hughes J D, Campbell M J, Cho R J, Church G M. Nat Genet. 1999;22:281–285. doi: 10.1038/10343. [DOI] [PubMed] [Google Scholar]
  • 11.Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov J P, Coller H, Loh M L, Downing J R, Caligiuri M A, Bloomfield C D, Lander E S. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
  • 12.Ashburner M, Ball C A, Blake J A, Botstein D, Butler H, Cherry J M, Davis A P, Dolinski K, Dwight S S, Eppig J T, et al. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Hughes T R, Marton M J, Jones A R, Roberts C J, Stoughton R, Armour C D, Bennett H A, Coffey E, Dai H, He Y D, et al. Cell. 2000;102:109–126. doi: 10.1016/s0092-8674(00)00015-5. [DOI] [PubMed] [Google Scholar]
  • 14.Markstein M, Markstein P, Markstein V, Levine M S. Proc Natl Acad Sci USA. 2002;99:763–768. doi: 10.1073/pnas.012591199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Berman B P, Nibu Y, Pfeiffer B D, Tomancak P, Celniker S E, Levine M, Rubin G M, Eisen M B. Proc Natl Acad Sci USA. 2002;99:757–762. doi: 10.1073/pnas.231608898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Michelson A M. Proc Natl Acad Sci USA. 2002;99:546–548. doi: 10.1073/pnas.032685999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ren B, Robert F, Wyrick J J, Aparicio O, Jennings E G, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. Science. 2000;290:2306–2309. doi: 10.1126/science.290.5500.2306. [DOI] [PubMed] [Google Scholar]
  • 18.Iyer V R, Horak C E, Scafe C S, Botstein D, Snyder M, Brown P O. Nature (London) 2001;409:533–538. doi: 10.1038/35054095. [DOI] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES