dChip: Gene function enrichment analysis

 

Gene function enrichment analysis at clustering

 

After a list of genes is obtained by “Compare samples”/“Filter genes” or selected as a blue gene clustering branch in clustering image, we can use the "Tools/Gene Function Enrichment" dialog (“Tools/Classify Genes” in version before 3/31/07) to search for function enrichment in these genes using Gene Ontology or other annotation terms. Gene groups have header lines such as “Found 15 GeneOntology 'response to external stimulus' genes in a 120 annotated genes (genome-wide: 1068/7734, p-value: 0.661181)”. The p-values are calculated in the same way as for the significant gene clusters. Here 120 is the number of genes having Gene Ontology annotation in the input gene list, thus may be fewer than the actual number of genes in the list. Note that at "Tools/Classify genes", the whole gene list is considered to assess the significant enrichment; while at clustering, every gene clusters with at least 4 annotated genes is considered. Thus the former gives fewer significant gene groups than the latter.

 

Significant p-values as defined in the “Tools/Options/Clustering” dialog are suffixed by stars (“***”) in the output file. [Version before 3/31/07: Also one may check the “Only report significant results” box to output only gene groups with significant p-values.] The additional data columns such as expression values or fold changes of the “gene list file” will be copied into the output “classified file”.

 

[This paragraph is obsolete: In V 3/31/07+;  The function terms of redundant probe sets for the same genes will only be counted once at "Tools/Gene Function Enrichment" or gene clustering and when reading gene information file at "Open group", whether "Tools/Options/Mask redundant probe set" is checked or not.] To prevent multiple probe sets for the same gene from biasing the result of the functional significance computation, it is best to check “Analysis/Open group/Options/Analysis/Mask redundant probe sets” to exclude the redundant probe sets (identified by LocusLink ID) from a gene list. This can also be done at “Tools/Options/Analysis/Mask redundant probe sets”, but redoing “Analysis/Open group” is desired since the array background information on gene annotation is computed after reading in the “gene information file”.

 

In V3/31/07+, the enriched clusters will be reported in the analysis view in a tab-delimited format, easily to be viewed or copied to Excel:

Gene annotation enrichement analysis

 

 

C1: number of genes in a cluster having this annotation term

C2: number of annotated genes in this clster

 

C3: number of all genes having this annotation term

 

C4: number of all annotated genes

 

 

P-value: binomial approximated p-value for hypergeometric distribution

 

 

 

 

 

 

***Gene Ontology***

 

 

 

 

C1

C2

C3

C4

P-value

TermName

16

322

128

7186

0.000269

receptor signaling protein activity

60

322

526

7186

0

defense response

54

322

485

7186

0

immune response

75

322

803

7186

0

response to external stimulus

65

322

575

7186

0

response to biotic stimulus

30

322

321

7186

0.000143

response to pest/pathogen/parasite

5

322

9

7186

0.000062

immunoglobulin binding

70

322

987

7186

0.00006

organismal physiological process

79

322

967

7186

0

response to stimulus

Reported significant: 9, Expected false positive: 1 (736 term assessed for enrichment at p-value threshold 0.001000)

 

Gene function enrichment analysis at clustering

 

The colored region on the right side of the hierarchical clustering picture  or below represents “functional category classification”, with different colors representing distinct functional descriptions (use “Control+Click” to change the color of the functional blocks; the changes cannot be saved; use Shift+Left/Right to change its width or not show it). Such information is stored in “Gene information file” and comes from Affymetrix array annotation files (e.g. based on NCBI Entrez Gene database), which classifies a gene according to molecular function, biological process and cellular component using GeneOntology terms.

 

After the hierarchical clustering is performed on genes, dChip searches all branches with at least 4 functionally annotated genes to assess whether a local cluster is enriched by genes having a particular function. Such assessment has been used in Tavazoie et al. 1999 and Cho et al. 2001 for K-mean or supervised clusters. Here dChip systematically assesses the significance of all functional categories in all braches of the hierarchical clustering tree. In the cluster figure below (data from Armstrong 02), the blue gene cluster is enriched by genes having GO term "central nervous system development". Inspecting the cluster figure and gene names on the right reveals the genes with this GO term in blue color as well as other genes in this cluster. We can also see in what samples these genes are highly expressed.

 

Click an icon on the left for a gene annotation term (below the “Clustering” icon) to highlight a functionally enriched gene cluster in blue in the clustering branch. We formulate the problem as “there are m annotated genes on the chip, of which n has a certain function; if we randomly select k genes, what is the probability that x or more genes (of the k genes) have this certain function?”; this probability is the p-value and indicates our surprise (or significance) of seeing these many genes of certain function occurring in a cluster of k genes. This is finite sampling and the hyper-geometric distribution is used. In this clustering picture, the blue cluster has 49 functionally annotated genes (genes without annotation are not counted), of which 6 are chaperone genes; considering there are all together 61 chaperone genes in the 5009 functionally annotated genes on the array, this cluster is significantly enriched by chaperone genes. A p-value of 9.84e-013 is calculated (in the picture the p-value was obtained using the normal approximation; the current version uses the exact hyper-geometric distribution). The p-value information on the bottom is obtained by “mouse-over” the colored blocks on the right side (see below for details).

 

P-values smaller than 0.005 (0.001 in V1.3+) are considered significant by default and the corresponding functional clusters will be represented by “Functional cluster” icons below “Clustering” icon, and this p-value threshold can be set to other value in the “Tools/Options/Clustering” dialog. Sometimes different “Functional cluster” icons may represent the same cluster of genes since this cluster is enriched by several different functional terms.

 

If a larger cluster is called significant and it contains smaller significant clusters, the smaller clusters will not be reported. For example, the entire tree (all the genes used for clustering) is considered as the largest cluster, and it is abundant in cell cycle genes relative to all the annotated genes, dChip will not report whether cell cycle genes are relatively enriched in one tree branch (a smaller cluster). One way around this is to change the p-value threshold in “Tools/Options/Clustering” to be smaller (since small branch tend to have more significant p-values) to not report the large clusters but the smaller ones. Another way to look more closely at sub-clusters is: click to highlight a branch, use “Clustering/Selected branch/Export Data” to export a gene list for the branch, and then go to “Tools/Gene function enrichment analysis” to examine the significance of all functions in this local branch. Finally, one can click to highlight a branch and use the “mouse-over” method described in the next paragraph to interactively check the significance of functional terms.

 

In the clustering picture, use Shift+Left/Right key to change the width of the annotation columns, and use Shift+Left multiple times to hide these columns. “Mouse-over” the colored region on the right side of the clustering picture will display the following information in the status bar on the bottom: the functional category name for the current color, the number of genes belonging to this category on the array, the total number of genes having annotation on the array, the number of genes belonging to this category in the currently highlighted cluster, the number of genes having annotation in the highlighted cluster, and the p-value for seeing these many genes with this function in the highlighted cluster. Clicking the function bars on the right side of the clustering data will select the corresponding function as the “current function” and color all the genes having this function in blue. The “current function” is also set when selecting the “functional cluster” icons on the left pane.

 

If HG_U95AV2 or MG_U74AV2 gene information file is used, the GeneOntology and ProteinDomain terms are indexed by GeneOntology and Pfam (InterPro for 12/14/02+ gene information files) database ID. Right-click a colored functional term to go to the website of GeneOntology or Pfam entries. Also since not all genes have been mapped to LocusLink, and not all LocusLink entries have GO and protein domain terms, one may see the white blank rows in the GeneOntology or ProteinDomain column.

 

(Page since 3/31/07)