dChip: Gene function
enrichment analysis
Gene
function enrichment analysis at clustering
After a list of genes is obtained
by “Compare samples”/“Filter genes” or selected as a blue gene clustering branch in clustering
image, we can use the "Tools/Gene Function Enrichment" dialog
(“Tools/Classify Genes” in version before 3/31/07) to search for function
enrichment in these genes using Gene Ontology or other annotation terms. Gene
groups have header lines such as “Found 15 GeneOntology 'response to external
stimulus' genes in a 120 annotated genes (genome-wide: 1068/7734, p-value:
0.661181)”. The p-values are calculated in the same way as for the significant gene clusters. Here 120 is
the number of genes having Gene Ontology annotation in the input gene list,
thus may be fewer than the actual number of genes in the list. Note that at
"Tools/Classify genes", the whole gene list is considered to assess
the significant enrichment; while at clustering, every gene clusters with at
least 4 annotated genes is considered. Thus the former gives fewer significant
gene groups than the latter.
Significant p-values as defined in
the “Tools/Options/Clustering” dialog are suffixed by stars (“***”) in the
output file. [Version before 3/31/07: Also one may check the “Only report
significant results” box to output only gene groups with significant p-values.]
The additional data columns such as expression values or fold changes of the
“gene list file” will be copied into the output “classified file”.
[This paragraph is obsolete: In V
3/31/07+; The function terms of redundant
probe sets for the same genes will only be counted once at "Tools/Gene
Function Enrichment" or gene clustering and when reading gene information
file at "Open group", whether "Tools/Options/Mask redundant
probe set" is checked or not.] To prevent multiple probe sets for the same
gene from biasing the result of the functional significance computation, it is
best to check “Analysis/Open group/Options/Analysis/Mask redundant probe sets”
to exclude the redundant probe sets (identified by LocusLink ID) from a gene
list. This can also be done at “Tools/Options/Analysis/Mask redundant probe
sets”, but redoing “Analysis/Open group” is desired since the array background
information on gene annotation is computed after reading in the “gene
information file”.
In V3/31/07+, the enriched clusters will be reported in the analysis view in a tab-delimited
format, easily to be viewed or copied to Excel:
|
Gene
annotation enrichement analysis
|
|
|
|
C1: number
of genes in a cluster having this annotation term
|
|
C2:
number of annotated genes in this clster
|
|
|
C3:
number of all genes having this annotation term
|
|
|
C4:
number of all annotated genes
|
|
|
|
P-value: binomial
approximated p-value for hypergeometric distribution
|
|
|
|
|
|
|
|
|
***Gene
Ontology***
|
|
|
|
|
|
C1
|
C2
|
C3
|
C4
|
P-value
|
TermName
|
|
16
|
322
|
128
|
7186
|
0.000269
|
receptor
signaling protein activity
|
|
60
|
322
|
526
|
7186
|
0
|
defense
response
|
|
54
|
322
|
485
|
7186
|
0
|
immune
response
|
|
75
|
322
|
803
|
7186
|
0
|
response
to external stimulus
|
|
65
|
322
|
575
|
7186
|
0
|
response
to biotic stimulus
|
|
30
|
322
|
321
|
7186
|
0.000143
|
response to
pest/pathogen/parasite
|
|
5
|
322
|
9
|
7186
|
0.000062
|
immunoglobulin
binding
|
|
70
|
322
|
987
|
7186
|
0.00006
|
organismal
physiological process
|
|
79
|
322
|
967
|
7186
|
0
|
response
to stimulus
|
|
Reported significant:
9, Expected false positive: 1 (736 term assessed for enrichment at p-value
threshold 0.001000)
|
Gene function
enrichment analysis at clustering
The colored region on the right side of the hierarchical clustering picture or below represents “functional category
classification”, with different colors representing distinct functional
descriptions (use “Control+Click” to change the color of the functional blocks;
the changes cannot be saved; use Shift+Left/Right to change its width or not
show it). Such information is stored in “Gene
information file” and comes from Affymetrix array annotation files (e.g.
based on NCBI Entrez Gene database), which classifies a gene according to
molecular function, biological process and cellular component using GeneOntology terms.
After the hierarchical clustering is performed on
genes, dChip searches all branches with at least 4 functionally annotated genes
to assess whether a local cluster is enriched by genes having a particular
function. Such assessment has been used in Tavazoie
et al. 1999 and Cho et al. 2001 for
K-mean or supervised clusters. Here dChip systematically assesses the
significance of all functional categories in all braches of the hierarchical clustering
tree. In the cluster figure below (data from Armstrong
02), the blue gene cluster is enriched by genes having GO term "central
nervous system development". Inspecting the cluster figure and gene names
on the right reveals the genes with this GO term in blue color as well as other
genes in this cluster. We can also see in what samples these genes are highly
expressed.

Click an icon on the left for a gene annotation term
(below the “Clustering” icon) to highlight a functionally enriched gene cluster
in blue in the clustering branch. We formulate the problem as “there are m
annotated genes on the chip, of which n has a certain function; if we randomly
select k genes, what is the probability that x or more genes (of the k genes)
have this certain function?”; this probability is the p-value and indicates our
surprise (or significance) of seeing these many genes of certain function occurring
in a cluster of k genes. This is finite sampling and the hyper-geometric
distribution is used. In this clustering
picture, the blue cluster has 49 functionally annotated genes (genes
without annotation are not counted), of which 6 are chaperone genes;
considering there are all together 61 chaperone genes in the 5009 functionally
annotated genes on the array, this cluster is significantly enriched by
chaperone genes. A p-value of 9.84e-013 is calculated (in the picture the
p-value was obtained using the normal approximation; the current version uses
the exact hyper-geometric distribution). The p-value information on the bottom
is obtained by “mouse-over” the colored blocks on the right side (see below for
details).
P-values smaller than 0.005 (0.001 in V1.3+) are
considered significant by default and the corresponding functional clusters
will be represented by “Functional cluster” icons below “Clustering” icon, and
this p-value threshold can be set to other value in the
“Tools/Options/Clustering” dialog. Sometimes different “Functional cluster”
icons may represent the same cluster of genes since this cluster is enriched by
several different functional terms.
If a larger cluster is called significant and it
contains smaller significant clusters, the smaller clusters will not be
reported. For example, the entire tree (all the genes used for clustering) is
considered as the largest cluster, and it is abundant in cell cycle genes
relative to all the annotated genes, dChip will not report whether cell cycle
genes are relatively enriched in one tree branch (a smaller cluster). One way
around this is to change the p-value threshold in “Tools/Options/Clustering” to
be smaller (since small branch tend to have more significant p-values) to not
report the large clusters but the smaller ones. Another way to look more
closely at sub-clusters is: click to highlight a branch, use
“Clustering/Selected branch/Export Data” to export a gene list for the branch,
and then go to “Tools/Gene function enrichment analysis”
to examine the significance of all functions in this local branch. Finally, one
can click to highlight a branch and use the “mouse-over” method described in
the next paragraph to interactively check the significance of functional terms.
In the clustering picture, use Shift+Left/Right key
to change the width of the annotation columns, and use Shift+Left multiple
times to hide these columns. “Mouse-over” the colored region on the right side
of the clustering picture will display the following information in the status
bar on the bottom: the functional category name for the current color, the
number of genes belonging to this category on the array, the total number of
genes having annotation on the array, the number of genes belonging to this
category in the currently highlighted cluster, the number of genes having
annotation in the highlighted cluster, and the p-value for seeing these many
genes with this function in the highlighted cluster. Clicking the function bars
on the right side of the clustering data will select the corresponding function
as the “current function” and color all the genes having this function in blue.
The “current function” is also set when selecting the “functional cluster”
icons on the left pane.
If HG_U95AV2 or MG_U74AV2 gene
information file is used, the GeneOntology and ProteinDomain terms are indexed
by GeneOntology and Pfam (InterPro for 12/14/02+ gene information files)
database ID. Right-click a colored functional term to go to the website of
GeneOntology or Pfam entries. Also since not all genes have been mapped to
LocusLink, and not all LocusLink entries have GO and protein domain terms, one
may see the white blank rows in the GeneOntology or ProteinDomain column.
(Page since 3/31/07)