dChip: Hierarchical Clustering and Enrichment Analysis

 

Clustering image                                  Selected gene branch                             Significant sample cluster

Save clustering tree                              Export clustering image                         Remove irrelevant genes                                   

 

After obtaining model-based expression values, we can perform high-level analysis such as hierarchical clustering (Eisen et al. 1998). Unsupervised sample clustering using genes obtained by Analysis/Filter genes can be used to identify novel sample clusters and their associated “signature genes”, to check the data quality to see if replicate samples or samples under similar conditions are clustered together (if not what might be possible reasons), and to identify unexpected clustering (e.g. samples generated in same date or lab cluster together). Select the menu “Analysis/Hierarchical clustering”:

 

A “gene list file” is a tab-delimited text file with probe set name in the first column of each line. It can be generated by “Analysis/Filter genes”, “Analysis/Compare samples” or “Tools/Gene list file”. It may also be a “Tree file” saved by the “Clustering/Save tree” function so that an existing tree structure saved before can be used. dChip will use genes in the file for clustering.

 

One many check “Tools/Options/Analysis/Mask redundant probe sets” to exclude the redundant probe sets (having the same LocusLink ID) from a gene list and only keep the first occurring probe set, since multiple probe sets for the same gene tend to bias the result of sample clustering and functionally significant gene clusters. However, if the replicate probe sets are both selected by some filtering or comparison criteria, and cluster closely in the clustering, this is a good indication of meaningfulness of the selected gene list. On the other hand, if a selected gene list seems to have genes not related to each other (e.g. not many replicate probe sets), we may doubt its validity and often a FDR by permutation can result in similar number of genes and thus supports this suspicion. The same conclusion can be extended to probe sets for the genes in the same gene families or same pathways.

 

The samples used for clustering are either all the arrays, or the samples in the “Array list file” if it is specified. When a “Filter genes” gene list is used for clustering, it is often desired to use the same “Array list file” used in filtering genes to do gene clustering and sample clustering. This is an unsupervised sample clustering since the genes are selected by large variation across samples and the sample group information is not used. When one specifies a “Compare samples” gene list generated by using only a subset of samples, it is often desired to only specify and order the relevant samples in “Array list file” and view them without sample clustering. In this case the main interest lies in viewing the genes obtained by comparison, and one can often get good sample clustering since the genes are selected by using the sample group information. It is also interesting to cluster both samples used for selecting genes and samples not used for selecting genes (e.g. samples from an independent study) together, one can predict the group membership of the latter samples.

 

The default clustering algorithm of genes is as follows: the distance between two genes is defined as 1 - r where r is the Pearson correlation coefficient between the standardized expression values (make mean 0 and standard deviation 1) of the two genes across the samples used. Two genes with the closest distance are first merged into a super-gene and connected by branches with length representing their distance, and are then excluded for subsequent merging events. The expression values of the newly formed super-gene is the average of standardized expression values of the two genes (centroid-linkage) across samples. Then the next pair of genes (super-genes) with the smallest distance is chosen to merge and the process is repeated n – 1  times to merge all the n genes. A similar procedure is used to cluster samples. These standardization and clustering methods follow Golub et al. 1999 and Eisen et al. 1998. Centroid linkage can produce branch inversion when the distance between two clusters is smaller than the height of either cluster, dChip truncates the distance to be the larger of the two heights. This prevents the branch inversion in visualization, but the further distance computation is still based on the averaged profile.

 

One may choose alternative “Distance metric” as 1 - |r| (r is the correlation coefficient) as the distance measure. This is useful if we want to cluster negatively correlated genes clustered together. The “Average linkage” method can be specified, where the distance between two gene clusters (super-gene) is the average of all pair-wise distances between two genes not belonging to the same gene cluster. Tao Shi has observed that dChip produces the same clustering result as the R function hclust (using 1 – correlation matrix of row-wise standardized expression values) when the average linkage is used, but not when the centroid linkage is used.

 

Click “Options” (or “Tools/Options/Clustering”) to specify additional clustering options:

 

 

We can choose to cluster samples as well as genes. Uncheck the “Cluster genes” button to cluster samples without clustering genes, and this is useful if genes need to be put in a particular order when clustering samples. The option “Only draw lines for standard separator” (moved to the “Tools/Array list file” dialog for V1.2+) is discussed in the section “Array list file”.

 

Before clustering, the expression values for a gene across all samples are standardized (linearly scaled) to have mean 0 and standard deviation 1, and these standardized values are used to calculated correlations between genes and samples and serve as the basis for merging nodes. If the scale of the data is already adjusted, one may choose not to standardize a gene’s expression value across samples by unchecking the “Standardize rows” option. By default the samples are clustered using row-wise standardized or un-standardized values. One can check “Standardize columns” to standardize the raw expression data column-wise for sample clustering. Since the raw expression values are comparable row-wise but not column-wise, the column-wise standardization may not be meaningful when different genes have different magnitude of expression values. A user is advised to try to cluster samples with or without “Standardize columns” checked to judge which option yields more reasonable sample clustering.

 

If “Tools/Analysis/Treat outlier expression as missing values” is checked, the expression value called as “array-outlier” will be ignored when computing correlations and their data points are displayed as black (Blue/Red coloring) or white (Green/Red coloring) boxes in the clustering picture.

 

If the number of genes is large (e.g. 10,000), dChip may report “out of memory” or perform slowly, since storing all the pair-wise distances requires too much memory and may cause virtual-memory swapping. The solution is to uncheck the “Tools/Options/Clustering/Pre-calculate distances” button to calculate the pair-wise distances between genes on the fly.

 

Click “OK” to start clustering, and select “Analysis/Stop Analysis” or press “ESC” to stop the ongoing analysis. Following the analysis output as follows, the clustering picture will be displayed immediately. Click the “Analysis” icon on the left to view the analysis output:

 

{Hierarchical clustering

  Treat 24 arrays as 24 experiments

 

  Read in genes listed in file D:\array\out\iglehart filtered gene.xls...

    Found 191 genes

 

  Begin clustering...

    Calcuate distance 190

    Merge event 189

    Calcuate distance 20

    Merge event 16

 

  Finding significant functional clusters...

    Found 6 chaperone genes in a 49-cluster (all: 61/5009, PValue: 9.84e-013)

    Found 10 structural protein genes in a 47-cluster (all: 361/5009, PValue: 9.58e-005)

    Found 8 extracellular genes in a 29-cluster (all: 400/5009, PValue: 4.93e-005)

 

Finished in 00 hours 01 minutes}

 

Here 191 genes are selected for clustering. dChip also automatically searches for functionally significant clusters in the resulting clustering tree.

 

Clustering image

 

Click the “Clustering” icon on the left to display the “Clustering View” (Data courtesy of Dan Tang):

 

 

In the clustering picture each row represents a gene and each column represent a sample. The gene clustering tree is on the left, and genes close to each other have high similarity in their standardized expression values across the 24 samples. The sample clustering tree is on the top. Click anywhere in the right pane to activate the “Cluster View”. Arrow keys can enlarge or reduce the size of the clustering picture, Control+Arrow keys can change the size of the clustering trees, and Shift+Arrow keys can adjust other aspects such as the height of sample information blocks. Use the option “Tools/Options/ Clustering/Sample names always visible” to make the samples names always visible on the top when scrolling vertically.

 

On the bottom of the clustering picture is the color scale: the red color represents expression level above mean expression of a gene across all samples, the white color represents mean expression and the blue color represents expression lower than the mean. Since the expression levels for each gene is standardized to have mean 0 and standard deviation 1, the standardized expression values most likely fall within [-3, 3]. By default, dChip uses pure white to represent 0, pure red to represent 3 or higher, and pure blue to represent –3 or lower (Golub et al. 1999). This “displaying range of standardized values” (3) can be changed in the “Tools/Options/Clustering” dialog. Select menu “Tools/Options/Clustering/Use traditional red/black/green coloring scheme” to use the coloring scheme adopted by the TreeView software (Eisen et al. 1998). The height of the color scale can be adjusted by Shift+Up/Down arrows.

 

Click inside the clustering picture will highlight a data point with a surrounding blinking square:

 

 

The array name, probe set name with absolute call, the standardized value with original expression value and standard error for this probe set are displayed in the status bar on the bottom. Zooming in with arrow keys when a data point is highlighted will always place this data point in the center of the viewable picture area. To deselect an active data point, press “ESC” key or select the menu “Clustering/Selected Branches/Clear”. Gene names or descriptions are displayed on the right side; use the button “Tools/Options/Clustering/Gene descriptions from LocusLink when available” to toggle between displaying only Affymetrix descriptions or a mixture of LocusLink name (when available) and Affymetrix descriptions. If the gene descriptions are truncated on the right, use “Control+Right Arrow” to widen the gene clustering tree as well as the gene description area.

 

If an Internet browser is properly set up, the menu “View/Online Database” will start the browser to access the database page for the current probe set. Linking to online resources may not work on some computers. Checking "Tools/Options/Analysis/Show online link dialog" to show a dialog containing the web address and also automatically copy the address to the clipboard, then one can manually paste it into the address bar of Internet browser.

 

One can also right-click a non-gene node in the clustering tree to exchange the positions of its two branches, in order to interactively adjust the ordering of genes and samples in the clustering picture. This changes the visual perception of the clustering picture but not the clustering result. The original order has been determined so that the tighter child cluster (with smaller distance at its final merging evenet) is on the top. There is also research work on determining “optimal” orders of leaf nodes in a clustering by some criterion (e.g. gene’s peak expression time during a time course), but in principle all the 2^(N-1) orderings are equivalent.

 

Select “View/Next View” or press “Enter” to go to other views such as “PM/MM Data” or “CEL Image” to look at the probe level data of the current expression data point, and press “Enter” again to come back to the clustering picture. This is useful if one observes unusual data points in the clustering picture (such as large negative expressions values), and wants to check the probe level data.  For example, sometimes we may see bright strips of genes with very high or low expression values in particular samples. Usually we should pay caution to such samples, since this may imply the normalization did not bring such array comparable to others. It is then good to click an extreme red/blue data point in the “Clustering view” and then use “View/PM/MM data” to check the probe-level data. This way we can confirm whether a high/low-valued data point is real or due to noise/outlier.

 

Selected gene branch

 

Clicking any node in the gene or sample clustering tree will highlight the corresponding clustering branch in blue. Some items in the “Clustering” menu are activated only after this. Checking “Tools/Options/Clustering/Averaged gene profile pattern” will display a profile plot for the selected gene cluster:

 

 

The Y-axis has the same range as the color scale ([-3, 3] by default) on the bottom of the clustering picture. The value of the profile curve for each sample is the average of the standardized expression values of all selected genes in this sample (standardization is a linear scaling for each gene so its expression values across all samples have mean 0 and standard deviation 1). The error bar extends 1 standard deviation (of the selected genes’ standardized expression values in a sample) on both sides. A shorter error bar indicates tighter clustering of genes at the corresponding sample. In V1.3, if only a single gene is selected in the branch, the Y-axis range of the profile plot is from 0 to the maximal raw expression data of this gene across the samples. Thus the relative fold changes of this gene across samples can be inferred from the plot.

 

In the menu “Clustering”, we can clear the selected genes or select all genes, as well as delete the selected gene clusters and redo clustering using the rest of genes. Select “Clustering/Export branch image” (check "BMP" format) to export or copy the clustering image of the selected main gene cluster outlined by blue lines; however, the sample clustering tree is not attached to this image.

 

Select the “Clustering/Export branch data” menu to export the raw or gene-wise standardized expression data of selected branch. In the later case the data of the averaged profile will also be exported. If no sample branches are selected, expression value for all samples will be exported; otherwise only the data of the selected sample branch will be exported. The exported file can be used as the “gene list file” in the “Analysis/Hierarchical Clustering” dialog to perform clustering using only this subset of genes. In V1.3+, if “Clustering/Export branch data/Cut the tree at the height of current branch and export all branches” is checked, one may export gene expression data grouped in clusters. These clusters are obtained by cutting the gene clustering tree at the height of the selected blue branch.

 

Use Control+Click to select and color multiple gene or sample clusters. The multiple colored clusters can be exported (for sample branches only) or deleted (for gene branch only) by the functions in the menu “Clustering”. In contrast, clicking selects the main gene cluster (outlined by blue lines) used as described and in resampling clusters.

 

The "Clustering/Similar Profile" function can search and export genes with high positive or negative correlations across samples with the current highlighted gene or gene branch. The resultant list can be used as the "gene list file" in the "Analysis/Hierarchical Clustering" dialog to view these genes.

 

dChip also provides a resampling method  to assess the reliability of clusters  by using the standard errors for expression values (Li and Wong 2001b page 6, section “Standard errors help to assess clustering results”). We resample each expression value from a Normal distribution with mean equal to the estimated expression value and standard deviation equal to the attached standard error. Clicking the tree branches to highlight a gene or sample cluster in blue, then select menu “Clustering/Resample once” to resample all the data points and redo the clustering; select “Clustering/Go to original” to go back to the original clustering.

 

Significant sample cluster

 

Similarly, during sample clustering, the sample information specified in the “Sample information file” is used to calculate the sample clusters enriched by samples of a certain description (Data courtesy of Andrea Richardson):

 

 

The sample cluster p-values are calculated with regard to the samples used in the “Array list file” (if no “array list file” is specified then all the arrays in the group). By default p-value < 0.05 will be reported and the threshold can be set at “Tools/Options/Clustering”. Note that the p-values for gene clusters are calculated with regard to all the genes on the array, since a gene list is filtered or obtained by some means; but for sample cases those samples not in the “array list file” may not be of our interest.

 

The discrete categories of each sample description column in the “sample information file” has limit 10 currently. The color of sample category boxes can be set by “Control+Click”, and the height of sample information blocks can be adjusted by Shift+Up/Down arrows.

 

[New] In the clustering or chromosome view, set “Options/Clustering/Number of letters shown for sample information” to be greater than 1 to display 1 or more letters above samples, overlaying on the color blocks representing sample categories.

Save clustering tree

Select the menu “Clustering/Save Tree” to save the structural information of the clustering tree. The file can be used as “Analysis/Hierarchical Clustering/Gene list or tree file” later to avoid the clustering computation again. In the tree file, each gene/supergene has its parent and children’s ID, the weight (how many genes are included in this branch), and the distance when merging its two children. The sample tree information follows the gene tree information. Thus it is also possible that clustering is performed by using other software such as S-PLUS or R, but the result is exported into a file with the dChip tree file format (apply “Clustering/Save Tree” once to get its format), and then the clustering is viewed in dChip after reading in the data using “Open group” or “Get external data”.

Export clustering image

The clustering image can be exported by the “View/Export Image” menu. However, sometimes “View/Export image” does not produce any output file, or the exported image is altered or incomplete. This is an unfixed bug. One may try one of the following to get around the problem:

· Save the file as a different format (JPG, BMP or EMF)

· Use the Arrow or Control+Arrow keys to make the image smaller and then run “View/Export image”.

· Use the "PrintScreen" key (at the upper-right corner of the keyboard) to copy the whole screen image and then paste into the Microsoft Paint software; if necessary do this several times to compose the whole image.

· Change to a different Windows platform or a PC with larger memory.

 

Another known problem on Windows 95/98 is that after zooming in or out the clustering image many times, the clustering tree may disappear. One may restart dChip in this case, or upgrade the Windows system to a newer version.

 

Remove irrelevant genes from clustering analysis

 

Often it is desirable to exclude some genes from the clustering analysis. For example, MHC and immunogolbulin families genes vary for reasons that are irrelevant to the experiments and analysis of clonal B-cell populations (Bradley Messmer, pers. comm.).

 

One can make a gene list file (a text file containing probe set names on each line) of collagens genes, by using “Tools/Gene list file/By keywords” or from literature search, then in the “Analysis/Filter genes” or “Analysis/Compare samples” dialog specify it as the “Filter on gene list: excluding …” file. The resultant filtering or comparison gene list will not contain these genes.

 

To remove a list of probe sets completely from the array, specify the gene list as the “Analysis/Open group/other information/probe mask file” to make the CDF.bin file does not contain these gene anymore.

 

Lastly, if seeing an undesired gene cluster in clustering picture, one can click to highlight this cluster and use “Cluster/Delete selected gene” to exclude these genes so that they do not affect the sample clustering.