dChip: Chromosome regions and clustering using SNP data

 

Export averaged region data                           Cluster samples                      

 

Specify and view chromosome regions

 

A chromosome region file may be specified at "Analysis/Chromosome" to view only the markers in these regions. First make or download a text format chromosome region file (having similar format as a cytoband file). For example, cancer gene census hg17.xls uses the known cancer genes as regions (suggested by Rameen Beroukhim, based on Cancer Gene Census database, 9/21/05 version). One can also specify a cytoband file to view and cluster cytobands as regions. As another example, this region file will display SNP or tiling probe sets that are in the corresponding regions, so we can focus on regions with various sizes containing the RB gene.

 

[V4/16/07+] After using "Analysis/Chromosome" to obtain inferred LOH and copy number data, select "Chromosome/Region & Clustering":

 

Reference gene or refFlat files can be used to regard each gene as a region, by checking "Use refGene file as region file". Each gene's transcription starting and ending site will be extended by 5 Kb to define a gene's chromosome region.

 

SNP markers in a region will be averaged to obtain the region's data of inferred copy number (log2 ratio) and probability of LOH, and these data can be used to filter regions. Only the regions containing more than one marker or passing the filtering criteria will be displayed or used in chromosome region clustering.

 

[Version before 4/15/07] The chromosome region file should be specified at "Analysis/Chromosome". Specify a refFlat file at "Analysis/Chromosome/Reference gene file", and check "In refFlat file format". If "Use as region file" is checked, the genes will be the display unit instead of chromosomes. Home and End key will go to another gene, where on the left the black bands represent exon regions and white band intron regions. Only the SNPs or genes in the exon level region is displayed in the view. This view will be more useful for latest SNP, exon or tiling arrays with more markers. (Suggested by Bill Sellers, Rameen Beroukhim and Tom Look, data courtesy of Rani George)

 

Export averaged region data

 

[V8/12/07+] Inferred copy number and probability (LOH) can be averaged for SNPs in a chromosome region. Such averaged data for regions can be used for region filtering or be exported. Filtering criteria A or B in the above dialog will be used respectively, when the current chromosome view is copy number or LOH. In this example exported region data file, the reference gene file specified at "Analysis/Chromosome" are used as the chromosome region file, and the column named "% Sample satisfying A or B" can be sorted to identify genes with high percentage of copy number alterations across samples.

 

Cluster samples using SNP data

 

LOH or copy number data from SNP array can be used to cluster tumor samples (Garraway et al. 2005; Janne et al. 2004; Koed et al. 2005; Lieberfarb et al. 2003; Lin et al. 2004) or SNP markers (Girard et al. 2000).

 

[V11/6/05+] See Lin et al. 2004 and Janne et al. 2004 for references. To perform sample clustering, after "Analysis/Chromosome" and at the “Chromosome” view, select menu “Chromosome/Show all” and “Chromosome/Clustering”. Set “Options/chromosome/min, max, threshold” to be 0, 0.5, 0.25 in the beginning, and use Shift+left/right key to adjust the red threshold line to cluster samples using the chromosome regions with LOH score exceeding the threshold.


In the clustering figure below, the LOH score is plotted on the right side of the LOH data picture in blue. A high LOH score indicates that many samples have LOH events in the nearby region. Adjusting the score threshold line (in red), the markers or genes in the chromosome regions with LOH score exceeding the threshold will be colored blue. Only the SNP makers in the regions with LOH score above the threshold are used for sample clustering. The distance between two samples is defined as the average absolute difference of the Probability (LOH) in the two samples for these markers. The average linkage is used during hierarchical clustering. Intuitively if two samples often have LOH or retention together for the selected chromosome regions, they will cluster closely.

 

 

By changing the LOH score threshold from 0.01 to 1.00, we can progressively look at 100 samples clustering trees quickly (equivalent to gene filtering in clustering analysis of expression array). The sample type information on the top can be hidden at first (use Shift+Up Arrow key at non-proportional view), until at a particular threshold, good separation can be seen from the sample clustering tree on the top. Then the sample information can be brought up to correlate with the clustering results. Alternatively, sample information can be used to determine the LOH score threshold used for clustering. If at a particular threshold, the sample clustering agrees well with known cancer types or other clinical variables, the selected chromosome regions by this threshold (thus used in the clustering) may contain LOH differences that distinguish the sample subgroups.

 

At the SNP genotype view, LOH view and copy number view, selecting “Chromosome/Cluster samples” will perform sample clustering using the data in that particular view. The distance metric between two samples is: 

 

Clustering samples by genotypes can suggest pairs of samples are from the same ancestors: SNB-19/U251, NCI_Adr/OVCAR-8, M14/MDA-MB-485.

 

(Updated 8/12/07)