dChip: Sample classification by Linear Discriminant Analysis

 

Classification performance and cross-validation      Principal Component Analysis

 

LDA classification

 

See Hakak et al. 01 for examples. Linear discriminant analysis (LDA) is a classical statistical approach for classifying samples of unknown classes, based on training samples with known classes. LDA has been previously applied to sample classification of microarray data. Related links: Dudoit and Speed 02, LDA introduction.

 

dChip contains a simple LDA analysis function which requires the installation of R (the LDA function works with R 1.6 but not R 1.7). Select the “Analyisis/LDA Classification” menu to specify the sample classes and a list of genes:

 

 

The gene list file contains genes used as features and their expression values in a sample are the feature vector of this sample. If we want to use the three sample groups above to classify other unknown samples, it is reasonable to use the Analysis/Compare samples function to obtain genes that are differentially expressed between pairwise comparisons. A simpler way is to use the “Analysis/Filter genes” function to obtain genes with large variation across all samples, but dong this may not produce as good prediction power.

 

The sample used in LDA analysis can be specified at “Tools/Array list file”. The known sample classes need to be specified. Select the samples belonging to the same class in the left “Sample” listbox and then click the “Add class” button to add a known class. Use the “Delete last” button to delete the lastly added sample class. Samples not added in any of the class will be regarded as “unknown” samples and their class labels will be predicted after LDA is performed. “LDA result file” will store the LDA sample class specification as well as the classification result. Thus previous sample class specification stored in a “LDA result file” can be loaded by clicking the “Use” button.

 

Clicking “OK” button will start R software and call its lda and predict.lda function to perform the LDA training on the known classes and predict the class labels for the unknown samples. Output is similar to the below (click the “Analysis” icon on the left to view the output):

 

{Sample classification by Linear Discriminant Analysis

  Treat 21 arrays as 21 experiments

 

  Found 3 classes in file D:\array\out\iglehart lda result.xls

 

  Read in genes listed in file D:\array\out\iglehart filtered gene.xls...

    Found 179 genes

 

  Obtaining data for 179 genes and 21 samples...

    Gene 150

 

  Writing LDA result in D:\array\out\iglehart lda result.xls...

 

 

  Prediction rate for samples with known class: 0.71 (10/14)

LDA finished}

 

LD1 and LD2 are the first two linear discrimiants that map the samples with known class from the n-dimensional (n is the number of genes) space to the plane, in such a way that the ratio of the between-group variance and the within-group variance is maximized. If there are only 2 known classes then only LD1 is meaningful and LD2 is arbitrarily set to the order of samples for visualization purpose. We can see the LDA correctly predicted 71% of the samples with known classes. Selecting other gene lists may produce different prediction power. (Note here the cross-validation is not used and the prior is the class proportion for the training samples)

 

An “LDA” icon is added to the Navigation View on the left side; clicking the icon will bring out the scatter plot of LD1 versus LD2 (a similar R plot is also generated):

 

 

The color of the first four sample classes are blue, red, green and cyan, and the further colors are generated randomly. The gray points represent unknown samples (labeled as 0 in R plot) and their predicted class labels are indicated by a smaller rectangle inside. Use Arrow or Contorl+Arrow keys to zoom and Enter key to switch to other views. Select “View/Export Image” to export the picture.

 

Classification performance and cross-validation

 

It’s usually the case that applying a trained classifier to the training samples yields better performance than applying it to the test (unknown) samples, since the training samples provide information as to what genes (or gene sets) distinguish the two training classes best, so that these genes will be weighted more in the trained classifier and classify training samples well.

 

One may try an informal way to do classification: first filter genes (Analysis/Filter genes) using the training samples (without using class information), then cluster all the samples using this set of genes. From such sample clustering you may see if two classes separate and how the test samples are assigned. This gives you the bottom line that a real trained classifier should perform.

 

[V6/16/08+] At “Analysis/Classify samples”, you can specify sample groups and then perform leave-one out cross-validation. Each time, a sample will be excluded, and genes will be filtered using “Analysis/ANOVA” (specify a factor there) or “Analysis/Filter genes” (specify filtering criteria there and perform filtering once before this step”. The filtered genes will be used to train a classifier using LDA, and the left-out sample will be predicted. This filtering and prediction iterates for all samples to obtain cross-validation accuracy.

 

Principal Component Analysis

 

Principal Component Analysis (PCA, Alter et al. 2000, Raychaudhuri et al. 2000) is a useful way to explore the naturally arising sample classes based on the expression profile. Check the button “Perform Principal Component Analysis instead” in the “Analysis/LDA Classification” dialog to perform PCA. The installation of R is not required for PCA analysis.

 

The sample and gene specifications are similar to the LDA analysis, and all the samples in the left-side “Sample” box will be used for PCA. The first two principal components (PC) will be plotted. The class labels are used to color the samples but do not enter the PCA analysis, and the samples of unknown classes are colored as black. Each PC is a linear transformation of the expression values of all genes in a gene list. So in effect PCA maps samples in high N dimension (N is the number of genes) to two dimension, maximizing the space among the samples.