Classification performance and cross-validation Principal Component Analysis
LDA
classification
See
Hakak et al. 01 for examples. Linear discriminant
analysis (LDA) is a classical statistical approach for classifying samples of
unknown classes, based on training samples with known classes. LDA has been previously
applied to sample classification of microarray data. Related links: Dudoit and Speed 02,
LDA introduction.
dChip
contains a simple LDA analysis function which requires the installation of R (the LDA function works with R 1.6
but not R 1.7). Select the “Analyisis/LDA Classification” menu to specify the
sample classes and a list of genes:

The
gene list file contains genes used as features and their expression values in a
sample are the feature vector of this sample. If we want to use the three
sample groups above to classify other unknown samples, it is reasonable to use
the Analysis/Compare samples function to
obtain genes that are differentially expressed between pairwise comparisons. A
simpler way is to use the “Analysis/Filter genes”
function to obtain genes with large variation across all samples, but dong this
may not produce as good prediction power.
The
sample used in LDA analysis can be specified at “Tools/Array list file”. The known sample classes
need to be specified. Select the samples belonging to the same class in the
left “Sample” listbox and then click the “Add class” button to add a known
class. Use the “Delete last” button to delete the lastly added sample class.
Samples not added in any of the class will be regarded as “unknown” samples and
their class labels will be predicted after LDA is performed. “LDA result file”
will store the LDA sample class specification as well as the classification
result. Thus previous sample class specification stored in a “LDA result file”
can be loaded by clicking the “Use” button.
Clicking
“OK” button will start R software and call its lda and predict.lda
function to perform the LDA training on the known classes and predict the class
labels for the unknown samples. Output is similar to the below (click the
“Analysis” icon on the left to view the output):
{Sample classification by Linear Discriminant
Analysis
Treat 21
arrays as 21 experiments
Found 3
classes in file D:\array\out\iglehart lda result.xls
Read in
genes listed in file D:\array\out\iglehart filtered gene.xls...
Found 179
genes
Obtaining
data for 179 genes and 21 samples...
Gene 150
Writing LDA
result in D:\array\out\iglehart lda result.xls...

Prediction
rate for samples with known class: 0.71 (10/14)
LDA finished}
LD1
and LD2 are the first two linear discrimiants that map the samples with known
class from the n-dimensional (n is the number of genes) space to the plane, in
such a way that the ratio of the between-group variance and the
within-group variance is maximized. If there are only 2 known classes then only
LD1 is meaningful and LD2 is arbitrarily set to the order of samples for
visualization purpose. We can see the LDA correctly predicted 71% of the
samples with known classes. Selecting other gene lists may produce different
prediction power. (Note here the cross-validation is not used and the prior is
the class proportion for the training samples)
An
“LDA” icon is added to the Navigation View on the left side; clicking the icon
will bring out the scatter plot of LD1 versus LD2 (a similar R plot is also
generated):

The
color of the first four sample classes are blue, red, green and cyan, and the
further colors are generated randomly. The gray points represent unknown
samples (labeled as 0 in R plot) and their predicted class labels are indicated
by a smaller rectangle inside. Use Arrow or Contorl+Arrow keys to zoom and
Enter key to switch to other views. Select “View/Export Image” to export the picture.
Classification performance and cross-validation
It’s usually the case that applying a trained classifier to the training samples yields better performance than applying it to the test (unknown) samples, since the training samples provide information as to what genes (or gene sets) distinguish the two training classes best, so that these genes will be weighted more in the trained classifier and classify training samples well.
One may try an informal way to do classification: first filter genes (Analysis/Filter genes) using the training samples (without using class information), then cluster all the samples using this set of genes. From such sample clustering you may see if two classes separate and how the test samples are assigned. This gives you the bottom line that a real trained classifier should perform.
[V6/16/08+]
At “Analysis/Classify samples”, you can specify sample groups and then perform
leave-one out cross-validation. Each time, a sample will be excluded, and genes
will be filtered using “Analysis/ANOVA” (specify a factor there) or
“Analysis/Filter genes” (specify filtering criteria there and perform filtering
once before this step”. The filtered genes will be used to train a classifier
using LDA, and the left-out sample will be predicted. This filtering and
prediction iterates for all samples to obtain cross-validation accuracy.
Principal Component Analysis (PCA, Alter
et al. 2000, Raychaudhuri
et al. 2000) is a
useful way to explore the naturally arising sample classes based on the
expression profile. Check the button “Perform Principal Component Analysis
instead” in the “Analysis/LDA Classification” dialog to perform PCA. The
installation of R is not required for PCA analysis.
The sample and gene specifications are similar to the
LDA analysis, and all the samples in the left-side “Sample” box will be used
for PCA. The first two principal components (PC) will be plotted. The class
labels are used to color the samples but do not enter the PCA analysis, and the
samples of unknown classes are colored as black. Each PC is a linear
transformation of the expression values of all genes in a gene list. So in
effect PCA maps samples in high N dimension (N is the number of genes) to two
dimension, maximizing the space among the samples.