dChip: Filter genes
We are often interested in genes showing large
variation across samples or present in most samples, and these genes could be
used for unsupervised gene or sample clustering so
that clustering results are not affected by the noise from absent or
non-changed genes. Select menu “Analysis/Filter genes”:

The dialog provides several criteria for filtering genes.
The current “Array list file” specifies
the samples used in the filtering and “pooling replicate arrays”
will be performed before the filtering if needed.
The criterion (1) requires that the ratio of the
standard deviation and the mean of a gene’s expression values across all
samples be greater than a certain threshold. This ratio is also known as
Coefficient of Variation (CV). The more variable a gene is across samples, the
larger the ratio is. But if a gene is mostly absent across samples, this ratio
can be large due to small mean; in this case we could also use the criterion
(2). The default upper limit 1000 is a reasonably large number that is usually
satisfied, but it can be changed to obtain genes not variable across samples.
If a group of samples contain some outlier samples
that drive gene filtering and clustering (e.g. sample 29 and
11 in the left figure below; data courtesy of Tao Lu and Bruce Yankner), we can
log-transform expression
values by checking "Open group/Options/Log x transform".
Afterwards, gene filtering criterion (1) can be selected to use "Standard
deviation (for logged data)" instead of CV due to the variance
stabilization property of log transformation (right figure below).

The criterion (2) requires that a gene be called
“Present” in more than a portion of the arrays in the array list file. Please
see the “Handling replicate
arrays” section for the criterion (3). The criterion 4 selects genes whose
expression values are larger than a threshold in more than a percentage of
samples. If the expression values have been log-transformed by
checking "Open group/Options/Log x transform", the expression value
threshold should also be specified in log scale. Also note that criterion (2)
is at array level (replicate arrays are not pooled) and criterion (4) is at
sample level (replicate arrays are pooled to compute mean expression for each
sample).
The filtering can be restricted to
an existing gene list or its complement set, if a “Gene list file” (a
tab-delimited text file with the first column of each row being the probe set
name) is specified in the “Filter on gene list” button. Click the button
multiple times to switch between “using all genes”, “using gene list” and
“excluding gene list”.
By default the “Analysis/Filter
genes” and “Analysis/Compare samples” functions ignore the Affymetrix control
genes (probe set names starting with “AFFX-”), since their changes are
generally not interesting. To include Affymetrix control genes in the filtering
or comparsion, click “More options/Analysis” or
select from menu “Tools/Options/Analysis/”, and uncheck “Omit Affymetrix
control probe set at filtering or comparison”.
The genes satisfying the filtering criteria will be
exported to the “Gene list file” specified at the “Filtered gene list” button.
This gene list file can be used in analysis functions such as hierarchical clustering, or “Analysis/Model-based expression/Export”
to export expression data for these genes. Once the gene list file is saved,
its directory and name is stored in the Windows clipboard automatically, so one
can use Control+V to paste the file name into other places
such as “Analysis/Hierarchical clustering”.
Other gene
filtering functions
ANOVA filtering
can use sample category information to select genes that vary among sample
groups in a supervised way.
[Use version 10/1/05+] We can also use the "Tools/Percentile filtering"
function to select genes by its fold change between a high and a low percentile
across samples. The filtered gene list can be clustered and viewed as usual.
This gene list file also contains the percentile-standardized data, with the
two percentiles for each gene linearly scaled to the +/-
displaying range (specified at “Options/Clustering”). To view the
percentile-standardized data, first do 'Get external data' to read in this
exported data file, and then do “Analysis/Clustering” with “Options/Standardize
rows” unchecked.