
By default no checkbox option is checked. If needed click the “Options” button to specify other expression computation options . For example, use “PM-only model” or not perform the outlier detection procedure. The options displayed below are default. In version 8/20/06+, at the "Options/Model" dialog, select "Background subtraction" as "Mismatch probes" for PM/MM difference model and "5th percentile of region" for PM-only model. "Average" is for "Average difference" method.

Click OK to proceed and get an output similar to this:
{Model-based expression
Obtain
model-based expression values...
Probe set
7128
Saving expressions into dChip files...
Read in
existing DCP file
...
Read in
existing DCP file
Writing
array summary file D:\out\dan arrays.xls
Check CEL
images for quality information
Finished in 00 hours 03 minutes}
The “array summary file” has an icon in the left pane, either click the icon or open the file by Excel:
|
Number |
Array |
File Name |
Median
Intensity |
P call % |
Array
outlier % |
Single
outlier % |
Warning |
|
1 |
130a |
…… |
702 |
20.4236 |
5.34437 |
0.538363 |
* |
|
2 |
130b |
|
728 |
28.2929 |
1.78146 |
0.207117 |
|
|
3 |
130c |
|
658 |
25.9644 |
1.61313 |
0.258187 |
|
|
4 |
130d |
|
557 |
24.5616 |
2.04797 |
0.223431 |
|
|
5 |
CA-C |
|
1510 |
28.7698 |
0.60317 |
0.100721 |
|
|
6 |
CA-H |
|
2154 |
30.7336 |
0.617197 |
0.158884 |
|
|
7 |
CA-HR |
|
1817 |
35.7133 |
0.490952 |
0.107814 |
|
|
8 |
CA-R |
|
1560 |
25.3472 |
0.462898 |
0.0985935 |
|
|
9 |
PC-C subA |
|
1979 |
37.9857 |
0.50498 |
0.11278 |
|
|
10 |
PC-H subA |
|
1509 |
32.5431 |
0.645252 |
0.100721 |
|
Some statistics from model-fitting are reported here (in this example 24 arrays are used, including the previous 4 arrays): the percentage of probe sets called “Array-outlier” in one array, the percentage of “probe pairs” called “Single-outlier” in one array.
The computed expression values for one array are saved in the corresponding DCP file, which also contain both unnormalized and normalized probe level data for this array. If one would like to perform several variation of the analysis (e.g. normalization using different baseline array, compute expression values with different options) and keep the results under these different analysis schemes, one can specify alternative Working Directory at “Open Group” ("Tools/options/analysis" and check "Search and save DCP files in the working directory") to save the DCP files containing new results in a separate directory. To load the results of different analysis schemes, the corresponding Working Directory needs to be specified at “Open group”.
For a group of arrays, “Analysis/Normalize” and “Analysis/Model-based expression” only need to be performed once, unless outlier arrays are identified or some arrays are not of the interest at a later time (even then one can use “Tools/Array list file” to exclude these arrays). The next time one “Open group”, the normalized CEL values and the calculated expression values will be also be read in, and the high-level analysis such as clustering can be directly performed. The lower-right corner of dChip has “Normalized” and “Modelled” to indicate these two procedures have been performed.
Select “Analysis/Model-based Expression/Options/Method used: Average Difference” to use Average Difference method to compute expression values or signal. No outlier detection is performed, and the standard error is the standard deviation of PM/MM differences divided by the square root of the number of probe used. This method is faster than model-based method to obtain expression values.
Negative average differences or model-based expression indexes (MBEI) reflect that the signals are not specific (MM > PM and this is more likely to be due to cross-hybridization), and they are not biologically meaningful and can not be taken log transformation. MBEI are computed using non-crosshybridizing and probe pairs with consistent positive PM/MM difference by iterative fitting method, thus resulting fewer negative expression values.
The downstream “Clustering” or “Compare samples” analysis use the negative values as they are, except when calculating fold changes negative expressions are truncate to 1. One may also truncate the expression values right after the MBEI is calculated (may need to check “Ignore existing calculated expressions” to redo the “Model-based expression calculation”). Small or negative expression values can be truncated to a fixed value (e.g. 0), or to a certain percentile (e.g. 10%) of the all the expression values that are called “Absent”. The standard errors are set to 0 for the truncated expression values, and this will affect later “confidence interval of fold change” and “resample clustering” functions.
Check “Apply log x transform” to log transform the expression values (In V1.3+, check “Tools/Options/Analysis/ Log x transform expression values” and redo “Analysis/Open group” if needed). This leads to compressed dynamic range of the expression values and easier interpreting of fold changes in both directions. The standard errors are set to 0 for the log-transformed expression values. A “Logged” indicator at the lower-right corner indicates that the log-transformed expression values are being used. In V1.3+, only the original expression values are stored in the DCP files, and they are log transformed at the “Analysis/Open group” step.
The base of log transformation can be set for convenient interpretation. For example, a log-transformed value of 2 refers to the original expression 4 when the base is 2 and 100 when the base is 10. Since log10(x) = log2(x) * log10(2), the log ratio between two samples differ by a constant for two bases: log10(x/y) = log10(x) - log10(y) = log2(x/y) * log10(2). Some threshold values at "Compare samples" and "Filter genes" should be set in the corresponding log-transfromed scale (e.g. set log10(2) = 0.301 as the threshold for a 2 fold change when the base is 10).
The Affymetrix MAS 5 software has “Signal” as all-positive expression values. In V1.1+, the PM-only model (set at “Tools/Options/Model”) will also generate all-positive MBEI. By default dChip does not log-transform or truncate. Once may vary these settings and compare the results on his own data.
(Discussion thread) Related links: Array outlier image library (Ben Bolstad), SmudgeMinder (John Weinstein lab)
When fitting the model to a probe set, dChip uses the outlier detection algorithm to detect array-outliers for this probe set. The array-outliers are the arrays whose probe response pattern for this probe set is different from the consensus probe response pattern seen in most arrays. After the model is fitted for all probe sets by “Analysis/Model-based expression”, the percentage of probe sets called as array-outlier in one array is reported as “array outlier percentage”. If this percentage exceeds 5% in an array, the array is called as an “outlier array”. For such “outlier arrays”, A “*” is shown in the “Warning” column of “array summary file”, and the corresponding array icon in the left pane become dark blue, indicating potential image contamination or sample hybridization problem of these arrays:

A user is encouraged to click image icons to visually check the image after model-based expression calculation. These images will have “array-outlier” (white bars) and “single-outlier” (pink dots) superimposed (menu “Image/Array Outlier” to hide or show outlier bars on the image):


Array-outliers may be due to regional image contamination such as the scratches in the right side of the above array. Thus the expression values calculated for the affected probe sets should not be trusted. Probe sets identified as “array-outliers” are tagged and whose expression value can be treated as missing data in clustering or exported data file (by checking “Tools/Options/Analysis/Treating outlier expression as missing values”). The array-outliers are also attached with large standard errors (in fact this is why they are identified as array-outliers). In the down-stream analysis, the standard errors are used as measurement accuracy and the array-outliers are down-weighted. The examples are down-weighting unreliable expression values (array-outliers) when pooling replicates, or producing wider confidence interval of unreliable fold changes (see Compare samples). Single-outliers (such as image-spikes) have been replaced by imputed values in the model-fitting and their adverse effects have been eliminated.
The computed expression values and standard errors are stored in DCP files and can be exported into a data file. Select “Tools/Export data/Expression value” (“Analysis/Model-based expression/Export” for Version 1.0):

Arrays displayed are ordered by “Array list file”; click or Control/Shift+Click to select arrays whose expression values will be exported. One can choose to export the absolute calls and standard errors, or to export expression values called as “Array-outlier” to be blank for other software to treat them as missing data (set at “More options”). A “Gene list file” can be used to export expression values for a subset of probe sets. This file should be a tab-delimited text file with first column as probe set names. A user can also export the data in GCT format for input to GeneCluster 2.0 software by checking “GCT format for GeneCluster”. If a sample information file is specified at “AnalysisOpen group”, the first sample information column will be used to construct a CLS file for GeneCluster to use.
Expression data file exported by dChip follows the probe set order in CDF file, thus may not be the same as the order in gene information file. To export additional information, one can specify “Open group/Other information/gene information file” and check “Tools/Options/Analysis/Additional gene information” before exporting the data. In V1.3+, Check “Tools/Export data/Expression value/Include header information” to include information such as the modeling method, baseline array into the exported expression data files.
The output file has tab-delimited text format with genes on rows and samples on columns, and can be imported into other analysis software as well as dChip itself for further analysis (see file format for more details). The output files have the first column probe set name and the second column gene names. If the second column is not desired (e.g. for dChip “Get external data”), one can delete it after exporting, or clear “Gene information file” in “Analysis/Open Group/Other information” dialog when reading in array data, and then export data. If modified in other software such as Excel, make sure to save as “tab-delimited text” format.
An output file can be read in R by command such as “my.data =read.table("c:\\tmp\\expression.xls",sep='\t',header=T,row.names=1)”. dChip output files have a tab character in the end of each line, and this leads to a last NA column after the data is read into R. One can either delete the column in R, or open the file in Excel and save as tab-delimited text file to eliminate the ending tab before importing the file into R. In addition, when reading data files containing single quotes (e.g. 5’) in gene descriptions into R, one needs to use parameter quote = "" for the read.table() function (observed by Douglas Bates, see R mailing list).
To output the outliers identified by the model, one may use “Tools/Export Data/Expression value” (Check “Tools/Options/Analysis/Treat array outlier as missing expression value”) for array-outliers (empty grid in the exported file), “Tools/Export Data/Probe Sensitivity Index” for probe outliers (exclude=1 in the exported file), and “Image/Export CEL” for array and single outliers (the [OUTLIER] section in the end of CEL file)
Due to many MM (mismatch) > PM (perfect match)
signals and negative Average Difference or MBEI expression values, researchers
have explored the possibility of PM-only analysis (Naef
et al. 2001, Li and Wong 2001b) or
PM-only arrays (Zhou and Abagyan 2002). The
following algorithm uses only PM probes to calculate all-positive expression
values. Both dChip PM/MM difference model and PM-only model use the outlier detection algorithm to eliminate potential
cross-hybridizing probes. The difference model also subtracts away the
cross-hybridization signals equally bound to the paired PM and MM probes, while
the PM-only model absorbs some background signals in the expression values.
Therefore the difference model can be more sensitive to expression changes at
low concentration levels. The initial goal of PM only model is to get
positive-only expression levels. In the latest dChip using
“Options/Model/Truncate negative PM/MM difference to 0” will achieve the same
for PM/MM difference model [In dChip 9/6/06+, this option is removed. Negative
differences are always truncated to a small value of 1, leading to only
positive expression (theta) and phi estimates].
1. Background subtraction.
Background signals are the signals observed when the target gene does not express. We divide the whole array into 10 by 10 regions as in Affymetrix Inc. (1999), and calculate the 5th percentile of PM and MM values in a region as the local background of the sector. Then the background values are subtracted from all the PM and MM values in this region. Small negative values may be obtained and they are set to 0; the error introduced by setting to 0 is at the comparable level of the background noise and signal variation, which are accounted by the error term e below. Note that the normalization is based on the original PM and MM data, and the background subtraction is performed after normalization.
2. Apply multiplicative model to the background-adjusted PM values:
PMij = qi * fj + ε
qi is the model-based expression index (MBEI) of the target gene in array i, fj is the probe sensitivity index (PSI) of probe j in a probe set (see Li and Wong 2001a for details).
Specify different methods through the
“Analysis/Model-based expression/Options” dialog or use the menu “Data/Next
model” to switch between models.
In V1.3+, one may specify “Tools/Options/Analysis/PM-only array” to read in CEL files of PM-only arrays. dChip treats each PM probe as its own MM probe, leading to zero PM/MM difference. As a result, only the PM-Only model can be applied. Also one needs to supply Affy TXT files if absolute calls are needed since dChip always computes ‘A’ calls for PM/MM difference of 0.
The PM/MM difference model or PM-only model is
followed by the outlier detection algorithm described below. The model-fitting
and outlier-detection are iterated up to 30 times to reach a converged set of
array, probe and single outliers, or one of the cycling outlier sets at the
round 30 (Li and Wong 2001a). Usually it can
converge in 5-10 rounds, and the round information is displayed in the grid (1,
3) of Data View (“Rd”). If outliers are excluded
or imputed during the iteration, the dChip fitted q and f values are not
exactly the same as those obtained from S-PLUS or BioConductor
affy package.
1. Check image spikes (set by
“Tools/Options/Model/Treat image spikes as single outlier”).
Image spikes are identified using absolute value of
data points (PM/MM differences or adjusted PM values). The 80th
percentile of the absolute values of the data points in one array (probe)
multiplied by 3 is called the array-wise (probe-wise) value threshold. As with
other parameters used below, “80” and “3” are selected empirically to make the
algorithm successfully identify outliers confirmed in the dChip PM/MM data view
for a large number of array sets/probe sets. If a data point has its absolute
value exceeding both the relevant array-wise value threshold and the probe-wise
value threshold, this data point is called as an image spike. This algorithm is
similar to the super-scoring method used by MAS4 (Affymetrix
1999) for identifying outliers. Image spike checking is performed only once
before the “model fitting/outlier detection” iteration.
The identified image spikes are treated as missing
data during model-fitting, but used as they are during outlier detection; this
induces the image spikes to be identified and handled as single outliers
eventually. Note that dChip does not check for saturated probes, but these
probes may be identified as image spikes and treated as single outliers as
above.
2. Single outlier detection.
Single outliers are identified using absolute residuals. The 80th percentile of the absolute residuals of the data points in one array (probe) multiplied by 3 is called the array-wise (probe-wise) residual threshold. If a data point has its absolute residual exceeding both the corresponding array-wise residual threshold and the probe-wise residual threshold, it is called as a single outlier. Due to the implementation of finding percentiles in dChip (e.g. 80% percentile is SortedVector[int(0.8*Len)], with SortedVector[0…Len-1]), it is required to have >=6 arrays and probes for the model to call single outliers. Single outliers usually dominate the displaying range in the “Data view” and make the other data points difficult to see; one may use “Data/Outlier in Range” to toggle the option.
3. Probe and array outlier detection.
For a probe pair (or probe for PM-only model), if
its model-based standard error (std(fj)) is 3 times larger than the median standard error
of all probe pairs, it is called a probe-outlier. In addition, if a probe pair
is too active while most others are not (fj2 > (∑fj2 )* 0.9), it may
be due to single-probe cross-hybridization in a probe set and is called a
probe-outlier. Finally in the PM/MM difference model, probe pairs with negative
fj are called as
probe outlier; this happens when in a probe pair MM is consistently larger than
PM across arrays and thus it is more likely to be due to cross-hybridization.
The number of probe outliers is restricted to be at most 50% of all probes. The
similar criteria are used to identify array-outliers.
4. Iteration with model fitting and outlier detection
After the three types of outliers are identified, the model is refitted using the subset of arrays and probes not called as outlier, with the data value of single outliers regarded as missing. Afterwards the q’s for array-outliers are computed using local regression passing the origin point, regarding f’s as known and using only f’s of non-probe-outliers. f’s for probe-outliers are computed in a similar manner. The outlier detection and model-fitting is iterated until the set of outliers does not change or a limit of 30 rounds has been reached. In the final fitted model, the sum square of f’s for non-outlier probes are constrained to be equal to the number of non-outlier probes, with other f’s fitted through the local regression.
In implementation the single outliers, array outliers, and probe outliers are identified cyclically in this order, whenever a particular outlier set is found to be different from the current outlier set before the model fitting, the model is refit. An real order could be: model fitting -> check single outlier, not found any -> check array outlier, found 2 -> model fitting without these 2 arrays -> check probe outlier, found 1 -> model fitting without these 2 arrays and 1 probe -> check single outlier, found 3 -> model fitting without these 2 arrays, 1 probe, and regard these 3 data points as missing and impute -> check array outlier, found 3 -> model fitting without these 3 arrays, 1 probe, and regard the 3 data points as missing and impute -> check probe outlier, found 2 -> model fitting without these 3 arrays and 2 probes, and regard the 3 data points as missing and impute -> check single outlier, found the same 3 as before, no refitting -> check array outlier, found the same 3 as before, no refitting -> check probe outlier, found the same 2 as before, no refitting -> the outlier detection converges.
This outlier detection algorithm improves on that in the version 1.0 by checking for image spikes: some times it is not sufficient to rely on model itself to identify outliers; an image spike may distort the model fitting in a way that the spike does not manifest itself as either single, array or probe outlier. Previously image spikes may make the whole probe set become array-outlier (see example). Now single outliers are identified in a more reasonable manner (pink dots; data courtesy of Nikhil Munshi):

One may change
“Tools/Options/Model/Check array outlier”, “Check probe outlier” and “Check
single outlier” to perform or not perform a particular outlier detection
[V1.2+]. Please see tissue
effect for their use and related outlier detection options.
The main distinction of low-level analysis is that dChip uses Invariant set normalization method and Model-based expression indexes (MBEI). Thus the expression values and high-level analysis