dChip: Calculate model-based expression values

 

Truncate or log transform                    Array outlier and outlier array            Tissue effect

Probe sensitivity index file                 Export expression values                    PM-only model

Outlier detection algorithm                 Compare with MAS

 

The model-based expression values in grid (2, 2) of “PM/MM Data View” are calculated on the fly and not stored. To fit the model for all the probe sets and store the results into DCP files, select the menu “Analysis/Model-based Expression”:

 

By default no checkbox option is checked. If needed click the “Options” button to specify other expression computation options . For example, use “PM-only model” or not perform the outlier detection procedure. The options displayed below are default. In version 8/20/06+, at the "Options/Model" dialog, select "Background subtraction" as "Mismatch probes" for PM/MM difference model and "5th percentile of region" for PM-only model. "Average" is for "Average difference" method.

 

 

Click OK to proceed and get an output similar to this:

 

{Model-based expression

  Obtain model-based expression values...

    Probe set 7128

  Saving expressions into dChip files...

    Read in existing DCP file

      ...  

    Read in existing DCP file

 

  Writing array summary file D:\out\dan arrays.xls

 

  Check CEL images for quality information

Finished in 00 hours 03 minutes}

 

The “array summary file” has an icon in the left pane, either click the icon or open the file by Excel:

 

Number

Array

File Name

Median Intensity

P call %

Array outlier %

Single outlier %

Warning

1

130a

……

702

20.4236

5.34437

0.538363

*

2

130b

 

728

28.2929

1.78146

0.207117

 

3

130c

 

658

25.9644

1.61313

0.258187

 

4

130d

 

557

24.5616

2.04797

0.223431

 

5

CA-C

 

1510

28.7698

0.60317

0.100721

 

6

CA-H

 

2154

30.7336

0.617197

0.158884

 

7

CA-HR

 

1817

35.7133

0.490952

0.107814

 

8

CA-R

 

1560

25.3472

0.462898

0.0985935

 

9

PC-C subA

 

1979

37.9857

0.50498

0.11278

 

10

PC-H subA

 

1509

32.5431

0.645252

0.100721

 

 

Some statistics from model-fitting are reported here (in this example 24 arrays are used, including the previous 4 arrays): the percentage of probe sets called “Array-outlier” in one array, the percentage of “probe pairs” called “Single-outlier” in one array.

 

The computed expression values for one array are saved in the corresponding DCP file, which also contain both unnormalized and normalized probe level data for this array. If one would like to perform several variation of the analysis (e.g. normalization using different baseline array, compute expression values with different options) and keep the results under these different analysis schemes, one can specify alternative Working Directory at “Open Group” ("Tools/options/analysis" and check "Search and save DCP files in the working directory") to save the DCP files containing new results in a separate directory. To load the results of different analysis schemes, the corresponding Working Directory needs to be specified at “Open group”.

 

For a group of arrays, “Analysis/Normalize” and “Analysis/Model-based expression” only need to be performed once, unless outlier arrays are identified or some arrays are not of the interest at a later time (even then one can use “Tools/Array list file” to exclude these arrays). The next time one “Open group”, the normalized CEL values and the calculated expression values will be also be read in, and the high-level analysis such as clustering can be directly performed. The lower-right corner of dChip has “Normalized” and “Modelled” to indicate these two procedures have been performed.

 

Select “Analysis/Model-based Expression/Options/Method used: Average Difference” to use Average Difference method to compute expression values or signal. No outlier detection is performed, and the standard error is the standard deviation of PM/MM differences divided by the square root of the number of probe used. This method is faster than model-based method to obtain expression values.

 

Truncate or log transform expression values

 

Negative average differences or model-based expression indexes (MBEI) reflect that the signals are not specific (MM > PM and this is more likely to be due to cross-hybridization), and they are not biologically meaningful and can not be taken log transformation. MBEI are computed using non-crosshybridizing and probe pairs with consistent positive PM/MM difference by iterative fitting method, thus resulting fewer negative expression values.

 

The downstream “Clustering” or “Compare samples” analysis use the negative values as they are, except when calculating fold changes negative expressions are truncate to 1. One may also truncate the expression values right after the MBEI is calculated (may need to check “Ignore existing calculated expressions” to redo the “Model-based expression calculation”). Small or negative expression values can be truncated to a fixed value (e.g. 0), or to a certain percentile (e.g. 10%) of the all the expression values that are called “Absent”. The standard errors are set to 0 for the truncated expression values, and this will affect later “confidence interval of fold change” and “resample clustering” functions.

 

Check “Apply log x transform” to log transform the expression values (In V1.3+, check “Tools/Options/Analysis/ Log x transform expression values” and redo “Analysis/Open group” if needed). This leads to compressed dynamic range of the expression values and easier interpreting of fold changes in both directions. The standard errors are set to 0 for the log-transformed expression values. A “Logged” indicator at the lower-right corner indicates that the log-transformed expression values are being used. In V1.3+, only the original expression values are stored in the DCP files, and they are log transformed at the “Analysis/Open group” step.

 

The base of log transformation can be set for convenient interpretation. For example, a log-transformed value of 2 refers to the original expression 4 when the base is 2 and 100 when the base is 10. Since log10(x) = log2(x) * log10(2), the log ratio between two samples differ by a constant for two bases: log10(x/y) = log10(x) - log10(y) = log2(x/y) * log10(2). Some threshold values at "Compare samples" and "Filter genes" should be set in the corresponding log-transfromed scale (e.g. set log10(2) = 0.301 as the threshold for a 2 fold change when the base is 10).

 

The Affymetrix MAS 5 software has “Signal” as all-positive expression values. In V1.1+, the PM-only model (set at “Tools/Options/Model”) will also generate all-positive MBEI. By default dChip does not log-transform or truncate. Once may vary these settings and compare the results on his own data.

 

Array outlier and outlier array

 

(Discussion thread)     Related links: Array outlier image library (Ben Bolstad), SmudgeMinder (John Weinstein lab)

 

When fitting the model to a probe set, dChip uses the outlier detection algorithm to detect array-outliers for this probe set. The array-outliers are the arrays whose probe response pattern for this probe set is different from the consensus probe response pattern seen in most arrays. After the model is fitted for all probe sets by “Analysis/Model-based expression”, the percentage of probe sets called as array-outlier in one array is reported as “array outlier percentage”. If this percentage exceeds 5% in an array, the array is called as an “outlier array”. For such “outlier arrays”, A “*” is shown in the “Warning” column of “array summary file”, and the corresponding array icon in the left pane become dark blue, indicating potential image contamination or sample hybridization problem of these arrays:

 

 

A user is encouraged to click image icons to visually check the image after model-based expression calculation. These images will have “array-outlier” (white bars) and “single-outlier” (pink dots) superimposed (menu “Image/Array Outlier” to hide or show outlier bars on the image):

 

          

 

Array-outliers may be due to regional image contamination such as the scratches in the right side of the above array. Thus the expression values calculated for the affected probe sets should not be trusted. Probe sets identified as “array-outliers” are tagged and whose expression value can be treated as missing data in clustering or exported data file (by checking “Tools/Options/Analysis/Treating outlier expression as missing values”). The array-outliers are also attached with large standard errors (in fact this is why they are identified as array-outliers). In the down-stream analysis, the standard errors are used as measurement accuracy and the array-outliers are down-weighted. The examples are down-weighting unreliable expression values (array-outliers) when pooling replicates, or producing wider confidence interval of unreliable fold changes (see Compare samples). Single-outliers (such as image-spikes) have been replaced by imputed values in the model-fitting and their adverse effects have been eliminated.

Affymetrix MAS software uses the percentage of probe sets called present (usually 30%-50%) in an array, 5’ to 3’ ratio (to validate the IVT procedure) and the signal to background ratio as quality control assessment. dChip cross-references one array with other arrays through modeling approach to identify problematic arrays. Having fewer outliers is a sign of better array quality and more consistency with other arrays. Arrays with large number of array/single outliers (> 5%) deserves special attention. This can be due image contamination or that the sample was contaminated at some experimental stages or a wrong chip was hybridized to the sample. A visually clean chip may have a lot of array-outlier.

It is recommended to discard arrays with large number of array-outliers (> 15%). To do this we can either physically remove the CEL or DCP files from the directory and then redo the normalization and mdoel-based expression using only the good arrays, or de-select these arrays in the “Tools/Array list file” to exclude them from the high-level analysis. The “array-outliers” do not affect the expression index calculation in other good arrays since they are excluded during iterative model fitting. But the first approach is safer since an “outlier array” may be used as the baseline array for normalization (in this case it’s better to redo normalization and MBEI using another baseline array). If the image contamination is severe in one array but using the array data is still desirable, The image contamination correction may be used to alleviate the situation.

Tissue effect

Sometime large number of array-outliers may due to biological differences or treatment effects. For example, a probe set may have different probe patterns in different tissues due to differential cross-hybridization (a cross-hybridizing gene is present in one array set but not in the other), alternative splicing/SNPs, or array batch manufacture differences (also see the section “Probe-sensitivity indexes are stable across tissues types” in Li and Wong 2001b). When fitting the model using arrays having different responding patterns, the final probe-sensitive indexes (f's) will be mainly determined by the tissue having the majority of arrays, or a compromise of the patterns in several tissues. In the former case the arrays for the tissues hybridized to a minority of arrays may have many array-outliers and be identified as outlier-arrays. Another example is that a treatment causes some genes to be highly expressed and have the probe response pattern but the gene is absent in the control samples. If only a few treatments arrays are analyzed with a large number of control arrays, the treatment arrays may be identified outlier-arrays. In such cases, one may try first combine them to see if for many probe sets the two tissues give different patterns, then make decisions whether to combine them. Also see normalizing different tissues.

Although the real biology effect identified as array-outliers generally have the expression values correctly large, in such cases one may turn-off the array outlier detection at “Tools/Options/Model/Check array outlier” before “Analysis/Model-based expression”, so that real probe response patterns in a minority of arrays are not regarded as array outliers and used for fitting f values [V1.2+]. Alternatively, one can check “Tools/Options/Model/Do not call all replicate arrays as array outlier" and then specify an array list file with replicate separators to discard array outliers called in all replicates of a tissue type, since this is the real biology effect [V1.3+]. After the expression values are computed the array list file can be unselected or specified differently.

Probe sensitivity index file

Sometimes it is desirable to save the probe sensitivity indexes (PSI, or f’s) for all probe sets after the model fitting. For example, a researcher may have only a few arrays (< 5) to train the PSI parameters and wants to benefit from the PSI parameters trained on a large number of arrays having the same chip type and hybridized to similar tissue/cell lines from other sources. This is also useful when expression values of an initial batch of arrays are obtained and stored in a database, and we only want the subsequent incoming arrays to be comparable to the arrays in the database without altering the expression values of the initial batch. In such situation we can train the new arrays using the PSI file of the first batch of arrays (the original baseline array should be included in the 2nd dChip session and new arrays are normalized to this baseline array).

 

By default no PSI file is used and probe sensitivity indexes are fitted from the current arrays. Click “Options” in the “Analysis/Model-based expression” dialog to specify the usage and name of PSI files. “Write” option will save the probe sensitivity indexes after the model fitting is performed on all probe sets; at a later time, the PSI file can be used to fit the expression values for other arrays by selecting the “Read and use” option. The “Tools/Export probe set” dialog can export PSI (f) values and their standard errors into text format.

 

Slightly different expression values may occur for a few probe sets if we use the PSI file to get expression values of the original arrays, due to the non-convergence of outlier detection iterations. When sharing PSI files with other researchers please note chip type, tissue/cell line description, and number of arrays used.

 

Export expression values

 

The computed expression values and standard errors are stored in DCP files and can be exported into a data file.  Select “Tools/Export data/Expression value” (“Analysis/Model-based expression/Export” for Version 1.0):

 

 

Arrays displayed are ordered by “Array list file”; click or Control/Shift+Click  to select arrays whose expression values will be exported. One can choose to export the absolute calls and standard errors, or to export expression values called as “Array-outlier” to be blank for other software to treat them as missing data (set at “More options”). A “Gene list file” can be used to export expression values for a subset of probe sets. This file should be a tab-delimited text file with first column as probe set names. A user can also export the data in GCT format for input to GeneCluster 2.0 software by checking “GCT format for GeneCluster”. If a sample information file is specified at “AnalysisOpen group”, the first sample information column will be used to construct a CLS file for GeneCluster to use.

 

Expression data file exported by dChip follows the probe set order in CDF file, thus may not be the same as the order in gene information file. To export additional information, one can specify “Open group/Other information/gene information file” and check “Tools/Options/Analysis/Additional gene information” before exporting the data. In V1.3+, Check “Tools/Export data/Expression value/Include header information” to include information such as the modeling method, baseline array into the exported expression data files.

 

The output file has tab-delimited text format with genes on rows and samples on columns, and can be imported into other analysis software as well as dChip itself for further analysis (see file format for more details). The output files have the first column probe set name and the second column gene names. If the second column is not desired (e.g. for dChip “Get external data”), one can delete it after exporting, or clear “Gene information file” in “Analysis/Open Group/Other information” dialog when reading in array data, and then export data. If modified in other software such as Excel, make sure to save as “tab-delimited text” format.

 

An output file can be read in R by command such as “my.data =read.table("c:\\tmp\\expression.xls",sep='\t',header=T,row.names=1)”. dChip output files have a tab character in the end of each line, and this leads to a last NA column after the data is read into R. One can either delete the column in R, or open the file in Excel and save as tab-delimited text file to eliminate the ending tab before importing the file into R. In addition, when reading data files containing single quotes (e.g. 5’) in gene descriptions into R, one needs to use parameter quote = "" for the read.table() function (observed by Douglas Bates, see R mailing list).

 

To output the outliers identified by the model, one may use “Tools/Export Data/Expression value” (Check “Tools/Options/Analysis/Treat array outlier as missing expression value”) for array-outliers (empty grid in the exported file), “Tools/Export Data/Probe Sensitivity Index” for probe outliers (exclude=1 in the exported file), and “Image/Export CEL” for array and single outliers (the [OUTLIER] section in the end of CEL file)

 

PM-only model

 

Due to many MM (mismatch) > PM (perfect match) signals and negative Average Difference or MBEI expression values, researchers have explored the possibility of PM-only analysis (Naef et al. 2001, Li and Wong 2001b) or PM-only arrays (Zhou and Abagyan 2002). The following algorithm uses only PM probes to calculate all-positive expression values. Both dChip PM/MM difference model and PM-only model use the outlier detection algorithm to eliminate potential cross-hybridizing probes. The difference model also subtracts away the cross-hybridization signals equally bound to the paired PM and MM probes, while the PM-only model absorbs some background signals in the expression values. Therefore the difference model can be more sensitive to expression changes at low concentration levels. The initial goal of PM only model is to get positive-only expression levels. In the latest dChip using “Options/Model/Truncate negative PM/MM difference to 0” will achieve the same for PM/MM difference model [In dChip 9/6/06+, this option is removed. Negative differences are always truncated to a small value of 1, leading to only positive expression (theta) and phi estimates].

 

1. Background subtraction.

Background signals are the signals observed when the target gene does not express. We divide the whole array into 10 by 10 regions as in Affymetrix Inc. (1999), and calculate the 5th percentile of PM and MM values in a region as the local background of the sector. Then the background values are subtracted from all the PM and MM values in this region. Small negative values may be obtained and they are set to 0; the error introduced by setting to 0 is at the comparable level of the background noise and signal variation, which are accounted by the error term e below. Note that the normalization is based on the original PM and MM data, and the background subtraction is performed after normalization.

 

2. Apply multiplicative model to the background-adjusted PM values:

                                                           PMij = qi * fj + ε

qi is the model-based expression index (MBEI) of the target gene in array i, fj is the probe sensitivity index (PSI) of probe j in a probe set (see Li and Wong 2001a for details).

 

Specify different methods through the “Analysis/Model-based expression/Options” dialog or use the menu “Data/Next model” to switch between models.

 

In V1.3+, one may specify “Tools/Options/Analysis/PM-only array” to read in CEL files of PM-only arrays. dChip treats each PM probe as its own MM probe, leading to zero PM/MM difference. As a result, only the PM-Only model can be applied. Also one needs to supply Affy TXT files if absolute calls are needed since dChip always computes ‘A’ calls for PM/MM difference of 0.

 

Outlier detection algorithm

 

The PM/MM difference model or PM-only model is followed by the outlier detection algorithm described below. The model-fitting and outlier-detection are iterated up to 30 times to reach a converged set of array, probe and single outliers, or one of the cycling outlier sets at the round 30 (Li and Wong 2001a). Usually it can converge in 5-10 rounds, and the round information is displayed in the grid (1, 3) of Data View (“Rd”). If outliers are excluded or imputed during the iteration, the dChip fitted q and f values are not exactly the same as those obtained from S-PLUS or BioConductor affy package.

 

1. Check image spikes (set by “Tools/Options/Model/Treat image spikes as single outlier”).

Image spikes are identified using absolute value of data points (PM/MM differences or adjusted PM values). The 80th percentile of the absolute values of the data points in one array (probe) multiplied by 3 is called the array-wise (probe-wise) value threshold. As with other parameters used below, “80” and “3” are selected empirically to make the algorithm successfully identify outliers confirmed in the dChip PM/MM data view for a large number of array sets/probe sets. If a data point has its absolute value exceeding both the relevant array-wise value threshold and the probe-wise value threshold, this data point is called as an image spike. This algorithm is similar to the super-scoring method used by MAS4 (Affymetrix 1999) for identifying outliers. Image spike checking is performed only once before the “model fitting/outlier detection” iteration.

 

The identified image spikes are treated as missing data during model-fitting, but used as they are during outlier detection; this induces the image spikes to be identified and handled as single outliers eventually. Note that dChip does not check for saturated probes, but these probes may be identified as image spikes and treated as single outliers as above.

 

2. Single outlier detection.

Single outliers are identified using absolute residuals. The 80th percentile of the absolute residuals of the data points in one array (probe) multiplied by 3 is called the array-wise (probe-wise) residual threshold. If a data point has its absolute residual exceeding both the corresponding array-wise residual threshold and the probe-wise residual threshold, it is called as a single outlier. Due to the implementation of finding percentiles in dChip (e.g. 80% percentile is SortedVector[int(0.8*Len)], with SortedVector[0…Len-1]), it is required to have >=6 arrays and probes for the model to call single outliers. Single outliers usually dominate the displaying range in the “Data view” and make the other data points difficult to see; one may use “Data/Outlier in Range” to toggle the option.

 

3. Probe and array outlier detection.

For a probe pair (or probe for PM-only model), if its model-based standard error (std(fj)) is 3 times larger than the median standard error of all probe pairs, it is called a probe-outlier. In addition, if a probe pair is too active while most others are not (fj2  > (∑fj2 )* 0.9), it may be due to single-probe cross-hybridization in a probe set and is called a probe-outlier. Finally in the PM/MM difference model, probe pairs with negative fj are called as probe outlier; this happens when in a probe pair MM is consistently larger than PM across arrays and thus it is more likely to be due to cross-hybridization. The number of probe outliers is restricted to be at most 50% of all probes. The similar criteria are used to identify array-outliers.

 

4. Iteration with model fitting and outlier detection

After the three types of outliers are identified, the model is refitted using the subset of arrays and probes not called as outlier, with the data value of single outliers regarded as missing. Afterwards the q’s for array-outliers are computed using local regression passing the origin point, regarding f’s as known and using only f’s of non-probe-outliers. f’s for probe-outliers are computed in a similar manner. The outlier detection and model-fitting is iterated until the set of outliers does not change or a limit of 30 rounds has been reached. In the final fitted model, the sum square of f’s for non-outlier probes are constrained to be equal to the number of non-outlier probes, with other f’s fitted through the local regression.

 

In implementation the single outliers, array outliers, and probe outliers are identified cyclically in this order, whenever a particular outlier set is found to be different from the current outlier set before the model fitting, the model is refit. An real order could be: model fitting -> check single outlier, not found any -> check array outlier, found 2 -> model fitting without these 2 arrays -> check probe outlier, found 1 -> model fitting without these 2 arrays and 1 probe -> check single outlier, found 3 -> model fitting without these 2 arrays, 1 probe, and regard these 3 data points as missing and impute -> check array outlier, found 3 -> model fitting without these 3 arrays, 1 probe, and regard the 3 data points as missing and impute -> check probe outlier, found 2 -> model fitting without these 3 arrays and 2 probes, and regard the 3 data points as missing and impute -> check single outlier, found the same 3 as before, no refitting -> check array outlier, found the same 3 as before, no refitting -> check probe outlier, found the same 2 as before, no refitting -> the outlier detection converges.

 

This outlier detection algorithm improves on that in the version 1.0 by checking for image spikes: some times it is not sufficient to rely on model itself to identify outliers; an image spike may distort the model fitting in a way that the spike does not manifest itself as either single, array or probe outlier. Previously image spikes may make the whole probe set become array-outlier (see example). Now single outliers are identified in a more reasonable manner (pink dots; data courtesy of Nikhil Munshi):

 

 

One may change “Tools/Options/Model/Check array outlier”, “Check probe outlier” and “Check single outlier” to perform or not perform a particular outlier detection [V1.2+]. Please see tissue effect for their use and related outlier detection options.

 

Comparing dChip with MAS

 

The main distinction of low-level analysis is that dChip uses Invariant set normalization method and Model-based expression indexes (MBEI). Thus the expression values and high-level analysis