Invariant Set normalization Normalization plot Normalize different tissue types
Related resources: Affymetrix Data Analysis Fundamentals, Low-level analysis workshop: 2001, 2003, Probe-level analysis (John Weinstein lab)
Since scanned images may have different overall brightness, generally the normalization is needed to adjust the brightness of the arrays to comparable level. dChip needs to normalize arrays at PM and MM probe level before computing model-based expression levels. This is because MAS analyzes one array at a time thus we can scale or normalize the Signal or Average Difference of different arrays after computing them; while model-based method assumes the arrays are already at comparable brightness. However, there is also danger of introducing artifacts if normalization is not done reasonably (one can check the normalization curve at "Image/Normalization plot". If the amplification and scanning steps are controlled well and in the "Array summary file" the median intensities of arrays are close, one may skip the normalization step and perform "Model-based expression" directly.
The PM/MM Data View (see below) can best be looked at after normalization. Otherwise, during animation it is common to find probe sets where the MM curves jump up and down for the same probe set, indicating different overall background brightness of different arrays.
Select menu item “Analysis/Normalize & Model” to bring up the normalization dialog:

By default an array with median overall intensity is chosen as the baseline array against which other arrays are normalized at probe intensity level. In the dialog we can specify a different array as the baseline, if by checking array images or outlier image we find the default baseline array has problems (due to contamination or bad hybridization). If normalization has been performed dChip will use existing normalized CEL intensities in DCP files (a “Normalized” mark will be shown in the lower right corner). [Obsolete: Check “Ignore normalized data” to have dChip always do normalization.] Click “OK” to perform normalization and in the output we can read how many PM probes are selected in the “Invariant set” to determine the normalization curve, and the median probe intensity before and after normalization (data courtesy of Leighton Stein):
{Normalize arrays
using baseline: 1C...
Baseline chip
Accessing 'C:\tmp\leighton\1C.dcp' (file
format 3)
1C
Accessing 'C:\tmp\leighton\1C.dcp' (file
format 3)
Do not normalize the baseline array (Median
intensity 135)
1E
Accessing 'C:\tmp\leighton\1E.dcp' (file
format 3)
Searching Invariant-set: 9434
Normalizing CEL: 280000
Median intensity: 134 -> 131
3C
Accessing 'C:\tmp\leighton\3C.dcp' (file
format 3)
Searching Invariant-set: 1567
Normalizing CEL: 280000
Median intensity: 75 -> 146
...
Calculating background...
Finished in 0 minutes 9 seconds}
Depending on the overall brightness of the baseline array, we will get different expression indexes in the “Analysis/Model-based expression” step if a different baseline array is used. But the fold change of the same gene in two arrays will not be affected dramatically since its expressions in both arrays are adjusted to the same baseline array.
The Invariant Set Normalization method is used (see Li and Wong 2001b page 9, section “Normalization of arrays based on an ‘invariant set’ for more details), which chooses a subset of PM probes with small within-subset rank difference in the two arrays, to serve as the basis for fitting a normalization curve. The fitted curve is the running median curve in the scatterplot of probe intensities of two arrays (with the baseline array on Y-axis and the array to be normalized on X-axis). In the prototype S-PLUS implementation the cubic B-spline smoothing was tried: lines(smooth.spline(normalize$V1, normalize$V3,spar=.005),col=7); both are intended for fitting a reasonably smoothing curve through the scatterplot of the two arrays (smoothing parameters are chosen empirically by visually checking). When fitting the running median curve at the two ends, 5% of the “invariant” points are used to fit a ray at one end fixed (Version 1.0 uses 1¤300 of the “invariant” points); this makes the high-end normalization relationship more smooth and robust. The final running median curve is a piece-wise linear curve.
Then the normalization transformation is done for all the points (probes) in the array on X-axis (Y-axis is the baseline array and is not changed). To get the normalized value of a point which has particular intensity value on the X-axis, we drop an imagined vertical line passing this point, and use the Y-axis value of the intersection point of this line and the fitted curve as the normalized value. The normalized probe values beyond the range [0, 65535] are truncated at the boundaries. The total effect if normalization is a rotation and straightening of the scatter plot, so it better centers around the diagonal line y = x.
[Version 8/7/06+] If "Options/Normalization/Use selected probes: From probe set file" is selected, one can specify a list of probe set names (on rows) that are more likely to have stable gene expression values across different tissue types. E.g. some known housekeeping genes, or 100 or so normalization genes in HG-U133 arrays. The probes of these probe sets will be used to fit a normalization curve. The function is mainly for normalization of very different tissue types. For similar or same tissue types, this file is not needed since there can be enough data points to determine normalization relation.
After "Open group" with probe data loaded in memory, one can use “Image/Normalization plot” to view the normalization scatter plot between the current array and the baseline array. For version before 10/12/05, installation of R software is needed. These plots can also be viewed during "Analysis/Normalize" by checking “View normalization plot”.

The current array can be selected by clicking a "CEL image" icon on the left pane. By default the array with the median brightness (defined as median probe intensity) is used as the baseline array, but other arrays can be selected. On clicking OK, the invariant set is computed and scatterplots in original and log scale (M-A plots) are displayed (data courtesy of Ariel Rabinovic):

In the plot, each point represents PM or MM probe values in the two arrays. The blue line is the diagonal line y=x, the red circles are the probes selected in the “Invariant set”, and the green curve is the running median normalization curve based on the “Invariant set”. Usually the deviation of the blue line and the green curve indicates the need for normalization (that is, one array is brighter than the other). In the plot on the upper-right, the baseline array is plotted against the normalized probe values (on the X-axis), where we usually expect the invariant set largely overlap with the diagonal line.
[Version before 10/12/05] When the Invariant Set is small or the high-end points are sparse, the running-median curve used by “Analysis/Normalize” may not perform well (can be checked in the normalization plot). One may normalize such arrays individually using the smoothing spline method. This requires the baseline array to be already selected (recorded in DCP file) by performing “Analysis/Normalize” once. Then for individual arrays, run “Image/Normalization Plot” and check the “Use smoothing spline to normalize” option to fit a smoothing spline through the “Invariant Set” points as the normalization curve. If this gives reasonable result, one can also check “Use and save normalized values” to save the normalized values based on the smoothing spline in the DCP file. Afterwards “Analysis/Model-based expression” can be performed as usual.


