Sample information file Probe set mask or inclusion file "Open group" output
On starting dChip enters the
“Analysis View”, which can also be accessed at anytime by clicking the
“Analysis” icon on the left pane or select the menu “View/Analysis”:

The “Analysis View” displays information such as the status of analysis processes, and colors error messages in red. Small Excel files and exported images will be inserted into the “Analysis View” for convenience (uncheck “Tools/Options/Analysis/Insert Excel and Image outputs…” to disable the function). The analysis output can be saved into a Word file by “Analysis/Save” or selected and copied by “Analysis/Copy”. If problem happens it is helpful to attach these outputs and email to us for diagnosis.
Prepare data. dChip analysis is based on a group of array data files a researcher generates, either in DAT or CEL format. All the arrays to be used in a single analysis should be of the same chip type. The current limit on the number of arrays is 400. To read in the data, select the menu “Analysis/Open Group”:

Type in a group name in the “Group name” drop-down list, or click the button with the down-arrow to select a previous group. The group settings such as file names and gene filtering parameters will be saved in a group configuration (“group name”.ini) file under the same directory as the “dchip.exe” file. Click the “Delete” button to erase the settings for the specified group.
Click “Data directory” button to choose the directory containing the data (DAT, CEL, or DCP) files to be analyzed. Alternatively a “Data file list” can be used when the data files are stored in several directories or we want to specify individual data files. A data file list is a text file (make sure it has .txt extension) with each row containing directory or individual data file names. If specified, dChip will only search these directories or use these files. The following is an example “data file list”:
E:\Affy data\dan\CA-H.cel
E:\Affy data\dan\CA-HR.cel
E:\Affy data\zugen
E:\Affy data\dan\PC-C.cel
E:\Affy data\dan\PC-H.cel
E:\Affy data\dan\PC-R.cel
E:\Affy data\dan2
Next, specify whether you want to read in DAT or CEL files. CEL files are in text format and contain intensities values for PM and MM features, and are converted from pixel-level DAT files by MAS. dChip can also extract PM MM probe values from DAT file use the MAS algorithm (75-th percentile, excluding 1 outer layer of the pixels of each feature; Affymetrix 1999), and save them in DCP format (can be exported to CEL format by “Image/Export CEL”). The MAS algorithm first determines the pixel-level coordinates of the corner features (see header information of a CEL file or DAT file), then linearly interpolates the coordinates of all the features. This works well even if the scanned array image is slightly skewed. However, if the coordinates of corner features are determined incorrectly (e.g. page 4 of Li and Wong 2001a, can be checked by dChip outlier image), one may use MAS software to correct the corner points of the DAT file and regenerate CEL file. For more recent higher density arrays with 712 rows and columns of features, dChip-converted CEL files may give different probe values from MAS. This may be due to smaller feature size and number of pixels per feature in these arrays, and in such case using MAS CEL files is recommended.
If the “Ignore existing DCP file” checkbox is unchecked, dChip will search for existing DCP (dChip data) files with binary CEL data for faster access; these DCP files are saved in the same directory as CEL or DAT files at the first time you open a group, and have the same file name as CEL or DAT files except the “DCP” extension. They contain unnormalized and normalized probe cell intensities and also model-based expression values to be explained later. By default the normalized values are used when available, but we can check “Use unnormalized data” to use unnormalized CEL data (e.g for viewing purpose at DataView). If “Ignore existing DCP file” is checked then dChip will always extract data from CEL or DAT files. In V1.2+, check “Tools/Options/Analysis/Search and save DCP file in the Working directory” to store DCP files into different places than CEL files. This way we may perform different analysis (e.g. normalization using different baseline array, compute expression values with different options) and store the results into DCP files under different directories while maintaining the single copy of CEL files.
dChip considers the CEL, DAT or DCP files with the same time tag (can be found in the header line of CEL file) as identical files and only read in one of the identical files. If CEL files of two different samples happen to have the same time tag, one may change the time tag manually in one of the CEL file header (open CEL file in a text editor), e.g: change “01/18/ 1 13:43:17” to “01/18/ 1 13:43:18”. If files don't have time tags, uncheck "Open group/Options/Check CEL file time tag" to read them all. (Discussion thread)
More general analysis options can be selected by clicking the “Options” button (or “Tools/Options/Analysis”). For example, a “Working directory” can be specified as the default directory for dChip to output analysis files.
Uncheck "Analysis/Open group/Options/Load probe data in memory" to not load probe data so that a large dataset containing many arrays or large array types (e.g. 100K SNP array) can be loaded faster. Then do normalization and model-based expression computation the same way as before. However CEL image and PM/MM data views are not available since they use probe level data.
[V2/1/07+] At "Open group", specify "Suffix of TXT/CHP file" as ".chp" to read expression or SNP calls from CHP files. This avoids converting CHP files to TXT files to use in dChip. Related link: Parsing a genotyping CHP File.
dChip will also look for MAS’s analysis result TXT files (a text output of CHP file, one for each DAT or CEL file) for Presence calls (P/A). The TXT file should have the same file name as the corresponding DAT or CEL file but has the extension “.txt”, and its header line contains "...Probe Set... Avg Diff…Abs Call..." or "...Probe Set...Signal…Detection...". If TXT files are not found dChip will compute the Presence calls using a simplified version of the MAS4 algorithm (Affymetrix 1999), that is, the same decision matrix making calls is used but the background calculation is on the whole chip instead of 16 sectors. In one comparison the calls generated by dChip has 93% agreement with MAS’s calls. We encourage users to obtain TXT files to allow dChip to use MAS’s calls. The presence calls read or computed are saved in DCP files, and will be used when DCP files are loaded in future dChip sessions.
When a TXT file has any probe set not in the CDF file, dChip will ignore the TXT file and compute its own P/A calls for all probe sets. One can disable this feature by checking “Tools/Options/Analysis/Allow TXT files to contain probe sets not in the CDF file”, so that dChip ignores the unknown probe sets (e.g. those masked by “Probe set mask file”), but still use the P/A calls of the known probe sets.
Checking the “Read in MAS5 Signal” option to read in the MAS5 “Signal” from MAS5 analysis result TXT file (one for each DAT or CEL file). Note that “Analysis/Normalize” still performs at CEL level, and “Analysis/Model-based expression” will overwrite the MAS5 Signal values. Thus it is better to scale the MAS5 Signals to the same target intensity beforehand (e.g. in MAS5), so one can go ahead for high-level analysis without these two steps in dChip. A “Modeled” sign will show up in the lower-right corner only to indicate the expression values are ready. However, the MAS5 Signal values in dChip can be exported by “Tools/Export data/Expression value” into a file similar to MAS5 Pivot table, and then are used as follows. [V1.2+]
A MAS5 Pivot table containing the Signal values of multiple arrays cannot be used with "Analysis/Open group". One may modify the Pivot table slightly, so it can be read in dChip by the "Get external data" function without CEL values. At this point, the “Analysis/Normalize” function can be applied to normalize the data at the expression value level. Afterwards, the high-level analysis can be applied without the “Analysis/Model-based expression” step.
After completing the “Analysis/Open group/Data files” dialog, click “Other information” tab on the top and another dialog will be shown:

The CDF (Chip Description File) file for a particular array type can be obtained from the Affymetrix library file website (Support/Library files; download and unzip the library file of a specific array type to get its CDF file), or from Affymetrix GCOS software (“library” directory). A user can open a CEL file (of text format) to confirm the chip type information from the header (in the “DatHeader=” line, e.g. HG_U95Av2.1sq). dChip will also convert the CDF file into binary format for faster access next time. Checking “Ignore existing .cdf.bin file” will force dChip to re-extract information from the text format CDF file even if a binary version already exists.
Sometimes it is desirable to reorganize probes in an array type into custom probe sets and compute expression values for them. dChip can read in a custom CDF file in dChip specific CDF format. A file can have name such as “HG-U133_Plus_2_dchip_example.cdf”, where “_dchip_” in the file name identifies dChip specific CDF format, the string before it identifies the original array type so dChip can search the corresponding CEL files, and the string after it can be anything for user’s own identification. The header sections [CDF] and [Chip] are copied from the standard CDF file but the “NumberOfUnits” is changed to the number of probe sets that the custom CDF file contains. Only the probes in the custom CDF file will then be useful for normalization and expression computation. Such custom CDF file can also be constructed for Nimblegen arrays and used in dChip to analyze the correctly formatted CEL files for Nimblegen arrays.
It is optional but recommended to specify a “Gene information file” for the current array type.
If the CEL data file names are not informative, we can specify alternative names for them. A “Sample information file” is a tab-delimited text file; if edited in Excel make sure to save it in text format by “File/Save As/Save as type: Text (Tab delimited)”. The first header line is required. The first two columns are also required, and they are the array file names (without directory name and the .CEL or .DCP extension; can copy the “Array” column from the “Array summary file” to the “Array name” column here) and the corresponding sample names. The sample names should be different for each array, and also be different from any array names; it can be blank so a sample name is the same as its array name. The rest columns are optional descriptions of sample properties using discrete words or numbers. Here is an example file:
|
Array name |
Sample name |
Grade |
Marker 1 |
Maker 2 |
Maker 3 |
|
LG2000102601AA |
N1 |
II |
FALSE |
positive |
positive |
|
LG2000102602AA |
N3 |
III |
FALSE |
negative |
positive |
|
LG2000102603AA |
N4 |
II |
FALSE |
positive |
low-positive |
|
LG2000102604AA |
N5 |
III |
FALSE |
negative |
negative |
|
LG2000102605AA |
N6 |
II |
FALSE |
positive |
positive |
|
LG2000102606AA |
N7 |
III |
FALSE |
negative |
negative |
|
LG2000102607AA |
N8 |
III |
FALSE |
negative |
negative |
|
LG2000102608AA |
N9 |
III |
TRUE |
negative |
negative |
|
LG2000102609AA |
N10 |
II |
FALSE |
positive |
positive |
|
LG2000102610AA |
N11 |
III |
FALSE |
negative |
negative |
|
LG2000102611AA |
N12 |
III |
TRUE |
negative |
negative |
Using a “sample information file” is highly recommended. It will be very useful in later functions such as enriched sample clusters and selecting sample by categories. It can better facilitate the visual assessment of the sample clustering than the textual sample names. As an example, if a sample name “14c1” refers to “day 14, pair one, control sample”, we can create three sample information columns called “Day, Pair and Treatment”, and this sample has value “14, 1, C” for the three columns.
One may add a numerical column in “sample information file”. The column header needs to contain “(numeric)”, for example, “Time(numeric)”. Such continuous variable will be standardized and displayed at the top of clustering picture.
In the “Analysis/Open Group/Other information” dialog, we can specify an Affymetrix “probe set mask file” (*.msk file; or a tab-delimited text file with the first column of each line being the probe set name) to exclude some probe sets from the analysis. These probe sets will be handled as if they do not exist on the chip, by marking their CELs as “QC” (quality control). Thus their CELs are not used for the CEL-level normalization, and they do not enter any downstream analysis. Make sure to check “Ignore existing DCP file” and “Ignore existing cdf.bin file” to re-extract CEL and CDF files whenever you specify a probe set mask file or change its content. If no TXT files exist, dChip will re-compute present/absent calls as well excluding the masked probes.
When using the probe set mask file, be sure to extract DCP files anew from CEL and TXT files (by checking “Ignore existing DCP file” and uncheck “Data file type/DCP file”). This is because the existing DCP file has format corresponding to the original CDF file, and thus cannot be used with the new CDF file with some probe sets masked. By the same token, do not combine the DCP files using the original CDF file and the DCP file using CDF with mask file in the same group.
[V2/23/05+] The probe set mask file can also accept individual probes to mask them out from CDF file. Such probes may tend to cross-hybridize as found by other means. An example file is HG-U133A mask file.txt (Edited in Excel but save as tab-delimited text file), where in the 2nd column, “all” or empty or “1-x” (x is the maximal number of probe pairs in the probe set) means to mask out the whole probe set, and a probe number between “|” (such as “|10|”) is used to mask out individual probes. The probe numbers start from 1 and follow the same order as in the CDF files. Make sure not to have space in a line (e.g. use "1|2|5" instead of “1 | 2 | 5”). Alternatively, a custom CDF file may be used for the similar purposes.
[V2/28/07+] If the probe set mask file has "Including" in the first line, the file will be regarded as a probe set inclusion file. Only the probe sets in this file will be included in the cdf.bin file and used in downstream analysis including normalization. This is useful if only a subset of probe sets have targets in hybridization cocktail.
Click “OK” after filling in files names. dChip reads in CDF file, CEL files and other information files, and constructs icons for viewing array images and probe level PM/MM data:

Probe intensities and from CEL files and presence calls are loaded into the memory and also saved into DCP files. Later on, the normalized probe values and model-based expression values will also be saved into DCP files, so that these two steps only need to be performed once generally.
An “array summary file” is saved after all arrays are read in. Click the Excel icon in the left pane will start Excel to view this file (if not starting please go to the directory and double click the file to view). As with most other files exported by dChip, the format of this file is a tab-delimited text file with “xls” extension for easy access of the file by Excel. At this stage the summary file looks like:
|
Number |
Array |
File Name |
Median Intensity |
P call % |
|
1 |
130a |
E:\Affy data\dan\130\130a.CEL |
702 |
20.4236 |
|
2 |
130b |
E:\Affy data\dan\130\130b.CEL |
728 |
28.2929 |
|
3 |
130c |
E:\Affy data\dan\130\130c.CEL |
658 |
25.9644 |
|
4 |
130d |
E:\Affy data\dan\130\130d.CEL |
557 |
24.5616 |
The last two columns are the median probe intensity and the “P call %” (the percentage of probe sets called “Present” in an array). The median intensity is computed using unnormalized probe values by selecting a probe for every 5 by 5 region on the array. This may give slightly different result from using all probes for computing the median. This example uses older arrays (before early 2001) which often have high intensity values and signal saturation. Nowadays Affymetrix has adjusted the scanner settings so the signal intensities are about 1/10 as before.
Some arrays may have unusually low “P call %” (< 10%). One may check the “Image View” for potential problems. Good image have dark background and bright foreground signals, while problematic sample preparation, hybridization or scanning may result in the high background noise which overwhelms the real signals. Sometimes it is necessary to exclude these arrays from further analysis, by moving the files out of the data directory and redo “Analysis/Open group”.