dChip: Prepare data

 

Add new arrays                       File format                  Example method description

 

To use dChip, the user needs to provide Affymetrix array data files (in CEL or DAT format, or use public dataset from NCBI GEO, Broad Institute or Affymetrix data resource center), and the CDF file (Chip Description File). CEL files contain summarized probe-level (Perfect match and mismatch probes) data of Affymetrix arrays. dChip software uses the CEL files for normalization and model-based expression value computation from probe level data.

 

[Note: The 5/26/05+ dChip version can directly read binary CEL files so the following conversion is not needed.] You may use this Affymetrix CEL file converting tool to convert the latest binary-format (Version 4) CEL file to the text-format (Version 3) CEL file so dChip can read. Affymetrix conversion tool will convert all CEL files in a directory from version 4 to version 3, while leaving the file name the same. You can check the CEL file size before (4Mb) and after (10Mb) to make sure conversion is done. Make sure the CEL files are not read-only (e.g. when copied from CD) before conversion. If conversion fails but you have DAT files, you can also ask dChip to read them instead of CEL files. To convert file format automatically, the GDAC file SDK may help.

 

If it is desired to read in cDNA array data, one may make an external data file with every two columns as the green and red channel intensities from one array (e.g. obtained from GenePix GPR file), and the read it in dChip by “Analysis/Get external file”. Then the normalization and high-level analysis may be performed.

What arrays to combine as a group

Generally we want to combine more arrays hybridized to the similar tissue or cell lines in a single group. More arrays increase the chance of selecting good-behaving probes for expression calculation, and the probe response pattern can be better learned from data. It is also desired that these arrays are generated by the same core facility/person in a short period to minimized the experimental variation. The arrays do not need to be replicates.

For better model fitting and outlier detection, the number of arrays in a dChip group is desired to be more than 5. Also the target gene's presence in several arrays is needed; otherwise we only have random noise data curve around 0 in all arrays for the probe set, leading to correct close-to-0 expression values but the model does not help much in outlier detection.

To perform analysis of different tissues, cell types or very distinct conditions, please see normalizing different tissues, and tissue effect.

 

Add new arrays

 

If a group of arrays have been analyzed and there are new arrays coming in, one can combine the old and new arrays together as a new group and redo normalization (if needed check “Ignore the normalized data” to make sure all arrays are normalized to the same baseline array) and the model-based expression. If it is not desired to have the expression values of the old arrays changed, one can use the probe sensitivity index file of the old group to analyze the new arrays.

 

For the above old and new groups of arrays, or when the two groups of array are analyzed separately, the resultant expression indexes are generally not comparable for the same probe set. This is because the arrays in each group may be normalized to different baseline array (thus having different overall signal brightness), and the model-based expression indexes for the same probe set may be based on different subsets of probe pairs in the two groups. The latter is due to probe outlier detection; usually the probe sensitivities vary, and inclusion or exclusion of a probe may yield large differences in expression values. However, the relative expression between samples in terms of fold change should remain similar. For example, one scheme estimates the expression values as 100 and 200 for two samples, and another scheme gives 500 and 1000; the two are equivalent since expression values for one probe set are relative measure of the underlying mRNA concentrations.

Input and output file format

Most files dChip inputs and outputs are tab-delimited text files, such as Expression data file, Gene information file, and Gene list file (has probe set name in the first column of each row). They may have XLS extension for easy opening by Excel software. For such files one may change the file extension to TXT and open them by a text editor, or edit them in Excel (make sure to save in “tab-delimited text” format).

 

Sometimes the exported files may not appear to have an extension in dialogs or Windows explorer. One needs to use “Windows Explorer” program, select “Tools/Folder options/View” and uncheck “Hide file extensions for know file types” to show the file extensions.

 

dChip also exports binary DCP, CDF.BIN and PSI files for faster processing; their format are available as C++ code on request. In versions after Feb. 06, the DCP and CDF file format has been changed from format 3 to format 4 to accommodate unlimited number of probe sets, so that older DCP and CDF.BIN files cannot be used with the latest dChip. You may either re-process CEL and TXT files, or use “Tools/Export expression data” to export expression values and calls, and use “Analysis/Get external data” to read the data back to a new dChip version.

 

The format of Expression data file is as follows: The first column is probe set name, then the column “gene, Acession, LocusLink and Description” may be optional, and the “gene” column may occur without “Acession, LocusLink and Description” columns (these three always appear together), but whenever “Acession, LocusLink and Description” occurs there is the “gene” column. Followed by these columns, There is “expression values, call, and standard error” columns for every sample, with names sample_name, “sample_name call”, “sample_name SE”. The “call” column and “standard error” column are optional.

 

Example method description using dChip

 

Array normalization, expression value calculation and clustering analysis were performed using DNA-Chip Analyzer (www.dchip.org; Li & Wong 2001a). The Invariant Set Normalization method (Li & Wong 2001b) was used to normalize arrays at probe cell level to make them comparable, and the model-based method (Li & Wong 2001b) was used for probe-selection and computing expression values. These expression levels were attached with standard errors as measurement accuracy, which were subsequently used to compute 90% confidence intervals of fold changes in two-sample or two-group comparisons (Li & Wong 2001b). The lower confidence bounds of fold changes were conservative estimate of the real fold changes. Genes with increased or decreased expression after treatments by more than 2 fold (lower confidence bound) were selected for further study.

 

Hierarchical clustering analysis is used to group genes with same expression pattern (Li and Wong 2003). A genes is selected for clustering if (1) its expression values in the 20 samples has coefficient of variation (standard deviation / mean) between 0.5 to 10 (2) it is called “Present” by MAS5 (or GCOS or dChip) software in more than 5 samples. Then the expression values for a gene across the 20 samples are standardized to have mean 0 and standard deviation 1 by linear transformation, and the distance between two genes is defined as 1 - r where r is the standard correlation coefficient between the 20 standardize values of two genes. Two genes with the closest distance are first merged into a super-gene and connected by branches with length representing their distance, and are deleted for future merging. The expression level of the newly formed super-gene is the average of standardized expression levels of the two genes (average-linkage) for each sample. Then the next pair of genes (super-genes) with the smallest distance are chosen to merge and the process is repeated until all genes are merged into one cluster. The dendrogram in Figure X illustrates the final clustering tree, where genes close to each other have high similarity in their standardized expression values across the 20 samples.