dChip: Reading external data files

 

Combine external data file                  Extract subset of data file                    Read cDNA array data

 

If no DAT/CEL files are available for Affymetrix array data, dChip can still read in a tab delimited data file (in text format) with expression value, absolute call and standard error (SE) data as columns:

 

probe set

130a

130a call

130a SE

130b

130b call

130b SE

AFFX-BioB-5_at

3348.89

P

281.398

3825.92

P

225.898

AFFX-BioB-M_at

3478.22

P

400.583

6778.75

P

273.612

AFFX-BioB-3_at

2322.84

P

180.836

3437.77

P

158.029

AFFX-BioC-5_at

7837.85

P

628.778

7590.25

P

402.236

AFFX-BioC-3_at

5887.03

P

501.962

6473.87

P

316.34

AFFX-BioDn-5_at

4416.52

P

711.782

8313.93

P

556.247

AFFX-BioDn-3_at

16049.3

P

1870.28

18681.5

P

1048.73

AFFX-CreX-5_at

24904.8

P

1728.4

29241.8

P

1095.15

 

The absolute call and SE columns are optional and can be specified in the “Analysis/ Get External Data” dialog:

 

The data file should have the first row containing array names and the first column containing gene names. The data files exported by dChip may have the addition columns of gene annotations starting from the 2nd column (such as “gene, Accession, LocusLink, Description”). To read such files by “Analysis/Get External Data”, these columns should be deleted and then save the file as tab-delimited text file in Excel. Alternatively, specify "Skip column 2 to x" to ignore column 2 to x.

 

If there are missing values in the external data file, leave them blank in Excel and then save as tab-delimited text files so that “Get external data” will regard blank cells as missing values. If there may be any missing values in the last column, add an additional pseudo last column with all values of “1” to make the data read correctly. Afterwards use “Array list file” to specify only the real samples that will be used in the analysis.

 

The “Get External Data/Other information” tab will prompt user to read in “gene information file” or “sample information file”, and these files are the same as those used for data read in by “Analysis/Open group”. However, if in the external data file the sample (column) names are already meaningful, in sample information file one can have both “array name” and “sample name” columns identical as the sample names in the first line of the external data file.

 

Click “OK” to read in the data file. If successful, the “Modeled” indicator will appear in the lower-right corner to indicate the expression data is available for high-level analysis. The “Normalized” indicator will not be shown, and if the data has been normalized one can proceed to high-level analysis.

 

If the data has not been normalized beforehand, one can then use “Analysis/Normalize” to normalize the expression values using the Invariant Set Normalization method (Version 1.0 uses a using a simplified ISN method with fixed rank difference threshold 50 without iteration), and the standard error attached to an expression value will be scaled by the ratio of the expression values before and after normalization. Check the “Show scatter-plot…” option to show normalization scatter-plots when normalizing. (installation of R needed).

 

Afterwards, the high-level analysis can be applied without the “Analysis/Model-based expression” step (since no CEL values are read in). For example, the “Tools/Array list file” function can be used to pool replicate arrays, and “Analysis/Hierarchical Clustering” and “Analysis/Compare Samples” can be performed as usual.

Combine external data file

Many researchers have made the data available in Excel or tab-delimited text format but not the original CEL or DAT files. To pool several datasets together for analysis, one can follow similar steps as the step 3-7 of section “Combine differnt array types”. Note two external data files should have matched rows of probe sets and consistent columns (e.g. both with or without absolute call column).

Excel can only be used to save an external data file with less than 65535 rows and 256 columns. These two software packages may be tried to overcome the limit: GS-Calc, Quattro Pro (free trial).

Extract subset of data file

[Analysis example, Version 9/28/06+] Here we show how to extract a subset of data from a larger data file, with possible reordering of rows and columns. This can be useful when subsetting 500K Hapmap genotype calls to make a reference call file for LOH inference. First use "Get external data" to read in "Sty_HapMap270_brlmm_call.txt" (uncheck "Has both signal and call" and check "SNP data"), then specify an array list file containing 60 CEPH parent samples. Then go to "Tools/Export expression data", select all samples and specify a probe set list file with the same ordering as probe sets in CDF or genome information file. Affymetrix library files contain a PSI file (example, unzip it), which can be used as probe set list file (but check "Use probe set name in the 2nd column"). Uncheck "Has both signal and call" if making reference genotype file. The exported file can then be used as "Analysis/Chromosome/Options/Reference genotype file".

Read cDNA array data

To read in cDNA array or other two-channel microarray data, one may put background-adjusted green and red channel signals for one array as two columns in the external data file, and do normalization as above when needed. Alternatively, log ratios in one cDNA array can be put in one column of an external data file, and one can then directly go to high-level analysis such as clustering.