Common probe set file Combine different array types Combine sub-arrays for analysis
Combine HG_U95A and V2 CEL files
Often researchers generate data for different human chip types: Hu6800, HG_U95A or HC_G110 and want to combine them for clustering or comparative analysis. More interestingly, a mouse or rat model for a human disease is constructed and array data are obtained from both patient samples and animal models. Consistent gene expression changes across human and animal models cross-validate the results.
Since dChip only normalizes and calculates model-based expression for a group of arrays belonging to the same chip type, we need additional steps to combine the data for different chip types or tissues. Firstly “common probe set files” are constructed for the purpose. To combine sub-chips for the same series (e.g. Mu11KsubA and Mu11KsubB), see “Combine sub-chips for analysis”.
Related resources: CrossChip, Chip Comparer
A “common probe set file” can be downloaded or prepared by dChip. It contains pairs of probe sets representing the same gene in the two chip types. [Old description: For the same species (e.g. human), the probe set pairs are linked by matching the accession number or LocusLink ID. For example, there are 5987 probe set pairs representing the same gene between the HG_U95AV2 chip type and the HU6800 chip type (see “common u95av2_hu6800.xls”, Affymetrix control probe sets excluded). Here one U95AV2 probe set may match to several Hu6800 probe sets and vice versa. For different species, the orthologous and homologous probe set pairs are obtained from NetAffx.]
7/10/07: Chip Comparer can get pairing probe sets between array types using Gene IDs and orthologous gene relationship. Its output file can be sorted and modified to use as common probe set file.
Affymetrix also provide “array
comparison files” between the arrays of the same organism (“best match” or
“good match” files; needs free account to download). The probe set columns of
the two arrays can be copied and save in text format to use as dChip common
probe set file. Note some comparison files (e.g. U95 to Human Genome U133 Plus)
has small size and may be only for the probe sets not in U133 but in U133 plus
array. But one can combine the information from this file with “Human Genome
U95 to Human Genome U133”, since U133 array is a subset of U133 Plus array.
Also these files may also contain probe sets in U95 B-E arrays, and may cause
dChip to misalign when only U95A data are exported. It is ideal to only include
probe sets in the array data one is exporting in the common probe set file. (Common
probe set file for HG-U95Av2
vs. HG-U133_plus_2 array).
Here are older downloadable files. Right-click and select “Save target as” to download the needed common probe set files to use with dChip, and save the file in text format if modified in Excel:
common u95av2_hu6800.xls common u95a_hu6800.xls common u95av2_u95a.xls common u95av2_g110.xls
common HG-U133A_HG-U95Av2.txt (4/9/03, matched by LocusLink ID in NetAffx database)
These files are obtained by the new method. Please cite Liu et al. 2003 if you use these files:
HG-U95Av2 vs. MG-U74Av2: 12/16/02 HG-U95Av2 vs. RG-U34A: 12/16/02 MG-U74Av2 vs. RG-U34A: 12/16/02
HG-U133A vs. MG-U74Av2: 12/16/02 HG-U133A vs. RG-U34A: 12/16/02
Note that NetAffx CSV files don’t contain probe sets pairs between the array types of same species. But the common probe set files for the same species can be obtained by matching Accession number or locusLink ID in Microsoft Access.
[Discussion thread] The following steps are used to combine the data from a set of U95A arrays and another set of Hu6800 arrays. The user is assumed to already have some familiarity with dChip.
1. Export data. Open a group of U95A arrays. In the “Export data” dialog, choose the “Gene list file” to be “common u95a_hu6800.xls”, uncheck “Use the probe set name in the 2nd column”, select arrays to be exported, check “Has absolute call” and “Has standard error” if needed, and click OK to export the data for the common probe sets (assume the output file is out_u95.xls).
2. Export data. Open a group of Hu6800 arrays. In the “Export data” dialog, choose the “Gene list file” to be “common u95a_hu6800.xls”, check “Use the probe set name in the 2nd column”, select arrays to be exported, check “Has absolute call” and “Has standard error” if needed, and click OK to export the data for the common probe sets (assume the output file is out_6800.xls).
3. Combine data. Open out_u95.xls and out_6800.xls in Excel, copy all the data in out_6800.xls except the first “probe set” column and the “gene, Accession, LocusLink, Description” columns (if there is any), and move the cursor in out_u95.xls to the first row of the last blank column and paste the data. Delete the “gene, Accession, LocusLink, Description” columns in out_u95.xls if there is any. Save out_u95.xls using text (tab delimited) format, ignoring warning message by clicking “Yes”.
4. Read in the combined data. Select “Analysis/Get External Data”, choose “data file” to be out_u95.xls, “Other information/gene information file” to be “HG_U95A gene info.xls” (since the probe set names are from U95A chip), and check “has absolute call” and “has standard error” if needed. Click “OK” to read in the data.
5. Mark the boundary between the two data sets. Select “Tools/Array list file” dialog, select all the U95A arrays (may use Control+Click or Shift+Click) in the “All arrays” listbox and click “Add array”, then click “Add standardize separator”, then select all the Hu6800 arrays and click “Add array”. Click the “Save & Exit” button to save the “array list file”.
6. Normalization. Before performing “Analysis/Filter genes” or “Analysis/Compare Samples”, one may want to use “Analysis/Normalize” to scale the arrays (columns of the data table) to have the same median first. This is performed at the expression value level using the same “Invariant set normalization” method, instead of the CEL-level normalization performed before the model-based expression calculation.
7. Within-chip-type standardization for each gene during clustering. In the “Analysis/Hierarchical Clustering” dialog (“Tools/Array list file” in V1.2+), uncheck “only draw lines for standardize separator” (In V1.3, check “Use standardize separators”). As a result, the expression values for each gene in the U95A arrays and the 6800 arrays will be standardized (scaled to have mean 0 and standard deviation 1) within each chip type when performing “Filter genes” and “Hierarchical clustering”. When the number of arrays of each chip type is large (> 10), even if we do not normalize the data of different chip types, we expect the standardized expression values are comparable between U95A and 6800 arrays, and the co-regulated genes to have high correlation based on such within-chip-type standardized values.
If only a few samples in each array type are combined, adjusting batch effect is difficult. Alternatively, one may work in Excel to scale the arrays to have the same average expression values of all genes and manually divide two columns to get fold changes between samples. Pay attention to the absolute values of the two samples. E.g. when one expression value is small (< 50) and another is large (>500), or both expression values are large, the fold change is most reliable.
Combine sub-arrays for analysis
[Discussion thread for combing SNP sub arrays]
[Version 4/10/06+] We can combine two sub-arrays at "Open group" without using external data files as below. First analyze the data of each sub-array separately through normalization and model-based signal to obtain DCP files. Then at "Analysis/Open group/Other information", specify the 2nd sub-array CDF file at "Subarray CDF" and uncheck "Open group/Options/Load probe data in memory". dChip will assume the CEL/DCP file name of the 2nd sub-array the same as the 1st sub-array, except with "_2" in the end. For example, 01X_298B_x.dcp and 01x_298B_x_2.dcp will match each other. After "Open group", the genome information files containing the probe sets of both sub-arrays should be used at "Analysis/Chromosome". For the group name at "Open group", using the group name of the first subarray is fine, since this step just reads in additional signal and call values from the 2nd subarray, without altering DCP files. [V10/17/07+] The same sample information file used for the first subarray can be used for the second subarray, since dChip will add "_2" to the array names in sample information file and try to match them to the array names in a group.
[Obsolete method] Since arrays for each sub-array type are normalized among themselves, the expression values for each probe set are comparable across samples. We can row-wise combine the probe sets of the sub-chips using the following steps.
1. Open a group of subA arrays (clear the “Gene Information File” in the “Analysis/Open Group/other information” dialog to make the output not having gene descriptions, or delete the gene description columns in the export file to confirm to the “Get external data” format ). In the “Export data” dialog, select arrays to be exported, check “has absolute call” and “has standard error”, click OK to export the data (assuming the output file is 11kA.xls).
2. Do the same for a group of subB arrays, but make sure the columns (arrays) correspond to 11kA.xls (assuming the output file is 11kB.xls).
[New] For the second sub-array, you can check “Tools/Export expression value/Append to this file” to append the output data to an existing data file of the first sub-array. The array list file used should be the same or have the same array ordering for the existing file and the data to be exported. Afterwards, open the file in Excel and delete the “gene, Accession, LocusLink, Description” columns if there is any, and save as text file.
3. Open 11ka.xls and 11kb.xls, copy all the data in 11kb.xls except the first “array name” row, and paste it into 11kA.xls file starting from the last blank row. Delete the “gene, Accession, LocusLink, Description” columns if there is any. Save 11ka.xls using Text (tab delimited) format, ignoring warning message by clicking “Yes”.
4. Close groups in dChip. Select “Analysis/Get External Data”, choose “data file” to be 11ka.xls, check “has absolute call” and “has standard error”. The data will be read in.
5. Do clustering or comparing samples as usual. Currently the maximum number of genes dChip 1.1 can read in is 23000 (65000 in dChip 1.3). So you may need to first use “Compare Samples”, “Filter genes” or “Clustering/Export Selected” to get a subset of gene names. Then use such files as “Gene list file” in the “Tools/Export Data/Expression values” dialog to export only the data for the filtered genes for each sub chip type, and then combine data files for dChip to read in.
6. Gene information files for the two sub-chips can be combined row-wise (save as tab-delimited text format) and specified at “Analysis/Get external data/Other information/Gene information file”. A better way is to use the “Tools/Make gene information” function, which can accept a combined CSV files of sub-arrays made by using Windows “Command Prompt” command such as “copy HG-U133?_annot.csv HG-U133.csv” (do not replace “?” by A or B; “?” represents any character in the command.). Directly copying and pasting the two CSV files in text editor or Excel can change the CSV file format and should not be used.
Combine HG_U95A
and V2 arrays at the CEL file level
There are 26 HG_U95A-only probe sets and 25 HG_U95AV2-only probe sets. The rest common 12600 probe sets assumably have the same probe sequences and chip locations (confirmed by visually checking dChip “Image View”, “Data View” and CDF files, also see Affymetrix’s description). When CDF file “HG_U95AV2.cdf” and a probe set mask file “hg_u95av2 probe set mask.txt” (save the file exactly as the same name) are specified at “Analysis/Open group/Other information” dialog, dChip will also read in HG_U95A arrays. The subsequent normalization and model-based expression procedures are performed as usual, only on the 12600 common probe sets.
(Updated 10/17/07)