dChip: Gene information file

 

Make information file                                    Custom gene information file                        Make "gene list file" by annotation              

 

dChip “Gene information files” can be prepared by dChip (older files can be downloaded), and specified at the “Open group/Other information” dialog to be used in dChip. They are tab-delimited text files (with .XLS extension for easy opening in Excel) providing annotation information to probe sets, and have the following columns: Probe set name, Identifier (Accession number), LocusLink ID, Gene name, Gene Ontology terms, Protein domain terms, Pathway terms, Chromosome terms, and Gene description. These annotation terms allow dChip to find significant gene clusters or classify a gene list.

 

Downloadable gene information file (unzip the file; used the latest CSV file and GO file by that date): HG_U133 Plus 2.0 (08/05, 03/07), HG_U133A 2.0 (03/07), MOE430 2.0 (09/05), Drosophila_2 (12/06), Rat RAE230A, U34A (04/07)

 

Make information file

 

See Zhong et al. 03 for reference (also see the ChipInfo software). On March 2003, NetAffx (Liu et al. 03) made the Annotation CSV files available online. This makes it possible for individual users to timely convert the quarterly updated NetAffx annotation files to dChip information files. This function does not work for SNP arrays, whose SNP or genome information files are here. In dChip, one can select “Tools/Make information file”:

 

 

The input information files need to be downloaded to local computers. Download and unzip the Annotation CSV files for a described array type (need a free NetAffx acocount); make sure to use the CSV file as it is without re-saving it in Excel into a different format. Also download the three Gene Ontology (GO) structure files: function.ontology, process.ontology, component.ontology (save in text format with name “function.ontology.txt” etc.).  If these latest GO files make dChip crash, unzip and use these older files.

 

To make common probe set files, a NetAffx Ortholog CSV file is needed to be downloaded and specified; but do not check "Gene information file" or "Genome information file".

 

On clicking OK, dChip will parse the NetAffx annotation files to generate specified Gene information file, Genome information file and Common probe set file. If GO files are specified, the GO graphs are traced up to associate all the parent GO terms of a gene’s GO annotation terms to this gene. The most frequent occurred 2000 (older limit is 1200) GO terms in this array type are indexed in the associated “gene info Gene Ontology.xls” file, and used as GO annotation terms in the output gene information file. These terms typically associate with more than 10 probe sets and are more useful for functional significance identification. The same limit applies to other annotation categories such “Protein domain”. For each gene, at most 300 GO terms are recorded in the gene information file.

 

There is a “Genome Version” column in the CSV file that provides the information of the genome assembly, based on which the corresponding cytoband file and refgene files can be made to agree with the genome assembly of the genome info file.

 

[V3/24/07+] Gene symbols will be added in the gene name column of gene information file. [Obsolete] If gene symbols are preferred over gene names to be used in analysis, one can use Excel to open both the annotation CSV file and the gene information file generated above, and then copy the “Gene Symbol” column of the CSV file (except the header line) to the "Name" column of the gene information file. The rows of the two files contain matching probe sets, but check to make sure there is no misalignment. Finally save the gene information file in tab-delimited text format.

 

For custom arrays without NetAffx CSV file, Affymetrix usually will supply with some gene information. One can copy and paste in Excel to make the format the same as the gene information file, keep the header line to be the same and leave not needed columns empty, and then save in text format.

 

For the latest CSV files, the InterPro protein domain information may not be available. One can make gene info files using both the old CSV and new CSV files (if needed after combining A and B CSV files by DOS command). Then copy the protein domain column from the old gene info file to the new gene info file, and also use the old protein domain index file. Make sure the rows containing probe set names match between the two gene info files.

 

Custom gene information file

 

The four annotation columns are optional: Gene Ontology, Protein domain, Pathway, and Chromosome terms. One can delete one or more of them in Excel, and save the file in tab-delimited text format. Also, one may replace one of these columns by custom annotation terms in a controlled vocabulary. The current limit is 4 for the number of annotation column and 1300 for the number of different annotation terms in a column. For this example gene information file, we can keep the first line unchanged, but change the "FirstOfFunction" column's content to be gene annotation of your choice, and the first column to be your feature/gene names. Leave other columns empty when not appropriate. If needed you can add up to 3 other annotation columns between "FirstOf Function" and "Description". Finally save the file in text format. Such custom gene information file may be used with an external data set not in Affymetrix platform.

 

If additional or project-dependent gene information is desired, one can use “Tools/More gene information” to read in a gene information file and use it with priority over the main gene info file specified in “Analysis/Open group”. The file can be edited in Excel but saved as “tab-delimited text file” and the header line should be the same as the original “Gene information file” specified at “Open group”. It may contain only the subset of genes whose information is to be modified and added. The Non-blank entries in the file will overwrite the corresponding entries in the original gene information file.

 

Make “gene list file” by annotation terms or keywords

 

Sometimes it is desirable to focus attention on a set of genes with particular function or keywords. For example, in a cell-cycle experiment, we may want to look at the expression value changes of cell cycle genes, and cluster samples using only these genes. If a “Gene information file” is used at “Open Group” or “Get external data”, we can use “Tools/Gene list file/By Annotation” (“By GeneOntology” in V1.2-) to output a list of genes that belong to a specific Gene Ontology, protein domain or pathway category:

 

 

In V1.3+, one can select multiple terms to get the union or intersection of the genes belonging to these categories, and apply the “Filter genes” function immediately using this gene list as the input gene list.

 

Similarly, one can use “Tools/Gene list file/By keywords” to obtain a list of genes with particular keyword in the gene name or description. The specified keywords can be wildcard strings. For example, “receptor * kinase” matches with “receptor tyrosine kinase”, and both “receptor [1-9]” and “receptor ?” match with “receptor 4” (more matching patterns).

 

The obtained gene list file can be used as the “Gene list file” in the “Analysis/Hierarchical clustering” dialog for clustering analysis and in the “Tools/Export data/Expression value” dialog for exporting expression values for only these genes. In addition, it can be specified as the “Analysis/Open group/Other information/Probe set mask file” to eliminate these genes from the downstream analysis (also see Remove irrelevant genes from clustering analysis).