Posters

 

The posters will be presented at the Conference Reception  at  Element Cafe of the Conference Center at Harvard  Medical at 7:00pm -9:30pm on Wednesday, October 10, 2007 (* indicates the presenter).

 

Posters. 1

1. Speed Up Genome-Wide Association Studies by Random Walk Algorithms. 2

2. Gene ranking in survival analysis assessment of prognostic value of genome expression experiments. 2

3. Inference from Genome-Wide Association Studies using a Novel Markov Model 3

4. Survival Analysis With Large Dimensional Genomic Data And Possible Pathways. 3

5. Peak detection of SELDI measurements for identifying protein biomarkers. 4

6. Haplotype-based Association analysis via variance component score test 4

7. Whole-genome association studies have substantially increased power in admixed populations. 5

8. Genetic interactions in a cross of outbred S. cerevisiae strains:  Snapshots along the speciation trail 6

9. On Combining Triads and Unrelated Subjects Data in Assessing Genetic Association with Disease Risk: An Application to Testicular Caner 6

10. Joint modeling of functional SELDI-TOF mass spectrometry proteomic. 7

11. Micromanipulation of the Extracellular Matrix: From Cancer Treatment to Stem Cell Differentiation. 7

12. A New Multilevel Nonlinear Mixture Dirichlet Model for Expressed Sequence Tag Data. 8

13. SimuGeno: simulation software for genome wide case-control association studies. 8

14. Model selection in meta-analytical framework for prototype-based clustering. 9

15. Potential genes associated with BMI and interacting with dietary factors in Taiwanese population. 10

16. Incorporating biological information in genome-wide association studies. 11

17. Genome-wide predictors of plasma folate and homocysteine. 11

18. Detecting unusual genotypic patterns in a single subject 12

19. A model-based approach for clustering genome-wide conserved non-coding elements  12

20. Analysis of cytotoxicity dose-response curves for pharmacogenomic studies using Bayesian hierarchical nonlinear models. 13

21. Exploratory analysis of correlated structure across multiple datasets using multiple co-inertia analysis. 14

22. Bayesian modeling of complex traits. 14

23. A Quantitative Framework for Discovering Gene-Gene Interactions Associated with Complex Diseases. 15

24. A systems-based evaluation of toxicogenomic microarray data:  how toxicants alter gene pathways and functional gene categories. 15

25. Indexing regulatory potential of genome loci to gene expression for candidate evaluation. 16

 

1. Speed Up Genome-Wide Association Studies by Random Walk Algorithms   

Di Zhao          

Lousisana Tech University                 

 

As recent development of genotyping technology, genome-wide association studies of genetic variation and diseases in population become feasible through analysis of SNPs and sample labels. Monte Carlo method is the currently standard method to evaluate the association. However, since extremely large amount of SNPs from whole human genome sequence and number of samples from disease/control population, computational difficulty will happen, or even the computation become impossible. Sampling methods, such as importance sampling, are effective to decrease time complexity of standard Monte Carlo method (SMC). In this paper, we develop a sampling method, named random walk algorithms (RWA), for genome-wide association studies. Theoretical analyses show that, comparing SMC, RWA gains significantly decrease of time complexity from O(lmn) to approximately O(mn). Computational experiments show that RWA is approximately 10^3 to 10^5 times faster than SMC.                                                                         

 

2. Gene ranking in survival analysis assessment of prognostic value of genome expression experiments

 

Giulia Tonini*, Annibale Biggeri, Michela Baccini  

Universita' di Firenze and CSPO, Firenze                  

 

An aim of this work was to investigate the performance of gene expression in predicting survival. We develop a statistical approach which considers the net contribution of the gene having adjusted for other known prognostic factors. Moreover we extended our approach to the effect of each gene, given all other investigated genes. In literature several approaches can be found. The Significance Analysis of Microarrays (SAM) has been generalized to censored survival data, but these procedures are usually based on a marginal approach: they investigate the effect of gene expression without taking into account for other potential prognostic variables (i.g. age, gender, stage of disease); they assess the effect of each gene separately. We applied a two-steps procedure. First, we considered the problem of ranking genes in order of their prognostic value. Since we were interested in the net contribution of differential gene expression on survival we based ranking on a score test statistics adjusted for other relevant prognostic factors (characteristics of the person and tumor characteristics). Secondly, we specified a penalized Cox regression model, considering the 40 top ranked genes obtained from the previous procedure. The application of a two steps procedure was motivated by the need to reduce the computational burden of the penalized regression approach. We examplify our approach using a real dataset on colon cancer collected by Pharmacology Department of the University of Florence and Tuscany Cancer Institute (ITT).

 

3. Inference from Genome-Wide Association Studies using a Novel Markov Model      

 

F.J. Hosking*, J.A.C. Sterne, G. Davey Smith, P.J. Green     

Unviersity of Bristol, UK                   

 

We propose a Bayesian modelling approach to the analysis of genome-wide association studies (GWAS) based on single nucleotide polymorphism (SNP) data. Our model combines various aspects of k-means clustering, hidden Markov models (HMMs) and logistic regression into a fully Bayesian model. It is fitted using the Markov chain Monte Carlo (MCMC) stochastic simulation method, with Metropolis-Hastings update steps. The approach is flexible, both in allowing different types of genetic models, and because it can be easily extended while remaining computationally feasible due to the use of fast algorithms for HMMs. It allows for inference primarily on the location of the causal locus but also on other parameters of interest. The model is used here to analyse three data sets, using both synthetic and real disease phenotypes with real SNP data, and shows promising results.                                                                       

 

4. Survival Analysis With Large Dimensional Genomic Data And Possible Pathways    

 

Yi Li, David Engler*

HSPH                         

 

Use of microarray technology often leads to high- dimensional and low sample size data settings. Over the past several years, a variety of approaches have been proprosed for variable selection in this context. However, only few are applicable to time- to- event data where censoring is present. Among standard variable selection methods shown both to have good predictive accuracy and to be computationally efficient is the elastic net penalization approach. In this paper, adaptation of the elastic net penalized estimation approach is presented for variable selection both under the Cox proportional hazards model and under an accelerated failure time (AFT) model.  A key advantage of this method is its capablity of preserving the "grouping effects", ideal for modeling unobserved pathways. Assessment of the two methods is conducted through simulation studies and through analysis of microarray data obtained from a set of patientswith diffuse large B- cell lymphoma where time to survival is of interest. The approaches are shown to be an improvement over the LARS- based (Efron et al., 2004) LASSO variable selection method in terms of predictive performance when identification of highly correlated variables is of interest and in settings where a high degree of censoring is present.

 

5. Peak detection of SELDI measurements for identifying protein biomarkers  

 

Chuen Seng Tan*, Alexander Ploner, Andreas Quandt, Janne Lehti, Maria Pernemalm, Rolf Lewensohn, Yudi Pawitan

Department of Medical Epidemiology and Biostatistics, Karolinska Institutet and Center for Molecular Epidemiology, National University of Singapore                         

Aim: Protein expression profiling data from the surface-enhanced laser desorption and ionization (SELDI) technology is used to discover biomarkers for clinical diagnosis, prognosis and therapy prediction. The pre-processing of the raw data however is still problematic. We aim to develop a peak detection method with better specificity than standard methods. Methods: Scientists inspect individual spectra visually and laboriously to verify that the peaks identified by the standard method are real. Motivated by this multi-spectral practice, we investigate an analytical approach that reduces the data to a single spectrum of F-statistics capturing significant variability between spectra. To account for multiple testing, we use a false discovery rate criterion to identify potentially interesting proteins. To annotate the peaks, we extracted a peak template from all spectra via the principle component analysis. Finally, with the template, we estimate the amplitude and location of the peak in each spectrum with the least-squares method and refine the estimation of the amplitude via the mixture model. Results: We show that our approach has better operating characteristics than several existing methods and gives more accurate peak annotations than the standard method. Conclusion:

We find that our approach alleviates the main problems in the preprocessing of SELDI-TOF spectra.

                                                           

6. Haplotype-based Association analysis via variance component score test      

 

Jung-Ying Tzeng*, Daowen Zhang

NC State University                           

 

Haplotypes provide a more informative format of polymorphisms for genetic association analysis than single SNPs, but the practical efficacy of haplotype association analysis faces a trade-off between the benefits of modeling abundant variation and the cost on the extra degrees of freedom. To reduce the degrees of freedom, several strategies have been considered in the literature,  including (1) clustering evolutionarily close haplotypes, (2) modeling the level of haplotype sharing, and (3) smoothing haplotype effects by introducing a correlation structure on the effects of similar haplotypes, and study the variance components for association. While the first two strategies enjoy a fair extent of power gain, empirical evidence found that methods of variance components (VC) may only exhibit similar or less power than the standard haplotype regression method even when there are many haplotypes (Schaid 2004). In this study, we report possible reasons that cause the under-power phenomena, and show how the power of the VC strategy can be improved. We construct a score test based on the restricted maximum likelihood (REML) function of the variance components and identify its non-typical limiting distribution. Through simulation, we demonstrate the validity of the test and the power improvement over the standard haplotype regression method. With suitable choices of the correlation structure, the proposed method can be directly applied to unphased genotypic data. Our method can be applied under a general model framework, and is computationally efficient and easy to implement. The broad coverage and the fast-and-easy implementation of our method make the VC strategy a powerful tool for haplotype analysis even in the modern genome-wide association studies.

                                                                                                           

7. Whole-genome association studies have substantially increased power in admixed populations    

 

Alkes L. Price*, Simon R. Myers, Nick Patterson and David Reich

Harvard Medical School & Broad Institute of MIT and Harvard                  

 

Whole-genome association studies (WGAS) are a powerful way to identify common variants conferring disease risk. WGAS have been carried out almost exclusively in populations of European ancestry, with little or no representation of underserved populations such as Latinos and African Americans. This may be due in part to the technical challenges posed by admixed populations, but we and others have now developed methods that enable fully powered WGAS in these populations. We set out to investigate the power of WGAS in admixed populations, and found that there is a major gain in power and efficiency.  These populations offer more power on average because (a) multiple ancestral populations provide increased genetic variation, and (b) there is no noise introduced from controls in the admixture association component of the overall signal.

 

Using the empirical distribution of ancestry proportions in Latino Americans and the empirical joint distribution of European and Native American allele frequencies, we calculated how many Latino samples would be required to achieve power comparable to genotyping 1,000 European cases and 1,000 European controls in a whole-genome scan.  We found that 10% fewer samples, or 900 cases and 900 controls, were required for random markers.  For markers in the top 10% of frequency differentiation between Europeans and Native Americans, which might drive differences in prevalence of diseases such as type 2 diabetes, only 660 cases and 660 controls were required.  (These calculations assume that the causal variant has been genotyped, and do not account for the advantage of increased linkage disequilibrium within chromosomal segments of Native American ancestry.) Sample for sample, our results indicate that admixed populations are substantially more powerful for identifying disease variants, even for variants with only average differences in frequency across populations. For these reasons we suggest that researchers should specifically choose to study admixed populations in preference to unadmixed populations for any WGAS for which samples from both admixed and unadmixed populations are available.

                       

8. Genetic interactions in a cross of outbred S. cerevisiae strains:  Snapshots along the speciation trail          

 

Joseph Mellor, Najaf A. Shah, John W. Rodgers, John L. Hartman 4th, Frederick P. Roth*  

Harvard Medical School, Dept of Biol. Chem & Molec. Pharm., Dana-Farber Cancer Inst., Center for Cancer Systems Biology               

 

Sometimes the phenotypic effects of one mutation depend on another.  This phenomenon defines genetic interaction.  Isolated populations tend to diverge, each acquiring a constellation of mutations that are selected for mutual compatibility and fitness within their environment.  Individuals from diverged populations may be crossed to reveal incompatibilities and other genetic interactions.  We analyzed a previously described cross of two outbred S. cerevisiae strains, from which both parents and 121 segregants were genotyped at ~3,000 loci.  A search for significantly depleted ditypes at a false discovery rate of 0.05 revealed 5 genetic interactions.  We also measured the exponential growth rate of parental strains and segregants to reveal genetic loci and genetic interactions associated with fitness.  Loci associated with these genetic interactions represent sites of selection, with examples that both promote and reduce compatibility between these diverged strains.  In addition, specific genetic interactions were found which depended on environmental conditions.

 

                                                                                                                                   

9. On Combining Triads and Unrelated Subjects Data in Assessing Genetic Association with Disease Risk: An Application to Testicular Caner

 

Li Hsu*, Jacqueline R. Starr, Yingye Zheng, Stephe M. Schwartz

Fred Hutchinson Cancer Research Center

 

Combining data collected from different sources is a cost-effective and time-efficient approach for enhancing the power to detect weak-to-modest genetic effects or gene-gene or gene-environment interactions. However, combining data across studies becomes complicated when data are collected under different study designs, such as family-based and unrelated individual-based (e.g., population-based case-control design). In this paper, we describe a general method that permits the joint estimation of effects on disease risk of genes, environmental factors, and gene-environment interactions under a hybrid design that includes cases, parents of cases, and unrelated individuals. We provide both asymptotic theory and statistical inference. Extensive simulation experiments demonstrate that the proposed estimation and inferential methods perform well in realistic settings. We illustrate the method by an application to a study of testicular cancer.

                                                                                                                       

10. Joint modeling of functional SELDI-TOF mass spectrometry proteomic

 

Jaroslaw Harezlak*, Xihong Lin, Shan Jiang, Mike Wu, Mike Wang & David Christiani     

Harvard School of Public Health                   

 

Cancer biomarkers play an important role in disease diagnosis. The emerging field of protemics presents a great opportunity of candidate biomarker identification, because we can simultaneously measure and analyze a large number of proteins. In this poster, we provide a unified statistical approach to modeling, feature extraction and risk prediction from the high-dimensional mass spectrometry data. The task is challenging due to several factors, including non-stationarity and correlation of the error terms, high-dimensional data and variable selection. We propose to use a local polynomial regression kernel smoothing with adaptive bandwidth selection for MS modeling, significant zero-downcrossing idea for feature selection, warping algorithm for feature alignment and penalized logistic regression for disease risk estimation. We also describe a cross-validation procedure taking into account the uncertainty in the processing steps. We apply our method to the lung cancer SELDI-TOF MS data set, and find a predictive model with good specificity and sensitivity.

                                                           

11. Micromanipulation of the Extracellular Matrix: From Cancer Treatment to Stem Cell Differentiation          

 

Asad Moten*, Shefa Moten    

Massachusetts Institute of Technology, NASA, Harvard University              

 

Cells are inherently sensitive to local mesoscale and microscale patterns of chemistry and topography. Recent research has investigated how surface mechanics might dictate cell behavior, affecting both cell function and differentiation. This study professes that alterations of the underlying matrix can dictate cell behavior and function. Through micro-orienting the substrate and micro-patterning protein, we were able to engineer a structural and biological backbone to regulate cellular behavior. The techniques involved in this study to control cell behavior include the micro-orienting of the ECM, electrospinning a 3D fibrous PMMA scaffold with the optimum diameter and orientation, micro-patterning of fibronectin(Fn) for cellular attachment, applying magnetic/electric fields to fibronectin, creating a novel suicide circuit for tumor targeting bacteria, and altering the stiffness of the surface to initiate cellular differentiation. Furthermore, cancer was studied and regulated at the single cell level through the alteration of the ECM mechanics. In addition, we determined that by manipulating the numerous mechanical forces acting upon the cells, the genetic and morphological changes can lead to the activation of the necessary genes to synthesize the ECM components for differentiation. The use of microscale structuring to restore tissue architecture and dictate cell behavior has several important implications for tissue engineering, cancer treatment, and stem cell differentiation.

                                                                                                                       

12. A New Multilevel Nonlinear Mixture Dirichlet Model for Expressed Sequence Tag Data   

 

Fang Yu*, Ming-Hui Chen, Lynn Kuo, Wanling Yang

University of Nebraska                      

 

ESTs (Expressed Sequence Tags) are usually a one-pass sequencing reading of cloned cDNAs derived from a certain tissue. The characteristics of EST data is that its total number of observed tags is relatively small compared to the number of unique tags, and a lot of unique tags have a zero count (under-representation) while a few unique tags have huge counts(over-representation). To fit such data, we propose a new multi-level nonlinear mixture dirichlet model on the expression levels of each gene from multiple tissues of multiple tissue types. The Bayesian criterion based measure is developed for comparing the proposed model to other existing models. The advantage of our model is that (1) it resolves the issues caused by over-representation and under-representation, (2) it allows to borrow information from different libraries of the same tissue type, and (3) it provides direct measures of the gene-level expressions. We also develop novel computational algorithm and gene selection criterion for detecting genes with different expressions across different tissues types. A real EST dataset is analyzed to illustrate the proposed methodology.

                                                                                                                                   

13. SimuGeno: simulation software for genome wide case-control association studies  

 

Youfang Liu*, Mike Weale

Bioinformatics Research Center, North Carolina State University                

           

Genome-wide association testing has become a powerful and important tool in the study of genetic complex disease. Many novel methods for testing association have been developed. One key issue is how to evaluate the power of each method under realistic settings. Simulation is an efficient way to evaluate the ability of novel methods to detect the disease markers, but needs to be tied to the linkage disequilibrium (LD) properties of real datasets. SimuGeno is a package to simulate large scale genomic data for case-control association studies, using real data as a starting point. It can take HapMap data or any other real data sets as input.  Causal loci can be either random or user-defined.  As an intermediate step, SimuGeno relies on phased haplotype data.  Two options we provide for the simulation of case and control datasets are: (1) bootstrapping of haplotypes; and (2) a method proposed by Dudbridge (2006), which generates new haplotypes based on the recombination rate and random mating assumption. In this second method, two chromosomes are randomly selected, grouped in pairs, and gametes are constructed using HapMap recombination maps. New genotypes are constructed by random union of gametes. When we simulate whole genome data containing about 500k SNPs (the current popular size for genome-wide association studies), computational time will be a problem. To solve this problem, SimuGeno undertakes what we call causal region simulation. The rationale behind this causal region simulation is that only the SNPs close to causal loci will show differences between cases and controls. In causal region simulation, SimuGeno selects, for each causal SNP, a causal region with that causal SNP located at its center. The edges of each region are determined either by physical distance or by the location of recombination hotspots. Simulation takes place within these regions only. Finally, the newly constructed causal regions are plugged back into the original chromosomal data to create new chromosomes. SimuGeno use the logistic regression model to determine the disease status: Logit[Pr(D|genotype)] = β0 + β1*G1 + β2*G2 + β3*G3 + βi*Gi, where G stands for genotype coding in 0,1,2 and i stands for  the number of disease loci.  Logistic model allows single causal locus, multiple causal loci and gene-gene interactions among them, and thus allows for complex disease simulation. SimuGeno simulated data maintains allele frequencies and LD structure that are similar to the original data. As an example, we applied SimuGeno to HapMap CEU chromosome 21 and 22 data and found that the simulated data do indeed have very a similar allele frequency pattern and LD structure compared to the original HapMap data.

           

14. Model selection in meta-analytical framework for prototype-based clustering        

 

Benjamin Haibe-Kains*, Christos Sotiriou and Gianluca Bontempi

Machine Learning Group, Universit Libre de Bruxelles, Brussels, Belgium                          

 

Introduction: The use of dimension reduction methods in microarray analysis is justified by the characteristics of the data such as the high feature-to-sample ratio, the high correlation of coexpressed genes and the high level of noise due to complex technology. Clustering analysis is widely used to perform dimension reduction, keeping the new features interpretable. This method consists in replacing a cluster of correlated genes by a cluster centroid (called feature), and can be used in biologically driven microarray analysis. We aimed at efficiently using a priori biological knowledge to improve clustering methodology for dimension reduction. Methods: We introduced a new method called prototype-based clustering to identify genes that are coexpressed with one prototype, ie one gene representative of a biological process of interest. For each gene to cluster, we fitted univariate and multivariate linear models with the prototypes which play the role of explanatory variables. We compared these models based on their leave-one-out cross-validation error computed by the PRESS statistic. Using Friedman's test, the models exhibiting the lowest errors were selected to identify the cluster of the gene. This method was used in a meta-analystical framework in order to combine model selection from different datasets. Results: We applied our method to two public microarray datasets of breast cancer (BC) untreated patients. We used hallmarks of BC involving various biological processes such as estrogen receptor, her2/neu signaling, proliferation, tumor invasion, immune response, angiogenesis, and apoptosis as prototype genes. We reduced the number of variables from ~20,000 to seven in keeping valuable information 1) to define robustly BC molecular subtypes based on estrogen receptor and her2neu features and 2) to investigate the impact of the seven features on clinical outcome. These two questions were addressed in using fifteen public microarray datasets (~2100 patients). Conclusions: The use of prototype-based clustering allowed for efficient reduction of the dimensionality of microarray data in focusing on target biologically processes. We successfully  applied this method to BC samples in order to gain new insights into BC biology."                                                                                                                                   

 

15. Potential genes associated with BMI and interacting with dietary factors in Taiwanese population           

 

Wen-Harn Pan*, chao-chi Liang, Shing-Hong Chen, Shao-Yuan Chuang

Institute of Biomedical Sciences, Academia Sinica               

 

Background: There are at least hundreds of potential obesity genes being documented. However, only a few dozens have been replicated more than three times in human association studies. The aim of this study was to find influential obesity candidate genes and those major ones interacting with dietary factors in Taiwanese population. Materials and methods: This study was within Cardiovascular Disease Risk Factor Two-township Study (CVDFACTS), using nested case-control study design. All 285 obese subjects (BMI: body mass index, 27kg/m2 persistently) of the cohort were included (100%) and 285 overweight subjects (24kg/m2<BMI 27kg/m2 persistently) were randomly selected (53%). We obtained 554 age-sex grouped matched normal BMI control (34%) and chose 15 SNPs in 12 genes: ADRB2 (Arg16Gly, Gln27Glu), ESR1 A+51193T, FABP2 Ala54Thr, LEP A-2548G, LEPR Gln223Arg, PLIN G+11842A, PPARD T-87C, PPARG (G-82362A, Pro12Ala, G+28752A), TNFA G-308A, TNFB G+252A, UCP2 Ala55Val, and UCP3 C-2078T. They conformed to at least one of the following criteria: (1) it was reported to associate directly with obesity at least in three studies and was previously found to relate to morbidity obesity in our laboratory or (2) it was interacting with environmental factors in its association with obesity. Dietary information was accessed by a validated food frequency questionnaire. Association with genetic variants, nutrient parameters or gene-nutrient interactions were assessed by linear regression models with BMI as the dependent variable and potential confounders adjusted. Results: ADRB2 Arg16Gly, PPARG G-82362A and FABP2 Ala54Thr were gene variants that highly associated to BMI variation and later two only significant in men (pADRB2 Arg16Gly=0.0319, pPPARG G-82362A=0.0105 and pFABP2 Ala54Thr =0.0058). Total energy intake and fat intake (% of energy) were two dietary factors associated with elevated BMI (p=0.0187 and p=0.0011, respectively). With regard to gene-diet interactions, we found that total energy intake was associated with BMI for AA homozygote in LEP -2548 locus but not for its counterparts (p for interaction=0.0464). Furthermore, BMI was associated with dietary % fat intake for UCP2 Val55, or UCP3 T-55 variant carriers, but not for their counterparts (p for interaction=0.0004 and 0.0037, respectively). Putting all afore-mentioned significant correlates in one multivariate regression model, it could explain 6% of BMI value in our population. Conclusions: We have constructed a statistical model for predicting BMI, combining the genetic and environmental effects. With this approach, we may be able to substantially increase the predictivity of BMI or obesity, when more candidate variants are considered.

 

16. Incorporating biological information in genome-wide association studies      

 

Darlene R. Goldstein*, Kunlin Zhang, Anthony C. Davison, Olivier Martin, Brian J. Stephenson, C. Victor Jongeneel and Amalio Telenti

Ãcole Polytechnique Fdrale de Lausanne (EPFL)                  

 

Whole genome association studies represent a promising approach to identifying disease genes, but successful genome-wide association studies depend on a very dense map of markers.  The advent of very high throughput genotyping technologies have made such studies feasible.  However, difficulties in the analysis of large scale genotype data persist.  These include very high test multiplicity and problems related to small effects of contributing loci, leading to potentially high false positive rates where noise drowns out the signal. We are exploring approaches to increase the signal by incorporating biological information into the analysis.  Our strategies include use of Gene Ontology information and functional annotation of markers included in the study.  Each strategy results in a marker list along with corresponding significance levels.  These approaches are described and their performance assessed based on reproducibility of results in an independent cohort.                                                                                                                           

 

17. Genome-wide predictors of plasma folate and homocysteine 

 

Aditi Hazra*, Peter Kraft, Edward L. Giovannucci, David J. Hunter

Harvard School of Public Health                   

 

Background: Suboptimal nutrition, including low folate consumption, contributes to 30%-35% of cancer occurrence worldwide, including colorectal cancer (CRC) and breast cancer particularly among those who consume moderate to high levels of alcohol, which interferes with one-carbon metabolism. Although the metabolite levels are heritable, we can explain <10% of the genetic variance in plasma folate and plasma homocysteine levels. Methods: We propose to conduct a cost-effective multistage analysis of genome-wide data on plasma folate and homocysteine levels among 1,699 women in the National Cancer Institute sponsored Cancer Genetic Markers of Susceptibility (CGEMS) project, perform 2.25 million SNP imputations based on haplotype structure and combine these data with participants in the National Heart Lung and Blood Institutes Framingham SNP-Health Association Resource (SHARe) study. We will replicate a subset of the highest-ranking SNPs from the hypotheses generated in the initial analysis in 960 individuals and explore associations of the most promising SNPs with risk of CRC and adenoma (CRA) in nested case-control studies in the NHS. Results: Preliminary analysis of the combined CGEMS cases and controls suggests promising associations with plasma homocysteine and a variant on chromosome 2 (Trend test p-value 1.09x10-06) and with plasma folate and a variant on chromosome 3 (Trend test p-value 1.34x10-06). We observed the expected association with plasma homocysteine levels and MTHFR Ala222Val (p= 5.36 x10-04) and BHMT Arg239Gln  (p=0.01), a nsSNP previously associated with CRC. The fact that known gene-metabolite associations were nominally significant but did not achieve genome-wide significance highlights the importance of replicating the top associations seen in CGEMS by pooled analysis with the SHARe scan and replication.

Conclusion: This study represents the first genome-wide association study of plasma folate and homocysteine. These data will provide an integrated research paradigm that synthesizes genome-wide analytic approaches and metabolic profiling to offer an informative platform for understanding gene-nutrient relationships and predicting their association with CRC and CRA risk. Our empirical characterization of genetic variation in plasma levels of folate and homocysteine may offer insight for future cancer prevention strategies and elucidate which population subgroups may benefit most from additional folate intake or supplementation.                

 

18. Detecting unusual genotypic patterns in a single subject        

 

Silviu-Alin Bacanu*, Matthew R. Nelson, Li Li, Margaret G. Ehm

GlaxoSmithKline                    

 

Many adverse drug reactions (ADRs) are rare, and may only be observed in a small number of cases. Consequently, the limited number of cases for pharmacogenetic studies presents a challenge for analysis and inference using traditional methods. Under certain circumstances, we may even wish to make inferences about possible genetic causes within a single case. Even for larger case sample sizes, if we suspect the ADRs are genetically heterogeneous, it may be more appropriate to attempt case-specific genetic inferences.  In such instances, instead of aggregating the information at each marker (usually single nucleotide polymorphisms, or SNPs) among cases, we choose to aggregate the genotypic information among different adjacent markers for each case and obtain a statistic capturing the likelihood of those patterns against a reference control set. The distribution of this statistic under the null hypothesis is estimated by computing the same statistic for each control relative to the pattern found in the remaining controls. An appropriate p-value comparing the case statistics relative to controls statistics is obtained using a newly developed method for bounding tail probabilities (p-values) for any distribution. Using this method we estimate the power to detect genetic aberrations such as deletions, loss-of-heterozygosity (LOH) among others. We apply this method to investigate if any such unusual genotypic patterns occur in a study involving one ADR case suspected of having LOH in a drug-metabolizing gene.                                                                                                                  

 

19. A model-based approach for clustering genome-wide conserved non-coding elements           

 

Zhaohui Qin*, Gordon Robertson, Misha Bilenky, Steven Jones

Department of Biostatistics, University of Michigan                          

 

Recent databases and datasets of computationally identified conserved non-coding elements create new opportunities for studying relationships among functional and regulatory genomic elements. For motifs identified in multi-species, comparative genomics approaches, clustering can be performed to identify groups of similar motifs. Traditional distance-based approaches such as hierarchical clustering methods face challenge in scalability for clustering genome-wide sets of mammalian conserved elements, as the distance matrices are large. Model-based clustering is an alternative approach for such large sets of conserved elements. We developed motifOrganizer, a model-based motif clustering algorithm, specially designed such that it is capable of clustering very large sets of conserved elements collected from genome-wide searches. This new algorithm also allows motifs of different widths to be grouped in the same cluster. Simulation study demonstrated that this algorithm maintains high accuracy under different settings. We also tested real datasets from TRANSFAC, JASPAR and cisRED and demonstrated that motifOrganizer is able to scale up to analyze motifs collected from genome-wide searches on mammalian species.                                                                                                                                   

 

20. Analysis of cytotoxicity dose-response curves for pharmacogenomic studies using Bayesian hierarchical nonlinear models

 

B.L. Fridley*, D. Schaid, R. Weinshilboum, L. Wang

Mayo Clinic                           

 

Pharmacogenomic research has recently incorporated cell-based model systems.  Hypotheses generated with the cell-based model system are then tested in individuals treated with the agent (translational medicine).  Currently, investigation of the genomic relationship with drug cytotoxicity is completed by analyzing cytotoxicity  analyzed one drug concentration at a time or dose-response curves are fit to the cytotoxicity endpoints from which summary endpoints (e.g., GI50) are used as the phenotype in the analysis. A more complete analysis of the impact of genetic variation on the entire dose-response curve might better reflect the true relationship between drug response and genetic variation and lead to insight into the understanding of the pharmacogenomics of a particular drug/agent. A large number of statistical methods have been developed to evaluate non-linear dose-response curves in order to model how response curve parameters are influences by subject-specific characteristics, but few efforts have been made to tie these types of models to genomic studies. We will illustrate the use of Bayesian nonlinear hierarchical models for analysis of pharmacogenomic data with concentration-effect endpoints. Concentration-effect endpoints are any measurable cellular phenotype that is related to drug concentration, one example being cytotoxicity. The model will be illustrated utilizing data from a study of the pharmacogenomics of gemcitabine using data collected on the Coriell Human Variation Panel comprised of 203 cell lines by the Mayo-NIH PGRN and with simulated data.                                                                                                                                   

 

21. Exploratory analysis of correlated structure across multiple datasets using multiple co-inertia analysis     

 

Purvis LBC*, Culhane AC, Rubio R, Yeatman TJ, Quackenbush J

Harvard School of Public Health, Dana Farber Cancer Institute, University of Vermont                   

 

While much focus has been placed on methods for feature selection or supervised analysis of "omics" data, relatively less attention has been given to tools for exploratory data analysis. Studies that use a high throughput omics approach are frequently hypothesis seeking data-driven studies and thus tools for visualization and exploratory analysis of these complex, high dimensional data are essential. Based on a covariance optimization criterion, multiple co-inertia analysis (MCOA) enables the simultaneous ordination of multiple datasets. MCOA is a symmetric coupling technique that concurrently optimizes global variability (inertia) within each dataset and the squared correlation between the principal structures of each table. As a result, MCOA maximizes co-structure through an optimized composite function of correlation between table ordinates and within table covariance. MCOA axes will thus be highly correlated and explain much of the variance within each dataset. Breast cancer is a significant cause of mortality and morbidity, accounting for nearly 1 in 3 cancers diagnosed in US women. Breast cancer is a complex disease requiring multiple genetic alterations and research indicates that genetic mutations occur in the histologically normal breast tissue adjacent to the tumor in nearly 60% of patients, suggesting that such pre-cancerous changes might be diagnostically useful. Here we present an analysis of gene expression profiles designed to search for evidence of such alterations. Five tissue samples were collected from breast cancer patients; one from the primary four from adjacent tissue collected at 1, 2, 3, and 4 cm from the tumor and microarray expression profiles were generated for each sample. We describe the application of MCOA to this dataset and demonstrate its ability to effectively visualize intrinsic correlated structure, providing a useful starting point for subsequent supervised analyses.                                                                                                         

 

22. Bayesian modeling of complex traits.

 

Paola Sebastiani

Department of Biostatistics, Boston University

 

Discovering the genetic basis of common diseases is one of the major challenges of biomedical sciences that has been limited in the past to candidate gene studies. Nowadays, advances in high throughput technology can provide genetic information of almost the entire genome and open the opportunity to discover the genetic basis of complex disease. However, a challenge of this discovery process is building complex models with many variables from massive amount of data. This talk will describe some of the issues related to modeling complex traits, the new challenges posed by the analysis of genome-wide data, and the feasibility of network modeling. In particular, we will present a hierarchical and modular approach to screen genome wide genotype data that incorporates quality control filters, linkage disequilibrium, physical distance and gene ontology to build prognostic models of complex traits. We will use examples of genetic dissection of complex traits using data from a cohort study of the complications of sickle cell anemia, and data from a cohort study of exceptional longevity.

 

23. A Quantitative Framework for Discovering Gene-Gene Interactions Associated with Complex Diseases

 

Indika Rajapakse* and Lue Ping Zhao

Fred Hutchinson Cancer Research Center                  

 

Understanding gene-gene interactions is essential for characterizing the genetic basis of most complex disease phenotypes, from cancer to coronary heart disease to diabetes.  While many ongoing genome-wide studies focus on disease association with single genes (single SNP or a haplotype of multiple SNPs from the same gene/region), the next wave of analytic effort utilizing data sets from these genome-wide studies will target gene-gene interactions and their association with complex diseases.  In anticipation of this new wave, several research groups have been actively developing methods for characterization of all possible gene-gene interactions.  Building upon existing works, we have identified a general quantitative framework that encompasses many existing models used to characterize gene-gene interactions.  Moreover, the simplicity and generality of this framework allows us to appreciate quantitative penetrance of gene-gene interactions in complex diseases, as well as to generalize methods of characterizing two-gene interactions to three-way or higher order interactions.  In addition, from the saturated version of this framework, we have derived explicit calculations for evaluating two- or three-way interactions.  The proposed framework is illustrated using data from a myeloid leukemia study on regions of chromosome 6.

 

24. A systems-based evaluation of toxicogenomic microarray data:  how toxicants alter gene pathways and functional gene categories

 

Xiaozhong Yu, William C. Griffith, Eric M. Vigoren*, Kristina Hanspers, James F. Dillman III, Hansel Ong, Melinda A. Vredevoogd, and Elaine M. Faustman

 

Institute for Risk Analysis and Risk Communication, Dept. of Environmental and Occupational Health Sciences, University of Washington; GenMAPP Development Team, Bioinformatics Research Associate/Conklin Lab Gladstone Institute of Cardiovascular Disease/UCSF; Cell and Molecular Biology Branch, U.S. Army Medical Research Institute of Chemical Defense

 

A major challenge in the interpretation of microarray results is the laborious task of functional interpretation, linking potentially interrelated alterations in gene expression to conventional toxicological endpoints. Researchers have gathered annotations from databases and have identified characteristics of extracted sets of genes. The Gene Ontology (GO) consortium initiated the standardization of annotation terms, making them applicable for different organisms. Gene ontology and pathway mapping are both powerful approaches used to generate a global view of biological processes and cellular components affected by toxicants. However, this one-dimensional analysis does not allow for quantitative or qualitative evaluation of possible dose- or time-dependent genomic relationships. To make of best use of toxicogenomic data for risk assessment, we propose that comparisons across toxicant doses and time evaluate how gene pathways, rather than single genes, are altered. We propose a systems-based approach to integrate the raw gene expression data of genes associated with specific functional categories derived from the GO database. We developed a program (GO-Quant) to estimate the average expression values (intensity or ratio) for significantly altered genes, by functional gene category, based on MAPPFinder results. To demonstrate its application, we applied this approach to a previously published dose- and time-dependent toxicogenomic dataset (J.F. Dillman et al., 2005, Chem. Res. Toxicol. 18-28-34). The results of our analysis indicate that the systems approach can describe quantitatively the degree to which functional gene systems change across dose or time, providing quantitative estimates for each specific functional GO term. Supported by NIEHS U10 ES 11387, EPA R826886, NIEHS 1PO1ES09601, R01-ES10613, NIEHS P50 ES012762, NSF OCE-0434087, NIEHS 5 P30 ES07033USDOE DE-FG02-03ER63674.

 

25. Indexing regulatory potential of genome loci to gene expression for candidate evaluation.

 

Oliver Hofmann, Chris Maher, Adele Kruger, Vlad Bajic and Winston Hide*

 

South African National Bioinformatics Institute, Private Bag x17 University of the Western Cape, Bellville, 7925, South Africa.

Biostatistics department, Harvard School of Public Health, 655 Huntington Avenue Boston, MA  02115

 

Several exciting new studies have been published as a result of genome-wide interrogation over tens of thousands of unrelated individuals. The scale of these studies is necessitated by the degree to which statistical power is required to derive meaningful results. Upon provision of genomic loci of interest to disease association, regions yield loci that may be associated with coding regions of genes; others lie in non-coding regions, and others in gene deserts, which contain no known associated functional gene. Clearly, our understanding of biological function can contribute a great deal to a reduction in the search space. In order to assess the value of SNPs or candidate regions they associate with, it is useful to be able to characterise any associated transcript expression at or near that position. The value of a candidate can be associated in the context of known expression in terms of anatomy, time, species, cell type, or pathology (eVOC). We have been leveraging our ability to assess loci via interrogation of the genome for consistently described gene expression information using the eVOC gene expression vocabulary and biomart (www.evocontology.org ,www.biomart.org). Given the degree to which this human gene expression data is available, a major contribution to its value is a better understanding of its regulation. We have embarked upon a process of mapping of regulation potential to gene expression states and disease. We are developing an index of gene regulation, linked with eVOC, with the goal of grouping promotor elements by their common expression potential. In this manner, unknown regions of the genome can be compared against the index for similarities in potential expression. We will explore an encoding approach for regulation potential relevant to disease loci.