The posters will be
presented at the Conference Reception
at Element Cafe of the Conference
Center at Harvard Medical at 7:00pm -9:30pm on Wednesday, October 10,
2007 (* indicates the presenter).
1. Speed Up Genome-Wide Association Studies by
Random Walk Algorithms
2. Gene ranking in survival analysis assessment
of prognostic value of genome expression experiments
3. Inference from Genome-Wide Association Studies
using a Novel Markov Model
4. Survival Analysis With Large Dimensional
Genomic Data And Possible Pathways
5. Peak detection of SELDI measurements for
identifying protein biomarkers
6. Haplotype-based Association analysis via
variance component score test
7. Whole-genome association studies have
substantially increased power in admixed populations
10. Joint modeling of functional SELDI-TOF mass
spectrometry proteomic
12. A New Multilevel Nonlinear Mixture Dirichlet
Model for Expressed Sequence Tag Data
13. SimuGeno: simulation software for genome wide
case-control association studies
14. Model selection in meta-analytical framework
for prototype-based clustering
15. Potential genes associated with BMI and
interacting with dietary factors in Taiwanese population
16. Incorporating biological information in
genome-wide association studies
17. Genome-wide predictors of plasma folate and
homocysteine
18. Detecting unusual genotypic patterns in a
single subject
19. A model-based approach for clustering
genome-wide conserved non-coding elements
22. Bayesian modeling of complex traits.
23. A Quantitative Framework for Discovering
Gene-Gene Interactions Associated with Complex Diseases
25. Indexing regulatory potential of genome loci
to gene expression for candidate evaluation.
Di Zhao
Lousisana Tech University
As recent development of
genotyping technology, genome-wide association studies of genetic variation and
diseases in population become feasible through analysis of SNPs
and sample labels. Monte Carlo method is the currently standard method to
evaluate the association. However, since extremely large amount of SNPs from whole human genome sequence and number of samples
from disease/control population, computational difficulty will happen, or even
the computation become impossible. Sampling methods, such as importance
sampling, are effective to decrease time complexity of standard Monte Carlo
method (SMC). In this paper, we develop a sampling method, named random walk
algorithms (RWA), for genome-wide association studies. Theoretical analyses
show that, comparing SMC, RWA gains significantly decrease of time complexity
from O(lmn) to approximately
O(mn). Computational experiments show that RWA is
approximately 10^3 to 10^5 times faster than SMC.
Giulia Tonini*,
Annibale Biggeri, Michela Baccini
Universita' di
Firenze and CSPO, Firenze
An aim of this work was to
investigate the performance of gene expression in predicting survival. We
develop a statistical approach which considers the net contribution of the gene
having adjusted for other known prognostic factors. Moreover we extended our
approach to the effect of each gene, given all other investigated genes. In
literature several approaches can be found. The Significance Analysis of
Microarrays (SAM) has been generalized to censored survival data, but these
procedures are usually based on a marginal approach: they investigate the
effect of gene expression without taking into account for other potential
prognostic variables (i.g. age, gender, stage of
disease); they assess the effect of each gene separately. We applied a
two-steps procedure. First, we considered the problem of ranking genes in order
of their prognostic value. Since we were interested in the net contribution of
differential gene expression on survival we based ranking on a score test statistics
adjusted for other relevant prognostic factors (characteristics of the person
and tumor characteristics). Secondly, we specified a penalized Cox regression
model, considering the 40 top ranked genes obtained from the previous
procedure. The application of a two steps procedure was motivated by the need
to reduce the computational burden of the penalized regression approach. We examplify our approach using a real dataset on colon cancer
collected by Pharmacology Department of the University of Florence and Tuscany
Cancer Institute (ITT).
F.J. Hosking*, J.A.C. Sterne, G. Davey Smith, P.J.
Green
Unviersity of Bristol, UK
We propose a Bayesian modelling approach to the analysis of genome-wide
association studies (GWAS) based on single nucleotide polymorphism (SNP) data.
Our model combines various aspects of k-means clustering, hidden Markov models
(HMMs) and logistic regression into a fully Bayesian
model. It is fitted using the Markov chain Monte Carlo (MCMC) stochastic
simulation method, with Metropolis-Hastings update steps. The approach is
flexible, both in allowing different types of genetic models, and because it
can be easily extended while remaining computationally feasible due to the use
of fast algorithms for HMMs. It allows for inference
primarily on the location of the causal locus but also on other parameters of
interest. The model is used here to analyse three
data sets, using both synthetic and real disease phenotypes with real SNP data,
and shows promising results.
Yi Li, David Engler*
HSPH
Use of microarray technology
often leads to high- dimensional and low sample size data settings. Over the
past several years, a variety of approaches have been proprosed
for variable selection in this context. However, only few are applicable to
time- to- event data where censoring is present. Among standard variable
selection methods shown both to have good predictive accuracy and to be
computationally efficient is the elastic net penalization approach. In this
paper, adaptation of the elastic net penalized estimation approach is presented
for variable selection both under the Cox proportional hazards model and under
an accelerated failure time (AFT) model.
A key advantage of this method is its capablity
of preserving the "grouping effects", ideal for modeling unobserved
pathways. Assessment of the two methods is conducted through simulation studies
and through analysis of microarray data obtained from a set of patientswith diffuse large B- cell lymphoma where time to
survival is of interest. The approaches are shown to be an improvement over the
LARS- based (Efron et al., 2004) LASSO variable
selection method in terms of predictive performance when identification of
highly correlated variables is of interest and in settings where a high degree
of censoring is present.
Chuen Seng
Tan*, Alexander Ploner, Andreas Quandt,
Janne Lehti, Maria Pernemalm, Rolf Lewensohn, Yudi Pawitan
Department of Medical
Epidemiology and Biostatistics, Karolinska Institutet and Center for Molecular Epidemiology, National University
of Singapore
Aim: Protein expression
profiling data from the surface-enhanced laser desorption
and ionization (SELDI) technology is used to discover biomarkers for clinical
diagnosis, prognosis and therapy prediction. The pre-processing of the raw data
however is still problematic. We aim to develop a peak detection method with
better specificity than standard methods. Methods: Scientists inspect
individual spectra visually and laboriously to verify that the peaks identified
by the standard method are real. Motivated by this multi-spectral practice, we
investigate an analytical approach that reduces the data to a single spectrum
of F-statistics capturing significant variability between spectra. To account
for multiple testing, we use a false discovery rate criterion to identify
potentially interesting proteins. To annotate the peaks, we extracted a peak
template from all spectra via the principle component analysis. Finally, with
the template, we estimate the amplitude and location of the peak in each
spectrum with the least-squares method and refine the estimation of the
amplitude via the mixture model. Results: We show that our approach has better
operating characteristics than several existing methods and gives more accurate
peak annotations than the standard method. Conclusion:
We find that our approach
alleviates the main problems in the preprocessing of SELDI-TOF spectra.
Jung-Ying Tzeng*,
Daowen Zhang
NC State University
Haplotypes provide a more
informative format of polymorphisms for genetic association analysis than
single SNPs, but the practical efficacy of haplotype
association analysis faces a trade-off between the benefits of modeling
abundant variation and the cost on the extra degrees of freedom. To reduce the
degrees of freedom, several strategies have been considered in the literature, including (1)
clustering evolutionarily close haplotypes, (2) modeling the level of haplotype
sharing, and (3) smoothing haplotype effects by introducing a correlation
structure on the effects of similar haplotypes, and study the variance
components for association. While the first two strategies enjoy a fair extent
of power gain, empirical evidence found that methods of variance components
(VC) may only exhibit similar or less power than the standard haplotype
regression method even when there are many haplotypes (Schaid
2004). In this study, we report possible reasons that cause the under-power
phenomena, and show how the power of the VC strategy can be improved. We
construct a score test based on the restricted maximum likelihood (REML)
function of the variance components and identify its non-typical limiting
distribution. Through simulation, we demonstrate the validity of the test and
the power improvement over the standard haplotype regression method. With
suitable choices of the correlation structure, the proposed method can be
directly applied to unphased genotypic data. Our
method can be applied under a general model framework, and is computationally
efficient and easy to implement. The broad coverage and the fast-and-easy
implementation of our method make the VC strategy a powerful tool for haplotype
analysis even in the modern genome-wide association studies.
Alkes L. Price*, Simon R. Myers,
Nick Patterson and David Reich
Harvard Medical School &
Broad Institute of MIT and Harvard
Whole-genome association
studies (WGAS) are a powerful way to identify common variants conferring
disease risk. WGAS have been carried out almost exclusively in populations of
European ancestry, with little or no representation of underserved populations
such as Latinos and African Americans. This may be due in part to the technical
challenges posed by admixed populations, but we and others have now developed
methods that enable fully powered WGAS in these populations. We set out to
investigate the power of WGAS in admixed populations, and found that there is a
major gain in power and efficiency.
These populations offer more power on average because (a) multiple
ancestral populations provide increased genetic variation, and (b) there is no
noise introduced from controls in the admixture association component of the
overall signal.
Using the empirical
distribution of ancestry proportions in Latino Americans and the empirical
joint distribution of European and Native American allele frequencies, we
calculated how many Latino samples would be required to achieve power
comparable to genotyping 1,000 European cases and 1,000 European controls in a
whole-genome scan. We found that 10%
fewer samples, or 900 cases and 900 controls, were required for random
markers. For markers in the top 10% of
frequency differentiation between Europeans and Native Americans, which might
drive differences in prevalence of diseases such as type 2
diabetes, only 660 cases and 660 controls were required. (These calculations assume that the causal
variant has been genotyped, and do not account for the advantage of increased
linkage disequilibrium within chromosomal segments of Native American
ancestry.) Sample for sample, our results indicate that admixed populations are
substantially more powerful for identifying disease variants, even for variants
with only average differences in frequency across populations. For these
reasons we suggest that researchers should specifically choose to study admixed
populations in preference to unadmixed populations
for any WGAS for which samples from both admixed and unadmixed
populations are available.
Joseph Mellor, Najaf A. Shah, John W. Rodgers, John L. Hartman 4th,
Frederick P. Roth*
Harvard
Medical School, Dept of Biol. Chem & Molec. Pharm., Dana-Farber Cancer Inst.,
Center for Cancer Systems Biology
Sometimes the phenotypic
effects of one mutation depend on another.
This phenomenon defines genetic interaction. Isolated populations tend to diverge, each
acquiring a constellation of mutations that are selected for mutual
compatibility and fitness within their environment. Individuals from diverged populations may be
crossed to reveal incompatibilities and other genetic interactions. We analyzed a previously described cross of
two outbred S. cerevisiae
strains, from which both parents and 121 segregants
were genotyped at ~3,000 loci. A search for significantly depleted ditypes
at a false discovery rate of 0.05 revealed 5 genetic interactions. We also measured the exponential growth rate
of parental strains and segregants to reveal genetic
loci and genetic interactions associated with fitness. Loci associated with these genetic
interactions represent sites of selection, with examples that both promote and
reduce compatibility between these diverged strains. In addition, specific genetic interactions
were found which depended on environmental conditions.
Li Hsu*, Jacqueline R. Starr,
Yingye Zheng, Stephe M.
Schwartz
Fred Hutchinson Cancer
Research Center
Combining data collected from
different sources is a cost-effective and time-efficient approach for enhancing
the power to detect weak-to-modest genetic effects or gene-gene or
gene-environment interactions. However, combining data across studies becomes
complicated when data are collected under different study designs, such as
family-based and unrelated individual-based (e.g., population-based
case-control design). In this paper, we describe a general method that permits
the joint estimation of effects on disease risk of genes, environmental
factors, and gene-environment interactions under a hybrid design that includes
cases, parents of cases, and unrelated individuals. We provide both asymptotic
theory and statistical inference. Extensive simulation experiments demonstrate
that the proposed estimation and inferential methods perform well in realistic
settings. We illustrate the method by an application to a study of testicular
cancer.
Jaroslaw Harezlak*,
Xihong Lin, Shan Jiang, Mike Wu, Mike Wang &
David Christiani
Harvard School of Public
Health
Cancer biomarkers play an
important role in disease diagnosis. The emerging field of protemics
presents a great opportunity of candidate biomarker identification, because we
can simultaneously measure and analyze a large number of proteins. In this
poster, we provide a unified statistical approach to modeling, feature
extraction and risk prediction from the high-dimensional mass spectrometry data.
The task is challenging due to several factors, including non-stationarity and correlation of the error terms,
high-dimensional data and variable selection. We propose to use a local
polynomial regression kernel smoothing with adaptive bandwidth selection for MS
modeling, significant zero-downcrossing idea for
feature selection, warping algorithm for feature alignment and penalized
logistic regression for disease risk estimation. We also describe a
cross-validation procedure taking into account the uncertainty in the
processing steps. We apply our method to the lung cancer SELDI-TOF MS data set,
and find a predictive model with good specificity and sensitivity.
Asad Moten*,
Shefa Moten
Massachusetts Institute of
Technology, NASA, Harvard University
Cells are inherently
sensitive to local mesoscale and microscale
patterns of chemistry and topography. Recent research has investigated how
surface mechanics might dictate cell behavior, affecting both cell function and
differentiation. This study professes that alterations of the underlying matrix
can dictate cell behavior and function. Through micro-orienting the substrate
and micro-patterning protein, we were able to engineer a structural and
biological backbone to regulate cellular behavior. The techniques involved in
this study to control cell behavior include the micro-orienting of the ECM, electrospinning a 3D fibrous PMMA scaffold with the optimum
diameter and orientation, micro-patterning of fibronectin(Fn) for cellular
attachment, applying magnetic/electric fields to fibronectin,
creating a novel suicide circuit for tumor targeting bacteria, and altering the
stiffness of the surface to initiate cellular differentiation. Furthermore,
cancer was studied and regulated at the single cell level through the
alteration of the ECM mechanics. In addition, we determined that by
manipulating the numerous mechanical forces acting upon the cells, the genetic
and morphological changes can lead to the activation of the necessary genes to
synthesize the ECM components for differentiation. The use of microscale structuring to restore tissue architecture and
dictate cell behavior has several important implications for tissue
engineering, cancer treatment, and stem cell differentiation.
Fang Yu*, Ming-Hui Chen, Lynn Kuo, Wanling Yang
University of Nebraska
ESTs (Expressed Sequence Tags)
are usually a one-pass sequencing reading of cloned cDNAs
derived from a certain tissue. The characteristics of EST data is that its
total number of observed tags is relatively small compared to the number of
unique tags, and a lot of unique tags have a zero count (under-representation)
while a few unique tags have huge counts(over-representation). To fit such
data, we propose a new multi-level nonlinear mixture dirichlet
model on the expression levels of each gene from multiple tissues of multiple
tissue types. The Bayesian criterion based measure is developed for comparing
the proposed model to other existing models. The advantage of our model is that
(1) it resolves the issues caused by over-representation and under-representation,
(2) it allows to borrow information from different
libraries of the same tissue type, and (3) it provides direct measures of the
gene-level expressions. We also develop novel computational algorithm and gene
selection criterion for detecting genes with different expressions across
different tissues types. A real EST dataset is analyzed to illustrate the
proposed methodology.
Youfang Liu*, Mike Weale
Bioinformatics Research
Center, North Carolina State University
Genome-wide association
testing has become a powerful and important tool in the study of genetic
complex disease. Many novel methods for testing association have been
developed. One key issue is how to evaluate the power of each method under
realistic settings. Simulation is an efficient way to evaluate the ability of
novel methods to detect the disease markers, but needs to be tied to the
linkage disequilibrium (LD) properties of real datasets. SimuGeno
is a package to simulate large scale genomic data for case-control association
studies, using real data as a starting point. It can take HapMap
data or any other real data sets as input.
Causal loci can be either random or user-defined. As an intermediate step, SimuGeno
relies on phased haplotype data. Two
options we provide for the simulation of case and control datasets are: (1)
bootstrapping of haplotypes; and (2) a method proposed by Dudbridge
(2006), which generates new haplotypes based on the recombination rate and
random mating assumption. In this second method, two chromosomes are randomly
selected, grouped in pairs, and gametes are constructed using HapMap recombination maps. New genotypes are constructed by
random union of gametes. When we simulate whole genome data
containing about 500k SNPs (the current popular size
for genome-wide association studies), computational time will be a problem.
To solve this problem, SimuGeno undertakes what we
call causal region simulation. The rationale behind this causal region
simulation is that only the SNPs close to causal loci
will show differences between cases and controls. In causal region simulation, SimuGeno selects, for each causal SNP, a causal region with
that causal SNP located at its center. The edges of each region are determined
either by physical distance or by the location of recombination hotspots.
Simulation takes place within these regions only. Finally, the newly
constructed causal regions are plugged back into the original chromosomal data
to create new chromosomes. SimuGeno use the logistic
regression model to determine the disease status: Logit[Pr(D|genotype)]
= β0 + β1*G1 + β2*G2 + β3*G3 + βi*Gi, where G
stands for genotype coding in 0,1,2 and i stands
for the number of disease loci. Logistic model allows single causal locus,
multiple causal loci and gene-gene interactions among them, and thus allows for
complex disease simulation. SimuGeno simulated data
maintains allele frequencies and LD structure that are similar to the original
data. As an example, we applied SimuGeno to HapMap CEU chromosome 21 and 22 data and found that the
simulated data do indeed have very a similar allele frequency pattern and LD
structure compared to the original HapMap data.
Benjamin Haibe-Kains*,
Christos Sotiriou and Gianluca Bontempi
Machine Learning Group, Universit Libre de Bruxelles, Brussels, Belgium
Introduction: The use of dimension reduction methods in microarray analysis is
justified by the characteristics of the data such as the high feature-to-sample
ratio, the high correlation of coexpressed genes and
the high level of noise due to complex technology. Clustering analysis is
widely used to perform dimension reduction, keeping the new features
interpretable. This method consists in replacing a cluster of correlated genes
by a cluster centroid (called feature), and can be
used in biologically driven microarray analysis. We aimed at efficiently using
a priori biological knowledge to improve clustering methodology for dimension
reduction. Methods: We introduced a
new method called prototype-based clustering to identify genes that are coexpressed with one prototype, ie
one gene representative of a biological process of interest. For each gene to
cluster, we fitted univariate and multivariate linear models with the
prototypes which play the role of explanatory variables. We compared these
models based on their leave-one-out cross-validation error computed by the
PRESS statistic. Using Friedman's test, the models exhibiting the lowest errors
were selected to identify the cluster of the gene. This method was used in a
meta-analystical framework in order to combine model
selection from different datasets. Results:
We applied our method to two public microarray datasets of breast cancer (BC)
untreated patients. We used hallmarks of BC involving various biological
processes such as estrogen receptor, her2/neu signaling, proliferation, tumor
invasion, immune response, angiogenesis, and apoptosis as prototype genes. We
reduced the number of variables from ~20,000 to seven in
keeping valuable information 1) to define robustly BC molecular subtypes based
on estrogen receptor and her2neu features and 2) to investigate the impact of
the seven features on clinical outcome. These two questions were
addressed in using fifteen public microarray datasets (~2100 patients). Conclusions: The use of prototype-based
clustering allowed for efficient reduction of the dimensionality of microarray
data in focusing on target biologically processes. We successfully applied this method to BC samples in
order to gain new insights into BC biology."
Wen-Harn Pan*, chao-chi
Liang, Shing-Hong Chen, Shao-Yuan Chuang
Institute of Biomedical
Sciences, Academia Sinica
Background: There are at least hundreds of potential obesity genes being
documented. However, only a few dozens have been replicated more than three
times in human association studies. The aim of this study was to find
influential obesity candidate genes and those major ones interacting with
dietary factors in Taiwanese population. Materials
and methods: This study was within Cardiovascular Disease Risk Factor
Two-township Study (CVDFACTS), using nested case-control study design. All 285
obese subjects (BMI: body mass index, 27kg/m2 persistently) of the cohort were
included (100%) and 285 overweight subjects (24kg/m2<BMI 27kg/m2
persistently) were randomly selected (53%). We obtained 554 age-sex grouped
matched normal BMI control (34%) and chose 15 SNPs in
12 genes: ADRB2 (Arg16Gly, Gln27Glu), ESR1 A+51193T, FABP2 Ala54Thr, LEP
A-2548G, LEPR Gln223Arg, PLIN G+11842A, PPARD T-87C, PPARG (G-82362A, Pro12Ala,
G+28752A), TNFA G-308A, TNFB G+252A, UCP2 Ala55Val, and UCP3 C-2078T. They
conformed to at least one of the following criteria: (1) it was reported to
associate directly with obesity at least in three studies and was previously
found to relate to morbidity obesity in our laboratory or (2) it was
interacting with environmental factors in its association with obesity. Dietary
information was accessed by a validated food frequency questionnaire.
Association with genetic variants, nutrient parameters or gene-nutrient
interactions were assessed by linear regression models with BMI as the
dependent variable and potential confounders adjusted. Results: ADRB2 Arg16Gly, PPARG G-82362A and FABP2 Ala54Thr were
gene variants that highly associated to BMI variation and later two only
significant in men (pADRB2 Arg16Gly=0.0319, pPPARG
G-82362A=0.0105 and pFABP2 Ala54Thr =0.0058). Total energy intake and fat
intake (% of energy) were two dietary factors associated with elevated BMI (p=0.0187
and p=0.0011, respectively). With regard to gene-diet interactions, we found
that total energy intake was associated with BMI for AA homozygote in LEP -2548
locus but not for its counterparts (p for interaction=0.0464). Furthermore, BMI
was associated with dietary % fat intake for UCP2 Val55, or UCP3 T-55 variant
carriers, but not for their counterparts (p for interaction=0.0004 and 0.0037,
respectively). Putting all afore-mentioned significant correlates in one
multivariate regression model, it could explain 6% of BMI value in our
population. Conclusions: We have
constructed a statistical model for predicting BMI, combining the genetic and
environmental effects. With this approach, we may be able to substantially
increase the predictivity of BMI or obesity, when
more candidate variants are considered.
Darlene R. Goldstein*, Kunlin Zhang, Anthony C. Davison, Olivier Martin, Brian J.
Stephenson, C. Victor Jongeneel and Amalio Telenti
Ãcole Polytechnique Fdrale de Lausanne (EPFL)
Whole genome association
studies represent a promising approach to identifying disease genes, but
successful genome-wide association studies depend on a very dense map of
markers. The advent of
very high throughput genotyping technologies have made such studies
feasible. However, difficulties in the
analysis of large scale genotype data persist.
These include very high test multiplicity and problems related to small
effects of contributing loci, leading to potentially high false positive rates
where noise drowns out the signal. We are exploring approaches to increase the
signal by incorporating biological information into the analysis. Our strategies include use of Gene Ontology
information and functional annotation of markers included in the study. Each strategy results in a marker list along
with corresponding significance levels.
These approaches are described and their performance assessed based on
reproducibility of results in an independent cohort.
Aditi Hazra*,
Peter Kraft, Edward L. Giovannucci, David J. Hunter
Harvard School of Public
Health
Background: Suboptimal nutrition, including low folate
consumption, contributes to 30%-35% of cancer occurrence worldwide, including
colorectal cancer (CRC) and breast cancer particularly among those who consume
moderate to high levels of alcohol, which interferes with one-carbon
metabolism. Although the metabolite levels are heritable, we can explain
<10% of the genetic variance in plasma folate and
plasma homocysteine levels. Methods: We propose to conduct a cost-effective multistage analysis
of genome-wide data on plasma folate and homocysteine levels among 1,699 women in the National
Cancer Institute sponsored Cancer Genetic Markers of Susceptibility (CGEMS)
project, perform 2.25 million SNP imputations based on haplotype structure and
combine these data with participants in the National Heart Lung and Blood
Institutes Framingham SNP-Health Association Resource (SHARe)
study. We will replicate a subset of the highest-ranking SNPs
from the hypotheses generated in the initial analysis in 960 individuals and
explore associations of the most promising SNPs with
risk of CRC and adenoma (CRA) in nested case-control studies in the NHS. Results: Preliminary analysis of the
combined CGEMS cases and controls suggests promising associations with plasma homocysteine and a variant on chromosome 2 (Trend test
p-value 1.09x10-06) and with plasma folate and a
variant on chromosome 3 (Trend test p-value 1.34x10-06). We observed the
expected association with plasma homocysteine levels
and MTHFR Ala222Val (p= 5.36 x10-04) and BHMT Arg239Gln (p=0.01), a nsSNP
previously associated with CRC. The fact that known gene-metabolite
associations were nominally significant but did not achieve genome-wide
significance highlights the importance of replicating the top associations seen
in CGEMS by pooled analysis with the SHARe scan and
replication.
Conclusion: This study
represents the first genome-wide association study of plasma folate and homocysteine. These
data will provide an integrated research paradigm that synthesizes genome-wide
analytic approaches and metabolic profiling to offer an informative platform
for understanding gene-nutrient relationships and predicting their association
with CRC and CRA risk. Our empirical characterization of genetic variation in
plasma levels of folate and homocysteine
may offer insight for future cancer prevention strategies and elucidate which
population subgroups may benefit most from additional folate
intake or supplementation.
Silviu-Alin Bacanu*,
Matthew R. Nelson, Li Li, Margaret G. Ehm
GlaxoSmithKline
Many adverse drug reactions (ADRs) are rare, and may only be observed in a small number
of cases. Consequently, the limited number of cases for pharmacogenetic
studies presents a challenge for analysis and inference using traditional
methods. Under certain circumstances, we may even wish to make inferences about
possible genetic causes within a single case. Even for larger case sample
sizes, if we suspect the ADRs are genetically
heterogeneous, it may be more appropriate to attempt case-specific genetic
inferences. In such instances, instead
of aggregating the information at each marker (usually single nucleotide
polymorphisms, or SNPs) among cases, we choose to
aggregate the genotypic information among different adjacent markers for each
case and obtain a statistic capturing the likelihood of those patterns against
a reference control set. The distribution of this statistic under the null
hypothesis is estimated by computing the same statistic for each control
relative to the pattern found in the remaining controls. An appropriate p-value
comparing the case statistics relative to controls statistics is obtained using
a newly developed method for bounding tail probabilities (p-values) for any
distribution. Using this method we estimate the power to detect genetic
aberrations such as deletions, loss-of-heterozygosity (LOH) among others. We
apply this method to investigate if any such unusual genotypic patterns occur
in a study involving one ADR case suspected of having LOH in a drug-metabolizing
gene.
Zhaohui Qin*, Gordon Robertson, Misha Bilenky, Steven Jones
Department of Biostatistics,
University of Michigan
Recent databases and datasets
of computationally identified conserved non-coding elements create new
opportunities for studying relationships among functional and regulatory
genomic elements. For motifs identified in multi-species, comparative genomics
approaches, clustering can be performed to identify groups of similar motifs.
Traditional distance-based approaches such as hierarchical clustering methods
face challenge in scalability for clustering genome-wide sets of mammalian
conserved elements, as the distance matrices are large. Model-based clustering
is an alternative approach for such large sets of conserved elements. We
developed motifOrganizer, a model-based motif
clustering algorithm, specially designed such that it is capable of clustering
very large sets of conserved elements collected from genome-wide searches. This
new algorithm also allows motifs of different widths to be grouped in the same
cluster. Simulation study demonstrated that this algorithm maintains high
accuracy under different settings. We also tested real datasets from TRANSFAC,
JASPAR and cisRED and demonstrated that motifOrganizer is able to scale up to analyze motifs
collected from genome-wide searches on mammalian species.
B.L. Fridley*, D. Schaid, R. Weinshilboum, L. Wang
Mayo Clinic
Pharmacogenomic research has recently
incorporated cell-based model systems.
Hypotheses generated with the cell-based model system are then tested in
individuals treated with the agent (translational medicine). Currently, investigation of the genomic
relationship with drug cytotoxicity is completed by
analyzing cytotoxicity analyzed one drug concentration at a time or
dose-response curves are fit to the cytotoxicity
endpoints from which summary endpoints (e.g., GI50) are used as the phenotype
in the analysis. A more complete analysis of the impact of genetic variation on
the entire dose-response curve might better reflect the true relationship
between drug response and genetic variation and lead to insight into the
understanding of the pharmacogenomics of a particular
drug/agent. A large number of statistical methods have been developed to
evaluate non-linear dose-response curves in order to model how response curve
parameters are influences by subject-specific characteristics, but few efforts
have been made to tie these types of models to genomic studies. We will
illustrate the use of Bayesian nonlinear hierarchical models for analysis of pharmacogenomic data with concentration-effect endpoints.
Concentration-effect endpoints are any measurable cellular phenotype that is
related to drug concentration, one example being cytotoxicity.
The model will be illustrated utilizing data from a study of the pharmacogenomics of gemcitabine
using data collected on the Coriell Human Variation
Panel comprised of 203 cell lines by the Mayo-NIH PGRN and with simulated data.
Purvis LBC*, Culhane AC, Rubio R, Yeatman TJ,
Quackenbush J
Harvard School of Public
Health, Dana Farber Cancer Institute, University of Vermont
While much focus has been
placed on methods for feature selection or supervised analysis of "omics" data, relatively less attention has been given
to tools for exploratory data analysis. Studies that use a high throughput omics approach are frequently hypothesis seeking
data-driven studies and thus tools for visualization and exploratory analysis
of these complex, high dimensional data are essential. Based on a covariance
optimization criterion, multiple co-inertia analysis (MCOA) enables the
simultaneous ordination of multiple datasets. MCOA is a symmetric coupling technique
that concurrently optimizes global variability (inertia) within each dataset
and the squared correlation between the principal structures of each table. As
a result, MCOA maximizes co-structure through an optimized composite function
of correlation between table ordinates and within table covariance. MCOA axes
will thus be highly correlated and explain much of the variance within each
dataset. Breast cancer is a significant cause of mortality and morbidity,
accounting for nearly 1 in 3 cancers diagnosed in US women. Breast cancer is a
complex disease requiring multiple genetic alterations and research indicates
that genetic mutations occur in the histologically
normal breast tissue adjacent to the tumor in nearly 60% of patients,
suggesting that such pre-cancerous changes might be diagnostically useful. Here
we present an analysis of gene expression profiles designed to search for
evidence of such alterations. Five tissue samples were collected from breast
cancer patients; one from the primary four from adjacent tissue collected at 1,
2, 3, and 4 cm from the tumor and microarray expression profiles were generated
for each sample. We describe the application of MCOA to this dataset and
demonstrate its ability to effectively visualize intrinsic correlated
structure, providing a useful starting point for subsequent supervised
analyses.
Paola
Sebastiani
Department
of Biostatistics, Boston University
Discovering
the genetic basis of common diseases is one of the major challenges of
biomedical sciences that has been limited in the past
to candidate gene studies. Nowadays, advances in high throughput technology can
provide genetic information of almost the entire genome and open the
opportunity to discover the genetic basis of complex disease. However, a
challenge of this discovery process is building complex models with many
variables from massive amount of data. This talk will describe some of the
issues related to modeling complex traits, the new challenges posed by the
analysis of genome-wide data, and the feasibility of network modeling. In
particular, we will present a hierarchical and modular approach to screen
genome wide genotype data that incorporates quality control filters, linkage
disequilibrium, physical distance and gene ontology to build prognostic models
of complex traits. We will use examples of genetic dissection of complex traits
using data from a cohort study of the complications of sickle cell anemia, and
data from a cohort study of exceptional longevity.
Indika Rajapakse*
and Lue Ping Zhao
Fred Hutchinson Cancer
Research Center
Understanding gene-gene
interactions is essential for characterizing the genetic basis of most complex
disease phenotypes, from cancer to coronary heart disease to diabetes. While many ongoing genome-wide studies focus
on disease association with single genes (single SNP or a haplotype of multiple
SNPs from the same gene/region), the next wave of
analytic effort utilizing data sets from these genome-wide studies will target
gene-gene interactions and their association with complex diseases. In anticipation of this new wave, several
research groups have been actively developing methods for characterization of
all possible gene-gene interactions.
Building upon existing works, we have identified a general quantitative
framework that encompasses many existing models used to characterize gene-gene
interactions. Moreover, the simplicity
and generality of this framework allows us to appreciate quantitative
penetrance of gene-gene interactions in complex diseases, as well as to
generalize methods of characterizing two-gene interactions to three-way or
higher order interactions. In addition,
from the saturated version of this framework, we have derived explicit
calculations for evaluating two- or three-way interactions. The proposed framework is illustrated using
data from a myeloid leukemia study on regions of chromosome 6.
Xiaozhong Yu, William C. Griffith,
Eric M. Vigoren*, Kristina Hanspers,
James F. Dillman III, Hansel Ong,
Melinda A. Vredevoogd, and Elaine M. Faustman
Institute for Risk Analysis
and Risk Communication, Dept. of Environmental and Occupational Health
Sciences, University of Washington; GenMAPP
Development Team, Bioinformatics Research Associate/Conklin Lab Gladstone Institute
of Cardiovascular Disease/UCSF; Cell and Molecular Biology Branch, U.S. Army
Medical Research Institute of Chemical Defense
A major challenge in the
interpretation of microarray results is the laborious task of functional interpretation,
linking potentially interrelated alterations in gene expression to conventional
toxicological endpoints. Researchers have gathered annotations from databases
and have identified characteristics of extracted sets of genes. The Gene
Ontology (GO) consortium initiated the standardization of annotation terms,
making them applicable for different organisms. Gene ontology and pathway
mapping are both powerful approaches used to generate a global view of
biological processes and cellular components affected by toxicants. However,
this one-dimensional analysis does not allow for quantitative or qualitative
evaluation of possible dose- or time-dependent genomic relationships. To make
of best use of toxicogenomic data for risk
assessment, we propose that comparisons across toxicant doses and time evaluate
how gene pathways, rather than single genes, are altered. We propose a
systems-based approach to integrate the raw gene expression data of genes
associated with specific functional categories derived from the GO database. We
developed a program (GO-Quant) to estimate the average expression values
(intensity or ratio) for significantly altered genes, by functional gene
category, based on MAPPFinder results. To demonstrate
its application, we applied this approach to a previously published dose- and
time-dependent toxicogenomic dataset (J.F. Dillman et al., 2005, Chem. Res. Toxicol.
18-28-34). The results of our analysis indicate that the systems approach can
describe quantitatively the degree to which functional gene systems change
across dose or time, providing quantitative estimates for each specific
functional GO term. Supported by NIEHS U10 ES 11387, EPA R826886, NIEHS
1PO1ES09601, R01-ES10613, NIEHS P50 ES012762, NSF OCE-0434087, NIEHS 5 P30
ES07033USDOE DE-FG02-03ER63674.
Oliver Hofmann, Chris Maher,
Adele Kruger, Vlad Bajic
and Winston Hide*
South African National
Bioinformatics Institute, Private Bag x17 University of the Western Cape,
Bellville, 7925, South Africa.
Biostatistics department,
Harvard School of Public Health, 655 Huntington Avenue Boston, MA 02115
Several exciting new studies
have been published as a result of genome-wide interrogation over tens of
thousands of unrelated individuals. The scale of these studies is necessitated
by the degree to which statistical power is required to derive meaningful
results. Upon provision of genomic loci of interest to disease association,
regions yield loci that may be associated with coding regions of genes; others
lie in non-coding regions, and others in gene deserts,
which contain no known associated functional gene. Clearly, our understanding
of biological function can contribute a great deal to a reduction in the search
space. In order to assess the value of SNPs or
candidate regions they associate with, it is useful to be able to characterise any associated transcript expression at or near
that position. The value of a candidate can be associated in the context of
known expression in terms of anatomy, time, species, cell type, or pathology (eVOC). We have been
leveraging our ability to assess loci via interrogation of the genome for consistently
described gene expression information using the eVOC
gene expression vocabulary and biomart (www.evocontology.org ,www.biomart.org). Given the degree to
which this human gene expression data is available, a major contribution to its
value is a better understanding of its regulation. We have embarked upon a
process of mapping of regulation potential to gene expression states and
disease. We are developing an index of gene regulation, linked with eVOC, with the goal of grouping promotor
elements by their common expression potential. In this manner, unknown regions
of the genome can be compared against the index for similarities in potential
expression. We will explore an encoding approach for regulation potential
relevant to disease loci.