THE FBAT (Family Based Association Test) PROGRAM:
Testing for Linkage and Association with Family Data
Xin Xu, Steve Horvath and Nan Laird
horvath@mednet.ucla.edu, laird@hsph.harvard.edu, xin_xu@harvard.edu
3/8/02
Web address: www.biostat.harvard.edu/~fbat/default.html
TABLE OF CONTENTS
4.5 genotype pedigree-id
marker1 [marker2#]
4.13 setafftrait
aff_t unaff_t unknown_t
4.15 viewmarker
marker [pedigree_id]
4.16 viewstat [-s]
[-e,-o] marker
5.1
Loading the Pedigree File and Running FBAT
5.3 Commands Useful for Debugging
FBAT is a software package
for computing Family Based Tests of Association. There are two steps involved in constructing a test of
association: 1) defining a test statistic that reflects association between a
phenotype or trait (T) and a marker value (X) and 2) defining the distribution
of the test statistic under the null hypothesis. FBAT uses the generic form
S = ST抒X
as a test statistic, where summation is over all offspring in all families in
the data set. The actual test
results will differ depending upon how the user specifies T and X, and how the
distribution of S is determined. Here
we present a brief explanation of the ideas underlying the choice of T, X and
the distribution of S; for more details, see Laird et al. (2000), Rabinowitz and
Laird (2000) and Horvath et al.
(2001b).
The data set used by FBAT can consist of pedigrees, nuclear families or a combination of the two. If pedigrees are present in the data set, each one is decomposed into individual nuclear families that are treated as distinct in the calculations, except in the calculation of the empirical variance (see below).
The default null hypothesis tested by FBAT is H0: no linkage and no association between the marker and any gene influencing the trait. To avoid biases due to population stratification, mis-specification of the trait distribution, and/or selection based on the phenotype, the distribution of S under H0 is calculated using the distribution of offspring genotype, conditional on the trait T, and on the parental genotype. If either or both parental genotypes are unknown, FBAT uses the distribution of offspring genotype conditional on T and on the sufficient statistics for the unobserved parental genotype. For more details, see Rabinowitz and Laird (2000).
This conditional
distribution under the null hypothesis is used to compute E(S) and Var(S);
S is then normalized in the usual way to obtain large sample test
statistics. Specifically, if X is a
scalar summary of an individual*s genotype, then
Z=(S-E(S))/ÖVar(S)
is approximately N(0,1).
FBAT gives the value of Z and a 2-sided p-value based on this normal
approximation. When X is a vector,
the statistic
c2=(S-E(S))*Var(S)-1
(S-E(S))
has an approximate c2
distribution with degrees of freedom equal to the rank of Var(S).
FBAT gives the value of c2
and its one-sided p-value based on the asymptotic c2
distribution. The calculation of the moments of S is described in the technical
report ※Inside FBAT§, available from the FBAT web page.
With more than one sibling
in a family, the distribution of offspring genotypes depends upon whether or not
one assumes linkage is present between the marker and a gene influencing the
trait. In the case where linkage is
present and the null hypothesis is H0: linkage but no association, sibling
genotypes will be correlated, as will the genotypes of different nuclear
families derived from one pedigree. In
this case, the empirical variance option should be used.
With this option, Var(S) is computed empirically without making
assumptions about the recombination parameter, or degree of correlation between
multiple sibs in a family. If some of the nuclear families come from a common
pedigree, then FBAT sums S-E(S) over all nuclear families in the pedigree to
yield a single contribution for each pedigree.
This means that the empirical variance option gives the correct variance
asymptotically regardless of pedigree structure and null hypothesis.
See Lake et al. (2000) for details.
In some nuclear families,
the distribution of offspring genotypes may be degenerate, after conditioning on
parental genotype or on the sufficient statistic for parental genotypes.
In such cases, these families are not informative and will not contribute
to the test statistic. For example,
if both parents are homozygous, then the conditional genotype distribution of
any offspring is degenerate. If
both parent*s genotypes are unknown, and all offspring have the same genotype,
then the distribution of any offspring*s genotype, conditional on the
sufficient statistic for parental genotypes, is a point mass at the single
observed genotype. In some
cases, the conditional genotype distribution is not degenerate, but there can
still be no contribution to the test statistic.
For example, consider a family with two affected offspring with genotypes
AA and Aa and no parents. The genotype distribution is not degenerate; under the
null, each sibling retains their original genotype with probability ½ and
the genotypes are interchanged with probability ½ .
If X counts the number of A alleles among the affecteds, the contribution
of the two affected siblings to S is always three, its mean is always three and
its variance is zero. It is not
necessary to exclude such families in the data set because their contribution to
S-E(S) and Var(S) will be zero. The
count of informative families, which is given as part of the fbat output,
includes only families with a non-zero contribution to the test statistic.
The main issues to be
addressed in using FBAT are how to specify X and how to specify T.
Consider X first; it is controlled by two FBAT options: model and mode.
The model command allows one to specify a recessive, dominant, additive
or genotype model for how the gene acts on the phenotype.
The default is uses the additive model; several studies have shown that
the additive model performs well even when the true genetic model is not
additive. See, for example, Knapp
(1999), Tu et al. (2000) and Horvath et al. (2001b). Note that this choice only
affects the power of the test. Choosing
the &wrong* model does not invalidate the test under the null, rather it may
reduce power under the alternative. The
genotype model treats each distinct genotype as a separate allele. The mode
option is relevant if the marker has more than two alleles or the genotype model
is used. FBAT allows two
strategies: each allele is tested separately, resulting in multiple, single
degree-of-freedom tests, or all alleles are compared simultaneously to their
null expectation in one test with multiple degrees of freedom.
In this case, X is a vector and the c2
version of the test (given above) is used.
The second main issue is how
to specify the trait T. In general
T will be some function of the phenotype; dichotomous, measured and age-at-onset
data can be used. FBAT can handle
any kind of trait, but it is up to the user to specify T, and this can have a
substantial impact on power of these tests.
First note that if T=0 for a subject, then this subject contributes
nothing to S, E(S), or Var(S), i. e., they do not contribute to the value of the
test. Such individuals only help to
determine the distribution of their sibling*s genotypes in the case where
parents are missing genotype data. When
using the affection status variable in the pedigree file, subjects coded as 0
(missing phenotype) will have T=0 for the default trait. When using a phenotype
file (see below), subjects with unknown phenotype should be coded &-&.
This ensures that any T computed for this subject will be zero.
When the trait is
dichotomous, affected or not, the usual approach has been to look at allele
transmissions from parents to affected offspring only (Spielman, et al.,1993).
This can be achieved by setting T=1 for affected subjects, and T=0 for
all others; this is the default trait coding for FBAT.
The theory of score tests
(Laird et al., 2000, Whittaker and Lewis, 1998) suggests using T=Y-m,
where Y is a dichotomous indicator of affection status, and m
is the disease prevalence. With a
rare disease, m
is nearly zero, and the result is close to the default coding.
When m
is 0.5, then an affected subject has T= 0.5 and an unaffected subject has
T=-0.5, i.e., they receive equal weight, but have different signs.
One can use the setafftrait command in FBAT to specify values of T based
on the affection status variable in the pedigree file.
If disease prevalence is not known for the population, one can use the
每o option in FBAT, which chooses m
to minimize Var(S). See the
examples section of the FBAT documentation and also Lunetta et al., (2000).
The use of 每o does not
always result in a more significant p-value, especially in the case where the
test result is already significant using the default affection status. The -o
option only minimizes the variance of S under the null hypothesis and should
perform best for small departures from the null.
In general, using the 每o option will give a result similar to using the
sample mean to estimate m.
FBAT can also handle
measured traits by specifying T in a phenotype file.
How T should be defined depends both on the sample
design and on objectives of the association test.
If the measured phenotype data are available for affected subjects only,
and the determination of affection status depends strongly upon the value of
measured phenotype, then there is little to be gained by using the measured
phenotype. Alternatively, if variation in the measured phenotype among
affecteds is of interest, then a test designed as a contrast between high and
low values of the measured phenotype is potentially powerful for detecting an
association between the marker and the phenotype. In this case, T should again be defined centering the
phenotype, T = Y- m,
where m
is the mean phenotype in the population; depending upon the sample, m may be estimated from the sample mean, or alternatively from external
data. The 每o option described
above can also be used.
When there are multiple
correlated traits available, it may be desirable to test them simultaneously in
a simple test using the multivariate test statistic described in Lange et al
(2002). This can be accomplished
using the trait command.
There have been several
possibilities suggested for handling age-at-onset data.
Horvath et al. (2001b) describe two methods based on a score test using a
proportional hazards model with an exponential age-at-onset distribution.
Mokliatchouk el al. (2001) use a score test based on Cox regression.
The FBAT package also can be
used to implement the sibship disequilibrium test using the sdt command (Horvath
and Laird, 1998). The SDT is
designed to detect both linkage in the presence of association and association
in the presence of linkage (linkage disequilibrium) when dealing with a
dichotomous trait. The test does
not require parental data, but requires discordant sibships with at least one
affected and one unaffected sisbling.
The
FBAT package, consisting of the executable program and a user*s manual, is
currently available in compressed forms for MacOS/PowerPC, Windows, and Solaris/Sparc
platforms.
FBAT
for MacOS requires PowerPC processors. The package is compressed by Stuffit and
named ※FBATxxx.sit.hqx§ where xxx is the version number. To install, just
expand it with StuffIt Expander™.
The
package is named ※FBATxxx.tar.Z§ where xxx is the version number. To
install, just type ※zcat FBATxxx.tar.Z | tar xvf
每※.
FBAT
for Windows requires Windows3.1, Windows95, Windows98 or Windows NT. The package
is zipped and named ※FBATxxx.zip§ where xxx is the version number. To
install, just unzip it using any zip utilities.
Two
types of files are used by FBAT. A pedigree file defines the pedigree structure,
affection status, and genotype information. A phenotype file defines any other
traits for each subject. Both files are in text format with one line for each
individual; variables are separated by a blank space or a tab. Be aware that
while a continuous run of blank spaces is regarded as a single separator, each
tab is treated as a separator. So there will be k+1 fields for an entry with k
tabs. Generally, blank spaces as field separators are recommended.
First
line:
names of all markers in the sequence of the genotype data
Remaining
lines: pid id fid mid sex aff A11 A12
A21 A22 #
pid:
pedigree ID
id:
individual ID
fid:
father ID. Use 0 (zero) for founders or marry-ins (parents not
specified) in a pedigree
mid:
mother ID. Use 0 (zero) for founders or marry-ins (parents not
specified) in a pedigree
sex:
1 = male, 2 = female
aff:
affection status. 2 = affected, 1 = unaffected, 0 = unknown
Aij:
allele j of marker i (j=1,2; i=1, 2,# ). Alleles are represented by
positive integers. Use 0 (zero) for missing alleles.
All
ID*s and marker names are composed of strings of any characters that does not
include blank space, tab, newline, and carriage return. The maximum length for
IDs and marker names are 11 and 15 characters, respectively. A maximum number of
39 alleles are allowed for each marker.
First
line:
names of all traits in the phenotype file
Remaining
lines: pid id trait_1 trait_2 #
Use
a single hyphen (※-※) for missing traits. The order of the subject entries
is not important. The set of individuals defined in the phenotype file need not
be the same as that in the pedigree file (e.g. you may omit all parents in the
phenotype file). However, for each individual appearing in both files, his(her)
IDs must be consistent.
Data on any individuals in the phenotype file who do not appear in the
pedigree file will be ignored.
The
general syntax for every command used in this program is
command
[option1,option2,#]# arguments#,
where
[option1,option2,#]# are optional arguments. In the descriptions given
below, all acceptable options for an argument are listed within a bracket and
separated by comas.
Commands
and options are case sensitive. A partial command name may be used to specify
the command as long as it is unambiguous.
Display
a one-line description for the specified command. If no command is specified, a
listing and descriptions of all available commands is given.
Output
a sample estimate of allele frequencies for the specified marker. If no marker
is specified, allele frequencies for all
the markers are produced. The allele frequencies are estimated using the founder
genotype data only. An EM algorithm is used to incorporate families with
incomplete founder genotype data. Note:
The estimated allele frequencies are not used by fbat.
Selectively
display the test results with p-value equal to or less than the specified
p_value. The default p_value is 1.0 (display all results). If no p_value
argument is specified, the current p_value is given.
Compute
the test statistic(s) and p-value(s) for the specified marker (or all markers if
no marker is specified) using the current trait, test mode (bi-allelic or
multi-allelic), and association model.
Option 每e: compute the test statistic using the empirical variance, as described in Lake, Blacker and Laird (2001). This option should be used when testing for association in an area of known linkage and data from multiple sibs in a family or multiple families in a pedigree are used.
Option
每o: use T = (current trait 每 m)
in the test construction. The value
of m
is chosen to minimize the variance of the test statistic. Subjects with
phenotype unknown (phenotype file is &-&) have T set to zero.
This option works for both quantitative and qualitative traits.
Note that when this option is used for qualitative traits, phenotype data
for both affected and unaffected offspring are required.
If the data set contains only affected offspring, or offspring with only
a narrow range of values in the phenotype file, this option should not be used.
The
每o option can only be used with the bi-allelic mode.
Output
of the fbat command: When using the bi-allelic mode (see 4.9) of testing, the
results are displayed with one line for each marker allele using the format:
marker_name allele_name
afreq fam#
S E(S) Var(S)
Z P
where
afreq is allele frequency; fam# is number of nuclear families
informative for that allele; S is the test statistic; E(S) and Var(S)
are the expected value and variance of the test statistic under H0; Z and P are
the Z statistic (S normalized using E(S) and Var(S)) and
its corresponding p-value (P(X>|Z|), where X is N(0,1)).
When
using the multi-allelic mode of testing, the result is displayed with one line
for each marker using the format:
marker_name number_of_alleles
degrees_of_freedom chi-square
p_value.
When
using -o, an additional column is given for the value of the optimizing nuisance
parameter m.
Display
the raw genotype data for the specified pedigree and marker(s).
Output
format:
Id father-id
mother-id allele1_1
allele1_2 [allele2_1
allele2_2#]
If
father-id and mother-id refer to founders or marry-ins, their
id*s will be set to zero in the display.
Read
in data from a pedigree or a trait file. The file name can be either an absolute
path
name or a relative path name from your current directory.
The options [ped,phe]
are
not necessary when the specified file_name ends with a corresponding extension
(.ped
for pedigree file, .phe for trait file).
Start
logging all inputs and outputs into log_file or toggle the logging status. A
log_file must be specified before you can toggle the logging status. If no
option is specified, the current logging status is displayed. You can save only
particular parts of a session by toggling the logging status.
Specify
the minimum number of informative families necessary to compute the
Test
statistics. The default value is 10. If size_value is not specified, the current
value
is displayed.
Specify
bi-allelic (b) or multi-allelic (m) tests. The default is bi-allelic. If no
option is specified, the current mode is displayed.
Specify
the association model to be additive (a), dominant (d), recessive (r) or
genotype(g). The default is the additive model. If no option is specified, the
current model is displayed. For a
description of the marker scoring schemes under these models, see Schaid (1996).
The genotype coding essentially constructs a multiallelic marker, with
each possible genotype corresponding to an allele.
Exit
the program.
Compute
the SDT test statistic(s) and p-value(s) for the specified marker (or all
markers if no marker is specified) using the current trait and test mode.
When
using the bi-allelic mode of testing, the results are displayed as one line for
each marker allele using the format:
marker_name allele_name
B C
p-exact
where
B and C are counts of families with a higher (B) or lower (C) frequency
of the specified allele among affected offspring compared to unaffected
offspring in the family. The
p-value is computed exactly using the Binomial distribution.
When
using the multi-allelic mode of testing, the result is displayed in one line for
each marker using the format:
marker_name degrees_of_freedom
chi-square p-value
where
the p-value uses the asymptotic chi-square distribution.
Set
trait values for affected (aff_t), unaffected (unaff_t), and subjects with
unknown affection status (unknown_t). Here
affected, unaffected, and trait unknown offspring are defined using the
affection status variable from the pedigree file:
affection
status 2 means affected
affection
status 1 means unaffected
affection
status 0 means unknown
Specify
the trait to use for computing the test statistics.
Specifying more than one trait will provide a multivariate test with
multiple degrees of freedom (Lange et al, 2002). If no trait is specified, all
available traits are displayed with the current trait denoted by **.
The default trait uses the affection status variable defined in the
pedigree file, recoded as 1 if affection status is 2(affected), and zero
otherwise (affection status is unaffected or unknown).
The name of the default trait is affection.
Displays
detailed information about the marker genotype distribution (under H0: no
linkage) among offspring in each nuclear family of the named pedigree.
If pedigree_id is not specified, the marker distributions are displayed
for all nuclear families in all pedigrees.
These distributions are obtained via a table lookup using Tables 1-3 of
Rabinowitz and Laird (2000). If the
family is not informative, the marker genotype distribution has probability 1
for the observed data and the output for that family is suppressed.
Output
format:
First
line:
**** ped_id {fgt,mgt} ==> {ogt..}
where fgt and mgt are the observed genotypes (in letters) for father and mother in the nuclear family, and the ogt# lists the set of observed genotypes among the offspring, one entry for each unique genotype. If either parent is missing, the genotype entry is replaced by a blank. The genotype is expressed generically using the letters A,B,C or D. The letters indicate how the nuclear family is mapped to one of the corresponding entries in Tables 1-3, except when genotype data are available for both parents. Tables 1-3 omit this case, since here the usual Mendelian rules are used to determine genotype probabilities for the offspring
Remaining
lines:
The sample space of possible offspring genotypes, the probability of
each point in the sample space, and the joint probability of each
possible pair of genotypes.
Displays
detailed information about the FBAT statistics for the specified markers,
including S, E(S) and Var(S). By default, information on the statistics are
displayed separately for each informative family, followed by a summary over all
the families. In viewstat, S will be displayed as a vector, with one entry for
each allele, hence E(S) is also a vector, and Var(S) is a matrix, unless 每o is
used.
Option
※-s§ suppresses the family specific information and displays only the
summary part. Option ※-e§ computes and displays the variance of S using the
empirical variance. Option 每o
displays S, E(S) and Var(S) when the trait is determined using the 每o command.
Since the value of m
is different for each allele, only the scalar values of Var(S) for each allele
are displayed. Only one of the 每e or 每o options can be used simultaneously.
To get started, the user must call the program and load a pedigree data file. This file follows the standard pedigree file format as used by genehunter or mapmaker. See section 3 for details. For example, the first few lines of a data file named test.ped are:
a2m apoe
501051 10070 90020 90019 1 2 1 1 3 4
501051 10137 90020 90019 1 1 1 1 3 4
501151 10018 90039 90040 2 2 1 1 3 4
501151 10031 90039 90040 2 2 1 2 4 4
501151 10196 90039 90040 2 1 1 2 4 4
501151 10237 90039 90040 1 1 1 1 3 4
None of the families in this data set have parental genotype information; all of them have both affected and unaffected offspring. To read this data file (which resides in the same directory as fbat), use the load command:
>> load test.ped
read in: 2 markers from 104 pedigrees (104 nuclear families,375
persons)
Once you have read in a pedigree file successfully, you may use the default settings to get a TDT-type test statistic. The relevant default settings are:
Command
Default
displayp 1.0 (all test results are printed out regardless of p-value)
minsize 10 informative families are required
model additive
mode bi-allelic (test one allele against all others at a marker)
trait dichotomous affection status
You obtain your test statistics by simply typing:
>>fbat
trait affection; model additive; test bi-allelic; minsize 10, p 1.000000
Marker Allele afreq fam# S E(S) Var(S) Z P
a2m 1 0.829 44 112.000 126.133 14.900 -3.661 0.000251
a2m 2 0.171 44 64.000 49.867 14.900 3.661 0.000251
apoe 2 0.040 11 6.000 12.418 3.298 -3.534 0.000410
apoe 3 0.547 68 130.000 147.334 30.133 -3.158 0.001589
apoe 4 0.412 67 160.000 136.248 31.116 4.258 0.000021
Total
number of test(s): 5
Since we are using bi-allelic tests, the program gives a line of output for each allele, two for the a2m gene and three for the apoe gene. Because there are only two alleles for a2m, the Z statistics are identical apart from sign.
The
empirical variance option is useful when the marker in question is known to be
linked to a gene underlying the trait. In
this case the null hypothesis is H0: no association.
In most instances the test results will be similar, unless there are many
nuclear families which arise from a single pedigree.
Whenever the 每e option is used with fbat, the empirical variance is
used to compute the test statistic:
>> fbat 每e
trait affection; model additive; test bi-allelic; minsize 10; p 1.00000
Marker Allele afreq fam# S E(S) Var(S) Z P
a2m 1 0.829 38 94.000 108.133 15.235 -3.621 0.000294
a2m 2 0.171 38 56.000 41.867 15.235 3.621 0.000294
apoe 2 0.040 10 5.000 11.418 4.912 -2.896 0.003782
apoe 3 0.547 59 107.000 124.334 28.858 -3.227 0.001252
apoe 4 0.412 57 140.000 116.248 32.232 4.184 0.000029
Total number of test(s): 5
Note
that the results using the empirical variance are slightly more conservative,
but
do
not differ markedly. This is what
we would generally expect, unless there are a
few
very large pedigrees that contribute most of the informative nuclear families.
Here we consider a few more advanced commands: mode, model and loading phenotype files
MODE To perform a multi-allelic test, where each allele is tested against all others, use the mode command, followed by the fbat as usual:
>> mode m
current mode multi-allelic
>> fbat
trait affection; model additive; test multi-allelic; minsize 10; p 1.000000
Marker Allele# DF CHISQ P
a2m 2 1 13.405701 0.000251
apoe 3 2 25.392769 0.000003
Total number of test(s): 2
When
using the additive model, the degrees of freedom of the chi-square is one less
than the number of alleles. With
other models the degrees-of-freedom is generally equal to the number of alleles.
Note that the multi-allelic test value for any bi-allelic marker is the
same as the square of either of the two Z values computed in the bi-allelic
mode, i.e., the tests are equivalent. If
any allele has fewer than minsize informative families, all offspring with that
allele are excluded from the test.
To
change back to bi-allelic mode, use ※mode b§.
MODEL The default trait uses an additive model so that the statistics count the number of alleles that offspring have. We can switch to a dominant, recessive or a genotype model using the model command. For example, with the dominant model, we count the number of offspring with any copy of a particular allele. For the test.ped data, we find:
>> model d
current genetic model is dominant
>> fbat
trait affection; model dominant; test bi-allelic; minsize 10; p 1.000000
Marker Allele afreq fam# S E(S) Var(S) Z P
a2m 1 0.829 9 *****
a2m 2 0.171 41 49.000 37.650 10.744 3.463 0.000535
apoe 2 0.040 10 4.000 10.084 3.076 -3.469 0.000522
apoe 3 0.547 38 36.000 45.501 11.199 -2.839 0.004523
apoe 4 0.412 46 69.000 56.993 13.969 3.212 0.001316
Total number of test(s): 4
Notice that no test has been computed for the a2m 1 allele because there are fewer than 10 informative families for the dominant model.
Unlike the additive model, when using multi-allelic tests and the dominant (or recessive) model, the degrees-of-freedom will generally be equal to the number of alleles when all alleles have sufficient informative families:
>> model d
current genetic model is dominant
>> fbat apoe
trait affection; model dominant; test multi-allelic; minsize 10; p 1.000000
Marker Allele# DF CHISQ P
apoe 3 3 23.555276 0.000031
Total number of test(s): 1
The
genotype model compares observed to expected numbers of genotypes for each
possible combination. Continuing
with test.ped,
>> model g
current genetic model is genotype
>> mode b
current test mode is bi-allelic
>> fbat
trait affection; model genotype; test bi-allelic; minsize 10; p 1.000000
Marker Allele afreq fam# S E(S) Var(S) Z P
a2m 1/1 0.658 41 33.000 44.350 10.744 -3.463 0.000535
a2m 1/2 0.342 44 46.000 37.433 11.590 2.517 0.011852
a2m 2/2 0.000 9 *****
apoe 2/2 0.006 1 *****
apoe 2/3 0.036 9 *****
apoe 3/3 0.201 46 32.000 39.833 13.557 -2.127 0.033384
apoe 2/4 0.034 4 *****
apoe 3/4 0.656 64 64.000 61.828 18.673 0.503 0.615230
apoe 4/4 0.067 39 48.000 36.254 11.320 3.491 0.000481
Total number of test(s): 5
In
this data set only three genotypes
of apoe have more than 10 informative families.
With the multiallelic mode, all offspring with any other genotype are
excluded from the chi-square test:
>> mode m
current test mode is multi-allelic
>> fbat
trait affection; model genotype; test multi-allelic; minsize 10; p 1.000000
Marker Allele DF CHISQ P
a2m 2 2 13.485486 0.001179
apoe 3 3 25.258375 0.000014
Total number of test(s): 2
LOADING
A PHENOTYPE FILE To
use a quantitative trait, we use another data set that has measured scores on
two quantitative tests and genotype data on 5 markers. First we must load in the pedigree file, and then the
phenotype file. The first few lines
of each file are:
Pedigree
file (adh.ped):
DAT DRD4 DRD2 DBH NAR
2000 200003 0 0 1 1 0 0 4 4 77 77 256 260 115 115
2000 200002 0 0 2 1 480 440 2 4 77 79 256 276 113 113
2000 200006 200003 200002 2 1 0 0 4 4 77 77
256 260 113 115
2000 200007 200003 200002 2 1 440 440 4 4 77 77 256 276 113 115
2000 200004 200003 200002 2 1 440 440 4 4 77 77 260 276 113 115
2001 200103 0 0 1 2 480 480 4 4 79 79 258 258 113 113
2001 200102 0 0 2 1 480 440 4 5 77 79 260 274 113 113
2001 200104 200103 200102 2 2 480 480 4 5 77 79 258 260 113 113
2001 200101 200103 200102 2 2 480 480 4 5 79 79 258 260 113 113
Phenotype
file (adh.phe):
v1 v2 centerv1 centerv2
2000 200002 8 9 -4.8 -10.9
2000 200006 8 11 -4.8 -8.9
2000 200003 8 17 -4.8 -2.9
2000 200007 8 9 -4.8 -10.9
2000 200004 8 9 -4.8 -10.9
2001 200103 13 21 .2 1.1
2001 200102 8 9 -4.8 -10.9
2001 200104 10 19 -2.8 -.9
2001 200101 13 13 .2 -6.9
The
trait file has two variables, v1 and v2; the last two columns of this file
contain mean centered versions of these variables.
We enter the data using the load command (must load adh.ped first).
>> load adh.ped
read in: 5 markers from 44 pedigrees (43 nuclear families,166 persons)
mendelian error: locus DBH, pedigree 2003 [200303,200302]
mendelian error: locus DRD2, pedigree 2016 [201603,201602]
mendelian error: locus DRD2, pedigree 2057 [205703,205702]
mendelian error: locus DBH, pedigree 2057 [205703,205702]
mendelian error: locus DRD2, pedigree 2059 [205903,205902]
A total of 5 mendelian errors have been found
genotypes of families with mendelian error have been reset to 0
>> load adh.phe
4 quantitative traits have been successfully read
166 persons have been phenotyped
Notice
that the fbat identified Mendelian inconsistencies in one or two loci for 4
different pedigrees. All of the
genotypes of those families have been set to zero for those markers with the
Mendelian inconsistencies.
The
trait command tells us what traits are now available and which one (with**) is
in use.
>> trait
affection** v1 v2 centerv1 centerv2
We
change to another trait using the trait command:
>> trait centerv2
affection v1 v2 centerv1 centerv2**
>> fbat DAT DRD4
trait centerv2; model additive; test by allele; minsize 10; p 1.000000
Marker Allele afreq fam# S E(S) Var(S) Z P
DAT 440 0.272 22 -57.400 -22.450 455.107 -1.638 0.101362
DAT 480 0.717 23 -45.700 -89.700 483.015 2.002 0.045281
DAT 520 0.011 2 ****
DRD4 2 0.091 14 7.400 -0.700 255.865 0.506 0.612587
DRD4 3 0.051 6 *****
DRD4 4 0.652 23 -100.700 -83.200 513.540 -0.772 0.439974
DRD4 5 0.023 4 *****
DRD4 6 0.011 0 *****
DRD4 7 0.154 18 -0.900 -17.100 224.990 1.080 0.280131
DRD4 8 0.017 3 *****
In
this case, using the additive model and a mean centered quantitative trait gives
a test identical to that proposed by Rabinowitz (1997) when all parental
genotype data are present. Here v2
has been centered by subtracting out the sample mean of the v2 from all the
subjects, thus the trait is the residual values; this explains the negative
values. One could also adjust for other covariates thought to influence the
value of v2 such as age and sex, and use residuals from the regression equation.
Alternatively, we may choose to center the trait using external norms or by
using the 每o option of fbat.
MULTIVARIATE TESTS
FBAT
can also test multiple traits simultaneously using a multivariate version of the
test statistic.
To
implement this option, list multiple traits at the trait command line.
To illustrate, we use data from an aathma study that looks at a
polymorphism in Il13. The sample
contains only asthmatic offspring and their parents, but there is interest in a
large number of phenotypes that characterize asthma severity and symptoms.
>> load il13.ped
read in: 1 markers from 635 pedigrees (636 nuclear families,1963 persons)
>> load fbat.phe
22 quantitative traits have been successfully read
666 persons have been phenotyped
warning: 4 persons are not in any pedigrees
>> trait
affection** toteos logige npos sxcmean limactiv AH_30 AH_21 AH_22 AH_23 posfevpp posfvcpp ampfmean pmpfmean dpfmean lnpc20 VARMean bdabs fevbd bdpred pfbd sxamean examean
>> trait toteos
affection toteos** logige npos sxcmean limactiv AH_30 AH_21 AH_22 AH_23 posfevpp posfvcpp ampfmean pmpfmean dpfmean lnpc20 VARMean bdabs fevbd bdpred pfbd sxamean examean
>> fbat
trait toteos; model additive; test bi-allelic; minsize 10; p 1.000000
Marker Allele afreq fam# S E(S) Var(S) Z P
il13 30 0.776 313 -5.963 21.662 103.881 -2.710 0.006721
il13 31 0.224 313 52.428 24.803 103.881 2.710 0.006721
Total number of test(s): 2
>> trait logige
affection toteos logige** npos sxcmean limactiv AH_30 AH_21 AH_22 AH_23 posfevpp posfvcpp ampfmean pmpfmean dpfmean lnpc20 VARMean bdabs fevbd bdpred pfbd sxamean examean
>> fbat
trait logige; model additive; test bi-allelic; minsize 10; p 1.000000
Marker Allele afreq fam# S E(S) Var(S) Z P
il13 30 0.776 314 -10.594 8.968 100.687 -1.950 0.051235
il13 31 0.224 314 44.623 25.061 100.687 1.950 0.051235
Total number of test(s): 2
>> trait npos
affection toteos logige npos** sxcmean limactiv AH_30 AH_21 AH_22 AH_23 posfevpp posfvcpp ampfmean pmpfmean dpfmean lnpc20 VARMean bdabs fevbd bdpred pfbd sxamean examean
>> fbat
trait npos; model additive; test bi-allelic; minsize 10; p 1.000000
Marker Allele afreq fam# S E(S) Var(S) Z P
il13 30 0.776 313 37.236 32.767 99.805 0.447 0.654615
il13 31 0.224 313 21.167 25.636 99.805 -0.447 0.654615
Total number of test(s): 2
>> trait toteos logige npos
affection toteos** logige** npos** sxcmean limactiv AH_30 AH_21 AH_22 AH_23 posfevpp posfvcpp ampfmean pmpfmean dpfmean lnpc20 VARMean bdabs fevbd bdpred pfbd sxamean examean
>> fbat
3 traits selected: toteos logige npos
model additive; test bi-allelic; minsize 10; p 1.000000
Marker Allele afreq fam# DF CHISQ P
il13 30 0.776 314 3 12.209 0.006702
il13 31 0.224 314 3 12.209 0.006702
>> quit
In
this example, the test including all three endpoints has virtually the same
p-value as the smallest of the three p-values observed in the univariate tests.
If any subject is missing a phenotype that is included in the variable
list, that subject is excluded from the test.
THE
每o OPTION
With
每o, the value subtracted from each trait is chosen to minimize the test
statistic. The following
illustrates the use of 每o with v2:
affection v1 v2** centerv1 centerv2
>> fbat DAT
trait v2; model additive; test bi-allelic; minsize 10; p 1.000000
Marker Allele afreq fam# S E(S) Var(S) Z P
DAT 440 0.272 22 460.000 485.000 4158.000 -0.388 0.698237
DAT 480 0.717 23 1009.000 965.000 4565.500 0.6510 0.514923
DAT 520 0.011 2 *****
>> fbat -o DAT
trait v2; model additive; test bi-allelic; minsize 10; p 1.000
Marker Allele afreq fam# S E(S) Var(S) Z P Offset
DAT 440 0.272 22 -23.721 10.581 437.070 -1.641 0.100845 18.604651
DAT 480 0.717 23 8.913 -35.087 470.804 2.028 0.042577 18.869565
DAT 520 0.011 2 *****
Notice
that the offset for each allele differs since the command is executed by allele,
but in each case the offset is close to the mean of v2, so that the result of
using 每o is similar to using centeredv2.
Using v2 alone gives a very different, and not a meaningful, result since
there are many unaffected offspring in the data set and a wide range of v2.
The
每o option can also be used with qualitative data. First we view the results of using affection status in this
data set:
>> fbat DAT
trait affection; model additive; test bi-allelic; minsize 10; p 1.000000
Marker Allele afreq fam# S E(S) Var(S) Z P
DAT 440 0.272 16 12.000 14.500 6.250 -1.000 0.317311
DAT 480 0.717 16 30.000 26.500 6.750 1.347 0.177932
DAT 520 0.011 1 *****
Using
fbat with affection status and 每o gives:
>> fbat -o
trait affection; model additive; test bi-allelic; minsize 10; p 1.000
Marker Allele afreq fam# S E(S) Var(S) Z P Offset
DAT 440 0.272 21 -2.535 0.256 2.616 -1.725 0.084469 0.581395
DAT 480 0.717 22 0.652 -2.848 2.788 2.096 0.036071 0.586957
DAT 520 0.011 2 *****
There
are now more informative families in the data set, since this data set includes
families with only unaffected offspring, as well as families with both affected
and unaffected offspring.
USING
THE SETAFFTRAIT COMMAND.
We can use this command to set values of T to correspond to affected,
unaffected, and unknown phenotype. We
use this to see what happens if we use only the unaffected subjects in the data
set:
>> setafftrait 0 1 0
conver affection status to trait 0.000000(aff), 1.000000(unaff), 0.000000(unknown)
>> fbat DAT
trait affection; model additive; test bi-allelic; minsize 10; p 1.000000
Marker Allele afreq fam# S E(S) Var(S) Z P
DAT 440 0.272 12 13.000 10.000 4.500 1.414 0.157299
DAT 480 0.717 13 20.000 23.500 4.750 -1.606 0.108294
DAT 520 0.011 1 *****
In
this data set, using either the
affected or the unaffected alone gives no indication of significance, but using
both suggests that allele DAT480 may be positively association with affection
status.
5.3 Commands
Useful for Debugging.
In this section, we consider 3 commands that are helpful in understanding how the statistics are computed and in debugging. These commands are genotype, viewmarker and viewstat.
GENOTYPE To assure yourself that the data are loaded correctly, you can ask to see the data for any pedigree (or family) by typing:
>>genotype
501051 a2m apoe
a2m
apoe
10070 90020 90019 1 1 3 4
90020 0 0 0 0 0 0
90019 0 0 0 0 0 0
10137 90020
90019 1 1 3 4
The output lists the id*s and marker data for every individual in the family. In this case, both parents (90019 and 90020) have no record in the data file; their genotypes and the id*s of their parents have been set to zero.
VIEWMARKER Viewmarker provides data at the nuclear family level on the
distribution of offspring genotypes. Using viewmarker with test.ped gives, for
the first few families:
>> viewmarker apoe
**** 501151 {,} ==> {AB,BB}
Possible sibs' genotypes g[] = 4/4 3/4
Probability of each genotype Pg[] =
0.500000 0.500000
Probability of each pair of genotypes Pgg[][] =
0.166667 0.333333
0.333333 0.166667
**** 501221 {,} ==> {AA,AB}
Possible sibs' genotypes g[] = 4/4 4/3
Probability of each genotype Pg[] =
0.666667 0.333333
Probability of each pair of genotypes Pgg[][] =
0.333333 0.333333
0.333333 0.000000
There
is no entry for 501051 because that family has all their mass concentrated on a
single genotype. For 501151, there
were 2 genotypes among the offspring, one 3/4
and one 4/4 and no parental genotype data.
The AB,BB notation means that table 3, line 2 of Rabinowitz and Laird
(2000) is used to assign a
probability distribution to the four offspring in this family. All the mass is concentrated on these two genotypes, with
zero mass on all other genotypes.
It
is instructive to use viewmarker when we have a family where both parents have
genotype data. With parents, the
distribution is computed using Mendelian laws.
Returning to the adh.ped file, we have
>> viewmarker DRD4
**** 2000 {AA,AB} ==> {AA}
Possible sibs' genotypes g[] = 3/3 3/1
Probability of each genotype Pg[] =
0.5000 0.5000
Probability of each pair of genotypes Pgg[][] =
0.2500 0.2500
0.2500 0.2500
Here
the parents are 3/3 and 3/1, and they have only 3/3 offspring.
Since parent genotypes are known, the offspring distribution under H0
follows Mendel*s laws: all offspring are either 3/3 or 3/1 with probability
50-50.
VIEWSTAT Viewstat provides data on the contribution of each family to S,
E(S) and Var(S). Viewstat also will provide additional detail on the summary
statistics, which are relevant when using multi-allelic tests.
Using viewstat with test.ped gives:
>> viewstat apoe
Alle S E(S) Var(S)
ped 501151 (90039,90040)
2 0.0000 0.0000 0.0000 0.0000 0.0000
3 1.0000 1.0000 0.0000 0.3333 -0.3333
4 3.0000 3.0000 0.0000 -0.3333 0.3333
ped 501221 (90109,90108)
2 0.0000 0.0000 0.0000 0.0000 0.0000
3 0.0000 0.6667 0.0000 0.2222 -0.0000
4 2.0000 2.0000 0.0000 -0.0000 -0.0000
.
.
.
total family count = 104; informative family count = 72
Alle Fam# S E(S) Var(S)
2 11 6.0000 12.4178 3.2985 -1.1575 -2.1409
3 68 130.0000 147.3345 -1.1575 30.1325 -28.9750
4 67 160.0000 136.2477 -2.1409 -28.9750 31.1159
eigenvector:
0.8164 0.5774 0.7134
-0.4192 0.5774 0.7134
-0.3972 0.5774 0.7134
diagonal vector:
4.9344
-0.0000
59.6125
pseudo inverse v:
0.1351 -0.0692 0.0405
-0.0692 0.0439 0.0405
-0.0659 0.0254 0.0405
rank = 2
chisq = 25.3928
Again,
there is no entry for 501051. Since
we use the default settings, the statistic counts the number of alleles among
affected offspring. For family
501151, there are no 2 alleles, one 3 allele, and three 4 alleles among affected
offspring. E(S) and Var(S) are
computed from the distribution given by viewmarker.
By chance, S equals E(S). Notice
that Var(S) is a 3x3 matrix. The
first row and column are zeroes since this family has zero probability of a 2
allele. If you use the 每s option,
you only obtain the summary over all families.
The summary E(S) and the diagonal of Var(S) are what you obtain from fbat
using bi-allelic tests. The rest of
the output refers to multi-allelic tests: the eigenvectors, eigenvalues and
pseudo inverse of Var(S) are used to compute the chisquare test, and its
degrees-of-freedom (rank of Var(S)).
One
can also use 每o with viewmarker to obtain individual family contributions to
Var (S) when this option is used with FBAT.
Horvath, S. and Laird, N.M.
(1998) ※A discordant 每sibship test for disequilibrium linkage: No need for
parental data.§ Amer J Hum Gen 63: 1886-97.
Horvath, S., Wei, E., Xu,
X., Palmer, L., and Baur, M. (2001a) ※Family-based association test method I:
Age of onset traits and covariates.§ Genetic Epi (Suppl 1) 19:36-42.
Horvath, S., Xu, X., and
Laird, N. (2001b) ※The family based association test method: strategies for
studying general genotype-phenotype associations.§
Euro J Hum Gen 9: 301-306.
Knapp, M. (1999) ※A Note
on Power Approximation for the Transmission/Disequilibrium Test.§ Amer J
Hum Gen 64: 1177-1185.
Laird, N., Horvath, S. and
Xu, X. (2000) ※Implementing a unified approach to family based tests of
association.§ Genetic Epi 19(Suppl 1): S36-S42.
Lake, S., Blacker, D. and
Laird, N. (2001) ※Family based tests in the presence of association.§ Amer
J Hum Gen 67:1515-1525.
Lange C., Silverman E.,
Weiss, S., Xu X., Laird N.M. (2002) ※A Multivariate Family-Based Test using
Generalized Estimating Equations: FBAT-GEE§.
Under revision; available from the authors.
Lunetta, K.L., Farone, S.V., Biederman, J., and Laird, N.M. (2000) ※Family based tests of association and linkage using unaffected sibs, covariates and interactions.§ Amer J Hum Gen 66: 605-614.
Mokliatchouk, O., Blacker, D. and Rabinowitz, D. (2001) ※Association tests for traits with variable age at onset.§ Human Hederity 51: 46-53.
Rabinowitz, D. and Laird,
N.M. (2000) ※A unified approach to adjusting association tests for population
admixture with arbitrary pedigree structure and arbitrary missing marker
information.§ Human Heredity
504:227-233.
Schaid, D.J. (1996)
※General score tests for associations of genetic markers with disease using
cases and their parents.§ Genetic Epi 13:423-49.
Tu, I.P., Balise, R.R. and
Whittemore, A.S. (2000). ※Detection of disease genes by use of family data II.
Application to nuclear families.§ Amer J Hum Gen 66:1341-1350.
Whittaker, J., and Lewis,
C., (1998) ※The effect of family structure on linkage tests using allelic
association.§ Amer J Hum Gen
63:889-897.