THE FBAT (Family Based Association Test) PROGRAM:

                 Testing for Linkage and Association with Family Data

                               Xin Xu, Steve Horvath and Nan Laird 

                                     horvath@mednet.ucla.edu, laird@hsph.harvard.edu, xin_xu@harvard.edu                                                                                                                                                

                                                         3/8/02

               Web address: www.biostat.harvard.edu/~fbat/default.html

 

                                  TABLE OF CONTENTS 

1. INTRODUCTION TO FBAT

2. INSTALLING THE PROGRAM

2.1 MacOS

2.2 Solaris/Sparc

2.3 Windows

3. INPUT FILES

3.1 Format for Pedigree File

3.2 Format for Phenotype File

4. COMMANDS

4.1 ? [command]

4.2 afreq [marker#]

4.3 displayp [p_value]

4.4 fbat [-e,-o] [marker#]

4.5 genotype pedigree-id marker1 [marker2#]

4.6 load [ped,phe] file_name

4.7 log [log_file,on,off]

4.8 minsize [size_value]

4.9 mode [b,m]

4.10 model [a,d,r]

4.11 quit

4.12 sdt [marker]

4.13 setafftrait aff_t unaff_t unknown_t

4.14 trait [trait_name]

4.15 viewmarker marker [pedigree_id]

4.16 viewstat [-s] [-e,-o] marker

5. GETTING STARTED IN FBAT

5.1 Loading the Pedigree File and Running FBAT

5.2 More Advanced Commands

    5.3 Commands Useful for Debugging

6. REFERENCES

1. INTRODUCTION

FBAT is a software package for computing Family Based Tests of Association.  There are two steps involved in constructing a test of association: 1) defining a test statistic that reflects association between a phenotype or trait (T) and a marker value (X) and 2) defining the distribution of the test statistic under the null hypothesis.  FBAT uses the generic form  S = STX as a test statistic, where summation is over all offspring in all families in the data set.  The actual test results will differ depending upon how the user specifies T and X, and how the distribution of S is determined.  Here we present a brief explanation of the ideas underlying the choice of T, X and the distribution of S; for more details, see Laird et al. (2000), Rabinowitz and Laird (2000)  and Horvath et al. (2001b).

 

The data set used by FBAT can consist of pedigrees, nuclear families or a combination of the two. If pedigrees are present in the data set, each one is decomposed into individual nuclear families that are treated as distinct in the calculations, except in the calculation of the empirical variance (see below).  

 

The default null hypothesis tested by FBAT is H0: no linkage and no association between the marker and any gene influencing the trait.  To avoid biases due to population stratification, mis-specification of the trait distribution, and/or selection based on the phenotype, the distribution of S under H0 is calculated using the distribution of offspring genotype, conditional on the trait T, and on the parental genotype.  If either or both parental genotypes are unknown, FBAT uses the distribution of offspring genotype conditional on T and on the sufficient statistics for the unobserved parental genotype. For more details, see Rabinowitz and Laird (2000).

 

This conditional distribution under the null hypothesis is used to compute E(S) and Var(S);   S is then normalized in the usual way to obtain large sample test statistics.  Specifically, if X is a scalar summary of an individual*s genotype, then

 

                                              Z=(S-E(S))/ÖVar(S)

 

is approximately N(0,1).  FBAT gives the value of Z and a 2-sided p-value based on this normal approximation.  When X is a vector, the statistic

                                 

                                             c2=(S-E(S))*Var(S)-1 (S-E(S))

 

has an approximate c2 distribution with degrees of freedom equal to the rank of Var(S).  FBAT gives the value of c2 and its one-sided p-value based on the asymptotic c2 distribution. The calculation of the moments of S is described in the technical report ※Inside FBAT§, available from the FBAT web page.

 

With more than one sibling in a family, the distribution of offspring genotypes depends upon whether or not one assumes linkage is present between the marker and a gene influencing the trait.  In the case where linkage is present and the null hypothesis is H0: linkage but no association, sibling genotypes will be correlated, as will the genotypes of different nuclear families derived from one pedigree.  In this case, the empirical variance option should be used.  With this option, Var(S) is computed empirically without making assumptions about the recombination parameter, or degree of correlation between multiple sibs in a family. If some of the nuclear families come from a common pedigree, then FBAT sums S-E(S) over all nuclear families in the pedigree to yield a single contribution for each pedigree.  This means that the empirical variance option gives the correct variance asymptotically regardless of pedigree structure and null hypothesis.  See Lake et al. (2000) for details.

 

In some nuclear families, the distribution of offspring genotypes may be degenerate, after conditioning on parental genotype or on the sufficient statistic for parental genotypes.  In such cases, these families are not informative and will not contribute to the test statistic.  For example, if both parents are homozygous, then the conditional genotype distribution of any offspring is degenerate.  If both parent*s genotypes are unknown, and all offspring have the same genotype, then the distribution of any offspring*s genotype, conditional on the sufficient statistic for parental genotypes, is a point mass at the single observed genotype.   In some cases, the conditional genotype distribution is not degenerate, but there can still be no contribution to the test statistic.  For example, consider a family with two affected offspring with genotypes AA and Aa and no parents. The genotype distribution is not degenerate; under the null, each sibling retains their original genotype with probability ½ and the genotypes are interchanged with probability ½ .   If X counts the number of A alleles among the affecteds, the contribution of the two affected siblings to S is always three, its mean is always three and its variance is zero.  It is not necessary to exclude such families in the data set because their contribution to S-E(S) and Var(S) will be zero.  The count of informative families, which is given as part of the fbat output, includes only families with a non-zero contribution to the test statistic.

 

 

The main issues to be addressed in using FBAT are how to specify X and how to specify T.  Consider X first; it is controlled by two FBAT options: model and mode.  The model command allows one to specify a recessive, dominant, additive or genotype model for how the gene acts on the phenotype.  The default is uses the additive model; several studies have shown that the additive model performs well even when the true genetic model is not additive.  See, for example, Knapp (1999), Tu et al. (2000) and Horvath et al. (2001b). Note that this choice only affects the power of the test.  Choosing the &wrong* model does not invalidate the test under the null, rather it may reduce power under the alternative.  The genotype model treats each distinct genotype as a separate allele. The mode option is relevant if the marker has more than two alleles or the genotype model is used.  FBAT allows two strategies: each allele is tested separately, resulting in multiple, single degree-of-freedom tests, or all alleles are compared simultaneously to their null expectation in one test with multiple degrees of freedom.  In this case, X is a vector and the c2  version of the test (given above) is used.

 

The second main issue is how to specify the trait T.  In general T will be some function of the phenotype; dichotomous, measured and age-at-onset data can be used.  FBAT can handle any kind of trait, but it is up to the user to specify T, and this can have a substantial impact on power of these tests.  First note that if T=0 for a subject, then this subject contributes nothing to S, E(S), or Var(S), i. e., they do not contribute to the value of the test.  Such individuals only help to determine the distribution of their sibling*s genotypes in the case where parents are missing genotype data.  When using the affection status variable in the pedigree file, subjects coded as 0 (missing phenotype) will have T=0 for the default trait. When using a phenotype file (see below), subjects with unknown phenotype should be coded &-&.  This ensures that any T computed for this subject will be zero.

 

When the trait is dichotomous, affected or not, the usual approach has been to look at allele transmissions from parents to affected offspring only (Spielman, et al.,1993).  This can be achieved by setting T=1 for affected subjects, and T=0 for all others; this is the default trait coding for FBAT.   

 

The theory of score tests (Laird et al., 2000, Whittaker and Lewis, 1998) suggests using T=Y-m, where Y is a dichotomous indicator of affection status, and m is the disease prevalence.  With a rare disease, m is nearly zero, and the result is close to the default coding.  When m is 0.5, then an affected subject has T= 0.5 and an unaffected subject has T=-0.5, i.e., they receive equal weight, but have different signs.  One can use the setafftrait command in FBAT to specify values of T based on the affection status variable in the pedigree file.   If disease prevalence is not known for the population, one can use the 每o option in FBAT, which chooses m to minimize Var(S).  See the examples section of the FBAT documentation and also Lunetta et al., (2000).

 

The use of 每o does not always result in a more significant p-value, especially in the case where the test result is already significant using the default affection status. The -o option only minimizes the variance of S under the null hypothesis and should perform best for small departures from the null.  In general, using the 每o option will give a result similar to using the sample mean to estimate m.

 

FBAT can also handle measured traits by specifying T in a phenotype file.  How T should be defined depends both on the sample design and on objectives of the association test.  If the measured phenotype data are available for affected subjects only, and the determination of affection status depends strongly upon the value of measured phenotype, then there is little to be gained by using the measured phenotype.  Alternatively, if variation in the measured phenotype among affecteds is of interest, then a test designed as a contrast between high and low values of the measured phenotype is potentially powerful for detecting an association between the marker and the phenotype.  In this case, T should again be defined centering the phenotype, T = Y- m, where m is the mean phenotype in the population; depending upon the sample, m may be estimated from the sample mean, or alternatively from external data.  The 每o option described above can also be used.

 

When there are multiple correlated traits available, it may be desirable to test them simultaneously in a simple test using the multivariate test statistic described in Lange et al (2002).  This can be accomplished using the trait command.

 

There have been several possibilities suggested for handling age-at-onset data.  Horvath et al. (2001b) describe two methods based on a score test using a proportional hazards model with an exponential age-at-onset distribution.  Mokliatchouk el al. (2001) use a score test based on Cox regression.

In principle, it is straightforward to adjust the phenotypes for covariates, by incorporating them into the estimate of m.  Instead of estimating a single m for the entire sample, estimate the mean of Y as a function of the covariates, and use Ti=Yi-mi as the trait, where i indexes the individuals in the sample.  See Lunetta et al. (2000) and Horvath et al. (2000a) for examples.

 

The FBAT package also can be used to implement the sibship disequilibrium test using the sdt command (Horvath and Laird, 1998).  The SDT is designed to detect both linkage in the presence of association and association in the presence of linkage (linkage disequilibrium) when dealing with a dichotomous trait.  The test does not require parental data, but requires discordant sibships with at least one affected and one unaffected sisbling.

2. INSTALLING THE PROGRAM

The FBAT package, consisting of the executable program and a user*s manual, is currently available in compressed forms for MacOS/PowerPC, Windows, and Solaris/Sparc platforms. 

2.1 MacOS

FBAT for MacOS requires PowerPC processors. The package is compressed by Stuffit and named ※FBATxxx.sit.hqx§ where xxx is the version number. To install, just expand it with StuffIt Expander™.

 

2.2 Solaris/Sparc

The package is named ※FBATxxx.tar.Z§ where xxx is the version number. To install, just type ※zcat FBATxxx.tar.Z | tar xvf  每※.

 

2.3 Windows

FBAT for Windows requires Windows3.1, Windows95, Windows98 or Windows NT. The package is zipped and named ※FBATxxx.zip§ where xxx is the version number. To install, just unzip it using any zip utilities.

 

3. INPUT FILES

Two types of files are used by FBAT. A pedigree file defines the pedigree structure, affection status, and genotype information. A phenotype file defines any other traits for each subject. Both files are in text format with one line for each individual; variables are separated by a blank space or a tab. Be aware that while a continuous run of blank spaces is regarded as a single separator, each tab is treated as a separator. So there will be k+1 fields for an entry with k tabs. Generally, blank spaces as field separators are recommended.

 

3.1 Format for Pedigree File

First line:          names of all markers in the sequence of the genotype data

Remaining lines:  pid id fid mid sex aff A11 A12 A21 A22 #

 

pid:          pedigree ID

id:          individual ID

fid:     father ID. Use 0 (zero) for founders or marry-ins (parents not         

          specified) in a pedigree

mid:          mother ID. Use 0 (zero) for founders or marry-ins (parents not           

          specified) in a pedigree

sex:    1 = male, 2 = female

aff:          affection status. 2 = affected, 1 = unaffected, 0 = unknown

Aij:     allele j of marker i (j=1,2; i=1, 2,# ). Alleles are represented by positive integers. Use 0 (zero) for missing alleles.

 

All ID*s and marker names are composed of strings of any characters that does not include blank space, tab, newline, and carriage return. The maximum length for IDs and marker names are 11 and 15 characters, respectively. A maximum number of 39 alleles are allowed for each marker.

 

3.2 Format for Phenotype File

First line:                names of all traits in the phenotype file       

                         

Remaining lines:  pid id trait_1 trait_2 #

 

Use a single hyphen (※-※) for missing traits. The order of the subject entries is not important. The set of individuals defined in the phenotype file need not be the same as that in the pedigree file (e.g. you may omit all parents in the phenotype file). However, for each individual appearing in both files, his(her) IDs  must be consistent.  Data on any individuals in the phenotype file who do not appear in the pedigree file will be ignored.

 

4. COMMANDS

The general syntax for every command used in this program is

command [option1,option2,#]# arguments#,

where [option1,option2,#]# are optional arguments. In the descriptions given below, all acceptable options for an argument are listed within a bracket and separated by comas.

Commands and options are case sensitive. A partial command name may be used to specify the command as long as it is unambiguous.

 

4.1 ? [command]

Display a one-line description for the specified command. If no command is specified, a listing and descriptions of all available commands is given.

 

4.2 afreq [marker#]

Output a sample estimate of allele frequencies for the specified marker. If no marker is specified, allele frequencies for all the markers are produced. The allele frequencies are estimated using the founder genotype data only. An EM algorithm is used to incorporate families with incomplete founder genotype data.  Note:  The estimated allele frequencies are not used by fbat.

 

4.3 displayp [p_value]

Selectively display the test results with p-value equal to or less than the specified p_value. The default p_value is 1.0 (display all results). If no p_value argument is specified, the current p_value is given.

 

4.4 fbat [-e,-o] [marker#]

Compute the test statistic(s) and p-value(s) for the specified marker (or all markers if no marker is specified) using the current trait, test mode (bi-allelic or multi-allelic), and association model.

 

Option 每e: compute the test statistic using the empirical variance, as described in Lake, Blacker and Laird (2001).  This option should be used when testing for association in an area of known linkage and data from multiple sibs in a family or multiple families in a pedigree are used.

 

Option 每o: use T = (current trait 每 m) in the test construction.  The value of m is chosen to minimize the variance of the test statistic. Subjects with phenotype unknown (phenotype file is &-&) have T set to zero.  This option works for both quantitative and qualitative traits.  Note that when this option is used for qualitative traits, phenotype data for both affected and unaffected offspring are required.  If the data set contains only affected offspring, or offspring with only a narrow range of values in the phenotype file, this option should not be used. 

 

The 每o option can only be used with the bi-allelic mode.

 

Output of the fbat command: When using the bi-allelic mode (see 4.9) of testing, the results are displayed with one line for each marker allele using the format:

 

   marker_name  allele_name  afreq  fam#  S  E(S)  Var(S)  Z  P

 

where afreq is allele frequency; fam# is number of nuclear families informative for that allele; S is the test statistic; E(S) and Var(S) are the expected value and variance of the test statistic under H0; Z and P are the Z statistic (S normalized using E(S) and Var(S)) and its corresponding p-value (P(X>|Z|), where X is N(0,1)).

 

When using the multi-allelic mode of testing, the result is displayed with one line for each marker using the format:

 

   marker_name  number_of_alleles  degrees_of_freedom  chi-square  p_value.

 

When using -o, an additional column is given for the value of the optimizing nuisance

parameter m.

4.5 genotype pedigree-id marker1 [marker2#]

Display the raw genotype data for the specified pedigree and marker(s).

Output format:

 

   Id  father-id  mother-id  allele1_1  allele1_2  [allele2_1  allele2_2#]

 

If father-id and mother-id refer to founders or marry-ins, their id*s will be set to zero in the display.

 

4.6 load [ped,phe] file_name

Read in data from a pedigree or a trait file. The file name can be either an absolute

path name or a relative path name from your current directory.  The options [ped,phe]

are not necessary when the specified file_name ends with a corresponding extension

(.ped for pedigree file, .phe for trait file).

 

4.7 log [log_file,on,off]

Start logging all inputs and outputs into log_file or toggle the logging status. A log_file must be specified before you can toggle the logging status. If no option is specified, the current logging status is displayed. You can save only particular parts of a session by toggling the logging status.

 

4.8 minsize [size_value]

Specify the minimum number of informative families necessary to compute the

Test statistics. The default value is 10. If size_value is not specified, the current

value is displayed.

 

4.9 mode [b,m]

Specify bi-allelic (b) or multi-allelic (m) tests. The default is bi-allelic. If no option is specified, the current mode is displayed.

 

4.10 model [a,d,r,g]

Specify the association model to be additive (a), dominant (d), recessive (r) or genotype(g). The default is the additive model. If no option is specified, the current model is displayed.  For a description of the marker scoring schemes under these models, see Schaid (1996).  The genotype coding essentially constructs a multiallelic marker, with each possible genotype corresponding to an allele.

 

4.11 quit

Exit the program.

 

4.12 sdt [marker]

Compute the SDT test statistic(s) and p-value(s) for the specified marker (or all markers if no marker is specified) using the current trait and test mode.

 

When using the bi-allelic mode of testing, the results are displayed as one line for each marker allele using the format:

 

   marker_name   allele_name  B  C  p-exact

 

where B and C are counts of families with a higher (B) or lower (C)  frequency of the specified allele among affected offspring compared to unaffected offspring in the family.  The p-value is computed exactly using the Binomial distribution.

 

When using the multi-allelic mode of testing, the result is displayed in one line for each marker using the format:

 

   marker_name   degrees_of_freedom   chi-square   p-value

 

where the p-value uses the asymptotic chi-square distribution.

 

4.13 setafftrait aff_t unaff_t unknown_t

Set trait values for affected (aff_t), unaffected (unaff_t), and subjects with unknown affection status (unknown_t).  Here affected, unaffected, and trait unknown offspring are defined using the affection status variable from the pedigree file:

affection status 2 means affected

affection status 1 means unaffected

affection status 0 means unknown

The trait value of ※unknown§ should always remain zero; changing the values for affected and unaffected allows unaffected subjects to contribute to the analysis.  The default values are (1,0,0; see 4.14 below).  See Lunetta, et al. (2000).

4.14 trait [trait_name]

Specify the trait to use for computing the test statistics.  Specifying more than one trait will provide a multivariate test with multiple degrees of freedom (Lange et al, 2002). If no trait is specified, all available traits are displayed with the current trait denoted by **.  The default trait uses the affection status variable defined in the pedigree file, recoded as 1 if affection status is 2(affected), and zero otherwise (affection status is unaffected or unknown).  The name of the default trait is affection.

4.15 viewmarker marker [pedigree_id] 

Displays detailed information about the marker genotype distribution (under H0: no linkage) among offspring in each nuclear family of the named pedigree.  If pedigree_id is not specified, the marker distributions are displayed for all nuclear families in all pedigrees.  These distributions are obtained via a table lookup using Tables 1-3 of Rabinowitz and Laird (2000).  If the family is not informative, the marker genotype distribution has probability 1 for the observed data and the output for that family is suppressed.

 

Output format:

First line:

          **** ped_id {fgt,mgt} ==> {ogt..}

where fgt and mgt are the observed genotypes (in letters) for father and mother in the nuclear family, and the ogt# lists the set of observed genotypes among the offspring, one entry for each unique genotype. If either parent is missing, the genotype entry is replaced by a blank. The genotype is expressed generically using the letters A,B,C or D.  The letters indicate how the nuclear family is mapped to one of the corresponding entries in Tables 1-3, except when genotype data are available for both parents.  Tables 1-3 omit this case, since here the usual Mendelian rules are used to determine genotype probabilities for the offspring

Remaining lines:

      The sample space of possible offspring genotypes, the probability of 

      each point in the sample space, and the joint probability of each 

      possible pair of genotypes.

 

4.16 viewstat [-s] [-e,-o] marker

Displays detailed information about the FBAT statistics for the specified markers, including S, E(S) and Var(S). By default, information on the statistics are displayed separately for each informative family, followed by a summary over all the families. In viewstat, S will be displayed as a vector, with one entry for each allele, hence E(S) is also a vector, and Var(S) is a matrix, unless 每o is used.

 

Option ※-s§ suppresses the family specific information and displays only the summary part. Option ※-e§ computes and displays the variance of S using the empirical variance.  Option 每o displays S, E(S) and Var(S) when the trait is determined using the 每o command.  Since the value of m is different for each allele, only the scalar values of Var(S) for each allele are displayed. Only one of the 每e or 每o options can be used simultaneously.

 

 

5. GETTING STARTED IN FBAT

 

5.1 Loading the Pedigree File and Running FBAT

To get started, the user must call the program and load a pedigree data file.  This file follows the standard pedigree file format as used by genehunter or mapmaker. See section 3 for details.  For example, the first few lines of a data file named test.ped are:

 

a2m apoe

501051 10070 90020 90019 1 2 1 1 3 4

501051 10137 90020 90019 1 1 1 1 3 4

501151 10018 90039 90040 2 2 1 1 3 4

501151 10031 90039 90040 2 2 1 2 4 4

501151 10196 90039 90040 2 1 1 2 4 4

501151 10237 90039 90040 1 1 1 1 3 4

 

None of the families in this data set have parental genotype information; all of them have both affected and unaffected offspring.  To read this data file (which resides in the same directory as fbat), use the load command:

 

>> load test.ped

 

read in: 2 markers from 104 pedigrees (104 nuclear families,375

persons)

 

Once you have read in a pedigree file successfully, you may use the default settings to get a TDT-type test statistic.  The relevant default settings are:

Command     Default

displayp        1.0 (all test results are printed out regardless of p-value)

minsize         10 informative families are required

model            additive

mode             bi-allelic (test one allele against all others at a marker)

trait                dichotomous affection status

You obtain your test statistics by simply typing:

>>fbat

trait affection; model additive; test bi-allelic; minsize 10, p 1.000000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z        P

a2m          1  0.829   44    112.000    126.133        14.900  -3.661 0.000251

a2m          2  0.171   44     64.000     49.867        14.900   3.661 0.000251

apoe         2  0.040   11      6.000     12.418         3.298  -3.534 0.000410

apoe         3  0.547   68    130.000    147.334        30.133  -3.158 0.001589

apoe         4  0.412   67    160.000    136.248        31.116   4.258 0.000021

Total number of test(s): 5

Since we are using bi-allelic tests, the program gives a line of output for each allele, two for the a2m gene and three for the apoe gene.  Because there are only two alleles for a2m, the Z statistics are identical apart from sign.

 

The empirical variance option is useful when the marker in question is known to be linked to a gene underlying the trait.  In this case the null hypothesis is H0: no association.  In most instances the test results will be similar, unless there are many nuclear families which arise from a single pedigree.  Whenever the 每e option is used with fbat, the empirical variance is used to compute the test statistic:

 

>> fbat 每e

trait affection; model additive; test bi-allelic; minsize 10; p 1.00000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z        P

a2m          1  0.829   38     94.000    108.133        15.235  -3.621 0.000294

a2m          2  0.171   38     56.000     41.867        15.235   3.621 0.000294

apoe         2  0.040   10      5.000     11.418         4.912  -2.896 0.003782

apoe         3  0.547   59    107.000    124.334        28.858  -3.227 0.001252

apoe         4  0.412   57    140.000    116.248        32.232   4.184 0.000029

Total number of test(s): 5

 

Note that the results using the empirical variance are slightly more conservative, but

do not differ markedly.  This is what we would generally expect, unless there are a

few very large pedigrees that contribute most of the informative nuclear families.

5.2 More Advanced Commands

 

Here we consider a few more advanced commands:  mode,  model and loading phenotype files

MODE   To perform a multi-allelic test, where each allele is tested against all others, use the mode command, followed by the fbat as usual:

 

>> mode m

current mode multi-allelic

>> fbat

trait affection; model additive; test multi-allelic; minsize 10; p 1.000000

Marker   Allele#    DF       CHISQ           P

a2m            2     1   13.405701    0.000251

apoe           3     2   25.392769    0.000003

Total number of test(s): 2

 

When using the additive model, the degrees of freedom of the chi-square is one less than the number of alleles.  With other models the degrees-of-freedom is generally equal to the number of alleles.  Note that the multi-allelic test value for any bi-allelic marker is the same as the square of either of the two Z values computed in the bi-allelic mode, i.e., the tests are equivalent.  If any allele has fewer than minsize informative families, all offspring with that allele are excluded from the test.

 

To change back to bi-allelic mode, use ※mode b§.

MODEL  The default trait uses an additive model so that the statistics count the number of alleles that offspring have.  We can switch to a dominant, recessive or a genotype model using the model command.  For example, with the dominant model, we count the number of offspring with any copy of a particular allele.  For the test.ped data, we find:

 

>> model d

current genetic model is dominant

>> fbat

trait affection; model dominant; test bi-allelic; minsize 10; p 1.000000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z        P

a2m          1  0.829    9     *****

a2m          2  0.171   41     49.000     37.650        10.744   3.463 0.000535

apoe         2  0.040   10      4.000     10.084         3.076  -3.469 0.000522

apoe         3  0.547   38     36.000     45.501        11.199  -2.839 0.004523

apoe         4  0.412   46     69.000     56.993        13.969   3.212 0.001316

Total number of test(s): 4

Notice that no test has been computed for the a2m 1 allele because there are fewer than 10 informative families for the dominant model.

Unlike the additive model, when using multi-allelic tests and the dominant (or recessive) model, the degrees-of-freedom will generally be equal to the number of alleles when all alleles have sufficient informative families:

 

>> model d

current genetic model is dominant

>> fbat apoe

trait affection; model dominant; test multi-allelic; minsize 10; p 1.000000

Marker   Allele#    DF       CHISQ           P

apoe           3     3   23.555276    0.000031

Total number of test(s): 1

 

The genotype model compares observed to expected numbers of genotypes for each possible combination.  Continuing with test.ped,

 

>> model g

current genetic model is genotype

>> mode b

current test mode is bi-allelic

>> fbat

trait affection; model genotype; test bi-allelic; minsize 10; p 1.000000

Marker    Allele  afreq  fam#        S       E(S)        Var(S)       Z        P

a2m          1/1  0.658   41     33.000     44.350        10.744  -3.463 0.000535

a2m          1/2  0.342   44     46.000     37.433        11.590   2.517 0.011852

a2m          2/2  0.000    9      *****

apoe         2/2  0.006    1      *****

apoe         2/3  0.036    9      *****

apoe         3/3  0.201   46     32.000     39.833        13.557  -2.127 0.033384

apoe         2/4  0.034    4      *****

apoe         3/4  0.656   64     64.000     61.828        18.673   0.503 0.615230

apoe         4/4  0.067   39     48.000     36.254        11.320   3.491 0.000481

Total number of test(s): 5

 

In this data set  only three genotypes of apoe have more than 10 informative families.  With the multiallelic mode, all offspring with any other genotype are excluded from the chi-square test: 

 

>> mode m

current test mode is multi-allelic

>> fbat

trait affection; model genotype; test multi-allelic; minsize 10; p 1.000000

Marker  Allele  DF    CHISQ           P

a2m            2     2   13.485486    0.001179

apoe           3     3   25.258375    0.000014

Total number of test(s): 2

 

 

 

LOADING A PHENOTYPE FILE  To use a quantitative trait, we use another data set that has measured scores on two quantitative tests and genotype data on 5 markers.  First we must load in the pedigree file, and then the phenotype file.  The first few lines of each file are:

 

Pedigree file (adh.ped):

DAT DRD4 DRD2 DBH NAR

2000 200003 0 0 1 1 0 0 4 4 77 77 256 260 115 115

2000 200002 0 0 2 1 480 440 2 4 77 79 256 276 113 113

               2000 200006 200003 200002 2 1 0 0 4 4 77 77 256 260 113 115

               2000 200007 200003 200002 2 1 440 440 4 4 77 77 256 276 113 115

2000 200004 200003 200002 2 1 440 440 4 4 77 77 260 276 113 115

2001 200103 0 0 1 2 480 480 4 4 79 79 258 258 113 113

2001 200102 0 0 2 1 480 440 4 5 77 79 260 274 113 113

2001 200104 200103 200102 2 2 480 480 4 5 77 79 258 260 113 113

2001 200101 200103 200102 2 2 480 480 4 5 79 79 258 260 113 113

 

Phenotype file (adh.phe):

     v1 v2 centerv1 centerv2

     2000     200002          8          9       -4.8      -10.9

     2000     200006          8         11       -4.8       -8.9

     2000     200003          8         17       -4.8       -2.9

     2000     200007          8          9       -4.8      -10.9

     2000     200004          8          9       -4.8      -10.9

     2001     200103         13         21         .2        1.1

     2001     200102          8          9       -4.8      -10.9

     2001     200104         10         19       -2.8        -.9

     2001     200101         13         13         .2       -6.9

 

The trait file has two variables, v1 and v2; the last two columns of this file contain mean centered versions of these variables.  We enter the data using the load command (must load adh.ped first).

 

>> load adh.ped

read in: 5 markers from 44 pedigrees (43 nuclear families,166 persons)

mendelian error: locus DBH, pedigree 2003 [200303,200302]

mendelian error: locus DRD2, pedigree 2016 [201603,201602]

mendelian error: locus DRD2, pedigree 2057 [205703,205702]

mendelian error: locus DBH, pedigree 2057 [205703,205702]

mendelian error: locus DRD2, pedigree 2059 [205903,205902]

A total of 5 mendelian errors have been found

genotypes of families with mendelian error have been reset to 0

 

>> load adh.phe

4 quantitative traits have been successfully read

166 persons have been phenotyped

 

Notice that the fbat identified Mendelian inconsistencies in one or two loci for 4 different pedigrees.  All of the genotypes of those families have been set to zero for those markers with the Mendelian inconsistencies.

 

 

The trait command tells us what traits are now available and which one (with**) is in use.

 

>> trait

affection** v1 v2 centerv1 centerv2

 

We change to another trait using the trait command:

 

>> trait centerv2

affection v1 v2 centerv1 centerv2**

 

>> fbat DAT DRD4

trait centerv2; model additive; test by allele; minsize 10; p 1.000000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z        P

DAT        440  0.272   22    -57.400    -22.450       455.107  -1.638 0.101362

DAT        480  0.717   23    -45.700    -89.700       483.015   2.002 0.045281

DAT        520  0.011    2     ****

DRD4         2  0.091   14      7.400     -0.700       255.865   0.506 0.612587

DRD4         3  0.051    6     *****

DRD4         4  0.652   23   -100.700    -83.200       513.540  -0.772 0.439974

DRD4         5  0.023    4     *****

DRD4         6  0.011    0     *****

DRD4         7  0.154   18     -0.900    -17.100       224.990   1.080 0.280131

DRD4         8  0.017    3     *****

 

In this case, using the additive model and a mean centered quantitative trait gives a test identical to that proposed by Rabinowitz (1997) when all parental genotype data are present.  Here v2 has been centered by subtracting out the sample mean of the v2 from all the subjects, thus the trait is the residual values; this explains the negative values. One could also adjust for other covariates thought to influence the value of v2 such as age and sex, and use residuals from the regression equation. Alternatively, we may choose to center the trait using external norms or by using the 每o option of fbat.

 

  MULTIVARIATE TESTS

 

FBAT can also test multiple traits simultaneously using a multivariate version of the test statistic.

To implement this option, list multiple traits at the trait command line.  To illustrate, we use data from an aathma study that looks at a polymorphism in Il13.  The sample contains only asthmatic offspring and their parents, but there is interest in a large number of phenotypes that characterize asthma severity and symptoms.

 

 

>> load il13.ped

read in: 1 markers from 635 pedigrees (636 nuclear families,1963 persons)

>> load fbat.phe

22 quantitative traits have been successfully read

666 persons have been phenotyped

warning: 4 persons are not in any pedigrees

>> trait

affection** toteos logige npos sxcmean limactiv AH_30 AH_21 AH_22 AH_23 posfevpp posfvcpp ampfmean pmpfmean dpfmean lnpc20 VARMean bdabs fevbd bdpred pfbd sxamean examean

>> trait toteos

affection toteos** logige npos sxcmean limactiv AH_30 AH_21 AH_22 AH_23 posfevpp posfvcpp ampfmean pmpfmean dpfmean lnpc20 VARMean bdabs fevbd bdpred pfbd sxamean examean

>> fbat

trait toteos; model additive; test bi-allelic; minsize 10; p 1.000000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z        P

il13        30  0.776  313     -5.963     21.662       103.881  -2.710 0.006721

il13        31  0.224  313     52.428     24.803       103.881   2.710 0.006721

Total number of test(s): 2

>> trait logige

affection toteos logige** npos sxcmean limactiv AH_30 AH_21 AH_22 AH_23 posfevpp posfvcpp ampfmean pmpfmean dpfmean lnpc20 VARMean bdabs fevbd bdpred pfbd sxamean examean

>> fbat

trait logige; model additive; test bi-allelic; minsize 10; p 1.000000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z        P

il13        30  0.776  314    -10.594      8.968       100.687  -1.950 0.051235

il13        31  0.224  314     44.623     25.061       100.687   1.950 0.051235

Total number of test(s): 2

>> trait npos

affection toteos logige npos** sxcmean limactiv AH_30 AH_21 AH_22 AH_23 posfevpp posfvcpp ampfmean pmpfmean dpfmean lnpc20 VARMean bdabs fevbd bdpred pfbd sxamean examean

>> fbat

trait npos; model additive; test bi-allelic; minsize 10; p 1.000000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z        P

il13        30  0.776  313     37.236     32.767        99.805   0.447 0.654615

il13        31  0.224  313     21.167     25.636        99.805  -0.447 0.654615

Total number of test(s): 2

>> trait  toteos logige npos

affection toteos** logige** npos** sxcmean limactiv AH_30 AH_21 AH_22 AH_23 posfevpp posfvcpp ampfmean pmpfmean dpfmean lnpc20 VARMean bdabs fevbd bdpred pfbd sxamean examean

>> fbat

3 traits selected: toteos logige npos

model additive; test bi-allelic; minsize 10; p 1.000000

Marker  Allele  afreq fam#       DF       CHISQ           P

il13        30  0.776  314        3      12.209    0.006702

il13        31  0.224  314        3      12.209    0.006702

>> quit

 

In this example, the test including all three endpoints has virtually the same p-value as the smallest of the three p-values observed in the univariate tests.  If any subject is missing a phenotype that is included in the variable list, that subject is excluded from the test.

 

 THE 每o OPTION 

With 每o, the value subtracted from each trait is chosen to minimize the test statistic.  The following illustrates the use of 每o with v2:

 

affection v1 v2** centerv1 centerv2

>> fbat DAT

trait v2; model additive; test bi-allelic; minsize 10; p 1.000000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z       P

DAT        440  0.272   22    460.000    485.000      4158.000  -0.388   0.698237

DAT        480  0.717   23   1009.000    965.000      4565.500   0.6510  0.514923 

DAT        520  0.011    2      *****

>> fbat -o DAT

trait v2; model additive; test bi-allelic; minsize 10; p 1.000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z        P      Offset

DAT        440  0.272   22    -23.721     10.581       437.070  -1.641  0.100845     18.604651

DAT        480  0.717   23      8.913    -35.087       470.804   2.028  0.042577     18.869565

DAT        520  0.011    2      *****

 

Notice that the offset for each allele differs since the command is executed by allele, but in each case the offset is close to the mean of v2, so that the result of using 每o is similar to using centeredv2.  Using v2 alone gives a very different, and not a meaningful, result since there are many unaffected offspring in the data set and a wide range of v2.

 

The 每o option can also be used with qualitative data.  First we view the results of using affection status in this data set:

 

>> fbat DAT

trait affection; model additive; test bi-allelic; minsize 10; p 1.000000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z        P

DAT        440  0.272   16     12.000     14.500         6.250  -1.000 0.317311

DAT        480  0.717   16     30.000     26.500         6.750   1.347 0.177932

DAT        520  0.011    1      *****

 

Using fbat with affection status and 每o gives:

 

>> fbat -o

trait affection; model additive; test bi-allelic; minsize 10; p 1.000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z        P        Offset

DAT        440  0.272   21     -2.535      0.256         2.616  -1.725 0.084469      0.581395

DAT        480  0.717   22      0.652     -2.848         2.788   2.096 0.036071      0.586957

DAT        520  0.011    2      *****

 

 

 

There are now more informative families in the data set, since this data set includes families with only unaffected offspring, as well as families with both affected and unaffected offspring.

 

USING THE SETAFFTRAIT COMMAND.  We can use this command to set values of T to correspond to affected, unaffected, and unknown phenotype.  We use this to see what happens if we use only the unaffected subjects in the data set:

 

>> setafftrait 0 1 0

conver affection status to trait 0.000000(aff), 1.000000(unaff), 0.000000(unknown)

>> fbat DAT

trait affection; model additive; test bi-allelic; minsize 10; p 1.000000

Marker  Allele  afreq fam#          S       E(S)        Var(S)       Z        P

DAT        440  0.272   12     13.000     10.000         4.500   1.414 0.157299

DAT        480  0.717   13     20.000     23.500         4.750  -1.606 0.108294

DAT        520  0.011    1      *****

 

 

 

In this data set,  using either the affected or the unaffected alone gives no indication of significance, but using both suggests that allele DAT480 may be positively association with affection status.

 

 

5.3 Commands Useful for Debugging.

In this section, we consider 3 commands that are helpful in understanding how the statistics are computed and in debugging.  These commands are genotype, viewmarker and viewstat.

GENOTYPE  To assure yourself that the data are loaded correctly, you can ask to see the data for any pedigree (or family) by typing:

>>genotype  501051 a2m apoe

a2m apoe

10070 90020 90019 1 1 3 4

90020 0 0 0 0 0 0

90019 0 0 0 0 0 0

10137 90020 90019 1 1 3 4

The output lists the id*s and marker data for every individual in the family.  In this case, both parents (90019 and 90020) have no record in the data file; their genotypes and the id*s of their parents have been set to zero.

 

VIEWMARKER    Viewmarker provides data at the nuclear family level on the

distribution of offspring genotypes.  Using viewmarker with test.ped gives, for

the first few families:

 

>> viewmarker apoe

****  501151    {,} ==> {AB,BB}  

Possible sibs' genotypes g[] = 4/4  3/4 

Probability of each genotype Pg[] =

0.500000      0.500000   

Probability of each pair of genotypes Pgg[][] =

0.166667      0.333333   

0.333333      0.166667   

 

****  501221    {,} ==> {AA,AB}  

Possible sibs' genotypes g[] = 4/4  4/3 

Probability of each genotype Pg[] =

0.666667      0.333333   

Probability of each pair of genotypes Pgg[][] =

0.333333      0.333333   

0.333333      0.000000   

 

There is no entry for 501051 because that family has all their mass concentrated on a single genotype.  For 501151, there were 2 genotypes among the offspring, one  3/4 and one 4/4 and no parental genotype data.  The AB,BB notation means that table 3, line 2 of Rabinowitz and Laird (2000)  is used to assign a probability distribution to the four offspring in this family.  All the mass is concentrated on these two genotypes, with zero mass on all other genotypes.

 

It is instructive to use viewmarker when we have a family where both parents have genotype data.  With parents, the distribution is computed using Mendelian laws.  Returning to the adh.ped file, we have

 

>> viewmarker DRD4

****  2000    {AA,AB} ==> {AA}  

Possible sibs' genotypes g[] = 3/3  3/1 

Probability of each genotype Pg[] =

0.5000      0.5000     

Probability of each pair of genotypes Pgg[][] =

0.2500      0.2500     

0.2500      0.2500     

 

Here the parents are 3/3 and 3/1, and they have only 3/3 offspring.   Since parent genotypes are known, the offspring distribution under H0 follows Mendel*s laws: all offspring are either 3/3 or 3/1 with probability 50-50.

 

 

VIEWSTAT  Viewstat provides data on the contribution of each family to S,

E(S) and Var(S).  Viewstat also will provide additional detail on the summary

statistics, which are relevant when using multi-allelic tests.

 

Using viewstat with test.ped gives:

 

>> viewstat apoe

Alle S         E(S)      Var(S)

ped 501151 (90039,90040)

2    0.0000    0.0000    0.0000    0.0000    0.0000   

3    1.0000    1.0000    0.0000    0.3333    -0.3333  

4    3.0000    3.0000    0.0000    -0.3333   0.3333   

ped 501221  (90109,90108)

2    0.0000    0.0000    0.0000    0.0000    0.0000   

3    0.0000    0.6667    0.0000    0.2222    -0.0000  

4    2.0000    2.0000    0.0000    -0.0000   -0.0000  

    .

    .

    .

 

total family count = 104; informative family count = 72

Alle Fam# S         E(S)      Var(S)

2    11   6.0000    12.4178   3.2985    -1.1575   -2.1409  

3    68   130.0000  147.3345  -1.1575   30.1325   -28.9750 

4    67   160.0000  136.2477  -2.1409   -28.9750  31.1159  

 

eigenvector:

0.8164      0.5774      0.7134

-0.4192      0.5774      0.7134

-0.3972      0.5774      0.7134

 

diagonal vector:

4.9344

      -0.0000

            59.6125

 

pseudo inverse v:

0.1351      -0.0692      0.0405

-0.0692      0.0439      0.0405

-0.0659      0.0254      0.0405

 

rank = 2

chisq = 25.3928


 

Again, there is no entry for 501051.  Since we use the default settings, the statistic counts the number of alleles among affected offspring.  For family 501151, there are no 2 alleles, one 3 allele, and three 4 alleles among affected offspring.  E(S) and Var(S) are computed from the distribution given by viewmarker.  By chance, S equals E(S).  Notice that Var(S) is a 3x3 matrix.  The first row and column are zeroes since this family has zero probability of a 2 allele.  If you use the 每s option, you only obtain the summary over all families.  The summary E(S) and the diagonal of Var(S) are what you obtain from fbat using bi-allelic tests.  The rest of the output refers to multi-allelic tests: the eigenvectors, eigenvalues and pseudo inverse of Var(S) are used to compute the chisquare test, and its degrees-of-freedom (rank of Var(S)).

 

One can also use 每o with viewmarker to obtain individual family contributions to Var (S) when this option is used with FBAT.

 

6.     REFERENCES

Horvath, S. and Laird, N.M. (1998) ※A discordant 每sibship test for disequilibrium linkage: No need for parental data.§ Amer J Hum Gen 63: 1886-97.

 

Horvath, S., Wei, E., Xu, X., Palmer, L., and Baur, M. (2001a) ※Family-based association test method I: Age of onset traits and covariates.§ Genetic Epi (Suppl 1) 19:36-42.

 

Horvath, S., Xu, X., and Laird, N. (2001b) ※The family based association test method: strategies for studying general genotype-phenotype associations.§

Euro J Hum Gen 9: 301-306.

 

Knapp, M. (1999) ※A Note on Power Approximation for the Transmission/Disequilibrium Test.§ Amer J Hum Gen 64: 1177-1185.

 

Laird, N., Horvath, S. and Xu, X. (2000) ※Implementing a unified approach to family based tests of association.§ Genetic Epi 19(Suppl 1): S36-S42.

 

Lake, S., Blacker, D. and Laird, N. (2001) ※Family based tests in the presence of association.§ Amer J Hum Gen 67:1515-1525.

 

Lange C., Silverman E., Weiss, S., Xu X., Laird N.M. (2002) ※A Multivariate Family-Based Test using Generalized Estimating Equations: FBAT-GEE§.  Under revision; available from the authors.

 

Lunetta, K.L., Farone, S.V., Biederman, J., and Laird, N.M. (2000) ※Family based tests of association and linkage using unaffected sibs, covariates and interactions.§  Amer J Hum Gen 66: 605-614.

 

Mokliatchouk, O., Blacker, D. and Rabinowitz, D. (2001) ※Association tests for traits with variable age at onset.§  Human Hederity 51: 46-53.

 

Rabinowitz, D. and Laird, N.M. (2000) ※A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information.§  Human Heredity 504:227-233.

 

Schaid, D.J. (1996) ※General score tests for associations of genetic markers with disease using cases and their parents.§ Genetic Epi 13:423-49.

 

Tu, I.P., Balise, R.R. and Whittemore, A.S. (2000). ※Detection of disease genes by use of family data II. Application to nuclear families.§ Amer J Hum Gen 66:1341-1350.

 

Whittaker, J., and Lewis, C., (1998) ※The effect of family structure on linkage tests using allelic association.§  Amer J Hum Gen 63:889-897.