Conditional linkage methods--searching for modifier genes in a large Amish pedigree with known Von Willebrand disease major gene modification

Size: px
Start display at page:

Download "Conditional linkage methods--searching for modifier genes in a large Amish pedigree with known Von Willebrand disease major gene modification"

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations Spring 009 Conditional linkage methods--searching for modifier genes in a large Amish pedigree with known Von Willebrand disease major gene modification Diana Lee Abbott University of Iowa Copyright 009 Diana Lee Abbott This dissertation is available at Iowa Research Online: Recommended Citation Abbott, Diana Lee. "Conditional linkage methods--searching for modifier genes in a large Amish pedigree with known Von Willebrand disease major gene modification." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Other Genetics and Genomics Commons

2 CONDITIONAL LINKAGE METHODS--SEARCHING FOR MODIFIER GENES IN A LARGE AMISH PEDIGREE WITH KNOWN VON WILLEBRAND DISEASE MAJOR GENE MUTATION by Diana Lee Abbott An Abstract Of a thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Statistical Genetics in the Graduate College of The University of Iowa May 009 Thesis Supervisors: Associate Professor Kai Wang Professor Trudy Burns

3 ABSTRACT Von Willebrand Disease (VWD) is the most common bleeding disorder. In addition to known major genes, genetic modifiers, such as ABO blood group, affect quantitative outcome measures for VWD. To study genetic modification of VWD, an 854-member Amish pedigree with established linkage of VWD to a known mutation in the Von Willebrand Factor (VWF) gene on chromosome was utilized. Phenotypic information and genotypic data consisting of VWF mutation status and a genome screen of markers were available for 385 pedigree members. Genetic modifiers of the VWF mutation were investigated using known and new conditional linkage methods that search for modifier genes of a major gene with known mutation. The MCMC-based program LOKI (Heath, 997) was used to conduct multipoint linkage analysis of VWD outcome measures while controlling for the known VWF mutation. Adjustment for the mutation did not eliminate the linkage signal on chromosome in the same location as the VWF mutation. Evidence for quantitative trait loci (QTLs) was also found on six other chromosomes. S mod, a score statistic that detects evidence of a genetic modifier conditional on linkage to a major gene, was developed for sib pair data. To limit the modifier gene main effect, S mod was developed so that variance due to the modifier locus was bounded above by the variance of the interaction between major gene and modifier gene. The performance of S mod was compared to other published score statistics. Power to detect linkage to the modifier locus depended on major gene and modifier gene risk allele frequencies, relative contribution of the major gene main effect to the interaction effect, and the upper bound on the modifier gene main effect. The Amish pedigree was broken up into sib pair data and analyzed using S mod and other score statistics. Using these statistics, the strongest evidence for QTLs for VWD

4 was also found on chromosome in the region of the VWF mutation. Combined with the LOKI results, further analysis will help determine if intragenic modification is occurring or if linkage disequilibrium between the mutation and analyzed markers is driving results. Abstract Approved: Thesis Supervisor Title and Department Date Thesis Supervisor Title and Department Date

5 CONDITIONAL LINKAGE METHODS--SEARCHING FOR MODIFIER GENES IN A LARGE AMISH PEDIGREE WITH KNOWN VON WILLEBRAND DISEASE MAJOR GENE MUTATION by Diana Lee Abbott A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Statistical Genetics in the Graduate College of The University of Iowa May 009 Thesis Supervisors: Associate Professor Kai Wang Professor Trudy Burns

6 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Diana Lee Abbott has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Statistical Genetics at the May 009 graduation. Thesis Committee: Kai Wang, Thesis Supervisor Trudy Burns, Thesis Supervisor Jorge Di Paola Jian Huang Brian Smith

7 To my Sweet Boys Mark and Thomas ii

8 ACKNOWLEDGMENTS I would like to thank Dr. Jorge Di Paola for allowing me to work with the Amish pedigree data, Dr. Kai Wang for thoroughly guiding me on methods development, Dr. Trudy Burns for keeping me mindful of the big picture and providing insightful ways to improve the details of my research, and Drs. Jian Huang and Brian Smith for serving on my committee. iii

9 ABSTRACT Von Willebrand Disease (VWD) is the most common bleeding disorder. In addition to known major genes, genetic modifiers, such as ABO blood group, affect quantitative outcome measures for VWD. To study genetic modification of VWD, an 854-member Amish pedigree with established linkage of VWD to a known mutation in the Von Willebrand Factor (VWF) gene on chromosome was utilized. Phenotypic information and genotypic data consisting of VWF mutation status and a genome screen of markers were available for 385 pedigree members. Genetic modifiers of the VWF mutation were investigated using known and new conditional linkage methods that search for modifier genes of a major gene with known mutation. The MCMC-based program LOKI (Heath, 997) was used to conduct multipoint linkage analysis of VWD outcome measures while controlling for the known VWF mutation. Adjustment for the mutation did not eliminate the linkage signal on chromosome in the same location as the VWF mutation. Evidence for quantitative trait loci (QTLs) was also found on six other chromosomes. S mod, a score statistic that detects evidence of a genetic modifier conditional on linkage to a major gene, was developed for sib pair data. To limit the modifier gene main effect, S mod was developed so that variance due to the modifier locus was bounded above by the variance of the interaction between major gene and modifier gene. The performance of S mod was compared to other published score statistics. Power to detect linkage to the modifier locus depended on major gene and modifier gene risk allele frequencies, relative contribution of the major gene main effect to the interaction effect, and the upper bound on the modifier gene main effect. The Amish pedigree was broken up into sib pair data and analyzed using S mod and other score statistics. Using these statistics, the strongest evidence for QTLs for VWD was also found on chromosome in the region of the VWF mutation. Combined with iv

10 the LOKI results, further analysis will help determine if intragenic modification is occurring or if linkage disequilibrium between the mutation and analyzed markers is driving results. v

11 TABLE OF CONTENTS LIST OF TABLES... viii LIST OF FIGURES... xi CHAPTER. BACKGROUND INFORMATION.... Genetics Background..... Molecular Biology Basics..... Types of Genetic Diseases Genetic Material in Families...7. Model-Based Linkage Analysis MCMC Methods for Complex Likelihoods....3 Quantitative Trait Analysis Analyzing Quantitative Traits with MCMC Methods LOKI and MCMC Segregation and Linkage Analysis Modifier Genes....5 Literature Review of Conditional Linkage Methods...4. SEARCHING FOR MODIFIER GENES IN A LARGE, AMISH PEDIGREE USING EXISTING LINKAGE ANALYSIS METHODS Von Willebrand Disease The Coagulation Process Types of VWD Von Willebrand Disease: A Complex Genetic Disorder With Reduced Penetrance and Variable Expressivity The Amish Pedigree Data Amish Pedigree Analysis Aims Two-point Linkage Analysis of VWD VWD Multipoint Linkage Analyses with LOKI CBC Multipoint Linkage Analyses with LOKI Conclusions STATISTICAL DEVELOPMENT OF A NEW CONDITIONAL LINKAGE METHOD FOR QUANTITATIVE TRAITS Motivation for Development of a Conditional Linkage Method Brief Review of Conditional Linkage Methods Theory and Development Sib Pair Data Score Statistic and Asymptotic Distribution Summary SIMULATION MODELS AND ANALYSIS OF THE NEW CONDITIONAL LINKAGE METHOD Simulation Generating Models Simulation Sib Pair Data...80 vi

12 4. Analysis Models Test Statistic Performance Comparison Criteria Results for the Independent Sib Pair Simulated Data Amish Pedigree Simulated Sib Pair Data Discussion of Results Conditional Linkage Analysis of Observed Amish Pedigree Data CONDITIONAL LINKAGE METHODS: GENERALIZATIONS AND FUTURE WORK Research Summary Amish Pedigree Conditional Linkage Analysis Results and Implications Future Efforts...30 REFERENCES...3 vii

13 LIST OF TABLES Table.. Genotypic Effects and Quantitative Trait Values by Single-Locus Genotype Bayes Factor Interpretation Criteria Disease Model in which Individuals are Affected Only if They Possess the Risk Genotypes of Both the Target Gene, A A, and the Modifier Gene, B B Disease Model Illustrating the Suppression of a Risk Genotype of the Target Gene, A A, by a Beneficial Genotype of the Modifier Gene, B B Disease Model in which the Inheritance Pattern of the Target Gene, A, Differs According to an Individual s Genotype at the Modifier Gene, B Disease Model in which Disease Severity as Modeled by the Target Gene, A, is Increased Due to the Effect of the Modifier Gene, B Disease Model in which Disease Severity as Modeled by the Target Gene, A, is Decreased Due to the Effect of the Modifier Gene, B Examples of Target Genes Acted Upon by Modifier Genes, Type of Modifying Effect, and Modified Phenotype The Coagulation Process Penetrance Vector for Linkage Analysis Counts of VWF Mutation Status by Sex Mean and Standard Deviation of Age and VWF Outcome Measures by Sex, VWF Mutation Status, and ABO Blood Group LOKI Multipoint QTL Linkage Analysis Results by VWD Outcome Measure and Chromosome Mean and Standard Deviation of HB, HCT, MCHC, and MCV by Sex and VWF Mutation Status Mean and Standard Deviation of Platelet, RBC, RDW, and WBC by Sex and VWF Mutation Status LOKI QTL Linkage Analysis Results by CBC Measure and Chromosome Joint Classification of VWD Status and Mutation Status in Amish Pedigree Members Based on Bleeding Score (BS) Levels of at Least Joint Classification of VWD Status and Mutation Status in Amish Pedigree Members Based on Bleeding Score (BS) Levels of at Least viii

14 3.3. Joint Classification of VWD Status and Mutation Status in Amish Pedigree Members Based on RCO Levels Below Joint Classification of VWD Status and Mutation Status in Amish Pedigree Members Based on RCO Levels Below Phenotypic Value by Combined Major Locus and Modifier Locus Genotype Covariance Between Sib Phenotypic Values, Conditioned on IBD Sharing Constrained Covariance Between Siblings, Conditioned on IBD Sharing Asymptotic Distribution and Critical Values for S mod Given Fully Informative Marker Information Model : Parameter Values for the Generating Model in which Both Major and Modifier Gene Risk Allele Frequencies are 0.5. In this case, b = c = η; d = Model : Parameter Values for the Generating Model in which Both Major and Modifier Gene Risk Allele Frequencies are 0.5. In this case, b = c = η; d = ( 5 )η Model 3: Parameter Values for the Generating Model in which Both Major and Modifier Gene Risk Allele Frequencies are 0.5. In this case, b = c = η ; d = - η Model 4: Parameter Values for the Generating Model in which Both Major and Modifier Gene Risk Allele Frequencies are 0.5. M= Model 5: Parameter Values for the Generating Model in which Both Major and Modifier Gene Risk Allele Frequencies are 0.5. M= Model 6: Parameter Values for the Generating Model in which Both Major and Modifier Gene Risk Allele Frequencies are 0.5. M= Model 7: Parameter Values for the Generating Model in which Both Major and Modifier Gene Risk Allele Frequencies are 0.5. M= Model 8: Parameter Values for the Generating Model in which Both Major nd Modifier Gene Risk Allele Frequencies are 0.5. M= Mean, Standard Deviation, and Correlation of the 0,000 Replicates Simulated Under the 8 Generating Models. M =.0 for Models 4-8; h = 0.6; f(g ) = 0.5; f(g ) = 0.5; n = 00 Sib Pairs Asymptotic Distributions and Critical Values for Sibship Data for Test Statistics S, S, and T Asymptotic Distribution and Critical Values for S mod for Sibship Data Given Fully Informative Marker Information Rejection Rates of Test Statistics Under the Null Hypothesis; n = 00 Sib Pairs, 0,000 Replications, M=0.3, Models -4, f(g )= 0.5; f(g )= ix

15 4.3. Rejection Rates of Test Statistics Under the Null Hypothesis; n = 00 Sib Pairs, 0,000 Replications, M=0.3, Models 5-8, f(g )= 0.5; f(g )= Mean and Standard Deviation of the Difference Between the Nominal Significance Level and the Observed Type I Error Rate for S, S, T, and S mod Across Generating Models Amish Pedigree Simulated Data Empirically Determined Critical Values for S, S, T, and S mod Kolmogorov-Smirnov Distribution Test p-values Testing Whether the Test Statistic Distributions are the Same Independent of the Modifier Gene Risk Allele Frequency in the Amish Pedigree Structure Simulated Data; M= FVIII, VWFAG, and RCO Summary Measures in the Amish Pedigree Data Chromosome, Marker, and cm Location where there is Significant Evidence of a FVIII Modifier QTL of the Amish VWD Mutation at p3.3. Crude Model Chromosome, Marker, and cm Location where there is Significant Evidence of a FVIII Modifier QTL of the Amish VWD Mutation at p3.3. Age- and Sex-Adjusted Model Chromosome, Marker, and cm Location where there is Significant Evidence of a VWF:Ag Modifier QTL of the Amish VWD Mutation at p3.3. Crude Model Chromosome, Marker, and cm Location where there is Significant Evidence of a VWF:Ag Modifier QTL of the Amish VWD Mutation at p3.3. Ageand Sex- Adjusted Model Chromosome, Marker, and cm Location where there is significant evidence of a RCO modifier QTL of the Amish VWD mutation at p3.3. Crude Model Chromosome, Marker, and cm Location where there is Significant Evidence of a RCO Modifier QTL of the Amish VWD Mutation at p3.3. Age- and Sex-adjusted Model FVIII, VWF:Ag, and RCO Mean Values by ABO Status for Individuals in the Amish Data who Lack the Amish Mutation FVIII, VWF:Ag, and RCO Mean Values by ABO Status for Individuals in the Amish Data Who Have the Amish Mutation Results of a Two-point Linkage Analysis Conducted on the Amish Pedigree Using Mutation Status to Define Liability Classes....9 x

16 LIST OF FIGURES Figure.. Structure of DNA DNA to RNA to Protein Simple Depiction of Meiosis as Outlined by Mendel s Law of Segregation Crossing Over Event During Meiosis Amish Pedigree VWD Two-point Linkage Results for Chromosome VWD Chromosome Multipoint Linkage Analysis Results VWD Chromosome 5 Multipoint Linkage Analysis Results VWD Chromosome 6 Multipoint Linkage Analysis Results VWD Chromosome 7 Multipoint Linkage Analysis Results VWD Chromosome 9 Multipoint Linkage Analysis Results VWD Chromosome 7 Multipoint Linkage Analysis Results VWD Chromosome Multipoint Linkage Analysis Results Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 00 Sibling Pairs,,000 Replications, Models Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 00 Sibling Pairs,,000 Replications, Models 4-8, M= Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 00 Sibling Pairs,,000 Replications, Models 4-8, M= Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 00 Sibling Pairs,,000 Replications, Models 4-8, M= Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 683 Independent Sibling Pairs,,000 Replications, Models Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 683 Independent Sibling Pairs,,000 Replications, Models 4-8, M= Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 683 Independent Sibling Pairs,,000 Replications, Models 4-8, M= xi

17 4.8. Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 683 Independent Sibling Pairs,,000 Replications, Models 4-8, M= Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 683 Amish Pedigree Sibling Pairs,,000 Replications, Models Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 683 Amish Pedigree Sibling Pairs,,000 Replications, Models 4-8, M= Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 683 Amish Pedigree Sibling Pairs,,000 Replications, Models 4-8, M= Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 683 Amish Pedigree Sibling Pairs,,000 Replications, Models 4-8, M= Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 683 Amish Pedigree Sibling Pairs,,000 Replications, Models 4-8, M= Rejection Rates of Test Statistics Under the Alternative Hypothesis; n = 683 Amish Pedigree Sibling Pairs,,000 Replications, Models 4-8, M= Rejection Rates of S mod Under the Alternative Hypothesis; n = 683 Amish Pedigree Sibling Pairs,,000 Replications, Models 8, M=0, 0.3,, 5, xii

18 CHAPTER BACKGROUND INFORMATION. Genetics Background Gregor Mendel s experiments from 856 to 863 planted vital seeds of genetic understanding. While his general discoveries, known as Mendel s Laws of Heredity, had little influence during his own lifetime, during the early 0 th century Mendel s ideas were rediscovered and cultivated. Today we enjoy the copious fruits of Mendel s initial work. Much like his experiments which began with the cross-hybridization of a few pea plants and grew to include over 8,000 plants the field of genetics has also experienced exponential growth. Subsequent genetic breakthroughs include the discovery in 944 by Avery, MacLeod, and McCarty that hereditary information is carried by the DNA molecule and the findings of Watson and Crick in 953 demonstrating the chemical structure and double-stranded nature of DNA (Avery et al., 944; Watson and Crick, 953). Combined with recent advances in DNA sequencing technology, these seminal ideas have induced the explosion of current genetic knowledge. The completion of the full human genome sequence by the Human Genome Project and Celera Genomics in April 003 signifies the enormity of this explosion in genetic understanding (Collins et al., 003; Pennisi, 003)... Molecular Biology Basics Because molecular and cellular processes largely determine human health and disease, an understanding of the principles of molecular biology can strengthen one s understanding of a disease process. Molecular biology basic principles are therefore outlined below (Gelehrter et al., 997; Schena, 999; Butte, 003; Schena, 003; Kohane et al., 003; National Human Genome Research Institute, 004).

19 Integrated cellular activity drives the operation of the human body; collections of differentiated cells form tissues while coordinated groups of tissues form organs. DNA, or deoxyribonucleic acid, resides within the nucleus of every cell and houses all instructions necessary for managing the coordinated activity of cells. DNA consists of two strands of nitrogen bases attached to the sugars of a sugar-phosphate backbone (See Figure.). The four nitrogen bases, or nucleotides, are adenine (A), thymine (T), guanine (G), and cytosine (C). With adenine binding to thymine and cytosine binding to guanine, the sequences of nucleotides for the two strands are complementary. The two strands forming DNA wind together to form a double helix. Comprised of approximately 3.3 billion base pairs (bp), the human genome is the complete sequence of human DNA. Figure.. Structure of DNA. *(Muskopf, 008)

20 3 Through an elaborate coiling system, human DNA is organized into pairs of autosomal chromosomes and pair of sex chromosomes. Each chromosome contains many genes, basic units of heredity comprised of specific DNA sequences that encode instructions for making proteins, which perform multiple cellular functions. The human genome contains 0,000 to 5,000 genes, whose coding sequences account for a mere percent of the DNA sequence. The remaining 98 percent of the sequence contains noncoding regions, whose definitive functions remain to be elucidated; possible functions include maintenance of chromosomal structural integrity and management of the location, timing, and amount of protein production. The central dogma of molecular biology explains that genetic information flows from DNA to RNA to protein (See Figure.). DNA and RNA, or ribonucleic acid, are similar in that they both consist of nitrogen bases attached to a sugar-phosphate backbone; however, there are several key differences between them. RNA is singlestranded while DNA is double-stranded. Furthermore, RNA contains ribose sugars instead of the deoxyribose sugars contained in DNA. While adenine, cytosine, and guanine nucleotides are present in both DNA and RNA, RNA contains uracil in place of thymine. Lastly, DNA serves one main function to store genetic information while RNA serves three main functions performed by the three different types of RNA: messenger RNA (mrna), transfer RNA (trna), and ribosomal RNA (rrna). Without having to leave the nucleus, DNA sends its encoding information through intermediary mrna molecules to ribosomes in the cytoplasm for protein production. During the process of transcription, a section of the DNA unwinds to expose the bases of the DNA strands. RNA nucleotides bind to their complementary bases, with uracil binding to adenine, adenine to thymine, guanine to cytosine, and cytosine to guanine. The resulting mrna is complementary to its DNA template. Specific signals present in the DNA sequence indicate where transcription should begin and end. After

21 4 transcription, the newly produced mrna travels from the nucleus to ribosomes where translation, the process of protein synthesis, takes place. Figure.. DNA to RNA to Protein. *(Muskopf, 008) During protein synthesis, the mrna sequence is read three nucleotides at a time. In all, there are 64 possible triplet sequences, or codons. Three of these codons encode stop codons, which signal the end of the mrna message. The remaining 6 triplet sequences code for one of only twenty amino acids. Consequently, amino acids are not encoded from a unique sequence of nucleotides, and degeneracy is thus built into the genetic code. A protein is formed when several amino acids are linked together in a polypeptide chain. Each transfer RNA molecule consists of a set of three bases called an anticodon, which is complementary to the mrna codons. Additionally, each trna molecule contains an acceptor site for the binding of its corresponding amino acid. During the process of translation, the trna delivers the appropriate amino acids, which are found in the cytoplasm, to the mrna at the ribosome. During protein synthesis, ribosomal RNA molecules serve catalytic and structural functions. Upon arrival at the ribosome, mrna

22 5 is essentially immobilized by being enclosed within two subunits of the ribosome apparatus. rrna molecules subsequently help bind the mrna and trna to the ribosome. Following protein synthesis, the protein takes on a three-dimensional form as determined by the amino acid sequence. After folding, proteins perform a variety of biochemical activities, including involvement in gene regulation, energy production, response to the environment, metabolism, cell structure, and DNA replication. An individual s health relies on the proper production and coordinated activity of proteins. Consequently, an abnormality present at any stage of protein synthesis can initiate a disease process perhaps by increasing an individual s underlying susceptibility for disease... Types of Genetic Diseases A locus is the specific physical location of a gene, and an allele refers to the DNA sequences at a genetic locus. Because an individual normally has two copies of each different chromosome, each individual will have a pair of alleles at each genetic locus. This pair of DNA variants is an individual s genotype. A homozygous genotype occurs when both alleles at a locus are the same while a heterozygous genotype occurs when the two alleles differ. The observable physical characteristics resulting from the expression of an individual s genotype is his or her phenotype. The resulting disease trait can be either qualitative or quantitative; a qualitative trait will easily separate individuals into separate, discrete classes while a quantitative trait consists of a continuously varying characteristic. Based on the number of genes involved, genetically determined diseases can be classified into three types: chromosomal, single gene, and multifactorial. Chromosomal diseases, such as Down s syndrome, occur from the addition or deletion of all or part of a chromosome. Single gene defects, including sickle cell anemia and cystic fibrosis, result from mutations in a single gene and are inherited in the simple manner described by

23 6 Mendel. (However, pleiotropic effects of the single gene can result in multiple phenotypic effects thus complicating the clinical picture of disease.) Multifactorial or complex diseases, such as diabetes mellitus and cancer, result from contributions of the environment and/or the effects of multiple genes. Despite being the most common of human genetic disorders, multifactorial diseases are the most difficult to understand because of their complexity. Nevertheless, multifactorial traits can be classified into three categories: normal characteristics with continuous variation, common single congenital malformations, and common disorders of adult life (Burns, 005). For normal characteristics with continuous variation, an affected individual is one whose trait value is an extreme variant of the normal range. For common single congenital malformations, the trait can be thought of as having an underlying continuous variation in liability for disease. Individuals whose liability exceeds a threshold value will express the abnormal phenotype. For common disorders of adult life, environmental factors are thought to play a considerably larger role, thus accounting for the typical delay in disease development until adulthood when individuals have had more opportunity for exposure to certain environments. Due to the confluence of factors determining development of disease, certain genetic profiles will only lead to an increased susceptibility to disease. In the absence of important environmental exposures or other background genes, an individual who has an increased susceptibility to disease may never become affected. Due to the complexity of multifactorial diseases, the underlying genetic and environmental factors are often unknown. Nevertheless, models such as polygenic and oligogenic models help describe how various factors can combine to create the resulting phenotypes. Under a polygenic model, multiple loci contribute small additive effects to the continuous trait underlying the disease. As a result, individuals with very different genotypes at the involved loci could still exhibit the same phenotype. Under an oligogenic model, the number of involved loci is considerably smaller. Few genes play a

24 7 role in the overall phenotype, but some of the involved loci contribute relatively large effects. In all models, the expression of the phenotype can be modified due to environmental exposures or genetic modifiers (which are explained more in Section.4)...3 Genetic Material in Families Even for simple diseases, searching for genes that contribute to disease development can be a complicated process. Understanding some of the statistical methods utilized in the search for disease genes first requires an understanding of how genetic material is transmitted from generation to generation. Meiosis is the process by which genetic material is replicated to form the gametes, the male sperm and the female egg. Except for individuals with certain chromosomal abnormalities, each individual has 3 pairs of chromosomes, having received only one copy of each chromosome from his or her mother and one copy from his or her father. (The two chromosome members of each pair are known as homologous chromosomes.) At the completion of meiosis, each gamete that is formed is haploid in number; that is, instead of having 3 pairs of chromosomes one maternal and one paternal in origin each gamete has only one copy of each of the 3 chromosomes (See Figure.3). This is in accordance with Mendel s first law, the Law of Segregation, which states that the two members of a gene pair segregate from each other into the gametes with half of the gametes containing one member of the pair and the other half containing the other member. Since the parental origin of the chromosome copy passed on to the egg or sperm is entirely random, there are 3 possible unique gametes that can be formed in this process. When the haploid male sperm fertilizes the haploid female egg, the complement of 3 pairs of chromosomes is restored in the resulting zygote. The Law of Segregation actually underestimates the amount of variation inherent in the meiotic process. During meiosis, in a process known as crossing over, paired homologous chromosomes exchange homologous segments of DNA. The result is that

25 8 Figure.3. Simple Depiction of Meiosis as Outlined by Mendel s Law of Segregation. *(Muskopf, 008)

26 9 each chromosome transmitted to a gamete is typically a hybrid composed of smaller segments that alternate in origin between the maternally derived and paternally derived chromosomes in the original cell (See Figure.4). Figure.4. Crossing Over Event During Meiosis. *(Muskopf, 008) In his second law, the Law of Independent Assortment, Mendel hypothesized that during gamete formation, the segregation of alleles of one gene is independent of the segregation of alleles of another gene. However, Mendel failed to factor in genes that are located close together on the same chromosome. Thus, violations of Mendel s second law do occur, and they provide the foundation for linkage analysis.. Model-Based Linkage Analysis When two loci are located so close together on a chromosome that their assortment is non-independent, the two loci are said to be linked. In model-based linkage analysis, the two loci being investigated for evidence of linkage are typically a hypothetical disease locus and a marker locus, where the marker has a detectable physical location on a chromosome, exhibits genetic variation, and can be monitored in terms of presence and inheritance (Burns, 005). Because the location of the disease gene is

27 0 unknown, observed disease phenotypes are used to infer individuals disease genotypes. In order to map phenotypes to genotypes, a genetic model detailing the mode of inheritance, gene frequencies, and the penetrance of each genotype must be specified. Family data are collected, and the likelihood of observing the data is calculated assuming both linkage and no linkage. In forming the likelihood, offspring in a family are first classified as recombinant or non-recombinant individuals. A child inherits one maternally-derived and one paternally-derived haplotype (combination of alleles located closely together that tend to be inherited together) from its parents. To be considered a non-recombinant individual, the child s maternally-derived haplotype must have inferred disease and marker genotypes that exactly match one of the mother s original haplotypes and a paternallyderived haplotype with inferred disease and marker genotypes that exactly match one of the father s original haplotypes. Otherwise, the individual is considered a recombinant individual; that is, they inherit a haplotype that has a different combination of alleles than one of the parental haplotypes. (Note that a recombinant event between two loci will only be observed when an odd number of meiotic crossover events occurs between the two loci.) For a particular offspring, recombination events are counted separately for the meioses for each parent. When a parent is not doubly heterozygous (where doubly heterozygous means having two different alleles or DNA variants at both the marker and disease loci), recombination events in the offspring cannot be scored for that parent s transmitted haplotypes. The recombination fraction θ, defined as the probability that a recombination event will occur between the two loci, can be estimated by dividing the number of recombination events observed by the total number of scored gametes. Under the hypothesis of no linkage, θ is 0.5 (because when the two loci are unlinked, we would expect, by chance, a recombination event to occur between them half the time). Under the hypothesis of complete linkage (i.e., disease gene and marker gene always segregate

28 together), θ is 0 (because when two loci are always inherited together, a recombination event between them will never occur). Incomplete linkage occurs when θ falls between the extreme values of 0 and 0.5. Uncertainty regarding which chromosomes alleles reside on (i.e., maternal or paternal in origin) results in a more complicated likelihood. In this situation, the likelihood is computed as the sum over all possible values of X, where X refers to all unobserved data in a family or pedigree. Thus, the likelihood takes the following form (Thompson, 996): P θ ( ) = P ( Y X) = P ( Y X) P ( X) X Y, θ, θ θ X where Y refers to the observed data. For a particular individual, the data depend only upon his or her genotype and upon Mendelian segregation probabilities detailing how genes are transmitted from parents to child. Letting X j represent full genotype information for both chromosomes of individual j and letting m j and f j indicate the parents of individual j, the components of the previous likelihood function can be broken down as follows (Thompson, 996): P θ ( Y X) = P ( Y X ) and P ( X) = P ( X ) P ( X X, X ), θ i i θ θ j θ observed founders nonfounders where founders are individuals in the pedigree without parents who appear in the pedigree and nonfounders are individuals in the pedigree whose parents also appear in the pedigree. When calculating the likelihood of the observed data under the hypothesis of linkage, the likelihood can be calculated for θ values ranging from 0 to 0.5. Under the null hypothesis of no linkage, the likelihood is calculated using θ = 0.5. The LOD score is defined as the logarithm of the odds, where the odds in this case are actually a likelihood ratio. The LOD score can be computed for different values of θ by taking the log of the ratio between the likelihood of the data at the specified θ value and the j m j f j

29 likelihood of the data at θ =0.5. In order to find the value of θ that is most likely given the observed data, the LOD score is maximized with respect to θ. By convention, maximum LOD scores of at least 3 are considered strong evidence for linkage while maximum LOD scores below - are considered evidence against linkage. The value of θ for which the LOD is maximized indicates how close together the disease and marker loci likely are. The entire human genome is roughly 3500 centimorgans (cm) long, where cm represents an additive unit of physical distance. Unlike cm units, θ values are not additive due to the possibility of double crossovers between two loci, which obscures the fact that the two loci are actually farther apart than they appear. Nevertheless, a θ value of 0.0 (i.e., % recombination) roughly corresponds to cm. Results of a linkage analysis should be interpreted with great care. For example, large LODs with accompanying small θ values for a marker do not mean that the marker locus and the disease locus are one in the same. LOD scores merely provide evidence that the marker locus and the disease locus are located very close together. Because linkage analysis is based on statistical reasoning, it is always possible to obtain a large LOD score for a marker merely by chance. When a linkage analysis yields a large LOD score and small θ value for a marker, the location of the disease gene can potentially be narrowed down by conducting linkage analyses with markers more tightly spaced in the implicated region. It is unusual for investigators to test just one marker for linkage to a disease gene; instead, multipoint linkage analyses that take into account the information for multiple markers at a time are usually conducted... MCMC Methods for Complex Likelihoods As pedigrees grow large and/ or the number of markers being tested grows large, likelihood calculations for linkage analysis become quite complicated. Two algorithms, the Elston-Stewart algorithm and the Lander-Green algorithm, are utilized in the field of statistical genetics to deal with large datasets. Computation time for the Elston-Stewart

30 3 algorithm is linear with respect to pedigree size and grows exponentially with the number of markers analyzed. Thus, the Elston-Stewart algorithm can best be used for analyzing large pedigrees with a small number of genetic markers. For the Lander-Green algorithm, computation time is linear with respect to number of markers analyzed and grows exponentially with pedigree size. Consequently, the Lander-Green algorithm can best be used for analyzing smaller pedigrees with a large number of markers. For datasets consisting of small pedigrees and a few markers, use of either algorithm is adequate. When a dataset consists of a large extended pedigree with many markers, however, neither algorithm is computationally tractable. When exact likelihood calculations are infeasible due to the complexity of the data, Markov Chain Monte Carlo (MCMC) methods can be utilized for likelihood estimation. In the Bayesian realm, investigation of appropriate summary measures from a posterior distribution can reveal pertinent information about the unknown parameters given the data. (For example, posterior means can serve as point estimates for the unknown parameters.) However, estimating these posterior measures typically involves evaluating complicated integrals of a function with respect to the posterior distribution. Since the likelihood is a sum over a large space of unobserved data values (X), Markov Chain Monte Carlo (MCMC) integration involves sampling from some probability distribution on this space. In other words, Markov chain theory implements iterative sampling methods to evaluate complicated integrals. A series of Markov chain states are built up, and when convergence is achieved, consecutive states can then be assumed to be drawn from the target probability distribution. For Bayesian analyses, this asymptotic probability distribution is the posterior distribution (Muller, 006). MCMC methods consist of two components: Monte Carlo estimation and the Markov Chain. Monte Carlo techniques can be employed to estimate intractable sums, integrals, or expectations. For example, consider Monte Carlo estimation of the sum x g (x). Note that the sum may be rewritten as:

31 4 x = g( x) P( X = x) P( X = x) g( x) EP x = g( X) P( X) where P(.) is some probability distribution over the space of values of X that assigns positive probability to every value x of X for which g(x) is strictly positive. If X (), X (),, X (N) are simulated from P(.), then an unbiased estimator of the sum is N g( X ( τ ) ( τ ) N τ = P( X However, this estimator is not inherently a good estimator. To be a good estimator, it must have small variance. Selecting the appropriate probability distribution will reduce the Monte Carlo variance. Sampling from a probability distribution P * (X) P θ (X Y) α P θ (X,Y) successfully minimizes the Monte Carlo variance (Thompson, 000). The Markov Chain component of MCMC methods refers to how dependent realizations can be generated from the target probability distribution. The key to Markov Chain Monte Carlo (MCMC) methods is to define Markov chains appropriately so that the ergodic averages (i.e., averages that are independent of the initial conditions) will approximate the desired posterior expectations. Standard methods ensuring proper MCMC simulation exist, including Gibbs sampling, Metropolis-Hastings, and reversible jump (Muller, 006). ). ).3 Quantitative Trait Analysis The theory behind linkage methods for detecting quantitative trait loci (QTLs) is that individuals with similar trait values will have a higher than expected amount of sharing of genetic material identical by descent (IBD) near the gene or genes that contribute to the quantitative trait. The simplest quantitative trait model is the single locus model in which only one gene contributes to the quantitative trait. For this QTL, genotypes A A, A A, and A A have genotypic effects a, d, and a, respectively (See

32 5 Table.), where a refers to the additive effects (the linear effect of the alleles at the locus) and d refers to dominance effects (interaction between the two alleles at the locus). Then the quantitative trait value Q i for individual i can be written as Q = μ + g + e, where μ is the overall population mean, g i is the genotypic effect at the major locus, and e i is the individual-specific environmental or residual variation, which is assumed to be normally distributed. Furthermore, this simple model assumes no interaction between genotypic effects and the environment. Assuming no correlation between genotype and environment, the overall trait variance can be decomposed as = g + e. The genetic variance g can also be decomposed into additive and dominance components = + and can be expressed in terms of gene frequencies and additive and g a d dominance effects a and d : ( = pq( a d( p q) ); 4 p q d ). a d = Extensions to this simple model include the addition of covariates, multiple major genes, effects of polygenes (multiple genes of small effect that are unlinked to the major gene), and interaction effect terms. i i i Table.. Genotypic Effects and Quantitative Trait Values by Single-Locus Genotype. A A A A A A Genotypic Effects a d a Quantitative Trait Value Q i μ a + e i μ + d + e i μ + a + e i Several quantitative trait analysis methods exist and are briefly summarized here. Haseman-Elston regression (Haseman and Elston, 97) methods investigate a function of trait sharing for sibling pair data (i.e., the squared difference in trait values for each pair) and perform a regression of the trait sharing values on IBD sharing values, where

33 6 IBD means identical by descent and represents that identical copies of a gene segregating from a common ancestor in the defined pedigree have been inherited by the individuals under consideration. In the absence of linkage, sibling pairs are expected to share 0,, and alleles at a locus IBD ¼, ½, and ¼ of the time, respectively. A nonzero regression coefficient is evidence of linkage between the quantitative trait and the locus being investigated. Since its introduction in 97, Haseman and Elston s original regression method has been examined and extended. For example, Amos and Elston used other relative pairs (Amos and Elston, 989), Wright concluded that using the squared difference discarded useful information (Wright, 997), and along with Drigalenko, he suggested using both the trait sum and trait difference in the regression framework (Drigalenko, 998). Elston et al. incorporated these extensions and used the trait product in a regression model while also allowing for use of multiple pairs from the same sibship (Elston et al., 000). Variance components methods fit a multivariate normal model to quantitative trait values for family members. Since the covariance between relatives can be expressed in terms of the number of alleles shared IBD, the covariance between relatives can be tested for evidence of linkage between a marker and the trait. Under a hypothesis of linkage, the correlation coefficient between relative pairs who share alleles IBD at a locus should be larger than the correlation coefficient between relative pairs sharing only allele IBD, which in turn, should be larger than the correlation coefficient between relative pairs sharing no alleles IBD. Amos describes a test for linkage in which asymptotic theory is applied to conduct a likelihood ratio test between unconstrained (estimating,,, a e d, and θ) and constrained versions (constraining both d to be zero) of the likelihood (Amos, 994). Due to the potential number of IBD a and configurations in a family, the form of the correct likelihood is complicated. A number of subsequent authors described extensions of Amos methods for more complex models, such as the addition of gene-environment interaction effects (Towne et al., 997), gene-

34 7 gene interaction effects (Stern et al., 996 and Mitchell et al., 997), and multivariate traits (Almasy et al., 997). The literature establishes that variance components methods have higher power than Haseman-Elston regression when the normal trait model is close to being correct (Forrest, 00 and Tang and Siegmund, 00). Furthermore, variance components methods are more easily applied to larger pedigrees, which have greater power for detecting linkage. However, type I error rates are not robust to violations of the normality assumption (Allison et al., 999; Sham et al., 000; Tang and Siegmund, 00)..3. Analyzing Quantitative Traits with MCMC Methods The quantitative trait analysis methods described above focus on testing a hypothesis of linkage one locus at a time. Bayesian methods, however, can address the larger goal of simultaneously testing for multiple QTLs influencing the same trait without having to specify a priori the number of QTLs expected (which the researcher will never really know beforehand). In statistical genetics, MCMC methods have proven powerful tools for mapping genes underlying quantitative traits (Satagopan et al., 996; Uimari and Hoeschele, 997; Heath, 997; Lee and Thomas, 000; Yi and Xu, 000). Rather than enumerating all of the possible unobserved genotypes, MCMC approaches are based on statistical sampling methods that obtain realizations of possible unobserved genotypes, haplotypes, and other latent variables from the underlying sample space. Utilizing MCMC approaches, linkage analyses can be conducted with any number of marker loci, multiple trait loci, and pedigrees of any size and complexity. Furthermore, joint linkage and oligogenic segregation analyses that employ MCMC methods also allow estimation of the total number of underlying trait loci and covariate effects in addition to model parameters for each locus under investigation (Wijsman, 00). To conduct Bayesian MCMC analyses, prior distributions must be specified. Examples of these prior distributions include distributions for founder genotypes of both

35 8 marker and trait loci, quantitative trait locus (QTL) locations, number of QTLs in the model, and variance of genotypic effects. Informative prior distributions incorporate various pieces of information; for example, prior distributions for marker loci take into account marker allele frequencies, Hardy-Weinberg equilibrium assumptions, linkage equilibrium between loci, and Mendelian transmission probabilities (Wijsman, 00). When it is uncertain what form the prior distribution should take, an uninformative uniform prior distribution can be assumed. For each MCMC iteration, prior distributions, combined with genotypes of other pedigree members, are used to determine possible multilocus genotypes for each individual in the pedigree. Two basic steps are performed at each iteration. First, the prior distribution is used to propose a new value for the latent variable (i.e., the variable that is not directly observed but is inferred). Second, the new value is either accepted or rejected; if it is rejected, the value of the latent variable remains unchanged from its value in the previous iteration. Use of the Metropolis-Hastings algorithm retains two separate steps in this process while use of the Gibbs sampler combines them into one (Wijsman, 00). Useful estimates for variables of interest can only be obtained when the MCMC sampler successfully moves around the sample space (i.e., mixes well); that is, possible genotypes must be proposed and accepted from the entire possible sample space. For a sampler to propose values for latent variables that have the highest probability, it may take awhile for the sampler to move into the appropriate part of the sample space. Consequently, initial iterations serving as burn-in iterations may be ignored or discarded. While traditional linkage analyses yield LOD scores or p-values as outcomes, Bayesian MCMC methods yield estimates of posterior distributions of parameter values. For linkage analysis, the typical parameter of interest is an estimate of the posterior distribution of trait gene locations, also known as the posterior probability of linkage. For each interval, the fraction of all iterations for which one or more QTLs are placed in

36 9 that interval serves as an estimate of this posterior probability of linkage. Additionally, MCMC methods allow estimation of parameters other than linkage locations..3. LOKI and MCMC Segregation and Linkage Analysis Heath developed a method to simultaneously perform segregation and multipoint linkage analysis of quantitative traits using oligogenic models (Heath, 997). His method is implemented in the software program LOKI and, unlike many other QTL analysis methods, can be used to analyze very large even complex pedigree structures. The method avoids misspecification of model parameters by jointly estimating QTL number, position, and effects. Heath models a quantitative trait as being controlled by a variable number k of diallelic QTLs and allows for covariate terms as needed; however, interaction terms are not accounted for in the model. For QTL j, the genotypic effect G j has the value -a j, d j, or a j depending upon the individual s genotype at the locus. Then the oligogenic model for quantitative trait Q is as follows: Q = μ + Xβ + G j + e, where μ is the overall mean, β is an (m x ) vector of covariate effects, and X is an (n x m) incidence matrix for the covariate effects. The data Y include quantitative trait and covariate values and the marker data. In the model, the prior probability for each QTL is specified so that prior to analysis, each QTL is equally likely to be on any chromosome (anywhere along the chromosome) or to be unlinked to the trait. The posterior distribution of all variables is expressed as p(k, Z, M, β, λ, δ, η, α, e, μ Y) where Z and M are complete, phased genotypes of all QTLs and markers, δ keeps track of which QTLs are currently linked, λ is a vector of the QTL map positions for linked QTLs, η retains allele frequencies for QTLs and markers, and e is the residual variance. Both a j and d j are assigned normal prior distributions each with mean 0 and precision τ, which is set to a constant value depending on the k j=