Journal of Biomedical Science 2009, 16:52

Size: px
Start display at page:

Download "Journal of Biomedical Science 2009, 16:52"

Transcription

1 Journal of Bomedcal Scence Ths Provsonal PDF corresponds to the artcle as t appeared upon acceptance. Fully formatted PDF and full text (HTML) versons wll be made avalable soon. A novel tool for ndvdual haplotype nference usng mxed data Journal of Bomedcal Scence 2009, 6:52 do:0.86/ Chen-Pang Ln (mchael@bms.snca.edu.tw) Cathy SJ Fann (csjfann@bms.snca.edu.tw) SSN Artcle type Research Submsson date 23 February 2009 Acceptance date 2 June 2009 Publcaton date 2 June 2009 Artcle URL Ths peer-revewed artcle was publshed mmedately upon acceptance. t can be downloaded, prnted and dstrbuted freely for any purposes (see copyrght notce below). Artcles n Journal of Bomedcal Scence are lsted n PubMed and archved at PubMed Central. For nformaton about publshng your research n Journal of Bomedcal Scence or any BoMed Central journal, go to For nformaton about other BoMed Central publcatons go to Ln and Fann, lcensee BoMed Central Ltd. Ths s an open access artcle dstrbuted under the terms of the Creatve Commons Attrbuton Lcense ( whch permts unrestrcted use, dstrbuton, and reproducton n any medum, provded the orgnal work s properly cted.

2 A novel tool for ndvdual haplotype nference usng mxed data Chen-Pang Ln, Cathy S. J. Fann,2 nsttute of Publc Health, Natonal Yang-Mng Unversty, Tape, Tawan and 2 nsttute of Bomedcal Scences, Academa Snca, Tape, Tawan Author: Chenpang Ln, nsttute of Publc Health, Natonal Yang-Mng Unversty, 55, Sec. 2, Lnong Street, Tape 2, Tawan. Emal: mchael@bms.snca.edu.tw Correspondence: Cathy S. J. Fann, Ph D, nsttute of Bomedcal Scences, Academa Snca, 28, Academa Road, Secton 2 Nankang, Tape 5, Tawan. Tel: ; Fax: ; E-mal: csjfann@bms.snca.edu.tw

3 Abstract Background n many studes, researchers may recrut samples consstng of ndependent tros and unrelated ndvduals. However, most of the currently avalable haplotype nference methods do not cope well wth these knds of mxed data sets. Methods We propose a general and smple methodology usng a mxture of weghted multnomal (MXMUL) approach that combnes separate haplotype nformaton from unrelated ndvduals and ndependent tros for haplotype nference to the ndvdual level. Results The new MXMUL procedure mproves over exstng methods n that t can accurately estmate haplotype frequences from mxed data sets and output probable haplotype pars n optmzed reconstructon outcomes for all subjects that have contrbuted to estmaton. Smulaton results showed that ths new MXMUL procedure competes well wth the EM-based method,.e. FAMHAP, under a few assumed scenaros. Conclusons The results showed that MXMUL can provde accurate estmates smlar to 2

4 those haplotype frequences obtaned from FAMHAP and output the probable haplotype pars n the most optmal reconstructon outcome for all subjects that have contrbuted to estmaton. f avalable data consst of combnatons of unrelated ndvduals and ndependent tros, the MXMUL procedure can be used to estmate the haplotype frequences accurately and output the most lkely reconstructed haplotype pars of each subject n the estmaton. 3

5 Background Snce the completon of the nternatonal HapMap Project, mllons of sngle nucleotde polymorphsms (SNPs) and haplotype nformaton have been deposted nto publc databases for studes n the felds of populaton genetcs, evolutonary genetcs, and complex dsease gene mappng. Several studes have demonstrated that haplotypes can provde more power than sngle markers n detectng assocatons []. However, haplotype nformaton cannot usually be obtaned drectly from unphased genotype data. t s possble to determne haplotypes usng molecular expermental technques, but such approaches are stll expensve and labor ntensve. Therefore, haplotype determnaton from genotype data by statstcal methods s used f the estmaton s done accurately. The populaton-based case-control desgn s a commonly used desgn n genetc assocaton studes, n whch unrelated cases and controls are collected and compared wth respect to the frequences of some haplotypes. An advantage of ths study desgn s that the mplementaton s very convenent, snce recrutng unrelated ndvduals s both tme- and cost-effectve. One potental dsadvantage for the populaton-based study s due to populaton stratfcaton whch may make an excess of false-postve results. To avod a deceptve 4

6 assocaton confounded by populaton stratfcaton, the famly-based desgns usng relatves of the cases as controls have been proposed. The tro desgn s the smplest famly desgn, where both parents of the affected subjects are ncluded as famly controls. When genotype data for parents are not avalable, such as n the study of late onset dseases, the unaffected sblngs can be ncluded nstead. Recrutment, whch s the prmary dsadvantage of famly desgn, usually requres more resources n terms of tme and money [2]. A few studes have drawn attenton to the assocaton study usng both famly-based and populaton-based controls [3, 4]. One motvaton for ths type of study s the supplementaton of case-parent tros wth addtonal unrelated controls, f avalable, to ensure suffcent power to detect assocaton, snce parental controls may be hard to recrut, especally for late-onset dseases. Several statstcal and computatonal approaches have been developed for the nference of haplotype phase from genotype data of unrelated ndvduals or ndependent tros, but most programs cannot deal wth famly-based and populaton-based controls at the same tme. Becker and Knapp [5] proposed a program FAMHAP, whch calculates maxmum lkelhood estmates of haplotype frequences from general nuclear famles va the EM-algorthm. One feature of ths program s the possblty of estmatng haplotype frequences from data sets 5

7 consstng of a combnaton of unrelated ndvduals and nuclear famles. Nevertheless, ths program cannot output the most lkely haplotypes pars of each subject. n ths work, usng a MXture of weghted MULtnomal (MXMUL) approach, a new procedure based on PHASE [6, 7] s proposed for dealng wth mxed data sets to estmate the haplotype frequences and to reconstruct the most lkely haplotype pars of each subject contrbuted nto estmaton. We evaluated the MXMUL procedure wth respect to the accuracy of haplotype frequency estmaton for the combnaton data sets. We also consdered a few factors, ncludng genotypng error and extent of lnkage dsequlbrum. The new MXMUL procedure competes well wth the lkelhood-based method FAMHAP of Becker & Knapp, whch s also applcable to mxed data sets. Whle FAMHAP can only output haplotype frequency estmates, the MXMUL procedure can further provde a lst of the most probable haplotype pars for every subject n the mxed data sets. 6

8 Methods MXture of weghted MULtnomal (MXMUL) approach We assumed Hardy-Wenberg equlbrum for haplotypes n all subjects and throughout ths study and consdered a sample consstng of n unrelated ndvduals and n 2 ndependent tros. For each subject n the mxed sample, we observed q SNPs wth alleles and 2 n a specfc regon of the genome, and Q possble haplotypes, wth Q 2 q. Let 2 H = ( H, H, L, H Q ) denote the Q possble haplotypes, and a vector θ = ( θ, θ2, L, θ Q ) was used to descrbe the unknown haplotype frequences, wth Q θ j =. j= Assumng that the haplotypes from the n unrelated ndvduals followed a multnomal dstrbuton, the multnomal dstrbuton model was defned as P( x α) (2 n )! Q = Q = x! = α x. Let x be the number of tmes that haplotype H occurs, and the vector = (,, L, ) be the vector of counts for all haplotypes, wth x = 2n beng x x x2 x Q Q = the total number of haplotypes. Let α be the mean haplotype frequency for haplotype, and α = ( α, α2, L, α Q ) be the vector of haplotype frequences. Because only parents of each tro would contrbute to the estmaton, the haplotypes from the founders of the n 2 ndependent tros also followed a 7

9 multnomal dstrbuton defned as (4 n )! Q 2 β = Q j= x j! j= P( x ) whereβ j s the mean haplotype frequency for haplotype j, and β = ( β, β2, L, β Q ) β x j j, s the vector of haplotype frequences. Let y be the number of tmes haplotype H occurs wthn the n 2 tros, and the vector y= ( y, y2, L, y Q ) be the vector of counts for all haplotypes. Wth a weght parameter λ, we combned these two sets of haplotype frequences, α and β. Thus, the dstrbuton specfed by the mxture of weghted multnomal model s Q z P( z λ, α, β ) [( λ) α+ λβ ], = where z s the total number of tmes haplotype H occurs, and the vector z z z2 z Q = (,, L, ) s the vector of counts for all haplotypes. Thus, we obtaned z as the sum of the number of tmes that haplotype H occurs n n unrelated ndvduals and n 2 ndependent tros.e. z = x + y. Clearly, the multnomal wth haplotype frequency vector ( λ) α + λβ specfed the same dstrbuton [8]. Smulaton study We examned the performance of the proposed MXMUL procedure va smulaton studes. The smulatons were conducted under settngs where the 8

10 combnng of the data s sutable. We generated three dense mult-locus genotype data sets usng the program SNaP [9] based on the dfferent haplotype frequency dstrbutons provded from three authentc data sets. The frst smulated data set was based on the fve SNPs wthn the N-acetyltransferase 2 gene (NAT2) descrbed by Xu et al [0]. The haplotype frequences of the fve SNPs were over the 850-bp fragment of NAT2 sequenced from each of the 8 ndvduals to resolve the haplotypes for both chromosomes of each ndvdual. The second smulated data set was the eght SNP haplotype frequences wthn the gene ARHGDB on chromosome 2, whch were dentfed from 44 unrelated ndvduals []. The thrd haplotype frequency dstrbuton was based on the ten SNPs from the orgnal Oxford ACE data descrbed by Zhang et al [2]. The ACE data set contaned genotypes of 666 ndvduals wth ten SNPs n strong lnkage dsequlbrum (LD), spannng a very short regon (26 kb) wthn the gene ACE. Frst, we used PHASE to obtan the count of each haplotype n the best reconstructon for the unrelated ndvduals,.e. x= ( x, x2, L, x Q ). Then we could obtan an estmated haplotype frequency vector ^ ^ ^ ^ (, 2,, ) α = α α L α Q by ^ α = x Q = x. Separately, we used PHASE to obtan the count of each haplotype for the ndependent tros to form the vector of counts for all haplotypes n these tro famles, 9

11 y y y2 y Q = (,, L, ). A vector of estmated haplotype frequences can be obtaned by ^ β = Q y = y ^ ^ ^ ^ β = β β 2 L β Q (,,, ). We then used our proposed MXMUL procedure to calculate the estmated mxture haplotype frequency vector ^ ^ ^ ^ (, 2,, ) θ = θ θ L θ Q as ^ ^ θ = λα + ( λ) β. ^ The weght parameter λ was estmated by the maxmum lkelhood method to obtan the estmator ^ ^ λ. Gven a weght parameter estmate of λ, a set of estmated haplotype frequences were obtaned by MXMUL. Based on the set of estmated frequences for the mxed data set, the MXMUL procedure can further output the best reconstructed haplotype pars of each subject contrbuted nto estmaton. Comparng MXMUL wth FAMHAP The MXMUL procedure was developed to deal wth mxed data sets consstng of unrelated ndvduals and ndependent tros for haplotype frequency estmaton and haplotype nference. Usng the MXMUL approach, the weght parameter ^ λ was obtaned by maxmzng the lkelhood functon of a mxture of two ndependent multnomal dstrbutons n order to perform haplotype frequency estmaton for the mxed data sets. We examned the accuracy for haplotype frequency estmaton of the proposed MXMUL procedure and compared the performance wth that of the EM-based FAMHAP proposed by Becker & Knapp, whch can also estmate 0

12 haplotype frequences for mxed data sets. We set the mxed data sets consstng of 20 unrelated ndvduals and 5, 0, 5, 20, 25, or 30 ndependent tros, respectvely. Because only parents of each tro would contrbute to the estmaton, these combnatons correspond to a total number of subjects n estmaton of 30, 40, 50, 60, 70, and 80, respectvely. To evaluate the performance of MXMUL and FAMHAP n the presence of genotypng error, we used the ARHGDB data to generate the smulated data n the exstence of genotypng error at the levels of the 0.025, 0.05, and 0.. The error rates range from to 0. as suggested by Tntle et al. [3] and Cheng and Ln [4]. To assess the performance of MXMUL wth weak LD extent, we smulated a data set based on eght SNPs wth 5 equally frequent haplotypes and average D = for comparng wth FAMHAP. On the other hand, we used the ARHGDB data (average D = 0.905) and smulated 30 unrelated ndvduals and 0 ndependent tros, respectvely, to evaluate the estmaton accuracy of rare haplotype frequences (haplotype frequency < 0% ) contngent on the same number of subjects utlzed. Measurement of accuracy To evaluate the qualty of haplotype frequency estmaton, the ndces F and H proposed by Excoffer and Slatkn [5] were used. The frst was a smlarty

13 measure F whch descrbes how close the estmated haplotype frequences are to the actual frequences and s defned as one mnus half of the sum of absolute dfferences between the true and estmated haplotype frequences,.e., F Q ^ = θ θ, 2 = where ^ θ s the estmated frequency of the -th haplotype, and θ s the true haplotype frequency. F ranges between 0 and (a value of ndcates that the actual and estmated frequences are dentcal). Another measurement H was used to quantfy the effectveness of computatonal algorthms for haplotype reconstructon. H compares the number of haplotypes n a sample and the number of haplotypes detected by an algorthm. n a sample wth N subjects, the mnmum frequency of every true haplotype has to be greater than or equal to, whch could be used as a lower bound threshold value 2N for determnng the exstence of a haplotype. Based on ths, H 2( ktrue kmssed ) =, k + k true found where k true s the number of true haplotypes, k found s the number of dentfed haplotypes wth frequences above the threshold value, and k mssed s the number of true haplotypes not dentfed. The measure H also vares between 0 (when none of the true haplotypes s dentfed) and (when the haplotypes dentfed are exactly the same as the true haplotypes). 2

14 Results The plots n Fgure show accuracy comparson of haplotype frequency estmatons between MXMUL and FAMHAP usng NAT2 data (panels a and b), ARHGDB data (panels c and d), and ACE data (panels e and f). Compared to the EM-based program FAMHAP, results were consstent wth the noton that MXMUL has comparable accuracy for haplotype frequency estmaton wth FAMHAP. Moreover, MXMUL can output the most lkely reconstructed haplotype pars for every subject nto estmaton, whle FAMHAP only output overall estmated haplotype frequences. t s known that genotypng error can severely affect the performance of haplotype frequency estmaton algorthms. Because these results were smlar for each level of the genotypng error rate, Fgure 2 only shows the results for genotypng error rate of They ndcated that n most cases, the accuracy usng MXMUL was slghtly better than that obtaned from usng FAMHAP. The extent of lnkage dsequlbrum (LD) between SNP markers also has an mportant effect on haplotype-nference accuracy. We evaluated the performance of MXMUL under a scenaro of weak LD extent. Plots of the accuracy measures for the smulated data set were shown n Fgure 3. t can be seen that MXMUL performed 3

15 well wth FAMHAP even when the LD extent was weak. Most haplotype nference methods can estmate common haplotypes (haplotype frequency > 0% ) very accurately when the LD content across the consttuent loc s strong, but the performances were lower n estmatng rare haplotypes (haplotype frequency < 0% ). Thus, we used the ARHGDB data and focused on the estmaton accuracy of rare haplotype frequences contngent on the same number of subjects utlzed. Our results showed that the measure F of 30 unrelated ndvduals and 0 ndependent tros were and , respectvely. ndeed, for rare haplotypes nference, takng famly nformaton nto account wll ncrease the frequency estmaton accuracy. 4

16 Dscusson Haplotype nference for a large number of tghtly lnked markers through close relatves has drawn much attenton n recent years. Several novel methods and programs have been developed. However, t s lkely that researchers collected samples consstng of ndependent tros and unrelated ndvduals. The MXMUL procedure proposed here can be convenently and effcently utlzed to deal wth such data sets for haplotype nference. The MXMUL procedure wll also output the most lkely reconstructed haplotype pars of each subject contrbuted nto the estmaton. The current verson of the PHASE program (verson 2.) can reconstruct haplotypes from populaton-based or tro-based genotype data respectvely, but t cannot handle mxed data sets consstng of both. On the other hand, the FAMHAP program s able to deal wth mxed data sets, however, ths program does not provde haplotype nference to the ndvdual level; t only provdes overall haplotype frequences. Based on the outputs of the PHASE program, we proposed the MXMUL procedure that deals wth mxed data sets and nfer most probable haplotypes to the ndvdual level by usng a weghted functon. Becker and Knapp proposed a program FAMHAP, whch calculates maxmum lkelhood estmates of haplotype frequences from general nuclear famles wth an 5

17 arbtrary number of chldren va the EM-algorthm. Ths program can estmate haplotype frequences from data sets consstng of a combnaton of unrelated ndvduals and nuclear famles. However, n our smulaton, we compared some factors that would affect haplotype frequency estmaton, ncludng number of SNP markers, genotypng error rates, and extent of LD. On the bass of our results, the accuracy of haplotype frequency estmaton for MXMUL competes well wth FAMHAP when consderng all these factors. Furthermore, MXMUL can provde not only the accurate haplotype frequency estmates, but also the most lkely reconstructed haplotype pars of each subject. As suggested by other studes [5, 6, 7], ncludng famly nformaton mproved the accuracy of haplotype estmaton. We examned whether addng famles to unrelated ndvdual data could mprove the accuracy of haplotype frequency estmaton by usng MXMUL. On the bass of our results, the accuracy of haplotype frequency estmaton by MXMUL showed that takng famly nformaton or partal famly nformaton nto account dd mprove the accuracy of haplotype frequency estmaton. However, the accuracy of usng more than 60 unrelated ndvduals for haplotype nference was almost the same as those from usng tro or mxed data. A systematc approach was not used n ths study to address the ssues because the true haplotype frequences (derved from experments) were more sutable for 6

18 comparson purposes wth nferred haplotype frequences usng F and H calculated from MXMUL and FAMHAP. Thus, we selected three publshed data sets avalable onlne where the true haplotype frequences were derved from experments. We carred out smulatons based on these known frequences and the number of SNPs ranged from 5 0 wth lnkage dsequlbrum coeffcent ( D ) ranged from The levels of genotypng error rates were 0.025, 0.05 and 0. as suggested by other studes [3, 4]. By usng ths approach, we obtaned approxmate nformaton about the mpact of these factors n assessng accuracy n comparson between MXMUL and FAMHAP. ncomplete tros ( parent, one chld) n mxed data sets can be analyzed by MXMUL by settng one parent n a tro as mssng. We used the ARHGDB data to conduct 00 smulatons to evaluate the mpact of usng ncomplete tros. The results showed that when usng only ncomplete tros n the mxed data sets, the accuracy of haplotype frequences estmaton was lower than that obtaned from usng complete tros (2 parents, one chld) because the degree of famly nformaton was defcent. For example, gven 0 unrelated ndvduals and 5 ncomplete tros, F was and t was f these tros were complete. H was for the former and for the later. Gven 5 unrelated ndvduals and 40 ncomplete tros, F was and t was f these tros were complete; H was for the former and

19 for the later. The dfferences of accuracy measurements between usng ncomplete and complete tros were smaller when the number of subjects used n estmaton ncreased. 8

20 Conclusons Haplotypes capture LD nformaton n chromosomal regons descended from ancestral chromosomes. Such nformaton s of consderable nterest n populaton genetcs and genetc epdemology studes. Wth wdespread applcatons of new generatons of genotypng technques, especally hgh-densty SNP arrays, the human genome wll eventually be unlocked by lnkng haplotype nformaton to bomarker and phenotypc data. n current assocaton studes, t s lkely that researchers may recrut mxed samples consstng of ndependent tros and unrelated ndvduals. However, most exstng methods for haplotype nference and frequency estmaton cannot cope wth these knds of mxed data sets. Although the EM-based FAMHAP of Becker & Knapp can deal wth such knd of data, t cannot reconstruct the haplotype pars of the ndvdual level. Therefore, n ths study we developed the MXMUL procedure based on the program PHASE to deal wth mxed data sets. Accordng to our results, MXMUL can provde accurate estmates for haplotype frequences as FAMHAP and further output the probable haplotype pars n the most optmal reconstructon outcome for every subject that have contrbuted to estmaton. f avalable data consst of combnatons of unrelated ndvduals and ndependent tros, the proposed MXMUL procedure can be used to perform haplotype frequency 9

21 estmaton to obtan accurate haplotype frequency estmates n the mxed sample as well as to output the most lkely reconstructed haplotype pars of each subject nto the estmaton for further haplotype level assocaton analyss. The MXMUL procedure s avalable for download from 20

22 Competng nterests 'The authors declare that they have no competng nterests Authors contrbuton CL carred out ths study and drafted the manuscrpt. CSJF conceved of the study, and partcpated n ts desgn and coordnaton and helped to draft the manuscrpt. All authors read and approved the fnal manuscrpt. Acknowledgements Ths project was partally funded by two grants from the Tawan Natonal Scence Councl (NSC B and NSC B MY2). 2

23 References. Brownng BL, Brownng SR: Effcent multlocus assocaton mappng for whole genome assocaton studes usng localzed haplotype clusterng. Genet Epdemol 2007, 3: Lard NM, Lange C: Famly-based desgns n the age of large-scale gene-assocaton studes. Nat Rev Genet 2006, 7(5): Nagelkerke NJ, Hoebee B, Teuns P, Kmman TG.: Combnng the transmsson dsequlbrum test and case-control methodology usng generalzed logstc regresson. Eur J Hum Genet 2004, 2(): Kazeem GR, Farrall M.: ntegratng case-control and TDT studes. Ann Hum Genet 2005, 69: Becker T, Knapp M: Maxmum-lkelhood estmaton of haplotype frequences n nuclear famles. Genet Epdemol 2004, 27: Stephens M, Smth NJ, Donnelly P: A new statstcal method for haplotype reconstructon from populaton data. Am J Hum Genet 200, 68: Stephens M, P Scheet: Accountng for decay of lnkage dsequlbrum n haplotype nference and mssng data mputaton. Am J Hum Genet 2005, 76: Yants S, Meyer DE, Smth JE: Analyses of multnomal mxture dstrbutons: 22

24 new tests for stochastc of cognton and acton. Psychol Bull 99, 0(2): Nothnagel M: Smulaton of LD block-structured SNP haplotype data and ts use for the analyss of case-control data by supervsed learnng methods. Am J Hum Genet 2002, 7:(Suppl.)(4), A Xu CF, Lews K, Cantone KL, Khan P, Donnelly C, Whte N, Crocker N, Boyd PR, Zaykn DV, Purvs J: Effectveness of computatonal methods n haplotype predcton. Hum Genet 2002, 0: Rech DE, Cargll M, Bolk S, reland J, Sabet PC, Rchter DJ, Lavery T, Kouyoumjan R, Farhadan SF, Ward R, Lander ES: Lnkage dsequlbrum n the human genome. Nature 200, 4: Zhang K, Zhao H: A comparson of several methods for haplotype frequency estmaton and haplotype reconstructon for tghtly lnked markers from general pedgrees. Genet Ep 2006, 30: Tntle NL, Ahn K, Mendell NR, Gordon D, Fnch SJ: Characterstcs of replcated sngle-nucleotde polymorphsm genotypes from COGA: Affymetrx and Center for nherted Dsease Research. BMC Genet 2005, 30:6 Suppl :S Cheng KF, Ln WJ: Smultaneously correctng for populaton stratfcaton 23

25 and for genotypng error n case-control assocaton studes. Am J Hum Genet 2007, 8(4): Excoffer L, Slatkn M: Maxmum-lkelhood estmaton of molecular haplotype frequences n a dplod populaton. Mol Bol Evol 995, 2: Becker T, Knapp M: Effcency of haplotype frequency estmaton when nuclear famly nformaton s ncluded. Hum Hered 2002, 54: Schad DJ: Relatve effcency of ambguous vs. drectly measured haplotype frequences. Genet Epdemol 2002, 23: Fgure legends Fgure. Performance comparsons of MXMUL and FAMHAP The upper panels (a and b) show the average measures of accuracy based on the NAT2 data. The mddle panels (c and d) show the average measures of accuracy based on the ARHGDB data. The lower panels (e and e) show the average measures of accuracy based on the ACE data. The left panels (a, c and e) and the rght panels (b, d and f) show the smlarty ndces and G H, respectvely, between the estmated and the actual haplotype frequences. 24

26 Fgure 2. Performance comparson when genotypng error rate s 0.05 The graph shows the accuracy ndces and G H between MXMUL and FAMHAP based on the ARHGDB data gven the genotypng error rate of Fgure 3. Performance comparson when LD extent s weak The graph shows the accuracy ndces and G H between MXMUL and FAMHAP based on the smulated data when the extent of LD s weak (average D = ). 25

27 Fgure

28 Fgure 2

29 Fgure 3