An ef cient haplotyping method with DNA pools

Size: px
Start display at page:

Download "An ef cient haplotyping method with DNA pools"

Transcription

1 ã 2002 Oxford University Press An ef cient haplotyping method with DNA pools Ester Inbar 1, Benjamin Yakir 2 and Ariel Darvasi 1,3, * 1 Life Sciences Institute, Givat-Ram, The Hebrew University of Jerusalem, 2 Department of Statistics, The Hebrew University of Jerusalem, Jerusalem 91904, Israel and 3 IDgene Pharmaceuticals Ltd Bet Offer, Nahum Hafzadi 5, Jerusalem 91344, Israel Received April 16, 2002; Revised and Accepted June 7, 2002 ABSTRACT Determination of haplotype frequencies (the joint distribution of genetic markers) in large population samples is a powerful tool for association studies. This is due to their greater extent of polymorphism since any two bi-allelic single nucleotide polymorphisms (SNPs) generate a potential four-allele genetic marker. Therefore, a haplotype may capture a given functional polymorphism with higher statistical power than its SNP components. The statistical estimation of haplotype frequencies, usually employed in linkage disequilibrium studies, requires individual genotyping for each SNP in the haplotype, thus making it an expensive process. In this study, we describe a new method for direct measurement of haplotype frequencies in DNA pools by allele-speci c, long-range haplotype ampli cation. The proposed method allows the ef cient determination of haplotypes composed of two SNPs in close vicinity (up to 20 kb). INTRODUCTION Haplotype analysis is becoming a common tool in association studies. The joint distribution of adjacent markers in a population, which is actually the haplotype frequency, represents the correlation between those markers. Several recent studies showed that haplotypes, if used as genetic markers, have higher statistical power than individual markers (1,2). A major dif culty in using haplotypes as genetic markers lies in determining the haplotype phase for individuals who are heterozygous for more than one marker. There are several approaches to overcome this dif culty. However, in order for an approach to be practical, it needs to meet the low cost and high throughput requirements. Only such approaches can potentially be used in studies using large samples and involving a large number of genetic markers. Haplotype phase can be established by genotyping family members in order to infer parental chromosomes. This, however, requires the recruitment and genotyping of relatives, which may not be available or may be expensive to attain. Methods for chromosomal isolation (3), even though advantageous for long-range haplotypes, are currently applicable only for very small sample sizes due to the high costs involved. Therefore, the most common approach presently used in association studies is the statistical estimation of the frequencies of various haplotype phases. In this approach, an algorithm estimates the most likely haplotype frequencies, given the genotypes distribution in a sample (4,5). Unfortunately, the process of attaining the primary data necessary for the statistical estimation requires numerous individual genotypings of all markers included in the haplotype, and this is expensive and time-consuming. In this study, we describe a new method for haplotyping using DNA pools. Our approach is based on allele-speci c PCR ampli cation from pooled DNA samples and quantitative genotyping of the PCR products. MATERIALS AND METHODS DNA samples and SNP genotyping DNA was extracted from blood samples using the Nucleon BACC kit (Amersham). Two SNP markers were selected from the APOE gene sequence, denoted SNP888 and SNP988 according to Martin et al. (6). Primers were designed using the Primer3 program (Whitehead Institute for Biomedical Research, primer3_www.cgi). Long PCR ampli cation was performed using the Expand 20 kb plus PCR System (Roche Molecular Biochemicals), according to the manufacturer's instructions. Quantitative SNP genotyping was performed by PyrosequencingÔ, according to the manufacturer's instructions. This methodology provides accurate estimates as sources of error, such as preferential ampli cation and allele drop-out, are rare (7). Each PCR ampli cation for the quanti cation reaction was repeated three times. The mean of the three measurements was taken as the quantitative result. Pool assembly Individual samples were genotyped at both SNPs. Individuals which were homozygous at least for one SNP were selected to assemble pools with known haplotype frequencies (Table 1). Optical density measurements for each individual sample were carried out in six replicates using a mquant spectrophotometer (Bio-Tek Instruments). The DNA samples were then diluted to reach a set concentration of 10 ng/ml. These DNA samples were mixed in appropriate ratios to generate *To whom correspondence should be addressed at: Life Sciences Institute, Givat-Ram, The Hebrew University of Jerusalem, Jerusalem 91904, Israel. Tel: ; Fax: ; arield@cc.huji.ac.il

2 e76 Nucleic Acids Research, 2002, Vol. 30 No. 15 PAGE 2 OF 5 Table 1. Haplotype composition of the DNA template samples DNA template Sample size Haplotype X 888 -X 988 Frequency (%) Pools P1 6 T-T 25 T-C 25 C-T 25 C-C 25 P2 10 T-T 45 T-C 10 C-T 35 C-C 10 P3 8 a T-T 10 T-C 10 C-T 45 C-C 35 P4 12 T-T 42 T-C 4 C-T 33 C-C 21 Individuals I1 1 T-T 50 C-T 50 I2 1 T-C 50 I3 1 T-T 50 T-C 50 I4 1 C-T 50 I5 1 T-T 50 a Pool 3 was assembled from an equal amount of DNA from each of seven different individuals and a triple amount of DNA from an eighth individual. several pools, each with different known haplotype frequencies. Estimating haplotype frequencies: experimental procedure The key component of our technique resides in measuring the allele frequencies at SNP 1, given the allele at SNP 2. This is carried out through selective ampli cation of a DNA segment that includes both SNPs, according to a principle reported for individual haplotype genotyping, in the context of a bi-allelic Alu deletion, by Michalatos-Beloin et al. (8). In each of two alternate reactions, we use an allele-speci c forward primer for one of the two alleles of SNP 2 (the 3 nucleotide of each primer is either of the polymorphic nucleotides) and a common reverse primer, located beyond SNP 1 (Fig. 1). The template to be ampli ed is a DNA pool. The distribution of SNP 1 in each of the reaction products and the distribution of SNP 2 in the original (unampli ed) pool is then quanti ed. It is recommended to use a double heterozygous individual as a control template. Any individual can be referred to as a natural pool of two haplotypes with known frequencies. A double heterozygote is a useful control for various stages of the process: the speci city of the allele-speci c primers, the validity of the long-range PCR and the quantitative genotyping reaction at both SNPs. Estimating haplotype frequencies: statistical procedures Consider a pair of markers included in a haplotype. The term `haplotype frequency' is synonymous with the term `joint distribution of the markers', i.e. the joint distribution of a pair of bi-allelic markers is described by a table and the haplotype frequencies are the entries in that table. There are three degrees of freedom in the determination of the entries to the table (since the entries sum to 1). Thus, it is suf cient to measure three independent parameters in order to reconstruct the complete table, e.g. the marginal distribution of SNP 2, the conditional distribution of SNP 1 given one allele at SNP 2 and the conditional distribution of SNP 1 given the other allele at SNP 2. Since the joint distribution can be expressed by the conditional distribution via the equation: joint distribution = (conditional distribution) 3 (marginal distribution) it is possible to calculate the joint distribution of both SNPs from these three measurements using equations 1±4: p(a,b) = p(a B) p(b) 1 p(a,b) = [1 ± p(a B)] p(b) 2 p(a,b) = p(a b) [1 ± p(b)] 3 p(a,b) = [1 ± p(a b)] [1 ± p(b)] 4 where A,a and B,b are the alleles for SNP 1 and SNP 2, respectively. Measurement of the marginal distributions is attained by direct quantitative genotyping. Measurement of conditional distributions is enabled by quantitative genotyping of selectively ampli ed samples, as described above. RESULTS AND DISCUSSION The three parameters measured were p(t 988 ) (the marginal distribution of SNP988), p(t 888 C 988 ) (the conditional distribution of SNP888 given allele C at SNP988) and p(t 888 T 988 ) (the conditional distribution of SNP888 given allele T at SNP988), shown in Table 2. Our aim was to reach a highly speci c ampli cation of the segment containing both SNPs, in a manner that will discriminate between segments with allele C at SNP988 and segments with allele T. We performed both ampli cations on a number of DNA template samples, as shown in Table 1, and calculated the haplotype frequencies using equations 1±4. We then compared the estimated and expected frequencies for each template. The method's ability to identify and measure haplotypes is well illustrated if we compare the results for I5 (a double heterozygote) and for P1. Both templates have the same allele frequency in each marker (the marginal distributions in Table 3). Following the discriminating ampli cations, however, it is evident that I5 presents only two haplotypes (T T 988 and C 888 -C 988 ), while P1 is composed equally of all four possible haplotypes. When comparing estimated and expected haplotype frequencies for all templates, similar results were obtained

3 PAGE 3 OF 5 Figure 1. Estimating haplotype frequencies of two markers, SNP 1 and SNP 2, using DNA pooling. (Step 1) Measurement of the allele frequencies of SNP 2 [probability p(b)]. (Step 2) Design of two allele-speci c forward primers ending at the polymorphic base of SNP 2 and a common reverse primer beyond SNP 1. (Step 3) Two separate PCR ampli cations are carried out, resulting in alternative amplicons, each carrying a different allele of SNP 2. (Step 4) Quantitative genotyping of the PCR products from step 3 for measurement of allele frequencies at SNP 1 [probabilities p(a B) and p(a b)]. Table 2. Quantitative genotyping results Template p(t 988 ) p(t 888 T 988 ) p(t 888 C 988 ) P P P P I I I I I (Table 3). The difference between the estimated and expected value ranged from 0 to 0.04, with an average of for the individuals and for the pools. This error rate is not signi cantly higher than estimating standard allele frequencies of SNPs in DNA pools (9,10). It should be noted that the pools were constructed using a small number of individuals (six to twelve) and, consequently, our results suffer from the unavoidable inaccuracies associated with the assembly of small pools. These inaccuracies will be avoided in practical experiments where the pools will usually consist of a relatively large number of individuals (11). In addition, a small number of replications can increase accuracy further. Thus the accuracy achieved for the described technology should allow its reasonable use for various haplotyping purposes. This includes primarily: (i) a method to rapidly assess common haplotypes across the genome in different populations; (ii) association analysis with haplotypes to identify the genetic basis of complex traits. In this context, the method suggested combines the advantages of case± control genetic association, haplotype analysis and DNA pooling for linkage disequilibrium mapping. We have described the method for haplotypes composed of two SNPs. However, it can also be used for haplotyping pairs of polymorphisms where only one of them is a SNP and the other is any kind of polymorphism that can be genotyped quantitatively, e.g. microsatellite, Ins/Del. The method can also be extended to haplotypes comprising more than two markers through a series of nested PCR ampli cations. It should be noted that the long-range, allele-speci c PCR required for this method is technically more dif cult than a standard PCR and may need some optimization to reach the required high speci city. The nal result depends on the discrimination achieved. Yield reduction caused by a high GC content or low quality of the DNA will not have a signi cant effect on the results as long as the speci city of the primers ensures a proportional ampli cation of segments containing each allele of SNP 1. Additionally, a very low concentration of DNA is actually needed for the quantitative genotyping of that SNP in the next step. Low proofreading ability in the ampli cation is not crucial either, since the only ampli ed nucleotides that could affect the results are the two SNPs. The most important factor is the allele-speci c primer design. Since they have to end at a speci c nucleotide and have to t a certain melting temperature, not every pair of SNPs would have a suitable sequence for allele-speci c primers. The achievement of highly discriminating primers at the SNP is essential for this method to work. It should be mentioned, though, that the allele-speci c primers can be located on either of the two SNPs. Furthermore, since any three independent parameters can be employed, it is suf cient to have only one good allele-speci c primer. The other two parameters can be, for example, allele probabilities of both SNPs in the original pool. The presented method uses DNA pools for haplotype frequency estimation and therefore signi cantly reduces the number of genotyping reactions necessary for haplotyping. For example, determining haplotype frequencies of a two-snp haplotype in a sample population of 100 individuals would normally require 200 SNP genotyping reactions. Note that even when individual genotyping is performed, a statistical

4 e76 Nucleic Acids Research, 2002, Vol. 30 No. 15 PAGE 4 OF 5 Table 3. Estimated and expected joint distributions of SNP988 and SNP888 in the template samples algorithm is applied which also introduces some error. In contrast, with the method suggested here, only ve reactions are needed: two allele-speci c ampli cations and three genotyping reactions. If sample size is increased to, say, 1000, under individual genotyping the number of reactions is increased accordingly, whereas under the method presented here the number of reactions remains xed. Therefore, this method may provide an ef cient solution to the growing need for haplotype data collection. ACKNOWLEDGEMENTS This study was supported by the FIRST foundation of the Israeli Academy of Science. REFERENCES 1. Akey,J., Jin,L. and Xiong,M. (2001) Haplotypes vs single marker linkage disequilibrium tests: what do we gain? Eur. J. Hum. Genet., 9, 291± Zollner,S. and von Haeseler,A. (2000) A coalescent approach to study linkage disequilibrium between single-nucleotide polymorphisms. Am. J. Hum. Genet., 66, 615± Rasko,J.E., Battini,J.L., Kruglyak,L., Cox,D.R. and Miller,A.D. (2000) Precise gene localization by phenotypic assay of radiation hybrid cells. Proc. Natl Acad. Sci. USA, 97, 7388± Fallin,D. and Schork,N.J. (2000) Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am. J. Hum. Genet., 67, 947± Excof er,l. and Slatkin,M. (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol., 12, 921± Martin,E.R., Lai,E.H., Gilbert,J.R., Rogala,A.R., Afshari,A.J., Riley,J., Finch,K.L., Stevens,J.F., Livak,K.J., Slotterbeck,B.D., Slifer,S.H., Warren,L.L., Conneally,P.M., Schmechel,D.E., Purvis,I., Pericak-Vance,M.A., Roses,A.D. and Vance,J.M. (2000) SNPing away at complex diseases: analysis of single-nucleotide polymorphisms around APOE in Alzheimer disease. Am. J. Hum. Genet., 67, 383± Neve,B., Froguel,P., Corset,L., Vaillant,E., Vatin,V. and Boutin,P. (2002) Rapid SNP allele frequency determination in genomic DNA pools by pyrosequencing. Biotechniques, 32, 1138± Michalatos-Beloin,S., Tishkoff,S.A., Bentley,K.L., Kidd,K.K. and Ruano,G. (1996) Molecular haplotyping of genetic markers 10 kb apart by allele-speci c long-range PCR. Nucleic Acids Res., 24, 4841±4843.

5 PAGE 5 OF 5 9. Germer,S., Holland,M.J. and Higuchi,R. (2000) High-throughput SNP allele-frequency determination in pooled DNA samples by kinetic PCR. Genome Res., 10, 258± Giordano,M., Mellai,M., Hoogendoorn,B. and Momigliano-Richiardi,P. (2001) Determination of SNP allele frequencies in pooled DNAs by primer extension genotyping and denaturing high-performance liquid chromatography. J. Biochem. Biophys. Methods, 47, 101± Jawaid,A., Bader,J.S., Purcell,S., Cherny,S.S. and Sham,P. (2002) Optimal selection strategies for QTL mapping using pooled DNA samples. Eur. J. Hum. Genet., 10, 125±132.