Tag SNP Selection. Jingwu He, Kelly Westbrooks and Alexander Zelikovsky. Abstract

Size: px
Start display at page:

Download "Tag SNP Selection. Jingwu He, Kelly Westbrooks and Alexander Zelikovsky. Abstract"

Transcription

1 Linear Reduction Method for Predictive and Informative Tag SNP Selection Jingwu He, Kelly Westbrooks and Alexander Zelikovsky Department of Computer Science, Georgia State University, Atlanta, GA s: {jingwu, kelly, Abstract Constructing a complete human haplotype map is helpful when associating complex diseases with their related SNPs. Unfortunately, the number of SNPs is very large and it is costly to sequence many individuals. Therefore, it is desirable to reduce the number of SNPs that should be sequenced to a small number of informative representatives called tag SNPs. In this paper, we propose a new linear algebra-based method for selecting and using tag SNPs. Our method is purely combinatorial and can be combined with linkage disequilibrium (LD) and block based methods. We measure the quality of our tag SNP selection algorithm by comparing actual SNPs with SNPs predicted from selected linearly independent tag SNPs. For example, our experiments show that for long haplotypes (> SNPs), knowing only 0.4% of all SNPs our method predicts an unknown haplotype with 98% accuracy while the prediction is based on 10% of the sample population. Comparison with existing predictive tagging Preliminary version of paper has appeared in He et al. (2004). 1

2 methods of Halldorsson et al. (2004) and Zhang et al. (2004) shows that our method achieves better accuracy using fewer tag SNPs. Keywords: Single nucleotide polymorphism, tag SNP, linear independence. 1 Introduction Genome-wide SNP scans for disease association tests are still infeasible. In order to decrease SNP genotyping cost it is attractive to sequence only a small number of SNPs, called tag SNPs, and then infer the rest of SNPs (or certain suspicious SNPs) based on the sequenced tag SNPs. An interesting open problem is to find optimal subsets of tag SNPs. Another motivation for tagging is to identify informative SNPs, e.g., SNPs that are responsible for genetic diseases, thus reducing the noise introduced by irrelevant SNPs. Since the SNPs responsible for complex diseases are unknown, the tag SNPs should allow for the reconstruction of all (or almost all) SNPs. Note that complete 100% correct reconstruction is impossible because a single mutation may spoil an otherwise reliable reconstruction. The assumption is that the genotyped tag SNPs carry sufficient statistical power for identifying disease associations. An established way of selecting tag SNPs is based on linkage disequilibrium (LD) in which the entire SNP sequence is partitioned into blocks, i.e., contiguous SNP segments within which the number of different haplotypes is comparatively small (see (Avi-Itzhak et al., 2003; Carlson et al., 2004; Judson et al., 2002; Patil et al., 2001; Zhang et al., 2004)). Due to the low diversity within a block, the SNPs are highly correlated and a very small number of tag SNPs can predict values of all other SNPs. In Clark et al. (2003), one can find a valuable discussion of the tasks and 2

3 limitations of LD approaches. We further assume that the population is given as a set of haplotypes where each haplotype is a sequence of SNPs. Below we formulate two different objectives for tag SNP selection. The first (traditional) formulation assumes that the entire population is given and asks for the minimum lossless information, i.e., the minimum number of tags. Informative Tag SNP Selection Problem. Given the full pattern of all haplotypes, find the minimum set of tag SNPs that distinguish any two different haplotypes. The second formulation assumes that the available data represents only a small sample of the entire population and the solution should also include a method to predict SNPs which are not tagged. This formulation is similar to a data compression problem (see Bafna et al. (2003)). Predictive Tag SNP Selection Problem. Given the full pattern of all haplotypes in a small population sample, find the minimum number of tag SNPs and a method for reconstructing each haplotype in the entire population from these tags. This paper is focused on the predictive tag SNP selection problem. We suggest the application of a new linear reduction method which has been already successfully applied for the haplotype inference problem in He et al. (2004). The main idea behind the new method is to choose linearly independent SNPs as tags. In order to measure the quality of our approach, we directly count the number of correct predictions rather than introduce intermediate objectives. Our experimental results show that we need significantly fewer tag SNPs than LD-based methods to fully predict the entire set of haplotypes. For example, we randomly choose 100 haplotypes out of a simulated population of 1000 haplotypes each with SNPs, then we extract 100 linearly independent SNPs as tags. Given the values of the tag SNPs of any haplotype h out of the 900 haplotypes which 3

4 did not participate in tag SNP selection, we can reconstruct all SNPs of h with the average error 2%. This is in contrast with LD-based methods which require more tag SNPs to achieve the same prediction accuracy (see (Bafna et al., 2003)). Our contributions include: Formulation of the Predictive Tag SNP Selection Problem. Linear reduction method for tagging by applying standard Gauss-Jordan elimination. Experimental verification of our linear reduction method on simulated and real data. For example, our experiments show that for long haplotypes (> SNPs), knowing only 0.4% of all SNPs our method predicts an unknown haplotype with 98% accuracy while the prediction is based on 10% of the sample population. Comparison with existing predictive tagging methods shows that our method achieves better accuracy using fewer tag SNPs. The rest of the paper is organized as follows. In the next section we summarize the previous work on haplotype tagging problems. In Section 3 we give a formal description of the informative and predictive tag SNP selection problems. Section 4 formally introduces the linear reduction method and gives detailed descriptions of the implementation of the linear reduction algorithms. Section 5 presents a empirical study of our suggested methods as well as a comparison with methods of Halldorsson et al. (2004) and Zhang et al. (2004) on real data. 4

5 2 Previous Work Previous research on tag SNP selection has explored both lossless and lossy methods. Lossless methods select a set of tag SNPs that capture 100% of the haplotypic variation in the sample population. Lossy methods typically select fewer tags than lossless methods, but with some tolerated amount of information loss. Avi-Itzhak et al. (2003) presented a method for selecting tags which can be used in both a lossless and a lossy manner. The central idea behind both their lossless and lossy methods is to eliminate tags that contribute the least to the Shannon entropy for the haplotype set. First, identical columns and complimentary columns are eliminated, then they eliminate columns that do not reduce the number of unique rows. They note that selecting a maximal linearly independent set of column vectors would miss opportunities to eliminate complimentary SNPs and illustrate that by the 2-by-2 identity matrix 1. Their lossless method reduces by 25% and 36% the number of SNPs describing the haplotype diversity within an African-American and Caucasian population, respectively. Zhang et al. (2004) introduced a block-based, dynamic programming algorithm for haplotype inference that is capable of reconstructing 90% of the original data using only 35% of SNPs as tags. They used the partition-ligation expectation maximization algorithm Qin et al. (2002) for haplotype inference, and as a result, provided a method of performing association studies directly on genotype data. Sebastiani et al. (2003) described a lossless method called BEST (Best Enumeration of SNP Tags) for identifying a minimal set of tag SNPs from haplotype data. BEST selects tags by de- 1 In Section 4 we show how to adjust linear reduction to avoid such example. 5

6 termining if a candidate tag is a boolean function of SNPs already chosen as tags. The BEST method selected 14% of SNPs as tags from an African-American population and 10% from an European-American population by considering individual genes each ranging from 5 to 229 SNPs in length. However, its effectiveness on a genome-wide scale is still unproven. According to their method, 95% of tags selected from the European-American population were also selected from the African-American population, which provides evidence for the a genetic bottleneck event that occurred long ago as hominids migrated out of Africa to settle Europe and Asia. Halldorsson et al. (2004) defined the informativeness measure of how well a set of tags describes a haplotype sample. Both the informativeness measure, as well as their tag SNP selection method consider a graph whose vertices are SNPs; an edge is placed between to SNPs if one SNP can be used to reliably predict the other. Their method seeks the set of SNPs that maximizes the informativeness measure on the haplotype data. The method can achieve prediction rates of 90% based on only 20% of SNPs. Halldorsson s method differs from the others in that it is a block-free method. Block-based methods are restricted to identifying tags only within local contiguous sequences of SNPs where the haplotype diversity is low. Block-free methods have the capability to identify tags across an entire genome. Like Halldorsson s method, the linear reduction method we propose is a block-free method. Our tagging problem formulations and above approaches do not take into account haplotype frequency when selecting a tag SNPs. For a discussion of how haplotype frequency affects tag SNP selection, see (Stram et al., 2003; Chapman et al., 2003; Fortonet al., 2005). 6

7 3 Haplotype Tagging Problem Formulations Assume that there is a population P of haplotype vectors. The input to the tag SNP selection problem is a set of n haplotype vectors H = {h i i = 1,...,n}, each having m coordinates (positions), h i = (h i,1,...,h i,m ). Traditionally, each h i j {0,1} corresponds to a biallelic (taking only two values) SNP. Each of n haplotype vectors corresponds to a haplotype drawn from the population P and each of m positions corresponds to a SNP site in a haplotype. The tag SNPs are k position-sites t 1,t 2,...,t k, t i {1,...,m}, which are characteristic to all haplotypes. The traditional tag SNP selection problem is to select a minimal set of SNPs which are able to distinguish the haplotypic variations in a population without any information loss. The problem can be formulated as follows: Informative Tag SNP Selection Problem (ITTS). Given a set of n haplotype vectors H on m sites, find k tag sites t 1,...,t k such that for any two haplotypes h,h H, if h h then h(t i ) h (t i ) for some i = 1,...,k. One can also reconstruct an entire (preferably unique) haplotype h P from its tag SNPs vectorvalue h k = (h t1,...,h tk ). In order to formally describe the reconstruction, we introduce the notion of a reconstruction function, which is a vector-function f = ( f 1,..., f m ), where f j = f j (x 1,...x k ) is a k-variable function equal to j-th site of the prediction haplotype h = f (h k ). Obviously, f t j (h k ) = h t j. Now we are ready to formulate the predictive tag SNP selection problem as follows: Statistical Predictive Tag SNP Selection and Haplotype Reconstruction Problem (STTS). Given a set of n haplotype vectors H on m sites and k < m, find k tag sites t 1,...,t k and a reconstruction function f = ( f 1,..., f m ), such that for any haplotype h P, the expected Hamming distance between h and its prediction h is minimized. Note that the STTS problem is a natural generalization of the informative tag SNP selection 7

8 problem. For example, the block approach of Zhang et al. (2004) selects tag SNPs which distinguish 90% of the haplotypes within the block where blocks are defined based on LD. The STTS formulation allows for the use of all tag SNPs in reconstructing the unknown haplotype. The standard experimental way of validating a solution method M for STTS is to run M on a sample H randomly selected out of the population P and checking the average accuracy rate of prediction over all haplotypes in P\ H. In order to get more trustworthy results, the reported results should be averaged over multiple random choices of H. Naturally, the larger the set H, the more accuracy can be achieved. Therefore, we give a new optimization problem formulation. Optimum Predictive Tag SNP Selection and Haplotype Reconstruction Problem (OTTS). Given a population as a set P of p haplotypes on m sites, a population sample H P of n haplotypes and an integer k < m, find k tag sites t 1,...,t k and a reconstruction function f = ( f 1,..., f m ), such that average Hamming distance h, h between any haplotype h P \ H and predicted haplotype h = f (h k ) is minimized. 4 Linear Reduction of SNPs and Haplotypes In this section we first introduce linear dependency of SNPs and haplotypes and haplotype rank. We prove that the haplotype rank depends on the number of recombination hotspots rather than the number of different haplotypes. Then we describe our basic linear reduction method for tagging based on the sample haplotype matrix and reconstruction of haplotypes from tags. Finally, we show how to adjust the basic linear reduction method when the required number of tags is greater or smaller than the linear rank of the given sample haplotype matrix. 8

9 4.1 Linear Dependency of SNPs Typically, in genetic sequences derived from human haplotypes (see Patil et al. 2001), the number of sites is much larger than the number of individuals. Because of such disproportion, many columns corresponding to SNP sites are similar. Indeed, as noted in Patil et al. (2001), the number of equivalent sites in real data is considerably large. The 0-1-column-site s i is equivalent to the site s j if either s i and s j are the same, s i = s j, or s i is complimentary to s j (i.e., s i becomes s j after each 0 is replaced with 1 and each 1 is replaced with 0). It is common to keep only one site out of several equivalent sites since they do not carry any additional information (see Patil et al. 2001). In general, if one column-site can be restored from several other columns, then it can be dropped without loss of information. In this paper we consider restoration of one column-site using a linear combination of other column-sites. As noted in Avi-Itzhak et al. (2003), one cannot straightforwardly apply linear combinations of column-sites since equivalent columns are linearly independent. But one can easily overcome this obstacle by replacing 0 s with 1 s (see (He et al., 2004)). From now on we will change SNP notations: 1 corresponds to the wild type and 1 corresponds to the mutation. The advantage of ( 1, 1)-notations is that two sites are equivalent if and only if they are collinear (i.e., linearly dependent). Our tagging method is based keeping only linearly independent SNPs as tags. One can also explore linear dependency of rows-haplotypes rather than columns-snps. Then linear dependency in ( 1, 1)-notations can be used for classification of recombinations. Assume that in the given population all recombinations happen at a limited number of hotspots. Assume further that each hotspot occupies a DNA segment between two consecutive SNPs. If initially there are only two haplotypes a and b, then by repeatedly recombining a and b at g different hotspots, 9

10 one can potentially obtain as much as 2 g+1 different haplotypes. Indeed, let a = a 1 a 2...a g+1 and b = b 1 b 2...b g+1, where a 1 (respectively, b 1 ) is the segment of a (resp. b) from the first SNP to the last SNP before the first hotspot, a i (respectively, b i ) is the segment between (i 1)-st and i-th hotspots, and a g+1 (respectively, b g+1 ) is the segment from the last hotspot to the last SNP. Then any haplotype h obtained by recombination of a and b can be partitioned into g + 1 segments each coming either from a or from b, i.e., h = h 1...h g+1 where h i = a i or h i = b i. On the other hand, the number of linearly independent recombinations of two haplotypes is at most g + 2 which is much smaller then 2 g+1 which allows Theorem 1 Let H be a set of haplotypes obtained from two haplotypes by recombination events at g hotspots. Then the number of linearly independent rows-haplotypes is at most g + 2, i.e., the linear rank of H, rank(h ) g + 2. Proof. Let initial two haplotypes be a and b, and let g hotspots partition them into substrings as follows a = a 1 a 2...a g+1 and b = b 1 b 2...b g+1. Consider the set of g + 2 vectors which consists of the vector a and vectors b i each having all substrings (except the i-th substring) equal 0 and the i-th substring equal b i a i, i.e., b i = 0...(b i a i )...0, i = 1,...,g + 1. Any recombination haplotype vector h = h 1 h 2...h g+1 can be represented as h = a + h i =b i b i The proof of the following theorem is similar. 10

11 Theorem 2 Let H be a set of haplotypes obtained from l different haplotypes by recombination events at g hotspots. Then the number of linearly independent rows-haplotypes is at most (g + 1)(l 1) + 1, i.e., the linear rank of H, rank(h ) (g + 1)(l 1) + 1. The above theorems show that tagging of haplotype population consisting of recombinations of a limited number of haplotypes can be efficiently reduced to tagging of a small number of linearly independent population representatives. 4.2 The Basic Linear Reduction Method for Tagging Our basic linear reduction method for tagging assumes that if there is a linear dependency between certain SNPs in the given sample H, then the same dependency is likely to hold for these SNPs in the entire population P. Our extensive experimental study show that this assumption is true for simulated and real data (see Section 5). Based on this assumption, we suggest (i) to find linear dependencies in the sample, (ii) extract linear independent SNPs using them as tags, and (iii) reconstruct the values of non-tag SNPs based on values of tag SNPs and linear dependencies found in the sample H. Formally, our basic linear reduction method for tagging consists of the following steps: From the sample haplotype matrix H, extract the maximum number r = rank(h) of linearly independent columns-snps T (H) = {H t1,...,h tr } forming a basis of columns-snps of H. The columns-snps in T (H) form the set of tag SNPs. For each column-snp H j, j = 1,...,m in H, find a unique representation of H j as a linear combination of tag SNPs H j = 11 r i=1 α i, j H ti

12 For example, if H j is a tag, i.e., H j = H ti, then α i, j = 1 and α i, j = 0, i i. Output the positions {t 1,...,t r } of tag SNPs of T (H) and the matrix F = (α i, j ) of coefficients of linear combinations. The suggested linear reduction method can be implemented very efficiently. Applying O(n 2 m) Gauss-Jordan elimination, we can transform the n m matrix H into the reduced row echelon form R which will have exactly r = rank(h) nonzero rows. The r tag SNPs formed by linearly independent column-sites corresponding to nonzero rows can be easily found from R. Let F be the matrix R in which zero rows are dropped, so F is an r m matrix. Then for any haplotype h with the tag SNP values h r, the predicted reconstruction h = f (h r ) equals h = h r F One cannot guarantee all the values of h to be either 1 or -1. Therefore we postprocess h as follows: if the value of an SNP in h is negative, we set it to 1, otherwise we set it to 1. The haplotype information is spread all over the haplotype length and the first r linearly independent columns do not necessarily give the best choice of tags. Finally, we compare the following variations of the initial method: (i) Linear Reduction (LR), where the SNPs are processed in the order as in H and (ii) Randomized Linear Reduction (RLR), which is LR where H is preprocessed by randomly permuting columns-snps. (iii) RLR with postprocessing (RLRP), which is RLR where unresolved SNPs are reconstructed using specified above postprocessing. 12

13 Input: The sample haplotype matrix H with rows-haplotypes and columns-snps, and a number of required tags k Output: The set of k positions of tag columns-snps t 1,...,t k and the set of reconstructing matrices F 1. Find the linear rank r of the matrix H, the number of linearly independent rows (or columns). 2. If k r, then - sort all rows-haplotypes of H in the ascending order of d, where d is the sum of Hamming distances from h to all other haplotypes; - reduce H to the first k linearly independent rows; - find the reduced row echelon form R of H, F = {R}, and select the set of k tags consisting of linearly independent columns-snps in R. If k > r, then - find the reduced row echelon form R of H, and select the set of r tags consisting of linearly independent columns-snps in R; - select additional k r tags among columns-snps in R with largest number of non-zero entries; - find the reduced row echelon forms R i of H i, i = 1,...,k r, where H i is obtained from H by placing i-th additional tag column-snp in the first position, F = {R 1,R 2,...,R k r+1 }. 3. Output the set F and the set of k tag positions. Figure 1: The Tagging Algorithm RLRP(k). 4.3 Linear Reduction Algorithms with the Required Number of Tags When the required number of tags k is specified, then it may not necessarily coincide with the linear rank of the sample matrix H. Figures 1 and 2 show how to adjust RLRP = RLRP(k) for required number of tags k. In case when the required number of tags k is less than the linear rank of H, we suggest to reduce the sample to k linear independent haplotypes. We found that it is better to choose the most representative haplotypes, i.e., haplotypes that can predict all others with the least number of 13

14 Input: The set of reconstruction matrices F and a haplotype h k restricted to k tag SNP values. Output: The predicted full haplotype h. 1. If F consists of a single reconstructing matrix, F = {R}, then reconstruct h = h k R. If F consists of several reconstructing matrices, F = {R 1,R 2,...,R k r+1 }, then h is reconstructed from k r + 1 reconstructions each based on its own r tags forming the tag vector h i r and the corresponding reconstructing matrix R i as follows: k r+1 h = i=1 h i r R i 2. Postprocess h as follows: if the value of an SNP in h is negative, we set it to 1, otherwise we set it to 1. Figure 2: The Reconstructing Algorithm RLRP(k). errors. In case when the required number of tags k is more than the linear rank r of H, we suggest to add more SNPs to the initial r tags and form k r + 1 different reconstruction matrices corresponding to k r + 1 different r-subsets of k tags. In the reconstruction phase, we aggregate the information from all k r +1 reconstructions each based on different tag subsets. The aggregation is suggested to be done by voting : the value of 1 (respectively, 1) is assigned if majority of k r + 1 reconstructions suggests 1 (respectively, 1). 5 Experimental Results 5.1 The Data Sets Our algorithms are evaluated on simulated and real data. The simulated data is generated using ms of Hudson et al. (1990), a well-known haplotype generator based on the coalescent model of 14

15 SNP sequence evolution. Given as input the number of haplotypes desired, the number of SNPs desired, and the recombination rate, the ms generator emits a haplotype population with those characteristics. In our tests, we generate four haplotype samples of sizes 300, 500, 1000, and 2000 with SNP sites and a recombination rate of 40. The data set collected by Daly et al. (2001) is derived from the 616 kilobase region of human Chromosome 5q31 that may contain a genetic variant responsible for Crohn s disease by genotyping 103 SNPs for 129 trios. We use both parent and children haplotype data sets obtained via trio phasing presented by Brinza et al. (2005) over the Daly data. The data set of Patil et al. (2001) consists the first 1,000 of 24,047 SNPs typed on 20 haploid copies of human Chromosome 21. This subset was found to be highly representative to the entire data set. The data set of Clark et al. (1998) comes from 71 individuals typed at 88 SNPs in the human lipoprotein lipase (LPL) gene. The phased haplotype is known in this data set. We compare our linear reduction method with the methods of Halldorsson et al. (2004) and Zhang et al. (2004) on these two data sets. 5.2 Sample Based Predictive Tagging Problem For a population size of 1000 of 25K SNPs (see Figure 3), we randomly choose a population sample of size (50 to 500) as the training data, then apply the linear reduction method described in Section 4.2 to reconstruct the rest of unknown haplotypes. In this case, the number of tag SNPs is always close to size of the sample. We report an error as the average difference between the reconstructed and real haplotype in percentage. The two next plots (see Figures 4 and 5) are devoted to experiments with the real data set. We use both parent and children haplotype data sets 15

16 6 t ot al er r or % LR RLR RLRP sampl e popul at i on Figure 3: The total number of errors in % to the total number of SNPs depending on the size of the sample population for the three algorithms LR, RLR, and RLRP. Results from the simulated data with sites and haplotype population of Daly et al. (2001) phased by the methods of Brinza et al. (2005). We report an error depending on the size of the population sample while the number of tag SNPs is always less than the size of the sample and comes to 60 for 100 haplotypes. The results are averaged over 10 random draws from the set of all haplotypes. The last plot (see Figure 6) compares the error rate of RLRP method for the same sample size while the population grows from 300 to As one can see, the error rate does not change with the population size. 5.3 Comparison With Other Tagging Methods We have applied leave-one-out cross-validation to evaluate the quality of the solution given by the tagging SNP selection. For each haplotype in our data set, we apply the RLRP on the rest of data to select the required number of tag SNPs and compute the reconstruction matrix. The haplotype left out is reconstructed based on the SNPs that are tagged and the reconstruction matrix. The reconstructed haplotype is then compared with that leave-out haplotype and the number of errors 16

17 20 total error % LR RLR RLRP sample population Figure 4: The total number of errors in % to the total number of SNP depending on the size of the sample population for the three algorithms LR, RLR, and RLRP. Results from the 258 children haplotypes with 103 SNPs phased by methods presented by Brinza et al. (2005). is recorded. The average number of errors in reconstruction over all haplotypes is used as a measure of the overall accuracy of the tagging method on the data set. Methods of Zhang et al. (2004) and Halldorsson et al. (2004) impute a SNP based on the tag SNPs in the same block or neighborhood. Therefore, if there is no tag SNP in the block or neighborhood, then these methods do not make any prediction. The RLRP method reconstructs each SNP based on the values of all tag SNPs which may possibly be far away. Figure 7 compares RLRP with the methods of Zhang et al. (2004) and Halldorsson et al. (2004). For LPL and Patil et al data when 10% of SNPs are used as tags, RLRP reaches 80% accuracy, while previous methods reach 20% accuracy. 17

18 20 total error % LR RLR RLRP sample population Figure 5: The total number of errors in % to the total number of SNP depending on the size of the sample population for the three algorithms LR, RLR, and RLRP. Results from the 516 children haplotypes with 103 SNPs phased by methods presented by Brinza et al. (2005). 6 Conclusions & Future Work We have suggested a linear reduction method based on standard Gauss-Jordan elimination. Our experiments show that the linear reduction method reliably (error rate below 2%) recovers all SNPs based upon a very small portion of tag SNPs (e.g., 100 tags out of total 25K SNPs) while sampling below 10% of the population. In our future work, we will apply tag selection on genotype data. Also, we will explore different possibilities of combining our methods with blocks and will apply linear reduction to recover missing SNP values. 18

19 4 total error% RLRP (p=300) RLRP (p=500) RLRP (p=1000) RLRP (p=2000) sample of population Figure 6: Simulated data with sites and different haplotype population. The total number of errors in % to the total number of SNPs depending on the size of the sample population for the different population on 300, 500, 1000 and References Avi-Itzhak, H.I., Su, X. and de la Vega, F.M. (2003) Selection of minimum subsets of single nucleotide polymorphism to capture haplotype block diversity, Proceedings of Pacific Symposium on Biocomputing, Vol. 8, pp Bafna, V., Halldorsson, B.V., Schwartz, R.S., Clark, A.G. and Istrail, S. (2003) Haplotypes and informative SNP selection algorithms: don t block out information, Proceedings of the Seventh International Conference on Research in Computational Molecular Biology, pp Brinza, D., He, J., Mao, W. and Zelikovsky, A. (2005). Phasing and Missing data recovery in Family, International Workshop on Bioinformatics Research and Applications (IWBRA 2005), to appear Carlson, C.S., Eberle, M.A., Rieder, M.J., Yi, Q., Kruglyak, L. and Nickerson, D.A. (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium, American Journal of Human Genetics, Vol. 74, No. 1, pp

20 Halldorsson et al. Zhang et al. 0.8 Halldorsson et al. Zhang et al. Accurancy % RLRP Accurancy % RLRP # of Tag SNPs # of Tag SNPs Figure 7: The x-axis shows the number of SNPs typed, and the y-axis shows the fraction of SNPs correctly imputed in a leave-one-out experiment. (Left) Results from the LPL data set. (Right) Results from the first 1000 SNPs of Chromosome 21 data set. Clark, A., Weiss, K., Nickerson, D., Taylor, S., Buchanan, A., Stengard, J., Salomaa, V., Vartiainen, E., Perola, M., Boerwinkle, E., et al. (1998) Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase, American Journal of Human Genetics, Vol. 63, pp Clark, A. (2003) Finding genes underlying risk of complex disease by linkage disequilibrium mapping, Current Opinion in Genetics & Development, Vol. 13, No. 3, pp Chapman, J.M., Cooper, J.D., Todd, J.A. and Clayton, D.G. (2003). Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power, Human Heredity, Vol. 56, pp Daly, M., Rioux, J., Schaffner, S., Hudson, T. and Lander, E. (2001) High resolution haplotype structure in the human genome, Nature Genetics, Vol. 29, pp Eskin, E., Halperin, E. and Karp, R. (2003) Efficient reconstruction of haplotype structure via perfect phylogeny, Journal of Bioinformatics and Computational Biology, Vol. 1, No. 1, pp

21 Forton, J., Kwiatkowshi, D., Rockett, K., Luoni, G., Kimber, M. and Hull, J. (2005) Accuracy of Haplotype Reconstruction from Haplotype-Tagging Single-Nucleotide Polymorphisms, American Journal of Human Genetics Vol. 76, pp Halldorsson, B.V., Bafna, V., Lippert, R., Schwartz, R., de la Vega, F.M., Clark, A.G. and Istrail, S. (2004) Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies, Genome Research Vol. 14, pp He, J. and Zelikovsky, A. (2004) Linear Reduction Methods for Tag SNP Selection, Proceedings of the International Conference of the IEEE Engineering in Medicine and Biology (EMBC 04), pp He, J. and Zelikovsky, A. (2004) Linear Reduction for Haplotype Inference,, Proceedings of the Workshop on Algorithms in Bioinformatics (WABI 04), Vol. 3240, pp Hudson, R. (1990) Gene genealogies and the coalescent process, Oxford Survey of Evolutionary Biology, Vol. 7, pp Judson, R., Salisbury, B., Schneider, J., Windemuth, A. and Stephens, J.C. (2002) How many SNPs does a genome-wide haplotype map require?, Pharmacogenomics, Vol. 3, pp Patil, N., Berno, A., Hinds, D., Barrett, W., Doshi, J., Hacker, C., Kautzer, C., Lee, D., Marjoribanks, C., McDonough, D., Nguyen, B., Norris, M., Sheehan, J., Shen, N., Stern, D., Stokowski, R., Thomas, D., Trulson, M., Vyas, K., Frazer, K., Fodor, S. and Cox, D. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome, Science, Vol. 294, pp Qin, Z., Niu, T., and Liu, J. (2002) Partitioning-Ligation-Expectation- Maximization algorithm for haplotype inference with single-nucleotide polymorphisms, American Journal of Human Genetics, Vol. 71, pp

22 Sebastiani, P., Lazarus, R., Weiss, S., Kunkel, L., Kohane, I., and Ramoni, M. (2003) Minimal haplotype tagging, Proceedings of the National Academy of Sciences, Vol. 100, pp Stram, D., Haiman, C., Hirschhorn, J., Altshuler, D., Kolonel, L., Henderson, B. and Pike, M. (2003). Choosing haplotype-tagging SNPs based on unphased genotype data using as preliminary sample of unrelated subjects with an example from the multiethnic cohort study, Human Heredity, Vol. 55, pp Zhang, K., Qin, Z., Liu, J., Chen, T., Waterman, M., and Sun, F. (2004) Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies, Genome Research, Vol. 14, pp