VARIANT DETECTION USING NEXT GENERATION SEQUENCING DATA YOON SOO PYON. For the Degree of Doctor of Philosophy

Size: px
Start display at page:

Download "VARIANT DETECTION USING NEXT GENERATION SEQUENCING DATA YOON SOO PYON. For the Degree of Doctor of Philosophy"

Transcription

1 VARIANT DETECTION USING NEXT GENERATION SEQUENCING DATA by YOON SOO PYON Submitted in partial fulfillment of the requirements For the Degree of Doctor of Philosophy Dissertation Advisor: Jing Li Department of Electrical Engineering and Computer Science CASE WESTERN RESERVE UNIVERSITY January, 2013

2 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the thesis/dissertation of Yoon Soo Pyon Ph.D. candidate for the degree *. Jing Li (signed) (chair of the committee) Thomas LaFramboise Mehmet Koyuturk Xiang Zhang 7/3/2012 (date) *We also certify that written approval has been obtained for any proprietary material contained therein.

3 Table of Contents 1. Introduction Overview of Next Generation Sequencing Technologies Genomic Structural Variation Single Nucleotide Polymorphism 4 2. Structural Variation Detection Using Next Generation Sequencing Data Literature Review Project Statement Proposed Methods Outline of Algorithms Candidate generation and feature collection for deletions Candidate generation and feature collection for inversions Definition of singleton feature for inversions Model based clustering Results Experimental design Generating simulation data Result using different alignment tools Membership probabilities Genotyping results Comparison between different read length data 32 ii

4 2.4.7 Comparison with other approaches Inversion events analysis Discussion Genotype and SNP calling using Next Generation Sequencing Data Literature Review Project Statement Naïve Bayes classifier for SNP discovery and calling Allele ambiguity due to strand issue Results Experimental design Genotype accuracy Mendelian inconsistency analysis Scalability with unrelated family member Conclusion and Discussion Appendix Bibliography iii

5 List of Tables Table 2.1 Classification of read pairs 10 Table 2.2 Mapping results and SV calling results of SVMiner using different alignment tools and different parameter thresholds Table 2.3 Comparison of SVMiner and MoDIL on two datasets Table 2.4 Insert size analysis Table 2.5 Prediction performance comparison of deletion results among different approaches.. 40 Table 2.6 the predicted inversion events Table 3.1 the number of variant loci identified by some methods 58 Table 3.2 the number of loci after outer joining results from 3 methods, Illumina Omni 2.5 genotype results and dbsnp information 58 Table 3.3 the counts of loci in certain combination of methods having Mendelian error 66 Table S2.1: Comparison of different methods and their ability to capture the validated inversion reported by Kidd et al Table S3.1 Genotype calling accuracy for P Table S3.2 Genotype calling accuracy for P Table S3.3 Genotype calling accuracy for P Table S3.4 Mendelian errors rate among GATK, SAMtools, SOAPsnp and intsnp Table S3.5 Mendelian errors rate for inconsistent genotype loci among GATK, SAMtools, SOAPsnp and intsnp Table S3.6 Genotype calling accuracy for P iv

6 List of Figures Figure 1.1 Structural variants and the read pairs that support them 3 Figure 2.1 Deletion examples Figure 2.2 Optional filtering processes for inversion candidate Figure 2.3 Inversion examples.. 15 Figure 2.4 Mapping quality comparisons between same read ID (randomly selected 2 million paired reads) mapped in SRR Figure 2.5 Predicted deletion events and their uncertainty scores 27 Figure 2.6 SVMiner and MoDIL comparison.. 28 Figure 2.7 Two events that are predicted by SVMiner and MoDIL but with different genotype calling Figure 2.8 Overlaps between shorter read dataset and longer read dataset 33 Figure 2.9 IGV view of events only detected by longer read data 35 Figure 2.10 IGV view of events only detected by shorter read data. 36 Figure 2.11 Event size comparisons. 37 Figure 2.12 Inversion results. 42 Figure 3.1 Minor Allele Frequency distributions of known SNP loci obtained from dbsnp. 52 Figure 3.2 Training data generation.. 53 Figure 3.3 Frequencies of genotypes in lower minor allele frequency (0.0 ~ 0.05) bins for P5098 sample 55 Figure 3.4 Pedigree of family suffering speech disorder.. 56 Figure 3.5 Genotype calling accuracies comparison 61 Figure 3.6 Genotype calling accuracies comparison in conflict genotype loci 63 v

7 Figure 3.7 Mendelian error rates among different methods. 65 Figure 3.8 Genotype calling accuracies of P209 predicted with own family members and unrelated family members 68 Supplementary Figure S2.1 the distributions of the two features for the three types of events Supplementary Figure S2.2 Membership probability distribution of predicted normal events 73 vi

8 Acknowledgements First and foremost, I would like to thank my advisor Dr. Jing Li for his continuous support during my Ph.D. study. Without his immense knowledge and guidance, this dissertation may not come true. He showed me what true mentorship is through entire my Ph.D. study. I am also grateful to my committee member Dr. Thomas LaFramboise. He always gave me valuable advice for both research and life. I will never meet an advisor like him again. My sincere thanks also go to other committee members: Dr. Mehmet Koyuturk and Dr. Xiang Zhang. I am thankful to my fellow lab mates Matthew Hayes who worked together for structural variation detection project, Dr. Wenhui Wang who worked together for rare variant detection using pooled DNA sequence data, Robert Shields, Sunah Song, Amin Assareh, Sen Yang, Ke Hu, Matthew Wollerman and former lab member Dr. Xin Li, Dr. Yixuan Chen and Dr. Xiaolin Yin. Last but not least, I would like to thank my wife Bonghyun and my parents from bottom of heart. Without their endless love and sacrifice, I might not finish Ph.D. study. vii

9 VARIANT DETECTION USING NEXT GENERATION SEQUENCING DATA Abstract By YOON SOO PYON Genetic variants, including single nucleotide polymorphisms (SNPs) and genomic structural variations (SVs), not only contribute to human diversity, but also are important to human health because some of them are proved to trigger diseases such as cancer, obesity and diabetes. With recent development of cost effective next generation sequencing (NGS) technologies, it has become possible to identify novel variants with high resolutions and to identify some copy neutral variants such as inversions, which cannot be detected using microarray based technologies. However, enormous amounts of raw sequence data generated from NGS technologies pose great challenges for data analysis. Efficient computational algorithms and tools to analyze these data are in great need. viii

10 In this dissertation, we propose effective computational methods to identify genomic structural variations (deletions and inversions) and to infer accurate genotypes from multiple genotype and SNP calling algorithms using NGS. For structure variant detection, we propose a model based clustering approach utilizing a set of features defined for each type of SV events. Our method, termed SVMiner, not only provides a probability score for each candidate, but also predicts the heterozygosity of genomic deletions. Extensive experiments on genome-wide deep sequencing data have demonstrated that SVMiner is robust against the variability of a single cluster feature, and it performs well when classifying validated SV events with accentuated features. To improve SNP calling results, we propose a Naïve Bayes based approach, which combines prior information of known polymorphic sites and population specific allele frequencies with SNP calling results from multiple programs to obtain more accurate SNP genotypes. Results show that our approach has higher genotype calling accuracy than individual algorithms. ix

11 Chapter 1 Introduction During last few years, cost-effective next generation sequencing (NGS) technologies have revolutionized biological and genetic studies. However, with DNA sequencing becoming faster and cheaper, the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data. Therefore, the development of more efficient and more accurate data analysis tools are necessary. One area of recent interest and development is to identify genomic variants from single nucleotide polymorphisms (SNPs) to larger structural variants (SVs). 1.1 Overview of Next Generation Sequencing Technologies Over the past two decades, Sanger sequencing had dominated the genome field and accomplished monumental projects such as the completion of the first human reference genome. Despite its high accuracy (as high as %) and long read length (~1000bp), expensive sequencing cost of this so called first generation sequencing technology has limited individual labs to pursue deep sequencing analysis of whole genome. [1] The advent of much needed cost effective NGS has provided many opportunities for various applications including whole genome resequencing, variants detection using whole or targeted genome sequencing, RNA-seq, CHIP-seq, methyl-seq, and DNase-seq. 1

12 Because NGS is still not as cheap as array based technologies such as array CGH or SNP genotyping arrays, these array based methods will still keep playing a role in the near future for identifying copy number variations and/or genotyping individuals at previously known loci. However, NGS is better fit for novel variant discovery for both SNPs and SVs. Furthermore, NGS methods can detect copy neutral variants such as inversions. Unlike array based methods that rely on probe intensities, NGS does not have the oversaturation problem at very high copy counts and it can potentially provide singlenucleotide resolution. 1.2 Genomic Structural Variation Genomic structural variation refers to the rearrangement, duplication, deletion, or insertion of DNAs in the human genome [2-5]. Recent studies have shown that in addition to SNPs, genomic structural alterations may be responsible for much of the phenotypic variation that currently exists amongst human populations. These genomic structural variants are ubiquitous and some are likely responsible for increased susceptibility to diseases such as cancer [6-8], as well as other conditions such as obesity and infertility [8, 9]. Given the impact of structural variation on individuals, locating and mapping these regions can provide better understanding of the genetic mechanisms that define phenotypic differences and those that can potentially trigger diseases and disorders [5]. Among the platforms used to detect structural variation are array comparative genomic hybridization (acgh), SNP genotyping arrays, and high throughput next generation sequencing (NGS). Sequencing technologies (e.g., Illumina Genome Analyzer, ABI SOLiD, and Roche s 454) not only allow for genome-wide screening of structural 2

13 variants at the nucleotidee level [5, 10, 11], but also allow for detection of different types of structural variation (SV) events using paired-end sequencing (for a review, see [12]). To detect variation using paired-end NGS data, millions of paired sequence reads of roughly known lengths are first generated from a donor genome. For a given read pair, the distance from a read to its mate is known as the insert size, and thesee paired reads are computationally mapped to specific locations on a reference genome. When a pair of reads does not overlap with any structural variant, the mapped distance between them should approximately be the same as the libraryy insert size and should have a different orientation (i.e., the left end mapped to the + strand and the right end mapped to the - strand for Illumina technologies). On the other hand, when the pair does overlap a structural variant, the mapping distance will deviate from the insert size or its orientation will be different from that of a normal pair of reads, or both, depending on the types of structural variants. One advantage of using paired-end sequence data is that it allows for detection of many types of SVs. Read pairs withh normally mapped distances and correct orientation are known as concordant pairs, elsee the pairs are discordant and suggest a possible variant. The relationships between discordant pairss and three types of possible structural variants are shown in Figure 1.1. Figure1.1 Structural variants and the read pairs that support them Pairs from a donor genome D have distinct characteristics when mapped to a reference genome R. 3

14 1.3 Single Nucleotide Polymorphisms Single nucleotide polymorphisms (SNPs), the most common type of genetic variation in human, refer to single nucleotide mutations occurring naturally from human populations. It is estimated that SNPs appear in every 300 base pairs on average and roughly 10 million SNPs exist in the human genome. Although most SNPs have no effect on health and development, some of them are associated with traits related to human health. Researchers have discovered the impact of SNPs in the development of some diseases [13]. Further research on SNPs may help to predict individual s response to some drugs, susceptibility of environmental factors, and risk of developing particular diseases. SNPs have been identified mainly through international collaborations such as the HapMap projects. With recent drastic development of cost-effective high-throughput sequencing technologies, the higher resolution map of human genome variations is being cataloged [2, 14]. The rest of this dissertation is organized as follows. In the next chapter, we introduce our algorithm termed SVMiner which identifies structural variation using NGS data. In chapter 2.1 and 2.2 we review existing algorithms for structural variation detection and describe our methods in high level. Chapters through explain the details of methods including candidate generation for deletion/inversion and model based clustering of the structural variants. Various results of SVMiner and comparison analysis with other approaches are provided in the chapter 2.4. In chapter 3, we propose a method to infer accurate genotypes from multiple genotype and SNP calling algorithms using NGS. After we briefly review existing SNP/genotype calling methods in chapter 3.1 we 4

15 mention the intuition of the problem in chapter 3.2. Then we suggest Naïve Bayes classifier based method in chapter 3.3. In the chapter 3.5, the results from our combining method and individual results of other approaches are compared. Finally we close this dissertation by summarizing the proposed methods and discussing about future work. 5

16 Chapter 2 Structural Variation Detection using Next Generation Sequencing Data 2.1 Literature Review Most existing computational approaches to identify structural variations were developed based on mapping distance and/or orientation (for a review, see [12]). A few groups [15-17] have provided the first studies in which a human genome was sequenced using next generation technology, and called structural and other genetic variation, mainly using their in-house tools. Hormozdiari et al. [18] presented three algorithms for detecting structural variation in a donor sequence using NGS technology based on two different formulations. Their first formulation is called the Maximum Parsimony Structural Variation (MPSV) problem, the goal of which is to compute a unique mapping for each discordant read pair in the reference genome such that the total number of implied structural variants is minimized. Two algorithms (named VariationHunter_unweighted and VariationHunter_weighted, or VHU and VHW for short) were developed, both of which were based on an approximation algorithm for the Set-Cover problem. Their second formulation aims to calculate the probability of each event and an iterative algorithm was proposed (named VariationHunter_Probabilistic or VHP). They further developed their algorithms for transposons [19]. Lee et al. [20] presented an algorithm called MoDIL, which primarily focuses on detecting insertions 6

17 and deletions of small sizes (20~50bp). The main idea is to compare the distribution of insert sizes in the library to the distribution of the observed mapped distances, utilizing an iterative Kolmogorov-Smirnov (KS) test. MoDIL can determine heterozygosity of deletions by observing the distribution of mapped distances within a variant cluster, though it does not consider read depth for this process. Chen et al. [21] presented two algorithms for detecting structural variation using NGS technology. The first algorithm, BreakDancerMax, can potentially detect large deletions, large insertions, inversions, and translocations. Their second algorithm, BreakDancerMini, focuses on small indels that are between base pairs in length. BreakDancerMax identifies SV events by grouping together overlapping discordant pairs of a certain type, and the region is called as a structural variant based on a confidence score that is a function of the number of discordant pairs, the coverage of the reads, and the size of the region defined by the discordant pairs. The BreakDancerMini algorithm, which is in principle similar to the MoDIL algorithm[20], locates small indels by dividing the genome into non overlapping windows and comparing the distribution of mapped distances for read pairs in the window with the distribution of mapped distances in the entire genome using the KS test. 2.2 Project Statement In most cases, it is not enough to call a structural variant based only on mapping distances and orientations. First of all, the distribution of the insert size is usually estimated. In many cases, it is difficult to distinguish whether a deviation of mapping distance from the mean insert size is due to a structural variant or due to the fact that the segment of this particular pair happens to be far from/close to the mean. In addition, sequence errors and mapping uncertainties all add to the complexity of identifying 7

18 structural variants. The aforementioned challenges are what motivate existing approaches [18, 20-22] that address this problem. In this dissertation, we present a model-based approach for SV detection named SVMiner. SVMiner uses as input the mapping of DNA paired end reads to a reference sequence, which can be obtained using existing mapping tools (e.g., MAQ [23] and BWA [24] ). It consists of four steps, namely, a) classification of paired-end reads into different categories according to their separation distances and orientations; b) candidate SV generation by grouping overlapped read pairs of the same type; c) feature collection based on types of SV candidates; and d) a model-based clustering approach to separate candidates into final predicted events or normal regions. SVMiner has three distinct features: 1) in addition to separation distances and orientations, SVMiner automatically collects new features for different types of events; 2) SVMiner further utilizes these new features for classification and can predict heterozygosity for deletion events; 3) the model-based clustering approach also reports membership probabilities, which could be interpreted as confidence scores. We implement the approach and apply it on two recently generated datasets [15] as well as a simulated dataset. In addition to this, we also applied the method to a genome of the same individual, but using longer read length data. In our experiments, we investigate the effect of mapping tool, insert size, read length and assess its capability in reporting heterozygous deletions and event membership probabilities. We compare our results with those from the original papers, experimentally validated events of the same sample using a different technology [25], results from the 1000 genome project [2, 26], as well as prediction results of three popular programs for the detection of SV using paired-end reads: BreakDancerMax [21], VariationHunter[18], 8

19 and MoDIL [20]. Although ground truth of structure variant events is generally not available for real data, we construct several highly reliable event sets based on 1) the reported results from the original papers; 2) predictions that overlap with events from the 1000 genome project; 3) events predicted by at least two independent algorithms. Based on these reliable event sets, SVMiner significantly outperforms all the three approaches in detecting deletions. Furthermore, it achieves much higher accuracy than MoDIL in predicting heterozygous deletions. Simulation results on deletions also show that SVMiner has the best overall performance. For inversions, the overlaps among different programs are low, indicating it is much hard to detect inversions for all the programs tested. 2.3 Proposed Methods Outline of algorithms SVMiner takes as input a set of alignment results and outputs a list of likely SV events with probability scores. It is worth mentioning that there exist many different mapping algorithms with different characteristics, and in principle, mapping results from any algorithm can be used in our approach. In our implementation, SVMiner can take alignment results from MAQ [23] and BWA [24], or any other mapping tools that support BAM format. As previously stated, SVMiner consists of four steps. Firstly, the insert size and its standard deviation will be estimated based on data and each pair will be classified based on either 1) its separation distance relative to the insert size or 2) its orientation. Table 2.1 9

20 summarizes all different types of read pairs. In this study, we mainly focus on deletion and inversion events, thus only discordant pairs for deletions and inversions will be studied. Secondly, SVMiner generates SV candidates by grouping overlapped discordant read pairs of the same type and similar lengths. Thirdly, for each type of candidate SV cluster, we define a feature space to be used in step 4 by our model-based clustering method. Finally, we use a model-based clustering approach to separate candidates into predicted variants or predicted regions of no variation. The first two steps of our algorithm are shared by many existing approaches. Our major contribution is to infer variant heterozygosity, to develop a procedure to automatically collect different features for different types of SV events (step 3) and to assign membership probabilities for each candidate SV (step 4). Feature spaces are different for different types of events, but mainly based on the number of discordant pairs, the coverage of normal pairs, and the number of singletons (i.e., only one read of a pair being mapped to the reference), with the aim to distinguish real events from false ones and to distinguish heterozygous events from homozygous ones. The exact definitions of features for deletions and inversions will be discussed in the subsequent sections. Table2.1 Classification of read pairs The relationship between paired-end reads and structural variants according to a partition based on their separation distances and orientations. The meanings of symbols: F: forward read; R: reverse read; d: separation distance; µ: insert size (mean); σ: standard deviation; x: a user-defined parameter; N: normal; D: deletion; I: insertion; Iv: inversion; Ta: tandem duplication; Tra: intrachromosomal translocation; Tre: interchromosomal translocation; C: other complex structure event d< µ-x*σ µ-x*σ d µ+x*σ µ+x*σ d Different Chromosomes FR I N D, Tra Tre RF Ta Ta Ta, Tra FF Iv C RR 10

21 Once a feature set is defined and collected for each candidate SV, each candidate is viewed as one observation generated from a mixture distribution. Each candidate SV can be predicted as either a true variant, or a false event (i.e., region of no variation). For a predicted true event, there are also two possibilities: a homozygous event where both homolog chromosomes/haplotypes bear the same type of SV, or a heterozygous event where only one chromosome/haplotype has the event. We construct features such that for different types of events (homozygous, heterozygous and normal), they exhibit different characteristics and represent different mixture components. Therefore, in our framework, the number of mixture components, G, is naturally defined as 3, although in some cases (such as inversions), the data is so noisy that one cannot really distinguish heterozygous events from homozygous events. In such cases, we consider a candidate SV can be predicted as either a true event or a normal region (i.e., G=2). For our model-based approach, we adopt the one described in [27]. More details of the approach can be found in the section Candidate generation and feature collection for deletion After being mapped, pairs with high mapping quality (score 30) and with a separation distance greater than the insert size plus x standard deviation are discordant pairs for candidate deletion events, where the insert size is approximated using the mean of the mapped distances and x = 3 or 4 for two datasets we analyzed. This was to be consistent with the analyses in the original papers. To generate candidate SV events, overlapped discordant deletion pairs are clustered based on their lengths and positions. 11

22 More specifically, we require that overlapping discordant pairs must have similar mapped distance lengths. This is necessary because discordant pairs from the same deletion event are more likely to have a similar size. In addition, they should also have similar position. Therefore, we require that every read pair in one cluster must overlap with all other read pairs within the same cluster. This requirement removes read pairs from a cluster that could, for example, overlap with one pair, but not with others. This aberrant read pair may have been mapped by chance. Furthermore, only pairs with a mapping distance less than 100 kilobases (kb) are considered, because it s possible that a larger sized pair is an error. SVMiner can, however, detect deletions larger than 100 kb by adjusting this maximum allowed pair size. Finally, only clusters with at least three 1 discordant pairs are considered as candidate deletion events. We refer to this step as the filtering step. For each candidate deletion event, we define and collect two features for classification: 1) the number of discordant pairs of the candidate cluster and 2) the average depth of concordant pair reads within the cluster region, based on the following observations. If the event is indeed a homozygous deletion (i.e., regions in both chromosomes/haplotypes deleted), one would expect that the number of discordant pairs should be close to the expected coverage of the region. At the same time, within the deleted region, the depth of reads should at least be very close to 0. On the other hand, if the event is a heterozygous deletion, one would expect that the number of discordant pairs as well as the depth of concordant pairs will be close to half of the expected depth in the region. In an unaffected region, high depth of concordant pairs and low depth of discordant pairs is expected. Similar observations have been made by other researchers (e.g., [15]), but few approaches have automatically collected these features based on these observations to 1 The minimum number of discordant pairs per cluster is provided as a user-defined parameter in SVMiner 12

23 incorporate them into a formal model. Figure 2.1 shows the distributions of concordant pairs and discordant pairs around a homozygous deletion (Figure 2.1a) and a heterozygous deletion (Figure 2.1b) from a real dataset from [15], which are consistent with the observation. The two features are then used by the model-based clustering algorithm, which will predict three possible outcomes for deletion events: 1) homozygous deletion, 2) heterozygous deletion, and 3) no event. (a) (b) Figure 2.1 Deletion examples (a) A homozygous deletion and (b) a heterozygous deletion on chromosome X of a NA The abnormally spaced discordant read pairs are represented as red lines, while the concordant pairs are seen as blue lines. The deletion event in (b) has only occurred on one haplotype, thus explaining the presence of concordant pairs within the affected region Candidate generation and feature collection for inversions The candidate inversion events can be determined in a similar way to the aforementioned method of determining candidate deletions. First, only inverted pairs with the same orientation and with high mapping quality (score 30) are retained. To form a cluster, we require there to be at least three 2 overlapping pairs discordant by orientation. We also enforce additional criteria based on characteristics of inversion events. For an 2 The minimum number of discordant pairs per cluster is provided as a user-defined parameter in SVMiner. 13

24 inverted discordant pair with forward-forward (FF) orientation, it should theoretically have a larger mapped distance than other FF discordant pairs in the cluster with higher genomic positions. This property also holds for any reverse-reverse (RR) oriented pair. If a read pair p is the first FF pair in the cluster (i.e., the pair with the lowest start position), the distance between the start position of the left read of p and the start position of the innermost left read of another FF pair should be less than mean insert size. This relation also holds for the right FF reads and for the RR pairs. Also, if both FF and RR pairs exist in a cluster, the left reads of all FF pairs should be located before the left reads of all RR oriented pairs. Lastly, all inverted pairs with mapped distances greater than 500 kb (provided as parameter in the program) are discarded. This step is referred as the filtering step for inversion. These additional requirements of inversions are illustrated in Figure 2.2. Possible inversion events are portrayed in Figure

25 Figure 2.2 Optional filtering processes for inversion candidate 1) Insert size of outermost pair IS 1 is always greater or equal then insert size of innermost pair IS n : IS 1 >= IS n 2) Maximum distances between left reads of FF read group is less or equal than mean insert size. Same for right reads of FF read group. This is same for RR read group. : D 1,n <= mean insert size 3) If there exists FF read group and RR read group, reads of FF group should have smaller genomic coordinate.: Pos1 <= Pos2 (a) (b) Figure 2.3 Inversion examples A possible inversion in (a) chromosome 1 of NA18507 and (b) chromosome X of NA The FF discordant pairs are red and the RR pairs are green. Singleton reads are shown in dark blue and light blue, corresponding to forward reads and reverse reads, respectively. Note that (b) lacks RR discordant pairs. Singleton reads tend to accumulate near the inversion breakpoints because their mates do not map to the reference sequence due to potential overlap with the breakpoints. For each candidate inverted cluster, we define two intuitive features to be used by the model-based clustering method. The first feature is defined as follows: max, 15

26 where N(FF) is the number of FF reads that form the cluster and N(RR) is the number of RR reads 3. Theoretically, there should be both FF and RR discordant pairs that should support an inversion event, and the number of them should be close to the average read depth in the region. Occasionally, however, there may only be RR or FF read pairs in a cluster that support a true inversion. For example, Bentley et al. [15] provided an example of a validated inversion event that is only supported by FF pairs, which is also shown in Figure 2.3(b). The lack of alternate discordant reads may be explained by lower read coverage in the region where the other types of pairs are expected, or they may be caused by the deletion from the reference of nearby repeated regions. The second feature we define is related to the coverage of singleton reads near the cluster, which are reads whose mate pair does not map to the reference sequence (for short read ~35), or whose mate pair has been split (for longer reads that are ~ bp). For many SV events, including inversions, it is expected that the reads that cover the breakpoints in the donor genome will either 1) not map to the reference, or 2) for split read alignment, will align disjoint segments of a read to distinct locations on the reference genome. As a result, the mates of the unmapped/split reads will be termed as singleton reads here. The presence of singleton reads near inversion boundaries was also described by [15]. Figure 2.3 shows the tendency of singleton reads to accumulate at inversion breakpoint boundaries. This feature is basically defined by measuring the number of such singleton reads that cover areas near the expected breakpoints. Details can be found in section The two features are collected automatically for each candidate cluster and then used by the model based clustering algorithm. Like deletion events, theoretically, inversions can also occur 3 A pair is described as FF if both its reads are mapped to the forward strand in the reference. Similarly, a pair is described as RR if both its reads are mapped to the reverse strand. 16

27 on either haplotype. However, the high variability in discordant pair coverage obfuscates genotyping of inversions (e.g., not every inversion event has FF and RR reads). Therefore, only two possible outcomes are considered for candidate inversions: predicted true inversions and normal regions Definition of the singleton feature for inversions We define this feature by measuring the number of singleton reads that cover areas near the expected breakpoints. For all FF pairs, we define four windows, and for each base pair within each window, we measured the number of singleton reads with quality score 30 that covered it. The first window is defined as the interval [StartPosition(Left) - α, StartPosition(Left)], where Left is the left read pair and α is a window size. For this window, we only considered the singletons that mapped to the + strand, and we used 300 for α to capture an interval that is slightly larger than the expected insert size. The second window is defined as [StartPosition(Left), StartPosition(Left) + α] and for this window, we consider both (+,-) singletons because we suspect the left breakpoint of the inversion would be located near this window, and there could be (+) or ( ) singletons whose mates do not map to the breakpoint. The third window is defined as [StartPosition(Right) - α, StartPosition(Right)], where Right is the right read. The fourth window is defined as [StartPosition(Right), StartPosition(Right) + 1.5*α]. We extend this window to increase the chances of including the right inversion breakpoint, since the ranges of FF pairs are unlikely to flank this breakpoint. Since possible inversion breakpoints can be located within the second window and the fourth window, only forward (+) singleton reads of the first window and third window are informative while both (+) and (-) singleton reads 17

28 from the second and the fourth windows are informative. If both FF and RR discordant pairs are present in the cluster, it is easier to find both breakpoints. However, the lack of one type of inverted pair hinders our ability to find both breakpoints, hence the larger size for the fourth window. For all RR pairs in the cluster, we define our four windows in similar fashion, but for these pairs, the breakpoints can only be located within the first or third windows. Having defined a total of four windows for each discordant pair, we defined our second feature by taking the maximum, over all windows and all read pairs, of the number of singletons that cover all base pairs in each window. This feature is defined symbolically as max The number of singletons that overlap, where W is a window defined for some paired read and b is a base pair that is located within W. W quantifies the set of all windows over all paired reads in the cluster Model Based Clustering We assume that observed data points are generated from different component distributions. More specifically, each candidate event x i is a vector of two dimensions in our application, representing the values of two features in each type of event. Let, denote the probability density of the observation x i being from mixture component k, which is usually assumed to be a normal distribution with parameter mean µ k, and covariance k,, ~

29 For a vector of observations x, its log likelihood function is defined as:,, x, 2 where τ k is the mixture probability of component k, G is the number of components, and z ik is an indicator variable that is 1 if x i belongs to component k, and 0 otherwise. The maximum likelihood parameters for Equation (2) are computed via the maximization (M) step in the expectation-maximization (EM) algorithm [28]. In the E-step, the expected values for z ik are calculated as: zˆ ik ˆ τ k = G j = 1 f k ( xi ˆ μ k, ˆ k ) ˆ τ f ( x ˆ μ, ˆ j j i j j ) where, is the component probability estimated from the M step, and, are the mean and covariance matrix for component k learned from the M step. These learned parameters are updated on successive iterations of the E and M steps. For classifying an observation, the component that maximizes the converged membership probability the label assigned to the observation after the iterations terminate. The distribution of each component can be further refined by specifying a smaller number of free variables in the covariance matrix Σ k, which is equivalent to specifying the geometric characteristics of each component. Evaluation of different models can be quantified by calculating the Bayesian Information Criterion (BIC) [29] value for each model, given the current dataset. One can refer to Fraley and Raftery [27] for a detailed discussion of different models. In our implementation, we have tried a few different 19

30 initialization methods (e.g., assign class labels randomly or using different thresholds), and no differences have been observed from different initializations. The threshold for convergence is Results Experimental design We evaluated our method on three real datasets as well as on simulated datasets for both deletion events and inversion events, and mainly compared our approach with three existing algorithms, i.e., MoDIL, BreakDancer, and VariationHunter. The three real datasets are an African male (NA18507) whole genome data with read lengths of bp and an insert size of 200 bp [15] (dataset I), the same sample (NA18507) with different read lengths (~100 bp) and different insert size (~500 bp) (dataset II), and a Caucasian female (NA07340) chromosome X data with read lengths of bp and an insert length of 130 bp [15] (dataset III). The reads for these datasets were generated using the Illumina Genome Analyzer II, and downloaded from the NCBI short read archive (NA18507) and European Bioinformatics Institute (NA07340). For dataset III, the original study [15] also predicted homozygous and heterozygous deletions by manual examinations, which therefore was used to assess capability of SVMiner in detecting heterozygous deletions. The simulated datasets were generated following the procedure presented in the and were used in testing the functionalities of the software and in comparing with other tools. For deletions, we evaluated the performance of SVMiner by examining its results using different mapping tools, the effect of using membership 20

31 probabilities as a reliability measure, its prediction on heterozygosity, and the effect of different insert sizes and read lengths. We compared its results with the results reported in the original papers as well as results reported from the three existing programs, namely, BreakDancer, VariationHunter and MoDIL. The three programs were chosen not only because of their functionality, popularity and availability, but also because they have been used in studying dataset I. Not all programs were tested on all datasets or all parameter settings, because 1) programs may have some special requirements, which makes some experiments unnecessary or infeasible (e.g., VariationHunter uses its own mapping tool; some other programs do not distinguish heterozygous deletions from homozygous ones); 2) due to the huge amount of sequence data, computational resources limit the number of possible combinations. When comparing results from two programs on a dataset or comparing predictions with ground truth, we examine the overlaps between predicted events. Two events are considered the same one if they have at least 50% reciprocal overlap with each other Generating simulated data Using the chromosome 9 of the human reference genome, we generated two homologous chromosomes (i.e., diploid) and randomly embedded 190 non-overlapping deletions and 202 inversions of varying lengths (ranging from 100 bps to 75kbps in length). Of the 190 embedded deletions, 94 were homozygous and 96 were heterozygous. The two chromosomes were then sheared into reads of length 36. The average insert length between a pair of reads is 200 with the standard deviation of 20, which closely 21

32 mimics the real data at hand. Point mutation rate is 0.1% and sequencing error rate is 0.5%. For both SVMiner and BreakDancerMax, MAQ has been used as the alignment tool. For the latest version of VariationHunter, it uses its own mapping tool called mrfast, which can provide information about reads mapped to multiple positions. However, mrfast requires significantly more memory and is slower than MAQ. After alignments, the mean read depth is around 31x Results using different alignment tools Structural variation prediction critically depends on the quality of mapping results. Therefore, we first tested SVMiner on dataset I using two alignment tools: BWA and MAQ. MAQ provides three types of mapping quality scores for paired-end reads: a single-end mapping quality score for each read, a mapping quality score (Q) for each pair, and an alternative mapping quality score (AltQ) for each pair [23]. The single-end score of each read is a measure of mapping quality of the read, defined as 10 log, where P is probability that the read is erroneously mapped. AltQ is defined as the minimum of the two single-end mapping quality scores. The mapping quality score Q for a pair of reads is defined differently for different types of read pairs. For properly mapped pairs (i.e., uniquely mapped pairs with correct orientations and a proper mapping distance), it is the summation of the single-end mapping quality scores of both ends. For all other cases (including discordant pairs or pairs with multiple mapping positions), it is defined as the minimum of the two single-end mapping quality scores, which is the same as AltQ in such a case. On the other hand, BWA only provides one mapping quality score for each pair, which is more or less close to the definition of mapping quality score of each pair in MAQ. Although the quality scores from the two programs are conceptually 22

33 related, they do not always correlate well (Figure 2.4). To retain highly reliable mapped reads, a threshold is normally used to filter out low quality reads. We first investigated the effect of both Q and AltQ for MAQ, and the mapping quality score for BWA on mapping results. In terms of mapping results of the two programs, around 96.5% (MAQ) and 98.1% (BWA) reads out of total 3.77B reads have been mapped to the genome (Table 2.2). Among them, BWA has ~72.4% of mapped read pairs with quality score >=30; MAQ has about 76.3% read pairs with mapping quality score Q>=30. However, if one uses the alternative quality score of MAQ, only 49.5% mapped read pairs with AltQ >=30. The number of discordant pairs also varies significantly (from 414K to 797K) for these thresholds (Table 2.2). Table 2.2 Mapping results and SV calling results of SVMiner using different alignment tools and different parameter thresholds # reads # mapped reads # mapped reads (score 30) # discordant reads for deletions (shared with BWA) Ave depth of normal pairs in candidates Ave # of discordant pairs in candidates # of candidates, homozygous, heterozygous, normal events BWA 3.697b 2.677b 0.671m , 3505, 2774, 2348 MAQ (Q) 3.77b 3.637b 2.774b 0.797m (0.457m) , 2361, 2848, 2576 MAQ (AltQ) 1.802b 0.414m (0.346m) , 3372, 1879,

34 Therefore, using different mapping tools and different (or even the same) thresholds for quality control, the mapped results, which are also the inputs to our program SVMiner, are different. It is expected that the results of SVMiner will also be different. Nevertheless, a significant majority of the events are still the same. For example, 4384 deletions are in common between the results using BWA with quality scores 30 (6279 deletions in total) and the results using MAQ with quality scores Q 30 (5209 deletions in total). For MAQ with alternative mapping quality score AltQ 30, our approach identified 5251 deletions. The numbers of total candidate events and the numbers of predicted homozygous deletions, heterozygous deletions and normal events based on the three thresholds of the two alignment tools are shown in Table 2.2. Although the numbers of initial candidates based on MAQ with two different thresholds differ much, the numbers of final predicted events are close (Table 2.2). However, the numbers of homozygous/heterozygous events are quite different (Table 2.2). This is mainly because when using the mapping quality score, more concordant read pairs, which may not be of high quality and would be removed when using AltQ, will be retained, which result in more concordant pairs within candidate regions. SVMiner therefore predicted more heterozygous events. Although similar numbers of pairs have been aligned to the reference genome by both BWA and MAQ with Q 30, considerable amounts of discordant read pairs are not shared (Table 2.2). In addition to algorithmic differences, another possible reason is that the two mapping tools have different error models in calculating mapping probabilities, and the two quality scores cannot easily map to each other. Even when varying the quality score parameter for both methods, no significant improvement was made in identifying more common events between them. As noted by 24

35 Li et al. [24], the actual quality score of MAQ tends to be underestimated, while the quality score of BWA tends to be overestimated, although the concept of the mapping quality scores of both alignment tools is the same. The results of SVMiner are inevitably affected by the results of alignment tools. Nevertheless, the inconsistency mostly happens for events with weak signals. From this point of view, SV prediction approaches that can provide reliability scores, such as SVMiner proposed here, are desirable, because one can potentially use the reliability score to filter low quality events. We decided to use mapping results from BWA for the remaining experiments, mainly because of its efficiency. Figure 2.4 Mapping quality comparison between same read ID (randomly selected 2 million paired reads) mapped in SRR

36 2.4.4 Membership probabilities The membership probabilities provided by SVMiner for each predicted event can be used as reliability scores to rank and filter predicted events. Figure 2.5 (a) displays classification results on dataset I and Figure 2.5 (b) shows the corresponding prediction uncertainty score of each data point, which is defined as one minus the maximum of the three membership probabilities. The general shape of the clusters suggests that SVMiner results are generally robust to the variability amongst feature values within the same class. Consequently, classification results tend to be robust against differing levels of sequence coverage. It is also apparent that a candidate variant with accentuated features is very likely to be predicted as a true variant. This can be seen from the uncertainty plot. Points of higher uncertainty tend to accumulate near the cluster boundaries, especially near the origin. Because data points near the origin have low concordant pair depth and low numbers of discordant pairs, they have reasonably moderate probabilities of being any event. In particular, it is difficult to separate heterozygous predictions from normal predictions in these regions (Supplementary Figure S2.1). Otherwise, Supplementary Figure S2.1 shows that the distributions of features of the three different types of events generally agree with expectations. We further examined the membership probability distributions of predicted normal events. In general, the membership probabilities for data points of weak feature values are lower than those with prominent features (Supplementary Figure 2.2). This suggests that one can utilize our reported membership probabilities in distinguishing high quality events from events with high uncertainly. A simple threshold on the maximum membership probability can be adopted to return only events with high confidence. 26

37 (a) (b) Figure 2.5 Predicted deletion events and their uncertainty scores (a) The clustering results of the candidate deletions of dataset I (NA18507). (b) Plot of uncertainty for predicted results from the same dataset. Uncertainty is defined as 1 z, where z is the greatest of the class membership probabilities after clustering. The larger, darker circles indicate lower membership probabilities for the given data point. The highest degrees of uncertainties are around cluster boundaries and around the origin where the data features are not prominent enough to allow for a more confident prediction. Data points with prominent features (i.e. high concordant pair depth and/or high number of discordant pairs) are generally classified with high confidence Genotyping results Among all the other methods considered, MoDIL [20] is the only tool that can predict heterozygous deletions. However, MoDIL and SVMiner are not directly comparable in the sense that MoDIL was originally designed to detect small to medium sized deletions. Nevertheless, to evaluate the performance of SVMiner in predicting heterozygous deletions, we compared the results of SVMiner and MoDIL on three different datasets: genome-wide data of specimen NA18507 (dataset I), chromosome X data of specimen NA07340 (dataset III) and a simulated dataset. For dataset I, we directly obtained the 27

38 results from the authors of [20] (who used mrfast for mapping tool). Because MoDIL focused mostly on detecting small to medium size indels (20~50bp), only a small portion (6.3%) of their 10,397 reported deletions with high confidence have sizes greater than 50 bp and may directly comparable with our results. Among them, 380 deletions called by MoDIL overlap (50% reciprocal) with 380 events predicted by SVMiner. Figure 2.6 illustrates the number of overlaps of two different types of deletion events. Among the overlapped events, results show the two programs have a high degree of consistency. For example, for homozygous deletions, 88.2% of SVMiner s predictions overlapped with MoDIL s homozygous predictions, while 96.0% of MoDIL s predictions overlapped with SVMiner s predictions. For heterozygous deletions, 85.3% of SVMiner s predictions overlapped with those of MoDIL, while 64% of MoDIL s predictions overlapped with SVMiner. SVMiner and MoDIL had a higher degree of mutual overlap for homozygous deletions as compared to heterozygous deletions, which is expected because heterozygous events are more difficult to distinguish from normal events SVMiner MoDIL SVMiner MoDIL a) Homozygous deletion b) Heterozygous deletion Figure 2.6 SVMiner and MoDIL comparison 28

39 One should notice that MoDIL may not be able to precisely predict the two breakpoints of an event. In the result file of MoDIL, both the size of an event and its start/end positions (i.e., breakpoints) were given. However, the distance between the two breakpoints of an event in general is much greater than the event size, which may affect values of those features, based on which MoDIL made its predictions. We examined some of the inconsistent calls by SVMiner and MoDIL, they often disagree on the two breakpoints. Figure 2.7 (a) shows an example of an event called by SVMiner as a homozygous deletion, but a heterozygous deletion by MoDIL. According to the breakpoints of SVMiner (dashed lines), there are apparently no concordant pairs within the deletion region, therefore, it concluded that this event is a homozygous deletion. However, according to the predicted breakpoints of MoDIL (dotted lines), this region included enough concordant pairs for the program to predict this event as a heterozygous deletion (Figure 2.7b). Another possible reason of inconsistency might be due to the fact that two programs had used different mapping tools, which is known to affect the calling results as shown earlier. Dataset III was originally studied by Bentley et al. [15] and the authors manually separated the predicted 77 deletion events into 49 homozygous and 28 heterozygous deletions, which were regarded as the ground truth in comparing SVMiner and MoDIL. Running both programs on this dataset, SVMiner predicted 55 homozygous deletions and 42 heterozygous deletions, while MoDIL predicted 12 homozygous deletions and 55 heterozygous deletions. A predicted event is treated as a true positive if the region has at 29

40 (a) (b) Figure 2.7 An event predicted by SVMiner and MoDIL but with different genotype calling. On the left (a), the mapped read pairs along the chromosomes. Blue indicates concordant pairs, and red represents discordant pairs. Dashed lines are breakpoints of SVMiner and dotted lines are breakpoints of MoDIL. The event was predicted as a homozygous deletion by SVMiner, but a heterozygous deletion by MoDIL. On the right (b), we reconstructed the signals defined by MoDIL based on its breakpoint information, which are the distribution of separation distances of mapped read pairs within the breakpoints. least 50% reciprocal overlap with one of Bentley s events. Although 30 out of 67 predictions made by MoDIL overlaps with true events, the number of correctly predicted genotypes is low, with 5 correctly predicted homozygous genotypes and 9 correctly predicted heterozygous genotypes. In contrast, 69 out of 97 predictions made by SVMiner overlapped with the true events. In particular, SVMiner correctly predicted 41 homozygous events and 21 heterozygous events. Overall, SVMiner significantly outperformed MoDIL on this dataset (Table 2.3). The F_scores of SVMiner range from 0.60 to 0.79 when evaluating genotype predictions separately or jointly, however, F_scores of MoDIL range from 0.16 to

41 Table 2.3 Comparison of SVMiner and MoDIL on two datasets Combined Data # real events method # predictions # true positive 4 precision recall F-score chrx SVM MoDIL Simulated SVM MoDIL Homozygous Data # real events method # predictions # true positive precision recall F-score chrx SVM MoDIL Simulated SVM MoDIL Heterozygous Data # real events method # predictions # true positive precision recall F-score chrx SVM MoDIL Simulated SVM MoDIL In addition to these two real datasets, we also compared SVMiner and MoDIL on a simulated dataset. The data generation procedure was described in the section Briefly, 190 deletions (94 homozygous and 96 heterozygous) of varying sizes (ranging from 100 bps to 75kbps in length) are embedded in one chromosome. In total, MoDIL identified 1123 deletions, which include many small deletions that are false positives. Among all them, only 140 predictions (137 with high confidence) overlapped with the 101 embedded deletions. Among the 137 predictions with high confidence, 3 were predicted as homozygous, and 134 were predicted as heterozygous (Table 2.3). All 3 4 For MoDIL, only 137 high confident ones that have overlaps with real events were considered when considering homozygous/heterozygous predictions, its F-score would be extremely low if all predicted events considered. 31

42 homozygous predictions overlapped with embedded homozygous deletions. However, only 73 of the 134 heterozygous deletion predictions overlapped with embedded heterozygous deletions, while the remaining 61 overlapped with embedded homozygous deletions. In contrast, SVMiner predicted a total of 153 deletions, all of which overlapped with embedded deletions, yielding a precision of 100%. In terms of homozygousity, out of the 77 predicted homozygous deletions, 75 of them overlapped with embedded homozygous deletions and the other two overlapped with embedded heterozygous events. For the 76 predicted heterozygous deletions, 75 of them overlapped with embedded heterozygous deletions and the other one overlapped with an embedded homozygous deletion. On this simulated dataset, MoDIL not only predicted many false positives, but also had a great tendency to misclassify true heterozygous deletions as homozygous ones. A precision and recall comparison of SVMiner to MoDIL can be found in Table Comparison between different read length data With further development in technology, the length of short reads has increased over the years for many platforms. We therefore evaluate how experimental parameters such as read length and library insert size may affect the discovery of structural variants. This sample (NA18507) we have analyzed earlier has been sequenced several times using Illumina platforms. In our experiments, we considered two such datasets. In addition to dataset I, which consists of paired-end short reads with 36~41 base pairs (bp) read lengths and a 200 bp average insert size (SRA000271), we tested SVMiner on the second dataset that consists of paired-end reads with 100 bp and 500 bp inserts (SRX016231). 32

43 From this long read data (dataset II), SVMiner identified 6236 deletion events with 2550 homozygous deletions and 3686 heterozygous deletions. The overall overlap rate between results from the two datasets is around 44% (Figure 2.8a). Intuitively, the overlap rate is not that high, which may be due to the differences in the datasets using different platforms, or due to the limitation of our algorithm. We further tested another program BreakDancerMax on both datasets. It turns out that the overlap rate of BreakDancerMax (34%-37%) is even lower (Figure 2.8b). In fact, the level of consistency between the two approaches on the same datasets is much higher than the level of consistency of either algorithm on the two different datasets. For example, using the dataset with short reads, the fraction of overlapped events predicted by the two approaches is around 84%. SVMiner BreakDancer shorter longer shorter longer (a) (b) Figure 2.8 Overlaps between shorter read dataset and longer read dataset Overlaps between shorter read data and longer read data for (a) SVMiner and (b) BreakDancerMax 33

44 After visual inspections of some alignment regions using IGV [22], we concluded that the two datasets have somewhat different mapping results for discordant reads. For example, Figures 2.9 and 2.10 provide examples of events detectable by one type of data but not by the other. Also, we found several instances where mapping qualities of reads in a possible deletion area are all zero in one dataset, while the other dataset has plenty of high quality discordant reads. One possible explanation is that in addition to the large deletions, there are possible small indels, which affect the mapping quality score differently for reads with different lengths. Another possibility is that smaller deletions may not be detectable using longer reads with a large insert size. Indeed, when examining event size distributions, many small events were only detected by shorter read data (Figure 2.11). This is because for longer insert datasets, it is more difficult to detect smaller deletions due to fluctuations in read pair insert lengths. Overall, our results illustrate that the capability of any computational method in detecting deletion events is constrained by technological platforms and available data. For paired-end sequence data, read length and insert size may affect the final set of detectable structural variants. To see the individual effect by decoupling read length and insert size, we simulated three libraries with the only differences are insert sizes (200, 500, 5k) and tested SVMiner on these datasets with the same set of 190 embedded deletion events. Results show that small to medium sized deletions may not be detected using data with large insert size (Table 2.4). 34

45 Figure 2.9 IGV view of event only detected by longer read data Upper panel is for shorter read data and lower panel is for longer read data. Brown color read denotes discordant paired read. 35

46 Figure 2.10 IGV view of event only detected by shorter read data Upper panel is for shorter read data and lower panel is for longer read data. Brown color read denotes discordant paired read Table 2.4 Insert size analysis. There were 190 total embedded deletions. Each simulated dataset is annotated with two numbers: the average library insert size (in base pairs), and the standard deviation threshold for calling a pair as discordant. As seen in this table, SVMiner s performance decreases as the average library size increases. This is to be expected, as it is harder for large insert pairs to detect small to medium sized events. Dataset Precision Recall 200bp_4sd 153/ / bp_3sd 155/ / bp_4sd 158/ / bp_3sd 145/ /190 5kb_4sd 35/52 35/190 5kb_3sd 37/63 37/190 36

47 Figure 2.11 Event size comparisons Red and green box are size distribution of events detected by longer read dataset and shorter read dataset respectively, blue box is size distribution of events detected by only longer read dataset and orange box is size distribution of events detected by only shorter read dataset Comparison with other approaches Many approaches have been developed for calling structure variants. In addition to MoDIL, we chose to compare our results mainly with BreakDancerMax (BDM) and 37

48 VariationHunter (VHW 5 ) on specimen NA18507, which has been studied intensively recently by a few groups [15, 18, 20, 21, 25]. We compared our results to theirs, either by our own running their methods on the same datasets, or by comparing their published results to ours. Because no ground truth is available for this dataset, it is difficult to directly compare results from different algorithms. We constructed three different sets of highly reliable deletion events, and compared the three approaches by treating these reliable set as ground truth. First, the 1000 genome project (1KG) has recently released a set of high quality population level structural variant polymorphisms based on results predicted by many different algorithms [2, 26, 30]. If any algorithm tested here detects a deletion from the sample we have that is overlapped with an event in the 1KG set, it is very likely that the event is a true event. Therefore, we first calculated the intersection of the 1KG set with the predicted events from each of the three methods, and then took the union of these three intersections as the true events (denoted as R_I), with the understanding that the final set is most likely a proper subset of the true events from this sample. The second set (R_II) consists of all the events detected by at least two of the three algorithms, based on the belief that if an event is detected by at least two algorithms, it is more likely to be a true event. The third set (R_III) consists of the validated events from Kidd et al. [25]. However, one should notice that Kidd et al. used a very different technology (large fosmid clones), and it is only effective for large events. Therefore, the total number of deletions in R_III is only 155. Given the constructed sets of reliable events, we assessed the precisions (correct predictions out of all predictions) and recalls (correct predictions out of all true events) of each method, and compared their F_scores 5 The three variants of VariationHunter are not independent. Only one of them (i.e., VariationHunterWeighted) was used to construct reliable dataset R_II later. Therefore, only results from VariationHunterWeighted was considered in this subsection. 38

49 (the harmonic mean of precision and recall). Because neither BreakDancerMax nor VariationHunterWeighted could separate heterozygous deletions from homozygous deletions, we combined these two groups together and investigated the overall prediction performance, which is given in in Table 2.5. The total numbers of deletion events reported by the three methods vary noticeably, ranging from ~6200 to ~8900. This may be due to the fact that in addition to differences in different algorithms per se, other factors such as differences in mapping tools, limitations on the maximum sizes of events, may also contribute to the discrepancy. Based on R_I and R_II, SVMiner significantly outperformed both BreakDancerMax and VariationHunterWeighted with the highest F_scores (SVMiner: 0.61 & 0.87, BreakDancerMax: 0.54 & 0.65, VariationHunterWeighted 0.53 & 0.71, more details can be found in Table 2.5). SVMiner and BreakDancerMax had similar numbers of predictions, but BreakDancerMax had a lower precision and a lower recall on both R_I and R_II. VariationHunterWeighted predicted many more events, therefore, had better recalls. But it suffered from low precisions. Because the total number of events in R_III is very small, the number of overlaps from any approach is small (11 to 13). R_III cannot be used in distinguishing the three. We also compared these three methods on a simulated data with known ground truth. As mentioned earlier in comparing SVMiner and MoDIL, SVMiner predicted a total of 153 deletions, all of which are from the 190 embedded deletion events, yielding a precision of 100% and a recall of 80.5%. BreakDancerMax detected 147 events and all of them are embedded events. VariationHunter predicted 185 events, of which 177 were embedded events. Overall, for the simulated data, SVMiner is slightly better than BreakDancerMax, 39

50 while VariationHunter reports some false positives, but with the gain in the recall rate in this case. Table 2.5 Prediction performance comparison of deletion results among different approaches The three constructed sets of reliable events are: R_I) intersection of events from the 1KG project and events predicted by any method; R_II) events predicted by at least two methods; R_III) events from Kidd et al. The three approaches are SVMiner (SVM), BreakDancerMax (BDM) and VariationHunterWeighted (VHW). Constructed set (#) method # predictions # true positives Precision recall F-score SVM R_I (3412) BDM VHW SVM R_II (5504) BDM VHW SVM R_III (155) BDM VHW Inversion events analysis To test the capability of SVMiner in detecting inversions, we run the program on two datasets: dataset I and a simulated dataset. For dataset I, the total number of predicted inversion events is much smaller than the number of deletions. SVMiner initially identified 263 candidates, of which 148 were predicted as inversions after the clustering phase of our algorithm. Since the original paper by Bentley et al. did not report inversion results and coordinates of predicted inversions were not available for the BreakDancer algorithm, we could not compare our findings with theirs. The numbers of inversions predicted by three variations of VariationHunter vary greatly (from 181 to 504). The 40

51 overlaps of our predicted events with those by VariationHunter are low (Table 2.6), which reflects that inversions are much harder to capture. The three variants of VariationHunter have large number of overlaps, because these three algorithms are not independent and share a common component. Neither SVMiner nor VariationHunter has detected many events reported by Kidd et al. We further investigated the impact on sensitivity caused by 1) the mapping ambiguity, 2) the filtering step described in the Materials and Methods section and 3) the final classification step. None of the steps significantly affected the number of overlaps, which suggests that the discrepancy between results by SVMiner (as well as VariationHunter) and results from Kidd et al. primarily is due to the differences in technologies used in each study. Capability of computational approaches is limited by technologies that generate data. Figure 2.12 shows classification results by SVMiner, which illustrate that events with high singleton coverage and/or large numbers of discordant pairs will likely be classified as inversions and events that cluster near the origin tend to be labelled as normal events (i.e., not inversions). The results are in line with expectations because true events should have high level of singleton coverage and/or large numbers of discordant pairs while regions of no variation will likely not have inverted discordant pairs or accumulations of singleton reads. To further evaluate the performance of the proposed algorithm, we compared the results of three algorithms (SVMiner, BreakerDancerMax and VariationHunter) on a simulated dataset. SVMiner predicted a total of 189 inversions, and 170 of them overlapped with the 202 embedded events (precision: 89.9%, recall: 84.2%, F_score=87.0%). BreakerDancerMax predicted 214 events and 148 of them overlapped with embedded inversions (precision: 69.2%, recall: 73.3%, F_score=71.2%). 41

52 VariationHunter only predicted 41 inversions of which 34 overlapped with the true events (precision: 82.9%, recall: 16.8%, F_score=28.0%). We had made every effort to run VariationHunter correctly and used the same configuration as the one when analyzing real data, but it is unclear why the recall was so low. Overall, SVMiner performed the best on this dataset. Table 2.6 the predicted inversion events The overlaps among predicted inversion events by SVMiner, three versions of VariationHunter, and the events detected using a different technology (Kidd et al). The total number of events by each approach is provided in the parentheses. Kidd (82) VHW (504) VHU (433) VHP (181) SVMiner (148) Kidd (82) VHW (504) 141 VHU (433) Figure 2.12 Inversion results The result of the predicted inversions for the NA18507 specimen. The blue points are predicted inversions and the red points are predicted false positives. 42

53 2.5 Discussion Our results show that SVMiner is well suited for detecting both homozygous and heterozygous deletion events. It is more difficult to assess the efficacy of inversion detection, but SVMiner s detection of the inversion events in the simulated dataset lends credence to the choice of feature values for detecting inversions within the clustering framework. Of the methods we compared to SVMiner, only MoDIL had the ability to detect heterozygous deletions. Furthermore, not only did SVMiner clearly outperform MoDIL on both the chromosome X dataset and the simulated dataset, but the time needed for the analysis was much shorter for SVMiner. Since MoDIL is not suited for detecting large deletions, it is possible that its performance suffered because of this limitation. Regarding the performance of SVMiner on NA18507, the results suggest that one can have high confidence in SVMiner s predictions. Compared to BreakDancer and VariationHunter, SVMiner had the highest F_scores when comparing with events from the 1000 genome project or events detected by at least two approaches. Since SVMiner utilizes a model-based clustering framework, it bases its predictions on patterns exhibited by the input dataset, thus freeing the user from any need to manually adjust parameters based on cluster characteristics. Variant cluster signatures may differ across different datasets, and SVMiner is well suited to take advantage of such variation. For the comparison of long and short read datasets with different insert lengths, we found that around 40% of events detected by SVMiner using the two datasets were in common. The overlap of deletion events detected by BreakDancer from these two datasets was even 43

54 lower (~30%). This suggests that although both programs are capable of detecting deletions using different read lengths and insert sizes, the set of events that can potentially be detected highly rely on the input datasets. The comparisons with events from Kidd et al further demonstrated limitations of different technologies. Unlike SNP calling platforms and algorithms which normally can achieve very high accuracy, further development in technologies and improvement in algorithms are greatly needed for structural variation calling. A more complete catalogue of common structural variants with more accurate breakpoint information in different populations will also greatly help the heterozygosity predictions. Future work will investigate other features for the refinement of breakpoints of SVs at the nucleotide level. One of the contributions of our approach is that it can genotype simple structural variation events (i.e., deletions), so we will also investigate ways to predict the heterozygosity of events other than deletions. Very recently, similar attempts are being conducted by other researchers. For example, Hormozdiari et al.[19] extended their VariationHunter algorithm to identify transposon events [31], in which they also took into account the diploid structure of the human genome. It is known that other factors, such as local G+C content and local repetitive structures [15], can greatly affect local read mapability. Therefore, even given a specific type of structural variation (e.g., heterozygous deletion), the expected value of a particular feature (e.g., normal read depth) may vary, which in turn may affect our final predictions. We will investigate approaches to account for these factors. Furthermore, the mixture model in our approach assumes normal distributions of all features, which may not hold in practice. The effect of this violation of normal assumption on the final predictions was 44

55 not thoroughly investigated in this study. By considering the relationships of the expected value of read depths and local sequence characteristics (e.g., G+C content), it is possible that we can derive or transform the raw values of our features to better fit the normal assumption. Lastly, we will extend our approach to detect structural variants not described in this paper, such as tandem duplications [32] and translocations [7]. 45

56 Chapter 3 Genotype and SNP calling using Next Generation Sequencing Data 3.1 Literature Review SNP calling refers to the process identifying polymorphic sites where at least one nucleotide allele differs from a reference base and genotype calling is the process of determining the genotype for each individual at each site. For sequence-based SNP analysis, these two processes are highly connected and sometimes are performed at the same time. SNP and genotype calling using NGS is one of highly active researched areas (for a review, see [33]). Early NGS studies relied mainly on the counts of reference alleles and non-reference alleles, and simple algorithms based on hard filtering were used. Filtering criteria included read coverage and base quality scores. For example, if a site has at least five reads with Phred quality scores greater than 20, and the ratio of allele counts between the reference base and the non-reference base is between 20% and 80%, then it is called as heterozygous; otherwise it is called as homozygous. While this straightforward method performs well when the sequencing coverage is high (>20X), however, for moderate or low coverage data, genotype calling based on fixed cutoffs could lead to under calling of heterozygous genotypes and loss of information. Another disadvantage of this type of genotype calling is that it typically does not provide measures of uncertainty in the genotype inference. For these reasons, several approaches based on probabilistic 46

57 frameworks have been proposed, which utilize information about error probabilities from base calling and/or alignment. Many of the available SNP & genotype calling procedures use some forms of the basic Bayes' theorem, where is the posterior probability of event T i given the observed data D, the prior probabilities and the conditional probabilities. The posterior probability is then used to determine the final genotype at this site. Among other differences, different methods use different set of events for T i in the Bayesian framework. For example, SAMtools [34] partition those events into 3 categories, homozygous reference (nonvariant), heterozygous genotype, and homozygous alternative. Genome Analysis Tool Kit (GATK) [35] and SOAPsnp [36] divide the events into all possible 10 genotypes. Therefore, these different methods define slightly different conditional probabilities and prior probabilities under the similar framework. All these probabilistic methods provide measure of statistical uncertainty when calling genotypes or SNPs and this is the major advantage of probabilistic methods. 3.2 Project Statement Aforementioned genotype and SNP calling methods showed fairly good genotype calling accuracies on an example dataset we have (Supplementary Table S3.1 ~ S3.3). 47

58 However, we also found that there were significant differences of SNP calling rates and genotypes. For example, while GATK made about 48,000 SNP calls, SAMtools called about 35,000 SNPs for one sample. Moreover, the fraction of inconsistent genotype calls on this dataset was as high as 19%, which cannot be simply ignored. Significant differences in SNP and genotyping calling among different methods intrigued us to compare the performance of each method and to devise new approaches to further improve SNP calling. In collaborative efforts to identify variants such as the 1000 genome project [2], many institutions have been involved. Different datasets have been generated based on different platforms and different SNP results have been called using different methods. Naturally, the consensus calls from different methods are in general of higher quality than any individual call sets. Therefore, the 1000 genome project only reported consensus results by taking the intersections of different call sets from different methods [2]. While such a strategy is expected to yield high specificity, it will also lead to low sensitivity. Another intuitive way of combining multiple results is to use majority voting scheme. However, one drawback of the basic majority voting method is that it arbitrarily resolves ties. As an alternative, here we propose an integration method, named intsnp (for integrating SNP results), which is based on the naïve Bayes classification method. The basic assumption here is that we have a set of sequence-based SNP calling algorithms that will be able to report SNPs and genotypes for any given input sequences. We also have some training data where we know the ground truth genotypes of some loci from other means (e.g., we use genotypes from SNP arrays in this study). Based on the training data and some prior knowledge (e.g., population specific SNP frequencies), we will be 48

59 able to train a naïve Bayes classifier to automatically determine the final genotypes from multiple calls of different methods. Once the model is trained, it can then be used for novel variant loci and for other datasets without results from SNP arrays, in both cases no ground truth is known. By doing so, the suggested approach is expected to report more confident and accurate genotypes than any single method. Because SNP array technology has been very mature and results from existing arrays are of extremely high quality, there should be little if any issues by taking them as training data. Furthermore, in the dataset we will use in this project, it also contains family data, which adds another level of quality assurance. The learned model can be applied on other datasets as long as the samples are from the same population and they are sequenced using the same platform. In our implementation, we use three widely-used genotype and SNP calling algorithms (i.e., GATK, SOAPsnp and SAMTools) and a real dataset that consists of two family trios. Each sample has been typed using Illumina Omni 2.5M array and has performed exome sequencing. As a byproduct, our study will also compare the performance of the three algorithms, which has not yet been done previously. Result of such an comparison will also be useful for the research community. 3.3 Naïve Bayes classifier for SNP discovery and calling Despite its simplicity, the naïve Bayes classifier has provided great performance on a variety of complex real world problems. For the naïve Bayes classifier, attributes are assumed to be mutually independent of each other within a given class. Therefore a naïve 49

60 Bayes classifier is fully defined by the conditional probabilities of each attribute given the class. Genotype calling is often treated as a three-label (AA, AB and BB) classification problem. In our study, we further divide this into a ten-class model. The set of class label G=,,,,,,,,, which consists of all possible ten biallelic genotypes. The attribute set M=,,, which consists of genotypes called from method 1 to method n. Then assuming conditional independence of attributes, the posterior probability of a particular genotype G in G can be written as:,,,,,,,,, 1 1 where is prior probability of genotype G, is conditional probability of method i calls genotype M i given the true genotype G, and Z=P(M 1,M 2,,M n ) is a scaling factor that is dependent only on M 1, M 2,, M n. Since Z is constant, it does not affect the final calling results. The final classification result can be obtained based on maximum a posteriori principle: classify,,, argmax,,, 2 As mentioned earlier, the naïve Bayes classifier is trained based on known genotypes obtained from SNP genotyping arrays and prior genotype probabilities are obtained from population specific genotype frequencies in the dbsnp database. For SNP loci with no genotype frequencies in dbsnp, allele frequencies are used to calculate genotype frequencies assuming Hardy-Weinberg equilibrium. Therefore, prior probabilities are calculated as: 50

61 2 where the major allele (the one with frequency greater than 0.5) and minor allele are denoted as A and B respectively, and their frequencies are denoted by p and q respectively. Typically, frequencies of both alleles are available in dbsnp (e.g., A: C:0.467). In our implementation, we actually consider all 10 possible genotype combinations by assigning the other two alleles with a very small probability (say, 0.001). When population specific allele frequencies are not available, the average allele frequencies from all populations are used. The conditional probability is empirically calculated from the training data, which consists of all loci from the SNP genotyping array and genotype calling results of different algorithms on these loci. This training data is then divided into 16 bins according to their genotypes. Here we use 16 bins because we separate each heterozygous genotype into two bins depending on which allele is the major allele. Theoretically, these partitions should be sufficient to obtain. However, it is known that SNP calling methods have different performance for loci with different allele frequencies. Therefore, we further partition each bin into 10 smaller bins according to the minor allele frequencies. Because the distribution of minor allele frequencies roughly follows a uniform distribution (except for loci with very small frequencies, Figure 3.1), the size of frequency intervals of all the bins are set to 0.05, i.e., the 10 intervals are (0, 0.05], (0.05, 0.1],, (0.45, 0.50]. Finally, the training data 51

62 consists of total 16 times 10 bins (Figure 3.2). The conditional probability of each bin is calculated based on the training data. Figure 3.1 Minor Allele Frequency distributions of known SNP loci obtained from the dbsnp 3.4 Allele ambiguity due to strand issue It is known that the human reference genome assembly might not be stable when a SNP was discovered. For example, for some SNPs, the strand information of the regions harboring these SNPs has been flipped sometime after the SNPs being discovered. Then the allele types of those SNPs may not be consistent with the forward strand of the current reference. If the alleles are A/C or A/G or C/T, or T/G, genotypes can be easily distinguished from their flipped form. However, A/T and G/C SNPs can be problematic due to the ambiguity. Often this ambiguity makes it difficult or impossible to know which strand has been actually used. Furthermore, allele frequency information may also 52

63 Figure 3.2 Training data generation Genotype results from different algorithms, known genotypes obtained from SNP genotyping array, and allele/genotype frequency information from dbsnp are combined and divided into 16 by 10 bins according to allelic combination and its frequencies. 53

64 problematic in this case. Because rigorous verification of allele types and frequencies of all SNPs were impractical, some of the public databases (e.g., dbsnp) might contain erroneous alleles and/frequencies, which causes some errors for further analysis [37]. Due to this strand issue on A/T and G/C genotypes, HapMap3 [14] excluded some conflict calls from Illumina platforms when they are A/T or G/C genotypes. Because of this, Illumina devised an A/B allele coding and the TOP/BOT strand definition based on actual or contextual sequence of each individual SNP. Illumina GenomeStudio, the software package from Illumina for SNP arrays and Illumina NGS, reports TOP/BOT and A/B genotypes as their default genotypes. However, the A/B genotypes cannot be directly compared with SNP calling results from other methods based on NGS data. Therefore, in this study, we will use the Allele1 Plus and Allele1 Minus columns in the final report generated by Illumina GenomeStudio as the true genotypes. However, when we use this encoding, there are some inconsistencies between these genotypes and allele frequencies dbsnp for A/T or G/C SNPs. For example, from one individual (P5098), we take all the loci with dbsnp minor allele frequencies less than 0.05 and separate them into 12 possible combinations according to their allele combinations (Figure 3.3). Ideally, the majority of these loci will follow Hardy-Weinberg equilibrium and will have extremely high frequencies for homozygous major alleles, low frequencies for heterozygous genotypes, and extremely low frequencies for minor alleles. While many of these frequencies follow this trend, the frequency distributions on loci with alleles A/T and G/C do not. This shows that there are still some issues of genotypes called by Illumina when loci have A/T and G/C alleles. 54

65 There is not an easy fix for this problem; therefore, we choose to ignore loci with dbsnp alleles A/T or G/C in this study. These filtered loci are about 5.3% of entire loci A/C A/G A/T C/A C/G C/T G/A G/C G/T T/A T/C T/G Figure 3.3 Frequencies of different genotypes when minor allele frequencies less than 0.05 for individual P5098 On the x-axis, the allele before / is the major allele and the one after it is the minor allele. The blue, red and green color represent the genotype frequencies of the homozygous major allele, heterozygous and homozygous minor allele, respectively. 3.5 Results Experimental design The suggested method will be applied to an exome sequencing dataset of a Caucasian family trio (Fig 3.4), which is obtained from our collaborators. Exome targets are captured using Agilent SureSelect Human All Exon 50Mb bait kit and sequenced with Illumina Genome Analyzer II. All three have been genotyped using the Illumina Omni 2.5 array, which are assumed as ground truth. The exome target regions contain 87,287 55

66 SNPs from roughly 2.4 million loci of Illumina Omni 2.5 array. After removing those loci with missing genotypes on these samples, we obtain about 83,000 loci. The raw sequences are aligned against genome build 37 (hg19) using BWA [24] (or SOAP2 [38] for SOAPsnp). Local realignments and quality score recalibration are performed using Genome Analysis Toolkit (GATK). Final SNPs are called by GATK, SAMtools and SOAPsnp. These sequence-based SNP calling algorithms report genotypes only if some reads contain mutated alleles. Therefore, reported loci by these algorithms are either homozygous mutant alleles or heterozygous genotypes, which are different the array data that also contains loci with homozygous reference genotypes. All the other loci are assumed homozygous reference genotypes for all the three approaches. P5097 P5098 P1139 Figure 3.4 A family trio used in this study Genotype accuracy The total numbers of variant sites range from 35,000 to 49,000 for different individuals by different calling methods using their default parameters (Table 3.1). For all three individuals, GATK reported most number of SNPs while SAMtools reported least number of SNPs. The difference between GATK and SOAPsnp is not that huge, but the difference between SAMtools and GATK or SOAPsnp is quite significant. When we 56

67 further partition these loci into known loci (i.e., also on the SNP array) vs. novel loci (that are not on the SNP array), it is obvious that most of those differences are novel. Because the primary purpose of sequence-based SNP calling and discovery is for novel variant loci, the results in Table 3.1 indicate that these algorithms have significant proportion of disagreement in calling new variant loci. Better calling algorithms are in great need. When incorporating all SNP loci from the Illumina Omni 2.5 array, the total number of analyzed loci became about 117,000 to 126,000 per different individual (Table 3.2). Among these total loci, about 63% to 70% of them have matched genotypes of Illumina Omni 2.5 array. In other words, true genotypes are available for these 63% to 70% of the total loci. Therefore, we could directly assess the genotype calling accuracy of these methods for these known polymorphic sites. As a performance measurement metric, assuming genotypes called from Illumina Omni 2.5 array as true genotypes, we defined genotype calling accuracy as: calling accuracy # # The remaining loci (~30% to 37%) cannot be directed assessed because no ground truth is available. Most of them will be novel SNPs, but some of them are false positives. In the next subsection, we indirectly measure the performance of different algorithms on these loci by checking their Mendelian consistency rate among the trios. For our own algorithm, we performed several types of analysis in this project. First, we took all the known SNP sites as training data as well as testing data. Second, we performed 5-fold cross-validation on the training data. Third, for the novel SNPs with no 57

68 Table 3.1 the number of variant loci identified by the three methods A SNP is regarded as known if the locus is on the Illumina Omni array and novel otherwise. P5097 P5098 P1139 known novel Total known novel Total known novel Total GATK 18,978 29,562 48,630 18,799 30,534 49,333 18,937 26,263 45,200 SAMtools 16,494 18,775 35,269 17,378 27,046 44,424 17,104 18,697 35,801 SOAPsnp 18,635 27,336 45,971 18,474 27,581 46,055 18,697 25,923 44,620 Table 3.2 the number of SNP loci after combining all results from the three methods and the Illumina Omni 2.5 genotypes. These loci are further partitioned into categories with and without dbsnp information, consistent or inconsistent calls. dbsnp info. available dnsnp info. unavailable Called genotype is consistent 1 P5097 P5098 P1139 known novel Total known novel total known novel total 78,000 19,089 97,089 77,721 26, ,472 77,871 19,500 97,371 5,089 20,526 25,615 5,059 16,724 21,783 5,067 14,998 20,065 80,055 19,228 99,283 80,709 22, ,767 80,619 20, ,906 Called 3,034 20,387 23,421 2,071 21,417 23,488 2,319 14,211 16,530 genotype is inconsistent 2 Total 83,089 39, ,704 82,780 43, ,255 82,938 34, ,436 1 Genotype called by GATK, SAMtools and SOAPsnp is same 2 At least one method called different genotype with other methods

69 ground truth, we examined the Mendelian consistency among the trio. Finally, using this trio as training dataset, we tested the performance of the algorithm on a different trio. Prior information on allele frequencies were obtained from the dbsnp database. About 79% to 83% of total loci have allele frequency information available. For the remaining loci (including novel predictions), we assume that they are all rare variants and use their corresponding bins with lowest minor allele frequency for training and testing. In the first experiment, intsnp was trained on all known loci from all three samples. Then the genotypes of these loci were predicted. When we measured the genotype calling accuracy we divided the test set into homozygous reference alleles, heterozygous genotypes, and homozygous alternative alleles according to the given true genotypes, and compared calling accuracy among GATK, SAMtools, SOAPsnp, majority voting, and intsnp for the trio (Figure 3.5). We observed that for the variant loci (heterozygous and homozygous variant alleles), intsnp showed slightly higher accuracy than each individual method and the majority voting method. GATK also performs very well and also slightly better than the majority voting method. SOAPsnp is slightly worse than the majority voting method. All these are significantly better than SAMtools. For homozygous reference alleles, all methods have similar accuracy (>99.8%). More detailed results including raw numbers are represented in the supplementary table S3.1- S3.3 in the Appendix. Because for the three sequence-based algorithms, we treated nocalls as homozygous reference alleles, lower sensitivity of SNP calling could lead to higher calling rates of homozygous reference genotypes. For example, SAMtools has the highest genotype calling accuracy for homozygous reference genotypes, mainly because it made the smallest number of variant SNP calling. Therefore, in addition to the accuracy 59

70 measure, more attention should be paid to false positive rate of variant calls, false negative rate of variant loci, and inconsistency genotypes at variant loci. However, not all of them can be evaluated using available data in this project (e.g., false positive rate), we mainly focus on accuracy in this study. In order to measure the robustness of the suggested method, we also performed 5-fold cross-validation. All the loci from the Illumina Omni 2.5 array in the three samples were mixed together and randomly partitioned into five subsets. The genotypes from the three samples were equally allocated to these five subsets. The 5-fold cross-validation results (Figure 3.5) are very close to the previous results using all training data as the test data. Because for the majority of all the loci, all three algorithms actually have the same genotype calls and because when the three algorithms have the same calls, intsnp almost always reports the same results 6, it makes more sense to examine the performance of intsnp on loci where the three have reported different genotypes. The total numbers of these conflict loci vary from 2000 to 3000 for different individuals. On variant loci, intsnp showed some more improvements over all the methods (Figure 3.6, Supplementary Tables S3.1 S3.3). It has correctly reported more heterozygous genotypes and homozygous genotypes with variant alleles. However, it reported less number of homozygous genotypes with reference alleles. Overall, on the known sites, although intsnp performs slightly better than each individual method, some of the differences are not very significant, especially in comparison with GATK, which itself is already very accurate on these known sites. 6 There may be one or two loci per sample that intsnp reported different genotypes because of strong contributions from prior information (i.e., allele frequencies). In all these cases, intsnp actually reported the correct genotypes. 60

71 Furthermore, the inconsistent calls on these known sites are very small (2000 to 3000 loic out of loci). In contrast, about 86% to 92% of inconsistent genotypes are found from the loci where true genotypes are unknown. It is expected that intsnp should be able to make more improvements on these novel sites. We examine its performance by checking the Mendelian consistency rate on these loci as an indirect measure in the next subsection HOM-ref HET HOM-ALT Variants GATK SAMtools SoapSNP Majority Voting intsnp intsnp-5fold c.v. (a) 61

72 HOM-ref HET HOM-ALT Variants GATK SAMtools SoapSNP Majority Voting intsnp intsnp-5fold c.v. (b) HOM-REF HET HOM-ALT Variants GATK SAMtools SoapSNP Majority Voting intsnp intsnp-5fold c.v. (c) Figure 3.5 Genotype calling accuracies comparison on the three individuals: (a) P5097 (b) P5098 (c) P

73 HOM-ref HET HOM-ALT Variants GATK SAMtools SoapSNP Majority Voting intsnp (a) HOM-ref HET HOM-ALT Variants GATK SAMtools SoapSNP Majority Voting intsnp (b) 63

74 HOM-ref HET HOM-ALT Variants GATK SAMtools SoapSNP Majority Voting intsnp (c) Figure 3.6 Genotype calling accuracies comparison in conflict genotype loci: (a) P5097 (b) P5098 (c) P Mendelian inconsistency analysis Mendelian inconsistency refers to the fact that a child may have an allele that could not have been inherited from either of his/her biological parents according to Mendel s law. For example, if the genotype of a child is AA and parents have AA and BB genotypes, then this is against Mendel inheritance. Since Mendelian inconsistency caused by de novel mutation is typically very low, the source of Mendelian errors is mainly due to genotyping error when the family relationship is known to be accurate. Among all the algorithms, SAMtools has significant higher Mendelian errors (Figure 3.7). Although differences among the other three algorithms (GATK, SOAPsnp, intsnp) are not huge, intsnp has the lowest Mendelian error rates for both known loci and novel loci. Since higher Mendelian errors mean higher genotyping errors, this result implies that 64

75 intsnp has better genotype calling for all the loci. Another striking fact is that the Mendelian error rates for all methods on novel loci are much higher (5-20 times) than those on the known loci (supplementary tables S3.4 and S3.5). Because the primarily purpose of sequence-based SNP discovery is for novel SNPs, this result indicates that much work is needed in sequence-based SNP calling GATK SAMtools SoapSNP intsnp KnownGT UnknownGT Total Figure 3.7 Mendelian error rates among different methods We also clustered the total novel loci with Mendelian errors into different categories according to which approaches have reported Mendelian errors (Table 3.3). The first row indicates the total number of novel loci that no Mendelian errors were reported by any method, but at least one method reported genotypes on at least one family member. For rows B to O, again at least one method called different genotypes in at least one family member with other methods. In addition, some but not all methods have Mendelian errors. The last row, all methods reported Mendelian errors. The total number of such loci is 35,362. The rows B, C, E, and I represent that only one method has Mendelian errors 65

76 while other methods do not have errors. Then the genotypes called by the method with Mendelian errors must have some incorrect genotypes. Therefore, the number of these loci is a lower bound of incorrect genotypes of that method. From this table, we can see that our method (row B) has the smallest number of loci with Mendelian error (71), while other methods have many more errors (GATK: 684; SAMtools: 2969; and SOAPsnp: 1512). Table 3.3 The counts of loci that different approaches have reported Mendelian errors Dot (. ) denotes the method does not have Mendelian errors and X denotes it has Mendelian errors. The last row indicates the total number of Mendelian errors by each method. GATK SAMtools SOAPsnp intsnp # loci A B... X 71 C.. X D.. X X 705 E. X F. X. X 163 G. X X. 154 H. X X X 172 I X J X.. X 538 K X. X. 318 L X. X X 368 M X X N X X. X 1207 O X X X. 11 P X X X X 187 Total

77 3.5.4 Test on unrelated individuals So far the training and testing datasets we analyzed were from the same individuals. However, in practice, many sequencing projects may not necessarily have SNP array data. Therefore, it is important to test the performance of the program on different testing datasets. In our next experiment, we trained the model using data from one family, but tested its performance on an unrelated individual (P209). We also tested the program on this individual using genotypes from its own family members as training data. Results (Fig 3.8) show that the two procedures almost have the same performance in calling accuracy of variants loci. Furthermore, using unrelated individuals as training data, the accuracy of intsnp on variant loci was still higher than each individual calling method. These trends are more obvious when we focus on conflict loci (Fig 3.8b). For homozygous genotypes with reference alleles, other methods have showed higher accuracy, which is similar to previous analyses. Once again, this is mainly because the our treatment of homozygous reference alleles is biased towards methods with low sensitivities on variant loci. Overall, the proposed method can be trained on one dataset and then applied to new sequence datasets generated from the same platform for samples in the same population. More detailed results can be found in supplementary table S

78 HOM-ref HET HOM-ALT Variants GATK SAMtools SoapSNP Majority Voting intsnp-different intsnp-same (a) HOM-ref HET HOM-ALT Variants GATK SAMtools SoapSNP Majority Voting intsnp-different intsnp-same (b) Figure 3.8 Comparison of genotype calling accuracies of different algorithms on sample P209 Different training data were used by intsnp. Results on (a) all the known loci, and on (b) known loci with conflict genotypes by individual approach. 68

79 3.6 Conclusion and Discussion In this thesis, we have introduced a naïve Bayes classifier based SNP calling method called intsnp, which combines multiple genotype calling results from different methods to achieve more accurate calling results. Population specific allele frequency information is used as prior and genotypes of known polymorphic loci obtained from Illumina Omni 2.5 array are used as training data. Results showed that this combined approach had better genotype calling accuracy than each individual method. It also outperforms the simple majority voting method on variant loci. A significant portion of variant loci reported by sequence-based SNP calling algorithms are not on existing SNP arrays. Most of inconsistency from SNP calling algorithms actually occurs at these novel sites. However, because no ground truth is known on these sites, we could not directly assess genotype calling accuracy of different approaches on this subset. As an alternative solution, we used Mendelian inconsistency to indirectly measure the performance of different algorithms. Results showed that our method showed the lowest Mendelian error rates for both loci where true genotypes are known and loci where true genotypes are unknown. Finally, we investigated whether our method can be trained on one dataset and be applied on different datasets. On a limited dataset with two different families, results showed that our method can be trained on one dataset and then be applied on a different dataset. With lower cost of next generation sequencing technologies, it is expected that more sequence data will be available, which may not have genotype results from SNP 69

80 arrays. The proposed method can contribute better genotype calling results on these data by taking advantage of existing SNP calling algorithms. There are many ways the proposed work can be extended. On known loci, all the approaches have very high accuracy and the majority of disagreements from individual algorithms occur on novel loci. However, the investigation on this issue is limited in this study. Furthermore, each individual approach also provides a reliability score for each called genotype. An integration method should be able to take this into consideration. When family members are available, joint analysis of family data may provide better calling results. These and other issues will be investigated further in the future. 70

81 Appendix 71

82 Supplementary Figure S2.1 the distributions of the two features for the three types of events (top: predicted normal events; middle: predicted heterozygous deletions; bottom: predicted homozygous deletions). Events predicted as being normal generally have low discordant pair support and relatively high concordant pair depth. Heterozygous regions typically have significant support from both concordant and discordant pairs. Homozygous deletions are expected to have high discordant pair support and very low concordant pair support. The results in general agree with expectations. At the same time, it is apparent there are also normal events with low concordant pairs, heterozygous deletions with low concordant pairs and low discordant pairs, homozygous deletions with low discordant pairs. Those are data points that near the origin and/or class boundaries, which normally have high uncertainty values associated with their classification as shown in Figure 2.5 (b). 72

83 Supplementary Figure S2.2. Membership probability distributions of predicted normal events with average concordant depth less than 6 (Left) and greater or equal to 6 (Right). On the left, the membership probabilities are generally less than 0.90; on the right, the membership probabilities are generally greater than When the data points have a low average concordant depth, there is greater uncertainty whether the low coverage is due to 1) an actual deletion or 2) some other sources (such as high G+C content). The membership probabilities in general capture the uncertainty well and can be utilized to select a subset of more reliable predictions. 73