Evaluation of Paired-end Sequencing Strategies for Detection of Genome Rearrangements in Cancer

Size: px

Start display at page:

Download "Evaluation of Paired-end Sequencing Strategies for Detection of Genome Rearrangements in Cancer"

Mitchell Hensley
6 years ago
Views:

1 Evaluation of Paired-end Sequencing Strategies for Detection of Genome Rearrangements in Cancer Ali Bashir Stanislav Volik Colin Collins Vineet Bafna Benjamin J. Raphael June, 28 Abstract Paired-end sequencing is emerging as a key technique for assessing genome rearrangements and structural variation on a genome-wide scale. This technique is particularly useful for detecting copy neutral rearrangements such as inversions and translocations, which are common in cancer, and can produce novel fusion genes. We address the question of how much sequencing is required to detect rearrangement breakpoints and to localize them precisely using both theoretical models and simulation. We derive a formula for the probability of a fusion gene given a set of observed paired-end sequences and use this formula to compute fusion gene probabilities in several breast cancer samples. We find that we are able to accurately predict fusion genes in these samples with a relatively small number of fragments of large sizes. We further demonstrate how the ability to detect fusion genes depends on the distribution of lengths of human genes, and evaluate how different parameters of a sequencing strategy impact breakpoint detection, breakpoint localization, and fusion gene detection, even in the presence of errors that suggest false rearrangements. These results will be useful in calibrating future cancer sequencing efforts, particularly large-scale studies of many cancer genomes that are enabled by new massively parallel sequencing technologies. Introduction Cancer is a disease driven by selection for somatic mutations. These mutations range from single nucleotide changes to large-scale chromosomal aberrations such as deletion, duplications, inversions and translocations. While many such mutations have been cataloged in cancer cells via cytogenetics, gene resequencing, and arraybased techniques (i.e. comparative genomic hybridization) there is now great interest in using genome sequencing to provide a comprehensive understanding of mutations in tumor genomes. One such sequencing initiative, The Cancer Genome Atlas ( focuses sequencing efforts in the pilot phase on point mutations in coding regions. However, the pilot phase of this initiative largely ignores copy neutral genome rearrangements including translocations and inversions. Such rearrangements can create novel fusion genes, as observed in leukemias, lymphomas, and sarcomas [, 2, 3]. The canonical example of a fusion gene is BCR-ABL that results from a characteristic translocation (termed the Philadelphia chromosome ) in many patients with chronic myelogenous leukemia (CML) [3]. The advent of Gleevec, a drug targeted to the BCR-ABL fusion gene, has proven successful in treatment of CML patients [4], invigorating the search for other fusion genes which might provide tumor-specific biomarkers or drug targets. Until recently, it is was generally believed that translocations and resulting gene fusions were solely recurrent in hematological disorders and sarcomas, with few suggesting that such recurrent events were prevalent across all tumor types, including solid tumors [5, 6]. This view has been increasingly challenged with the discovery of a fusion between the TMPRSS2 gene and several members of the ERG protein family in prostate cancer [7] and the EML4-ALK fusion in lung cancer [8]. These studies raise the question of what other recurrent rearrangements remain to be discovered and emphasize the importance of developing a systematic, genome-wide approach for detecting fusion genes and other Univ. of California, San Diego, Bioinformatics Graduate Program Univ. of California, San Francisco, Comprehensive Cancer Center. Univ. of California, San Diego, Dept. of Computer Science and Engineering Brown Univ., Dept. of Computer Science and Center for Computational Molecular Biology

2 large scale rearrangements. One strategy for high-resolution identification of rearrangements involves paired-end sequencing of clones, or other fragments of genomic DNA, from tumor samples. The resulting end-sequence pairs, or paired reads, are mapped back to the reference human genome sequence. If the mapped locations of the ends of a clone are invalid (i.e. have abnormal distance or orientation) then a genomic rearrangement is suggested (See Figure and Methods). This strategy was initially described in the End Sequence Profiling approach [9] and is currently also used to assess genetic structural variation []. An innovative approach utilizing SAGE-like sequencing of concatenated short paired-end tags successfully identified fusion transcripts in cdna libraries []. Present and forthcoming next-generation DNA sequencers hold promise for extremely high-throughput sequencing of paired-end reads. For example, the Illumina Genome Analyzer will soon be able to produce millions of paired reads of approximately 3bp from fragments of length 5-bp [2], while the SOLiD system from Applied Biosystems promises 25bp reads from each end of size selected DNA fragment of many sizes [3]. Similar strategies coupling the generation of paired-end tags with 454 sequencing have also been described [4, 5]. Whole genome paired-end sequencing approaches allow for a genome-wide survey of all potential fusion genes and other rearrangements in a tumor. This approach holds several advantages over transcript or protein profiling in cancer studies. First, discovery of fusion genes using mrna expression [7], cdna sequencing, or mass spectrometry [6] depends on the fusion genes being transcribed under the specific cellular conditions present in the sample at the time of the assay. These conditions might be different than those experienced by the cells during tumor development. Second, measurement of fusions at the DNA sequence level focuses on gene fusions due to genomic rearrangements and thus is less impeded by splicing artifacts or trans splicing [7]. Finally, genome sequencing can identify more subtle regulatory fusions that result when the promoter of one gene is fused to the coding region of another gene, as in the case with with the c-myc oncogene fusion with the immunoglobin gene promoter in Burkitt s lymphoma [8]. In this paper, we address a number of theoretical and practical considerations for assessing tumor genome organization using paired-end sequencing approaches. We are largely concerned with detecting rearrangement breakpoints, where a pair of non-adjacent coordinates in the reference genome are adjacent (i.e. fused) in the tumor genome. In particular, we extend this idea of a breakpoint to examine the ability to detect fusion genes. If a rearrangement in the tumor genome is identified by detecting a clone with end sequences mapping to distant locations, does this rearrangement lead to formation of a fusion gene? Obviously, sequencing the clone will answer this question, but this requires additional effort/cost and (in the case of most next-generation sequencing technologies which do not archive the genome in a clone library) may be problematic. We show how to use evidence from multiple clones and the empirical distribution of clone sizes to derive a formula for the probability of fusion between a pair of genomic regions (e.g. genes) given the set of all mapped clones. Such estimates are useful for prioritizing follow-up experiments to validate fusion genes. In a test experiment on the MCF7 cell-line, 3,2 pairs of genes were found near clone with aberrantly mapping end-sequences. However, our analysis revealed only 8 pairs of genes with a high probability 2 of fusion, of which six were tested and five experimentally confirmed. The advent of high throughput sequencing strategies raise important questions with respect to experimental design for understanding tumor genome organization. Obviously, sequencing more clones improves the probability of detecting fusion genes and breakpoints. However, even with the latest sequencing technologies, it would not be practical nor cost effective to shotgun sequence and assemble the genomes of thousands of tumor samples. Thus, it is important to optimize to maximize the probability of detecting fusion genes with the least amount of sequencing. This probability depends on multiple factors including the number and length of end-sequenced clones, the length of genes that are fused, and possible errors in breakpoint localization. Here, we derive (theoretically and empirically) many formulae which elucidate the trade-offs in experimental design of both current and next-generation sequencing technologies. Using our probability calculations and simulations, we find that even with current paired-end technology we can obtain extremely high probability of breakpoint detection with a very low number of reads. For example, more than 9% of all breakpoints can be detected with paired-end sequencing of less than, clones (Table 2). Additionally, next-generation sequencers can potentially detect rearrangements with a greater than 99% probability and localize the breakpoints of these rearrangements to less than 3bp in a single run of the machine (Table 2). A number of next-generation sequencing technologies do not use traditional cloning vectors to amplify DNA. For the sake of simplicity we will use the term clone to refer to any contiguous fragment that is sequenced from both ends. 2 Probability greater than.5 2

3 Fused Gene Pair Tumor Genome Normal Genome x C Gene U a y C Gene V b (a) L= (a-x C ) + (b-y C ) Genomic Position Gene V b y C Clone size = L Max Clone size = L Min Clone size = (b) x C a Gene U Genomic Position Figure : Schematic of Breakpoint Calculation. (a) The endpoints of a clone C from the tumor genome map to locations x C and y C (joined by an arc) on the reference genome that are inconsistent with C being a contiguous piece of the reference genome. This configurations indicates the presence of a breakpoint (a, b) that fuse at ζ in the tumor genome. (b) The coordinates (a, b) of the breakpoint are unknown but lie with the trapezoid described by equation (). The observed length of the clone is given by L C = (a x C) + (b y C). The rectangle U V describes the breakpoints that lead to a fusion between genes U and V. 3

4 Results Computing probability of a fusion gene We consider the tumor genome as a rearranged version of the reference human genome and assume that there exists a mapping between coordinates of the two genomes. The reference genome is described by a single interval of length G; i.e. we concatenate multiple chromosomes into a single coordinate system. Define a breakpoint (a, b) as a pair of non-adjacent coordinates a and b in the reference genome that are adjacent in the tumor genome. Correspondingly, we define the fusion point 3 as the coordinate ζ in the tumor genome such that the point a maps to ζ and the point b maps to ζ +. Consider a clone C containing ζ. If the breakpoints a and b are far apart (e.g. on different chromosomes) then the endpoints of C will map to two locations x C and y C on the reference genome that are inconsistent with C being a contiguous fragment of the reference genome (Figure a). In this case, we say that (x C, y C ) is an invalid pair [2]. Observing an invalid pair (x C, y C ) does not identify the breakpoint (a, b) exactly. However, if we know that the length of the clone C lies within the range [L min, L max ], and we assume that: (i) only a single breakpoint is contained in a clone; and (ii) a > x C and b > y C (without loss of generality: See Methods); then breakpoint (a, b) that are consistent with (x C, y C ) must satisfy L min (a x C ) + (b y C ) L max. () If we plot an invalid pair (x C, y C ) as a point in the two dimensional space G G then the breakpoints (a, b) satisfying the above equation define a trapezoid 4 (Figure ). If multiple clones contain the same fusion point ζ, then the corresponding breakpoint (a, b) lies within the intersection I of the trapezoids corresponding to the clones. Conversely, we will assume that if the trapezoids defined by several invalid pairs intersect, then they share a common breakpoint. We call a set of clones whose trapezoids have non-empty intersection a cluster. Figure 2 displays a cluster of six clones from the MCF7 cell line. As the number of clones that are end-sequenced increases, more clones will contain the same fusion point and more clusters will be formed. Thus, the area of I will decrease, and correspondingly the uncertainty in the location of the fusion point decreases. Given a set of clones, we are interested in computing the probability that the cancer genome contains a fusion gene, i.e. a fusion of two different genes from the reference genome. A gene defines an interval U = [s, t] where s is the 5 transcription start site and t is the 3 transcription termination site. Consider two genes with intervals U and V. The two genes are fused if there exists a breakpoint (u, v) that lies in U V. This breakpoint is detected if (u, v) lies in I. An approximate probability for a fusion gene is the fraction of I that lies within the rectangle U V. We obtain a better estimate of the probability of fusion by considering the empirical distribution of clone lengths and provide a (See Methods). The exact probability of the gene fusion is given by the probability mass that lies within the intersection of I and the rectangle U V defined by the pair of genes. An efficient algorithm for computing these probabilities is given in Methods. Fusion Predictions in Breast Cancer We made predictions of fusion genes for the MCF7, BT474, SKBR3 breast cancer cell lines as well as two primary tumors using data from end sequence profiling of these samples [2, 22]. Approximately, 7Mb of end-sequence was derived from these 5 samples, 29Mb (corresponding to.47 clonal coverage) coming from the MCF7 cell line. Across all samples, a total of,4 invalid pairs were obtained. These formed 99 clusters, 95 of which contained more than one clone. The empirical distribution of clone lengths for the MCF7 library library is shown in Supp. Fig I. We use the distribution in each clone library to compute the probabilities shown in Tables. Table shows the results of our predictions for fully sequenced BACs across multiple breast cancer cell lines and primary tumors, sorted according to fusion probability. We have successfully validated a number of these highest ranked predictions (Table ). Experimental validation in this context involves sequencing of an entire clone, thus exactly confirming the breakpoint (and point of gene fusion) if present (See Methods). In some cases multiple breakpoints were found in a clone, we ensure that the breakpoint associated to each gene in the fusion disrupts the corresponding 3 In the genome rearrangement literature, a fusion point is also called a breakpoint [9]. 4 A triangle when L min = 4

5 BCAS Genomic Position (Chromosome 2) 5.23 x Genomic Position (Chromosome ) x 8 NTNG Figure 2: Prediction of a Fusion between the NTNG and BCAS genes. The rectangle indicates the possible locations of a breakpoint on chromosomes and 2 that would result in a fusion between NTNG and BCAS. Each trapezoid indicates possible locations for a breakpoint consistent with an invalid pair. Assuming that all clones contain the same breakpoint, this breakpoint must lie in the intersection of the trapezoids (shaded region). Approximately 69% of this shaded region intersects (darkly shaded region) the fusion gene rectangle, giving a probability of fusion of approximately.69. The empirical distribution of clone sizes reveals that not all clone sizes are equally likely (e.g. extremely long or short clones are rare). Using this additional information, our improved estimate for the probability of fusion is >.99. 5

6 .8 Probability of Fusion (a) Product of Gene Lengths (log scale) (b) Count in ChimerDB Product of Gene Lengths (log scale) Figure 3: (a) Probability of Fusion vs. product of gene lengths involved in the fusion event indicates higher fusion probabilities for larger gene pairs. Larger circles indicate gene pairs experimentally validated by further sequencing. Darkly shaded circles indicate fusions confirmed by clone sequencing; yellow circles indicate predicted fusions with negative sequencing results. Smaller points indicate untested predictions with blue circles indicating cluster sizes of 2 or more clones and red circles indicating singletons. (b) The number of fusion genes in chimerdb [23] plotted as a function of the product of gene lengths in the fusion. 6

7 Table : Fusion Probability Predictions and Sequencing Results for Clusters in Breast Cancer Start End Fusion Cluster Sequencing Supporting Cell Line / Gene Gene Probability Size Fusion Primary Tumor ASTN2 PTPRG 2 Yes MCF7 BCAS4 BCAS3 2 Yes MCF7 KCND3 PPME.99 2 Yes MCF7 NTNG BCAS.99 6 Yes MCF7 BCAS3 ATXN Yes MCF7 ZFP64 PHACTR No BT474 CT2 HUMAN UBE2G2.88 No Breast VAPB ZNFNA3.842* 3 Yes BT474 BMP7 EYA No MCF7 KCNH7 TDGF.25 No Breast SULF2 TBX No MCF7 NACAL NCOA No MCF7 MRPL45 TBCD3C.5 No BT474 U NP No Breast RBBP9 ITGB2.5 No Breast Y SYNPR <. 4 No MCF7 PRR TMEM49 <. 9 No MCF7 BMP7 Q96TB <. 3 No MCF7 Ranked List of Fusion Genes Predictions in Breast Cancer Cell Lines and Primary Tumors The gene order shown indicates start and end positions with respect to transcription. Elements labeled as Not Tested are, as yet, not sequenced. Though additional clones have been sequenced they did not overlap any putative fusion region. It should be noted that those clones were also negative for gene fusion. VAPB/ZNFNA3 has low probability, however there are many pairs of genes with low probability of fusion in this region. The probability that any one of these gene pairs fuse is >.3. All clones in a cluster are non-redundant (the same clones do not reappear multiple times in a cluster). indicates that a single clone contained more than two chromosomal segments, i.e. the clone is not a simple fusion of two genomic loci. gene. Such multiple rearranged regions have been observed to still form fusion transcripts as in the case of BCAS4/BCAS3 [, 24]. An example, and explanation, of a high-scoring prediction (NTNG/BCAS) can be seen in Figure 2. The strong correspondence between fusion probability prediction and subsequent sequencing validation of the breakpoint in Table illustrates the predicative power of our method. Table also indicates the power of the technique in predicting negative results. In addition to gene pairs with highly predcited fusion probabilities it shows results on all remaining sequenced clones for which fusion information was available. Only one of these clones (below 5% fusion probability) resulted in positive results for fusion genes (VAPB/ZNFNA3). The negative results mostly follow the computational predictions, as only one of the clones had a probability above 5% for gene fusion. A notable result from sequencing, as noted earlier, is the observation that certain clones undergo multiple rearrangements (see Table ). This results in greater than two chromosomal segments, with respect to the reference genome, present within a single clone. Detecting fusion points Consider an idealized model in which R clones, each of fixed length L are picked uniformly at random from a tumor genome 5 of length G, end-sequenced, and ends mapped to the reference genome. The fraction of clones uniquely mapped, f, provides one with a revised set of clones N = fr. The fraction of clones uniquely mapped varies significantly between technologies to the next. Our ESP study had 9% of reads mappable with 58% uniquely mapped [22]. 454 sequencing of structural variants in the human genome reported 63% mapping of sequences with recognizable linker sequences, corresponding to mapping of 4% mapping of all reads [5]. It should be noted that the 454 reads are of significantly higher sequenced end lengths (average 9bp) [5] and quality compared to other next gene technologies (average 2 3bp) [3, 2], so lower mapping fidelity would be expected for such technologies. A fusion point, ζ, on the tumor genome is detected if a uniquely mapped clone contains it (See Figure 4). 5 For these calculations the tumor genome size, G, will correspond to the diploid genome size, 6 6 bp for human. 7

8 C l C r Genome Figure 4: Schematic of a Breakpoint Region. A fusion point ζ on the tumor genome contained in multiple clones. The leftmost and rightmost clones determine the breakpoint region Θ ζ in which the fusion point can occur. Using the Clarke-Carbon formula [25, 26] (See Methods), the probability P ζ of detection is given by P ζ e c (2) where c = NL/G is the clone coverage. A single clone localizes a fusion point to within L bp, while multiple clones containing a fusion point will localize the fusion point more precisely. Define the breakpoint region, Θ ζ, as the interval determined by the intersection of all clones that contain ζ. Therefore Θ ζ defines the localization of ζ, or the uncertainty in mapping ζ. To localize a fusion point to within L, we only need a single clone overlapping ζ. We show (See Methods) that Furthermore, we show that for s < L, Pr( Θ ζ = L) Le c ( e N G ). (3) P r[( Θ ζ = s)] se Ns G ( e N G ) 2 These equations allow us to estimate the the expected length of Θ ζ, conditioned on ζ being covered 6 (See Methods for full derivation and closed form solution, equation 24) as ( ) ( ) e N L G E( Θ ζ ζ is covered) e c L 2 e c + s 2 e Ns G ( e N G ) (5) We evaluated the error in this approximation by simulation (See Supplemental Methods for descriptions of all simulations). Supp. II shows that approximation (5) very closely models the average obsered Θ ζ. The relative error between the average observed length of the breakpoint region and approximation (5) was.2. We also assess the effect of different clone sizes, L, and number of clones, N, on the expected length of the breakpoint region, E( Θ ζ ), around a specific fusion point, ζ. In Figure 5 we see the obvious correlation that an increase in the number of reads, N, decreases the uncertainty in localization ( Θ ζ ). Interestingly, we note that 4kb clones are most advantageous for detection when Θ ζ = 4kb. A similar effect is observed for 5kb and 2kb. This implies that the choice of clone-lengths impacts one s ability to detect fusions of a specific size. Evaluation of Sequencing Strategies Formulas (2) and (5) provide a framework for examining a variety of sequencing parameters (L, N, c). In Table 2 and graphically in Supp. III and IV, we assess the effect of using different clone sizes and varying numbers of paired reads on the ability to detect and localize a fusion point (also see Supp. V). One can see that a distinct trade-off exists between detection, in which larger clones hold a distinct advantage, and localization, in which case smaller clones are advantageous. Though larger size clones (e.g. BACs of 5kb) appear more pragmatic for sequencing projects using a smaller number of paired reads, the advent of low cost, highly parallel sequencing of small clones could soon yield extremely high resolution (smaller Θ ζ ) and coverage of fusion points. 6 Otherwise, Θ ζ is not defined s= (4) 8

9 Probability.8 5kb Clones, k Reads 4kb Clones, k Reads 2kb Clones, k Reads 5kb Clones, k Reads 4kb Clones, k Reads 2kb Clones, k Reads 5kb Clones, M Reads 4kb Clones, M Reads 2kb Clones, M Reads Θζ x Figure 5: Probability of localizing a fusion point to an interval of a given length. A fusion point is localized to length s if the corresponding breakpoint point region has length s or less. Note that when s exceeds the clone length L, only a single clone contains the fusion point; thus, the probability of localization is the fixed probability of a clone containing the fusion point at a given clonal coverage. Note that a fixed clone length is assumed here. Using a distribution of clone lengths would create a less abrupt transition. Table 2: Breakpoint Detection and Localization for Different Sequencing Strategies Clone Size(L) kb kb 2kb 2kb kb kb 4kb 4kb 5kb 5kb 5kb Paired Reads(N) Clone Coverage(c) 3.3X.33X 3.3X.66X 6.7X 3.3X 26.7X.33X 25X 5X.6X E( Θζ ) Pζ >.99.5 > >.99.8 > > E( Θζ ) Pζ >.99.5 > The probability Pζ of detecting a fusion point and the expected length E( Θζ ) of a breakpoint region under various clone sizes (L) and number of end-sequenced clones (N ). Pζ, and E( Θζ ), correspond to the probability for, and expected size of a breakpoint region in the case of, two clones spanning ζ. The small clone sizes (kb, 2kb) and large number of reads represent what one might achieve in a single run with new technologies (assuming perfect mapping of end sequences). The last value for the 5kb clones represent our current status on the MCF7 cell line. 9

10 Percentage of Total Genes in Set Gene Length in kb (Log Scale) All Genes (39,368) Kinases (62) Transcription Factors (562) ChimerDB (3) Random Fusion Genes (2) Figure 6: Distribution of gene sizes for different sets of genes. All genes: The known genes track in the UCSC Genome Browser [27]. Kinases: Selected from the KinBase database [28]. Transcription factors: Selected from the AmiGO database according to the GO term transcription factor activity [29]. ChimerDB: Fusion genes in cancer extracted from the chimerdb database [23]. Random Fusion Genes: A set of 2 genes involved in random fusion events. Random Fusion events were formed by inducing random breakpoints, and selecting such events if they formed a fusion gene. Note that the gene sizes are on a log scale, and the number of genes from each set used to derive each distribution is shown in the legend. Fusion gene probabilities Figure 5 indicates that to achieve a desired breakpoint localization with a fixed number of reads, some clone sizes are more effective than others. Thus, Figure 5 implies that our ability to accurately assess fusion genes depends not only on the sequencing parameters but also on the sizes of genes involved in the fusion. There is considerable variation in sizes of human genes, as seen in Figure 6. When considering all known transcripts [27], the median gene size is approximately 2kb and the mean is approximately 4kb. However, this does not correspond to the sizes of known fusion genes in cancer. Examination of chimerdb [23], a database of fusion sequences derived from mrna, EST, literature and database searches, shows a clear bias towards larger genes (Figure 6), with a median gene size of 4kb and a mean gene size of 9kb. It is tempting to speculate on the reasons for this bias. There is the potential for ascertainment bias as larger genes would be easier to localize via cytogenetic techniques. Additionally, random breakage would seem to favor larger genes as the probability of a breakpoint disrupting a large gene would be greater than for a small gene. To assess this possibility, we simulated random fusion events by inducing random breakpoints in the genome. If a breakpoint formed a fusion gene we recorded the size of each fusion partner as seen in Figure 6. It is interesting to note that these random fusion events results in much larger genes than observed in the normal genome or chimerdb (median and mean gene sizes of 55kb and 284kb, respectively). Though further investigations are needed, this could indicate that the observed fusion genes are selected for functional reasons. We also examined the distribution of transcription factors and kinases which are members of multiple fusion genes, (Figure 6). Interestingly, kinases correspond much more closely to the chimerdb distribution whereas transcription factors correspond better with gene sizes observed across the entire genome. The variation in gene sizes for different classes of genes (Figure 6) suggests that one consider a wide range of gene sizes when assessing our ability to detect fusion genes. Figure 7 shows the number of clones necessary to achieve at least a.5 fusion probability for a random gene pair of specified size, across a variety of different clone sizes. It should be noted that the breakpoint could exist at any position within either gene. Smaller clone sizes clearly hold a distinct advantage in the probability of detecting fusion genes when compared across equal clonal coverage while large clones tend to perform better when comparing across equivalent reads (Figure 7a). This is not surprising, as a significantly higher number of paired reads is required to achieve the same coverage with smaller clones. For 2kb clones 75 times the number of paired reads is needed to yield equivalent clone coverage

11 Paired Rads Fusion Probability > Number of Clones Fusion Probability >.5 2kb fusion 4kb fusion kb fusion 5kb fusion (a) Fusion Gene Size (a) Clone Size Figure 7: (a) The number of paired reads necessary to detect fusion genes with fusion probability greater than.5 as a function of gene size for different clone lengths. The vertical lines indicate median (2kb) and mean (4kb) sizes for all known genes as well as the median (4kb) and mean (9kb) sizes for chimerdb genes. (b) The number of paired reads necessary to detect fusion genes with fusion probability greater than.5 as a function of clone size for different fusion genes sizes (log scale in both axes). Each point in these plots is the average over different fusion genes and and different simulations of clones from the genome. Thus, each datapoint represents the average value of 4 simulations. In each simulation, a pair of genes was chosen such that area of the resulting gene rectangle (U V ) was equal to the square of the indicated fusion gene size. A breakpoint was selected for the gene pair uniformly in the rectangle U V ). with respect to 5kb clones. It is important to note, however, the next-generation sequencing technologies can potentially generate a high number of paired reads for small clone sizes at very low cost (See Discussion). Less obvious is the relationship between fusion gene size and detectability of fusion events (Figure 7b). Note that larger clones create larger trapezoids. This increases the probability that the trapezoid intersects with the gene rectangle (better detection), but may, thus, reduce the calculated fusion probability since only a small fraction of the larger trapezoid would overlap the gene rectangle (akin to creating a larger breakpoint region around a breakpoint). Notice that there appears to be a distinct correlation between the size of genes involved in a fusion and the size of the clone being used to sequence the genome. When the resulting fused gene is 4kb in length, the average fusion probability is significantly greater when using the same number of 4kb clones compared to 2kb clones, largely due to the greater genomic coverage provided by the larger clones. However, 4kb clones also perform nearly as well as 5kb clones (Figure 7b), because of their improved breakpoint localization (Figure 5). Once the fusion gene size is increased to 5kb, once again, a significant benefit is seen by moving to the larger clone size. In addition to supporting bias for identification of larger gene fusion events by large clones, these results suggest that when investigating specific fusion events, designing one s clone library to reflect the size of the event could be beneficial in detecting it with high probability. However, it is important to note that the 5kb clones consistently show less variance in predicted probabilities (Supp. VI). This makes them a more reliable source for performing whole genome assessments and comparisons across multiple tumor samples, especially with a small number of paired reads. Effects of Errors There are numerous sources of error in paired-end sequencing approaches. The method of computing fusion probabilities from end-sequences makes it possible that a pair of genes near a breakpoint may exhibit a high probability of fusion but not actually fuse. This is more likely in the case of large clone sizes and low clone coverage. A number of other factors, including experimental artifacts, genome assembly errors or mismapping of end sequences, can affect the prediction of fusion genes. In particular, experimental artifacts can lead to false identification of fusion genes. A major source of experimental artifacts is chimeric clones produced when two non-contiguous regions of DNA are joined together during the cloning procedure. Approximately 2% of clones in modern BAC libraries are chimeric [2, 5], and rates for other vectors are roughly similar. The type and rate of experimental artifacts for new genome amplification and sequencing strategies is still an open question. In order to assess the rate of false positive predictions of fusion genes in the presence of errors, we simulated

12 TP TP 5 (a).5 2kb Clones 4kb Clones 5kb Clones FP (b) 2kb Clones 4kb Clones 5kb Clones FP TP TP (c) 5 2kb Clones 4kb Clones 5kb Clones FP (d) 5 2kb Clones 4kb Clones 5kb Clones.5.5 FP Figure 8: Sensitivity and Specificity of Fusion Gene Predictions (a) Number of false positive (FP) and true positive (TP) fusion gene predictions for a simulated genome with translocations and, paired reads. Each curve represents the average of 5 simulations with clones of a fixed size (2kb, 4kb, 5kb clones). The minimum fusion probability threshold for indicating that a fusion gene was predicted was decreased from >.95 (leftmost point) to > (rightmost point) in increments.5. and the number of true and false predictions was determined. For all figures 9 true fusion genes were present in the rearranged genome. These 9 events were not selected for, rather they were a result of the random rearrangment of the genome. (b), paired reads. (c),, paired reads. (d),, paired reads. random rearrangement events with the percent of chimeric clones equal to % (Figure 8). For each curve, the points represent True Positives (fusion pairs correctly labeled as such) vs. False Positives (non-fusing pairs incorrectly labeled as fusing) for various fusion probability thresholds. As expected, with low numbers of paired reads 5kb clones obtain the most True Positives (Figure 8a,b), while with high number of paired reads smaller clone sizes are better (4kb clones in Figure 8c). Very high coverages are needed before very small clones, 2kb, can become effective (Figure 8d). However, they show almost no false positives at appropriate thresholds, with little (if any) increase in true positives when reducing the probability threshold (Figure 8d). Finally, we examined the effect of chimeric clones on our ability to identify breakpoints from invalid clusters. Obviously, when only a single isolated invalid pair exists we cannot determine whether it arose via a chimeric event or through a true rearrangement. However, a cluster of invalid pairs is highly unlikely to arise from chimeric clones [2] as seen in Figure 9). In most cases, no clusters formed from chimeric clones are ever observed. Even under high coverage (X) and a very high percentage of chimeric clones (5% of all paired reads) in 8% of studies no chimeric clusters would be observed. This result increases the confidence in the assumption that clusters of two or more invalid pairs indicate true rearrangement events. When considering an equal number of chimeric clones for different clone lengths, the probability of observing a chimeric cluster is much lower for smaller clone sizes. 2

13 Probability of Observing a Chimeric Cluster X Clonal Coverage 2X Clonal Coverage 5X Clonal Coverage 6 X Clonal Coverage Percent Chimeric Clones Figure 9: Probability of observing at least one chimeric cluster vs. the percent of chimeric clones. These probabilities were computed using Equation 27, with clone size L = 5kb and confirmed by simulation. Other clone sizes yield virtually identical probabilities at the same clonal coverage. Discussion We provided a computational framework to evaluate paired-end sequencing strategies for detection of rearrangements in cancer genomes via. Using our probability calculations and simulations, we find that even with current paired-end technology we can obtain extremely high probability of breakpoint detection with a very low number of reads. For example, more than 9% of all breakpoints can be detected with paired-end sequencing of less than, clones (Table 2). Additionally, next-generation sequencers can potentially detect rearrangements with a greater than 99% probability and localize the breakpoints of these rearrangements to less than 3bp in a single run of the machine (Table 2). Even when only a fraction (f.5) of the reads map uniquely, similar detection levels can be achieved by simply doubling the amount of sequencing. We derived formulae that provide estimates of the probability of detecting rearrangement breakpoints and localizing them precisely. For a genome of length G with N mapped paired reads from clones of size L, the detection probability is a function of the of clonal coverage (c = NL G ). Thus, increasing L means that fewer clones are needed to maintain the same probability of detecting a fusion. Traditionally, clone size L was dictated by efficiency considerations with available cloning vectors (e.g. plasmids 2kb, fosmids 4kb, and BACs 5kb). However, paired-end tag approaches coupled with new sequencing technolgies permit paired-end sequencing from a larger range of clone sizes. On the other hand, the distribution of breakpoint localization depends separately on clone size L, number of mapped reads N, and genome size G. The natural question for the practioner is: what sequencing strategy maximizes information about rearrangements in the cancer genome for minimum cost? Providing a definitive answer to the question is complicated by three major issues. First, the goal of maximizing information about rearrangments in cancer genomes requires further specification. If one is interested in detecting as many rearrangement breakpoints (and thus fusion genes) as possible with the minimum amount of sequencing (i.e. maximizing the coverage c), then for a fixed number of paired reads, larger clones give higher probabilities of detection than smaller clones, a consequence that is follows immediately from the breakpoint detection probability (Equation 2). However, if one is interested in precise localization of breakpoint regions, then smaller clones will provide better localization when the breakpoint is detected (Figure 8, V). If one wants to validate the breakpoint via PCR, then clearly localization must be less than the length amplified via PCR, typically less than a few kilobases. Our results show the less obvious correlation between clone size and the probability of detecting breakpoint regions of a specific size and, thus, fusion genes involving partner genes of a specific size. Figure 5 shows that at equal number of paired reads different clone sizes obtain better probabilities for resolving fusion points to an interval of fixed length. Moreover, Figure 7b shows that these results readily translate to the probability of detecting fusion genes with partner genes of a given size. Such trade-offs permeate nearly all aspects of sequence strategy evaluation. Suppose that paired-end sequences could be obtained for any clone length. Then the choice of optimal clone size depends on the length of fusion genes that the researcher wants to have the greatest ability to localize, which might depend on a prior belief about the model of rearrangement in cancer. For example, if one wants to be able to localize fusion genes whose size is approximately the length of an average human gene 3

14 (4kb), then the optimal clone size is 4kb. However, under the hypothesis that the breaks in the genome that lead to fusion genes are distributed unifomly on the genome, larger fusion genes would be expected and thus larger clone sizes would be optimal. Note that in many cases, one is interested not in breakpoints in a single tumor, but recurrent breakpoints across multiple samples. For this reason, approaches like Primer Approximation Multiplex PCR (PAMP) that assay for variable genomic lesions (such as deletions) in a patient population [3] are appropriate, and precise localization of breakpoints is unnecessary. Moreover, if rearrangement breakpoints are in highly repetitive regions, it might be difficult to map ends too close to the breakpoints, and thus larger clone sizes are appropriate. On the other hand, if there are multiple rearrangements clustered in a small genomic interval, large clones would miss some of these rearrangements, as observed with the multiple breakpoints found in some of our BACs. The second major difficulty is that the sequencing parameters cannot be set arbitrarily, but are limited by the technology chosen. Some of these technologies are not yet commercially available, and the information about their constraints is unknown. Moreover, the field is developing rapidly and any claims stated about read lengths, sequencing error rates, etc. are undergoing continual revision. Finally, the complexity of cancer genomes at the sequence level is unknown. Cytogenetics has demonstrated [2, 3] extensive polyploidy in tumor cells, and more recent ESP studies and shotgun sequencing will naturally sample more from amplified regions. Although we did not explicity simulate amplifications, it is clear that the probability of detecting amplified translocations is directly proportional to their amplification status. However, it is difficult to calibrate the genome length parameter in our simulations. For this reason, pilot studies will be needed to assess the extent of amplification and heterogeneity in these samples. Mapping Reads Our analysis focused on several key parameters including number of paired reads, clone length, and percent of chimeric clones. Some of these parameters are adjustable while others (e.g. error rate) are fixed by the choice of a given technology. However, our model and simulations did not describe all factors that might influence the choice of a given technology. In particular, we avoided the issue of the mapping of reads to the reference genome. Our results give some indication of the promise of these new technologies, but at present there are many unknowns in the number and size of mapped paired-end sequences (See below) that one can expect. Clearly, pilot studies are needed to address this issue. As next-generation technologies mature they will inevitably provide highly reliable and precise mappings of fusion points and detection of fusion genes, allowing for the systematic analysis of genomic events of many cancer samples. Different sequencing technologies produce reads of varying length and quality which can have a dramatic effect on the ability to map paired reads. On one extreme, conventional paired-end sequencing of cloned genomic fragments employed by current ESP studies [9, 2], yields high quality reads exceeding 5 bp. This enables the majority of reads outside of repeats and segmental duplications to be uniquely and accurately mapped to the reference genome. For example, 492 out of 983 clones (58%) in the MCF7 study, using 5bp from each end, mapped uniquely [22], while 4% of paired reads using bp of pyrosequencing from each end, mapped uniquely [5]. Newer sequencing technologies such as Illumina and ABI have even shorter reads (2 to 3 bp) and higher error rates [3, 2], and the ditag approach sequences only 8 2 base pairs from each end of the genomic fragment []. These shorter reads will be much more difficult to map, particulary when analyzing rearrangements. Unlike resequencing studies when one can use additional information that the end sequences are close together, detection of rearrangements specifically requires the accurate mapping of end sequences from distant locations on the genome. It would be informative to model the effect of different read accuracies and lengths on our ability to accurately resolve breakpoints. Organization of Cancer Genomes An additional limitation is that the simulation makes certain simplifying assumptions about the character of cancer genomes. Most notably, we assume that the size of the cancer genome (equal to the parameter G above) is known. Since many cancer genomes, particularly solid tumors, have extensive aneuploidy the actual size of a given genome might differ greatly from normal cells [32]. Additionally, certain regions are highly amplified and can have complex structures due to different mechanisms of duplication [33, 2]. Finally, genomic heterogeneity, particularly in primary tumor samples, reduces the effective coverage and thus the probability of detecting rearrangement breakpoints. Even a genomic lesion that is an important checkpoint in a cancer progression, 4

15 might be difficult to detect in an admixed sample containing normal cells and cells from earlier developmental stages of the tumor. It is nearly impossible to determine how these factors will affect cancer sequencing without further pilot studies. Such studies promise to provide lots of new information about the extent of ploidy changes and heterogeneity. Extensions and Applications Our formula for the probability of a fusion genes is readily extended to fusions of any genomic feature. For example, regulatory fusions. that result from the fusion of the promoter of one gene to the coding region of another gene can be assessed. Similarly, it should be noted that predicting the probability of detection for amplified breakpoints, P ζ, (under the assumption of constant genome size) naturally follows from the current model. As amplification, a, increases the probability of detection corresponding increases, following ( ( P ζ ) a ). Additionally, a number of techniques based on the accurate resolution of genomic lesions can be used in combination with, or extensions of, this approach. Array CGH, can provide high resolution of breakpoints involved in deletion and duplication events at average resolutions of less than kb [34, 35]. When this information overlaps paired-end sequencing data (such as the case with an amplified translocation like BCAS4/BCAS3) one can potentially improve the resolution of the breakpoint interval defined by a paired-end sequencing approach. Another example, Primer Approximation Multiplex PCR (PAMP) attempts to assay for variable genomic lesions (such as deletions) in a patient population [3]. The success of this technique relies on establishing the variance in genomic boundaries of a genomic event, so that appropriate primers spanning the region can be designed. Our approach provides an effective method by which to derive rough approximations for boundaries of such events. Combining this approach with such experimental techniques makes it a powerful tool for both designing clinical diagnostics and identifying therapeutic targets. Methods Mapping and clustering of end sequences We assume that each clone C is end sequenced and the ends are mapped uniquely to the reference human genome sequence. Thus, each clone C corresponds to a pair (x C, y C ) of locations in the human genome where the end sequences map. In addition, an end sequence may map to either DNA strand, and so each mapped end has a sign (+ or ) indicating the mapped strand. We call such a signed pair an end sequence pair (ES pair). If we know that the length L C of the clone C lies within the range [L min, L max ], then for most clones the distance between elements of the corresponding ES pair will lie in this range and the ends will have opposite, convergent orientations: i.e. an ES pair of the form (+x, (x + L C )). Following [2] we call such ES pairs valid pairs because these indicate no rearrangement in the tumor genome. We can use the distribution of distance y x between the ends of valid pairs to define an empirical distribution for the clone lengths (cf. Supp. I). If a pair (x C, y C ) has ends with non-convergent orientation or whose distance y x is greater than L max or smaller than L min, we say that (x C, y C ) is an invalid pair. The set of breakpoints (a, b) that are consistent with the invalid pair (x C, y C ) is determined by the inequalities [36] L min sign(x C )(a x C ) + sign(y C )(b y C ) L max. (6) Throughout the paper, we assume (without loss of generality) that sign(x C ) = sign(y C ) = + so that a x C and b y C. Validating fusion predictions by sequencing Clones contained predicted fusion genes were draft sequenced (X coverage) by subcloning into 3kb plasmids as described in [22]. Assembly of these sequences and alignment to the reference human genome identified either the precise fusion point, or identified a plasmid containing the fusion point thus localizing the breakpoint to 3kb. 5

16 Computing fusion probability Define C (a,b) as the event that a clone C from the tumor with corresponding invalid pair (x C, y C ) overlaps a breakpoint (a, b) of a reference genome. Assume w.l.o.g. that the invalid pair (x C, y C ) is oriented such that a x C and b y C. Given the breakpoint (a, b), the length L C of the clone is fixed to be l C (a, b) = (a x C ) + (b y C ) = (a + b) (x C + y C ) (7) Therefore, the event C (a,b) implies the event that L C = l C (a, b), allowing us to express the probability of occurrence of breakpoint (a, b) in terms of the distribution on the lengths of clones. Denote N C [s] as the number of discrete breakpoints (a, b) such that a x C, b y C, and a + b = s. Then Pr(C (a,b) ) = Pr(C (a,b) (L C = l C (a, b))) (8) = Pr(C (a,b) L C = l C (a, b)) Pr(L C = l C (a, b)) (9) = N C [a + b] Pr(L C = l C (a, b)), () where the last equality follows from Equation 7 and the assumption that all breakpoints are equally likely. Consider a pair of genes spanning genomic intervals U, V. If exactly one of the genes is on the + strand and the other is on the - strand, then the probability of genes fusing, given C is Pr( (a,b) U V C (a,b) ) = (a,b) U V Pr(L C = l C (a, b)). () N C [a + b] Otherwise, if the genes are both on the same strand then an in-frame fusion transcript is impossible 7 and the probability is. In the simple case, assume that the clone lengths were uniformly distributed over the range [L min, L max ], so that Pr(L C = l C (a, b)) = { L max L min if L min l C (a, b) L max, otherwise. In this case, Equation refers to the fraction of the trapezoid (Equation ) that intersects with U V. A more realistic distribution of the clone lengths is empirically estimated (Supp. I), and used to compute P r(l C = l C (a, b)). Next, we extend the equations to include the case when a set {C (), C (2),...} of multiple clones overlap the breakpoint (a, b). Define C to be the event that all clones overlaps the same breakpoint. Then where C = (a,b) C (a,b) C (a,b) = j C (j) (a,b) is the event that all clones C (j) overlap the breakpoint (a, b). Thus, the probability of (a, b) being the breakpoint given that all clones overlap it is given by Pr(C (a,b) C) = Pr(C (a,b) C) Pr(C) = Pr(C (a,b)) Pr(C) j Pr(C(j) = (a,b) (a,b) ) (2) (3) j Pr(C(j) (a,b) ) (4) Here, equation 3 follows from the fact that C (a,b) implies C, and Equation 4 follows from the independence of clones. This allows us to compute the probability of genes U, V fusing as (a,b) U V j Pr( (a,b) U V C (a,b) C) = Pr(C(j) (a,b) ) (a,b) j Pr(C(j) (a,b) ) (5) 7 The analysis of the orientation of genes is analogous in the cases of invalid pairs with other signs. 6

Evaluation of Paired-End Sequencing Strategies for Detection of Genome Rearrangements in Cancer

Evaluation of Paired-End Sequencing Strategies for Detection of Genome Rearrangements in Cancer Ali Bashir 1 *, Stanislav Volik 2, Colin Collins 2, Vineet Bafna 3, Benjamin J. Raphael 4 * 1 Bioinformatics