Evaluation of Paired-end Sequencing Strategies for Detection of Genome Rearrangements in Cancer

Size: px
Start display at page:

Download "Evaluation of Paired-end Sequencing Strategies for Detection of Genome Rearrangements in Cancer"

Transcription

1 Evaluation of Paired-end Sequencing Strategies for Detection of Genome Rearrangements in Cancer Ali Bashir Stanislav Volik Colin Collins Vineet Bafna Benjamin J. Raphael June, 28 Abstract Paired-end sequencing is emerging as a key technique for assessing genome rearrangements and structural variation on a genome-wide scale. This technique is particularly useful for detecting copy neutral rearrangements such as inversions and translocations, which are common in cancer, and can produce novel fusion genes. We address the question of how much sequencing is required to detect rearrangement breakpoints and to localize them precisely using both theoretical models and simulation. We derive a formula for the probability of a fusion gene given a set of observed paired-end sequences and use this formula to compute fusion gene probabilities in several breast cancer samples. We find that we are able to accurately predict fusion genes in these samples with a relatively small number of fragments of large sizes. We further demonstrate how the ability to detect fusion genes depends on the distribution of lengths of human genes, and evaluate how different parameters of a sequencing strategy impact breakpoint detection, breakpoint localization, and fusion gene detection, even in the presence of errors that suggest false rearrangements. These results will be useful in calibrating future cancer sequencing efforts, particularly large-scale studies of many cancer genomes that are enabled by new massively parallel sequencing technologies. Introduction Cancer is a disease driven by selection for somatic mutations. These mutations range from single nucleotide changes to large-scale chromosomal aberrations such as deletion, duplications, inversions and translocations. While many such mutations have been cataloged in cancer cells via cytogenetics, gene resequencing, and arraybased techniques (i.e. comparative genomic hybridization) there is now great interest in using genome sequencing to provide a comprehensive understanding of mutations in tumor genomes. One such sequencing initiative, The Cancer Genome Atlas ( focuses sequencing efforts in the pilot phase on point mutations in coding regions. However, the pilot phase of this initiative largely ignores copy neutral genome rearrangements including translocations and inversions. Such rearrangements can create novel fusion genes, as observed in leukemias, lymphomas, and sarcomas [, 2, 3]. The canonical example of a fusion gene is BCR-ABL that results from a characteristic translocation (termed the Philadelphia chromosome ) in many patients with chronic myelogenous leukemia (CML) [3]. The advent of Gleevec, a drug targeted to the BCR-ABL fusion gene, has proven successful in treatment of CML patients [4], invigorating the search for other fusion genes which might provide tumor-specific biomarkers or drug targets. Until recently, it is was generally believed that translocations and resulting gene fusions were solely recurrent in hematological disorders and sarcomas, with few suggesting that such recurrent events were prevalent across all tumor types, including solid tumors [5, 6]. This view has been increasingly challenged with the discovery of a fusion between the TMPRSS2 gene and several members of the ERG protein family in prostate cancer [7] and the EML4-ALK fusion in lung cancer [8]. These studies raise the question of what other recurrent rearrangements remain to be discovered and emphasize the importance of developing a systematic, genome-wide approach for detecting fusion genes and other Univ. of California, San Diego, Bioinformatics Graduate Program Univ. of California, San Francisco, Comprehensive Cancer Center. Univ. of California, San Diego, Dept. of Computer Science and Engineering Brown Univ., Dept. of Computer Science and Center for Computational Molecular Biology

2 large scale rearrangements. One strategy for high-resolution identification of rearrangements involves paired-end sequencing of clones, or other fragments of genomic DNA, from tumor samples. The resulting end-sequence pairs, or paired reads, are mapped back to the reference human genome sequence. If the mapped locations of the ends of a clone are invalid (i.e. have abnormal distance or orientation) then a genomic rearrangement is suggested (See Figure and Methods). This strategy was initially described in the End Sequence Profiling approach [9] and is currently also used to assess genetic structural variation []. An innovative approach utilizing SAGE-like sequencing of concatenated short paired-end tags successfully identified fusion transcripts in cdna libraries []. Present and forthcoming next-generation DNA sequencers hold promise for extremely high-throughput sequencing of paired-end reads. For example, the Illumina Genome Analyzer will soon be able to produce millions of paired reads of approximately 3bp from fragments of length 5-bp [2], while the SOLiD system from Applied Biosystems promises 25bp reads from each end of size selected DNA fragment of many sizes [3]. Similar strategies coupling the generation of paired-end tags with 454 sequencing have also been described [4, 5]. Whole genome paired-end sequencing approaches allow for a genome-wide survey of all potential fusion genes and other rearrangements in a tumor. This approach holds several advantages over transcript or protein profiling in cancer studies. First, discovery of fusion genes using mrna expression [7], cdna sequencing, or mass spectrometry [6] depends on the fusion genes being transcribed under the specific cellular conditions present in the sample at the time of the assay. These conditions might be different than those experienced by the cells during tumor development. Second, measurement of fusions at the DNA sequence level focuses on gene fusions due to genomic rearrangements and thus is less impeded by splicing artifacts or trans splicing [7]. Finally, genome sequencing can identify more subtle regulatory fusions that result when the promoter of one gene is fused to the coding region of another gene, as in the case with with the c-myc oncogene fusion with the immunoglobin gene promoter in Burkitt s lymphoma [8]. In this paper, we address a number of theoretical and practical considerations for assessing tumor genome organization using paired-end sequencing approaches. We are largely concerned with detecting rearrangement breakpoints, where a pair of non-adjacent coordinates in the reference genome are adjacent (i.e. fused) in the tumor genome. In particular, we extend this idea of a breakpoint to examine the ability to detect fusion genes. If a rearrangement in the tumor genome is identified by detecting a clone with end sequences mapping to distant locations, does this rearrangement lead to formation of a fusion gene? Obviously, sequencing the clone will answer this question, but this requires additional effort/cost and (in the case of most next-generation sequencing technologies which do not archive the genome in a clone library) may be problematic. We show how to use evidence from multiple clones and the empirical distribution of clone sizes to derive a formula for the probability of fusion between a pair of genomic regions (e.g. genes) given the set of all mapped clones. Such estimates are useful for prioritizing follow-up experiments to validate fusion genes. In a test experiment on the MCF7 cell-line, 3,2 pairs of genes were found near clone with aberrantly mapping end-sequences. However, our analysis revealed only 8 pairs of genes with a high probability 2 of fusion, of which six were tested and five experimentally confirmed. The advent of high throughput sequencing strategies raise important questions with respect to experimental design for understanding tumor genome organization. Obviously, sequencing more clones improves the probability of detecting fusion genes and breakpoints. However, even with the latest sequencing technologies, it would not be practical nor cost effective to shotgun sequence and assemble the genomes of thousands of tumor samples. Thus, it is important to optimize to maximize the probability of detecting fusion genes with the least amount of sequencing. This probability depends on multiple factors including the number and length of end-sequenced clones, the length of genes that are fused, and possible errors in breakpoint localization. Here, we derive (theoretically and empirically) many formulae which elucidate the trade-offs in experimental design of both current and next-generation sequencing technologies. Using our probability calculations and simulations, we find that even with current paired-end technology we can obtain extremely high probability of breakpoint detection with a very low number of reads. For example, more than 9% of all breakpoints can be detected with paired-end sequencing of less than, clones (Table 2). Additionally, next-generation sequencers can potentially detect rearrangements with a greater than 99% probability and localize the breakpoints of these rearrangements to less than 3bp in a single run of the machine (Table 2). A number of next-generation sequencing technologies do not use traditional cloning vectors to amplify DNA. For the sake of simplicity we will use the term clone to refer to any contiguous fragment that is sequenced from both ends. 2 Probability greater than.5 2

3 Fused Gene Pair Tumor Genome Normal Genome x C Gene U a y C Gene V b (a) L= (a-x C ) + (b-y C ) Genomic Position Gene V b y C Clone size = L Max Clone size = L Min Clone size = (b) x C a Gene U Genomic Position Figure : Schematic of Breakpoint Calculation. (a) The endpoints of a clone C from the tumor genome map to locations x C and y C (joined by an arc) on the reference genome that are inconsistent with C being a contiguous piece of the reference genome. This configurations indicates the presence of a breakpoint (a, b) that fuse at ζ in the tumor genome. (b) The coordinates (a, b) of the breakpoint are unknown but lie with the trapezoid described by equation (). The observed length of the clone is given by L C = (a x C) + (b y C). The rectangle U V describes the breakpoints that lead to a fusion between genes U and V. 3

4 Results Computing probability of a fusion gene We consider the tumor genome as a rearranged version of the reference human genome and assume that there exists a mapping between coordinates of the two genomes. The reference genome is described by a single interval of length G; i.e. we concatenate multiple chromosomes into a single coordinate system. Define a breakpoint (a, b) as a pair of non-adjacent coordinates a and b in the reference genome that are adjacent in the tumor genome. Correspondingly, we define the fusion point 3 as the coordinate ζ in the tumor genome such that the point a maps to ζ and the point b maps to ζ +. Consider a clone C containing ζ. If the breakpoints a and b are far apart (e.g. on different chromosomes) then the endpoints of C will map to two locations x C and y C on the reference genome that are inconsistent with C being a contiguous fragment of the reference genome (Figure a). In this case, we say that (x C, y C ) is an invalid pair [2]. Observing an invalid pair (x C, y C ) does not identify the breakpoint (a, b) exactly. However, if we know that the length of the clone C lies within the range [L min, L max ], and we assume that: (i) only a single breakpoint is contained in a clone; and (ii) a > x C and b > y C (without loss of generality: See Methods); then breakpoint (a, b) that are consistent with (x C, y C ) must satisfy L min (a x C ) + (b y C ) L max. () If we plot an invalid pair (x C, y C ) as a point in the two dimensional space G G then the breakpoints (a, b) satisfying the above equation define a trapezoid 4 (Figure ). If multiple clones contain the same fusion point ζ, then the corresponding breakpoint (a, b) lies within the intersection I of the trapezoids corresponding to the clones. Conversely, we will assume that if the trapezoids defined by several invalid pairs intersect, then they share a common breakpoint. We call a set of clones whose trapezoids have non-empty intersection a cluster. Figure 2 displays a cluster of six clones from the MCF7 cell line. As the number of clones that are end-sequenced increases, more clones will contain the same fusion point and more clusters will be formed. Thus, the area of I will decrease, and correspondingly the uncertainty in the location of the fusion point decreases. Given a set of clones, we are interested in computing the probability that the cancer genome contains a fusion gene, i.e. a fusion of two different genes from the reference genome. A gene defines an interval U = [s, t] where s is the 5 transcription start site and t is the 3 transcription termination site. Consider two genes with intervals U and V. The two genes are fused if there exists a breakpoint (u, v) that lies in U V. This breakpoint is detected if (u, v) lies in I. An approximate probability for a fusion gene is the fraction of I that lies within the rectangle U V. We obtain a better estimate of the probability of fusion by considering the empirical distribution of clone lengths and provide a (See Methods). The exact probability of the gene fusion is given by the probability mass that lies within the intersection of I and the rectangle U V defined by the pair of genes. An efficient algorithm for computing these probabilities is given in Methods. Fusion Predictions in Breast Cancer We made predictions of fusion genes for the MCF7, BT474, SKBR3 breast cancer cell lines as well as two primary tumors using data from end sequence profiling of these samples [2, 22]. Approximately, 7Mb of end-sequence was derived from these 5 samples, 29Mb (corresponding to.47 clonal coverage) coming from the MCF7 cell line. Across all samples, a total of,4 invalid pairs were obtained. These formed 99 clusters, 95 of which contained more than one clone. The empirical distribution of clone lengths for the MCF7 library library is shown in Supp. Fig I. We use the distribution in each clone library to compute the probabilities shown in Tables. Table shows the results of our predictions for fully sequenced BACs across multiple breast cancer cell lines and primary tumors, sorted according to fusion probability. We have successfully validated a number of these highest ranked predictions (Table ). Experimental validation in this context involves sequencing of an entire clone, thus exactly confirming the breakpoint (and point of gene fusion) if present (See Methods). In some cases multiple breakpoints were found in a clone, we ensure that the breakpoint associated to each gene in the fusion disrupts the corresponding 3 In the genome rearrangement literature, a fusion point is also called a breakpoint [9]. 4 A triangle when L min = 4

5 BCAS Genomic Position (Chromosome 2) 5.23 x Genomic Position (Chromosome ) x 8 NTNG Figure 2: Prediction of a Fusion between the NTNG and BCAS genes. The rectangle indicates the possible locations of a breakpoint on chromosomes and 2 that would result in a fusion between NTNG and BCAS. Each trapezoid indicates possible locations for a breakpoint consistent with an invalid pair. Assuming that all clones contain the same breakpoint, this breakpoint must lie in the intersection of the trapezoids (shaded region). Approximately 69% of this shaded region intersects (darkly shaded region) the fusion gene rectangle, giving a probability of fusion of approximately.69. The empirical distribution of clone sizes reveals that not all clone sizes are equally likely (e.g. extremely long or short clones are rare). Using this additional information, our improved estimate for the probability of fusion is >.99. 5

6 .8 Probability of Fusion (a) Product of Gene Lengths (log scale) (b) Count in ChimerDB Product of Gene Lengths (log scale) Figure 3: (a) Probability of Fusion vs. product of gene lengths involved in the fusion event indicates higher fusion probabilities for larger gene pairs. Larger circles indicate gene pairs experimentally validated by further sequencing. Darkly shaded circles indicate fusions confirmed by clone sequencing; yellow circles indicate predicted fusions with negative sequencing results. Smaller points indicate untested predictions with blue circles indicating cluster sizes of 2 or more clones and red circles indicating singletons. (b) The number of fusion genes in chimerdb [23] plotted as a function of the product of gene lengths in the fusion. 6

7 Table : Fusion Probability Predictions and Sequencing Results for Clusters in Breast Cancer Start End Fusion Cluster Sequencing Supporting Cell Line / Gene Gene Probability Size Fusion Primary Tumor ASTN2 PTPRG 2 Yes MCF7 BCAS4 BCAS3 2 Yes MCF7 KCND3 PPME.99 2 Yes MCF7 NTNG BCAS.99 6 Yes MCF7 BCAS3 ATXN Yes MCF7 ZFP64 PHACTR No BT474 CT2 HUMAN UBE2G2.88 No Breast VAPB ZNFNA3.842* 3 Yes BT474 BMP7 EYA No MCF7 KCNH7 TDGF.25 No Breast SULF2 TBX No MCF7 NACAL NCOA No MCF7 MRPL45 TBCD3C.5 No BT474 U NP No Breast RBBP9 ITGB2.5 No Breast Y SYNPR <. 4 No MCF7 PRR TMEM49 <. 9 No MCF7 BMP7 Q96TB <. 3 No MCF7 Ranked List of Fusion Genes Predictions in Breast Cancer Cell Lines and Primary Tumors The gene order shown indicates start and end positions with respect to transcription. Elements labeled as Not Tested are, as yet, not sequenced. Though additional clones have been sequenced they did not overlap any putative fusion region. It should be noted that those clones were also negative for gene fusion. VAPB/ZNFNA3 has low probability, however there are many pairs of genes with low probability of fusion in this region. The probability that any one of these gene pairs fuse is >.3. All clones in a cluster are non-redundant (the same clones do not reappear multiple times in a cluster). indicates that a single clone contained more than two chromosomal segments, i.e. the clone is not a simple fusion of two genomic loci. gene. Such multiple rearranged regions have been observed to still form fusion transcripts as in the case of BCAS4/BCAS3 [, 24]. An example, and explanation, of a high-scoring prediction (NTNG/BCAS) can be seen in Figure 2. The strong correspondence between fusion probability prediction and subsequent sequencing validation of the breakpoint in Table illustrates the predicative power of our method. Table also indicates the power of the technique in predicting negative results. In addition to gene pairs with highly predcited fusion probabilities it shows results on all remaining sequenced clones for which fusion information was available. Only one of these clones (below 5% fusion probability) resulted in positive results for fusion genes (VAPB/ZNFNA3). The negative results mostly follow the computational predictions, as only one of the clones had a probability above 5% for gene fusion. A notable result from sequencing, as noted earlier, is the observation that certain clones undergo multiple rearrangements (see Table ). This results in greater than two chromosomal segments, with respect to the reference genome, present within a single clone. Detecting fusion points Consider an idealized model in which R clones, each of fixed length L are picked uniformly at random from a tumor genome 5 of length G, end-sequenced, and ends mapped to the reference genome. The fraction of clones uniquely mapped, f, provides one with a revised set of clones N = fr. The fraction of clones uniquely mapped varies significantly between technologies to the next. Our ESP study had 9% of reads mappable with 58% uniquely mapped [22]. 454 sequencing of structural variants in the human genome reported 63% mapping of sequences with recognizable linker sequences, corresponding to mapping of 4% mapping of all reads [5]. It should be noted that the 454 reads are of significantly higher sequenced end lengths (average 9bp) [5] and quality compared to other next gene technologies (average 2 3bp) [3, 2], so lower mapping fidelity would be expected for such technologies. A fusion point, ζ, on the tumor genome is detected if a uniquely mapped clone contains it (See Figure 4). 5 For these calculations the tumor genome size, G, will correspond to the diploid genome size, 6 6 bp for human. 7

8 C l C r Genome Figure 4: Schematic of a Breakpoint Region. A fusion point ζ on the tumor genome contained in multiple clones. The leftmost and rightmost clones determine the breakpoint region Θ ζ in which the fusion point can occur. Using the Clarke-Carbon formula [25, 26] (See Methods), the probability P ζ of detection is given by P ζ e c (2) where c = NL/G is the clone coverage. A single clone localizes a fusion point to within L bp, while multiple clones containing a fusion point will localize the fusion point more precisely. Define the breakpoint region, Θ ζ, as the interval determined by the intersection of all clones that contain ζ. Therefore Θ ζ defines the localization of ζ, or the uncertainty in mapping ζ. To localize a fusion point to within L, we only need a single clone overlapping ζ. We show (See Methods) that Furthermore, we show that for s < L, Pr( Θ ζ = L) Le c ( e N G ). (3) P r[( Θ ζ = s)] se Ns G ( e N G ) 2 These equations allow us to estimate the the expected length of Θ ζ, conditioned on ζ being covered 6 (See Methods for full derivation and closed form solution, equation 24) as ( ) ( ) e N L G E( Θ ζ ζ is covered) e c L 2 e c + s 2 e Ns G ( e N G ) (5) We evaluated the error in this approximation by simulation (See Supplemental Methods for descriptions of all simulations). Supp. II shows that approximation (5) very closely models the average obsered Θ ζ. The relative error between the average observed length of the breakpoint region and approximation (5) was.2. We also assess the effect of different clone sizes, L, and number of clones, N, on the expected length of the breakpoint region, E( Θ ζ ), around a specific fusion point, ζ. In Figure 5 we see the obvious correlation that an increase in the number of reads, N, decreases the uncertainty in localization ( Θ ζ ). Interestingly, we note that 4kb clones are most advantageous for detection when Θ ζ = 4kb. A similar effect is observed for 5kb and 2kb. This implies that the choice of clone-lengths impacts one s ability to detect fusions of a specific size. Evaluation of Sequencing Strategies Formulas (2) and (5) provide a framework for examining a variety of sequencing parameters (L, N, c). In Table 2 and graphically in Supp. III and IV, we assess the effect of using different clone sizes and varying numbers of paired reads on the ability to detect and localize a fusion point (also see Supp. V). One can see that a distinct trade-off exists between detection, in which larger clones hold a distinct advantage, and localization, in which case smaller clones are advantageous. Though larger size clones (e.g. BACs of 5kb) appear more pragmatic for sequencing projects using a smaller number of paired reads, the advent of low cost, highly parallel sequencing of small clones could soon yield extremely high resolution (smaller Θ ζ ) and coverage of fusion points. 6 Otherwise, Θ ζ is not defined s= (4) 8

9 Probability.8 5kb Clones, k Reads 4kb Clones, k Reads 2kb Clones, k Reads 5kb Clones, k Reads 4kb Clones, k Reads 2kb Clones, k Reads 5kb Clones, M Reads 4kb Clones, M Reads 2kb Clones, M Reads Θζ x Figure 5: Probability of localizing a fusion point to an interval of a given length. A fusion point is localized to length s if the corresponding breakpoint point region has length s or less. Note that when s exceeds the clone length L, only a single clone contains the fusion point; thus, the probability of localization is the fixed probability of a clone containing the fusion point at a given clonal coverage. Note that a fixed clone length is assumed here. Using a distribution of clone lengths would create a less abrupt transition. Table 2: Breakpoint Detection and Localization for Different Sequencing Strategies Clone Size(L) kb kb 2kb 2kb kb kb 4kb 4kb 5kb 5kb 5kb Paired Reads(N) Clone Coverage(c) 3.3X.33X 3.3X.66X 6.7X 3.3X 26.7X.33X 25X 5X.6X E( Θζ ) Pζ >.99.5 > >.99.8 > > E( Θζ ) Pζ >.99.5 > The probability Pζ of detecting a fusion point and the expected length E( Θζ ) of a breakpoint region under various clone sizes (L) and number of end-sequenced clones (N ). Pζ, and E( Θζ ), correspond to the probability for, and expected size of a breakpoint region in the case of, two clones spanning ζ. The small clone sizes (kb, 2kb) and large number of reads represent what one might achieve in a single run with new technologies (assuming perfect mapping of end sequences). The last value for the 5kb clones represent our current status on the MCF7 cell line. 9

10 Percentage of Total Genes in Set Gene Length in kb (Log Scale) All Genes (39,368) Kinases (62) Transcription Factors (562) ChimerDB (3) Random Fusion Genes (2) Figure 6: Distribution of gene sizes for different sets of genes. All genes: The known genes track in the UCSC Genome Browser [27]. Kinases: Selected from the KinBase database [28]. Transcription factors: Selected from the AmiGO database according to the GO term transcription factor activity [29]. ChimerDB: Fusion genes in cancer extracted from the chimerdb database [23]. Random Fusion Genes: A set of 2 genes involved in random fusion events. Random Fusion events were formed by inducing random breakpoints, and selecting such events if they formed a fusion gene. Note that the gene sizes are on a log scale, and the number of genes from each set used to derive each distribution is shown in the legend. Fusion gene probabilities Figure 5 indicates that to achieve a desired breakpoint localization with a fixed number of reads, some clone sizes are more effective than others. Thus, Figure 5 implies that our ability to accurately assess fusion genes depends not only on the sequencing parameters but also on the sizes of genes involved in the fusion. There is considerable variation in sizes of human genes, as seen in Figure 6. When considering all known transcripts [27], the median gene size is approximately 2kb and the mean is approximately 4kb. However, this does not correspond to the sizes of known fusion genes in cancer. Examination of chimerdb [23], a database of fusion sequences derived from mrna, EST, literature and database searches, shows a clear bias towards larger genes (Figure 6), with a median gene size of 4kb and a mean gene size of 9kb. It is tempting to speculate on the reasons for this bias. There is the potential for ascertainment bias as larger genes would be easier to localize via cytogenetic techniques. Additionally, random breakage would seem to favor larger genes as the probability of a breakpoint disrupting a large gene would be greater than for a small gene. To assess this possibility, we simulated random fusion events by inducing random breakpoints in the genome. If a breakpoint formed a fusion gene we recorded the size of each fusion partner as seen in Figure 6. It is interesting to note that these random fusion events results in much larger genes than observed in the normal genome or chimerdb (median and mean gene sizes of 55kb and 284kb, respectively). Though further investigations are needed, this could indicate that the observed fusion genes are selected for functional reasons. We also examined the distribution of transcription factors and kinases which are members of multiple fusion genes, (Figure 6). Interestingly, kinases correspond much more closely to the chimerdb distribution whereas transcription factors correspond better with gene sizes observed across the entire genome. The variation in gene sizes for different classes of genes (Figure 6) suggests that one consider a wide range of gene sizes when assessing our ability to detect fusion genes. Figure 7 shows the number of clones necessary to achieve at least a.5 fusion probability for a random gene pair of specified size, across a variety of different clone sizes. It should be noted that the breakpoint could exist at any position within either gene. Smaller clone sizes clearly hold a distinct advantage in the probability of detecting fusion genes when compared across equal clonal coverage while large clones tend to perform better when comparing across equivalent reads (Figure 7a). This is not surprising, as a significantly higher number of paired reads is required to achieve the same coverage with smaller clones. For 2kb clones 75 times the number of paired reads is needed to yield equivalent clone coverage

11 Paired Rads Fusion Probability > Number of Clones Fusion Probability >.5 2kb fusion 4kb fusion kb fusion 5kb fusion (a) Fusion Gene Size (a) Clone Size Figure 7: (a) The number of paired reads necessary to detect fusion genes with fusion probability greater than.5 as a function of gene size for different clone lengths. The vertical lines indicate median (2kb) and mean (4kb) sizes for all known genes as well as the median (4kb) and mean (9kb) sizes for chimerdb genes. (b) The number of paired reads necessary to detect fusion genes with fusion probability greater than.5 as a function of clone size for different fusion genes sizes (log scale in both axes). Each point in these plots is the average over different fusion genes and and different simulations of clones from the genome. Thus, each datapoint represents the average value of 4 simulations. In each simulation, a pair of genes was chosen such that area of the resulting gene rectangle (U V ) was equal to the square of the indicated fusion gene size. A breakpoint was selected for the gene pair uniformly in the rectangle U V ). with respect to 5kb clones. It is important to note, however, the next-generation sequencing technologies can potentially generate a high number of paired reads for small clone sizes at very low cost (See Discussion). Less obvious is the relationship between fusion gene size and detectability of fusion events (Figure 7b). Note that larger clones create larger trapezoids. This increases the probability that the trapezoid intersects with the gene rectangle (better detection), but may, thus, reduce the calculated fusion probability since only a small fraction of the larger trapezoid would overlap the gene rectangle (akin to creating a larger breakpoint region around a breakpoint). Notice that there appears to be a distinct correlation between the size of genes involved in a fusion and the size of the clone being used to sequence the genome. When the resulting fused gene is 4kb in length, the average fusion probability is significantly greater when using the same number of 4kb clones compared to 2kb clones, largely due to the greater genomic coverage provided by the larger clones. However, 4kb clones also perform nearly as well as 5kb clones (Figure 7b), because of their improved breakpoint localization (Figure 5). Once the fusion gene size is increased to 5kb, once again, a significant benefit is seen by moving to the larger clone size. In addition to supporting bias for identification of larger gene fusion events by large clones, these results suggest that when investigating specific fusion events, designing one s clone library to reflect the size of the event could be beneficial in detecting it with high probability. However, it is important to note that the 5kb clones consistently show less variance in predicted probabilities (Supp. VI). This makes them a more reliable source for performing whole genome assessments and comparisons across multiple tumor samples, especially with a small number of paired reads. Effects of Errors There are numerous sources of error in paired-end sequencing approaches. The method of computing fusion probabilities from end-sequences makes it possible that a pair of genes near a breakpoint may exhibit a high probability of fusion but not actually fuse. This is more likely in the case of large clone sizes and low clone coverage. A number of other factors, including experimental artifacts, genome assembly errors or mismapping of end sequences, can affect the prediction of fusion genes. In particular, experimental artifacts can lead to false identification of fusion genes. A major source of experimental artifacts is chimeric clones produced when two non-contiguous regions of DNA are joined together during the cloning procedure. Approximately 2% of clones in modern BAC libraries are chimeric [2, 5], and rates for other vectors are roughly similar. The type and rate of experimental artifacts for new genome amplification and sequencing strategies is still an open question. In order to assess the rate of false positive predictions of fusion genes in the presence of errors, we simulated

12 TP TP 5 (a).5 2kb Clones 4kb Clones 5kb Clones FP (b) 2kb Clones 4kb Clones 5kb Clones FP TP TP (c) 5 2kb Clones 4kb Clones 5kb Clones FP (d) 5 2kb Clones 4kb Clones 5kb Clones.5.5 FP Figure 8: Sensitivity and Specificity of Fusion Gene Predictions (a) Number of false positive (FP) and true positive (TP) fusion gene predictions for a simulated genome with translocations and, paired reads. Each curve represents the average of 5 simulations with clones of a fixed size (2kb, 4kb, 5kb clones). The minimum fusion probability threshold for indicating that a fusion gene was predicted was decreased from >.95 (leftmost point) to > (rightmost point) in increments.5. and the number of true and false predictions was determined. For all figures 9 true fusion genes were present in the rearranged genome. These 9 events were not selected for, rather they were a result of the random rearrangment of the genome. (b), paired reads. (c),, paired reads. (d),, paired reads. random rearrangement events with the percent of chimeric clones equal to % (Figure 8). For each curve, the points represent True Positives (fusion pairs correctly labeled as such) vs. False Positives (non-fusing pairs incorrectly labeled as fusing) for various fusion probability thresholds. As expected, with low numbers of paired reads 5kb clones obtain the most True Positives (Figure 8a,b), while with high number of paired reads smaller clone sizes are better (4kb clones in Figure 8c). Very high coverages are needed before very small clones, 2kb, can become effective (Figure 8d). However, they show almost no false positives at appropriate thresholds, with little (if any) increase in true positives when reducing the probability threshold (Figure 8d). Finally, we examined the effect of chimeric clones on our ability to identify breakpoints from invalid clusters. Obviously, when only a single isolated invalid pair exists we cannot determine whether it arose via a chimeric event or through a true rearrangement. However, a cluster of invalid pairs is highly unlikely to arise from chimeric clones [2] as seen in Figure 9). In most cases, no clusters formed from chimeric clones are ever observed. Even under high coverage (X) and a very high percentage of chimeric clones (5% of all paired reads) in 8% of studies no chimeric clusters would be observed. This result increases the confidence in the assumption that clusters of two or more invalid pairs indicate true rearrangement events. When considering an equal number of chimeric clones for different clone lengths, the probability of observing a chimeric cluster is much lower for smaller clone sizes. 2

13 Probability of Observing a Chimeric Cluster X Clonal Coverage 2X Clonal Coverage 5X Clonal Coverage 6 X Clonal Coverage Percent Chimeric Clones Figure 9: Probability of observing at least one chimeric cluster vs. the percent of chimeric clones. These probabilities were computed using Equation 27, with clone size L = 5kb and confirmed by simulation. Other clone sizes yield virtually identical probabilities at the same clonal coverage. Discussion We provided a computational framework to evaluate paired-end sequencing strategies for detection of rearrangements in cancer genomes via. Using our probability calculations and simulations, we find that even with current paired-end technology we can obtain extremely high probability of breakpoint detection with a very low number of reads. For example, more than 9% of all breakpoints can be detected with paired-end sequencing of less than, clones (Table 2). Additionally, next-generation sequencers can potentially detect rearrangements with a greater than 99% probability and localize the breakpoints of these rearrangements to less than 3bp in a single run of the machine (Table 2). Even when only a fraction (f.5) of the reads map uniquely, similar detection levels can be achieved by simply doubling the amount of sequencing. We derived formulae that provide estimates of the probability of detecting rearrangement breakpoints and localizing them precisely. For a genome of length G with N mapped paired reads from clones of size L, the detection probability is a function of the of clonal coverage (c = NL G ). Thus, increasing L means that fewer clones are needed to maintain the same probability of detecting a fusion. Traditionally, clone size L was dictated by efficiency considerations with available cloning vectors (e.g. plasmids 2kb, fosmids 4kb, and BACs 5kb). However, paired-end tag approaches coupled with new sequencing technolgies permit paired-end sequencing from a larger range of clone sizes. On the other hand, the distribution of breakpoint localization depends separately on clone size L, number of mapped reads N, and genome size G. The natural question for the practioner is: what sequencing strategy maximizes information about rearrangements in the cancer genome for minimum cost? Providing a definitive answer to the question is complicated by three major issues. First, the goal of maximizing information about rearrangments in cancer genomes requires further specification. If one is interested in detecting as many rearrangement breakpoints (and thus fusion genes) as possible with the minimum amount of sequencing (i.e. maximizing the coverage c), then for a fixed number of paired reads, larger clones give higher probabilities of detection than smaller clones, a consequence that is follows immediately from the breakpoint detection probability (Equation 2). However, if one is interested in precise localization of breakpoint regions, then smaller clones will provide better localization when the breakpoint is detected (Figure 8, V). If one wants to validate the breakpoint via PCR, then clearly localization must be less than the length amplified via PCR, typically less than a few kilobases. Our results show the less obvious correlation between clone size and the probability of detecting breakpoint regions of a specific size and, thus, fusion genes involving partner genes of a specific size. Figure 5 shows that at equal number of paired reads different clone sizes obtain better probabilities for resolving fusion points to an interval of fixed length. Moreover, Figure 7b shows that these results readily translate to the probability of detecting fusion genes with partner genes of a given size. Such trade-offs permeate nearly all aspects of sequence strategy evaluation. Suppose that paired-end sequences could be obtained for any clone length. Then the choice of optimal clone size depends on the length of fusion genes that the researcher wants to have the greatest ability to localize, which might depend on a prior belief about the model of rearrangement in cancer. For example, if one wants to be able to localize fusion genes whose size is approximately the length of an average human gene 3

14 (4kb), then the optimal clone size is 4kb. However, under the hypothesis that the breaks in the genome that lead to fusion genes are distributed unifomly on the genome, larger fusion genes would be expected and thus larger clone sizes would be optimal. Note that in many cases, one is interested not in breakpoints in a single tumor, but recurrent breakpoints across multiple samples. For this reason, approaches like Primer Approximation Multiplex PCR (PAMP) that assay for variable genomic lesions (such as deletions) in a patient population [3] are appropriate, and precise localization of breakpoints is unnecessary. Moreover, if rearrangement breakpoints are in highly repetitive regions, it might be difficult to map ends too close to the breakpoints, and thus larger clone sizes are appropriate. On the other hand, if there are multiple rearrangements clustered in a small genomic interval, large clones would miss some of these rearrangements, as observed with the multiple breakpoints found in some of our BACs. The second major difficulty is that the sequencing parameters cannot be set arbitrarily, but are limited by the technology chosen. Some of these technologies are not yet commercially available, and the information about their constraints is unknown. Moreover, the field is developing rapidly and any claims stated about read lengths, sequencing error rates, etc. are undergoing continual revision. Finally, the complexity of cancer genomes at the sequence level is unknown. Cytogenetics has demonstrated [2, 3] extensive polyploidy in tumor cells, and more recent ESP studies and shotgun sequencing will naturally sample more from amplified regions. Although we did not explicity simulate amplifications, it is clear that the probability of detecting amplified translocations is directly proportional to their amplification status. However, it is difficult to calibrate the genome length parameter in our simulations. For this reason, pilot studies will be needed to assess the extent of amplification and heterogeneity in these samples. Mapping Reads Our analysis focused on several key parameters including number of paired reads, clone length, and percent of chimeric clones. Some of these parameters are adjustable while others (e.g. error rate) are fixed by the choice of a given technology. However, our model and simulations did not describe all factors that might influence the choice of a given technology. In particular, we avoided the issue of the mapping of reads to the reference genome. Our results give some indication of the promise of these new technologies, but at present there are many unknowns in the number and size of mapped paired-end sequences (See below) that one can expect. Clearly, pilot studies are needed to address this issue. As next-generation technologies mature they will inevitably provide highly reliable and precise mappings of fusion points and detection of fusion genes, allowing for the systematic analysis of genomic events of many cancer samples. Different sequencing technologies produce reads of varying length and quality which can have a dramatic effect on the ability to map paired reads. On one extreme, conventional paired-end sequencing of cloned genomic fragments employed by current ESP studies [9, 2], yields high quality reads exceeding 5 bp. This enables the majority of reads outside of repeats and segmental duplications to be uniquely and accurately mapped to the reference genome. For example, 492 out of 983 clones (58%) in the MCF7 study, using 5bp from each end, mapped uniquely [22], while 4% of paired reads using bp of pyrosequencing from each end, mapped uniquely [5]. Newer sequencing technologies such as Illumina and ABI have even shorter reads (2 to 3 bp) and higher error rates [3, 2], and the ditag approach sequences only 8 2 base pairs from each end of the genomic fragment []. These shorter reads will be much more difficult to map, particulary when analyzing rearrangements. Unlike resequencing studies when one can use additional information that the end sequences are close together, detection of rearrangements specifically requires the accurate mapping of end sequences from distant locations on the genome. It would be informative to model the effect of different read accuracies and lengths on our ability to accurately resolve breakpoints. Organization of Cancer Genomes An additional limitation is that the simulation makes certain simplifying assumptions about the character of cancer genomes. Most notably, we assume that the size of the cancer genome (equal to the parameter G above) is known. Since many cancer genomes, particularly solid tumors, have extensive aneuploidy the actual size of a given genome might differ greatly from normal cells [32]. Additionally, certain regions are highly amplified and can have complex structures due to different mechanisms of duplication [33, 2]. Finally, genomic heterogeneity, particularly in primary tumor samples, reduces the effective coverage and thus the probability of detecting rearrangement breakpoints. Even a genomic lesion that is an important checkpoint in a cancer progression, 4

15 might be difficult to detect in an admixed sample containing normal cells and cells from earlier developmental stages of the tumor. It is nearly impossible to determine how these factors will affect cancer sequencing without further pilot studies. Such studies promise to provide lots of new information about the extent of ploidy changes and heterogeneity. Extensions and Applications Our formula for the probability of a fusion genes is readily extended to fusions of any genomic feature. For example, regulatory fusions. that result from the fusion of the promoter of one gene to the coding region of another gene can be assessed. Similarly, it should be noted that predicting the probability of detection for amplified breakpoints, P ζ, (under the assumption of constant genome size) naturally follows from the current model. As amplification, a, increases the probability of detection corresponding increases, following ( ( P ζ ) a ). Additionally, a number of techniques based on the accurate resolution of genomic lesions can be used in combination with, or extensions of, this approach. Array CGH, can provide high resolution of breakpoints involved in deletion and duplication events at average resolutions of less than kb [34, 35]. When this information overlaps paired-end sequencing data (such as the case with an amplified translocation like BCAS4/BCAS3) one can potentially improve the resolution of the breakpoint interval defined by a paired-end sequencing approach. Another example, Primer Approximation Multiplex PCR (PAMP) attempts to assay for variable genomic lesions (such as deletions) in a patient population [3]. The success of this technique relies on establishing the variance in genomic boundaries of a genomic event, so that appropriate primers spanning the region can be designed. Our approach provides an effective method by which to derive rough approximations for boundaries of such events. Combining this approach with such experimental techniques makes it a powerful tool for both designing clinical diagnostics and identifying therapeutic targets. Methods Mapping and clustering of end sequences We assume that each clone C is end sequenced and the ends are mapped uniquely to the reference human genome sequence. Thus, each clone C corresponds to a pair (x C, y C ) of locations in the human genome where the end sequences map. In addition, an end sequence may map to either DNA strand, and so each mapped end has a sign (+ or ) indicating the mapped strand. We call such a signed pair an end sequence pair (ES pair). If we know that the length L C of the clone C lies within the range [L min, L max ], then for most clones the distance between elements of the corresponding ES pair will lie in this range and the ends will have opposite, convergent orientations: i.e. an ES pair of the form (+x, (x + L C )). Following [2] we call such ES pairs valid pairs because these indicate no rearrangement in the tumor genome. We can use the distribution of distance y x between the ends of valid pairs to define an empirical distribution for the clone lengths (cf. Supp. I). If a pair (x C, y C ) has ends with non-convergent orientation or whose distance y x is greater than L max or smaller than L min, we say that (x C, y C ) is an invalid pair. The set of breakpoints (a, b) that are consistent with the invalid pair (x C, y C ) is determined by the inequalities [36] L min sign(x C )(a x C ) + sign(y C )(b y C ) L max. (6) Throughout the paper, we assume (without loss of generality) that sign(x C ) = sign(y C ) = + so that a x C and b y C. Validating fusion predictions by sequencing Clones contained predicted fusion genes were draft sequenced (X coverage) by subcloning into 3kb plasmids as described in [22]. Assembly of these sequences and alignment to the reference human genome identified either the precise fusion point, or identified a plasmid containing the fusion point thus localizing the breakpoint to 3kb. 5

16 Computing fusion probability Define C (a,b) as the event that a clone C from the tumor with corresponding invalid pair (x C, y C ) overlaps a breakpoint (a, b) of a reference genome. Assume w.l.o.g. that the invalid pair (x C, y C ) is oriented such that a x C and b y C. Given the breakpoint (a, b), the length L C of the clone is fixed to be l C (a, b) = (a x C ) + (b y C ) = (a + b) (x C + y C ) (7) Therefore, the event C (a,b) implies the event that L C = l C (a, b), allowing us to express the probability of occurrence of breakpoint (a, b) in terms of the distribution on the lengths of clones. Denote N C [s] as the number of discrete breakpoints (a, b) such that a x C, b y C, and a + b = s. Then Pr(C (a,b) ) = Pr(C (a,b) (L C = l C (a, b))) (8) = Pr(C (a,b) L C = l C (a, b)) Pr(L C = l C (a, b)) (9) = N C [a + b] Pr(L C = l C (a, b)), () where the last equality follows from Equation 7 and the assumption that all breakpoints are equally likely. Consider a pair of genes spanning genomic intervals U, V. If exactly one of the genes is on the + strand and the other is on the - strand, then the probability of genes fusing, given C is Pr( (a,b) U V C (a,b) ) = (a,b) U V Pr(L C = l C (a, b)). () N C [a + b] Otherwise, if the genes are both on the same strand then an in-frame fusion transcript is impossible 7 and the probability is. In the simple case, assume that the clone lengths were uniformly distributed over the range [L min, L max ], so that Pr(L C = l C (a, b)) = { L max L min if L min l C (a, b) L max, otherwise. In this case, Equation refers to the fraction of the trapezoid (Equation ) that intersects with U V. A more realistic distribution of the clone lengths is empirically estimated (Supp. I), and used to compute P r(l C = l C (a, b)). Next, we extend the equations to include the case when a set {C (), C (2),...} of multiple clones overlap the breakpoint (a, b). Define C to be the event that all clones overlaps the same breakpoint. Then where C = (a,b) C (a,b) C (a,b) = j C (j) (a,b) is the event that all clones C (j) overlap the breakpoint (a, b). Thus, the probability of (a, b) being the breakpoint given that all clones overlap it is given by Pr(C (a,b) C) = Pr(C (a,b) C) Pr(C) = Pr(C (a,b)) Pr(C) j Pr(C(j) = (a,b) (a,b) ) (2) (3) j Pr(C(j) (a,b) ) (4) Here, equation 3 follows from the fact that C (a,b) implies C, and Equation 4 follows from the independence of clones. This allows us to compute the probability of genes U, V fusing as (a,b) U V j Pr( (a,b) U V C (a,b) C) = Pr(C(j) (a,b) ) (a,b) j Pr(C(j) (a,b) ) (5) 7 The analysis of the orientation of genes is analogous in the cases of invalid pairs with other signs. 6

Evaluation of Paired-End Sequencing Strategies for Detection of Genome Rearrangements in Cancer

Evaluation of Paired-End Sequencing Strategies for Detection of Genome Rearrangements in Cancer Evaluation of Paired-End Sequencing Strategies for Detection of Genome Rearrangements in Cancer Ali Bashir 1 *, Stanislav Volik 2, Colin Collins 2, Vineet Bafna 3, Benjamin J. Raphael 4 * 1 Bioinformatics

More information

ChIP-seq and RNA-seq

ChIP-seq and RNA-seq ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)

More information

Understanding Accuracy in SMRT Sequencing

Understanding Accuracy in SMRT Sequencing Understanding Accuracy in SMRT Sequencing Jonas Korlach, Chief Scientific Officer, Pacific Biosciences Introduction Single Molecule, Real-Time (SMRT ) DNA sequencing achieves highly accurate sequencing

More information

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions

More information

Comparative Genomic Hybridization

Comparative Genomic Hybridization Comparative Genomic Hybridization Srikesh G. Arunajadai Division of Biostatistics University of California Berkeley PH 296 Presentation Fall 2002 December 9 th 2002 OUTLINE CGH Introduction Methodology,

More information

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping BENG 183 Trey Ideker Genome Assembly and Physical Mapping Reasons for sequencing Complete genome sequencing!!! Resequencing (Confirmatory) E.g., short regions containing single nucleotide polymorphisms

More information

Sample to Insight. Dr. Bhagyashree S. Birla NGS Field Application Scientist

Sample to Insight. Dr. Bhagyashree S. Birla NGS Field Application Scientist Dr. Bhagyashree S. Birla NGS Field Application Scientist bhagyashree.birla@qiagen.com NGS spans a broad range of applications DNA Applications Human ID Liquid biopsy Biomarker discovery Inherited and somatic

More information

Welcome to the NGS webinar series

Welcome to the NGS webinar series Welcome to the NGS webinar series Webinar 1 NGS: Introduction to technology, and applications NGS Technology Webinar 2 Targeted NGS for Cancer Research NGS in cancer Webinar 3 NGS: Data analysis for genetic

More information

Pharmacogenetics: A SNPshot of the Future. Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001

Pharmacogenetics: A SNPshot of the Future. Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001 Pharmacogenetics: A SNPshot of the Future Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001 1 I. What is pharmacogenetics? It is the study of how genetic variation affects drug response

More information

Mate-pair library data improves genome assembly

Mate-pair library data improves genome assembly De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate

More information

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing project Unknown sequence { experimental evidence result read 1 read 4 read 2 read 5 read 3 read 6 read 7 Computational requirements

More information

02 Agenda Item 03 Agenda Item

02 Agenda Item 03 Agenda Item 01 Agenda Item 02 Agenda Item 03 Agenda Item SOLiD 3 System: Applications Overview April 12th, 2010 Jennifer Stover Field Application Specialist - SOLiD Applications Workflow for SOLiD Application Application

More information

Biol 478/595 Intro to Bioinformatics

Biol 478/595 Intro to Bioinformatics Biol 478/595 Intro to Bioinformatics September M 1 Labor Day 4 W 3 MG Database Searching Ch. 6 5 F 5 MG Database Searching Hw1 6 M 8 MG Scoring Matrices Ch 3 and Ch 4 7 W 10 MG Pairwise Alignment 8 F 12

More information

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS. !! www.clutchprep.com CONCEPT: OVERVIEW OF GENOMICS Genomics is the study of genomes in their entirety Bioinformatics is the analysis of the information content of genomes - Genes, regulatory sequences,

More information

PRIMER SELECTION METHODS FOR DETECTION OF GENOMIC INVERSIONS AND DELETIONS VIA PAMP

PRIMER SELECTION METHODS FOR DETECTION OF GENOMIC INVERSIONS AND DELETIONS VIA PAMP 1 PRIMER SELECTION METHODS FOR DETECTION OF GENOMIC INVERSIONS AND DELETIONS VIA PAMP B. DASGUPTA Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607-7053 E-mail: dasgupta@cs.uic.edu

More information

Genomic resources. for non-model systems

Genomic resources. for non-model systems Genomic resources for non-model systems 1 Genomic resources Whole genome sequencing reference genome sequence comparisons across species identify signatures of natural selection population-level resequencing

More information

Molecular Biology: DNA sequencing

Molecular Biology: DNA sequencing Molecular Biology: DNA sequencing Author: Prof Marinda Oosthuizen Licensed under a Creative Commons Attribution license. SEQUENCING OF LARGE TEMPLATES As we have seen, we can obtain up to 800 nucleotides

More information

Solutions will be posted on the web.

Solutions will be posted on the web. MIT Biology Department 7.012: Introductory Biology - Fall 2004 Instructors: Professor Eric Lander, Professor Robert A. Weinberg, Dr. Claudette Gardel NAME TA SEC 7.012 Problem Set 7 FRIDAY December 3,

More information

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility 2018 ABRF Meeting Satellite Workshop 4 Bridging the Gap: Isolation to Translation (Single Cell RNA-Seq) Sunday, April 22 Basics of RNA-Seq (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly,

More information

SEQUENCING. M Ataei, PhD. Feb 2016

SEQUENCING. M Ataei, PhD. Feb 2016 CLINICAL NEXT GENERATION SEQUENCING M Ataei, PhD Tehran Medical Genetics Laboratory Feb 2016 Overview 2 Background NGS in non-invasive prenatal diagnosis (NIPD) 3 Background Background 4 In the 1970s,

More information

Towards detection of minimal residual disease in multiple myeloma through circulating tumour DNA sequence analysis

Towards detection of minimal residual disease in multiple myeloma through circulating tumour DNA sequence analysis Towards detection of minimal residual disease in multiple myeloma through circulating tumour DNA sequence analysis Trevor Pugh, PhD, FACMG Princess Margaret Cancer Centre, University Health Network Dept.

More information

Structural variation. Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona

Structural variation. Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona Structural variation Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona Genetic variation How much genetic variation is there between individuals? What type of variants

More information

CSE182-L16. LW statistics/assembly

CSE182-L16. LW statistics/assembly CSE182-L16 LW statistics/assembly Silly Quiz Who are these people, and what is the occasion? Genome Sequencing and Assembly Sequencing A break at T is shown here. Measuring the lengths using electrophoresis

More information

DNA Sequencing and Assembly

DNA Sequencing and Assembly DNA Sequencing and Assembly CS 262 Lecture Notes, Winter 2016 February 2nd, 2016 Scribe: Mark Berger Abstract In this lecture, we survey a variety of different sequencing technologies, including their

More information

Bionano Solve Theory of Operation: Variant Annotation Pipeline

Bionano Solve Theory of Operation: Variant Annotation Pipeline Bionano Solve Theory of Operation: Variant Annotation Pipeline Document Number: 30190 Document Revision: B For Research Use Only. Not for use in diagnostic procedures. Copyright 2018 Bionano Genomics,

More information

Chapter 14: Genes in Action

Chapter 14: Genes in Action Chapter 14: Genes in Action Section 1: Mutation and Genetic Change Mutation: Nondisjuction: a failure of homologous chromosomes to separate during meiosis I or the failure of sister chromatids to separate

More information

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences. Bio4342 Exercise 1 Answers: Detecting and Interpreting Genetic Homology (Answers prepared by Wilson Leung) Question 1: Low complexity DNA can be described as sequences that consist primarily of one or

More information

Get to Know Your DNA. Every Single Fragment.

Get to Know Your DNA. Every Single Fragment. HaloPlex HS NGS Target Enrichment System Get to Know Your DNA. Every Single Fragment. High sensitivity detection of rare variants using molecular barcodes How Does Molecular Barcoding Work? HaloPlex HS

More information

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae Schefkind 1 Adam Schefkind Bio 434W 03/08/2014 Finishing of Fosmid 1042D14 Abstract Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae genomic DNA. Through a comprehensive analysis of forward-

More information

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014 Single Nucleotide Variant Analysis H3ABioNet May 14, 2014 Outline What are SNPs and SNVs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants! Single Nucleotide

More information

Next Generation Sequencing. Target Enrichment

Next Generation Sequencing. Target Enrichment Next Generation Sequencing Target Enrichment Next Generation Sequencing Your Partner in Every Step from Sample to Data NGS: Revolutionizing Genetic Analysis with Single-Molecule Resolution Next generation

More information

Lecture 7. Next-generation sequencing technologies

Lecture 7. Next-generation sequencing technologies Lecture 7 Next-generation sequencing technologies Next-generation sequencing technologies General principles of short-read NGS Construct a library of fragments Generate clonal template populations Massively

More information

Fusion Gene Analysis. ncounter Elements. Molecules That Count WHITE PAPER. v1.0 OCTOBER 2014 NanoString Technologies, Inc.

Fusion Gene Analysis. ncounter Elements. Molecules That Count WHITE PAPER. v1.0 OCTOBER 2014 NanoString Technologies, Inc. Fusion Gene Analysis NanoString Technologies, Inc. Fusion Gene Analysis ncounter Elements v1.0 OCTOBER 2014 NanoString Technologies, Inc., Seattle WA 98109 FOR RESEARCH USE ONLY. Not for use in diagnostic

More information

Structural variation analysis using NGS sequencing

Structural variation analysis using NGS sequencing Structural variation analysis using NGS sequencing Victor Guryev NBIC NGS taskforce meeting April 15th, 2011 Scale of genomic variants Scale 1 bp 10 bp 100 bp 1 kb 10 kb 100 kb 1 Mb Variants SNPs Short

More information

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome See the Difference With a commitment to your peace of mind, Life Technologies provides a portfolio of robust and scalable

More information

Targeted RNA sequencing reveals the deep complexity of the human transcriptome.

Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Tim R. Mercer 1, Daniel J. Gerhardt 2, Marcel E. Dinger 1, Joanna Crawford 1, Cole Trapnell 3, Jeffrey A. Jeddeloh 2,4, John

More information

Microarrays & Gene Expression Analysis

Microarrays & Gene Expression Analysis Microarrays & Gene Expression Analysis Contents DNA microarray technique Why measure gene expression Clustering algorithms Relation to Cancer SAGE SBH Sequencing By Hybridization DNA Microarrays 1. Developed

More information

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd 1 Our current NGS & Bioinformatics Platform 2 Our NGS workflow and applications 3 QIAGEN s

More information

Comparing a few SNP calling algorithms using low-coverage sequencing data

Comparing a few SNP calling algorithms using low-coverage sequencing data Yu and Sun BMC Bioinformatics 2013, 14:274 RESEARCH ARTICLE Open Access Comparing a few SNP calling algorithms using low-coverage sequencing data Xiaoqing Yu 1 and Shuying Sun 1,2* Abstract Background:

More information

Cancer Genetics Solutions

Cancer Genetics Solutions Cancer Genetics Solutions Cancer Genetics Solutions Pushing the Boundaries in Cancer Genetics Cancer is a formidable foe that presents significant challenges. The complexity of this disease can be daunting

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

Chapter 15 Gene Technologies and Human Applications

Chapter 15 Gene Technologies and Human Applications Chapter Outline Chapter 15 Gene Technologies and Human Applications Section 1: The Human Genome KEY IDEAS > Why is the Human Genome Project so important? > How do genomics and gene technologies affect

More information

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein Abundance

More information

PV92 PCR Bio Informatics

PV92 PCR Bio Informatics Purpose of PCR Chromosome 16 PV92 PV92 PCR Bio Informatics Alu insert, PV92 locus, chromosome 16 Introduce the polymerase chain reaction (PCR) technique Apply PCR to population genetics Directly measure

More information

Long-range gene regulation

Long-range gene regulation Long-range gene regulation Short Course in Medical Genetics Melbourne, June 2011 Question Why should long-range regulation of gene expression be of interest to Clinical Scientists and Pathologists interested

More information

HaloPlex HS. Get to Know Your DNA. Every Single Fragment. Kevin Poon, Ph.D.

HaloPlex HS. Get to Know Your DNA. Every Single Fragment. Kevin Poon, Ph.D. HaloPlex HS Get to Know Your DNA. Every Single Fragment. Kevin Poon, Ph.D. Sr. Global Product Manager Diagnostics & Genomics Group Agilent Technologies For Research Use Only. Not for Use in Diagnostic

More information

Motivation From Protein to Gene

Motivation From Protein to Gene MOLECULAR BIOLOGY 2003-4 Topic B Recombinant DNA -principles and tools Construct a library - what for, how Major techniques +principles Bioinformatics - in brief Chapter 7 (MCB) 1 Motivation From Protein

More information

The Human Genome Project

The Human Genome Project The Human Genome Project The Human Genome Project Began in 1990 The Mission of the HGP: The quest to understand the human genome and the role it plays in both health and disease. The true payoff from the

More information

NGS in Pathology Webinar

NGS in Pathology Webinar NGS in Pathology Webinar NGS Data Analysis March 10 2016 1 Topics for today s presentation 2 Introduction Next Generation Sequencing (NGS) is becoming a common and versatile tool for biological and medical

More information

Representing Errors and Uncertainty in Plasma Proteomics

Representing Errors and Uncertainty in Plasma Proteomics Representing Errors and Uncertainty in Plasma Proteomics David J. States, M.D., Ph.D. University of Michigan Bioinformatics Program Proteomics Alliance for Cancer Genomics vs. Proteomics Genome sequence

More information

Introduction to Microarray Analysis

Introduction to Microarray Analysis Introduction to Microarray Analysis Methods Course: Gene Expression Data Analysis -Day One Rainer Spang Microarrays Highly parallel measurement devices for gene expression levels 1. How does the microarray

More information

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing technologies Jose Blanca COMAV institute bioinf.comav.upv.es Outline Sequencing technologies: Sanger 2nd generation sequencing: 3er generation sequencing: 454 Illumina SOLiD Ion Torrent PacBio

More information

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1 CAP 5510-9 BIOINFORMATICS Su-Shing Chen CISE 10/5/2005 Su-Shing Chen, CISE 1 Basic BioTech Processes Hybridization PCR Southern blotting (spot or stain) 10/5/2005 Su-Shing Chen, CISE 2 10/5/2005 Su-Shing

More information

NUCLEOTIDE RESOLUTION STRUCTURAL VARIATION DETECTION USING NEXT- GENERATION WHOLE GENOME RESEQUENCING

NUCLEOTIDE RESOLUTION STRUCTURAL VARIATION DETECTION USING NEXT- GENERATION WHOLE GENOME RESEQUENCING NUCLEOTIDE RESOLUTION STRUCTURAL VARIATION DETECTION USING NEXT- GENERATION WHOLE GENOME RESEQUENCING Ken Chen, Ph.D. kchen@genome.wustl.edu The Genome Center, Washington University in St. Louis The path

More information

Chapter 17. PCR the polymerase chain reaction and its many uses. Prepared by Woojoo Choi

Chapter 17. PCR the polymerase chain reaction and its many uses. Prepared by Woojoo Choi Chapter 17. PCR the polymerase chain reaction and its many uses Prepared by Woojoo Choi Polymerase chain reaction 1) Polymerase chain reaction (PCR): artificial amplification of a DNA sequence by repeated

More information

Figure S1. Unrearranged locus. Rearranged locus. Concordant read pairs. Region1. Region2. Cluster of discordant read pairs, bundle

Figure S1. Unrearranged locus. Rearranged locus. Concordant read pairs. Region1. Region2. Cluster of discordant read pairs, bundle Figure S1 a Unrearranged locus Rearranged locus Concordant read pairs Region1 Concordant read pairs Cluster of discordant read pairs, bundle Region2 Concordant read pairs b Physical coverage 5 4 3 2 1

More information

Contact us for more information and a quotation

Contact us for more information and a quotation GenePool Information Sheet #1 Installed Sequencing Technologies in the GenePool The GenePool offers sequencing service on three platforms: Sanger (dideoxy) sequencing on ABI 3730 instruments Illumina SOLEXA

More information

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Ruth Howe Bio 434W 27 February 2010 Abstract The fourth or dot chromosome of Drosophila species is composed primarily of highly condensed,

More information

Finishing of DELE Drosophila elegans has been sequenced using Roche 454 pyrosequencing and Illumina

Finishing of DELE Drosophila elegans has been sequenced using Roche 454 pyrosequencing and Illumina Sarah Swiezy Dr. Elgin, Dr. Shaffer Bio 434W 27 February 2015 Finishing of DELE8596009 Abstract Drosophila elegans has been sequenced using Roche 454 pyrosequencing and Illumina technology. DELE8596009,

More information

CHAPTERS 16 & 17: DNA Technology

CHAPTERS 16 & 17: DNA Technology CHAPTERS 16 & 17: DNA Technology 1. What is the function of restriction enzymes in bacteria? 2. How do bacteria protect their DNA from the effects of the restriction enzymes? 3. How do biologists make

More information

Chapter 5. Structural Genomics

Chapter 5. Structural Genomics Chapter 5. Structural Genomics Contents 5. Structural Genomics 5.1. DNA Sequencing Strategies 5.1.1. Map-based Strategies 5.1.2. Whole Genome Shotgun Sequencing 5.2. Genome Annotation 5.2.1. Using Bioinformatic

More information

Complementary Technologies for Precision Genetic Analysis

Complementary Technologies for Precision Genetic Analysis Complementary NGS, CGH and Workflow Featured Publication Zhu, J. et al. Duplication of C7orf58, WNT16 and FAM3C in an obese female with a t(7;22)(q32.1;q11.2) chromosomal translocation and clinical features

More information

We begin with a high-level overview of sequencing. There are three stages in this process.

We begin with a high-level overview of sequencing. There are three stages in this process. Lecture 11 Sequence Assembly February 10, 1998 Lecturer: Phil Green Notes: Kavita Garg 11.1. Introduction This is the first of two lectures by Phil Green on Sequence Assembly. Yeast and some of the bacterial

More information

Assay Validation Services

Assay Validation Services Overview PierianDx s assay validation services bring clinical genomic tests to market more rapidly through experimental design, sample requirements, analytical pipeline optimization, and criteria tuning.

More information

Alignment and Assembly

Alignment and Assembly Alignment and Assembly Genome assembly refers to the process of taking a large number of short DNA sequences and putting them back together to create a representation of the original chromosomes from which

More information

Chapter 15 The Human Genome Project and Genomics. Chapter 15 Human Heredity by Michael Cummings 2006 Brooks/Cole-Thomson Learning

Chapter 15 The Human Genome Project and Genomics. Chapter 15 Human Heredity by Michael Cummings 2006 Brooks/Cole-Thomson Learning Chapter 15 The Human Genome Project and Genomics Genomics Is the study of all genes in a genome Relies on interconnected databases and software to analyze sequenced genomes and to identify genes Impacts

More information

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter A shotgun introduction to sequence assembly (with Velvet) MCB 247 - Brem, Eisen and Pachter Hot off the press January 27, 2009 06:00 AM Eastern Time llumina Launches Suite of Next-Generation Sequencing

More information

CS262 Lecture 12 Notes Single Cell Sequencing Jan. 11, 2016

CS262 Lecture 12 Notes Single Cell Sequencing Jan. 11, 2016 CS262 Lecture 12 Notes Single Cell Sequencing Jan. 11, 2016 Background A typical human cell consists of ~6 billion base pairs of DNA and ~600 million bases of mrna. It is time-consuming and expensive to

More information

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome Lectures 30 and 31 Genome analysis I. Genome analysis A. two general areas 1. structural 2. functional B. genome projects a status report 1. 1 st sequenced: several viral genomes 2. mitochondria and chloroplasts

More information

Next-Generation Sequencing. Technologies

Next-Generation Sequencing. Technologies Next-Generation Next-Generation Sequencing Technologies Sequencing Technologies Nicholas E. Navin, Ph.D. MD Anderson Cancer Center Dept. Genetics Dept. Bioinformatics Introduction to Bioinformatics GS011062

More information

Wu et al., Determination of genetic identity in therapeutic chimeric states. We used two approaches for identifying potentially suitable deletion loci

Wu et al., Determination of genetic identity in therapeutic chimeric states. We used two approaches for identifying potentially suitable deletion loci SUPPLEMENTARY METHODS AND DATA General strategy for identifying deletion loci We used two approaches for identifying potentially suitable deletion loci for PDP-FISH analysis. In the first approach, we

More information

Whole Human Genome Sequencing Report This is a technical summary report for PG DNA

Whole Human Genome Sequencing Report This is a technical summary report for PG DNA Whole Human Genome Sequencing Report This is a technical summary report for PG0002601-DNA Physician and Patient Information Physician name: Vinodh Naraynan Address: Suite 406 222 West Thomas Road Phoenix

More information

Genome Biology and Biotechnology

Genome Biology and Biotechnology Genome Biology and Biotechnology Functional Genomics Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology (VIB) University of Gent International course

More information

Conditional Random Fields, DNA Sequencing. Armin Pourshafeie. February 10, 2015

Conditional Random Fields, DNA Sequencing. Armin Pourshafeie. February 10, 2015 Conditional Random Fields, DNA Sequencing Armin Pourshafeie February 10, 2015 CRF Continued HMMs represent a distribution for an observed sequence x and a parse P(x, ). However, usually we are interested

More information

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Introduction to metagenome assembly Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Sequencing specs* Method Read length Accuracy Million reads Time Cost per M 454

More information

Lecture #1. Introduction to microarray technology

Lecture #1. Introduction to microarray technology Lecture #1 Introduction to microarray technology Outline General purpose Microarray assay concept Basic microarray experimental process cdna/two channel arrays Oligonucleotide arrays Exon arrays Comparing

More information

Authors: Vivek Sharma and Ram Kunwar

Authors: Vivek Sharma and Ram Kunwar Molecular markers types and applications A genetic marker is a gene or known DNA sequence on a chromosome that can be used to identify individuals or species. Why we need Molecular Markers There will be

More information

Genome annotation & EST

Genome annotation & EST Genome annotation & EST What is genome annotation? The process of taking the raw DNA sequence produced by the genome sequence projects and adding the layers of analysis and interpretation necessary

More information

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing technologies Jose Blanca COMAV institute bioinf.comav.upv.es Outline Sequencing technologies: Sanger 2nd generation sequencing: 3er generation sequencing: 454 Illumina SOLiD Ion Torrent PacBio

More information

Systematic evaluation of spliced alignment programs for RNA- seq data

Systematic evaluation of spliced alignment programs for RNA- seq data Systematic evaluation of spliced alignment programs for RNA- seq data Pär G. Engström, Tamara Steijger, Botond Sipos, Gregory R. Grant, André Kahles, RGASP Consortium, Gunnar Rätsch, Nick Goldman, Tim

More information

Gene Expression Technology

Gene Expression Technology Gene Expression Technology Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu Gene expression Gene expression is the process by which information from a gene

More information

FFPE in your NGS Study

FFPE in your NGS Study FFPE in your NGS Study Richard Corbett Canada s Michael Smith Genome Sciences Centre Vancouver, British Columbia Dec 6, 2017 Our mandate is to advance knowledge about cancer and other diseases and to use

More information

A Greedy Algorithm for Minimizing the Number of Primers in Multiple PCR Experiments

A Greedy Algorithm for Minimizing the Number of Primers in Multiple PCR Experiments A Greedy Algorithm for Minimizing the Number of Primers in Multiple PCR Experiments Koichiro Doi Hiroshi Imai doi@is.s.u-tokyo.ac.jp imai@is.s.u-tokyo.ac.jp Department of Information Science, Faculty of

More information

measuring gene expression December 5, 2017

measuring gene expression December 5, 2017 measuring gene expression December 5, 2017 transcription a usually short-lived RNA copy of the DNA is created through transcription RNA is exported to the cytoplasm to encode proteins some types of RNA

More information

Genetics Lecture 21 Recombinant DNA

Genetics Lecture 21 Recombinant DNA Genetics Lecture 21 Recombinant DNA Recombinant DNA In 1971, a paper published by Kathleen Danna and Daniel Nathans marked the beginning of the recombinant DNA era. The paper described the isolation of

More information

Genetics and Genomics in Medicine Chapter 3. Questions & Answers

Genetics and Genomics in Medicine Chapter 3. Questions & Answers Genetics and Genomics in Medicine Chapter 3 Multiple Choice Questions Questions & Answers Question 3.1 Which of the following statements, if any, is false? a) Amplifying DNA means making many identical

More information

dbcamplicons pipeline Amplicons

dbcamplicons pipeline Amplicons dbcamplicons pipeline Amplicons Matthew L. Settles Genome Center Bioinformatics Core University of California, Davis settles@ucdavis.edu; bioinformatics.core@ucdavis.edu Microbial community analysis Goal:

More information

Marker types. Potato Association of America Frederiction August 9, Allen Van Deynze

Marker types. Potato Association of America Frederiction August 9, Allen Van Deynze Marker types Potato Association of America Frederiction August 9, 2009 Allen Van Deynze Use of DNA Markers in Breeding Germplasm Analysis Fingerprinting of germplasm Arrangement of diversity (clustering,

More information

Roche Molecular Biochemicals Technical Note No. LC 10/2000

Roche Molecular Biochemicals Technical Note No. LC 10/2000 Roche Molecular Biochemicals Technical Note No. LC 10/2000 LightCycler Overview of LightCycler Quantification Methods 1. General Introduction Introduction Content Definitions This Technical Note will introduce

More information

Genome Sequencing-- Strategies

Genome Sequencing-- Strategies Genome Sequencing-- Strategies Bio 4342 Spring 04 What is a genome? A genome can be defined as the entire DNA content of each nucleated cell in an organism Each organism has one or more chromosomes that

More information

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4 WHITE PAPER Oncomine Comprehensive Assay Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4 Contents Scope and purpose of document...2 Content...2 How Torrent

More information

12/8/09 Comp 590/Comp Fall

12/8/09 Comp 590/Comp Fall 12/8/09 Comp 590/Comp 790-90 Fall 2009 1 One of the first, and simplest models of population genealogies was introduced by Wright (1931) and Fisher (1930). Model emphasizes transmission of genes from one

More information

The Diploid Genome Sequence of an Individual Human

The Diploid Genome Sequence of an Individual Human The Diploid Genome Sequence of an Individual Human Maido Remm Journal Club 12.02.2008 Outline Background (history, assembling strategies) Who was sequenced in previous projects Genome variations in J.

More information

DNA METHYLATION RESEARCH TOOLS

DNA METHYLATION RESEARCH TOOLS SeqCap Epi Enrichment System Revolutionize your epigenomic research DNA METHYLATION RESEARCH TOOLS Methylated DNA The SeqCap Epi System is a set of target enrichment tools for DNA methylation assessment

More information

Nature Methods: doi: /nmeth Supplementary Figure 1. Pilot CrY2H-seq experiments to confirm strain and plasmid functionality.

Nature Methods: doi: /nmeth Supplementary Figure 1. Pilot CrY2H-seq experiments to confirm strain and plasmid functionality. Supplementary Figure 1 Pilot CrY2H-seq experiments to confirm strain and plasmid functionality. (a) RT-PCR on HIS3 positive diploid cell lysate containing known interaction partners AT3G62420 (bzip53)

More information

Bi 8 Lecture 4. Ellen Rothenberg 14 January Reading: from Alberts Ch. 8

Bi 8 Lecture 4. Ellen Rothenberg 14 January Reading: from Alberts Ch. 8 Bi 8 Lecture 4 DNA approaches: How we know what we know Ellen Rothenberg 14 January 2016 Reading: from Alberts Ch. 8 Central concept: DNA or RNA polymer length as an identifying feature RNA has intrinsically

More information

High-throughput genome scaffolding from in vivo DNA interaction frequency

High-throughput genome scaffolding from in vivo DNA interaction frequency correction notice Nat. Biotechnol. 31, 1143 1147 (213) High-throughput genome scaffolding from in vivo DNA interaction frequency Noam Kaplan & Job Dekker In the version of this supplementary file originally

More information

SureSelect XT HS. Target Enrichment

SureSelect XT HS. Target Enrichment SureSelect XT HS Target Enrichment What Is It? SureSelect XT HS joins the SureSelect library preparation reagent family as Agilent s highest sensitivity hybrid capture-based library prep and target enrichment

More information

Department of Computer Science and Engineering, University of

Department of Computer Science and Engineering, University of Optimizing the BAC-End Strategy for Sequencing the Human Genome Richard M. Karp Ron Shamir y April 4, 999 Abstract The rapid increase in human genome sequencing eort and the emergence of several alternative

More information

Introduction to human genomics and genome informatics

Introduction to human genomics and genome informatics Introduction to human genomics and genome informatics Session 1 Prince of Wales Clinical School Dr Jason Wong ARC Future Fellow Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer

More information