1 KLEAT: CLEAVAGE SITE ANALYSIS OF TRANSCRIPTOMES * INANÇ BIROL, ANTHONY RAYMOND, READMAN CHIU, KA MING NIP, SHAUN D JACKMAN, MAAYAN KREITZMAN, T RODERICK DOCKING, CATHERINE A ENNIS, A GORDON ROBERTSON, ALY KARSAN Canada s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency Vancouver, BC, V5Z 4S6, Canada In eukaryotic cells, alternative cleavage of 3 untranslated regions (UTRs) can affect transcript stability, transport and translation. For polyadenylated (poly(a)) transcripts, cleavage sites can be characterized with short-read sequencing using specialized library construction methods. However, for large-scale cohort studies as well as for clinical sequencing applications, it is desirable to characterize such events using RNA-seq data, as the latter are already widely applied to identify other relevant information, such as mutations, alternative splicing and chimeric transcripts. Here we describe KLEAT, an analysis tool that uses de novo assembly of RNA-seq data to characterize cleavage sites on 3 UTRs. We demonstrate the performance of KLEAT on three cell line RNA-seq libraries constructed and sequenced by the ENCODE project, and assembled using. Validating the KLEAT predictions with matched ENCODE RNA-seq and RNA-PET libraries, we show that the tool has over 90% positive predictive value when there are at least three RNA-seq reads supporting a poly(a) tail and requiring at least three RNA-PET reads mapping within 100 nucleotides as validation. We also compare the performance of KLEAT with other popular RNA-seq analysis pipelines that reconstruct 3 UTR ends, and show that it performs favourably, based on an ROC-like curve. * This work is supported by Canadian Institutes of Health Research, Genome Canada, Genome British Columbia, British Columbia Cancer Foundation and the National Institutes of Health under Award Number R01HG The content of this work is solely the responsibility of the authors and does not necessarily represent the official views of any of our funding agencies.
2 1. Introduction The section of an mrna transcript that is translated into protein sequence is flanked by 5 and 3 untranslated regions (UTRs). These UTRs play a number of important biological roles. The 3 end of an mrna molecule (the 3' UTR) helps to regulate its stability and localization, hence the amount of corresponding protein that is produced [1-4]. Over 50% of human genes produce two or more transcript isoforms via alternative polyadenylation (APA) of the 3 UTRs . APA is recognized as playing a role in cancer biology [6-9]. A number of direct sequencing protocols have been developed for characterizing polyadenylated (poly(a)) tails of 3 UTRs and APA [9-15]. A cost-effective alternative to these direct sequencing protocols would be high throughput transcriptome sequencing (RNA-seq) , coupled with a validated bioinformatics pipeline to detect 3 UTR cleavage sites (CS). RNA-seq is a central data type for many studies, including the ENCODE (ENCyclopedia Of DNA Elements) project, whose goal is to identify all functional elements in the human genome sequence . Using various sequencing protocols, an ENCODE study  identified over 100,000 transcripts, about 60,000 of which were protein coding, and reported that transcript expression levels span six orders of magnitude. This is remarkable, as it speaks to the sensitivity of the RNA-seq technology. The lower range of the reported expression levels of 10-2 RPKM in that study implies that RNA-seq can detect a transcript expressed by 1 in 100 cells . This resolution of RNA-seq data can be leveraged to identify 3 UTR ends of transcripts. An earlier study  inferred 3 UTR switching using sudden changes in expression profiles near cleavage sites, but did not utilize the direct evidence of observed poly(a) sequences. In this report, we introduce KLEAT, a post-processing tool for characterizing 3 UTRs in assembled RNA-seq data through direct observation of poly(a) tails. While we developed KLEAT as an extension to the analysis pipeline [20, 21], it can also accept contigs from other transcriptome assembly tools, as we demonstrate below. It analyses the structures of assembled transcripts for poly(a) tails, filters 3 UTR cleavage site (CS) candidates using several evidence types within RNA-seq reads, and gathers and reports metrics that can be used in downstream post-processing, such as for filtering calls by their levels of read support. 2. Methods The key technology KLEAT uses in detecting 3 UTR ends is de novo transcriptome assemblies. Compared to genome assembly, a successful transcriptome assembly has to address some particular challenges. These include robust assembly of transcripts from a wide range of transcript abundance levels, and resolution of transcripts from alternative isoforms and gene families. There are several specialized de novo assembly tools, including ,  and Oases  that successfully address these challenges. The KLEAT pipeline (Figure 1) uses by default. Using the raw reads and assembled contigs, it performs two levels of alignments in parallel: (1) reads to contigs; and (2) contigs to reference genome. It processes these alignment results to identify tail, bridge, and link evidence (Figure 2), and collates the evidence to predict cleavage sites.
3 Sequence Reads Annotations ESTs Align Reads to Genome Align Contigs to Genome Trans- ABySS Assembly Identify CS Events Filter & Group CSs 3' UTR Expression Align Reads to Contigs Find Bridges Read to Contig Alignment Link Support Find Best Matching Transcript Report Events Contig to Genome Alignment Find Tails Contigs on Annotated 3' UTRs Fig. 1. Flowchart of the KLEAT pipeline. Two shades of yellow flowchart elements designate raw and external input to the pipeline; blue and grey indicate existing internal and external tools, respectively; green denotes new tools developed specifically for KLEAT Tail Contig sequences that end in a poly(a) stretch represent high-confidence candidates. We filter these candidates to identify true poly(a) tails by aligning the flagged contigs to a reference genome. Accounting for the direction of transcription, we classify contigs with untemplated poly(a) sequence (a stretch of poly(a) sequence not observed in the reference genome) at their 3 ends as tail type events. For a transcript that is sufficiently abundant, this would be the expected default event type. Annotation Transcripts RNA-Seq Contigs Poly(A) Poly(A Support Tail Bridge Link Fig. 2. Three types of support for detecting cleavage sites using RNA-seq data. The gene annotation (grey) indicates a single 3 UTR isoform, while the sample expresses two APA (red) variants. RNA-seq data capture the presence of these two alternatives with reads that end in poly(a) sequence (red). Contigs with supporting evidence have either a poly(a) tail, an overhanging read that is bridge to a poly(a) sequence, or a read that has a link to a pair with poly(a) sequence.
4 2.2. Bridge Expressed alternative long and short 3 UTRs present alternative paths for contig extensions during de novo assembly. In such cases, if the graph indicates a branch that does not extend to a poly(a) sequence, the alternative branch with the poly(a) sequence is removed by an assembly quality assurance stage within ABySS, in an operation called trimming . While this is a desirable behaviour in general, it creates a particular challenge in assembling contigs with poly(a) tails. However, the information removed during this step can be recovered later by aligning reads to contigs, then assessing the sequences of partial read alignments at the contig edges. When an overhanging read alignment represents an untemplated poly(a) sequence, we infer the presence of a cleavage site. We call such cases bridge type evidence Link The sequence complexity of 3 UTRs may drop substantially near their 3 ends, where the region is dominated by AU-rich sequence , and this may affect contig extensions due to loss of specificity of read-to-read overlaps. When this happens near a cleavage site, the corresponding contig may fail to present a tail type evidence, and may terminate extension before a read with a poly(a) tail can bridge it to beyond the cleavage site. However, the 3 UTR end may be within a typical sequencing fragment length, and if we identify read pairs linking the end of a contig to an untemplated poly(a) sequence, we classify the corresponding contig as having link type evidence. -- Some cleavage sites may have supporting evidence from a combination of these evidence types, and even multiple observations from the same evidence type. The latter is partly due to the fuzzy definition of a cleavage site, where the end of a 3 UTR may fluctuate by about ±30 nucleotides (nt) between mrna molecules of the same transcript species . Accordingly, we cluster cleavage sites predicted from multiple contigs if they fall within a certain window, label them as being representatives of the same cleavage site, and tally the counts presented by each evidence type in a given cluster to score the strength of our prediction. We note that read-to-contig alignments performed in the pipeline have unique requirements. Although we demonstrate our results using BWA  an established general-purpose sequence alignment tool we recognize that detecting cleavage sites is most effective when reads are aligned to contigs with a tool that is capable of handling alignments with overhangs, that is, when a read aligns to the end of a contig with its sequence extending beyond the boundary(ies) of the contig. Because many high-throughput general-purpose sequence alignment tools are developed by the explicit or implicit assumption of a reference sequence that is composed of a small number of long contigs (i.e. chromosomes), they may suffer from accuracy and performance issues when the reference sequence is in many short pieces (as in an assembled transcriptome). Alignments near contig or scaffold edges are particularly challenging for general-purpose alignment software. We call this the edge effect, and address it by an FM-index based aligner within the ABySS genome assembly package , as an alternative. This aligner weighs edge alignments that are shorter but with fewer mismatches more favourably than longer alignments with more mismatches to provide local alignments.
5 KLEAT compares putative cleavage sites to annotation and EST databases to characterize and annotate them with other supporting observations, if any. Again using the annotation and EST databases, KLEAT groups, classifies and filters the putative events. For method validation, we used the RNA-PET protocol  as our gold standard. We quantified the concordance between the putative (KLEAT) and real (RNA-PET) cleavage sites (CS) using the following definitions: false positive A called CS not within a certain window of an RNA-PET cluster true positive An RNA-PET cluster with at least one called CS in a window false negative An RNA-PET cluster without a called CS in a window true negative Cannot be defined One way to gauge the performance of a detection tool is to study its receiver-operatorcharacteristic (ROC) curve, where a stringency parameter is varied to plot the true positive rate, TPR (the ratio of the true positive count to the total number of events) versus the false positive rate, FPR (the ratio of the false positive count to the total number of negatives). Note that, because true negative is undefined in this context, FPR cannot be defined either. A common practice in such cases is to use the false discovery rate, FDR (defined as the ratio of the false positive count to the total number of calls) as surrogate for the FPR. 3. Results and Discussion For validating our method, we employed experimental data collected by the ENCODE project . The ENCODE consortium characterized transcript ends using the RNA-PET protocol , and generated RNA-seq data for some of the same samples. We considered three cell lines (H1- hesc, A549 and MCF-7) for which RNA-PET and RNA-seq data were available (Table 1). For RNA-PET coverage data, we applied an expression level threshold (default, 3 reads), and clustered observations that occurred within a certain distance (default, 100 nt) of each other. We used the resulting clusters as true events. We used v1.4.7 [20, 21] to assemble the RNA-seq reads; aligned the assembled transcripts to the human genome reference hg19 using GMAP ; and aligned reads back to the assembled contigs using BWA-SW v0.6.2-r126 . We processed the results to identify the tail, bridge and link evidence, and clustered the CS calls. We also used Release  for RNA-seq assembly, and used the same pipeline to identify CS calls. Table 2 summarizes the performance of KLEAT on the assembled transcriptomes (with and contigs) for the three cell lines. Table 1. Three ENCODE cell lines used for validation. All libraries were prepared using the long polya+ RNA fraction protocol. RNA samples represent the whole cell transcriptome. RNA-PET reads were sequenced at 2x36 nt. H1-hESC cell line RNA-seq reads were sequenced at 2x78 nt, and A549 and MCF-7 cell lines at 2x76 nt. All data were generated and made publicly available by the ENCODE project . Cell Line # RNA-PET Read Pairs (million) # RNA-seq Read Pairs (million) H1-hESC A MCF
6 Table 2. Summary statistics on de novo assembly of the RNA-seq data and KLEAT calls. Assembly figures are for contigs longer than 500 nt in length. The total number of cleavage sites called by KLEAT, and the average number of alternative polyadenylation sites per gene, are shown in the last two columns. Two sub-columns for the number of cleavage sites and APA per gene represent total and true positive (TP) calls, at a support threshold of three reads. Cell Line Assembler # Contigs N50 (nt) H1-hESC A549 MCF-7 Total Reconstruction (Mnt) # Cleavage Sites APA per gene All TP All TP 879,277 1, ,975 8, ,627 2, ,896 7, ,688 1, ,984 12, ,880 2, ,249 11, ,002,984 1, ,240 13, ,875 2, ,845 11, Using contigs from either transcriptome assembly tool as input, we observed the number of cleavage site calls to be the lowest for H1-hESC, and highest for MCF-7. The number of APA sites per gene also follows this pattern, ranging from roughly 2.5 to 3.0 APA isoforms, on average. Interestingly, while the fraction of true positive cleavage site calls range from 75 to 93%, the average number of APA isoforms per gene is insensitive to filtering for true positives. We compared the use of contigs from de novo transcriptome assembly to detect cleavage sites, with the use of transcripts reconstructed by aligning the reads to a reference genome. We ran the pipeline v2.1.1  on the same dataset, and used the reconstructed 3 UTR ends of predicted transcripts to measure its accuracy in detecting poly(a) tails. takes RNA-seq read alignments to a reference genome as input, and builds those alignments into a parsimonious set of transcripts with or without annotation support. We ran the pipeline with annotation support, allowing for transcript discovery. Figure 3 depicts the ROC curves for these two paradigms for the three cell lines used. Curves closer to the top-left corner indicate better performance. These results suggest that to identify 3 UTR cleavage sites an assembly-first approach (using either or ) may be preferred over the alignment-first approach implemented in the pipeline. We note that the and results are similar, while the assemblies perform marginally yet consistently better. This may be due to the difference between the total reconstruction figures of the two tools (Table 2). Although the magnitudes of the reported TPR figures are low (<10%), we note that this reflects our simple but relaxed definition of the ground truth, with no distinction between 3 and 5 UTR ends, and applying none of the filtering suggested in the ENCODE report. These choices would inflate the denominator of TPR, and lead to underestimates in the reported figures. However, neither of these would change the relative performance of the analysis tools we present. We also compared the concordance between these three sets of results (Figure 4). Our analysis indicates that and contigs are largely concordant in their reconstruction of cleavage sites, identifying roughly 9,000 to 13,000 and 7,000 to 11,000 true positive calls, respectively (for an RNA-PET threshold of 3 reads) and agreeing on about 80 to 90% of the calls.
7 In contrast, would identify 7,000 to 10,000 true positive calls with the same RNA-PET threshold, agreeing with the assembly-first results only about 25 to 30% of the time, meaning that it identifies a smaller yet largely distinct set of events H1#hESC( A549( MCF#7( Fig. 1. Performance of KLEAT on three ENCODE cell lines. Curves represent the true positive rate (TPR) as a function of the false discovery rate (FDR), for KLEAT with (blue) and KLEAT with (green) and (red). RNA-PET evidence is considered as the gold standard at two support cutoffs: left column: 3 or more, right column: 10 or more RNA-PET reads.
8 1433 RNA-PET 3 RNA-PET H1#hESC( A549( MCF#7( Fig. 4. Concordance between three methods. Blue, green and red sets indicate events detected by KLEAT/, KLEAT/ and, respectively, that are supported by at least 3 or 10 RNA-PET reads, as indicated. We note that in this dataset the pipeline is more sensitive for detecting weakly expressed transcripts. This observation is supported by the performance metrics of the three tools when we increased the RNA-PET threshold from three to 10 reads. At the increased threshold, KLEAT with and loses about 6 to 10% of its true positive calls, while loses about 10 to 33% of such calls. Further supporting this observation, a coverage histogram of the expressed genes in the A549 cell line, as detected by, is depicted in Figure 5. Here the two x-axes represent expression levels in units of average coverage and FPKM on logarithmic scales. reports a major peak for transcripts represented at 1- to 10-fold average coverage, also reconstructing a
9 large number of transcripts with less than 1-fold coverage, some of which would report cleavage sites also observed by RNA-PET data. In contrast, de novo assembly methods would typically reconstruct transcripts over 10-fold coverage. This apparent difference in target expression levels for transcript reconstruction explains the lack of concordance between KLEAT (in conjunction with or ) and, as reported in Figures 3 and 4. Fig. 5. Gene level coverage histogram for the A549 cell line RNA-seq data, as reported by. The histogram is presented with two logarithmic scales on the x-axis, average coverage (Cov) and FPKM, to show the correspondence between the two units for the sequencing depth of the experimental data. 4. Conclusions In this study, we introduced KLEAT as an analysis tool for detecting 3 UTR cleavage sites using de novo assembled RNA-seq reads. We validated our method using data from the ENCODE project . We measured the accuracy of KLEAT using two transcriptome assembly tools ( [20, 21] and ), and compared its performance to results from an alignment-based analysis tool ( , the method of choice in the ENCODE study). Our results demonstrate that one can reliably detect around 10,000 poly(a) tails per sample using RNA-seq data, at a sequencing depth of 70 million read pairs. The depth of sequencing data will certainly affect the number of transcripts observed, hence the number of poly(a) tails detected. Therefore, although we suggest that detecting on the order of 10,000 features will already provide important biological insights for highly expressed transcripts, if one wants to observe more features, one approach might be to sequence a library to greater depth (albeit with diminishing returns). With sequencing throughput on the Illumina platform pushing beyond 250 million read pairs per lane, experimental design (such as pooling multiple samples per lane) reflects a balance between cost and value, and that balance is determined by the particular experimental goals and the budget of a study. We also note that overlapping sense/anti-sense gene annotations can potentially confuse the poly(a) tail calls. Using a strand-specific RNA-seq protocol should help mitigate this issue. Surveying 15 human cell lines, the ENCODE study reports a total of 128,824 poly(a) sites mapping within annotated Gencode transcripts . This observation puts the average number of polydenylation sites in this dataset to 2.5 per gene. Interestingly, before this landmark publication, the APA multiplicity was estimated to be around 1.1. Our analysis of RNA-seq data from three ENCODE cell lines (APA per gene statistics in Table 2) are in agreement with this ENCODE estimate. There is a growing appreciation of 3 UTRs, their molecular assembly, mechanistic roles and variants [9-15]. Many of these studies developed novel wet lab techniques to build specialized sequencing libraries, and applied them to interrogate a particular biological condition.
10 KLEAT offers an alternative analysis method to characterize 3 UTRs and APA from RNAseq data at a nucleotide scale resolution. We anticipate the tool to be an enabling technology for many applications, including large-scale disease studies and clinical genomics, and to provide added value to the large volume of sequencing data already generated using the data type. KLEAT is available at and is offered free for academic use. Acknowledgments The authors thank, Canadian Institutes of Health Research, Genome Canada, Genome British Columbia and British Columbia Cancer Foundation for their generous support. The work is also partially funded by the National Institutes of Health under Award Number R01HG The content of this work is solely the responsibility of the authors and does not necessarily represent the official views of any of our funding agencies. We also thank the ENCODE project for enabling the validation work presented by making their datasets publicly available. References  J. D. Keene, RNA regulons: coordination of post-transcriptional events, Nature reviews. Genetics, vol. 8, no. 7, pp , Jul,  K. K. To, Z. Zhan, T. Litman, and S. E. Bates, Regulation of ABCG2 expression at the 3' untranslated region of its mrna through modulation of transcript stability and protein translation by a putative microrna in the S1 colon cancer cell line, Molecular and cellular biology, vol. 28, no. 17, pp , Sep,  Z. Ji, J. Y. Lee, Z. Pan, B. Jiang, and B. Tian, Progressive lengthening of 3' untranslated regions of mrnas by alternative polyadenylation during mouse embryonic development, Proceedings of the National Academy of Sciences of the United States of America, vol. 106, no. 17, pp , Apr 28,  S. Muller, L. Rycak, F. Afonso-Grunz, P. Winter, A. M. Zawada, E. Damrath, J. Scheider, J. Schmah, I. Koch, G. Kahl, and B. Rotter, APADB: a database for alternative polyadenylation and microrna regulation events, Database (Oxford), vol. 2014,  B. Tian, J. Hu, H. Zhang, and C. S. Lutz, A large-scale analysis of mrna polyadenylation of human and mouse genes, Nucleic Acids Res, vol. 33, no. 1, pp ,  P. Singh, T. L. Alley, S. M. Wright, S. Kamdar, W. Schott, R. Y. Wilpan, K. D. Mills, and J. H. Graber, Global changes in processing of mrna 3' untranslated regions characterize clinically distinct cancer subtypes, Cancer Res, vol. 69, no. 24, pp , Dec,  Y. Hamaya, S. Kuriyama, T. Takai, K. Yoshida, T. Yamada, M. Sugimoto, S. Osawa, K. Sugimoto, H. Miyajima, and S. Kanaoka, A distinct expression pattern of the long 3'- untranslated region dicer mrna and its implications for posttranscriptional regulation in colorectal cancer, Clin Transl Gastroenterol, vol. 3, pp. e17,  E. Devany, X. Zhang, J. Y. Park, B. Tian, and F. E. Kleiman, Positive and negative feedback loops in the p53 and mrna 3' processing pathways, Proc Natl Acad Sci U S A, vol. 110, no. 9, pp , Feb,  Y. Fu, Y. Sun, Y. Li, J. Li, X. Rao, C. Chen, and A. Xu, Differential genome-wide profiling of tandem 3' UTRs among human breast cancer and normal cells by highthroughput sequencing, Genome Res, vol. 21, no. 5, pp , May, 2011.
11  P. Ng, C. L. Wei, W. K. Sung, K. P. Chiu, L. Lipovich, C. C. Ang, S. Gupta, A. Shahab, A. Ridwan, C. H. Wong, E. T. Liu, and Y. Ruan, Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation, Nat Methods, vol. 2, no. 2, pp , Feb,  Y. Ruan, H. S. Ooi, S. W. Choo, K. P. Chiu, X. D. Zhao, K. G. Srinivasan, F. Yao, C. Y. Choo, J. Liu, P. Ariyaratne, W. G. Bin, V. A. Kuznetsov, A. Shahab, W. K. Sung, G. Bourque, N. Palanisamy, and C. L. Wei, Fusion transcripts and transcribed retrotransposed loci discovered through comprehensive transcriptome analysis using Paired-End ditags (PETs), Genome Res, vol. 17, no. 6, pp , Jun,  P. J. Shepard, E. A. Choi, J. Lu, L. A. Flanagan, K. J. Hertel, and Y. Shi, Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq, RNA, vol. 17, no. 4, pp , Apr,  X. Ruan, and Y. Ruan, Genome wide full-length transcript analysis using 5' and 3' pairedend-tag next generation sequencing (RNA-PET), Methods Mol Biol, vol. 809, pp ,  D. Hafez, T. Ni, S. Mukherjee, J. Zhu, and U. Ohler, Genome-wide identification and predictive modeling of tissue-specific alternative polyadenylation, Bioinformatics, vol. 29, no. 13, pp. i108-16, Jul,  R. Minasaki, D. Rudel, and C. R. Eckmann, Increased sensitivity and accuracy of a single-stranded DNA splint-mediated ligation assay (spat) reveals poly(a) tail length dynamics of developmentally regulated mrnas, RNA Biol, vol. 11, no. 2, Feb,  A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nat Methods, vol. 5, no. 7, pp , Jul,  ENCODE Project Consortium, The ENCODE (ENCyclopedia Of DNA Elements) Project, Science, vol. 306, no. 5696, pp , Oct,  S. Djebali, C. A. Davis, A. Merkel, A. Dobin, T. Lassmann, A. Mortazavi, A. Tanzer, J. Lagarde, W. Lin, F. Schlesinger, C. Xue, G. K. Marinov, J. Khatun, B. A. Williams, C. Zaleski, J. Rozowsky, M. Roder, F. Kokocinski, R. F. Abdelhamid, T. Alioto, I. Antoshechkin, M. T. Baer, N. S. Bar, P. Batut, K. Bell, I. Bell, S. Chakrabortty, X. Chen, J. Chrast, J. Curado, T. Derrien, J. Drenkow, E. Dumais, J. Dumais, R. Duttagupta, E. Falconnet, M. Fastuca, K. Fejes-Toth, P. Ferreira, S. Foissac, M. J. Fullwood, H. Gao, D. Gonzalez, A. Gordon, H. Gunawardena, C. Howald, S. Jha, R. Johnson, P. Kapranov, B. King, C. Kingswood, O. J. Luo, E. Park, K. Persaud, J. B. Preall, P. Ribeca, B. Risk, D. Robyr, M. Sammeth, L. Schaffer, L. H. See, A. Shahab, J. Skancke, A. M. Suzuki, H. Takahashi, H. Tilgner, D. Trout, N. Walters, H. Wang, J. Wrobel, Y. Yu, X. Ruan, Y. Hayashizaki, J. Harrow, M. Gerstein, T. Hubbard, A. Reymond, S. E. Antonarakis, G. Hannon, M. C. Giddings, Y. Ruan, B. Wold, P. Carninci, R. Guigo, and T. R. Gingeras, Landscape of transcription in human cells, Nature, vol. 489, no. 7414, pp , Sep 6,  W. Wang, Z. Wei, and H. Li, A change-point model for identifying 3'UTR switching by next-generation RNA sequencing, Bioinformatics, vol. 30, no. 15, pp , Aug 1,  I. Birol, S. D. Jackman, C. B. Nielsen, J. Q. Qian, R. Varhol, G. Stazyk, R. D. Morin, Y. Zhao, M. Hirst, J. E. Schein, D. E. Horsman, J. M. Connors, R. D. Gascoyne, M. A. Marra,
12 and S. J. M. Jones, De novo transcriptome assembly with ABySS, Bioinformatics, vol. 25, no. 21, pp , Nov 1,  G. Robertson, J. Schein, R. Chiu, R. Corbett, M. Field, S. D. Jackman, K. Mungall, S. Lee, H. M. Okada, J. Q. Qian, M. Griffith, A. Raymond, N. Thiessen, T. Cezard, Y. S. Butterfield, R. Newsome, S. K. Chan, R. She, R. Varhol, B. Kamoh, A.-L. Prabhu, A. Tam, Y. Zhao, R. A. Moore, M. Hirst, M. A. Marra, S. J. M. Jones, P. A. Hoodless, and I. Birol, De novo assembly and analysis of RNA-seq data, Nature Methods, vol. 7, no. 11, pp. 909-U62, NOV 2010,  M. G. Grabherr, B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X. Adiconis, L. Fan, R. Raychowdhury, Q. Zeng, Z. Chen, E. Mauceli, N. Hacohen, A. Gnirke, N. Rhind, F. di Palma, B. W. Birren, C. Nusbaum, K. Lindblad-Toh, N. Friedman, and A. Regev, Full-length transcriptome assembly from RNA-Seq data without a reference genome, Nat Biotechnol, vol. 29, no. 7, pp , Jul,  M. H. Schulz, D. R. Zerbino, M. Vingron, and E. Birney, Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels, Bioinformatics, vol. 28, no. 8, pp , Apr 15,  J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. M. Jones, and I. Birol, ABySS: A parallel assembler for short read sequence data, Genome Research, vol. 19, no. 6, pp , Jun,  T. Zhang, V. Kruys, G. Huez, and C. Gueydan, AU-rich element-mediated translational control: complexity and multiple activities of trans-activating factors, Biochem Soc Trans, vol. 30, no. Pt 6, pp , Nov,  E. Beaudoing, S. Freier, J. R. Wyatt, J. M. Claverie, and D. Gautheret, Patterns of variant polyadenylation signal usage in human genes, Genome Res, vol. 10, no. 7, pp , Jul,  H. Li, and R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics, vol. 25, no. 14, pp , Jul 15,  T. D. Wu, and C. K. Watanabe, GMAP: a genomic mapping and alignment program for mrna and EST sequences, Bioinformatics, vol. 21, no. 9, pp , May 1,  C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, and L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nat Biotechnol, vol. 28, no. 5, pp , May,  J. Harrow, A. Frankish, J. M. Gonzalez, E. Tapanari, M. Diekhans, F. Kokocinski, B. L. Aken, D. Barrell, A. Zadissa, S. Searle, I. Barnes, A. Bignell, V. Boychenko, T. Hunt, M. Kay, G. Mukherjee, J. Rajan, G. Despacio-Reyes, G. Saunders, C. Steward, R. Harte, M. Lin, C. Howald, A. Tanzer, T. Derrien, J. Chrast, N. Walters, S. Balasubramanian, B. Pei, M. Tress, J. M. Rodriguez, I. Ezkurdia, J. van Baren, M. Brent, D. Haussler, M. Kellis, A. Valencia, A. Reymond, M. Gerstein, R. Guigó, and T. J. Hubbard, GENCODE: the reference human genome annotation for The ENCODE Project, Genome Res, vol. 22, no. 9, pp , Sep, 2012.
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop The Basic Tuxedo Suite References Trapnell C, et al. 2009 TopHat: discovering splice junctions with
De Novo Assembly of High-throughput Short Read Sequences Chuming Chen Center for Bioinformatics and Computational Biology (CBCB) University of Delaware NECC Third Skate Genome Annotation Workshop May 23,
Gene Expression Technology Bing Zhang Department of Biomedical Informatics Vanderbilt University firstname.lastname@example.org Gene expression Gene expression is the process by which information from a gene
Next Gen Sequencing Contents 1 Expansion of sequencing technology 2 The Next Generation of Sequencing: High-Throughput Technologies 3 High Throughput Sequencing Applied to Genome Sequencing (TEDed CC BY-NC-ND
De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate
Mapping strategies for sequence reads Ernest Turro University of Cambridge 21 Oct 2013 Quantification A basic aim in genomics is working out the contents of a biological sample. 1. What distinct elements
measuring gene expression December 5, 2017 transcription a usually short-lived RNA copy of the DNA is created through transcription RNA is exported to the cytoplasm to encode proteins some types of RNA
Gene Signal Estimates from Exon Arrays I. Introduction: With exon arrays like the GeneChip Human Exon 1.0 ST Array, researchers can examine the transcriptional profile of an entire gene (Figure 1). Being
SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA The most sensitive cdna synthesis technology, combined with next-generation
Top 5 Lessons Learned From MAQC III/SEQC Weida Tong, Ph.D Division of Bioinformatics and Biostatistics, NCTR/FDA Weida.email@example.com; 870 543 7142 1 MicroArray Quality Control (MAQC) An FDA led community
Chapter 11: Gene Expression The availability of an annotated genome sequence enables massively parallel analysis of gene expression. The expression of all genes in an organism can be measured in one experiment.
Introduction to Bioinformatics Richard Corbett Canada s Michael Smith Genome Sciences Centre Vancouver, British Columbia June 28, 2017 Our mandate is to advance knowledge about cancer and other diseases
MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? Lesson Plan: Title Introduction to the Genome Browser: what is a gene? JOYCE STAMM Objectives Demonstrate basic skills in using the UCSC Genome
Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010)
Assembly of Ariolimax dolichophallus using SOAPdenovo2 Charles Markello, Thomas Matthew, and Nedda Saremi Image taken from Banana Slug Genome Project, S. Weber SOAPdenovo Assembly Tool Short Oligonucleotide
RNA-Seq Workshop AChemS 2017 Sunil K Sukumaran Monell Chemical Senses Center Philadelphia Benefits & downsides of RNA-Seq Benefits: High resolution, sensitivity and large dynamic range Independent of prior
Whole Transcriptome Analysis of Illumina RNA- Seq Data Ryan Peters Field Application Specialist Partek GS in your NGS Pipeline Your Start-to-Finish Solution for Analysis of Next Generation Sequencing Data
Introduction to RNA sequencing Bioinformatics perspective Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden November 2017 Olga (NBIS) RNA-seq November 2017 1 / 49 Outline Why sequence
From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow Technical Overview Import VCF Introduction Next-generation sequencing (NGS) studies have created unanticipated challenges with
Supplementary Data for DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding. Wenxiu Ma 1, Lin Yang 2, Remo Rohs 2, and William Stafford Noble 3 1 Department of Statistics,
QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd 1 Our current NGS & Bioinformatics Platform 2 Our NGS workflow and applications 3 QIAGEN s
Full-length single-cell RNA-seq applied to a viral human cancer: Applications to HPV expression and splicing analysis in HeLa S3 cells Liang Wu 1, Xiaolong Zhang 1,2, Zhikun Zhao 1,3,4, Ling Wang 5, Bo
Tjaden Genome Biology (2015) 16:1 DOI 10.1186/s13059-014-0572-2 SOFTWARE De novo assembly of bacterial transcriptomes from RNA-seq data Brian Tjaden Open Access Abstract Transcriptome assays are increasingly
Promoter definition by mass genome annotation data: in silico primer extension EMBNET course Bioinformatics of transcriptional regulation Jan 28 2008 Christoph Schmid Regulation of eukaryotic transcription:
Themes: RNA is very versatile! RNA and RNA Processing Chapter 14 RNA-RNA interactions are very important! Prokaryotes and Eukaryotes have many important differences. Messenger RNA (mrna) Carries genetic
Introduction to the UCSC genome browser Dominik Beck NHMRC Peter Doherty and CINSW ECR Fellow, Senior Lecturer Lowy Cancer Research Centre, UNSW and Centre for Health Technology, UTS SYDNEY NSW AUSTRALIA
Augmenting transcriptome assembly combinatorially Prachi Jain 1+, Neeraja M Krishnan 1+ and Binay Panda 1,2 * 1 Ganit Labs, Bio-IT Centre, Institute of Bioinformatics and Applied Biotechnology, Bangalore,
OPEN Assessment of transcript reconstruction methods for RNA-seq Tamara Steijger 1, Josep F Abril 2,11, Pär G Engström 1,1,11, Felix Kokocinski 3,11, The RGASP Consortium 4, Tim J Hubbard 3, Roderic Guigó
Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions Joshua N. Burton 1, Andrew Adey 1, Rupali P. Patwardhan 1, Ruolan Qiu 1, Jacob O. Kitzman 1, Jay Shendure 1 1 Department
SCIENCE CHINA Life Sciences SPECIAL TOPIC February 2013 Vol.56 No.2: 143 155 RESEARCH PAPER doi: 10.1007/s11427-013-4442-z Comparative study of de novo assembly and genome-guided assembly strategies for
A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium SEQC/MAQC-III Consortium* 214 Nature America, Inc. All rights reserved.
MICROARRAYS+SEQUENCING The most efficient way to advance genomics research Down to a Science. www.affymetrix.com/downtoascience Affymetrix GeneChip Expression Technology Complementing your Next-Generation
DNA Transcription By: Suzanne Clancy, Ph.D. 2008 Nature Education Citation: Clancy, S. (2008) DNA transcription. Nature Education 1(1) If DNA is a book, then how is it read? Learn more about the DNA transcription
Gene Expression: Transcription The majority of genes are expressed as the proteins they encode. The process occurs in two steps: Transcription = DNA RNA Translation = RNA protein Taken together, they make
L3: Short Read Alignment to a Reference Genome Shamith Samarajiwa CRUK Autumn School in Bioinformatics Cambridge, September 2017 Where to get help! http://seqanswers.com http://www.biostars.org http://www.bioconductor.org/help/mailing-list
Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Ruth Howe Bio 434W April 1, 2010 INTRODUCTION De novo annotation is the process by which a finished genomic sequence is searched for
Outline Overview of the GEP annotation projects Annotation of Drosophila Primer January 2018 GEP annotation workflow Practice applying the GEP annotation strategy Wilson Leung and Chris Shaffer AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCT
Measuring transcriptomes with RNA-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2017 Anthony Gitter firstname.lastname@example.org These slides, excluding third-party material, are licensed under CC BY-NC
CHAPTER 1 DNA Microarray Technology All living organisms are composed of cells. As a functional unit, each cell can make copies of itself, and this process depends on a proper replication of the genetic
Transcription in Eukaryotes Biology I Hayder A Giha Transcription Transcription is a DNA-directed synthesis of RNA, which is the first step in gene expression. Gene expression, is transformation of the
Introduction to RNA-Seq Monica Britton, Ph.D. Sr. Bioinformatics Analyst March 2015 Workshop Overview of RNA-Seq Activities RNA-Seq Concepts, Terminology, and Work Flows Using Single-End Reads and a Reference
Next-Generation Next-Generation Sequencing Technologies Sequencing Technologies Nicholas E. Navin, Ph.D. MD Anderson Cancer Center Dept. Genetics Dept. Bioinformatics Introduction to Bioinformatics GS011062
Welcome to the NGS webinar series Webinar 1 NGS: Introduction to technology, and applications NGS Technology Webinar 2 Targeted NGS for Cancer Research NGS in cancer Webinar 3 NGS: Data analysis for genetic
SUPPLEMENL MERILS Eh-seq: RISPR epitope tagging hip-seq of DN-binding proteins Daniel Savic, E. hristopher Partridge, Kimberly M. Newberry, Sophia. Smith, Sarah K. Meadows, rian S. Roberts, Mark Mackiewicz,
American Journal of Life Sciences 2016; 4(6): 146-151 http://www.sciencepublishinggroup.com/j/ajls doi: 10.11648/j.ajls.20160406.12 ISSN: 2328-5702 (Print); ISSN: 2328-5737 (Online) Gap Filling for a Human
TECHNICAL NOTE Base Composition of Sequencing Reads of Chromium Single Cell 3 v2 Libraries INTRODUCTION The Chromium Single Cell 3 v2 Protocol (CG00052) produces Single Cell 3 libraries, ready for Illumina
Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project
Microarray Gene Expression Analysis at CNIO Orlando Domínguez Genomics Unit Biotechnology Program, CNIO 8 May 2013 Workflow, from samples to Gene Expression data Experimental design user/gu/ubio Samples
Assessing De-Novo Transcriptome Assemblies Shawn T. O Neil Center for Genome Research and Biocomputing Oregon State University Scott J. Emrich University of Notre Dame 100K Contigs, Perfect 1M Contigs,
MCB Accepted Manuscript Posted Online 28 December 2015 Mol. Cell. Biol. doi:10.1128/mcb.00955-15 Copyright 2015, American Society for Microbiology. All Rights Reserved. 1 2 3 4 Chromatin and RNA Maps Reveal
SUPPLEMENTARY INFORMATION Letters https://doi.org/10.1038/s41588-018-0062-7 In the format provided by the authors and unedited. The human noncoding genome defined by genetic diversity Julia di Iulio 1,5,
Single Cell RNA-seq Data Analysis Ahmed Mahfouz BioSB NGS course 08 November 2017 2 Cell types in the brain Tasic et al. (Nat. Neuroscience 2016) 3 Cellular heterogeneity in the brain Hodges et al. (Human
Measuring transcriptomes with RNA-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter email@example.com Overview RNA-Seq technology The RNA-Seq quantification problem Generative
How to view Results with Scaffold Proteomics Shared Resource Starting out Download Scaffold from http://www.proteomes oftware.com/proteom e_software_prod_sca ffold_download.html Follow installation instructions
Course Index Introduction Agenda Analysis pipeline Some considerations Introduction Who we are Teachers: Marta Bleda: Computational Biologist and Data Analyst at Department of Medicine, Addenbrooke's Hospital
Next Generation Sequencing: An Overview Cavan Reilly November 13, 2017 Table of contents Next generation sequencing NGS and microarrays Study design Quality assessment Burrows Wheeler transform Next generation
Supplementary Figure 1 Origin use and efficiency are similar among WT, rrm3, pif1-m2, and pif1-m2; rrm3 strains. A. Analysis of fork progression around confirmed and likely origins (from cerevisiae.oridb.org).
Computational Biology I LSM5191 (2003/4) Aylwin Ng, D.Phil Lecture Notes: Transcriptome: Molecular Biology of Gene Expression I Flow of information: DNA to polypeptide DNA Start Exon1 Intron Exon2 Termination
Introductory Next Gen Workshop http://www.illumina.ucr.edu/ http://www.genomics.ucr.edu/ Workshop Objectives Workshop aimed at those who are new to Illumina sequencing and will provide: - a basic overview
Analysing genomes and transcriptomes using Illumina uencing Dr. Heinz Himmelbauer Centre for Genomic Regulation (CRG) Ultrauencing Unit Barcelona The Sequencing Revolution High-Throughput Sequencing 2000
Davidson and Oshlack Genome Biology 2014, 15:410 METHOD Open Access : enabling differential gene expression analysis for de novo assembled transcriptomes Nadia M Davidson 1 and Alicia Oshlack 1,2* Abstract
Soneson and Delorenzi BMC Bioinformatics 213, 14:91 RESEARCH ARTICLE A comparison of methods for differential expression analysis of RNA-seq data Charlotte Soneson 1* and Mauro Delorenzi 1,2 Open Access
High-throughput scale. Desktop simplicity. NextSeq 500 System. Flexible power. Speed and simplicity for whole-genome, exome, and transcriptome sequencing. Harness the power of next-generation sequencing.
Encyclopedic Reference of Genomics and Proteomics in Molecular Medicine The Two-Hybrid System Carolina Vollert & Peter Uetz Institut für Genetik Forschungszentrum Karlsruhe PO Box 3640 D-76021 Karlsruhe
Bioinformatics Advance Access published October 8, 2012 COPE: An accurate k-mer based pair-end reads connection tool to facilitate genome assembly Binghang Liu 1,2,, Jianying Yuan 2,, Siu-Ming Yiu 1,3,
BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 21 2009, pages 2872 2877 doi:10.1093/bioinformatics/btp367 Sequence analysis De novo transcriptome assembly with ABySS Inanç Birol 1,, Shaun D. Jackman 1, Cydney
Bioinformatics Advice on Experimental Design Where do I start? Please refer to the following guide to better plan your experiments for good statistical analysis, best suited for your research needs. Statistics
Integrative Genomics Viewer James T. Robinson 1, Helga Thorvaldsdóttir 1, Wendy Winckler 1, Mitchell Guttman 1,2, Eric S. Lander 1,2,3, Gad Getz 1, and Jill P. Mesirov 1 1 Broad Institute of Massachusetts
Gene Expression Analysis Superior Solutions for any Project Find Your Perfect Match ArrayXS Global Array-to-Go Focussed Comprehensive: detect the whole transcriptome reliably Certified: discover exceptional