De novo assembly and analysis of RNA-seq data

Size: px
Start display at page:

Download "De novo assembly and analysis of RNA-seq data"

Transcription

1 Nature Methods De novo assembly and analysis of RNA-seq data Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, Matthew Field, Shaun D Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada, Jenny Q Qian, Malachi Griffith, Anthony Raymond, Nina Thiessen, Timothee Cezard, Yaron Butterfield, Richard Newsome, Simon K Chan, Rong She, Richard Varhol, Baljit Kamoh, Anna-Liisa Prabhu, Angela Tam, YongJun Zhao, Richard Moore, Martin Hirst, Marco A Marra, Steven J M Jones, Pamela A Hoodless & Inanc Birol Supplementary Figure 1 Schematic of ABySS assembly steps Supplementary Figure 2 Assembly properties for k values of 26 to 50 Supplementary figure 3 Supplementary figure 4 Supplementary figure 5 Supplementary figure 6 Supplementary figure 7 Supplementary figure 8 Supplementary figure 9 Supplementary figure 10 Supplementary figure 11 Supplementary figure 12 Supplementary figure 13 Supplementary figure 14 Supplementary figure 15 Supplementary figure 16 Supplementary figure 17 Supplementary figure 18 Partitioning of MAQ-aligned reads relative to Ensembl transcript models. Fraction of Ensembl transcripts with at least 80% of exon length covered by unmerged contig alignments, as a function of normalized WTSS coverage threshold Assembled contigs across multiple k-values are merged to obtain a nonredundant set of contigs for analysis Coverage of Ensembl v54 transcripts by contig alignments, as a function of mean read transcript cover Splice site support for 149,877 Trans-ABySS parent (v1.1.1) contig alignments, considering GT-AG, GC-AG and AT-AC donor-acceptor types Schematic of comparing a transcript model (top) with contig alignments to identify annotated and novel transcripts and transcript structures RT-PCR validation of Insr s 36-bp novel exon Novel UTR candidate for Nlrp6 Novel transcript candidate Shank2 s contig alignment supports both an RT-PCR-validated 21-bp skipped exon, and a novel, H3K4me3-supported TSS that is upstream of the 5 -most Ensembl TSS Sfrs3: assembly can extend contigs through exons that have low togenome aligned-read densities Empirical distribution functions for mean normalized read coverage, C, for ENSMUSTs and transcripts with novel retained introns Coverage metrics for known and novel retained introns Schematic of identifying novel short and long 3 UTRs with EJ- and PAMreads a) PAM-reads identify a novel polyadenylation site in the 3 UTR of Dmgdh b) PAM-reads and contig alignments identify a novel long 3 UTR for Sult3a1 Schematic of detecting a fusion gene with a contig alignment

2 Supplementary figure 19 Supplementary figure 20 Supplementary figure 21 Supplementary Table 1 Supplementary Table 2 Supplementary Table 3 Supplementary Table 4 Supplementary Note 1 Comparisons of gene-level expression metrics for Trans-ABySS and ALEXA-Seq Overview of the transcriptome assembly and analysis pipeline workflow Length-normalized profiles of Burrows-Wheeler Aligner read alignment densities Summary of read-to-genome alignments Run times for Trans-ABySS, Tophat, Cufflinks and Scripture Summary of candidate transcript events that were identified as novel relative to all UCSC, RefSeq, Ensembl and AceView transcript models Summary results for identifying annotated and novel polyadenylation sites De novo transcriptome assembly Issues for de novo and reference-based transcriptome assembly Comparing de novo and reference-based assembly Detecting novel polyadenylation sites Identifying fusion genes Quantifying gene-level expression Validating novel transcripts and transcript events WTSS aligned-read pipeline Generating splice graph visualizations

3 Supplementary Figures Supplementary Figure 1. Schematic of ABySS assembly steps illustrating the origin of main, junction, and bubble contigs, and the manner in which the contig alignments are used for analysis. a) Bubble contig branch pairs (green) typically capture heterozygous SNVs. For each bubble, ABySS writes the higher coverage branch (mid green) into the single end (SE) contig set, and writes the branch pair into the global set of bubble contigs. b) SE contigs are constructed from unambiguous (k-1)-bp overlaps between k-mers. c) Mate pairs identify overlapping contig neighbors, and alternate contig-joining paths may be identified. The shorter, pale blue contig represents a candidate junction contig. Because such a contig typically corresponds of two (k-1) overlaps, it is expected 1

4 to be (2k-2) bp long, in an assembly generated for a k-mer length of k bp. For a given assembly (and so k value), contigs that are at least (2k-2) bp long are expected to be the most informative of transcript structure. Dependent on assembly parameters and the strength of supporting mate pair information, one of the two alternate contigs may be joined to the flanking contigs to construct a longer PE contig; however, it is also possible that neither or both alternative paths will be constructed. d) The path containing the longer alternate contig is constructed, with the shorter contig retained as a junction contig. e) Example of possible outcomes for alignments of main (dark blue), junction (light blue), and bubble pair (light and mid-green) contigs to the reference genome. Comparison of their alignments to that of two transcript isoforms (gray) is shown. The alignment blocks of the main contig support the lower isoform, while the junction contig alignment supports the presence of the upper alternative isoform. The alignment of the bubble contig pair identifies a heterozygous SNV. Supplementary Figure 2. Assembly properties for k values of 26 to 50. a) Curves show N50 length (the contig length for which the contigs larger than N50 have 50% of the bases of the assembly), the total number of contigs, and the number of contigs longer than 100 bp. 2

5 Supplementary Figure 3. Partitioning of BWA-aligned reads relative to Ensembl v54 transcript models. Supplementary Figure 4. Fraction of Ensembl transcripts with at least 80% of exon length covered by unmerged contig alignments, as a function of normalized WTSS (Supplementary Note) coverage threshold. Results are shown for the 34,400 Ensembl v54 transcripts (corresponding to 19,508 unique gene IDs) that had a nonzero length-normalized WTSS mean coverage. Curves show results for the single longest contig (blue) and for all contigs (green). For single contigs, 64% and 72% of nonzero-coverage transcripts were covered to at least 80% of the exon length for WTSS coverage thresholds of 10 and 20; considering all contig alignments, the percentages were 88% and 92%. 3

6 Supplementary Figure 5. Assembled contigs across multiple k-values are merged to obtain a non-redundant set of contigs for analysis. a) The contig merging process is shown schematically for eight hypothetical assemblies (k 1, k 2,, k 8 ). Contig sets from pairs of assemblies with adjacent k values are reciprocally compared. Those contigs having an exact match to a longer contig from the paired assembly are buried. Where contigs are equivalent, the contig from the assembly with the lower k is retained. From the output of this stage, adjacent pairs of contig sets are again merged (e.g. k 12 and k 34 ). Merging continues until only one contig set remains. Retained contigs are identified as parent contigs. Contigs that are neither buried nor parent are untouched. The merging process is applied to both the main and extended junction contigs. See Fig. 1b. 4

7 Supplemental Figure 6. Coverage of Ensembl v54 transcripts by contig alignments, as a function of mean read transcript coverage. Mean transcript read coverage, C, was calculated for each transcript by aligning reads to the NCBI37 reference genome which had been extended by exon-exon junctions, and normalizing the number of aligned reads for a transcript by the sum of exon lengths in the transcript. Distributions are shown for all transcripts with nonzero read-alignment coverage (gray), and for transcripts with de novo contig alignments (Trans-ABySS, for even-k assemblies) or reference-based contigs (Cufflinks, Scripture) representing at least 80% of the total exon length, either considering all contigs for that transcript (red) or the single longest contig (blue). 5

8 Supplementary Figure 7. Splice site support for 149,877 Trans-ABySS parent (v1.1.1) contig alignments, considering GT-AG, GC-AG and AT-AC donoracceptor types 1. An ss2 contig alignment (97.9%) has at least one alignment intron with both acceptor and donor sites, an ss1 contig alignment (1.8%) have at least one intron with only an acceptor or donor, and an ss0 contig alignment (0.2%) lacks such support. 6

9 Supplementary Figure 8. Schematic of comparing a transcript model (top) with contig alignments to identify annotated and novel transcripts and transcript structures. For each main and extended junction contig we compared coordinates of contig alignment blocks to coordinates of exons in each best-fitting transcript model, considering all mm9 UCSC gene, RefSeq, Ensembl and AceView transcripts. For a full match, edges of all internal blocks and transcript exons match, as do inside edges of the outer or terminal blocks and exons. Because contig ends do not necessarily correlate with transcript ends, outer edges of terminal alignment blocks may not match outer edges of corresponding exons, and so are not considered to represent novel events. A multi-block alignment that matches no known transcript models represents a potential novel transcript (not shown). For schematics for identifying candidate novel short and long 3 UTRs and candidate fusion genes see Supplementary Figs. 16 and 18. 7

10 Supplementary Figure 9. RT-PCR validation of a 36-bp novel exon prediction in the Insr gene, which was subsequently reported in a shorter full-length RIKEN cdna clone for adult male testis, in a more recent set of known gene transcript models. a) UCSC genome browser mm9 screenshot showing (top to bottom) Tag-seq data (unpublished), H3K4me3 ChIP-seq data 2, exonerate alignments for main contigs, read-alignment pileup, RT-PCR primers (blue arrow) and a range of transcript and other annotations. b,c) Detailed view of the RT-PCR primers on the exons flanking the novel exon. While the pileup coverage is greater than 100 on the flanking exons, the 36-bp novel exon is so much shorter than the 50-bp reads that only two BWA-aligned reads support the novel exon (not shown). d) RT-PCR gel image showing the expected 185-bp product, but not the annotated 149-bp product. e) The approximate alignment coverage for the gene (vertical red line) shown relative to cumulative distributions of transcript coverage for all Ensembl mouse transcripts (gray line) and all contigs whose alignments covered at least 80% of the total exon length of a transcript (see Fig. 1a). The novel exon 8

11 corresponded to 12 amino acids, and overlapped exons in human and rat RefSeq transcript alignments (not shown). All three contigs in the region contain this exon, suggesting that only one isoform is expressed. Despite the gene being relatively highly expressed (read coverage for flanking exons is ~130-fold), the novel exon is shorter than the 50-bp reads, and so has only two reported read alignments. In contrast, read alignments to the assembled contigs indicate a ~90- fold coverage over this detected novel exon (data not shown). 9

12 Supplementary Figure 10. Novel UTR prediction for the Nlrp6 gene. a) UCSC genome browser mm9 screenshot showing (top to bottom) Tag-seq data for the positive and negative strands, an H3K4me3 enrichment profile, exonerate alignments for main contigs, read-alignment pileup, RT-PCR primer positions (blue arrow), and a range of transcript annotations. b,c) Details of the RT-PCR primer locations. d) RT-PCR gel image showing the expected 856-bp product. e) The approximate alignment coverage for the two annotated genes (vertical red lines, ~400 and ~2100) relative to distributions shown in Fig. 1a. The evidence for the detected novel UTR on Nlrp6 includes the following. The main H3K4me3 enrichment signal 2 extends across a short UCSC or AceView transcript, while 10

13 weaker H3K4me3 enrichment is consistent with short UCSC and AceView Nlrp6 transcripts. Numerous shorter and particularly longer contigs suggest that the gene model for Nlrp6 is incomplete, and that transcripts extend between this locus and the main enriched H3K4me3 region. Read coverage is approximately 560 for the Nlrp6 transcripts, and higher (approximately 1175) for the upstream transcripts; consistent with this high expression, there is widespread low-level intergenic or (novel) intronic transcription that is reflected in many unspliced contigs. The longest contig exactly reconstructs the ORF part of the RefSeq transcript. The set of contig alignments at the upper left extend ~148kb upstream to a very highly expressed (~6500 pileup) cytochrome P450 Cyp2e1. Supplementary Figure 11. A novel transcript prediction. a) UCSC genome browser mm9 screenshot showing (top to bottom) exonerate alignments for main contigs, PE reads, a read-alignment pileup, RT-PCR primer positions (blue arrow), a range of transcript annotations and mammalian conservation. b,c) Details of the RT-PCR primer locations. d) RT-PCR gel image showing the expected 264-bp product. e) The approximate alignment coverage for the novel transcript (vertical red line, ~31) relative to distributions shown in Fig. 1a. 11

14 Supplementary Figure 12. Alignments of contigs representing the Shank2 gene support both an RT-PCR-validated 21-bp skipped exon (red arrow), and a novel, H3K4me3-supported 2 TSS that is upstream of the 5 -most Ensembl TSS. a) mm9 UCSC genome browser view of Shank2 showing (top to bottom) Tag-seq data for the positive strand, an H3K4me3 enrichment profile, exonerate alignments for 12

15 main contigs, BWA read-alignment pileup, RT-PCR primer positions (blue arrow), and a range of transcript annotations. b) Detail of RT-PCR primers, with a red arrow indicating the skipped exon. c) Detail of the skipped exon. d) RT-PCR gel, showing the 200-bp annotated and 179-bp novel products. e) The vertical red line shows the approximate read alignment coverage for the gene relative to distributions shown in Fig. 1a. Supplementary Figure 13. Assembly can generate contigs for exons with low read alignment densities. Sfrs3 is a member of the SR splicing factor family, which has 11 and 10 members in human and mouse, respectively 3. In human, SFRS3 shares a splicing pattern with six other family members: a cassette exon that introduces a premature stop codon is skipped in the reference isoform but included in an alternative isoform 3. a) For the mouse Sfrs3 shown, exons overlap chained self-alignment blocks. Consistent with this, aligned-read coverage is low on exons flanking the retained intron; however, de novo assembly generates informative contigs. Contig k values and normalized k-mer coverages are consistent with transcripts having a wide range of expression levels (viz. k45:11.2 vs. k31:3.0). A relatively highly expressed 1629-bp k45:11.2 contig is consistent with the RefSeq reference isoform, while k37:14.3 and k33:17.9 contigs show the retained intron. This gene s retained intron is one of the three known cases shown as red circles in Supplementary Fig. 15. b) A Sircah 4 splice graph representation of the main contig alignments. 13

16 Coverage metrics for known and novel retained introns Supplementary Figure 14. Empirical distribution functions for mean normalized read coverage, C, for ENSMUSTs and transcripts with novel retained introns. The graph shows 34,400 ENSMUSTs with nonzero coverage (gray), and 181 of the 250 transcripts with novel retained introns (red) that had UCSC gene IDs or ENSMUST IDs. Approximately 75% of transcripts with novel retained introns had mean normalized read coverage that was at or above the 90 th percentile coverage for the Ensembl transcripts. Supplementary Figure 15. Coverage metrics for known and novel retained introns. The axes are the mean read coverage for a retained intron s flanking exons, and the ratio of the mean coverage of the retained intron to the mean 14

17 coverage of the flanking exons. Contours summarize 5314 retained introns from the mouse ASTD v1.1 database 5. Blue squares show 250 non-redundant novel retained introns from the current work. Lower coverage for the flanking exons and higher intron-to-flanking exon coverage ratios were consistent for three examples of retained introns for SR slicing factor genes, which undergo unproductive splicing as part of a regulatory mechanism 6 (red circles, see also Supplementary Fig. 13). Detailed work may prioritize focus on the retained introns that are associated with less highly expressed genes and have larger coverage ratios (upper left quadrant), while those in the lower right quadrant may be less biologically relevant. 15

18 Supplementary Figure 16. Schematic of method for identifying novel short and long 3 UTRs. a) A cdna with a poly(a) tail. End-junction (EJ) reads and poly(a)- mate (PAM) reads that were generated from the cdna are identified from the read sequence file. b) 50-bp sequences were added to 3 ends of reference transcript sequences (gray). Contig sequences (blue) are expected to terminate in a poly(a) sequence whose length is less than the assembly k; contig sequences were padded with 50-bp poly(a) sequences on their 3 ends and 50- bp poly(t) sequences on their 5 ends. c) The fragment length distribution, i.e. the measured insert length for paired end reads, was determined from distances between mate pairs mapped to contigs (shown here for k=38). d) The distribution of the number of T s in M 50-bp reads. Sequence reads with very high proportions T are likely to belong to cdna poly(a) tails (right edge of the graph). e) Aligning the transcript-read (short blue rectangles) from EJ and PAM matepairs to reference transcript sequences (gray) to confirm annotated 3 UTR ends (e1) and identify novel short 3 UTR ends (e2). (e3) Refining estimates of ends of novel long 3 UTRs by aligning, to contigs (blue), reads that do not map to transcripts. 16

19 Supplementary Figure 17. a) PAM-reads identify approximate known and novel polyadenylation sites in the 3 UTR of Dmgdh (Supplementary Fig. 16e1,2). The origin of the insert length distribution (Supplementary Fig. 16c) is located at the left-most edges of signal peaks in the stringent evidence pileup track (second from top), and the shaded rectangles correspond to the width of the peak in the insert length distribution. The predicted novel polyadenylation site (left) is consistent with EST evidence. b) PAM-reads identify three candidate polyadenylation sites in the 3 UTR of Sult3a1 (Supplementary Fig. 16e3). 3 UTRs that are longer than annotated 3 UTRs are supported by contig alignments (horizontal blue bars) and read alignments. 17

20 Supplementary Figure 18. Schematic of detecting a fusion gene. a,b) The contig aligns to two genomic regions. The regions may be on different chromosomes, or on one chromosome but separated by a distance that is much longer than the ~200-bp PE insert length (Supplementary Fig. 16a). The contig breakpoint (a, red line) must be supported by reads that align with no mismatches to the contig and cross the breakpoint. The contig alignments may also have mate-pair support from reads aligned to the EEJ-extended genome (b). Annotated transcripts are shown in gray. 18

21 Supplementary Figure 19. Comparisons of gene-level expression metrics for Trans-ABySS, ALEXA-Seq 7 and a whole transcriptome shotgun sequencing (WTSS) pipeline (Supplementary Note). Results are shown for the 8190 Ensembl mouse genes that had fractional gene-level contig-to-exon coverage of at least 0.8. The Pearson s correlation coefficient was

22 Supplementary Figure 20. Overview of the transcriptome assembly and analysis pipeline workflow, outlining the steps from initial transcriptome assembly, contig processing and analysis outcomes. Boxes with rounded corners indicate operations, boxes with square corners represent results and blue boxes represent outcome results. a) When a genome sequence is not available, assembly make contigs available for functional or phylogenetic analyses by methods that are not part of the Trans-ABySS pipeline. b) When a genome sequence is available but gene models have not been annotated, contig alignments to the genome can identify a range of transcript structures, as well as chimeric transcripts and variants like indels and SNVs. c) When transcript models are available for comparison to contig alignments, models can be refined and updated to include transcript variants. 20

23 Supplementary Figure 21. Length-normalized profiles of BWA read alignment densities, showing 20 th, 50 th and 80 th quantiles. 21

24 Supplementary tables Supplementary Table 1. Summary of read alignments for 147.1M 50-bp paired end (PE) Illumina reads (7.36Gb). We retained only aligned reads that had a MAQ mapping quality 10; these had unique genomic alignment positions and few mismatches to the mm9 reference genome sequence or constructed exonexon junction sequences. Junctions were constructed for consecutive exons from UCSC, RefSeq, Ensembl and AceView transcripts. Read counts relative to genes were calculated using Ensembl v54. Percentages in MAPQ filter columns are relative to Total mapped numbers, and those in Aligned to columns are relative to the number of retained read sequences. Total MAPQ filter Aligned to mapped Filtered Retained Exons/EEJ Introns Intergenic # reads 136,685,932 17,999, ,686,768 91,935,338 2,901,894 7,678,810 (13.17%) (86.83%) (77.46%) (2.45%) (6.47%) Gb Supplementary Table 2. Run times. Trans-ABySS Assembly Using ABySS 1.2.1, assemblies for k=26 to 50 completed in 4.7 hours of wallclock time and 370 CPU-hours using 25 machines, each of which had 8 hyperthreaded cores in two Intel E GHz CPUs, and 16 GB of RAM. Analysis Merging a total of 22 million contigs across 25 assemblies completed in about 5-6 hours. Blat alignments completed in about minutes of wallclock time per 1000 contigs. Exonerate alignments completed in about 100 minutes of wallclock time per 1000 contigs. Novelty detection completed in about 5-6 hours wallclock time for 1.2 million alignments. Tophat/Cufflinks/Scripture Tophat Cufflinks Scripture This was run as 8 parallel jobs (one per lane of data), each of which took an average of 6.75 hours. Time to sort, sam2bam, merge, was about 4 hours total CPU time. Total Tophat run time: ~60 CPU hours, which was required for both Cufflinks and Scripture. 1 job, 12 CPU hours 24 jobs, ~30 minutes each on average: 12 CPU hours. 22

25 Supplementary Table 3. Summary of candidate transcript events that were identified as novel relative to all UCSC, RefSeq, Ensembl and AceView transcript models. Event type Contigs with events a Unique contig events b Genes affected Novel exons Novel skipped exons Novel introns Alternative exon splicing Novel UTRs Retained introns Novel transcripts Novel polyadenylation sites a Total number of contigs containing novel events relative to annotated transcript models. In some cases multiple contigs identify the same event. b The number of unique genomic locations represented by the contig events. These identify unique transcript events. 23

26 Supplementary Table 4. Summary results for identifying annotated and novel polyadenylation sites. EJ-reads and PAM-reads were mapped to NCBI37 (mm9) UCSC 8, RefSeq 9, Ensembl 10 and AceView 11 transcript models, and to GenBank 12 mrnas. a) EJ-read mappings EJ-reads that mapped to transcript models EJ-reads that did not map to transcript models Reads All transcripts Reads Contigs >50 bp (novel short) 6,505 >50 bp 13,016 <= 50 bp (known) 11,060 <= 50 bp 5,221 Unmapped 200,676 Unmapped 182,439 Total 218,242 Total 200,676 b) PAM-read mappings PAM-reads that mapped to transcript models PAM-reads that did not map to transcript models Reads All transcripts Reads Contigs >300 bp (novel short) 4,424 >300 bp 327 <= 300 bp (known) 34,699 <= 300 bp 2,243 Unmapped 10,240 Unmapped 7,670 Total 49,363 Total 10,240 c) Transcripts identified by EJ-reads Filter Known Novel short Novel long Total All transcripts mapped by EJ-reads Na 4,667 8,885 13,552 Novel short (>50 bp), novel long (<=50 bp) 2,774 2,664 2,807 5,471 Mate read maps within range on same transcript 2,225 1, ,908 Stretch of T prefix > 10 bp of read Transcripts with at least 2 EJ-reads of support d) Transcripts identified by PAM-reads Filter Known Novel short Novel long Total All transcripts with mapped PAM-reads na 7,496 1,069 8,565 Novel short (>300 bp), novel long (<=300 bp) 6,672 1, ,450 Has at least 1 PAM-read with a 49/50 T mate Filtered for high AT content (80%) and antisense 2, Transcripts with at least 2 PAM-reads of support Filtered for reads with genomic mapping Manually reviewed

27 Supplementary Note De novo transcriptome assembly Non-normalized transcriptome shotgun libraries differ from whole genome shotgun libraries in presenting a very wide range of sequence representations to an assembler. We address expression level differences by using a wide range of k values to assemble contigs that represent cdnas, then merging the contig lists from independent assemblies into a smaller set of meta-assembly contigs for analysis. Transcriptome shotgun libraries also differ from whole genome shotgun libraries in that many genes express multiple transcript isoforms, and so present multiple correct, overlapping paths to an assembler. In contrast, in genome assembly, a single correct assembly path is expected through any genomic region, with the exception of repetitive and duplicated sequences and those representing haplotypic variation or mutational alterations. ABySS captures single nucleotide variation within a sample as pairs of short sequences, which are referred to as bubble contigs (Supplementary Fig. 1). The variant with the highest coverage is represented in the assembled contigs, but both variants are written out to a separate file as a bubble contig pair that can be analyzed independently to identify allelic variation within the sample and SNVs relative to known variants. ABySS typically handles heterozygous indel variants by creating a pair of short contigs for each variant in the initial assembly stages (Supplementary Fig. 1b,c). The contig representing a deletion variant is usually comprised of sequences of length k-1 flanking the insertion point, and thus is characteristically (2k-2) bp in length. The contig representing an insertion variant is comprised of the same (2k-2) bp sequence, plus the additional sequence representing the insertion, and is therefore somewhat longer than the (2k-2) bp deletion variant. We refer to these contigs as junction contigs. Depending on assembly parameters, individual junction contigs may or may not be incorporated into longer contigs in later stages of the assembly (Supplementary Fig. 1d) (see Methods). As we reported previously 13, in transcriptome assembly these junction contigs also capture exon content differences between transcript isoforms. While results for SNVs and indels are not reported here, our pipeline therefore includes methods for bubble and junction contigs. Given the above considerations, the Trans-ABySS workflow consists of the following stages: 1) assembling reads into contigs using ABySS, 2) aligning contigs to the reference genome, and 3) analyzing the contig alignments to correlate with known transcript annotations and to identify SNVs, indels, novel transcripts and transcript structures, and gene rearrangements and fusions. From each assembly, we considered all contigs of length L (2k-2) bp, and all bubble contigs; summed across all assemblies, there were 9.5 M of the former and 346,787 of the latter. To reduce the number of L (2k-2) bp contigs analyzed, while maintaining the transcript representation provided by all 25

28 assemblies, we merged the assemblies by removing ( burying ) contigs that were redundant because they were exactly represented within longer ( parent ) contigs in another assembly. To accomplish this, we iteratively and reciprocally aligned contigs between pairs of assemblies, removing redundant contigs at each round (Supplementary Fig. 5). The iterative burying process returned a set of 1,200,130 non-redundant contigs (Fig. 1b), which we refer to as the main contig set (Supplementary Fig. 20). Preliminary analysis showed that a junction contig shorter than (2k-2) bp can be assembled when there are short homologous sequences on either side of the junction. To ensure that such contigs were included in our dataset for analysis, we identified contigs with length L < (2k-2) bp for which mate pair information indicated overlap with a single candidate contig neighbor at each end. To support robust genome alignments for these small contigs, we extended them by adding their two neighboring contig sequences. We refer to these as extended junction contigs (Supplementary Figs. 1, 20). Subsequent merging reduced the 96,019 extended junction contigs across all assemblies to 16,287 contigs for analysis. Alignments of main and extended junction contigs were compared to structures of known transcript models in order to identify novel transcripts and alternative transcript structures. Alignments for all contigs were used to identify SNVs and indels relative to the genome (data not shown), and candidate fusion genes were identified from the main contig and extended junction contig alignments (Supplementary Fig. 20). Issues for de novo and reference-based transcriptome assembly A number of issues pose challenges to both reference-based and de novo assembly approaches. First, the library protocol that we used generated doublestranded cdna, and so did not retain the strand of the original transcript. While for spliced contig alignments we inferred the strand of the source transcript from the splice sites in the contig alignments, for some cases confirmation would require orthogonal evidence. It is likely that directional library protocols currently under development will reduce the complexity of such analysis 14. Second, while a de novo approach can be robust to sequence similarity between exons, shared sequences that are highly similar will halt contig extension, with repetitive regions assembling into separate contigs, each of which aligns to multiple locations. Third, aligned-read densities are non-uniform along exons due to multi-mapping and other technical biases Fourth, isoform reconstruction remains problematic for genes that have multiple expressed isoforms. Although suggested transcript models have been reported for both de novo and referencebased assembly algorithms, complex alternative isoforms cannot be reconstructed reliably, due to short read lengths and short fragment lengths for paired end reads. Also, attempts to use expression levels in inference fail due to both theoretical (under-, over- or ill-defined linear mathematical models) and 26

29 practical (3 /5 sequence bias, Supplementary Fig. 21) obstacles. Unless one is supplied with reads that associate longer lengths across transcripts, assembly methods can at best report splice diagrams for genes with alternative isoforms. Comparing de novo and reference-based assembly We ran TopHat Beta on each of the eight lanes of data separately, then sorted and joined the output.bam-format 19 files into a single merged file, which we used as input into Cufflinks Beta (02 July 2010), and Scripture 21 Beta (22 June 2010). For our TopHat analysis we generated the intron result set by merging the resulting BED-format files from each lane, and accumulating scores for identical introns. Unique introns for the other three tools were generated from exonerate alignments for Trans-ABySS contigs, BED files for Scripture contigs, and GTF files for Cufflinks contigs. We then compared the predicted splice sites to the unique coordinates of all the donor-acceptor pairs in the reference annotations, which corresponded to all nonredundant introns for the union of UCSC, RefSeq, Ensembl and AceView transcript models. A splice site was only considered to match between datasets if the coordinates of the donor-acceptor pair matched exactly. Supplementary Table 2 outlines run times. TopHat identified alignments for 145,798,588 (78.8%) of 184,915,546 reads. Of the aligned reads, 592,864 (0.4%) were gapped or split alignments; these identified 141,846 unique dinucleotide splice sites, which we compared against the unique coordinates of all the donor-acceptor pairs in UCSC, RefSeq, Ensembl and AceView gene annotations. Methods that use split read alignments may have difficulty in detecting exons that are shorter than the read length, particularly when 50-bp reads are used. For TopHat, every detected splice junction is required to be supported by at least one read that anchors by a user defined minimum length on either side of a split. This makes it insensitive to exons shorter than the anchor length, but also less sensitive for relatively short exons, especially when these are in isoforms that are weakly expressed. Consequently, using the TopHat spliced read alignments as input, we observed that Cufflinks was strongly biased against detecting shorter exons. To estimate performance differences between contig alignments and spliced read alignments more directly, we compared dinucleotide splice sites detected by Trans-ABySS and TopHat using the splice sites in UCSC gene transcripts as our reference set. We included TopHat because, although the assembly of exons is deferred to the Cufflinks software, the splice sites are reported by TopHat. Fig. 2 compares sensitivity (SN) and specificity (SP), relative to the reference junctions, are approximate metrics for this comparison. The SN reported is the fraction of all unique splice sites that are detected in the UCSC, RefSeq, Ensembl and AceView transcript models. SN, as reported, is an underestimate, 27

30 because it includes splice sites from unexpressed transcripts. The SP reported is the ratio of the number of reference introns to the total number of introns detected. It too is an underestimate, because apparently non-specific predictions include not only false positives, but also true positive exon-exon junctions that are novel relative to the reference intron set. Detecting novel polyadenylation sites Alternative polyadenylation sites can affect mrna stability, translocation and translation 22. For fission yeast, polyadenylation sites have been identified from single-end read RNA-seq data through reads that aligned at junctions of transcripts and poly(a) tails (end-junction or EJ reads) 23. In a transcriptome assembly, a contig representing a polyadenylated transcript should terminate in a homopolymer-a sequence whose length approaches k. In our study, the read length was 50 bp, while the merged contig set included contigs from assemblies with 26 k 50. Given this, we expect that terminal poly(a) sequences for merged contigs will be shorter than the read length, which could interfere with the EJ-read alignments. We addressed this by adding 50-bp poly(a) and poly(t) sequences to 3 and 5 ends of each contig, respectively. Similarly, we added 50-bp Poly(A) sequences to the 3 end of each reference (e.g. RefSeq) mrna sequence (Supplementary Fig. 16). Contigs that are downstream of such a transcript contig in the de Bruijn graph represent the poly(a) tail, but are not incorporated into any particular transcript contig due to the difficulty of assembling simple sequence. Here, as an initial step towards a future graph-based analysis, we identified and annotated novel polyadenylation sites using end-junction (EJ-) and mate-pair (PAM-) reads in paired-end sequence data (Supplementary Fig. 16). An EJ-read spanned a poly(a) start site 23 ; a PAM-read had one mate mapped to a poly(a) tail, while its mate mapped either to an annotated transcript or to a contig sequence. We identified candidate EJ-reads spanning poly(a) start sites as reads whose sequence was prefixed by poly(t) runs that were at least 5 bp long. We identified candidate PAM-reads as those in which the mate s sequence contained 80% to 98% (40 to 49 of 50 nt) of T s. We used BWA 24 v0.5.4 to map candidate EJ-reads and PAM-reads to known transcripts annotations from UCSC, RefSeq, Ensembl, AceView, and to Genbank mrnas. Files for all of these were downloaded from the UCSC mm9 genome browser 25. To identify transcripts with candidate novel short 3 UTRs, we used the length distribution for PE reads and the distance from each PAM-read to the end of each transcript (Supplemental Fig. 16b,c and Supplemental Table 4). Specifically, we considered that mapping distances longer than 50 bp for EJreads, and 300 bp for PAM-reads from a transcript to mark such cases. 28

31 To identify candidate novel long 3 UTRs, all EJ-reads and PAM-reads that did not align to annotated transcript sequences were mapped with BWA to ABySS contig sequences. We identified contigs that had EJ-reads mapped to the ends and PAM-reads mapped within 300 bp from a contig end, and mapped the contigs to the mouse mm9 genome to determine the transcript product with the novel elongated 3 UTR. In such cases the contig alignment already suggested the extended 3 UTR, and the PAM-reads refined the estimate of the position of the end of the UTR. We then filtered candidate polyadenylation sites, as follows. For shortening and lengthening cases using EJ-reads, we required EJ-reads to satisfy two conditions: that they map to the genome or to transcripts only when their poly(t) prefix or poly(a) suffix had been trimmed; and that their mate pair map bp from the opposite strand of the same transcript. We ranked mapping positions of a read, prioritizing positions with the fewest mismatches and then the shortest distance to a transcript end. We then required at least two reads to map to each position. Transcripts from the four annotated sets used were resolved to gene symbols when possible. For both shortening and lengthening cases using PAM-reads, we required that at least one of these had at least 49 T s out of 50 bases in the poly(a) tail read. When a PAM-read mapped to more than one genomic location, we ranked mapping positions in the same way as for EJ-reads. To reduce the number of false positives, we rejected transcripts that had one or more 50-bp windows in which 80% of the bases were A or T. We then required at least two reads to map to each position. For the 218,242 potential EJ-reads, requiring at least two reads of support for each transcript event and comparing events to four sets of transcript annotations and to Genbank mrnas, we confirmed 71 annotated 3 UTRs ends, as well as 36 novel short UTRs. Mapping the unmapped reads to ABySS contigs then identified 22 novel long UTRs (Supplementary Table 4a,c). For the 49,363 PAM-reads, 39,123 mapped to the transcript models and Genbank mrnas. By requiring at least two PAM reads for each event, we confirmed 1277 annotated 3' transcript ends, as well as 20 transcripts with novel short 3 UTRs (Supplementary Fig. 17). Mapping the unmapped reads to contigs then identified 10 transcripts with novel long UTRs (Supplementary Table 4b,d). By combining EJ- and PAM-read singletons, we also confirmed 9 annotated UTRs as well as 6 novel short UTRs. Overall, we confirmed polyadenylation start sites in 1299 annotated transcripts, inferred 84 novel polyadenylation sites that corresponded to 56 novel short 3'UTRs and, from contig alignments, 32 novel long 3 UTRs (Supplementary Table 4, Supplementary Fig. 17). Relatively few novel events were predicted by both methods; in almost all cases a novel event was predicted by only one of the two methods. 29

32 Identifying fusion genes To identify candidate contigs spanning gene fusion breakpoints we apply filters to identify contigs that aligned discretely to distinct genomic regions using BLAT (Supplementary Fig. 18). We parse the top-scoring five alignments and perform the corresponding 10 pairwise comparisons. Initially, we discard any contig that contained a single alignment that represented 95% of the contig length, as any candidate fusions generated from the relatively short remaining part of the contig were marked as likely to be false positives. Alignments are subsequently filtered for quality by requiring that alignment identity be at least 95%. To ensure that the entire contig was represented in the alignments and to minimize overlap between alignment pairs, we require that 95% of the entire contig length be covered by the alignments, and that no more than 5% of the contig bases, and none of the reference bases, be shared between alignments. We then filter all candidate fusion alignments. We discard alignments that align to mtdna or haplotype reference sequences. We reject candidate fusion contigs that are reported as a fusion candidate multiple times. Contig alignments that overlap RepeatMasker RNA repeat elements are also rejected, as are contigs that have fewer than two Bowtie read alignments spanning the candidate breakpoint (Supplementary Fig. 20a). As a final piece of confirmatory evidence we require that the contig alignments be supported by mate-pairs aligned to the EEJ-extended reference genome and that the number of such supporting matepairs be within an acceptable range [4, 2000] (Supplementary Fig. 20b). Quantifying gene-level expression The Trans-ABySS pipeline includes a general method for determining a contigbased expression metric for gene loci, given a reference genome with transcript annotations. The approach considers reads aligned to all contigs whose alignment blocks on a reference genome overlap with exons in transcript model annotations. For Ensembl v54 genes, we compare the expression levels predicted by this approach with those from two methods that align reads to a reference genome that has transcript annotations. The first method was ALEXA-Seq 7, whose expression values agree well with those from microarrays and qpcr. The second was a WTSS (whole transcriptome shotgun sequencing, i.e. RNA-Seq) pipeline that extends reference chromosome sequences with exon-exon junction sequences and is used for production-level analysis at the GSC (unpublished). For the 8190 genes with fractional contig-to-exonic coverage of at least 0.8, the expression levels for the two read-alignment methods were highly correlated, with a Pearson s coefficient of r 2 = Correlation coefficients between Trans- ABySS and ALEXA-seq and the WTSS pipeline were and respectively. 30

33 Validating novel transcripts and transcript events We generated 50 µl of double stranded cdna by reverse transcribing 0.2 µg of DNAase-treated RNA from a biological replicate (see Library construction and sequencing, above). We used 1.5 µl of cdna for each RT-PCR reaction. Primers were designed with PrimerQuest from IDT SciTools 26, BatchPrimer3 27, or Visual OMP (DNA Software, Ann Arbor MI). Each primer pair was checked against the UCSC mouse mm9 assembly to confirm expected RT-PCR products. The following PCR cycle was repeated 40 times: 95 o C for 30 min, 53 o C for 30 min, and 72 o C for 60 min. For Csnk2a2, Fbrs, Foxn2, Kynu, novel transcripts 'Event 17', and 'Event 18', primers were hybridized at 55 o C and the reaction was run for 35 cycles. RT-PCR products were resolved on a 1.8% agarose gel. Product sizes for bands were estimated by a custom Matlab (Mathworks, Natick, MA) program that read an image file corresponding to a gel and text file specifying ladder fragment sizes and expected mobilities (Supplemental Note). Product sizes for bands were estimated by a custom Matlab (Mathworks, Natick, MA) program that read an image file corresponding to a gel and text file specifying ladder fragment sizes and expected mobilities (Supplemental Note). The user participated in lane tracking, and, because the shape information for the ladder bands is used for de-noising sample lanes, manually confirmed the automatically identified ladder bands. The user then set a minimum threshold brightness for detecting bands. The program analyzed each sample lane, automatically identifying bands as local profile maxima, calculating a relative profile height at each maximum as an intensity metric, and assigning a product size to each maximum by linearly interpolating a size versus mobility relationship between the ladders. When a peak was saturated by an abundant product, the product size was estimated as the center of the plateau. WTSS aligned-read pipeline Using a whole shotgun transcriptome sequencing pipeline (WTSS, unpublished), we constructed a sequence resource by extending the NCBI37 reference genome with a pool of non-redundant exon-exon junction sequences. The junction sequences were constructed using Ensembl 10, UCSC gene 8, RefSeq 9, AceView 11, and Genscan 28 transcript annotations from the UCSC genome browser 25, by concatenating (read length-1) nucleotides from each side of each pair of consecutive exons for each transcript, and then eliminating redundant junctions from the pooled set. We aligned the PE reads to the sequence resource using BWA 24 v0.5.4, and manipulated the output.bam-format 19 file to assign reads that had aligned to exon-exon junctions to their absolute genomic positions. Coverage for Ensembl v54 genes was calculated using the subset of mapped reads that had a mapping quality of at least 10. UCSC wig-format and then bigwig-format files were 31

34 generated using SAMtools, Unix scripts and the UCSC wigtobigwig application, again removing reads with a MAQ mapping quality lower than 10. We determined length-normalized read density profiles along transcripts, from the BWA-aligned.bam file, using custom Java software (Supplementary Fig. 5). Generating splice graph visualizations Trans-ABySS contigs were aligned to the NCBI37/mm9 assembly using GMAP 29, and results were written out in GFF3 EST_match format. Sircah 4 was used to associate the contig alignments with genes using annotated gene start and end coordinates, and to draw a splicing diagram for each gene of interest (Supplementary Figure 13). References 1. Burset, M., Seledtsov, I.A., and Solovyev, V.V., Nucleic Acids Res 28 (21), (2000). 2. Robertson, A.G. et al., Genome Res 18 (12), (2008). 3. Lareau, L.F. et al., Nature 446 (7138), (2007). 4. Harrington, E.D. and Bork, P., Bioinformatics 24 (17), (2008). 5. Koscielny, G. et al., Genomics 93 (3), (2009). 6. Lareau, L.F. et al., Adv Exp Med Biol 623, (2007). 7. Griffith, M. et al., Nature Methods [Epub ahead of print] (2010). 8. Hsu, F. et al., Bioinformatics 22 (9), (2006). 9. Pruitt, K.D., Tatusova, T., and Maglott, D.R., Nucleic Acids Res 35 (Database issue), D61-65 (2007). 10. Hubbard, T.J. et al., Nucleic Acids Res 37 (Database issue), D (2009). 11. Thierry-Mieg, D. and Thierry-Mieg, J., Genome Biol 7 Suppl 1, S (2006). 12. Benson, D.A. et al., Nucleic Acids Res 38 (Database issue), D46-51 (2010). 13. Birol, I. et al., Bioinformatics 25 (21), (2009). 14. Parkhomchuk, D. et al., Nucleic Acids Res 37 (18), e123 (2009). 15. Degner, J.F. et al., Bioinformatics 25 (24), (2009). 16. Hansen, K.D., Brenner, S.E., and Dudoit, S., Nucleic Acids Res 38 (12), e131 (2010). 17. Li, J., Jiang, H., and Wong, W.H., Genome Biol 11 (5), R50 (2010). 18. Trapnell, C., Pachter, L., and Salzberg, S.L., Bioinformatics 25 (9), (2009). 19. Li, H. et al., Bioinformatics 25 (16), (2009). 20. Trapnell, C. et al., Nat Biotechnol 28 (5), (2010). 21. Guttman, M. et al., Nat Biotechnol 28 (5), (2010). 32

35 22. Millevoi, S. and Vagner, S., Nucleic Acids Res 38 (9), (2009). 23. Nagalakshmi, U. et al., Science 320 (5881), (2008). 24. Li, H. and Durbin, R., Bioinformatics 25 (14), (2009). 25. Rhead, B. et al., Nucleic Acids Res 38 (Database issue), D (2010). 26. Owczarzy, R. et al., Nucleic Acids Res 36 (Web Server issue), W (2008). 27. You, F.M. et al., BMC Bioinformatics 9, 253 (2008). 28. Burge, C. and Karlin, S., J Mol Biol 268 (1), (1997). 29. Wu, T.D. and Watanabe, C.K., Bioinformatics 21 (9), (2005). 33

Haploid Assembly of Diploid Genomes

Haploid Assembly of Diploid Genomes Haploid Assembly of Diploid Genomes Challenges, Trials, Tribulations 13 October 2011 İnanç Birol Assembly By Short Sequencing IEEE InfoVis 2009 2 3 in Literature ~40 citations on tool comparisons ~20 citations

More information

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions

More information

ChIP-seq and RNA-seq

ChIP-seq and RNA-seq ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)

More information

Mapping strategies for sequence reads

Mapping strategies for sequence reads Mapping strategies for sequence reads Ernest Turro University of Cambridge 21 Oct 2013 Quantification A basic aim in genomics is working out the contents of a biological sample. 1. What distinct elements

More information

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein Abundance

More information

Transcriptome analysis

Transcriptome analysis Statistical Bioinformatics: Transcriptome analysis Stefan Seemann seemann@rth.dk University of Copenhagen April 11th 2018 Outline: a) How to assess the quality of sequencing reads? b) How to normalize

More information

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ), Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ), 2012-01-26 What is a gene What is a transcriptome History of gene expression assessment RNA-seq RNA-seq analysis

More information

Systematic evaluation of spliced alignment programs for RNA- seq data

Systematic evaluation of spliced alignment programs for RNA- seq data Systematic evaluation of spliced alignment programs for RNA- seq data Pär G. Engström, Tamara Steijger, Botond Sipos, Gregory R. Grant, André Kahles, RGASP Consortium, Gunnar Rätsch, Nick Goldman, Tim

More information

RNA-Sequencing analysis

RNA-Sequencing analysis RNA-Sequencing analysis Markus Kreuz 25. 04. 2012 Institut für Medizinische Informatik, Statistik und Epidemiologie Content: Biological background Overview transcriptomics RNA-Seq RNA-Seq technology Challenges

More information

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans. David Wang Bio 434W 4/27/15 Annotation of contig27 in the Muller F Element of D. elegans Abstract Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans. Genscan predicted six

More information

RNA standards v May

RNA standards v May Standards, Guidelines and Best Practices for RNA-Seq: 2010/2011 I. Introduction: Sequence based assays of transcriptomes (RNA-seq) are in wide use because of their favorable properties for quantification,

More information

MODULE 5: TRANSLATION

MODULE 5: TRANSLATION MODULE 5: TRANSLATION Lesson Plan: CARINA ENDRES HOWELL, LEOCADIA PALIULIS Title Translation Objectives Determine the codons for specific amino acids and identify reading frames by looking at the Base

More information

Analysis of RNA-seq Data

Analysis of RNA-seq Data Analysis of RNA-seq Data A physicist and an engineer are in a hot-air balloon. Soon, they find themselves lost in a canyon somewhere. They yell out for help: "Helllloooooo! Where are we?" 15 minutes later,

More information

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis -Seq Analysis Quality Control checks Reproducibility Reliability -seq vs Microarray Higher sensitivity and dynamic range Lower technical variation Available for all species Novel transcript identification

More information

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet May 2013 Standard sequence library generation Illumina

More information

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids. Supplementary Figure 1 Number and length distributions of the inferred fosmids. Fosmid were inferred by mapping each pool s sequence reads to hg19. We retained only those reads that mapped to within a

More information

user s guide Question 1

user s guide Question 1 Question 1 How does one find a gene of interest and determine that gene s structure? Once the gene has been located on the map, how does one easily examine other genes in that same region? doi:10.1038/ng966

More information

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA Stranded, Illumina ready library construction in

More information

RNA-Seq Software, Tools, and Workflows

RNA-Seq Software, Tools, and Workflows RNA-Seq Software, Tools, and Workflows Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 1, 2016 Some mrna-seq Applications Differential gene expression analysis Transcriptional profiling Assumption:

More information

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia RNA-Seq Workshop AChemS 2017 Sunil K Sukumaran Monell Chemical Senses Center Philadelphia Benefits & downsides of RNA-Seq Benefits: High resolution, sensitivity and large dynamic range Independent of prior

More information

How to deal with your RNA-seq data?

How to deal with your RNA-seq data? How to deal with your RNA-seq data? Rachel Legendre, Thibault Dayris, Adrien Pain, Claire Toffano-Nioche, Hugo Varet École de bioinformatique AVIESAN-IFB 2017 1 Rachel Legendre Bioinformatics 27/11/2018

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:1.138/nature11233 Supplementary Figure S1 Sample Flowchart. The ENCODE transcriptome data are obtained from several cell lines which have been cultured in replicates. They were either left intact (whole

More information

Gene Signal Estimates from Exon Arrays

Gene Signal Estimates from Exon Arrays Gene Signal Estimates from Exon Arrays I. Introduction: With exon arrays like the GeneChip Human Exon 1.0 ST Array, researchers can examine the transcriptional profile of an entire gene (Figure 1). Being

More information

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Transcriptomics analysis with RNA seq: an overview Frederik Coppens Transcriptomics analysis with RNA seq: an overview Frederik Coppens Platforms Applications Analysis Quantification RNA content Platforms Platforms Short (few hundred bases) Long reads (multiple kilobases)

More information

RNA-SEQUENCING ANALYSIS

RNA-SEQUENCING ANALYSIS RNA-SEQUENCING ANALYSIS Joseph Powell SISG- 2018 CONTENTS Introduction to RNA sequencing Data structure Analyses Transcript counting Alternative splicing Allele specific expression Discovery APPLICATIONS

More information

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment Zhaojun Zhang, Shunping Huang, Jack Wang, Xiang Zhang, Fernando Pardo

More information

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc. Novel methods for RNA and DNA- Seq analysis using SMART Technology Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc. Agenda Enabling Single Cell RNA-Seq using SMART Technology SMART

More information

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly SCIENCE CHINA Life Sciences SPECIAL TOPIC February 2013 Vol.56 No.2: 156 162 RESEARCH PAPER doi: 10.1007/s11427-013-4444-x Comparative analysis of de novo transcriptome assembly CLARKE Kaitlin 1, YANG

More information

Mapping and quantifying mammalian transcriptomes by RNA-Seq. Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer & Barbara Wold

Mapping and quantifying mammalian transcriptomes by RNA-Seq. Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer & Barbara Wold Mapping and quantifying mammalian transcriptomes by RNA-Seq Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer & Barbara Wold Supplementary figures and text: Supplementary Figure 1 RNA shatter

More information

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy An introduction to RNA-seq Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy The central dogma Genome = all DNA in an organism (genotype) Transcriptome = all RNA (molecular

More information

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Ruth Howe Bio 434W April 1, 2010 INTRODUCTION De novo annotation is the process by which a finished genomic sequence is searched for

More information

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions Joshua N. Burton 1, Andrew Adey 1, Rupali P. Patwardhan 1, Ruolan Qiu 1, Jacob O. Kitzman 1, Jay Shendure 1 1 Department

More information

NGS Data Analysis and Galaxy

NGS Data Analysis and Galaxy NGS Data Analysis and Galaxy University of Pretoria Pretoria, South Africa 14-18 October 2013 Dave Clements, Emory University http://galaxyproject.org/ Fourie Joubert, Burger van Jaarsveld Bioinformatics

More information

Introduction to RNA sequencing

Introduction to RNA sequencing Introduction to RNA sequencing Bioinformatics perspective Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden November 2017 Olga (NBIS) RNA-seq November 2017 1 / 49 Outline Why sequence

More information

Transcription Start Sites Project Report

Transcription Start Sites Project Report Transcription Start Sites Project Report Student name: Student email: Faculty advisor: College/university: Project details Project name: Project species: Date of submission: Number of genes in project:

More information

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? Lesson Plan: Title Introduction to the Genome Browser: what is a gene? JOYCE STAMM Objectives Demonstrate basic skills in using the UCSC Genome

More information

RNA-Seq Module 2 From QC to differential gene expression.

RNA-Seq Module 2 From QC to differential gene expression. RNA-Seq Module 2 From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics Support System (RISS) MSI Apr. 24, 2012 RNA-Seq Tutorials Tutorial 1: Introductory (Mar.

More information

02 Agenda Item 03 Agenda Item

02 Agenda Item 03 Agenda Item 01 Agenda Item 02 Agenda Item 03 Agenda Item SOLiD 3 System: Applications Overview April 12th, 2010 Jennifer Stover Field Application Specialist - SOLiD Applications Workflow for SOLiD Application Application

More information

measuring gene expression December 5, 2017

measuring gene expression December 5, 2017 measuring gene expression December 5, 2017 transcription a usually short-lived RNA copy of the DNA is created through transcription RNA is exported to the cytoplasm to encode proteins some types of RNA

More information

Targeted RNA sequencing reveals the deep complexity of the human transcriptome.

Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Tim R. Mercer 1, Daniel J. Gerhardt 2, Marcel E. Dinger 1, Joanna Crawford 1, Cole Trapnell 3, Jeffrey A. Jeddeloh 2,4, John

More information

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009 Page 1 Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009 Page 2 Introduction: Annotation is the process of analyzing the genomic sequence of an organism. Besides identifying

More information

Supplementary Materials for De-novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity

Supplementary Materials for De-novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity Supplementary Materials for De-novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity Sections: S1. Evaluation of transcriptome assembly completeness S2. Comparison

More information

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1 BST 226 Statistical Methods for Bioinformatics David M. Rocke March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1 NGS Technologies Illumina Sequencing HiSeq 2500 & MiSeq PacBio Sequencing PacBio

More information

Supplementary Online Material. the flowchart of Supplemental Figure 1, with the fraction of known human loci retained

Supplementary Online Material. the flowchart of Supplemental Figure 1, with the fraction of known human loci retained SOM, page 1 Supplementary Online Material Materials and Methods Identification of vertebrate mirna gene candidates The computational procedure used to identify vertebrate mirna genes is summarized in the

More information

Introduction to Next Generation Sequencing

Introduction to Next Generation Sequencing The Sequencing Revolution Introduction to Next Generation Sequencing Dena Leshkowitz,WIS 1 st BIOmics Workshop High throughput Short Read Sequencing Technologies Highly parallel reactions (millions to

More information

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department

More information

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010)

More information

De novo assembly in RNA-seq analysis.

De novo assembly in RNA-seq analysis. De novo assembly in RNA-seq analysis. Joachim Bargsten Wageningen UR/PRI/Plant Breeding October 2012 Motivation Transcriptome sequencing (RNA-seq) Gene expression / differential expression Reconstruct

More information

Mate-pair library data improves genome assembly

Mate-pair library data improves genome assembly De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate

More information

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018 Outline Overview of the GEP annotation projects Annotation of Drosophila Primer January 2018 GEP annotation workflow Practice applying the GEP annotation strategy Wilson Leung and Chris Shaffer AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCT

More information

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz Table of Contents Supplementary Note 1: Unique Anchor Filtering Supplementary Figure

More information

RNA-Seq data analysis course September 7-9, 2015

RNA-Seq data analysis course September 7-9, 2015 RNA-Seq data analysis course September 7-9, 2015 Peter-Bram t Hoen (LUMC) Jan Oosting (LUMC) Celia van Gelder, Jacintha Valk (BioSB) Anita Remmelzwaal (LUMC) Expression profiling DNA mrna protein Comprehensive

More information

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Introduction to metagenome assembly Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Sequencing specs* Method Read length Accuracy Million reads Time Cost per M 454

More information

Introduction to RNA-Seq in GeneSpring NGS Software

Introduction to RNA-Seq in GeneSpring NGS Software Introduction to RNA-Seq in GeneSpring NGS Software Dipa Roy Choudhury, Ph.D. Strand Scientific Intelligence and Agilent Technologies Learn more at www.genespring.com Introduction to RNA-Seq In a few years,

More information

Quantifying gene expression

Quantifying gene expression Quantifying gene expression Genome GTF (annotation)? Sequence reads FASTQ FASTQ (+reference transcriptome index) Quality control FASTQ Alignment to Genome: HISAT2, STAR (+reference genome index) (known

More information

Genomic resources. for non-model systems

Genomic resources. for non-model systems Genomic resources for non-model systems 1 Genomic resources Whole genome sequencing reference genome sequence comparisons across species identify signatures of natural selection population-level resequencing

More information

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing: Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing: Patented, Anti-Correlation Technology Provides 99.5% Accuracy & Sensitivity to 5% Variant Knowledge Base and External Annotation

More information

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ de novo transcriptome assembly de novo from the Latin expression meaning from the beginning In bioinformatics, we often use

More information

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies RNA Seq and Differential Expression Analysis Brooke L. Fridley, PhD University of Kansas Medical Center 1 Next-generation

More information

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club De novo assembly of human genomes with massively parallel short read sequencing Mikk Eelmets Journal Club 06.04.2010 Problem DNA sequencing technologies: Sanger sequencing (500-1000 bp) Next-generation

More information

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica 1 www.ensembl.org Lecture 7.1 2 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database

More information

Introduction of RNA-Seq Analysis

Introduction of RNA-Seq Analysis Introduction of RNA-Seq Analysis Jiang Li, MS Bioinformatics System Engineer I Center for Quantitative Sciences(CQS) Vanderbilt University September 21, 2012 Goal of this talk 1. Act as a practical resource

More information

Supplementary Figures

Supplementary Figures Supplementary Figures A B Supplementary Figure 1. Examples of discrepancies in predicted and validated breakpoint coordinates. A) Most frequently, predicted breakpoints were shifted relative to those derived

More information

RNA-Seq Analysis. Simon Andrews, Laura v

RNA-Seq Analysis. Simon Andrews, Laura v RNA-Seq Analysis Simon Andrews, Laura Biggins simon.andrews@babraham.ac.uk @simon_andrews v2018-10 RNA-Seq Libraries rrna depleted mrna Fragment u u u u NNNN Random prime + RT 2 nd strand synthesis (+

More information

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION AS-NMD modulates FLM-dependent thermosensory flowering response in Arabidopsis NATURE PLANTS www.nature.com/natureplants 1 Supplementary Figure 1. Genomic sequence of FLM along with the splice sites. Sequencing

More information

RNA-Seq de novo assembly training

RNA-Seq de novo assembly training RNA-Seq de novo assembly training Training session aims Give you some keys elements to look at during read quality check. Transcriptome assembly is not completely a strait forward process : Multiple strategies

More information

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Transcriptome Assembly, Functional Annotation (and a few other related thoughts) Transcriptome Assembly, Functional Annotation (and a few other related thoughts) Monica Britton, Ph.D. Sr. Bioinformatics Analyst June 23, 2017 Differential Gene Expression Generalized Workflow File Types

More information

QIAseq Targeted Panel Analysis Plugin USER MANUAL

QIAseq Targeted Panel Analysis Plugin USER MANUAL QIAseq Targeted Panel Analysis Plugin USER MANUAL User manual for QIAseq Targeted Panel Analysis 1.1 Windows, macos and Linux June 18, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej

More information

Barnacle: detecting and characterizing tandem duplications and fusions in transcriptome assemblies

Barnacle: detecting and characterizing tandem duplications and fusions in transcriptome assemblies Barnacle: detecting and characterizing tandem duplications and fusions in transcriptome assemblies The MIT Faculty has made this article openly available. Please share how this access benefits you. Your

More information

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Genome 373: Mapping Short Sequence Reads II. Doug Fowler Genome 373: Mapping Short Sequence Reads II Doug Fowler The final Will be in this room on June 6 th at 8:30a Will be focused on the second half of the course, but will include material from the first half

More information

Performance comparison of five RNA-seq alignment tools

Performance comparison of five RNA-seq alignment tools New Jersey Institute of Technology Digital Commons @ NJIT Theses Theses and Dissertations Spring 2013 Performance comparison of five RNA-seq alignment tools Yuanpeng Lu New Jersey Institute of Technology

More information

RNAseq Differential Gene Expression Analysis Report

RNAseq Differential Gene Expression Analysis Report RNAseq Differential Gene Expression Analysis Report Customer Name: Institute/Company: Project: NGS Data: Bioinformatics Service: IlluminaHiSeq2500 2x126bp PE Differential gene expression analysis Sample

More information

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer Project XX Customer Detail Table of Contents. Bioinformatics analysis pipeline...3.. Read quality check. 3.2. Read alignment...3.3.

More information

ab initio and Evidence-Based Gene Finding

ab initio and Evidence-Based Gene Finding ab initio and Evidence-Based Gene Finding A basic introduction to annotation Outline What is annotation? ab initio gene finding Genome databases on the web Basics of the UCSC browser Evidence-based gene

More information

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo 1 Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo, Louis April 20, 2006 Annotation Report Introduction In the first half of Research Explorations in Genomics I finished a 38kb fragment of chromosome

More information

Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line

Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line Table of Contents SUPPLEMENTARY TEXT:... 2 FILTERING OF RAW READS PRIOR TO ASSEMBLY:... 2 COMPARATIVE ANALYSIS... 2 IMMUNOGENIC

More information

RNA-Seq with the Tuxedo Suite

RNA-Seq with the Tuxedo Suite RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop The Basic Tuxedo Suite References Trapnell C, et al. 2009 TopHat: discovering splice junctions with

More information

Sequence Analysis 2RNA-Seq

Sequence Analysis 2RNA-Seq Sequence Analysis 2RNA-Seq Lecture 10 2/21/2018 Instructor : Kritika Karri kkarri@bu.edu Transcriptome Entire set of RNA transcripts in a given cell for a specific developmental stage or physiological

More information

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010 Mapping Next Generation Sequence Reads Bingbing Yuan Dec. 2, 2010 1 What happen if reads are not mapped properly? Some data won t be used, thus fewer reads would be aligned. Reads are mapped to the wrong

More information

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler High-Throughput Bioinformatics: Re-sequencing and de novo assembly Elena Czeizler 13.11.2015 Sequencing data Current sequencing technologies produce large amounts of data: short reads The outputted sequences

More information

Deep Sequencing technologies

Deep Sequencing technologies Deep Sequencing technologies Gabriela Salinas 30 October 2017 Transcriptome and Genome Analysis Laboratory http://www.uni-bc.gwdg.de/index.php?id=709 Microarray and Deep-Sequencing Core Facility University

More information

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) tru TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) Anton Bankevich Center for Algorithmic Biotechnology, SPbSU Sequencing costs 1. Sequencing costs do not follow Moore s law

More information

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang Supplementary Materials for: Detecting very low allele fraction variants using targeted DNA sequencing and a novel molecular barcode-aware variant caller Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John

More information

Array-Ready Oligo Set for the Rat Genome Version 3.0

Array-Ready Oligo Set for the Rat Genome Version 3.0 Array-Ready Oligo Set for the Rat Genome Version 3.0 We are pleased to announce Version 3.0 of the Rat Genome Oligo Set containing 26,962 longmer probes representing 22,012 genes and 27,044 gene transcripts.

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 21 2011, pages 2957 2963 doi:10.1093/bioinformatics/btr507 Genome analysis Advance Access publication September 7, 2011 : fast length adjustment of short reads

More information

CSE 549: RNA-Seq aided gene finding

CSE 549: RNA-Seq aided gene finding CSE 549: RNA-Seq aided gene finding Finding Genes We ll break gene finding methods into 3 main categories. ab initio latin from the beginning w/o experimental evidence comparative make use of knowledge

More information

measuring gene expression December 11, 2018

measuring gene expression December 11, 2018 measuring gene expression December 11, 2018 Intervening Sequences (introns): how does the cell get rid of them? Splicing!!! Highly conserved ribonucleoprotein complex recognizes intron/exon junctions and

More information

Annotation of a Drosophila Gene

Annotation of a Drosophila Gene Annotation of a Drosophila Gene Wilson Leung Last Update: 12/30/2018 Prerequisites Lecture: Annotation of Drosophila Lecture: RNA-Seq Primer BLAST Walkthrough: An Introduction to NCBI BLAST Resources FlyBase:

More information

Transcriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data

Transcriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data Transcriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data Introduction The US Food and Drug Administration (FDA) has coordinated the Sequencing Quality Control project (SEQC/MAQC-III)

More information

Short Read Alignment to a Reference Genome

Short Read Alignment to a Reference Genome Short Read Alignment to a Reference Genome Shamith Samarajiwa CRUK Summer School in Bioinformatics Cambridge, September 2018 Aligning to a reference genome BWA Bowtie2 STAR GEM Pseudo Aligners for RNA-seq

More information

Homework 4. Due in class, Wednesday, November 10, 2004

Homework 4. Due in class, Wednesday, November 10, 2004 1 GCB 535 / CIS 535 Fall 2004 Homework 4 Due in class, Wednesday, November 10, 2004 Comparative genomics 1. (6 pts) In Loots s paper (http://www.seas.upenn.edu/~cis535/lab/sciences-loots.pdf), the authors

More information

Genome annotation & EST

Genome annotation & EST Genome annotation & EST What is genome annotation? The process of taking the raw DNA sequence produced by the genome sequence projects and adding the layers of analysis and interpretation necessary

More information

Introduction to RNA-Seq

Introduction to RNA-Seq Introduction to RNA-Seq Monica Britton, Ph.D. Bioinformatics Analyst September 2014 Workshop Overview of Today s Activities Morning RNA-Seq Concepts, Terminology, and Work Flows Two-Condition Differential

More information

A Novel Approach to Clustering and Assembly of Large-Scale Roche 454 Transcriptome Data for Gene Validation and Alternative Splicing Analysis

A Novel Approach to Clustering and Assembly of Large-Scale Roche 454 Transcriptome Data for Gene Validation and Alternative Splicing Analysis A Novel Approach to Clustering and Assembly of Large-Scale Roche 454 Transcriptome Data for Gene Validation and Alternative Splicing Analysis Vitoantonio Bevilacqua 1,3,*, Fabio Stroppa 1, Stefano Saladino

More information

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012 Introduction to transcriptome analysis using High Throughput Sequencing technologies D. Puthier 2012 A typical RNA-Seq experiment Library construction Protocol variations Fragmentation methods RNA: nebulization,

More information

Figure 1. FasterDB SEARCH PAGE corresponding to human WNK1 gene. In the search page, gene searching, in the mouse or human genome, can be done: 1- By

Figure 1. FasterDB SEARCH PAGE corresponding to human WNK1 gene. In the search page, gene searching, in the mouse or human genome, can be done: 1- By 1 2 3 Figure 1. FasterD SERCH PGE corresponding to human WNK1 gene. In the search page, gene searching, in the mouse or human genome, can be done: 1- y keywords (ENSEML ID, HUGO gene name, synonyms or

More information

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère C3BI VARIANTS CALLING November 2016 Pierre Lechat Stéphane Descorps-Declère General Workflow (GATK) software websites software bwa picard samtools GATK IGV tablet vcftools website http://bio-bwa.sourceforge.net/

More information

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction Lecture 8 Reading Lecture 8: 96-110 Lecture 9: 111-120 DNA Libraries Definition Types Construction 142 DNA Libraries A DNA library is a collection of clones of genomic fragments or cdnas from a certain

More information