De novo assembly and analysis of RNA-seq data
|
|
- Harold Cole
- 5 years ago
- Views:
Transcription
1 Nature Methods De novo assembly and analysis of RNA-seq data Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, Matthew Field, Shaun D Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada, Jenny Q Qian, Malachi Griffith, Anthony Raymond, Nina Thiessen, Timothee Cezard, Yaron Butterfield, Richard Newsome, Simon K Chan, Rong She, Richard Varhol, Baljit Kamoh, Anna-Liisa Prabhu, Angela Tam, YongJun Zhao, Richard Moore, Martin Hirst, Marco A Marra, Steven J M Jones, Pamela A Hoodless & Inanc Birol Supplementary Figure 1 Schematic of ABySS assembly steps Supplementary Figure 2 Assembly properties for k values of 26 to 50 Supplementary figure 3 Supplementary figure 4 Supplementary figure 5 Supplementary figure 6 Supplementary figure 7 Supplementary figure 8 Supplementary figure 9 Supplementary figure 10 Supplementary figure 11 Supplementary figure 12 Supplementary figure 13 Supplementary figure 14 Supplementary figure 15 Supplementary figure 16 Supplementary figure 17 Supplementary figure 18 Partitioning of MAQ-aligned reads relative to Ensembl transcript models. Fraction of Ensembl transcripts with at least 80% of exon length covered by unmerged contig alignments, as a function of normalized WTSS coverage threshold Assembled contigs across multiple k-values are merged to obtain a nonredundant set of contigs for analysis Coverage of Ensembl v54 transcripts by contig alignments, as a function of mean read transcript cover Splice site support for 149,877 Trans-ABySS parent (v1.1.1) contig alignments, considering GT-AG, GC-AG and AT-AC donor-acceptor types Schematic of comparing a transcript model (top) with contig alignments to identify annotated and novel transcripts and transcript structures RT-PCR validation of Insr s 36-bp novel exon Novel UTR candidate for Nlrp6 Novel transcript candidate Shank2 s contig alignment supports both an RT-PCR-validated 21-bp skipped exon, and a novel, H3K4me3-supported TSS that is upstream of the 5 -most Ensembl TSS Sfrs3: assembly can extend contigs through exons that have low togenome aligned-read densities Empirical distribution functions for mean normalized read coverage, C, for ENSMUSTs and transcripts with novel retained introns Coverage metrics for known and novel retained introns Schematic of identifying novel short and long 3 UTRs with EJ- and PAMreads a) PAM-reads identify a novel polyadenylation site in the 3 UTR of Dmgdh b) PAM-reads and contig alignments identify a novel long 3 UTR for Sult3a1 Schematic of detecting a fusion gene with a contig alignment
2 Supplementary figure 19 Supplementary figure 20 Supplementary figure 21 Supplementary Table 1 Supplementary Table 2 Supplementary Table 3 Supplementary Table 4 Supplementary Note 1 Comparisons of gene-level expression metrics for Trans-ABySS and ALEXA-Seq Overview of the transcriptome assembly and analysis pipeline workflow Length-normalized profiles of Burrows-Wheeler Aligner read alignment densities Summary of read-to-genome alignments Run times for Trans-ABySS, Tophat, Cufflinks and Scripture Summary of candidate transcript events that were identified as novel relative to all UCSC, RefSeq, Ensembl and AceView transcript models Summary results for identifying annotated and novel polyadenylation sites De novo transcriptome assembly Issues for de novo and reference-based transcriptome assembly Comparing de novo and reference-based assembly Detecting novel polyadenylation sites Identifying fusion genes Quantifying gene-level expression Validating novel transcripts and transcript events WTSS aligned-read pipeline Generating splice graph visualizations
3 Supplementary Figures Supplementary Figure 1. Schematic of ABySS assembly steps illustrating the origin of main, junction, and bubble contigs, and the manner in which the contig alignments are used for analysis. a) Bubble contig branch pairs (green) typically capture heterozygous SNVs. For each bubble, ABySS writes the higher coverage branch (mid green) into the single end (SE) contig set, and writes the branch pair into the global set of bubble contigs. b) SE contigs are constructed from unambiguous (k-1)-bp overlaps between k-mers. c) Mate pairs identify overlapping contig neighbors, and alternate contig-joining paths may be identified. The shorter, pale blue contig represents a candidate junction contig. Because such a contig typically corresponds of two (k-1) overlaps, it is expected 1
4 to be (2k-2) bp long, in an assembly generated for a k-mer length of k bp. For a given assembly (and so k value), contigs that are at least (2k-2) bp long are expected to be the most informative of transcript structure. Dependent on assembly parameters and the strength of supporting mate pair information, one of the two alternate contigs may be joined to the flanking contigs to construct a longer PE contig; however, it is also possible that neither or both alternative paths will be constructed. d) The path containing the longer alternate contig is constructed, with the shorter contig retained as a junction contig. e) Example of possible outcomes for alignments of main (dark blue), junction (light blue), and bubble pair (light and mid-green) contigs to the reference genome. Comparison of their alignments to that of two transcript isoforms (gray) is shown. The alignment blocks of the main contig support the lower isoform, while the junction contig alignment supports the presence of the upper alternative isoform. The alignment of the bubble contig pair identifies a heterozygous SNV. Supplementary Figure 2. Assembly properties for k values of 26 to 50. a) Curves show N50 length (the contig length for which the contigs larger than N50 have 50% of the bases of the assembly), the total number of contigs, and the number of contigs longer than 100 bp. 2
5 Supplementary Figure 3. Partitioning of BWA-aligned reads relative to Ensembl v54 transcript models. Supplementary Figure 4. Fraction of Ensembl transcripts with at least 80% of exon length covered by unmerged contig alignments, as a function of normalized WTSS (Supplementary Note) coverage threshold. Results are shown for the 34,400 Ensembl v54 transcripts (corresponding to 19,508 unique gene IDs) that had a nonzero length-normalized WTSS mean coverage. Curves show results for the single longest contig (blue) and for all contigs (green). For single contigs, 64% and 72% of nonzero-coverage transcripts were covered to at least 80% of the exon length for WTSS coverage thresholds of 10 and 20; considering all contig alignments, the percentages were 88% and 92%. 3
6 Supplementary Figure 5. Assembled contigs across multiple k-values are merged to obtain a non-redundant set of contigs for analysis. a) The contig merging process is shown schematically for eight hypothetical assemblies (k 1, k 2,, k 8 ). Contig sets from pairs of assemblies with adjacent k values are reciprocally compared. Those contigs having an exact match to a longer contig from the paired assembly are buried. Where contigs are equivalent, the contig from the assembly with the lower k is retained. From the output of this stage, adjacent pairs of contig sets are again merged (e.g. k 12 and k 34 ). Merging continues until only one contig set remains. Retained contigs are identified as parent contigs. Contigs that are neither buried nor parent are untouched. The merging process is applied to both the main and extended junction contigs. See Fig. 1b. 4
7 Supplemental Figure 6. Coverage of Ensembl v54 transcripts by contig alignments, as a function of mean read transcript coverage. Mean transcript read coverage, C, was calculated for each transcript by aligning reads to the NCBI37 reference genome which had been extended by exon-exon junctions, and normalizing the number of aligned reads for a transcript by the sum of exon lengths in the transcript. Distributions are shown for all transcripts with nonzero read-alignment coverage (gray), and for transcripts with de novo contig alignments (Trans-ABySS, for even-k assemblies) or reference-based contigs (Cufflinks, Scripture) representing at least 80% of the total exon length, either considering all contigs for that transcript (red) or the single longest contig (blue). 5
8 Supplementary Figure 7. Splice site support for 149,877 Trans-ABySS parent (v1.1.1) contig alignments, considering GT-AG, GC-AG and AT-AC donoracceptor types 1. An ss2 contig alignment (97.9%) has at least one alignment intron with both acceptor and donor sites, an ss1 contig alignment (1.8%) have at least one intron with only an acceptor or donor, and an ss0 contig alignment (0.2%) lacks such support. 6
9 Supplementary Figure 8. Schematic of comparing a transcript model (top) with contig alignments to identify annotated and novel transcripts and transcript structures. For each main and extended junction contig we compared coordinates of contig alignment blocks to coordinates of exons in each best-fitting transcript model, considering all mm9 UCSC gene, RefSeq, Ensembl and AceView transcripts. For a full match, edges of all internal blocks and transcript exons match, as do inside edges of the outer or terminal blocks and exons. Because contig ends do not necessarily correlate with transcript ends, outer edges of terminal alignment blocks may not match outer edges of corresponding exons, and so are not considered to represent novel events. A multi-block alignment that matches no known transcript models represents a potential novel transcript (not shown). For schematics for identifying candidate novel short and long 3 UTRs and candidate fusion genes see Supplementary Figs. 16 and 18. 7
10 Supplementary Figure 9. RT-PCR validation of a 36-bp novel exon prediction in the Insr gene, which was subsequently reported in a shorter full-length RIKEN cdna clone for adult male testis, in a more recent set of known gene transcript models. a) UCSC genome browser mm9 screenshot showing (top to bottom) Tag-seq data (unpublished), H3K4me3 ChIP-seq data 2, exonerate alignments for main contigs, read-alignment pileup, RT-PCR primers (blue arrow) and a range of transcript and other annotations. b,c) Detailed view of the RT-PCR primers on the exons flanking the novel exon. While the pileup coverage is greater than 100 on the flanking exons, the 36-bp novel exon is so much shorter than the 50-bp reads that only two BWA-aligned reads support the novel exon (not shown). d) RT-PCR gel image showing the expected 185-bp product, but not the annotated 149-bp product. e) The approximate alignment coverage for the gene (vertical red line) shown relative to cumulative distributions of transcript coverage for all Ensembl mouse transcripts (gray line) and all contigs whose alignments covered at least 80% of the total exon length of a transcript (see Fig. 1a). The novel exon 8
11 corresponded to 12 amino acids, and overlapped exons in human and rat RefSeq transcript alignments (not shown). All three contigs in the region contain this exon, suggesting that only one isoform is expressed. Despite the gene being relatively highly expressed (read coverage for flanking exons is ~130-fold), the novel exon is shorter than the 50-bp reads, and so has only two reported read alignments. In contrast, read alignments to the assembled contigs indicate a ~90- fold coverage over this detected novel exon (data not shown). 9
12 Supplementary Figure 10. Novel UTR prediction for the Nlrp6 gene. a) UCSC genome browser mm9 screenshot showing (top to bottom) Tag-seq data for the positive and negative strands, an H3K4me3 enrichment profile, exonerate alignments for main contigs, read-alignment pileup, RT-PCR primer positions (blue arrow), and a range of transcript annotations. b,c) Details of the RT-PCR primer locations. d) RT-PCR gel image showing the expected 856-bp product. e) The approximate alignment coverage for the two annotated genes (vertical red lines, ~400 and ~2100) relative to distributions shown in Fig. 1a. The evidence for the detected novel UTR on Nlrp6 includes the following. The main H3K4me3 enrichment signal 2 extends across a short UCSC or AceView transcript, while 10
13 weaker H3K4me3 enrichment is consistent with short UCSC and AceView Nlrp6 transcripts. Numerous shorter and particularly longer contigs suggest that the gene model for Nlrp6 is incomplete, and that transcripts extend between this locus and the main enriched H3K4me3 region. Read coverage is approximately 560 for the Nlrp6 transcripts, and higher (approximately 1175) for the upstream transcripts; consistent with this high expression, there is widespread low-level intergenic or (novel) intronic transcription that is reflected in many unspliced contigs. The longest contig exactly reconstructs the ORF part of the RefSeq transcript. The set of contig alignments at the upper left extend ~148kb upstream to a very highly expressed (~6500 pileup) cytochrome P450 Cyp2e1. Supplementary Figure 11. A novel transcript prediction. a) UCSC genome browser mm9 screenshot showing (top to bottom) exonerate alignments for main contigs, PE reads, a read-alignment pileup, RT-PCR primer positions (blue arrow), a range of transcript annotations and mammalian conservation. b,c) Details of the RT-PCR primer locations. d) RT-PCR gel image showing the expected 264-bp product. e) The approximate alignment coverage for the novel transcript (vertical red line, ~31) relative to distributions shown in Fig. 1a. 11
14 Supplementary Figure 12. Alignments of contigs representing the Shank2 gene support both an RT-PCR-validated 21-bp skipped exon (red arrow), and a novel, H3K4me3-supported 2 TSS that is upstream of the 5 -most Ensembl TSS. a) mm9 UCSC genome browser view of Shank2 showing (top to bottom) Tag-seq data for the positive strand, an H3K4me3 enrichment profile, exonerate alignments for 12
15 main contigs, BWA read-alignment pileup, RT-PCR primer positions (blue arrow), and a range of transcript annotations. b) Detail of RT-PCR primers, with a red arrow indicating the skipped exon. c) Detail of the skipped exon. d) RT-PCR gel, showing the 200-bp annotated and 179-bp novel products. e) The vertical red line shows the approximate read alignment coverage for the gene relative to distributions shown in Fig. 1a. Supplementary Figure 13. Assembly can generate contigs for exons with low read alignment densities. Sfrs3 is a member of the SR splicing factor family, which has 11 and 10 members in human and mouse, respectively 3. In human, SFRS3 shares a splicing pattern with six other family members: a cassette exon that introduces a premature stop codon is skipped in the reference isoform but included in an alternative isoform 3. a) For the mouse Sfrs3 shown, exons overlap chained self-alignment blocks. Consistent with this, aligned-read coverage is low on exons flanking the retained intron; however, de novo assembly generates informative contigs. Contig k values and normalized k-mer coverages are consistent with transcripts having a wide range of expression levels (viz. k45:11.2 vs. k31:3.0). A relatively highly expressed 1629-bp k45:11.2 contig is consistent with the RefSeq reference isoform, while k37:14.3 and k33:17.9 contigs show the retained intron. This gene s retained intron is one of the three known cases shown as red circles in Supplementary Fig. 15. b) A Sircah 4 splice graph representation of the main contig alignments. 13
16 Coverage metrics for known and novel retained introns Supplementary Figure 14. Empirical distribution functions for mean normalized read coverage, C, for ENSMUSTs and transcripts with novel retained introns. The graph shows 34,400 ENSMUSTs with nonzero coverage (gray), and 181 of the 250 transcripts with novel retained introns (red) that had UCSC gene IDs or ENSMUST IDs. Approximately 75% of transcripts with novel retained introns had mean normalized read coverage that was at or above the 90 th percentile coverage for the Ensembl transcripts. Supplementary Figure 15. Coverage metrics for known and novel retained introns. The axes are the mean read coverage for a retained intron s flanking exons, and the ratio of the mean coverage of the retained intron to the mean 14
17 coverage of the flanking exons. Contours summarize 5314 retained introns from the mouse ASTD v1.1 database 5. Blue squares show 250 non-redundant novel retained introns from the current work. Lower coverage for the flanking exons and higher intron-to-flanking exon coverage ratios were consistent for three examples of retained introns for SR slicing factor genes, which undergo unproductive splicing as part of a regulatory mechanism 6 (red circles, see also Supplementary Fig. 13). Detailed work may prioritize focus on the retained introns that are associated with less highly expressed genes and have larger coverage ratios (upper left quadrant), while those in the lower right quadrant may be less biologically relevant. 15
18 Supplementary Figure 16. Schematic of method for identifying novel short and long 3 UTRs. a) A cdna with a poly(a) tail. End-junction (EJ) reads and poly(a)- mate (PAM) reads that were generated from the cdna are identified from the read sequence file. b) 50-bp sequences were added to 3 ends of reference transcript sequences (gray). Contig sequences (blue) are expected to terminate in a poly(a) sequence whose length is less than the assembly k; contig sequences were padded with 50-bp poly(a) sequences on their 3 ends and 50- bp poly(t) sequences on their 5 ends. c) The fragment length distribution, i.e. the measured insert length for paired end reads, was determined from distances between mate pairs mapped to contigs (shown here for k=38). d) The distribution of the number of T s in M 50-bp reads. Sequence reads with very high proportions T are likely to belong to cdna poly(a) tails (right edge of the graph). e) Aligning the transcript-read (short blue rectangles) from EJ and PAM matepairs to reference transcript sequences (gray) to confirm annotated 3 UTR ends (e1) and identify novel short 3 UTR ends (e2). (e3) Refining estimates of ends of novel long 3 UTRs by aligning, to contigs (blue), reads that do not map to transcripts. 16
19 Supplementary Figure 17. a) PAM-reads identify approximate known and novel polyadenylation sites in the 3 UTR of Dmgdh (Supplementary Fig. 16e1,2). The origin of the insert length distribution (Supplementary Fig. 16c) is located at the left-most edges of signal peaks in the stringent evidence pileup track (second from top), and the shaded rectangles correspond to the width of the peak in the insert length distribution. The predicted novel polyadenylation site (left) is consistent with EST evidence. b) PAM-reads identify three candidate polyadenylation sites in the 3 UTR of Sult3a1 (Supplementary Fig. 16e3). 3 UTRs that are longer than annotated 3 UTRs are supported by contig alignments (horizontal blue bars) and read alignments. 17
20 Supplementary Figure 18. Schematic of detecting a fusion gene. a,b) The contig aligns to two genomic regions. The regions may be on different chromosomes, or on one chromosome but separated by a distance that is much longer than the ~200-bp PE insert length (Supplementary Fig. 16a). The contig breakpoint (a, red line) must be supported by reads that align with no mismatches to the contig and cross the breakpoint. The contig alignments may also have mate-pair support from reads aligned to the EEJ-extended genome (b). Annotated transcripts are shown in gray. 18
21 Supplementary Figure 19. Comparisons of gene-level expression metrics for Trans-ABySS, ALEXA-Seq 7 and a whole transcriptome shotgun sequencing (WTSS) pipeline (Supplementary Note). Results are shown for the 8190 Ensembl mouse genes that had fractional gene-level contig-to-exon coverage of at least 0.8. The Pearson s correlation coefficient was
22 Supplementary Figure 20. Overview of the transcriptome assembly and analysis pipeline workflow, outlining the steps from initial transcriptome assembly, contig processing and analysis outcomes. Boxes with rounded corners indicate operations, boxes with square corners represent results and blue boxes represent outcome results. a) When a genome sequence is not available, assembly make contigs available for functional or phylogenetic analyses by methods that are not part of the Trans-ABySS pipeline. b) When a genome sequence is available but gene models have not been annotated, contig alignments to the genome can identify a range of transcript structures, as well as chimeric transcripts and variants like indels and SNVs. c) When transcript models are available for comparison to contig alignments, models can be refined and updated to include transcript variants. 20
23 Supplementary Figure 21. Length-normalized profiles of BWA read alignment densities, showing 20 th, 50 th and 80 th quantiles. 21
24 Supplementary tables Supplementary Table 1. Summary of read alignments for 147.1M 50-bp paired end (PE) Illumina reads (7.36Gb). We retained only aligned reads that had a MAQ mapping quality 10; these had unique genomic alignment positions and few mismatches to the mm9 reference genome sequence or constructed exonexon junction sequences. Junctions were constructed for consecutive exons from UCSC, RefSeq, Ensembl and AceView transcripts. Read counts relative to genes were calculated using Ensembl v54. Percentages in MAPQ filter columns are relative to Total mapped numbers, and those in Aligned to columns are relative to the number of retained read sequences. Total MAPQ filter Aligned to mapped Filtered Retained Exons/EEJ Introns Intergenic # reads 136,685,932 17,999, ,686,768 91,935,338 2,901,894 7,678,810 (13.17%) (86.83%) (77.46%) (2.45%) (6.47%) Gb Supplementary Table 2. Run times. Trans-ABySS Assembly Using ABySS 1.2.1, assemblies for k=26 to 50 completed in 4.7 hours of wallclock time and 370 CPU-hours using 25 machines, each of which had 8 hyperthreaded cores in two Intel E GHz CPUs, and 16 GB of RAM. Analysis Merging a total of 22 million contigs across 25 assemblies completed in about 5-6 hours. Blat alignments completed in about minutes of wallclock time per 1000 contigs. Exonerate alignments completed in about 100 minutes of wallclock time per 1000 contigs. Novelty detection completed in about 5-6 hours wallclock time for 1.2 million alignments. Tophat/Cufflinks/Scripture Tophat Cufflinks Scripture This was run as 8 parallel jobs (one per lane of data), each of which took an average of 6.75 hours. Time to sort, sam2bam, merge, was about 4 hours total CPU time. Total Tophat run time: ~60 CPU hours, which was required for both Cufflinks and Scripture. 1 job, 12 CPU hours 24 jobs, ~30 minutes each on average: 12 CPU hours. 22
25 Supplementary Table 3. Summary of candidate transcript events that were identified as novel relative to all UCSC, RefSeq, Ensembl and AceView transcript models. Event type Contigs with events a Unique contig events b Genes affected Novel exons Novel skipped exons Novel introns Alternative exon splicing Novel UTRs Retained introns Novel transcripts Novel polyadenylation sites a Total number of contigs containing novel events relative to annotated transcript models. In some cases multiple contigs identify the same event. b The number of unique genomic locations represented by the contig events. These identify unique transcript events. 23
26 Supplementary Table 4. Summary results for identifying annotated and novel polyadenylation sites. EJ-reads and PAM-reads were mapped to NCBI37 (mm9) UCSC 8, RefSeq 9, Ensembl 10 and AceView 11 transcript models, and to GenBank 12 mrnas. a) EJ-read mappings EJ-reads that mapped to transcript models EJ-reads that did not map to transcript models Reads All transcripts Reads Contigs >50 bp (novel short) 6,505 >50 bp 13,016 <= 50 bp (known) 11,060 <= 50 bp 5,221 Unmapped 200,676 Unmapped 182,439 Total 218,242 Total 200,676 b) PAM-read mappings PAM-reads that mapped to transcript models PAM-reads that did not map to transcript models Reads All transcripts Reads Contigs >300 bp (novel short) 4,424 >300 bp 327 <= 300 bp (known) 34,699 <= 300 bp 2,243 Unmapped 10,240 Unmapped 7,670 Total 49,363 Total 10,240 c) Transcripts identified by EJ-reads Filter Known Novel short Novel long Total All transcripts mapped by EJ-reads Na 4,667 8,885 13,552 Novel short (>50 bp), novel long (<=50 bp) 2,774 2,664 2,807 5,471 Mate read maps within range on same transcript 2,225 1, ,908 Stretch of T prefix > 10 bp of read Transcripts with at least 2 EJ-reads of support d) Transcripts identified by PAM-reads Filter Known Novel short Novel long Total All transcripts with mapped PAM-reads na 7,496 1,069 8,565 Novel short (>300 bp), novel long (<=300 bp) 6,672 1, ,450 Has at least 1 PAM-read with a 49/50 T mate Filtered for high AT content (80%) and antisense 2, Transcripts with at least 2 PAM-reads of support Filtered for reads with genomic mapping Manually reviewed
27 Supplementary Note De novo transcriptome assembly Non-normalized transcriptome shotgun libraries differ from whole genome shotgun libraries in presenting a very wide range of sequence representations to an assembler. We address expression level differences by using a wide range of k values to assemble contigs that represent cdnas, then merging the contig lists from independent assemblies into a smaller set of meta-assembly contigs for analysis. Transcriptome shotgun libraries also differ from whole genome shotgun libraries in that many genes express multiple transcript isoforms, and so present multiple correct, overlapping paths to an assembler. In contrast, in genome assembly, a single correct assembly path is expected through any genomic region, with the exception of repetitive and duplicated sequences and those representing haplotypic variation or mutational alterations. ABySS captures single nucleotide variation within a sample as pairs of short sequences, which are referred to as bubble contigs (Supplementary Fig. 1). The variant with the highest coverage is represented in the assembled contigs, but both variants are written out to a separate file as a bubble contig pair that can be analyzed independently to identify allelic variation within the sample and SNVs relative to known variants. ABySS typically handles heterozygous indel variants by creating a pair of short contigs for each variant in the initial assembly stages (Supplementary Fig. 1b,c). The contig representing a deletion variant is usually comprised of sequences of length k-1 flanking the insertion point, and thus is characteristically (2k-2) bp in length. The contig representing an insertion variant is comprised of the same (2k-2) bp sequence, plus the additional sequence representing the insertion, and is therefore somewhat longer than the (2k-2) bp deletion variant. We refer to these contigs as junction contigs. Depending on assembly parameters, individual junction contigs may or may not be incorporated into longer contigs in later stages of the assembly (Supplementary Fig. 1d) (see Methods). As we reported previously 13, in transcriptome assembly these junction contigs also capture exon content differences between transcript isoforms. While results for SNVs and indels are not reported here, our pipeline therefore includes methods for bubble and junction contigs. Given the above considerations, the Trans-ABySS workflow consists of the following stages: 1) assembling reads into contigs using ABySS, 2) aligning contigs to the reference genome, and 3) analyzing the contig alignments to correlate with known transcript annotations and to identify SNVs, indels, novel transcripts and transcript structures, and gene rearrangements and fusions. From each assembly, we considered all contigs of length L (2k-2) bp, and all bubble contigs; summed across all assemblies, there were 9.5 M of the former and 346,787 of the latter. To reduce the number of L (2k-2) bp contigs analyzed, while maintaining the transcript representation provided by all 25
28 assemblies, we merged the assemblies by removing ( burying ) contigs that were redundant because they were exactly represented within longer ( parent ) contigs in another assembly. To accomplish this, we iteratively and reciprocally aligned contigs between pairs of assemblies, removing redundant contigs at each round (Supplementary Fig. 5). The iterative burying process returned a set of 1,200,130 non-redundant contigs (Fig. 1b), which we refer to as the main contig set (Supplementary Fig. 20). Preliminary analysis showed that a junction contig shorter than (2k-2) bp can be assembled when there are short homologous sequences on either side of the junction. To ensure that such contigs were included in our dataset for analysis, we identified contigs with length L < (2k-2) bp for which mate pair information indicated overlap with a single candidate contig neighbor at each end. To support robust genome alignments for these small contigs, we extended them by adding their two neighboring contig sequences. We refer to these as extended junction contigs (Supplementary Figs. 1, 20). Subsequent merging reduced the 96,019 extended junction contigs across all assemblies to 16,287 contigs for analysis. Alignments of main and extended junction contigs were compared to structures of known transcript models in order to identify novel transcripts and alternative transcript structures. Alignments for all contigs were used to identify SNVs and indels relative to the genome (data not shown), and candidate fusion genes were identified from the main contig and extended junction contig alignments (Supplementary Fig. 20). Issues for de novo and reference-based transcriptome assembly A number of issues pose challenges to both reference-based and de novo assembly approaches. First, the library protocol that we used generated doublestranded cdna, and so did not retain the strand of the original transcript. While for spliced contig alignments we inferred the strand of the source transcript from the splice sites in the contig alignments, for some cases confirmation would require orthogonal evidence. It is likely that directional library protocols currently under development will reduce the complexity of such analysis 14. Second, while a de novo approach can be robust to sequence similarity between exons, shared sequences that are highly similar will halt contig extension, with repetitive regions assembling into separate contigs, each of which aligns to multiple locations. Third, aligned-read densities are non-uniform along exons due to multi-mapping and other technical biases Fourth, isoform reconstruction remains problematic for genes that have multiple expressed isoforms. Although suggested transcript models have been reported for both de novo and referencebased assembly algorithms, complex alternative isoforms cannot be reconstructed reliably, due to short read lengths and short fragment lengths for paired end reads. Also, attempts to use expression levels in inference fail due to both theoretical (under-, over- or ill-defined linear mathematical models) and 26
29 practical (3 /5 sequence bias, Supplementary Fig. 21) obstacles. Unless one is supplied with reads that associate longer lengths across transcripts, assembly methods can at best report splice diagrams for genes with alternative isoforms. Comparing de novo and reference-based assembly We ran TopHat Beta on each of the eight lanes of data separately, then sorted and joined the output.bam-format 19 files into a single merged file, which we used as input into Cufflinks Beta (02 July 2010), and Scripture 21 Beta (22 June 2010). For our TopHat analysis we generated the intron result set by merging the resulting BED-format files from each lane, and accumulating scores for identical introns. Unique introns for the other three tools were generated from exonerate alignments for Trans-ABySS contigs, BED files for Scripture contigs, and GTF files for Cufflinks contigs. We then compared the predicted splice sites to the unique coordinates of all the donor-acceptor pairs in the reference annotations, which corresponded to all nonredundant introns for the union of UCSC, RefSeq, Ensembl and AceView transcript models. A splice site was only considered to match between datasets if the coordinates of the donor-acceptor pair matched exactly. Supplementary Table 2 outlines run times. TopHat identified alignments for 145,798,588 (78.8%) of 184,915,546 reads. Of the aligned reads, 592,864 (0.4%) were gapped or split alignments; these identified 141,846 unique dinucleotide splice sites, which we compared against the unique coordinates of all the donor-acceptor pairs in UCSC, RefSeq, Ensembl and AceView gene annotations. Methods that use split read alignments may have difficulty in detecting exons that are shorter than the read length, particularly when 50-bp reads are used. For TopHat, every detected splice junction is required to be supported by at least one read that anchors by a user defined minimum length on either side of a split. This makes it insensitive to exons shorter than the anchor length, but also less sensitive for relatively short exons, especially when these are in isoforms that are weakly expressed. Consequently, using the TopHat spliced read alignments as input, we observed that Cufflinks was strongly biased against detecting shorter exons. To estimate performance differences between contig alignments and spliced read alignments more directly, we compared dinucleotide splice sites detected by Trans-ABySS and TopHat using the splice sites in UCSC gene transcripts as our reference set. We included TopHat because, although the assembly of exons is deferred to the Cufflinks software, the splice sites are reported by TopHat. Fig. 2 compares sensitivity (SN) and specificity (SP), relative to the reference junctions, are approximate metrics for this comparison. The SN reported is the fraction of all unique splice sites that are detected in the UCSC, RefSeq, Ensembl and AceView transcript models. SN, as reported, is an underestimate, 27
30 because it includes splice sites from unexpressed transcripts. The SP reported is the ratio of the number of reference introns to the total number of introns detected. It too is an underestimate, because apparently non-specific predictions include not only false positives, but also true positive exon-exon junctions that are novel relative to the reference intron set. Detecting novel polyadenylation sites Alternative polyadenylation sites can affect mrna stability, translocation and translation 22. For fission yeast, polyadenylation sites have been identified from single-end read RNA-seq data through reads that aligned at junctions of transcripts and poly(a) tails (end-junction or EJ reads) 23. In a transcriptome assembly, a contig representing a polyadenylated transcript should terminate in a homopolymer-a sequence whose length approaches k. In our study, the read length was 50 bp, while the merged contig set included contigs from assemblies with 26 k 50. Given this, we expect that terminal poly(a) sequences for merged contigs will be shorter than the read length, which could interfere with the EJ-read alignments. We addressed this by adding 50-bp poly(a) and poly(t) sequences to 3 and 5 ends of each contig, respectively. Similarly, we added 50-bp Poly(A) sequences to the 3 end of each reference (e.g. RefSeq) mrna sequence (Supplementary Fig. 16). Contigs that are downstream of such a transcript contig in the de Bruijn graph represent the poly(a) tail, but are not incorporated into any particular transcript contig due to the difficulty of assembling simple sequence. Here, as an initial step towards a future graph-based analysis, we identified and annotated novel polyadenylation sites using end-junction (EJ-) and mate-pair (PAM-) reads in paired-end sequence data (Supplementary Fig. 16). An EJ-read spanned a poly(a) start site 23 ; a PAM-read had one mate mapped to a poly(a) tail, while its mate mapped either to an annotated transcript or to a contig sequence. We identified candidate EJ-reads spanning poly(a) start sites as reads whose sequence was prefixed by poly(t) runs that were at least 5 bp long. We identified candidate PAM-reads as those in which the mate s sequence contained 80% to 98% (40 to 49 of 50 nt) of T s. We used BWA 24 v0.5.4 to map candidate EJ-reads and PAM-reads to known transcripts annotations from UCSC, RefSeq, Ensembl, AceView, and to Genbank mrnas. Files for all of these were downloaded from the UCSC mm9 genome browser 25. To identify transcripts with candidate novel short 3 UTRs, we used the length distribution for PE reads and the distance from each PAM-read to the end of each transcript (Supplemental Fig. 16b,c and Supplemental Table 4). Specifically, we considered that mapping distances longer than 50 bp for EJreads, and 300 bp for PAM-reads from a transcript to mark such cases. 28
31 To identify candidate novel long 3 UTRs, all EJ-reads and PAM-reads that did not align to annotated transcript sequences were mapped with BWA to ABySS contig sequences. We identified contigs that had EJ-reads mapped to the ends and PAM-reads mapped within 300 bp from a contig end, and mapped the contigs to the mouse mm9 genome to determine the transcript product with the novel elongated 3 UTR. In such cases the contig alignment already suggested the extended 3 UTR, and the PAM-reads refined the estimate of the position of the end of the UTR. We then filtered candidate polyadenylation sites, as follows. For shortening and lengthening cases using EJ-reads, we required EJ-reads to satisfy two conditions: that they map to the genome or to transcripts only when their poly(t) prefix or poly(a) suffix had been trimmed; and that their mate pair map bp from the opposite strand of the same transcript. We ranked mapping positions of a read, prioritizing positions with the fewest mismatches and then the shortest distance to a transcript end. We then required at least two reads to map to each position. Transcripts from the four annotated sets used were resolved to gene symbols when possible. For both shortening and lengthening cases using PAM-reads, we required that at least one of these had at least 49 T s out of 50 bases in the poly(a) tail read. When a PAM-read mapped to more than one genomic location, we ranked mapping positions in the same way as for EJ-reads. To reduce the number of false positives, we rejected transcripts that had one or more 50-bp windows in which 80% of the bases were A or T. We then required at least two reads to map to each position. For the 218,242 potential EJ-reads, requiring at least two reads of support for each transcript event and comparing events to four sets of transcript annotations and to Genbank mrnas, we confirmed 71 annotated 3 UTRs ends, as well as 36 novel short UTRs. Mapping the unmapped reads to ABySS contigs then identified 22 novel long UTRs (Supplementary Table 4a,c). For the 49,363 PAM-reads, 39,123 mapped to the transcript models and Genbank mrnas. By requiring at least two PAM reads for each event, we confirmed 1277 annotated 3' transcript ends, as well as 20 transcripts with novel short 3 UTRs (Supplementary Fig. 17). Mapping the unmapped reads to contigs then identified 10 transcripts with novel long UTRs (Supplementary Table 4b,d). By combining EJ- and PAM-read singletons, we also confirmed 9 annotated UTRs as well as 6 novel short UTRs. Overall, we confirmed polyadenylation start sites in 1299 annotated transcripts, inferred 84 novel polyadenylation sites that corresponded to 56 novel short 3'UTRs and, from contig alignments, 32 novel long 3 UTRs (Supplementary Table 4, Supplementary Fig. 17). Relatively few novel events were predicted by both methods; in almost all cases a novel event was predicted by only one of the two methods. 29
32 Identifying fusion genes To identify candidate contigs spanning gene fusion breakpoints we apply filters to identify contigs that aligned discretely to distinct genomic regions using BLAT (Supplementary Fig. 18). We parse the top-scoring five alignments and perform the corresponding 10 pairwise comparisons. Initially, we discard any contig that contained a single alignment that represented 95% of the contig length, as any candidate fusions generated from the relatively short remaining part of the contig were marked as likely to be false positives. Alignments are subsequently filtered for quality by requiring that alignment identity be at least 95%. To ensure that the entire contig was represented in the alignments and to minimize overlap between alignment pairs, we require that 95% of the entire contig length be covered by the alignments, and that no more than 5% of the contig bases, and none of the reference bases, be shared between alignments. We then filter all candidate fusion alignments. We discard alignments that align to mtdna or haplotype reference sequences. We reject candidate fusion contigs that are reported as a fusion candidate multiple times. Contig alignments that overlap RepeatMasker RNA repeat elements are also rejected, as are contigs that have fewer than two Bowtie read alignments spanning the candidate breakpoint (Supplementary Fig. 20a). As a final piece of confirmatory evidence we require that the contig alignments be supported by mate-pairs aligned to the EEJ-extended reference genome and that the number of such supporting matepairs be within an acceptable range [4, 2000] (Supplementary Fig. 20b). Quantifying gene-level expression The Trans-ABySS pipeline includes a general method for determining a contigbased expression metric for gene loci, given a reference genome with transcript annotations. The approach considers reads aligned to all contigs whose alignment blocks on a reference genome overlap with exons in transcript model annotations. For Ensembl v54 genes, we compare the expression levels predicted by this approach with those from two methods that align reads to a reference genome that has transcript annotations. The first method was ALEXA-Seq 7, whose expression values agree well with those from microarrays and qpcr. The second was a WTSS (whole transcriptome shotgun sequencing, i.e. RNA-Seq) pipeline that extends reference chromosome sequences with exon-exon junction sequences and is used for production-level analysis at the GSC (unpublished). For the 8190 genes with fractional contig-to-exonic coverage of at least 0.8, the expression levels for the two read-alignment methods were highly correlated, with a Pearson s coefficient of r 2 = Correlation coefficients between Trans- ABySS and ALEXA-seq and the WTSS pipeline were and respectively. 30
33 Validating novel transcripts and transcript events We generated 50 µl of double stranded cdna by reverse transcribing 0.2 µg of DNAase-treated RNA from a biological replicate (see Library construction and sequencing, above). We used 1.5 µl of cdna for each RT-PCR reaction. Primers were designed with PrimerQuest from IDT SciTools 26, BatchPrimer3 27, or Visual OMP (DNA Software, Ann Arbor MI). Each primer pair was checked against the UCSC mouse mm9 assembly to confirm expected RT-PCR products. The following PCR cycle was repeated 40 times: 95 o C for 30 min, 53 o C for 30 min, and 72 o C for 60 min. For Csnk2a2, Fbrs, Foxn2, Kynu, novel transcripts 'Event 17', and 'Event 18', primers were hybridized at 55 o C and the reaction was run for 35 cycles. RT-PCR products were resolved on a 1.8% agarose gel. Product sizes for bands were estimated by a custom Matlab (Mathworks, Natick, MA) program that read an image file corresponding to a gel and text file specifying ladder fragment sizes and expected mobilities (Supplemental Note). Product sizes for bands were estimated by a custom Matlab (Mathworks, Natick, MA) program that read an image file corresponding to a gel and text file specifying ladder fragment sizes and expected mobilities (Supplemental Note). The user participated in lane tracking, and, because the shape information for the ladder bands is used for de-noising sample lanes, manually confirmed the automatically identified ladder bands. The user then set a minimum threshold brightness for detecting bands. The program analyzed each sample lane, automatically identifying bands as local profile maxima, calculating a relative profile height at each maximum as an intensity metric, and assigning a product size to each maximum by linearly interpolating a size versus mobility relationship between the ladders. When a peak was saturated by an abundant product, the product size was estimated as the center of the plateau. WTSS aligned-read pipeline Using a whole shotgun transcriptome sequencing pipeline (WTSS, unpublished), we constructed a sequence resource by extending the NCBI37 reference genome with a pool of non-redundant exon-exon junction sequences. The junction sequences were constructed using Ensembl 10, UCSC gene 8, RefSeq 9, AceView 11, and Genscan 28 transcript annotations from the UCSC genome browser 25, by concatenating (read length-1) nucleotides from each side of each pair of consecutive exons for each transcript, and then eliminating redundant junctions from the pooled set. We aligned the PE reads to the sequence resource using BWA 24 v0.5.4, and manipulated the output.bam-format 19 file to assign reads that had aligned to exon-exon junctions to their absolute genomic positions. Coverage for Ensembl v54 genes was calculated using the subset of mapped reads that had a mapping quality of at least 10. UCSC wig-format and then bigwig-format files were 31
34 generated using SAMtools, Unix scripts and the UCSC wigtobigwig application, again removing reads with a MAQ mapping quality lower than 10. We determined length-normalized read density profiles along transcripts, from the BWA-aligned.bam file, using custom Java software (Supplementary Fig. 5). Generating splice graph visualizations Trans-ABySS contigs were aligned to the NCBI37/mm9 assembly using GMAP 29, and results were written out in GFF3 EST_match format. Sircah 4 was used to associate the contig alignments with genes using annotated gene start and end coordinates, and to draw a splicing diagram for each gene of interest (Supplementary Figure 13). References 1. Burset, M., Seledtsov, I.A., and Solovyev, V.V., Nucleic Acids Res 28 (21), (2000). 2. Robertson, A.G. et al., Genome Res 18 (12), (2008). 3. Lareau, L.F. et al., Nature 446 (7138), (2007). 4. Harrington, E.D. and Bork, P., Bioinformatics 24 (17), (2008). 5. Koscielny, G. et al., Genomics 93 (3), (2009). 6. Lareau, L.F. et al., Adv Exp Med Biol 623, (2007). 7. Griffith, M. et al., Nature Methods [Epub ahead of print] (2010). 8. Hsu, F. et al., Bioinformatics 22 (9), (2006). 9. Pruitt, K.D., Tatusova, T., and Maglott, D.R., Nucleic Acids Res 35 (Database issue), D61-65 (2007). 10. Hubbard, T.J. et al., Nucleic Acids Res 37 (Database issue), D (2009). 11. Thierry-Mieg, D. and Thierry-Mieg, J., Genome Biol 7 Suppl 1, S (2006). 12. Benson, D.A. et al., Nucleic Acids Res 38 (Database issue), D46-51 (2010). 13. Birol, I. et al., Bioinformatics 25 (21), (2009). 14. Parkhomchuk, D. et al., Nucleic Acids Res 37 (18), e123 (2009). 15. Degner, J.F. et al., Bioinformatics 25 (24), (2009). 16. Hansen, K.D., Brenner, S.E., and Dudoit, S., Nucleic Acids Res 38 (12), e131 (2010). 17. Li, J., Jiang, H., and Wong, W.H., Genome Biol 11 (5), R50 (2010). 18. Trapnell, C., Pachter, L., and Salzberg, S.L., Bioinformatics 25 (9), (2009). 19. Li, H. et al., Bioinformatics 25 (16), (2009). 20. Trapnell, C. et al., Nat Biotechnol 28 (5), (2010). 21. Guttman, M. et al., Nat Biotechnol 28 (5), (2010). 32
35 22. Millevoi, S. and Vagner, S., Nucleic Acids Res 38 (9), (2009). 23. Nagalakshmi, U. et al., Science 320 (5881), (2008). 24. Li, H. and Durbin, R., Bioinformatics 25 (14), (2009). 25. Rhead, B. et al., Nucleic Acids Res 38 (Database issue), D (2010). 26. Owczarzy, R. et al., Nucleic Acids Res 36 (Web Server issue), W (2008). 27. You, F.M. et al., BMC Bioinformatics 9, 253 (2008). 28. Burge, C. and Karlin, S., J Mol Biol 268 (1), (1997). 29. Wu, T.D. and Watanabe, C.K., Bioinformatics 21 (9), (2005). 33
Haploid Assembly of Diploid Genomes
Haploid Assembly of Diploid Genomes Challenges, Trials, Tribulations 13 October 2011 İnanç Birol Assembly By Short Sequencing IEEE InfoVis 2009 2 3 in Literature ~40 citations on tool comparisons ~20 citations
More informationChIP-seq and RNA-seq. Farhat Habib
ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions
More informationChIP-seq and RNA-seq
ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)
More informationMapping strategies for sequence reads
Mapping strategies for sequence reads Ernest Turro University of Cambridge 21 Oct 2013 Quantification A basic aim in genomics is working out the contents of a biological sample. 1. What distinct elements
More informationIntroduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013
Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein Abundance
More informationTranscriptome analysis
Statistical Bioinformatics: Transcriptome analysis Stefan Seemann seemann@rth.dk University of Copenhagen April 11th 2018 Outline: a) How to assess the quality of sequencing reads? b) How to normalize
More informationAnalysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),
Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ), 2012-01-26 What is a gene What is a transcriptome History of gene expression assessment RNA-seq RNA-seq analysis
More informationSystematic evaluation of spliced alignment programs for RNA- seq data
Systematic evaluation of spliced alignment programs for RNA- seq data Pär G. Engström, Tamara Steijger, Botond Sipos, Gregory R. Grant, André Kahles, RGASP Consortium, Gunnar Rätsch, Nick Goldman, Tim
More informationRNA-Sequencing analysis
RNA-Sequencing analysis Markus Kreuz 25. 04. 2012 Institut für Medizinische Informatik, Statistik und Epidemiologie Content: Biological background Overview transcriptomics RNA-Seq RNA-Seq technology Challenges
More informationAnnotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.
David Wang Bio 434W 4/27/15 Annotation of contig27 in the Muller F Element of D. elegans Abstract Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans. Genscan predicted six
More informationRNA standards v May
Standards, Guidelines and Best Practices for RNA-Seq: 2010/2011 I. Introduction: Sequence based assays of transcriptomes (RNA-seq) are in wide use because of their favorable properties for quantification,
More informationMODULE 5: TRANSLATION
MODULE 5: TRANSLATION Lesson Plan: CARINA ENDRES HOWELL, LEOCADIA PALIULIS Title Translation Objectives Determine the codons for specific amino acids and identify reading frames by looking at the Base
More informationAnalysis of RNA-seq Data
Analysis of RNA-seq Data A physicist and an engineer are in a hot-air balloon. Soon, they find themselves lost in a canyon somewhere. They yell out for help: "Helllloooooo! Where are we?" 15 minutes later,
More informationExperimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis
-Seq Analysis Quality Control checks Reproducibility Reliability -seq vs Microarray Higher sensitivity and dynamic range Lower technical variation Available for all species Novel transcript identification
More informationOutline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions
Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson
More informationBioinformatics in next generation sequencing projects
Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet May 2013 Standard sequence library generation Illumina
More informationNature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.
Supplementary Figure 1 Number and length distributions of the inferred fosmids. Fosmid were inferred by mapping each pool s sequence reads to hg19. We retained only those reads that mapped to within a
More informationuser s guide Question 1
Question 1 How does one find a gene of interest and determine that gene s structure? Once the gene has been located on the map, how does one easily examine other genes in that same region? doi:10.1038/ng966
More informationTECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA
TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA Stranded, Illumina ready library construction in
More informationRNA-Seq Software, Tools, and Workflows
RNA-Seq Software, Tools, and Workflows Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 1, 2016 Some mrna-seq Applications Differential gene expression analysis Transcriptional profiling Assumption:
More informationRNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia
RNA-Seq Workshop AChemS 2017 Sunil K Sukumaran Monell Chemical Senses Center Philadelphia Benefits & downsides of RNA-Seq Benefits: High resolution, sensitivity and large dynamic range Independent of prior
More informationHow to deal with your RNA-seq data?
How to deal with your RNA-seq data? Rachel Legendre, Thibault Dayris, Adrien Pain, Claire Toffano-Nioche, Hugo Varet École de bioinformatique AVIESAN-IFB 2017 1 Rachel Legendre Bioinformatics 27/11/2018
More informationSUPPLEMENTARY INFORMATION
doi:1.138/nature11233 Supplementary Figure S1 Sample Flowchart. The ENCODE transcriptome data are obtained from several cell lines which have been cultured in replicates. They were either left intact (whole
More informationGene Signal Estimates from Exon Arrays
Gene Signal Estimates from Exon Arrays I. Introduction: With exon arrays like the GeneChip Human Exon 1.0 ST Array, researchers can examine the transcriptional profile of an entire gene (Figure 1). Being
More informationTranscriptomics analysis with RNA seq: an overview Frederik Coppens
Transcriptomics analysis with RNA seq: an overview Frederik Coppens Platforms Applications Analysis Quantification RNA content Platforms Platforms Short (few hundred bases) Long reads (multiple kilobases)
More informationRNA-SEQUENCING ANALYSIS
RNA-SEQUENCING ANALYSIS Joseph Powell SISG- 2018 CONTENTS Introduction to RNA sequencing Data structure Analyses Transcript counting Alternative splicing Allele specific expression Discovery APPLICATIONS
More informationGeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment Zhaojun Zhang, Shunping Huang, Jack Wang, Xiang Zhang, Fernando Pardo
More informationNovel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.
Novel methods for RNA and DNA- Seq analysis using SMART Technology Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc. Agenda Enabling Single Cell RNA-Seq using SMART Technology SMART
More informationSCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly
SCIENCE CHINA Life Sciences SPECIAL TOPIC February 2013 Vol.56 No.2: 156 162 RESEARCH PAPER doi: 10.1007/s11427-013-4444-x Comparative analysis of de novo transcriptome assembly CLARKE Kaitlin 1, YANG
More informationMapping and quantifying mammalian transcriptomes by RNA-Seq. Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer & Barbara Wold
Mapping and quantifying mammalian transcriptomes by RNA-Seq Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer & Barbara Wold Supplementary figures and text: Supplementary Figure 1 RNA shatter
More informationAn introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy
An introduction to RNA-seq Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy The central dogma Genome = all DNA in an organism (genotype) Transcriptome = all RNA (molecular
More informationChimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang
Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Ruth Howe Bio 434W April 1, 2010 INTRODUCTION De novo annotation is the process by which a finished genomic sequence is searched for
More informationChromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material
Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions Joshua N. Burton 1, Andrew Adey 1, Rupali P. Patwardhan 1, Ruolan Qiu 1, Jacob O. Kitzman 1, Jay Shendure 1 1 Department
More informationNGS Data Analysis and Galaxy
NGS Data Analysis and Galaxy University of Pretoria Pretoria, South Africa 14-18 October 2013 Dave Clements, Emory University http://galaxyproject.org/ Fourie Joubert, Burger van Jaarsveld Bioinformatics
More informationIntroduction to RNA sequencing
Introduction to RNA sequencing Bioinformatics perspective Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden November 2017 Olga (NBIS) RNA-seq November 2017 1 / 49 Outline Why sequence
More informationTranscription Start Sites Project Report
Transcription Start Sites Project Report Student name: Student email: Faculty advisor: College/university: Project details Project name: Project species: Date of submission: Number of genes in project:
More informationMODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?
MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? Lesson Plan: Title Introduction to the Genome Browser: what is a gene? JOYCE STAMM Objectives Demonstrate basic skills in using the UCSC Genome
More informationRNA-Seq Module 2 From QC to differential gene expression.
RNA-Seq Module 2 From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics Support System (RISS) MSI Apr. 24, 2012 RNA-Seq Tutorials Tutorial 1: Introductory (Mar.
More information02 Agenda Item 03 Agenda Item
01 Agenda Item 02 Agenda Item 03 Agenda Item SOLiD 3 System: Applications Overview April 12th, 2010 Jennifer Stover Field Application Specialist - SOLiD Applications Workflow for SOLiD Application Application
More informationmeasuring gene expression December 5, 2017
measuring gene expression December 5, 2017 transcription a usually short-lived RNA copy of the DNA is created through transcription RNA is exported to the cytoplasm to encode proteins some types of RNA
More informationTargeted RNA sequencing reveals the deep complexity of the human transcriptome.
Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Tim R. Mercer 1, Daniel J. Gerhardt 2, Marcel E. Dinger 1, Joanna Crawford 1, Cole Trapnell 3, Jeffrey A. Jeddeloh 2,4, John
More informationDraft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009
Page 1 Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009 Page 2 Introduction: Annotation is the process of analyzing the genomic sequence of an organism. Besides identifying
More informationSupplementary Materials for De-novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity
Supplementary Materials for De-novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity Sections: S1. Evaluation of transcriptome assembly completeness S2. Comparison
More informationBST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1
BST 226 Statistical Methods for Bioinformatics David M. Rocke March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1 NGS Technologies Illumina Sequencing HiSeq 2500 & MiSeq PacBio Sequencing PacBio
More informationSupplementary Online Material. the flowchart of Supplemental Figure 1, with the fraction of known human loci retained
SOM, page 1 Supplementary Online Material Materials and Methods Identification of vertebrate mirna gene candidates The computational procedure used to identify vertebrate mirna genes is summarized in the
More informationIntroduction to Next Generation Sequencing
The Sequencing Revolution Introduction to Next Generation Sequencing Dena Leshkowitz,WIS 1 st BIOmics Workshop High throughput Short Read Sequencing Technologies Highly parallel reactions (millions to
More informationConsensus Ensemble Approaches Improve De Novo Transcriptome Assemblies
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department
More informationMachine Learning Methods for RNA-seq-based Transcriptome Reconstruction
Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010)
More informationDe novo assembly in RNA-seq analysis.
De novo assembly in RNA-seq analysis. Joachim Bargsten Wageningen UR/PRI/Plant Breeding October 2012 Motivation Transcriptome sequencing (RNA-seq) Gene expression / differential expression Reconstruct
More informationMate-pair library data improves genome assembly
De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate
More informationOutline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018
Outline Overview of the GEP annotation projects Annotation of Drosophila Primer January 2018 GEP annotation workflow Practice applying the GEP annotation strategy Wilson Leung and Chris Shaffer AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCT
More informationAssemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz
Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz Table of Contents Supplementary Note 1: Unique Anchor Filtering Supplementary Figure
More informationRNA-Seq data analysis course September 7-9, 2015
RNA-Seq data analysis course September 7-9, 2015 Peter-Bram t Hoen (LUMC) Jan Oosting (LUMC) Celia van Gelder, Jacintha Valk (BioSB) Anita Remmelzwaal (LUMC) Expression profiling DNA mrna protein Comprehensive
More informationIntroduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014
Introduction to metagenome assembly Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Sequencing specs* Method Read length Accuracy Million reads Time Cost per M 454
More informationIntroduction to RNA-Seq in GeneSpring NGS Software
Introduction to RNA-Seq in GeneSpring NGS Software Dipa Roy Choudhury, Ph.D. Strand Scientific Intelligence and Agilent Technologies Learn more at www.genespring.com Introduction to RNA-Seq In a few years,
More informationQuantifying gene expression
Quantifying gene expression Genome GTF (annotation)? Sequence reads FASTQ FASTQ (+reference transcriptome index) Quality control FASTQ Alignment to Genome: HISAT2, STAR (+reference genome index) (known
More informationGenomic resources. for non-model systems
Genomic resources for non-model systems 1 Genomic resources Whole genome sequencing reference genome sequence comparisons across species identify signatures of natural selection population-level resequencing
More informationFast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:
Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing: Patented, Anti-Correlation Technology Provides 99.5% Accuracy & Sensitivity to 5% Variant Knowledge Base and External Annotation
More informationde novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ
de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ de novo transcriptome assembly de novo from the Latin expression meaning from the beginning In bioinformatics, we often use
More informationStatistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies
Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies RNA Seq and Differential Expression Analysis Brooke L. Fridley, PhD University of Kansas Medical Center 1 Next-generation
More informationDe novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club
De novo assembly of human genomes with massively parallel short read sequencing Mikk Eelmets Journal Club 06.04.2010 Problem DNA sequencing technologies: Sanger sequencing (500-1000 bp) Next-generation
More informationThe Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica
The Ensembl Database Dott.ssa Inga Prokopenko Corso di Genomica 1 www.ensembl.org Lecture 7.1 2 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database
More informationIntroduction of RNA-Seq Analysis
Introduction of RNA-Seq Analysis Jiang Li, MS Bioinformatics System Engineer I Center for Quantitative Sciences(CQS) Vanderbilt University September 21, 2012 Goal of this talk 1. Act as a practical resource
More informationSupplementary Figures
Supplementary Figures A B Supplementary Figure 1. Examples of discrepancies in predicted and validated breakpoint coordinates. A) Most frequently, predicted breakpoints were shifted relative to those derived
More informationRNA-Seq Analysis. Simon Andrews, Laura v
RNA-Seq Analysis Simon Andrews, Laura Biggins simon.andrews@babraham.ac.uk @simon_andrews v2018-10 RNA-Seq Libraries rrna depleted mrna Fragment u u u u NNNN Random prime + RT 2 nd strand synthesis (+
More informationUCSC Genome Browser. Introduction to ab initio and evidence-based gene finding
UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene
More informationSUPPLEMENTARY INFORMATION
AS-NMD modulates FLM-dependent thermosensory flowering response in Arabidopsis NATURE PLANTS www.nature.com/natureplants 1 Supplementary Figure 1. Genomic sequence of FLM along with the splice sites. Sequencing
More informationRNA-Seq de novo assembly training
RNA-Seq de novo assembly training Training session aims Give you some keys elements to look at during read quality check. Transcriptome assembly is not completely a strait forward process : Multiple strategies
More informationTranscriptome Assembly, Functional Annotation (and a few other related thoughts)
Transcriptome Assembly, Functional Annotation (and a few other related thoughts) Monica Britton, Ph.D. Sr. Bioinformatics Analyst June 23, 2017 Differential Gene Expression Generalized Workflow File Types
More informationQIAseq Targeted Panel Analysis Plugin USER MANUAL
QIAseq Targeted Panel Analysis Plugin USER MANUAL User manual for QIAseq Targeted Panel Analysis 1.1 Windows, macos and Linux June 18, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej
More informationBarnacle: detecting and characterizing tandem duplications and fusions in transcriptome assemblies
Barnacle: detecting and characterizing tandem duplications and fusions in transcriptome assemblies The MIT Faculty has made this article openly available. Please share how this access benefits you. Your
More informationGenome 373: Mapping Short Sequence Reads II. Doug Fowler
Genome 373: Mapping Short Sequence Reads II Doug Fowler The final Will be in this room on June 6 th at 8:30a Will be focused on the second half of the course, but will include material from the first half
More informationPerformance comparison of five RNA-seq alignment tools
New Jersey Institute of Technology Digital Commons @ NJIT Theses Theses and Dissertations Spring 2013 Performance comparison of five RNA-seq alignment tools Yuanpeng Lu New Jersey Institute of Technology
More informationRNAseq Differential Gene Expression Analysis Report
RNAseq Differential Gene Expression Analysis Report Customer Name: Institute/Company: Project: NGS Data: Bioinformatics Service: IlluminaHiSeq2500 2x126bp PE Differential gene expression analysis Sample
More informationAnalysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail
Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer Project XX Customer Detail Table of Contents. Bioinformatics analysis pipeline...3.. Read quality check. 3.2. Read alignment...3.3.
More informationab initio and Evidence-Based Gene Finding
ab initio and Evidence-Based Gene Finding A basic introduction to annotation Outline What is annotation? ab initio gene finding Genome databases on the web Basics of the UCSC browser Evidence-based gene
More informationAnnotating Fosmid 14p24 of D. Virilis chromosome 4
Lo 1 Annotating Fosmid 14p24 of D. Virilis chromosome 4 Lo, Louis April 20, 2006 Annotation Report Introduction In the first half of Research Explorations in Genomics I finished a 38kb fragment of chromosome
More informationSupplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line
Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line Table of Contents SUPPLEMENTARY TEXT:... 2 FILTERING OF RAW READS PRIOR TO ASSEMBLY:... 2 COMPARATIVE ANALYSIS... 2 IMMUNOGENIC
More informationRNA-Seq with the Tuxedo Suite
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop The Basic Tuxedo Suite References Trapnell C, et al. 2009 TopHat: discovering splice junctions with
More informationSequence Analysis 2RNA-Seq
Sequence Analysis 2RNA-Seq Lecture 10 2/21/2018 Instructor : Kritika Karri kkarri@bu.edu Transcriptome Entire set of RNA transcripts in a given cell for a specific developmental stage or physiological
More informationMapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010
Mapping Next Generation Sequence Reads Bingbing Yuan Dec. 2, 2010 1 What happen if reads are not mapped properly? Some data won t be used, thus fewer reads would be aligned. Reads are mapped to the wrong
More informationHigh-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler
High-Throughput Bioinformatics: Re-sequencing and de novo assembly Elena Czeizler 13.11.2015 Sequencing data Current sequencing technologies produce large amounts of data: short reads The outputted sequences
More informationDeep Sequencing technologies
Deep Sequencing technologies Gabriela Salinas 30 October 2017 Transcriptome and Genome Analysis Laboratory http://www.uni-bc.gwdg.de/index.php?id=709 Microarray and Deep-Sequencing Core Facility University
More informationTruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)
tru TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) Anton Bankevich Center for Algorithmic Biotechnology, SPbSU Sequencing costs 1. Sequencing costs do not follow Moore s law
More informationChang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang
Supplementary Materials for: Detecting very low allele fraction variants using targeted DNA sequencing and a novel molecular barcode-aware variant caller Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John
More informationArray-Ready Oligo Set for the Rat Genome Version 3.0
Array-Ready Oligo Set for the Rat Genome Version 3.0 We are pleased to announce Version 3.0 of the Rat Genome Oligo Set containing 26,962 longmer probes representing 22,012 genes and 27,044 gene transcripts.
More informationBIOINFORMATICS ORIGINAL PAPER
BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 21 2011, pages 2957 2963 doi:10.1093/bioinformatics/btr507 Genome analysis Advance Access publication September 7, 2011 : fast length adjustment of short reads
More informationCSE 549: RNA-Seq aided gene finding
CSE 549: RNA-Seq aided gene finding Finding Genes We ll break gene finding methods into 3 main categories. ab initio latin from the beginning w/o experimental evidence comparative make use of knowledge
More informationmeasuring gene expression December 11, 2018
measuring gene expression December 11, 2018 Intervening Sequences (introns): how does the cell get rid of them? Splicing!!! Highly conserved ribonucleoprotein complex recognizes intron/exon junctions and
More informationAnnotation of a Drosophila Gene
Annotation of a Drosophila Gene Wilson Leung Last Update: 12/30/2018 Prerequisites Lecture: Annotation of Drosophila Lecture: RNA-Seq Primer BLAST Walkthrough: An Introduction to NCBI BLAST Resources FlyBase:
More informationTranscriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data
Transcriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data Introduction The US Food and Drug Administration (FDA) has coordinated the Sequencing Quality Control project (SEQC/MAQC-III)
More informationShort Read Alignment to a Reference Genome
Short Read Alignment to a Reference Genome Shamith Samarajiwa CRUK Summer School in Bioinformatics Cambridge, September 2018 Aligning to a reference genome BWA Bowtie2 STAR GEM Pseudo Aligners for RNA-seq
More informationHomework 4. Due in class, Wednesday, November 10, 2004
1 GCB 535 / CIS 535 Fall 2004 Homework 4 Due in class, Wednesday, November 10, 2004 Comparative genomics 1. (6 pts) In Loots s paper (http://www.seas.upenn.edu/~cis535/lab/sciences-loots.pdf), the authors
More informationGenome annotation & EST
Genome annotation & EST What is genome annotation? The process of taking the raw DNA sequence produced by the genome sequence projects and adding the layers of analysis and interpretation necessary
More informationIntroduction to RNA-Seq
Introduction to RNA-Seq Monica Britton, Ph.D. Bioinformatics Analyst September 2014 Workshop Overview of Today s Activities Morning RNA-Seq Concepts, Terminology, and Work Flows Two-Condition Differential
More informationA Novel Approach to Clustering and Assembly of Large-Scale Roche 454 Transcriptome Data for Gene Validation and Alternative Splicing Analysis
A Novel Approach to Clustering and Assembly of Large-Scale Roche 454 Transcriptome Data for Gene Validation and Alternative Splicing Analysis Vitoantonio Bevilacqua 1,3,*, Fabio Stroppa 1, Stefano Saladino
More informationIntroduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012
Introduction to transcriptome analysis using High Throughput Sequencing technologies D. Puthier 2012 A typical RNA-Seq experiment Library construction Protocol variations Fragmentation methods RNA: nebulization,
More informationFigure 1. FasterDB SEARCH PAGE corresponding to human WNK1 gene. In the search page, gene searching, in the mouse or human genome, can be done: 1- By
1 2 3 Figure 1. FasterD SERCH PGE corresponding to human WNK1 gene. In the search page, gene searching, in the mouse or human genome, can be done: 1- y keywords (ENSEML ID, HUGO gene name, synonyms or
More informationC3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère
C3BI VARIANTS CALLING November 2016 Pierre Lechat Stéphane Descorps-Declère General Workflow (GATK) software websites software bwa picard samtools GATK IGV tablet vcftools website http://bio-bwa.sourceforge.net/
More informationReading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction
Lecture 8 Reading Lecture 8: 96-110 Lecture 9: 111-120 DNA Libraries Definition Types Construction 142 DNA Libraries A DNA library is a collection of clones of genomic fragments or cdnas from a certain
More information