RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop
The Basic Tuxedo Suite References Trapnell C, et al. 2009 TopHat: discovering splice junctions with RNA-Seq. Bioinformatics Trapnell C, et al. 2010 Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology Kim D, et al. 2011 TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology Roberts A, et al. 2011 Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology Roberts A, et al. 2011 Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics Cufflinks assembles transcripts Cuffdiff identifies differential expression of genes/ transcripts/promoters Trapnell C, et al. 2013 Differential analysis of gene regulation at transcript resolution with RNA-Seq. Nature Biotechnology
Alignment and Differential Expression Read set(s) TopHat bam file(s) Existing annotation (GTF) We followed these steps with the single-end reads Toptables, etc. Cuffdiff
But, do we have all the genes? For organisms with genomes, gene models are stored in gtf files Assumptions: The gtf file contains annotation for ALL transcripts and genes All splice sites, start/stop codons, etc. are correct Are these assumptions correct for every sequenced organism? RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation Method used depends on how much sequence information there is for the organism
Gene Construction (Alignment) vs. Assembly Genome- Sequenced Organisms Trinity software Novel or Non-Model Organisms Haas and Zody (2010) Nat. Biotech. 28:421-3
Gene / Transcriptome Construction Annotation can be improved even for well-annotated model organisms Identify all expressed exons Combine expressed exons into genes Find all splice variants for a gene Discover novel transcripts For newly sequenced organisms Validate ab initio annotation Comparison between different annotation sets Can assist in finding some types of contamination Reconstruction of rrna genes Genomic/mitochondrial DNA in RNA library preps.
Reference Annotation Based Transcript (RABT) Assembly Read set(s) TopHat bam file(s) Existing annotation (GTF) [optional] Cufflinks Cuffmerge Cuffcompare Read-set specific GTF(s) Merged GTF Final assembly (GTF and stats) Toptables, etc. Cuffdiff
TopHat Spliced Alignment to a Genome
Reference Annotation Based Transcript (RABT) Assembly
Cufflinks Identification of Incompatible Fragments Incompatible alignment
Cufflinks Minimum Paths to Transcripts
Cufflinks Abundance Estimation
Cufflinks Abundance Estimation
Merging Cufflinks Assemblies
So Now We ve Explored These Tools
We ve Used Other Software in Conjunction HTSeq-count Raw Counts edger (But HTSeq-count and edger are independent)
And Then Came Some Extensions
Modules Introduced in 2014 Cuffquant Improves efficiency of running multiple samples Stores data in.cxb compressed format, that can later be analyzed with cuffdiff or cuffnorm Cuffnorm Generate tables of expression values that are normalized for library size. Tables are used as input to Monocle Monocle Used to analyze single-cell expression data Trapnell, et al., 2014, Nat. Biotech. 32:381
But Software Continues to Evolve HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts) Kim et al., 2015, Nat. Methods Planned to be Tophat3 Faster than other aligners More accurate on simulated reads.
But Software Continues to Evolve StringTie Pertea et al., 2015, Nat. Biotech Probable successor to Cufflinks2 Assembles more transcripts (based on simulated reads) Ballgown Frazee et al., 2015, Nat. Biotech Bioconductor R package Probable successor to Cuffdiff2 Includes useful Tablemaker preprocessor
A New Potential Game-Changer (2015) Kallisto ( Near-Optimal RNA-Seq Quantification ) Bray et al. (http://arxiv.org/abs/1505.02710) Extremely fast, uses pseudo-alignment based on k-mers and debruijn graphs Speed Accuracy
A Few Words About Bacterial RNA-Seq
Eukaryotic and Bacterial Gene Structures are Different Eukaryotes Gene structure includes introns and exons Splicing, poly-adenylation Each mrna is a discrete molecule when translated Bacteria / Prokaryotes Individual genes and groups of genes in operons Generally, no splicing, no polya One mrna can contain coding sequences for multiple proteins
Bacterial RNA-Seq Considerations rrna depletion strategies may leave considerable amounts of non-coding RNA molecules Splicing-aware aligners (such as Tophat) may not be useful Reads from polycistronic mrna may overlap two genes How would HTSeq-Count handle this? Compare alignments to the genome to alignments to transcriptome. Some aligners, such as bwa-mem, will report secondary alignments Transcriptome alignments can be used to generate counts table for edger Specialized software, such as Rockhopper (stand-alone, http://cs.wellesley.edu/~btjaden/rockhopper/)