RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

RNA-Seq Workshop AChemS 2017 Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

Benefits & downsides of RNA-Seq Benefits: High resolution, sensitivity and large dynamic range Independent of prior knowledge (in contrast to predesigned probes in microarray analysis). Unravel previously inaccessible complexities. Downside: Data analysis is not straightforward; methods continue to evolve. Cost

Types of RNA-Seq analysis Gene expression analysis Single cell RNA-Seq (scrna-seq) Small RNA-Seq (mirna-seq) Analysis of RNA-protein/RNA-RNA-interaction

Goals of typical RNA-Seq analysis Identify expressed genes and transcripts Quantify gene expression in different conditions or tissues (differential expression). Identify novel transcripts and genes (de novo assembly) Alternative splicing Novel transcribed genes Transcriptome from non-model organisms

Comparison of sequencing platforms 1 st Gen 2 nd Gen 3 rd Gen

Overview of Illumina RNA-Seq https://www.slideshare.net/ueb52/uebuat-bioinformatics-course-session-23-vhir-barcelona

Sequencing strategies Which library preparation protocol to use? How many replicates? What is the optimal library size (sequencing depth)? Paired end or single end? Which data analysis pipeline to use?

Not all types of RNA encode information The bulk (~95%) of cellular RNA is rrna and trna. http://finchtalk.blogspot.com/2009/05/small-rnas-get-smaller.html

Quality and quantity of input RNA High quality RNA is preferred, but many times not available. Needle biopsies, Laser microdissection and formalin fixed paraffin embedded samples yield low integrity RNA. The amount of RNA may be low by necessity or by design (e.g. scrna-seq).

mrna has to be selectively enriched polya Selection RNase H = Magnetic bead Ribo-Zero

Stranded libraries are better! Stranded libraries preserve information on the strand of origin of the transcript Helpful when overlapping antisense transcripts occur in a genomic region (~19% of genes in human genome!) e.g. Mouse Gng13 and Chtf18 genes.

How many replicates? Considerations Include: Technical variability of RNA-Seq protocol. The intrinsic biological variability. The desired statistical power. Multiple samples can be sequenced in the same lane (multiplexing). Prepare all replicate libraries at once, to avoid batch effects.

Sequencing mode and length Paired end preferred for de novo transcriptome assembly and isoform level analysis Single end sequencing sufficient for gene expression studies Illumina sequencer read lengths vary from 50-150bp. Longer reader length= better mappability.

Library size Only a subset of the genome is transcribed The dynamic range of gene expression is huge Reliable detection of genes expressed at lower levels need bigger library size. scrna seq needs lower depth Tools such as Scotty and RNASeqPower can help calculate optimum library size and # replicates based on pilot data. The ENCODE consortium guidelines: http://encodeproject.org/encode/experiment_guidelin es.html

RNA-Seq Library preparation

Library specific index sequences allow pooling multiple libraries ~ 6 libraries are pooled per lane for typical RNA-Seq 100 s of libraries are pooled for scrna-seq.

Digital RNA-Seq uses barcodes to correct PCR bias Proc Natl Acad Sci U S A. 2012 Jan 24;109(4):1347-52 Particularly useful when many cycles of PCR amplification are used (e.g scrna-seq)

Illumina Sequencing

From sequence to biological insights Reads Mapping FASTQ Files QC by FastQC/R To genome/transcriptome/de novo Expression quantification Summarize read counts : EM/union of exons QC by RSeQC Differential Expression Analysis Gene/transcript level Functional Interpretation Enriched pathways/go terms, integration with other data Biological Insights & hypothesis

FASTQ file format FASTQ format is used by modern sequencers. Bundles a FASTA sequence and its quality data. Line1: Sequence identifier Line2: Raw sequence Line3: meaningless, may repeat sequence identifier Line4: quality values for the sequence (!=lowest, ~ highest) @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT +!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Sequencing QC, using FastQC Basic information (total reads, sequence length, etc.) Per base sequence quality Overrepresented sequences GC content Duplication level Etc.

FastQC report http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Per base sequence quality

Overrepresented Sequences Adapter

Challenges in RNA-Seq read alignment How to correctly align short reads to the parent gene? Theoretically, the chances of a 100 bp read occurring more than once in a genome is infinitesimally small (4 100 = ~1.6*10 60, compared to the size of mammalian genome, ~3*10 9 ). But repeat elements, such as conserved regions in gene families and overlapping antisense genes abound in the genome. About 1/3 of RNA-Seq reads span exon-exon junctions!

Are you awake?

The shredded book analogy for short read alignment Adapted from a lecture by Michael Schatz, JHU

Nature Methods. 10, 1165 1166 (2013)

de Bruijn Graph assembly Repeat sequences make correct reconstruction and quantification difficult.

Overview of read mapping and transcript identification Model organisms (Reference sequence available) RNA-Seq Reads Non-model organisms (poor or no reference sequence) RNA-Seq reads Splice aware mapper TopHat, STAR, HISAT Ungapped mapper BWA, Bowtie De novo assembler Trinity StringTie Cufflinks Align to genome Align to transcriptome Identify all transcripts EM algorithm Cufflinks, RSEM With GTF Analyze known transcripts Union of exons FeatureCounts W/O GTF Discover novel transcripts RSEM, Kallisto Analyze known transcripts Align to de novo transcriptsome Analyze BWA, Bowtie RSEM Kallisto

Alignment and annotation files SAM is a text based file format for storing sequences aligned to a reference sequence. Consists of header (read names) and alignment sections (mandatory). Alignment section has 11 mandatory fields specifying alignment information BAM files are compressed forms of SAM files. GTF, GFF and BED files contain annotations of features such as the cordinates of genes, transcripts and exons.

Genome browsers Sashimi Plot Web based: UCSC Desktop: IGV

Taste cell and tissue isolation for RNA-Seq analysis A B C Before cutting After cutting Type III Salt Type III Sour T1R3GFP (Sweet/Umami) GustGFP (Mostly bitter) GADGFP (Type III) Lgr5GFP (Stem) Circumvallate Fungiform 33

Mapping QC Percentage of reads properly mapped or uniquely mapped Among the mapped reads, the percentage of reads in exon, intron, and intergenic regions. Splice junctions 5' or 3' bias Etc Popular software include RseqQC and RNAseqQC.

Read mapping to gene features

Splice junction saturation

Taste cells express many novel isoforms and genes 42%-45% of the splice junctions in taste libraries are either completely or partially novel. But these novel splice junctions were rarely used (<5%). Taste and olfactory tissue is barely represented in public gene annotation efforts.

Normalized read mapping intensity Gene body coverage 100 Bulk taste libraries Single cell libraries 0 Normalized Distance along transcript 5 ->3 (%)

Motivation for re-annotating the taste transcriptome Not all transcripts are fully annotated, even in human and mouse Transcriptomes are annotated from well studied tissues by RefSeq and Gencode. The 3 and 5 UTRs of genes are poorly annotated This causes problems for 3 end sequencing Especially problematic for scrna-seq

Strategies vary for model organisms

And non-model organisms

Methods for transcriptome Assembly Reference-based assembly De novo assembly Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671 682

Transcriptome assembly when reference genome is available https://galaxyproject.org/tutorials/rb_rnaseq/

When reference genome and transcriptome are available Bioinformatics (2011) 27 (17): 2325-2329. Reference annotation based transcriptome assembly (RABT assembly) leverages existing gene annotations for discovering novel transcripts. Appropriate for model organisms.

Strategy for RABT assembly of taste RNA-Seq data Reference annotation based transcriptome assembly using cufflinks and Stringtie packages of taste bud libraries Results from the two workflows were combined. Non coding, pre-mrna and transcripts containing premature stop codons were removed. Potential coding transcripts were functionally annotated. More info: Poster # 520

Many novel genes and isoforms of known genes were identified in the taste buds Transcript types De novo Gene annotations Identical to known 111512* Novel Intronic 115 Novel isoforms of known genes 50110 Novel intergenic Transcripts 1649 Novel antisense transcripts 303 *Out of a total of 111706 transcripts in Gencode M7

Improved bitter taste receptor gene annotations Blue= de novo model, red = refseq model 23/35 Tas2r genes are multi-exonic. Ten of them were verified by RT-PCR using cdna from taste tissue

Novel isoform of known genes: e.g. Chromogranin A

Improved mouse OR gene annotations A B 913 (73.1%) OR and 246 (45.9%) VR genes had extended gene Models. The de novo models are more sensitive at detecting OR gene expression (B). A : From PLoS Genet. 2014 Sep 4;10(9):e1004593 B: From Scientific Reports 5, Article number: 18178 (2015) doi:10.1038/srep18178

Thanks for your attention! ssukumaran@monell.org Many figures and slides in this presentations came from publications, presentations, web pages etc. I am grateful to the authors for making them available.