Long and short/small RNA-seq data analysis GEF5, 4.9.2015 Sami Heikkinen, PhD, Dos.
Topics 1. RNA-seq in a nutshell 2. Long vs short/small RNA-seq 3. Bioinformatic analysis work flows GEF5 / Heikkinen S / 4.9.2015 2
RNA-seq in a nutshell GEF5 / Heikkinen S / 4.9.2015 3
Planning Bench Define problem Consult bioinformatician (et al)! Get ethical permits Select/get samples (groups, N) Define sequencing strategy RNA-seq project work flow Execution Bioinformatics Extract RNA Generate sequencing libraries Sequence Perform QC Preprocess, align, analyze / test Summarize, visualize Interpret! Bed side Individualized treatment? Genetic risk? etc! GEF5 / Heikkinen S / 4.9.2015 4
Next Generation Sequencing (NGS) =deep sequencing (Fin: syväsekvensointi) counting applications the count of reads aligning to a genomic location matter (the most) e.g. ChIP-seq, RNA-seq, many others qualitative applications the sequence itself matters e.g. whole genome / targeted / exome sequencing d Nature Reviews in Genetics Me>ger 2010 Annu. Rev. Anal. Chem, Mardis 2013 GEF5 / Heikkinen S / 4.9.2015 5
Sample barcoding e.g. ATCACG Barcoded sequencing libraries Sample 1 Sample 2 Linkers & adapters DNA fragment Linkers & adapters Sequencing allows for multiplexing - take benefit of the modern high-capasity sequencers: ~200 million reads per one run on old Illumina HiSeq - recent versions up to 20x that! - typically up to 48 or even 96 barcodes All reads De- bar coded reads Sample 1 Sample 2 GEF5 / Heikkinen S / 4.9.2015 6
Read length most counting applications: 50 bp genome sequencing: 100 600 bp long RNA-seq Reads Fragments Target mrna only measure gene expression levels (etc)? 50 bp OK interested in alternative splicing? Need 100+ bp! 100 bp short/small RNA-seq e.g. mature mirnas ~22 nt ~40 bp Sequence reads Sequenced fragment Target mrna 50 bp Exon 1 Exon 2 100 bp Exon 3 GEF5 / Heikkinen S / 4.9.2015 7
Single or paired-end? single end most counting applications, including typical long RNA-seq paired-end helps in alignment alternative splicing in RNA-seq genome sequencing higher cost, longer sequencer run times Single end OR Sequenced fragment Genome Paired- end AND? - >! Sequence read pairs Sequenced fragment Target mrna 50 bp 100 bp Exon 1 Alternative Exon 3 exon 2 GEF5 / Heikkinen S / 4.9.2015 8
Sequencing depth in RNA-seq, more depth = more reliability (for lower expressed genes) Random result?! N reads Low expressed gene Sample 1 Sample 2 6 4 Higher expressed gene Sample 1 Sample 2 60 40 4 * N reads 24 16 240 160 long RNA-seq on mammalian-size transcriptome - gene expression: need 10-40 million single end reads per sample multiplex ~6-12x - gene expression + alternatively spliced mrna isoforms: need 100 million paired-end reads per sample no/low multiplexing short/small RNA-seq - need 2-3 million single end, 40 bp reads per sample use lower capasity sequencer, and multiplex e.g. 12 x GEF5 / Heikkinen S / 4.9.2015 9
Replicates also RNA-seq suffers from the inherent variation in e.g. gene expression levels between individuals - need samples from many individuals per group - probably at the very least tens - the smaller the expected difference, the bigger the N must be - power calculations? GEF5 / Heikkinen S / 4.9.2015 10
Long vs short RNA-seq GEF5 / Heikkinen S / 4.9.2015 11
Long RNA-seq Short/small RNA-seq Target - any RNA present in the extracted RNA sample - messanger RNAs (mrnas) - long non-coding RNAs (lncrnas), processed pseudogenes etc - typical min length: ~200 bp - all expressed isoforms included Target - Small non-coding RNAs (sncrnas) - micrornas (mirnas) - PIWI-interacting RNAs (pirnas) - small nucleolar RNAs (snornas) - utilizes chemical properties at the ends of small RNAs E.g. Protein mrna 5 3 Genome GEF5 / Heikkinen S / 4.9.2015 12
Long RNA-seq Short/small RNA-seq Starting material - total RNA - mrna only sample Starting material - total RNA THAT MUST include also the <50 bp small RNA species RNA extraction method matters! - small RNA only sample Comlexity - e.g. 19797 known protein coding genes (through 79795 transcripts) - variable tissue specificity - variable lengths à high complexity Comlexity - e.g. 2588 known mature mirnas - higher tissue specificity - very short à low complexity GEF5 / Heikkinen S / 4.9.2015 13
RNA-seq analysis work flows GEF5 / Heikkinen S / 4.9.2015 14
Raw data on server Long RNA-seq data analysis work-flow Download Transcriptome + genome Public data Initial QC Preprocessing Decontam. Raw data locally Trim for 3 - A n ( homertools trim ) Trim Trim adapter and Q- filter ( Trimmomatic ) Preprocess Align to rrna+chrm+etc ( bowtie2 or tophat2 ) Decontaminate QC results fastqc fastqc fastqc bowtie2 index fastqc Unaligned Aligned cufflinks? Align and index ( tophat2 & samtools ) Align, sort, index, and visualize.tdf Gene expressions igvtools Quantitate ( cuffquant ) Test ( cuffdiff ) DEG vs results bowtie2 index fastqc Pathway analysis Associations to clinical data etc transcriptome (re- )annotation ( cuffcompare ) Export ( cuffnorm ) Norm d GEx & counts Alignment Quantitate Analyze GEF5 / Heikkinen S / 4.9.2015 15
RNA-seq pipeline architecture Output (folders) filename(s).suffix pipeline_se]ings.txt run_fastqc.sh log.txt fastqc Input (folder) log.txt Master Unix shell script run_homertools.sh run_trimmomatic.sh log.txt homertools trim trimmomatic Output (folder) filename(s).suffix Output (folder) reporting.sh(s) log.txt filename(s).suffix summary.txt etc.sh. etc Output (folder) filename(s).suffix GEF5 / Heikkinen S / 4.9.2015 16
Some data formats and types Raw sequence data (.fastq.gz) FastQC quality control (.html) Visualization in genome browser (.tdf,.bigbed,.bigwig ) etc Aligned reads (.sam,.bam, indexed and sorted.bam ) Gene expression test results (from cuffdiff, tab.delim.txt) GEF5 / Heikkinen S / 4.9.2015 17
Raw data on server Long RNA-seq data analysis work-flow Download Transcriptome + genome Public data Initial QC Preprocessing Decontam. Raw data locally Trim for 3 - A n ( homertools trim ) Trim Trim adapter and Q- filter ( Trimmomatic ) Preprocess Align to rrna+chrm+etc ( bowtie2 or tophat2 ) Decontaminate QC results fastqc fastqc fastqc bowtie2 index fastqc Unaligned Aligned cufflinks? Align and index ( tophat2 & samtools ) Align, sort, index, and visualize.tdf Gene expressions igvtools Quantitate ( cuffquant ) Test ( cuffdiff ) DEG vs results bowtie2 index fastqc Pathway analysis Associations to clinical data etc transcriptome (re- )annotation ( cuffcompare ) Export ( cuffnorm ) Norm d GEx & counts Alignment Quantitate Analyze GEF5 / Heikkinen S / 4.9.2015 18
Pilot RNA-seq sample from human blood Read count across processing steps Tissue Specific Expression Analysis (TSEA) (top 1000 expressed genes) GEF5 / Heikkinen S / 4.9.2015 19
Pilot RNA-seq sample from human blood GEF5 / Heikkinen S / 4.9.2015 20
Raw data on server Small RNA-seq data analysis work-flow Align, QC, sort, index, visualize Quantitate Annotate Test Phenotype data Initial QC Preprocessing mature mirna index QC Aligned Viz. Unaligned DESeq2 depend. DESeq2 depend. DESeq2 Groups Clim chem Histology Disease risk etc Decontaminate hairpin mirna index QC Aligned Viz. Unaligned sncrna index NOTE: with e.g. miseq (Mediteknia), adapter clipping done already on sequencer 40 bp QC Viz. pirna QC Aligned Unaligned index Aligned Viz. e.g. 22 bp GEF5 / Heikkinen S / 4.9.2015 21
Thank you! uef.fi