RNA-Seq analysis workshop. Zhangjun Fei

Similar documents
RNA-Seq analysis workshop

RNA-seq analysis worksop

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Transcriptome analysis

Introduction to RNA-Seq

Introduction to RNA-Seq

1. Introduction Gene regulation Genomics and genome analyses

De novo assembly in RNA-seq analysis.

Sequence Analysis 2RNA-Seq

ChIP-seq and RNA-seq. Farhat Habib

Wheat CAP Gene Expression with RNA-Seq

ChIP-seq and RNA-seq

Eucalyptus gene assembly

Applications of short-read

CBC Data Therapy. Metatranscriptomics Discussion

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

RNA-SEQUENCING ANALYSIS

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

RNA-Seq de novo assembly training

measuring gene expression December 5, 2017

measuring gene expression December 11, 2018

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

RNA-Seq data analysis course September 7-9, 2015

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Introduction to RNA sequencing

How to deal with your RNA-seq data?

Next-Generation Sequencing. Technologies

RNA-Seq Software, Tools, and Workflows

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

RNA-Sequencing analysis

RNA-Seq Analysis. Simon Andrews, Laura v

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

Bioinformatics Advice on Experimental Design

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

RNA Sequencing: Experimental Planning and Data Analysis. Nadia Atallah September 12, 2018

Sanger vs Next-Gen Sequencing

Third Generation Sequencing

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience

Plant Breeding and Agri Genomics. Team Genotypic 24 November 2012

RNA-Seq with the Tuxedo Suite

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

Deep Sequencing technologies

NGS technologies: a user s guide. Karim Gharbi & Mark Blaxter

Total RNA isola-on End Repair of double- stranded cdna

Gene Expression Technology

Computational & Quantitative Biology Lecture 6 RNA Sequencing


10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

Long and short/small RNA-seq data analysis

RNA standards v May

Introduction of RNA-Seq Analysis

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Single Cell Transcriptomics scrnaseq

Analysis of RNA-seq Data

Genome annotation & EST

Analysis of RNA-seq Data. Bernard Pereira

Next Generation Sequencing. Tobias Österlund

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis

Advanced RNA-Seq course. Introduction. Peter-Bram t Hoen

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

Research school methods seminar Genomics and Transcriptomics

Form for publishing your article on BiotechArticles.com this document to

Next Gen Sequencing. Expansion of sequencing technology. Contents

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy

Overview of Next Generation Sequencing technologies. Céline Keime

RNA Seq: Methods and Applica6ons. Prat Thiru

RNA-Seq Module 2 From QC to differential gene expression.

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Application of NGS (nextgeneration. for studying RNA regulation. Sung Wook Chi. Sungkyunkwan University (SKKU) Samsung Medical Center (SMC)

Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA. March 2, Steven R. Kain, Ph.D. ABRF 2013

Application of NGS (next-generation sequencing) for studying RNA regulation

Next Generation Sequencing

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

RNA-seq Data Analysis

Mapping strategies for sequence reads

Matthew Tinning Australian Genome Research Facility. July 2012

Comparison and Evaluation of Cotton SNPs Developed by Transcriptome, Genome Reduction on Restriction Site Conservation and RAD-based Sequencing

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Contact us for more information and a quotation

Introduction to NGS analyses

Welcome to the NGS webinar series

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS

Genomic resources. for non-model systems

Quantifying gene expression

COMPUTATIONAL PREDICTION AND CHARACTERIZATION OF A TRANSCRIPTOME USING CASSAVA (MANIHOT ESCULENTA) RNA-SEQ DATA

Analysis of Differential Gene Expression in Cattle Using mrna-seq

Outline General NGS background and terms 11/14/2016 CONFLICT OF INTEREST. HLA region targeted enrichment. NGS library preparation methodologies

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

Next generation sequencing techniques" Toma Tebaldi Centre for Integrative Biology University of Trento

RNA

Analysis Datasheet Exosome RNA-seq Analysis

Finding Genes with Genomics Technologies

Assessing De-Novo Transcriptome Assemblies

SCALABLE, REPRODUCIBLE RNA-Seq

Transcription:

RNA-Seq analysis workshop Zhangjun Fei

Outline Background of RNA-Seq Application of RNA-Seq (what RNA-Seq can do?) Available sequencing platforms and strategies and which one to choose RNA-Seq data analysis Read processing and quality assessment De novo assembly Alignment to reference genome/transcriptome Differentially expressed gene identification

Milestones of Transcriptome analysis Year Milestone 1965 Sequence of the first RNA molecule determined 1977 Development of the Northern blot technique and the Sanger sequencing method 1989 Reports of RT-PCR experiments for transcriptome analysis 1991 First high-throughput EST sequencing study 1992 Introduction of Differential Display for the discovery of differentially expressed genes 1995 Reports of the microarray and Serial Analysis of Gene Expression (SAGE) methods 1996 Suppression subtractive hybridization reported 2005 First next-generation sequencing technology (Roche/454) introduced to the market 2006 First transcriptome sequencing studies using a next-generation technology (Roche/454)

New sequencing technologies Next generation sequencing Illumina (HiSeq, NovaSeq) Roche/454 Ion Torrent (Ion Proton) ABI/SOLiD Helicos Third generation sequencing Pacific Biosciences Oxford Nanopore Complete Genomics Desktop sequencer Ion Torrent PGM Illumina MiSeq 454 GS Junior

RNA-Seq applications

RNA-Seq application Accelerating gene discovery and gene family expansion Improving genome annotation identifying novel genes and gene models Identifying tissue/condition specific alternative splicing events

RNA-Seq applications Alternative splicing Short reads can t provide the complete structure of an isoform

RNA-Seq applications PacBio long reads SEQUEL SYSTEM

RNA-Seq applications PacBio long reads error correction

Each sample needs four libraries with different insert sizes: 1-2K, 2-3K, 3-5K, >5K RNA-Seq applications

RNA-Seq applications

RNA-Seq applications SNP and SSR marker identification facilitating breeding SNP discovery in RNA-Seq is more challenging than in DNA: Varying levels of coverage depth False discovery around splicing junctions due to incorrect mapping

RNA-Seq applications Phylogenetic relationship, population structure Xu et al. (2017) Draft genome of spinach Spinacia oleracea and transcriptome diversity of 120 Spinacia accessions. Nature Communications

RNA-Seq applications selective sweep Xu et al. (2017) Draft genome of spinach Spinacia oleracea and transcriptome diversity of 120 Spinacia accessions. Nature Communications

RNA-Seq applications Expression QTL (eqtl) network A melon RIL population (Nurit Katzir, unpublished)

RNA-Seq applications Mutant gene cloning (BSA RNA-Seq) white fruit x yellow fruit 132 of 189 SNPs in this region F1 F2 kb F3 white pool yellow pool RNA-Seq SNPs and DE genes Feder et al. (2015) A Kelch domain-containing F-box coding gene negatively regulates flavonoid accumulation in Cucumis melo L. Plant Physiol 169:1714-1726

RNA-Seq applications GWAS Distribution of mapped markers associating with the erucic acid trait

RNA-Seq applications Genomic imprinting and allele specific expression

RNA-Seq applications Regulatory mode of gene expression in F1 hybrids Provided by Nabil Elrouby C. maxima, Rimu C. moschata, Rifu x Fruit Root Leaf stem The interspecific hybrid, Shintosa 62-80% trans, 13-24% cis

RNA-Seq applications Root Root Fruit Fruit Leaf Stem Root Cma F1 Cmo Response to heat (GO:0009408) Cma F1 Cmo Carotenoid biosynthesis (GO:0016117) Cma F1 Cmo Defense response (GO:0006952) Genes exhibiting dominant and transgressive expression patterns in Shinotasa are enriched with those involved in defense response, response to heat, carotenoid biosynthesis and photosynthesis Cma F1 Cmo Cma F1 Cmo Cma F1 Cmo Cma F1 Cmo Photosynthesis (GO:0015979)

RNA-Seq applications non-coding RNAs (lncrna, lincrnas )

Gene fusion RNA-Seq applications

Gene expression profiling RNA-Seq applications

RNA-Seq vs microarray Problem of microarray Cross-hybridization Stable probe secondary structures high background (e.g., nonspecific hybridization) limited dynamic range (e.g., nonlinear and saturable hybridization kinetics) RNA-Seq (digital expression analysis) allow direct enumeration of transcript molecules digital expression data are absolute so data can be directly compared across different experiments and laboratories without the need for extensive internal controls or other experimental manipulation provide open systems that allow detection of previously uncharacterized transcripts, as well as rare transcripts

RNA-Seq applications Summary Accelerating gene discovery and gene family expansion Improving genome annotation identifying novel genes and gene models Identifying tissue/condition specific alternative splicing events SNP and SSR marker identification Phylogenetic relationship, population structure, selective sweep Expression QTL analysis Mutant gene cloning (BSA RNA-Seq) Genome (Transcriptome)-wide associate study Genomic imprinting and allele specific expression analysis Identifying non-coding RNAs (lncrna, lincrnas ) Identifying gene fusion events Gene expression profiling analysis

Sequencing platforms and strategies

Sequencing platforms Next generation sequencing Illumina (HiSeq, NovaSeq) Ion Torrent (Ion Proton) ABI/SOLiD Roche/454 Helicos Third generation sequencing Pacific Biosciences Oxford Nanopore Complete Genomics Desktop sequencer Ion Torrent PGM Illumina MiSeq Illumina NextSeq 454 GS Junior

Sequencing platforms Illumina HiSeq 2000/2500 High-output mode (200-300M reads/ read pairs per lane) Single-end, 50, 100 bp Paired-end, 2 x 125bp Run time: 2-11 days Rapid run mode (150-200M reads/ read pairs per lane) Single-end, 50, 100, 150 bp Paired-end, 2 x 100 bp Paired-end, 2 x 150 bp Paired-end, 2 x 200 bp Paired-end, 2 x 250 bp Runtime: 7-40 hours Illumina MiSeq 50 bp sequencing kit 300 bp sequencing kit (e.g. 2 x 150 bp) 500 bp sequencing kit (e.g. 2 x 250 bp) 150 bp sequencing kit (e.g. 2 x 75 bp) 600 bp sequencing kit (e.g. 2 x 300 bp) Run time: 5-65 hours http://www.biotech.cornell.edu/brc/genomics/services/price-list

Sequencing platforms Single-end or paired-end For gene expression analysis with a reference genome, singleend is enough For de novo assembly, genome annotation, alternative splicing identification, it s better to use paired-end Strand-specific or non strand-specific Always choose strand-specific RNA-Seq if possible

Strand-specific RNA sequencing More accurately determine the expression level Significantly reduce false positives in identifying alternatively spliced transcripts Identify antisense transcripts another level of gene regulation in important biological processes Determine the transcribed strand of non-coding RNAs (e.g. lincrnas)

Strand-specific RNA-Seq library construction

High throughput ssrna-seq Up to 96 libraries in two days Paired-end compatible multiplexing

Strand specific RNA sequencing Strand-specific sequencing can produce more accurate digital gene expression data when compared to the conventional Illumina RNA-Seq.

Strand specific RNA sequencing

Strand specific RNA sequencing Antisense transcript cis-natural antisense transcripts (cis-nat) 1340 cis-nat pairs in Arabidopsis (Wang et al., 2005) 687 cis-nat pairs in rice (Osato et al., 2003) trans-natural antisense transcripts (trans-nat) 1,320 trans-nat pairs in Arabidopsis (Wang et al., 2006) function alternative splicing RNA editing DNA methylation genomic imprinting X-chromosome inactivation

Strand specific RNA sequencing Antisense transcript LEFL2040O15 1394 reads 259 reads LEFL2002DC06 389 reads 1189 reads

lincrna (determine the sense strand) Strand specific RNA sequencing

RNA-Seq strategies Sequencing depth and no. of biological replicates Most frequently asked question How many samples should I multiplex in one lane? or How many reads should I generate for each of my samples? Depend on $$$ Depends on the quality of the library and the reads rrna, trna, organelle, adaptor contamination No. of biological replicates for expression call At least three Effects of read numbers on expression call Mature green fruit library (22M reads) Randomly select 0.1-0.9, 1-22M reads from the library and calculate gene expression for each dataset (20 different randomizations)

RNA-Seq (multiplexing) 0.1M 1M 2M r=0.8682 r=0.9867 r=0.9934 3M 5M 10M r=0.9957 r=0.9976 r=0.9992 Mature green fruit, 22M

RNA-Seq (multiplexing)

RNA-Seq (multiplexing)

Common problems in RNA-Seq experimental design Without involvement of a bioinformatics expert in the experimental design. This could cause serious problems for downstream data analysis if the experimental design has flaws. No biological replicates. Currently most journals requires at least three biological replicates. Biological replicated samples collected at different time or different places. For biotic/abiotic stress experiment, no mock control. All treatments are compared to non-treated samples (time 0). (Circadian clock genes, genes differentially expressed due to different environmental factors and developmental stages ) Directly compare different genotypes with totally different genetic background. Genes differentially expressed due to other phenotypes, not the interested one.

RNA-Seq data analysis

Read quality control (fastqc) Read processing

Read quality control (fastqc) Read processing

Read quality control (fastqc) Read processing

Read processing Remove adaptors and all possible contaminations: rrna, trna, organelle (chloroplast and mitochondrion) RNAs, virus, low quality sequences Arabidopsis 25S ribosomal RNA vs GenBank nr protein database

Read processing Remove contaminated sequences Align reads to rrna and organelle sequence database (bowtie or BWA) Affect RPKM values if not removed Trim adaptor and low quality sequences FASTX-Toolkit AdapterRemoval Trimmomatic Cutadapt Condetri ERNE-filter Prinseq SolexaQA-bwa Sickle

Read processing

RNA-Seq data analysis De novo transcriptome assembly Long reads (454/Sanger) overlap-layout-consensus strategy Short reads (Illumina) de Bruijn graph approach Martin & Wang, 2011

De novo transcriptome assembly Long reads (454/Sanger) CAP3 (http://seq.cs.iastate.edu/cap3.html) TGICL/CAP3 (http://compbio.dfci.harvard.edu/tgi/software/) MIRA (http://www.chevreux.org/projects_mira.html) Newbler (-cdna) Phrap (http://www.phrap.org/) Two major problems in existing EST assembly programs and unigene databases: 1) Large portion of different transcripts (mainly alternative spliced transcripts and paralogs) are incorrectly assembled into same transcripts type I error (false positives) 2) Large portion of nearly identical sequences are not assembled into one transcript type II error (false negatives)

Example of type I assembly error (paralog) In DFCI Tomato Gene Index, AW218649 is a member of TC237370 Sequence identity between AW218649 and TC232370: 91.5% AW218649 is aligned to tomato chromosome 4 TC237370 is aligned to tomato chromosome 11

Example of type I assembly error (alternative splicing) In DFCI Tomato Gene Index, U95008 is a member of TC226520

Example of type II assembly error In DFCI Tomato Gene Index, two unigenes, TC219875 and TC221582, are identical

iassembler http://bioinfo.bti.cornell.edu/tool/iassembler/ iterative assemblies (assembly of assemblies) using MIRA and CAP3 (four cycles of MIRA followed by one cycle of CAP3) reduce errors that nearly identical sequences are not assembled Further assembly error identification 1) comparing unigene sequences against themselves to identify nearly identical sequences (type II errors) 2) aligning EST sequences to their corresponding unigene sequences to identify mis-assembled ESTs (type I errors) Both type I and II assembly errors are corrected automatically by the program Unigene base errors are then corrected based on the resulting SAM files

De novo transcriptome assembly Short reads (Illumina) Trinity Trans-ABySS Oases/velvet SOAPdenovo-Trans

De novo transcriptome assembly Reference-guided de novo assembly Cufflink IsoLasso Scripture Traph StringTie

De novo transcriptome assembly Trinity

De novo transcriptome assembly Post processing of de novo assemblies Remove contaminations (bacteria, virus, fungus ) Remove assembly errors (mainly redundancy) Remove errors caused by library preparation (incomplete digestion of dutp containing 2 nd strand during strandspecific RNA-Seq library construction)

De novo transcriptome assembly blastx Remove contamination blastn

De novo transcriptome assembly Remove contamination DeconSeq SeqClean

De novo transcriptome assembly Remove type II assembly error (redundancy) iassembler

De novo transcriptome assembly Remove transcripts derived from incomplete 2 nd digestion Gene ID length antisense sense UN22492 1504 97 48138 comp38294_c0_seq1 526 10822 103 removed

De novo transcriptome assembly High number of assembled transcripts Alternative splicing Non-coding RNAs Incomplete coverage of full length transcripts DFCI gene index

RNA-Seq data analysis Alignment Align reads to reference genome TopHat HISAT STAR Alignment reads to reference transcriptome bowtie BWA If you have a reference genome, it s not a good idea to align the reads to the predicted CDS or cdna, due to the incomplete prediction of UTRs and alternative splicing

RNA-Seq data analysis Visualization tools Integrative Genomics Viewer (IGV)

RNA-Seq data analysis Read counting and normalization Read counting htseq-count samtools (samtools view c) Normalization RPKM: reads per kilobase of exon model per million mapped reads FPKM: fragments per kilobase of exon model per million mapped reads

RNA-Seq data analysis Quality control biological replicates Sample correlation matrix

RNA-Seq data analysis Differentially expressed gene detection Pair-wise comparison DESeq edger Time course data edger first data transformation using getvariancestabilizeddata function in DESeq (to get normal distribution). Then DE gene identification using F tests in LIMMA Multiple test correction False Discovery Rate (FDR) q value

RNA-Seq data analysis Differentially expressed gene detection