RNA-Seq analysis workshop

Similar documents
RNA-Seq analysis workshop. Zhangjun Fei

Introduction to RNA-Seq

measuring gene expression December 5, 2017

Introduction to RNA sequencing

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

Next-Generation Sequencing. Technologies

RNA-Seq Software, Tools, and Workflows

RNA-Sequencing analysis

RNA-Seq with the Tuxedo Suite

Third Generation Sequencing

Gene Expression Technology

Research school methods seminar Genomics and Transcriptomics

Long and short/small RNA-seq data analysis

RNA Seq: Methods and Applica6ons. Prat Thiru

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Mapping strategies for sequence reads

Sanger vs Next-Gen Sequencing

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Bioinformatics Advice on Experimental Design

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA

Next Gen Sequencing. Expansion of sequencing technology. Contents

RNASEQ WITHOUT A REFERENCE

Assessing De-Novo Transcriptome Assemblies

Analysis of Differential Gene Expression in Cattle Using mrna-seq

RNA-seq Data Analysis

Intermediate RNA-Seq Tips, Tricks and Non-Human Organisms

Welcome to the NGS webinar series

Next Generation Sequencing: An Overview

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist


NOW GENERATION SEQUENCING. Monday, December 5, 11

Shuji Shigenobu. April 3, 2013 Illumina Webinar Series

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

RNA-Seq Tutorial 1. Kevin Silverstein, Ying Zhang Research Informatics Solutions, MSI October 18, 2016

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

SCALABLE, REPRODUCIBLE RNA-Seq

RNA-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

De novo metatranscriptome assembly and coral gene expression profile of Montipora capitata with growth anomaly

Genomics and Transcriptomics of Spirodela polyrhiza

De Novo Assembly of High-throughput Short Read Sequences

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Introduction to Bioinformatics and Gene Expression Technologies

Mate-pair library data improves genome assembly

Course Presentation. Ignacio Medina Presentation

FGCZ NEWSLETTER FALL Next Generation Sequencing at the Functional Genomics Center Zurich

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

Post-assembly Data Analysis

Outline. General principles of clonal sequencing Analysis principles Applications CNV analysis Genome architecture

Haploid Assembly of Diploid Genomes

De novo genome assembly with next generation sequencing data!! "

Chapter 15 Gene Technologies and Human Applications

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

CM581A2: NEXT GENERATION SEQUENCING PLATFORMS AND LIBRARY GENERATION

Automated size selection of NEBNext Small RNA libraries with the Sage Pippin Prep

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

Post-assembly Data Analysis

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

RNAseq Differential Gene Expression Analysis Report

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Jenny Gu, PhD Strategic Business Development Manager, PacBio

IMGM Laboratories GmbH. Sales Manager

A Roadmap to the De-novo Assembly of the Banana Slug Genome

Microarrays: since we use probes we obviously must know the sequences we are looking at!

Molecular Cell Biology - Problem Drill 11: Recombinant DNA

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

DNA-Sequencing. Technologies & Devices

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

SCIENCE CHINA Life Sciences

Single Cell Genomics

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

INTRODUCCIÓ A LES TECNOLOGIES DE 'NEXT GENERATION SEQUENCING'

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

Microarray Gene Expression Analysis at CNIO

HLA and Next Generation Sequencing it s all about the Data

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

How much sequencing do I need? Emily Crisovan Genomics Core

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

Supporting Information

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

L3: Short Read Alignment to a Reference Genome

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Corset: enabling differential gene expression analysis for de novo assembled transcriptomes

Genomic Data Analysis Services Available for PL-Grid Users

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data

De novo genome assembly. Dr Torsten Seemann

CMPS 3110 : Bioinformatics. High-Throughput Sequencing and Applications

2/5/16. Honeypot Ants. DNA sequencing, Transcriptomics and Genomics. Gene sequence changes? And/or gene expression changes?

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Local assembly and pre-mrna splicing analyses by high-throughput sequencing data

Ultrasequencing: Methods and Applications of the New Generation Sequencing Platforms

Transcription:

RNA-Seq analysis workshop Zhangjun Fei Boyce Thompson Institute for Plant Research USDA Robert W. Holley Center for Agriculture and Health Cornell University

Outline Background of RNA-Seq Application of RNA-Seq (what RNA-Seq can do?) Available sequencing platforms and strategies and which one to choose RNA-Seq data analysis Read processing and quality assessment De novo assembly Alignment to reference genome/transcriptome Differentially expressed gene identification

Milestones of Transcriptome analysis Year Milestone 1965 Sequence of the first RNA molecule determined 1977 Development of the Northern blot technique and the Sanger sequencing method 1989 Reports of RT-PCR experiments for transcriptome analysis 1991 First high-throughput EST sequencing study 1992 Introduction of Differential Display for the discovery of differentially expressed genes 1995 Reports of the microarray and Serial Analysis of Gene Expression (SAGE) methods 1996 Suppression subtractive hybridization reported 2005 First next-generation sequencing technology (Roche/454) introduced to the market 2006 First transcriptome sequencing studies using a next-generation technology (Roche/454)

New sequencing technologies Next generation sequencing Illumina (HiSeq 2000/2500) Roche/454 Ion Torrent (Ion Proton) ABI/SOLiD Helicos Third generation sequencing Pacific Biosciences Oxford Nanopore Complete Genomics Desktop sequencer Ion Torrent PGM Illumina MiSeq 454 GS Junior

RNA-Seq applications

RNA-Seq application Accelerating gene discovery and gene family expansion Improving genome annotation identifying novel genes and gene models Identifying tissue/condition specific alternative splicing events

RNA-Seq applications Alternative splicing Short reads can t provide the complete structure of an isoform

PacBio long reads RNA-Seq applications

RNA-Seq applications PacBio long reads error correction

Each sample needs four libraries with different insert sizes: 1-2K, 2-3K, 3-5K, >5K RNA-Seq applications

RNA-Seq applications

RNA-Seq applications Cell 1 Cell 2 No. reads 86,126 80,543 Total base 527,933,678 476,348,201 Average length 6,129 5,914

RNA-Seq applications SNP and SSR marker identification facilitating breeding SNP discovery in RNA-Seq is more challenging than in DNA: Varying levels of coverage depth False discovery around splicing junctions due to incorrect mapping

RNA-Seq applications Phylogenetic relationship, population structure, selective sweep 1000.0 16 115 36 20 94 8 7 71 80 68 3 96 51 27 47 67 9 65 15 43 117 93 13 40 6 41 73 60 2 95 50 57 39 90 1 105 119 122 87 49 66 77 62 48 14 58 109 99 111 54 42 46 76 107 30 19 85 97 5 113 24 110 17 112 121 11 70 25 92 83 106 26 38 18 82 35 12 23 56 64 53 102 28 22 108 32 61 55 84 75 31 37 118 72 52 59 33 101 98 104 100 114 91 116 4 74 63 81 29 45 10 79 120 103 44 78 86 34 69 21

RNA-Seq applications Expression QTL Distribution of SNPs (blue) and differentially expressed (DE) genes in IL10-1

RNA-Seq applications Mutant gene cloning (BSA RNA-Seq) white fruit x yellow fruit 132 of 189 SNPs in this region F1 F2 kb F3 white pool yellow pool RNA-Seq SNPs and DE genes Feder et al. (2015) A Kelch domain-containing F-box coding gene negatively regulates flavonoid accumulation in Cucumis melo L. Plant Physiol 169:1714-1726

RNA-Seq applications GWAS Distribution of mapped markers associating with the erucic acid trait

RNA-Seq applications Genomic imprinting and allele specific expression

RNA-Seq applications non-coding RNAs (lncrna, lincrnas )

Gene fusion RNA-Seq applications

Gene expression profiling RNA-Seq applications

RNA-Seq vs microarray Problem of microarray Cross-hybridization Stable probe secondary structures high background (e.g., nonspecific hybridization) limited dynamic range (e.g., nonlinear and saturable hybridization kinetics) RNA-Seq (digital expression analysis) allow direct enumeration of transcript molecules digital expression data are absolute so data can be directly compared across different experiments and laboratories without the need for extensive internal controls or other experimental manipulation provide open systems that allow detection of previously uncharacterized transcripts, as well as rare transcripts

RNA-Seq vs microarray high background (e.g., nonspecific hybridization) limited dynamic range (e.g., nonlinear and saturable hybridization kinetics)

RNA-Seq applications Summary Accelerating gene discovery and gene family expansion Improving genome annotation identifying novel genes and gene models Identifying tissue/condition specific alternative splicing events SNP and SSR marker identification Phylogenetic relationship, population structure, selective sweep Expression QTL analysis Mutant gene cloning (BSA RNA-Seq) Genome (Transcriptome)-wide associate study Genomic imprinting and allele specific expression analysis Identifying non-coding RNAs (lncrna, lincrnas ) Identifying gene fusion events Gene expression profiling analysis

Sequencing platforms and strategies

Sequencing platforms Next generation sequencing Illumina (HiSeq 2000/2500) Ion Torrent (Ion Proton) ABI/SOLiD Roche/454 Helicos Third generation sequencing Pacific Biosciences Oxford Nanopore Complete Genomics Desktop sequencer Ion Torrent PGM Illumina MiSeq Illumina NextSeq 454 GS Junior

Sequencing platforms Illumina HiSeq 2000/2500 High-output mode (150-200M reads/ read pairs per lane) Single-end, 50, 100 bp Paired-end, 2 x 125bp Run time: 2-11 days Rapid run mode (100-150M reads/ read pairs per lane) Single-end, 50, 100, 150 bp Paired-end, 2 x 100 bp Paired-end, 2 x 150 bp Paired-end, 2 x 200 bp Paired-end, 2 x 250 bp Runtime: 7-40 hours Illumina MiSeq 50 bp sequencing kit 300 bp sequencing kit (e.g. 2 x 150 bp) 500 bp sequencing kit (e.g. 2 x 250 bp) 150 bp sequencing kit (e.g. 2 x 75 bp) 600 bp sequencing kit (e.g. 2 x 300 bp) Run time: 5-65 hours http://www.biotech.cornell.edu/brc/genomics/services/price-list

Sequencing platforms Single-end or paired-end For gene expression analysis with a reference genome, singleend is enough For de novo assembly, genome annotation, alternative splicing identification, it s better to use paired-end Strand-specific or non strand-specific Always choose strand-specific RNA-Seq if possible

Strand-specific RNA sequencing More accurately determine the expression level Significantly reduce false positives in identifying alternatively spliced transcripts Identify antisense transcripts another level of gene regulation in important biological processes Determine the transcribed strand of non-coding RNAs (e.g. lincrnas)

Strand-specific RNA-Seq library construction

High throughput ssrna-seq Up to 96 libraries in two days Paired-end compatible multiplexing

Strand specific RNA sequencing Strand-specific sequencing can produce more accurate digital gene expression data when compared to the conventional Illumina RNA-Seq.

Strand specific RNA sequencing

Strand specific RNA sequencing Antisense transcript cis-natural antisense transcripts (cis-nat) 1340 cis-nat pairs in Arabidopsis (Wang et al., 2005) 687 cis-nat pairs in rice (Osato et al., 2003) trans-natural antisense transcripts (trans-nat) 1,320 trans-nat pairs in Arabidopsis (Wang et al., 2006) function alternative splicing RNA editing DNA methylation genomic imprinting X-chromosome inactivation

Strand specific RNA sequencing Antisense transcript LEFL2040O15 1394 reads 259 reads LEFL2002DC06 389 reads 1189 reads

lincrna (determine the sense strand) Strand specific RNA sequencing

RNA-Seq strategies Sequencing depth and no. of biological replicates Most frequently asked question How many samples should I multiplex in one lane? or How many reads should I generate for each of my samples? Depend on $$$ Depends on the quality of the library and the reads rrna, trna, organelle, adaptor contamination No. of biological replicates for expression call At least three Effects of read numbers on expression call Mature green fruit library (22M reads) Randomly select 0.1-0.9, 1-22M reads from the library and calculate gene expression for each dataset (20 different randomizations)

RNA-Seq (multiplexing) 0.1M 1M 2M r=0.8682 r=0.9867 r=0.9934 3M 5M 10M r=0.9957 r=0.9976 r=0.9992 Mature green fruit, 22M

RNA-Seq (multiplexing)

RNA-Seq (multiplexing)

RNA-Seq data analysis

Read quality control (fastqc) Read processing

Read quality control (fastqc) Read processing

Read quality control (fastqc) Read processing

Read processing Remove adaptors and all possible contaminations: rrna, trna, organelle (chloroplast and mitochondrion) RNAs, virus, low quality sequences Arabidopsis 25S ribosomal RNA vs GenBank nr protein database

Read processing Remove contaminated sequences Align reads to rrna and organelle sequence database (bowtie or BWA) Affect RPKM values if not removed Trim adaptor and low quality sequences FASTX-Toolkit AdapterRemoval Trimmomatic Cutadapt Condetri ERNE-filter Prinseq SolexaQA-bwa Sickle

Read processing

RNA-Seq data analysis De novo transcriptome assembly Long reads (454/Sanger) overlap-layout-consensus strategy Short reads (Illumina) de Bruijn graph approach Martin & Wang, 2011

De novo transcriptome assembly Long reads (454/Sanger) CAP3 (http://seq.cs.iastate.edu/cap3.html) TGICL/CAP3 (http://compbio.dfci.harvard.edu/tgi/software/) MIRA (http://www.chevreux.org/projects_mira.html) Newbler (-cdna) Phrap (http://www.phrap.org/) Two major problems in existing EST assembly programs and unigene databases: 1) Large portion of different transcripts (mainly alternative spliced transcripts and paralogs) are incorrectly assembled into same transcripts type I error (false positives) 2) Large portion of nearly identical sequences are not assembled into one transcript type II error (false negatives)

Example of type I assembly error (paralog) In DFCI Tomato Gene Index, AW218649 is a member of TC237370 Sequence identity between AW218649 and TC232370: 91.5% AW218649 is aligned to tomato chromosome 4 TC237370 is aligned to tomato chromosome 11

Example of type I assembly error (alternative splicing) In DFCI Tomato Gene Index, U95008 is a member of TC226520

Example of type II assembly error In DFCI Tomato Gene Index, two unigenes, TC219875 and TC221582, are identical

iassembler http://bioinfo.bti.cornell.edu/tool/iassembler/ iterative assemblies (assembly of assemblies) using MIRA and CAP3 (four cycles of MIRA followed by one cycle of CAP3) reduce errors that nearly identical sequences are not assembled Further assembly error identification 1) comparing unigene sequences against themselves to identify nearly identical sequences (type II errors) 2) aligning EST sequences to their corresponding unigene sequences to identify mis-assembled ESTs (type I errors) Both type I and II assembly errors are corrected automatically by the program Unigene base errors are then corrected based on the resulting SAM files

iassembler performance A curated Arabidopsis EST dataset, which only contain ESTs that can be perfectly aligned to the TAIR10 cdnas perfectly aligned means that the sequences were aligned to Arabidopsis cdnas in their entire lengths

De novo transcriptome assembly Short reads (Illumina) Trinity Trans-ABySS Oases/velvet SOAPdenovo-Trans

De novo transcriptome assembly Reference-guided de novo assembly Cufflink IsoLasso Scripture Traph StringTie

De novo transcriptome assembly Trinity

De novo transcriptome assembly Post processing of de novo assemblies Remove contaminations (bacteria, virus, fungus ) Remove assembly errors (mainly redundancy) Remove errors caused by library preparation (incomplete digestion of dutp containing 2 nd strand during strandspecific RNA-Seq library construction)

De novo transcriptome assembly blastx Remove contamination blastn

De novo transcriptome assembly Remove contamination DeconSeq SeqClean

De novo transcriptome assembly Remove type II assembly error (redundancy) iassembler

De novo transcriptome assembly Remove transcripts derived from incomplete 2 nd digestion Gene ID length antisense sense UN22492 1504 97 48138 comp38294_c0_seq1 526 10822 103 removed

De novo transcriptome assembly High number of assembled transcripts Alternative splicing Non-coding RNAs Incomplete coverage of full length transcripts DFCI gene index

RNA-Seq data analysis Alignment Align reads to reference genome TopHat HISAT Alignment reads to reference transcriptome bowtie BWA If you have a reference genome, it s not a good idea to align the reads to the predicted CDS or cdna, due to the incomplete prediction of UTRs and alternative splicing

RNA-Seq data analysis Visualization tools Integrative Genomics Viewer (IGV)

RNA-Seq data analysis Read counting and normalization Read counting htseq-count samtools (samtools view c) Normalization RPKM: reads per kilobase of exon model per million mapped reads FPKM: fragments per kilobase of exon model per million mapped reads

RNA-Seq data analysis Quality control biological replicates Sample correlation matrix

RNA-Seq data analysis Differentially expressed gene detection Pair-wise comparison DESeq edger Time course data first data transformation using getvariancestabilizeddata function in DESeq (to get normal distribution). Then DE gene identification using F tests in LIMMA Multiple test correction False Discovery Rate (FDR) q value

RNA-Seq data analysis Differentially expressed gene detection