RNA-Seq analysis workshop Zhangjun Fei Boyce Thompson Institute for Plant Research USDA Robert W. Holley Center for Agriculture and Health Cornell University
Outline Background of RNA-Seq Application of RNA-Seq (what RNA-Seq can do?) Available sequencing platforms and strategies and which one to choose RNA-Seq data analysis Read processing and quality assessment De novo assembly Alignment to reference genome/transcriptome Differentially expressed gene identification
Milestones of Transcriptome analysis Year Milestone 1965 Sequence of the first RNA molecule determined 1977 Development of the Northern blot technique and the Sanger sequencing method 1989 Reports of RT-PCR experiments for transcriptome analysis 1991 First high-throughput EST sequencing study 1992 Introduction of Differential Display for the discovery of differentially expressed genes 1995 Reports of the microarray and Serial Analysis of Gene Expression (SAGE) methods 1996 Suppression subtractive hybridization reported 2005 First next-generation sequencing technology (Roche/454) introduced to the market 2006 First transcriptome sequencing studies using a next-generation technology (Roche/454)
New sequencing technologies Next generation sequencing Illumina (HiSeq 2000/2500) Roche/454 Ion Torrent (Ion Proton) ABI/SOLiD Helicos Third generation sequencing Pacific Biosciences Oxford Nanopore Complete Genomics Desktop sequencer Ion Torrent PGM Illumina MiSeq 454 GS Junior
RNA-Seq applications
RNA-Seq application Accelerating gene discovery and gene family expansion Improving genome annotation identifying novel genes and gene models Identifying tissue/condition specific alternative splicing events
RNA-Seq applications Alternative splicing Short reads can t provide the complete structure of an isoform
PacBio long reads RNA-Seq applications
RNA-Seq applications PacBio long reads error correction
Each sample needs four libraries with different insert sizes: 1-2K, 2-3K, 3-5K, >5K RNA-Seq applications
RNA-Seq applications
RNA-Seq applications Cell 1 Cell 2 No. reads 86,126 80,543 Total base 527,933,678 476,348,201 Average length 6,129 5,914
RNA-Seq applications SNP and SSR marker identification facilitating breeding SNP discovery in RNA-Seq is more challenging than in DNA: Varying levels of coverage depth False discovery around splicing junctions due to incorrect mapping
RNA-Seq applications Phylogenetic relationship, population structure, selective sweep 1000.0 16 115 36 20 94 8 7 71 80 68 3 96 51 27 47 67 9 65 15 43 117 93 13 40 6 41 73 60 2 95 50 57 39 90 1 105 119 122 87 49 66 77 62 48 14 58 109 99 111 54 42 46 76 107 30 19 85 97 5 113 24 110 17 112 121 11 70 25 92 83 106 26 38 18 82 35 12 23 56 64 53 102 28 22 108 32 61 55 84 75 31 37 118 72 52 59 33 101 98 104 100 114 91 116 4 74 63 81 29 45 10 79 120 103 44 78 86 34 69 21
RNA-Seq applications Expression QTL Distribution of SNPs (blue) and differentially expressed (DE) genes in IL10-1
RNA-Seq applications Mutant gene cloning (BSA RNA-Seq) white fruit x yellow fruit 132 of 189 SNPs in this region F1 F2 kb F3 white pool yellow pool RNA-Seq SNPs and DE genes Feder et al. (2015) A Kelch domain-containing F-box coding gene negatively regulates flavonoid accumulation in Cucumis melo L. Plant Physiol 169:1714-1726
RNA-Seq applications GWAS Distribution of mapped markers associating with the erucic acid trait
RNA-Seq applications Genomic imprinting and allele specific expression
RNA-Seq applications non-coding RNAs (lncrna, lincrnas )
Gene fusion RNA-Seq applications
Gene expression profiling RNA-Seq applications
RNA-Seq vs microarray Problem of microarray Cross-hybridization Stable probe secondary structures high background (e.g., nonspecific hybridization) limited dynamic range (e.g., nonlinear and saturable hybridization kinetics) RNA-Seq (digital expression analysis) allow direct enumeration of transcript molecules digital expression data are absolute so data can be directly compared across different experiments and laboratories without the need for extensive internal controls or other experimental manipulation provide open systems that allow detection of previously uncharacterized transcripts, as well as rare transcripts
RNA-Seq vs microarray high background (e.g., nonspecific hybridization) limited dynamic range (e.g., nonlinear and saturable hybridization kinetics)
RNA-Seq applications Summary Accelerating gene discovery and gene family expansion Improving genome annotation identifying novel genes and gene models Identifying tissue/condition specific alternative splicing events SNP and SSR marker identification Phylogenetic relationship, population structure, selective sweep Expression QTL analysis Mutant gene cloning (BSA RNA-Seq) Genome (Transcriptome)-wide associate study Genomic imprinting and allele specific expression analysis Identifying non-coding RNAs (lncrna, lincrnas ) Identifying gene fusion events Gene expression profiling analysis
Sequencing platforms and strategies
Sequencing platforms Next generation sequencing Illumina (HiSeq 2000/2500) Ion Torrent (Ion Proton) ABI/SOLiD Roche/454 Helicos Third generation sequencing Pacific Biosciences Oxford Nanopore Complete Genomics Desktop sequencer Ion Torrent PGM Illumina MiSeq Illumina NextSeq 454 GS Junior
Sequencing platforms Illumina HiSeq 2000/2500 High-output mode (150-200M reads/ read pairs per lane) Single-end, 50, 100 bp Paired-end, 2 x 125bp Run time: 2-11 days Rapid run mode (100-150M reads/ read pairs per lane) Single-end, 50, 100, 150 bp Paired-end, 2 x 100 bp Paired-end, 2 x 150 bp Paired-end, 2 x 200 bp Paired-end, 2 x 250 bp Runtime: 7-40 hours Illumina MiSeq 50 bp sequencing kit 300 bp sequencing kit (e.g. 2 x 150 bp) 500 bp sequencing kit (e.g. 2 x 250 bp) 150 bp sequencing kit (e.g. 2 x 75 bp) 600 bp sequencing kit (e.g. 2 x 300 bp) Run time: 5-65 hours http://www.biotech.cornell.edu/brc/genomics/services/price-list
Sequencing platforms Single-end or paired-end For gene expression analysis with a reference genome, singleend is enough For de novo assembly, genome annotation, alternative splicing identification, it s better to use paired-end Strand-specific or non strand-specific Always choose strand-specific RNA-Seq if possible
Strand-specific RNA sequencing More accurately determine the expression level Significantly reduce false positives in identifying alternatively spliced transcripts Identify antisense transcripts another level of gene regulation in important biological processes Determine the transcribed strand of non-coding RNAs (e.g. lincrnas)
Strand-specific RNA-Seq library construction
High throughput ssrna-seq Up to 96 libraries in two days Paired-end compatible multiplexing
Strand specific RNA sequencing Strand-specific sequencing can produce more accurate digital gene expression data when compared to the conventional Illumina RNA-Seq.
Strand specific RNA sequencing
Strand specific RNA sequencing Antisense transcript cis-natural antisense transcripts (cis-nat) 1340 cis-nat pairs in Arabidopsis (Wang et al., 2005) 687 cis-nat pairs in rice (Osato et al., 2003) trans-natural antisense transcripts (trans-nat) 1,320 trans-nat pairs in Arabidopsis (Wang et al., 2006) function alternative splicing RNA editing DNA methylation genomic imprinting X-chromosome inactivation
Strand specific RNA sequencing Antisense transcript LEFL2040O15 1394 reads 259 reads LEFL2002DC06 389 reads 1189 reads
lincrna (determine the sense strand) Strand specific RNA sequencing
RNA-Seq strategies Sequencing depth and no. of biological replicates Most frequently asked question How many samples should I multiplex in one lane? or How many reads should I generate for each of my samples? Depend on $$$ Depends on the quality of the library and the reads rrna, trna, organelle, adaptor contamination No. of biological replicates for expression call At least three Effects of read numbers on expression call Mature green fruit library (22M reads) Randomly select 0.1-0.9, 1-22M reads from the library and calculate gene expression for each dataset (20 different randomizations)
RNA-Seq (multiplexing) 0.1M 1M 2M r=0.8682 r=0.9867 r=0.9934 3M 5M 10M r=0.9957 r=0.9976 r=0.9992 Mature green fruit, 22M
RNA-Seq (multiplexing)
RNA-Seq (multiplexing)
RNA-Seq data analysis
Read quality control (fastqc) Read processing
Read quality control (fastqc) Read processing
Read quality control (fastqc) Read processing
Read processing Remove adaptors and all possible contaminations: rrna, trna, organelle (chloroplast and mitochondrion) RNAs, virus, low quality sequences Arabidopsis 25S ribosomal RNA vs GenBank nr protein database
Read processing Remove contaminated sequences Align reads to rrna and organelle sequence database (bowtie or BWA) Affect RPKM values if not removed Trim adaptor and low quality sequences FASTX-Toolkit AdapterRemoval Trimmomatic Cutadapt Condetri ERNE-filter Prinseq SolexaQA-bwa Sickle
Read processing
RNA-Seq data analysis De novo transcriptome assembly Long reads (454/Sanger) overlap-layout-consensus strategy Short reads (Illumina) de Bruijn graph approach Martin & Wang, 2011
De novo transcriptome assembly Long reads (454/Sanger) CAP3 (http://seq.cs.iastate.edu/cap3.html) TGICL/CAP3 (http://compbio.dfci.harvard.edu/tgi/software/) MIRA (http://www.chevreux.org/projects_mira.html) Newbler (-cdna) Phrap (http://www.phrap.org/) Two major problems in existing EST assembly programs and unigene databases: 1) Large portion of different transcripts (mainly alternative spliced transcripts and paralogs) are incorrectly assembled into same transcripts type I error (false positives) 2) Large portion of nearly identical sequences are not assembled into one transcript type II error (false negatives)
Example of type I assembly error (paralog) In DFCI Tomato Gene Index, AW218649 is a member of TC237370 Sequence identity between AW218649 and TC232370: 91.5% AW218649 is aligned to tomato chromosome 4 TC237370 is aligned to tomato chromosome 11
Example of type I assembly error (alternative splicing) In DFCI Tomato Gene Index, U95008 is a member of TC226520
Example of type II assembly error In DFCI Tomato Gene Index, two unigenes, TC219875 and TC221582, are identical
iassembler http://bioinfo.bti.cornell.edu/tool/iassembler/ iterative assemblies (assembly of assemblies) using MIRA and CAP3 (four cycles of MIRA followed by one cycle of CAP3) reduce errors that nearly identical sequences are not assembled Further assembly error identification 1) comparing unigene sequences against themselves to identify nearly identical sequences (type II errors) 2) aligning EST sequences to their corresponding unigene sequences to identify mis-assembled ESTs (type I errors) Both type I and II assembly errors are corrected automatically by the program Unigene base errors are then corrected based on the resulting SAM files
iassembler performance A curated Arabidopsis EST dataset, which only contain ESTs that can be perfectly aligned to the TAIR10 cdnas perfectly aligned means that the sequences were aligned to Arabidopsis cdnas in their entire lengths
De novo transcriptome assembly Short reads (Illumina) Trinity Trans-ABySS Oases/velvet SOAPdenovo-Trans
De novo transcriptome assembly Reference-guided de novo assembly Cufflink IsoLasso Scripture Traph StringTie
De novo transcriptome assembly Trinity
De novo transcriptome assembly Post processing of de novo assemblies Remove contaminations (bacteria, virus, fungus ) Remove assembly errors (mainly redundancy) Remove errors caused by library preparation (incomplete digestion of dutp containing 2 nd strand during strandspecific RNA-Seq library construction)
De novo transcriptome assembly blastx Remove contamination blastn
De novo transcriptome assembly Remove contamination DeconSeq SeqClean
De novo transcriptome assembly Remove type II assembly error (redundancy) iassembler
De novo transcriptome assembly Remove transcripts derived from incomplete 2 nd digestion Gene ID length antisense sense UN22492 1504 97 48138 comp38294_c0_seq1 526 10822 103 removed
De novo transcriptome assembly High number of assembled transcripts Alternative splicing Non-coding RNAs Incomplete coverage of full length transcripts DFCI gene index
RNA-Seq data analysis Alignment Align reads to reference genome TopHat HISAT Alignment reads to reference transcriptome bowtie BWA If you have a reference genome, it s not a good idea to align the reads to the predicted CDS or cdna, due to the incomplete prediction of UTRs and alternative splicing
RNA-Seq data analysis Visualization tools Integrative Genomics Viewer (IGV)
RNA-Seq data analysis Read counting and normalization Read counting htseq-count samtools (samtools view c) Normalization RPKM: reads per kilobase of exon model per million mapped reads FPKM: fragments per kilobase of exon model per million mapped reads
RNA-Seq data analysis Quality control biological replicates Sample correlation matrix
RNA-Seq data analysis Differentially expressed gene detection Pair-wise comparison DESeq edger Time course data first data transformation using getvariancestabilizeddata function in DESeq (to get normal distribution). Then DE gene identification using F tests in LIMMA Multiple test correction False Discovery Rate (FDR) q value
RNA-Seq data analysis Differentially expressed gene detection