Differential gene expression analysis using RNA-seq

Similar documents
Differential gene expression analysis using RNA-seq

Differential gene expression analysis using RNA-seq

Differential gene expression analysis using RNA-seq

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Lecture 7. Next-generation sequencing technologies

RNA-Seq data analysis course September 7-9, 2015

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Next Generation Sequencing

Transcriptome analysis

SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS

Introduction to differential gene expression analysis using RNA-seq

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

Wheat CAP Gene Expression with RNA-Seq

Deep Sequencing technologies

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to differential gene expression analysis using RNA-seq

Next-generation sequencing and quality control: An introduction 2016

RNA-Seq Analysis. Simon Andrews, Laura v

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA

Sanger vs Next-Gen Sequencing

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA. March 2, Steven R. Kain, Ph.D. ABRF 2013

Introduction to RNA-Seq in GeneSpring NGS Software

Parts of a standard FastQC report

1. Introduction Gene regulation Genomics and genome analyses

Illumina Sequencing Error Profiles and Quality Control

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

DATA FORMATS AND QUALITY CONTROL

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Next Gen Sequencing. Expansion of sequencing technology. Contents

Wet-lab Considerations for Illumina data analysis

Sequence Analysis 2RNA-Seq

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

ChIP-seq and RNA-seq. Farhat Habib

Introduction to NGS analyses

How to deal with your RNA-seq data?

Quantifying gene expression

Overcome limitations with RNA-Seq

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

ChIP-seq analysis 2/28/2018

ChIP-seq and RNA-seq

Introductory Next Gen Workshop

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.

Applications of short-read

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

Single Cell Transcriptomics scrnaseq

Why QC? Next-Generation Sequencing: Quality Control. Illumina data format. Fastq format:

RNA Sequencing: Experimental Planning and Data Analysis. Nadia Atallah September 12, 2018

FFPE in your NGS Study

Next-Generation Sequencing: Quality Control

RNA-Seq de novo assembly training

Analysis of RNA-seq Data. Bernard Pereira

Matthew Tinning Australian Genome Research Facility. July 2012

Welcome to the NGS webinar series

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy

NextGen Sequencing Technologies Sequencing overview

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

Long and short/small RNA-seq data analysis

RNAseq Differential Gene Expression Analysis Report

RNA-Sequencing analysis

Deep sequencing of transcriptomes

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

RNA-Seq Software, Tools, and Workflows

RNA-seq Data Analysis

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

RNA

RNA standards v May

Introduction to RNA sequencing

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Introducing QIAseq. Accelerate your NGS performance through Sample to Insight solutions. Sample to Insight

Experimental design of RNA-Seq Data

Introduction of RNA-Seq Analysis

Zika infected human samples

Introduction to RNA-Seq

Measuring and Understanding Gene Expression

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Francisco García Quality Control for NGS Raw Data

SCALABLE, REPRODUCIBLE RNA-Seq

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Next-generation sequencing technologies

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Gene Expression Profiling and Validation Using Agilent SurePrint G3 Gene Expression Arrays

Chapter 7. DNA Microarrays

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA

Analytics Behind Genomic Testing

Guidelines Analysis of RNA Quantity and Quality for Next-Generation Sequencing Projects

Quality control for Sequencing Experiments

Obtain superior NGS library performance with lower input amounts using the NEBNext Ultra II Directional RNA Library Prep Kit for Illumina

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Obtain superior NGS library performance with lower input amounts using the NEBNext Ultra II Directional RNA Library Prep Kit for Illumina

Computational & Quantitative Biology Lecture 6 RNA Sequencing

High performance sequencing and gene expression quantification

Application Note Selective transcript depletion

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Bioinformatics Advice on Experimental Design

Advanced RNA-Seq course. Introduction. Peter-Bram t Hoen

Transcription:

https://abc.med.cornell.edu/ Differential gene expression analysis using RNA-seq Applied Bioinformatics Core, August 2017 Friederike Dündar with Luce Skrabanek & Ceyda Durmaz

Day 1: Introduction into high-throughput sequencing [many general concepts!] 1. RNA isolation & library preparation 2. Illumina s sequencing by synthesis 3. raw sequencing reads download quality control 4. experimental design

RNA-seq is popular, but still developing RNA%seq)is)not$a$mature$technology.)It$is$ undergoing$rapid$evolution$of)biochemistry) of)sample)preparation;)of)sequencing) platforms;)of)computational)pipelines;)and)of$ subsequent$analysis$methods$that$include$ statistical$treatments)and)transcript)model) building.) ) ENCODE&consortium& Reuter et al. ( 2015). Mol Cell. Goodwin, McPherson & McCombie (2016). Nat Gen, 17(6), 333 351

Analysis paralysis basically no generally accepted standard reference myriad tools! highly complex & specialized pipelines The ( ) flexibility and seemingly infinite set of options ( ) have hindered its path to the clinic. ( ) The fixed nature of probe sets with microarrays or qrt-pcr offer an accelerated path ( ) without the lure of the latest and newest analysis methods. Byron et al., 2016 Byron et al. Nat Rev Genetics (2016)

What to expect from the class Sample type & quality Library preparation Poly-A enrichment vs. ribo minus Strand information Sequencing Read length PE vs. SR Sequencing errors Biological question Expression quantification Alternative splicing De novo assembly needed mrnas, small RNAs. Experimental design Controls No. of replicates Randomization Bioinformatics Aligner Normalization DE analysis strategy NOT COVERED: novel transcript discovery transcriptome assembly alternative splicing analysis (see the course notes for references to useful reviews)

cells RNA fragments cdna with adapters RNA-seq workflow overview Total RNA extraction Fragmentation mrna enrichment Library preparation Sequencing Bioinformatics Cluster generation Sequencing by synthesis Image acquisition

Quality control of RNA extraction 28S:18S ratio avoid degraded RNA junk

QC! RNA-seq library preparation RNA extraction rrna depletion/mrna enrichment poly(a) enrichment or ribo-depletion fragmentation random priming and reverse transcription 3 adapter ligation second strand synthesis 5 adapter ligation U U U U UU U end repair, A- addition, adapter ligation end repair, A- addition, adapter ligation U U reverse transcription PCR PCR PCR classical Illumina protocol (unstranded) dutp stranded library preparation sequential ligation of two different adapters Van Dijk et al. (2014). Experimental Cell Research, 322(1), 12 20. doi:10.1016/j.yexcr.2014.01.008

http://informatics.fas.harvard.edu/test-tutorial-page/ RNA-seq workflow overview cells Total RNA extraction RNA fragments cdna with adapters Sequencing flowcell with primers

http://informatics.fas.harvard.edu/test-tutorial-page/ Cluster generation bridge amplification denaturation cluster generation removal of complementary strands! identical fragment copies remain

Image from Illumina Sequencing by synthesis labelled dntp 1. extend 1 st base 2. read 3. deblock repeat for 50 100 bp generate base calls

Typical biases of Illumina sequencing sequencing errors miscalled bases PCR artifacts (library preparation) duplicates (due to low amounts of starting material) length bias GC bias sample-specific problems! RNA-seq-specific Figure from Love et al. (2016). Nat Biotech, 34(12). More details & refs in course notes (esp. Table 6).

General sources of biases (not inherently sample-specific) issues with the reference CNV mappability inappropriate data processing inclusion of multi-mapped reads exclusion of multi-mapped reads

RAW SEQUENCING READS Let the data wrangling begin!

Bioinformatics workflow of RNA-seq analysis Images.tif FASTQC Raw reads.fastq Aligned reads.sam/.bam Base calling & demultiplexing Bustard/RTA/OLB, CASAVA Mapping STAR Counting featurecounts Read count table.txt Normalized read count table.robj List of fold changes & statistical values.robj,.txt Downstream analyses on DE genes Normalizing DESeq2, edger DE test & multiple testing correction DESeq2, edger, limma Filtering Customized scripts

Where are all the reads? GenBank http://www.ncbi.nlm.nih.gov/genbank/ Sequence Read Archive DDBJ http://www.ddbj.nig.ac.jp/intro-e.html ENA https://www.ebi.ac.uk/ena/ The SRA is the main repository for publicly available DNA and RNA sequencing data of which three instances are maintained world-wide.

Let s download! We will work with a data set submitted by Gierlinski et al. they deposited the sequence files with SRA we will retrieve it via ENA (https://www.ebi.ac.uk/ena/) accession number: ERP004763 Course notes @ https://chagall.med.cornell.edu See Section 2 (Raw Data) for download instructions etc. ls mkdir wget cut grep awk

FASTQ file format = FASTA + quality scores 1 read " 4 lines! 1 2 3 4 @ERR459145.1 DHKW5DQ1:219:D0PT7ACXX:2:1101:1590:2149/1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC + @7<DBADDDBH?DHHI@DH >HHHEGHIIIGGIFFGIBFAAGAFHA 5?B@D 1. @Read ID and sequencing run information 2. sequence 3. + (additional description possible) 4. quality scores

http://www.ascii-code.com/ Base quality score @ERR459145.1 DHKW5DQ1:219:D0PT7ACXX:2:1101:1590:2149/1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC + @7<DBADDDBH?DHHI@DH >HHHEGHIIIGGIFFGIBFAAGAFHA 5?B@D base error probability p, e.g. 10e-4! -10 x log10(p) turn score into ASCII symbol Phred score, e.g.: 40 FASTQ score, e.g.: (

Base quality scores each base has a certain error probability (p) Phred score = -10 x log10(p) Phred scores are ASCII-encoded, e.g.,! COULD represent Phred score 33 SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS......XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL...!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~ 33 59 64 73 104 126 0...26...31...40-5...0...9...40 0...9...40 3...9...40 0.2...26...31...41 S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) also see Table 2 in the course notes image from https://en.wikipedia.org/wiki/fastq_format

Quality control of raw reads: FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. The main functions of FastQC are: Import of data from BAM, SAM or FastQ files (any variant) Providing a quick overview to tell you in which areas there may be problems Summary graphs and tables to quickly assess your data Export of results to an HTML based permanent report Offline operation to allow automated generation of reports without running the interactive application not specific for RNA-seq data! $ mat/software/fastqc/fastqc $ mat/software/anaconda2/bin/multiqc

EXPERIMENTAL DESIGN How to avoid spurious signals and drowning in noise

How deep is deep enough? for DGE (logfc~ 2) in mammals: 20 50 mio SR, 75 bp Goals that require more, longer, and possibly pairedend reads: quantification of lowly expressed genes identification of genes with small changes between conditions investigation of alternative splicing/isoform quantification identification of novel transcripts, chimeric transcripts de novo transcriptome assembly

doi:10.1038/nmeth.2613 Why do we need replicates? Goal: Identify differences in expression for every gene. and differences should preferably be due to our experiment, not noise! Samples are our windows to the population, and their statistics are used to estimate those of the population. Martin Krzywinski & Naomi Altman

Gierliński et al. (2015). Bioinformatics, 31(22), 3625 3630. & Schurch et al. (2016) RNA. Invest in replicates! recommended: 6 biological replicates per condition for DGE of strongly changing genes (logfc >= 2) [based on insights from the fairly simple yeast transcriptome] Gene X 10.26 10.24 log2 Counts 10.22 10.20 10.18 10.16 condition 1 condition 2

also see course notes and Blainey et al. (2014) Nature Methods, 1(9) 879 880. Technical replicates Replicates library prep sequencing lane sequencing lane RNA extraction library prep sequencing lane sequencing lane Biological replicates RNA from an independent growth of cells/tissue sequencing lane RNA extraction sequencing lane library prep RNA extraction sequencing lane sequencing lane

Lin, Lin, and Snyder (2014). PNAS 111:48 Gilad & Mizrahi-Man (2015). F1000Research 4:121 Batch effects can happen everywhere Overall,)our)results)indicate)that)there)is) considerable$rna$expression$diversity$ between$humans$and$mice,)well)beyond) what)was)described)previously,)likely) reilecting)the)fundamental)physiological) differences)between)these)two)organisms.) ) Once$we$accounted$for$the$batch$effect$[i.e.,) mouse)and)human)samples)being)sequenced)on)two) different)machines])( ),)the)comparative)gene) expression)data)no)longer)clustered)by)species,)and) instead,)we)observed)a$clear$tendency$for$ clustering$by$tissue. ))

ENCODE s* study design was not optimal Tissue was confounded with (at least): sequencer sex age tissue handling human data: deceased organ donors mouse data: 10-week-old littermates A very good read (including the reviews and comments that discuss many scientific as well as ethical issues: https://f1000research.com/articles/4-121/v1 * not just ENCODE: see e.g. Leek et al. (2010) Nat Rev Gen 11(10) 733-739 or Jaffe & Irizarry (2014) Genome Biol 15(R31) 1 9

Completely randomized design Avoiding bias Restricted randomized design Blocked & randomized design WEIGHT Block what you can, randomize what you cannot. What factors are of interest? Which ones might introduce noise? Which nuisance factors do you absolutely need to account for? Krzywinski & Altman (2014) Nature Methods 11(7)

Auer & Doerge (2010). Genetics, 185(2), 405 16. Typical RNA-seq set-up keep the technical nuisance factors (harvest date, RNA extraction kit, sequencing date ) to a minimum cover only as much of the biological variation as needed (just keep possible restrictions about your conclusions in mind for later) Make sure the sequencing core multiplexes all samples!

Summary Day 1 RNA-seq analysis is not a completely solved issue but DE analysis on a gene level is decently mature and the field seems to gravitate towards some sort of standard no analysis tool can enforce (or replace!) common sense and knowledge about the biology behind the experiment crap in, crap out more replicates are often better investments than more reads FastQC and multiqc are great tools to detect possible technical nuisance factors