Differential gene expression analysis using RNA-seq

https://abc.med.cornell.edu/ Differential gene expression analysis using RNA-seq Applied Bioinformatics Core, August 2017 Friederike Dündar with Luce Skrabanek & Ceyda Durmaz

Day 1: Introduction into high-throughput sequencing [many general concepts!] 1. RNA isolation & library preparation 2. Illumina s sequencing by synthesis 3. raw sequencing reads download quality control 4. experimental design

RNA-seq is popular, but still developing RNA%seq)is)not$a$mature$technology.)It$is$ undergoing$rapid$evolution$of)biochemistry) of)sample)preparation;)of)sequencing) platforms;)of)computational)pipelines;)and)of$ subsequent$analysis$methods$that$include$ statistical$treatments)and)transcript)model) building.) ) ENCODE&consortium& Reuter et al. ( 2015). Mol Cell. Goodwin, McPherson & McCombie (2016). Nat Gen, 17(6), 333 351

Analysis paralysis basically no generally accepted standard reference myriad tools! highly complex & specialized pipelines The ( ) flexibility and seemingly infinite set of options ( ) have hindered its path to the clinic. ( ) The fixed nature of probe sets with microarrays or qrt-pcr offer an accelerated path ( ) without the lure of the latest and newest analysis methods. Byron et al., 2016 Byron et al. Nat Rev Genetics (2016)

What to expect from the class Sample type & quality Library preparation Poly-A enrichment vs. ribo minus Strand information Sequencing Read length PE vs. SR Sequencing errors Biological question Expression quantification Alternative splicing De novo assembly needed mrnas, small RNAs. Experimental design Controls No. of replicates Randomization Bioinformatics Aligner Normalization DE analysis strategy NOT COVERED: novel transcript discovery transcriptome assembly alternative splicing analysis (see the course notes for references to useful reviews)

cells RNA fragments cdna with adapters RNA-seq workflow overview Total RNA extraction Fragmentation mrna enrichment Library preparation Sequencing Bioinformatics Cluster generation Sequencing by synthesis Image acquisition

Quality control of RNA extraction 28S:18S ratio avoid degraded RNA junk

QC! RNA-seq library preparation RNA extraction rrna depletion/mrna enrichment poly(a) enrichment or ribo-depletion fragmentation random priming and reverse transcription 3 adapter ligation second strand synthesis 5 adapter ligation U U U U UU U end repair, A- addition, adapter ligation end repair, A- addition, adapter ligation U U reverse transcription PCR PCR PCR classical Illumina protocol (unstranded) dutp stranded library preparation sequential ligation of two different adapters Van Dijk et al. (2014). Experimental Cell Research, 322(1), 12 20. doi:10.1016/j.yexcr.2014.01.008

http://informatics.fas.harvard.edu/test-tutorial-page/ RNA-seq workflow overview cells Total RNA extraction RNA fragments cdna with adapters Sequencing flowcell with primers

http://informatics.fas.harvard.edu/test-tutorial-page/ Cluster generation bridge amplification denaturation cluster generation removal of complementary strands! identical fragment copies remain

Image from Illumina Sequencing by synthesis labelled dntp 1. extend 1 st base 2. read 3. deblock repeat for 50 100 bp generate base calls

Typical biases of Illumina sequencing sequencing errors miscalled bases PCR artifacts (library preparation) duplicates (due to low amounts of starting material) length bias GC bias sample-specific problems! RNA-seq-specific Figure from Love et al. (2016). Nat Biotech, 34(12). More details & refs in course notes (esp. Table 6).

General sources of biases (not inherently sample-specific) issues with the reference CNV mappability inappropriate data processing inclusion of multi-mapped reads exclusion of multi-mapped reads

RAW SEQUENCING READS Let the data wrangling begin!

Bioinformatics workflow of RNA-seq analysis Images.tif FASTQC Raw reads.fastq Aligned reads.sam/.bam Base calling & demultiplexing Bustard/RTA/OLB, CASAVA Mapping STAR Counting featurecounts Read count table.txt Normalized read count table.robj List of fold changes & statistical values.robj,.txt Downstream analyses on DE genes Normalizing DESeq2, edger DE test & multiple testing correction DESeq2, edger, limma Filtering Customized scripts

Where are all the reads? GenBank http://www.ncbi.nlm.nih.gov/genbank/ Sequence Read Archive DDBJ http://www.ddbj.nig.ac.jp/intro-e.html ENA https://www.ebi.ac.uk/ena/ The SRA is the main repository for publicly available DNA and RNA sequencing data of which three instances are maintained world-wide.

Let s download! We will work with a data set submitted by Gierlinski et al. they deposited the sequence files with SRA we will retrieve it via ENA (https://www.ebi.ac.uk/ena/) accession number: ERP004763 Course notes @ https://chagall.med.cornell.edu See Section 2 (Raw Data) for download instructions etc. ls mkdir wget cut grep awk

FASTQ file format = FASTA + quality scores 1 read " 4 lines! 1 2 3 4 @ERR459145.1 DHKW5DQ1:219:D0PT7ACXX:2:1101:1590:2149/1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC + @7<DBADDDBH?DHHI@DH >HHHEGHIIIGGIFFGIBFAAGAFHA 5?B@D 1. @Read ID and sequencing run information 2. sequence 3. + (additional description possible) 4. quality scores

http://www.ascii-code.com/ Base quality score @ERR459145.1 DHKW5DQ1:219:D0PT7ACXX:2:1101:1590:2149/1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC + @7<DBADDDBH?DHHI@DH >HHHEGHIIIGGIFFGIBFAAGAFHA 5?B@D base error probability p, e.g. 10e-4! -10 x log10(p) turn score into ASCII symbol Phred score, e.g.: 40 FASTQ score, e.g.: (

Base quality scores each base has a certain error probability (p) Phred score = -10 x log10(p) Phred scores are ASCII-encoded, e.g.,! COULD represent Phred score 33 SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS......XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL...!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~ 33 59 64 73 104 126 0...26...31...40-5...0...9...40 0...9...40 3...9...40 0.2...26...31...41 S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) also see Table 2 in the course notes image from https://en.wikipedia.org/wiki/fastq_format

Quality control of raw reads: FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. The main functions of FastQC are: Import of data from BAM, SAM or FastQ files (any variant) Providing a quick overview to tell you in which areas there may be problems Summary graphs and tables to quickly assess your data Export of results to an HTML based permanent report Offline operation to allow automated generation of reports without running the interactive application not specific for RNA-seq data! $ mat/software/fastqc/fastqc $ mat/software/anaconda2/bin/multiqc

EXPERIMENTAL DESIGN How to avoid spurious signals and drowning in noise

How deep is deep enough? for DGE (logfc~ 2) in mammals: 20 50 mio SR, 75 bp Goals that require more, longer, and possibly pairedend reads: quantification of lowly expressed genes identification of genes with small changes between conditions investigation of alternative splicing/isoform quantification identification of novel transcripts, chimeric transcripts de novo transcriptome assembly

doi:10.1038/nmeth.2613 Why do we need replicates? Goal: Identify differences in expression for every gene. and differences should preferably be due to our experiment, not noise! Samples are our windows to the population, and their statistics are used to estimate those of the population. Martin Krzywinski & Naomi Altman

Gierliński et al. (2015). Bioinformatics, 31(22), 3625 3630. & Schurch et al. (2016) RNA. Invest in replicates! recommended: 6 biological replicates per condition for DGE of strongly changing genes (logfc >= 2) [based on insights from the fairly simple yeast transcriptome] Gene X 10.26 10.24 log2 Counts 10.22 10.20 10.18 10.16 condition 1 condition 2

also see course notes and Blainey et al. (2014) Nature Methods, 1(9) 879 880. Technical replicates Replicates library prep sequencing lane sequencing lane RNA extraction library prep sequencing lane sequencing lane Biological replicates RNA from an independent growth of cells/tissue sequencing lane RNA extraction sequencing lane library prep RNA extraction sequencing lane sequencing lane

Lin, Lin, and Snyder (2014). PNAS 111:48 Gilad & Mizrahi-Man (2015). F1000Research 4:121 Batch effects can happen everywhere Overall,)our)results)indicate)that)there)is) considerable$rna$expression$diversity$ between$humans$and$mice,)well)beyond) what)was)described)previously,)likely) reilecting)the)fundamental)physiological) differences)between)these)two)organisms.) ) Once$we$accounted$for$the$batch$effect$[i.e.,) mouse)and)human)samples)being)sequenced)on)two) different)machines])( ),)the)comparative)gene) expression)data)no)longer)clustered)by)species,)and) instead,)we)observed)a$clear$tendency$for$ clustering$by$tissue. ))

ENCODE s* study design was not optimal Tissue was confounded with (at least): sequencer sex age tissue handling human data: deceased organ donors mouse data: 10-week-old littermates A very good read (including the reviews and comments that discuss many scientific as well as ethical issues: https://f1000research.com/articles/4-121/v1 * not just ENCODE: see e.g. Leek et al. (2010) Nat Rev Gen 11(10) 733-739 or Jaffe & Irizarry (2014) Genome Biol 15(R31) 1 9

Completely randomized design Avoiding bias Restricted randomized design Blocked & randomized design WEIGHT Block what you can, randomize what you cannot. What factors are of interest? Which ones might introduce noise? Which nuisance factors do you absolutely need to account for? Krzywinski & Altman (2014) Nature Methods 11(7)

Auer & Doerge (2010). Genetics, 185(2), 405 16. Typical RNA-seq set-up keep the technical nuisance factors (harvest date, RNA extraction kit, sequencing date ) to a minimum cover only as much of the biological variation as needed (just keep possible restrictions about your conclusions in mind for later) Make sure the sequencing core multiplexes all samples!

Summary Day 1 RNA-seq analysis is not a completely solved issue but DE analysis on a gene level is decently mature and the field seems to gravitate towards some sort of standard no analysis tool can enforce (or replace!) common sense and knowledge about the biology behind the experiment crap in, crap out more replicates are often better investments than more reads FastQC and multiqc are great tools to detect possible technical nuisance factors