Gene Expression analysis with RNA-Seq data

Similar documents
10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

RNA Seq: Methods and Applica6ons. Prat Thiru

RNAseq and Variant discovery

VM origin. Okeanos: Image Trinity_U16 (upgrade to Ubuntu16.04, thanks to Alexandros Dimopoulos) X2go: LXDE

Sanger vs Next-Gen Sequencing

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

RNAseq Differential Gene Expression Analysis Report

RNA-Seq Module 2 From QC to differential gene expression.

RNA-seq Data Analysis

Introduction of RNA-Seq Analysis

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

RNA-Seq Analysis. Simon Andrews, Laura v

DNASeq: Analysis pipeline and file formats Sumir Panji, Gerrit Boha and Amel Ghouila

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

Bioinformatics in next generation sequencing projects

RNA-Seq Software, Tools, and Workflows

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Reference genomes and common file formats

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Quantifying gene expression

Reference genomes and common file formats

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Transcriptome analysis

Introduction to Next Generation Sequencing

02 Agenda Item 03 Agenda Item

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

ISO/IEC JTC 1/SC 29/WG 11 N15527 Warsaw, CH June Introduction

How to deal with your RNA-seq data?

RNA-Sequencing analysis

Bioinformatics Monthly Workshop Series. Speaker: Fan Gao, Ph.D Bioinformatics Resource Office The Picower Institute for Learning and Memory

Next Generation Sequencing

Applications of short-read

ChIP-seq and RNA-seq

Long and short/small RNA-seq data analysis

ChIP-seq and RNA-seq. Farhat Habib

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

Introduction to NGS analyses

Lecture 7. Next-generation sequencing technologies

RNA

Introduction to bioinformatics (NGS data analysis)

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

Analysis of neo-antigens to identify T-cell neo-epitopes in human Head & Neck cancer. Project XX1001. Customer Detail

NGS Data Analysis and Galaxy

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Short Read Alignment to a Reference Genome

Francisco García Quality Control for NGS Raw Data

Course Presentation. Ignacio Medina Presentation

Sequence Analysis 2RNA-Seq

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

SCALABLE, REPRODUCIBLE RNA-Seq

Introduction to transcriptome analysis using High Throughput Sequencing technologies. D. Puthier 2012

measuring gene expression December 5, 2017

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Why QC? Next-Generation Sequencing: Quality Control. Illumina data format. Fastq format:

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

Next-Generation Sequencing: Quality Control

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

L3: Short Read Alignment to a Reference Genome

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

measuring gene expression December 11, 2018

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

Introduc)on to Genomics

Introduction to RNA-Seq in GeneSpring NGS Software

Differential gene expression analysis using RNA-seq

Introduction to RNA-Seq

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Deep Sequencing technologies

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Galaxy Platform For NGS Data Analyses

NGS part 2: applications. Tobias Österlund

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Illumina Sequencing Error Profiles and Quality Control

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

RNA-Seq with the Tuxedo Suite

High performance sequencing and gene expression quantification

Canadian Bioinforma3cs Workshops

Bioinformatics Core Facility IDENTIFYING A DISEASE CAUSING MUTATION

Analytics Behind Genomic Testing

Eucalyptus gene assembly

RNA-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Mapping strategies for sequence reads

Introduction to RNA sequencing

NGS in Pathology Webinar

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Wheat CAP Gene Expression with RNA-Seq

Next-Generation Sequencing. Technologies

Genomic resources. for non-model systems

RNA-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

SNP calling and VCF format

RNA-Seq analysis workshop

RNA Sequencing. Next gen insight into transcriptomes , Elio Schijlen

De Novo Assembly of High-throughput Short Read Sequences

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Transcription:

Gene Expression analysis with RNA-Seq data C3BI Hands-on NGS course November 24th 2016 Frédéric Lemoine

Plan 1. 2. Quality Control 3. Read Mapping 4. Gene Expression Analysis 5. Splicing/Transcript Analysis 6. Other Analyses 7. Visualization 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 2

Sequencing : Reminder High throupghput sequencing (HTS pour high-throughput sequencing), or NGS is a set of methods developped in 2005, that produce millions of sequences in a run, at a low cost. Example: Genome «reads» Coverage : 10 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 3

Sequencers (Illumina) NextSeq 500 HiSeq 4000 X Ten Max Output 120 Gb Max Read Number 800 M Max Read Length 2x150 bp Run time 29 h 9 exomes per run Output 1500 Gb Read Number 4->5 B Read Length 2x150 bp Run time 1 -> 3,5 D 12 genomes per run Max Output 1800 Gb Max Read Number 6 B Max Read Length 2x150 bp Run time < 3 D > 18 000 human genomes per year 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 4

Sequencing: Applications DNA-Seq: DNA Sequencing CHIP-Seq: Study of protein/dna interaction CLIP-Seq: Study Protein/RNA interaction RNA-Seq : RNA Sequencing 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 5

Sequencing: Applications DNA-Seq: DNA Sequencing CHIP-Seq: Study of protein/dna interaction CLIP-Seq: Study Protein/RNA interaction RNA-Seq : RNA Sequencing 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 6

RNA-Seq : Definition RNA-Seq allows to reveal the presence and quantity of RNA in a genome at a given moment in time «reads» Chromosome Genes Junction reads Exonic reads Coverage Qualitative + Quantitative! 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 7

RNA-Seq : Data types 2.10 9 reads Illumina flowcell 250.10 6 reads per lane 1 2 3 4 5 6 7 8 8 sequencing lanes 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 8

RNA-Seq : Data types A little modification of library preparation allows to read both ends (forward and reverse) of the fragments. FastQ File 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 9

RNA-Seq : Data Types FASTQ @SRR527545.1 1 length=76 GTCGATGATGCCTGCTAAACTGCAGCTTGACGTACTGCGGACCCTGCAGTCCAGCGCTCGTCATGGAACGCAAACG + HHHHHHHHHHHHFGHHHHHHFHHGHHHGHGHEEHHHHHEFFHHHFHHHHBHHHEHFHAH?CEDCBFEFFFFAFDF9 FASTA format >SRR527545.1 1 length=76 GTCGATGATGCCTGCTAAACTGCAGCTTGACGTACTGCGGACCCTGCAGTCCAGCGCTCGTCATGGAACGCAAACG SFF - Standard Flowgram Format - binary format for 454 reads Colorspace (SOLiD) - CSFASTQ @0711.1 2_34_121_F3 T11332321002210131011131332200002000120000200001000 + 64;;9:;>+0*&:*.*1-.5($2$3&$570*$575&$9966$5835'665 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 10

RNA-Seq : FASTQ Read ID Read sequence Per base quality scores @HWI-ST946:381:C2ABHACXX:1:1101:1154:2156 1:N:0:GCCAAT GGAAAACATATTCACCCAAGACCTGT + @@@DADDDFFHHFE?E@FEHGIIIIF Read ID Details Column Descrip.on HWI-ST946 Unique iden8fier of sequencer 381 Project (run) iden8fier C2ABHACXX Flowcell Iden8fier 1 Lane number into flowcell 1101 Tile number in Lane 1154 X Coordinate of the read cluster in the Tile 2156 Y Coordinate of the read cluster in the Tile 1 Read number in the pair (1 or 2). Only if paired-end/mate-pair sequencing N Pass read filter :«Y» or «N».«N» indicates a bad read. 0 0 when no control bit is ac8vated GCCAAT Index of the sequence: When several samples are mul8pexed 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 11

RNA-Seq : FASTQ Scores 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 12

RNA-Seq : FASTQ Scores Phred Quality Score Incorrect iden.fica.on probability Base iden.fica.on precision 10 1/10 90% 20 1/100 99% 30 1/1000 99.9% 40 1/10000 99.99% 50 1/100000 99.999% 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 13

RNA-Seq : Applications Measure gene expression Measure alternative splicing Detect expressed mutations Gene annotation (new exons) Detect fusion transcripts 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 14

RNA-Seq : Whole pipeline Raw Data Preprocessing Mapping Quality control Expression Splicing Fusion transcripts SNPs Visualization Exon Splicing Patterns Transcript 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 15

RNA-Seq : Whole pipeline Raw Data Preprocessing Mapping Quality control Expression Splicing Fusion transcripts SNPs Visualization Exon Splicing Patterns Transcript 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 16

RNA-Seq : Quality control Sequence quality controls Mapping quality controls We use FastQC & PicardTools 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 17

Sequence quality: FastQC Documentation http://www.bioinformatics.babraham.ac.uk/projects/fastqc/help/ 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 18

Sequence quality: Per base A warning will be issued if the lower quar.le for any base is less than 10, or if the median for any base is less than 25. A failure is raised if the lower quar.le for any base is less than 5 or if the median for any base is less than 20. 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 19

Sequence quality: Per sequence S1 S2 A warning is raised if the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate. An error is raised if the most frequently observed mean quality is below 20 - this equates to a 1% error rate. 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 20

Sequence quality: GC content per sequence S1 S2 A warning is raised if the sum of the devia.ons from the normal distribu.on represents more than 15% of the reads. This module will indicate a failure if the sum of the devia.ons from the normal distribu.on represents more than 30% of the reads. 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 21

Sequence quality: N content S2 S3 This module raises a warning if any posi.on shows an N content of >5%. This module will raise an error if any posi.on shows an N content of >20%. 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 22

Sequence quality: Overrepresented sequences S1 S2 This module will issue a warning if any sequence is found to represent more than 0.1% of the total. This module will issue an error if any sequence is found to represent more than 1% of the total. 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 23

Sequence quality: Nucleotide content S1 S2 This module issues a warning if the difference between A and T, or G and C is greater than 10% in any posi.on. This module will fail if the difference between A and T, or G and C is greater than 20% in any posi.on. 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 24

Quality control: Mapping Sample Uniquely mapped reads Mapped to too many loci Mapped to mul.ple loci Unmapped reads: too short Unmapped reads: other s1 79.92% 4.93% 2.54% 11.97% 0.64% s2 83.57% 6.73% 3.38% 5.29% 1.03% s3 86.68% 4.10% 2.32% 6.38% 0.52% s4 86.86% 5.21% 2.84% 4.29% 0.79% s5 85.59% 2.92% 2.08% 8.98% 0.44% s6 82.33% 5.82% 3.45% 7.36% 1.02% s7 87.99% 3.28% 2.04% 6.27% 0.42% s8 81.81% 3.80% 2.24% 11.67% 0.48% s9 82.51% 4.17% 2.43% 10.32% 0.56% s10 87.88% 4.31% 2.58% 4.52% 0.70% s11 77.76% 4.96% 2.41% 14.16% 0.70% s12 77.89% 11.45% 3.85% 4.65% 2.16% s13 85.67% 4.13% 2.42% 7.22% 0.56% s14 81.28% 7.21% 3.68% 6.54% 1.28% s15 84.76% 5.39% 2.92% 6.11% 0.82% s16 69.37% 3.57% 2.67% 24.05% 0.34% s17 92.53% 2.88% 2.12% 1.90% 0.58% STAR 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 25

Quality control: Transcript coverage S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 Picard Tools 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 26

Quality control: Read localization Genes Exons Introns UTRs Intergenic regions rrnas 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 27

Quality control: Read localization % BASES Sample RIBOSOMAL CODING UTR INTRONIC INTERGENIC MRNA USABLE s1 0% 79% 15% 2% 3% 95% 76% s2 0% 76% 18% 2% 4% 93% 80% s3 0% 83% 13% 2% 2% 96% 84% s4 0% 80% 15% 2% 3% 95% 84% s5 0% 84% 13% 2% 2% 97% 82% s6 0% 76% 17% 2% 4% 93% 78% s7 0% 84% 12% 2% 2% 97% 85% s8 0% 83% 13% 2% 2% 96% 78% s9 0% 81% 14% 2% 2% 96% 79% s10 0% 83% 13% 2% 3% 96% 85% s11 0% 83% 13% 2% 3% 95% 74% s12 0% 73% 19% 3% 5% 92% 74% s13 0% 82% 14% 2% 2% 96% 82% s14 0% 74% 19% 3% 5% 93% 77% s15 0% 78% 16% 2% 3% 95% 81% s16 0% 75% 19% 2% 4% 94% 62% s17 0% 87% 10% 1% 2% 97% 91% Picard Tools 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 28

Quality control: Read localization Picard Tools 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 29

RNA-Seq : Whole pipeline Raw Data Preprocessing Mapping Quality control Expression Splicing Fusion transcripts SNPs Visualization Exon Splicing Patterns Transcript 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 30

Mapping RNA-Seq 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 31

Mapping RNA-Seq: Difficult Splice Junctions! 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 32

Mapping RNA-Seq: Tophat Tophat pipeline Trapnell et. al. Bioinforma0cs, 2009 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 33

Mapping RNA-Seq: STAR 1) Search MMP : SA 2) Alignment clustering Dobin et. al. Bioinforma0cs, 2013 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 34

Format SAM SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. (*) Sequence ID Flag Chr Position Map Qual Cigar Paired end info HWI-ST1136:196:HS113:4:1101:4333:28021 163 chr2 217279469 255 100M = 217279487 117 HWI-ST1136:196:HS113:4:1101:4333:28021 83 chr2 217279487 255 99M1S = 217279469-117 HWI-ST1136:196:HS113:4:1101:4320:28039 163 chr11 65271253 255 100M = 65271335 182 HWI-ST1136:196:HS113:4:1101:4320:28039 83 chr11 65271335 255 100M = 65271253-182 HWI-ST1136:196:HS113:4:1101:4274:28047 99 chr4 763497 255 100M = 763607 210 HWI-ST1136:196:HS113:4:1101:4274:28047 147 chr4 763607 255 100M = 763497-210 HWI-ST1136:196:HS113:4:1101:4333:28054 99 chr17 74433086 255 100M = 74433100 114 HWI-ST1136:196:HS113:4:1101:4333:28054 147 chr17 74433100 255 100M = 74433086-114 HWI-ST1136:196:HS113:4:1101:4353:28065 99 chr11 62293812 255 100M = 62293909 197 HWI-ST1136:196:HS113:4:1101:4353:28065 147 chr11 62293909 255 100M = 62293812-197... (*) h?ps://samtools.github.io/hts-specs/samv1.pdf 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 35

Format SAM SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. (*)... Sequence Base qualities Optional tags AGAGAATCGACAAAAGGCTCTGGCCCG CCCFFFFFHHHHHJJJIJIJJJJJJJIJJJB NH:i:1 HI:i:1 AS:i:197 nm:i:0 TCTGGCCCGCAGAGCTGAGAAGTTATT DDDDDBDBDCDDDDDEDDDEDDCCAACDEEE NH:i:1 HI:i:1 AS:i:197 nm:i:0 AACGAATGTAACTTTAAGGCAGGAAAG CCCFFFFFHHHHHJJJJJJJJJJIJJJIIII NH:i:1 HI:i:1 AS:i:198 nm:i:0 ATAGAGGCCCTCTAAATAAGGAATAAA DDDDDDDFFFDDHHHHHJIIGJJJIJIGGCJ NH:i:1 HI:i:1 AS:i:198 nm:i:0 CCTGAGATGTGCGTAGCCTCCGTGTAA CCCFFFFFHHHHHJJJJJJJJJIJIJJJJJJ NH:i:1 HI:i:1 AS:i:198 nm:i:0 ACCCAGCCTTTACCAGCAGCGTACGGC ADDDDDDCDDDCDDDDDDDDDDDFFFFHHHH NH:i:1 HI:i:1 AS:i:198 nm:i:0 GCTGGCATGGTGGTGGGCACCCATAAT CCCFFFFDHHFHHHGIJIJJJJJJJJJJIJJ NH:i:1 HI:i:1 AS:i:198 nm:i:0 GGGCACCCATAATCCTAGCTGCTCAGG DDDBCDCDDDDDCDDDDDDEEECCCFFFEHH NH:i:1 HI:i:1 AS:i:198 nm:i:0 GCCCTTTCAACTTTCCCTCTGGTCCTT CCCFFFFFHHHHHJJIJJIJJJGIJJJJJJJ NH:i:1 HI:i:1 AS:i:196 nm:i:1 CACATCCCCATCTGGGCCCTCTCCTTT DDDDDDDDDCBDDDDDDDDCDEFFFFFHHHH NH:i:1 HI:i:1 AS:i:196 nm:i:1 (*) h?ps://samtools.github.io/hts-specs/samv1.pdf 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 36

Format SAM SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. (*) Sequence ID Flag Chr Position Map Qual Cigar Paired end info HWI-ST1136:196:HS113:4:1101:4333:28021 163 chr2 217279469 255 100M = 217279487 117 HWI-ST1136:196:HS113:4:1101:4333:28021 83 chr2 217279487 255 99M1S = 217279469-117 HWI-ST1136:196:HS113:4:1101:4320:28039 163 chr11 65271253 255 100M = 65271335 182 HWI-ST1136:196:HS113:4:1101:4320:28039 83 chr11 65271335 255 100M = 65271253-182 HWI-ST1136:196:HS113:4:1101:4274:28047 99 chr4 763497 255 100M = 763607 210 HWI-ST1136:196:HS113:4:1101:4274:28047 147 chr4 763607 255 100M = 763497-210 HWI-ST1136:196:HS113:4:1101:4333:28054 99 chr17 74433086 255 100M = 74433100 114 HWI-ST1136:196:HS113:4:1101:4333:28054 147 chr17 74433100 255 100M = 74433086-114 HWI-ST1136:196:HS113:4:1101:4353:28065 99 chr11 62293812 255 100M = 62293909 197 HWI-ST1136:196:HS113:4:1101:4353:28065 147 chr11 62293909 255 100M = 62293812-197... (*) h?ps://samtools.github.io/hts-specs/samv1.pdf 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 37

SAM Fields : FLAG FLAG : Combination of bitwise FLAGs Example: Decimal Flag Value 83 Binary Flag Value 20481024512256 128 64 32 16 8 4 2 1 0000 0 1010010 To each bit corresponds a meaning 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 38

SAM Fields : «Explain FLAG» tool h0ps://broadins6tute.github.io/picard/explain-flags.html 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 39

SAM Format SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. (*) Sequence ID Flag Chr Position Map Qual Cigar Paired end info HWI-ST1136:196:HS113:4:1101:4333:28021 163 chr2 217279469 255 100M = 217279487 117 HWI-ST1136:196:HS113:4:1101:4333:28021 83 chr2 217279487 255 99M1S = 217279469-117 HWI-ST1136:196:HS113:4:1101:4320:28039 163 chr11 65271253 255 100M = 65271335 182 HWI-ST1136:196:HS113:4:1101:4320:28039 83 chr11 65271335 255 100M = 65271253-182 HWI-ST1136:196:HS113:4:1101:4274:28047 99 chr4 763497 255 100M = 763607 210 HWI-ST1136:196:HS113:4:1101:4274:28047 147 chr4 763607 255 100M = 763497-210 HWI-ST1136:196:HS113:4:1101:4333:28054 99 chr17 74433086 255 100M = 74433100 114 HWI-ST1136:196:HS113:4:1101:4333:28054 147 chr17 74433100 255 100M = 74433086-114 HWI-ST1136:196:HS113:4:1101:4353:28065 99 chr11 62293812 255 100M = 62293909 197 HWI-ST1136:196:HS113:4:1101:4353:28065 147 chr11 62293909 255 100M = 62293812-197... (*) h?ps://samtools.github.io/hts-specs/samv1.pdf 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 40

SAM Fields : Mapping quality Higher mapping quality = more unique = 10 log10 ( ) With p : Estimate of the probability that the alignment does not correspond to the read s true point of origin A mapping quality of < 10 indicates that there is > a 1 in 10 chances that the read truly originated elsewhere 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 41

SAM Format SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. (*) Sequence ID Flag Chr Position Map Qual Cigar Paired end info HWI-ST1136:196:HS113:4:1101:4333:28021 163 chr2 217279469 255 100M = 217279487 117 HWI-ST1136:196:HS113:4:1101:4333:28021 83 chr2 217279487 255 99M1S = 217279469-117 HWI-ST1136:196:HS113:4:1101:4320:28039 163 chr11 65271253 255 100M = 65271335 182 HWI-ST1136:196:HS113:4:1101:4320:28039 83 chr11 65271335 255 100M = 65271253-182 HWI-ST1136:196:HS113:4:1101:4274:28047 99 chr4 763497 255 100M = 763607 210 HWI-ST1136:196:HS113:4:1101:4274:28047 147 chr4 763607 255 100M = 763497-210 HWI-ST1136:196:HS113:4:1101:4333:28054 99 chr17 74433086 255 100M = 74433100 114 HWI-ST1136:196:HS113:4:1101:4333:28054 147 chr17 74433100 255 100M = 74433086-114 HWI-ST1136:196:HS113:4:1101:4353:28065 99 chr11 62293812 255 100M = 62293909 197 HWI-ST1136:196:HS113:4:1101:4353:28065 147 chr11 62293909 255 100M = 62293812-197... (*) h?ps://samtools.github.io/hts-specs/samv1.pdf 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 42

SAM Fields : CIGAR String representation of the alignment Example: 52M36890N45M3S REF : chr20 READ 3689N 52 M 45M 3S All Cigar operations 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 43

SAM Format SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. (*) Sequence ID Flag Chr Position Map Qual Cigar Paired end info HWI-ST1136:196:HS113:4:1101:4333:28021 163 chr2 217279469 255 100M = 217279487 117 HWI-ST1136:196:HS113:4:1101:4333:28021 83 chr2 217279487 255 99M1S = 217279469-117 HWI-ST1136:196:HS113:4:1101:4320:28039 163 chr11 65271253 255 100M = 65271335 182 HWI-ST1136:196:HS113:4:1101:4320:28039 83 chr11 65271335 255 100M = 65271253-182 HWI-ST1136:196:HS113:4:1101:4274:28047 99 chr4 763497 255 100M = 763607 210 HWI-ST1136:196:HS113:4:1101:4274:28047 147 chr4 763607 255 100M = 763497-210 HWI-ST1136:196:HS113:4:1101:4333:28054 99 chr17 74433086 255 100M = 74433100 114 HWI-ST1136:196:HS113:4:1101:4333:28054 147 chr17 74433100 255 100M = 74433086-114 HWI-ST1136:196:HS113:4:1101:4353:28065 99 chr11 62293812 255 100M = 62293909 197 HWI-ST1136:196:HS113:4:1101:4353:28065 147 chr11 62293909 255 100M = 62293812-197... (*) h?ps://samtools.github.io/hts-specs/samv1.pdf 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 44

SAM Fields : Paired-end Information REF : chr20 READ Mate read First read 3 fields: 1) Chromosome «=» if the paired read maps on the same chromosome The other chromosome otherwise 2) Position Mapping position of the paired read 3) Template Length Number of bases from the leftmost mapped base to the rightmost mapped base 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 45

Format SAM SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. (*)... Sequence Base qualities Optional tags AGAGAATCGACAAAAGGCTCTGGCCCG CCCFFFFFHHHHHJJJIJIJJJJJJJIJJJB NH:i:1 HI:i:1 AS:i:197 nm:i:0 TCTGGCCCGCAGAGCTGAGAAGTTATT DDDDDBDBDCDDDDDEDDDEDDCCAACDEEE NH:i:1 HI:i:1 AS:i:197 nm:i:0 AACGAATGTAACTTTAAGGCAGGAAAG CCCFFFFFHHHHHJJJJJJJJJJIJJJIIII NH:i:1 HI:i:1 AS:i:198 nm:i:0 ATAGAGGCCCTCTAAATAAGGAATAAA DDDDDDDFFFDDHHHHHJIIGJJJIJIGGCJ NH:i:1 HI:i:1 AS:i:198 nm:i:0 CCTGAGATGTGCGTAGCCTCCGTGTAA CCCFFFFFHHHHHJJJJJJJJJIJIJJJJJJ NH:i:1 HI:i:1 AS:i:198 nm:i:0 ACCCAGCCTTTACCAGCAGCGTACGGC ADDDDDDCDDDCDDDDDDDDDDDFFFFHHHH NH:i:1 HI:i:1 AS:i:198 nm:i:0 GCTGGCATGGTGGTGGGCACCCATAAT CCCFFFFDHHFHHHGIJIJJJJJJJJJJIJJ NH:i:1 HI:i:1 AS:i:198 nm:i:0 GGGCACCCATAATCCTAGCTGCTCAGG DDDBCDCDDDDDCDDDDDDEEECCCFFFEHH NH:i:1 HI:i:1 AS:i:198 nm:i:0 GCCCTTTCAACTTTCCCTCTGGTCCTT CCCFFFFFHHHHHJJIJJIJJJGIJJJJJJJ NH:i:1 HI:i:1 AS:i:196 nm:i:1 CACATCCCCATCTGGGCCCTCTCCTTT DDDDDDDDDCBDDDDDDDDCDEFFFFFHHHH NH:i:1 HI:i:1 AS:i:196 nm:i:1 (*) h?ps://samtools.github.io/hts-specs/samv1.pdf 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 46

SAM Fields : Optional TAGS Example of optional tags NM:i : Edit distance between read and reference (number of mismatches fo example) RG:id : Read group. When several samples are merged in one bam file for example NH:i : number of reported alignments for that read XS:A:+/- : if rnaseq stranded library AS:i : Alignment Score 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 47

SAM Fields : Optional TAGS Alignment Score AS:i (Ex Bowtie) Higher the score and the more similar the read sequence is to the reference sequence aligned to. A score is calculated by subtracting penalties for each difference (mismatch, gap, etc) and, in local alignment mode, adding bonuses for each match. Example: ACGCGATCGGACTACCATCTAGCATCGACTGCGCATAC ACGCGATCGGACTAGCATCTAGCAT--ACTGCGCATAC Type # Score Matches 49 98 Mismatches 1-5 GAP 2-11 (=-5-3-3) Total 81 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 48

BAM Format BAM : Binary SAM (BGZF compressed) BGZF is block compression implemented on top of the standard gzip file format.13 The goal of BGZF is to provide good compression while allowing efficient random access to the BAM file for indexed queries. The BGZF format is gunzip compatible, in the sense that a compliant gunzip utility can decompress a BGZF compressed file (*) Can be: 1) Sorted by coordinates (or read names) 2) Indexed (bai file) 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 49

Converting SAM To BAM SAMTOOLS (*) : Samtools is a suite of programs for interacting with highthroughput sequencing data. Sam to Bam conversion > samtools view Shb o sample.bam sample.sam Sorting Bam file > samtools sort o sample_sorted.bam sample.bam (*) h?p://www.htslib.org/ 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 50

Indexing BAM Indexing aims to achieve fast retrieval of alignments overlapping a specified region without going through the whole alignments. BAM must be sorted by the reference ID and then the leftmost coordinate before indexing. Indexation: Indexing creates a.bai file, next to each indexed bam file Each index uses virtual file offsets into the BGZF file And after? Using a sorted bam file and an index file allows to pose queries like: What reads overlap chrx:start-end? Without decompressing and traversing the whole bam file Indexing Bam file > samtools index sample_sorted.bam 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 51

RNA-Seq : Whole Pipeline Raw Data Preprocessing Mapping Quality control Expression Splicing Fusion transcripts SNPs Visualization Exon Splicing Patterns Transcript 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 52

RNA-Seq : Differential expression Two main steps: 1) Counting reads on genes 2) Statistical analysis 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 53

RNA-Seq : Differential expression Counting reads on genes Gene model Exons Introns Junctions 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 54

RNA-Seq : Differential expression Counting reads on genes Gene model Exons Introns Junctions 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 55

GFF Format GFF (General Feature Format) lines are based on the Sanger GFF2 specification. GFF lines have nine required fields that must be tab-separated (*) It s a text file with the following fields: 1. seqname - The name of the sequence (chromosome/scaffold) 2. source - The program that generated this feature 3. feature - Type of feature ("CDS", "start_codon", "stop_codon", "exon ) 4. start - Starting position of the feature in the sequence (starts at 1) 5. end - Ending position of the feature (inclusive). 6. score - Score between 0 and 1000 (or. if no value) 7. strand - '+', '-', or '.' 8. frame - If coding exon, frame should be 0-2: reading frame of the first base. 9. group - All lines with the same group are linked together into a single item. GTF format: Refinement of GFF format (*) https://genome.ucsc.edu/faq/faqformat.html#format3 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 56

GTF Format Example GTF (Gene Transfer Format) is a refinement to GFF that tightens the specification. The first eight GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute consists of a type/value pair. chr9 hg38_refgene stop_codon 133255666 133255668 0.000000 -. gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene CDS 133255669 133256356 0.000000-1 gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene exon 133255176 133256356 0.000000 -. gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene CDS 133257409 133257542 0.000000-1 gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene exon 133257409 133257542 0.000000 -. gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene CDS 133258097 133258132 0.000000-1 gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene exon 133258097 133258132 0.000000 -. gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene CDS 133259819 133259866 0.000000-1 gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene exon 133259819 133259866 0.000000 -. gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene CDS 133261318 133261374 0.000000-1 gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene exon 133261318 133261374 0.000000 -. gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene CDS 133262099 133262168 0.000000-2 gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene exon 133262099 133262168 0.000000 -. gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene CDS 133275162 133275189 0.000000-0 gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene start_codon 133275187 133275189 0.000000 -. gene_id NM_020469; transcript_id NM_020469; chr9 hg38_refgene exon 133275162 133275214 0.000000 -. gene_id NM_020469; transcript_id NM_020469; (*) https://genome.ucsc.edu/faq/faqformat.html#format3 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 57

RNA-Seq : Differential expression Subread FeatureCount : http://subread.sourceforge.net/ Modèle de gène GTF File BAM File Feature Count G1 count1 G2 count2 Gn countn hip://www.ncbi.nlm.nih.gov/pubmed/24227677 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 58

RNA-Seq : Differential expression Subread FeatureCount : http://subread.sourceforge.net/ Options: -t gene : feature type to count -s 1 : stranded -a gtf file : annotation file -o output file : count file «A read is said to overlap a feature if at least one read base is found to overlap the feature. For paired-end data, a fragment (or template) is said to overlap a feature if any of the two reads from that fragment is found to overlap the feature.» http://www.ncbi.nlm.nih.gov/pubmed/24227677 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 59

RNA-Seq : Differential expression Statistical Analysis: DESeq2 Next presentation 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 60

RNA-Seq : Pipeline complet Raw Data Preprocessing Mapping Quality control Expression Splicing Fusion transcripts SNPs Visualization Exon Splicing Patterns Transcript 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 61

Splicing analysis 3 levels of analysis: Exon Patterns Full transcript 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 62

Splicing analysis: Exon level We search for exons that are differentially included in the genes 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 63

Splicing analysis: Exon level We search for exons that are differentially included in the genes DEXSeq (*) (*) http://www.ncbi.nlm.nih.gov/pubmed/22722343 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 64

Splicing Analysis: DEXSeq Counts per exons Counts are modeled by a negative binomial distribution: One model per gene http://www.ncbi.nlm.nih.gov/pubmed/20835245 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 65

Splicing Analysis: DEXSeq 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 66

Splicing analysis: Pattern level Différent kind of splicing patterns 1 er alternative exons; Last alternative exons; PA PA Cassette exons; Mutually exclusive exons; Intron retentions; Alternative acceptor splice sites; Alternative donor splice sites. 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 67

Splicing analysis: Pattern level Assign each read to a pattern: Comparing counts between samples: Sort and list regulated patterns 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 68

Splicing analysis: Pattern level Griffith et. al., Nature Methods, 2010 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 69

Splicing analysis: Pattern level Griffith et. al., Nature Methods, 2010 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 70

Splicing analysis: Transcript level Transcript assembly Cufflinks Trinity Kissplice Etc. Analysis Cuffdiff RSEM EBSeq BitSeq Etc. 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 71

Splicing analysis: Cufflinks Assembly with Cufflinks Output transcripts.gtf isoforms.fpkm_tracking genes.fpkm_tracking 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 72

Splicing analysis: Cufflinks Cuffcompare : Compare assembled transcripts with reference annotations Track transcripts between different replicates Output output.stats output.combined.gtf output.tracking Cuffdiff Find significant differences between transcript expression, splicing, and promotor usage Output FPKM tracking files Count tracking files Read group tracking files Differential expression tests Differential splicing tests - splicing.diff 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 73

RNA-Seq : Whole Pipeline Raw Data Preprocessing Mapping Quality control Expression Splicing Fusion transcripts SNPs Visualization Exon Splicing Patterns Transcript 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 74

Integrated Genome Viewer The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations. https://www.broadinstitute.org/igv/ Helga ThorvaldsdóQr, James T. Robinson, Jill P. Mesirov. Integra0ve Genomics Viewer (IGV): highperformance genomics data visualiza0on and explora0on. Briefings in Bioinforma0cs 14, 178-192 (2013). 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 75

Integrated Genome Viewer 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 76

Integrated Genome Viewer Organism chromosome Genome Positions Basic usage Data Tracks 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 77

Integrated Genome Viewer Basic usage Gene annotations Read coverage Read alignments 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 78

Integrated Genome Viewer Basic usage Gene annotations SNV C->A: coverage : 73 %: 96% 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 79

Integrated Genome Viewer Basic usage: Mapping information 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 80

Integrated Genome Viewer Basic usage: Annotations 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 81

Integrated Genome Viewer Advanced usage : Importing a new genome 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 82

Integrated Genome Viewer Advanced usage : Importing datasets 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 83

Integrated Genome Viewer Advanced usage : Importing datasets 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 84

Integrated Genome Viewer Advanced usage : Igv Tools - Computing coverage data 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 85

Integrated Genome Viewer Advanced usage : Igv Tools - Computing coverage data 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 86

Integrated Genome Viewer : RNA-Seq 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 87

Integrated Genome Viewer : RNA-Seq 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 88

RNA-Seq Lots of applications Variant detection Gene Expression Splicing analysis Fusion transcript Difficult analysis: No fully detailed recommendations for all analyses 10/12/15 Institut Pasteur - C3BI Hands-on NGS course - RNA-Seq 89