How to deal with your RNA-seq data?

Size: px
Start display at page:

Download "How to deal with your RNA-seq data?"

Transcription

1 How to deal with your RNA-seq data? Rachel Legendre, Thibault Dayris, Adrien Pain, Claire Toffano-Nioche, Hugo Varet École de bioinformatique AVIESAN-IFB Rachel Legendre Bioinformatics 27/11/2018

2 Summary 01 Bioinformatics Quality control, Mapping, Counting 04 Practice Differential analysis with SARTools 2 Rachel Legendre Bioinformatics 27/11/ Statistics 03 Statistics Experimental design, Exploratory data Normalization, modelisation and analysis troubleshooting Advanced practice Gene Sets Analysis methods Bioinformatics Transcriptome de novo assembly

3 Bioinformatics Introduction and prerequisites 3 Rachel Legendre Bioinformatics 27/11/2018

4 Raw NGS data Instrument 4 Rachel Legendre Bioinformatics 27/11/2018 Flowcell Intensities

5 Data storage Text file with size between 100 to 150 Gb by lane Let s compare : War and peace by Léon Tolstoï 1817 pages 6 cm width 4 Mb 1 lane : times "war and peace 45 Millions pages 1.5 km (5 Eiffel towers) 8 lane by flow cell 1 Tb of raw data/ week / sequencer Times 2 for paired-end 5 Rachel Legendre Bioinformatics 27/11/2018

6 RNA-seq applications «Transcriptome analysis provides information about the identity and quantity of all RNA molecules in one cell or a population of cells» 6 Rachel Legendre Bioinformatics 27/11/2018

7 RNA-seq: Why? How Ask right question before libraries preparation and sequencing: Prokaryotes Eukaryotes I don t find a ribo-depletion kit for my organism: I want coding genes only: Design yourself the oligos I want to identify antisense RNA: Directional protocol (standard) I m interested in transposons: Longer read sequencing Paired-end sequencing 7 Rachel Legendre Bioinformatics 27/11/2018 PolyA strategy I want non-coding genes also: Ribo Depletion I m interesting in small RNA profiling: Use specific protocole I m interesting in isoforms: Paired-end sequencing Long read technologies

8 RNA-seq: Why? How Regardless of your organism: Complexity of your genome and the biological question paired end or single end, length of reads? Sequencing depth (multiplexing rate) More biological replicates than more sequencing depth Stranded RNA-seq protocol to assigned reads to a particular strand 8 Rachel Legendre Bioinformatics 27/11/2018

9 RNA-seq: Why? How Regardless of your organism: Complexity of your genome and the biological question paired end or single end, length of reads? Sequencing depth (multiplexing rate) More biological replicates than more sequencing depth Stranded RNA-seq protocol to assigned reads to a particular strand 9 Rachel Legendre Bioinformatics 27/11/2018 For a successful experiment, it's imperative to include bioinformaticians and biostatistician before the beginning of the RNA extraction

10 Prerequisites RNA sample: Reference genome: Annotation file: Complete genomic sequence in fasta format All features (genes, CDS, intron, UTR) of genome in GFF format DNAse treatment Quantity (adapted protocole) Quality (RNA integrity number > 7) Stocked at -80 C 10 Rachel Legendre Bioinformatics 27/11/2018

11 Where find the genome and the annotation? Common databases 11 Rachel Legendre Bioinformatics 27/11/2018 Specific databases

12 FASTQC: explore quality scores 12 Rachel Legendre Bioinformatics 27/11/2018

13 FASTQC: explore quality scores 13 Rachel Legendre Bioinformatics 27/11/2018

14 FASTQC: explore quality scores Systematic high duplication level in RNA-seq, why? 14 Rachel Legendre Bioinformatics 27/11/2018

15 How to screen contaminations? Different levels: - Ribosomal contamination from same organism - Align reads against the ribosomal genome with a dedicated mapper 15 Rachel Legendre Bioinformatics 27/11/2018

16 How to screen contaminations? Different levels: - Ribosomal contamination from same organism RNA contamination from other organism - Use dedicated or derived tools such as fastq_screen or kraken 16 Rachel Legendre Bioinformatics 27/11/2018

17 How to screen contaminations? Different levels: - Ribosomal contamination from same organism RNA contamination from other organism DNA contamination - DNAse treatment could be ineffective and for DNA to make it through into the final library. As soon as you visualise your reads against an annotated genome the presence of DNA is normally fairly apparent as a consistent background of reads over the whole genome 17 Rachel Legendre Bioinformatics 27/11/2018

18 Bioinformatics From mapping to counting 18 Rachel Legendre Bioinformatics 27/11/2018

19 RNA-seq mapping specificity Mapping on genome or transcriptome? the transcriptome is currently not well characterised enough to serve as a suitable reference for RNA-Seq mapping to a genome is more objective and repeatable get more gene isoforms information through mapping it to the genome Take account to reads that come from exon-exon junctions Cole Trapnell & Steven L Salzberg.Nature Biotechnology 27, (2009) 19 Rachel Legendre Bioinformatics 27/11/2018

20 Mapping timeline From 20 Rachel Legendre Bioinformatics 27/11/2018

21 Choose the good mapper Which one is the best mapper? 21 Rachel Legendre Bioinformatics 27/11/2018

22 Choose the good mapper Which one is the best mapper? Which mapper should I use based on my data and my analysis? 22 Rachel Legendre Bioinformatics 27/11/2018

23 Choose the good mapper Depends on: - Detection of splicing events - Length of reads: Very short read (<50) : Up to 1000kb : Long reads : - Allow gap on alignment STAR, minimap2, Hisat2 Bowtie1 BWA-SW, bowtie2 Minimap2 STAR, BWA, Bowtie2 Common situations: choose a mapper widely-used and well maintained 23 Rachel Legendre Bioinformatics 27/11/2018

24 Known biases in RNA-seq Intron coverage: if many reads align to introns, this is indicative of incomplete poly(a) enrichment or abundant presence of immature transcripts. Intergenic reads: if a significant portion of reads is aligned outside of annotated gene sequences, this may suggest genomic DNA contamination (or abundant non-coding transcripts). 3' bias: over-representation of 3' portions of transcripts indicates RNA degradation. 24 Rachel Legendre Bioinformatics 27/11/2018

25 Mapping QC on RNA-seq Percentage of mapped reads along genome Human/Mouse: 70 to 90 % Prokaryotic: more to 90 % Uniformity of read coverage on exons and the mapped strand. Low rate of multiple mapping Low rate of ribosomal RNA 25 Rachel Legendre Bioinformatics 27/11/2018

26 Mapping QC on RNA-seq Common : Samtools (flagstats) Bamtools (stats) Picardtools (CollectRNASeqMetrics) RseQC Human and mouse : RNAseQC Qualimap 26 Rachel Legendre Bioinformatics 27/11/2018

27 RNA-seq experiment Organism: Arabidopsis thaliana, plant and model organism. Genome and annotation available in TAIR10, the arabidopsis database Dataset: 3 biological replicates, paired-end sequencing. Characterization of the function of the protein arginine methyltransferase AtPRMT5 during de novo shoot regeneration in Arabidopsis by a knocking-out of AtPRMT5. 27 Rachel Legendre Bioinformatics 27/11/2018

28 Quantify number of reads on each gene When counting reads, make sure you know how the program handles the following: overlap size (full read vs. partial overlap) multimapping reads reads overlapping multiple genomic features of the same kind reads overlapping introns Two popular tools : Htseq-count featurecounts 28 Rachel Legendre Bioinformatics 27/11/2018

29 Practice - Connexion to cluster: ssh <LOGIN>@core.cluster.france-bioinformatique.fr - Change directory: cd /shared/projects/ebai2018_<login> - Copy the script template in your home: cp /shared/home/rlegendre/tp_rna/runme.sh. - Also available here : - Follow the commands on the runme 29 Rachel Legendre Bioinformatics 27/11/2018

30 Bioinformatics Long Read sequencing for isoform detection 30 Rachel Legendre Bioinformatics 27/11/2018

31 Transcriptome reconstruction : an inextricable problem? Korf I. Genomics: the state of the art in RNA-seq analysis. Nat Methods Dec 31 Rachel Legendre Bioinformatics 27/11/2018

32 Long Reads, a Solution to the RNA-Seq Problem? With a gene of 2kb long, can a sequencing of PE100 allows you to say with certitude that one read comes from this isoform or another? 32 Rachel Legendre Bioinformatics 27/11/2018

33 Long Reads, a Solution to the RNA-Seq Problem? 33 Rachel Legendre Bioinformatics 27/11/2018

34 The 2018 winning technologies Today at 17:45 PM 34 Rachel Legendre Bioinformatics 27/11/2018

35 PacBio: Universal SMRTBell template 35 Rachel Legendre Bioinformatics 27/11/2018

36 PacBio: from polymerase to CCS reads Subreads CCS - Subreads (purple and gold) are separated by adapter sequences (blue) >= 2 full passes required for CCS Both adapters must be detected for a read to be identified as full pass 36 Rachel Legendre Bioinformatics 27/11/2018

37 Isoseq3 workflow Raw data CCS reads Full length reads Non Full length reads Cluster Consensus Low quality, full length polished isoforms High quality, full length polished isoforms (>Q20) 37 Rachel Legendre Bioinformatics 27/11/2018

38 Isoseq3: from CCS to HQ reads 38 Rachel Legendre Bioinformatics 27/11/2018

39 Isoseq3: an example of HQ isoform With Isoseq, we can identify one full-length sequenced transcript for one annotation 39 Rachel Legendre Bioinformatics 27/11/2018

40 PacBio: limitations The sensitivity of Iso-Seq method is limited by the following factors: (1) the selection of full-length transcripts is not complete, so not all Reads of Insert represent full-length transcripts (2) very long transcripts are likely not sequenced in full due to the sequencing length limit (3) high-quality reads (CCS reads) can be generated only if the target cdna is short enough to be sequenced by multiple passes. 40 Rachel Legendre Bioinformatics 27/11/2018

41 Bioinformatics Visualize your data 41 Rachel Legendre Bioinformatics 27/11/2018

42 Visualize alignments Which format? BAM BigWig, BedGraph (base-by-base scores) BED, GFF (feature-by-feature data) Which tools? Browser : IGV, Artemis, UCSC Genome browser, SeqMonk Snapshots : Deeptools, ngs.plot, Rachel Legendre Bioinformatics 27/11/2018

43 Visualize alignments Go to AT4G Rachel Legendre Bioinformatics 27/11/2018