RNA-Seq Module 2 From QC to differential gene expression.

Size: px

Start display at page:

Download "RNA-Seq Module 2 From QC to differential gene expression."

Magnus Turner
6 years ago
Views:

1 RNA-Seq Module 2 From QC to differential gene expression. Ying Zhang Ph.D, Informatics Analyst Research Informatics Support System (RISS) MSI Apr. 24, 2012

2 RNA-Seq Tutorials Tutorial 1: Introductory (Mar. 28 & Apr. 19) RNA-Seq experiment design and analysis Instruction on individual software will be provided in other tutorials Tutorial 2: Introductory (Apr. 3 & Apr 24) Analysis RNA-Seq using TopHat and Cufflinks Tutorial 3: Intermediate (May 23) Advanced RNA-Seq Analysis topics and Trouble- Shooting

3 Tutorial Outline Review Key definitions and concepts Pre-processing of RNA-seq data QC and data cleaning Applications of RNA-seq Identification of differential gene expression Using TopHat, Cufflinks and Cuffdiff Definition of the transcriptome Transcriptome Assembly with / without reference genome Comparison of transcriptomes Identification of novel transcripts

4 RNA-Seq Definitions & Concepts

5 Key definitions (I) Sample mrna isolation SE: single end sequencing PE: paired end sequencing Mate-pair sequencing Fragmentation Size: ~200 bp Library preparation Circulation Size: ~2000 bp Sequence fragment end(s) Fragmentation SE sequencing PE sequencing Mate-Pair sequencing Sequence fragment end(s)

6 Key definitions (II) Fragment size selection - Only fragments with size around 200bp will be sequenced in order to reduce sequencing bias. Sequencing Depth is the average reads coverage of target sequences - Sequencing depth = total number of reads X read length / estimated target sequence length - Example, for a 5MB transcriptome, if 1Million 50 bp reads are produced, the depth is 1 M X 50 bp / 5M ~ 10 X Library Type: Sequencing Depth: De novo Assembly of transcriptome Refine gene model Differential Gene Expression PE, Mated PE PE, SE PE PE Extensive (> 50 X) Extensive Moderate (10 X ~ 30 X) Identification of structural variants Extensive ENCODE RNA-Seq guidelines

7 Phred (quality) Score Key definitions (III) Phred Score (Q) is the log transformation of error rate (P) at each base calling position Q = -10log 10 P Encoded using ASCII codes: Sanger standard: ASCII = Phred Score 0-93 Phred score 30 ~ 1 error per 1000 nucleotides Phred score 20 ~ 1 error per 100 nucleotides

8 NGS File formats

9 File formats in NGS (I) CASAVA software fastq Mapping SAM/BAM Assembly GTF

10 File format NGS (II) - FASTQ and FASTQ_flt (MSI) CASAVA software fastq CASAVA: Illumina software package for base calling Fastq format: Text format. Stores sequence and quality info 4 lines per sequences CASAVA 1.8 header line: Machine ID QC Filter flag Y=bad N=good barcode Read ID (header) Sequence + Quality score 1:N:0:AGATC TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + Read pair # =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4= FASTQ_flt data Fastq files processed by MSI standard to remove reads with QC flag Y

11 File formats NGS (III) CASAVA software fastq mapping SAM/BAM format: Sequence alignment format SAM: text format BAM: binary file of SAM Bitwise flag field: indicating mapped or not, paired or not, etc SAM/BAM SAM/BAM format is the standard format of mapped reads, and could be used by almost all NGS tools, e.g. assembler, viewer, quantifier.

12 File formats in NGS (IV) CASAVA software fastq SAM/BAM mapping assembly GTF format Gene Transfer Format Widely used format for annotated genome and transcriptome Downloadable from major browser sites, e.g. UCSC, Ensembl, NCBI Illumina also provides a set of annotated genomes: igenomes Available through Galaxy and command line GTF Seqname Source feature start end score strand frame a0ributes chr1 unknown exon gene_id "Xkr4"; transcript_id "NM_ ;

Steps in RNA-Seq Data Analysis Step 1: Quality Control

fastqsanger Step 3: Map Reads to Reference

Transcriptome Cufflinks gtf; fpkm Other applications: De

13 Steps in RNA-Seq Data Analysis Step 1: Quality Control FastQC fastq Step 2: Data prepping Filter/Trimmer/Converter fastqsanger Step 3: Map Reads to Reference Genome/Transcriptome TopHat bam/sam Step 4: Assemble Transcriptome Cufflinks gtf; fpkm Other applications: De novo Assembly Refine gene models Identify Differentially Expressed Gens Cuffdiff fpkm; diff

14 Step 1 Quality control of the input data Step 2 Data prepping

15 Quality control of the raw reads Goal: to determine quality of the sequencing process Recommended program: fastqc Available both in Galaxy and Linux platform Checklist of reads quality: Ø File format q Basic Statistics Ø Reliability of base calling q Per base sequence quality q Per sequence quality score Ø Contamination q Per sequence GC content q Overrepresented sequences

16 Is NOT NEEDED, if: In the right format Good reads quality Data prepping (I) Phred score per base & per sequence >=20 ( better if >=30) No contamination detected Paired reads are synchronized Bad mapping efficiency of PE reads is symptomatic of desynchronization

17 BAD format GOOD Wrong Fastq Format (CASAVA 1:N:0:AGATC TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + Right Fastq Format (CASAVA 1.7): TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4=

18 Data prepping Needed (I) Data format is incorrect, paired reads are not indicated as RNAME/1, RNAME/2 the quality score is not Sanger/Illumina 1.9 format Action: change the format, using edit attributes, fastq groomer, header line converter Notes: If you are using Galaxy to analyze your data, change file name WILL NOT change the file format.

19 Distribution of Phred Score in reads Bad Trimming needed Good

20 Data prepping Needed (II) Data contains bad reads, the quality score of reads/part of the reads is < 20 Action: remove the low quality reads, using fastq filter, and fastq trimmer fastq filter: remove entire reads fastq column trimmer: uniformly remove the nucleotides positions in all reads. fastq quality trimmer: remove all nucleotide positions with low quality.

21 Example of bad data: sequence contamination

22 Data prepping Needed (III) Adapter sequences are detected Action: remove the adapter sequences, using CutAdapt

23 Data prepping Needed (IV) Data is out of synch, the Forward and Reverse reads are not arranged in the same order. Action: synchronize the files, using fastq interlacer and fastq de-interlacer Notes: Synchronization check and correction should be the last step in data prepping, because the previous steps in prepping can cause de-synchronization of PE data.

24 Summary - Data prepping Data prepping is NEEDED, if: Data format is incorrect Data contains bad reads Adapter sequences are detected Data is out of synch, meaning the pairing of Forward and Reverse reads are out of order

Summary: Galaxy Tools for pre-processing 1.

Sanger standard format Necessary for data

7 or less, no need for CASAVA 1.8 and above.

7 3. Fastq filter: removal low quality reads

bases 5. Cutadapt: Cut Adapter sequences 6.

25 Summary: Galaxy Tools for pre-processing 1. Fastq Groomer: Convert quality score to Sanger standard format Necessary for data generated with CASAVA 1.7 or less, no need for CASAVA 1.8 and above. 2. Convert read header format from 1.8 to Fastq filter: removal low quality reads 4. Fastq Trimmer: removal of low quality end bases 5. Cutadapt: Cut Adapter sequences 6. Synchronization Fastq Interlacer/De-Interlacer Critical for PE data analysis

26 Applications of RNA-Seq

27 1 Evaluation of a tissue s transcriptome What is the composition of the transcriptome? 2 Comparative analysis of two or more transcriptomes How do two or more species transcriptomes compare? 3 Differential gene expression What genes are differentially regulated in two or more conditions? This tutorial

28 Differential Gene Expression (DGE) Two Scenarios

29 1 DGE Non discovery mode DGE without detection of novel transcripts 2 DGE - Discovery mode DGE with detection of novel transcripts

30 1 DGE - Non discovery mode Quality Control (fastqc) Mapped Reads (sample 1) bam/sam Condition 1 fastq Map Reads to Reference sequence or genome (TopHat) Pre-defined Annotation Identify Differential Expression (Cuffdiff) Condition 2 fpkm; diff fastq Quality Control (fastqc) Mapped Reads (sample 2) bam/sam

transcripts (Cufflinks) gtf Merge sample

31 2 DGE - Discovery mode Assemble sample transcriptome with discovery of novel transcripts (Cufflinks) gtf Merge sample transcriptomes into one (Cuffcompare */ Cuffmerge) gtf Data Prepping fastq Map Reads to Reference sequence or genome (TopHat) SAM/BAM Identify Differential Expression (Cuffdiff) fpkm; diff * Only available in Galaxy

32 RNA-Seq analytical tool: Tuxedo A mapper: Bowtie Maps short reads to the reference genome. Bowtie A splice junction aligner: Tophat Uses Bowtie to align short reads to reference genome or sequence It infers and estimates the splicing sites. A transcriptome assembler: Cufflinks cuffcompare (comparing transcriptomes) cuffmerge (merging transcriptomes) cuffdiff (identifying differentially expressed genes). Cuffmerge Cuffcompare TopHat Cufflinks Cuffdiff bam/sam gtf; fpkm diff; fpkm A visualization (R) package: cummerbund cummerbund Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Trapnell C. et al. (2012) Nature Protocols

33 Quantify Expression Abundance (Cufflinks): FPKM Sample 1 mrna isolation Fragmentation RNA -> cdna Paired End Sequencing Calculate transcript abundance Gene A Gene B Sample # of Fragment (Paired Reads) Gene A Gene B Sample # Fragments per kilobase of exon Genome A Reference Transcriptome Map reads B Gene A Gene B Total Sample Sample # Fragments per kilobase of exon per million mapped reads FPKM

34 DGE Non discovery mode

35 DGE without detection of novel transcripts Approach: Treat RNA-Seq as a high-resolution microarray. Most appropriate for: Quick identification and analysis of differentially expressed genes Best for systems with annotated reference genome, such as human and mouse Analysis specificities- limitations: - Only map reads to previously known transcripts - Only test the differential expression of previously known transcripts. Programs to use: 1. TopHat: the mapper 2. Cuffdiff: the tester

36 Why choose TopHat as the mapper? Mapping RNA-Seq reads to the genome is a big challenge continuous RNA-Seq reads dis-continuous genomic sequence Initially, genome aligner (such as BWA) treated introns as gaps. BWA Dgcr2 TopHat: discovering splice junctions with RNA-Seq Trapnell C et al. (2009) Bioinformatics

37 INTRONs shouldn t be treated as GAPs. BWA Dgcr2 If there were no introns: Reads should continuously cover the splice junctions. Exon Dgcr2

38 TopHat The splice junction aligner It became intuitive to incorporate the splicing information in the mapping process. Later, it became necessary to build splicing junctions ab initio, because of the incompleteness of known junctions Splicing signals: donor and acceptor sites So TopHat is developed. So, TopHat is developed. TopHat BWA Dgcr2 TopHat: discovering splice junctions with RNA-Seq Trapnell C et al. (2009) Bioinformatics

39 Step 1: mapping Step 2: building splicing junctions (+/- using ref juncs) Step 3: 2 nd mapping Basis of TopHat Direct un-spliced, un-paired mapping (using bowtie) Assemble contiguous coverage island Identify possible splice donor and acceptor Predict possible splicing junctions Uses bowtie again in closure search (finding splice junctions with mapping support) Output: Mapping results (SAM/BAM file), and junctions.bed Only SAM/BAM file will be used by Cufflinks and Cuffdiff. From the courtesy of Dr. Kevin Silverstein

40 TopHat General considerations Q1: Is the project in human or mouse? TopHat is optimized for human and mouse genome. A: Yes Action: Nothing to be changed. Can use default parameters A: No Action: Cannot use default parameters. Need to input all species specific parameters, e.g. those gene-model related parameters, such as intron length.

41 TopHat General considerations Q2: Is the library paired-end? A: Yes Action: Set the parameters for mean distance between paired reads (- r) and the standard deviation of the inner distance (--mate-std-dev) inner distance = fragment length (220) 2 X read length A: No Action: Nothing to be changed.

42 TopHat options Non discovery analytical approach Q3: What are the parameters to select? 1 Select Full parameter list 2 Select Yes for the option to Use own junctions. 3 Select Yes for the option to Use gene annotation model, AND provide known annotation (gtf file). 4 Select Yes for the option to Only look for supplied junctions.

43 Assessing mapping efficiency A: Review of the mapping statistics. 1 % of reads mapped, % of reads properly paired 2 Use: Samtools and Picard tools flagstat: line 3 and line 7 For TopHat, first filter BAM file on MAPQ value of 255 Filter SAM or BAM files on FLAG MAPQ RG LN or by region SAM/BAM Alignment Summary Metrics 3 Estimate the insertion size Insertion size metrics Recommendations: For human and mouse, good mapping will result in - >= 80% mapping percentage >=70% paired reads

44 Mapping visualization Integrative Genome Viewer (IGV)

45 Run IGV locally to view multiple tracks Direction to install IGV: Healthy Sample Cancer

sequence or genome (TopHat) Pre-defined Annotation Identify Differential

46 DGE Workflow - Non discovery mode Quality Control (fastqc) Mapped Reads (sample 1) bam/sam Condition 1 fastq Condition 2 Map Reads to Reference sequence or genome (TopHat) Pre-defined Annotation Identify Differential Expression (Cuffdiff) fpkm; diff fastq Quality Control (fastqc) Mapped Reads (sample 2) bam/sam

Cuffdiff Facts Cuffdiff: Quantifies the gene expression abundance, Statistical evaluation of the differential expression. Considerations on handling Tail data.

47 Cuffdiff Facts Cuffdiff: Quantifies the gene expression abundance, Statistical evaluation of the differential expression. Considerations on handling Tail data. Exclude the lowexpressed genes to remove transcription artifacts. Set the parameter for Min Alignment Count. Density Global gene expression log10(fpkm)

48 Cuffdiff Facts Cuffdiff: Quantifies the gene expression abundance, Statistical evaluation of the differential expression. Considerations on handling Tail data. Density Global gene expression Exclude the highly expressed genes, such as some house-keeping genes. Set yes to Perform quartile normalization log10(fpkm)

49 Cuffdiff output Healthy_sample Cancer_sample

50 Post-analysis processing and iterations Check for non-biological variations Also known as technical variation, or within-group variation. This type of variation is detected among samples of the same group. Source of the technical variations: Batch effect How were the samples collected and processed? Were the samples processed as groups, and if so what was the grouping? Non-synchronized cell cultures Were all the cells from the same genetic backgrounds and growth phase? Use technical replicates rather than biological replicates Detection of non biological variation PCA analysis; or MDS analysis; or Unsupervised clustering analysis of FPKM values

Steps in PCA analysis PCA analysis Construct the multiple variable matrix e.g. tables of FPKM values transcripts Sample A Sample V Sample O Sample E Sample I Sample U gene1 6.18 6.64 6.46 6.30 6.58 6.

51 Steps in PCA analysis PCA analysis Construct the multiple variable matrix e.g. tables of FPKM values transcripts Sample A Sample V Sample O Sample E Sample I Sample U gene gene gene gene gene gene gene gene gene gene gene gene gene gene Group 1 (A,V,O) Group 2 (E,I,U) PC V U I E A O PC1 O

52 DGE Discovery mode

53 QC with fastqc QC with fastqc CONDITION A SampA.fq CONDITION B SampB.fq Alignment with TopHat SampA.bam Reference Index Genome.fa Alignment with TopHat SampB.bam Assemble with Cufflinks SampA.gtf Reference Gene Annotation Genes.gtf Assemble with Cufflinks SampB.gtf Discovery Phase Merge assemblies with Cuffcompare merged.gtf Store Results Quantitation and differential expression with cuffdiff gene_exp.diff; isoform_exp.diff; Visualization with cummerbund

54 DGE with detection of novel transcripts Novel transcripts will be assembled and tested for differential expression. Potential identification of new splicing variants Key advantage (over microarray). Not limited by previous knowledge Extends current knowledge banks Programs used: 1. TopHat: the mapper; 2. Cufflinks: the assembler; 3. Cuffdiff: the tester

55 TopHat Best practice in Discovery Mode Same as before, but TopHat needs to be run at least TWICE in order to reliably and consistently identify the splicing junctions. 1 First run is to generate a full list of junctions. 2 Second run is to apply the full junction files to all the samples to keep mapping consistence. The TWO-STEP running of TopHat: 1. Running TopHat as before 2. Re-run TopHat with a list of junctions (see setting in next slide).

56 TopHat options discovery analytical approach 1 Combine the sample junctions.bed files into one using Concatenate. 2 Turn on Full parameter list. 3 Turn on (set yes to) the option for Use own junctions. 4 Provide junctions files (bed file). 5 Turn on the option for Use Closure Search. 6 Turn on Use Microexon Search.

57 Considerations for Cufflinks

58 Cufflinks facts Optimized for human and mouse genomes Uses a parsimonious method to assemble the transcripts +/- known annotation Can estimate the transcript abundances FPKM: # of Fragments Per Kilobases of exon model per Million mapped fragments Can estimate the fragment length distribution Not available in Galaxy Output file: GTF file

59 Cufflinks General considerations Q1: Is the project in human or mouse? Cufflinks is optimized for human and mouse genome. A: Yes Action: Nothing to change A: No Action: Cannot use default parameters. Need to input all species specific parameters, e.g. those gene-model related parameters, such as intron length.

60 Cufflinks General considerations Q2: Want to use a known annotation in transcriptome assembly and report novel transcripts assembled? A: Yes Action: Use the option for Use Reference Annotation ; Select Use Reference Annotation as Guide. A: No Action: Nothing to change

61 Cufflinks General considerations Q3: Can I pool samples as one input to cufflinks? A: No. Because we might lose some isoforms in this manner. It is possible that one isoform may only be called from one sample, due to some uncontrollable sample preparation process. Cufflinks will only report isoforms above certain abundance threshold (10% of the major transcripts). The rare isoform will be diluted in the pooled samples, so that it may become missing in the assembly. Isoform A (FPKM) Isoform B (FPKM) Called? Sample Yes Sample No Pooled No

62 Cuffcompare Facts Cuffcompare Compares multiple transcriptomes and reports the similarity between them. Available in Galaxy. Cuffmerge A new function implemented in Cufflinks package. Purpose is to remove assembly artifacts. Available using command line tools.

63 Follow the same instruction to run Cuffdiff and postprocessing as in DGE-non discovery mode.

64 Reproducibility and the value of Workflow

65 Analysis strategy in Workflow Workflow is A sequential collection of Galaxy operations to complete an analysis

66 Create a Workflow From scratch From current history Edit existing workflow

67 Share/Publish/Use Workflow

68 Tutorial optional material = = Evaluation of transcriptome Two Scenarios

69 1 De novo assembly of transcriptome Assemble transcriptome without a reference transcriptome/genome 2 Reference-guided assembly of transcriptome

70 Key definitions Short Reads Contigs = consensus of overlapping reads Scaffolds = contigs + known-length gaps known-length gaps could be estimated by Mate-pair sequencing Draft transcriptome/genome = a collection of non-ordered scaffolds

Transcriptome Trans-ABySS * fastq * We only put one assembler in this diagram to illustrate

71 De novo assemble the transcriptome. fastq Samples (RNA-Seq) Pre-processing: QC and Data cleaning fastq De novo Assembly of Transcriptome Trans-ABySS * fastq * We only put one assembler in this diagram to illustrate the concept of assembling. However, in order to construct a reliable transcriptome, multiple assembler should be used to generate a consensus assembly.

72 Trans-ABySS Facts ABySS is a de novo, parallel sequence assembler that is designed for short reads. Can work on single end reads and paired end reads. Is a de Bruijn graph assembler It takes two steps: Using all possible k-mers from the reads to build the initial contigs Using mate-pair information to extend contigs Trans-ABySS is a pipeline for analyzing ABySSassembled contigs from RNA-Seq data. Use several k-mer length Availability: Command line Homepage:

Reference-guided assembly of transcriptome

reads to reference genome (TopHat) BAM/SAM

73 Reference-guided assembly of transcriptome (Also known as transcriptome reconstruction ) fastq Samples (RNA-Seq) Pre-processing: QC and Data cleaning fastq Known Annotation Map short reads to reference genome (TopHat) BAM/SAM Assemble transcriptome from mapped reads (Cufflinks) GTF

74 If choosing TopHat and Cufflinks as the assembler, follow the instructions in DGEdiscovery mode

75 Specific Notes for Prokaryotes samples Cufflinks developer: We don t recommend assembling bacteria transcripts using Cufflinks at first. If you are working on a new bacteria genome, consider a computational gene finding application such as Glimmer. So for bacteria transcriptome: If the genome is available, do genome annotation first then reconstruct the transcriptome. If the genome is not available, try the de novo assembly, then followed by gene annotation.

76 Next-generation transcriptome assembly Martin J. et al (2011) Nature Review Summary Hybrid method on transcriptome assembly

77 Comparative Study of Transcriptomes

78 Q: How can I compare different transcriptomes? Sample 1.gtf Sample 2.gtf Sample 3.gtf. Cuffcompare: Compare individual transcriptome Generic tools: Operate on Genomic Intervals Sample N.gtf

79 Galaxy Tools for pre-processing Cuffcompare Operate on Genomic Intervals

80 Downstream visualization and analysis: Will be covered in Tutorial Module 3. IGV: interactive genome viewer IPA: Ingenuity pathway analysis Other analysis package: R package: ArrayExpressHTS, cummerbund

81 Discussion and Questions? Get Support at MSI: General Questions: Subject line: RISS: Galaxy Questions: Subject line: Galaxy:

Intermediate RNA-Seq Tips, Tricks and Non-Human Organisms

Intermediate RNA-Seq Tips, Tricks and Non-Human Organisms Kevin Silverstein PhD, John Garbe PhD and Ying Zhang PhD, Research Informatics Support System (RISS) MSI September 25, 2014 Slides available at