RNA-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Size: px

Start display at page:

Download "RNA-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland"

Percival Gibbs
5 years ago
Views:

1 RNA-seq data analysis with Chipster Eija Korpelainen CSC IT Center for Science, Finland

2 What will I learn? How to operate the Chipster software Short introduction to RNA-seq Analyzing RNA-seq data Central concepts Analysis steps File formats How to design experiments

3 Introduction to Chipster

4 Chipster Provides an easy access to over 350 analysis tools Command line tools R/Bioconductor packages Free, open source software What can I do with Chipster? analyze and integrate high-throughput data visualize data efficiently share analysis sessions save and share automatic workflows

5 Technical aspects Client-server system Enough CPU and memory for large analysis jobs Centralized maintenance Easy to install Client uses Java Web Start Server available as a virtual machine

8 Mode of operation Select: data tool category tool run visualize

9 Job manager You can run many analysis jobs at the same time Use Job manager to view status cancel jobs view time view parameters

10 Analysis history is saved automatically -you can add tool source code to reports if needed

11 Analysis sessions In order to continue your work later, you have to save the analysis session. Session includes all the files, their relationships and metadata (what tool and parameters were used to produce each file). Session is saved into a single.zip file. You can save it locally on your computer or in the cloud (in Chipster v3.6.4) Session files allow you to continue the analysis on another computer, or share it with a colleague. You can have multiple analysis sessions saved separately, and combine them later if needed.

12 Workflow panel Shows the relationships of the files You can move the boxes around, and zoom in and out. Several files can be selected by keeping the Ctrl key down Right clicking on the data file allows you to Save an individual result file ( Export ) Delete Link to another data file Save workflow

13 Workflow reusing and sharing your analysis pipeline You can save your analysis steps as a reusable automatic macro, which you can apply to another dataset When you save a workflow, all the analysis steps and their parameters are saved as a script file, which you can share with other users

Saving and using workflows Select the starting point for your workflow Select Workflow/ Save starting from selected Save the workflow file on your computer

14 Saving and using workflows Select the starting point for your workflow Select Workflow/ Save starting from selected Save the workflow file on your computer with a meaningful name Don t change the ending (.bsh) To run a workflow, select Workflow->Open and run Workflow->Run recent (if you saved the workflow recently).

15 Analysis tool overview 150 NGS tools for 140 microarray tools for RNA-seq gene expression mirna-seq mirna expression exome/genome-seq protein expression ChIP-seq acgh FAIRE/DNase-seq SNP MeDIP-seq integration of different data CNA-seq Metagenomics (16S rrna) 60 tools for sequence analysis BLAST, EMBOSS, MAFFT Phylip

16 Tools for QC, processing and mapping FastQC PRINSEQ FastX TagCleaner Trimmomatic Bowtie TopHat BWA Picard SAMtools BEDTools

17 RNA-seq tools Quality control RseQC Counting HTSeq express Transcript discovery Cufflinks Differential expression edger DESeq Cuffdiff DEXSeq Pathway analysis ConsensusPathDB

18 mirna-seq tools Differential expression edger DESeq Retrieve target genes PicTar mirbase TargetScan miranda Pathway analysis for targets GO KEGG Correlate mirna and target expression

19 Exome/genome-seq tools Variant calling Samtools, bcftools Variant filtering and reporting VCFtools Variant annotation AnnotateVariant (Bioconductor) Variant Effect Predictor

Detect motifs, match to JASPAR MotIV, rgadem Dimont

20 ChIP-seq and DNase-seq tools Peak detection MACS F-seq Peak filtering P-value, no of reads, length Detect motifs, match to JASPAR MotIV, rgadem Dimont Retrieve nearby genes Pathway analysis GO, ConsensusPathDB

21 MeDIP-seq tools Detect methylation, compare two conditions MEDIPS

22 CNA-seq tools Count reads in bins Correct for GC content Segment and call CNA Filter for mappability Plot profiles Group comparisons Clustering Detect genes in CNA GO enrichment Integrate with expression

23 Metagenomics / 16 S rrna tools Taxonomy assignment with Mothur package Align reads to 16 S rrna template Filter alignment for empty columns Keep unique aligned reads Precluster aligned reads Remove chimeric reads Classify reads to taxonomic units Statistical analyses using R Compare diversity or abundance between groups using several ANOVA-type of analyses

24 Visualizing the data Data visualization panel Maximize and redraw for better viewing Detach = open in a separate window, allows you to view several images at the same time Two types of visualizations 1. Interactive visualizations produced by the client program Select the visualization method from the pulldown menu Save by right clicking on the image 2. Static images produced by analysis tools Select from Analysis tools/ Visualisation View by double clicking on the image file Save by right clicking on the file name and choosing Export

Interactive visualizations by the client Genome browser Spreadsheet Histogram Venn diagram Scatterplot 3D scatterplot Volcano plot Expression profiles

25 Interactive visualizations by the client Genome browser Spreadsheet Histogram Venn diagram Scatterplot 3D scatterplot Volcano plot Expression profiles Clustered profiles Hierarchical clustering SOM clustering Available actions: Select genes and create a gene list Change titles, colors etc Zoom in/out

27 Static images produced by R/Bioconductor Dispersion plot MA plot MDS plot Box plot Histogram Heatmap Idiogram Chromosomal position Correlogram Dendrogram K-means clustering SOM-clustering etc

28 Options for importing data to Chipster Import files/import folder Note that you can select several files by keeping the Ctrl key down Compressed files (.gz) are ok FASTQ, BAM, BED, VCF and GTF files are recognized automatically SAM/BAM and BED files can be sorted at the import stage or later Import from URL Utilities / Download file from URL directly to server Import from SRA database Utilities / Retrieve FASTQ or BAM files from SRA Import from Ensembl database Utilities / Retrieve data for a given organism in Ensembl Open an analysis session (made by you or somebody else) Files / Open session

29 Problems? Send us a support request -request includes the error message and link to analysis session (optional)

30 Acknowledgements to Chipster users and contibutors

31 More info Chipster tutorials in YouTube

32 Introduction to RNA-seq

33 What can I investigate with RNA-seq? Differential expression Isoform switching New genes and isoforms New transcriptomes Variants Allele-specific expression Etc etc

34 Is RNA-seq better than microarrays? + Wider detection range + Can detect new genes and isoforms - Data analysis is not as established as for microarrays - Data is voluminous

35 How was your data produced?

36 RNA-seq data analysis

37 RNA-seq data analysis: typical steps Align reads to reference genome Match alignment positions with known gene positions Gene A Gene B Count how many reads each gene has A = 6 B = 11

38 RNA-seq data analysis: file formats Fastq (raw reads) BAM (aligned reads) Fasta (ref. genome) Gene A Gene B GTF (ref. annotation) A = 6 B = 11 tsv (count table)

39 RNA-seq data analysis workflow

40 RNA-seq data analysis workflow today

41 RNA-seq data analysis workflow Quality control of raw reads Preprocessing if needed Alignment (=mapping) to reference genome Alignment level quality control Quantitation Experiment level quality control Differential expression analysis Visualization of reads and results in genomic context

42 What and why? Potential problems low confidence bases, Ns sequence specific bias, GC bias adapters sequence contamination Knowing about potential problems in your data allows you to correct for them before you spend a lot of time on analysis take them into account when interpreting results

43 Software packages for quality control FastQC FastX PRINSEQ TagCleaner...

44 Raw reads: FASTQ file format Four lines per read: Line 1 begins with a '@' character and is followed by a sequence identifier. Line 2 is the sequence. Line 3 begins with a '+' character and can be followed by the sequence identifier. Line 4 encodes the quality values for the sequence, encoded with a single ASCII character for brevity. name GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + read name!''*((((***+))%%%++)(%%%%).1***-+*''))**55ccf>>>>>>ccccccc65

45 Base qualities If the quality of a base is 20, the probability that it is wrong is Phred quality score Q = -10 * log 10 (probability that the base is wrong) T C A G T A C T C G Sanger encoding: numbers are shown as ASCII characters so that 33 is added to the Phred score E.g. 39 is encoded as H, the 72nd ASCII character (39+33 = 72) Note that older Illumina data uses different encoding Illumina1.3: add 64 to Phred Illumina : add 64 to Phred, ASCII 66 B means that the whole read segment has low quality

46 Base quality encoding systems

47 Per position base quality (FastQC) good ok bad

48 Per position base quality (FastQC)

49 Per position sequence content (FastQC)

50 Per position sequence content (FastQC) Sequence specific bias: Correct sequence but biased location, typical for Illumina RNA-seq data

51 RNA-seq data analysis workflow Quality control of raw reads Preprocessing (trimming / filtering) if needed Alignment (=mapping) to reference genome Manipulation of alignment files Alignment level quality control Quantitation Experiment level quality control Visualization of reads and results in genomic context Differential expression analysis

52 Filtering vs trimming Filtering removes the entire read Trimming removes only the bad quality bases It can remove the entire read, if all bases are bad Trimming makes reads shorter This might not be optimal for some applications Paired end data: the matching order of the reads in the two files has to be preserved If a read is removed, its pair has to removed as well

53 What base quality threshold should be used? No consensus yet Trade-off between having good quality reads and having enough sequence

54 Software packages for preprocessing FastX PRINSEQ TagCleaner Trimmomatic Cutadapt TrimGalore!...

55 PRINSEQ filtering options in Chipster Base quality scores Minimum quality score per base Mean read quality Ambiguous bases Maximum count/ percentage of Ns that a read is allowed to have Low complexity DUST (score > 7), entropy (score < 70) Length Minimum length of a read Duplicates Exact, reverse complement, or 5 /3 duplicates Copes with paired end data

56 PRINSEQ trimming options in Chipster Trim based on quality scores Minimum quality, look one base at a time Minimum (mean) quality in a sliding window From 3 or 5 end Trim x bases from left/ right Trim to length x Trim polya/t tails Minimum number of A/Ts From left or right Minimum read length after trimming Copes with paired end data

57 Trimmomatic options in Chipster Adapters Minimum quality Per base, one base at a time or in a sliding window, from 3 or 5 end Per base adaptive quality trimming (balance length and errors) Minimum (mean) read quality Trim x bases from left/ right Minimum read length after trimming Copes with paired end data

58 RNA-seq data analysis workflow Quality control of raw reads Preprocessing (trimming / filtering) if needed Alignment (=mapping) to reference genome Manipulation of alignment files Alignment level quality control Quantitation Experiment level quality control Visualization of reads and results in genomic context Differential expression analysis

59 Alignment to reference genome/transcriptome Goal is to find out where a read originated from Challenge: variants, sequencing errors, repetitive sequence Mapping to transcriptome allows you to count hits to known transcripts genome allows you to find new genes and transcripts Many organisms have introns, so RNA-seq reads map to genome non-contiguously spliced alignments needed Difficult because sequence signals at splice sites are limited and introns can be thousands of bases long

60 Tens of aligners are available

61 Splice-aware aligners TopHat (uses Bowtie) STAR GSNAP RUM MapSplice... Nature methods 2013 (10:1185)

62 Mapping quality Confidence in read s point of origin Depends on many things, including uniqueness of the aligned region in the genome length of alignment number of mismatches and gaps Expressed in Phred scores, like base qualities Q = -10 * log 10 (probability that mapping location is wrong) TopHat mapping qualities 50 = unique mapping 3 = maps to 2 locations 1 = maps to 3-4 locations 0 = maps to 5 or more locations

63 Bowtie2 Fast and memory efficient aligner Can make gapped alignments (= can handle indels) Cannot make spliced alignments, but is used by TopHat2 which can Two alignment modes: End-to-end Local Read is aligned over its entire length Maximum alignment score = 0, deduct penalty for each mismatch (less for low quality base), N, gap opening and gap extension Read ends don t need to align, if this maximizes the alignment score Add bonus to alignment score for each match Reference (genome) is indexed to speed up the alignment process

TopHat2 Relatively fast and memory efficient spliced aligner Performs several alignment steps Uses Bowtie2 end-to-end mode for aligning Low

64 TopHat2 Relatively fast and memory efficient spliced aligner Performs several alignment steps Uses Bowtie2 end-to-end mode for aligning Low tolerance for mismatches If annotation (GTF file) is available, builds a virtual transcriptome and aligns reads to that first Kim et al, Genome Biology 2013

65 TopHat2 spliced alignment steps Kim et al, Genome Biology 2013

67 File format for mapped reads: BAM/SAM SAM (Sequence Alignment/Map) is a tab-delimited text file containing read alignment data. BAM is a binary form of SAM. optional header section alignment section: one line per read, containing 11 mandatory fields, followed by optional tags

68 Fields in BAM/SAM files read name flag 272 reference name 1 position mapping quality 0 CIGAR mate name * mate position 0 insert size 0 HWI-EAS229_1:2:40:1280:283 49M6183N26M sequence AGGGCCGATCTTGGTGCCATCCAGGGGGCCTCTACAAGGAT AATCTGACCTGCTGAAGATGTCTCCAGAGACCTT base qualities ECC@EEF@EB:EECFEECCCBEEEE;>5;2FBB@FBFEEFCF@F FFFCEFFFFEE>FFEFC=@A;@>1@6.+5/5 tags MD:Z:75 NH:i:7 AS:i:-8 XS:A:-

69 CIGAR string M = match or mismatch I = insertion D = deletion N = intron (in RNA-seq read alignments) S = soft clip (ignore these bases) H = hard clip (ignore and remove these bases) VN:1.3 SN:ref LN:45 r ref M2I4M1D3M = TTAGATAAAGGATACTG * The corresponding alignment Ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT r001 TTAGATAAAGGATA*CTG

70 Flag field in BAM Read s flag number is a sum of values E.g. 4 = unmapped, 1024 = duplicate Explained in detail at You can interpret them at

71 Alignment level quality checks How many reads mapped to the reference? How many of them mapped uniquely? How many pairs mapped? How many pairs mapped concordantly? Mapping quality distribution? Example report from Samtools Flagstat command in total (QC-passed reads + QC-failed reads) duplicates mapped (100.00%:-nan%) paired in sequencing read read properly paired (80.74%:-nan%) with itself and mate mapped singletons (15.02%:-nan%) with mate mapped to a different chr with mate mapped to a different chr (mapq>=5)

72 BAM index file (.bai) BAM files can be sorted by chromosomal coordinates and indexed for efficient retrieval of reads for a given region. The index file must have a matching name. (e.g. reads.bam and reads.bam.bai) Genome browser requires both BAM and the index file. The alignment tools in Chipster automatically produce sorted and indexed BAMs. When you import BAM files, Chipster asks if you would like to preproces them (convert SAM to BAM, sort and index BAM).

73 Manipulating BAM files (SAMtools, Picard) Convert SAM to BAM, sort and index BAM Preprocessing when importing SAM/BAM, runs on your computer. The tool available in the Utilities category runs on the server. Index BAM Count alignments in BAM How many alignments does the BAM contain. Includes an optional mapping quality filter. Count alignments per chromosome in BAM Count alignment statistics for BAM Collect multiple metrics for BAM Retrieve alignments for a given chromosome/region Makes a subset of BAM, e.g. chr1: , inc quality filter.

74 RNA-seq data analysis workflow Quality control of raw reads Preprocessing (trimming / filtering) if needed Alignment (=mapping) to reference genome Alignment level quality control Quantitation Experiment level quality control Differential expression analysis Visualization of reads and results in genomic context

75 Annotation-based quality metrics Saturation of sequencing depth Would more sequencing detect more genes and splice junctions? Read distribution between different genomic features Exonic, intronic, intergenic regions Coding, 3 and 5 UTR exons Protein coding genes, pseudogenes, rrna, mirna, etc Is read coverage uniform along transcripts? Biases introduced in library construction and sequencing polya capture and polyt priming can cause 3 bias random primers can cause sequence-specific bias GC-rich and GC-poor regions can be under-sampled Genomic regions have different mappabilities (uniqueness)

76 Quality control programs for aligned reads RseQC (available in Chipster) RNA-seqQC Qualimap Picards s CollectRnaSeqMetrics

Quality assessment with RseQC Checks coverage uniformity, saturation of sequencing depth, novelty of splice junctions, read distribution between different

77 Quality assessment with RseQC Checks coverage uniformity, saturation of sequencing depth, novelty of splice junctions, read distribution between different genomic regions, etc. Takes a BAM file and a BED file The files need to have the same chromosome naming (chr1 or 1) You can obtain BED files from the UCSC Table browser

78 QC tables by RseQC

79 Region file formats: BED 5 obligatory columns: chr, start, end, name, score 0-based, like BAM

80 Obtain BED from UCSC Table Browser

81 UCSC Table Browser part II

82 Chromosome naming UCSC BED vs Chipster Chromosome names in UCSC BED files have the prefix chr e.g. chr1 Chipster files are Ensembl-based and don t have the prefix Remove the prefix in UCSC BED files (chr1 1) Utilities / Modify text. Use the following parameters: Operation = Replace text Search string = chr Input file format = BED

83 RNA-seq data analysis workflow Quality control of raw reads Preprocessing (trimming / filtering) if needed Alignment (=mapping) to reference genome Alignment level quality control Quantitation Experiment level quality control Differential expression analysis Visualization of reads and results in genomic context

84 Software for counting aligned reads per genomic features (genes/exons/transcripts) HTSeq Cufflinks BEDTools Qualimap express...

85 HTSeq count Given a BAM file and a list of genomic features (e.g. genes), counts how many reads map to each feature. For RNA-seq the features are typically genes, where each gene is considered as the union of all its exons. Also exons can be considered as features, e.g., in order to check for alternative splicing. Features need to be supplied in GTF file Note that GTF and BAM must use the same chromosome naming 3 modes to handle reads which overlap several genes Union (default) Intersection-strict Intersection-nonempty

86 HTSeq count modes

87 Unique, ambiguous, concordant, stranded? read Ambiguous read A read A Not unique read A _1 read A _2 read A _2 read A _1 Concordant 5 read 3 Stranded

88 GTF file format 9 obligatory columns: chr, source, name, start, end, score, strand, frame, attribute 1-based For HTSeq to work, all exons of a gene must have the same gene_id Use GTFs from Ensembl, avoid UCSC

89 Estimating gene expression at gene level - the isoform switching problem Trapnell et al. Nature Biotechnology 2013

90 Combine individual count files to a count table and describe the experiment Select all the count files and run Utilities / Define NGS experiment This creates also a phenodata file, where you can describe experimental groups, time, pairing etc Use numbers (e.g. 1 = control, 2 = cancer) Description column allows you to define sample names for visualizations If you import ready-made count table, you can create a phenodata file using the tool Utilities / Preprocess count table"

91 RNA-seq data analysis workflow Quality control of raw reads Preprocessing (trimming / filtering) if needed Alignment (=mapping) to reference genome Alignment level quality control Quantitation Experiment level quality control Differential expression analysis Visualization of reads and results in genomic context

92 Experiment level quality control Getting an overview of similarities and dissimilarities between samples allows you to check Do the experimental groups separate from each other? Is there a confounding factor (e.g. batch effect) that should be taken into account in the statistical analysis? Are there sample outliers that should be removed? Several methods available MDS (multidimentional scaling) PCA (principal component analysis) Clustering

93 MDS plot by edger Distances correspond to the logfc or biological coefficient of variation (BCV) between each pair of samples Calculated using 500 most heterogenous genes (that have largest tagwise dispersion treating all samples as one group)

94 PCA plot by DESeq2 The first two principal components, calculated after variance stabilizing transformation Indicates the proportion of variance explained by each component If PC2 explains only a small percentage of variance, it can be ignored

95 Sample heatmap by DESeq2 Euclidean distances between the samples, calculated after variance stabilizing transformation

96 RNA-seq data analysis workflow Quality control of raw reads Preprocessing (trimming / filtering) if needed Alignment (=mapping) to reference genome Manipulation of alignment files Alignment level quality control Quantitation Experiment level quality control Differential expression analysis Visualization of reads and results in genomic context

97 Things to take into account Biological replicates are important! Normalization is required in order to compare expression between samples Different library sizes RNA composition bias caused by sampling approach Model has to account for overdispersion in biological replicates negative binomial distribution Raw counts are needed to assess measurement precision Counts are the the units of evidence for expression Multiple testing problem

99 Software packages for DE analysis edger DESeq DEXSeq Cuffdiff Ballgown SAMseq NOIseq Limma + voom, limma + vst...

100

101 Comments from comparisons Methods based on negative binomial modeling have improved specificity and sensitivity as well as good control of false positive errors Cuffdiff performance has reduced sensitivity and specificity. We postulate that the source of this is related to the normalization procedure that attempts to account for both alternative isoform expression and length of transcripts

102 Differential gene expression analysis Normalization Dispersion estimation Log fold change estimation Statistical testing Filtering Multiple testing correction

103 Differential expression analysis: Normalization

104 Normalization For comparing gene expression between genes in one sample, normalize for Gene length Gene GC content For comparing gene expression between (groups of) samples, normalize for Library size (number of reads obtained) RNA composition effect

105 FPKM and TC are ineffective and should be definitely abandoned in the context of differential analysis In the presence of high count genes, only DESeq and TMM (edger) are able to maintain a reasonable false positive rate without any loss of power

106 RPKM and FPKM Reads (or fragments) per kilobase per million mapped reads. 20 kb transcript has 400 counts, library size is 20 million reads RPKM = (400/20) / 20 = kb transcript has 10 counts, library size is 20 million reads RPKM = (10/0.5) / 20 = 1 Normalizes for gene length and library size Can be used only for reporting expression values, not for testing differential expression In DE analysis raw counts are needed to assess the measurement precision correctly

107 Normalization by edger and DESeq Aim to make normalized counts for non-differentially expressed genes similar between samples Do not aim to adjust count distributions between samples Assume that Most genes are not differentially expressed Differentially expressed genes are divided equally between up- and down-regulation Do not transform data, but use normalization factors within statistical testing

108 Normalization by edger and DESeq how? DESeq(2) Take geometric mean of gene s counts across all samples Divide gene s counts in a sample by the geometric mean Take median of these ratios sample s normalization factor (applied to read counts) edger Select as reference the sample whose upper quartile is closest to the mean upper quartile Log ratio of gene s counts in sample vs reference M value Take weighted trimmed mean of M-values (TMM) normalization factor (applied to library sizes) Trim: Exclude genes with high counts or large differences in expression Weights are from the delta method on binomial data

109 Differential expression analysis: Dispersion estimation

110 Dispersion When comparing gene s expression levels between groups, it is important to know also its within-group variability Dispersion = (BCV) 2 BCV = gene s biological coefficient of variation E.g. if gene s expression typically differs from replicate to replicate by 20% (so BCV = 0.2), then this gene s dispersion is = 0.04 Note that the variability seen in counts is a sum of 2 things: Sample-to-sample variation (dispersion) Uncertainty in measuring expression by counting reads

111 How to estimate dispersion reliably? RNA-seq experiments typically have only few replicates it is difficult to estimate within-group variance reliably Solution: pool information across genes which are expressed at similar level assumes that genes of similar average expression strength have similar dispersion Different approaches edger DESeq2

112 Dispersion estimation by DESeq2 Estimates genewise dispersions using maximum likelihood Fits a curve to capture the dependence of these estimates on the average expression strength Shrinks genewise values towards the curve using an empirical Bayes approach The amount of shrinkage depends on several things including sample size Genes with high gene-wise dispersion estimates are dispersion outliers (blue circles above the cloud) and they are not shrunk

113 Dispersion estimation by edger Estimates common dispersion for all genes using a conditional maximum likelihood approach Trended dispersion: takes binned common dispersion and abundance, and fits a curve though these binned values Final dispersion: shrinks gene-wise dispersions towards the trended one uses an empirical Bayes strategy and a weighted likelihood approach genes that are consistent between replicates are ranked more highly

114 Differential expression analysis: Statistical testing

115 Generalized linear models Model the expression of each gene as a linear combination of explanatory factors (eg. group, time, patient) y = a + (b. group) + (c. time) + (d. patient) + e y = gene s expression a, b, c and d = parameters estimated from the data a = intercept (expression when factors are at reference level) e = error term Generalized linear model allows the expression value distribution to be different from normal distribution Negative binomial distribution used for count data

116 Statistical testing DESeq and edger Two group comparisons Exact test for negative binomial distribution. Multifactor experiments DESeq2 Generalized linear model (GLM) likelyhood ratio test. Shrinks log fold change estimates toward zero using an empirical Bayes method Shrinkage is stronger when counts are low, dispersion is high, or there are only a few samples GLM based Wald test for significance Shrunken estimate of log fold change is divided by its standard error and the resulting z statistic is compared to a standard normal distribution

117 Fold change shrinkage by DESeq2

118 Multiple testing correction We tests thousands of genes, so it is possible that some genes get good p-values just by chance To control this problem of false positives, p-values need to be corrected for multiple testing Several methods are available, the most popular one is the Benjamini-Hochberg correction (BH) largest p-value is not corrected second largest p = (p *n)/ (n-1) third largest p = (p * n)/(n-2) smallest p = (p * n)/(n- n+1) = p * n The adjusted p-value is FDR (false discovery rate)

119 Filtering Reduces the severity of multiple testing correction by removing some genes (makes n smaller) Filter out genes which have little chance of showing evidence for significant differential expression genes which are not expressed genes which are expressed at very low level (low counts are unreliable) Should be independent do not use information on what group the sample belongs to DESeq2 selects filtering threshold automatically

120 edger result table logfc = log2 fold change logcpm = average log2 counts per million Pvalue = raw p-value FDR = false discovery rate (Benjamini-Hochberg adjusted p- value)

121 DESeq2 result table basemean = mean of counts (divided by size factors) taken over all samples log2foldchange = log2 of the ratio meanb/meana lfcse = standard error of log2 fold change stat = Wald statistic pvalue = raw p-value padj = Benjamini-Hochberg adjusted p-value

122 RNA-seq data analysis workflow Quality control of raw reads Preprocessing (trimming / filtering) if needed Alignment (=mapping) to reference genome Alignment level quality control Quantitation Experiment level quality control Differential expression analysis Visualization of reads and results in genomic context

123 Software packages for visualization Chipster genome browser IGV UCSC genome browser... Differences in memory consumption, interactivity, annotations, navigation,...

124 Chipster Genome Browser Integrated with Chipster analysis environment Automatic sorting and indexing of BAM, BED and GTF files Automatic coverage calculation (total and strand-specific) Zoom in to nucleotide level Highlight variants Jump to locations using BED, GTF, VCF and tsv files View details of selected BED, GTF and VCF features Several views (reads, coverage profile, density graph)

125

126

127

128 Design of experiments

129 When planning an experiment, consider Technical vs. biological replicates Sample pairing Pooling Reference samples Sequencing decisions Number of reads per sample Read length (longer is better) Paired end or single end (PE is better) Stranded or unstranded (stranded is better) Note that sequencing capacity depends on the device (how many cycles, how many reads per flow cell, how many lanes in the flow cell, etc).

130 Technical vs. biological replicates Biological replicates are separate individuals/samples Necessary for a properly controlled experiment Technical replicates are repeated sequencing runs using the same RNA isolate or sample Waste of resources? Can cause unnecessary variance reduction increases number of false positives Avoid mixing of biological and technical replicates!

131 Technical vs. biological replicates Dummy example: Let s measure the average height of a Finnish male. Biological replicates: different individuals Technical replicates: measuring same individual with different measuring tape How many replicates would be enough?

132 Technical vs. biological replicates Distinction between technical and biological replicates is fuzzy. Where do we stand with cell lines?

Replicate number Publication quality data needs at least 3

This can be sufficient for cell-cultures and/or test animals,

numbers: Cell cultures / test animals: 3 is minimum, 4-5 OK, >7

133 Replicate number Publication quality data needs at least 3 biological replicates per sample group. This can be sufficient for cell-cultures and/or test animals, but usually too few for in vivo experiment More reasonable numbers: Cell cultures / test animals: 3 is minimum, 4-5 OK, >7 excellent Patients: 3 is minimum, OK, >50 good Power analysis can be used to estimate sample sizes

134 Paired samples Use of matched samples reduces variance, as individual variation can be tackled using a matched control Pre vs. post treatment samples Tumor vs. normal samples from the same patient Tissue pairs Example: 6 patients, 2 samples from each. Enough resources to sequence only 6 samples. Which option do you choose? Pre treatment samples P1 P2 P3 P4 P5 P6 P1 P2 P3 P4 P5 P6 VS Post treatment samples P1 P2 P3 P4 P5 P6 P1 P2 P3 P4 P5 P6

135 Pooling Where possible, measure each sample on its own. If this is not possible (too expensive or not enough material), samples can be pooled to reduce variance Careful with the concentrations!!! Cell culture:

136 Pooling Make pools as similar as possible Avoid pooling of similar kinds of samples into one pool F F F M M M F M F M F M VS

137 Pooling Something to consider: What if some of your samples are outliers, or have a contamination? Cell culture:

138 Reference samples Don t compare apples to oranges! Cancer sample vs. normal sample Blood? Same tissue, healthy? Similar tissue? Healthy tissue from another patient? Cell line?

139 How many reads per sample do I need? Some recommendations available Differential expression M reads (less for small genomes) Allele specific expression M Alternative splicing M De novo assembly >100 M More reads or more replicates? Heterogeneous sample more reads needed E.g. in tumor samples or when there is a doubt of contamination A highly expressed transcript may hoard all the resources

140 Example Differential expression profiling We want min. 30 M reads/sample 30 samples: 30 mice 10 treated with saline 10 treated with drug 1 10 treated with drug 2 Illumina NextSeq 500 device (output 400 M reads) 400 M reads / 30 samples = 13 M reads/sample: not enough. Do we want to decrease number of replicates & increase number of reads, or keep the replicates and deal with lower number of reads? Do we have enough money to have 2 runs? If so, how should we form the batches?

141 More info Manuals of individual analysis tools Varying quality Biostars SEQanswers

RNA-seq data analysis with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

RNA-seq data analysis with Chipster Eija Korpelainen CSC IT Center for Science, Finland chipster@csc.fi What will I learn? 1. What you can do with Chipster and how to operate it 2. What RNA-seq can be