less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Similar documents
Transcriptome analysis

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

ChIP-seq and RNA-seq. Farhat Habib

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

ChIP-seq and RNA-seq

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

measuring gene expression December 11, 2018

RNA-Seq analysis using R: Differential expression and transcriptome assembly

Analysis of RNA-seq Data. Bernard Pereira

measuring gene expression December 5, 2017

A normalization method based on variance and median adjustment for massive mrna polyadenylation data

Next Generation Sequencing

Introduction into single-cell RNA-seq. Kersti Jääger 19/02/2014

1. Introduction Gene regulation Genomics and genome analyses

RNA

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

RNA standards v May

Introduction of RNA-Seq Analysis

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

RNA-Seq Analysis. Simon Andrews, Laura v

Finding Genes with Genomics Technologies

Introduction to RNA-Seq in GeneSpring NGS Software

Massive Analysis of cdna Ends for simultaneous Genotyping and Transcription Profiling in High Throughput

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis

High performance sequencing and gene expression quantification

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Microarray Informatics

ChIP-seq analysis 2/28/2018

Comparative analysis of RNA-Seq data with DESeq and DEXSeq

A comparison of methods for differential expression analysis of RNA-seq data

Genomic resources. for non-model systems

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

Supplementary Information

Deep sequencing of transcriptomes

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data

RNA Sequencing: Experimental Planning and Data Analysis. Nadia Atallah September 12, 2018

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.

Sequence Analysis 2RNA-Seq

Normalization of RNA-Seq data in the case of asymmetric dierential expression

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Canadian Bioinforma3cs Workshops

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

TECH NOTE Stranded NGS libraries from FFPE samples

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

RNA sequencing Integra1ve Genomics module

RNA Sequencing Analyses & Mapping Uncertainty

Parts of a standard FastQC report

Experimental design of RNA-Seq Data

Automation of Lexogen s QuantSeq 3 mrna-seq Library Prep Kits on the Biomek FX p NGS Workstation

Measuring transcriptomes with RNA-Seq

RNA-SEQUENCING ANALYSIS

Measuring and Understanding Gene Expression


Microarray Informatics

Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data

Comparative analysis of RNA-Seq data with DESeq2

Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world.

RNA-Seq data analysis course September 7-9, 2015

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

Matthew Tinning Australian Genome Research Facility. July 2012

Applications of short-read

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

GREG GIBSON SPENCER V. MUSE

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA

Wheat CAP Gene Expression with RNA-Seq

Optimal Calculation of RNA-Seq Fold-Change Values

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

Differential expression analysis for sequencing count data. Simon Anders

RNA-Sequencing analysis


Advanced RNA-Seq course. Introduction. Peter-Bram t Hoen

The first thing you will see is the opening page. SeqMonk scans your copy and make sure everything is in order, indicated by the green check marks.

Microarray. Key components Array Probes Detection system. Normalisation. Data-analysis - ratio generation

G E N OM I C S S E RV I C ES

Next-generation sequencing technologies

Lesson Overview. Fermentation 13.1 RNA

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Chapter 1. from genomics to proteomics Ⅱ

Technical note: Molecular Index counting adjustment methods

Gene Expression Technology

Session 8. Differential gene expression analysis using RNAseq data

Introduction to Bioinformatics and Gene Expression Technology

Computational & Quantitative Biology Lecture 6 RNA Sequencing

CBC Data Therapy. Metatranscriptomics Discussion

Simultaneous profiling of transcriptome and DNA methylome from a single cell

Seven Keys to Successful Microarray Data Analysis

Integrative Genomics 1a. Introduction

EECS730: Introduction to Bioinformatics

From reads to results: differential. Alicia Oshlack Head of Bioinformatics

Differential gene expression analysis using RNA-seq

Recent technology allow production of microarrays composed of 70-mers (essentially a hybrid of the two techniques)

Transcription:

Chapter 11: Gene Expression The availability of an annotated genome sequence enables massively parallel analysis of gene expression. The expression of all genes in an organism can be measured in one experiment. In this chapter we discuss the key aspects of such gene expression analysis. The genome sequence also enables analysis of cis-regulation, also discussed in this chapter. There are many methods of examining gene expression. A method of choice is to sequence libraries of cdna made from a population of mrna. The sequencing reads can be assembled to deduce a transcriptome, the set of transcribed sequences, which include mrnas and non-coding RNAs. The reads can also be mapped onto a genome assembly to identify the genes expressed. Note that mrna measurements imply the extent of steady state mrna accumulation not transcription. The level of mrna is a function of the rate of synthesis and the rate of degradation. Table of gene expression measurement methods Method features RNA-seq comprehensive Microarray less sensitive than RNA-seq but more robust analysis pipelines Nano-string expensive but quantitiatve qrt-pcr standard but typically not high throughput RNA-seq reads can be quantified by a simple metric, FPKM To normalize the number of reads as a function of gene size, a common metric is RPKM, reads per kilobase of gene model per million reads. For paired end reads the comparable measure is FPKM, fragments per kilobase of gene model per million reads. Transcripts per million is another metric. Figure SGF-1137. FPKM. Variation in results Systems Genetics Chapter 11 bi190-2013 h/0 1

Technical - differences due to measurement Biological - differences in samples We want to measure the level of expression of each gene under a set of conditions. Such a genome-wide gene expression analysis gives results such as shown in Figure [SGF-1336]. Figure [SGF-1336]. Generalized gene expression experiment RNA-seq An RNA-seq experiment involves a series of steps that, in brief, obtain a count of the relative number of transcripts in a sample. mrna is extracted from a sample. The population of mrna is converted to a population of cdna and that population of cdna analyzed. In a common platform, DNA sequence is obtained with read length of at least 50 nucleotides. The reads are then mapped to the relevant genome, and assigned to gene models. The number of reads is normalized to the length of the gene model: a longer gene will have more reads (Figure ]. The read counts are normalized to the total number of reads. There is error associate with mapping reads, for example if the reads map to more than one location in the genome. The sensitivity is greater than that of microarrays. Analysis of data. Data processing results in a table of genes with read counts per sample (eg.., Table ). Table. Simple gene expression analysis example Gene Condition 1 sample1 condition 1 sample 2 condition 2, sample 1 condition2, sample2 A 9 12 13 8 B 11 10 11 9 C 8 13 2 1 D 0 0 1 0 E 2 1 1 0 F 123 198 34 21 G 0 1 49 62 For one sample of one condition we simply have a list of genes. For two samples from the same condition we have a better sense of accuracy, both of the relative Systems Genetics Chapter 11 bi190-2013 h/0 2

measurements and of the presence of gene expression. To compare two conditions, it is better to have replicates. The number of replicates is driven by cost and statistics. Two is much better than 1; Three is significantly better than 2; Four is better than 3; Five is slightly better than 4; and so forth. In this example, genes A, B, D and E don t change among conditions. C and F is higher under condition 1; G is lower. Modern analysis programs such as DEseq include information on total number of reads but uses raw values of reads. The software estimates variance for each gene and across all data. For example, the coefficient of variation is the standard deviation divided by the mean so it is normalized. The variance between samples comprises the sum of samplesample variation in expression (dispersion) and the uncertainty in determining a concentration by counting reads (called the shot or Poisson noise). DEseq performs a negative binomial test to get a p value. The negative binomial (or Pascal) distribution allows variance and mean to be different. The number of successes in a sequence trials before a specified number of failures (denoted r) occurs. NB(r,p), p is probability of success; r is number of desired failures. For the negative binomial: Mean is pr/(1-p). Variance is pr/(1-p) 2 K is number of successes: and the probability mass function is By contrast, the Poisson distribution has mean=variance. Systems Genetics Chapter 11 bi190-2013 h/0 3

Figure of negative binomial distribution with various p and r values. from http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0cdkqfjab&url=http%3a%2f%2fwww.stat. washington.edu%2fpeter%2f341%2fgeoemtric%2520and%2520negative%2520binomial.pdf&ei=rwwzuejxoae9igkp hodida&usg=afqjcngyv5f3utj7masdywlngfxkf9mscw&sig2=p9o9xm5pnogo0mqpql9yrg&bvm=bv.46751780, d.cge Comparison of two measurements and Multiple hypothesis testing. The Bonferroni correction is conservative and corrects significance values by the number of tests. Bonferroni corrected α = α/n FDR The Bonferroni correction allows us to know the probability that a particular observation might come about by chance. In most screens, we are interested in finding many true positives, but can tolerate the fact that some are falsely positive. We can control the False Discovery Rate (FDR) by a simple procedure. If alpha is our significance cutoff, then the number of false positives among N observation is α N, and the proportion of false positives is α N / #Positives. FDR = αn/pos And rearranging we have α = FDR POS/N Where POS/N is the proportion of observed positives. For FDR = 0.1, α = 0.1 POS/N Table SGT---. Choice of a at FDR=0.1 proportion positive α 0.5 0.05 0.1 0.01 0.01 0.001 0.001 0.0001 As you can see, the rarer the positives observed, the lower the p-value has to be to control false discoveries. If we test 10,000 genes and find 10% positive, to have a 10% FDR, we use as cutoff α=0.01. Bonferroni correction in this situation would be 0.05/N = 0.000005 and we would throw out many false positives. Systems Genetics Chapter 11 bi190-2013 h/0 4

Spike-ins allow absolute quantification. By adding an absolute number of specific mrnas to the sample before processing, one can obtain a good estimate of absolute number of transcripts. Enrichment analysis. After you have identified sets of genes associated with a particular biological condition (genotype, environment, experimental treatment), you want to explore common features of the gene set. One standard way to do this is to look for Gene Ontology term enrichment. Suppose you have a simple ontology (Fig. [SGF-1273]) with 5 genes annotated to each of six nodes for a total of 30 genes. We test for enrichment for each node where there are two or more genes in our potentially enriched set. We thus test node D for enrichment. 3 of 4 genes identified in our experiment are annotated to node D. A standard statistical test for such a case is the Hypergeometric Distribution, which tests sampling without replacement and is thus a bit more complicated than the binomial distribution. For 3 of 4 from 5 of 30, the probability of obtaining such a result is 0.009, which is highly significant. Systems Genetics Chapter 11 bi190-2013 h/0 5

Figure [SGF-1273]. Enrichment example. 3 of 4 genes in experiment are annotated to node D. Is a set of genes significantly enriched in a particular characteristic? Hypergeometric Distribution The hypergeometric distribution differs from a binomial distribution in that it models sampling without replacement. To apply it to Gene Ontology annotation, we define: G = number of genes in population; g = number of genes in population with a particular annotation; G-g = number of genes in population without that annotation; F = number of genes in sample; f = number of genes in sample with that annotation; and F -f = number of genes in sample without that annotation. We thus have as the probability of getting exactly f of F given g of G: q f = This is derived in the following way. The probability of getting exactly f of F given g of G is: There are There are There are ways of getting exactly f successes. ways of getting exactly F-f failures. ways of sampling F from the total population G The probability of getting exactly f of F given g of G is 1 P(getting <f), and thus: Systems Genetics Chapter 11 bi190-2013 h/0 6

P = 1 An example calculation is as follows. = 1 f F g G p 1 1 1 20 0.05 1 1 50 100 0.5 2 2 10 100 0.009 1 2 10 100 0.2 2 4 25 100 0.047 For analysis of Gene Ontology annotations, each node in the GO is a test We can exclude any test that involves a node with less than 2 gene annotations since they will never be significant. To use a Bonferroni correction, one can use the total number of nodes to which the provided list of genes are annotated, either directly or indirectly, excluding any nodes that are annotated only once since these cannot be overrepresented. (multiply p by the number of tests.) Systems Genetics Chapter 11 bi190-2013 h/0 7

There are a number of web-accessible enrichment tools such as Amigo ffromt he Gene Ontology Consortium: http://amigo.geneontology.org/cgi-bin/amigo/term_enrichment?session_id= Systems Genetics Chapter 11 bi190-2013 h/0 8

Systems Genetics Chapter 11 bi190-2013 h/0 9

Systems Genetics Chapter 11 bi190-2013 h/0 10