RNA Expression Time Course Analysis

Similar documents
Browsing Genomes with Ensembl

Analysis Datasheet Exosome RNA-seq Analysis

CRAC: An integrated approach to analyse RNA-seq reads Additional File 4 Results on real RNA-seq data.

Applications of short-read

Sequencing applications. Today's outline. Hands-on exercises. Applications of short-read sequencing: RNA-Seq and ChIP-Seq

SAS Microarray Solution for the Analysis of Microarray Data. Susanne Schwenke, Schering AG Dr. Richardus Vonk, Schering AG

Next Generation Sequencing

Analysis of RNA-seq Data. Feb 8, 2017 Peikai CHEN (PHD)

Alignment & Variant Discovery. J Fass UCD Genome Center Bioinformatics Core Tuesday June 17, 2014

SCALABLE, REPRODUCIBLE RNA-Seq

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Introduction of RNA-Seq Analysis

Genomes with Ensembl. Dr. Giulietta Spudich CNIO, of 21

Course Presentation. Ignacio Medina Presentation

RNA-Seq Analysis. Simon Andrews, Laura v


Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

RNAseq Differential Gene Expression Analysis Report

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Analysis of RNA-seq Data. Bernard Pereira

Standard Data Analysis Report Agilent Gene Expression Service

Quantifying gene expression

Introduction to NGS analyses

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Sanger vs Next-Gen Sequencing

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to Short Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Long and short/small RNA-seq data analysis

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

Bioinformatics in next generation sequencing projects

From reads to results: differential. Alicia Oshlack Head of Bioinformatics

Benchmarking of RNA-seq data processing pipelines using whole transcriptome qpcr expression data

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

VM origin. Okeanos: Image Trinity_U16 (upgrade to Ubuntu16.04, thanks to Alexandros Dimopoulos) X2go: LXDE

RNA

RNA-Sequencing analysis

measuring gene expression December 5, 2017

ChIP-seq and RNA-seq

Optimal Calculation of RNA-Seq Fold-Change Values

Galaxy Platform For NGS Data Analyses

RNA-Seq Module 2 From QC to differential gene expression.

High performance sequencing and gene expression quantification

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Galaxy for Next Generation Sequencing 初探次世代序列分析平台 蘇聖堯 2013/9/12

RNA-seq Data Analysis

Sequence Analysis 2RNA-Seq

Introduction to RNA-Seq in GeneSpring NGS Software

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

RNA-Seq with the Tuxedo Suite

A pipeline for ChIP-seq data analysis (Prot 56)

Eucalyptus gene assembly

Positive Selection of Tyrosine Loss in Metazoan Evolution

Transcriptome analysis

ChIP-seq and RNA-seq. Farhat Habib

Differential expression analysis for sequencing count data. Simon Anders

NGS Data Analysis and Galaxy

RNA Seq: Methods and Applica6ons. Prat Thiru

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

De novo assembly in RNA-seq analysis.

Analysis of RNA-seq Data

measuring gene expression December 11, 2018

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

RNA-SEQUENCING ANALYSIS

Bioinformatics for Biologists

Session 8. Differential gene expression analysis using RNAseq data

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

1. Introduction Gene regulation Genomics and genome analyses

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

Introduction to RNA sequencing

How to deal with your RNA-seq data?

Short Read Alignment to a Reference Genome

RNA-seq differential expression analysis. bioconnector.org/workshops

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

RNA-Seq data analysis course September 7-9, 2015

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

RNA-Seq analysis using R: Differential expression and transcriptome assembly

Biology 644: Bioinformatics

RNA Sequencing Analyses & Mapping Uncertainty

Parts of a standard FastQC report

oqtans A Galaxy-Integrated Workflow for Quantitative Transcriptome Analysis from NGS Data

Integrative Genomics 1a. Introduction

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

A normalization method based on variance and median adjustment for massive mrna polyadenylation data

Differential gene expression analysis using RNA-seq

NEXT GENERATION SEQUENCING. Farhat Habib

NGS in Pathology Webinar

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

Fully Automated Genome Annotation with Deep RNA Sequencing

NGS Approaches to Epigenomics

Bioinformatics Monthly Workshop Series. Speaker: Fan Gao, Ph.D Bioinformatics Resource Office The Picower Institute for Learning and Memory

RNA Sequencing: Experimental Planning and Data Analysis. Nadia Atallah September 12, 2018

Top 5 Lessons Learned From MAQC III/SEQC

Normalization. Getting the numbers comparable. DNA Microarray Bioinformatics - #27612

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Exploring and Understanding ChIP-Seq data. Simon v

Transcription:

RNA Expression Time Course Analysis April 23, 2014

Introduction The increased efficiency and reduced per-read cost of next generation sequencing (NGS) has opened new and exciting opportunities for data analysis. Rather than comparing the change in expression in a small number of samples, sometimes without replicates, the researchers now can effectively compare expression profiles of a large number of samples, complete with replicates. However, these types of comparisons are not trivial. They require utilization of fairly complex data management and data handling tools that can effectively transform the read-based data into a data matrix that can be analyzed using sophisticated data analysis algorithms. In this document we take a set of 12 NGS data sets from first 24 hours of Drosophila embryonic development and demonstrate how to conduct time course analysis using peer-reviewed statistical algorithms. Analysis Steps Any kind of expression analysis experiment using NGS data (time course is a kind of expression analysis) is done in 6 steps: 1. Quality check of the raw data delivered by core facilities or contract research organizations (CROs). 2. Mapping data to reference genome. 3. Quantification of gene expression. Also may include quantification of transcripts (isoforms) and other annotated genomic elements, possibly including non-coding RNAs. 4. Variance reduction (also called transformation) 5. Normalization 6. Differential expression analysis After step 3 the data have been converted from a list of reads found in raw data files into a table-like format with rows containing genomic elements (genes, transcripts, etc.) and columns samples. Individual cells in this table will contain a number that represents relative abundance of the particular element in a sample. For example, rows corresponding to genes polaa and pola B may look like Table 1. This kind of table is also called data matrix. 1

Tag SampleA Sample B Sample C Sample d polaa 10 12 9 0 polab 15 9 0 11 Table 1: Example Data Matrix After one or more intensity tables have been generated, the data must transformed and normalized so they have a distribution appropriate for statistical algorithms. Most of these algorithms are sensitive to the shape of the data, requiring the data to have Normal distribution. This requires steps 4, 5, and 6. In step 6 is where the appropriate statistical algorithms are finally applied. Experiment Drosophila melanogaster (common fruit fly) is one of the best studied and commonly used model organisms. They are easy to handle and breed, and large quantities of molecular information are readily available. The eggs are approximately 0.5 millimeter long and hatch after 12-15 hours [1, 2]. The resulting larvae will grow for approximately 48 hours. The model organisms encyclopedia of DNA elements (modencode) project [3] has been created by National Human Genome Research Institute (NHGRI) to identify all sequence-based functional elements of C. elegans and Drosophila melanogaster. Project modencode generated 12 sets of Drosophila melanogaster embryonic development data by sequencing RNA collected every 2 hours over a 24 hour period. The data, produced using SOLiD 3 sequencing device, are fully accessible from DNA Data Bank of Japan (DDBJ) [4]. We downloaded the data in FASTQ files, each file containing roughly 5 gigabases in 50 nt reads, mapped them to Ensembl Drosophila genome using software for splice junction mapping, quantified gene expression and analyzed the data matrix using time course-specific algorithms. The results show that, in spite of the insufficient sequencing depth, our analysis was able to identify the probable hatching time as well as differentially expressed genes and non-coding elements. All analysis were performed using clouddeployed Lumenogix NGS. The data were mapped to Drosophila melanogaster genome release 5.25, published February 2010, which was downloaded from Ensembl [5]. Mapping was performed using Tophat [6] splice junction mapping software version 1.2 (Figures 2, 3). Internally, Tophat utilizes bowtie [7] version 0.12.7. Since Tophat does not handle SOLiD colorspace 2

data, the files were converted to nucleotide format with standard quality encoding, using SOLiD2Std.pl script from Corona Lite toolkit distributed by Life Technologies. Gene expression quantification was performed using htseq-count [8] software. The resulting data matrix was transformed using variance reduction algorithm from Bioconductor [9] DESeq [10] then normalized using quantile normalization algorithm from Limma [11]. Following transformation and normalization, the data were analyzed using the Bioconductor Timecourse [12] package. Experimental Setup The first data analysis decisions must be made before any data are generated. The researchers must decide on the length of each read and whether the sequencing will be done in paired or single configurations. Clearly, generating longer reads and paired reads will provide more information. This, however, will need to be balanced with budgetary restrictions longer reads and paired reads will cost more. In some cases, it might come down to a choice of more replicates vs. longer and/or paired reads. As a rule, if the researcher is looking for gene expression analysis and is not particularly interested in alternative splicing, shorter single-read configuration should be adequate. However, if the experiment is done on well-annotated species (such as human or mouse) and the researcher is interested in evaluating expressions of individual isoforms, then longer paired-end reads will be preferred. In practice this means one of two possible configurations: 1. For gene expression analysis: 50 nucleotide single reads 2. For analysis of transcript isoforms: 100 nucleotides paired reads This decision must be made before libraries are prepared and will have a significant impact on all data analysis. Quality Check Data produced by NGS devices can be of varying quality depending on many factors, including quality of material used, library construction, as well as sequencing itself. While quality checks (QC) should be done at every step of this process, once the data have been delivered it is possible to evaluate each sample it also is possible to evaluate the quality of each sample using unbiased computational tools. Within Lumenogix NGS, we utilize FASTQC [8], a popular application that evaluates the quality of the data based on 3

Figure 1: Per base data quality multiple metrics. In case of poor quality data, e.g. Figure 1 two approaches could be taken: 1. Data could be filtered prior to analysis. This is the preferred approach when high data quality is required (SNP analysis and de-novo assembly). In this case, filtering means removal of low-quality reads from the data set. 2. Performing analysis with complete data sets. This is often the approach taken in most expression analysis experiments, especially when the data quality are consistent between samples. Mapping Reads to the Genome Several excellent tools exist for mapping RNA reads to genomes. A special feature of these aligners is the ability to handle splice junctions. Since we are mapping spliced RNA transcripts to genome, the algorithm must be aware and handle situations where a read is aligned to two (or more) locations with significant gaps in between. Within Lumenogix NGS we utilize TopHat [6], one of the popular algorithms for mapping RNASeq data. The algorithm 4

Figure 2: Lumenogix NGS TopHat submission screen requires the reference genome to be established and allows users to fine-tune parameters (Figure 2). Please note that in this example, we are handling single-end reads. After the mapping is complete, the algorithm will generate files that contain alignment information for each read. In most cases, these files will be in SAM or BAM format [13]. Please note that SAM is text-based, human readable, format while BAM is compressed binary representation of SAM. These files are very large, so in most cases only the BAM format is retained. This is acceptable since it is relatively easy to convert from BAM to SAM using Samtools [13]. After mapping additional statistical information will be obtained, including the count and percentage of reads that was successfully aligned. However, it should be stated that for TopHat, this statistic is inaccurate. This is a known TopHat issue. In addition to the alignment information for each read, TopHat also produces information on identified splice junctions (Figure 3). Both SAM/BAM as well as BED [14] format used to store splice junctions are well known and may be used in various tools. 5

Figure 3: Lumenogix NGS TopHat results screen Quantification Once the reads have been aligned to the reference genome, the information can be used to quantify expression of genes and transcripts. The most popular quantification approach is to use one of the normalizing quantification algorithms, such as Cufflinks [6]. Cufflinks is an extremely popular algorithm and was designed to be well integrated with TopHat. In this case, normalizing means that the intensity values will be scaled based on the length of transcript. This is done to adjust for the advantage that longer transcripts have over shorter transcripts. Quantification is done by counting the reads that aligned to a particular transcript. However, longer transcripts have an advantage there will be more reads aligning to longer transcripts than to shorter transcripts. Therefore, modern normalizing quantification algorithms produce intensity in Fragments Per Kilobase of exon per Million fragments mapped (FPKM) [6]. This is basically intensity normalized by transcript length. After computing intensity by transcript, the values are aggregated into gene-based intensity calculations. This is a very good way to quantify intensities when comparing expression of transcripts within samples. However, in our example we are comparing gene expressions in different samples. When comparing gene A in sample 6

Figure 4: Lumenogix NGS scatter plot of FPKM data Figure 5: Lumenogix NGS boxplot of intensities 1 to gene A in sample 2 (or transcript A.p1 in sample 1 to transcript A.p1 in sample 2), the lengths of these elements actually are the same. Furthermore, normalizing algorithms occasionally run into trouble when they are unable to unambiguously identify a transcript, resulting in lost data (Figure 4). Another way to do quantification is using the raw hit count approach, such as htseq-count [8]. This type of algorithm will not normalize the data based on transcript length and also will calculate gene expression directly (not by aggregating transcript expression). For maximum flexibility, Lumenogix NGS offers both normalizing and raw hit-count quantification algorithms. Once the quantification has been performed (using either method) a data matrix (meaning a table of intensities) can be constructed and the distribution (or shape) of data visualized using standard statistical tools (Figures 5 and 6). Variance Reduction The purpose of the variance reduction step (also called transformation or scaling) is to transform the data into form that can be used in statistical analysis, meaning into a Normal distribution. Different algorithms may be used, depending on how the data have been quantified. For FPKM (normalized) data, standard log base 2 (log2) transformation appears to work well. For counting data that were produced using raw hit-count quantification, a Bioconductor-based application called DESeq [10] provides a specialized variance reduction function (Figure 7). 7

Figure 6: Lumenogix NGS samples classification using unsupervised hierarchical clustering Figure 7: Lumenogix NGS samples counting data transformed using DESeq variance reduction algorithm Normalization Normalization algorithms remove differences between samples that are due to technical (not biological) variations. Many different algorithms are available, including for example a popular quantile algorithm implemented in Limma [11] package (Figure 8). The final shape of the data matrix prior to expression analysis should be a reasonable approximation of the Normal distribution (Figure 9). Expression Analysis Once the data matrix has been constructed, transformed and normalized it is ready for expression analysis. Specifically, the goals for the pre-processing are: 1. Convert NGS reads data into intensity data. 2. Transform the resulting data matrix to achieve some approximation of normal distribution. 3. Normalize the data to remove technical (but not biological) differences from the samples. Once all these goals have been achieved, the actual expression analysis is relatively simple. A large number of existing algorithms can handle the 8

Figure 8: Lumenogix NGS samples counting data normalized using quantile normalization algorithm Figure 9: Lumenogix NGS density plot of the data matrix after variance reduction and normalization constructed data matrix. In this experiment we applied Timecourse [12] algorithm from the Bioconductor package. The algorithm required us to classify the samples based on time points (Figure 10). The result of this analysis is a list of genes ranked based on Tau a calculated value that estimates the odds of differential expression (Figure 11). It is not entirely accurate to refer to the resulting grid as a gene grid. This analysis actually can detect any kind of transcribed, annotated genomic element, including non-coding RNA. In this particular example, one of the differentially expressed elements was a small nucleolar RNA Me28S-A992 (Figure 12). Conclusion Existing algorithms for analyzing microarray gene expression data are widely accepted. Here we applied the Bioconductor [9] Timecourse [12] algorithm to variance stabilized [10]and normalized [11] gene expression data to demonstrating that these tools can also be used for processing expression data derived from sequencing reads. This approach offers a standardized analysis pipeline for expression data generated by high speed sequencing technologies. 9

Figure 10: Lumenogix NGS time course sample classification Figure 11: Lumenogix NGS time course gene grid 10

Figure 12: Lumenogix NGS expression of snorna:me28s-a992 over time [1] M. Demerec, Biology of Drosophila. Wiley, 1950. [2] V. Hartenstein, Atlas of Drosophila Development. Cold Spring Harbor Laboratory Press, 1 edition ed., Jan. 1995. [3] S. E. Celniker, L. A. L. Dillon, M. B. Gerstein, K. C. Gunsalus, S. Henikoff, G. H. Karpen, M. Kellis, E. C. Lai, J. D. Lieb, D. M. MacAlpine, G. Micklem, F. Piano, M. Snyder, L. Stein, K. P. White, R. H. Waterston, and modencode Consortium, Unlocking the secrets of the genome, Nature, vol. 459, pp. 927 930, June 2009. PMID: 19536255 PMCID: PMC2843545. [4] Y. Kodama, E. Kaminuma, S. Saruhashi, K. Ikeo, H. Sugawara, Y. Tateno, and Y. Nakamura, Biological databases at DNA data bank of japan in the era of next-generation sequencing technologies, Advances in experimental medicine and biology, vol. 680, pp. 125 135, 2010. PMID: 20865494. [5] T. J. P. Hubbard, B. L. Aken, S. Ayling, B. Ballester, K. Beal, E. Bragin, S. Brent, Y. Chen, P. Clapham, L. Clarke, G. Coates, S. Fairley, S. Fitzgerald, J. Fernandez-Banet, L. Gordon, S. Graf, S. Haider, 11

M. Hammond, R. Holland, K. Howe, A. Jenkinson, N. Johnson, A. Kahari, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, E. Kulesha, D. Lawson, I. Longden, K. Megy, P. Meidl, B. Overduin, A. Parker, B. Pritchard, D. Rios, M. Schuster, G. Slater, D. Smedley, W. Spooner, G. Spudich, S. Trevanion, A. Vilella, J. Vogel, S. White, S. Wilder, A. Zadissa, E. Birney, F. Cunningham, V. Curwen, R. Durbin, X. M. Fernandez-Suarez, J. Herrero, A. Kasprzyk, G. Proctor, J. Smith, S. Searle, and P. Flicek, Ensembl 2009, Nucleic acids research, vol. 37, pp. D690 697, Jan. 2009. PMID: 19033362 PMCID: PMC2686571. [6] C. Trapnell, L. Pachter, and S. L. Salzberg, TopHat: discovering splice junctions with RNA-Seq, Bioinformatics (Oxford, England), vol. 25, pp. 1105 1111, May 2009. PMID: 19289445 PMCID: PMC2672628. [7] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome biology, vol. 10, no. 3, p. R25, 2009. PMID: 19261174 PMCID: PMC2690996. [8] C. Popp, W. Dean, S. Feng, S. J. Cokus, S. Andrews, M. Pellegrini, S. E. Jacobsen, and W. Reik, Genome-wide erasure of DNA methylation in mouse primordial germ cells is affected by AID deficiency, Nature, vol. 463, pp. 1101 1105, Feb. 2010. PMID: 20098412 PMCID: PMC2965733. [9] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. H. Yang, and J. Zhang, Bioconductor: open software development for computational biology and bioinformatics, Genome biology, vol. 5, no. 10, p. R80, 2004. PMID: 15461798 PMCID: PMC545600. [10] S. Anders and W. Huber, Differential expression analysis for sequence count data, Genome biology, vol. 11, no. 10, p. R106, 2010. PMID: 20979621 PMCID: PMC3218662. [11] G. K. Smyth, Linear models and empirical bayes methods for assessing differential expression in microarray experiments, Statistical applications in genetics and molecular biology, vol. 3, p. Article3, 2004. PMID: 16646809. 12

[12] Y. C. Tai and T. P. Speed, On gene ranking using replicated microarray time course data, Biometrics, vol. 65, pp. 40 51, Mar. 2009. PMID: 18537947. [13] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup, The sequence Alignment/Map format and SAMtools, Bioinformatics (Oxford, England), vol. 25, pp. 2078 2079, Aug. 2009. PMID: 19505943 PMCID: PMC2723002. [14] A. R. Quinlan and I. M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics (Oxford, England), vol. 26, pp. 841 842, Mar. 2010. PMID: 20110278 PMCID: PMC2832824. 13