RNA Expression Time Course Analysis April 23, 2014
Introduction The increased efficiency and reduced per-read cost of next generation sequencing (NGS) has opened new and exciting opportunities for data analysis. Rather than comparing the change in expression in a small number of samples, sometimes without replicates, the researchers now can effectively compare expression profiles of a large number of samples, complete with replicates. However, these types of comparisons are not trivial. They require utilization of fairly complex data management and data handling tools that can effectively transform the read-based data into a data matrix that can be analyzed using sophisticated data analysis algorithms. In this document we take a set of 12 NGS data sets from first 24 hours of Drosophila embryonic development and demonstrate how to conduct time course analysis using peer-reviewed statistical algorithms. Analysis Steps Any kind of expression analysis experiment using NGS data (time course is a kind of expression analysis) is done in 6 steps: 1. Quality check of the raw data delivered by core facilities or contract research organizations (CROs). 2. Mapping data to reference genome. 3. Quantification of gene expression. Also may include quantification of transcripts (isoforms) and other annotated genomic elements, possibly including non-coding RNAs. 4. Variance reduction (also called transformation) 5. Normalization 6. Differential expression analysis After step 3 the data have been converted from a list of reads found in raw data files into a table-like format with rows containing genomic elements (genes, transcripts, etc.) and columns samples. Individual cells in this table will contain a number that represents relative abundance of the particular element in a sample. For example, rows corresponding to genes polaa and pola B may look like Table 1. This kind of table is also called data matrix. 1
Tag SampleA Sample B Sample C Sample d polaa 10 12 9 0 polab 15 9 0 11 Table 1: Example Data Matrix After one or more intensity tables have been generated, the data must transformed and normalized so they have a distribution appropriate for statistical algorithms. Most of these algorithms are sensitive to the shape of the data, requiring the data to have Normal distribution. This requires steps 4, 5, and 6. In step 6 is where the appropriate statistical algorithms are finally applied. Experiment Drosophila melanogaster (common fruit fly) is one of the best studied and commonly used model organisms. They are easy to handle and breed, and large quantities of molecular information are readily available. The eggs are approximately 0.5 millimeter long and hatch after 12-15 hours [1, 2]. The resulting larvae will grow for approximately 48 hours. The model organisms encyclopedia of DNA elements (modencode) project [3] has been created by National Human Genome Research Institute (NHGRI) to identify all sequence-based functional elements of C. elegans and Drosophila melanogaster. Project modencode generated 12 sets of Drosophila melanogaster embryonic development data by sequencing RNA collected every 2 hours over a 24 hour period. The data, produced using SOLiD 3 sequencing device, are fully accessible from DNA Data Bank of Japan (DDBJ) [4]. We downloaded the data in FASTQ files, each file containing roughly 5 gigabases in 50 nt reads, mapped them to Ensembl Drosophila genome using software for splice junction mapping, quantified gene expression and analyzed the data matrix using time course-specific algorithms. The results show that, in spite of the insufficient sequencing depth, our analysis was able to identify the probable hatching time as well as differentially expressed genes and non-coding elements. All analysis were performed using clouddeployed Lumenogix NGS. The data were mapped to Drosophila melanogaster genome release 5.25, published February 2010, which was downloaded from Ensembl [5]. Mapping was performed using Tophat [6] splice junction mapping software version 1.2 (Figures 2, 3). Internally, Tophat utilizes bowtie [7] version 0.12.7. Since Tophat does not handle SOLiD colorspace 2
data, the files were converted to nucleotide format with standard quality encoding, using SOLiD2Std.pl script from Corona Lite toolkit distributed by Life Technologies. Gene expression quantification was performed using htseq-count [8] software. The resulting data matrix was transformed using variance reduction algorithm from Bioconductor [9] DESeq [10] then normalized using quantile normalization algorithm from Limma [11]. Following transformation and normalization, the data were analyzed using the Bioconductor Timecourse [12] package. Experimental Setup The first data analysis decisions must be made before any data are generated. The researchers must decide on the length of each read and whether the sequencing will be done in paired or single configurations. Clearly, generating longer reads and paired reads will provide more information. This, however, will need to be balanced with budgetary restrictions longer reads and paired reads will cost more. In some cases, it might come down to a choice of more replicates vs. longer and/or paired reads. As a rule, if the researcher is looking for gene expression analysis and is not particularly interested in alternative splicing, shorter single-read configuration should be adequate. However, if the experiment is done on well-annotated species (such as human or mouse) and the researcher is interested in evaluating expressions of individual isoforms, then longer paired-end reads will be preferred. In practice this means one of two possible configurations: 1. For gene expression analysis: 50 nucleotide single reads 2. For analysis of transcript isoforms: 100 nucleotides paired reads This decision must be made before libraries are prepared and will have a significant impact on all data analysis. Quality Check Data produced by NGS devices can be of varying quality depending on many factors, including quality of material used, library construction, as well as sequencing itself. While quality checks (QC) should be done at every step of this process, once the data have been delivered it is possible to evaluate each sample it also is possible to evaluate the quality of each sample using unbiased computational tools. Within Lumenogix NGS, we utilize FASTQC [8], a popular application that evaluates the quality of the data based on 3
Figure 1: Per base data quality multiple metrics. In case of poor quality data, e.g. Figure 1 two approaches could be taken: 1. Data could be filtered prior to analysis. This is the preferred approach when high data quality is required (SNP analysis and de-novo assembly). In this case, filtering means removal of low-quality reads from the data set. 2. Performing analysis with complete data sets. This is often the approach taken in most expression analysis experiments, especially when the data quality are consistent between samples. Mapping Reads to the Genome Several excellent tools exist for mapping RNA reads to genomes. A special feature of these aligners is the ability to handle splice junctions. Since we are mapping spliced RNA transcripts to genome, the algorithm must be aware and handle situations where a read is aligned to two (or more) locations with significant gaps in between. Within Lumenogix NGS we utilize TopHat [6], one of the popular algorithms for mapping RNASeq data. The algorithm 4
Figure 2: Lumenogix NGS TopHat submission screen requires the reference genome to be established and allows users to fine-tune parameters (Figure 2). Please note that in this example, we are handling single-end reads. After the mapping is complete, the algorithm will generate files that contain alignment information for each read. In most cases, these files will be in SAM or BAM format [13]. Please note that SAM is text-based, human readable, format while BAM is compressed binary representation of SAM. These files are very large, so in most cases only the BAM format is retained. This is acceptable since it is relatively easy to convert from BAM to SAM using Samtools [13]. After mapping additional statistical information will be obtained, including the count and percentage of reads that was successfully aligned. However, it should be stated that for TopHat, this statistic is inaccurate. This is a known TopHat issue. In addition to the alignment information for each read, TopHat also produces information on identified splice junctions (Figure 3). Both SAM/BAM as well as BED [14] format used to store splice junctions are well known and may be used in various tools. 5
Figure 3: Lumenogix NGS TopHat results screen Quantification Once the reads have been aligned to the reference genome, the information can be used to quantify expression of genes and transcripts. The most popular quantification approach is to use one of the normalizing quantification algorithms, such as Cufflinks [6]. Cufflinks is an extremely popular algorithm and was designed to be well integrated with TopHat. In this case, normalizing means that the intensity values will be scaled based on the length of transcript. This is done to adjust for the advantage that longer transcripts have over shorter transcripts. Quantification is done by counting the reads that aligned to a particular transcript. However, longer transcripts have an advantage there will be more reads aligning to longer transcripts than to shorter transcripts. Therefore, modern normalizing quantification algorithms produce intensity in Fragments Per Kilobase of exon per Million fragments mapped (FPKM) [6]. This is basically intensity normalized by transcript length. After computing intensity by transcript, the values are aggregated into gene-based intensity calculations. This is a very good way to quantify intensities when comparing expression of transcripts within samples. However, in our example we are comparing gene expressions in different samples. When comparing gene A in sample 6
Figure 4: Lumenogix NGS scatter plot of FPKM data Figure 5: Lumenogix NGS boxplot of intensities 1 to gene A in sample 2 (or transcript A.p1 in sample 1 to transcript A.p1 in sample 2), the lengths of these elements actually are the same. Furthermore, normalizing algorithms occasionally run into trouble when they are unable to unambiguously identify a transcript, resulting in lost data (Figure 4). Another way to do quantification is using the raw hit count approach, such as htseq-count [8]. This type of algorithm will not normalize the data based on transcript length and also will calculate gene expression directly (not by aggregating transcript expression). For maximum flexibility, Lumenogix NGS offers both normalizing and raw hit-count quantification algorithms. Once the quantification has been performed (using either method) a data matrix (meaning a table of intensities) can be constructed and the distribution (or shape) of data visualized using standard statistical tools (Figures 5 and 6). Variance Reduction The purpose of the variance reduction step (also called transformation or scaling) is to transform the data into form that can be used in statistical analysis, meaning into a Normal distribution. Different algorithms may be used, depending on how the data have been quantified. For FPKM (normalized) data, standard log base 2 (log2) transformation appears to work well. For counting data that were produced using raw hit-count quantification, a Bioconductor-based application called DESeq [10] provides a specialized variance reduction function (Figure 7). 7
Figure 6: Lumenogix NGS samples classification using unsupervised hierarchical clustering Figure 7: Lumenogix NGS samples counting data transformed using DESeq variance reduction algorithm Normalization Normalization algorithms remove differences between samples that are due to technical (not biological) variations. Many different algorithms are available, including for example a popular quantile algorithm implemented in Limma [11] package (Figure 8). The final shape of the data matrix prior to expression analysis should be a reasonable approximation of the Normal distribution (Figure 9). Expression Analysis Once the data matrix has been constructed, transformed and normalized it is ready for expression analysis. Specifically, the goals for the pre-processing are: 1. Convert NGS reads data into intensity data. 2. Transform the resulting data matrix to achieve some approximation of normal distribution. 3. Normalize the data to remove technical (but not biological) differences from the samples. Once all these goals have been achieved, the actual expression analysis is relatively simple. A large number of existing algorithms can handle the 8
Figure 8: Lumenogix NGS samples counting data normalized using quantile normalization algorithm Figure 9: Lumenogix NGS density plot of the data matrix after variance reduction and normalization constructed data matrix. In this experiment we applied Timecourse [12] algorithm from the Bioconductor package. The algorithm required us to classify the samples based on time points (Figure 10). The result of this analysis is a list of genes ranked based on Tau a calculated value that estimates the odds of differential expression (Figure 11). It is not entirely accurate to refer to the resulting grid as a gene grid. This analysis actually can detect any kind of transcribed, annotated genomic element, including non-coding RNA. In this particular example, one of the differentially expressed elements was a small nucleolar RNA Me28S-A992 (Figure 12). Conclusion Existing algorithms for analyzing microarray gene expression data are widely accepted. Here we applied the Bioconductor [9] Timecourse [12] algorithm to variance stabilized [10]and normalized [11] gene expression data to demonstrating that these tools can also be used for processing expression data derived from sequencing reads. This approach offers a standardized analysis pipeline for expression data generated by high speed sequencing technologies. 9
Figure 10: Lumenogix NGS time course sample classification Figure 11: Lumenogix NGS time course gene grid 10
Figure 12: Lumenogix NGS expression of snorna:me28s-a992 over time [1] M. Demerec, Biology of Drosophila. Wiley, 1950. [2] V. Hartenstein, Atlas of Drosophila Development. Cold Spring Harbor Laboratory Press, 1 edition ed., Jan. 1995. [3] S. E. Celniker, L. A. L. Dillon, M. B. Gerstein, K. C. Gunsalus, S. Henikoff, G. H. Karpen, M. Kellis, E. C. Lai, J. D. Lieb, D. M. MacAlpine, G. Micklem, F. Piano, M. Snyder, L. Stein, K. P. White, R. H. Waterston, and modencode Consortium, Unlocking the secrets of the genome, Nature, vol. 459, pp. 927 930, June 2009. PMID: 19536255 PMCID: PMC2843545. [4] Y. Kodama, E. Kaminuma, S. Saruhashi, K. Ikeo, H. Sugawara, Y. Tateno, and Y. Nakamura, Biological databases at DNA data bank of japan in the era of next-generation sequencing technologies, Advances in experimental medicine and biology, vol. 680, pp. 125 135, 2010. PMID: 20865494. [5] T. J. P. Hubbard, B. L. Aken, S. Ayling, B. Ballester, K. Beal, E. Bragin, S. Brent, Y. Chen, P. Clapham, L. Clarke, G. Coates, S. Fairley, S. Fitzgerald, J. Fernandez-Banet, L. Gordon, S. Graf, S. Haider, 11
M. Hammond, R. Holland, K. Howe, A. Jenkinson, N. Johnson, A. Kahari, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, E. Kulesha, D. Lawson, I. Longden, K. Megy, P. Meidl, B. Overduin, A. Parker, B. Pritchard, D. Rios, M. Schuster, G. Slater, D. Smedley, W. Spooner, G. Spudich, S. Trevanion, A. Vilella, J. Vogel, S. White, S. Wilder, A. Zadissa, E. Birney, F. Cunningham, V. Curwen, R. Durbin, X. M. Fernandez-Suarez, J. Herrero, A. Kasprzyk, G. Proctor, J. Smith, S. Searle, and P. Flicek, Ensembl 2009, Nucleic acids research, vol. 37, pp. D690 697, Jan. 2009. PMID: 19033362 PMCID: PMC2686571. [6] C. Trapnell, L. Pachter, and S. L. Salzberg, TopHat: discovering splice junctions with RNA-Seq, Bioinformatics (Oxford, England), vol. 25, pp. 1105 1111, May 2009. PMID: 19289445 PMCID: PMC2672628. [7] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome biology, vol. 10, no. 3, p. R25, 2009. PMID: 19261174 PMCID: PMC2690996. [8] C. Popp, W. Dean, S. Feng, S. J. Cokus, S. Andrews, M. Pellegrini, S. E. Jacobsen, and W. Reik, Genome-wide erasure of DNA methylation in mouse primordial germ cells is affected by AID deficiency, Nature, vol. 463, pp. 1101 1105, Feb. 2010. PMID: 20098412 PMCID: PMC2965733. [9] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. H. Yang, and J. Zhang, Bioconductor: open software development for computational biology and bioinformatics, Genome biology, vol. 5, no. 10, p. R80, 2004. PMID: 15461798 PMCID: PMC545600. [10] S. Anders and W. Huber, Differential expression analysis for sequence count data, Genome biology, vol. 11, no. 10, p. R106, 2010. PMID: 20979621 PMCID: PMC3218662. [11] G. K. Smyth, Linear models and empirical bayes methods for assessing differential expression in microarray experiments, Statistical applications in genetics and molecular biology, vol. 3, p. Article3, 2004. PMID: 16646809. 12
[12] Y. C. Tai and T. P. Speed, On gene ranking using replicated microarray time course data, Biometrics, vol. 65, pp. 40 51, Mar. 2009. PMID: 18537947. [13] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup, The sequence Alignment/Map format and SAMtools, Bioinformatics (Oxford, England), vol. 25, pp. 2078 2079, Aug. 2009. PMID: 19505943 PMCID: PMC2723002. [14] A. R. Quinlan and I. M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics (Oxford, England), vol. 26, pp. 841 842, Mar. 2010. PMID: 20110278 PMCID: PMC2832824. 13