ISOFORM ABUNDANCE INFERENCE PROVIDES A MORE ACCURATE ESTIMATION OF GENE EXPRESSION LEVELS IN RNA-SEQ

Size: px
Start display at page:

Download "ISOFORM ABUNDANCE INFERENCE PROVIDES A MORE ACCURATE ESTIMATION OF GENE EXPRESSION LEVELS IN RNA-SEQ"

Transcription

1 Journal of Bioinformatics and Computational Biology Vol. 8, Suppl. 1 (2010) c The Authors DOI: /S ISOFORM ABUNDANCE INFERENCE PROVIDES A MORE ACCURATE ESTIMATION OF GENE EXPRESSION LEVELS IN RNA-SEQ XI WANG,, ZHENGPENG WU, and XUEGONG ZHANG, MOE Key Laboratory of Bioinformatics and Bioinformatics Division TNLIST/Department of Automation, Tsinghua University Beijing , P. R. China wang-xi05@mails.tsinghua.edu.cn wuzhengpeng99@mails.tsinghua.edu.cn zhangxg@tsinghua.edu.cn Received 15 July 2010 Revised 30 August 2010 Accepted 10 September 2010 Due to its unprecedented high-resolution and detailed information, RNA-seq technology based on next-generation high-throughput sequencing significantly boosts the ability to study transcriptomes. The estimation of genes transcript abundance levels or gene expression levels has always been an important question in research on the transcriptional regulation and gene functions. On the basis of the concept of Reads Per Kilo-base per Million reads (RPKM), taking the union-intersection genes (UI-based) and summing up inferred isoform abundance (isoform-based) are the two current strategies to estimate gene expression levels, but produce different estimations. In this paper, we made the first attempt to compare the two strategies performances through a series of simulation studies. Our results showed that the isoform-based method gives not only more accurate estimation but also has less uncertainty than the UI-based strategy. If taking into account the non-uniformity of read distribution, the isoform-based method can further reduce estimation errors. We applied both strategies to real RNA-seq datasets of technical replicates, and found that the isoform-based strategy also displays a better performance. For a more accurate estimation of gene expression levels from RNA-seq data, even if the abundance levels of isoforms are not of interest, it is still better to first infer the isoform abundance and sum them up to get the expression level of a gene as a whole. Keywords: RNA-seq; gene expression level; estimation error; estimation uncertainty. 1. Background Measuring gene expression levels has great importance in biological research. A gene s expression level is closely related to its functionality, so estimating These authors contributed equally to this work. Corresponding author. 177

2 178 X. Wang, Z. Wu & X. Zhang gene expressions is a key step for describing many biological processes. Based on the next-generation high-throughput sequencing technologies, which significantly reduces the cost of sequencing, RNA-seq finds extensive applications in studies of the transcriptome. 1,2 Although RNA-seq data provides the potential to measure how genes are expressed in an isoform-specific manner, several intrinsic and external reasons limit the wide use of the isoform-specific information. Due to the slight differences between isoform structures in many genes and the existing sequencing biases, 3,4 isoform abundance estimation has larger uncertainty than gene expression level estimation. 5 The large uncertainties will result in large variances in the differential expression analysis and therefore reduce the detection power. Besides, there also exist some genes where the isoform abundance is not identifiable because of the intrinsic gene structure. 6 Meanwhile, the current lack of isoform-based biological knowledge still limits its application. The most popular biological knowledge databases, such as Gene Ontology (GO) 7 and the Kyoto Encyclopedia of Genes and Genomes (KEGG), 8 are organized according to gene annotations instead of isoform annotations. Summarizing the expression levels of genes from RNA-seq data is particularly useful in current analysis pipelines, and several recent RNA-seq studies estimated gene expression levels, rather than isoform expression levels, for differential expression analysis. 1,9,10 In early techniques such as microarray analysis, there have been many algorithms that focused on the estimation of gene expression values by summarizing probe level data (e.g. Refs. 11 and 12). To quantify transcript levels in RNA-seq, the concept of reads per kilobase of exon model per million reads (RPKM) was first proposed. 1 RPKM measures the read density in a genic region of interest by normalizing the read count in the corresponding exonic regions against the sum of each exon length (or gene length) and the total reads in the measurement (or sequencing depth). In ideal conditions where sequenced reads are randomly sampled uniformly from transcripts and no alternative transcripts are derived from an identical genic region, RPKM reflects well the actual transcript abundance levels. In alternatively spliced genes, which comprise 92 94% of human genes, 13 people sometimes however misuse this concept by ignoring the fact that different isoforms may be of different lengths, which results in a projective normalization method. It has been shown in a recent paper that the projective normalization method under-estimated the gene expression levels to varying degrees. 14 In contrary, for an unbiased estimation of gene expression levels, there exist several other candidate methods, including a union-intersection gene (UI-based) method and an isoform-based strategy. The UI-based method takes the union of the constitutive exons of a gene as the UI genic region and computes the RPKM only in the UI gene. 10 The utilization of a UI gene model avoids the length differences introduced from multiple isoforms. The isoform-based method adopts more sophisticated statistical models and bases the gene expression estimation on the summation of inferred isoform expression levels. 15 Under the assumption that the sequenced

3 Isoform Abundance Inference Provides a More Accurate Estimation 179 reads are sampled independently and uniformly from measured transcripts, it is easy to model the distribution of exon read counts as a Poisson distribution. 1 Based on this, Jiang et al., 5 proposed a maximum likelihood estimate method to infer isoform expression levels from RNA-seq data. We call the model used by Jiang et al., the uniform read distribution (URD) model. With a similar framework, we have proposed a non-uniform read distribution (N-URD) model which considers the empirical distribution of reads counts and gives significantly more accurate isoform expression inference. 16 We will implement both models to show the performance of the isoform-based strategy for gene expression level estimation. Gene expression estimation has extreme importance for biological research, but there is no consensus for the choice of estimation strategies. 10 The aim of this paper is therefore to compare the two categories of methods mentioned above. We used two criteria to evaluate how an individual strategy performs. Estimation error measures the difference between the estimated value and the given value, and the degree of estimation uncertainty is measured by the length of the interval which contains 95% of the posterior probability of the estimated value. Through a series of simulation studies, we compared the estimation error and the estimation uncertainty of the two strategies. Considering that both strategies we investigated here rely on known gene annotations, we also explored their performances in the cases of incomplete annotations. In both investigations, results show that a significant gain can be achieved in the performance of gene expression level estimation (including less estimation error and less uncertainty) by using the isoform-based strategy rather than the UI-based strategy. Lastly we applied these two methods on real RNA-seq technical replicate data, which also illustrated the advantage of the isoform-based method. 2. Methods In this section we will illustrate the two unbiased gene expression estimation strategies (UI-based and isoform-based strategies) and the detailed experiment designs Notations Assume a gene g has n exons of lengths (l 1,...,l n )andm isoforms with expression levels in RPKM units (θ 1,...,θ m )orθ={θ i i =1,...,m} in an experiment. From RNA-seq data, we have a set of observations (x 1,...,x n ) on this gene, with x i denoting the number of reads mapped to the ith exon of the gene g (see Fig. 1(a) for an example). We use an indicator matrix (a ij ) n m to represent the gene structure (an example is shown in Fig. 1(b)), where a ij =1ora ij = 0 indicates that the ith exon is included or excluded in the jth isoform UI-based gene expression estimation Bullard et al. defined a union-intersection gene (UI gene) to be the greatest common part of multiple isoforms within a gene. 10 The main idea of UI genes is to

4 180 X. Wang, Z. Wu & X. Zhang Fig. 1. The two unbiased methods for gene expression level estimation. (a) The observation in RNA-seq data is the read counts for each exon in a gene. (b) The gene structure shows a gene with three isoforms and four exons. (c) The sketch map of the UI-based gene expression method. (d) The sketch map of the isoform-based gene expression method. simplify the gene structure, and therefore it excludes the reads from exons which do not appear in all isoforms. Within the UI gene region, RPKM 17 is employed to calculate the expression values of genes. Thus, the estimation of the gene expression level is ˆθ g = i UI x i, (1) i UI l i where UI denotes the aggregate of exons in the UI gene region, that is UI = {i m j=1 a ij 0}. The sketch map of the UI-based gene expression method is shown in Fig. 1(c). The advantage of the UI-based method is that it avoids the impact of complex gene structures; however, the disadvantage is losing information supplied by the abandoned exons. Because the UI region may not exist for some genes, sometimes the UI-based method may be invalid. We excluded these genes in our experiments Isoform-based gene expression estimation Unlike the UI-based strategy, the isoform-based methods make use of all sequenced reads that are mapped to genic regions. The sketch map of the isoform-based gene expression strategy is shown in Fig. 1(d). It first estimates the isoform-level expression values (ˆθ 1,...,ˆθ m ) from the observed read counts and known gene structures

5 Isoform Abundance Inference Provides a More Accurate Estimation 181 (such as from RefSeq gene annotation), and then sums them up to get the gene-level expression, that is m ˆθ g = ˆθ j. (2) j=1 In genes with isoforms whose abundance levels are not identifiable, we can simplify the gene structure by merging the unidentifiable isoforms to form a new pseudo isoform. This makes the isoform-based strategy applicable to all genes. For the inference of isoform-level expression values, a key step is to build up proper statistical models. The URD model proposed by Jiang et al., 5 is one of the most effective models to solve the isoform expression inference problem in RNA-seq. Based on a similar framework, with the aim to improve the expression estimation accuracy at the isoform level, we have previously proposed an N-URD model by introducing the empirical information about non-uniform read distribution. 16 In this study, we use both the URD and N-URD models to investigate the performance of isoform-based gene expression estimation strategy. Here we briefly introduce the URD and N-URD models. In the URD model, each observation x i is assumed to be a random variable following a Poisson distribution with parameter λ i.theλ i for the ith exon is λ i = l i w m j=1 a ijθ j,wherew is the total number of mapped reads in the RNA-seq data and a ij is the elements of the gene structure indicator matrix. Thus we have the corresponding log-likelihood function for the ith exon as ( e λ i λ xi i log(l(θ x i )) = log x i! ), (3) where L( ) denotes the likelihood function. Assuming the independence of x i s,for ageneg, the joint log-likelihood function of all its exons can be written as n ( e λ i ) λ xi i log(l(θ x 1,x 2,...,x n )) = log. (4) x i! i=1 Considering λ i = l i w m j=1 a ijθ j for each exon, we have log(l(θ x 1,x 2,...,x n )) = w n i=1 j=1 m l i a ij θ j + n m x i log l i w a ij θ j i=1 j=1 n log(x i!). (5) i=1 The maximum likelihood estimation for the above log-likelihood function will give the inference of isoform expression levels. Due to the convexity of the above optimization problem, a gradient descending method can be used to find the solution. 5 The N-URD model substitutes the indicator matrix (a ij ) with a weighted indicator matrix (b ij ) whose elements are nonnegative real numbers calculated from a given

6 182 X. Wang, Z. Wu & X. Zhang RNA-seq dataset and depict the non-uniformity of read distributions. Rewriting Eq. (5), the modified log-likelihood function is n m n m log(l(θ x 1,x 2,...,x n )) = w l i b ij θ j + x i log l i w b ij θ j i=1 j=1 i=1 j=1 n log(x i!). (6) i=1 Unlike the 0 1 indicator matrix (a ij ), the weighted indicator matrix (b ij ) not only represents the gene structure information, but also gives weights to the non-zero elements according to a bias tendency of corresponding exons. A typical global bias tendency can be learnt from data within single-isoform genes. It is notable that changing the indicator matrix to the weighted indicator matrix will not change the convexity of the optimization problem. So it is still easy to solve the maxima of the log-likelihood function Simulation data We designed a series of simulation experiments to compare the performances of the investigated methods. The simulated data generation includes several steps. We first fixed the number of exons and the number of isoforms for a gene, that is, the size of the structure matrix. Then we sampled the a ij using a 0 1 random variable which takes a value of 1 with 80% probability. This probability is deduced from the RefSeq gene annotation. Next, we sampled the lengths of exons randomly from the lengths of exons in the RefSeq gene annotation. Having the gene structure with exon lengths, for each isoform I j in a simulated gene, we randomly sampled a single-isoform gene I r from real RNA-seq data, and let I j have the same RPKM of I r with a similar read count distribution. This gives every exon i in the isoform j arpkmvaluerp KM ij.takingλ i = l m i j=1 RP KM ij as the parameter, the read count for exon i can be sampled from the corresponding Poisson model. We studied genes where the numbers of isoforms varied from 2 to 5 and the numbers of exons varied from 6 to 13. For each setting, we generated 1,000 different gene structures randomly. Although the annotation of the human genome is fairly comprehensive, there are still some uncovered isoforms for some genes. So we need to investigate the impact of incomplete annotation. For this, we generated another set of simulated data as described above, but randomly removed one isoform from the gene structure when estimating expression levels. Similarly, we repeated each sitting of the artificial incomplete gene structure 1,000 times in experiments RNA-seq datasets We employed the real transcriptome RNA-seq dataset of Marioni et al., 9 which also provided the information to generate our simulation data. The

7 Isoform Abundance Inference Provides a More Accurate Estimation 183 dataset was composed of about 120 million reads from human liver and kidney tissues. We downloaded them from the Sequence Read Archive (SRA, The read length of the dataset was 32bp. For brevity we refer to the dataset as the Marioni data. We used SeqMap 18 to map reads to the human genome assembly UCSC hg18 (or NCBI build36) allowing up to two mismatches. 3. Results We investigated the performances of the UI-based and isoform-based strategies on the simulated datasets and real data. For the isoform-based strategy, we adopted both the original URD model and the improved N-URD model for isoform level expression estimation. For brevity in discussion, we use URD and N-URD to represent the isoform-based strategy embedded with the URD model and the N-URD model, respectively Two criteria for performance comparison Two criteria were used to evaluate the performances of the two strategies and to facilitate the comparison between the strategies. Estimation error is of great biology importance. Correctly estimating the expression levels for each gene gives an accurate gene expression profile, which is the key to reliable functional analyses downstream. For example, to accurately estimate gene expression levels in the whole transcriptome was a primary step of a recent study in modeling gene expression basedonanintegratedanalysisusingchip-seqandrna-seqdata. 19 The other criterion, estimation uncertainty, is also essential in bioinformatics analyses. For example, in detecting differentially expressed genes from two samples RNA-seq data, the detection power will decrease when the estimation uncertainty goes larger, because signals could be lost in the background noises caused by the estimation uncertainty. The estimation error is defined as the difference between the estimated gene expression value and the given value, which is the sum of the given RPKM values of all isoforms. The calculation of the given gene-expression value is fair for the two strategies, because the two should report the same value if there is no randomness or biases in read generation. But due to random sampling and/or substantial sequencing biases, the main estimation error and uncertainty for the isoform-based strategy is introduced during the step of isoform expression inference, while that of the UI-based method stems from discarding a part of the informative reads. In order to make the estimation errors comparable in highly and lowly expressed genes, we normalized the difference by dividing by the gene expression level and getting the relative estimation error. We further compared the estimation uncertainty, which is defined as the length of the 95% confidence interval of the estimated gene expression. 5 Using the optimization solution as the mean and the inverse Fisher information matrix as the

8 184 X. Wang, Z. Wu & X. Zhang covariance matrix, the posterior distribution is approximated by a multivariable normal distribution. Using the approximated posterior distribution, we can sample a large number of points and estimate the 95% confidence interval. We take the length of the interval as the measure of the estimation uncertainty. As with the relative estimation error, we also investigated the relative estimation uncertainty, which equals the estimation uncertainty divided by the given gene expression level Results using complete annotation Figures 2 and 3 summarize the relative estimation errors and uncertainties of the two strategies on the simulation data generated from the Marioni data using complete annotation with different parameter (m, n) settings. Compared to UI, we can see that URD reduces 53 76% of the error of the gene expression estimation, and N-URD further reduces 67 87% of the error. With the increase of the number Fig. 2. Relative estimation errors for complete annotation. Shown are the comparisons of the relative estimation errors of UI, URD and N-URD on the simulated data (based on the Marioni data) with complete annotation. Within each group of experiments, we fixed the number of isoforms as shown in the title and varied the number of exons from 6 to 13. For each setting, we repeated the experiments 1,000 times and calculated the mean of the estimation errors.

9 Isoform Abundance Inference Provides a More Accurate Estimation 185 Fig. 3. Relative estimation uncertainties for complete annotation. Shown are the comparisons of the relative uncertainties of UI, URD and N-URD from the same experiments as shown in Fig. 2. of isoforms in the genes, the relative estimated error becomes larger. While for a fixed number of isoforms, along with the increase of the number of exons, the estimation error becomes smaller, because more exons supply more information. The improvement of URD and N-URD over UI becomes larger as the number of isoforms increases, because UI loses more information when faced with more isoforms. The achieved improvement on estimation errors by the isoform-based strategy over UI-based strategy is statistically significant, which is shown by paired single-sided t-tests. The p-values are listed in Table 1. Besides demonstrating the advantage of isoform-based strategy, the listed p-values also indicate that N-URD performs significantly better than URD. Next, we consider the uncertainty of these methods. From Fig. 3, we can see that both URD and N-URD have a smaller uncertainty about their estimation. The uncertainty of URD and N-URD are very close. Paired single-sided t-test p-values listed in Table 2 also indicate significant improvement of URD and N-URD compared to UI on estimation uncertainty, but similar performance between URD and N-URD.

10 186 X. Wang, Z. Wu & X. Zhang Table 1. Statistical significance on the improvement of estimation error using complete annotation. Listed are paired single-sided t-test p-values. H 1 denotes the alternative hypothesis in each hypothesis testing, and the corresponding null hypothesis (H 0 ) is the equation form of H 1. E(METHOD) means the estimation error bythemethod.thevariablesn and m denote the number of exons and isoforms, respectively. H 1 m n E(UI)> 2 6.7e e e e e e e e-172 E(URD) 3 4.2e e e e e e e e e e e e e e e e e e e e e e e e-246 E(UI)> 2 8.0e e e e e e e e-193 E(N-URD) 3 4.4e e e e e e e e e e e e e e e e e e e e e e e e-283 E(URD)> 2 5.0e e e e e e e e-40 E(N-URD) 3 5.7e e e e e e e e e e e e e e e e e e e e e e e e-78 Table 2. Statistical significance on the improvement of estimation uncertainty using complete annotation. Listed are paired single-sided t-test p-values. H 1 denotes the alternative hypothesis in each hypothesis testing, and the corresponding null hypothesis (H 0 ) is the equation form of H 1. U(METHOD) means the estimation uncertainty by the METHOD.The variables n and m denote the number of exons and isoforms, respectively. H 1 m U(UI)> 2 4.2e e e e e e e e-123 U(URD) 3 6.9e e e e e e e e e e e e e e e e e e e e e e e e-122 U(UI)> 2 1.1e e e e e e e e-113 U(N-URD) 3 3.9e e e e e e e e e e e e e e e e e e e e e e e e-204 U(URD)> 2 1.0e e e e e e e e+00 U(N-URD) 3 1.0e e e e e e e e e e e e e e e e e e e e e e e e+00 n 3.3. Results using incomplete annotation Using the incomplete annotation as described in Sec. 2, we performed a group of similar experiments. We discovered that URD reduces 55 77% of the estimation error of UI, and N-URD further reduces 56 85% of the error of UI. Although the improvement is slightly smaller than that of the complete annotation simulation, URD and N-URD still reduce the error by a large degree. Meanwhile, they achieve

11 Isoform Abundance Inference Provides a More Accurate Estimation 187 Fig. 4. Relative estimation errors for incomplete annotation. Shown are the comparisons of the relative estimation errors of UI, URD and N-URD on the simulated data (based on the Marioni data) using incomplete annotation. We used similar settings as the experiments shown in Fig. 2. smaller estimation uncertainty. Figures 4 and 5 summarize these results. As shown in Tables 3 and 4, paired single-sided t-tests were carried out to assess the statistical significance on the improvement. The listed p-values indicate that the isoform-based strategy significantly reduced the estimation uncertainty over the UI-based strategy, but URD performs similar as N-URD. Overall, compared with the UI-based method, the two isoform-based methods achieve significant improvement in estimating gene expression levels using both complete and incomplete annotations. The isoform-based strategy with taking into account the non-uniformity of read distribution in RNA-seq data further reduces the estimation errors. As a result, we recommend the isoform-based method for gene expression estimation in RNA-seq Applications to real RNA-seq data For the studies on the real transcriptome, we investigated the consistency of these methods on technical replicates. The estimation consistency on technical replicates

12 188 X. Wang, Z. Wu & X. Zhang Fig. 5. Relative estimation uncertainties for incomplete annotation. Shown are the comparisons of the relative uncertainties of UI, URD and N-URD from the same experiments as shown in Fig. 4. Table 3. Statistical significance on the improvement of estimation error using incomplete annotation. This table is similar to Table 1. H 1 m E(UI)> 2 1.0e e e e e e e e-136 E(URD) 3 5.6e e e e e e e e e e e e e e e e e e e e e e e e-265 E(UI)> 2 8.3e e e e e e e e-147 E(N-URD) 3 1.8e e e e e e e e e e e e e e e e e e e e e e e e-296 E(URD)> 2 9.2e e e e e e e e-18 E(N-URD) 3 3.5e e e e e e e e e e e e e e e e e e e e e e e e-56 n

13 Isoform Abundance Inference Provides a More Accurate Estimation 189 Table 4. Statistical significance on the improvement of estimation uncertainty using incomplete annotation. This table is similar to Table 2. H 1 m n U(UI)> 2 4.0e e e e e e e e-224 U(URD) 3 1.3e e e e e e e e e e e e e e e e e e e e e e e e-171 U(UI)> 2 3.7e e e e e e e e-216 U(N-URD) 3 6.6e e e e e e e e e e e e e e e e e e e e e e e e-233 U(URD)> 2 1.0e e e e e e e e+00 U(N-URD) 3 1.0e e e e e e e e e e e e e e e e e e e e e e e e+00 data mainly reflects the estimation uncertainty as we investigated on the simulation data. Since there are no true answers available on the real data, the estimation error was not assessed here. We applied UI, URD and N-URD to the technical replicate datasets of transcriptome RNA-seq data from Marioni et al. 9 Because the technical replicates are from the same biological sample, the estimated gene expression should also be highly consistent. For each tissue, we estimated the gene expression levels for two technical replicates and normalized the mean of the expression levels for each replicate to be 1. Then we calculated the differences and investigated them for each method. As we normalized the mean of the expression levels in each replicate to be 1, the mean of the differences is obviously zero. So Table 5 only lists the variances of the differences for two tissues using different methods.wecanseeforeachtissuethat UI always gives the biggest variance of differences, while URD and N-URD give smaller variances. These results indicate that the isoform-based strategy tends to be more stable than the UI-based strategy. This also supports the conclusions from the previous simulation studies. Table 5. The comparison results in real RNA-seq data. The table summarizes the comparisons of gene expression level estimation consistency by different methods in RNA-seq technical replicate data. Each cell represents the variance of the relative estimation differences in the replicate data. Kidney Liver UI URD N-URD

14 190 X. Wang, Z. Wu & X. Zhang 4. Discussion The gene expression level or the abundance of transcripts that are originated from a functional genomic region is the key concept in studying transcription regulation. RT-PCR, microarrays, SAGE, 20 and the revolutionary RNA-seq technology all made their own contributions to profile gene expression levels. Estimating the gene expression levels from the data generated by these experimental technologies in an unbiased and precise manner remains an important and challenging issue in bioinformatics studies. Although RNA-seq data provide digital information and directly give the global map of transcribed fragments, correctly profiling the gene expression cannot be done in a straightforward manner. Recently researchers have begun to realize that the most commonly used projective normalization method under estimates, to various degrees, the gene expression levels in most multi-isoform genes. On this basis, this article tries to answer the question to which of the two existing unbiased strategies, the UI-based and the isoform-based, is preferred. In this report, we conducted a comprehensive simulation study on comparing the performances of the two strategies for gene expression level estimation. The series of simulation experiments indicated the significant advantage of the isoformbased methods (with URD and N-URD embedded) both in estimation accuracy and stability, which is further demonstrated by the comparison of gene expression estimation in the real RNA-seq technical replicate data. Intuitively, researchers may tend to think the UI-based method, rather than the isoform-based methods, would have the more robust performance because of the large uncertainties in the isoform abundance inference. Note that a correlation exists among the individual isoforms of a gene, so summing up the isoform levels to estimate the gene expression levels may reduce the uncertainty. This is consistent with the observations in the previous study. 5 On the other hand, for the UI-based method, only the reads falling in the constitutive exons could be taken into account to estimate the abundance, making a large part of the information unused. Besides, although in a small number of cases, the UI-based method may not work where no constitutive exon exists. Thus, we may conclude that even if one is not interested in the isoform abundance, one should first infer the isoform abundance and sum them up to estimate the gene expression levels. This conclusion would be helpful to make a consensus on this problem in the community. Towards a more accurate estimation of gene expression, there is still room to improve. Although attention has been paid recently to non-uniform read distribution in the inference of isoform abundance, 15,16,21 more efforts are still needed to efficiently incorporate this kind of information. In addition, due to incomplete isoform annotation, ab initio reconstruction of isoforms from RNA-seq data may provide more precise and stable estimation. The refinement of this problem will strongly benefit the downstream analysis. For example, identification of differentially expressed genes from samples requires not only correct estimations of gene expression levels but also small estimation uncertainties.

15 Isoform Abundance Inference Provides a More Accurate Estimation 191 Acknowledgments The authors would like to thank Dr. Lior Pachter for his helpful discussion, and Dr. Greg Vatcher for his great assistance on the language of the manuscript. This work is supported in part by the NSFC grants ( , and ). References 1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nat Methods 5: , Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M, The transcriptional landscape of the yeast genome defined by RNA sequencing, Science 320: , Li J, Jiang H, Wong WH, Modeling non-uniformity in short-read rates in RNA-Seq data, Genome Biol 11:R50, Hansen KD, Brenner SE, Dudoit S, Biases in Illumina transcriptome sequencing caused by random hexamer priming, Nucleic Acids Res 38:e131, Jiang H, Wong WH, Statistical inferences for isoform expression in RNA-Seq, Bioinformatics 25: , Hiller D, Jiang H, Xu W, Wong WH, Identifiability of isoform deconvolution from junction arrays and RNA-Seq, Bioinformatics 25: , Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G, Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium, Nat Genet 25:25 29, Kanehisa M, Goto S, KEGG: Kyoto encyclopedia of genes and genomes, Nucleic Acids Res 28:27 30, Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y, RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays, Genome Res 18: , Bullard JH, Purdom E, Hansen KD, Dudoit S, Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments, BMC Bioinformatics 11:94, Li C, Wong WH, Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection, Proc Natl Acad Sci USA 98:31 36, Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP, Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics 4: , Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB, Alternative isoform regulation in human tissue transcriptomes, Nature 456: , Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L, Transcript assembly and quantification by RNA- Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nat Biotechnol 28: , Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN, RNA-Seq gene expression estimation with read mapping uncertainty, Bioinformatics 26: , Wu Z, Wang X, Zhang X, Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq, To be published.

16 192 X. Wang, Z. Wu & X. Zhang 17. Mortazavi A, Williams BA, Mccue K, Schaeffer L, Wold B, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nat Methods 5: , Jiang H, Wong WH, SeqMap: Mapping massive amount of oligonucleotides to the genome, Bioinformatics 24: , Ouyang Z, Zhou Q, Wong WH, ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells, Proc Natl Acad Sci USA 106: , Velculescu VE, Zhang L, Vogelstein B, Kinzler KW, Serial analysis of gene-expression, Science 270: , Howard BE, Heber S, Towards reliable isoform quantification using RNA-SEQ data, BMC Bioinformatics 11(Suppl 3):S6, 2010 Xi Wang received his B.E. degree in Automation in 2005 from Harbin Institute of Technology, Harbin, China. He is now a Ph.D. candidate in Bioinformatics at MOE Key Laboratory of Bioinformatics and Bioinformatics Division, Tsinghua National Laboratory of Information Science and Technology, and also Department of Automation, Tsinghua University. His research interest includes machine leaning, data mining for bioinformatics, DNA sequence analysis and ChIP-seq/RNA-seq data analyses. Zhengpeng Wu received his B.Sc. degree in Automatic Control in 2004 from Tsinghua University, Beijing, China. He is now a Ph.D. candidate at the Department of Automation, Tsinghua University, and also at MOE Key Laboratory of Bioinformatics and Bioinformatics Division, Tsinghua National Laboratory of Information Science and Technology. His research interests include pattern recognition, statistical learning theory, statistics and computational genomics. Xuegong Zhang received his B.Sc. degree in 1989 and Ph.D. degree in 1994, both from Tsinghua University, Beijing. He is now Professor of Pattern Recognition and Bioinformatics at Tsinghua University, and Director of the Bioinformatics Division, Tsinghua National Laboratory for Information Science and Technology.

Background Wikipedia Lee and Mahadavan, JCB, 2009 History (Platform Comparison) P Park, Nature Review Genetics, 2009 P Park, Nature Reviews Genetics, 2009 Rozowsky et al., Nature Biotechnology, 2009

More information

A normalization method based on variance and median adjustment for massive mrna polyadenylation data

A normalization method based on variance and median adjustment for massive mrna polyadenylation data ISSN : 0974-7435 Volume 8 Issue 4 A normalization method based on variance and median adjustment for massive mrna polyadenylation data Guoli Ji, Ying Wang, Mingchen Wu, Yangzi Zhang, Xiaohui Wu* Department

More information

A survey of statistical software for analysing RNA-seq data

A survey of statistical software for analysing RNA-seq data A survey of statistical software for analysing RNA-seq data Dexiang Gao, 1,5* Jihye Kim, 2 Hyunmin Kim, 4 Tzu L. Phang, 3 Heather Selby, 2 Aik Choon Tan 2,5 and Tiejun Tong 6** 1 Department of Pediatrics,

More information

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions

More information

ChIP-seq and RNA-seq

ChIP-seq and RNA-seq ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)

More information

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ),

Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ), Analysis of data from high-throughput molecular biology experiments Lecture 6 (F6, RNA-seq ), 2012-01-26 What is a gene What is a transcriptome History of gene expression assessment RNA-seq RNA-seq analysis

More information

Experimental design of RNA-Seq Data

Experimental design of RNA-Seq Data Experimental design of RNA-Seq Data RNA-seq course: The Power of RNA-seq Thursday June 6 th 2013, Marco Bink Biometris Overview Acknowledgements Introduction Experimental designs Randomization, Replication,

More information

Toward a Richer Representation of Sequence Variation in the Sequence Ontology Michael Bada 1 and Karen Eilbeck 2 1

Toward a Richer Representation of Sequence Variation in the Sequence Ontology Michael Bada 1 and Karen Eilbeck 2 1 Toward a Richer Representation of Sequence Variation in the Sequence Ontology Michael Bada 1 and Karen Eilbeck 2 1 University of Colorado Anschutz Medical Campus, Department of Pharmacology, MS 8303, RC-1

More information

Accurate differential gene expression analysis for RNA-Seq data without replicates

Accurate differential gene expression analysis for RNA-Seq data without replicates Accurate differential gene expression analysis for RNA-Seq data without replicates Sahar Al Seesi 1, Yvette Temate Tiagueu 2, Alex Zelikovsky 2, and Ion Măndoiu 1 1 Computer Science & Engineering Department,

More information

Analysis of RNA-seq Data. Bernard Pereira

Analysis of RNA-seq Data. Bernard Pereira Analysis of RNA-seq Data Bernard Pereira The many faces of RNA-seq Applications Discovery Find new transcripts Find transcript boundaries Find splice junctions Comparison Given samples from different experimental

More information

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput Chapter 11: Gene Expression The availability of an annotated genome sequence enables massively parallel analysis of gene expression. The expression of all genes in an organism can be measured in one experiment.

More information

Measuring transcriptomes with RNA-Seq. BMI/CS 776 Spring 2016 Anthony Gitter

Measuring transcriptomes with RNA-Seq. BMI/CS 776  Spring 2016 Anthony Gitter Measuring transcriptomes with RNA-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu Overview RNA-Seq technology The RNA-Seq quantification problem Generative

More information

Measuring transcriptomes with RNA-Seq

Measuring transcriptomes with RNA-Seq Measuring transcriptomes with RNA-Seq BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2017 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC

More information

Expression summarization

Expression summarization Expression Quantification: Affy Affymetrix Genechip is an oligonucleotide array consisting of a several perfect match (PM) and their corresponding mismatch (MM) probes that interrogate for a single gene.

More information

From reads to results: differential. Alicia Oshlack Head of Bioinformatics

From reads to results: differential. Alicia Oshlack Head of Bioinformatics From reads to results: differential expression analysis with ihrna seq Alicia Oshlack Head of Bioinformatics Murdoch Childrens Research Institute Benefits and opportunities ii of RNA seq All transcripts

More information

Estimation of alternative splicing isoform frequencies from RNA-Seq data

Estimation of alternative splicing isoform frequencies from RNA-Seq data Estimation of alternative splicing isoform frequencies from RNA-Seq data Marius Nicolae 1, Serghei Mangul 2, Ion Măndoiu 1, and Alex Zelikovsky 2 1 Computer Science & Engineering Department, University

More information

The Impacts of Read Length and Transcriptome Complexity for De Novo

The Impacts of Read Length and Transcriptome Complexity for De Novo The Impacts of Read Length and Transcriptome Complexity for De Novo Assembly: A Simulation Study Zheng Chang., Zhenjia Wang., Guojun Li* School of Mathematics, Shandong University, Jinan, Shandong, China

More information

Introduction to Bioinformatics! Giri Narasimhan. ECS 254; Phone: x3748

Introduction to Bioinformatics! Giri Narasimhan. ECS 254; Phone: x3748 Introduction to Bioinformatics! Giri Narasimhan ECS 254; Phone: x3748 giri@cs.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs11.html Reading! The following slides come from a series of talks by Rafael Irizzary

More information

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Introduction to RNA-Seq David Wood Winter School in Mathematics and Computational Biology July 1, 2013 Abundance RNA is... Diverse Dynamic Central DNA rrna Epigenetics trna RNA mrna Time Protein Abundance

More information

Gene Expression Technology

Gene Expression Technology Gene Expression Technology Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu Gene expression Gene expression is the process by which information from a gene

More information

A method for enhancement of short read sequencing alignment with Bayesian inference

A method for enhancement of short read sequencing alignment with Bayesian inference Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2013, 5(11):200-204 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 A method for enhancement of short read sequencing

More information

Fully Automated Genome Annotation with Deep RNA Sequencing

Fully Automated Genome Annotation with Deep RNA Sequencing Fully Automated Genome Annotation with Deep RNA Sequencing Gunnar Rätsch Friedrich Miescher Laboratory of the Max Planck Society Tübingen, Germany Friedrich Miescher Laboratory of the Max Planck Society

More information

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly SCIENCE CHINA Life Sciences SPECIAL TOPIC February 2013 Vol.56 No.2: 156 162 RESEARCH PAPER doi: 10.1007/s11427-013-4444-x Comparative analysis of de novo transcriptome assembly CLARKE Kaitlin 1, YANG

More information

Tutorial. Whole Metagenome Functional Analysis (beta) Sample to Insight. November 21, 2017

Tutorial. Whole Metagenome Functional Analysis (beta) Sample to Insight. November 21, 2017 Whole Metagenome Functional Analysis (beta) November 21, 2017 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com AdvancedGenomicsSupport@qiagen.com

More information

Biases in RNA-Seq data. October 15 th, 2012 NBIC Advanced RNA-Seq course

Biases in RNA-Seq data. October 15 th, 2012 NBIC Advanced RNA-Seq course Biases in RNA-Seq data October 15 th, 2012 NBIC Advanced RNA-Seq course Prof. dr. Antoine van Kampen Bioinformatics Laboratory Academic Medical Center Biosystems Data Analysis Group Swammerdam Institute

More information

Normalization. Getting the numbers comparable. DNA Microarray Bioinformatics - #27612

Normalization. Getting the numbers comparable. DNA Microarray Bioinformatics - #27612 Normalization Getting the numbers comparable The DNA Array Analysis Pipeline Question Experimental Design Array design Probe design Sample Preparation Hybridization Buy Chip/Array Image analysis Expression

More information

RNA

RNA RNA sequencing Michael Inouye Baker Heart and Diabetes Institute Univ of Melbourne / Monash Univ Summer Institute in Statistical Genetics 2017 Integrative Genomics Module Seattle @minouye271 www.inouyelab.org

More information

Analysis of Microarray Data

Analysis of Microarray Data Analysis of Microarray Data Lecture 1: Experimental Design and Data Normalization George Bell, Ph.D. Senior Bioinformatics Scientist Bioinformatics and Research Computing Whitehead Institute Outline Introduction

More information

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis Gene expression analysis Biosciences 741: Genomics Fall, 2013 Week 5 Gene expression analysis From EST clusters to spotted cdna microarrays Long vs. short oligonucleotide microarrays vs. RT-PCR Methods

More information

RNA-Sequencing analysis

RNA-Sequencing analysis RNA-Sequencing analysis Markus Kreuz 25. 04. 2012 Institut für Medizinische Informatik, Statistik und Epidemiologie Content: Biological background Overview transcriptomics RNA-Seq RNA-Seq technology Challenges

More information

An Overview of Gene Set Enrichment Analysis

An Overview of Gene Set Enrichment Analysis An Overview of Gene Set Enrichment Analysis CHELSEA JUI-TING JU, University of California, Los Angeles Gene set enrichment analysis is a data mining approach designed to facilitate the biological interpretation

More information

CSE 549: RNA-Seq aided gene finding

CSE 549: RNA-Seq aided gene finding CSE 549: RNA-Seq aided gene finding Finding Genes We ll break gene finding methods into 3 main categories. ab initio latin from the beginning w/o experimental evidence comparative make use of knowledge

More information

measuring gene expression December 5, 2017

measuring gene expression December 5, 2017 measuring gene expression December 5, 2017 transcription a usually short-lived RNA copy of the DNA is created through transcription RNA is exported to the cytoplasm to encode proteins some types of RNA

More information

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA

TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA TECH NOTE Pushing the Limit: A Complete Solution for Generating Stranded RNA Seq Libraries from Picogram Inputs of Total Mammalian RNA Stranded, Illumina ready library construction in

More information

RNA-Seq with the Tuxedo Suite

RNA-Seq with the Tuxedo Suite RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop The Basic Tuxedo Suite References Trapnell C, et al. 2009 TopHat: discovering splice junctions with

More information

RNA-Seq data analysis course September 7-9, 2015

RNA-Seq data analysis course September 7-9, 2015 RNA-Seq data analysis course September 7-9, 2015 Peter-Bram t Hoen (LUMC) Jan Oosting (LUMC) Celia van Gelder, Jacintha Valk (BioSB) Anita Remmelzwaal (LUMC) Expression profiling DNA mrna protein Comprehensive

More information

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Outline. Analysis of Microarray Data. Most important design question. General experimental issues Outline Analysis of Microarray Data Lecture 1: Experimental Design and Data Normalization Introduction to microarrays Experimental design Data normalization Other data transformation Exercises George Bell,

More information

Optimal Calculation of RNA-Seq Fold-Change Values

Optimal Calculation of RNA-Seq Fold-Change Values International Journal of Computational Bioinformatics and In Silico Modeling Vol. 2, No. 6 (2013): 285-292 Research Article Open Access ISSN: 2320-0634 Optimal Calculation of RNA-Seq Fold-Change Values

More information

Gene Signal Estimates from Exon Arrays

Gene Signal Estimates from Exon Arrays Gene Signal Estimates from Exon Arrays I. Introduction: With exon arrays like the GeneChip Human Exon 1.0 ST Array, researchers can examine the transcriptional profile of an entire gene (Figure 1). Being

More information

measuring gene expression December 11, 2018

measuring gene expression December 11, 2018 measuring gene expression December 11, 2018 Intervening Sequences (introns): how does the cell get rid of them? Splicing!!! Highly conserved ribonucleoprotein complex recognizes intron/exon junctions and

More information

Identifying Candidate Informative Genes for Biomarker Prediction of Liver Cancer

Identifying Candidate Informative Genes for Biomarker Prediction of Liver Cancer Identifying Candidate Informative Genes for Biomarker Prediction of Liver Cancer Nagwan M. Abdel Samee 1, Nahed H. Solouma 2, Mahmoud Elhefnawy 3, Abdalla S. Ahmed 4, Yasser M. Kadah 5 1 Computer Engineering

More information

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data http://www.psi.toronto.edu Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data Jim C. Huang, Quaid D. Morris, Brendan J. Frey October 06, 2004 PSI TR 2004 031 Iterated

More information

Analysis of Microarray Data

Analysis of Microarray Data Analysis of Microarray Data Lecture 1: Experimental Design and Data Normalization George Bell, Ph.D. Senior Bioinformatics Scientist Bioinformatics and Research Computing Whitehead Institute Outline Introduction

More information

Inference of Isoforms from Short Sequence Reads (Extended Abstract)

Inference of Isoforms from Short Sequence Reads (Extended Abstract) Inference of Isoforms from Short Sequence Reads (Extended Abstract) Jianxing Feng 1, Wei Li 2, and Tao Jiang 3 1 State Key Laboratory on Intelligent Technology and Systems Tsinghua National Laboratory

More information

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter VizX Labs, LLC Seattle, WA 98119 Abstract Oligonucleotide microarrays were used to study

More information

NGS Data Analysis and Galaxy

NGS Data Analysis and Galaxy NGS Data Analysis and Galaxy University of Pretoria Pretoria, South Africa 14-18 October 2013 Dave Clements, Emory University http://galaxyproject.org/ Fourie Joubert, Burger van Jaarsveld Bioinformatics

More information

High performance sequencing and gene expression quantification

High performance sequencing and gene expression quantification High performance sequencing and gene expression quantification Ana Conesa Genomics of Gene Expression Lab Centro de Investigaciones Príncipe Felipe Valencia aconesa@cipf.es Next Generation Sequencing NGS

More information

Mapping and quantifying mammalian transcriptomes by RNA-Seq. Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer & Barbara Wold

Mapping and quantifying mammalian transcriptomes by RNA-Seq. Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer & Barbara Wold Mapping and quantifying mammalian transcriptomes by RNA-Seq Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer & Barbara Wold Supplementary figures and text: Supplementary Figure 1 RNA shatter

More information

Integrative Genomics 1a. Introduction

Integrative Genomics 1a. Introduction 2016 Course Outline Integrative Genomics 1a. Introduction ggibson.gt@gmail.com http://www.cig.gatech.edu 1a. Experimental Design and Hypothesis Testing (GG) 1b. Normalization (GG) 2a. RNASeq (MI) 2b. Clustering

More information

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies

Statistical Genomics and Bioinformatics Workshop. Genetic Association and RNA-Seq Studies Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies RNA Seq and Differential Expression Analysis Brooke L. Fridley, PhD University of Kansas Medical Center 1 Next-generation

More information

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment Zhaojun Zhang, Shunping Huang, Jack Wang, Xiang Zhang, Fernando Pardo

More information

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010)

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Microarray Data Analysis. Lecture 1. Fran Lewitter, Ph.D. Director Bioinformatics and Research Computing Whitehead Institute Outline Introduction Working with microarray data

More information

Satellite Education Workshop (SW4): Epigenomics: Design, Implementation and Analysis for RNA-seq and Methyl-seq Experiments

Satellite Education Workshop (SW4): Epigenomics: Design, Implementation and Analysis for RNA-seq and Methyl-seq Experiments Satellite Education Workshop (SW4): Epigenomics: Design, Implementation and Analysis for RNA-seq and Methyl-seq Experiments Saturday March 17, 2012 Orlando, Florida Workshop Description: This full day

More information

Measuring and Understanding Gene Expression

Measuring and Understanding Gene Expression Measuring and Understanding Gene Expression Dr. Lars Eijssen Dept. Of Bioinformatics BiGCaT Sciences programme 2014 Why are genes interesting? TRANSCRIPTION Genome Genomics Transcriptome Transcriptomics

More information

Human housekeeping genes are compact

Human housekeeping genes are compact Human housekeeping genes are compact Eli Eisenberg and Erez Y. Levanon Compugen Ltd., 72 Pinchas Rosen Street, Tel Aviv 69512, Israel Abstract arxiv:q-bio/0309020v1 [q-bio.gn] 30 Sep 2003 We identify a

More information

Chapter 8. Quality Control of RNA-Seq Experiments. Xing Li, Asha Nair, Shengqin Wang, and Liguo Wang. Abstract. 1 Introduction

Chapter 8. Quality Control of RNA-Seq Experiments. Xing Li, Asha Nair, Shengqin Wang, and Liguo Wang. Abstract. 1 Introduction Chapter 8 Quality Control of RNA-Seq Experiments Xing Li, Asha Nair, Shengqin Wang, and Liguo Wang Abstract Direct sequencing of the complementary DNA (cdna) using high-throughput sequencing technologies

More information

Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world.

Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world. Microarray Data Analysis Workshop MedVetNet Workshop, DTU 2008 Preprocessing and normalization A trailer show of the rest of the microarray world Carsten Friis Media glna tnra GlnA TnrA C2 glnr C3 C5 C6

More information

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis Outline RNA-Seq for differential expression analysis Statistical methods for RNA-Seq: Structure

More information

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility 2018 ABRF Meeting Satellite Workshop 4 Bridging the Gap: Isolation to Translation (Single Cell RNA-Seq) Sunday, April 22 Basics of RNA-Seq (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly,

More information

Machine Learning. HMM applications in computational biology

Machine Learning. HMM applications in computational biology 10-601 Machine Learning HMM applications in computational biology Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly

More information

Transcriptome analysis

Transcriptome analysis Statistical Bioinformatics: Transcriptome analysis Stefan Seemann seemann@rth.dk University of Copenhagen April 11th 2018 Outline: a) How to assess the quality of sequencing reads? b) How to normalize

More information

1. Introduction Gene regulation Genomics and genome analyses

1. Introduction Gene regulation Genomics and genome analyses 1. Introduction Gene regulation Genomics and genome analyses 2. Gene regulation tools and methods Regulatory sequences and motif discovery TF binding sites Databases 3. Technologies Microarrays Deep sequencing

More information

Accurate, Fast, and Model-Aware Transcript Expression Quantification

Accurate, Fast, and Model-Aware Transcript Expression Quantification Accurate, Fast, and Model-Aware Transcript Expression Quantification Carl Kingsford Associate Professor, Computational Biology Department Joint work with Rob Patro & Geet Duggal Challenge of Large-Scale

More information

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson

More information

Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway

Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway Joseph F. Ryan* Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway Current Address: Whitney Laboratory for Marine Bioscience, University of Florida, St. Augustine,

More information

Next generation high-throughput sequencing (NGS) technologies are rapidly establishing

Next generation high-throughput sequencing (NGS) technologies are rapidly establishing JOURNAL OF COMPUTATIONAL BIOLOGY Volume 18, Number 3, 2011 # Mary Ann Liebert, Inc. Pp. 459 468 DOI: 10.1089/cmb.2010.0259 Accurate Estimation of Expression Levels of Homologous Genes in RNA-seq Experiments

More information

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA The most sensitive cdna synthesis technology, combined with next-generation

More information

Targeted RNA sequencing reveals the deep complexity of the human transcriptome.

Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Tim R. Mercer 1, Daniel J. Gerhardt 2, Marcel E. Dinger 1, Joanna Crawford 1, Cole Trapnell 3, Jeffrey A. Jeddeloh 2,4, John

More information

CBC Data Therapy. Metatranscriptomics Discussion

CBC Data Therapy. Metatranscriptomics Discussion CBC Data Therapy Metatranscriptomics Discussion Metatranscriptomics Extract RNA, subtract rrna Sequence cdna QC Gene expression, function Institute for Systems Genomics: Computational Biology Core bioinformatics.uconn.edu

More information

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group

10/06/2014. RNA-Seq analysis. With reference assembly. Cormier Alexandre, PhD student UMR8227, Algal Genetics Group RNA-Seq analysis With reference assembly Cormier Alexandre, PhD student UMR8227, Algal Genetics Group Summary 2 Typical RNA-seq workflow Introduction Reference genome Reference transcriptome Reference

More information

Next Generation Sequencing

Next Generation Sequencing Next Generation Sequencing Complete Report Catalogue # and Service: IR16001 rrna depletion (human, mouse, or rat) IR11081 Total RNA Sequencing (80 million reads, 2x75 bp PE) Xxxxxxx - xxxxxxxxxxxxxxxxxxxxxx

More information

Top 5 Lessons Learned From MAQC III/SEQC

Top 5 Lessons Learned From MAQC III/SEQC Top 5 Lessons Learned From MAQC III/SEQC Weida Tong, Ph.D Division of Bioinformatics and Biostatistics, NCTR/FDA Weida.tong@fda.hhs.gov; 870 543 7142 1 MicroArray Quality Control (MAQC) An FDA led community

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Functional Genomics: Microarray Data Analysis Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Outline Introduction Working with microarray data Normalization Analysis

More information

Advanced RNA-Seq course. Introduction. Peter-Bram t Hoen

Advanced RNA-Seq course. Introduction. Peter-Bram t Hoen Advanced RNA-Seq course Introduction Peter-Bram t Hoen Expression profiling DNA mrna protein Comprehensive RNA profiling possible: determine the abundance of all mrna molecules in a cell / tissue Expression

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2009 Paper 247 Evaluation of Statistical Methods for Normalization and Differential Expression in mrna-seq

More information

RNA-SEQUENCING ANALYSIS

RNA-SEQUENCING ANALYSIS RNA-SEQUENCING ANALYSIS Joseph Powell SISG- 2018 CONTENTS Introduction to RNA sequencing Data structure Analyses Transcript counting Alternative splicing Allele specific expression Discovery APPLICATIONS

More information

Introduction to RNA-Seq in GeneSpring NGS Software

Introduction to RNA-Seq in GeneSpring NGS Software Introduction to RNA-Seq in GeneSpring NGS Software Dipa Roy Choudhury, Ph.D. Strand Scientific Intelligence and Agilent Technologies Learn more at www.genespring.com Introduction to RNA-Seq In a few years,

More information

SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS

SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS SETTLES@UCDAVIS.EDU Bioinformatics Core Genome Center UC Davis BIOINFORMATICS.UCDAVIS.EDU DISCLAIMER This talk/workshop

More information

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS) RNA-sequencing Next Generation sequencing analysis 2016 Anne-Mette Bjerregaard Center for biological sequence analysis (CBS) Terms and definitions TRANSCRIPTOME The full set of RNA transcripts and their

More information

Intro to RNA-seq. July 13, 2015

Intro to RNA-seq. July 13, 2015 Intro to RNA-seq July 13, 2015 Goal of the course To be able to effectively design, and interpret genomic studies of gene expression. We will focus on RNA-seq, but the class will provide a foothold into

More information

Genome annotation & EST

Genome annotation & EST Genome annotation & EST What is genome annotation? The process of taking the raw DNA sequence produced by the genome sequence projects and adding the layers of analysis and interpretation necessary

More information

Performance comparison of five RNA-seq alignment tools

Performance comparison of five RNA-seq alignment tools New Jersey Institute of Technology Digital Commons @ NJIT Theses Theses and Dissertations Spring 2013 Performance comparison of five RNA-seq alignment tools Yuanpeng Lu New Jersey Institute of Technology

More information

A note on oligonucleotide expression values not being normally distributed

A note on oligonucleotide expression values not being normally distributed Biostatistics (2009), 10, 3, pp. 446 450 doi:10.1093/biostatistics/kxp003 Advance Access publication on March 10, 2009 A note on oligonucleotide expression values not being normally distributed JOHANNA

More information

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia RNA-Seq Workshop AChemS 2017 Sunil K Sukumaran Monell Chemical Senses Center Philadelphia Benefits & downsides of RNA-Seq Benefits: High resolution, sensitivity and large dynamic range Independent of prior

More information

Mixed effects model for assessing RNA degradation in Affymetrix GeneChip experiments

Mixed effects model for assessing RNA degradation in Affymetrix GeneChip experiments Mixed effects model for assessing RNA degradation in Affymetrix GeneChip experiments Kellie J. Archer, Ph.D. Suresh E. Joel Viswanathan Ramakrishnan,, Ph.D. Department of Biostatistics Virginia Commonwealth

More information

Single-Cell Whole Transcriptome Profiling With the SOLiD. System

Single-Cell Whole Transcriptome Profiling With the SOLiD. System APPLICATION NOTE Single-Cell Whole Transcriptome Profiling Single-Cell Whole Transcriptome Profiling With the SOLiD System Introduction The ability to study the expression patterns of an individual cell

More information

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist Whole Transcriptome Analysis of Illumina RNA- Seq Data Ryan Peters Field Application Specialist Partek GS in your NGS Pipeline Your Start-to-Finish Solution for Analysis of Next Generation Sequencing Data

More information

Deakin Research Online

Deakin Research Online Deakin Research Online This is the published version: Church, Philip, Goscinski, Andrzej, Wong, Adam and Lefevre, Christophe 2011, Simplifying gene expression microarray comparative analysis., in BIOCOM

More information

Affymetrix GeneChip Arrays. Lecture 3 (continued) Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

Affymetrix GeneChip Arrays. Lecture 3 (continued) Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy Affymetrix GeneChip Arrays Lecture 3 (continued) Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy Affymetrix GeneChip Design 5 3 Reference sequence TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT

More information

Viewing the Proteome from Oligopeptides and Prediction of Protein Function

Viewing the Proteome from Oligopeptides and Prediction of Protein Function 74 Genome Informatics 6(2): 74 82 (25) Viewing the Proteome from Oligopeptides and Prediction of Protein Function Hisayuki Horai,2 Kouichi Doi Hirofumi Doi,2 hisayu-h@is.naist.ac.jp doy@is.naist.ac.jp

More information

Improving RNA-Seq expression estimation by modeling isoform- and exon-specific read sequencing rate

Improving RNA-Seq expression estimation by modeling isoform- and exon-specific read sequencing rate Liu et al. BMC Bioinformatics (15) 1:33 DOI 1.11/s159-15-75- METHODOLOGY ARTICLE Open Access Improving RNA-Seq expression estimation by modeling isoform- and exon-specific read sequencing rate Xuejun Liu

More information

Background and Normalization:

Background and Normalization: Background and Normalization: Investigating the effects of preprocessing on gene expression estimates Ben Bolstad Group in Biostatistics University of California, Berkeley bolstad@stat.berkeley.edu http://www.stat.berkeley.edu/~bolstad

More information

RNA standards v May

RNA standards v May Standards, Guidelines and Best Practices for RNA-Seq: 2010/2011 I. Introduction: Sequence based assays of transcriptomes (RNA-seq) are in wide use because of their favorable properties for quantification,

More information

Application of Decision Trees in Mining High-Value Credit Card Customers

Application of Decision Trees in Mining High-Value Credit Card Customers Application of Decision Trees in Mining High-Value Credit Card Customers Jian Wang Bo Yuan Wenhuang Liu Graduate School at Shenzhen, Tsinghua University, Shenzhen 8, P.R. China E-mail: gregret24@gmail.com,

More information

Uncovering differentially expressed pathways with protein interaction and gene expression data

Uncovering differentially expressed pathways with protein interaction and gene expression data The Second International Symposium on Optimization and Systems Biology (OSB 08) Lijiang, China, October 31 November 3, 2008 Copyright 2008 ORSC & APORC, pp. 74 82 Uncovering differentially expressed pathways

More information

Upstream/Downstream Relation Detection of Signaling Molecules using Microarray Data

Upstream/Downstream Relation Detection of Signaling Molecules using Microarray Data Vol 1 no 1 2005 Pages 1 5 Upstream/Downstream Relation Detection of Signaling Molecules using Microarray Data Ozgun Babur 1 1 Center for Bioinformatics, Computer Engineering Department, Bilkent University,

More information

Computational approaches to the analysis of RNA-seq data

Computational approaches to the analysis of RNA-seq data I519 Introduction to Bioinformatics, 2013 Computational approaches to the analysis of RNA-seq data Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Next-generation gap http://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.f.268.html

More information

Quan9fying with sequencing. Week 14, Lecture 27. RNA-Seq concepts. RNA-Seq goals 12/1/ Unknown transcriptome. 2. Known transcriptome

Quan9fying with sequencing. Week 14, Lecture 27. RNA-Seq concepts. RNA-Seq goals 12/1/ Unknown transcriptome. 2. Known transcriptome 2015 - BMMB 852D: Applied Bioinforma9cs Quan9fying with sequencing Week 14, Lecture 27 István Albert Biochemistry and Molecular Biology and Bioinforma9cs Consul9ng Center Penn State Samples consists of

More information

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Team members: David Moskowitz and Emily Tsang Background Transcription factors

More information