ISOFORM ABUNDANCE INFERENCE PROVIDES A MORE ACCURATE ESTIMATION OF GENE EXPRESSION LEVELS IN RNA-SEQ

Journal of Bioinformatics and Computational Biology Vol. 8, Suppl. 1 (2010) 177 192 c The Authors DOI: 10.1142/S0219720010005178 ISOFORM ABUNDANCE INFERENCE PROVIDES A MORE ACCURATE ESTIMATION OF GENE EXPRESSION LEVELS IN RNA-SEQ XI WANG,, ZHENGPENG WU, and XUEGONG ZHANG, MOE Key Laboratory of Bioinformatics and Bioinformatics Division TNLIST/Department of Automation, Tsinghua University Beijing 100084, P. R. China wang-xi05@mails.tsinghua.edu.cn wuzhengpeng99@mails.tsinghua.edu.cn zhangxg@tsinghua.edu.cn Received 15 July 2010 Revised 30 August 2010 Accepted 10 September 2010 Due to its unprecedented high-resolution and detailed information, RNA-seq technology based on next-generation high-throughput sequencing significantly boosts the ability to study transcriptomes. The estimation of genes transcript abundance levels or gene expression levels has always been an important question in research on the transcriptional regulation and gene functions. On the basis of the concept of Reads Per Kilo-base per Million reads (RPKM), taking the union-intersection genes (UI-based) and summing up inferred isoform abundance (isoform-based) are the two current strategies to estimate gene expression levels, but produce different estimations. In this paper, we made the first attempt to compare the two strategies performances through a series of simulation studies. Our results showed that the isoform-based method gives not only more accurate estimation but also has less uncertainty than the UI-based strategy. If taking into account the non-uniformity of read distribution, the isoform-based method can further reduce estimation errors. We applied both strategies to real RNA-seq datasets of technical replicates, and found that the isoform-based strategy also displays a better performance. For a more accurate estimation of gene expression levels from RNA-seq data, even if the abundance levels of isoforms are not of interest, it is still better to first infer the isoform abundance and sum them up to get the expression level of a gene as a whole. Keywords: RNA-seq; gene expression level; estimation error; estimation uncertainty. 1. Background Measuring gene expression levels has great importance in biological research. A gene s expression level is closely related to its functionality, so estimating These authors contributed equally to this work. Corresponding author. 177

178 X. Wang, Z. Wu & X. Zhang gene expressions is a key step for describing many biological processes. Based on the next-generation high-throughput sequencing technologies, which significantly reduces the cost of sequencing, RNA-seq finds extensive applications in studies of the transcriptome. 1,2 Although RNA-seq data provides the potential to measure how genes are expressed in an isoform-specific manner, several intrinsic and external reasons limit the wide use of the isoform-specific information. Due to the slight differences between isoform structures in many genes and the existing sequencing biases, 3,4 isoform abundance estimation has larger uncertainty than gene expression level estimation. 5 The large uncertainties will result in large variances in the differential expression analysis and therefore reduce the detection power. Besides, there also exist some genes where the isoform abundance is not identifiable because of the intrinsic gene structure. 6 Meanwhile, the current lack of isoform-based biological knowledge still limits its application. The most popular biological knowledge databases, such as Gene Ontology (GO) 7 and the Kyoto Encyclopedia of Genes and Genomes (KEGG), 8 are organized according to gene annotations instead of isoform annotations. Summarizing the expression levels of genes from RNA-seq data is particularly useful in current analysis pipelines, and several recent RNA-seq studies estimated gene expression levels, rather than isoform expression levels, for differential expression analysis. 1,9,10 In early techniques such as microarray analysis, there have been many algorithms that focused on the estimation of gene expression values by summarizing probe level data (e.g. Refs. 11 and 12). To quantify transcript levels in RNA-seq, the concept of reads per kilobase of exon model per million reads (RPKM) was first proposed. 1 RPKM measures the read density in a genic region of interest by normalizing the read count in the corresponding exonic regions against the sum of each exon length (or gene length) and the total reads in the measurement (or sequencing depth). In ideal conditions where sequenced reads are randomly sampled uniformly from transcripts and no alternative transcripts are derived from an identical genic region, RPKM reflects well the actual transcript abundance levels. In alternatively spliced genes, which comprise 92 94% of human genes, 13 people sometimes however misuse this concept by ignoring the fact that different isoforms may be of different lengths, which results in a projective normalization method. It has been shown in a recent paper that the projective normalization method under-estimated the gene expression levels to varying degrees. 14 In contrary, for an unbiased estimation of gene expression levels, there exist several other candidate methods, including a union-intersection gene (UI-based) method and an isoform-based strategy. The UI-based method takes the union of the constitutive exons of a gene as the UI genic region and computes the RPKM only in the UI gene. 10 The utilization of a UI gene model avoids the length differences introduced from multiple isoforms. The isoform-based method adopts more sophisticated statistical models and bases the gene expression estimation on the summation of inferred isoform expression levels. 15 Under the assumption that the sequenced

Isoform Abundance Inference Provides a More Accurate Estimation 179 reads are sampled independently and uniformly from measured transcripts, it is easy to model the distribution of exon read counts as a Poisson distribution. 1 Based on this, Jiang et al., 5 proposed a maximum likelihood estimate method to infer isoform expression levels from RNA-seq data. We call the model used by Jiang et al., the uniform read distribution (URD) model. With a similar framework, we have proposed a non-uniform read distribution (N-URD) model which considers the empirical distribution of reads counts and gives significantly more accurate isoform expression inference. 16 We will implement both models to show the performance of the isoform-based strategy for gene expression level estimation. Gene expression estimation has extreme importance for biological research, but there is no consensus for the choice of estimation strategies. 10 The aim of this paper is therefore to compare the two categories of methods mentioned above. We used two criteria to evaluate how an individual strategy performs. Estimation error measures the difference between the estimated value and the given value, and the degree of estimation uncertainty is measured by the length of the interval which contains 95% of the posterior probability of the estimated value. Through a series of simulation studies, we compared the estimation error and the estimation uncertainty of the two strategies. Considering that both strategies we investigated here rely on known gene annotations, we also explored their performances in the cases of incomplete annotations. In both investigations, results show that a significant gain can be achieved in the performance of gene expression level estimation (including less estimation error and less uncertainty) by using the isoform-based strategy rather than the UI-based strategy. Lastly we applied these two methods on real RNA-seq technical replicate data, which also illustrated the advantage of the isoform-based method. 2. Methods In this section we will illustrate the two unbiased gene expression estimation strategies (UI-based and isoform-based strategies) and the detailed experiment designs. 2.1. Notations Assume a gene g has n exons of lengths (l 1,...,l n )andm isoforms with expression levels in RPKM units (θ 1,...,θ m )orθ={θ i i =1,...,m} in an experiment. From RNA-seq data, we have a set of observations (x 1,...,x n ) on this gene, with x i denoting the number of reads mapped to the ith exon of the gene g (see Fig. 1(a) for an example). We use an indicator matrix (a ij ) n m to represent the gene structure (an example is shown in Fig. 1(b)), where a ij =1ora ij = 0 indicates that the ith exon is included or excluded in the jth isoform. 2.2. UI-based gene expression estimation Bullard et al. defined a union-intersection gene (UI gene) to be the greatest common part of multiple isoforms within a gene. 10 The main idea of UI genes is to

180 X. Wang, Z. Wu & X. Zhang Fig. 1. The two unbiased methods for gene expression level estimation. (a) The observation in RNA-seq data is the read counts for each exon in a gene. (b) The gene structure shows a gene with three isoforms and four exons. (c) The sketch map of the UI-based gene expression method. (d) The sketch map of the isoform-based gene expression method. simplify the gene structure, and therefore it excludes the reads from exons which do not appear in all isoforms. Within the UI gene region, RPKM 17 is employed to calculate the expression values of genes. Thus, the estimation of the gene expression level is ˆθ g = i UI x i, (1) i UI l i where UI denotes the aggregate of exons in the UI gene region, that is UI = {i m j=1 a ij 0}. The sketch map of the UI-based gene expression method is shown in Fig. 1(c). The advantage of the UI-based method is that it avoids the impact of complex gene structures; however, the disadvantage is losing information supplied by the abandoned exons. Because the UI region may not exist for some genes, sometimes the UI-based method may be invalid. We excluded these genes in our experiments. 2.3. Isoform-based gene expression estimation Unlike the UI-based strategy, the isoform-based methods make use of all sequenced reads that are mapped to genic regions. The sketch map of the isoform-based gene expression strategy is shown in Fig. 1(d). It first estimates the isoform-level expression values (ˆθ 1,...,ˆθ m ) from the observed read counts and known gene structures

Isoform Abundance Inference Provides a More Accurate Estimation 181 (such as from RefSeq gene annotation), and then sums them up to get the gene-level expression, that is m ˆθ g = ˆθ j. (2) j=1 In genes with isoforms whose abundance levels are not identifiable, we can simplify the gene structure by merging the unidentifiable isoforms to form a new pseudo isoform. This makes the isoform-based strategy applicable to all genes. For the inference of isoform-level expression values, a key step is to build up proper statistical models. The URD model proposed by Jiang et al., 5 is one of the most effective models to solve the isoform expression inference problem in RNA-seq. Based on a similar framework, with the aim to improve the expression estimation accuracy at the isoform level, we have previously proposed an N-URD model by introducing the empirical information about non-uniform read distribution. 16 In this study, we use both the URD and N-URD models to investigate the performance of isoform-based gene expression estimation strategy. Here we briefly introduce the URD and N-URD models. In the URD model, each observation x i is assumed to be a random variable following a Poisson distribution with parameter λ i.theλ i for the ith exon is λ i = l i w m j=1 a ijθ j,wherew is the total number of mapped reads in the RNA-seq data and a ij is the elements of the gene structure indicator matrix. Thus we have the corresponding log-likelihood function for the ith exon as ( e λ i λ xi i log(l(θ x i )) = log x i! ), (3) where L( ) denotes the likelihood function. Assuming the independence of x i s,for ageneg, the joint log-likelihood function of all its exons can be written as n ( e λ i ) λ xi i log(l(θ x 1,x 2,...,x n )) = log. (4) x i! i=1 Considering λ i = l i w m j=1 a ijθ j for each exon, we have log(l(θ x 1,x 2,...,x n )) = w n i=1 j=1 m l i a ij θ j + n m x i log l i w a ij θ j i=1 j=1 n log(x i!). (5) i=1 The maximum likelihood estimation for the above log-likelihood function will give the inference of isoform expression levels. Due to the convexity of the above optimization problem, a gradient descending method can be used to find the solution. 5 The N-URD model substitutes the indicator matrix (a ij ) with a weighted indicator matrix (b ij ) whose elements are nonnegative real numbers calculated from a given

182 X. Wang, Z. Wu & X. Zhang RNA-seq dataset and depict the non-uniformity of read distributions. Rewriting Eq. (5), the modified log-likelihood function is n m n m log(l(θ x 1,x 2,...,x n )) = w l i b ij θ j + x i log l i w b ij θ j i=1 j=1 i=1 j=1 n log(x i!). (6) i=1 Unlike the 0 1 indicator matrix (a ij ), the weighted indicator matrix (b ij ) not only represents the gene structure information, but also gives weights to the non-zero elements according to a bias tendency of corresponding exons. A typical global bias tendency can be learnt from data within single-isoform genes. It is notable that changing the indicator matrix to the weighted indicator matrix will not change the convexity of the optimization problem. So it is still easy to solve the maxima of the log-likelihood function. 2.4. Simulation data We designed a series of simulation experiments to compare the performances of the investigated methods. The simulated data generation includes several steps. We first fixed the number of exons and the number of isoforms for a gene, that is, the size of the structure matrix. Then we sampled the a ij using a 0 1 random variable which takes a value of 1 with 80% probability. This probability is deduced from the RefSeq gene annotation. Next, we sampled the lengths of exons randomly from the lengths of exons in the RefSeq gene annotation. Having the gene structure with exon lengths, for each isoform I j in a simulated gene, we randomly sampled a single-isoform gene I r from real RNA-seq data, and let I j have the same RPKM of I r with a similar read count distribution. This gives every exon i in the isoform j arpkmvaluerp KM ij.takingλ i = l m i j=1 RP KM ij as the parameter, the read count for exon i can be sampled from the corresponding Poisson model. We studied genes where the numbers of isoforms varied from 2 to 5 and the numbers of exons varied from 6 to 13. For each setting, we generated 1,000 different gene structures randomly. Although the annotation of the human genome is fairly comprehensive, there are still some uncovered isoforms for some genes. So we need to investigate the impact of incomplete annotation. For this, we generated another set of simulated data as described above, but randomly removed one isoform from the gene structure when estimating expression levels. Similarly, we repeated each sitting of the artificial incomplete gene structure 1,000 times in experiments. 2.5. RNA-seq datasets We employed the real transcriptome RNA-seq dataset of Marioni et al., 9 which also provided the information to generate our simulation data. The

Isoform Abundance Inference Provides a More Accurate Estimation 183 dataset was composed of about 120 million reads from human liver and kidney tissues. We downloaded them from the Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/traces/sra). The read length of the dataset was 32bp. For brevity we refer to the dataset as the Marioni data. We used SeqMap 18 to map reads to the human genome assembly UCSC hg18 (or NCBI build36) allowing up to two mismatches. 3. Results We investigated the performances of the UI-based and isoform-based strategies on the simulated datasets and real data. For the isoform-based strategy, we adopted both the original URD model and the improved N-URD model for isoform level expression estimation. For brevity in discussion, we use URD and N-URD to represent the isoform-based strategy embedded with the URD model and the N-URD model, respectively. 3.1. Two criteria for performance comparison Two criteria were used to evaluate the performances of the two strategies and to facilitate the comparison between the strategies. Estimation error is of great biology importance. Correctly estimating the expression levels for each gene gives an accurate gene expression profile, which is the key to reliable functional analyses downstream. For example, to accurately estimate gene expression levels in the whole transcriptome was a primary step of a recent study in modeling gene expression basedonanintegratedanalysisusingchip-seqandrna-seqdata. 19 The other criterion, estimation uncertainty, is also essential in bioinformatics analyses. For example, in detecting differentially expressed genes from two samples RNA-seq data, the detection power will decrease when the estimation uncertainty goes larger, because signals could be lost in the background noises caused by the estimation uncertainty. The estimation error is defined as the difference between the estimated gene expression value and the given value, which is the sum of the given RPKM values of all isoforms. The calculation of the given gene-expression value is fair for the two strategies, because the two should report the same value if there is no randomness or biases in read generation. But due to random sampling and/or substantial sequencing biases, the main estimation error and uncertainty for the isoform-based strategy is introduced during the step of isoform expression inference, while that of the UI-based method stems from discarding a part of the informative reads. In order to make the estimation errors comparable in highly and lowly expressed genes, we normalized the difference by dividing by the gene expression level and getting the relative estimation error. We further compared the estimation uncertainty, which is defined as the length of the 95% confidence interval of the estimated gene expression. 5 Using the optimization solution as the mean and the inverse Fisher information matrix as the

184 X. Wang, Z. Wu & X. Zhang covariance matrix, the posterior distribution is approximated by a multivariable normal distribution. Using the approximated posterior distribution, we can sample a large number of points and estimate the 95% confidence interval. We take the length of the interval as the measure of the estimation uncertainty. As with the relative estimation error, we also investigated the relative estimation uncertainty, which equals the estimation uncertainty divided by the given gene expression level. 3.2. Results using complete annotation Figures 2 and 3 summarize the relative estimation errors and uncertainties of the two strategies on the simulation data generated from the Marioni data using complete annotation with different parameter (m, n) settings. Compared to UI, we can see that URD reduces 53 76% of the error of the gene expression estimation, and N-URD further reduces 67 87% of the error. With the increase of the number Fig. 2. Relative estimation errors for complete annotation. Shown are the comparisons of the relative estimation errors of UI, URD and N-URD on the simulated data (based on the Marioni data) with complete annotation. Within each group of experiments, we fixed the number of isoforms as shown in the title and varied the number of exons from 6 to 13. For each setting, we repeated the experiments 1,000 times and calculated the mean of the estimation errors.

Isoform Abundance Inference Provides a More Accurate Estimation 185 Fig. 3. Relative estimation uncertainties for complete annotation. Shown are the comparisons of the relative uncertainties of UI, URD and N-URD from the same experiments as shown in Fig. 2. of isoforms in the genes, the relative estimated error becomes larger. While for a fixed number of isoforms, along with the increase of the number of exons, the estimation error becomes smaller, because more exons supply more information. The improvement of URD and N-URD over UI becomes larger as the number of isoforms increases, because UI loses more information when faced with more isoforms. The achieved improvement on estimation errors by the isoform-based strategy over UI-based strategy is statistically significant, which is shown by paired single-sided t-tests. The p-values are listed in Table 1. Besides demonstrating the advantage of isoform-based strategy, the listed p-values also indicate that N-URD performs significantly better than URD. Next, we consider the uncertainty of these methods. From Fig. 3, we can see that both URD and N-URD have a smaller uncertainty about their estimation. The uncertainty of URD and N-URD are very close. Paired single-sided t-test p-values listed in Table 2 also indicate significant improvement of URD and N-URD compared to UI on estimation uncertainty, but similar performance between URD and N-URD.

186 X. Wang, Z. Wu & X. Zhang Table 1. Statistical significance on the improvement of estimation error using complete annotation. Listed are paired single-sided t-test p-values. H 1 denotes the alternative hypothesis in each hypothesis testing, and the corresponding null hypothesis (H 0 ) is the equation form of H 1. E(METHOD) means the estimation error bythemethod.thevariablesn and m denote the number of exons and isoforms, respectively. H 1 m 6 7 8 9 10 11 12 13 n E(UI)> 2 6.7e-126 7.4e-125 1.0e-131 1.1e-139 2.7e-139 6.0e-154 7.0e-159 5.5e-172 E(URD) 3 4.2e-174 1.2e-174 4.8e-181 7.5e-200 4.0e-169 1.7e-200 5.8e-211 2.9e-214 4 7.3e-225 3.7e-211 1.8e-218 2.6e-200 6.0e-229 5.8e-220 8.7e-211 1.8e-225 5 6.7e-269 9.9e-236 2.3e-224 2.9e-248 4.2e-227 3.4e-239 4.3e-231 4.0e-246 E(UI)> 2 8.0e-158 2.2e-148 5.8e-155 2.7e-166 4.1e-161 1.2e-171 1.3e-179 3.2e-193 E(N-URD) 3 4.4e-215 1.1e-213 7.4e-214 7.0e-231 4.8e-196 4.3e-233 3.7e-244 1.3e-249 4 2.6e-260 1.1e-250 6.5e-252 7.4e-240 4.1e-265 2.9e-256 1.4e-241 2.7e-263 5 1.4e-313 3.3e-279 1.0e-256 4.2e-292 7.9e-266 1.5e-279 1.3e-272 2.4e-283 E(URD)> 2 5.0e-43 1.6e-36 4.8e-37 8.2e-40 8.5e-37 2.0e-32 3.1e-36 5.0e-40 E(N-URD) 3 5.7e-67 8.5e-60 1.3e-60 7.9e-65 4.6e-54 5.2e-58 3.5e-55 2.9e-67 4 6.3e-62 1.9e-71 1.5e-63 5.9e-74 8.4e-66 6.8e-70 9.3e-66 1.8e-79 5 4.4e-80 4.5e-75 4.9e-74 2.9e-79 1.9e-72 9.7e-72 8.0e-77 3.1e-78 Table 2. Statistical significance on the improvement of estimation uncertainty using complete annotation. Listed are paired single-sided t-test p-values. H 1 denotes the alternative hypothesis in each hypothesis testing, and the corresponding null hypothesis (H 0 ) is the equation form of H 1. U(METHOD) means the estimation uncertainty by the METHOD.The variables n and m denote the number of exons and isoforms, respectively. H 1 m 6 7 8 9 10 11 12 13 U(UI)> 2 4.2e-61 5.5e-64 1.0e-83 3.8e-91 1.0e-105 3.9e-129 6.9e-136 9.1e-123 U(URD) 3 6.9e-121 3.3e-138 9.5e-119 4.9e-165 7.7e-123 4.3e-159 4.6e-173 1.8e-190 4 1.7e-213 1.4e-193 1.1e-179 1.6e-210 1.4e-197 5.9e-203 2.7e-208 1.8e-233 5 3.8e-94 5.3e-69 7.5e-107 2.0e-66 3.1e-99 4.2e-126 4.7e-127 2.3e-122 U(UI)> 2 1.1e-46 3.7e-50 1.7e-66 2.3e-77 2.9e-91 2.2e-113 7.6e-117 6.9e-113 U(N-URD) 3 3.9e-77 8.0e-88 6.0e-87 2.5e-129 3.2e-103 9.1e-127 6.0e-145 3.3e-160 4 6.9e-124 1.0e-124 2.8e-122 2.0e-147 8.8e-151 5.8e-151 5.9e-171 5.9e-194 5 5.0e-144 9.4e-133 2.3e-150 2.0e-163 1.3e-149 3.5e-178 4.3e-200 1.6e-204 U(URD)> 2 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 U(N-URD) 3 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 4 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 5 1.0e+00 1.0e+00 1.0e+00 8.7e-01 1.0e+00 1.0e+00 1.0e+00 1.0e+00 n 3.3. Results using incomplete annotation Using the incomplete annotation as described in Sec. 2, we performed a group of similar experiments. We discovered that URD reduces 55 77% of the estimation error of UI, and N-URD further reduces 56 85% of the error of UI. Although the improvement is slightly smaller than that of the complete annotation simulation, URD and N-URD still reduce the error by a large degree. Meanwhile, they achieve

Isoform Abundance Inference Provides a More Accurate Estimation 187 Fig. 4. Relative estimation errors for incomplete annotation. Shown are the comparisons of the relative estimation errors of UI, URD and N-URD on the simulated data (based on the Marioni data) using incomplete annotation. We used similar settings as the experiments shown in Fig. 2. smaller estimation uncertainty. Figures 4 and 5 summarize these results. As shown in Tables 3 and 4, paired single-sided t-tests were carried out to assess the statistical significance on the improvement. The listed p-values indicate that the isoform-based strategy significantly reduced the estimation uncertainty over the UI-based strategy, but URD performs similar as N-URD. Overall, compared with the UI-based method, the two isoform-based methods achieve significant improvement in estimating gene expression levels using both complete and incomplete annotations. The isoform-based strategy with taking into account the non-uniformity of read distribution in RNA-seq data further reduces the estimation errors. As a result, we recommend the isoform-based method for gene expression estimation in RNA-seq. 3.4. Applications to real RNA-seq data For the studies on the real transcriptome, we investigated the consistency of these methods on technical replicates. The estimation consistency on technical replicates

188 X. Wang, Z. Wu & X. Zhang Fig. 5. Relative estimation uncertainties for incomplete annotation. Shown are the comparisons of the relative uncertainties of UI, URD and N-URD from the same experiments as shown in Fig. 4. Table 3. Statistical significance on the improvement of estimation error using incomplete annotation. This table is similar to Table 1. H 1 m 6 7 8 9 10 11 12 13 E(UI)> 2 1.0e-123 1.2e-106 6.7e-128 6.7e-119 5.5e-119 2.5e-122 5.0e-139 7.8e-136 E(URD) 3 5.6e-203 1.4e-191 2.1e-196 7.1e-197 9.7e-179 7.5e-197 4.9e-176 2.5e-198 4 4.3e-261 2.9e-225 1.2e-227 5.7e-215 7.7e-206 1.1e-230 1.2e-212 4.5e-213 5 2.7e-268 1.4e-260 4.4e-249 1.0e-253 3.9e-228 1.7e-237 8.2e-233 1.4e-265 E(UI)> 2 8.3e-137 1.8e-122 1.5e-143 8.8e-133 5.1e-132 8.6e-131 2.1e-153 4.9e-147 E(N-URD) 3 1.8e-229 2.1e-218 9.5e-224 1.6e-214 7.5e-198 2.4e-220 1.2e-192 4.8e-219 4 4.4e-294 1.2e-252 1.6e-254 1.0e-239 5.1e-236 2.8e-257 7.3e-239 1.4e-237 5 6.0e-298 1.0e-288 1.8e-279 4.1e-290 7.5e-263 2.0e-265 5.0e-260 2.0e-296 E(URD)> 2 9.2e-24 4.6e-29 1.0e-24 2.1e-28 3.2e-23 2.9e-16 3.0e-19 1.3e-18 E(N-URD) 3 3.5e-42 9.7e-41 8.0e-38 1.5e-35 3.4e-37 5.1e-44 5.9e-37 3.3e-36 4 2.7e-48 1.3e-48 7.9e-53 4.6e-52 5.8e-58 1.5e-57 1.6e-51 5.0e-50 5 1.3e-54 1.3e-63 1.3e-63 8.2e-67 3.8e-66 7.8e-52 1.1e-55 4.0e-56 n

Isoform Abundance Inference Provides a More Accurate Estimation 189 Table 4. Statistical significance on the improvement of estimation uncertainty using incomplete annotation. This table is similar to Table 2. H 1 m 6 7 8 9 10 11 12 13 n U(UI)> 2 4.0e-144 5.4e-145 2.6e-166 1.2e-180 1.5e-199 9.8e-193 7.4e-218 3.1e-224 U(URD) 3 1.3e-203 1.4e-200 4.4e-205 2.8e-208 5.0e-197 6.7e-235 4.0e-214 3.6e-241 4 5.6e-259 3.1e-239 5.1e-246 1.5e-235 7.4e-235 7.9e-251 6.7e-254 1.3e-259 5 2.8e-118 3.2e-102 2.2e-130 6.5e-127 1.9e-140 2.5e-140 1.1e-145 6.8e-171 U(UI)> 2 3.7e-134 4.8e-135 2.8e-156 1.5e-171 2.9e-192 1.7e-185 4.9e-208 1.1e-216 U(N-URD) 3 6.6e-174 4.4e-173 5.4e-175 4.1e-183 1.1e-177 6.2e-212 5.1e-195 1.4e-223 4 1.4e-199 6.8e-191 2.2e-200 3.2e-190 2.3e-199 2.7e-216 2.5e-225 7.9e-230 5 8.9e-210 3.3e-208 1.4e-211 1.8e-223 9.3e-218 2.1e-222 3.2e-235 4.0e-233 U(URD)> 2 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 U(N-URD) 3 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 4 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 5 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 data mainly reflects the estimation uncertainty as we investigated on the simulation data. Since there are no true answers available on the real data, the estimation error was not assessed here. We applied UI, URD and N-URD to the technical replicate datasets of transcriptome RNA-seq data from Marioni et al. 9 Because the technical replicates are from the same biological sample, the estimated gene expression should also be highly consistent. For each tissue, we estimated the gene expression levels for two technical replicates and normalized the mean of the expression levels for each replicate to be 1. Then we calculated the differences and investigated them for each method. As we normalized the mean of the expression levels in each replicate to be 1, the mean of the differences is obviously zero. So Table 5 only lists the variances of the differences for two tissues using different methods.wecanseeforeachtissuethat UI always gives the biggest variance of differences, while URD and N-URD give smaller variances. These results indicate that the isoform-based strategy tends to be more stable than the UI-based strategy. This also supports the conclusions from the previous simulation studies. Table 5. The comparison results in real RNA-seq data. The table summarizes the comparisons of gene expression level estimation consistency by different methods in RNA-seq technical replicate data. Each cell represents the variance of the relative estimation differences in the replicate data. Kidney Liver UI 0.0168 0.0206 URD 0.0151 0.0181 N-URD 0.0151 0.0181

190 X. Wang, Z. Wu & X. Zhang 4. Discussion The gene expression level or the abundance of transcripts that are originated from a functional genomic region is the key concept in studying transcription regulation. RT-PCR, microarrays, SAGE, 20 and the revolutionary RNA-seq technology all made their own contributions to profile gene expression levels. Estimating the gene expression levels from the data generated by these experimental technologies in an unbiased and precise manner remains an important and challenging issue in bioinformatics studies. Although RNA-seq data provide digital information and directly give the global map of transcribed fragments, correctly profiling the gene expression cannot be done in a straightforward manner. Recently researchers have begun to realize that the most commonly used projective normalization method under estimates, to various degrees, the gene expression levels in most multi-isoform genes. On this basis, this article tries to answer the question to which of the two existing unbiased strategies, the UI-based and the isoform-based, is preferred. In this report, we conducted a comprehensive simulation study on comparing the performances of the two strategies for gene expression level estimation. The series of simulation experiments indicated the significant advantage of the isoformbased methods (with URD and N-URD embedded) both in estimation accuracy and stability, which is further demonstrated by the comparison of gene expression estimation in the real RNA-seq technical replicate data. Intuitively, researchers may tend to think the UI-based method, rather than the isoform-based methods, would have the more robust performance because of the large uncertainties in the isoform abundance inference. Note that a correlation exists among the individual isoforms of a gene, so summing up the isoform levels to estimate the gene expression levels may reduce the uncertainty. This is consistent with the observations in the previous study. 5 On the other hand, for the UI-based method, only the reads falling in the constitutive exons could be taken into account to estimate the abundance, making a large part of the information unused. Besides, although in a small number of cases, the UI-based method may not work where no constitutive exon exists. Thus, we may conclude that even if one is not interested in the isoform abundance, one should first infer the isoform abundance and sum them up to estimate the gene expression levels. This conclusion would be helpful to make a consensus on this problem in the community. Towards a more accurate estimation of gene expression, there is still room to improve. Although attention has been paid recently to non-uniform read distribution in the inference of isoform abundance, 15,16,21 more efforts are still needed to efficiently incorporate this kind of information. In addition, due to incomplete isoform annotation, ab initio reconstruction of isoforms from RNA-seq data may provide more precise and stable estimation. The refinement of this problem will strongly benefit the downstream analysis. For example, identification of differentially expressed genes from samples requires not only correct estimations of gene expression levels but also small estimation uncertainties.

Isoform Abundance Inference Provides a More Accurate Estimation 191 Acknowledgments The authors would like to thank Dr. Lior Pachter for his helpful discussion, and Dr. Greg Vatcher for his great assistance on the language of the manuscript. This work is supported in part by the NSFC grants (30625012, 60721003 and 60702002). References 1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nat Methods 5:621 628, 2008. 2. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M, The transcriptional landscape of the yeast genome defined by RNA sequencing, Science 320:1344 1349, 2008. 3. Li J, Jiang H, Wong WH, Modeling non-uniformity in short-read rates in RNA-Seq data, Genome Biol 11:R50, 2010. 4. Hansen KD, Brenner SE, Dudoit S, Biases in Illumina transcriptome sequencing caused by random hexamer priming, Nucleic Acids Res 38:e131, 2010. 5. Jiang H, Wong WH, Statistical inferences for isoform expression in RNA-Seq, Bioinformatics 25:1026 1032, 2009. 6. Hiller D, Jiang H, Xu W, Wong WH, Identifiability of isoform deconvolution from junction arrays and RNA-Seq, Bioinformatics 25:3056 3059, 2009. 7. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G, Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium, Nat Genet 25:25 29, 2000. 8. Kanehisa M, Goto S, KEGG: Kyoto encyclopedia of genes and genomes, Nucleic Acids Res 28:27 30, 2000. 9. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y, RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays, Genome Res 18:1509 1517, 2008. 10. Bullard JH, Purdom E, Hansen KD, Dudoit S, Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments, BMC Bioinformatics 11:94, 2010. 11. Li C, Wong WH, Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection, Proc Natl Acad Sci USA 98:31 36, 2001. 12. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP, Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics 4:249 264, 2003. 13. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB, Alternative isoform regulation in human tissue transcriptomes, Nature 456:470 476, 2008. 14. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L, Transcript assembly and quantification by RNA- Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nat Biotechnol 28:511 515, 2010. 15. Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN, RNA-Seq gene expression estimation with read mapping uncertainty, Bioinformatics 26:493 500, 2010. 16. Wu Z, Wang X, Zhang X, Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq, 2010. To be published.

192 X. Wang, Z. Wu & X. Zhang 17. Mortazavi A, Williams BA, Mccue K, Schaeffer L, Wold B, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nat Methods 5:621 628, 2008. 18. Jiang H, Wong WH, SeqMap: Mapping massive amount of oligonucleotides to the genome, Bioinformatics 24:2395 2396, 2008. 19. Ouyang Z, Zhou Q, Wong WH, ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells, Proc Natl Acad Sci USA 106:21521 21526, 2009. 20. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW, Serial analysis of gene-expression, Science 270:484 487, 1995. 21. Howard BE, Heber S, Towards reliable isoform quantification using RNA-SEQ data, BMC Bioinformatics 11(Suppl 3):S6, 2010 Xi Wang received his B.E. degree in Automation in 2005 from Harbin Institute of Technology, Harbin, China. He is now a Ph.D. candidate in Bioinformatics at MOE Key Laboratory of Bioinformatics and Bioinformatics Division, Tsinghua National Laboratory of Information Science and Technology, and also Department of Automation, Tsinghua University. His research interest includes machine leaning, data mining for bioinformatics, DNA sequence analysis and ChIP-seq/RNA-seq data analyses. Zhengpeng Wu received his B.Sc. degree in Automatic Control in 2004 from Tsinghua University, Beijing, China. He is now a Ph.D. candidate at the Department of Automation, Tsinghua University, and also at MOE Key Laboratory of Bioinformatics and Bioinformatics Division, Tsinghua National Laboratory of Information Science and Technology. His research interests include pattern recognition, statistical learning theory, statistics and computational genomics. Xuegong Zhang received his B.Sc. degree in 1989 and Ph.D. degree in 1994, both from Tsinghua University, Beijing. He is now Professor of Pattern Recognition and Bioinformatics at Tsinghua University, and Director of the Bioinformatics Division, Tsinghua National Laboratory for Information Science and Technology.