Heterogeneity of Variance in Gene Expression Microarray Data

Size: px

Start display at page:

Download "Heterogeneity of Variance in Gene Expression Microarray Data"

Jonah Gilmore
6 years ago
Views:

1 Heterogeneity of Variance in Gene Expression Microarray Data DavidM.Rocke Department of Applied Science and Division of Biostatistics University of California, Davis March 15, 2003 Motivation Abstract One important problem in the analysis of gene expression microarray data is that the variation in expression under constant conditions is not stable from gene to gene. Recently variance stabilizing transformations have been developed that can remove the systematic dependence of the variance on the mean, but it appears that there is still considerable variance heterogeneity that can interfere with global analysis of expression data. Results We develop a method consisting of a variance stabilizing data transformation followed by empirical Bayes estimation of gene-specific variances that is more powerful than using data from that gene alone, but does not suffer from the bias caused by the use of global error models. Availability R code will be available from the author by or on the website Contact dmrocke@ucdavis.edu. 1

2 1. Introduction Consider a set of microarray experiments of n arrays each with p genes. For each gene considered separately we entertain a statistical model which is linear in a set of factors or variables that are attached to the arrays, so that the statistical model is common to all genes. Assume that the expression data have been transformed so that the variances neither increase nor decrease systematically with the mean expression of the gene. Given a statistically hypothesis framed within the linear model for each gene, there is almost always an exact or approximate F -test, in which the numerator can be calculated from the cell means of the data for a particular gene, and the denominator (if the test is conductedinisolationforeachgene)isafunctionofthedeviationsofthedatafromthe cell mean (Kerr 2003). An alternative approach with variance-stabilized data is to obtain the numerator of the test from the particular gene, but obtain the denominator from a global error model (in this case, constant variance). This increases the power of the tests considerably because the variance estimates will be based on thousands of points, not just a few. However, it introduces possible biases if the variances are not truly homogeneous (Kerr 2003; Kerr, Martin, and Churchill 2000). A compromise between power and bias may be obtained by using variance estimates for the denominators of the F -test that are a compromise themselves between the gene-specific variance and the global variance. 2. A Motivating Example We consider an experiment in which cell lines in four conditions are to be compared. There are two observations for each of the four conditions consisting of an Affymetrix U95A GeneChip for each sample. For the sake of illustration, we will consider the MAS 4.0 average difference summary, one main advantage of which is that it does not artificially compress the low-level data. One goal of the analysis is to determine what genes are differentially expressed among the four conditions. A standard approach if we consider only one gene would be to perform a one-way analysis of variance (ANOVA). However, a standard assumption of that standard analysis is that the variance at the different levels is the same. In the case of microarray data, there is a strong dependence of the variance on the mean, as is shown for these data by Figures 1 and 2, which give the difference of replicates in a gene-by-group condition vs. the sum. This type of variability can be removed by the generalized log (glog) transform introduced independently by Durbin et al. (2002), Hawkins (2002), Huber et al. (2002), and Munson (2001), and further developed in Durbin and Rocke (2003a; 2003b), Geller et al. (2003) and Rocke and Durbin (2003a; 2003b). Figure 3 shows the same sum/difference data after transforming by the glog with a parameter of λ = 1225 estimated by maximum likelihood (Durbin and Rocke 2003a). 2

3 MSE Source TWER FWER FDR Gene-Specific Global Posterior Table 1: Number of genes out of 12,625 significant at the 5% level for three methods of estimating the MSE in a microarray experiment. Column 2 is the raw p-values with test-wise error rate (TWER) 5%. Column 3 give the family-wise error rate (FWER) using the Bonferroni inequality, and column 4 is the set of genes nominated as significant by the false-discovery-rate (FDR) method of Benjamin and Hochberg (1995; see also Reiner et al. 2003). At this point, one could reasonably perform an ANOVA for gene i using the model z ijk = β j + ² jk, (2.1) whereherethez ijk are additively-normalized, glog-transformed expression values. In this way, we obtain 12,625 F-tests of the null hypothesis of equal expression for all groups in which we compare the mean square for groups from gene i (MSG i )tothemeansquarefor error from gene i (MSE i ) by referring the ratio MSG i /MSE i to an F distribution with 3 and 4 degrees of freedom. This procedure should be valid, and after an adjustment for multiplicity, the results could be used directly. Figure 4 gives a histogram of the 12,625 p-values showing that certainly some of them represent real effects. The first line of figures in Table 1 shows that, at the 5% level, 1 gene is significant using the Bonferroni method, and 18 are significant using the FDR method of Benjamin and Hochberg (1995; see also Reiner et al. 2003). A possible objection to this procedure is that we are losing power by not employing information from other genes. If we employ the perspective of Kerr, Martin, and Churchill (2000), we could estimate the model z ijk = µ i + n k + β ij + ² ijk (2.2) where the z ijk are glog-transformed (unnormalized) expression values, the normalization is part of the ANOVA (the n k terms), and the group effects are in the gene-by-group interaction terms β ij (Kerr 2003). This analysis gives as another mean square for error that we could use as a denominator, in which case the F-statistics for each gene separately would have 3 and 50,493 df. Figure 5 shows the histogram of the p-values using this method. The excess of very small F-statistics is a sign that the model is incorrect. In this case, the assumption that all genes have the same MSE is almost certainly false. Use of an average MSE, when small or large ones will be more appropriate, will lead to an excess of p-values at both ends. In the second line of figures in Table 1, the number of genes nominated as significant is much greater for each of the three methods than when the gene-specific MSE 3

4 is used. It is likely that some of these are mistakes, being due to a large true gene-specific MSE being coupled with using an average MSE as a denominator instead of an unbiased gene-specific MSEestimate. The average value over all 12,625 genes of the MSE is , which is also the residual MSE from the global model. If the 4df estimates from each gene had the distribution predicted from normality and constant true variance, the variance of these MSE estimates across genes would be 2σ 4 /ν = (0.1017) 2 /2 = Instead, it is , nearly 10 times the size it should be. Of the two simple explanations for this: nonnormality and heterogeneity of variance, the latter is the simpler possibility. We now proceed to account for this situation using a standard empirical Bayes estimate for the individual gene MSE. 3. The Modeling Setup Given n genes indexed by i, supposethatthetruevarianceoftheeffect of interest for gene i is σi 2.Foreachi we obtain a ν degree-of-freedom estimate s2 i of σ2 i. We will work in the Gaussian framework for convenience, in which case we may assume that s 2 i has a gamma distribution with parameters τ (the mean) and a = ν/2 (the shape parameter). Again for simplicity,wetreatthecasewhereν is constant across genes. Though the case where ν varies is not conceptually more difficult, the computations are more complex. We model these individual values σi 2 = τ i as random with an inverse gamma distribution with parameters α and η = αβ. Notethatη isthemeanoftheinverseofτ (the reciprocal variance 1/τ is sometimes called the precision). With this as a prior distribution, and an observed value s 2 i, the posterior distribution for τ is proportional to e 1/τβ τ ν/2+α+1 (3.1) where β 2 = xν +2α/η Thus, the posterior distribution is inverse gamma, like the prior, with parameters (3.2) Also 1 η α = ν/2+α (3.3) β = 2 xν +2α/η (3.4) η = α β = ν +2α xν +2α/η (3.5) xν +2α/η = ν +2α µ ν = x + 1 µ 2α ν +2α η ν +2α 4 (3.6) (3.7)

5 Now x here is an observed value of s 2 i,and1/η is the reciprocal of the mean prior precision, which is thus an estimate of the center of the prior distribution for τ i = σi 2.Also ν is the degrees of freedom of s 2 i and 2α is the equivalent degrees of freedom of the prior. Thus, the posterior estimate of the variance used here will be a weighted average of the individual variance and the prior mean reciprocal precision, each weighted by its degrees of freedom. This method of estimation of a variance using an inverse gamma conjugate prior is completely standard (Carlin and Lewis 2000; Gelman et al. 1995), and has been used previously in a microarray context by Baldi and Long (2001). The first two references give more detail on the derivation of the posterior in this case. 4. Empirical Estimation of the Prior To complete the empirical Bayes estimation procedure, we need to specify how we estimate the parameters of the prior from the ensemble of variances. If each observed variance s 2 i has a gamma distribution F i with parameters τ and a = ν/2, and if the prior distribution G of τ is inverse gamma with parameters α and β then E(s 2 i ) = V (s 2 i ) = 1 β(α 1) 2(α 1)/ν +1 β 2 (α 1) 2 (α 2) (4.1) If an ensemble of variances has mean M and variance V, then a method of moments estimate of α and β is given by solving M = V = 1 β(α 1) 2(α 1)/ν +1 β 2 (α 1) 2 (α 2) (4.2) for α and β. This leads to ˆα = M 2 (1 2/ν)+2V V 2m 2 /ν 1 ˆβ = M(ˆα 1) (4.3) as method-of-moments estimates. If the variances were homogeneous, then we would have that V 2M 2 /ν. If the either the denominator or the numerator is negative, that is presumably a sign that there is not an important amount of heterogeneity in the variances. However, usually both will be bounded well away from zero. 5

6 5. The Example Continued For the example data set, the mean of the 12,625 values of the residual MSE is and the variance of the same collection is Using (4.3), we obtain ˆα = ˆβ = ˆη = /ˆν = The degrees of freedom of the prior is 2α =4.615, so for each gene i,weobtainan8.6dfmse estimate by taking a weighted average of the 4df MSE from the ANOVA of that gene (with weight 4/8.6), and the prior best estimate (with weight4.6/8.6). Figure 6 shows the histogram of the p-values obtained by this method, which shows no sign of distortion at the high p-value end. Comparing the three methods shown in Table 1, we see that the global MSE estimate rejects the most genes, but Figure 5 shows that these rejections cannot be trusted. The posterior best estimate MSE identifies a much larger number of genes as differentially expressed than using 4df gene-specific MSE s, without apparent signs of problems with maintaining thesizeofthetests. 6. Concluding Remarks Bayesian and empirical Bayesian methods are frequently proposed for the analysis of microarray data (for example, Baldi and Long 2001; Broët et al. 2002; Efron et al. 2002; Ibrahim et al. 2002; Newton et al. 2001, 2003; Theilhaber et al. 2001). What is proposed here is a sort of minimal empirical Bayesian approach. We do not need to put a prior distribution on the mean expression across genes or on the probability of positive expression, since this is handled by the multiplicity-adjusted F-tests. Our approach resembles most closely the treatment in Baldi and Long (2001). However, their use of the log transform resulted in substantial dependence of the variance on the mean, whereas by use of the glog transform, we have removed at least most of this dependence. This makes the Bayesian model fit the data better than in their case. We have written code in the R language (Ihaka and Gentleman 1996) that implements many of the required calculations in standard situations. They will be available from the author by or on the website Acknowledgements The research reported in this paper was supported by grants from the National Science Foundation (ACI , and DMS ) and the National Institute of Environmental Health Sciences, National Institutes of Health (P43 ES04699). 6

7 Appendix: The Gamma and Inverse Gamma Distributions The gamma distribution with parameters α and β has density The first two moments are given by f X (x) = xα 1 e x/β Γ(α)β α (.1) E(X) = αβ = τ (.2) V (X) = αβ 2 = τ 2 /α (.3) The inverse gamma distribution with parameters α and β is the distribution of Y =1/X where X is gamma distributed with parameters α and β. The density of Y is The first two moments are given by f Y (y) = e 1/yβ Γ(α)β α y α+1 (.4) E(Y ) = V (Y ) = 1 β(α 1) 1 β 2 (α 1) 2 (α 2) (.5) (.6) We will re-parametrize in terms of α and η = αβ, which is the mean of the reciprocal of the inverse gamma variate. We then have that the density is f Y (y) = e α/yη Γ(α)(η/α) α y α+1 (.7) The first two moments are given in this parametrization by E(Y ) = V (Y ) = α η(α 1) α 2 η 2 (α 1) 2 (α 2) (.8) (.9) References Baldi, P. and Long, A.D. (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inference of gene changes, Bioinformatics, 17,

8 Benjamani, Y. and Hochberg, Y. (1995) Controlling the false discovery rate, Journal of the Royal Statistical Society, Series B, 57, Broët, P., Richardson, S., and Radvanyi, F. (2002) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments, Journal of Computational Biology, 9, Carlin,B.P.andThomas,L.A.(2000)Bayes and Empirical Bayes Methods for Data Analysis, Second Edition, New York: Chapman and Hall. Durbin, B.P., Hardin, J.S., Hawkins, D.M., and Rocke, D.M. (2002) A variance-stabilizing transformation for gene-expression microarray data, Bioinformatics, 18, S105 S110. Durbin, B. and Rocke, D. M. (2003a) Estimation of transformation parameters for microarray data, Bioinformatics, in press. Durbin, B. and Rocke, D. M. (2003b) Exact and approximate variance-stabilizing transformations for two-color microarrays, submitted for publication. Efron, B., Tibshirani, R., Storey, J.D., and Tusher, V. (2002) Empirical Bayes analysis of a microarray experiment, Journal of the American Statistical Association, 96, Geller, S.C., Gregg, J.P., Hagerman, P.J., and Rocke, D.M. (2003) Transformation and normalization of oligonucleotide microarray data, submitted for publication. Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian Data Analysis, New York: Chapman and Hall. Hawkins, D.M. (2002) Diagnostics for conformity of paired quantitative measurements, Statistics in Medicine, 21, Holder,D.,Raubertas,R.F.,Pikounis,V.B.,Svetnik,V.,andSoper,K.(2001) Statistical analysis of high density oligonucleotide arrars: A SAFER approach, GeneLogic Workshop on Low Level Analysis of Affymetrix GeneChip Data. Huber, W., von Heydebreck, A., Sültmann, H., Poustka, A., and Vingron, M. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics, 18, S96 S104. Ibrahim, J.G., Chen, M.-H., and Gray, R.J. (2002) Bayesian models for gene expression with microarray data, Journal of the American Statistical Association, 97, Ihaka, R. and Gentleman, R. (1996) R: A language for data analysis and graphics, Journal of Computational and Graphical Statistics, 5, (See 8

9 Kerr, M.K. (2003) Linear models for microarray data analysis: Hidden similarity and differences, University of Washington Biostatistics Working Paper 190. Kerr, M.K., Martin, M., and Churchill, G.A. (2000) Analysis of variance for gene expression microarray data, Journal of Computational Biology, 7, Munson, P. (2001) A Consistency Test for Determining the Significance of Gene Expression Changes on Replicate Samples and Two Convenient Variance-stabilizing Transformations, GeneLogic Workshop on Low Level Analysis of Affymetrix GeneChip Data. Newton,M.A.,Kendziorski,C.M.,Richmond,C.S.,Blattner,F.R.,andTsui,K.W.(2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data, Journal of Computational Biology, 8, Newton, M.A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2003) Detecting differential gene expression with a semiparametric heirarchical mixture model, manuscript. Reiner, A., Yekutieli, D. and Benjamini, Y. (2003) Identifying differntially expressed genes using false discovery rate controllling procedures, Bioinformatics, 19, Rocke, D., and Durbin, B. (2001) A model for measurement error for gene expression arrays, Journal of Computational Biology, 8, Rocke, D. and Durbin, B. (2003) Approximate variance-stabilizing transformations for gene-expression microarray data, Bioinformatics, in press. Theilhaber, J., Bushnell, S., Jackson, A., and Fuchs, R. (2001) Bayesian estimation of fold changes in the analysis of gene expression: The PFOLD algorithm, Journal of Computational Biology, 8,

10 List of Figures 1. Absolute difference in replicates versus the sum for the 12,625 4 gene-by-group combinations. 2. Absolute difference in replicates versus the rank of the sum for the 12,625 4 geneby-group combinations. 3. Absolute difference in replicates versus the rank of the sum for the 12,625 4 geneby-group combinations after transformation by the glog with λ = Histogram of p-values for 12,625 F-tests using gene-specific MSE. 5. Histogram of p-values for 12,625 F-tests using global MSE. 6. Histogram of p-values for 12,625 F-tests using posterior best-estimate MSE. 10

11 Difference Sum Raw Data

12 Difference Rank of Sum Raw Data

13 Difference Rank of Sum Glog of Data

14 Histogram of Gene-Specific p-values Raw p-values Frequency

15 Histogram of Global p-values Raw p-values Frequency

16 Histogram of Posterior p-values Raw p-values Frequency

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods EPP 245/298 Statistical Analysis of Laboratory Data October 11, 2005 1 The