Gene expression data analysis in clinical cancer research

Size: px

Start display at page:

Download "Gene expression data analysis in clinical cancer research"

Damian Cunningham
6 years ago
Views:

1 Gene expression data analysis in clinical cancer research L analisi dell espressione genica nella ricerca oncologica Philippe Broët 1 INSERM U47 and Faculty of Medicine Paris-Sud broet@vjf.inserm.fr Summary: Nell ambito degli studi d associazione che utilizzano le biotecnologie orientate verso la transcriptomica dove l obiettivo è l identificazione dei geni le cui modifiche d espressione sono correlate a un fattore bio-clinico, uno dei problemi maggiori è l identificazione dei geni tenendo conto della molteplicità dei confronti effettuati. I due principali criteri utilizzati nell ambito delle procedure dei paragoni multipli sono : il FWER (Family Wise Error Rate) e il FDR (False Discovery Rate). Attualmente, esistono numerose procedure che permettono di controllare (o di stimare) questi diversi criteri d errore. Ciò nonostante, queste procedure rispondono solo parzialmente ai bisogni della ricerca clinica oncologica. In questo contesto presentiamo un metodo basato su modelli di misture Bayesiane che permettono di calcolare il FDR per un insieme qualsiasi di geni. Un esempio é presentato a partire da dati reali sul cancro del seno. Keywords: Bayesian mixture model, Clinical research, FDR, Microarray data analysis, Oncology. 1. Introduction Transcriptome-oriented biotechnologies have led to the availability for researchers of comparatively analysing thousands of mrna expression in parallel. Typically, these data consist of the measurement of gene expression under various experimental or biological conditions that can potentially provide information on the complex transcriptional activity for the biological system under study (Schena, 000). In parallel to the rapid development of this genomic technology, research into ways of interpreting the vast and rich body of generated data has become an active area. The interest in this new challenge for biostatisticians is underscored by the increasing number of articles recently published in the scientific literature. From well-designed experiments, research scientists pose questions related to comparison, prediction and clustering problems. For class comparison, the aim is to select relevant genes based on the relationship between its expression measurement and a response variable. For class prediction, the main interest is in deriving predictors defined from a linear or non-linear combination of gene measurement expressions. For class discovery, the major objective is to find new sub-classes of a disease entity that could help for future clinical and fondamental research. For class comparison, research into ways of identifying gene expression changes in microarray experiments taking into account false conclusions has become an active area. Up to now, statistical procedures have mostly relied on the multiple comparisons framework in order to control false positive conclusions (Hochberg and Tamhane, 1987). In this framework, two quantities have 1 16 Avenue Paul Vaillant Couturier Villejuif, France 1 Il lavoro è stato svolto con Sylvia RICHARDSON e Alex LEWIN, Department of Public Health, Imperial College, Norfolk Place, London W 1PG, United Kingdom

2 been considered : the familywise error rate (FWER) and the false discovery rate (FDR). The FWER, which is the oldest criterion considered in multiple comparisons, is defined as the probability of at least one false positive conclusion over all the true null hypotheses (a null hypothesis corresponds to the lack of relationship between gene expression measurement and a response variable). The most classical methods are Bonferroni and Sidãk methods (Hochberg and Tamhane, 1987). However, as argued by Benjamini and Hochberg (Benjamini and Hochberg, 1995), controlling the FWER in multiple testing settings may not always be appropriate. As an alternative and less stringent concept of error control they introduced the false discovery rate (FDR). The FDR is the expected proportion of erroneously rejected null hypotheses among the rejected ones. The main interest of the FDR is that it is an appealing error criteria which leads to more powerful procedures than those relying on the FWER. Moreover, the FDR seems well-suited for genomic and post-genomic biotechnologies which are mostly in the line of exploratory data analysis and screening. Based on this concept, they initially developed an step-up procedure under the hypothesis of independency which controls FDR at a prespecified value (Benjamini and Hochberg, 1995). Extension for the case of dependent tests has also been recently proposed (Benjamini and Yekutieli, 001). In this spirit, seminal work has been done for estimating the FDR, or the pfdr as defined by Storey (Storey, 001), in a non-parametric spirit (for some key contributions, see Storey and Tibshirani, 003; Tusher et al, 001; Efron et al, 001). A drawback of these latter procedures is that they only focus on protecting against false positive conclusions. However, in the exploratory and screening context of most microarray data analysis, investigators may be seriously concerned that such methods do not take into account false negatives and lead to the discarding of too large a proportion of meaningful experimental information. Indeed, a large gene expression variation does not necessarily translate into a major role in the biological process studied and vice versa. This is especially true for microarray experiments in oncology where the top genes (based on p-value or gene statistics) are not necessarily key genes whereas other interesting genes (related to biological pathway or target drug) may exhibit smaller transcriptional variations. In this setting, finite mixture modelling offers a flexible framework (see the numerous illustrations in McLachlan et al, 000) and allows for inferences obtained from a frequentist or Bayesian approach (for a few Pan et al, 003, Broët et al, 00). In this work we present a fully Bayesian mixture model that pays particular attention to the modelling of the alternative hypothesis in order to obtain good estimates of the FDR and its dual quantity the FNR as defined by Genovese and Wasserman (00). Moreover, it allows us to estimate the FDR and FNR for any subset of genes, a feature that cannot be obtained from classical approach that only considers monotone rejection regions. We illustrate our purpose in reanalyzing a dataset about breast cancer (Hedenfalk et al. 001), where the aim is to select relevant gene in a multi-class response experiments comparing BRCA1, BRCA related cancer and sporadic cancer.. Bayesian mixture modelling approach..1 Gene-based statistic In this subsection, we define a gene-based statistic for multi-class response experiments. In the following, let X ijk denote the measurement from the i th gene (1,..., I), in the j th sample (1,..., J k ) belonging to the k th class (1,..., K). The gene-based statistic D i

3 used in our proposed model-based approach is a transformation of the gene statistic F i (following under H 0 (corresponding to truly unmodified expression) a Fisher distribution, denoted FN K K 1 with (K 1) and (N K) degrees of freedom): D i = [(1 9(N K) )F 1 3 i (1 9(K 1) )][ 9(N K) F 3 i + 9(K 1) ] This transformation normalizes the distribution of the F i (Johnson and Kotz, 1970). Under H 0, D i is approximately distributed as a standard normal distribution, while D i has a more complex decentered distribution otherwise. Note that the decentered D i values summarize different gene expression changes across the conditions. Thus, the marginal distribution of D i is a mixture of distributions related to modified and unmodified gene expression measurements over the different classes... Model Our purpose is to model the mixture distribution of D i and to estimate for each gene the posterior probability to belong to the null component representing no difference over the different classes, conditional on the observed data. Our modelling approach assumes that the marginal density of D i can be written such as: f(d i ) = G g=0 w g f(. µ g, σ g) where f(. µ g, σg) are Gaussian densities, with unknown parameters ( µ g, σg) for the g th component density in the mixture. The quantities w g are the mixing proportions with 0 w g 1 and G g=0 w g = 1. Here, we define g = 0 to be the unmodified component having no expression change over the different conditions. This has a centered normal distribution. The number of modified components G in the mixture is treated as unknown since the alternative is expected to have a complex distribution summarizing various pattern of gene expression. The prior distribution for G is a Poisson distribution with parameter m, with m chosen small so as to encourage a parsimonious number of components being fitted. The mean parameter for the unmodified component µ 0 was set to 0 and we impose that µ G remark that under H 0 the distribution of F i are FN K K 1 Fisher distributions and noncentral Fisher distributions FN K K 1 (η) where η parameter under the alternative. The prior distributions specify that µ g;g 0, σg and w g are all drawn independently, with uniform, gamma and Dirichlet priors respectively. As usual for mixture models, we introduce L i an unobserved (latent) categorical variable taking the values 0,..., G with probability w 0,..., w G, respectively (McLachlan et al, 000). Thus, when L i 0 it will indicate that the gene i is not belonging to the null component. A joint posterior distribution for all unknowns is formed. Inference is then undertaken by simulating realizations from the resulting posterior distribution using a reversible-jump Metropolis-Hastings algorithm similar to the one used in Broët et al. (00) and Richardson and Green (1997). The full output of the Bayesian analysis includes information on the posterior distribution of G as well as our main quantities of interest, the posterior probabilities p 0i = p(l i = 0 data) for each gene. The p 0i are estimated within the algorithm by counting the number of times when L i = 0 divided by the length of the simulation run. Note that these probabilities are integrated over the range of normal mixtures (with different G) which are used by to fit the marginal density of D i, a unique feature of our model. From these posterior probabilities we can obtain model-based estimates of the observed false discovery and non-discovery rates conditionally upon the data. 1

4 ..3 The analysis of the Hedenfalk breast cancer dataset Dataset We analyzed the cdna microarray dataset publicly available from the breast cancer study conducted by Hedenfalk et al. (001). The aim of the study was to study breastcancer tissues from patients with BCRA1-related cancer, BCRA-related cancer, and sporadic cases of breast cancer for determining global gene-expression patterns in these three classes of tumors. The initial dataset consists of gene expression ratios derived from the fluorescent intensities from a tumor sample divided by those from a common reference sample. For each gene, a log-expression ratio was available. Here, we focus on the subset of 471 genes having a nominal denomination (EST and unknown gene were excluded). We consider each log-ratio measurement to be an additive sum of four terms: (i) a gene effect, (ii) a differential effect between the tumor sample and the reference sample co-hybridized on a defined array, (iii) an interaction gene cell line effect that reflects differential gene expression among the three tumor classes specific to each gene, (iv) an error term. As the term of interest is the interaction term, we estimate this term through a classical analysis of variance model. In practice, row and column effects are subtracted. Results The mixture integrated over different numbers of components provides a good semi-parametric fit to the gene-based statistics. This dataset appears to have a large number of differentiated genes (the Bayes estimate for the proportion of truly modified genes is 48%). The Bayes rule with the mixture model would give us a list of 995 genes, which is too many for practical purposes. Considering ordered p 0i, our method will provide FDR estimates for a list of the 96 or 384 genes (corresponding to classical 96 or 384 wellplates) of 1.6% and 6.1%, whereas FNR estimates are of 39% and 31.6%, respectively. In contrast, if the investigator is interested in studying a biological function, FDR and FNR can be obtained from individual p 0i. As an example, we consider three subsets of genes based on their known classical biological functions such as: apoptosis, cyclins and cell cycle regulation and cytoskelet. This gave us list size of 6, 1 and 5 genes of interest, respectively. Estimates for the FDR were 85% for apoptosis, 10% for cyclins and cell cycle regulation and 87% for cytoskelet. These results suggest that gene expression changes are different over the three tumor classes for cyclins and cell cycle regulation pathway as compared to the other considered biological functions and may lead the investigator to focus preferentially on gene involved in cell cycle. 3. Discussion Our fully Bayesian normal mixture model gives flexibility since the number of component is treated as an unknown parameter and can be considered as a parsimonious representation of a complex mixture density in a semi-parametric way. In this context, a mixture model-based approach such as the one presented here seems well suited for multi-class comparison experiments. obtained using our mixture model for the FDR and FNR are generaly accurate over a range of cases. When there is a substantial overlap between truly modified and unmodified gene profiles, the estimates outperform those obtained from classical nonparametric approach (such as Storey qvalue, 003). Moreover, our approach gives an estimate of the individual posterior probability for a gene of belonging to the null component integrated over all the possible mixture models. This allows to estimate FDR and FNR for any subset of genes, a feature that cannot be obtained from classical nonparametric approaches (such as Storey qvalue or SAM Tusher al, 001).

5 We applied the model to a cdna microarray dataset from a breast cancer study. When comparing for example three subset of genes defined from their biological functions, our results suggested that transcriptional expression for gene involved in kinase and cell cycle pathway differ between BRCA1, BRCA and sporadic tumors. In summary, we think this modelling approach gives an efficient way for obtaining the FDR and FNR and for analyzing relevant subset of genes that are particularly relevant in clinical cancer research. References Benjamini, Y., Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Ser. B,,57, Benjamini, Y., Yekutieli, D. (001) The control of the false discovery rate in multiple testing under dependency, The Annals of Statistics, 9, Broët, P., Richardson, S., Radvanyi, F. (00) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J. Comput. Biol.,9, Efron B. Tibshirani R. Storey J. Tusher V. (001) Empirical Bayes Analysis of a Microarray experiment, Journal of the American Statistical Association,96, Genovese, C., Wasserman, L. (00) Operating characteristics and extensions of the false discovery rate procedure. Journal of the Royal Statistical Society, Series B.,64, Hedenfalk, I., Duggan, D., Chen, Y. et al. (001) Gene-expression profiles in hereditary breast cancer. N Engl J Med,344, Hochberg, Y., Tamhane, A.(1987) Multiple comparison procedures, Wiley, New York. Johnson N.L., Kotz S. (1970) Continuous univariate distributions. Vol., Wiley, New York. McLachlan, G., Peel, D. (000) Finite Mixture models, Wiley, New York. Pan W, Lin J, Le C. A (003) mixture model approach to detecting differentially expressed genes with microarray data. Funct Integr Genomics, 3,117-4 Richardson, S., and Green, P.J. (1997) On Bayesian analysis of mixtures with an unknown number of components. J.R.Statist. Soc. B.,59, Schena, M. (000) Microarray Biochip Technology, Eaton. Storey, J.D. (001) A direct approach to false dis rates, Journal of the Royal Statistical Society, Series B.,64, Storey, J.D, Tibshirani R. (003) Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA.,100, QVALUE: The manual jstorey/qvalue/manual.pdf Storey JD. (003) The positive false discovery r Bayesian interpretation and the q-value. Annals of Statistics,31, 1-3. Tusher, V., Tibshirani, R., Chu, G. (001) Significant analysis of microarray applied to the ionising radiation response, Proc. Natl Acad. Sci. USA.,98,

Introduction to microarrays

Introduction to microarrays Bayesian modelling of gene expression data Alex Lewin Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) Philippe Broët (INSERM, Paris) In collaboration with Anne-Mette Hein, Natalia