Lab Rotation Report. Re-analysis of Molecular Features in Predicting Survival in Follicular Lymphoma

Size: px

Start display at page:

Download "Lab Rotation Report. Re-analysis of Molecular Features in Predicting Survival in Follicular Lymphoma"

Miranda Curtis
6 years ago
Views:

1 Lab Rotation Report Re-analysis of Molecular Features in Predicting Survival in Follicular Lymphoma Ray S. Lin Biomedical Informatics Training Program, Stanford June 24, 2006 ABSTRACT The findings in the article "Prediction of survival in follicular lymphoma based on molecular features of tumor infiltrating cells" by Dave and Wright et al. published in NEJM 2004 [1] have been found to be fragile. This project investigated whether Wright's finding can be reproduced in the following three experiments: 1) using the original training and test sets, 2) swapping training and test sets, and 3) randomly dividing into training and test sets for 100 times. Different experimental settings were examined, but unfortunately we were not able to reproduce Wright s findings in any of the settings. In the 1 st experiment, we found some pairs of gene clusters that were significant in both training and test sets; however, the most significant pair in the training set was not significant in the test set. In the 2 nd experiment, none of the pairs that consisted of one poor gene cluster and one good gene cluster were found to be significant in the test set. In the 3 rd experiment, the empirical distribution of p values in the test sets was generated after performing 100 trials on randomly-split training and test sets. The distribution is close to uniform, which indicates that the significant results Wright obtained were likely due to chance. In summary, based on Wright s dataset and the analytical procedure, it is not likely that there exist gene clusters (or pairs of gene clusters) that strongly predict survival in follicular lymphoma patients.

2 1. INTRODUCTION The finding in the article Prediction of survival in follicular lymphoma based on molecular features of tumor infiltrating cells by Dave and Wright et al. published in NEJM 2004 [1] has been found to be fragile. Robert Tibshirani reported that the finding was not reproducible by following the same model-building procedure if the training and test sets were swapped [2]. This project investigated whether Wright's finding can be reproduced in three experiments: 1) using the original training and test sets, 2) swapping training and test sets, and 3) randomly dividing into training and test sets for 100 times. 2. METHODS The original dataset was obtained from Wright s. The dataset contains gene expression measurements of 44,928 genes on 191 patients. Four out of the 191 patients did not have survival data. We performed the following three experiments: 1) reproducing Wright s results based on the original training and test sets; 2) new analysis after swapping training and test sets; 3) 100 new analyses that randomly divides the whole dataset into a training and a test set and then selects the most significant pair of gene clusters found in the training set and computes its p value in the test set Reproducing Wright's results The analysis consists of the following steps according to the description in Dave s article and the supplementary appendix. (a) Dividing Patients into training and test sets The whole dataset was divided into training and test sets based on the split performed by George Wright. There were 95 patients (93 with survival data) in the training set and 96 (94 with survival data) in the test set. (b) Filtering the genes The 44,928 genes were filtered based on two criteria computed in the training set: 1) Wald p < 0.1 in univariate Cox model and 2) median expression > 6. The remaining

3 genes were categorized into poor genes that predict poor survival (with Wald score > 0) and good genes that predicts good survival (with Wald score < 0). It was not clear that the filtering procedure was done based on data of all patients (i.e., n=95) or only patients with survival data (i.e., n=93). Therefore, both scenarios were examined. (c) Clustering the genes The poor genes and good genes are clustered by software XCluster [3] separately. The clustering result was further filtered by the joining correlation (> 0.5) and the cluster size (> 24 and < 51). Four datasets (G1 to G4) were derived from further analysis. G1 and G2 were derived based on the list of genes that appear in the.gtr and.ctr files (the output of the clustering software) provided by Wright. The files contain 1,569 poor genes, but the files containing good genes are not available. These poor genes were identified in the original dataset. The dataset G1 contains these 1,569 genes for all the 95 patients whereas G2 contains the same set of genes but for patients with survival data (n=93). Datasets G3 and G4 were derived from the original dataset by performing the filtering procedure listed above. G3 were derived based on all 95 patients whereas G4 were based on the 93 patients with survival data. (d) Fitting the gene clusters to Cox models The gene clusters obtained from G3 and G4 were further analyzed by Cox models. For each gene cluster, the expression of the cluster was computed as the mean expression of its constituent genes. Cox models were used to analyze the expression of the individual gene clusters and pairs of these clusters. For each pair, the p values of Wald tests in the training and test sets were reported. The clusters obtained from G1 and G2 could not be analyzed because they did not contain the clusters for good genes Swapping training and test sets In this analysis, the training and test sets were swapped. There were 96 patients (94 with survival data) in the new training set and 95 patients (93 with survival data) in the test set. The analysis was performed following steps (b) to (d) described Sec 2.1. This analysis was based on dataset G4. In other words, it includes patients without survival data.

4 2.3. Randomly dividing training and test sets The whole dataset was randomly divided into training and test sets (with roughly equal size) for 100 times. In each random split, steps (b) to (d) described in Sec. 2.1 were performed. The pair of gene clusters with the most significant Wald test in the training set was selected, and its p value in the test set was computed. The empirical distribution of these test-set p values was examined. This analysis was based on dataset G4, which includes patients without survival data. 3. RESULTS The following subsections summarize the results of the analyses in the three experiments Reproducing Wright's results Table 1 summarizes the number of poor and good genes after gene filtering. In Wright s analysis, it was not clear that the filtering procedure was done based on data of all patients (i.e., n=95) or only patients with survival data (i.e., n=93). Both scenarios were examined in this project, and the first scenario (i.e., n=95) produced the same number of genes as the supplementary appendix published on NEJM website [4]. However, the data provided by Wright was different from their appendix, and neither scenario in our analysis could reproduce Wright s result. This finding is the same as the one obtained by Rob Tibshirani [2]. Four datasets (G1 to G4) were created for analysis as described in Sec 2.1. Their characteristics were summarized in Table 2. Table 3 summarizes the number of clusters obtained in different datasets by trying different centering and scaling methods. None of the four datasets generated the same number of clusters as Wright's analysis. Figure 1 shows the p values in training and test sets based on dataset G3. Panel A and B show the p values of pairs of gene clusters in two-variable Cox models. Some pairs reached significant p values at the 0.05 level in both training and test sets. However, the most significant ones in the training set were not significant in the test set. Panel C and D

5 show the p value of poor and good gene clusters in univariate Cox models. Only one poor gene cluster and none of the good gene clusters was significant in both training and test sets Swapping training and test sets Figure 2 shows the Cox model p value in training and test sets based on dataset G4. Several pairs reached significant p values at the 0.05 level in both training and test sets (Panel A). However, when considering only the pairs consisting of one poor gene cluster and one good gene cluster, none of the pairs was significant in the test set (Panel B). Similarly in univariate Cox models, none of the poor gene clusters (Panel C) and none of the good gene clusters (Panel D) was significant in the test set regardless of their p values in the training set Randomly dividing training and test sets Table 4 summarizes the empirical distribution of Cox model p values in training and test sets, and Figure 3 shows the cumulative distribution of the p values in the test set. The distribution of p values in the test set is close to uniform. Only 6% of the p values in the test set were less than 0.05 although all the p values were very significant in the training set. 4. DISCUSSION The first goal of this project was to reproduce Wright s results. However, as presented in Sec 3, none of our analysis produced the same result as Wright s. It would be helpful if we could obtain their.cdt (not.ctr) files, which contain the gene clusters with the expression measurements for the all genes on all the patients. In this case, we can compare our.cdt files against theirs and might be able to identify the cause of the different results. The results in Sec 3.1 show that if we follow Wright s procedure and pick up the pair of gene clusters with the most significant p values in the training set, this pair would not be significant in the test set. Even though there are some pairs that are significant in

6 both training and test sets, we would not be able to identify them by just looking their p values in the training set. After swapping training and test sets, none of the pairs that consist of one poor gene cluster and one good gene cluster were significant in the test set. This finding is the same as Rob s result [2]. Notice that he plots are not exactly the same. This may be because Rob s analysis included only the patients with survival data (n=187) whereas this project included all the patients (n=191). In the 3 rd experiment, the empirical distribution of p values in the test set is close to uniform although the mean and median is around 0.42 (not 0.5). The p values in the test sets were significant (< 0.05) in only 6 out of the total 100 trials. This shows that if we follow Wright s procedure, the significant results obtained in the test set were likely only due to chance. It is not likely that there are pairs of gene clusters that strongly predict survival. In summary, it is unfortunate that we could not reproduce Wright s findings. This could be due to the nuance of parameters in the analysis. We have experimented different parameters, but none produced exactly the same results as Wright s. In the 1 st experiment, we found some pairs of gene clusters that were significant in both training and test sets; however, the most significant pair in the training set was not significant in the test set. In the 2 nd experiment, none of the pairs that consisted of one poor gene cluster and one good gene cluster were found to be significant in the test set. In the 3 rd experiment, the empirical distribution of p values in the test sets was generated after performing 100 trials on randomly-split training and test sets. The distribution is close to uniform, which indicates that the significant results Wright obtained were likely due to chance. Based on Wright s dataset and the analytical procedure, we were not able to find significant gene clusters (or pairs of gene clusters) that predict survival in follicular lymphoma patients. ACKNOWLEDGEMENTS I would like to thank the guidance of Dr. Trevor Hastie and Dr. Rob Tibshirani.

7 REFENECE 1. Dave, et al., Prediction of survival in follicular lymphoma based on molecular features of tumor infiltrating cells. NEJM, (21). 2. Tibshirani, R., Re-analysis of Dave et al, NEJM Nov 18, Sherlock, G., XCluster Dave, et al., Supplementary appendix to prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells. NEJM,

8 TABLES Table 1. The numbers of poor and good genes after gene filtering in different datasets. Patients in the tr. set #poor genes #good genes Reported by? 1,568 1,731 Supp appendix? 1,569 -* Wright all patients (n=95) 1,568 1,731 Ray patients with surv. data (n=93) 1,565 1,730 Ray *: Data in this cell were not available.?: In Wright s analysis, it was not clear whether patients without survival data were included or not. Table 2. The numbers of poor and good genes after gene filtering in different datasets. Dataset #patients #poor genes #good genes Derived from G1 95 1,569 -* Wright's clusters G2 93 1,569 -* Wright's clusters G3 95 1,568 1,731 original dataset G4 93 1,565 1,730 original dataset *: Data in this cell were not available. Table 3. The numbers of poor and good genes after gene filtering in different datasets. Dataset centering scaling #cluster (poor #cluster (good genes) genes) G1 mean none 92 -* var 96 -* median none 84 -* var 79 -* G2 mean none 78 -* var 87 -* median none 83 -* var 76 -* G3 mean none var median none var G4 mean none var median none var Wright's median? 71 -* *: Data in these cells were not available.?: Wright did not report scaling

9 Table 4. The empirical distribution of Cox model p values in training and test sets in twovariable Cox models. Min 1 st Qu. Median Mean 3 rd Qu. Max Training set Test set

10 FIGURES Figure 1. Cox model p values in training and test sets based on dataset G3. One poor + one good A B C D

11 Figure 2. Cox model p values in training and test sets based on dataset G4. One poor + one good A B C D

12 Figure 3. The empirical cumulative distribution function of p values in two-variable Cox models in test set. Empirical CDF of test set p-value Empirical CDF P value

13 APPENDIX Pseudo-code of R script 1. Read in the data from files 2. Combine the clinial data and gene expression data 3. Divide the whole dataset into training and test sets 4. Filter the genes i. Compute Wald tests (univariate Cox) for each gene based on training set ii. Filter genes based on Wald tests p < 0.1(the absolute value of Wald score > 1.645) iii. Fileter genes based on the median expression in the training set 5. Identify poor genes (wald score > 0) and good genes (wald score < 0) 6. Cluster the poor genes i. Transform the dataset into the format for XCluster ii. Call XCluster (parameters = centered by median, no scaling) iii. Filter the clusters based on joining correlation cutoff > 0.5 iv. Filter the clusters based on cluster size (>24 and <51) v. Map XCluster's gene id back to the id in the original dataset 7. Cluster the good genes i. Transform the dataset into the format for XCluster ii. Call XCluster (parameters = centered by median, no scaling) iii. Filter the clusters based on joining correlation cutoff > 0.5 iv. Filter the clusters based on cluster size (>24 and <51) v. Map XCluster's gene id back to the id in the original dataset 8. For each patient and each gene cluster, compute its expression as the mean expression of its constituent genes 9. Fit two-variable Cox model to every pair of one poor-gene cluster and one good-gene cluster 10. Plot the p values of Wald tests in training vs. test sets 11. In the training set, select the most significant pair of one poor-gene cluster and one good-gene cluster 12. Report the p value of this pair in the test set.

Intelligent Techniques Lesson 4 (Examples about Genetic Algorithm)

Intelligent Techniques Lesson 4 (Examples about Genetic Algorithm) Numerical Example A simple example will help us to understand how a GA works. Let us find the maximum value of the function (15x - x 2