Metalearning for gene expression data classification

Size: px

Start display at page:

Download "Metalearning for gene expression data classification"

Norman Jennings
6 years ago
Views:

1 Metalearning for gene expression data classification Bruno F. de Souza and André de Carvalho ICMC-University of Sao Paulo, Sao Carlos, Brazil Carlos Soares LIAAD-INESC Porto LA/Fac. de Economia Universidade do Porto, Portugal Abstract Machine Learning techniques have been largely applied to the problem of class prediction in microarray data. Nevertheless, current approaches to select appropriate methods for such task often result unsatisfactory in many ways, instigating the need for the development of tools to automate the process. In this context, the authors introduce the use of metalearning in the specific domain of gene expression classification. Experiments with the KNN-ranking method for algorithm recommendation applied for 49 datasets yielded successful results. 1. Introduction With the completion of a number of genomic sequencing programs, a wealth of biological data has become available, allowing an unprecedent set of opportunities to better understand the processes that conduct living systems [7]. An area that could greatly benefit from post-genomic research is health care, through the identification of genetic artifacts that could be somehow related to pathological states. In fact, the combat against cancer has already achieved promising advances, mainly due to the introduction of new wide scale gene expression analysis technologies, such as microarrays [21]. Microarrays are hybridization-based methods that allow the monitoring of the expression levels of thousands of genes simultaneously [23]. This enables the measurement of the levels of mrna molecules inside a cell and, consequently, the proteins being produced. Therefore, the role of the genes of a cell in a given moment, and under some circumstances, can be better understood by assessing their expression levels. In order to acquire qualitatively interesting information from the microarray experiments, one usually employs The authors acknowledge the support from FAPESP, CNPq and FCT (project Triana - POCT/TRA/61001/ and Programa de Financiamento Plurianual de Unidades de I&D). computational tools [28]. Specifically, Machine Learning (ML) algorithms [17] have been largely considered, mainly due to their ability to automatically extract patterns from data. One of the most promising ML tasks in this context is supervised classification [26]. Basically, it is used to identify the class membership of a sample based on its gene expression profile (e.g. normal versus cancerous tissues). One of the earliest applications of ML for classification of microarray data was the successful design of a classifier system to distinguish patients of related types of leukemia [10]. Following this approach, various studies confirmed the applicability of ML in gene expression domains [9, 20, 16, 15, 25, 12]. An analysis of their results indicates no single algorithm performs better than the others in all cases. This situation is related to the no-free-lunch theorem [22] and it emphasizes the need for a careful selection of the algorithm to be used on each specific problem. The current rules of the thumb for algorithm selection rely on costly trial-and-error procedures or expert advice [5]. Both approaches may not be satisfactory to the end user, typically a biologist or clinician, who intends to analyze microarray data more direct and cost-effectively. So, in practice, the choice of the ML algorithms is basically determined by the familiarity of the user with the algorithm rather than the particularities of the data and the algorithms themselves. This may lead to sub-optimum results, compromising the whole experimental setup. Therefore, a system that is able to automatically predict the performance of algorithms on new problems is highly desirable. One approach for that is metalearning [5]. The term generally refers to techniques that exploit expertise acquired in the process of applying ML algorithms in order to increase the quality of results obtained in future applications [5]. This work focuses on a particular perspective of metalearning, which is concerned with inducing metamodels that relate the characteristics of the problems with the performance of ML algorithms. The metamodels are then used to support algorithm selection for new problems. Metalearning has been successfully applied for algorithm recommendation on sets of diverse classification

2 problems [5]. However, it has never been tested on problems of a single, specific domain. Therefore, the goal of this work is to test whether it is possible to succesfully apply metalearning to problems of a single domain. The domain chosen for this study is classification of gene expression. Not only is this an important ML application, as argued earlier, but it also has has a number of idiosyncrasies, such as the morphology of the data. The metalearning algorithm used here is the KNN ranking method [5]. This method is particularly useful for algorithm recommendation because it generates a ranking of the algorithms for a given dataset, based on the expected performance of those algorithms. This document is organized is follows. Section 2 presents an overview of the application of ML algorithms to the classification of gene expression data. It also provides quantitative arguments to justify the use of metalearning in this domain. In Section 3, the general architecture of the meta-learning system employed is explained. Section 4 discusses the experimental results obtained. Finally, Section 5 draws the conclusions of this work and points out future research directions. 2. Gene expression data classification Traditional methods for cancer classification rely essentially on the tumor s morphological appearance [10] and on the tissue of its origin [24]. However, there is no assurance that similar tumors will have the same clinical development and, therefore, will demand a similar course of action. To permit a deeper understanding of the tissues being analyzed and to try to achieve a finer distinction of cancers, Golub et al [10] applied microarrays to define a genetic portrait of tissues from two types of acute leukemia, AML and ALL. With a simple weighted voting scheme, the authors were able to correctly classify most of the samples. That seminal work introduced the ML community to gene expression data classification. In this context, tissue samples x i are multidimensional observations represented by m genes x i,j and have an associated class y i (e.g. the presence or absence of a disease). The task of a ML technique is to learn a discriminant function f( x i ), induced from a training set S = {( x 1, y 1 ),..., ( x n, y n )}, such that it is able to relate x i with the corresponding y i and to exploit this relation to classify previously unseen tissue samples. An interesting point to note is the usual disproportional rate of between the very large number of genes and the small number of tissue samples. Within this framework, many studies in the literature have proven its efficacy. Two of those studies are discussed next. Nutt et al have studied the feasibility of using gene expression data to classify high-grade gliomas [19]. They proposed an approach based on a k-nearest Neighbors (KNN) classifier that was able to discriminate high-grade, nonclassic glial tumors objectively and reproducibly, outperforming the naive histopathological-based classification. The dataset used consists of 50 expression profiles obtained from Affymetrix high-density oligonocleotide microarrays containing probes for about genes. In an effort to improve the understanding of the molecular basis of Papillary Renal Cell Carcinoma (PRCC), Yang et al [30] have studied the gene expression profiles of 34 cases of PRCC from an Affymetrix array with probe sets. Using unsupervised analyses, they were able to identify two highly correlated distinct molecular subclasses with morphological correlation. Through the application of a Prediction Analysis of Microarrays (PAM) classifier over the samples of the two subclasses, the authors were able to achieve a very good cross validation accuracy when considering a subset of genetic markers. A broader view on the matter is provided by recent articles. Larrañaga et al [14] presents an extensive review of the application of machine learning methods in bioinformatics, with a section devoted to class prediction. They discuss issues like how to assess and compare performance of ML algorithms, the problem of feature selection and the most representative classification paradigms, with examples of their application. Asyali et al [1] focused exclusively in gene expression data and provided a more critical survey of ML methods in the context, along with the implications of the findings. In their work, key points of microarray analysis, e.g. preprocessing and classifier design, were covered and examined. This is important since, according to the authors, there has been a dramatic increase of studies related to gene expression profile classification over the last years. Such an interest raises the issue of which type of classifier should be applied and, as argued next, should be carefully addressed. In the current practice, the most used ML algorithms for gene expression classification are [1]: KNN, SVMs, Decision Trees (such as CART), PAM, Neural Networks, FLDA, DLDA and DQDA. Comparative studies, that inspect the performance of different algorithms over a range of problems, provide some support to users who need to decide which one to use. Dudoit et al [9] compared the performance of LDA, DLDA, DQDA, Weighted Vote Scheme, KNN, trees and tree-based ensembles on three microarray datasets. Their main conclusion is that simple methods such as DLDA and KNN perform very well in comparison to more sophisticated methods such as tree-based ensembles. Romualdi et al [20] studied the performance of DLDA, trees, Neural Networks, SVMs, KNN and PAM on two datasets. They were unable to obtain evidence to support that one of those methods performs better than the others. Man et al [16] included in their comparison KNN, PCA+LDA, PLS-DA, Neural Networks, random forests and SVM. Based on experiments with 6 datasets, they concluded that PLS-DA and

3 SVM presented the best results. In a very comprehensive study, using 21 classification methods (including most of the previous approaches) applied to 7 datasets, Lee et al [15] claimed that no classifier is systematically better than the other. Statnikov et al [25] compared 3 multi-class classification methods, named multi-class SVMs, KNN, Neural Networks, on 11 datasets and concluded that SVMs outperformed their competitors. Finally, Huang et al [12] compared the performance of 5 statistical methods (PLS, penalized PLS, LASSO, PAM and random forests) on 2 datasets and concluded that the algorithms obtain similar results. As a whole, the aforementioned studies suggest that there is no obvious winning algorithm. Although some methods do present a tendency to perform well, such as SVMs, none of them is the best on all datasets. In the present work, the authors further investigated this hypothesis. One possibility to achive this is to analyze the relative performance of some ML algorithms on various microarray datasets. For each dataset, one can construct a ranking of algorithms based on estimates of their performance. Here it is assumed that better algorithms are ranked higher (i.e., they are assigned ranks closer to 1). The distribution of the ranks of an algorithm over the datasets gives an indication of how well it performs in comparison to the others. Figure 1 presents the distribution of ranks for the seven algorithms and 49 datasets used in this work (Section 4). Each bar indicates how many times a given algorithm was ranked in each of the 7 possible positions, 1 represented by different levels of gray. The figure confirms the previous observation that there is no clear winner, although a few algorithms tend to perform well. Figure 1. Distribution of rankings. 1 One unit intervals are represented, because rank mean is assigned to rank for ted algorithms. 3 Metalearning As shown in the previous section, the data analyst must carefully select which algorithm to use on each problem, in order to obtain satisfactory results. Running an algorithm on a dataset is time consuming, especially when complex tasks with a large volume of data are involved, as is often the case in bioinformatics. Therefore, selecting the algorithm by trying out all alternatives is generally not a viable option. An alternative approach consists of using a learning algorithm to model the relation between the characteristics of learning problems (e.g., number of examples) and the relative performance of a set of algorithms [5]. Here, the authors refer to this approach as metalearning because one is learning about the performance of learning algorithms. Meta-learning models can be used to predict the relative performance of the set of algorithms on a new dataset based on the characteristics of the dataset and without actually running any of the algorithms. This approach involves three steps: (1) the generation of metadata; (2) induction of a meta-learning model by applying a learning algorithm on the metadata; and (3) application of the metamodel to support the selection of the algorithms to be used in new datasets. Next, the authors summarize these steps but for a more thorough description, the reader is referred to [5] and references therein. Metadata In this context, metadata are data that describe the (relative) performance of the selected algorithms on a set of datasets, which were already processed with those algorithms. They consist of a set of meta-examples, each one representing one dataset. Each meta-example consists of attributes and a target. Datasets for metalearning are usually obtained from repositories. The attributes, which also known as metafeatures, are measures that characterize the datasets. These measures represent general properties of the data which are expected to affect the performance of the algorithms. A few examples of commonly used metafeatures are the number of examples, the proportion of symbolic attributes, class entropy and the mean correlation between attributes. These are examples of what are usually referred to as general, statistical and information-theoretic metafeatures [5]. The target represents the relative performance of the algorithms on the dataset. Many metalearning approaches to the problem of algorithm recommendation handle it as a supervised classification task. The recommendation provided to the user consists of a single algorithm and the target variable is, thus, a nominal attribute containing the algorithm that achieved the best performance on the corresponding dataset. However, this is not the most adequate form of recommendation for this problem. It does not provide any further guidance when the user is not satisfied with the re-

4 sults obtained with the recommended algorithm. Although, as stated earlier, executing all the algorithms is not a viable strategy, it is often the case that the available computational resources are sufficient to run more than one of the available algorithms. If recommendation indicates the order in which the algorithms should be executed, then the user can execute as many as possible, thus increasing the probability that a satisfactory result is obtained. Therefore, the problem of algorithm recommendation should be tackled as a ranking task [5], which is discussed in the following section. KNN-ranking Method The metadata, as presented in the previous section, consists of a set of meta-examples that are described with a set of metafeatures and with a target consisting of a ranking of the ML algorithms, which is referred to as the target ranking. This learning problem is similar to the problem of supervised classification. The difference is that, given a new example described by the values of the attributes, the objective in classification is to predict the class it belongs to while the objective in ranking is to predict the order of the classes as applicable to that example. An algorithm that has previously been adapted for learning rankings and applied to the meta-learning problem with successful results is the k-nearest Neighbors (KNN) algorithm [5]. The difference between ranking and classification is only on the target. Therefore, any common distance function (e.g. the Euclidean distance considered here) can be used by KNN to measure the similarity between examples. After selecting k neighbors, the corresponding target rankings must be aggregated to generate a prediction. In classification, this is achieved by predicting the most frequent class among the selected examples. A simple approach is to aggregate the k target rankings with the Average Ranks (AR) method [5]. Let R i,j be the rank of base-algorithm a j (j = 1,..., n) on dataset i, where n is the number of algorithms. The average rank for each a j is: R j = k i=1 R i,j k The final ranking is obtained by ordering the average ranks and assigning ranks to the algorithms accordingly. Evaluation and Application The metamodel can then be used to support the data analyst in selecting the algorithm to use on a new dataset. To do this, it is first necessary to compute the metafeatures for the new dataset and the ranking of the algorithms can be predicted using the KNN method. However, to convince data analysts to apply a metalearning approach in practice, it is necessary to produce evidence that it is able to to generate accurate predictions. One approach is to use Leave-one-out Cross Validation (LOOCV), which consists of iteratively, for each metaexample, computing the accuracy of the predicted ranking using a metamodel obtained on all the remaining metaexamples [5]. To measure ranking accuracy, the authors have used Spearman s Rank Correlation Coefficient, r S, which is given by the expression: r S = 1 6 n i=1 (R(X i) R(Y i )) 2 n 3 n where X and Y are two sets of n values and R(X i ) represents the rank of element i in the series X. The coefficient simply evaluates the monotonicity of two sets of values, i.e., if their variations are related. The value of 1 represents perfect agreement, and -1 perfect disagreement (i.e., the rankings are inverted. A correlation of 0 means that the rankings are not related, which would be the expected score of the random ranking method [5]. To determine whether the accuracy of some particular recommended ranking can be regarded as high or not, a baseline method is required. In machine learning, simple prediction strategies are usually employed to set a baseline for more complex methods. For instance, a baseline commonly used in classification is the most frequent class in the dataset, referred to as the default class. The baseline is typically obtained by summarizing the values of the target variable for all the examples in the dataset. In ranking, a similar approach consists of applying the Average Ranks (AR) method to all the target rankings in the metadata. The ranking obtained is called the default ranking. 4. Experimental results Datasets The meta-data employed in this work came from 49 publicly available microarray datasets. They are related to disease diagnostic. Mainly, the task is either discriminating between normal and tumor cases or between different types of tumor. They present very diverse characteristics concerning the number of examples, the number of genes and the number of classes. Due to space constrains, the the datasets are not described here, but full descriptions can be retrieved from br\ bferes. Two preprocessing operations were performed. As some datasets presented missing values, imputation was done using the Least Square Adaptation method [2], following recommendation from Brock et al [6]. Additionally, all attributes are normalized to have mean 0 and variance 1. This is first done for the training data and then the test data are rescaled accordingly. ML algorithms Based on the comparative studies of ML algorithms for gene expression classification presented in

5 Section 2, seven classifiers were selected mainly according to two criteria: performance and training time: they are relatively fast to train and present error rates adequate on at least some datasets. The methods are: Diagonal Linear Discriminat Analysis (DLDA) [9], Diagonal Quadratic Discriminat Analysis (DQDA) [9], Prediction Analysis of Microarray (PAM) [27], the 3-Nearest Neighbor (3-NN) [8], Support Vector Machines (SVM) [29] with linear and RBF kernels and Penalized Discriminant Analysis (PDA) [11]. An important issue within the metalearning framework is the estimation of the performance of the classifiers. In the context of gene expression data, this is subject of ongoing research, with no widely accepted methodology [4, 13]. In the present work, the.632+ estimator is used, following the suggestion of Braga-Neto and Dougherty [4]. Metafeatures Although the datasets used are different from traditional classification datasets, ten metafeatures originally developed for that kind of datasets in the StatLog project were used [5]. As some of those may not be suitably applicable due to the high dimensionality of data, their calculation were preceded by a data reduction step. Here, Partial Least Square (PLS) was employed, mainly due to its good results in a number of microarray studies (see [3] and references therein) and to its low computational cost. The number of PLS components considered was 3, which seems to be adequate for preserving discrimination power in the context of expression data [18]. The measures are given next. More information can be found in [5]. 1. Log of number of examples 2. Log of number of features 3. Log of number of classes 4. Mean absolute skewness 5. Mean kurtosis 6. Geometric mean ratio of the standard deviations of individual populations to the pooled standard deviations 7. First canonical correlation 8. Proportion of total variation explained by the firs canonical correlation 9. Normalized class entropy 10. Average absolute correlation between continuous attributes, per class Results The experiments conducted in this work followed the evaluation and application guidelines presented in Section 3. The main results obtained here are illustrated in Figure 2. The points represent the mean ranking accuracy over the 49 datasets accordingly to the LOOCV approach, varying the number k of nearest neighbors from 1 to 20. The smallest values of k present the lowest performance (69.8% mean accuracy). Then, it increases with increasing value of k until it reaches a point of saturation, where the behavior of the accuracy remains basically constant and then gracefully drops. Here, k = 4 and k = 5 give similar results (both roughly 78.3% mean accuracy). In any case, the KNN ranking method clearly outperforms the default ranking (dashed line) (59.9% mean accuracy), generating rankings more correlated in average to the ideal rankings. This indicates that metalearning can be successully applied to recommend algorithms for gene expression analysis. Figure 2. Ranking accuracy of KNN. Additionally, these results are somewhat different to the ones reported on algorithm recommendation experiments with general classification datasets [5]. In this work, the best results were achieved with a very small k (1 or 2) and the KNN ranking method quickly became worse than the default ranking. This may be explained by the fact that, being from the same application domain, the datasets used here are more homogeneous. Therefore, the KNN algorithm has a smoother behavior with varying k. 5. Conclusions In this paper, it is presented an empirical analysis of the performance of a metalearning method on the problem of recommending learning algoritms for gene expression classification. Metalearning has been successfully applied to general classification problems. However, it was never applied to a restricted domain, such as gene expression data. The results presented here show that it is possible to use metalearning to recommend classifiers for gene expression data. It was observed that the behavior of the metalearning algorithm is actually smoother than when applied to general classification problems. It was employed an approach based on the KNN algorithm, mainly because it was previously applied to other metalearning problems with successful results [5]. In the future, it is necessary to investigate if better results can be obtained with different methods.

6 Here, a set of general metafeatures was used to characterize datasets. However, the gene expression classification datasets have significant differences to most other classification problems, namely in terms of their morphology. Therefore, it is expected that better meta-learning models can be obtained using metafeatures that are specifically designed for this application domain. Additionally, it has been shown that, although ranking accuracy is an important criterion in the evaluation of metalearning systems during their development, the data analyst is interested in the quality of the results obtained by the selected classifiers [5]. In the case of gene expression analysis, the data analysts are interested not only in the accuracy of the models but also on their interpretability. The authors plan to address these issues in our future work. References [1] M. H. Asyali, D. Colak, O. Demirkaya, and M. S. Inan. Gene expression profile classification: A review. Current Bioinformatics, 1(1):55 73, [2] T. H. B, B. Dysvik, and I. Jonassen. Lsimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Research, 32(3):e34, [3] A.-L. Boulesteix and K. Strimmer. Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics, 8(1):32 44, [4] U. M. Braga-Neto and E. R. Dougherty. Is cross-validation valid for small-sample microarray classification? Bioinformatics, 20(3): , [5] P. Brazdil, C. Soares, and J. da Costa. Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning, 50(3): , [6] G. N. Brock, J. R. Shaffer, R. E. Blakesley, M. J. Lotz, and G. C. Tseng. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinformatics, 9(12), [7] F. S. Collins, E. D. Green, A. E. Guttmacher, and M. S. Guyer. A vision for the future of genomics research. Nature, 422: , April [8] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, 2ł edition, [9] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77 87, [10] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286: , October [11] T. Hastie, A. Buja, and R. Tibshirani. Penalized discriminant analysis. Ann. Statist., 23:73102, [12] X. Huang, W. Pan, S. Grindle, X. Han, Y. Chen, S. J. Park, L. W. Miller, and J. Hall. A comparative study of discriminating human heart failure etiology using gene expression profiles. BMC Bioinformatics, 6:205, [13] W. Jiang and R. Simon. A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification. Statistics in Medicine, 26: , [14] P. Larraaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armaanzas, G. Santaf, A. Prez, and V. Robles. Machine learning in bioinformatics. Briefings in Bioinformatics, 7(1):86 112, [15] J. W. Lee, J. B. Lee, M. Park, and S. H. Song. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48(4): , [16] M. Z. Man, G. Dyson, K. Johnson, and B. Liao. Evaluating methods for classifying expression data. J Biopharm Stat., 14(4): , [17] R. S. Michalski, J. Carbonell, and T. M. Mitchell. Machine Learning: an Artificial Intelligence Approach. Morgan Kaufmann Publishers, Inc., [18] D. V. Nguyen and D. M. Rocke. Multi-class cancer clasification via partial least squares with gene expression profiles. Bioinformatics, 18(9): , [19] C. L. Nutt, D. R. Mani, R. A. Betensky, and P. Tamayo. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res., 63(7):1602 7, [20] C. Romualdi, S. Campanaro, D. Campagna, B. Celegato, N. Cannata, S. Toppo, G. Valle, and G. Lanfranchi. Pattern recognition in gene expression profiling using dna array: a comparative study of different statistical methods applied to cancer classification. Human Molecular Genetics, 12(8): , [21] G. Russo, C. Zegar, and A. Giordano. Advantages and limitations of microarray technology in human cancer. Oncogene, 22(42): , September [22] C. Schaffer. A conservation law for generalization performance. In ICML, pages , [23] M. Schena. DNA Microarrays: A Practical Approach. Practical Approach Series. Oxford University Press, Oxford, Inglaterra, 1ł edition, [24] D. K. Slonim, P. Tamayo, J. P. Mesirov, T. R. Golub, and E. S. Lander. Class prediction and discovery using gene expression data. In RECOMB, pages , [25] A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5): , [26] A. Tarca, R. Romero, and S. Draghici. Analysis of microarray experiments of gene expression profiling. Am J Obstet Gynecol., 195(2):373 88, Agosto [27] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS, 99(10): , May [28] B. Tjaden and J. Cohen. A survey of computational methods used in microarray data interpretation. Applied Mycology and Biotechnology, Volume 6: Bioinformatics:1 18, [29] V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., [30] X. J. Yang, M.-H. Tan, H. L. Kim, et al. A molecular classification of papillary renal cell carcinoma. Cancer Research, 65(13): , July 2005.

Data Mining in Bioinformatics. Prof. André de Carvalho ICMC-Universidade de São Paulo

Data Mining in Bioinformatics. Prof. André de Carvalho ICMC-Universidade de São Paulo Data Mining in Bioinformatics Prof. André de Carvalho ICMC-Universidade de São Paulo Main topics Motivation Data Mining Prediction Bioinformatics Molecular Biology Using DM in Molecular Biology Case studies