Metalearning for gene expression data classification

Size: px
Start display at page:

Download "Metalearning for gene expression data classification"

Transcription

1 Metalearning for gene expression data classification Bruno F. de Souza and André de Carvalho ICMC-University of Sao Paulo, Sao Carlos, Brazil Carlos Soares LIAAD-INESC Porto LA/Fac. de Economia Universidade do Porto, Portugal Abstract Machine Learning techniques have been largely applied to the problem of class prediction in microarray data. Nevertheless, current approaches to select appropriate methods for such task often result unsatisfactory in many ways, instigating the need for the development of tools to automate the process. In this context, the authors introduce the use of metalearning in the specific domain of gene expression classification. Experiments with the KNN-ranking method for algorithm recommendation applied for 49 datasets yielded successful results. 1. Introduction With the completion of a number of genomic sequencing programs, a wealth of biological data has become available, allowing an unprecedent set of opportunities to better understand the processes that conduct living systems [7]. An area that could greatly benefit from post-genomic research is health care, through the identification of genetic artifacts that could be somehow related to pathological states. In fact, the combat against cancer has already achieved promising advances, mainly due to the introduction of new wide scale gene expression analysis technologies, such as microarrays [21]. Microarrays are hybridization-based methods that allow the monitoring of the expression levels of thousands of genes simultaneously [23]. This enables the measurement of the levels of mrna molecules inside a cell and, consequently, the proteins being produced. Therefore, the role of the genes of a cell in a given moment, and under some circumstances, can be better understood by assessing their expression levels. In order to acquire qualitatively interesting information from the microarray experiments, one usually employs The authors acknowledge the support from FAPESP, CNPq and FCT (project Triana - POCT/TRA/61001/ and Programa de Financiamento Plurianual de Unidades de I&D). computational tools [28]. Specifically, Machine Learning (ML) algorithms [17] have been largely considered, mainly due to their ability to automatically extract patterns from data. One of the most promising ML tasks in this context is supervised classification [26]. Basically, it is used to identify the class membership of a sample based on its gene expression profile (e.g. normal versus cancerous tissues). One of the earliest applications of ML for classification of microarray data was the successful design of a classifier system to distinguish patients of related types of leukemia [10]. Following this approach, various studies confirmed the applicability of ML in gene expression domains [9, 20, 16, 15, 25, 12]. An analysis of their results indicates no single algorithm performs better than the others in all cases. This situation is related to the no-free-lunch theorem [22] and it emphasizes the need for a careful selection of the algorithm to be used on each specific problem. The current rules of the thumb for algorithm selection rely on costly trial-and-error procedures or expert advice [5]. Both approaches may not be satisfactory to the end user, typically a biologist or clinician, who intends to analyze microarray data more direct and cost-effectively. So, in practice, the choice of the ML algorithms is basically determined by the familiarity of the user with the algorithm rather than the particularities of the data and the algorithms themselves. This may lead to sub-optimum results, compromising the whole experimental setup. Therefore, a system that is able to automatically predict the performance of algorithms on new problems is highly desirable. One approach for that is metalearning [5]. The term generally refers to techniques that exploit expertise acquired in the process of applying ML algorithms in order to increase the quality of results obtained in future applications [5]. This work focuses on a particular perspective of metalearning, which is concerned with inducing metamodels that relate the characteristics of the problems with the performance of ML algorithms. The metamodels are then used to support algorithm selection for new problems. Metalearning has been successfully applied for algorithm recommendation on sets of diverse classification

2 problems [5]. However, it has never been tested on problems of a single, specific domain. Therefore, the goal of this work is to test whether it is possible to succesfully apply metalearning to problems of a single domain. The domain chosen for this study is classification of gene expression. Not only is this an important ML application, as argued earlier, but it also has has a number of idiosyncrasies, such as the morphology of the data. The metalearning algorithm used here is the KNN ranking method [5]. This method is particularly useful for algorithm recommendation because it generates a ranking of the algorithms for a given dataset, based on the expected performance of those algorithms. This document is organized is follows. Section 2 presents an overview of the application of ML algorithms to the classification of gene expression data. It also provides quantitative arguments to justify the use of metalearning in this domain. In Section 3, the general architecture of the meta-learning system employed is explained. Section 4 discusses the experimental results obtained. Finally, Section 5 draws the conclusions of this work and points out future research directions. 2. Gene expression data classification Traditional methods for cancer classification rely essentially on the tumor s morphological appearance [10] and on the tissue of its origin [24]. However, there is no assurance that similar tumors will have the same clinical development and, therefore, will demand a similar course of action. To permit a deeper understanding of the tissues being analyzed and to try to achieve a finer distinction of cancers, Golub et al [10] applied microarrays to define a genetic portrait of tissues from two types of acute leukemia, AML and ALL. With a simple weighted voting scheme, the authors were able to correctly classify most of the samples. That seminal work introduced the ML community to gene expression data classification. In this context, tissue samples x i are multidimensional observations represented by m genes x i,j and have an associated class y i (e.g. the presence or absence of a disease). The task of a ML technique is to learn a discriminant function f( x i ), induced from a training set S = {( x 1, y 1 ),..., ( x n, y n )}, such that it is able to relate x i with the corresponding y i and to exploit this relation to classify previously unseen tissue samples. An interesting point to note is the usual disproportional rate of between the very large number of genes and the small number of tissue samples. Within this framework, many studies in the literature have proven its efficacy. Two of those studies are discussed next. Nutt et al have studied the feasibility of using gene expression data to classify high-grade gliomas [19]. They proposed an approach based on a k-nearest Neighbors (KNN) classifier that was able to discriminate high-grade, nonclassic glial tumors objectively and reproducibly, outperforming the naive histopathological-based classification. The dataset used consists of 50 expression profiles obtained from Affymetrix high-density oligonocleotide microarrays containing probes for about genes. In an effort to improve the understanding of the molecular basis of Papillary Renal Cell Carcinoma (PRCC), Yang et al [30] have studied the gene expression profiles of 34 cases of PRCC from an Affymetrix array with probe sets. Using unsupervised analyses, they were able to identify two highly correlated distinct molecular subclasses with morphological correlation. Through the application of a Prediction Analysis of Microarrays (PAM) classifier over the samples of the two subclasses, the authors were able to achieve a very good cross validation accuracy when considering a subset of genetic markers. A broader view on the matter is provided by recent articles. Larrañaga et al [14] presents an extensive review of the application of machine learning methods in bioinformatics, with a section devoted to class prediction. They discuss issues like how to assess and compare performance of ML algorithms, the problem of feature selection and the most representative classification paradigms, with examples of their application. Asyali et al [1] focused exclusively in gene expression data and provided a more critical survey of ML methods in the context, along with the implications of the findings. In their work, key points of microarray analysis, e.g. preprocessing and classifier design, were covered and examined. This is important since, according to the authors, there has been a dramatic increase of studies related to gene expression profile classification over the last years. Such an interest raises the issue of which type of classifier should be applied and, as argued next, should be carefully addressed. In the current practice, the most used ML algorithms for gene expression classification are [1]: KNN, SVMs, Decision Trees (such as CART), PAM, Neural Networks, FLDA, DLDA and DQDA. Comparative studies, that inspect the performance of different algorithms over a range of problems, provide some support to users who need to decide which one to use. Dudoit et al [9] compared the performance of LDA, DLDA, DQDA, Weighted Vote Scheme, KNN, trees and tree-based ensembles on three microarray datasets. Their main conclusion is that simple methods such as DLDA and KNN perform very well in comparison to more sophisticated methods such as tree-based ensembles. Romualdi et al [20] studied the performance of DLDA, trees, Neural Networks, SVMs, KNN and PAM on two datasets. They were unable to obtain evidence to support that one of those methods performs better than the others. Man et al [16] included in their comparison KNN, PCA+LDA, PLS-DA, Neural Networks, random forests and SVM. Based on experiments with 6 datasets, they concluded that PLS-DA and

3 SVM presented the best results. In a very comprehensive study, using 21 classification methods (including most of the previous approaches) applied to 7 datasets, Lee et al [15] claimed that no classifier is systematically better than the other. Statnikov et al [25] compared 3 multi-class classification methods, named multi-class SVMs, KNN, Neural Networks, on 11 datasets and concluded that SVMs outperformed their competitors. Finally, Huang et al [12] compared the performance of 5 statistical methods (PLS, penalized PLS, LASSO, PAM and random forests) on 2 datasets and concluded that the algorithms obtain similar results. As a whole, the aforementioned studies suggest that there is no obvious winning algorithm. Although some methods do present a tendency to perform well, such as SVMs, none of them is the best on all datasets. In the present work, the authors further investigated this hypothesis. One possibility to achive this is to analyze the relative performance of some ML algorithms on various microarray datasets. For each dataset, one can construct a ranking of algorithms based on estimates of their performance. Here it is assumed that better algorithms are ranked higher (i.e., they are assigned ranks closer to 1). The distribution of the ranks of an algorithm over the datasets gives an indication of how well it performs in comparison to the others. Figure 1 presents the distribution of ranks for the seven algorithms and 49 datasets used in this work (Section 4). Each bar indicates how many times a given algorithm was ranked in each of the 7 possible positions, 1 represented by different levels of gray. The figure confirms the previous observation that there is no clear winner, although a few algorithms tend to perform well. Figure 1. Distribution of rankings. 1 One unit intervals are represented, because rank mean is assigned to rank for ted algorithms. 3 Metalearning As shown in the previous section, the data analyst must carefully select which algorithm to use on each problem, in order to obtain satisfactory results. Running an algorithm on a dataset is time consuming, especially when complex tasks with a large volume of data are involved, as is often the case in bioinformatics. Therefore, selecting the algorithm by trying out all alternatives is generally not a viable option. An alternative approach consists of using a learning algorithm to model the relation between the characteristics of learning problems (e.g., number of examples) and the relative performance of a set of algorithms [5]. Here, the authors refer to this approach as metalearning because one is learning about the performance of learning algorithms. Meta-learning models can be used to predict the relative performance of the set of algorithms on a new dataset based on the characteristics of the dataset and without actually running any of the algorithms. This approach involves three steps: (1) the generation of metadata; (2) induction of a meta-learning model by applying a learning algorithm on the metadata; and (3) application of the metamodel to support the selection of the algorithms to be used in new datasets. Next, the authors summarize these steps but for a more thorough description, the reader is referred to [5] and references therein. Metadata In this context, metadata are data that describe the (relative) performance of the selected algorithms on a set of datasets, which were already processed with those algorithms. They consist of a set of meta-examples, each one representing one dataset. Each meta-example consists of attributes and a target. Datasets for metalearning are usually obtained from repositories. The attributes, which also known as metafeatures, are measures that characterize the datasets. These measures represent general properties of the data which are expected to affect the performance of the algorithms. A few examples of commonly used metafeatures are the number of examples, the proportion of symbolic attributes, class entropy and the mean correlation between attributes. These are examples of what are usually referred to as general, statistical and information-theoretic metafeatures [5]. The target represents the relative performance of the algorithms on the dataset. Many metalearning approaches to the problem of algorithm recommendation handle it as a supervised classification task. The recommendation provided to the user consists of a single algorithm and the target variable is, thus, a nominal attribute containing the algorithm that achieved the best performance on the corresponding dataset. However, this is not the most adequate form of recommendation for this problem. It does not provide any further guidance when the user is not satisfied with the re-

4 sults obtained with the recommended algorithm. Although, as stated earlier, executing all the algorithms is not a viable strategy, it is often the case that the available computational resources are sufficient to run more than one of the available algorithms. If recommendation indicates the order in which the algorithms should be executed, then the user can execute as many as possible, thus increasing the probability that a satisfactory result is obtained. Therefore, the problem of algorithm recommendation should be tackled as a ranking task [5], which is discussed in the following section. KNN-ranking Method The metadata, as presented in the previous section, consists of a set of meta-examples that are described with a set of metafeatures and with a target consisting of a ranking of the ML algorithms, which is referred to as the target ranking. This learning problem is similar to the problem of supervised classification. The difference is that, given a new example described by the values of the attributes, the objective in classification is to predict the class it belongs to while the objective in ranking is to predict the order of the classes as applicable to that example. An algorithm that has previously been adapted for learning rankings and applied to the meta-learning problem with successful results is the k-nearest Neighbors (KNN) algorithm [5]. The difference between ranking and classification is only on the target. Therefore, any common distance function (e.g. the Euclidean distance considered here) can be used by KNN to measure the similarity between examples. After selecting k neighbors, the corresponding target rankings must be aggregated to generate a prediction. In classification, this is achieved by predicting the most frequent class among the selected examples. A simple approach is to aggregate the k target rankings with the Average Ranks (AR) method [5]. Let R i,j be the rank of base-algorithm a j (j = 1,..., n) on dataset i, where n is the number of algorithms. The average rank for each a j is: R j = k i=1 R i,j k The final ranking is obtained by ordering the average ranks and assigning ranks to the algorithms accordingly. Evaluation and Application The metamodel can then be used to support the data analyst in selecting the algorithm to use on a new dataset. To do this, it is first necessary to compute the metafeatures for the new dataset and the ranking of the algorithms can be predicted using the KNN method. However, to convince data analysts to apply a metalearning approach in practice, it is necessary to produce evidence that it is able to to generate accurate predictions. One approach is to use Leave-one-out Cross Validation (LOOCV), which consists of iteratively, for each metaexample, computing the accuracy of the predicted ranking using a metamodel obtained on all the remaining metaexamples [5]. To measure ranking accuracy, the authors have used Spearman s Rank Correlation Coefficient, r S, which is given by the expression: r S = 1 6 n i=1 (R(X i) R(Y i )) 2 n 3 n where X and Y are two sets of n values and R(X i ) represents the rank of element i in the series X. The coefficient simply evaluates the monotonicity of two sets of values, i.e., if their variations are related. The value of 1 represents perfect agreement, and -1 perfect disagreement (i.e., the rankings are inverted. A correlation of 0 means that the rankings are not related, which would be the expected score of the random ranking method [5]. To determine whether the accuracy of some particular recommended ranking can be regarded as high or not, a baseline method is required. In machine learning, simple prediction strategies are usually employed to set a baseline for more complex methods. For instance, a baseline commonly used in classification is the most frequent class in the dataset, referred to as the default class. The baseline is typically obtained by summarizing the values of the target variable for all the examples in the dataset. In ranking, a similar approach consists of applying the Average Ranks (AR) method to all the target rankings in the metadata. The ranking obtained is called the default ranking. 4. Experimental results Datasets The meta-data employed in this work came from 49 publicly available microarray datasets. They are related to disease diagnostic. Mainly, the task is either discriminating between normal and tumor cases or between different types of tumor. They present very diverse characteristics concerning the number of examples, the number of genes and the number of classes. Due to space constrains, the the datasets are not described here, but full descriptions can be retrieved from br\ bferes. Two preprocessing operations were performed. As some datasets presented missing values, imputation was done using the Least Square Adaptation method [2], following recommendation from Brock et al [6]. Additionally, all attributes are normalized to have mean 0 and variance 1. This is first done for the training data and then the test data are rescaled accordingly. ML algorithms Based on the comparative studies of ML algorithms for gene expression classification presented in

5 Section 2, seven classifiers were selected mainly according to two criteria: performance and training time: they are relatively fast to train and present error rates adequate on at least some datasets. The methods are: Diagonal Linear Discriminat Analysis (DLDA) [9], Diagonal Quadratic Discriminat Analysis (DQDA) [9], Prediction Analysis of Microarray (PAM) [27], the 3-Nearest Neighbor (3-NN) [8], Support Vector Machines (SVM) [29] with linear and RBF kernels and Penalized Discriminant Analysis (PDA) [11]. An important issue within the metalearning framework is the estimation of the performance of the classifiers. In the context of gene expression data, this is subject of ongoing research, with no widely accepted methodology [4, 13]. In the present work, the.632+ estimator is used, following the suggestion of Braga-Neto and Dougherty [4]. Metafeatures Although the datasets used are different from traditional classification datasets, ten metafeatures originally developed for that kind of datasets in the StatLog project were used [5]. As some of those may not be suitably applicable due to the high dimensionality of data, their calculation were preceded by a data reduction step. Here, Partial Least Square (PLS) was employed, mainly due to its good results in a number of microarray studies (see [3] and references therein) and to its low computational cost. The number of PLS components considered was 3, which seems to be adequate for preserving discrimination power in the context of expression data [18]. The measures are given next. More information can be found in [5]. 1. Log of number of examples 2. Log of number of features 3. Log of number of classes 4. Mean absolute skewness 5. Mean kurtosis 6. Geometric mean ratio of the standard deviations of individual populations to the pooled standard deviations 7. First canonical correlation 8. Proportion of total variation explained by the firs canonical correlation 9. Normalized class entropy 10. Average absolute correlation between continuous attributes, per class Results The experiments conducted in this work followed the evaluation and application guidelines presented in Section 3. The main results obtained here are illustrated in Figure 2. The points represent the mean ranking accuracy over the 49 datasets accordingly to the LOOCV approach, varying the number k of nearest neighbors from 1 to 20. The smallest values of k present the lowest performance (69.8% mean accuracy). Then, it increases with increasing value of k until it reaches a point of saturation, where the behavior of the accuracy remains basically constant and then gracefully drops. Here, k = 4 and k = 5 give similar results (both roughly 78.3% mean accuracy). In any case, the KNN ranking method clearly outperforms the default ranking (dashed line) (59.9% mean accuracy), generating rankings more correlated in average to the ideal rankings. This indicates that metalearning can be successully applied to recommend algorithms for gene expression analysis. Figure 2. Ranking accuracy of KNN. Additionally, these results are somewhat different to the ones reported on algorithm recommendation experiments with general classification datasets [5]. In this work, the best results were achieved with a very small k (1 or 2) and the KNN ranking method quickly became worse than the default ranking. This may be explained by the fact that, being from the same application domain, the datasets used here are more homogeneous. Therefore, the KNN algorithm has a smoother behavior with varying k. 5. Conclusions In this paper, it is presented an empirical analysis of the performance of a metalearning method on the problem of recommending learning algoritms for gene expression classification. Metalearning has been successfully applied to general classification problems. However, it was never applied to a restricted domain, such as gene expression data. The results presented here show that it is possible to use metalearning to recommend classifiers for gene expression data. It was observed that the behavior of the metalearning algorithm is actually smoother than when applied to general classification problems. It was employed an approach based on the KNN algorithm, mainly because it was previously applied to other metalearning problems with successful results [5]. In the future, it is necessary to investigate if better results can be obtained with different methods.

6 Here, a set of general metafeatures was used to characterize datasets. However, the gene expression classification datasets have significant differences to most other classification problems, namely in terms of their morphology. Therefore, it is expected that better meta-learning models can be obtained using metafeatures that are specifically designed for this application domain. Additionally, it has been shown that, although ranking accuracy is an important criterion in the evaluation of metalearning systems during their development, the data analyst is interested in the quality of the results obtained by the selected classifiers [5]. In the case of gene expression analysis, the data analysts are interested not only in the accuracy of the models but also on their interpretability. The authors plan to address these issues in our future work. References [1] M. H. Asyali, D. Colak, O. Demirkaya, and M. S. Inan. Gene expression profile classification: A review. Current Bioinformatics, 1(1):55 73, [2] T. H. B, B. Dysvik, and I. Jonassen. Lsimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Research, 32(3):e34, [3] A.-L. Boulesteix and K. Strimmer. Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics, 8(1):32 44, [4] U. M. Braga-Neto and E. R. Dougherty. Is cross-validation valid for small-sample microarray classification? Bioinformatics, 20(3): , [5] P. Brazdil, C. Soares, and J. da Costa. Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning, 50(3): , [6] G. N. Brock, J. R. Shaffer, R. E. Blakesley, M. J. Lotz, and G. C. Tseng. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinformatics, 9(12), [7] F. S. Collins, E. D. Green, A. E. Guttmacher, and M. S. Guyer. A vision for the future of genomics research. Nature, 422: , April [8] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, 2ł edition, [9] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77 87, [10] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286: , October [11] T. Hastie, A. Buja, and R. Tibshirani. Penalized discriminant analysis. Ann. Statist., 23:73102, [12] X. Huang, W. Pan, S. Grindle, X. Han, Y. Chen, S. J. Park, L. W. Miller, and J. Hall. A comparative study of discriminating human heart failure etiology using gene expression profiles. BMC Bioinformatics, 6:205, [13] W. Jiang and R. Simon. A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification. Statistics in Medicine, 26: , [14] P. Larraaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armaanzas, G. Santaf, A. Prez, and V. Robles. Machine learning in bioinformatics. Briefings in Bioinformatics, 7(1):86 112, [15] J. W. Lee, J. B. Lee, M. Park, and S. H. Song. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48(4): , [16] M. Z. Man, G. Dyson, K. Johnson, and B. Liao. Evaluating methods for classifying expression data. J Biopharm Stat., 14(4): , [17] R. S. Michalski, J. Carbonell, and T. M. Mitchell. Machine Learning: an Artificial Intelligence Approach. Morgan Kaufmann Publishers, Inc., [18] D. V. Nguyen and D. M. Rocke. Multi-class cancer clasification via partial least squares with gene expression profiles. Bioinformatics, 18(9): , [19] C. L. Nutt, D. R. Mani, R. A. Betensky, and P. Tamayo. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res., 63(7):1602 7, [20] C. Romualdi, S. Campanaro, D. Campagna, B. Celegato, N. Cannata, S. Toppo, G. Valle, and G. Lanfranchi. Pattern recognition in gene expression profiling using dna array: a comparative study of different statistical methods applied to cancer classification. Human Molecular Genetics, 12(8): , [21] G. Russo, C. Zegar, and A. Giordano. Advantages and limitations of microarray technology in human cancer. Oncogene, 22(42): , September [22] C. Schaffer. A conservation law for generalization performance. In ICML, pages , [23] M. Schena. DNA Microarrays: A Practical Approach. Practical Approach Series. Oxford University Press, Oxford, Inglaterra, 1ł edition, [24] D. K. Slonim, P. Tamayo, J. P. Mesirov, T. R. Golub, and E. S. Lander. Class prediction and discovery using gene expression data. In RECOMB, pages , [25] A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5): , [26] A. Tarca, R. Romero, and S. Draghici. Analysis of microarray experiments of gene expression profiling. Am J Obstet Gynecol., 195(2):373 88, Agosto [27] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS, 99(10): , May [28] B. Tjaden and J. Cohen. A survey of computational methods used in microarray data interpretation. Applied Mycology and Biotechnology, Volume 6: Bioinformatics:1 18, [29] V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., [30] X. J. Yang, M.-H. Tan, H. L. Kim, et al. A molecular classification of papillary renal cell carcinoma. Cancer Research, 65(13): , July 2005.

Data Mining in Bioinformatics. Prof. André de Carvalho ICMC-Universidade de São Paulo

Data Mining in Bioinformatics. Prof. André de Carvalho ICMC-Universidade de São Paulo Data Mining in Bioinformatics Prof. André de Carvalho ICMC-Universidade de São Paulo Main topics Motivation Data Mining Prediction Bioinformatics Molecular Biology Using DM in Molecular Biology Case studies

More information

Feature selection methods for SVM classification of microarray data

Feature selection methods for SVM classification of microarray data Feature selection methods for SVM classification of microarray data Mike Love December 11, 2009 SVMs for microarray classification tasks Linear support vector machines have been used in microarray experiments

More information

Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development

Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development 1 Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development Alexander Statnikov M.S., Constantin F. Aliferis M.D.,

More information

Random forest for gene selection and microarray data classification

Random forest for gene selection and microarray data classification www.bioinformation.net Hypothesis Volume 7(3) Random forest for gene selection and microarray data classification Kohbalan Moorthy & Mohd Saberi Mohamad* Artificial Intelligence & Bioinformatics Research

More information

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan International Conference on Mechatronics Engineering and Information Technology (ICMEIT 2016) Study on the Application of Mining in Bioinformatics Mingyang Yuan School of Science and Liberal Arts, New

More information

Data Mining for Biological Data Analysis

Data Mining for Biological Data Analysis Data Mining for Biological Data Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Data Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Jiawei Han

More information

Supervised Learning from Micro-Array Data: Datamining with Care

Supervised Learning from Micro-Array Data: Datamining with Care November 18, 2002 Stanford Statistics 1 Supervised Learning from Micro-Array Data: Datamining with Care Trevor Hastie Stanford University November 18, 2002 joint work with Robert Tibshirani, Balasubramanian

More information

Finding molecular signatures from gene expression data: review and a new proposal

Finding molecular signatures from gene expression data: review and a new proposal Finding molecular signatures from gene expression data: review and a new proposal Ramón Díaz-Uriarte rdiaz@cnio.es http://bioinfo.cnio.es/ rdiaz Unidad de Bioinformática Centro Nacional de Investigaciones

More information

Data mining: Identify the hidden anomalous through modified data characteristics checking algorithm and disease modeling By Genomics

Data mining: Identify the hidden anomalous through modified data characteristics checking algorithm and disease modeling By Genomics Data mining: Identify the hidden anomalous through modified data characteristics checking algorithm and disease modeling By Genomics PavanKumar kolla* kolla.haripriyanka+ *School of Computing Sciences,

More information

Comparing Correlation Coefficients as Dissimilarity Measures for Cancer Classification in Gene Expression Data

Comparing Correlation Coefficients as Dissimilarity Measures for Cancer Classification in Gene Expression Data Comparing Correlation Coefficients as Dissimilarity Measures for Cancer Classification in Gene Expression Data Pablo A. Jaskowiak and Ricardo J. G. B. Campello Department of Computer Sciences University

More information

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista Graphical Methods for Classifier Performance Evaluation Maria Carolina Monard and Gustavo E. A. P. A. Batista University of São Paulo USP Institute of Mathematics and Computer Science ICMC Department of

More information

A Hybrid Approach for Gene Selection and Classification using Support Vector Machine

A Hybrid Approach for Gene Selection and Classification using Support Vector Machine The International Arab Journal of Information Technology, Vol. 1, No. 6A, 015 695 A Hybrid Approach for Gene Selection and Classification using Support Vector Machine Jaison Bennet 1, Chilambuchelvan Ganaprakasam

More information

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005 Biological question Experimental design Microarray experiment Failed Lecture : Discrimination (cont) Quality Measurement Image analysis Preprocessing Jane Fridlyand Pass Normalization Sample/Condition

More information

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology Lecture 2: Microarray analysis Genome wide measurement of gene transcription using DNA microarray Bruce Alberts, et al., Molecular Biology

More information

Gene Reduction for Cancer Classification using Cascaded Neural Network with Gene Masking

Gene Reduction for Cancer Classification using Cascaded Neural Network with Gene Masking Gene Reduction for Cancer Classification using Cascaded Neural Network with Gene Masking Raneel Kumar, Krishnil Chand, Sunil Pranit Lal School of Computing, Information, and Mathematical Sciences University

More information

A Genetic Algorithm Approach to DNA Microarrays Analysis of Pancreatic Cancer

A Genetic Algorithm Approach to DNA Microarrays Analysis of Pancreatic Cancer A Genetic Algorithm Approach to DNA Microarrays Analysis of Pancreatic Cancer Nicolae Teodor MELITA 1, Stefan HOLBAN 2 1 Politehnica University of Timisoara, Faculty of Automation and Computers, Bd. V.

More information

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Final Project Report Alexander Herrmann Advised by Dr. Andrew Gentles December

More information

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING Abbas Heiat, College of Business, Montana State University, Billings, MT 59102, aheiat@msubillings.edu ABSTRACT The purpose of this study is to investigate

More information

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA advanced analysis of gene expression microarray data aidong zhang State University of New York at Buffalo, USA World Scientific NEW JERSEY LONDON SINGAPORE BEIJING SHANGHAI HONG KONG TAIPEI CHENNAI Contents

More information

SELECTING GENES WITH DISSIMILAR DISCRIMINATION STRENGTH FOR SAMPLE CLASS PREDICTION

SELECTING GENES WITH DISSIMILAR DISCRIMINATION STRENGTH FOR SAMPLE CLASS PREDICTION July 3, 26 6:34 WSPC/Trim Size: in x 8.5in for Proceedings eegs apbc27 SELECTING GENES WITH DISSIMILAR DISCRIMINATION STRENGTH FOR SAMPLE CLASS PREDICTION Zhipeng Cai, Randy Goebel, Mohammad R. Salavatipour,

More information

Classification Study on DNA Microarray with Feedforward Neural Network Trained by Singular Value Decomposition

Classification Study on DNA Microarray with Feedforward Neural Network Trained by Singular Value Decomposition Classification Study on DNA Microarray with Feedforward Neural Network Trained by Singular Value Decomposition Hieu Trung Huynh 1, Jung-Ja Kim 2 and Yonggwan Won 1 1 Department of Computer Engineering,

More information

BIOINFORMATICS THE MACHINE LEARNING APPROACH

BIOINFORMATICS THE MACHINE LEARNING APPROACH 88 Proceedings of the 4 th International Conference on Informatics and Information Technology BIOINFORMATICS THE MACHINE LEARNING APPROACH A. Madevska-Bogdanova Inst, Informatics, Fac. Natural Sc. and

More information

A Genetic Approach for Gene Selection on Microarray Expression Data

A Genetic Approach for Gene Selection on Microarray Expression Data A Genetic Approach for Gene Selection on Microarray Expression Data Yong-Hyuk Kim 1, Su-Yeon Lee 2, and Byung-Ro Moon 1 1 School of Computer Science & Engineering, Seoul National University Shillim-dong,

More information

Our view on cdna chip analysis from engineering informatics standpoint

Our view on cdna chip analysis from engineering informatics standpoint Our view on cdna chip analysis from engineering informatics standpoint Chonghun Han, Sungwoo Kwon Intelligent Process System Lab Department of Chemical Engineering Pohang University of Science and Technology

More information

Hybrid Intelligent Systems for DNA Microarray Data Analysis

Hybrid Intelligent Systems for DNA Microarray Data Analysis Hybrid Intelligent Systems for DNA Microarray Data Analysis November 27, 2007 Sung-Bae Cho Computer Science Department, Yonsei University Soft Computing Lab What do I think with Bioinformatics? Biological

More information

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter VizX Labs, LLC Seattle, WA 98119 Abstract Oligonucleotide microarrays were used to study

More information

APPLICATION OF COMMITTEE k-nn CLASSIFIERS FOR GENE EXPRESSION PROFILE CLASSIFICATION. A Thesis. Presented to

APPLICATION OF COMMITTEE k-nn CLASSIFIERS FOR GENE EXPRESSION PROFILE CLASSIFICATION. A Thesis. Presented to APPLICATION OF COMMITTEE k-nn CLASSIFIERS FOR GENE EXPRESSION PROFILE CLASSIFICATION A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 16 Reader s reaction to Dimension Reduction for Classification with Gene Expression Microarray Data by Dai et al

More information

Application of Decision Trees in Mining High-Value Credit Card Customers

Application of Decision Trees in Mining High-Value Credit Card Customers Application of Decision Trees in Mining High-Value Credit Card Customers Jian Wang Bo Yuan Wenhuang Liu Graduate School at Shenzhen, Tsinghua University, Shenzhen 8, P.R. China E-mail: gregret24@gmail.com,

More information

Top-down Forecasting Using a CRM Database Gino Rooney Tom Bauer

Top-down Forecasting Using a CRM Database Gino Rooney Tom Bauer Top-down Forecasting Using a CRM Database Gino Rooney Tom Bauer Abstract More often than not sales forecasting in modern companies is poorly implemented despite the wealth of data that is readily available

More information

Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis

Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis Giorgio Valentini INFM, Istituto Nazionale di Fisica della Materia, DSI, Dip. di Scienze dell Informazione Università degli

More information

Classifying Gene Expression Data using an Evolutionary Algorithm

Classifying Gene Expression Data using an Evolutionary Algorithm Classifying Gene Expression Data using an Evolutionary Algorithm Thanyaluk Jirapech-umpai E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science School of Informatics University of Edinburgh

More information

Learning theory: SLT what is it? Parametric statistics small number of parameters appropriate to small amounts of data

Learning theory: SLT what is it? Parametric statistics small number of parameters appropriate to small amounts of data Predictive Genomics, Biology, Medicine Learning theory: SLT what is it? Parametric statistics small number of parameters appropriate to small amounts of data Ex. Find mean m and standard deviation s for

More information

A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET

A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET 1 J.JEYACHIDRA, M.PUNITHAVALLI, 1 Research Scholar, Department of Computer Science and Applications,

More information

Revealing Predictive Gene Clusters with Supervised Algorithms

Revealing Predictive Gene Clusters with Supervised Algorithms DSC 23 Working Papers (Draft Versions) http://www.ci.tuwien.ac.at/conferences/dsc-23/ Revealing Predictive Gene Clusters with Supervised Algorithms Marcel Dettling Seminar für Statistik ETH Zürich CH-892

More information

Bioinformatics : Gene Expression Data Analysis

Bioinformatics : Gene Expression Data Analysis 05.12.03 Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering What is Bioinformatics Broad Definition The study of how information technologies are used

More information

Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development

Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development Alexander Statnikov, Constantin F. Aliferis, Ioannis Tsamardinos

More information

DNA Gene Expression Classification with Ensemble Classifiers Optimized by Speciated Genetic Algorithm

DNA Gene Expression Classification with Ensemble Classifiers Optimized by Speciated Genetic Algorithm DNA Gene Expression Classification with Ensemble Classifiers Optimized by Speciated Genetic Algorithm Kyung-Joong Kim and Sung-Bae Cho Department of Computer Science, Yonsei University, 134 Shinchon-dong,

More information

Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics Jianlin Cheng, PhD Computer Science Department and Informatics Institute University of

More information

An Empirical Study of Univariate and GA-Based Feature Selection in Binary Classification with Microarray Data

An Empirical Study of Univariate and GA-Based Feature Selection in Binary Classification with Microarray Data An Empirical Study of Univariate and GA-Based Feature Selection in Binary Classification with Microarray Data Mike Lecocke and Kenneth Hess 2nd March 2005 Abstract Motivation: Feature subset selection

More information

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE CHAPTER1 ROAD TO STATISTICAL BIOINFORMATICS Jae K. Lee Department of Public Health Science, University of Virginia, Charlottesville, Virginia, USA There has been a great explosion of biological data and

More information

MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE

MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE Wala Abedalkhader and Noora Abdulrahman Department of Engineering Systems and Management, Masdar Institute of Science and Technology, Abu Dhabi, United

More information

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Feature Selection of Gene Expression Data for Cancer Classification: A Review Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 50 (2015 ) 52 57 2nd International Symposium on Big Data and Cloud Computing (ISBCC 15) Feature Selection of Gene Expression

More information

Gene Expression Data Analysis

Gene Expression Data Analysis Gene Expression Data Analysis Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu BMIF 310, Fall 2009 Gene expression technologies (summary) Hybridization-based

More information

Weighted Top Score Pair Method for Gene Selection and Classification

Weighted Top Score Pair Method for Gene Selection and Classification Weighted Top Score Pair Method for Gene Selection and Classification Huaien Luo 1, Yuliansa Sudibyo 2,, Lance D. Miller 1, and R. Krishna Murthy Karuturi 1, 1 Genome Institute of Singapore, Singapore 2

More information

A Comparative Study of Microarray Data Analysis for Cancer Classification

A Comparative Study of Microarray Data Analysis for Cancer Classification A Comparative Study of Microarray Data Analysis for Cancer Classification Kshipra Chitode Research Student Government College of Engineering Aurangabad, India Meghana Nagori Asst. Professor, CSE Dept Government

More information

Machine Learning Models for Classification of Lung Cancer and Selection of Genomic Markers Using Array Gene Expression Data

Machine Learning Models for Classification of Lung Cancer and Selection of Genomic Markers Using Array Gene Expression Data Machine Learning Models for Classification of Lung Cancer and Selection of Genomic Markers Using Array Gene Expression Data C.F. Aliferis 1, I. Tsamardinos 1, P.P. Massion 2, A. Statnikov 1, N. Fananapazir

More information

An Efficient and Effective Immune Based Classifier

An Efficient and Effective Immune Based Classifier Journal of Computer Science 7 (2): 148-153, 2011 ISSN 1549-3636 2011 Science Publications An Efficient and Effective Immune Based Classifier Shahram Golzari, Shyamala Doraisamy, Md Nasir Sulaiman and Nur

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Functional Genomics: Microarray Data Analysis Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Outline Introduction Working with microarray data Normalization Analysis

More information

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT ANALYTICAL MODEL DEVELOPMENT AGENDA Enterprise Miner: Analytical Model Development The session looks at: - Supervised and Unsupervised Modelling - Classification

More information

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest 1. Introduction Reddit is a social media website where users submit content to a public forum, and other

More information

Analysis of a Proposed Universal Fingerprint Microarray

Analysis of a Proposed Universal Fingerprint Microarray Analysis of a Proposed Universal Fingerprint Microarray Michael Doran, Raffaella Settimi, Daniela Raicu, Jacob Furst School of CTI, DePaul University, Chicago, IL Mathew Schipma, Darrell Chandler Bio-detection

More information

Gene Selection in Cancer Classification using PSO/SVM and GA/SVM Hybrid Algorithms

Gene Selection in Cancer Classification using PSO/SVM and GA/SVM Hybrid Algorithms Laboratoire d Informatique Fondamentale de Lille Gene Selection in Cancer Classification using PSO/SVM and GA/SVM Hybrid Algorithms Enrique Alba, José GarcíaNieto, Laetitia Jourdan and ElGhazali Talbi

More information

ARTIFICIAL IMMUNE SYSTEM CLASSIFICATION OF MULTIPLE- CLASS PROBLEMS

ARTIFICIAL IMMUNE SYSTEM CLASSIFICATION OF MULTIPLE- CLASS PROBLEMS 1 ARTIFICIAL IMMUNE SYSTEM CLASSIFICATION OF MULTIPLE- CLASS PROBLEMS DONALD E. GOODMAN, JR. Mississippi State University Department of Psychology Mississippi State, Mississippi LOIS C. BOGGESS Mississippi

More information

A comparison of Multiple Biomarker Selection Algorithms for Early Screening of Ovarian Cancer

A comparison of Multiple Biomarker Selection Algorithms for Early Screening of Ovarian Cancer A comparison of Multiple Biomarker Selection Algorithms for Early Screening of Ovarian Cancer Yu-Seop Kim 1,3, Jong-Dae Kim 1,3, Min-Ki Jang 2,3, Chan-Young Park 1,3, and Hye-Jung Song 1,3 1 Dept. of Ubiquitous

More information

Reliable classification of two-class cancer data using evolutionary algorithms

Reliable classification of two-class cancer data using evolutionary algorithms BioSystems 72 (23) 111 129 Reliable classification of two-class cancer data using evolutionary algorithms Kalyanmoy Deb, A. Raji Reddy Kanpur Genetic Algorithms Laboratory (KanGAL), Indian Institute of

More information

Introduction to Bioinformatics. Fabian Hoti 6.10.

Introduction to Bioinformatics. Fabian Hoti 6.10. Introduction to Bioinformatics Fabian Hoti 6.10. Analysis of Microarray Data Introduction Different types of microarrays Experiment Design Data Normalization Feature selection/extraction Clustering Introduction

More information

Classification of DNA Sequences Using Convolutional Neural Network Approach

Classification of DNA Sequences Using Convolutional Neural Network Approach UTM Computing Proceedings Innovations in Computing Technology and Applications Volume 2 Year: 2017 ISBN: 978-967-0194-95-0 1 Classification of DNA Sequences Using Convolutional Neural Network Approach

More information

Microarrays & Gene Expression Analysis

Microarrays & Gene Expression Analysis Microarrays & Gene Expression Analysis Contents DNA microarray technique Why measure gene expression Clustering algorithms Relation to Cancer SAGE SBH Sequencing By Hybridization DNA Microarrays 1. Developed

More information

Machine Learning Methods for Microarray Data Analysis

Machine Learning Methods for Microarray Data Analysis Harvard-MIT Division of Health Sciences and Technology HST.512: Genomic Medicine Prof. Marco F. Ramoni Machine Learning Methods for Microarray Data Analysis Marco F. Ramoni Children s Hospital Informatics

More information

Lymphoma Cancer Classification Using Genetic Programming with SNR Features

Lymphoma Cancer Classification Using Genetic Programming with SNR Features Lymphoma Cancer Classification Using Genetic Programming with SNR Features JinHyuk Hong and SungBae Cho Dept. of Computer Science, Yonsei University, 134 Shinchondong, Sudaemoonku, Seoul 120749, Korea

More information

Modeling gene expression data via positive Boolean functions

Modeling gene expression data via positive Boolean functions Modeling gene expression data via positive Boolean functions Francesca Ruffino 1, Marco Muselli 2, Giorgio Valentini 1 1 DSI, Dipartimento di Scienze dell Informazione, Università degli Studi di Milano,

More information

Estimating Cell Cycle Phase Distribution of Yeast from Time Series Gene Expression Data

Estimating Cell Cycle Phase Distribution of Yeast from Time Series Gene Expression Data 2011 International Conference on Information and Electronics Engineering IPCSIT vol.6 (2011) (2011) IACSIT Press, Singapore Estimating Cell Cycle Phase Distribution of Yeast from Time Series Gene Expression

More information

Machine Learning in Computational Biology CSC 2431

Machine Learning in Computational Biology CSC 2431 Machine Learning in Computational Biology CSC 2431 Lecture 9: Combining biological datasets Instructor: Anna Goldenberg What kind of data integration is there? What kind of data integration is there? SNPs

More information

Data Mining and Applications in Genomics

Data Mining and Applications in Genomics Data Mining and Applications in Genomics Lecture Notes in Electrical Engineering Volume 25 For other titles published in this series, go to www.springer.com/series/7818 Sio-Iong Ao Data Mining and Applications

More information

Finding Regularity in Protein Secondary Structures using a Cluster-based Genetic Algorithm

Finding Regularity in Protein Secondary Structures using a Cluster-based Genetic Algorithm Finding Regularity in Protein Secondary Structures using a Cluster-based Genetic Algorithm Yen-Wei Chu 1,3, Chuen-Tsai Sun 3, Chung-Yuan Huang 2,3 1) Department of Information Management 2) Department

More information

Microarray gene expression ranking with Z-score for Cancer Classification

Microarray gene expression ranking with Z-score for Cancer Classification Microarray gene expression ranking with Z-score for Cancer Classification M.Yasodha, Research Scholar Government Arts College, Coimbatore, Tamil Nadu, India Dr P Ponmuthuramalingam Head and Associate Professor

More information

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology. G16B BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY Methods or systems for genetic

More information

A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data

A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data Thesis by Heba Abusamra In Partial Fulfillment of the Requirements For the Degree of Master of Science King

More information

Molecular Diagnosis Tumor classification by SVM and PAM

Molecular Diagnosis Tumor classification by SVM and PAM Molecular Diagnosis Tumor classification by SVM and PAM Florian Markowetz and Rainer Spang Practical DNA Microarray Analysis Berlin, Nov 2003 Max-Planck-Institute for Molecular Genetics Dept. Computational

More information

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 7, Issue 11, May 2018

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 7, Issue 11, May 2018 Application of Machine Learning to Immune Disease Prediction Kuan-Hui Lin, Yuh-Jyh Hu College of Computer Science, National Chiao Tung University, Hsinchu, Taiwan Abstract The intrusion of viruses, germs

More information

Particle Swarm Feature Selection for Microarray Leukemia Classification

Particle Swarm Feature Selection for Microarray Leukemia Classification 2 (2017) 1-8 Progress in Energy and Environment Journal homepage: http://www.akademiabaru.com/progee.html ISSN: 2600-7762 Particle Swarm Feature Selection for Microarray Leukemia Classification Research

More information

Amit Kumar Nandanwar A.P. CSE Department, VNS College, Bhopal, India

Amit Kumar Nandanwar A.P. CSE Department, VNS College, Bhopal, India Volume 6, Issue 4, April 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Support Classifier

More information

A Literature Review of Predicting Cancer Disease Using Modified ID3 Algorithm

A Literature Review of Predicting Cancer Disease Using Modified ID3 Algorithm A Literature Review of Predicting Cancer Disease Using Modified ID3 Algorithm Mr.A.Deivendran 1, Ms.K.Yemuna Rane M.Sc., M.Phil 2., 1 M.Phil Research Scholar, Dept of Computer Science, Kongunadu Arts and

More information

SAS Microarray Solution for the Analysis of Microarray Data. Susanne Schwenke, Schering AG Dr. Richardus Vonk, Schering AG

SAS Microarray Solution for the Analysis of Microarray Data. Susanne Schwenke, Schering AG Dr. Richardus Vonk, Schering AG for the Analysis of Microarray Data Susanne Schwenke, Schering AG Dr. Richardus Vonk, Schering AG Overview Challenges in Microarray Data Analysis Software for Microarray Data Analysis SAS Scientific Discovery

More information

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Team members: David Moskowitz and Emily Tsang Background Transcription factors

More information

Improving Credit Card Fraud Detection using a Meta- Classification Strategy

Improving Credit Card Fraud Detection using a Meta- Classification Strategy Improving Credit Card Fraud Detection using a Meta- Classification Strategy Joseph Pun, Yuri Lawryshyn Department of Applied Chemistry and Engineering, University of Toronto Toronto ABSTRACT One of the

More information

Immune Network based Ensembles

Immune Network based Ensembles Immune Network based Ensembles Nicolás García-Pedrajas 1 and Colin Fyfe 2 1- Dept. of Computing and Numerical Analysis University of Córdoba (SPAIN) e-mail: npedrajas@uco.es 2- the Dept. of Computing University

More information

Tumor Gene Characteristics Selection Method Based on Multi-Agent

Tumor Gene Characteristics Selection Method Based on Multi-Agent Send Orders for Reprints to reprints@benthamscience.ae The Open Cybernetics & Systemics Journal, 2015, 9, 2513-2518 2513 Open Access Tumor Gene Characteristics Selection Method Based on Multi-Agent Yang

More information

Generation of Comprehensible Hypotheses from Gene Expression Data

Generation of Comprehensible Hypotheses from Gene Expression Data Generation of Comprehensible Hypotheses from Gene Expression Data Yuan Jiang, Ming Li, and Zhi-Hua Zhou National Laboratory for Novel Software Technology Nanjing University, Nanjing 210093, China {jiangy,lim,zhouzh}@lamda.nju.edu.cn

More information

AUTOMATIC CANCER DIAGNOSTIC DECISION SUPPORT SYSTEM FOR GENE EXPRESSION DOMAIN. Alexander Statnikov. Thesis. Submitted to the Faculty of the

AUTOMATIC CANCER DIAGNOSTIC DECISION SUPPORT SYSTEM FOR GENE EXPRESSION DOMAIN. Alexander Statnikov. Thesis. Submitted to the Faculty of the AUTOMATIC CANCER DIAGNOSTIC DECISION SUPPORT SYSTEM FOR GENE EXPRESSION DOMAIN By Alexander Statnikov Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment

More information

Support Vector Machines (SVMs) for the classification of microarray data. Basel Computational Biology Conference, March 2004 Guido Steiner

Support Vector Machines (SVMs) for the classification of microarray data. Basel Computational Biology Conference, March 2004 Guido Steiner Support Vector Machines (SVMs) for the classification of microarray data Basel Computational Biology Conference, March 2004 Guido Steiner Overview Classification problems in machine learning context Complications

More information

SOFTWARE DEVELOPMENT PRODUCTIVITY FACTORS IN PC PLATFORM

SOFTWARE DEVELOPMENT PRODUCTIVITY FACTORS IN PC PLATFORM SOFTWARE DEVELOPMENT PRODUCTIVITY FACTORS IN PC PLATFORM Abbas Heiat, College of Business, Montana State University-Billings, Billings, MT 59101, 406-657-1627, aheiat@msubillings.edu ABSTRACT CRT and ANN

More information

A Gene Selection Algorithm using Bayesian Classification Approach

A Gene Selection Algorithm using Bayesian Classification Approach American Journal of Applied Sciences 9 (1): 127-131, 2012 ISSN 1546-9239 2012 Science Publications A Gene Selection Algorithm using Bayesian Classification Approach 1, 2 Alo Sharma and 2 Kuldip K. Paliwal

More information

2. Materials and Methods

2. Materials and Methods Identification of cancer-relevant Variations in a Novel Human Genome Sequence Robert Bruggner, Amir Ghazvinian 1, & Lekan Wang 1 CS229 Final Report, Fall 2009 1. Introduction Cancer affects people of all

More information

Application of Emerging Patterns for Multi-source Bio-Data Classification and Analysis

Application of Emerging Patterns for Multi-source Bio-Data Classification and Analysis Application of Emerging Patterns for Multi-source Bio-Data Classification and Analysis Hye-Sung Yoon 1, Sang-Ho Lee 1,andJuHanKim 2 1 Ewha Womans University, Department of Computer Science and Engineering,

More information

IMPROVED GENE SELECTION FOR CLASSIFICATION OF MICROARRAYS

IMPROVED GENE SELECTION FOR CLASSIFICATION OF MICROARRAYS IMPROVED GENE SELECTION FOR CLASSIFICATION OF MICROARRAYS J. JAEGER *, R. SENGUPTA *, W.L. RUZZO * * Department of Computer Science & Engineering University of Washington 114 Sieg Hall, Box 352350 Seattle,

More information

Performance Analysis of Genetic Algorithm with knn and SVM for Feature Selection in Tumor Classification

Performance Analysis of Genetic Algorithm with knn and SVM for Feature Selection in Tumor Classification Performance Analysis of Genetic Algorithm with knn and SVM for Feature Selection in Tumor Classification C. Gunavathi, K. Premalatha Abstract Tumor classification is a key area of research in the field

More information

Discriminant models for high-throughput proteomics mass spectrometer data

Discriminant models for high-throughput proteomics mass spectrometer data Proteomics 2003, 3, 1699 1703 DOI 10.1002/pmic.200300518 1699 Short Communication Parul V. Purohit David M. Rocke Center for Image Processing and Integrated Computing, University of California, Davis,

More information

Biomedical Big Data and Precision Medicine

Biomedical Big Data and Precision Medicine Biomedical Big Data and Precision Medicine Jie Yang Department of Mathematics, Statistics, and Computer Science University of Illinois at Chicago October 8, 2015 1 Explosion of Biomedical Data 2 Types

More information

arxiv: v1 [cs.ai] 5 Jun 2010

arxiv: v1 [cs.ai] 5 Jun 2010 Rasch-based high-dimensionality data reduction and class prediction with applications to microarray gene expression data arxiv:1006.1030v1 [cs.ai] 5 Jun 2010 Andrej Kastrin, Borut Peterlin Institute of

More information

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University Neural Networks and Applications in Bioinformatics Yuzhen Ye School of Informatics and Computing, Indiana University Contents Biological problem: promoter modeling Basics of neural networks Perceptrons

More information

Genetic Algorithm with Upgrading Operator

Genetic Algorithm with Upgrading Operator Genetic Algorithm with Upgrading Operator NIDAPAN SUREERATTANAN Computer Science and Information Management, School of Advanced Technologies, Asian Institute of Technology, P.O. Box 4, Klong Luang, Pathumthani

More information

Chapter 3 Top Scoring Pair Decision Tree for Gene Expression Data Analysis

Chapter 3 Top Scoring Pair Decision Tree for Gene Expression Data Analysis Chapter 3 Top Scoring Pair Decision Tree for Gene Expression Data Analysis Marcin Czajkowski and Marek Krȩtowski Abstract Classification problems of microarray data may be successfully performed with approaches

More information

Neural Networks and Applications in Bioinformatics

Neural Networks and Applications in Bioinformatics Contents Neural Networks and Applications in Bioinformatics Yuzhen Ye School of Informatics and Computing, Indiana University Biological problem: promoter modeling Basics of neural networks Perceptrons

More information

Classification in Parkinson s disease. ABDBM (c) Ron Shamir

Classification in Parkinson s disease. ABDBM (c) Ron Shamir Classification in Parkinson s disease 1 Parkinson s Disease The 2nd most common neurodegenerative disorder Impairs motor skills, speech, smell, cognition 1-3 sick per 1 >1% in individuals aged above 7

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics If the 19 th century was the century of chemistry and 20 th century was the century of physic, the 21 st century promises to be the century of biology...professor Dr. Satoru

More information

An Implementation of genetic algorithm based feature selection approach over medical datasets

An Implementation of genetic algorithm based feature selection approach over medical datasets An Implementation of genetic algorithm based feature selection approach over medical s Dr. A. Shaik Abdul Khadir #1, K. Mohamed Amanullah #2 #1 Research Department of Computer Science, KhadirMohideen College,

More information

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0047 ISSN (Online): 2279-0055 International

More information

Metamodelling and optimization of copper flash smelting process

Metamodelling and optimization of copper flash smelting process Metamodelling and optimization of copper flash smelting process Marcin Gulik mgulik21@gmail.com Jan Kusiak kusiak@agh.edu.pl Paweł Morkisz morkiszp@agh.edu.pl Wojciech Pietrucha wojtekpietrucha@gmail.com

More information