A MODEIFED ARTIFICIAL BEE COLONY ALGORITHM FOR GENE SELECTION IN CLASSIFYING CANCER

Size: px
Start display at page:

Download "A MODEIFED ARTIFICIAL BEE COLONY ALGORITHM FOR GENE SELECTION IN CLASSIFYING CANCER"

Transcription

1 M.Sc. Engg. Thesis A MODEIFED ARTIFICIAL BEE COLONY ALGORITHM FOR GENE SELECTION IN CLASSIFYING CANCER by Johra Muhammad Moosa Submitted to Department of Computer Science and Engineering in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering Department of Computer Science and Engineering Bangladesh University of Engineering and Technology (BUET) Dhaka 1000 August 2015

2 The thesis titled A MODIFIED ARTIFICIAL BEE COLONY ALGORITHM FOR GENE SELECTION IN CLASSIFYING CANCER, submitted by Johra Muhammad Moosa, Roll No P, Session April 2012, to the Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, has been accepted as satisfactory in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering and approved as to its style and contents. Examination held on August 17, Board of Examiners 1. Dr. M. Kaykobad Professor Chairman (Supervisor) Department of Computer Science and Engineering, BUET, Dhaka. 2. Dr. Mohammad Mahfuzul Islam Professor and Head Member (Ex-Officio) Department of Computer Science and Engineering, BUET, Dhaka 3. Dr. Md. Mostofa Akbar Member Professor Department of Computer Science and Engineering, BUET, Dhaka 4. Rifat Shahriyar Member Associate Professor Department of Computer Science and Engineering, BUET, Dhaka 5. Dr. Mohammad Rashedur Rahman Associate Professor Member (External) Department of Electrical and Computer Engineering, North South University, Dhaka 1

3 Candidate s Declaration This is hereby declared that the work titled A modified artificial bee colony algorithm for gene selection in classifying cancer is the outcome of research carried out by me under the supervision of Dr. M. Kaykobad, in the Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka. It is also declared that this thesis or any part of it has not been submitted elsewhere for the award of any degree or diploma. Johra Muhammad Moosa Candidate 2

4 Acknowledgment First of all I would like to thank my supervisor, Dr. M. Kaykobad, for assisting me throughout the thesis. Without his continuous supervision, guidance and advice it would not have been possible to complete this thesis. I am especially grateful to him for giving us his time whenever we needed, and always providing continuous support and motivation in our effort. I would like to take this opportunity to thank Dr. M. Sohel Rahman for introducing me to the amazingly interesting and diverse world of the application of gene selection and bioinformatics. I am indebted to him for his kind support and encouragement at times of disappointment. The work was done under Dr. Sohel Rahman s supervision and in his absence Dr. Kaykobad took over. I also want to thank the other members of my thesis committee: Dr. Md. Mostofa Akbar, Dr. Rifat Shahriyar and specially the external member Dr. Mohammad Rashedur Rahman for their valuable suggestions. Last but not the least, I am grateful to my guardians, families and friends for their patience, cooperation and inspiration during this period. 3

5 Abstract Development of cancer diagnostic models by utilizing microarray data has become a topic of great interest in the field of bioinformatics and medicine. Only a small number of gene expression data compared to the total number of genes explored possess a significant correlation with a certain phenotype. Gene selection enables researchers to obtain substantial insight into the genetic nature of the disease and the mechanisms responsible for it. Besides improvement of the performance of cancer classification, it can also cut down the time and cost of medical diagnosis. This study presents a modified Artificial Bee Colony Algorithm (ABC) to select a minimum number of genes that are deemed to be significant for cancer along with improvement of predictive accuracy. The search equation of ABC is said to be good at exploration but poor at exploitation. To overcome this limitation we have modified the ABC algorithm by incorporating pheromone which is one of the major components of Ant Colony Optimization (ACO) algorithm and introduced a new operation in which successive bees communicate to share their findings. The proposed algorithm is evaluated using a suite of ten publicly available datasets after the parameters are scientifically tuned with one of the datasets. The obtained results are compared to other works which used the same datasets. The performance of the proposed method has been proved to be superior. The method presented in this paper can provide a subset of genes leading to more accurate classification results while the number of selected genes is smaller. The proposed modified ABC Algorithm could conceivably be applied to problems in other areas. 4

6 Contents Board of Examiners 1 Candidate s Declaration 2 Acknowledgment 3 Abstract 4 1 Introduction Gene Selection About Cancer and Microarrays Motivation Motivation Behind Gene Selection Motivation Behind Applying ABC Algorithm for Gene Selection Motivation Behind Improving ABC Algorithm Challenges Organization Background Gene Selection Methods for Cancer Classification Filter Methods Wrapper Methods Embedded Methods Support Vector Machine (SVM) Leave-One-Out Cross-Validation (LOOCV) Preprocessing Kruskal-Wallis Rank Sum Test F -test

7 2.5 Swarm Intelligence Ant Colony Optimization (ACO) Stagnation Artificial Bee Colony Algorithm (ABC) Improvements of ABC Algorithm (ABC) Particle Swarm Optimization (PSO) Local Search Hill Climbing (HC) Steepest Ascent Hill Climbing with Replacement (SAHCR) Simulated Annealing (SA) Selection Procedure Tournament Selection (TS) Fitness-Proportionate Selection (FPS) Stochastic Universal Sampling (SUS) Summary Gene Selection for Cancer Classification Problem Formulation Preprocessing Stage Normalization Prefiltering Preselection of Genes Search Method for Gene Selection Genetic Algorithm ACO Algorithm Basic ABC Algorithm Modified ABC Algorithm Food Source Positions Pheromone Initialization Employed Bee Phase Onlooker Bee Phase Scout Bee Local Search Communication Operator

8 Neighborhood Operator Tweak Operator Fitness Pseudocode for the Modified ABC Algorithm Summary Experimental Results and Discussion Datasets Parameter Tuning Probability of Applying the Communication Operator, r Use of Pheromone, uph Probability of Local Search, probls Neighborhood Operator Destruction Size, nd Pheromone Persistence Factor, ρ Weight of Accuracy in Fitness, w Population Size, P S Prefiltering Method Selection Method at the Onlooker Bee Stage Kernel Method for SVM Inertia Weight (w) Update Approach Local Search Local Search at the Employed Bee Stage Local Search at the Onlooker Bee Stage Hill Climbing Simulated Annealing Steepest Ascent Hill Climbing with Replacement Percentage of Gene to be Selected from Prefiltering Step, th n Threshold to Select Gene from Prefiltering Step, th p Weight of Individual Bee in Pheromone Deposition Equation, c o Maximum Number of Algorithm Iterations, M AX IT ER Number of Trials without Improvement, limit Optimized Parameter Values Performance of Different Evolutionary Algorithms Genetic Algorithm Artificial Bee Colony

9 4.3.3 Ant Colony Optimization Comparative Study Comparison with Different Metaheuristics Comparison with Existing Methods Further Tuning of Parameters Second Parameter Settings Third Parameter Settings Summary Conclusion Future Works Contribution Summary

10 List of Figures 3.1 The flowchart of the modified Artificial Bee Colony Algorithm Obtained accuracy with different values of probls Selected gene size with different values of probls Obtained accuracy with different values of w Selected gene size with different values of w Obtained accuracy with different values of limit Selected gene size with different values of limit Distribution of classification accuracy for the dataset (a) 9 T umors; (b) 11 T umors Distribution of number of times selected gene size fall in a specific range (a) 9 T umors; (b) 11 T umors; (c) Brain T umor1; (d) Brain T umor2 (e) Leukemia1; (f) Leukemia2; (g) DLBCL; (h) Lung Cancer; (i) P rostate T umor; (j) SRBCT

11 List of Tables 4.1 Description of the datasets used for experimental evaluation Attributes of the datasets used for experimental evaluation Default parameter values for tuning Performance outcome for different values of parameter r Performance outcome for different values of parameter uph Performance outcome for different values of parameter probls Performance outcome for different values of parameter nd Performance outcome for different values of parameter ρ Performance outcome for different values of parameter w Performance outcome for different values of parameter P S Performance outcome for different values of parameter Prefiltering Method Performance outcome for different values of parameter Selection Method Performance outcome for different values of parameter Kernel Performance outcome for different values of parameter Inertia weight update equation Performance outcome for different local search methods in employed bee and onlooker bee stage Performance outcome for different iteration counts for Hill Climbing in employed bee stage Performance outcome for different iteration counts for Simulated Annealing in employed bee stage Performance outcome for different temperature (t) values for Simulated Annealing in employed bee stage Performance outcome for different values of the parameter schedule for Simulated Annealing in employed bee stage Performance outcome for different iteration counts for Steepest Ascent Hill Climbing with Replacement in onlooker bee stage

12 4.21 Performance outcome for different values of the parameter tweak for Steepest Ascent Hill Climbing with Replacement Performance outcome for percentage of genes selected in prefiltering stage, th n Performance utcome for threshold of p-value in prefiltering stage, th p Performance outcome for different values of c Performance outcome for different values of parameter M AX IT ER Performance outcome for different values of parameter limit Optimized parameter values after tuning Comparative experimental results of the best subsets produced by mabc using default and optimized parameter settings for different datasets Default values for tuning GA parameters Performance outcome for different values of P S for GA Performance outcome for different values of M AX IT ER for GA Performance outcome for different values of r for GA Performance outcome for different values of m for GA Parameter values for tuning ABC parameters Performance of Artificial Bee Colony algorithm in gene selection Default values for tuning ACO parameters Performance outcome for different values of P S for ACO Performance outcome for different values of M AX IT ER for ACO Comparative Experimental Results of the best subsets produced by mabc and other evolutionary methods for the dataset 9 T umors Comparative Experimental Results of the best subsets produced by mabc and other methods for different datasets Second proposed parameter values after tuning Comparative Experimental Results of the best subsets produced by mabc using default, optimized, and second parameter settings for different datasets Performance outcome for different values of parameter Selection Method Third proposed parameter values after tuning Comparative Experimental Results of the best subsets produced by mabc using optimized, second and third parameter settings for different datasets All the proposed parameter values after tuning

13 List of Algorithms 1 Steps of the Ant Colony Optimization Steps of the Artificial Bee Colony Algorithm Artificial Bee Colony Algorithm Steps of the Particle Swarm Optimization (PSO) Algorithm HillClimbing(S) SteepestAscentHillClimbingW ithreplacement(s) SimulatedAnnealing(S) T orunamentselection() F itnessp roportionateselection() StochasticUniversalSampling(N s ) Genetic Algorithm for Gene Selection Ant Colony Optimization Algorithm Artificial Bee Colony Algorithm initrandom(s i ) UpdateBest(S i ) Communicate(i) Steps of the modified Artificial Bee Colony Algorithm modified Artificial Bee Colony Algorithm

14 Chapter 1 Introduction Gene expression data represent the state of a cell at the molecular level. Therefore, gene expression data is considered to have great potential as a medical diagnosis tool. Analyzing gene expression is researched in depth for more than a decade [184]. Because of recent advancement in microarray technology scientists are now able to measure extensive gene expression levels simultaneously in the field of biological organisms [22,76,205]. Selection of most relevant and informative genes for certain phenotypes is an important aspect in gene expression analysis. In this thesis we address the problem of gene selection for classifying cancer using modified artificial bee colony algorithm. The Section 1.1 introduces the currently emerging topic named gene selection and our proposed approach to solve the problem for cancer classification. Brief description about cancer and microarray technique to profile cancerous genes are presented in Section 1.2. Motivations of the thesis from different points of view are explained in Section 1.3. The Section 1.4 explores the challenges we have faced in this work. Finally, the organization of the thesis is given in Section Gene Selection Gene selection for cancer classification has become one of the most important research topics in the biomedical field. However, microarray data pose a severe challenge for computational techniques. Available training datasets for cancer classification generally have a fairly small sample size compared to the number of genes involved and consists of multiclass categories. We need dimension reduction techniques that identify a small set of genes to achieve better learning performance. From the perspective of machine learning, the selection of genes can be considered to be a feature selection problem that aims to find a small subset of features that has the most discriminative information for the target. 13

15 The classification of gene expression data samples involves feature selection and classifier design. Noisy, irrelevant, and misleading attributes make the classification task complicated, as they can contain random correlation. A reliable selection method of relevant genes for sample classification is needed in order to increase classification accuracies and to avoid incomprehensibility. The task of gene selection is known as feature selection in artificial intelligence domain. Feature selection has class labeled data and attempts to determine which features best distinguish among the classes. The genes are considered the features that describe the cell. The goal is to select a minimum subset of features that achieves maximum classification performance and to discard the features with little or no effect. These selected features can then be used to classify unknown data. Feature selection can thus be considered as a principal preprocessing tool when solving classification problems [33,246]. Theoretically, feature selection problems are NP-hard [30, 236]. Performing an extensive search is impossible as the computational time and cost would be prohibitively large [42]. Gene selection methods can be divided into three categories [81]: filter methods, wrapper or hybrid, and embedded methods. Detail review on gene selection methods can be found in [1, 81, 95, 143, 159, 201]. A gene selection method is categorized as a filter method if it is carried out independently from a classification procedure. Due to lower computational time and cost most previous gene selection techniques in the early era of microarrays analysis have used the filter method. Many filters provide a feature ranking rather than an explicit best feature subset. Filter methods generally rely on a relevance measure to assess the importance of genes from the data, ignoring the effects of the selected feature subset on the performance of the classifier, which may result in the inclusion of irrelevant and noisy genes in a gene subset. Research shows that genes in a cell interact with one another rather than acting independently to complete certain biological processes or to implement certain molecular functions [157]. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. While the filter techniques handle the identification of genes independently, a wrapper or hybrid method on the other hand, implements a gene selection method merging with a classification algorithm. In the wrapper methods [138] a search is conducted in the space of genes, evaluating the fitness of each found gene subset. Fitness is determined by training the specific classifier to be used only with the found gene subset and then approximating the accuracy percentage of the classifier. The hybrid methods usually obtain better predictive accuracy estimation than the filter methods [108, 129, 169, 269], since the genes are selected by considering and optimizing the correlations among genes. Therefore, several hybrid 14

16 methods have been implemented to select informative genes for binary and multi-class cancer classification since few years ago up to recent time [36 39, 134, 151, 152, 156, , 209]. Embedded methods, specific to a given learning method, include the interaction with the classification model, while less computationally expensive than wrapper methods. Recently many diverse population based methods has been developed for investigating gene expression data to select small subsets of informative genes from the data for cancer classification. In this thesis we have proposed a modified artificial bee colony algorithm to select gene for cancer classification. The Artificial Bee Colony (ABC) algorithm [121], proposed by Karaboga in 2005, is one of the most recent swarm intelligence based optimization techniques, which simulate the foraging behavior of honey bee swarm. The search equation of ABC is said to be good at exploration but poor at exploitation [149, 173]. To overcome this limitation we have modified the artificial bee colony algorithm by incorporating pheromone which is one of the major components of Ant Colony Optimization (ACO) algorithm [58, 218] and a new operation in which successive bees communicate to share their results. Even though researchers are unable to establish whether such a communication indeed involved information transfer or not, it is known that foraging decisions of outgoing workers, and their probability to find a recently-discovered food source, are influenced by the interactions [18, 69, 75, 102, 195, 196]. Indeed, there is a notable proof that for harvester ants management of foraging activity is guided by ant encounters [49, 86, 89, 204]. Even only an encounter may provide information, such as the magnitude of the colony s foraging activity, and therefore influence the probability of food collection in ants [87, 88, 233]. This modified artificial bee colony will be used to select genes for cancer classifications. The goal is to select a minimum number of genes that are deemed to be significant for cancer with improvement of predictive accuracy. First, the expression levels for each gene were normalized to [0, 1]. Normalizing the data is important to ensure that the distance each gene contributes equally in classification. Then the normalized gene expression data is preprocessed by statistical methods including the Kruskal-Wallis test [139, 244, 245] to filter out a large number of redundant, noisy genes. Also the performance evaluation of different statistical methods will be done. To improve the performance of artificial bee colony algorithm one of the main component pheromone of ant colony optimization is incorporated along with a new operator which simulates communication of successive ants in an ant trail. The idea of incorporating pheromone is to keep track of fitness of previous iterations. The proposed modified artificial bee colony algorithm is then applied on selected small number of relevant genes in preprocessing step. The widely used state of the art classification method Support Vector Machine (SVM) [11,77,103,221] with LOOCV [37,169] will serve as 15

17 evaluators of fitness functions along with number of selected genes. The proposed methods will be evaluated on ten public data sets. Also some other meta-heuristics and evolutionary algorithms including Artificial Bee Colony algorithm will be applied to the problem. Finally the selection methods will be compared to the newly proposed one. We believe that modified artificial bee colony algorithm presented in this thesis is a significant contribution, worthy of further study. 1.2 About Cancer and Microarrays Our body is made up of trillions of living cells. Each cell consists of genes which define the functions, actions, and characteristics of the cell. Mutations can happen by chance when a cell is dividing. They can also be caused by the processes of life inside the cell. Or they can be caused by things coming from outside the body, such as the chemicals in tobacco smoke. And some people can inherit faults in particular genes that make them more likely to develop a cancer. All cancers begin in cells, when cells in a part of the body start to grow out of control. Cancer cell growth is different from normal cell growth. Mistakes in genes, either inherited from parents or from damage occurred during a person s life, contribute to abnormal cell growth. Some genes get damaged every day and cells are very good at repairing them. But over time, the damage may build up. And once cells start growing too fast, they are more likely to pick up further mutations and less likely to be able to repair the damaged genes. Mistakes that are not fixed become mutations maintained through subsequent cell divisions which may eventually lead to cancer. So, cancer is a genetic disease that is caused by changes to genes that control the way our cells function, especially how they grow and divide. Many of the genes that contribute to the development of cancer fall into broad categories, but usually cancer is caused by multiple changes to several different genes. In general, cancer cells have more genetic changes, such as mutations in DNA, than normal cells. Some of these changes may have nothing to do with the cancer; they may be the result of the cancer, rather than its cause. Classification of cancers has been dominated by the fields of histology and histopathology which aim to leverage morphological markers for accurate identification of a tumor type. For some types of cancer, these methods are unable to distinguish between subclasses. Also traditional methods give less accurate diagnosis. In this regard genes expression profiles come to aid. In a particular type of cell or tissue, only a small subset of an organism s genomic DNA will be expressed as mrnas at any given time. The unique pattern of gene expression for a given cell or tissue is referred 16

18 to as its molecular signature. For example, the expression of genes in skin cells would be very different compared to those expressed in blood cells. Gene expression profiling is a technique used in molecular biology to query the expression of thousands of genes simultaneously. In the context of cancer, gene expression profiling has been used to more accurately classify tumors. The information derived from gene expression profiling often has an impact on predicting the patient s clinical outcome. If we want to study a particular type of cancer or develop a diagnostic model we need to find out the gene of that particular cancerous cell. Gene expression of every cell is unique, so it can be defined as molecular fingerprints for cancer classification. A more powerful result of gene expression profiling is the ability to further classify tumors into subtypes having distinct biological properties and impact on prognoses. There are many technologies used to measure gene expression such as microarrays, serial analysis of gene expression (SAGE), RT-PCR, and northern blots. Microarray analysis can provide quantitative gene expression information allowing for the generation of a molecular signature, each unique to a particular class of tumor [198]. This allows for reliable identification of tumor type based on gene expression. Microarray analysis has yet to be widely accepted for diagnosis and classification of human cancers, despite the exponential increase in microarray studies reported in the literature. Among several methods available, a few refined approaches have evolved for the analysis of microarray data for cancer diagnosis. These include class comparison, class prediction and class discovery. Current cancer research makes use primarily of DNA microarrays. Microarray technology has become one of the indispensable tools that many biologists use to monitor genome wide expression levels of genes in a given organism. A microarray is typically a glass slide on to which DNA molecules are fixed in an orderly manner at specific locations called spots (or features). A microarray may contain thousands of spots and each spot may contain a few million copies of identical DNA molecules that uniquely correspond to a gene. Since its development in the mid-1990s, DNA microarray technology has revealed a great deal about the genetic factors involved in a number of diseases, including multiple forms of cancer. Microarrays are beginning to take an important place in clinical oncology practice. Although the main potential success of microarrays is related to evaluation of patients prognosis, microarrays also improve current clinical diagnostics, discover new diagnostic markers and identify new taxonomic classes of tumors. To reach its full potential in cancer diagnosis and classification microarray technology needs improvement of its ancillary technologies such as development of new microarray platforms, statistics and software for analysis and data 17

19 mining. Thus, microarray data can provide valuable results for a variety of gene expression profile problems and contribute to advances in clinical medicine. Researchers have found that looking at the patterns of a number of different genes at the same time (referred to as gene expression profiling) can help predict behavior of cancer. Upon research of all these years scientists found out some genes or gene subset containing correlation with cancer. Our objective is to help find more such genes. 1.3 Motivation The motivation of this thesis can be elaborated from three different perspectives. The problem of gene selection itself is a noble research topic. Addressing this problem helps researchers find genes with significant effect on certain phenotypes. Application of bee swarm based methods to select informative genes is yet not studied in the literature. So, investigating the behavior of such algorithms for gene selection is also one of our goal. ABC algorithm is the most studied bee swarm based algorithm in the literature. ABC has great potential in solving various optimization problems. Still some modifications to the original structure are necessary in order to significantly improve the algorithm performance. Thus, this thesis is also driven by our urge to contribute in the improvement of the currently emerging swarm based algorithm, ABC Motivation Behind Gene Selection Traditional cancer predictors diagnose poorly which can lead to patient suffering due to needless side-effects. Research shows that about 70 80% of breast cancer patients receive chemotherapy unnecessarily [90,91]. Development of effective therapies depends on accurate diagnosis. If we want to study a particular type of cancer or develop a diagnostic model we need to find out the gene of that particular cancerous cell. In this regard genes expression profiles come to aid. Gene expression of every cell is unique, so it can be defined as molecular fingerprints for cancer classification. So after generating the gene expression profile from a cancerous cell, we analyze this high dimensional data to understand the underlying mechanism of cancer and genes responsible for it. Cancer diagnosis is an important application domain of gene expression profiles, which has a promising future in clinical medicine. Experimental data from microarrays consisting gene expression profiles is regularly gathered, and the amount of information contained in microarray data is becoming gigantic. Since the volume of data is growing exponentially, 18

20 the analysis becomes more challenging. To evaluate these huge gene expression data and extract useful information by selecting high discriminative genes has now become a topic of substantial research interest. Gene expression data has characteristics of high-dimension and high-noise. Therefore, most of the genes observed in microarray may be irrelevant to analysis. Also the use of all the genes may potentially hinder the classifier performance by masking the contribution of the relevant genes [14, 153, 154, 174, 217, 224]. More complete understanding of gene function regulation and interactions can be developed using largescale data provided from gene expression profiling techniques [26]. The genes that best describe a disorder might be found by examining the gene expression levels. It might even help in interpretation of how cancer is developing. Microarray data is now being used in medical applications since it is possible to predict the treatment of human diseases by analyzing gene expression data [22, 28, 114, 158, 182]. To build diagnostic tests selecting a small subset of genes is important practically. Diagnostic models from gene expression data provide more accurate, resource efficient, and repeatable diagnosis than the traditional histopathology [85]. Also it can simplify the verification of the relevance of selected genes. As a result the selection of discriminatory genes is necessary to improve the accuracy and also to decrease the computational time and cost [231] Motivation Behind Applying ABC Algorithm for Gene Selection ABC is a new swarm intelligence algorithm proposed by Karaboga in 2005, which is inspired by the behavior of honey bees. It has only a few control parameters i.e. population size, limit and maximum cycle number [123]. It is simple, flexible and robust [193, 212] with fast convergence speed. It can easily be hybridized with other optimization algorithms. Therefore ABC has now became more attractive than other optimization algorithms. But application of ABC in gene selection is not yet studied while PSO is well studied. PSO and several of its variants and hybrids for gene selection are studied over the time. Obtained results exhibit that PSO related algorithm performs satisfactorily for gene selection [36,37,39,152,169,209]. Several researches compared ABC or its variants with PSO or its variants for different categories of problems and reported good results [118, 130, 178, 234, 268]. A recent study has shown that ABC performs significantly better or at least comparable to other swarm intelligence algorithms [123]. PSO carries out a global search at the beginning stage and local search in the ending stage. ABC features advantage of conducting both a global search and local search in each iteration [273, 274], and as a result the probability of finding the 19

21 optimal is significantly increased, which effectively avoids local optima to a large extent. Already PSO in gene selection performs well, so we expect that ABC will show significantly better performance Motivation Behind Improving ABC Algorithm In the last two decades, the computational researchers have been increasingly interested in the natural sciences, and especially biology, as source of modeling paradigms. Many research areas are massively influenced by the behavior of various biological entities and phenomena. It gave birth to most of population-based metaheuristics. The ABC algorithm [121], proposed by Karaboga in 2005, is a new swarm optimization approach that is inspired by the intelligent foraging behavior of honey-bee swarm. The algorithm has the advantage of sheer simplicity, high flexibility, fast convergence, and strong robustness which can be used for solving multidimensional and multimodal optimization problems [44, 119, 126]. Besides, excellent performances has been reported by ABC for a considerable number of problems [2,44,119,126,130,220]. According to the applications discussed, ABC algorithm seems to be a well-performed algorithm. However, similar to other population-based algorithms, there still are insufficiencies in ABC algorithm, such as slower convergence speed for some unimodal problems and easily get trapped in local optima for some complex multimodal problems [125]. It is well known that for the population-based algorithms the exploration and the exploitation abilities are both necessary facts. The exploration ability refers to the ability to investigate the various unknown regions to discover the global optimum in solution space, while the exploitation ability refers to the ability to apply the knowledge of the previous good solutions to find better solutions. The exploration ability and the exploitation ability contradict each other, so that the two abilities should be well balanced to achieve good performance on optimization problems. The solution search equation of ABC is significantly influenced by a random quantity which helps in exploration at the cost of exploitation of the search space. In general, the ABC algorithm works well on finding the better solution of the object function. However, the original design of the onlooker bees movement only considers the relation between the employed bee, which is selected by the roulette wheel selection, and the one selected randomly. The search equation of ABC is reported to be good at exploration but poor at exploitation [173]. Therefore, it is not strong enough to maximize the exploitation capacity. 20

22 Although ABC has great potential, it was clear to the scientific community that some modifications to the original structure are still necessary in order to significantly improve its performance. Thus, this thesis includes our effort to contribute in the improvement of the currently emerging swarm based algorithm, ABC. 1.4 Challenges Gene expression data has characteristics of high-dimension, high-noise, multi-class categorization, and small-sample size which makes selection of relevant genes challenging for researchers [152]. The sample size is likely to remain small at least for the near future due to the expense of microarray sample collection [63]. So to select small subsets of relevant genes involved in different types of cancer remains a challenge. Yet experimental data from microarrays is regularly gathered, and the amount of information contained in microarray data is becoming gigantic. Since the volume of data is growing exponentially, the analysis becomes more challenging. To evaluate these huge gene expression data and extract useful information by selecting high discriminative genes has now become a significant research interest. Theoretically, feature (gene) selection problems are NP-hard. Performing an extensive search is impossible as the computational time and cost would be excessive [42]. Recently many diverse population based methods have been developed for investigating gene expression data to select small subset of informative genes from the data for cancer classification. But application of bee colony in gene selection is yet not studied. As ABC is applied for the first time in gene selection, we need to analyze the parameter behavior and tune accordingly. We do not have any guidance to design our method. Also any study about how the parameters behave and how the algorithm evolves for this problem are unavailable. Studying the parameter behavior to improve the algorithm performance is a matter of great challenge. 1.5 Organization The rest of the chapters are organized as follows. In Chapter 2, we describe the basic concepts on gene selection, cancer classification, search methods, prefiltering methods, selection methods, and local search methods. In Chapter??, we present a brief literature review. The chapter contains review on existing noteworthy gene selection methods, improvements of ABC algorithms proposed over the time, and utilization of different classification and filter 21

23 methods for gene selection in the literature. The Chapter 3 contains the details description of the proposed approach. Different options for prefiltering methods, search methods, selection methods, and local search methods are considered for comparison in our thesis. Application of such methods in our proposed algorithm is also presented in this chapter. Experimental results and discussions about parameter tuning and performance comparison are presented in Chapter 4. Finally, the Chapter 5 contains the concluding remarks, which include future research directions, outcomes, and summary of this thesis. 22

24 Chapter 2 Background This chapter presents the ideas necessary to comprehend the topics covered in this thesis. Related algorithms and methods employed in this work are described in this chapter. The advantages and disadvantages of the methods are also explained. We also formally present the problem of gene selection which is addressed in this thesis. Previous works of different topics related to this thesis are also presented in this chapter. Gene selection for cancer classification has become one of the most important research topics in the field of biomedical science. Different approaches to solve the problem of gene selection have been proposed over the time. Many of them achieved satisfactory results. The Section 2.1 presents some such methods. The Section 2.2 and 2.3 includes definition and utilization of SVM and LOOCV respectively in different gene selection methods. Also discussion about different preprocessing methods are present in the Section 2.4. The Section 2.5 presents introduction to some popular swarm intelligence based methods. Among them Artificial Bee Colony (ABC) algorithm is one of the most recent nature inspired optimization algorithms based on intelligent foraging behavior of honey bee swarm. Many improvements of the algorithm are proposed in recent times. The Section presents brief discussion about the method and the improvements of ABC found in literature. 2.1 Gene Selection Methods for Cancer Classification Gene expression data represent the state of a cell at the molecular level. As a result it is considered to have great potential as a medical diagnosis tool. Analyzing gene expression is researched in-depth for more than a decade [184]. Because of recent advancement in microarray technology scientists are now able to measure large-scale gene expression levels simultaneously in the field of biological organisms [22,76,205]. The output of the microarray 23

25 technology is gene expression data. To compute the gene expression data thousands of gene sequences are placed in known locations on a glass slide called a gene chip. A sample containing control DNA or RNA which does not have any chromosomal abnormality is placed in contact with the gene chip. Complementary base pairing between the sample and the gene sequences on the chip emits fluorescent light which is measured. Genes that are expressed in the sample are determined by the location of the fluorescent spot on the chip. The gene expression level is proportional to the relative signal intensity. The genes that best describe a disorder might be found by examining the gene expression levels. It might even help in interpretation of how cancer is developing. Microarray data is now being used in medical applications since it is possible to predict the treatment of human diseases by analyzing gene expression data [22, 28, 114, 158, 182]. Development of cancer diagnostic models by utilizing microarray data is of great interest in bioinformatics and medicine. Selection of relevant genes enable researchers to obtain significant insight into the genetic nature of the disease and the mechanisms responsible for it [96, 248]. After gene selection, typical classification techniques can be applied to the microarray data. The classification of gene expression data samples involves feature selection and classifier design. Noisy, irrelevant, and misleading attributes make the classification task complicated, as they can contain random correlation. A reliable selection method of relevant genes for sample classification is needed in order to increase classification accuracies and to avoid incomprehensibility. The task of gene selection is known as feature selection in artificial intelligence domain. Feature selection has class-labeled data and attempts to determine which features best distinguish among the classes. The genes are considered the features that describe the cell. The goal is to select a minimum subset of features that achieves maximum classification performance and to discard the features with little or no effect. These selected features can then be used to classify unknown data. Feature selection can thus be considered as a principal preprocessing tool when solving classification problems [33, 246]. Theoretically, feature selection problems are NP-hard [30,236]. Performing an extensive search is impossible as the computational time and cost would be excessive [42]. Input and output of a gene selection method can be defined as follow: Input: G = {G 1, G 2,..., G n }, a vector of vectors, where n is the number of genes and G i = {g i,1, g i,2,..., g i,n } is a vector of gene expressions of all the samples for the i th gene where N is the sample size. So, g i,j is the expression level of the i th gene in the j th sample. 24

26 Output: R = {R 1, R 2,..., R m }, the indices of the genes selected in the optimal subset. Where m is the selected gene size. Gene selection methods can be divided into three categories [81]: filter methods, wrapper or hybrid methods, and embedded methods. Detail review on gene selection methods can be found in [1, 81, 95, 143, 159, 201] Filter Methods A gene selection method is categorized as a filter method if it is carried out independently from a classification procedure. In filter approach instead of searching the feature space, selection is done based on statistical properties. This method is often denoted as univariate gene selection, whose advantages are its simplicity and interpretability. Due to lower computational time and cost most previous gene selection techniques in the early era of microarrays analysis have used the filter method. Many filters provide a feature ranking rather than an explicit best feature subset. The top ranking features are chosen manually or via crossvalidation [50, 80, 137] while the remaining low ranking features are eliminated. Bayesian Network [82], t-test [190], Information Gain (IG) and Signal-to-Noise-Ratio (SNR) [85,249], Euclidean Distance [32, 107], etc. are the examples of the filter method that are usually considered as individual gene-ranking methods. Filter methods generally rely on a relevance measure to evaluate the significance of genes from the data, ignoring the effects of the selected feature subset on the performance of the classifier. So they ignore the interaction with the classifier and correlations with other genes, which may result in the inclusion of irrelevant and noisy genes in a gene subset. Research shows that genes in a cell do not act independently. They interact with one another to complete certain biological processes or to implement certain molecular functions [157]. Notably, filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problem instances. [67, 152, 219, 244, 245, 267, 277] Wrapper Methods A wrapper or hybrid method implements a gene selection method within a classification algorithm. A typical wrapper [138] method contains two components: the search scheme and the evaluation procedure. The search is conducted in the space of genes, evaluating the goodness of each found gene subset. Fitness is determined by training the specific classifier to be used only with the found gene subset and then approximating the accuracy percentage 25

27 of the classifier. The hybrid methods usually obtain better predictive accuracy estimation than the filter methods [108, 129, 169, 269], since the genes are selected by considering and optimizing the correlations among genes. Therefore, several hybrid methods have been implemented to select informative genes for binary and multi-class cancer classification during recent times [36 39, 134, 151, 152, 156, , 209]. However, its computational cost must be taken into account [129]. Recently many diverse population based methods have been developed for investigating gene expression data to select a small subset of informative genes from the data for cancer classification. Over the time a number of variants and hybrids of Particle Swarm Optimization (PSO) have been proposed to solve the gene selection problem. The Combat Genetic Algorithm (CGA) [72, 74] has been embedded within the Binary Particle Swarm Optimization (BPSO) in [37] which serves as a local optimizer at each iteration to improve the solutions of the BPSO. The algorithm has succeeded to achieve high classification accuracy albeit at the cost of unacceptably large size of the selected gene set. Although both PSO and CGA perform well as global optimizers, the proposed algorithm has failed to obtain satisfactory results because of not considering minimization of selected gene size as an objective. Also a hybridization of BPSO and Genetic Algorithm (GA) has been presented in [152]. However, its performance is not satisfactory enough. Incorporation of Tabu Search PSO as a local improvement procedure to maintain the population diversity and prevent steering to misleading local optima is discussed by Shen et al. in [209]. Obtained accuracy by their hybrid algorithm is sufficient but they did not provide any discussion at all about the number of genes selected. Again BPSO has been embedded in Tabu Search (TS) by Chuang et al. to prevent TS form getting trapped in local optima in [39] which helps in achieving satisfactory accuracy for some of the datasets. However, to attain that accuracy their algorithm needs to select prohibitively high number of genes. An improved binary particle swarm optimization (IBPSO) is proposed in [36] which achieves good accuracy for some of the datasets but, again, selects high number of genes. They resets the gbest value if it remains unchanged for predetermined number of iterations. As a result the algorithm fails to identify small gene subset and selects more genes. Recently, Mohamad et al. [169] have claimed to enhance the original BPSO algorithm by minimizing the probability of gene to be selected, resulting in the selection of only the most informative genes. They have obtained good classification accuracy with low number of selected genes for some of the datasets. But the number of iterations to achieve the target 26

28 accuracy is higher than ours, which will be reported in the Chapter 4 (Experimental Results and Discussion). A simple modified ant colony optimization (ACO) algorithm is proposed by Yu et al. in [267]. For each gene they have associated two pheromone components rather than a single one as follows. One component determines the effect of selecting the gene whether the other determines the effect of not selecting it. The algorithm is evaluated using five datasets. Their algorithm is able to select small number of genes and accuracy is also reasonable. Random forest algorithm for classifying microarray data [52] has obtained good accuracy for some datasets but not for all. Notably, the number of selected genes by the random forest classification algorithm in [52] has been found to be high for some of the datasets. A new variable importance measure based on the difference of proximity matrix has been proposed for gene selection using random forest classification in [276] by Zhou et al. Although it fails to achieve the highest accuracy (100% accuracy) for any dataset, their algorithm is able to select small number of genes and achieve satisfactory accuracy for all the datasets. Recently, Debnath and Kurita have proposed an evolutionary SVM classifier that adds features in each generation according to the error-bound values for the SVM classifier and frequency of occurrence of the gene features to produce a subset of potentially informative genes [45] Embedded Methods The third class of feature selection approaches is embedded methods. Embedded gene selection method has been proposed recently by Guyon et al. [94,96]. The difference of embedded methods with other feature selection methods is the search mechanism is incorporated into the classifier model. Similar to wrapper methods, embedded methods are therefore specific to a given learning algorithm. Embedded methods have the advantage that they include the interaction with the classification model, while less computationally expensive than wrapper methods. As a result several embedded methods for gene selection are developed recently [96, 98, 148, 279]. Guyon et al. [96] utilized Support Vector Machine methods based on Recursive Feature Elimination (RFE) for gene selection of binary class classification problem. Zhu et al. [279] introduced a novel Markov blanket embedded genetic algorithm (MBEGA) for gene selection problem. To quickly improve the solution and fine tune the search the proposed embedded Markov blanket based memetic operators add or delete features (genes) from a genetic algorithm (GA) solution. Embedded gene selection with two algorithms based on 27

29 EasyEnsemble named EGSEE (Embedded Gene Selection for EasyEnsemble) and EGSIEE (Embedded Gene Selection for Individuals of EasyEnsemble) for gene selection of imbalanced microarray data are proposed by [148]. The method achieved improved prediction ability in terms of some factors including true positives ratio. Hernandez et al. [98] have presented a genetic embedded method for gene selection and classification of microarray data. The proposed method is composed of a preselection phase according to a filtering criterion and a genetic search phase to determine the best gene subset for classification. 2.2 Support Vector Machine (SVM) Gene expression profiles are becoming increasingly promising as a medical diagnosis tool as they represent the state of a cell at the molecular level. The classification of micro array data could likely be a big help in the discovery of hidden patterns in expression profiles and opened possibility for proficient diagnosis of cancer and other complex diseases. In order to maximize benefits of this technology, researchers are constantly trying to develop and apply the most accurate decision support algorithms. The algorithms can then be used to create gene expression patient profiles. Various research suggests that support vector machine (SVM) achieves the best classification and computational performance among the reliable and popular techniques for classifying microarray gene expression data [43, 81, 237]. The Support Vector Machine is a state-of-the-art classification method based on Statistical Learning Theory [238], introduced by Boser, Guyon, and Vapnik in 1992 [19]. Supervised learning, also called prediction or discrimination, involves developing algorithms to priori defined categories. These algorithms are typically inferred from a training dataset and then tested on an independent test dataset to evaluate the accuracy. Support vector machines are a group of related supervised learning methods used for classification and regression. The simplest type of support vector machines is linear two-class classifier which tries to find a straight line that separates two dimensional data. It takes a binary labeled training data set (i.e., data set with positive and negative examples) as input. Then SVM creates a decision boundary between the two groups and select the most relevant examples involved in the decision process which are known as the support vectors. However, infinitely many such hyperplanes can be obtained by small perturbations of a given solution. The optimal one is the maximal margin separating hyperplane. Therefore, the optimal separating hyperplane maximizes the distance between the plane and the nearest point of any of the classes. The construction of the hyperplane is possible as long as the data is linearly separable. The replacement of dot product by a nonlinear kernel function [238] yields a nonlinear mapping 28

30 into a higher dimensional feature space [96]. A kernel can be viewed as a similarity function. It takes two inputs and gives output about how similar the inputs are. There are four basic kernels: linear, polynomial, radial basic function (RBF), and sigmoid [186]. If a separating hyperplane in the translated feature space is found, it can correspond to a nonlinear decision boundary in the input space. If there is noise or inconsistent data a perfectly separating hyperplane may not exist. To solve this problem soft margin extension is implemented where some data points of one class are allowed to appear on the other side of the boundary. Soft margin SVMs [41] attempt to separate the training set with a minimal number of errors. For problems involving more than two classes there are several possible approaches. Solving multi-class problems can be done with multi-class extensions of SVMs [253, 254]. But they are computationally expensive, so the feasible alternative is to convert a two-class classifier to a multi-class classifier. One of the standard methods for doing so is the one against all or one-vs.-rest (OvR) approach where one SVM is constructed per class. For each class, the SVM classifier is trained for that class against the rest of the classes. An input is classified according to which classifier produces the maximum discriminant function value. In spite of its simplicity, it remains the method of choice in most of the cases [194]. Another approach is known as one-vs.-one (OvO) or one-against-one [136], which builds one SVM for each pair of classes. An input is classified according to maximum voting, where each SVM votes for one class. As the training process is quicker this approach is said to be more practical according to the paper by Hsu & Lin [104]. LIBSVM [27] made use of this approach to solve multi-class problems on their implementation. The effectiveness of SVM depends on the selection of kernel, the kernel s parameters, and the soft margin parameter C. Uninformed choices may result in extreme reduction of performance [103]. Tuning SVM is more of an art than an exact science. Selection of a specific kernel and relevant parameters can be achieved empirically. One of the weaknesses of SVM is its sensitivity to noise. A relatively small number of mislabeled examples can remarkably decrease the performance. Yet, support vector machines (SVM) have demonstrated superior performance in terms of accuracy in classifying high dimensional sparse data and flexibility in modeling diverse sources of data [206]. As a result SVM classifier is becoming increasingly popular in many areas of bioinformatics, including classification using microarrays data. Several papers have reported good results on gene selection using SVM [77,96,145,192,252,270]. The SVM based classifier is less sensitive to the curse of dimensionality and more robust to a small number of high-dimensional gene expression samples than other non-svm classifiers, which makes it superior [215]. SVM performs comparatively well as classifier than other methods for most of the cases [144]. As 29

31 classifier for both binary class and multi class gene selection methods use of SVM is present in [25,39,48,52,77,79,112,152,156,159,162,169,177,192,202,203,214,215,219,235,243,245, 247, 259, 263, 271]. SVMs lend themselves particularly well to the analysis of wide patterns of gene expression data. Different methods of gene selection integrating Support Vector Machine are also found in the literature [3, 65, 96, 225, 276]. The noteworthy implementations of SVM include SVM light [117], LIBSVM [27], mysvm [200], and BSVM [105, 106]. We have included LIBSVM as the implementation of SVM. The details for LIBSVM can be found in [27,103]. Reviews and introductions to SVM can be found in [24, 93]. 2.3 Leave-One-Out Cross-Validation (LOOCV) Gaining high training accuracy (i.e. a classifier which accurately predicts training data whose class labels are indeed known) is not practical. Data used for gene selection must be distinct from the data used for calculating the predictive accuracy of the classifier [68]. Use of cross validation accuracy calculated within the gene (feature) selection process as an estimate of the prediction error causes selection bias [3]. To address this problem the data set is usually split into two parts: training data set and test data set which is considered as unknown. Then the fitness of the learned classifier is measured based on test data that has not been used to train the classifier. The performance of classifying an independent data set is reflected more precisely by the prediction accuracy acquired from the unknown test data set. An improved version of this process is known as cross-validation which is believed to be a good method for selecting a subset of features [23]. Cross-validation is almost unbiased when carried out properly and the training set and validation set are from the same population. One of the most common forms of cross-validation is k-fold cross-validation, where the data set is first divided into k subsets of equal size. For each subset, the subset is used as the test set to evaluate the classifier accuracy and the other k 1 subsets are put together to form a training set. So the process is carried out k times. Then the average error across all k trials is computed. Every data point is included in a training set k 1 times, and in a test set exactly once. Consequently, each instance of the whole training set is predicted once. Leave-one-out cross validation (LOOCV) is in one extremity of k-fold cross validation, where k is chosen as the total number of examples. For a dataset with N examples, N 30

32 numbers of experiments are performed. For each experiment the classifier is learned on N 1 examples and tested on the remaining one example. So every example is left out once and a prediction is made for that example. The average error is computed by finding number of misclassification and used to evaluate the model. The beauty of the leave-one-out cross-validation is that despite of the number of generation it will generate the same result each time, thus repetition is not needed. So it is possible to execute only one leave-oneout cross-validation cycle, instead of two nested ones. The leave-one-out cross-validation gives nearly unbiased estimator [116]. In this article, LOOCV method is utilized in fitness calculation. With the help of repeated resampling cross-validation allows models to be tested using the full training set. As a result overfitting can be avoided by maximizing the total number of samples used for testing. Overfitting occurs when the same data was used to learn the model and model s fitness assessment. When CV is used the best fit model will generally include only a subset of the truly informative features. Two commonly used methods in gene selection are support vector machines (SVM) [43] as a classification system and cross-validation (CV) [137] as a validation tool [164]. Both methods are closely connected. Typically, while using SVM there is a tendency to increase the data dimensionality as SVM maps data in higher dimensions. Besides, CV is preferred in scenarios with relatively few data points compared to the dimensionality. Small sample size remains as one of the key features of costly clinical investigations of microarray data as a result of the scarcity of tissue samples and the fact that microarray studies have limited resources in funding additionally they are rather extravagant in terms of time and required reagents. Resampling methods, such as cross validation, are commonly used to estimate the model performance when a large test data set cannot be held out or easily acquired. It is reckoned that LOOCV performs well when the sample size is small [37, 170, 171, 189]. Several studies to select genes achieved good results using LOOCV as validation tool with SVM as evaluator [25, 112, 156, 169, 214, 219, 259, 263, 271]. To assess the algorithm performance, KNN with LOOCV is also practiced in literature [37, 39, 214, 263, 271]. 2.4 Preprocessing The advancement of DNA microarray technology now allows scientists to monitor and measure thousands of gene expressions levels simultaneously in a single experiment [76]. Gene 31

33 expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnosis tool. But compared to the number of genes involved which often exceeds several thousands, available training data sets generally have a fairly small sample size for classification. Inclusion of irrelevant, noisy or redundant genes decreases the quality of classification. Also, the huge number of genes causes great computational complexity in wrapper methods when searching for significant genes. Hence, high dimensionality may cause considerable problems in microarray data analysis. Before applying other search methods it is thus provident to reduce gene subset space by preselecting a smaller number of informative genes based on filtering criteria. Filter based methods rank the features as a pre-processing step prior to the learning algorithm, and select those features with high ranking scores. The reduced space is then searched to identify even smaller subsets of relevant genes which are able to classify unknown test samples with high accuracy. Several filter methods have been proposed in the literature which can be used to preprocess data. These include Signal-to-Noise Ratio (SNR) and Information Gain (IG) [85, 249], t-test [190], Bayesian Network [82], Euclidean Distance [32, 107], Kruskal-Wallis nonparametric analysis of variance (ANOVA) algorithm [40,139,211], F -test (ratio of in between group variance to within group variance) [15, 175], BW ratio [66], etc. Some of these filter methods perform feature ranking rather than feature selection. So they are sometimes combined with search methods when one needs to find out the appropriate number of features to select. The inputs and outputs of the preprocessing stages are: Input: G = {G 1, G 2,..., G n }, a vector of vectors, where n is the number of genes and G i = {g i,1, g i,2,..., g i,n } is a vector of gene expressions for the i th gene where N is the sample size. So, g i,j is the expression level of i th gene in j th sample. Output: R = {R 1, R 2,..., R n }, the ranking of the genes based on the statistical method where n is the gene size, R i 1, 2,..., n, and R i R j for any 1 i, j n and i j. The filter based gene ranking techniques are usually used to preselect the differentially expressed genes from the original gene space even though those differentially expressed genes are not always tumor related ones due to the noises in dataset. Its main idea is to assign each gene a single score that denotes the significance of each gene according to a certain scoring criterion. Use of filter methods as a preprocessing step for wrapper methods, allows a wrapper to be used on larger problems [67, 98, 152, 162, 219, 244, 245, 267]. 32

34 Li et al. [152] adopted the proposed hybrid PSO/GA after preprocessing the gene expression data by utilizing the Wilcoxon rank sum test [48]. The paper formed the crude gene subset by selecting 40 top ranked genes from the preprocessing step. Su et al. [219] also used the Wilcoxon rank sum test to preselect 100 top ranked genes from the original gene expression data. The genes are ranked and selected using Kruskal-Wallis and F -Score separately in [67]. The study also discussed the advantages and limitations of each of the ranking schemes were also discussed. An initial informative gene subset of size 300 is selected by using Kruskal-Wallis rank sum test in [244]. Wang et al. [245] applied Kruskal-Wallis rank sum test to rank all the genes. The proposed method selected 200 top ranked genes as initial informative genes that contain complete classification information. Top 1,000 genes are preselected for each dataset in [25]. Mohamad et al. [169] selects top 500 genes using a gain ration technique. Application of different test statistics which can handle heterogeneity of the variances is discussed in [29]. Mallika et al. [162] uses the ANOVA p-values for individual gene ranking and pair wise gene ranking. Among many alternatives, in this article Kruskal-Wallis [40,139,211] and F -test [15,175] is adapted individually to rank the genes. Then top ranked genes are selected and fed into the proposed modified ABC. Also comparison between these methods is shown in Chapter Kruskal-Wallis Rank Sum Test The Kruskal-Wallis rank sum test (named after William Kruskal and W. Allen Wallis) is an extension of the Mann-Whitney U or Wilcoxon Sum Rank test [146, 255] for comparing two or more independent samples that may have different sample sizes [40,139,211]. It compares several populations on the basis of independent random samples from each population by determining whether the samples belong to the same distribution. Its null hypothesis states that the populations from which the samples are drawn have the same median. Assumptions for the Kruskal-Wallis test are that within each sample, the observations are independent and identically distributed and the samples are independent of each other. The test makes no assumptions about the distribution of the data (e.g., normality or equality of variance) [83, 99]. The Wilcoxon rank sum test or Kruskal-Wallis rank sum test based gene selection method was reported to perform very well in gene expression profile based tumor classification on the basis of the extensive comparison studies [20,144]. According to the results found in [48], the assumption about the data distribution often does not hold in gene expression data. The Kruskal-Wallis test is in fact very convenient for microarray data because it does not require 33

35 strong distributional assumptions [47], it works well on small samples [73], it is suited for multiclass problems, and its p-values can be calculated analytically. The test is utilized to determine p-values of each gene. The genes are then sorted in increasing order of the p-values. The lower the p-value of a gene, the higher the rank of the gene. The steps of the Kruskal-Wallis test are specified as below: Step 1 For each gene expression vector G i, Step 1.a We rank all gene expression levels across all classes. We assign any tied values the average of the ranks they would have received if they had not been tied. Step 1.b We calculate the test statistics K i for gene expression vector G i of the i th gene, which is given by Eq. 2.1 below: Here, for the i th gene, N is the sample size, K i = 12 N(N + 1) C is the number of different classes, C n k ( r k n k is the number of expression levels that are from class k, and k=1 (N + 1) ) 2 (2.1) 2 r k is the mean of the ranks of all expression level measurements for class k. Step 1.c If ties are found while ranking data in the i th gene, correction of ties must T i=1 be done. For this correction, K i is divided by: (1 t3 i t i ), where T is the N 3 N number of groups of different tied ranks and t i is the number of ties within group i. Step 1.d Finally the p-value for the i th gene, p i is approximated by P r(χ 2 C 1 K i), where χ 2 C 1 refers to the critical chi-square value. To compute the p-values, necessary functions of the already implemented package from nl/trac/prom/browser/packages/timestamps/trunk/src/edu/northwestern/ at/utils/math/ are incorporated in our method. Step 2 After the p-values for all the genes are calculated, we rank each gene G i according to p i. The lower the p-value of a gene, the higher is its ranking. 34

36 Kruskal-Wallis is used as a preprocessing step in many gene selection algorithms [67, 244, 245]. Kruskal Wallis test is utilized to rank and preselect genes in the two-stage gene selection algorithm proposed by Duncan et al. [67]. In the proposed method the number of genes selected from the ranked genes is optimized by cross-validation on the training set. Wang et al. [245] applied Kruskal Wallis rank sum test to rank all the genes for gene reduction. Obtained results from their study indicate that gene ranking with Kruskal-Wallis rank sum test is very effective. To select an initial informative subset of tumor related genes Kruskal-Wallis rank sum test is utilized in [244]. Besides applying Kruskal Wallis in prefiltering stage the use of Kruskal-Wallis for gene selection is also well studied [29, 142]. Chen et al. [29] studied application of different test statistics including Kruskal-Wallis for gene selection. Lan et al. [142] applied Kruskal-Wallis to rank the genes. Finally the top ranked genes are selected as features for the target task classifier. The proposed filter is claimed to be suitable as a preprocessing step for an arbitrary classification algorithm. Like many other non-parametric tests Kruskal-Wallis uses data rank rather than raw values to calculate the statistic. By ranking the data some information about the magnitude of difference between scores is lost. For this reason we have also applied a parametric method called F -test separately from Kruskal-Wallis to prefilter the genes, though replacing original scores with ranks does not naturally lead to performance reduction, but rather can result in a better performance at best and a slight degradation, at worst F -test Another approach to identify the genes that are correlated to the target classes from gene expression data is by using the F -test [15, 175]. F -test is one of the most widely used supervised feature selection methods. The key idea is to find a subset of features, such that the distances between the data points in different classes are as large as possible, while the distances between the data points in the same class are as small as possible in the data space spanned by the selected features. It uses variations among means to estimate variations among individual measurements. F -score for a gene is the ratio of in between group variance to within group variance, where each class label forms a group. However, the F -score (fisher score) is computed independently for each gene, which may lead to a suboptimal subset of features. The higher the fisher score of a gene, the higher its ranking. Generally, the F -test is sensitive to non-normality [21, 163]. Thus the preferred test to use with microarray data 35

37 is the Kruskal-Wallis test rather than the F -test since it is non-parametric. The steps to compute the F -score are given below: Step 1 For each gene expression vector G i, we compute the Fisher score (i.e., F -Score). The fisher score for the i th gene is given by Eq. 2.2 below. F i = C k=1 n k(µ i k µi ) 2 C k=1 n k(σ i k )2 (2.2) Here for the i th gene, µ i is the mean for all the gene expression levels corresponding to the i th gene, µ i k and σi k are mean and standard deviation of the kth class respectively, C is the number of classes, and n k is the number of samples associated with the k th class. Step 2 After computing the Fisher score for each gene, genes are sorted according to the F - score. The higher the F -score of a gene, the higher is its rank. Use of F -test either as a sidekick in gene selection [5,67] or as a stand-alone gene selection tool [92] both are practised in the literature. Duncan et al. [67] used F -test as one of the ranking schemes to preselect the genes. A privacy preserving algorithm for gene selection using F -criterion is proposed in [92]. The proposed method can be used in other feature selection problems. Au et al. [5] implemented F -test as a criterion function in their proposed algorithm to solve the problem of gene selection. F -test has been claimed to be effective for determining the discriminative power of genes [147]. In [25] top 1,000 genes are preselected from each dataset according to Fisher s ratio. To guide the search their method evaluated discriminative power of features independently according to Fisher criterion. The total number of genes in the input dataset is reduced to a smaller subset using F -score in [202]. 2.5 Swarm Intelligence In recent years, swarm intelligence systems became more and more attractive for the researchers who work in metaheuristics. The term swarm is used for an aggregation of animal societies such as fish schools, birds flocks, bat echolocation and insect colonies such as ant, termites and bees performing self organized collective behavior. Such systems consist of simple interrelated agents that are able to communicate one with another and to interact 36

38 with their environment. The individual agents behave without supervision and each of these agents has a stochastic behavior due to her perception in the neighborhood. Swarms use their environment and resources effectively by collective intelligence. In swarm intelligence the focus is on simulating biologic behaviour [70] rather than modeling. The main reason behind the development of many efficient swarm based optimization algorithms is the collaborative learning behavior of social colonies. Several researches [51, 131, 191, 241] have shown that algorithms based on swarm intelligence have great potential to find solutions of real world optimization problems. The swarm algorithms that have derived in recent years include Ant Colony Optimization (ACO) [51], Particle Swarm Optimization (PSO) [131], Bacterial Foraging Optimization (BFO) [183], Bat Algorithm [265], Artificial Bee Colony Algorithm (ABC) [121], Cat Swarm Optimization (CSO) [34, 35], etc Ant Colony Optimization (ACO) Ant Colony Optimization (ACO) [53, 55,57,59, 60], inspired by the foraging behavior of real ant colonies, is a recently developed population based approach. ACO is one of the natureinspired metaheuristics [17, 84, 101], which can be used to obtain good enough solutions of hard Combinatorial Optimization (CO) problems in a reasonable amount of computation time. ACO algorithms are stochastic search procedures. One of its main ideas is the indirect communication among the individuals of a colony of simple agents, called (artificial) ants, which iteratively construct candidate solutions to a combinatorial optimization problem. The ants solution construction is guided by heuristic information about the problem instance being solved and (artificial) pheromone trails, which real ants use as communication medium [46], to exchange information on quality of a solution component. When searching for food, ants initially explore the area surrounding their nest in a random manner. As soon as an ant finds a food source, it evaluates the quantity and the quality of the food and carries some of it back to the nest. During the return trip, the ant deposits a chemical pheromone trail on the ground [46]. Foragers can sense the pheromone trails and follow the path to food discovered by other ants. The quantity of pheromone deposited, which may depend on the quantity and quality of the food, will guide other ants to the food source. Accordingly, the indirect communication via pheromone trails enables the ants to find shortest paths between their nest and food sources [46]. 37

39 The (artificial) pheromone trails are a kind of distributed numeric information [56] which is exploited by the ants to reflect their experience accumulated while solving a particular problem. Pheromone model is used to probabilistically sample the search space. The pheromone values are updated using previously generated solutions. The update is designed to concentrate the search in regions of the search space containing high quality solutions. Solution components which are part of better solutions or are used by many ants will receive a higher amount of pheromone, and hence, will more likely to be used by the ants in future iterations of the algorithm. One of the important features of ACO algorithm is the reinforcement of solution components which is determined by the solution quality. It indirectly assumes that good solution components fabricate good solutions. However, to avoid the search getting stuck all pheromone trails are decreased by a factor before getting reinforced. Because of evaporation the pheromones will disappear over time unless they are reinforced by more ants. The main steps of the ACO algorithm can be enumerated as belows. 1 initialize pheromone 2 repeat 3 construct ant solution 4 apply local search optionally 5 update pheromone 6 until termination condition is not satisfied; Algorithm 1: Steps of the Ant Colony Optimization The essential trait of ACO algorithm is the combination of a priori information about the composition of a promising solution with a posteriori information about the composition of previously obtained good solutions. This technique was initially proposed by Marco Dorigo in 1992 in his PhD thesis [53, 59]. Various algorithmic techniques have been inspired by behaviors of ants. Among them ant colony optimization is the most successful and bestknown. The domain of application of ACO algorithm is vast. Since the proposal of the first ACO algorithms in early 1990 s, the field of ACO has attracted the attention of researchers and nowadays a large number of experimental and theoretical research results are available. ACO algorithms have obtained good performance on theoretical problems which made it appealing for applications in industrial settings. In principle, ACO algorithms can be applied to any combinatorial optimization (CO) problem by defining solution components on which the ants deposit pheromone to iteratively construct candidate solutions [57, 61, 62]. For details overview of ACO algorithms please refer to [54, 55, 61, 62]. 38

40 Stagnation Stagnation is the situation in ant colony algorithms where all ants take the same path and thus generate the same solution. When the algorithm get a good solution at the early stages of the search process but unfortunately all ants quickly converged to a single solution and then the algorithm is unable to improve that solution. So, the ability to improve the solution is lost. This is a common problem that all ACO algorithms suffer from regardless of the application domain Artificial Bee Colony Algorithm (ABC) The intelligent behaviors of bee swarm have recently inspired the researchers to propose several new algorithms [12, 64, 121, 160, 185, , 250, 264]. Artificial Bee Colony (ABC) algorithm is one of the most recent nature inspired optimization algorithms based on intelligent foraging behavior of honey bee swarm. The Artificial Bee Colony (ABC) algorithm is proposed by Karaboga [121] in 2005 and further developed in [127]. The Artificial Bee Colony algorithm is motivated by the intelligent behavior of honeybee swarms finding nectar and sharing the information of food sources with each other [122]. It consists of three essential components: food source positions, nectar-amount and three honey-bee classes (employed, onlooker and scout bees). A possible solution to the optimization problem is represented by the position of a food source that the artificial bees modify with time. The fitness of the solution is determined by the nectar amount associated with the food source. Artificial bees attempt to discover the food sources with high nectar amount and finally the one with the highest nectar. In Artificial Bee Colony algorithm, foraging honey bees are categorized into three groups namely employed bees, onlooker bees and scout bees. It incorporates division of labor by introducing three different types of bees. ABC algorithm in fact employs four different selection processes: A global selection process used by the artificial onlooker bees for discovering promising regions. A local selection process to find a neighboring food position carried out in a region by the artificial employed bees and the onlookers depending on local information. A local selection process called greedy selection process carried out by all bees. If the nectar amount of the candidate source is better than that of the present one, the bee forgets the present one and memorizes the candidate source. Otherwise the bee keeps the present one in the memory. 39

41 A random selection process carried out by scouts. The main steps of ABC algorithm can be described as follows. 1 initialize population 2 repeat 3 send the employed bees onto the food sources and evaluate their fitness (nectar amounts) 4 send the onlooker bees to the food sources depending on their fitness (nectar amounts) 5 send the scout bees to the solution space for randomly discovering new food sources 6 memorize the best solution found so far 7 until the stopping criteria are met; Algorithm 2: Steps of the Artificial Bee Colony Algorithm Each category of honey bees symbolizes one particular operation for generating new candidate solution. Employed bees exploit the food sources. They bring nectar from different food sources to their hive. Onlooker bees wait in the hive for the information on food sources to be shared by the employed bees and search for a food source based on that information. The employed bee whose food source has been exhausted becomes a scout and their solutions are abandoned [121]. Then the scout bees search randomly for new food sources near the hive without using any experience. After the scout finds a new food source, it becomes an employed bee again. Every scout is an explorer who does not have any guidance while looking for a new food, i.e., a scout may find any kind of food sources. Therefore, sometimes a scout might accidentally discover a more rich and entirely unknown food source. The position of a food source is a possible solution to the optimization problem and the nectar amount of the food source represents the quality of such solution. The bees act as operators over the food sources trying to find the best one among them. The onlookers and employed bees carry out the exploitation process in the search space and the scouts control the exploration process. Half of the colony is employed bees, and the other half is onlooker bees. In the basic form, the number of employed bees is equal to the number of food sources (solutions) thus each employed bee is associated with one and only one food source. Bee colonies can quickly and precisely adjust its searching pattern in time and space according to changing nectar sources. The ABC algorithm is presented in Algorithm 3. During the last decade, several algorithms have been developed depending on different intelligent behaviors of honey bee swarms [12, 64, 121, 160, 185, , 250, 264]. Among those, artificial bee colony (ABC) is the one which has been most widely studied on and 40

42 1 initialize population 2 repeat 3 send the employed bees onto the food sources and evaluate their fitness (nectar amounts) 4 for each employed bees do 5 produce a new solution and determine its fitness 6 apply greedy selection between new solution and current solution 7 end 8 evaluate the probability values of the food sources 9 for each onlooker bees do 10 select a food source depending on their fitness 11 produce a new solution and calculate its fitness 12 apply greedy selection between new solution and current solution 13 end 14 abandon a position if the food source is exhausted by the bees 15 send the scout bees to the solution space for discovering new food sources randomly for the abandoned positions 16 memorize the best food source found so far 17 until the stopping criteria are met; Algorithm 3: Artificial Bee Colony Algorithm applied to solve the real world problems, so far. Comprehensive study on ABC and other bee swarm algorithms can be found in [16, 123, 124, 128]. The algorithm has the advantage of sheer simplicity, high flexibility, fast convergence, and strong robustness which can be can be used for solving multidimensional and multimodal optimization problems [44,119,126]. Since the ABC algorithm was proposed in 2005, it has been applied in many research fields, such as flow shop scheduling problem [179,180], parallel machines scheduling problem [197], knapsack problem [115], traveling salesman problem [181], quadratic minimum spanning tree problem [220], multiobjective optimization [2, 178], generalized assignment problem [10], neural network training [125], and synthesis [141], data clustering [262], image processing [257], MR brain image classification [274], coupled ladder network [172], wireless sensor network [268], vehicle routing [222], nurse rostering [232], computer intrusion detection [275], live virtual machine migration [258], etc. Studies [123, 127] have indicated that ABC algorithms have high search capability to find good solutions efficiently. Besides, excellent performances has been reported by ABC for a considerable number of problems [2, 130, 220]. Karaboga and Basturk [127] tested for five multidimensional numerical benchmark functions and compared the ABC performance with that of differential evolution (DE), particle swarm optimization (PSO) and evolutionary algorithm 41

43 (EA). The study concluded that ABC gets out of a local minimum more efficiently for multivariable and multimodal function optimization and outperformed DE, PSO and EA Improvements of ABC Algorithm (ABC) Researchers observed that the ABC may occasionally stop proceeding toward the global optimum even though the population has not converged to a local optimum [123]. Several studies [149, 173, 278] show that the solution search equation of ABC algorithm is good at exploration but poor at exploitation. For the population based algorithms the exploration and the exploitation abilities are both necessary facts. The exploration ability refers to the ability to investigate the various unknown regions to discover the global optimum in solution space, while the exploitation ability refers to the ability to apply the knowledge of the previous good solutions to find better solutions. The exploration ability and the exploitation ability contradict to each other, so that the two abilities should be well balanced to achieve good performance on optimization problems. Though ABC is proved to show satisfactory performance, modifications to the original structure are still necessary in order to significantly improve its performance. As a result, several improvements of ABC have been proposed over the time. Baykasoglu et al. [10] incorporated the ABC algorithm with shift neighborhood searches and greedy randomized adaptive search heuristic and applied it to the generalized assignment problem. Pan et al. [179] proposed a discrete artificial bee colony (DABC) algorithm with a variant of iterated greedy algorithm with total weighted earliness and tardiness penalties criterion. Li et al. [150] used a hybrid Pareto-based ABC algorithm to solve flexible job shopscheduling problems. In the new algorithm, each food sources is represented by two vectors, that is, the machine assignment and the operation scheduling. Wu et al. [256] combined harmony search (HS) and the ABC algorithm to construct a hybrid algorithm. Comparison results show that the hybrid algorithm outperforms ABC, HS, and other heuristic algorithms. Kang et al. [120] anticipated a Hooke Jeeves Artificial Bee Colony algorithm (HJABC) for numerical optimization. HJABC integrates a new local search modus operandi which is based on Hooke Jeeves method (HJ) [100] with the basic ABC. Opposition Based Lévy Flight ABC is developed in [207]. Lévy flight based random walk local search is proposed and incorporated with ABC to find the global optima. Szeto et al. [222] proposed an enhanced ABC algorithm. The performance of the new approach is tested on two sets of standard benchmark instances. Simulation results show that the new algorithm outperforms the original ABC and several other existing algorithms. Chaotic Search ABC (CABC) is 42

44 introduced in [261] to solve the premature convergence issue of ABC by increasing the number of scout and rational using of the global optimal value and Chaotic Search. Again a scaled chaotic ABC (SCABC) method is proposed in [274] based on fitness scaling strategy and chaotic theory. Based on the Rossler attractor of chaotic theory a novel chaotic artificial bee colony (CABC) is developed in [272] in order to improve the performance of ABC. An improved artificial bee colony (IABC) algorithm is proposed in [155] to improve the optimization ability of ABC. The paper introduces an improved solution search equation in employee and scout bee phase using the best and the worst individual of the population. In addition, the initial population is generated by the piecewise Logistic equation which employs chaotic systems to enhance the global convergence. Inspired by differential evolution (DE), an improved solution search equation is proposed in [78]. In order to make full use of and balance the exploration of the solution search equation of ABC and the exploitation of the proposed solution search equation, a selective probability is introduced. In addition, to enhance the global convergence, when producing the initial population, both chaotic systems and opposition based learning methods are employed. Kang et al. [119] proposed a Rosenbrock ABC (RABC) algorithm which combines Rosenbrocks rotational direction method with the original ABC. There are two alternative phases of RABC: the exploration phase realized by ABC and the exploitation phased completed by the Rosenbrock method. Tsai et al. [234] introduced the Newtonian law of universal gravitation in the onlooker phase of the basic ABC algorithm in which onlookers are selected based on a roulette wheel to maximize the exploitation capacity of the solutions in this phase and the strategy is named as Interactive ABC (IABC). The IABC introduced the concept of universal gravitation into the consideration of the affection between employed bees and the onlooker bees. The onlooker bee phase is altered by biasing the direction towards random bee according to its fitness. Zhu and Kwong [278] utilized the search information of the global best solution to guide the search of ABC to improve the exploitation capacity. The is to apply the knowledge of the previous good solutions to find better solutions. Reported results show that the new approach achieves better results than the original ABC algorithm. Banharnsakun et al. [7] modified the search pattern of the onlooker bees such that the solution direction is biased toward the best-so-far position. Therefore, the new candidate solutions are similar to the current best solution. Li et al. [149] proposed an improved ABC algorithm called I-ABC, in which the best-so-far solution, inertia weight, and acceleration coefficients are introduced to modify the search process. The proposed method is claimed to have an extremely fast convergence speed. Gbest guided position update equations are introduced in Expedited Artificial Bee Colony (EABC) [111]. Jadon et 43

45 al. [110] proposed an improved ABC named as ABC with Global and Local Neighborhoods (ABCGLN) which concentrates to set a trade off between the exploration and exploitation and therefore increases the convergence rate of ABC. In the proposed strategy, a new position update equation for employed bees is introduced where each employed bee gets updated using best solutions in its local and global neighborhoods as well as random members from these neighborhoods. With a motivation to balance exploration and exploitation capabilities of ABC, Bansal et al. [8] presents an self adaptive version of ABC named as SAABC. In this adaptive version, to give more time to potential solutions to improve themselves, the parameter limit, of ABC is modified self adaptively based on current fitness values of the solutions. This setting of limit makes low fit solutions less stable, which helps in exploration. Also to enhance the exploration, scout bees are increased. To achieve an improved ABC based approach with better global exploration and local exploitation ability, a novel heuristic approach, PSABC is introduced in [258]. The method utilizes the binary search idea and the Boltzmann selection policy to achieve the uniform random initialization and thus to make the whole PSABC approach have a better global search potential and capacity at the very beginning. To obtain more efficient food positions two new mechanisms for the movements of scout bees are introduced in [208]. In the first method, the scout bee follows a non-linear (quadratic) interpolated path while in the second one, scout bee follows Gaussian movement. The first variant is named as QABC, while the second variant is named as GABC. Numerical results and statistical analysis of benchmark unconstrained, constrained and real life engineering design problems indicate that the proposed modifications enhance the performance of ABC. In order to improve exploitation capability of ABC a new search pattern for both employed and onlooker bees is proposed in [260]. In the new approach, some best solutions are utilized to accelerate the convergence speed. A solution pool is constructed by storing some best solutions of the current swarm. New candidate solutions are generated by searching the neighborhood of solutions randomly chosen from the solution pool. Kumar et al. [181] added crossover operators to the ABC as the operators have a better exploration property. Based on researches of entomologists, combining chemical communication way and behavior communication way a new ABC algorithm was developed by [115]. The new ABC algorithm introduces a novel communication mechanism among bees. In order to have a better coverage and a faster convergence speed, a modified ABC algorithm introducing forgetting and neighbor factor (FNF) in the onlooker bee phase and backward learning in the scout bee phase is proposed in [268]. 44

46 Bansal et al. [9] introduced Memetic ABC (MeABC) in order to balance between diversity and convergence capability of the ABC. A new local search phase is integrated with the basic ABC to exploit the search space identified by the best individual in the swarm. In the proposed phase, ABC works as a local search algorithm in which, the step size that is required to update the best solution, is controlled by Golden Section Search [135] approach. In the memetic search phase new solutions are generated in the neighborhood of the best solution depending upon a newly introduced parameter, perturbation rate. Kumar et al. [140] also proposed memetic search strategy to be used in place of employed bee and onlooker bee phase. Crossover operator is applied to two randomly selected parents from current swarm. After crossover operation two new offspring are generated. The worst parent is replaced by the best offspring, other parent remains same. The experimental result shows that the proposed algorithm performs better than the basic ABC without crossover in terms of efficiency and accuracy. Improved onlooker bee phase with help of a local search strategy inspired by memetic algorithm to balance the diversity and convergence capability of the ABC is proposed in [140]. The proposed algorithm is named as Improved Onlooker Bee Phase in ABC (IOABC). The onlooker bee phase is improved by introducing modified GSS [135] process. Newly introduced strategy added in onlooker bee phase. Proposed algorithm modifies search range of GSS process and solution update equation in order to balance intensification and diversification of local search space. Rodriguez et al. [197] combined two significant elements with the basic scheme. Firstly, after producing neighboring food sources (in both the employed and onlooker bees phases), a local search is applied with a predefined probability to further improve the quality of the solutions. Secondly, a new neighborhood operator based on the iterated greedy constructive-destructive procedure [109,199] is proposed. For further discussion please refer to the available reviews on ABC [240] Particle Swarm Optimization (PSO) Particle swarm optimization (PSO) is a population-based stochastic global optimization technique originating from artificial life and evolutionary computation, which was introduced in 1995 by Kennedy and Eberhart [131]. It simulates the social behavior of organisms, such as birds in a flock and fish in a school. PSO carries out a search based on population of particles with an individual position and velocity and each particle represents a potential solution in the search space. All of the particles have fitness values, which are evaluated by a fitness function to be optimized. During movement, each particle adjusts its position by changing its velocity 45

47 according to its own experience and knowledge gained by the swarm as a whole to find the best solution. Individual particles rely on simple rules to produce complex social behaviors and get better performance through sharing information and constant interaction. algorithm completes the optimization through following the personal best solution of each particle and the global best value of the whole swarm. described as below. The The main steps of PSO can be 1 initialize population 2 repeat 3 calculate fitness of the particles 4 tweak the best particles in the swarm 5 save the best particle found so far by a particle as personal best (pbest) 6 save the best particle found so far by the swarm as global best (gbest) 7 calculate the velocities of the particles 8 update the particle positions 9 until the requirements are met; Algorithm 4: Steps of the Particle Swarm Optimization (PSO) Algorithm PSO has been successfully applied in many areas including function optimization [71, 251], artificial neural network training [166], fuzzy system control [31,188], business optimization [266], scheduling problems [113], and other application problems [71]. Comprehensive survey of the PSO algorithms and their applications can be found in [71, 251]. PSO was originally developed to solve optimization problems with continuous valued spaces. But many optimization problems occur in a discrete space where the domain of the variables is finite like decision making, the traveling salesman problem, and scheduling and routing etc. To extend the real-value version of PSO to a binary or discrete space, Kennedy and Eberhart proposed a binary PSO (BPSO) method [132] in Local Search In computer science, local search is a metaheuristic method for solving computationally hard optimization problems. Local search can be used on problems that can be formulated as finding a solution maximizing/minimizing a criterion among a number of candidate solutions. Local search algorithms move from solution to solution in the space of candidate solutions (the search space) by applying local changes, until a solution deemed optimal is found or a time bound is elapsed. Local search methods can find a local optimum (a solution that cannot be improved by considering a neighboring configuration). 46

48 To explore nearby food sources basic ABC algorithm applies a neighboring operator to the current food source. In this algorithm we applied neighborhood operator followed by a local search to produce a new food position form the current one. In employed bee and onlooker bee stage local search is applied with the probability probls to increase the exploitation ability [197]. As local search procedure Hill Climbing (HC), Simulated Annealing (SA), and Steepest Ascent Hill Climbing with Replacement (SAHCR) are reviewed. Depending upon the choice HillClimbing(S) or SimulatedAnnealing(S) or SteepestAscentHillClimbingW ithreplacement(s) is called form the method LocalSearch(S) Hill Climbing (HC) Hill climbing is a optimization technique which belongs to the family of local search methods. The algorithm, starting from an arbitrary solution, iteratively tests new candidate solutions in the region of the current solution, and adopt the new ones if they are better. This enables to climb up the hill until local optima is reached. The method does not require to know the strength or direction of the gradient. Hill climbing is good for finding a local optima but it is not necessarily guaranteed to find the global optima. To find a new candidate solution we have applied random tweak to the current solution. The pseudocode is given in Algorithm 5. 1 repeat 2 R = T weak(s) 3 if fitness(r) > fitness(s) then 4 S = R 5 end 6 until the stopping criteria are met; 7 return S Algorithm 5: HillClimbing(S) Steepest Ascent Hill Climbing with Replacement (SAHCR) This method samples all around the original candidate solution by tweaking n t times. Best outcome of the tweaks is considered as the new candidate solution. The current candidate solution is replaced by the new one rather than selecting the best one between the new candidate solution and the current solution. The best found solution is saved in a separate variable. The pseudocode is given in Algorithm 6. 47

49 1 best = S 2 repeat 3 R = T weak(s) 4 for n t 1 times do 5 W = T weak(s) 6 if fitness(w)> fitness(r) then 7 R = W 8 end 9 end 10 S = R 11 if fitness(s)> fitness(best) then 12 best = S 13 end 14 until the stopping criteria are met; 15 return S Algorithm 6: SteepestAscentHillClimbingW ithreplacement(s) Simulated Annealing (SA) Annealing is a process in metallurgy where molten metals are slowly cooled to make them reach a state of low energy where they are very strong. Simulated annealing is an analogous optimization method for locating a good approximation to the global optima. It is typically described in terms of thermodynamics. Simulated annealing is a process where the temperature is reduced slowly, starting from mostly exploring by random walk at high temperature eventually the algorithm does only plain hill climbing as it approaches zero temperature. The random movement corresponds to high temperature. Simulated annealing injects randomness to jump out of the local optima. At each iteration the algorithm selects the new candidate solution probabilistically. So the algorithm may sometimes go down hills. The pseudocode is given in Algorithm Selection Procedure In the onlooker bee phase, an employed bee is selected using a selection procedure for further exploitation. As has been mentioned above, tournament selection, fitness proportionate selection, and stochastic universal sampling have been applied individually as the selection procedure. 48

50 1 initilaize t 2 best = S 3 repeat 4 R = T weak(s) 5 r= a random number in the range of [0, 1] 6 if fitness(r) > fitness(s) or r < e fitness(r) fitness(s) t then 7 S = R 8 end 9 t = t 2 schedule 10 if fitness(s)> fitness(best) then 11 best = S 12 end 13 until the stopping criteria are met; 14 return S Algorithm 7: SimulatedAnnealing(S) Tournament Selection (TS) In this method the fittest individual is selected among the t individuals picked from the population randomly with replacement [161], where t 1. It is simple to implement and easy to understand. The selection pressure of the method directly varies with the tournament size. With the increase of the number competitors, the selection pressure increases. So selection pressure can easily be adjusted by changing the tournament size. If the tournament size is larger, weak individuals have a smaller chance to be selected. The pseudocode is given in Algorithm 8. 1 Best = individual picked at random 2 for i from 2 to t do 3 Next = individual picked at random 4 if fitness(next) > fitness(best) then 5 Best = Next 6 end 7 end 8 return Best Algorithm 8: T orunamentselection() 49

51 2.7.2 Fitness-Proportionate Selection (FPS) In this approach, individuals are selected in proportion to their fitness [161]. Thus, if an individual has a higher fitness, its probability of getting selected is higher. In Fitness- Proportionate Selection which is also known as Roulette Wheel Selection, even the fittest individual may never get selected. The analogy to a roulette wheel can be envisaged by imagining a roulette wheel in which each candidate solution represents a pocket on the wheel; the size of the pockets are proportionate to the probability of selection of the solution. Selecting N individuals from the population is equivalent to playing N games on the roulette wheel, as each candidate is drawn independently. The pseudocode is given in Algorithm 9. 1 f 1 = fitness(s 1 ) 2 for i from 2 to N do 3 f i = f i + f i 1 4 end 5 r = random number in the range [0, f N ] 6 for i from 2 to t do 7 if f i 1 < r f i then 8 return p i 9 end 10 end 11 return p 1 Algorithm 9: F itnessp roportionateselection() Stochastic Universal Sampling (SUS) Stochastic Universal Sampling (or SUS) is a variant Fitness-Proportionate Selection. Stochastic universal sampling (SUS) is a technique used in genetic algorithms for selecting potentially useful solutions for recombination. It was introduced by James Baker [6]. SUS is a development of fitness proportionate selection (FPS) which exhibits no bias and minimal spread. Where FPS chooses several solutions from the population by repeated random sampling, SUS uses a single random value to sample all of the solutions by choosing them at evenly spaced intervals. This gives weaker members of the population (according to their fitness) a chance to be chosen and thus reduces the unfair nature of fitness-proportional selection methods. In SUS, selection is done in a fitness-proportionate way but biased so that fit individuals always get picked at least once. This is known as a low variance resampling algorithm. The pseudocode is given in Algorithm

52 1 f 1 = fitness(s 1 ) 2 index = 0 3 for i from 2 to N do 4 f i = f i + f i 1 5 end 6 r = random number in the range [0, f N N s ] 7 for i from 2 to N s do 8 while f index < r do 9 index = index end 11 r = r + f N /N s 12 in i = index 13 end 14 q = random number in the range [0, N s ] 15 return p inq Algorithm 10: StochasticUniversalSampling(N s ) Other methods like roulette wheel can have bad performance when a member of the population has a really large fitness in comparison with other members. SUS starts from a small random number, and chooses the next candidates from the rest of population remaining, not allowing the fittest members to saturate the candidate space. 2.8 Summary In this chapter, we have presented the preliminaries required to understand the following chapters. Description and algorithm of each methods applied in our work are also included. Also we have discussed the benefits and drawbacks of each method with respect to gene selection for classification cancer. Formal definition of the problem handled in this thesis is also provided in this chapter. We have also discussed some previous research works regarding the gene selection for cancer classification, improvements of ABC, etc. Different approaches including many evolutionary methods to solve the problem of gene selection can be found in the literature. Here we have reviewed some such methods which obtained satisfactory performance. We have analyzed their performance with logical reasoning. Utilization of popular methods e.g., SVM, LOOCV, and prefiltering methods in gene selection are also presented in this chapter. Finally, we have listed the improvements of ABC algorithm available in the literature. In the next chapter we will describe the steps of the proposed method in details. 51

53 Chapter 3 Gene Selection for Cancer Classification Detail discussion of the proposed gene selection approach is presented in this chapter. Problem formulation is given in Section 3.1. The Section 3.2 presents the preprocessing steps performed to the dataset before applying the search method. Proposed modified ABC algorithm and other evolutionary algorithm applied as search method for gene selection are discussed in Section Problem Formulation Gene expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnostic tool. But compared to the number of genes involved which often exceeds several thousands, available training datasets generally have a fairly small sample size for classification. Inclusion of irrelevant, noisy or redundant genes decreases the quality of classification. So to overcome this problem one of the approaches in practice is to search for the informative genes along with applying a filter beforehand. Use of prefiltering makes it possible to get rid of majority of the noisy genes. Consequently, the underlying method to search the informative genes becomes easier and efficient with respect to time and cost. Finally, to evaluate the fitness of the selected gene subset a classifier is utilized. The selected genes are used as features to classify the testing samples. The inputs and outputs of the method are: Input: G = {G 1, G 2,..., G n }, a vector of vectors consisting of gene expression vector 52

54 for all the samples, where n is the number of genes and G i = {g i,1, g i,2,..., g i,n } is a vector of gene expressions for the i th gene of all the samples where N is the sample size. So, g i,j is the expression level of the i th gene in the j th sample. Output: R = {R 1, R 2,..., R m }, the indices of the genes selected in the optimal subset. Where m is the selected gene size. The gene selection method starts with a preprocessing step followed by a gene selection step. Finally the classification is done. The main three steps of the gene selection method are listed below. Preprocessing Gene selection Classification In what follows, we will describe these steps in detail. 3.2 Preprocessing Stage To make the experimental data suitable for our algorithm and to help the algorithm run faster the preprocessing step is incorporated. The preprocessing step contains the following two stages: Normalization Prefilter Normalization Normalizing the data ensures the allocation of equal weight to each variable by the fitness measure. Without normalization, the variable with the largest scale will dominate the fitness measure [223]. Therefore, normalization reduces the training error, thereby improving the accuracy for the classification problem [103]. The expression levels for each gene are normalized at this step to [0, 1] using the standard procedure which is shown in Eq. 3.1 below. x = lower + [ ] value value min upper lower value max value min 53 (3.1)

55 Here, among all the expression levels of the gene in consideration, value max is the maximum original value of that gene, value min is the minimum original value of that gene, upper (lower) is 1 (0) and x is the normalized expression level. So after normalization, for all gene value max will be 1 and value min will be Prefiltering Gene expression data are characteristically high-dimensional. The huge number of genes cause great computational complexity in wrapper methods when searching for significant genes. Before applying other search methods it is thus prudent to reduce gene subset space by preselecting a smaller number of informative genes based on some filtering criteria. Several filter methods have been proposed in the literature which can be used to preprocess data. These include Signal-to-Noise Ratio (SNR) and Information Gain (IG) [85,249], t-test [190], Bayesian Network [82], Kruskal-Wallis non-parametric analysis of variance (ANOVA) algorithm [40, 139, 211], F -test (ratio of in between group variance to within group variance) [15,175], BW ratio [66], Euclidean Distance [32,107], etc. After the prefilter stage, we get a ranking of the genes based on the applied statistical methods. The inputs and outputs of the preprocessing stages are: Input: G = {G 1, G 2,..., G n }, a vector of vectors, where n is the number of genes and G i = {g i,1, g i,2,..., g i,n } is a vector of gene expressions for the i th gene where N is the sample size. So, g i,j is the expression level of i th gene in j th sample. Output: R = {R 1, R 2,..., R n }, the ranking of the genes based on the statistical method where n is the gene size, R i 1, 2,..., n, and R i R j for any 1 i, j n and i j. Because of the nature of gene expression data the selected statistical method should be able to deal with high dimensional small sample sized data. According to the assumption of the data characteristics two types of filtering methods exist, namely, parametric and non parametric. Both types of filtering techniques have been employed individually in our proposed algorithm for the sake of comparison. Among many alternatives, in our work, we have incorporated the following methods individually to rank the geens. Kruskal-Wallis The steps of Kruskal-Wallis test are discussed in Section non-parametric method. It is a 54

56 F -test The steps of performing F -test are given in Setion It is a parametric method. The experimental results showing the comparison of these two methods are presented in Section Preselection of Genes The top ranked genes from prefiltering step will enter the next phase. After the genes are ranked according to the statistical method in use, we need to calculate the number of genes to nominate for the next stage. There could be two ways to determine the number of genes to be selected in this stage. Select according to p: In this approach we predetermine a threshold and select all the genes that have statistics calculated by Kruskal-Wallis (F -test) below (above) the threshold. This approach generally tends to select comparatively large number of genes [245]. To determine a suitable threshold value we have conducted scientific parameter tuning in the range of [0, 1]. The analysis is presented in section Select according to n: Another approach is to select a predetermined number of top ranked genes. The number of genes selected from the ranked genes can be either fixed or optimized by cross-validation on the training set. EPSO [169] empirically determined a fixed number (500) and used it for all the datasets. Also several other works in the literature used this approach to preselect genes [25, 67, 152, 244, 267, 277]. But the problem in this approach is that different datasets have different sizes. So a fixed value might not be optimal for all the datasets. Determining a value that is good for all the datasets is not possible. So in this article we have selected a percentage of top ranked genes. As a result number of genes selected will depend on the original size of the dataset. Therefore, when the percentage is set to 0.1, only the top 10% from the ranked genes are supplied to the next stage. We have scientifically tuned the parameter in the range of [0, 1]. Section contains the experimental findings. 3.3 Search Method for Gene Selection After the preprocessing step only the most informative genes are left. Now they are fed to the search method to further select a smaller subset of genes significant to cancer. As the search method we have proposed the modified artificial bee colony algorithm as described 55

57 in Section Also other evolutionary algorithms such as ACO, ABC, and GA are considered as search methods. Comparison between different metaheuristics will be presented in Section Genetic Algorithm The genetic algorithm is iterated for M AX IT ER times. The genetic algorithm utilized as search method for gene selection is given in Algorithm 11. The algorithm has two parameters named crossover ratio (r) and mutation ratio (m). Mutation probability (or ratio) is basically a measure of the likeness of applying mutation operator. Crossover probability indicates a ratio of how many couples will be picked for crossover. As selection operator for crossover the Tournament Selection method (as described in Section 2.7.1) is used. The fitness function is defined in the Section ACO Algorithm The ACO algorithm is iterated for MAX IT ER times. The ACO algorithm utilized as search method for gene selection is given in Algorithm 12. Initially pheromones for all components (genes) are set to 0. For pheromone deposition contribution of the an individual and gbest is considered. As local search method SAHCR is used. The value of ρ is set to 0.8. The parameter values of ACO in gene selection are given in Table The fitness function is defined in the Section Basic ABC Algorithm In ABC the colony consists of equal number of employed bees and onlooker bees. In the basic form, the number of employed bees is equal to the number of food sources (solutions) thus each employed bee is associated with one and only one food source. Bee colonies can quickly and precisely adjust its searching pattern in time and space according to changing nectar sources. Excellent performances have been exhibited by the ABC algorithm for a considerable number of problems [2, 44, 119, 126, 130, 220]. The pseudocode of the ABC algorithm utilized as search method is presented in Algorithm 13. The ABC search algorithm is iterated for MAX IT ER times. The fitness function is defined in the Section

58 // initialization 1 for i=1 to N do 2 initrandom(s i ); 3 end 4 repeat 5 Q = NULL; 6 for i = 1 to r P S do 2 7 p 1 = Selection(); 8 p 2 = Selection(); // crossover 9 k =index of a gene selected randomly; 10 for j = 1 to k do 11 c 1.x i = p 1.x i ; 12 c 2.x i = p 2.x i ; 13 end 14 for j = k + 1 to n do 15 c 1.x i = p 2.x i ; 16 c 2.x i = p 1.x i ; 17 end 18 Q.add(c 1 ); 19 Q.add(c 2 ); 20 end 21 for i = r P S + 1 to P S do 2 22 p 1 = Selection(); 23 p 2 = Selection(); 24 c 1 = p 1 ; 25 c 2 = p 2 ; 26 Q.add(c 1 ); 27 Q.add(c 2 ); 28 end // mutation 29 for i = 1 to m P S do 30 k =index of an individual selected randomly from Q; 31 T weak(q k ); 32 end 33 for i = 1 to P S do 34 if fitness(q i ) > fitness(gbest) then 35 gbest = Q i ; 36 end 37 end 38 S = Q; 39 until the stopping criteria are met; 40 Gene subset corresponding to gbest is the optimal subset found by the algorithm Algorithm 11: Genetic Algorithm for Gene Selection 57

59 // initialization 1 for i=1 to n do 2 p i = 0.0; 3 end 4 for i=1 to N do 5 initrandom(s i ); 6 end 7 repeat 8 for i = o to P S do 9 S i = LocalSearch(S i ); 10 if fitness(s i ) > fitness(gbest) then 11 gbest = S i ; 12 end 13 Lay Pheromone; 14 end 15 Evaporate Pheromone; 16 until the stopping criteria are met; 17 Gene subset corresponding to gbest is the optimal subset found by the algorithm Algorithm 12: Ant Colony Optimization Algorithm Modified ABC Algorithm Like all other evolutionary optimization approaches, ABC also has some drawbacks. In general, the ABC algorithm works well on finding the better solution of the object function. However, the original design of the onlooker bees movement only considers the relation between the employed bee, which is selected by the roulette wheel selection, and the one selected randomly. The search equation of ABC is reported to be good at exploration but poor at exploitation [173]. Therefore, it is not strong enough to maximize the exploitation capacity. As a result, several improvements of ABC have been proposed over the time [8,9, 78, 110, 111, 115, 140, 181, 207, 208, 234, 258, 260, 261, 268, 272, 274, 275]. In employed bee and onlooker bee phase, new solutions are produced by means of a neighborhood operator. In order to enhance the exploitation capability of ABC, a local search method is applied to the solution obtained by the neighborhood operator with a certain probability in [197]. To overcome the limitations of the ABC algorithm, in addition to the approach followed in [197], we have further modified it by incorporating two new components in it. Firstly, we have incorporated the concept of pheromone which is one of the major components of the Ant Colony Optimization (ACO) algorithm [58, 218]. Secondly we have introduced and plugged in a new operation named Communication Operation in which successive bees communicate with each other to share their results. Briefly speaking, the pheromone helps minimizing 58

60 // initialization 1 for i=1 to N do 2 initrandom(s i ); 3 end 4 repeat // Employed Bee Phase 5 for i = 1 to P S do // produce a new solution using the neighborhood operator 6 E = Neighbor(S i ); 7 if fitness(e ) > fitness(s i ) then 8 S i = E ; 9 end 10 UpdateBest(S i ); 11 end // Onlooker Bee Phase 12 for i = 1 to P S do // select a bee index using the selection procedure 13 j = Selection(); // produce a new solution using the neighborhood operator form the selected bee 14 O = Neighbor(S i ) ; 15 if fitness(o ) > fitness(s i ) then 16 S i = O ; 17 end 18 UpdateBest(S i ); 19 end // Scout Bee Phase 20 for i = 1 to P S do 21 if trial i > limit then 22 initrandom(s i ); 23 UpdateBest(S i ); 24 end 25 end 26 until the stopping criteria are met; 27 Gene subset corresponding to gbest is the optimal subset found by the algorithm Algorithm 13: Artificial Bee Colony Algorithm 59

61 the number of selected genes while the Communication Operation improves the accuracy. The steps of the proposed modified ABC is described next. The algorithm is iterated for M AX IT ER times. Each iteration gives a global best solution, gbest. Finally, the gbest of the last iteration, i.e., the gbest with maximum fitness is the output of a single run. It is worth mentioning that finding a solution with 100% accuracy is not set as the stopping criteria as further iterations can find a smaller subset with the same accuracy. Ideally, a gene subset containing only one gene with 100% accuracy is the best possible solution found by any algorithm. The fitness function is defined in the Section The flowchart of the proposed gene selection method is given in Figure Food Source Positions The position of the food source for the i th bee S i, is represented by vector X i = {x 1 i, x 2 i,..., x n i }, where n is the gene size or dimension of the data, x d i {0, 1}, i = 1, 2,..., m (m is the population size), and d = 1, 2,..., n. Here, x d i = 1 represents that the corresponding gene is selected, while x d i = 0 means that the corresponding gene is not selected in the subset Pheromone We have incorporated the concept of pheromone (borrowed form ACO) to the ABC algorithm as a guide for exploitation. ACO algorithms are stochastic search procedures. The ants solution construction is guided by heuristic information about the problem instance being solved and (artificial) pheromone trails, which real ants use as communication media [46] to exchange information on the quality of a solution component. Pheromones are considered as long-term memory which help selecting the most crucial genes. The quantity of pheromone deposited, which may depend on the quantity and quality of the food, guides other ants to the food source. Accordingly, the indirect communication via pheromone trails enables the ants to find shortest paths between their nest and food sources [46]. The gene subset carrying significant information will occur more frequently. Thus the genes in that subset will get reinforced simultaneously which ensures formation of a potential gene subset. The idea of using pheromone is to keep track of the components that are supposed to be good because they were part of a good solution in previous iterations. Because of keeping this information we need less iterations to achieve a target accuracy. Thus, computational time is also reduced. 60

62 Figure 3.1: The flowchart of the modified Artificial Bee Colony Algorithm Determine a nearby food position using neighborhood operator Set parameters Apply local search to the food position with probability probls Evaluate fitness Normalize datasets Initialize solution Apply greedy selection Select top ranked genes using statistical method Calculate fitness using SVM with LOOCV Apply Communication Operator with probability r 4 Update pbest and gbest Find optimal gene subset using Modified Artificial Bee Colony Algorithm Employed bee phase Yes Lay pheromone All employed bees processed? No Evaluate accuracy of the gene subset corresponding to gbest using SVM with LOOCV Onlooker bee phase Scout bee phase Determine a bee using selection procedure Determine a nearby food position of the selected bee using neighborhood operator No Solution fitness reached maximum or maximum run count reached? Pheromone evaporation Apply local search to the food position with probability probls Evaluate fitness Apply greedy selection Update pbest and gbest Yes Lay pheromone Output the gene subset Yes All onlooker bees processed? No Abandon the food positions exhausted by a bee Turn the bees with abandoned food positions into scout bee No Any scout bee left unprocessed? Yes Assign a food position to the scout bee guided by pheromone Yes Fitness of gbest reached maximum or iter max iter 61 Lay pheromone

63 Pheromone Update The (artificial) pheromone trails are a kind of distributed numeric information [56] which is modified by the ants to reflect their experience accumulated while solving a particular problem. The pheromone values are updated using previously generated solutions. The update is designed to concentrate the search in regions of the search space containing high quality solutions. Solution components which are part of better solutions or are used by many ants will receive a higher amount of pheromone, and hence, will be more likely to be used by the ants in future iterations of the algorithm. It indirectly assumes that good solution components construct good solutions. However, to avoid the search getting stuck all pheromone trails are decreased by a factor before getting reinforced again. This mimics the natural phenomenon that, because of evaporation, the pheromone disappears over time unless they are revitalized by more ants. The idea of incorporating pheromone is to keep track of fitness of previous iterations. The pheromone trails for all the components are represented by the vector P = {p 1, p 2,, p n }, where p i is the pheromone corresponding to the i th gene and n is the total number of genes. To update the pheromone p i corresponding to the i th gene, two steps are followed: Pheromone deposition Pheromone evaporation After each step of update, if the pheromone value becomes greater (less) than tmax (tmin), then the value of pheromone is set to tmax (tmin). Use of tmax, tmin is introduced in the Max-Min Ant System (MMAS) presented in [218] to avoid stagnation. The value of tmin is set to 0 and will be kept same throughout. But the value of tmax is updated whenever new global best solution is found as given in Eq pheromone deposition After each iteration the bees acquire new information and update their knowledge of local and global best locations. The best position found so far by the i th bee is known as the pbest i and the best position found so far by all the bees, i.e., the population, is known as the gbest. After each bee completes its tasks in each iteration, pheromone laying is done. The bee deposits pheromone using its knowledge of food locations gained so far. To lay pheromone, the i th bee uses its current location (X i ), the best location found by the bee so far (pbest i ), and the best location found so far by all the bees (gbest). This idea is adopted from Particle Swarm Optimization (PSO) metaheuristic [131], where the local and global best locations are used to update the velocity of the current particle. We have also used the current position in pheromone laying to ensure enough exploration though in MMAS [218] only the current best 62

64 solution is used to update the pheromone. Only the components which are selected in the corresponding solutions get reinforced. Hence, the pheromone deposition by the i th bee utilizes Eq. 3.2 proposed as follow: p d (t+1) = p d (t) w+(r 0 c 0 f i x d i )+(r 1 c 1 pf i x d pbest i )+(r 2 c 2 gf x d gbest) (3.2) Here, d = 1, 2,, n (n is the number of genes), w is the inertia weight, f i is the fitness of X i, pf i is the fitness of pbest i, gf is the fitness of gbest, x d i is selection of d th gene in X i, x d pbest i is selection of d th gene in pbest i, x d gbest is selection of dth gene in gbest, c 0, c 1, and c 2 determines the contribution of f i, pf i, and gf respectively, and r 0, r 1, r 2 are random values in the range of [0, 1], which are sampled from a uniform distribution. Here we have, c 0 + c 1 + c 2 = 1 and c 1 = c 2. So the individual best and the global best influence the pheromone deposition equally. The value of c 0 is set from experimental results presented in the Section The inertia weight is considered to ensure that the contribution of global best and individual best is weighed more in later iterations when they contain meaningful values. To update the value of inertia weight w, two different approaches have been considered. One approach updates the weight so that an initial large value is decreased nonlinearly to a small value as described in [169] so that exploitation does not become absolute majority. w(t + 1) = (w(t) 0.4) (MAX IT ER iter) MAX IT ER (3.3) Here, MAX IT ER is the maximum number of iteration and iter is the current iteration. Another approach is to update the value randomly [97]. w = (1 + r 5) 2 (3.4) 63

65 Here, r 5 is a random value in the range of [0, 1], which is sampled from a uniform distribution. Performance evaluation of each of these two approached is presented in the Section pheromone evaporation At the end of each iteration, pheromones are evaporated to some extent.the equation for pheromone evaporation is given by Eq. 3.5: p i (t + 1) = p i (t) ρ (3.5) Here, (1 ρ) is the pheromone evaporation coefficient and p i is the pheromone corresponding to the i th gene and n is the total number of genes. Finally, note that, the value of tmax is updated whenever a new gbest is found. The rationale for such a change is as follows. Over time, as the fitness of gbest increases it also contributes more in the pheromone deposition, which may lead the pheromone values for some of the frequent genes to reach tmax. At that point, the algorithm will fail to store further knowledge about those particular genes. So we need to update the value of tmax after a new gbest is found. This is done using Eq. 3.6 below. tmax(g + 1) = tmax(g) (1 + ρ gf) (3.6) Here, tmax(g) represents the value of tmax when the gth global best is found by the algorithm Initialization Pheromone for all the genes are initialized to tmax. For all the bees food positions are selected randomly. To initialize the ith bee, the function initrandom(s i ), given in algorithm 14, is used. Here we have used a modified sigmoid function that was introduced in [169] to increase the probability of the bits in a particle position to be zero. The function is given in Eq. 3.7 below. selected. Here, x 0 and sigmoid(x) [0, 1] It allows the components with high pheromone values to get sigmoid(x) = e x (3.7) 64

66 1 for i=1 to n do 2 r 3 =random number in the range of [0, 1] 3 if r 3 > sigmoid(p j ) then 4 x j i = 1 5 else 6 x j i = 0 7 end 8 end Algorithm 14: initrandom(s i ) Employed Bee Phase To determine a new food position the neighborhood operator is applied to the current food position. Then local search is applied with the probability probls to the new food position to obtain a better position by exploitation. As local search procedures, Hill Climbing (HC), Simulated Annealing (SA), and Steepest Ascent Hill Climbing with Replacement (SAHCR) are considered. Then greedy selection is applied between the newly found neighbor and the current food position. The performance and comparison among different local search methods are discussed in the Section In each iteration the value of gbest, and pbest i are updated using the Algorithm if fitness(s i ) > fitness(pbest) then 2 pbest i = S i ; 3 end 4 if fitness(s i ) > fitness(gbest) then 5 gbest = S i ; 6 end Algorithm 15: UpdateBest(S i ) Onlooker Bee Phase At first a food source is selected according to the goodness of the source using a selection procedure. As the selection procedure, tournament selection, fitness proportionate selection, and stochastic universal sampling have been applied individually and the results are discussed in the Section To determine a new food position the neighborhood operator is applied to the food position of the selected bee. Then local search is applied with the probability probls to exploit the food position. As local search methods Hill Climbing, Simulated Annealing, and Steepest Ascent Hill Climbing with Replacement are 65

67 compared. Then greedy selection is applied between the newly found neighbor and the current food position. In each iteration the value of gbest, and pbest i are updated using the Algorithm 15. Selection Procedure In the onlooker bee phase, an employed bee is selected using a selection procedure for further exploitation. As has been mentioned above, tournament selection, fitness-proportionate selection, and stochastic universal sampling have been applied individually as the selection procedure. Tournament Selection Section contains brief description of the Tournament Selection method. In this method the fittest individual is selected among the t individuals picked from the population randomly with replacement [161]. Value of t is set to 7 in our algorithm. Fitness-Proportionate Selection Brief description of this selection method is given in Section In this approach, individuals are selected in proportion to their fitness [161]. Thus, if an individual has a higher fitness, its probability of getting selected is higher. In Fitness-Proportionate Selection which is also known as Roulette Wheel Selection, even the fittest individual may never be selected. In basic ABC, roulette wheel or fitness-proportionate selection scheme is incorporated. Stochastic Universal Sampling One variant of Fitness-Proportionate Selection is called Stochastic Universal Sampling (SUS), which is proposed by James Baker in [6]. In SUS, selection is done in a fitness-proportionate way but biased so that fit individuals always get picked at least once. This is known as a low variance resampling algorithm. The method has become now popular in other venues along with evolutionary computation [161]. Brief description of the method is given in the Section Scout Bee If the fitness of a bee remains the same for a predefined number (limit) of iterations, then it abandons its food position and becomes a scout bee. In basic ABC, it is assumed that only one source can be exhausted in each cycle, and only one employed bee can be a scout. In our modified approach we have removed this restriction. The scout bees are assigned to new food positions randomly. While determining components to form a new food position the solution component with higher pheromone values have higher probability of being selected. The value of limit is experimentally tuned and discussed in the Chapter 4. The variable 66

68 trial i contains the number of times the fitness remains unchanged consecutively for the i th bee. Procedure initrandom(s i ) to assign new food positions for scout bees is given in Algorithm 14. In each iteration the value of gbest, and pbest i are updated using the Algorithm Local Search To explore nearby food sources the basic ABC algorithm applies a neighboring operator to the current food source. But in our algorithm we have applied local search to produce a new food position form the current one. In the employed bee and onlooker bee stages, local search is applied with the probability probls to increase the exploitation ability [197]. The value of probls is scientifically tuned in the Section As has already been mentioned above, as the local search procedures, Hill Climbing (HC), Simulated Annealing (SA), and Steepest Ascent Hill Climbing with Replacement (SAHCR) have been employed as the local search procedure. Depending upon the choice HillClimbing(S) or SimulatedAnnealing(S) or SteepestAscentHillClimbingW ithreplacement(s) is called form the method LocalSearch(S). The performance assessment between different local searches are discussed in the Section??. Hill Climbing Hill climbing is a optimization technique which belongs to the family of local search methods. The pseudocode is given in Algorithm 5. Simulated Annealing Simulated annealing is an analogous optimization method for locating a good approximation to the global optima. It is typically described in terms of thermodynamics. At each iteration the algorithm selects the new candidate solution probabilistically. So the algorithm may sometimes go down hills. The pseudocode is given in Algorithm 7. Steepest Ascent Hill Climbing with Replacement This method samples all around the original candidate solution by tweaking n times. Best outcome of the tweaks is considered as the new candidate solution. The pseudocode is given in Algorithm Communication Operator We have incorporated a new operator simulating the communication between the ants in a trail. Even though researchers are unable to establish whether such a communication indeed involves information transfer or not, it is known that foraging decisions of outgoing workers, and their probability to find a recently discovered food source, are influenced by the 67

69 interactions [18, 69, 75, 102, 195, 196]. In fact, there is a large body of evidence emphasizing the role of ant encounters for the regulation of foraging activity particularly for harvester ants [18, 49, 86, 89, 204]. Even the mere instance of an encounter may provide information, such as the magnitude of the colony s foraging activity, and therefore may influence the probability of food collection in ants [87, 88, 233]. At each step bees gain knowledge about different components and store their findings by depositing pheromone. After a bee gains new knowledge about the solution components, it share its findings with the successor. So an employed bee gets insight of which components are currently exhibiting excellent performance. Thus a bee obtains idea about food sources from its predecessor. A gene is selected in the current bee if it is selected in its predecessor and pheromone level is greater than a threshold level. With probability r 4 the following communication operator (Eq. 3.8 and 3.9) is applied to each employed bee. The value of r 4 is experimentally tuned and the results are presented in the Section x d i = x d i 1 z pd (3.8) Where, for i th bee i > 1, d = 1, 2,, n (n is the number of genes), and 1, if p d > tmax 2 z pd = 0, otherwise (3.9) The procedure Communicate(i) to apply the communication operator on i th bee is presented in Algorithm 16 1 for d=1 to n do 2 if p d > tmax 2 then 3 z pd = 1 4 else 5 z pd = 0 6 end 7 x d i = x d i 1 z pd 8 end Algorithm 16: Communicate(i) 68

70 Neighborhood Operator In the solution we need the informative genes to be selected. So we discard the uninformative ones from the solution. By this way we will get a small set of informative genes. To find a nearby food position we first find the genes which are selected in the current position. A number of selected genes (at least one) are dropped from the current solution. We get rid of the genes which tend to appear less potential. If the current solution has zero selected genes then we rather select a possibly informative gene. The parameter nd determines the percentage of selected genes to be removed. The value of nd is experimentally tuned in the Section Let X e = {0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0} is a candidate solution with gene size, n = 20 and the number of selected gene is 10 (ten). So if nd = 0.3 we will randomly pick 3 (three) genes which are currently selected in the current candidate solution (X e ) and change them to 0. Let the indices 2 (two), 8 (eight), and 15 (fifteen) are randomly selected. So nearby food position (Xe n ) of the current candidate solution (X e ), found after applying the neighborhood operator will be Xe n = {0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0} (changes are shown in boldface font). Please note that we adopt zero-based indexing Tweak Operator The tweak operation is done by the method T weak(s). Here, one of the genes is picked randomly and selection of that gene is flipped. So if the gene is selected, after tweak it will be dropped and vice versa. For example let X e = {0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0} is a candidate solution with gene size, n = 20 and the number of selected gene is 10 (ten). Let randomly the index 6 (six) is selected. So the tweaked food position (Xe) t of the current candidate solution (X e ), found after applying the tweak operator will be Xe t = {0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0} (changes are shown in boldface font). Please note that we adopt zero-based indexing Fitness Our fitness function has been designed to consider both the classification accuracy and the number of selected genes. The higher the accuracy of an individual the higher is its fitness. On the other hand small number of selected genes yields good solution. So if the percentage of genes that are not selected is higher the fitness will be higher. The value n ns i gives the n percentage of genes that are not selected in S i. The tradeoff between weight of accuracy and selected gene size is given by w 1. Higher value of w 1 means accuracy is prioritized more 69

71 than the selected gene size. So, finally the fitness of the i th bee (S i ) is determined according to Eq fitness(s i ) = w 1 accuracy(x i ) + (1 w 1 ) n ns i n Here, w 1 sets the tradeoff between the importance of accuracy and selected gene size, X i is the food position corresponding to S i, (3.10) accuracy(x i ) is the LOOCV (Leave One Out Cross Validation) classification accuracy using SVM (to be discussed shortly), and ns i is the number of currently selected genes in S i. Accuracy To assess the fitness of a food position we need the classification accuracy of the gene subset. The predictive accuracy of a gene subset obtained from the modified ABC is calculated by an SVM with LOOCV (Leave One Out Cross Validation). The higher the LOOCV classification accuracy, the better the gene subset. SVM is very robust with sparse and noisy data. SVM has been found suitable for classifying high dimensional and smallsample sized data [103, 242]. Also SVM is reported to perform well for gene selection for cancer classification [81,144]. Among many SVM implementations [27,105,117,200] we have incorporated the LIBSVM [27] in our application to calculate the accuracy. SVM is basically a linear two class classifier. For a multi-class SVM, we have utilized the OVO ( one versus one ) approach, which is adapted in the LIBSVM [27]. The replacement of dot product by a nonlinear kernel function [238] yields a nonlinear mapping into a higher dimensional feature space [96]. A kernel can be viewed as a similarity function. It takes two inputs and outputs how similar they are. There are four basic kernels for SVM: linear, polynomial, radial basic function (RBF), and sigmoid [186]. The effectiveness of SVM depends on the selection of kernel, the kernel s parameters, and the soft margin parameter C. Uninformed choices may result in extreme reduction of performance [103]. Tuning SVM is more of an art than an exact science. Selection of a specific kernel and relevant parameters can be achieved empirically. For the SVM, the penalty factor C and Gamma are set to 2000, , respectively as adopted in [152]. Use of linear and RBF kernel and their parameter tuning is discussed in the Section A widely used SVM tool, LIBSVM [27] is incorporated with our algorithm to calculate the LOOCV accuracy. There are 2 steps involved in the LIBSVM; Firstly, the dataset was trained to obtain a model and Secondly, the model was used to predict the information for the testing dataset. 70

72 Cross-validation is believed to be a good method for selecting a subset of features [23]. Leave-one-out cross validation (LOOCV) is in one extremity of k-fold cross validation, where k is chosen as the total number of examples. For a dataset with N examples, N numbers of experiments are performed. For each experiment the classifier learns on N 1 examples and is tested on the remaining one example. In the LOOCV method, a single observation from the original sample is selected as the validation data, and the remaining observations serve as training data. This process is repeated so that each observation in the sample is used once as the validation data. So every example is left out once and a prediction is made for that example. The average error is computed by finding number of misclassification and used to evaluate the model. The beauty of the leave-one-out cross-validation is that despite of the number of generation it will generate the same result each time, thus repetition is not needed Pseudocode for the Modified ABC Algorithm Finally, the steps of our modified ABC algorithm used in this article is given in Algorithm 17 and the pseudocode is given in Algorithm 18. The flowchart of the proposed gene selection method using Algorithm 18 is given in Fig Summary Gene selection for cancer classification has become one of the most studied research topics in the biomedical field. To address the problem we have introduced a modified artificial bee colony algorithm. Application of artificial bee colony in gene selection is not studied yet. Also we have considered application of some other evolutionary algorithm including basic ABC, ACO, and GA. In this chapter the proposed method is described in details. Also application of the evolutionary algorithms in consideration are discussed. Future research endeavor could be directed towards further investigation of different parameter behavior. 71

73 1 bp 2 initialize all the population by selecting all the genes 3 initialize pheromones to tmax 4 repeat 5 for each employed bees do 6 produce a new solution using the neighborhood operator 7 apply local search on the newly produced solution with probability probls 8 evaluate its fitness 9 apply greedy selection between new solution and current solution 10 apply the communication operator with probability r 4 r 11 lay pheromone 12 update pbest and gbest 13 end 14 evaluate the probability values of the food sources 15 for each onlooker bees do 16 select a food source depending on their fitness using a selection procedure 17 produce a new solution using the neighborhood operator 18 apply local search on the newly produced solution with probability probls 19 calculate its fitness 20 apply greedy selection between new solution and current solution 21 lay pheromone 22 update pbest and gbest 23 end 24 abandon the food positions that are exhausted by the bees 25 for each abandoned positions do 26 appoint a scout bee 27 send the scout bee to the solution space for discovering new food sources randomly for the abandoned positions 28 lay pheromone 29 evaporate pheromone 30 update pbest and gbest 31 end 32 until the stopping criteria are met; 33 Gene subset corresponding to gbest is the optimal subset found by the algorithm Algorithm 17: Steps of the modified Artificial Bee Colony Algorithm 72

74 // initialization 1 for i=1 to n do 2 p i = tmax; 3 end 4 for i=1 to N do 5 initrandom(s i ); 6 end 7 repeat // Employed Bee Phase 8 for i = 1 to P S do // produce a new solution using the neighborhood operator 9 E = Neighbor(S i ); // apply local search with probability probls 10 E = LocalSearch(E); 11 if fitness(e ) > fitness(s i ) then 12 S i = E ; 13 end 14 Communicate(); 15 UpdateBest(S i ); 16 Lay Pheromone; 17 end // Onlooker Bee Phase 18 for i = 1 to P S do // select a bee index using the selection procedure 19 j = Selection(); // produce a new solution using the neighborhood operator form the selected bee 20 O = Neighbor(S i ) ; // with probability probls apply local search 21 O = LocalSearch(O); 22 if fitness(o ) > fitness(s i ) then 23 S i = O ; 24 end 25 UpdateBest(S i ); 26 Lay Pheromone; 27 end // Scout Bee Phase 28 for i = 1 to P S do 29 if trial i > limit then 30 initrandom(s i ); 31 UpdateBest(S i ); 32 Lay Pheromone; 33 end 34 end 35 Evaporate Pheromone; 36 until the stopping criteria are met; 37 Gene subset corresponding to gbest is the optimal subset found by the algorithm Algorithm 18: modified Artificial Bee Colony Algorithm 73

75 Chapter 4 Experimental Results and Discussion In this chapter we will discuss the experimental settings and results. The algorithm is iterated for MAX IT ER times to obtain an optimal gene subset. Then the gene subset is classified using SVM with LOOCV to find the accuracy of the subset which gives the performance outcome of a single run. Now to find the performance of our approach and to tune the parameters the algorithm is run multiple times (at least 15 times). Finally the average of accuracy along with the number of selected genes from all the runs for a single parameter combination presents the performance of that parameter combination. In this section two types of experimental results will be presented. Firstly, different algorithm parameters are tuned to enhance the performance of the algorithm using one of the datasets. Also the contribution of different parameters are analyzed. Then, using the optimal parameter combination the performance is evaluated using ten publicly available datasets. Comparison with previous methods that used the same datasets is also presented. In all cases the optimal results (maximum accuracy and minimum selected gene size) are highlighted using boldface font. Description of the datasets is given in the Section 4.1. The Section 4.2 explains scientific parameter tuning and behavior of different parameters of the algorithm. Also the optimized parameter values and reasoning behind selection of the values are discussed in details. Performance outcome of utilizing ACO, ABC, and GA as search method for gene selection is reviewed in the Section 4.3. Comparative study between different metaheuristics and existing gene selection methods is presented in the Section 4.4. Additional tuning of parameters to obtain better performance is illustrated in the Section 4.5. Two different parameter settings are presented in the section. 74

76 4.1 Datasets Brief description of the datasets are presented in Table 4.1. The Table 4.2 lists the attribute summary of the datasets. The datasets contains both binary and multi class high dimensional data. The online supplement to the datasets [215] used in this paper is available at The datasets are distributed as Matlab data files (.mat). Each file contains a matrix, the columns consist of diagnosis (1st column) and genes, and the rows are the samples. 4.2 Parameter Tuning Different parameters have been experimentally tested for different values. For each parameter value we ran the algorithm for at least 15 times. During these runs, all the parameter values except for the parameter to be tuned are set to default. The default values are given in Table 4.3. The average and standard deviation of all the runs represent the performance of a specific value. The default parameter values are set from intuition and preliminary runs. The parameter value for which average fitness is maximum has been be picked as optimal parameter value unless stated otherwise. Full factorial combination for parameter values is avoided assuming that the parameters are not interdependent. For tuning the parameters, one of the datasets (9 T umors) among the ten is selected. The reason to select this dataset is that it is a multiclass dataset and its gene size (5, 726) is neither too large nor too small compared to the others (please refer to Table 4.2). Also it has moderate sample size (60). The analysis of the tuned parameters are discussed in the subsequent sections Probability of Applying the Communication Operator, r 4 To analyze this parameter values from {0.0, 0.1, 0.3, 0.5, 0.7, 1.0} have been used. The value 0.0 means running the algorithm without applying the communication operator. On the other hand the value 1.0 means applying the operator in each iteration. The experimental results are presented in Table 4.4. From the experimental data we can see that use of communication operator increases the accuracy while number of selected genes tends to increase a little. With the increase of probability of utilizing the communication operator there is also an increase in accuracy. So this experiment indicates that the communication 75

77 Table 4.1: Description of the datasets used for experimental evaluation Name of Description Reference the dataset 9 T umors Oligonucleotide microarray gene expression profiles for the chemosensitivity [216] profiles of 232 chemical compounds 11 T umors Transcript profiles of 11 common human tumors for carcinomas of the [219] prostate, breast, colorectum, lung, liver, gastroescophagus, pancreas, ovary, kidney, and bladder/ureter Brain T umor1 The medulloblastomas included primitive neuroectodermal tumors (PNETs), atypical teratoid/rhabdoid tumors (AT/RTs), malignant gliomas and the medulloblastomas activated by the Sonic Hedgehog (SHH) pathway [187] Brain T umor2 Transcript profiles of four malignant gliomas, including classic [176] glioblastoma, nonclassic glioblastoma, classic oligodendroglioma, and nonclassic oligodendroglioma DLBCL DNA microarray gene expression profiles of diffuse large B-cell lymphoma [210] (DLBCL), in which the DLBCL can be identified as cured versus fatal or refractory disease & DNA microarray gene expression profiles of diffuse large B-cell lymphoma (DLBCL), in which the DL- BCL can be identified as cured versus fatal or refractory disease Leukemia1 DNA microarray gene expression profiles of acute myelogenous [85] leukemia (AML), acute lymphoblastic leukemia (ALL) B-cell and T- cell Leukemia2 Gene expression profiles of a chromosomal translocation to distinguish [4] mixed-lineage leukemia (MLL), acute lymphoblastic leukemia (ALL), and acute myelogenous leukemia (AML) Lung Cancer Oligonucleotide microarray transcript profiles of lung adenocarcinomas, [13] squamous cell lung carcinomas, pulmonary carcinomas, small- cell lung carcinomas, and normal lung tissues P rostate T umor cdna microarray gene expression profiles of prostate tumors. Based [213] on MUC1 and AZGP1 gene expression, the prostate cancer can be distinguished as a subtype associated with an elevated risk of recurrence or with a decreased risk of recurrence SRBCT cdna microarray gene expression profiles of small, round blue cell tumors, which include neuroblastoma (NB), rhabdomyosarcoma (RMS), non-hodgkin lymphoma (NHL), and the Ewing family of tumors (EWS) [133] 76

78 Table 4.2: Attributes of the datasets used for experimental evaluation Name of Sample Number Number the dataset size of genes of classes 9 T umors 60 5, T umors , Brain T umor1 90 5, Brain T umor , DLBCL 77 5, Leukemia1 72 5, Leukemia , Lung Cancer , P rostate T umor , SRBCT 83 2, operator indeed guides the bees in finding solution with higher accuracy. After analyzing the the results in Table 4.4, we have decided to use 0.5 as the value of r 4 in our final experiments Use of Pheromone, uph To understand the contribution of pheromone in our algorithm we need to observe the algorithm performance both with and without using pheromone. Table 4.5 presents performance of the algorithm with and without using pheromone. The accuracy remains almost unchanged with the use of pheromone. But form the experimental results we can see that there is an enormous reduction in the number of selected genes. The reason behind this improvement is that the pheromone emphasizes the informative genes found in previous iterations. Much more iterations will be needed to reach the target fitness if we rely mostly on random exploration Probability of Local Search, probls Values from 0 to 1, with step size 0.1 are tested to tune this parameter. The results are listed in Table 4.6. The relation between change in accuracy with probls is presented in the Fig And the relation between change in selected gene size with probls is exhibited in the Fig From the experimental values it is observed that with the increase of the probability of employing local search, accuracy of the algorithm also increases. Selected gene size tends to decrease with increasing probability of local search. Probability value of 0.7 has been used as the optimized value in our algorithm to ensure that too much exploitation 77

79 Table 4.3: Default parameter values for tuning Parameter Default value Comments probls 0.4 Probability of local search in employed and onlooker bee stage ρ 0.8 Pheromone persistence factor w 1.4 Inertia weight w Weight of accuracy in fitness equation th n 0.1 Percentage of gene to be selected from preprocessing step hc iter 12 Number of iterations for HC sahc iter 12 Number of iterations for SAHCR sahc tweak 5 Number of tweaks for SAHCR sa iter 12 Number of iterations for SA t 10 Value of temperature t for SA schedule 0.5 Value of schedule for SA tmax 5 Maximum pheromone value tmin 0 Minimum pheromone value c Weight of an individual in pheromone update M AX IT ER 20 Number of iterations limit 5 Iterations to determine exhausted food source nd 0.02 Percentage of genes to be removed in neighborhood operation P S 20 Population size r Probability of performing Communication Operation ls e SAHCR local search in employed bee stage ls o SA local search in onlooker bee stage st Tournament Selection Selection procedure to be used in onlooker bee stage kernel Linear Kernel to be used in SVM wt Equation 3.3 Inertia weight update equation uph True Whether to use pheromone or not pref ilter Kruskal-Wallis Prefiltering method Table 4.4: Performance outcome for different values of parameter r 4 Values Accuracy No. of selected gene Avg. S.D. Avg. S.D

80 Table 4.5: Performance outcome for different values of parameter uph Values Accuracy No. of selected gene Avg. S.D. Avg. S.D. FALSE TRUE is not done despite that the value of 1.0 gives the highest accuracy. Note also that more application of the local search increases the running time. Table 4.6: Performance outcome for different values of parameter probls Values Accuracy No. of selected gene Avg. S.D. Avg. S.D Neighborhood Operator Destruction Size, nd For tuning the parameter nd, values from {0.001, 0.005, 0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.05, 0.06, 0.08, 0.1, 0.15, 0.2} have been considered. The experimental results are presented in Table 4.7. The accuracy seems to improve with higher values of nd. This is because large value of nd means that large number of genes will be removed from the individual which also increases the probability of removing more noisy genes. But high value of nd results in high number of selected genes, because, to find a neighbor both too little or too big a jump at a time fails to find a potential neighbor. So, if nd is too large, finding potential candidate after reducing the gene subset will be less frequent. As a result the algorithm fails to achieve a small gene subset. The value of nd is set to as it demonstrates a good enough accuracy with tolerable gene set size among all the values considered for the parameter. 79

81 Figure 4.1: Obtained accuracy with different values of probls Figure 4.2: Selected gene size with different values of probls 80

82 Table 4.7: Performance outcome for different values of parameter nd Values Accuracy No. of selected gene Avg. S.D. Avg. S.D Pheromone Persistence Factor, ρ The amount of pheromone to be retained from previous iterations is determined by ρ. So, (1 ρ) is known as the pheromone evaporation coefficient. In this experiment, the parameter ρ takes values from 0 to 1, with step size 0.1 are tested to tune this parameter. performance for different parameter values for ρ is presented in Table 4.8. The experimental outcome shows that an increase in the value of ρ results in an increase in both accuracy (upto 0.8) and selected gene set size. The In each iteration, (1 ρ) 100% of the current pheromones are evaporated. So increase in ρ means less evaporation and thus, pheromone containing history of the previous iterations can retain longer. So good solution components will contain high pheromone values for more iterations as previous knowledge is remembered for much longer period of time. As a result during the gene selection, they are more likely to be selected. Hence, bees ability to use their experience while selecting genes also strengthens which results in higher accuracy. Again redundant and noisy genes that get selected also contributes for longer period of time which results in higher selected genes size. The value 0.8 is set as the tuned value for ρ as it shows the highest accuracy. When no pheromone is evaporated (ρ is 1.0), the algorithm selects lowest number of genes. As the information about all the components are stored and thus the most informative genes are detected. But also for this value the obtained accuracy is lowest because of stagnation. Definition of 81

83 stagnation is given in Section Table 4.8: Performance outcome for different values of parameter ρ Values Accuracy No. of selected gene Avg. S.D. Avg. S.D Weight of Accuracy in Fitness, w 1 The tradeoff between accuracy and the number of selected genes is controlled by the parameter w 1. In the fitness equation, w 1 is the weight for the accuracy whereas (1 w 1 ) is the weight for the number of selected genes. To tune the parameter w 1, values from {0.5, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9} are considered. Results of using different parameters are given in Table 4.9. The Fig. 4.3 represents a graph showing change in accuracy with respect to change in w 1. The graph shows increasing trend of accuracy. It reaches its peak at the value of 0.85 for w 1. For further values accuracy remains almost same. Increase in w 1 results in increase in accuracy, which is expected because increase in w 1 means accuracy is weighted more. The Fig. 4.4 represents the graph showing the behavior of selected gene size with respect to change in w 1. Also increase of w 1 means selected gene size weighs less in fitness function. So for higher values of w 1 the number of selected gene is also high. From the graphs (Fig. 4.3 and 4.4) it is confirmed that the algorithm perform best in the range of 0.7 to 0.85 considering both the constraints. The value 0.85 is selected as optimized value as it gives the highest accuracy Population Size, P S The values in the range of 10 to 50 with step size 5, are considered for tuning the parameter P S. Experimental results using different parameter values are given in Table Notable 82

84 Table 4.9: Performance outcome for different values of parameter w 1 Values Accuracy No. of selected gene Avg. S.D. Avg. S.D Figure 4.3: Obtained accuracy with different values of w 1 83

85 Figure 4.4: Selected gene size with different values of w 1 increase in P S shows negligible increase in accuracy. Also a slight reduction of the selected gene set size is noticed up to a value of 30. For population size 30, the number of selected gene is noticeably small but again the value increases for higher population size. For population size 40 highest accuracy us achieved. So population size is kept at 25 which shows an acceptable level of accuracy. Note also that increase in P S will only result in increased running time without significantly upgrading the solution. Table 4.10: Performance outcome for different values of parameter P S Values Accuracy No. of selected gene Avg. S.D. Avg. S.D Prefiltering Method As has been mentioned earlier, we have considered two prefiltering methods: one is parametric (F -test) and the other is non-parametric (Kruskal-Walllis). The performance of 84

86 application of these two methods in our approach is reported in Table Kruskal-Wallis method shows better performance according to accuracy. This is expected because Kruskal- Wallis is a nonparametric method which is known to be suitable for gene selection [47, 73]. On the other hand, F -test seems to filter out the redundant genes more effectively resulting in relatively smaller selected gene size (at least for the dataset we used for parameter tuning, i.e., 9 T umors). The performance of the filtering methods depends on whether the dataset contains normal distribution or not. So further study may consider choosing prefiltering method depending upon the result of checking the normality of the dataset. We have used Kruskal-Wallis method in our final experiments. Table 4.11: Performance outcome for different values of parameter Prefiltering Method Values Accuracy No. of selected gene Avg. S.D. Avg. S.D. F -test Kruskal-Wallis Selection Method at the Onlooker Bee Stage As the selection procedure at the onlooker bee stage, Tournament Selection and Fitness- Proportionate Selection have been reviewed. The experimental results are presented in Table Fitness proportionate selection yields smaller size for the selected gene set. On the other hand tournament selection provides slightly higher classification accuracy. We have used tournament selection in our final experiments. Table 4.12: Performance outcome for different values of parameter Selection Method Values Accuracy No. of selected gene Avg. S.D. Avg. S.D. Fitness Proportionate Selection Tournament Selection Kernel Method for SVM As the kernel method for SVM, use of both linear and RBF kernels have been examined. The experimental results are presented in Table The outcome exhibits better performance for linear kernel according to both accuracy and the number of selected genes. This is because the linear kernel is suitable for high dimensional, small sample sized data [103,142]. 85

87 Also, linear kernel is reported to have shown better performance than RBF kernel for gene selection [45]. Notably however, one of the studies suggests that choice of kernel does not affect performance in most of the cases [144]. Table 4.13: Performance outcome for different values of parameter Kernel Values Accuracy No. of selected gene Avg. S.D. Avg. S.D. RBF Linear Inertia Weight (w) Update Approach For updating the inertia weight w, we have discussed two approaches (i.e., Equation 3.3 and Equation 3.4). The experimental outcome using Eq. 3.3 and Eq. 3.4 for updating inertia weight is reported in Table The random update method i.e., Eq. 3.4 gives better performance according to selected gene size while the Eq. 3.3 gives better accuracy. We have decided to use Eq. 3.3 to update the inertia weight in our final experiments. Table 4.14: equation Performance outcome for different values of parameter Inertia weight update Values Accuracy No. of selected gene Avg. S.D. Avg. S.D. Eq Eq Local Search Local search procedure is used in both employed bee and onlooker bee stages. As local search procedures, Hill Climbing (HC), Simulated Annealing (SA), and Steepest Ascent Hill Climbing with Replacement (SAHCR) are proposed for both the stages. Results employing different local search methods in these stages are reported in Table For all the experiments related to local search methods we have kept the value of probls to 1.0 unless stated otherwise. 86

88 Table 4.15: Performance outcome for different local search methods in employed bee and onlooker bee stage Local search Accuracy No. of selected gene Employed Bee Onlooker Bee Avg. S. D. Avg. S. D. HC HC HC SA HC SAHCR SAHCR HC SAHCR SA SAHCR SAHCR SA SAHCR SA SA SA HC Local Search at the Employed Bee Stage Performance of Hill Climbing as the local search procedure at the employed bee stage with HC or SAHCR or SA as the local search method at the onlooker bee stage has been examined. Hill climbing shows satisfactory performance in this stage when HC or SAHCR is run at the onlooker bee stage; the result is poor when SA is run as the local search method for onlooker bee stage. Hill Climbing only exploits a potential solution. Upward movement of HC gets lost by random walk of SA at the onlooker bee stage. Hill climbing is more likely to get stuck in a local optima. So when HC is used for the employed bee stage, the possibility to get stuck in a local optima for the employed bee is higher. As a result we need combination of further exploitation and exploration by the onlooker bee when HC is applied in employed bee stage. Implementing SA at this stage performs well only with SAHCR at the onlooker bee stage. Since SA needs parameters to be tuned to perform well, the choices of SA parameters can have a significant impact on the method s effectiveness. Here for this experiment we have used SA parameter values without tuning. So it failed to exhibit a good performance. But despite that, SAHCR at the onlooker bee stage found good solution from SA outcome of the employed bees. Use of SAHCR at the employed bee stage gives satisfactory performance irrespective of any local search method employed at the onlooker bee stage. SAHCR tweaks multiple times in each iteration which allows it to explore enough to find a good solution. Thus application of SAHCR results in increased running time compared to SA and HC. Use of SA as the local search at the onlooker bee stage with SAHCR at the employed bee stage 87

89 shows comparatively poor performance. This can be attributed to the fact that a potential solution found by SAHCR might get lost by initial random walk of SA. Thus in this stage, we need an algorithm that will help the employed bee to land in a potentially good slope so that, when the onlooker bee exploits and explores further it can achieve a good solution Local Search at the Onlooker Bee Stage HC at the onlooker bee stage shows poor performance. HC will provide a good solution in this stage only if it starts from a prospective slope supplied from the employed bee stage. Thus because of a lack of exploration capability, HC performs poorly at this stage. SA at the onlooker bee stage also performs poorly. This can be attributed to the fact that SA might go downhill sometimes. Initially SA performs mostly random walk. At the onlooker bee stage, we need to upgrade the already found solution by the employed bee. Therefore, it is expected for onlooker bees to exploit more than to explore. But use of SA here might result in degradation of a potential solution. It is expected that from the employed bee stage already the individual is in the slope of a possible good solution. But because of its initial random walk, use of SA at this stage makes the probability of loosing the progress by the onlooker bee high. This mostly resultd in a poor performance. In this stage SAHCR performs really well, because, besides exploitation it also does enough exploration to find a good solution. In fact, the algorithm performs best when SAHCR is applied in both stages. But use of SAHCR in both stages increases the running time. Performance of SAHCR at the onlooker bee stage with either HC or SA at the employed bee stage seems quite satisfactory. So we have selected SAHCR as the local search method at the onlooker bee stage and SA at the employed bee stage to ensure both exploration end exploitation Hill Climbing For hill climbing the only parameter that is tuned is the iteration count when HC is applied at the employed bee stage and SAHCR is applied in the onlooker bee stage. Iteration count from 10 to 20 with step size 2 have been considered. Results of using different iteration counts for HC at the employed bee stage are presented in Table The experimental results show increase of selected gene size with the increase of iteration count. Accuracy remains stable with changes in iteration count. The accuracy reached highest at the value of 14. For this experiment the probability of local search is set to 1. 88

90 Table 4.16: Performance outcome for different iteration counts for Hill Climbing in employed bee stage Values Accuracy No. of selected gene Avg. S. D. Avg. S. D Simulated Annealing To tune the SA parameters we applied SA at the employed bee stage and SAHCR at the onlooker bee stage. To assess the contribution of iteration count in SA values from 10 to 20 with step size 2 have been considered. For both default (0.4) and highest (1.0) value of probability of local search the experiments are performed. The results are given in Table Accuracy increase with increased iteration count and reaches its peak at the value of 14 (20) for probability of local search 1.0 (0.4). When probls is 1.0, further increase in the iteration count causes slight decrease in the accuracy. Also number of selected gene tends to decrease with higher iteration count. For both accuracy and the number of selected genes the value of 14 shows good performance which is set as the optimized value for this parameter. Table 4.17: Performance outcome for different iteration counts for Simulated Annealing in employed bee stage probls 1.0 probls 0.4 Values Accuracy # selected gene Accuracy # selected gene Avg. S. D. Avg. S. D. Avg. S. D. Avg. S. D For simulated annealing the parameter temperature t is tuned using different values from {1, 3, 5, 7, 10, 12, 15}. Experiments are performed for both default (0.4) and highest (1.0) value of probability of local search. Performance measure of using different values of 89

91 t for SA at the onlooker bee stage is given in Table Increasing the value of t causes little degradation in accuracy but number of selected genes increases with the increase of t. Initially t is set to a high number, which causes the algorithm to random walk in the search space. So, higher value of t means that we do random walk for a longer period of time by accepting every newly-created solution regardless of how good it is. As a result the local search becomes poor at exploitation, resulting in comparatively inferior solutions. In SA t decreases slowly, eventually to 0, at which point the algorithm is doing just Hill Climbing. Table 4.18: Performance outcome for different temperature (t) values for Simulated Annealing in employed bee stage probls 1.0 probls 0.4 Values Accuracy # selected gene Accuracy # selected gene Avg. S. D. Avg. S. D. Avg. S. D. Avg. S. D The rate at which we decrease t is called the algorithm s schedule. The parameter schedule of simulated annealing is tuned using different values from 0.1 to 1 with step size 0.1. Experiments are performed for both the highest (1.0) and the default (0.4) probability of local search. The results of using different values for schedule is presented in Table The accuracy remains almost same with the change in values of schedule. We will consider the value 0.5 as the optimized value for this parameter Steepest Ascent Hill Climbing with Replacement The values from 10 to 20 with step size 2 have been evaluated as the iteration count of SAHCR. For this experiment SAHCR is set as local search method in onlooker bee stage and SA is set as local search method in employed bee stage. The results of tuning iteration counts of SAHCR are presented in Table The results show increase of accuracy with higher iteration count. The value 12 is set as optimized value as it shows acceptable accuracy. The value is kept small because increased iteration count increases the algorithm running time. To review the consequence of applying different amount of tweak the values form {5, 6, 7, 8, 9, 10, 12, 15, 20} are considered. The experimental outcomes are listed in 90

92 Table 4.19: Performance outcome for different values of the parameter schedule for Simulated Annealing in employed bee stage probls 1.0 probls 0.4 Values Accuracy # selected gene Accuracy # selected gene Avg. S. D. Avg. S. D. Avg. S. D. Avg. S. D Table Results of this experiment show improvement of accuracy with more tweaking. The selected gene size also tends to decrease with higher number of tweaks at both the stages. The value 9 is considered as the final value for number of tweaks in SAHCR. For all the experiments related to different parameter of SAHCR, both the highest (1.0) and the default values of probls are considered. Table 4.20: Performance outcome for different iteration counts for Steepest Ascent Hill Climbing with Replacement in onlooker bee stage probls 1.0 probls 0.4 Values Accuracy # selected gene Accuracy # selected gene Avg. S. D. Avg. S. D. Avg. S. D. Avg. S. D Percentage of Gene to be Selected from Prefiltering Step, th n After the filtering technique is applied the genes are ranked. A percentage of top ranked genes are selected to be passed on to the next stage. The parameter th n determines the per- 91

93 Table 4.21: Performance outcome for different values of the parameter tweak for Steepest Ascent Hill Climbing with Replacement probls 1.0 probls 0.4 Values Accuracy # selected gene Accuracy # selected gene Avg. S. D. Avg. S. D. Avg. S. D. Avg. S. D centage of gene to be selected. The values for tuning th n are taken from {0.004, 0.01, 0.02, 0.03, 0.04, 0.045, 0.05, 0.06, 0.065, 0.07, 0.075, 0.08, 0.085, 0.09, 0.1, 0.2} when Kruskal-Wallis test is utilized as the prefiltering technique. The results are presented in Table We can see that too small value of th n causes the filtering technique to discard most of the genes including the informative ones, which results in poor predictive accuracy. For this dataset, the value 0.03 (171 genes selected from the prefiltering stage) gives the highest accuracy. For higher values of th n accuracy remains almost same. In fact the accuracy reduces a little for high values of th n. The task of preprocessing step is to discard the irrelevant genes. So higher value of th n gives lower accuracy because of the presence of noisy genes. Again, too lower value of th n means informative genes are discarded in the prefiltering step. The experimental results also support the finding [14,153,154,174,217,224] that the use of all the genes potentially hampers the classifier performance. Table 4.22 also reports the number of genes selected from the preprocessing stage. The value is selected as the optimal value despite that the value 0.03 gives the best accuracy. This is done because choosing 0.03 might possess the risk of discarding informative genes for other datasets Threshold to Select Gene from Prefiltering Step, th p The p-value for each gene is computed and then all the genes are sorted according to the p-value in the filtering step. To select the top ranked genes form Kruskal-Wallis (F -test) we need to fix a threshold and select all the genes having p-value less (greater) than the threshold. The values from {0.0005, 0.001, 0.002, 0.003, 0.005, 0.007, 0.009, 0.01, 0.02, 0.025, 0, 03, 0.04, 92

94 Table 4.22: Performance outcome for percentage of genes selected in prefiltering stage, th n Values # genes selected Accuracy No. of selected gene from prefiltering Avg. S.D. Avg. S.D. stage } are taken into consideration to tune the parameter th p and Kruskal Wallis was set as the prefiltering method. The results of the experiments are reported in Table Increase in th p shows rapid increase in the selected gene size. The highest accuracy is attained at the value of (150 genes selected from prefiltering stage). For further values of th p little declination in the accuracy is noticed. This happens because with the increase of th p less genes are filtered which results in the selection of noisy genes for the next stage. Thus the algorithm exhibits unsatisfactory performance Weight of Individual Bee in Pheromone Deposition Equation, c o The contribution of the current bee S i, individual best pbest i, and global best gbest in pheromone laying is determined by c 0, c 1, and c 2 respectively. To find a suitable value for the parameter c 0, values from {0, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} are considered. The contribution of pbest i and gbest is kept the same, i.e., c 1 = c 2. Also some of experiments for the values from {0.7, 0.6, 0.5, 0.4, 0.3, 0.2} for the variable c 0 while keeping the contribution of pbest i, i.e., c 1 to 0 is performed. The value of c 2 is calculated so that 93

95 Table 4.23: Performance utcome for threshold of p-value in prefiltering stage, th p Values # genes selected Accuracy No. of selected gene from prefiltering Avg. S.D. Avg. S.D. stage c 0 + c 1 + c 2 = 1. The results are presented in Table From the observed results we can find that too low a value of c 0 results in low accuracy and the selected gene size. Gene subset remains small because the best solutions contribute more to the pheromone. So only small number of components receive high pheromone values and they get selected. But little exploration may cause the algorithm to get stuck at local optima. Again too high a value for c 0 masks the contribution of good solution components obtained from personal and global best solutions which also results in lower accuracy and higher selected gene size. The value 0.6 is set as optimized value for c 0 as it shows good results Maximum Number of Algorithm Iterations, M AX IT ER This parameter gives the maximum number of times the modified ABC will be run to achieve a single solution. To tune the parameter the values are taken from the range 10 to 50 with step size 10. The experimental results are presented in Table From the outcome no significant correlation can be found. The accuracy remains almost the same. Change in the selected gene size can be noticed with the variation in the value of MAX IT ER, but no actual relation is found. Also from the experimental data it is observed that on average number of iterations needed by the algorithm to reach to its final result remains almost the same for various values of MAX IT ER. So we can conclude that the algorithm converges really soon. As the selected value for MAX IT ER, 20 has been chosen, because it shows 94

96 Table 4.24: Performance outcome for different values of c 0 c 0 c 1 c 2 Accuracy No. of selected gene Average S. D. Average S. D the best performance. Notably, a higher value will result in a higher running time without improvement in the accuracy. Table 4.25: Performance outcome for different values of parameter M AX IT ER Values Accuracy No. of selected gene Average number of iterations Avg. S.D. Avg. S.D. needed to reach the final solution Number of Trials without Improvement, limit A food source which cannot be improved by a predetermined number (limit) of trials is abandoned by its employed bee and the employed bee is converted to a scout. The variable trial i keeps track of number of times fitness remains unchanged for the i th bee S i. To tune the parameter limit, the values {5, 7, 10, 12, 15, 20, 25, 30, 35, 40, 100, 200, 500, 800, } are 95

97 used. The value for the parameter limit means that no food sources are abandoned by employed bees. In the other words there is no scout bees when limit is set to. The results are reported in Table Increase in limit exhibits improvement of performance according to both accuracy (Fig. 4.5) and the number of selected genes (Fig. 4.6). The accuracy is highest when the value of limit is 100. From the graph presented in Fig. 4.5 increasing trend for accuracy is visible. But when limit is very high, accuracy remains almost the same. For the number of selected gene we can see from the graph presented in Fig. 4.6 that when limit is very low selected gene size is very high. Otherwise decreasing trend is visible. Small value of limit means that the individuals get random initialization more frequently. So there is a possibility of discarding a potential solution before the individual gets chance to exhaust it amply [8]. So a good solution might be lost in the midway. Thus lower value of limit results in reduction in accuracy. Increase in limit also tends to increases the average number of iterations needed for the algorithm to reach in its final solution. This might be because larger value of limit means less random exploration capability. So the algorithm needs more time to reach in the final solution. When limit is too small or too large, the results obtained by our algorithm are worse than those produced by using the moderate values of limit. Therefore, results show that proper frequency of new solution production has useful effect on the solution fitness, which can perform enough explorations to improve the search ability of the algorithm. However, the balance between exploration and exploitation processes will be altered whether limit is too small or too large, which will produce worse solutions and cost much more running time. The obtained accuracy is highest for the limit value 100. But high value of limit may result in less exploration. Thus, we recommend limit = 35 after considering the experimental outcomes Optimized Parameter Values The optimized parameter values are listed in Table We have conducted our experiments on the optimized parameter values listed in Table 4.27 as well as on the default parameter values listed in Table 4.3. The comparison between the results obtained applying the default and the optimized parameter values for different datasets is presented in Table For both parameter settings the best, average, standard deviation (S.D.), and the worst results are reported. In all cases optimized values give better results according to both accuracy and the number of selected genes. For the optimized parameter values we can conclude that the algorithm performs consistently for all datasets based on the standard deviation for accuracy (maximum 0.01) and number of selected gene (maximum 5.64). Our algorithm 96

98 Table 4.26: Performance outcome for different values of parameter limit Values Accuracy No. of selected gene Average number of iterations Avg. S.D. Avg. S.D. needed to reach the final solution Figure 4.5: Obtained accuracy with different values of limit 97

99 Figure 4.6: Selected gene size with different values of limit in fact has achieved satisfactory accuracy even for the default parameter settings albeit with a high standard deviation for the number of selected genes for most of the cases. The main reason for high standard deviation in the selected gene size for the default parameter setting can be attributed to the high default value of c 0 and low default value of limit. The Fig. 4.7 shows the distribution of obtained accuracy in optimized parameter settings for the dataset 9 T umors and 11 T umors. For all other datasets our method obtained 100% accuracy in all the runs. The horizontal axis represents the accuracy and the vertical axis represents the percentage of time corresponding accuracy is obtained among all the runs. Similarly the Fig. 4.8 represents the distribution of selected gene size in optimized parameter settings for all the datasets. The horizontal axis represents the selected gene size and the vertical axis represents the percentage of time corresponding gene size is obtained among all the runs. 4.3 Performance of Different Evolutionary Algorithms Different evolutionary algorithms can be considered as search method for gene selection. In this section we will present performance of gene selection for utilizing different evolutionary algorithms including genetic algorithm, ACO, and ABC in this section. 98

100 Table 4.27: Optimized parameter values after tuning Parameter Optimized value probls 0.7 ρ 0.8 w 1.4 w th n sahc iter 12 sahc tweak 9 sa iter 14 t 5 schedule 0.5 tmax 5 tmin 0 c MAX IT ER 20 limit 35 nd P S 25 r ls e SA ls o SAHCR selection method Tournament Selection kernel Linear wt Equation 3.3 uph True pref ilter Kruskal-Wallis Figure 4.7: 11 T umors (a) Distribution of classification accuracy for the dataset (a) 9 T umors; (b) (b) 99

101 Table 4.28: Comparative experimental results of the best subsets produced by mabc using default and optimized parameter settings for different datasets Dataset Name Evaluation Default value Optimized value Criteria Best Avg. S. D. Worst Best Avg. S. D. Worst 9 T umors Accuracy # Genes T umors Accuracy # Genes Brain T umor1 Accuracy # Genes Brain T umor2 Accuracy # Genes DLBCL Accuracy # Genes Leukemia1 Accuracy # Genes Leukemia2 Accuracy # Genes Lung Cancer Accuracy # Genes P rostate T umor Accuracy # Genes SRBCT Accuracy # Genes

102 (a) (b) (c) (d) (e) (f) (g) (h) 101 (i) (j) Figure 4.8: Distribution of number of times selected gene size fall in a specific range (a) 9 T umors; (b) 11 T umors; (c) Brain T umor1; (d) Brain T umor2 (e) Leukemia1; (f) Leukemia2; (g) DLBCL; (h) Lung Cancer; (i) P rostate T umor; (j) SRBCT