Cancer Microarray Data Gene Expression Analysis

Size: px

Start display at page:

Download "Cancer Microarray Data Gene Expression Analysis"

Noel McBride
5 years ago
Views:

A Comprehensive Fuzzy-Based Framework for Cancer Microarray Data Gene Expression Analysis Zhenyu Wang Computing Laboratory Oxford University Oxford, OXI 3QD, UK Email: Zhen.Yu.Wang@comlab.ox.ac.

1 A Comprehensive Fuzzy-Based Framework for Cancer Microarray Data Gene Expression Analysis Zhenyu Wang Computing Laboratory Oxford University Oxford, OXI 3QD, UK Vasile Palade Computing Laboratory Oxford University Oxford, OXI 3QD, UK I. ABSTRACT In this paper, we employ fuzzy techniques for analysing cancer microarray gene expression data. A fuzzy-based ensemble model and a comprehensive fuzzy-based framework for cancer microarray data analysis are being proposed. Our methods were tested on three benchmark microarray cancer data sets, namely Leukemia Cancer Data Set, Colon Cancer Data Set and Lymphoma Cancer Data Set. Comparing to other traditional statistical and machine learning models, the new approach can efficiently tackle several key problems in cancer microarray gene expression data analysis, including highly correlated genes, high dimensionality, highly noisy data and further explanation of the diagnosis results. II. INTRODUCTION The gene expression profiles from particular microarray experiments have been recently used for cancer classification [1] [2]. This approach promises to provide a better therapeutic assistance to cancer patients by diagnosing cancer types with improved accuracy [2]. However, the amount of data produced by this new technology is usually too large to be manually analysable. Hence, the need to automatically analyse the microarray data offers an opportunity for Machine Learning (ML) methods to have a significant impact on cancer research. Many machine learning methods, such as Self-Organizing Maps (SOMs) [3], Support Vector Machines (SVMs) [], Multi-Layer Perceptrons (MLP) [5] [6], K Nearest Neighbor (KNN) [7], and Decision Trees (DTs) [6] have been successfully applied to classify different tissues. But, most of these methods are black-box methods, they can hardly bring out the hidden information in the microarray data. Meanwhile, all those models can not deal with missing data, which is very likely to happen in most of the microarray data sets. Fuzzyrule-based models can not only provide good classification results, but also easily be explained and interpreted in human understandable terms by using fuzzy rules. This provides the researchers an insight into the developed models. Fuzzy systems adapt numerical data (input/output pairs) onto human linguistic terms, which offer very good capabilities of dealing with noisy and missing data. However, similar to most of widely used methods, fuzzy based models also have to face the following open difficulties in this application. First of all, compared to some other benchmark problems in machine learning, microarray data sets may be problematic. The number of features (genes), usually in the range of 2,000-30,000, is much larger than the number of examples (usually in the range of 0-200). But, not all of these genes are needed for classification. Most genes do not influence the performance of the classification task. Taking such genes into account during classification increases the dimension of the classification problem, poses computational difficulties and introduces unnecessary noise in the process. A major goal for diagnostic research is to develop diagnostic procedures based on inexpensive microarray that have enough probes to detect certain diseases. This requires the selection of some genes which are highly related to the particular classification problem, i.e., the informative genes. This process is called Gene Selection (GS) [7], which corresponds to feature selection from machine learning in general. Secondly, most of the current gene selection methods can not efficiently solve the following two problems: (a) how to estimate the numerical noisy data; (b) how to reduce the redundancy of selected genes. We might usually end up with many highly correlated genes. Highly correlated genes are normally from the same pathway, with similar biological meaning. These genes not only just bring additional computational cost, but also lead to the same misclassifications. To address the above problems, we proposed a novel Fuzzy Gene Selection (FGS) system in our model. The FGS system firstly classifies similar genes into clusters by using the Fuzzy C-Mean Clustering (FCMC) method, then evolves the fuzzy membership functions to represent the property of the data. The rank of different genes in each cluster can be determined by analysing the relationship between the trained membership functions. By using this method, we can replace the highly correlated genes by the genes from other pathway, and also operate all data on a fuzzy level. Thirdly, defining the fuzzy rules and membership functions requires a lot of prior knowledge. This can be usually obtained from human expert, and especially in the case of large amount of gene expression data is not an easy task. In this paper, we apply Hybrid Neuro-Fuzzy (NF) models, which combine the learning ability of neural systems and the transparency of /07/$ IEEE 1003

TABLE I A TYPICAL GENE EXPRESSION MATRIX X, WHERE ROWS REPRESENT SAMPLES OBTAINED UNDER DIFFERENT EXPERIMENTAL CONDITIONS AND COLUMNS REPRESENT GENES Sample 1 Sample 2 Gene 1 165.1 653.6 Gene 2 276.

2 TABLE I A TYPICAL GENE EXPRESSION MATRIX X, WHERE ROWS REPRESENT SAMPLES OBTAINED UNDER DIFFERENT EXPERIMENTAL CONDITIONS AND COLUMNS REPRESENT GENES Sample 1 Sample 2 Gene Gene Gene m Gene m Class -1 Sample n-i Sample n fuzzy systems, and can automatically generate and adjust the membership functions and linguistic rules directly from data. Finally, fuzzy-rule-based methods have suffered some wellknown limitations in dealing with high dimensional data. Although some fuzzy-rule-based applications for microarray analysis have already been presented [8], all these reported system are small models and only perform well on limited simple data sets. Because large rule-based models imply huge computational costs, they sometimes are unacceptable in practice. In order to improve the inherent weakness of individual NF models, a Neuro-Fuzzy Ensemble (NFE) model is developed in this paper. The rest of this paper is organized as follows. The main structure of our fuzzy-based cancer microarray data classification model is described in Section III. Section IV introduce a important part of our model, the Fuzzy Gene Selection method. The NFE model is detailed in Section V. Experimental results and analytical work are presented in Section VI. Some conclusions are drawn in Section VII. III. FUZZY-BASED CANCER MICROARRAY GENE EXPRESSION DATA ANALYSIS MODEL Generally, microarray experiments can be divided into two types. One focuses on time series data which contain the gene expression data of various genes during the time span of an experiment. Another type of microarray experiment consists of gene expression data of various genes taken from different tissue samples or under different experimental conditions. Different conditions can be used to answer such questions, for example, which genes are changed under certain conditions. Meanwhile, different tissues under the same experiment conditions are helpful in the classification of different types of tissues. The data from a series of n such experiments can be represented as an n x m gene expression matrix (see Table I), where each row represents a sample that consists of m genes from one experiment. Each sample belongs to a certain class (cancer/no-cancer). Most analyses on a cancer microarray data X are focused on the following two aspects: Analysis among columns. Let the given gene expression data be denoted as D {Gi,.,Gi,...,Gm} (1) where each vector Gi (gl,..., g,) denotes the different expression level of a certain position gene Gi Tumor Normal Fig. 1. The general scheme of a cancer microarray data classification model. from several repeated experiment (n is the number of experiments). Most analysis techniques used in this part are unsupervised learning methods. Cluster analysis or another statistical methods can be adopted to find out the relationship among different Gis or the importance of genes, i.e., the so-called Gene Clustering (GC) [9] and Gene Selection (GS) [7]. Normally, GC and GS analyses are used to provide the researchers with an overall picture of the microarray data. Also, good GC and GS results can be used as a good starting point for further data analysis. Analysis among rows. Let us denote the given gene expression data as D = f (Sl. tl)..... (Si. ti).... (Sn. tn)l. where an input vector Sj (g=,. gm) denotes a gene expression sample, m is the number of genes in this pattern, tj represents which class the pattern belongs to (cancer or no-cancer). The most common analysis is to first choose m genes out of m according to certain GS algorithms. Then, select n patterns with m most informative/important genes to train the classifier, and leave n -n patterns (with m genes) out to test the performance of the trained model. This method can be used to help researchers to classify whether a pattern belongs to the cancer or non-cancer class. Therefore, the microarray cancer classification problem can be formulated as a combinational optimization problem with two main objectives: minimizing the number of selected genes and maximizing the classification accuracy. A typical cancer microarray data classification system is shown in Figure 1. (2) /07/$ IEEE 100

3 mechanisms responsible for it. Another problem here is highly correlated genes. A better way is to group the genes with similar profiles, or the genes from the same pathway, then select informative genes from these different groups to avoid redundancy. Therefore, the main difference between our model and other classification models lays at Block A in Figure 1. Instead of the original gene selection part, we divide our proposed fuzzy gene selection system into three steps (see Figure 2):. Step 1: Classify all genes into different groups by using the Fuzzy C-Means Clustering method (FCMC).. Step 2: Select the informative genes from each group by using the Evolving Fuzzy Gene Selection method (EFGS).. Step 3: Determine the final gene selection results according to a certain statistical method. In order to reduce the computational cost, we use an ensemble of individual NF networks to replace a single large classification system. IV. THE FuzzY GENE SELECTION SYSTEM A. Fuzzy C-Means Clustering FCMC partition the data set into clusters with a certain degree of membership, where similar data are assigned to the same cluster whereas dissimilar data should belong to different clusters. The same as other clustering methods, it is based on the minimization of the following distance function: Fig. 2. TuNor ith Noil with reasons r Saons Proposed fuzzy-based cancer microarray data classification approach. Our fuzzy-based classifiers stand midway between the two options presented above. The model can classify different patterns as other classifiers do, whilst it can represent the relationship between different genes, by using fuzzy "IF-Then" rules and FCMC, correlating to the final classification results (see Figure 2). It is well known that a large set of gene expression features will not only significantly bring higher computational cost and slow down the learning process, but also decrease the classification accuracy due to the phenomenon known as the curse of dimensionality, in which the risk of over-fitting increases as the number of selected genes grows [10]. More importantly, by using a small subset of genes, we can not only get a better diagnostic accuracy, but also get an opportunity to further analyse the nature of the disease and the genetic N C ff = E: E: ij lic i= where m > 1, uij is the degree of membership of pattern xi to cluster j, cj is the center of cluster j, C is the number of clusters, N is the number of patterns in the data set. The algorithm is detailed below: 1) Initialize the membership degree matrix [uij], (every uij a random number between 0 and 1) and set k = 0; 2) At each step k, calculate the new cluster centers Cj according to the matrix Uk given at step 3 [9]: N _ 2i= 1 uij * Ci - N Ei=l uij 3) Update the matrix Uk [9], Xi k+l 1 7ij EZN K xi-cj W k=1 Ixi-cjI ) If J(k + 1)- J(k) < o, then stop, defined small number. (3) () (5) where o is a user B. The Evolving Fuzzy Gene Selection Method In this section, we introduce a novel Evolving Fuzzy-based Gene Selection (EFGS) approach. The method first combines the evolutionary programming and fuzzy clustering methods to adjust the membership functions to represent the property of each gene vector Gi; then, it uses the relationship between m-1i /07/$ IEEE 1005

0.9 - - LD?1 0.8 0.7 0.6 0.5 0. 0.3 02 01 Fig. 3. Three genes and eight samples. Samples 1- belong to one class, samples 5-8 belong to another class.

4 LD? Fig. 3. Three genes and eight samples. Samples 1- belong to one class, samples 5-8 belong to another class. the trained membership functions to determine the rank of different genes. The theoretical foundation behind this method is according to the classic strategy: the separation between two classes of expression data is proportional to the distance between their means [2]. Furthermore, this distance is normalized by the standard deviation of the classes. A large standard deviation value implies that we find points in the group far away from the mean value and that the separation would not be strong. For example, Figure 3 shows that the selected gene on the top left is unlikely to predict well because the means of the two classes are quite close; this gene can not give us enough power to distinguish between classes. The means for the top right one and bottom genes are the same, but the bottom one has less variation around the mean, and is then likely to be a better gene for classification. The whole algorithm for the EFGS can be summarized as presented below. Each input gene vector Gi, Vi D {,...,n} is described by two linguistic values with the membership functions set to be gaussian as follows: -(x _ M 9 2 f (z,d,11) =e 2,2 (6) where,u and d is the mean and the standard deviation of the gaussian function, respectively. 1) Generate an initial population of individuals and set k = 0. Each individual is represented by a real-valued vector, [6jl1,jl1,j2,/j2], Vj E {1,...,w}, where p1 is the mean of the Gaussian membership function 1, 61 is the standard deviation of the Gaussian membership function 1,,2 is the mean of the Gaussian membership function 2, 62 is the standard deviation of the Gaussian membership function 2. The initial value for 6j1, 8j,u, 1j2, /1j2 can be set as follows: P9jl T11lID, IJj2 T721D, (8) where 'Yi, 72, n,1 'T2 are user defined parameters, 6D, JID is the mean and the standard deviation of the whole gene Fig.. Initial membership functions of individual i expression data set D, respectively. Some initial membership functions we used for the colon data are shown in Figure. 2) Each individual [jl1,ijl1,6j2,i'j2], Vi C {1,..., }, creates a single offspring [6 l',ul, j2,u2] as given below: 6i, = 6j, + AN(O. '). 6j2 = 6j2 + AN(O. 1). j, = jig + AN(O, 1), Pj2 /jig + AN(O, 1), Ai = Ajexp(r,N(O, 1) + rn(o, 1), (9) (10) (1 1) where N(O, 1) denotes a normally distributed onedimensional random variable with mean zero and variance one. 6',,u' and A' is the parameters of the new offspring after mutation. A is a strategy parameters in self-adaptive evolutionary algorithms [11]. The values r and r, are usually set to ( 2 I)1and (2n)-1. 3) Use a simple clustering strategy to classify all genes. The only difference between our clustering method and the classical FCMC method is that the membership function we adopted here is a Gaussian function. ) Calculate the fitness function value fit for each individual by testing how many patterns are correctly classified. 5) Conduct pairwise comparison over the union of parents [J10, ji,1j2, pj2l and offspring [6i,Kii52, K2. For each individual, q opponents are chosen uniformly at random from all parents and offsprings. For each comparison, if the individual's fitness is not smaller than that of the opponents, it receives a win. Select,u individuals out of [6>,,ui, 6j2, IJj2] and [5k,1ul, P j2,2tj2] with the largest wins to form the next generation. 6) Stop if the predefined halting criterion is satisfied and select the top fitted individual to represent the final membership functions set; otherwise, k = k + 1 and go to Step 3). Figure 5 shows two trained membership functions after the evolving process had stopped. From this figure, it can be seen that the distance between the centers of the two trained membership functions hints the distance between the means of the two classes of expression data, and the widths of the /07/$ IEEE 1006

Fig. 5. 0.9 0.8 0.7 0.6 0.5 0. 0.3 0.2 01 1000 2000 3000 000 5000 6000 Trained membership functions of top select individual trained membership functions hint the standard deviation of the classes.

5 Fig Trained membership functions of top select individual trained membership functions hint the standard deviation of the classes. According to the theoretical foundation described before, a good pattern implies that the two membership functions should be far from each other, and the width of each membership function should be small. As shown in Equation 12: Fscore(Gi, c) A C2 W21 Fscore is the score of each named gene, where c is the class vector (i.e., last column in Table I), Gi is the gene expression vector, C1 and C2 are the centers of the two trained Gaussian membership functions, which can be equaled to,ui and,ui, W1 and W2 represent the widths of the two trained gaussian membership functions, which can be replaced by 61 and 62. C. Final Selected Gene Subset Once we have done the FCMC, we know that all the genes in one cluster show similar profiles and might be involved in the same pathway. Generally, we believe that highly correlated genes have a similar biological explanation. We prefer for our final decision model to be consisted of features with different meanings. Therefore, we would like to use more uncorrelated genes (but still informative) instead of the highly correlated top genes. But because of the nature of the microarray gene expression data, sometimes there are several pathways involved in the perturbation, but one pathway usually has a major influence. Therefore, we also propose a new mechanism to determine the balance between the redundancy and the diversity in the selected gene subset. The final selected gene subset is determined by selecting a certain number 0 of top ranked genes from each cluster. The value of 0 can be set according to some statistical methods, for example, 0 = T x F (13) ZD=l FscoreD where T is the maximum number of selected genes, Fscore(j) is the sum of the score of all gene in the clustering j, Fscore(D) is the sum of gene score in the whole data space D. By doing this, if there are several pathways involved in the perturbation, but one pathway has the major influence, we will probably select more genes from this pathway [12]. V. NEURO-FuzzY CLASSIFIER ENSEMBLE Different from other nonlinear classification systems, fuzzy systems are rule-based systems; a small number of input will normally generate a large number of potential rules. Not all of the potential rules are useful for the final classification results and a small number of most useful rules can be selected indeed at some point. This initial large number of potential rules brings a large computational cost. For this reason, in practical applications, the feasible number of selected genes is limited. But, collection of well-distributed, sufficient, and accurately measured input genes is the basic requirement for obtaining an accurate model. When data sets require a relatively large number of genes to represent their properties, we need to design some strategies to enable the model to accept more inputs with less computational cost. Recommended approaches include: evaluate and select rules, delete antecedents, delete fuzzy sets, etc. A. Ensemble Learning Our approach to this problem is to construct a Neuro- Fuzzy Ensemble model by combining several individual NF models that are trained on the same data but using different subsets of genes [13], so that the overall model will finally work with a relatively large number of genes (see Table II). A better generalization ability can be obtained by the nature of the ensemble learning itself. Meanwhile, the NFE can relieve the trial-and-error process by tuning architectures. ANFIS is a Sugeno-like fuzzy system in a five-layered network structure. The Back-propagation algorithm is used to train the membership functions, while the least mean squares algorithm (LSE) determines the coefficients of the linear parameters in the consequent part of the rules. By combining with our FGS system, we will try to make sure that the inputs of one individual NF model are from different clusters. The output combination strategy of our NFE model is the Majority Voting. The main structure of our NFE is shown in Figure 6. VI. A. Microarray Cancer Data Sets EXPERIMENTAL STUDY In this study, the proposed models are tested on three cancer microarray gene expression data sets: leukemia cancer data set, colon cancer data set and lymphoma cancer data set. 1) Colon Cancer Data Set: This data set contains 62 samples. There are 0 tumor samples, and 22 normal samples. From about 6000 genes represented in each sample in the original data set, only 2000 genes were selected. The data set is available at /07/$ IEEE 1007

. I.,.. 1-137.2 TABLE III ToP SEVEN RANKED COLON GENES SELECTED FROM DIFFERENT CLUSTER BY USING THE FGS SYSTEM Fig. 6.

6 . I., TABLE III ToP SEVEN RANKED COLON GENES SELECTED FROM DIFFERENT CLUSTER BY USING THE FGS SYSTEM Fig. 6. The main structure of the NFE: n individual ANFIS classifiers in the ensemble, each having R inputs, so that the overall ensemble model can use R *rn genes. The output of the ensemble is taken by simple majority voting (MV). I ',,,h gm-g., I TABLE II COMPUTATIONAL COST COMPARISON BETWEEN INDIVIDUAL AND ENSEMBLE NF MODELS. WE COMPARE THE NUMBER OF RULES AND PARAMETERS OF INDIVIDUAL NF AND NFE MODELS. EACH INPUT OF THE NF MODELS HAS 3 MEMBERSHIP FUNCTIONS. IN THIS COMPARISON, THE NFE CONTAINS TWO INDIVIDUAL NF MODELS. NoG DENOTES THE NUMBER OF SELECTED GENES, NoR DENOTES THE NUMBER OF RULES, NoP DENOTES THE NUMBER OF PARAMETERS NEEDED TO BE UPDATED IN EACH EPOCH. NoG N/A s NoR x x x 1012 NF NoP x x x T TO u l NFE NoR NoP N/A N/A N/A N/A ) Leukemia Cancer Data Set: This data set contains 72 samples. All samples can be divided into two subtypes: 25 samples of acute myeloid leukemia (AML) and 7 samples of acute lymphoblastic leukemia (ALL). The expression levels of 7129 genes were reported. The data set is available at 3) Lymphoma Cancer Data Set: This data set contains 7 samples. B cell diffuse large cell lymphoma (B-DLCL) data set includes two subtypes: germinal center B cell-like DLCL and active B cell-like DLCL. The expression levels of 026 genes were reported. 2 samples are germinal center B-like DLCL and 23 samples are active B cell-like DLCL. The data set is available at IC Zank Cluister A ID Gene name Description 1 59 H22688 UBIQUITIN (HUMAN) 2 29 R S ACIDIC RIBOSOMAL PROTEIN T61602 H3908 0S RIBOSOMAL PROTEIN Sll TRANSFORMING GROWTH FACTOR BETA 2 PRECURSOR T5157 H S RIBOSOMAL PROTEIN S2 PHOTOSYSTEM II KD REACTION CENTRE PROTEIN 7 18 T63508 FERRITIN HEAVY CHAIN (HUMAN) B Cluster M2288 Human bone morphogenetic protein 1 mrna R726 CHOLINE KINASE M2325 Human Ca2-activated neutral protease large subunit mrna 1258 R67358 MAP KINASE PHOSPHATASE H09665 LAMIN B RECEPTOR X80692 H.sapiens ERK3 mrna X1702 Human mrna for hematopoetic proteoglycan core protein Cluster C H65823 D31883 H7738 X67325 X5163 T52015 Z23115 B. Experimental Setup and Results B (HUMAN) Human mrna (KIAA0059) for ORF 5-LIPOXYGENASE ACTIVATING PROTEIN (Macaca mulatta) H.sapiens p27 mrna TROPONIN I, CARDIAC MUSCLE ELONGATION FACTOR 1-GAMMA H.sapiens bcl-xl mrna The LOOCV accuracy is strongly suggested by other researchers to be used as an evaluation measure for the microarray data classification performance. In order to compare with their work, this strategy is also adopted in our study. We use three important criteria for the empirical evaluation of the performance of our models: * Number of selected genes; * Predictive accuracy on selected genes; * Extracted knowledge from the trained models. Each variable is described by three membership functions for both NF and NFE models, and the initial shape of the membership function is a bell-shaped function. There are 5 individual NF networks in our NFE model, and each NF model has inputs. The output of the ensemble is obtained by using Majority Voting (MV). In FGS system, we firstly classified all genes into three clusters, then select total twenty genes according to Equation 13. Top seven genes with highest Fscore from each cluster are listed in Table III. The gene selection results before (left) and after (right) clustering are compared in Figure 7. The performance of our NFE model was compared to that of single NF models and some other reported approaches, by using the same training and testing strategies (see Table IV). Our NFE models obtained better results on Colon cancer data /07/$ IEEE 1008

7 TABLE IV COMPARISON ON THE CLASSIFICATION PERFORMANCE OF DIFFERENT 0.5 x w CLASSIFIERS AND GENE SELECTION METHODS ON LEUKEMIA CANCER DATA SET, COLON CANCER DATA SET, AND LYMPHOMA CANCER DATA SET L X 10 ANFIS NFE NFE SVM [] SVM [6] KNN C.5 GSMethod SNR SNR FGS IG SNR EA ReliefF NoSG Colon [15] Leukemia [1] 81.9 [15] Lymphoma N/A x in2mfl in2mf2 in2mf3 xw _- -E 0.6 E E -1.5 a) O = x input x 10 10r Fig. 8. Trained membership functions of the NFE model on the Colon Cancer Data Set x 10 10F x 105 set and Lymphoma data set, and similar results on Leukemia, but both NF and NFE models use less number of genes comparing with other approaches. The performance of the NFE model is much better than that of single NF models on the three cancer data sets. From line two and line three of Table IV, we can see that our FGS system can obtain similar results with standard gene selection methods, such as SNR, by using the same classifiers. But different from many other standard statistical gene selection methods, our FGS system can directly deal with noisy and uncompleted data by combining with EFGS model. Id, - x X X _ Fig. 7. The 'x' marks represent the top 9 selected genes from three different data sets. The first and third figures are the gene selection results given by traditional methods, the other figures are the gene selection results given by our FGS system. The top two figures are gene selection results for the colon cancer data se; traditional system select all genes from the right two clusters. The bottom two figures are gene selection results for the leukemia cancer data set; we can see that most of selected genes are from one cluster. Traditional methods are very likely to select highly correlated genes, while the FGS system selects features from a larger space /07/$ IEEE 1( 009 x 105 C. Fuzzy Knowledge Discovery Different from black-box approaches, NF-based models can extract some useful knowledge from large gene expression data, for example, adjusted membership functions (Figure 8), trained fuzzy rules (Table V) and fuzzy decision surfaces (see Figure 9). All this knowledge can be presented in a human understandable form. This seems very attractive for researchers in the area, as they can better understand the data or explain how the results are obtained. Meanwhile, NF-based models can also easily incorporate prior knowledge, which helps obtaining more refined models and shorten the training process. Combining with the clustering results, the number of rules can be further reduced. For example, in Table VI, gene M63391 and gene M6872 are from the same pathway, so

TABLE V FIVE RULES SELECTED FROM A SINGLE NF MODEL IN THE ENSEMBLE ON THE COLON CANCER DATA SET. THERE ARE TWO MEMBERSHIP FUNCTIONS FOR EACH VARIABLE.

8 TABLE V FIVE RULES SELECTED FROM A SINGLE NF MODEL IN THE ENSEMBLE ON THE COLON CANCER DATA SET. THERE ARE TWO MEMBERSHIP FUNCTIONS FOR EACH VARIABLE. Descriptions of Rules 1 If (M2325 is small) and (X5163 is small) then (output is Cancer) 2 If (M2325 is small) and (X5163 is medium) then (output is Cancer) 3 If (M2325 is small) and (X5163 is large) then (output is Normal) 7 If (M2325 is large) and (X5163 is small) then (output is Cancer) 9 If (M2325 is large) and (X5163 is large) then (output is Normal) TABLE VI Two SIMILAR RULES ON THE COLON CANCER DATA SET. Descriptions of Rules A If (M2325 is small) and (X5163 is small) then (output is Cancer) B If (X80692 is small) and (X5163 is small) then (output is Cancer) we can treate rule A and rule B as repeated rules, therefore, delete one of them from the rule base. By this analysis, the number of trained rules from Colon Data Set can be reduced from 05 to 293. VII. CONCLUSION AND FUTURE WORK In this paper, we introduced a novel fuzzy based system to cancer microarray gene expression data analysis for both the gene selection and classification tasks. The FGS system directly points against two major problems of traditional gene selection methods. The NFE method makes the fuzzy rule based approach more feasible to microarray gene expression data analysis. The performance obtained by our model are competitive. But, there still are many issues that need to be considered in future research. For example, all tested problems are binary classification problems, we should address how to Fig. 9. The fuzzy decision surface of the trained models when the number of selected genes = 2, gene selection method = IG. extend our model to multi-category classification problems. Fuzzy association rules offer us an insight of the interaction between genes in a human readable form. Fuzzy rules can reveal biological relevant associations between different genes, and between genes and their environment. But this depends heavily on the size of the data set. Microarray gene data sets are usually very large, and how to understand larger number of rules becomes another prey-box problem. A single NF model can be easily explained and interpreted by users, while an ensemble of several NF models would be more difficult to understand. The balance between classification accuracy and model interpretability should be further explored. The performance of the NFE can be further enhanced by using other ensemble training techniques, i.e., bagging and boosting. In addition, our methods offer good potential to deal with highly noisy/missing data, which also needs to be further investigated in future research. REFERENCES [1] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine, "Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays," Proc. Natnl. Acad. Sci. USA., vol. 96, no. 12, pp , [2] D. K. Slonim, P. Tamayo, J. P. Mesirov, T. R. Golub, and E. S. Lander, "Class prediction and discovery using gene expression data," in RECOMB, 2000, pp [3] T. Kohonen, Ed., Self-organizing maps. Secaucus, NJ, USA: Springer- Verlag New York, Inc., [] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler, "Support vector machine classification and validation of cancer tissue samples using microarray expression data," Bioinformatics, vol. 16, pp , [5] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer, "Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks," Nature Medicine, vol. 7, pp , [6] C. Shi and L. Chen, "Feature dimension reduction for microarray data analysis using locally linear embedding," in APBC, 2005, pp [7] L. Li, C. R. Weinberg, T. A. Darden, and L. G. Pedersen, "Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the ga/knn method," Bioinformatics, vol. 17, pp , [8] H. Ressom, R. Reynolds, and R. S. Varghese, "Increasing the efficiency of fuzzy logic-based gene expression data analysis," Physiol. Genomics, vol. 13, pp , [9] W. Shannon, R. Culverhouse, and J. Duncan, "Analyzing microarray data using cluster analysis," Pharmacogenomics, vol., pp. 1-51, [10] M. Xiong, W. Li, J. Zhao, L. Jin, and E. Boerwinkle, "Feature (gene) selection in gene expression-based tumor classification," Molecular Genetics and Metabolism, pp , [11] X. Yao, Y Liu, and G. Liu, "Evolutionary programming made faster," IEEE Trans. on Evolutionary Computation, vol. 3, no. 2, pp , [12] J. Jaeger, R. Sengupta, and W. L. Ruzzo, "Improved gene selection for classification of microarrays," in Pacific Symposium on Biocomputing, 2003, pp [13] Z. Wang, V. Palade, and Y. Xu, "Neuro-fuzzy ensemble approach for microarray cancer gene expression data analysis," in Proc. of the Second International Symposium on Evolving Fuzzy System, 2006, pp [1] T. Jirapech-Umpai and S. Aitken, "Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes," Bioinformatics, vol. 6, pp , [15] L. Yu and H. Liu, "Redundancy based feature selection for microarray data," Department of Computer Science and Engineering Arizona State University," Technical Report, /07/$ IEEE 1010