Applying a Statistical Technique for the Discovery of Differentially Expressed Genes in Microarray Data

Size: px
Start display at page:

Download "Applying a Statistical Technique for the Discovery of Differentially Expressed Genes in Microarray Data"

Transcription

1 Applying a Statistical Technique for the Discovery of Differentially Expressed Genes in Microarray Data ABEER M.MAHMOUD, BASMA A.MAHER, EL-SAYED M.EL-HORBATY AND ABDEL-BADEEH M. SALEM Dept. of Computer science, Faculty of computer and information science Ain shams University, Cairo, EGYPT. abeer_f13@yahoo.com, abmsalem@yahoo.com, eng.basma.ali.cs@gmail.com andsayed.horbaty@yahoo.com Abstract: -The discovery of differentially expressed genes (DEGs) in microarray data which are relevant for a specific target annotation is a rich research area. This discovery task is a very important not only to physicians to diagnose patients but also is targeted by drugs companies to obtain the knowledge of drug sensitive genes and chemical substructures associated with specific genetic response. Although using machine learning techniques for such mission reported a reasonable success in literatures, the discovery of these DEGs still challenging and new algorithms emerge as alternative to the existing ones. Furthermore, no doubt that building an accurate and cost effective classifier is based on achieving an outstanding success in finding the DEGs. Therefore, this paper highlights this discovery process and its effect on the classification accuracy using a machine learning technique. We applied the statistical T-Test (TS) feature selection technique and K-Nearest neighbor (KNN) classifier on the Lymphoma data set to reach the DEGs and to analyze the effect of these genes on the classifier accuracy, respectively. In addition, a comparison of our results with literature related work is discussed. Key-Words: -Gene expression analysis, machine learning, gene selection. 1 Introduction Microarray technology has made the modern biological research. The goal of molecular biology is to understand the regulatory mechanism that governs protein synthesis and activity. All the cells in an organism carries equal number of genes yet their protein synthesis can be different due to regulation. Protein synthesis is regulated by control mechanisms at different stages 1) transcription, 2) RNA splicing, 3) translation and 4) post transitional modifications [1]. Microarray techniques provide a plat form where one can measure the expression levels of thousands of genes in hundreds of different conditions. Actually, there is a high redundancy in microarray data and numerous genes contain inappropriate information for precise classification of diseases or phenotypes [2]. Therefore, the amount of data generated by this technology presents a challenge for the biologists to carry out analysis [3]. The main types of microarray data analysis include: gene selection, clustering, and classification. Gene selection is a process of finding the genes most strongly related to a particular class [4]. The benefit of this process is to reduce not only dimensionality but also, the danger of presence of irrelevant genes that affect the classification process. Clustering can be classified into two categories: one-way clustering and two-way clustering. Methods of the first category are used to group either genes with similar behavior or samples with similar gene expressions and the two-way clustering methods are used to simultaneously cluster genes and samples [4,5]. Classification is applied to discriminate diseases or to predict outcomes based on gene expression patterns and perhaps even identify the best treatment for given genetic signature [4,6]. The broad use of machine learning techniques and their applicability in the different areas of bioinformatics reported a success resolving biological problems because it facilitates the process of analyzing such data sets and extracting the important hidden knowledge. Actually, the cancer classification based on the microarray data is one example of this type of analysis. Therefore, the process of classifying this high dimension data increases the need for using an efficient gene selection technique as a very important pre-classification process for reducing time and effort and achieving high classification accuracy. ISBN:

2 The rest of the paper is organized as follows. In section 2, a brief literature related work is presented. Section 3, details the implemented gene selection techniques for the discovery of DEGs and the machine learning based classifier used in our experiments with the necessary mathematical formulation. Section 4, discuses the implementation results and the comparative results. Section 5, concludes the paper. 2 Related Work Actually, gene expression analysis and gene selection research area achieved considerable advances, where a variety of classification techniques and gene selection techniques have been proposed. Roberto Ruiz et.al, [7] have proposed a new heuristic to choose relevant gene subsets so as to use them for the classification task, their method was based on the statistical significance of appending a gene from a ranked list to the final subset. The efficiency of their technique was established though wide spread comparisons with other representative heuristics. Their approach demonstrated an excellent performance at recognizing relevant genes and also with respect to computational cost. J. Zhang and H. Deng [8] chose their reduced set of genes by first carrying a gene presentation using a univariate criterion function and then estimating the upper bound of the bayes error to filter out redundant genes from remaining genes derived from gene per selection step. To validate their system they used two classifier, k-nearest neighbor (KNN) and support vector machine (SVM) and five databases. Chu & Wang [9] proposed a novel radial basis function (RBF) neural network for cancer classification using expression of very few genes. Zhang et.al, [10] presented a method called Extreme Learning Machine (ELM) algorithm for multicategory classification in cancer diagnosis with micro array data. Wang et.al, [11] proposed the approach for cancer classification using an expression of very few genes. There are two steps involved in this process. The first step is important gene selection. The second step is the classification accuracy of gene combination has been carried by using a fine classifier. Divide and conquer approach are used to attain good accuracy. The scoring method such as T-test, Class Separabilty is used for gene ranking. Mallika & Saravanan [12] developed a new algorithm called an efficient statistical model based classification algorithm for classifying cancer gene expression data with minimal gene subsets. This model used classical statistical technique for the purpose of ranking the gene and two various classifiers used for gene selection and prediction. Bharathi & Natarajan [13] presented a gene selection scheme called ANOVA and they used the support vector machine for the classification process. Osareh & Shadgar [14] joined the SVM, K- nearest neighbors and probabilistic neural networks classifiers with signal-to-noise ratio for feature ranking, and classification of benign and malignant tumors of breast. Zantanu Ghorai et.al, [15] offered a nonparallel plane proximal classifier (NPPC) ensemble for cancer classification based on microarray gene expression profiles. A hybrid computer-aided diagnosis (CAD) framework is introduced based on filters and wrapper methods. Dina et.al, [16] introduced three hybrid classification systems called (MGS-SVM, MGS- KNN and MGS-LDA), respectively. They also, proposed a gene selection technique named (MGS- CM). Using their methods they achieved reasonable classification accuracy but on limited data sets. Figure 1 The proposed scenario for achieving accurate classification accuracy in microarray data using KNN classifier ISBN:

3 3 Applied Methodology The well-known T-Test selection technique was applied on the lymphoma database as an initial proposed scenario for extracting its DEGs. But, before reaching that phase for completing our proposed approach for knowledge discovery in microarray data, a data pre-processing is urgently (removing noise and imputing the missing values in the original database using KNN impute method). Based on the T-test values, the genes with the highest values were selected. Finally, the KNN classifier with the selected genes was trained, and then tested over the same number of the selected genes. Therefore, the classifier accuracy was reported for induction of conclusions. Figure: 1 shows a flowchart for the order of applying these steps. 3.1 T-test statistics (TS) for DEGs Gene selection techniques fall into three categories; marginal filters, wrappers and embedded methods. Marginal filter approaches are individual feature ranking methods. In a wrapper method, usually a classifier is built and employed as the evaluation criterion. If the criterion is derived from the intrinsic properties of a classifier, the corresponding feature selection method will be categorized as an embedded approach [17]. Filter methods are characterized over the two other types by being powerful, easy to implement and are stand-alone techniques which can be further applied to any classifier. They work on giving each gene a score according to a specific criterion and choosing a subset of genes above or below a specified threshold. Thus, they remove the irrelevant genes according to general characteristics of the data [16,18]. Filter techniques are further divided into parametric and non-parametric tests. Parametric tests measure a specific property of the gene while non-parametric tests measure a degree of relation between each gene and class. Gene selection techniques can also be divided into univariate and multivariate techniques. Univariate techniques evaluate the importance of each gene individually while multivariate techniques build its evaluation on a subset of genes [16,19]. Many of gene selection techniques are developed to reduce the number of genes in the microarray datasets to reach accurate classification accuracy with the smallest number of genes. This process reduces the computational time and the cost. Examples of gene selection techniques most widely applied for microarray data are Mean Difference (MD), Signal to noise ratio (SNR), F(x) score (FS), Fisher discriminant criterion (FC), T-test statistics (TS), Entropy (E), Correlation Coefficient (CC), Euclidean distance (ED), and Class Separability (CS) [20]. The T-test statistics is a very famous ranking gene selection technique which is widely used by many researchers. The TS starts by calculating the Mean Difference and then normalizing it as illustrated in equations (1) and (2). Actually, the T-test is used to measure the difference between two Gaussian distributions. Then the P-values which define the difference significance are computed. Therefore, a threshold of P-values is used to determine a set of informative genes [11,16,21]. (1) (2) The standard T-test is only applicable to measure the difference between two groups. Therefore, when the number of classes is more than two, we need to modify the standard T-test. In this case, the T-test has been used to calculate the degree of difference between one specific class and the centroid of all the classes. Hence, the definition of TS for gene i can be described in equations from (3) to (7) [16,24]. Where (3) (4) (5) (6) (7) ISBN:

4 Here max{y k ; k = 1; 2;...K} is the maximum of all y k. refers to class k that includes samples. is the expression value of gene i in sample j. is the mean expression value in class k for gene i. n is the total number of samples. is the general mean expression value for gene i. is the pooled within-class standard deviation for gene i. 3.2 The KNN Classifier Gene expression classification is the process of classifying a new gene expression samples into a predefined class. After reducing the number of the genes, we attempt to classify the data set. Examples of classification techniques most widely applied for microarray data are Support vector Machine (SVM), K-Nearest neighbor (KNN), Fuzzy Neural Network (FNN), and Linear Discriminate Analysis (LDA) [20]. Although being a simple technique, KNN shows an outstanding performance in many cases of classifying microarray gene expression to abstract the data with a gene selection technique. KNN is known to be a lazy technique as it depends on calculating a distance between a test data and all the training data. Therefore, for using KNN technique three key elements are essential, (1) a set of data for training, (2) a group of labels for the training data (identifying the class of each data entry) and (3) the value of K for deciding the Table 1. A Sample data from Lymphoma Dataset No. Gene ID Name Values Values Values (DLCL) (FL) (CLL) 11 GENE3129X Autocrine motility factor receptor Clone= GENE3126X 2B catalytic subunit Clone= GENE3072X APC Clone= GENE3067X Probable ATP Clone= GENE4006X SRC-like adapter protein Clone= Table 2. Lymphoma Dataset with empty spots No. Gene ID Name Values Values Values Values Values Values (DLCL) (DLCL) (DLCL) (DLCL) (FL) (FL) 22 GENE3069X Clone= GENE2584X Clone= GENE3070X Clone= GENE1843X Clone= GENE3166X Clone= GENE3165X Clone= Table 3. Pre-processed Lymphoma Dataset No. Gene ID Name Values Values Values Values Values Values (DLCL) (DLCL) (DLCL) (DLCL) (FL) (FL) 22 GENE3069X Clone= GENE2584X Clone= GENE3070X Clone= GENE1843X Clone= GENE3166X Clone= GENE3165X Clone= Table 4. Informative genes based on their T-Test NO. Gene ID T-Test Value NO. Gene ID T-Test Value 1 Gene 708X Gene 712X Gene 653X Gene 622X Gene 537X Gene 540X Gene 699X Gene 711X Gene 706X Gene 639X Gene 2391X Gene 833X Gene 704X Gene 769X Gene 707X Gene 771X Gene 710X Gene 2374X Gene 709X Gene 695X ISBN:

5 Table 5. A comparison of our T-test result with another two results No. Our T-test for all data Feng & Lipo [24] V.Bhuvaneswari &.vanitha [23] 1 Gene 708X (2807) Gene 1610X (3756) Gene 1943X (2114) 2 Gene 653X (2738) Gene 708X (2807) Gene 880X (2966) 3 Gene 537X (2676) Gene 1622X (3765) Gene 324X (2253) 4 Gene 699X (2816) Gene 1641X (3791) Gene 1557X (3898) 5 Gene 706X (2809) Gene 3320X (1271) Gene 2231X (893) 6 Gene 2391X (768) Gene 707X (2808) Gene 289X (1707) 7 Gene 704X (2811) Gene 653X (2738) Gene 1792X (3601) 8 Gene 707X (2808) Gene 1636X (3796) Gene 910X (2555) 9 Gene 710X (2805) Gene 2391X (768) Gene 272X (2656) 10 Gene 709X (2806) Gene 2403X (759) Gene 692X (2822) 11 Gene 712X (2803) Gene 1644X (3788) 12 Gene 622X (2735) Gene 3275X (1225) 13 Gene 540X (2673) Gene 642X (2749) 14 Gene 711X (2804) Gene 706X (2809) 15 Gene 639X (2752) Gene 1643X (3789) 16 Gene 833X (2627) Gene 2395X (680) 17 Gene 769X (2909) Gene 537X (2676) 18 Gene 771X (2911) Gene 709X (2806) 19 Gene 2374X (742) Gene 2307X (856) 20 Gene 695X (2798) Gene 2389X (2389) number of nearest neighbours [16,22]. Actually, the main idea of KNN classifier is to assign a class label to a new sample where the majority of the chosen number of neighbours belongs. The KNN calculates its distances by different ways, but Euclidean distance is the most popular. Also, for achieving the highest classification accuracy, it is advised trying different values of k. 4 Experimental Result & Discussion 4.1 The Data set We used the lymphoma data set from Lymphoma/Leukemia Molecular Profiling Project (LLMPP) webpage [ 1.cdt]. A sample from the data is shown in Table: 1. This dataset contains of 4026 genes and 62 samples, 42 samples derived from diffuse large B-cell lymphoma (DLBCL), 9 samples from follicular lymphoma (FL), and 11 samples from chronic lymphocytic lymphoma (CLL). Pre-processing is the process of removing the noisy data and filtering necessary information. The lymphoma dataset downloaded consist of noisy and inconsistent data. Reading the description available for the lymphoma data set, we found noisy and inconsistent data, where some additional unnecessary columns exist. After a deep study of the important columns needed to proceed our work, which are (Gene ID, Name, Class Label (DLCL, FL, CLL)), we removed such unnecessary data. Also, we found many cells values equal zero, and although we worried of the reflect of such values on the classifier, we kept these zeros as it is for the time being. Another problem we countered in preprocessing of the original lymphoma dataset is the treatment of missing attribute values (meaning empty string and not zero values). Therefore, we imputed these missing values using KNN impute technique (using Matlab), where this technique replaces such data with the corresponding nearest neighbors columns and if that value is also missing, it go further to the next nearest column and so on. Table: 2 and Table: 3 show a sample of the data before and after the treatment of missing attribute values, respectively. 4.2 The discovery of DEGs Gene ranking simplifies gene expression tests to include only a very small number of genes rather than thousands of genes. We rank the genes by using the T-test based on their statistical score. Finding the informative gene greatly reduces the computational burden and noise arising from irrelevant genes [23]. The T- score of the genes are sorted and the genes with the highest T-scores are ranked. For example, Gene 1 means the gene ranked first as shown in Table: 4. The genes with the highest scores are retained as informative DEGs. ISBN:

6 Table 6. The testing classification accuracy for the three cases Classification Accuracy Classification Accuracy for 31 sample for 25 sample % 100% 100% % 100% 100% % 100% 100% % 96% 93.33% % 96% 93.33% % 96% 93.33% % 100% 93.33% 5 100% 96% 100% 4 100% 96% 93.33% 3 100% 96% 93.33% % 88% 86.67% % 76% 93.33% Top Gene Selected No. 100 Classification Accuracy for 15 sample Testing Accuraccy Case 1 Case 2 Case Number of Genes Figure 2 The Testing Classification Accuracy for the Three Cases. A comparison of our DEGs with a recent two scientific papers, were selected for main interest from our perspective research, is shown in Table: 5. Where Gene (1), (2), (3), (5), (6) and (10) in our result are exist in Feng & Lipo [24] but in other order (Gene (2), (7), (17), (14), (9), (6) and (18) respectively). We believe that the difference in genes order return to different preprocessing schema. 4.3 Appling the KNN Classifier The lymphoma data set is divided randomly into a three cases of different training and testing subsets to study a variety of data analysis parameters on the classification accuracy. These parameters include training and testing percentage (k nearest neighbours = 1) and the number of DEGs. In the following, the three cases are described separately. Case 1: The data set is divided randomly into 50% (31) samples for training and 50% (31) samples for testing. In that condition, the DEGs number varies from 100 to 1, then are given to the KNN classifier and classification accuracy is recorded. Case 2: The data set is divided randomly into 60% (37) samples for training and 40% (25) samples for testing. In that condition, the DEGs number varies from 100 to 1, then are given to the KNN classifier and classification accuracy is recorded. Case 3: The data set is divided randomly into 75% (47) samples for training and 25% (15) samples for testing. In that condition, the DEGs number varies from 100 to 1, then are given to the KNN classifier and classification accuracy is recorded. Table: 6 and Figure: 2 show the exact result for the previous three cases. ISBN:

7 Figure: 2 show that in case 1: the classification accuracy is 100% even with only three DEGs, and decreases with two or one DEGs. Also, the figure proves that the best classification accuracy of different DEGs number happen when dividing training and testing subsets such Case 1. In addition, from the experimental results, it can be observed that with DEGs greater than or equal to 40 the training and testing samples weaken its effect on the classification accuracy where it is constant and reach 100% in the three cases. Finally, using only three genes in the case 1 reported the best classification accuracy and the best data reduction with very few numbers of genes (3 genes). 5 Conclusion& Future Work The recent technological advances, have led to an exponential increase in the biological data. This large amount of the biological data increases the using of machine learning techniques to generate a hidden knowledge from this biological data. Classifying the cancer datasets (microarray expression datasets) into a predefined class is divided into two main steps to reach accurate classification accuracy. The first step is implementing an effective gene selection technique to reduce the number of genes involved in the classification process to reduce time and effort and hence increase classifier efficiency and accuracy. The second step is adjusting a powerful classifier to achieve accurate classification accuracy for new unclassified samples. The high dimensionality input and the small sample data size are the main two problems that have triggers the discovery process in microarray data. This paper presented the discovery of DEGs using T-test statistical machine learning technique and applying a KNN classifier on lymphoma cancer database. The results of our ongoing gene expression research showed that the T-test reported a successful DEGs and reduced the data size, therefore improved the complexity of the analysis of data. Also, KNN achieved nearly 100% accuracy in diagnosing lymphoma cancer when dividing the original data randomly into 50% training and 50% testing, with 3 DEGs supplemented to the KNN classifier. Our future work intends to apply other machine learning techniques for the discovery of DEGs. Also, different classifier will be used. In addition to, other gene expression data sets. References [1] T.D.Pham and D.I.Crane, Analysis of Microarray Gene Expression Data, Bentham science Ltd, 2006 [2] A.Osareh and B.Shadgar, Classification and Diagnostic Prediction of Cancers using Gene Microarray Data Analysis, J.of applied sciences, vol.9, no.3, pp , [3] S.Kim, Y.Tak and L.tari, Mining gene Expression Profiles with Bioogical Prior Knowledge, Proc. of IEEE life science systems & applications workshope, pp.1-2,2006. [4] G.Tzanis, C.Berberidis, &I.Vlahavas, "Biological Data Mining", Encyclopedia of Database Technologies and Applications, [5] R.Tibshirani, et.al, "Clustering methods for the analysis of DNA microarray data", Dep. of Statistics, Stanford University, [6] G.Shapiro and P.Tamayo, "Microarray Data Mining: Facing the Challenges". SIGKDD Explorations, vol.5, no.2, 2003, pp.1-5. [7] Robert Ruiz, Jose C.Riquelme and Jesus S.Aguilar Ruiz, Incremental Wrapper Based Gene Selection from Microarry data from Cancer Classification, pattern recognition, vol.39, no.12,pp , [8] J. Zhang, andh. Deng, "Gene selection for classification of microarray data based on the Bayes error", BMC Bioinformatics, vol.8, no.1, 2007, pp [9] F.Chu and L.Wang, "Applying Rbf Neural Networks to Cancer Classification Based on Gene Expressions", Int. Joint Conf. on Neural Networks, 2006, pp [10] Zhang, et.al, "Multicategory Classification Using an Extreme Learning Machine for Microarray Gene Expression Cancer Diagnosis", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no.3, 2007, pp [11] L.Wang, F.Chu and W.Xie, "Accurate cancer classification using expressions of very few genes", IEEE Transactions on Computational Biology and Bioinformatics, vol. 4, no. 1, pp.40-53, 2007,. [12] M.Rangasamy and S.Venketraman, "An Efficient Statistical Model Based Classification Algorithm For Classifying Cancer Gene Expression Data With Minimal Gene Subsets", Int. J. of Cyber Society & Education, vol. 2, no. 2, 2009, pp [13] A.Bharathi and A.M.Natarajan, "Cancer Classification Of Bioinformatics Data Using ISBN:

8 ANOVA", Int. J. of Computer Theory & Engineering, vol. 2, no. 3,2010, pp [14] A.Osareh and B.Shadgar, "Machine learning techniques to diagnose breast cancer", 5th Int. Symposium on Health Informatics and Bioinformatics (HIBIT),2010, pp [15] Z.Ghorai, et.al, "Cancer Classification from Gene Expression Data by NPPC Ensemble", IEEE Transactions on Computational Biology &Bioinformatics, vol.8, no.3, 2011, pp [16] Dina A. Salem, Rania Ahmed and Hesham A. Ali, "MGS-CM: A Multiple Scoring Gene Selection Technique for Cancer Classification using Microarrays", International J. of Computer Applications, vol.36, no.6, 2011, pp [17] L.Guyon and A.Elisseeff, "An Introduction to Variable and Feature Selection", J. of Machine Learning Research, vol.3, 2003, pp [18] Y.Wanga, et.al, "Gene selection from microarray data for cancer classification", Computational Biology and Chemistry, vol.29, no.1, 2005, pp [19] C.Lai, et.al, "A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets", BMC Bioinformatics, vol. 7, 2006,pp [20] Abeer M. Mohamed, et. al, "Analysis ofmachine Learning Techniques for Gene Selection and Classification of Microarray Data", Proceeding of 6 th IEEE int. conf. on Information Technology, Cloud Computing, [21] L.Deng, J.Ma and D.Lee, "A Rank Sum Test Method for Informative Gene Discovery", In Proceedings of 10th int conf. of ACM SIGKDD'04, 2004, pp [22] X. Wu et al., "Top 10 algorithms in data mining", Knowl Inf Syst, vol. 14, pp. 1-37, [23] V.Bhuvaneswari and.vanitha, "Classification of Microarray Gene Expression Data by Gene Combinations using Fuzzy Logic (MGC-FL)", International Journal of Computer Science, Engineering and Applications (IJCSEA),Vol.2, No.4, August [24] F. Chu and L. Wang, "Applications of Support Vector Machines to Cancer Classification with Microarray Data", International Journal of Neural Systems, Vol. 15, No. 6, 2005, pp ISBN: