Statistical Analysis of Gene Expression Data Using Biclustering Coherent Column

Size: px

Start display at page:

Download "Statistical Analysis of Gene Expression Data Using Biclustering Coherent Column"

Roy Shaw
5 years ago
Views:

Volume 114 No. 9 2017, 447-454 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu 1 ijpam.

1 Volume 114 No , ISSN: (printed version); ISSN: (on-line version) url: 1 ijpam.eu Statistical Analysis of Gene Expression Data Using Biclustering Coherent Column Vengatesan K 1, R. P. Singh 1, S. B. Mahajan 2, P. Sanjeevikumar 2 1 Department of Computer Engineering, Sri Satya Sai University of Technology and Medical Science, India, vengicse2005@gmail.com, vc@sssutms.co.in; 2 Department of Electrical and Electronics Engineering, University of Johannesburg, Auckland Park, South Africa, sagar25.mahajan@gmail.com, sanjeevi_12@yahoo.co.in Abstract In this article statistical analysis of gene expression data using Biclustering Coherent Column (BCC) is articulated. Biclustering is one of the main techniques for grouping the data, which uses different statistical methods for analysis of gene expression data. In this article, results are tested with matrix and experimental view to demonstrate the Biclustering model with different set of gene samples. Biclustering Coherent Column (BCC) methodology is adopted to identify the functions of gene effectively and classify according to data mining algorithms in which BCC was tested with Protein-Protein Interchange and Metabolic Pathway Map to construct the pair of genes. BCC finds the applications in the field of Information Retrieval, Text Mining, Dimensionality Reductions and various real time algorithms. Key Words and Phrases: Biclustering Coherent Column (BCC), Gene; Clustering, Protein-Protein Interchange, Metabolic Pathway Map 447

2 1 Introduction A clustering is grouping of data based on the certain similarity that will discover the pattern.

In the vector measurement, a cluster is point in multidimensional space and the scalar components of the vector are called features or attributes [1].

In general an object may be characterized by following three features like height, weight and color.

Bicluster Different form of representation of the Biclusters is given in fig.1. One Bicluster is shown in fig.1(a), Row and Column of Bicluster is shown in Fig(b), Checked Bicluster is shown in fig.

2 2 1 Introduction A clustering is grouping of data based on the certain similarity that will discover the pattern. The patterns are generally represented either by vector measurement or scalar measurement. In the vector measurement, a cluster is point in multidimensional space and the scalar components of the vector are called features or attributes [1]. A desired pattern can measure an abstract notation or physical objects also features are qualitative or quantitative. In general an object may be characterized by following three features like height, weight and color. Clustering is an ordinary movement performed all the time by the human brain which has the potential of consortium objects alleged in the environment based on the some resemblance criteria [2]. (a) (b) (c) (d) (e) Fig.1. Representation of the Biclusters (a) One Bicluster (b) Row and column of Bicluster (c) Checked Bicluster (d) Overlapped Bicluster with hierarchical structures (e) Arbitrarily positioned overlapping Bicluster Different form of representation of the Biclusters is given in fig.1. One Bicluster is shown in fig.1(a), Row and Column of Bicluster is shown in Fig(b), Checked Bicluster is shown in fig.1(c), Overlapped Bicluster with hierarchical structures is shown in fig.1(d) and Arbitrarily Positioned Overlapping Bicluster is shown in fig.1(e) [3]. A cluster is related group of items, it also called as pattern and the main important task is find which are related to each other under certain conditions and which are related or dissimilar using some statistical techniques. A clustering does not have predefined classes or categories due to unsupervised learning [4]. In supervised classification, the group of labeled pattern is specified. These patterns are used to learn the description of the classes that are useful to find out new patterns. A learning are categorized into two ways one is supervised and unsupervised, in which clustering is unsupervised, so which contains unlabelled pattern, need to group based on the similarity [5]. There are various reasons for the significance in unsupervised measures. Collecting and category a large set of patterns can be very 448

3 3 expensive. If classifiers are used to minimize the large number of data set into small groups based on the different attributes for proceed minimum time period. Data clustering applies to each field of movement to forms the groups in different variety of forms, for different factors [6]. Data clustering applies to each field of movement to forms the groups in different variety of forms, for different factors [6]. Cluster analysis has establish applications in such miscellaneous disciplines as engineering, biology, psychology, archaeology, geology, economics, information retrieval and remote sensing [7]. It has been used with great success in pattern recognition, image processing, market research, spatial data analysis, time series analysis and entire new creation of web applications like document classification and weblog data clustering [8]. Beyond identification of differentially expressed genes, clustering of genes from multiple experiments into groups with similar expression patterns is required for further function annotation and diagnostic classification. Clustering can be applied to rows (genes) and/or columns (samples/arrays) of an expression data matrix. The similar gene profiles are group in one cluster, which give indication that unknown genes are grouped into different cluster based on the functionalities [9]. For example, the clusters or groups that are identified may be exclusive, so that every instance belongs to only one group. Or they may be overlapping, meaning that one instance may fall into several clusters. Or they may be probabilistic, whereby an instance belongs to each group depending on a certain assigned probability. Sometimes these may be hierarchical, such that there is a crude division of the instances into groups at a high level that is further refined into finer levels. Furthermore, different formulations lead to different algorithms to solve. In which different clustering algorithms are proposed to group genes based on the functionality for large family of gene expression data [10]. A wide range of different methods have been proposed for the analysis of gene expression data including hierarchical clustering, self-organizing maps, and k-means approaches. Many of the proposed algorithms have been reported to be successful but no single algorithm has emerged as a method of choice. Most of the algorithms are based on heuristic methods, and the issues of determining the correct number of clusters and the choice of best algorithm have yet to be solved. Although in the literature there are different clustering algorithms are proposed to group genes based on the functionality for large family of gene expression data. One simple classification that allows essentially splitting them into the following two main classes: like Parametric Clustering and Non-Parametric Clustering [11]. 2 Biclustering Coherent Column (BCC) Clustering is a process which divides a given data set into identical groups based on given facial appearance such that related objects are reserved in a assemblage whereas divergent things are in diverse groups. 449

4 4 Fig.2. Formation of Biclustering It is most significant unsupervised erudition predicament. It deals with finding organization in an anthology of unlabeled data. Fig.2 shows the formation biclustering with constraints, which use the divide and conquer algorithm for spilt the matrix into smaller parts, also find the overlapping sub matrices, first consider U and V are set of columns and corresponding sub matrices that are CU and CV. Consider a row that is resorted and with respective condition CU. The equivalent sets of genes GU, GW, and GV then define in combination with CU and CV the resulting sub matrices U and V which are festering recursively. (a) (b) Fig.3. Matrix view of sample (a) 11 gene using BCC ( b) 15 gene using BCC The goal of this kind of algorithm is to solve an optimization problem to satisfy the optimality criterion imposed by the model, which often means minimizing the cost function. This type of method usually includes some assumptions about the underlying data structure. Fig.3(a) demonstrates the sample 11 gene Biclustering Coherent Column (BCC) in the form of matrix view, the selected gene is shown in rectangle. Fig.3(b) demonstrates the sample 15 gene Biclustering Coherent Column (BCC) in the form of matrix view, the selected gene is shown in rectangle. Fig.4 shows the expression analysis of 11 gene using Biclustering Coherent Column (BCC), in which x-axis represents different experimental conditions and y-axis represents gene samples. Fig.5 shows the expression analysis of 15 gene using Biclustering Coherent Column (BCC), from which x-axis represents different experimental conditions and y-axis represents gene samples. Fig.6 demonstrates the sample 100 gene Biclustering Coherent Column (BCC) in the form of matrix view, the 450selected gene is shown in rectangle.

conditions and y-axis represents gene samples. Fig.4. Expression View of sample 11 gene using Biclustering Coherent Column (BCC) Fig.5.

Matrix view of sample 100 gene using BCC 3 Result and Discussion Performance analysis of the proposed algorithm Biclustering Coherent

5 5 Fig.7 shows the expression analysis of 100 gene using Biclustering Coherent Column (BCC), from which x-axis represents different experimental conditions and y-axis represents gene samples. Fig.4. Expression View of sample 11 gene using Biclustering Coherent Column (BCC) Fig.5. Expression View of sample 15 gene using Biclustering Coherent Column (BCC) Fig.6. Matrix view of sample 100 gene using BCC 3 Result and Discussion Performance analysis of the proposed algorithm Biclustering Coherent Column (BCC) is tested with OPSM (Order Preserving Sub matrix Algorithm), ISA (Iterative Signature Algorithm), SAMA and BiMax to construct the pair of genes. Finding the biological significance of bicluster based on the two ways; one is a metabolic pathway map (MPM) for Arabidopsis thaliana and another 451 is protein-protein interaction (PPI)

6 network (PPI). Fig.8(a) represents the construction of Biclustering using different algorithms by metabolic pathway map for various algorithms like BiMax, ISA, SAMA, OPSM and BCC. Fig.8 (b) represents the construction of Biclustering using different algorithms by protein-protein interaction for various algorithms like BiMax, ISA, SAMA, OPSM and BCC.

Expression View of sample 100 gene using Biclustering Coherent Column (BCC) (a) (b) Fig.8.

from gene expression data and the submatrics or unique patterns from the gene expression data.

6 6 network (PPI). Fig.8(a) represents the construction of Biclustering using different algorithms by metabolic pathway map for various algorithms like BiMax, ISA, SAMA, OPSM and BCC. Fig.8 (b) represents the construction of Biclustering using different algorithms by protein-protein interaction for various algorithms like BiMax, ISA, SAMA, OPSM and BCC. And It is observed that Biclustering Coherent Column (BCC) gives the best results. Gene Expressions Conditions Fig.7. Expression View of sample 100 gene using Biclustering Coherent Column (BCC) (a) (b) Fig.8. Gene pair Construction using (a) Metabolic Pathway Map (b) Protein-Protein Interaction 4 Conclusion Biclustering Coherent Column (BCC) is one efficient method used to find the overlapped Biclusters from gene expression data and the submatrics or unique patterns from the gene expression data. The proposed model analysis experiment result tested with various other methods like OPSM (Order Preserving Sub matrix Algorithm), ISA (Iterative Signature Algorithm), SAMA and BiMax to construct the pair of genes and produced better significant result when compared to other methods. Data mining based algorithm such as BCC is efficient method used to identify the overlapped genes from gene expression data that suitable for both Protein- Protein Interchange and Metabolic Pathway Map to construct the pair of genes. The analysis methods introduced in this article preserve absolute to properly analyze a variety of further Biclustering algorithms mutually in terms of the intention Bicluster patterns 452 and investigate strategy.

7 7 References [1] K. Vengatesan, S. Selvarajan: The performance Analysis of Microarray Data using Occurrence Clustering. International Journal of Mathematical Science and Engineering, Vol.3 (2),pp (2014). [2] Ben Dor A., Shamir R. and Yakhini Z.: Clustering gene expression patterns. Journal of Computational Biology, 6(3/4), pp , (1999). [3] Halkidi M., Batistakis Y. and Vazirgiannis M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems Journal, vol. 17 (2/3), pp , (2001). [4] Li. Wentian: Zipf s Law in Importance of Genes for Cancer Classification Using Microarray Data. Second Conference of cheinese bioinformatics society, (2002). [5] E. Domany: Cluster analysis of gene expression data. Journal Stat. Phys., vol. 110, pp , (2003). [6] B. Chandra and M. Gupta: An efficient statistical feature selection for classification of gene expression data. Journal of Biomedical Informatics, vol. 44, pp , (2011). [7] Vengatesan K., and S. Selvarajan: Improved T-Cluster based scheme for combination gene scale expression data. International Conference on Radar, Communication and Computing (ICRCC), pp IEEE (2012). [8] Kalaivanan M., and K. Vengatesan.: Recommendation system based on statistical analysis of ranking from user. International Conference on Information Communication and Embedded Systems (ICICES), pp , IEEE, (2013). [9] W. Au, K. Chan, A. Wong, and Y. Wang: Attribute clustering for grouping, selection, and classification of gene expression data. IEEE Transaction Computational Biology and Bioinformatics, vol. 2 (2) pp , (2005). [10] C. Ding and H. Peng.: Minimum redundancy feature selection from microarray gene expression data. Proceeding of IEEE Computational Systems Bioinformatics (2003). [11] Fadhl M. Al-Akwaa: Analysis of Gene Expression Data Using Biclustering Algorithms. published by INTECH, pp.51-66,

8 454