International Journal of Engineering & Technology IJET-IJENS Vol:14 No:01 9

Size: px
Start display at page:

Download "International Journal of Engineering & Technology IJET-IJENS Vol:14 No:01 9"

Transcription

1 International Journal of Engineering & Technology IJET-IJENS Vol:14 No:01 9 Analysis on Clustering Method for HMM-Based Exon Controller of DNA Plasmodium falciparum for Performance Improvement Alfred Pakpahan 1, Suhartati Agoes 2, Binti Solihah 3 1 Department of Biology, Faculty of Dentistry, Trisakti University 2 Electrical Engineering Department, Faculty of Industrial Technology, Trisakti University 3 Informatic Technology Department, Faculty of Industrial Technology, Trisakti University Trisakti University, Jalan Kyai Tapa Grogol Jakarta 11440, Indonesia 1 alfred@trisakti.ac.id, 2 sagoes@trisakti.ac.id, 3 binti76@yahoo.com Abstract-- Improved performance of exon controller of Deoxyribo Nucleic Acid (DNA) Plasmodium falciparum based on Hidden Markov model (HMM) can be done with the application of clustering methods on data in the process of training and testing the HMM. Some Coding Sequence (CDS) data of DNA Plasmodium falciparum as the input data can be used during training to establish the model and the result of the formed model are tested by a sequence of data and the calculated level of familiarity to the data with a certain number of exons. Some amount of state models can be implemented on HMM structure to get the value of the model's performance is Correlation Coefficient (CC) is optimal. This research also identified the protein product similarity prediction results HMM models using the Open Reading Frame (ORF) and the identification of patterns of insertion and deletion of products associated with the predicted results of exon length. The simulation results indicate that increasing the number of states in the model is not linear to the increase in the value of the performance of the model compared to doing the clustering process HMM training and testing have increased the value of the CC with the simulation processing time is relatively short. Index Term-- clustering, CC HMM, Plasmodium falciparum DNA, CDS, 1. INTRODUCTION The objective of this study is to control exon Deoxyribo nucleic acid (DNA) in the coding sequence (CDS) to a protein produced after going through the process of transcription and translation has not changed so there is no indication that changes generated against the protein. Exon controlling process is similar with gene finding technique. As mention in [1,2], there are two classes of method in gene prediction, sequence similarity search and ab initio gene finding (gene structure and signal based search. The limitation of the first approach, as mention in [1] is the fact that only half of genes being discovered have significant homology to genes in data base. In ab initio method, there are several algorithms have been developed, such as dynamic programming, Neural Network, Markov Model, Hidden Markov Model. The most successful program is Hidden Markov Model [1, 2]. One method that can be used to control the exon DNA is the method of Hidden Markov model (HMM) which has some of the parameters used are the number of states, the value of the transition state, state emissions values and algorithms used for training and testing process which Baum- Welch algorithm and Viterbi. In this study implemented HMM to control exons with simulation trials in the MATLAB programming environment and one of the developed model performance is expressed by the Correlation Coefficient (CC). Model accuracy in controlling exon is indicated by the value of the CC. Among the ways that have been used to increasethe value of the CC is to add the number of HMM states until a certain amount of state [3,4,5] and classify the training data based on the number of exons in the CDS [6]. Increasing the value by adding the value of state CC takes time training with the tendency of the model and logarithmic search state composition difficult. On the development of clustering models with state despite an increase in the number of CC but constrained by the limited training data. Therefore it is necessary to identify other ways to optimize the model. The goal of this study is to identify the relationship between the value of the CC with the protein product similarity prediction results compared to the original product, identifying the insertions and deletions on the results of the model predictions compared with the original CDS, then do the Fuzzy C-Mean clustering the training data to obtain improved performance of the model existing and clustering result is used to obtain a model that is specific to the characteristics of the data. The results of the trial showed an increase in the value of the CC compared with previous test results in the same model structure. 2. MARKOV MODEL TO CONTROL EXON HIDDEN Hidden Markov Model (HMM) is one of the stochastic models consisting of a signal (the signal of DNA) that is modeled as a Markov chain state (state) and a finite observation corresponding observation process modeled on

2 International Journal of Engineering & Technology IJET-IJENS Vol:14 No:01 10 Fig. 1. HMM models for Plasmodium falciparum Markov chains. Some outline the method consists of Markov chain HMM, HMM elements, the basic problems HMM and HMM basic problems solutions [5,6,7,8]. Markov chain (Markov Chain) is the probability of transition from one state to another can be determined according to the number of state and HMM topology use. Markov chain in the HMM method in accordance with the structure of the exons and introns are present in the DNA where the location of exons and introns should alternate (alternately located) that also affect the values of the transition state were randomly assigned so that will affect the performance of the model on the results. Performance comparison between the models is determined by decoding the state estimation results (testing) with the original state sequence according to the number of states that in designing the model structure. HMM model structure developed as in Figure 1. During the state definition phase, coding sequence area is devided in three areas, i.e first exon, introns area and others exon area. Each area is populated with a numbers of states defined in first stage. First exon base (Be1) is development of number state model in first exon area, each bases (GT bases the beginning in intron base area, intron base Bi and ended by AG bases) are located in the intron CDS, finally exon bases Be are performed by number of state and ended by one of three Stop codon in the other exons in CDS area. If seen from the structure, the position of the base in exon determines the model number in the state [3]. For the first exon, the bases after a certain sequence of numbers will have the same state. Next to the bases with the same serial number of an exon or intron in addition to the first exon state will also have the same number. HMM model accuracy is expressed with CC values that are formulated as in equation (1) below: CC TP TN FP FN TP FP TP FN TN FP TN FN.. (1) Where TP: True Positive, TN: True Negative, FP: False Positive and FN: False Negative. 3. RESEARCH METHODOLOGY The methodology used in this study is experimental simulations of various models of HMM with MATLAB software application contained on the Personal Computer Bioinformatics toolbox that can convert DNA sequences of data into a digital sequence and can be made of various forms of structural simulation models are processed in a computerized Model development Development of the model structure is done by using the basic structure of part of exon sequences contained in the coding (CDS), thus increasing the number of state determination made in accordance with the location of exons and introns are located alternately. Testing is done with the simulation process for HMM structures that have been designed in accordance with the basic structure of the exons in the CDS with the addition of a model state in this randomized study that may affect the determination of the values of random transition state. Simulations were also performed to calculate the total value of state in accordance with the shape of the model structure, HMM training process using the Viterbi algorithm and HMM testing using Viterbi algorithm and Baum-Welch to obtain one of the performance parameters of the model is the value of CC. The design and implementation using HMM consists of input DNA sequences totaling 152 genes in the genome of

3 International Journal of Engineering & Technology IJET-IJENS Vol:14 No:01 11 Plasmodium falciparum GenBank format in accordance with the existing database on sites with long sequences of at least 684 base pair (bp) and a maximum of bp. Implementation HMM method for controlling a model structure of this DNA has a basic structure as the location of exons in the CDS that can be known is that part of the intron between the two exons so that the basic structure can be described as Figure 2. EXON INTRON EXON Fig. 2. The basic structure of the model based on the structure of DNA Model development is done by increasing the number of random state in the exons and introns. Opportunities transition values for each state are also done randomly with respect to the optimal value of the CC. The testing done by trial and error on the value of state transition model to obtain optimal performance. Structural form model development using HMM method for controlling DNA exons in general as in Figure 3 below. Fig. 3. Structure Model Development with HMM method 3.2. Clustering Method At this stage of the simulation process is carried out using fuzzy-c Mean for grouping data based on similarity of characteristics, the data further classification results are used to build new models and do testing to identify the presence or absence of increase in CC on a new model compared to the old model. Similarity analysis of protein-based product of Open Reading Frame (ORF) is more of a concept in biology and computation comparison ORF gene is defined as the area defined by a start codon, a set of codons which can be translated, and the stop codon. There are several aspects that need to be considered in the determination of the ORF [8], the first, each codon has three nucleotides called a triplet or amino acids that form, so that the reading frame ORF or this can be done from position 1, position 2, and 3 positions. The second aspect to be considered is the length ORF encodes a protein which has a length of more than 100 amino acids or 300 nucleotides, although the ORF is a computational concept, but in this case the use of ORF assessed sufficient, because the data are used as testing data that is already known genes positions The similarity analysis aims to identify similarities initial protein products compared with products of protein identification results in terms of controlling exon is due to the fact that the CC does not state how similarity levels of protein product produced, or how much resemblance to the original protein products predicted results for a certain value of CC. The process of mutation in the DNA there are three types of substitution or change one letter of the DNA sequence (point mutation), insertions and deletions, in real conditions is the most common substitution. Furthermore, at this stage do the identification process insertion and deletion in the model test results, related to how the model can precisely control the exons. Insertion process occurs if there is a base or set of bases in the DNA chain insert, while the deletion process occurs when one or a number of missing or truncated bases of the DNA chain. [8] In this study, insertions and deletions in the DNA strands beshownto see how the model effectiveness in controlling exon, in the sense that the position of exons and introns can be retained as its origin. This process is done by juxtaposition of exons to identify the presence or absence of the insertion or deletion of one or more bases in exon prediction results. As a function of input DNA strand, the original position of exons and exon position prediction result. Output functions are stored in a file. Txt with component line number on the DNA strand, the original exon bases followed the sign ' - ' followed by exon prediction results. Implementation to perform this analysis is done by MATLAB with input in the form of DNA strand, the position of the first exon is exon position data is used as test data, and the position of the second exon is exon positions predicted results. So that the output of this function is the number of insertion and number of deletion. Fuzzy c-means (FCM) is a data clustering technique which has good resistance to ambiguity [9], as one example of the implementation of the FCM in the data with ambiguities are implementation on image segmentation [6] which utilize the cell nucleus classification results using FCM as one feature on Artificial Neural Networks. Process simulation in this study using the FCM on the data forming the model to classify data based on the characteristics of the CDS data is based on the length of exon, exon length to 2. Among the second exon can be known of the long introns, so characteristic features have been based on the method of formation of the state in the HMM models were established by determining the initial coefficients used in the transition matrix. HMM state formation in the model used in this study, led to several states has a much larger population base than the base population to another state [2]. The difference in the total population base of this state affects the value of the components forming the transition to the transition matrix models, where the value of the transition to the state itself was made much larger for the state with a large population. The data used in this process are the two exons and CDS data on CDS data ambiguity is mainly caused by variations in the length of exon and intron bases.

4 International Journal of Engineering & Technology IJET-IJENS Vol:14 No: SIMULATION RESULTS AND ANALYSIS OF SIMULATION RESULTS HMM to control exon implemented in the MATLAB programming environment and a controller model building exons done through two stages, namely the stage of training and testing phases. In the clustering process of testing, done in two stages, the first stage of the model is tested with data containing sequences with any number of exons or general sequences, and then in the second phase of testing is done to sequence with a certain number of exons. In the clustering process of training and testing done specifically to DNA sequences with two exons. The data used for testing is Plasmodium falciparum sequence data that have CDS with Genbank format. The use of such data in the training process must meet four criteria, namely the complete CDS, does not contain an element of the unknown, not a pseudo gene, and contains only one CDS in a single sequence Results of development model Sequence data classification is based on the number of exon sequences are in Table 1 and the most widely used data exon is the data sequence containing two exons and contain at least 10 exons. No Table I The number of exon sequences of DNA Total DNA exon In the sequence Total The test results are expressed with CC values contained in Table II with the value of CC for a certain amount of state models at each number of exons present in the same column. Table II The results of testing the general HMM models with test data the number of exons state CC Vit BW Vit BW Vit BW Vit BW Mean In the Table II the position of the line explained in a row for the test results to the CC value using the Baum - Welch algorithm and Viterbi algorithm in the structure of the model with a certain amount of state. In general it can be said from the results of the development of the model with the HMM method is that the value of the CC with Viterbi algorithm is better when compared with the value obtained from the CC model development using Baum Welch algorithm. If the amount of training data associated with CC values obtained relationship, the less data entered for the training process at the time of the testing process with the same data results will be better because the process is easy or generalization would be worst if the system fails to generalize Clustering Method Result The results of testing and analysis of each stage by using the clustering method based on HMM model structure is contained in the description below. The results of the models project identification triplet contained in Figure 4, whereas the process of identification of the start codon and the stop codon, a stop codon obtained by varying length. Based on the consideration of the stop codon of the last chosen. [8]

5 International Journal of Engineering & Technology IJET-IJENS Vol:14 No:01 13 Fig. 4. Example of a triplet identification results Results iidentification of iinsertion and deletion of model results obtained with the initial alignment and exon prediction results from a sequence as shown in Figure 5 with following explanation: - a-a means that a recognized as an exon bases in the original DNA strand and the DNA strand prediction results. - -a means there is a base insertion in exon prediction results - a- means a process of elimination of a base in exon prediction results. Fig. 5. Examples of insersion and deletion of exon base Clustering results in some of the data used in the model building process of the HMM can be seen in Table III. No Table III Example of clustering results in some training data 1 st exon length 2 nd exon length Intron length Classification result From the total number of experimental data used 69 sequences of data, classification results indicate that the data are classified as cluster 1 are 57 data and cluster 0 are 12 data. The amount of 69 data is then used as training data in a model-based controller exon HMM. CC values obtained using the amount of 69 data is whereas after 12 data item is not used; the CC value obtained is The increase in the value of the CC is larger than the number of state added to the model as is done in [3] is equal to Other performance improvements are also obtained when training is required for the establishment of a model with amount of 20 states, it only takes no more than 0.2 hours to form a model while the number of state 100 it took more than 5 hours. 5. RESULTS AND DISCUSSION The success rate of model building using training data created for testing the data sequences that have a certain number of exons that result in variations in the value of the CC controller application by exon DNA HMM method can be developed in the form of the model structure of DNA sequences that have any number of exons. FCM is effective enough to classify the data forming the HMM-based model of control exons. Test results show that a model built with training data derived from a single cluster (i.e. cluster 1) resulted in a greater CC compared with an increase of the addition amount of the state. In exon controller models are built with HMM obtained by insertion in exon number is smaller than the number of exon deletions because the model controller can maintain intron position as the intron, but the number of large deletions suggests that many bases in exon position moves to the position of introns. ACKNOWLEDGEMENTS Thank you for funds dedicated to the Directorate of Higher Education, Ministry of Education Goverment of Indonesia for research grant that has been given so that this

6 International Journal of Engineering & Technology IJET-IJENS Vol:14 No:01 14 research can be conducted and the Trisakti University Research Institute for their supports. REFERENCES [1]. Wang, Z., Chen, Y., Li, Y., A Brief Review of Computational Gene Prediction Method, Geno. Prot. Bioinfo, vol 2 No [2] Mathe, C., Sagot, M.F., Schiex, Rouzhe, C., Survey and and Summary Current Method of Gene Prediction their Strength and Weaknesess, Nucleic Acid Research, Vol 30 No 19. Oxford University Press, [3]. Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Keith Roberts, James D Watson, Biologi molekuler sel: Mengenal sel, 2 nd ed (Translate), PT Gramedia Pustaka Utama, pp , [4]. Solihah B., Agoes S.,Pakpahan A, Optimization Structure of Hidden Markov Model for Plasmodium falciparum Gene Prediction, (IJEIT- Online), [5] Malcolm J Gardner, The genome of the malaria parasite. Current Opinion in Genetics and Development 9, (1999). [6]. Tapas Kanungo, Hidden Markov Model, Center for Automation Research, University of Maryland, /software /hmmtut.pdf, November [7]. Andrew W Moore, Hidden Markov Models. School of Computer Science, Carnegie-Mellon University, 15 November, 2005 [8]. Jose Renau, Hidden Markov Models: Fundamentals And applications to bioinformatics. ~shoukat/ipdps2007-boa.pdf, July 2005 [9]. Tulyakov, Introduction to Hidden Markov models, [10]. Gopal, S., Haake, A., Jones, R.P., Tymann, P., 2009, Bioinformatics A Computing Perspective, McGraw Hill, New York. [11]. Yang, Y., Huang, S., 2007, Image Segmentation By Fuzzy c-means Clustering Algorithm With A Novel Penalty Term, Computing and Informatics, Vol. 26, page [12]. Amnah, Solihah B., Shofiati R., 2012, Segmentasi inti sel menggunakan Jaringan syaraf tiruan, Proceeding Seminar Konferensi Nasional Sistem Informasi, Februari 2012, Bali.