AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM

AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM Manish Kumar Department of Computer Science and Engineering, Indian School of Mines, Dhanbad-826004, Jharkhand, India. *Corresponding Author: Manish Kumar, e-mail: manishkumar@cse.ism.ac.in ABSTRACT One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, CLUSTAL W, MSA-GA and MSA-GA W/PREALIGN. Experiments on a wide range of data`s have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality. KEYWORDS: Multiple Sequence Alignment; Genetic Algorithm; Crossover Operator; Mutation Operator. INTRODUCTION A multiple sequence alignment (MSA) (Hamidi et al, 2013) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA (Auyeung and Melcher,2005). Sequence alignment is a standard technique in bioinformatics for visualizing the relationships between residues in a collection of evolutionarily or structurally related protein. Sequence alignment are extensively be used for improving the secondary and tertiary structure of protein and RNA sequences, which is used for drug designing and also to find distance between organism. In MSA, the emphasis is likely to find optimal alignment for a group of sequences. Several applicable techniques were observed in the past research, from traditional method such as dynamic programming to the extent of widely used stochastic optimization method such as Genetic Algorithms (GAs) (Peng et al, 2011) HMM (Eddy,1998) and Simulated Annealing (Kirkpatrick et al, 1983). MSA problems are solved using several different methods, such as classical, progressive (Kupis and 2007 ) and iterative algorithms (Mohsen et al,2007). These algorithms follow either global or local alignment (Changjin and Tewfik, 2009) strategies. In global alignments, sequences are aligned over their whole length. By contrast, local alignments identify regions of similarity within a sub sequence. Local alignments are often preferable, but can be more difficult because of the additional challenge of identifying the regions of similarity. A general global alignment technique is the Needleman Wunsch algorithm (Needleman and Wunsch,1970) which is based on dynamic programming. The Smith Waterman algorithm is a general local alignment method which is also based on dynamic programming. The dynamic programming (DP) approach (Zhimin and Zhong 2013) is good at finding the optimal alignment for two sequences. However, the complexity of this method grows significantly for three or more sequences. Note that MSA is a combinatorial problem (NP-hard) (Kececioglu and Starrett 2004) where the computational effort becomes prohibitive with a large number of sequences. The progressive alignment algorithm (tree-base algorithm), proposed by Feng and Doolittle (Feng and Dolittle 1987) iteratively utilizes the method of Needleman and Wunsch in order to obtain an MSA and to construct an evolutionary tree (Bhattacharjee et al. 2006) to depict the relationship between sequences. The progressive alignment algorithms align sequences according to the branching order of a guide tree. The difficulty with these methods is that they usually converge to local optima (Naznin et al., 2012). To overcome such a limitation, it is recommended to use an iterative or stochastic procedure. In this study, genetic algorithms (Pengfeiet al. 2010) have been considered for experimental analysis. The main advantage of using GA for MSA problem is that there is no need to provide a particular algorithm to solve a given problem. It only needs a fitness function to evaluate the quality of different solutions. Also since it is an implicitly parallel technique, it can be implemented very effectively on powerful parallel computers to solve exceptionally demanding large-scale problems. In the proposed method, our main objective is to align multiple protein sequences by using genetic algorithms. As protein sequences is an important application for the foreseeable future, therefore we have developed two new genetic Volume- 4 Issue- 3 (2015) ISSN: 2319 4731 (p); 2319 5037 (e) 2015 DAMA International. All rights reserved. 390

operators which is different from the tradition genetic operators and with the help of these genetic operators we have tried to solve the alignment problem of protein sequences. In the presented approach, we are able to align the protein sequences for most of the test cases (datasets) which can be observed by the obtained results. MATERIALS AND METHODS 1. Representation and Initial Generation In the proposed approach, the population is initially randomly generated at first. Then the largest sequence in size is determined. Based on the largest sequence size, the initially generated population is filled with gap sign until they reach the size of the biggest sequence plus a random number of gaps between 0 and 25% of the size of the largest loaded sequence. These gaps are randomly placed into the sequences. After the population s has initialized, all the solutions are combined and mutated, so as to produce new individuals with a defined number of generations (iterations), which is 50 for this experimental study. 2. Scoring Function In order to evaluate the fitness of the sequence alignment, the Sum of pair method (SPM) is used in this paper. Sum of Pair Method (SPM) By using SPM, the fitness of a multiple sequence alignment can be determined by using equation (1a) and (1b). In equation (1a), S is the cost of the multiple alignment. L is the length (columns) of alignment, S l is the cost of the l th column of L length. N is the number of sequences, A i (A j ) the aligned sequence i (j) and cost(a i,a j ) is the alignment score between the two aligned sequences A i and A j. When A i - and A j - then cost (A i, A j ) is determined from the PAM 250 matrix, a mutation probability matrix. The cost function includes the sum of the substitution costs of the insertion/deletions using a model with affine gap penalties as shown in (1b). Where, G is the gap penalty, g is the cost of opening a gap, x is the cost of extending the gap by one and n is the length of the gap. By this way, the fitness of a multiple sequence alignment is calculated. The complexity of this function is O(N2L). S = where = ) (1a) G = g + nx (1b) The score is calculated by scoring all the pair wise comparison between each residue in each column of an alignment and adding the scores together. This score will act as a measure to evaluate fitness of the population at each generation. Score for each column for the given sequences is calculated as per the data available in the PAM 250 Matrix. 3.Selection Strategies Description 3.1 Child Generation In order to generate a child population of 100 individuals in every generation, two genetic operators namely Crossover and Mutation have been considered for the experimental study, which is described below in details. 3.2 Crossover Operator It first chooses a column randomly in the parent alignments and defines a cut point there. Then by interchanging the different parts of parents it form two new offsprings, also known as Childs. For doing this type of operation gaps may be added to the resulting offsprings. E G K V A A W G A E D K V A K V N E E G V G G E A L E G K V A A A E G K V G A A E G E Y G A E AL E S K V A A A A A E S K V A G H A G A Y G A E AL Parent alignment 1 Parent alignment 2 E G K V A L W G A A E D K V K V N E E G V G G E A L A E G K V G A A E G E Y G A E A L E G K V A A A E S K V A G H A G A Y G A E A L E S K V A A A A A Child alignment 1 Child alignment 2 Figure 1. One point crossover. Volume- 4 Issue- 3 (2015) ISSN: 2319 4731 (p); 2319 5037 (e) 2015 DAMA International. All rights reserved. 391

4. Mutation Mutation is a divergence operation. It is intended to occasionally break one or more members of a population out of a local minimum/maximum space and potentially discover a better minimum/maximum space. Order changing - two numbers are randomly selected and exchanged (1 2 3 4 5 6 8 9 7) => (1 8 3 4 5 6 2 9 7) 5. New Generation For the coming generation, we have implemented a 60-40 % selection scheme of parent child combination based on their fitness score. It means that for the coming generation 60% of the parent and 40% of the child population will be used to produce the next population. Other combinations such as 40-60 % or the 50-50% parent- child population has also been considered but, these strategies has not shown any impact in improving the overall quality of the solution and hence not been considered. RESULTS The main objective of this research work is to observe the role of proposed crossover and mutation operators in solving MSA problem of protein sequences in terms of quality and scores of the sequence aligned. Here, quality of an aligned sequence is judged by the scores it obtains after successfully aligning. In this study, the experiments for the proposed approach have been performed using genetic algorithm with C programming on an Intel Core 2 Duo processor having 2.53 GHz CPU with 2 GB RAM running on the Linux platform. For evolution of the proposed approach, the algorithm were executed for 50 independent run (iterations) for 14 datasets. As, the fitness score depends upon the level of similarity among the residue in the sequences therefore, the scores can be either positive or negative. Here, one point is to be noted that if the residues among the comparable sequences are similar, then small numbers of gaps ( - ) are needed to make the sequences aligned properly. On the other hand, if the majority of the residues are dissimilar then a large number of gaps are needed for necessary sequence alignment. Performance of the Proposed Method with Ref. 1 The 14 datasets of reference 1 shown in table 1 are of different lengths and sequences. In order to compare the proposed method with respect to BAliscore, the proposed approach were compared with that of CLUSTAL W,MSA- GA, MSA-GA w/prealign and SAGA. From comparison it can be seen that out of 14 test cases, the proposed method has successfully overcome other methods solutions in 11 test cases and in three test cases, the proposed method solution were very close to the best. Table 1: Experimental results with Reference 1 Datasets of BAliBase 2.0 NAME OF DATASETS CLUSTAL W MSA-GA MSA-GA W/PREALIGN SAGA PROPOSED METHOD 1idy 0.500 0.427 0.438 0.342 0.452 1ar5A 0.946 0.812 0.946 0.971 0.986 1ad2 0.773 0.821 0.845 0.917 0.962 kinase 0.479 0.443 0.405 0.862 0.981 1krn 0.895 0.908 0.895 0.993 0.995 2myr 0.296 0.212 0.302 0.285 0.621 Ref. 1 1ycc 0.643 0.650 0.653 0.837 0.898 3cyr 0.767 0.772 0.789 0.908 0.958 1taq 0.826 0.525 0.826 0.931 0.984 1ldg 0.895 0.895 0.922 0.989 0.752 1fieA 0.932 0.843 0.942 0.947 0.985 1sesA 0.913 0.620 0.913 0.954 0.994 2fxb 0.985 0.941 0.985 0.951 0.989 1amk 0.945 0.965 0.959 0.997 0.752 Average score 0.771 0.702 0.772 0.846 0.848 Volume- 4 Issue- 3 (2015) ISSN: 2319 4731 (p); 2319 5037 (e) 2015 DAMA International. All rights reserved. 392

Bali score Bali score www.sciencejournal.in Overall performance of proposed and others methods in reference 1 1.2 1 CLUSTAL W MSA-GA MSA-GA w/prealign SAGA PROPOSED 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 1-3cyr,2-1taq,3-1ldg,4-1fieA,5-1sesA,6-2fxb,7-1amk Figure 2. Bar graph comparison result of scores between proposed and other methods over ref.1 1 0.9 0.8 0.7 Overall performance of proposed and other methods in reference 1 CLUSTAL W MSA-GA MSA-GA w/prealign SAGA PROPOSED 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 1-1idy,2-1ar5A,3-1ad2,4-kinase,5-1krn,6-2myr,7-1ycc Figure 3. Bar graph comparison result of scores between proposed and other methods over ref.1 DISCUSSIONS In this paper, a novel approach has been developed, which uses genetic algorithm for performing multiple sequence alignment. The objective of this study is to validate the efficacy of the proposed approach and assess it by comparing with other commonly used algorithms for MSA over different datasets. In order to evaluate the efficiency and feasibility of the proposed approach, a benchmark datasets from BAliBase 2.0 is considered, because most of the methods discussed in this paper uses BaliBase datasets to access the quality of the multiple sequence alignments. When compared to other methods, the proposed method improves the overall quality of the alignment. The experimental result provides a better scope for multiple sequences alignment, as there is a increase in the alignment quality, which can be observed by the scores of different datasets. It was also observed that the proposed method solution gives some unsatisfied results in some test cases. To this respect the conclusion that can be drawn is that the novel approach proposed in this paper obtains very promising protein sequences that significantly surpass previously published results in most of the cases. REFERENCES Auyeung A. and Melcher U. (2005). Evaluations of protein sequence alignments using structural information. Int. Con. Info. Tech. Coding Computing. 2:748-49. Bhattacharjee A; Sultana K.Z. and Shams Z. (2006). Dynamic and Parallel Approaches to Optimal Evolutionary Tree Construction. Canadian Con. Electrical Computer Engineering. 119-122. Changjin H. and Tewfik A.H. (2009). Heuristic Reusable Dynamic Programming: Efficient Updates of Local Sequence Alignment. IEEE/ACM Transactions Computational Biol. Bioinfo. 6(4):570-82. Volume- 4 Issue- 3 (2015) ISSN: 2319 4731 (p); 2319 5037 (e) 2015 DAMA International. All rights reserved. 393

Eddy S. (1998). Profile hidden Markov models. Bioinformatics. 14:755 63. Feng D. F. and Dolittle R. F. (1987). Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25(4): 351 60. Hamidi S; Naghibzadeh M. and Sadri J. (2013). Protein multiple sequence alignment based on secondary structure similarity. Int. Con. Advances Computing, Communications Info.1224-1229. Kirkpatrick S., Gelatt J.C.D. and Vecchi M. P. (1983). Optimization by simulated annealing. Sci. 220 :671 80. Kececioglu J. and Starrett D. (2004). Aligning alignments exactly. RECOMB. Kupis P. and Mandziuk J. (2007). Evolutionary-Progressive Method for Multiple Sequence Alignment. IEEE Symposium Computational Intelligence Bioinfo. Computational Biol. 291-297. Mohsen B., Balaji P; Devavrat S. and Mayank S (2007) Iterative Scheduling Algorithms. IEEE INFOCOM proceedings. Naznin F., Sarker R. and Essam D. (2012). Progressive Alignment Method Using Genetic Algorithm for Multiple Sequence Alignment. IEEE Transactions on Evolutionary Computation. 16(5): 615-631. Needleman S. B. and Wunsch C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3) :443 53. Peng Y; Dong C. and Zheng H. (2011). Research on Genetic Algorithm Based on Pyramid Model 2nd International Symposium on Intelligence Information Processing Trusted Computing. 83-86. Pengfei G., Xuezhi Wa. and Yingshi H. (2010). The enhanced genetic algorithms for the optimization design. 3rd Int.l Con. Biomedical Engineering Info. 7: 2990-994. Zhimin Z. H. and Zhong W. C. (2013). Dynamic Programming For Protein Sequence Alignment. International J. BioSci. Bio Tech.. 5(2). Volume- 4 Issue- 3 (2015) ISSN: 2319 4731 (p); 2319 5037 (e) 2015 DAMA International. All rights reserved. 394