AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM

Similar documents
A NOVEL AND EFFICIENT APPROACH FOR ALIGNMENT OF PROTEIN BIOMOLECULES THROUGH RESERVE SELECTION SCHEME

Original article: AN ENHANCED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM Manish Kumar

Changing Mutation Operator of Genetic Algorithms for optimizing Multiple Sequence Alignment

An Evolutionary Optimization for Multiple Sequence Alignment

Optimizing Genetic Algorithm Parameters for Multiple Sequence Alignment Based on Structural Information

Optimization of Process Parameters of Global Sequence Alignment Based Dynamic Program - an Approach to Enhance the Sensitivity.

Creation of a PAM matrix

Application of Evolutionary Algorithms for Multiple Sequence Alignment

Dynamic Programming Algorithms

Genetic Algorithm: An Optimization Technique Concept

CHAPTER 4 PROPOSED HYBRID INTELLIGENT APPROCH FOR MULTIPROCESSOR SCHEDULING

PARALLEL LINE AND MACHINE JOB SCHEDULING USING GENETIC ALGORITHM

DNA Sequence Alignment based on Bioinformatics

Optimizing Dynamic Flexible Job Shop Scheduling Problem Based on Genetic Algorithm

Applying Computational Intelligence in Software Testing

Implementation of CSP Cross Over in Solving Travelling Salesman Problem Using Genetic Algorithms

Genetic Algorithm for Predicting Protein Folding in the 2D HP Model

COMBINED-OBJECTIVE OPTIMIZATION IN IDENTICAL PARALLEL MACHINE SCHEDULING PROBLEM USING PSO

Scoring Alignments. Genome 373 Genomic Informatics Elhanan Borenstein

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

Available online at International Journal of Current Research Vol. 9, Issue, 07, pp , July, 2017

Evolutionary Algorithms and Simulated Annealing in the Topological Configuration of the Spanning Tree

AC Algorithms for Mining Biological Sequences (COMP 680)

Modeling and Optimisation of Precedence-Constrained Production Sequencing and Scheduling for Multiple Production Lines Using Genetic Algorithms

Evolutionary Algorithms

Bioinformatics Practical Course. 80 Practical Hours

Extracting Database Properties for Sequence Alignment and Secondary Structure Prediction

Data Mining for Biological Data Analysis

A HYBRID GENETIC ALGORITHM FOR JOB SHOP SCHEUDULING

CHAPTER 5 EMISSION AND ECONOMIC DISPATCH PROBLEMS

Evolutionary Computation for Minimizing Makespan on Identical Machines with Mold Constraints

Bioinformatics for Biologists. Comparative Protein Analysis

Evolutionary Computation

Assoc. Prof. Rustem Popa, PhD

CHAPTER 3 RESEARCH METHODOLOGY

Cloud Load Balancing Based on ACO Algorithm

Processor Scheduling Algorithms in Environment of Genetics

Inferring Gene Networks from Microarray Data using a Hybrid GA p.1

Rule Minimization in Predicting the Preterm Birth Classification using Competitive Co Evolution

The Metaphor. Individuals living in that environment Individual s degree of adaptation to its surrounding environment

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

A Viral Systems Algorithm for the Traveling Salesman Problem

Identifying Regulatory Regions using Multiple Sequence Alignments

Genetic Algorithms in Matrix Representation and Its Application in Synthetic Data

Genetic Algorithms for Optimizations

Basic Local Alignment Search Tool

The application of hidden markov model in building genetic regulatory network

Designing a Forest Road Network using Heuristic Optimization Techniques

PARALLELIZATION OF HYBRID SIMULATED ANNEALING AND GENETIC ALGORITHM FOR SHORT-TERM PRODUCTION SCHEDULING

Integration of Process Planning and Job Shop Scheduling Using Genetic Algorithm

Classification and Learning Using Genetic Algorithms

Finding Regularity in Protein Secondary Structures using a Cluster-based Genetic Algorithm

A Protein Secondary Structure Prediction Method Based on BP Neural Network Ru-xi YIN, Li-zhen LIU*, Wei SONG, Xin-lei ZHAO and Chao DU

Genetic approach to solve non-fractional knapsack problem S. M Farooq 1, G. Madhavi 2 and S. Kiran 3

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction

ProGen: GPHMM for prokaryotic genomes

An introduction to multiple alignments

IMPLEMENTATION OF AN OPTIMIZATION TECHNIQUE: GENETIC ALGORITHM

Single alignment: FASTA. 17 march 2017

Use of Genetic Algorithms in Discrete Optimalization Problems

GENETIC ALGORITHM BASED APPROACH FOR THE SELECTION OF PROJECTS IN PUBLIC R&D INSTITUTIONS

Reducing Premature Convergence Problem in Genetic Algorithm: Application on Travel Salesman Problem

A Particle Swarm Optimization Approach for Workflow Scheduling on Cloud Resources Priced by CPU Frequency

An Effective Genetic Algorithm for Large-Scale Traveling Salesman Problems

Textbook Reading Guidelines

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

Repeated Sequences in Genetic Programming

Timetabling with Genetic Algorithms

GENETIC ALGORITHM A NOBLE APPROACH FOR ECONOMIC LOAD DISPATCH

Multi-objective Evolutionary Optimization of Cloud Service Provider Selection Problems

A New Hybrid Model to find The Dominant Pattern of Amino Acid Sequence to using Data Mining

Bioinformation by Biomedical Informatics Publishing Group

Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana

INTERNATIONAL JOURNAL OF APPLIED ENGINEERING RESEARCH, DINDIGUL Volume 2, No 3, 2011

10. Lecture Stochastic Optimization

Machine Learning: Algorithms and Applications

Article A Teaching Approach From the Exhaustive Search Method to the Needleman Wunsch Algorithm

Design and Implementation of Genetic Algorithm as a Stimulus Generator for Memory Verification

EVOLUTIONARY ALGORITHMS AT CHOICE: FROM GA TO GP EVOLŪCIJAS ALGORITMI PĒC IZVĒLES: NO GA UZ GP

Validity Constraints and the TSP GeneRepair of Genetic Algorithms

A Genetic Algorithm on Inventory Routing Problem

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad.

Improving Differential Evolution Algorithm with Activation Strategy

Part 1: Motivation, Basic Concepts, Algorithms

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

Genetic algorithms. History

GROUP elevator scheduling is important to transportation

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Sequence Alignment and Phylogenetic Tree Construction of Malarial Parasites

PI-Controller Tuning For Heat Exchanger with Bypass and Sensor

Introduction To Genetic Algorithms

Intro. ANN & Fuzzy Systems. Lecture 36 GENETIC ALGORITHM (1)

Genetic algorithms and code optimization. A quiet revolution

Feature Selection for Predictive Modelling - a Needle in a Haystack Problem

A Multi-Period MPS Optimization Using Linear Programming and Genetic Algorithm with Capacity Constraint

BIOINFORMATICS Introduction

Theory and Application of Multiple Sequence Alignments

Simulation approaches for optimization in business and service systems

Performance Analysis of Multi Clustered Parallel Genetic Algorithm with Gray Value

The Impact of Population Size on Knowledge Acquisition in Genetic Algorithms Paradigm: Finding Solutions in the Game of Sudoku

Transcription:

AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM Manish Kumar Department of Computer Science and Engineering, Indian School of Mines, Dhanbad-826004, Jharkhand, India. *Corresponding Author: Manish Kumar, e-mail: manishkumar@cse.ism.ac.in ABSTRACT One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, CLUSTAL W, MSA-GA and MSA-GA W/PREALIGN. Experiments on a wide range of data`s have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality. KEYWORDS: Multiple Sequence Alignment; Genetic Algorithm; Crossover Operator; Mutation Operator. INTRODUCTION A multiple sequence alignment (MSA) (Hamidi et al, 2013) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA (Auyeung and Melcher,2005). Sequence alignment is a standard technique in bioinformatics for visualizing the relationships between residues in a collection of evolutionarily or structurally related protein. Sequence alignment are extensively be used for improving the secondary and tertiary structure of protein and RNA sequences, which is used for drug designing and also to find distance between organism. In MSA, the emphasis is likely to find optimal alignment for a group of sequences. Several applicable techniques were observed in the past research, from traditional method such as dynamic programming to the extent of widely used stochastic optimization method such as Genetic Algorithms (GAs) (Peng et al, 2011) HMM (Eddy,1998) and Simulated Annealing (Kirkpatrick et al, 1983). MSA problems are solved using several different methods, such as classical, progressive (Kupis and 2007 ) and iterative algorithms (Mohsen et al,2007). These algorithms follow either global or local alignment (Changjin and Tewfik, 2009) strategies. In global alignments, sequences are aligned over their whole length. By contrast, local alignments identify regions of similarity within a sub sequence. Local alignments are often preferable, but can be more difficult because of the additional challenge of identifying the regions of similarity. A general global alignment technique is the Needleman Wunsch algorithm (Needleman and Wunsch,1970) which is based on dynamic programming. The Smith Waterman algorithm is a general local alignment method which is also based on dynamic programming. The dynamic programming (DP) approach (Zhimin and Zhong 2013) is good at finding the optimal alignment for two sequences. However, the complexity of this method grows significantly for three or more sequences. Note that MSA is a combinatorial problem (NP-hard) (Kececioglu and Starrett 2004) where the computational effort becomes prohibitive with a large number of sequences. The progressive alignment algorithm (tree-base algorithm), proposed by Feng and Doolittle (Feng and Dolittle 1987) iteratively utilizes the method of Needleman and Wunsch in order to obtain an MSA and to construct an evolutionary tree (Bhattacharjee et al. 2006) to depict the relationship between sequences. The progressive alignment algorithms align sequences according to the branching order of a guide tree. The difficulty with these methods is that they usually converge to local optima (Naznin et al., 2012). To overcome such a limitation, it is recommended to use an iterative or stochastic procedure. In this study, genetic algorithms (Pengfeiet al. 2010) have been considered for experimental analysis. The main advantage of using GA for MSA problem is that there is no need to provide a particular algorithm to solve a given problem. It only needs a fitness function to evaluate the quality of different solutions. Also since it is an implicitly parallel technique, it can be implemented very effectively on powerful parallel computers to solve exceptionally demanding large-scale problems. In the proposed method, our main objective is to align multiple protein sequences by using genetic algorithms. As protein sequences is an important application for the foreseeable future, therefore we have developed two new genetic Volume- 4 Issue- 3 (2015) ISSN: 2319 4731 (p); 2319 5037 (e) 2015 DAMA International. All rights reserved. 390

operators which is different from the tradition genetic operators and with the help of these genetic operators we have tried to solve the alignment problem of protein sequences. In the presented approach, we are able to align the protein sequences for most of the test cases (datasets) which can be observed by the obtained results. MATERIALS AND METHODS 1. Representation and Initial Generation In the proposed approach, the population is initially randomly generated at first. Then the largest sequence in size is determined. Based on the largest sequence size, the initially generated population is filled with gap sign until they reach the size of the biggest sequence plus a random number of gaps between 0 and 25% of the size of the largest loaded sequence. These gaps are randomly placed into the sequences. After the population s has initialized, all the solutions are combined and mutated, so as to produce new individuals with a defined number of generations (iterations), which is 50 for this experimental study. 2. Scoring Function In order to evaluate the fitness of the sequence alignment, the Sum of pair method (SPM) is used in this paper. Sum of Pair Method (SPM) By using SPM, the fitness of a multiple sequence alignment can be determined by using equation (1a) and (1b). In equation (1a), S is the cost of the multiple alignment. L is the length (columns) of alignment, S l is the cost of the l th column of L length. N is the number of sequences, A i (A j ) the aligned sequence i (j) and cost(a i,a j ) is the alignment score between the two aligned sequences A i and A j. When A i - and A j - then cost (A i, A j ) is determined from the PAM 250 matrix, a mutation probability matrix. The cost function includes the sum of the substitution costs of the insertion/deletions using a model with affine gap penalties as shown in (1b). Where, G is the gap penalty, g is the cost of opening a gap, x is the cost of extending the gap by one and n is the length of the gap. By this way, the fitness of a multiple sequence alignment is calculated. The complexity of this function is O(N2L). S = where = ) (1a) G = g + nx (1b) The score is calculated by scoring all the pair wise comparison between each residue in each column of an alignment and adding the scores together. This score will act as a measure to evaluate fitness of the population at each generation. Score for each column for the given sequences is calculated as per the data available in the PAM 250 Matrix. 3.Selection Strategies Description 3.1 Child Generation In order to generate a child population of 100 individuals in every generation, two genetic operators namely Crossover and Mutation have been considered for the experimental study, which is described below in details. 3.2 Crossover Operator It first chooses a column randomly in the parent alignments and defines a cut point there. Then by interchanging the different parts of parents it form two new offsprings, also known as Childs. For doing this type of operation gaps may be added to the resulting offsprings. E G K V A A W G A E D K V A K V N E E G V G G E A L E G K V A A A E G K V G A A E G E Y G A E AL E S K V A A A A A E S K V A G H A G A Y G A E AL Parent alignment 1 Parent alignment 2 E G K V A L W G A A E D K V K V N E E G V G G E A L A E G K V G A A E G E Y G A E A L E G K V A A A E S K V A G H A G A Y G A E A L E S K V A A A A A Child alignment 1 Child alignment 2 Figure 1. One point crossover. Volume- 4 Issue- 3 (2015) ISSN: 2319 4731 (p); 2319 5037 (e) 2015 DAMA International. All rights reserved. 391

4. Mutation Mutation is a divergence operation. It is intended to occasionally break one or more members of a population out of a local minimum/maximum space and potentially discover a better minimum/maximum space. Order changing - two numbers are randomly selected and exchanged (1 2 3 4 5 6 8 9 7) => (1 8 3 4 5 6 2 9 7) 5. New Generation For the coming generation, we have implemented a 60-40 % selection scheme of parent child combination based on their fitness score. It means that for the coming generation 60% of the parent and 40% of the child population will be used to produce the next population. Other combinations such as 40-60 % or the 50-50% parent- child population has also been considered but, these strategies has not shown any impact in improving the overall quality of the solution and hence not been considered. RESULTS The main objective of this research work is to observe the role of proposed crossover and mutation operators in solving MSA problem of protein sequences in terms of quality and scores of the sequence aligned. Here, quality of an aligned sequence is judged by the scores it obtains after successfully aligning. In this study, the experiments for the proposed approach have been performed using genetic algorithm with C programming on an Intel Core 2 Duo processor having 2.53 GHz CPU with 2 GB RAM running on the Linux platform. For evolution of the proposed approach, the algorithm were executed for 50 independent run (iterations) for 14 datasets. As, the fitness score depends upon the level of similarity among the residue in the sequences therefore, the scores can be either positive or negative. Here, one point is to be noted that if the residues among the comparable sequences are similar, then small numbers of gaps ( - ) are needed to make the sequences aligned properly. On the other hand, if the majority of the residues are dissimilar then a large number of gaps are needed for necessary sequence alignment. Performance of the Proposed Method with Ref. 1 The 14 datasets of reference 1 shown in table 1 are of different lengths and sequences. In order to compare the proposed method with respect to BAliscore, the proposed approach were compared with that of CLUSTAL W,MSA- GA, MSA-GA w/prealign and SAGA. From comparison it can be seen that out of 14 test cases, the proposed method has successfully overcome other methods solutions in 11 test cases and in three test cases, the proposed method solution were very close to the best. Table 1: Experimental results with Reference 1 Datasets of BAliBase 2.0 NAME OF DATASETS CLUSTAL W MSA-GA MSA-GA W/PREALIGN SAGA PROPOSED METHOD 1idy 0.500 0.427 0.438 0.342 0.452 1ar5A 0.946 0.812 0.946 0.971 0.986 1ad2 0.773 0.821 0.845 0.917 0.962 kinase 0.479 0.443 0.405 0.862 0.981 1krn 0.895 0.908 0.895 0.993 0.995 2myr 0.296 0.212 0.302 0.285 0.621 Ref. 1 1ycc 0.643 0.650 0.653 0.837 0.898 3cyr 0.767 0.772 0.789 0.908 0.958 1taq 0.826 0.525 0.826 0.931 0.984 1ldg 0.895 0.895 0.922 0.989 0.752 1fieA 0.932 0.843 0.942 0.947 0.985 1sesA 0.913 0.620 0.913 0.954 0.994 2fxb 0.985 0.941 0.985 0.951 0.989 1amk 0.945 0.965 0.959 0.997 0.752 Average score 0.771 0.702 0.772 0.846 0.848 Volume- 4 Issue- 3 (2015) ISSN: 2319 4731 (p); 2319 5037 (e) 2015 DAMA International. All rights reserved. 392

Bali score Bali score www.sciencejournal.in Overall performance of proposed and others methods in reference 1 1.2 1 CLUSTAL W MSA-GA MSA-GA w/prealign SAGA PROPOSED 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 1-3cyr,2-1taq,3-1ldg,4-1fieA,5-1sesA,6-2fxb,7-1amk Figure 2. Bar graph comparison result of scores between proposed and other methods over ref.1 1 0.9 0.8 0.7 Overall performance of proposed and other methods in reference 1 CLUSTAL W MSA-GA MSA-GA w/prealign SAGA PROPOSED 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 1-1idy,2-1ar5A,3-1ad2,4-kinase,5-1krn,6-2myr,7-1ycc Figure 3. Bar graph comparison result of scores between proposed and other methods over ref.1 DISCUSSIONS In this paper, a novel approach has been developed, which uses genetic algorithm for performing multiple sequence alignment. The objective of this study is to validate the efficacy of the proposed approach and assess it by comparing with other commonly used algorithms for MSA over different datasets. In order to evaluate the efficiency and feasibility of the proposed approach, a benchmark datasets from BAliBase 2.0 is considered, because most of the methods discussed in this paper uses BaliBase datasets to access the quality of the multiple sequence alignments. When compared to other methods, the proposed method improves the overall quality of the alignment. The experimental result provides a better scope for multiple sequences alignment, as there is a increase in the alignment quality, which can be observed by the scores of different datasets. It was also observed that the proposed method solution gives some unsatisfied results in some test cases. To this respect the conclusion that can be drawn is that the novel approach proposed in this paper obtains very promising protein sequences that significantly surpass previously published results in most of the cases. REFERENCES Auyeung A. and Melcher U. (2005). Evaluations of protein sequence alignments using structural information. Int. Con. Info. Tech. Coding Computing. 2:748-49. Bhattacharjee A; Sultana K.Z. and Shams Z. (2006). Dynamic and Parallel Approaches to Optimal Evolutionary Tree Construction. Canadian Con. Electrical Computer Engineering. 119-122. Changjin H. and Tewfik A.H. (2009). Heuristic Reusable Dynamic Programming: Efficient Updates of Local Sequence Alignment. IEEE/ACM Transactions Computational Biol. Bioinfo. 6(4):570-82. Volume- 4 Issue- 3 (2015) ISSN: 2319 4731 (p); 2319 5037 (e) 2015 DAMA International. All rights reserved. 393

Eddy S. (1998). Profile hidden Markov models. Bioinformatics. 14:755 63. Feng D. F. and Dolittle R. F. (1987). Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25(4): 351 60. Hamidi S; Naghibzadeh M. and Sadri J. (2013). Protein multiple sequence alignment based on secondary structure similarity. Int. Con. Advances Computing, Communications Info.1224-1229. Kirkpatrick S., Gelatt J.C.D. and Vecchi M. P. (1983). Optimization by simulated annealing. Sci. 220 :671 80. Kececioglu J. and Starrett D. (2004). Aligning alignments exactly. RECOMB. Kupis P. and Mandziuk J. (2007). Evolutionary-Progressive Method for Multiple Sequence Alignment. IEEE Symposium Computational Intelligence Bioinfo. Computational Biol. 291-297. Mohsen B., Balaji P; Devavrat S. and Mayank S (2007) Iterative Scheduling Algorithms. IEEE INFOCOM proceedings. Naznin F., Sarker R. and Essam D. (2012). Progressive Alignment Method Using Genetic Algorithm for Multiple Sequence Alignment. IEEE Transactions on Evolutionary Computation. 16(5): 615-631. Needleman S. B. and Wunsch C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3) :443 53. Peng Y; Dong C. and Zheng H. (2011). Research on Genetic Algorithm Based on Pyramid Model 2nd International Symposium on Intelligence Information Processing Trusted Computing. 83-86. Pengfei G., Xuezhi Wa. and Yingshi H. (2010). The enhanced genetic algorithms for the optimization design. 3rd Int.l Con. Biomedical Engineering Info. 7: 2990-994. Zhimin Z. H. and Zhong W. C. (2013). Dynamic Programming For Protein Sequence Alignment. International J. BioSci. Bio Tech.. 5(2). Volume- 4 Issue- 3 (2015) ISSN: 2319 4731 (p); 2319 5037 (e) 2015 DAMA International. All rights reserved. 394