A REVIEW OF ARTIFICIAL INTELLIGENCE TECHNIQUES APPLIED TO PROTEIN STRUCTURE PREDICTION

Size: px

Start display at page:

Download "A REVIEW OF ARTIFICIAL INTELLIGENCE TECHNIQUES APPLIED TO PROTEIN STRUCTURE PREDICTION"

Beverly Willis
5 years ago
Views:

1 A REVIEW OF ARTIFICIAL INTELLIGENCE TECHNIQUES APPLIED TO PROTEIN STRUCTURE PREDICTION Jiang Ye B.Sc., University of Ottawa, 2003 A PROJECT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in the School of Computing Jiang Ye 2007 SIMON FRASER UNIVERSITY Spring 2007 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author.

2 APPROVAL Name: Degree: Title of project: Jiang Ye Master of Science A Review of Artificial Intelligence Techniques Applied to Protein Structure Prediction Examining Committee: Dr. Diana Cukierman Chair Dr. Veronica Dahl, Senior Supervisor Dr. Kay C. Wiese, Supervisor Dr. Alma Barranco-Mendoza, Examiner, Assistant Professor of Computing Science, Trinity Western University, Langley Date Approved:

3 SIMON FRASER i bra UNIYERSI~~ r y DECLARATION OF PARTIAL COPYRIGHT LICENCE The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the "Institutional Repository" link of the SFU Library website < at: ~ and, without changing the content, to translate the thesislproject or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work. The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission. Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence. The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive. Simon Fraser University Library Burnaby, BC, Canada Revised: Fall 2006

4 Abstract Protein structure prediction (PSP) is a significant, yet difficult problem that attracts attention from both biology and computing worlds. The problem is to predict protein native structure from primary sequence using computational means. It remains largely unsolved due to the fact that no comprehensive theory of protein folding is available and a global search in the conformational space is intractable. This is why A1 techniques have been effective in tackling some aspects of this problem. This survey report reviews biologically initiated A1 techniques that have been applied to PSP problem. We focus on evolutionary computation and ANNs. Evolutionary computation is used as a population-based search technique, mainly in ab initio prediction approach. ANNs are most successful in secondary structure prediction by learning meaningful relations between primary sequence and secondary structures from datasets. The report also reviews a new generative encoding scheme L-systems to capture protein structure on lattice models. Keywords: Protein structure prediction, evolutionary computation, artificial neural networks, L-systems.

5 Acknowledgments My sincere gratitude goes to Dr. Veronica Dahl for the initiation of the project, for her guidance, support and always being there to listen and to give advice. It has been my greatest pleasure getting to know her and learning from her during my graduate studies. I would also like to thank Dr. Kay C. Wiese for his constructive suggestions and comments. Special thanks to my parents for their unconditional love and to my son for always being a caring boy.

6 Contents Approval Abstract Acknowledgments Contents ii iii iv v 1 Introduction Protein Structure Protein basics Protein Structure Hierarchy Experimental Methods Algorithmic Processing of Evolution : Protein Structure Databases Evaluation of Prediction Methods... 9 Problem Overview The Significance The Challenges Representation of Protein Structure All-atom Model Simplified Models HP Lattice Model Potential Energy Functions Measure of Prediction Accuracy 18

7 Related Problems 18 3 Prediction Approaches Overview Knowlege-based Prediction Homology (Comparative) Modeling Fold Recognition (Threading) Ab initio Prediction Dynamic Modeling Energy Minimization Structural Features Prediction Secondary Structure Prediction 28 4 A1 Techniques for PSP Evolutionary Computation Introduction to Evolutionary Algorithms Evolutionary Algorithms for PSP Discussion L-systems Introduction to L-systems L-system-based Encoding for Protein Structure Discussion Artificial Neural Networks Introduction to ANNs A Basic ANN Scheme for Predicting Structural Features Secondary Structure Prediction Other Structural Features Prediction Discussion Summary 76 Bibliography

8 Chapter 1 Introduction "Biology easily has 500 years of exciting problems for computer science." (Donald Knuth, 2001) Protein structure prediction (PSP) is definitely one such problem. Proteins form the very basis of life. They perform a variety of essential functions in organisms, from replication of the genetic code to transporting oxygen, from making up our cell skeleton to catalyzing chemical reactions that make life possible. Proteins are formed by joining amino acids into a linear chain. In water, the solvent environment in cells, the chain folds up into a unique three-dimensional structure. Determining this structure is the key to understanding how proteins work, thus is essential for our understanding of biological processes and our ability to enhance the quality and span of our lives. Currently, the structures of about less than 30,000 proteins have been determined through experimental methods [98]. This is in contrast to more than a million protein sequences known as a result of the explosion of genome sequencing projects. The sequencestructure gap has dramatically increased. Since the costly and time-consuming experimental methods for structure determination cannot keep pace with sequencing speed, we need effective computational tools which are able to translate the sequence into an accurate structure. But unfortunately, despite of the growth of computing power and several decades of research effort, the problem of predicting protein structure from sequence remains largely unsolved and has therefore been the "holy grail" in computational biology for many years. The main reason for this, as indicated in [59], is that no comprehensive theory of protein folding is available and a global search in the conformational space of proteins is intractable. The bright side, however, is that more and more research attention has been drawn to tackle this problem and there have been some promising results in some aspects of this

9 CHAPTER 1. INTRODUCTION 2 problem. The structure prediction community is growing rapidly. In the first CASP (Critical Assessment of Structure Prediction) contest in 1994, 35 research groups submitted 100 predictions for 33 protein targets, while in CASP6 in 2004, there were 230 groups submitting more than 41,000 predictions for the 76 targets. Also, the databases that keep various protein structural information and the web servers/programs for the task have greatly increased. There are two major categories of approaches to this problem: 'knowledge-based' and 'ab initio'. Some knowledge-based methods have achieved rather accurate prediction for a limited number of proteins. Approaches in the third category predict structure features such as secondary structures and become very useful in the general prediction problem. All three categories of approaches have been attracting active research. This report is intended to give an overview of computational approaches to the PSP problem. The focus is on applications of some interesting biologically originated A1 techniques. The role of computers has been dramatically enhanced in all areas of biological and medical research with the exponential growth of biological data. While at the same time, biological systems have been inspiring computing science advances with new concepts, including genetic algorithms, artificial neural networks, artificial life, DNA computing... When humans try to solve problems, it is always exciting looking at Nature's amazing solutions. "When looking for the most powerful natural problem solver, there are two rather straightforward candidates: the human brain (that created 'the wheel, New York, wars and so on'), and the evolutionary process (that created the human brain)" [26]. Trying to design problem solvers based on human brains leads to the field of neuro-computing. Evolutionary process forms a basis for evolutionary computing. In this report, we will examine and review how evolutionary algorithms and artificial neural networks are applied in PSP. We will also introduce Lindenmayer systems as a novel protein structure representation scheme. Although not as powerful as evolutionary computation and ANNs, L-systems, also inspired by natural systems, have found many applications in computing world and has been recently applied to encode lattice protein conformations. This report has five chapters. Chapter one provides a basic introduction to protein structure and the important resources in the protein structure prediction research field. Chapter two serves as a problem overview through discussing several issues in the problem domain. Chapter three is a general introduction to various approaches to the problem. This chapter serves as the big picture of protein structure prediction and helps to understand where individual computational techniques or methods are fitted. Chapter four is the main chapter

10 CHAPTER 1. INTRODUCTION 3 of this report. It reviews and analyzes three biologically initiated computing techniques: evolutionary algorithms, L-systems, and ANNs and their applications in the PSP problem. The report concludes with a short summary. 1.1 Protein Structure This section introduces basic ideas about proteins, protein structure, and the current ex- perimental methods to determine protein structure Protein basics A protein is a chain of amino acids, also referred to as residues. A single amino acid, shown in the diagram below, always has: a central carbon atom Ca, an amino group -NH2, a carboxyl group -COOH, a hydrogen -H and a chemical group or side chain -"R". Figure 1.1: Single amino acid structure There are 20 different amino acids commonly found in proteins, each being coded using one English letter. For example, protein Cysteine is coded as C'. All the 20 amino acids have the same general structure as shown in Figure 1.1, but their side chains (Rs) vary in composition and structure, thus in properties like size, shape, charge etc. It is the side chain that determines the identity of a particular amino acid. One useful classification of the amino acids divides them into two kinds, the polar (or hydrophilic) amino acids have side chains that interact with water, while those of the hydrophobic amino acids do not. Amino acids can be linked when the carboxyl group of one amino acid reacts with the amino group of the next amino acid, releasing a water molecule and forming a peptide bond between the two amino acids, as shown in Figure 1.2. Using the peptide bonds, long sequences of amino acids (polypeptide chains or proteins) are generated. Most of the proteins are a few hundred residues long although there are proteins as short as less than 100 residues or as long as over 1000 residues. Relative to the

11 CHAPTER 1. INTRODUCTION peptide bond Figure 1.2: Two amino acids reacting side chains, the sequence of three repeating groups: amino group, Ca atom, and carboxyl group is called the protein backbone or main chain. The two ends of a polypeptide chain are chemically different: the end carrying the amino group is the N-terminal, and that carrying the carboxyl group is the C-terminal. Conventionally, the amino acid sequence of a protein is always presented in the N to C direction. The peptide bond itself (the CO-NH group indicated with a rectangle) is planar, but there is flexibility for rotation around the N-Ca bond and around the Ca-C bond, forming two dihedral angles, 4 and $J on each side of the Ca atom, as shown in Figure 1.3. These two dihedral angles are the main degrees of freedom in forming 3-d polypeptide chain. Figure 1.3: Dihedral angles of the backbone Although the values of the dihedral angles are restricted to small regions in natural proteins, it is this freedom that protein can fold into a specific 3-dimensional structure, or conformation. 'Valid values are specified on so-called 'Rarnachandran plot'

12 CHAPTER 1. INTRODUCTION Protein Structure Hierarchy Conventionally, protein structure is represented with four levels of description: primary, secondary, tertiary, and quaternary structure. 0 Primary: the ordered sequence of amino acid residues. Formally, it can be modeled as a string from a finite alphabet C where ICI = 20 (There are 20 amino acids). Protein sequences differ in length from 30 to over 30,000 amino acids, but mostly a few hundreds Secondary: local arrangement of amino acids in a short range of protein chain. Only main chain atoms are involved in the secondary structure. There are two main secondary structure patterns: a-helix (H) and P-sheet (E). They may be connected by loop regions or coils (C). An a-helix is a tightly coiled, rodlike structure. It is built up from one continuous region in protein sequence through the formation of hydrogen bonds between C=O group of residue in position i and NH group of residue i + 4. A P-sheet is formed by two or more,b-strands hydrogen bonded side by side. A,B-strand is just a fragment of consecutive residues, but different P-strands forming a pleated P-sheet are usually distant in sequence. Coils have no fixed regular shape. In a slightly higher level, we can define motifs or supersecondary structure which are some commonly found secondary-structure arrangements, such as helix-loop-helix. Every amino acid in the sequence belongs to one of the three structural types (some- times finer classification of eight structural types are used), thus protein secondary structure can be flattened and represented by a string from an alphabet ICI = {H, E, C) with the same length of the primary structure. Take an example from [59], a fragment of protein primary and secondary structure is as follows:... P Y E L A M S P T I M C K D N W M A L E M LT... e-- Primarystructure... C C H H H H C E E E E E EEEH H H H H C C C... e-- Secondarystructure 0 Tertiary: three-dimensional conformations resulted from secondary structures folding together. Interactions of amino acid side chains are the predominant drivers of tertiary structure [2]. This tertiary structure into which a protein naturally folds is also known as its native structure. Normally, the interior of a (folded) protein molecule tend to

13 CHAPTER 1. INTRODUCTION 6 be hydrophobic, while the exterior of a protein is largely composed of hydrophilic residues, which are able to bond with water molecules. This allows a protein to have greater water solubility. 0 Quaternary: results from the interactions of multiple independent polypeptide chains. This level structure will not be discussed in this report Experimental Methods Protein structures are determined by two main experimental methods: X-ray crystallography and nuclear magnetic resonance (NMR). In X-ray crystallography, the target protein must first be isolated and highly purified. Then a series of procedures are required to grow a crystal which is then exposed to X-rays. From the diffraction pattern recorded, the 3-d structure could be solved. Using X-ray crystallography depends on successfully obtaining protein crystals, which is sometimes a major difficulty. (Some proteins do not crystallize.) Often, It has to take months to solve even a single protein structure by X-ray methods, although recently this process has been sped up by some high throughput techniques. Another drawback of the x-ray crystallography techniques is that the crystallization process may cause the protein to assume a structure other than its native conformation. In the second method of NMR, 3-d structure is constructed from pairwise distances estimated by exciting nucleus and measuring the coupling effects on their neighboring nucleus. Generally, NMR has been successful for only small proteins and compared with X-ray crystallography, the resolution is poorer. Protein tertiary structures solved by X-ray crystallography or NMR are deposited in Protein Data Bank [98] and are used to evaluate how accurate the computer prediction models are. But it should be noted that both X-ray and NMR are indirect methods and they have their limitations. Protein structures solved by them may not represent the native, active conformation of the protein. For the time being, they are the best data available to be used as a test for computer models. In the future, however, with more understanding about protein's native conformation, the standards by which the predicted models are being judged may be altered.

14 CHAPTER 1. INTRODUCTION Algorithmic Processing of Evolution Homology is an important concept in protein structure prediction. It is defined as similarity in structure, physiology, development and evolution of organisms based upon common genetic factors [7]. Evolution at molecular level is commonly modeled as a process in which currently observed sequences have diverged from a common ancestor sequence. This process involves such events as: mutations, deletions, and insertions of amino acids in a sequence and selection of those having environmental advantages. In general, 3-d structures, hence functions are more conserved than the sequences [14, 731. Usually, two proteins are considered to be homologous when they have identical amino acid residues in a significant number of positions, thus resulting in similar structures, i.e., the essential fold of the two proteins is identical, details such as additional loop regions may vary. However, it is frequently found that two proteins with low sequence identity can also have similar structures. Sequence similarity can be observed by optimal alignment algorithms which usually employ dynamic programming techniques [55, 831. If a pair-wise alignment shows sequence identity above some threshold, e.g %, it is generally assumed that the two sequences have diverged from the same ancestor, and therefore they are likely to share a similar structure. If it is below the threshold, there are two possibilities. Either the two proteins have diverged from the same ancestor (but their sequences are too divergent for their homology to be detectable) or the two proteins are unrelated. Also, there are multiple alignment algorithms comparing multiple sequences. The evolutionary information in a multiple alignment of N sequences and L positions can be expressed using a profile, a 20 x L Position Specific Scoring Matrix (PSSM) that lists the frequencies of each amino acid in each position. This evolutionary information is often considered when designing computational tools. Suppose we want to predict the structure of a protein sequence s, besides for exploiting the information contained in s directly, if we can find a set of similar sequences with s, we can think that this set contains more structural information than s itself. The success of the most effective predictive systems is largely based upon this empirical argument and on their ability to process the information provided by multiple alignments of similar sequences.

15 CHAPTER 1. INTRODUCTION Protein Structure Databases The development of computational tools is undoubtedly crucial in increasing prediction accuracy. Another very important aspect is the growth of protein sequence and structure databases. Except for some ab initio approaches, most structure prediction methods are dependent on detecting homologies with structures existing in the databases. Thus the more protein structures deposited in databases, the more likely we can predict a novel protein structure accurately. In a broader sense, having sizable databases of sequences and structures provides raw data of evolution. It is the use of evolutionary information and finding patterns in the information that has pushed forward the field of bioinformatics, subsequently the sub-field of protein structure predication. The most important protein structure database is the Protein Data Bank (PDB) [98]. It has existed for three decades and is a primary database that contains all experimentally determined biological macro-molecular structures, mainly proteins. The PDB is updated frequently and as of December 2006, about 37,300 protein structures have been deposited in it. The availability of this large quantity of protein structures allows many analytical studies to be carried out. Also, there are several structure classification databases that are derived from the PDB, two of which are SCOP [loo] and CATH [96]. They are both hierarchical databases of protein structure. SCOP (the Structural Classification Of Proteins) divides the world of proteins structures to reflect both structural and evolutionary relatedness. The major levels in the hierarchy are family, superfamily and fold. For example, proteins put in a same family are clearly evolutionarily related, while proteins in a same fold category may not have a common evolutionary origin but share some structural similarity. CATH clusters proteins at four major levels: Class, Architecture, Topology and Homologous superfamily. It provides a slightly different view on clustering different structures. Both databases are widely used in structure prediction, in particular, in fold recognition approach. In addition to structure databases, protein sequences databases, e.g. GenBank and SwissProt, are also important in structure prediction. With the recent enormous growth of these databases, some powerful sequence alignment tools, such as PSI-BLAST, can detect extremely remote homologous relationships between proteins. The evolutionary information detected is valuable in structure prediction.

16 CHAPTER 1. INTRODUCTION Evaluation of Prediction Methods A large number of approaches or methods have been applied to PSP problem. There is the need to evaluate the effectiveness of different prediction methods. In this section, we briefly introduce three world-wide experimental competitions in protein structure prediction field. CASP Critical Assessment of Structure Prediction (CASP) is a world-wide protein structure prediction contest initiated in It is held every two years since then and the most recent one was CASPG in December, During each prediction season, CASP provides participants with the amino acid sequences of proteins whose structures are close to being determined experimentally but not released to public yet. The participants then work on the blind prediction of structures of these target proteins and submit their structure models generated by computer programs. (Often the models are produced by a combination of computer programs and human intervention). CASP assessors then compare the predicted models with experimentally determined structures. Each CASP contest will conclude with a meeting to discuss the results. Work in protein structure prediction is very complex and computationally intensive. CASP provides the PSP research community with an assessment of the various approaches and critical review of the field. As with the growth of the field, the number of participants and the extent of prediction have greatly increased. In the first CASP contest, 35 research groups submitted 100 predictions for 33 protein targets, while ten years later, in recent CASPG, there were 230 groups submitting more than 41,000 predictions for the 76 targets [93]. Figure 1.4 shows two prediction results of protein TM0919. A good prediction and a not-so-accurate prediction are shown. Despite of the enormous value of the CASP experiments, they do have some limitations. In [28], some limitations are discussed: The assessment is carried out by humans, thus bears the issue of subjectivity; The number of targets is relatively small, therefore, the results may not always be significant; The assessments cover only proteins determined in a period of about four months every two years; Users cannot always reproduce CASP predictions, because computer programs or the required human expertise are often not available. C AFASP In contrast to CASP in which human intervention is allowed in the prediction process,

17 CHAPTER 1. INTRODUCTION Figure 1.4: Crystal structure of TM0919, one of the 76 target proteins of CASP6. (b) Comparison of a successful prediction (red) for TM0919 with the crystal structure. (c) Comparison of a less successful prediction. (The image was taken from [93].) Critical Assessment of Fully Automated Structure Prediction (CAFASP) aims to assess the performance of fully automatic structure prediction servers. Thus, what is measured is the capability of the computer program itself, rather than the capability of prediction groups aided with the programs. This is of significance to biologists who just want to choose a better prediction tool. The benefits of an assessment of fully automated methods are listed in [SO]. First, the nonspecialist users can choose which is the best method for them to use on their prediction targets. Second, users can evaluate and better interpret the results they obtain from the various prediction programs. And last, fully automated predictions are reproducible, unlike the cases where human intervention is part of the model-building process. The CAFASP results demonstrated that although in most cases human intervention resulted in better predictions, several programs could already independently produce reasonable models. LiveBench Like CAFASP, LiveBench also evaluates automatic servers only, but it is carried out in a continuous fashion and use a larger number of prediction targets. Each week the Protein Data Bank is checked for new entries. Proteins with low sequence similarity to other proteins of known structure are chosen as prediction targets and are immediately submitted via Internet to the participating servers. After a few months, a large collection

18 CHAPTER 1. INTRODUCTION of prediction targets is thus obtained, and the predicted models can be evaluated.

19 Chapter 2 Problem Overview In this chapter, we discuss several issues that will help understand the problem domain. 2.1 The Significance Knowledge about the structure of a protein is essential in understanding its biological function. It helps us to understand substrate and ligand binding, devise intelligent protein engineering experiments with improved specificity and stability, perform structure-based drug design, and design novel proteins. Thus, being able to predict the 3-d structure of a protein from its amino acid sequence would greatly benefit molecular biology research. It would provide educated guesses about the function of newly discovered proteins without the time and cost required to perform x-ray crystallography and NMR. Indeed, if structure prediction is good enough, it may remove the need for the lab experiments at all. At least, in many situations, even a crude or approximate model can greatly help experimental determination of protein structure. Thus, even though most of the current approaches cannot produce accurate results yet, prediction of structures is of great value. Also, structure prediction is important for the progress of protein engineering as it would enable changes to be made in the amino acid sequence with some expectation of how the change will affect the structure. On the other hand, the study of protein structure prediction problem drives the development of computing techniques, e.g. this problem on the simplified models is a good test problem for developing and evaluating evolutionary algorithms.

20 CHAPTER 2. PROBLEM OVERVIEW The Challenges Protein structure prediction is a very difficult problem. We have not even come close to solving it. In [79], David Searls outlined some major challenges around the problem: The physical basis of protein structural stability is not fully understood. Although Anfinsen [I] experimentally showed that the primary sequence plus thermodynamic principles should suffice to completely account for the native structure of a protein, what exactly those principles are and the best way to apply them are still not certain. The search space of the problem is too huge because of the vast range of possible conformations of even relatively short polypeptides. The primary sequence may not fully specify the tertiary structure. - "There are no rules without exception in biology." To illustrate a little more about the second challenge, take a small protein of 100 amino acids as an example. Even with a very modest estimation of three possible structural arrangements for each amino acid, the total number of conformations for this small protein is 3"' = a number which is far beyond the computing capability of modern computers. While the other two challenges need to be addressed primarily by biophysists/biochemists who study and model protein folding processes, the secondary one, however, is a rich source of interesting and challenging computational problems in A1 field: e.g., what are some intelligent ways to explore the conformation space? Why is the Nature so efficient and accurate with respect to the protein folding and what can we learn from the Nature? 2.3 Representation of Protein Structure When we approach this computational problem, probably the first thing is to represent protein structure in the problem space. Because protein structure can be specified at different levels of the hierarchy and for each level, it may be viewed at different levels of detail, there are various ways of representing protein structures. Meanwhile, due to the complexity of the 'Despite of some new discoveries, e.g. chaperones, a special type of proteins whose function is to assist other proteins in achieving proper folding, this argument largely remains valid in PSP community and is the fundamental principle underlying all prediction methods.

21 CHAPTER 2. PROBLEM OVERVIEW 14 problem, in practice, further simplified or restrained models are often used to accommodate limited computing resources. Roughly, we can represent protein structures using two categories: all-atom model and simplified models. Choosing a suitable representation not only makes the problem space explicit but may help to find solutions more efficiently and effectively All-atom Model In Protein Data Bank [98], protein structures are represented by lists of 3-d coordinates of all atoms in a protein. Although an accurate all-atom model is desired in the structure prediction, it causes too huge a computation overhead even for very small proteins. Besides, it is difficult to identify similar sub-structures across different proteins using all-atom coordinates representation, consequently, it is difficult to carry out generalization and abstraction. Thus for the PSP problem, various simplified models and representations are used Simplified Models Since all-atom model is not feasible, at least currently, it is attractive to explore simplified structure models to see if they are good enough to at least allow approximate solutions, which are useful either directly or as initial models for further improvement. Simplified models can range from a very abstract model such as an HP lattice model (see 2.3.3) to almost realistic models in which proteins are represented by a geometry description of the main-chain atoms and a rotamer library of side chains. Roughly, simplified models can be classified into lattice models and off-lattice models. Lattice models adopted lattice environment which is a grid and structural elements are positioned only at grid intersections; whereas off-lattice models use off-lattice environment in which structural elements are positioned in a continuous space. Lattice models have two aspects of simplification. Each amino acid is modeled as a single "bead" without considering the different atoms in the amino acid; and, the beads are restricted to a rigid lattice, rather than being able to take any position in space. Thus in a legal conformation in lattice model, one residue occupies one vertex in the lattice and the adjacent residues in the sequence must be adjacent in the lattice. Thus a legal conformation is actually a self-avoiding path on a lattice. These lattices may be two-dimensional, e.g. square or triangular, or three-dimensional, e.g. cube or diamond.

22 CHAPTER 2. PROBLEM OVERVIEW 15 There is debate about the 'physical reality' of a lattice protein. For example, reference [35] addressed the issue and suggested that simplified lattice models do not contain the biological information necessary to solve the protein structure prediction problem. But some other researchers, as an example, N. Krasnogor in [48] think that, simple lattice models can capture many global aspects of protein structures and other than their inexpensiveness to use, it is possible to design test problems for which the best conformational structure is known (for small protein sequences). Off-lattice models represent protein structure in various ways: Depending on what level of details each model represents the polypeptide chain composition, there are models considering: - at individual residue level, often the central Ca atoms; - all backbone atoms; - backbone atoms and side-chain centroids; - all heavy atoms. Depending on how to represent the positions of structure units, there are models - using dihedral angles; - directly address coordinates, either absolute or relative coordinates; - using distance matrix. These different representations carry more or less information about the protein structure. Some structure prediction approaches use multiple representations and move among them for different purposes HP Lattice Model The 2-d HP lattice model is perhaps the simplest lattice protein model. It was proposed by Dill [25] and is widely studied for ab initio prediction. This is the main model we use in Chapter four when we discuss evolutionary algorithms and Lindenmayer generative encoding systems. In this model, the 20 amino acids are classified into only two classes: hydrophobic (H) and hydrophilic (P), according to their interaction with water molecules. Thus a protein sequence s is reduced to s E {H, PI+. In addition, the sequence is assumed to be embedded or a protein sequence of n residues, the corresponding distance matrix contains n * n entries, each representing the distance between the C-a atoms of a pair of residues.

23 CHAPTER 2. PROBLEM OVERVIEW 16 in a certain 2-d lattice. The free energy of a conformation is inversely proportional to the number of H-H contacts. A H-H contact refers to a hydrophobic non-local bond and it occurs if two H-residues occupy adjacent vertices in the lattice but are not consecutive in the sequence. Thus the more there are H-H contacts, the lower the free energy of the conformation. The forming of lowest energy conformation would result in H-residues forming a hydrophobic core while being surrounded by P-residues that interface the environment. This concept is normally quantified by giving a value e = -1 for every pair of H-H contact, and trying to maximize the total number of H-H contacts. The following figure shows examples of two HP lattice models. H-residues are represented by dark circles and P-residues are white circles. HH contacts are highlighted only in (b) with curved lines: Figure 2.1: HP models in (a) square lattice and (b) triangular lattice. The embedding of a HP sequence in a lattice may be represented in two ways: the location of each residue on the lattice is specified independently; or relative to the previous residue. In the latter case, the structure is specified as a sequence of moves (e.g. up, down, left, right) taken on the lattice form one residue to the next. Although the degrees of freedom, thus the amount of computation is greatly reduced for the HP lattice model, it has been shown that the PSP problem for the HP model is NP-hard on both the 2-d square lattice [19] and 3-d cubic lattice [6]. This justifies the use of intelligent search techniques, e.g. evolutionary algorithms, to tackle this problem. 2.4 Potential Energy Functions During the prediction process, we need energy functions to provide information on what conformations of a protein are better or worse. Apparently, energy functions are very important to the prediction result. A poorly defined energy function may render an energy

24 CHAPTER 2. PROBLEM OVERVIEW 17 hyper-surface that has little correlation with a protein's true conformation. An energy function is needed in almost all computational approaches. A wide variety of energy functions have been used in protein structure prediction. These range from the very simple hydrophobic potential in HP lattice protein to energy models based on more detailed molecular mechanics, such as CHARMM (Chemistry at HARvard Macromolecular Mechanics) package [95]. Current energy functions can be roughly classified into three categories: physical potential functions; mean force potentials and simplified potentials. Physical potential functions take into account the bond and non-bonded potentials between atoms, such as torsion (bonded), electrostatics(non-bonded)..., and typically have the form where R is the vector representing the conformation of the protein, typically in Cartesian coordinates or torsion angles. A popular example in this category is CHARMM. The v.27 CHARMM energy function adds up seven energy terms [95]. Mean force potentials are derived from databases of known protein structures. They can be based on statistics of frequencies of contacts between amino acids, or in a finer manner, between functional groups. For example, the amino acids pair R and D is frequently found to occur a short distance apart relative to random expectation. This indicates that such an interaction is favorable. Mean force potentials are quite successful in fold recognition approach, but generally they are not accurate enough due to their crude representation and their statistical nature. Simplified empirical energy functions are often related to simplified protein models, e.g. hydrophobic potential in the simple HP lattice protein. In these potentials, more consideration is on computation efficiency and ease of use, rather than accuracy. Potential energy function is a very important factor for the accuracy of structure prediction. An energy function should be sufficiently close to the right potential for the native state, otherwise, the lowest energy state will not correlate with the native conformation. Development of energy functions is a very active research area, new models are frequently published and tested. Often these new functions are a combination of atomic forces and statistical properties taken from observed protein structure. Currently, potential functions are still not accurate enough.

25 CHAPTER 2. PROBLEM OVERVIEW Measure of Prediction Accuracy How do we measure the accuracy of our predicted result assuming we know the real native structure of the protein? In literature, the most popular metric is the 'root mean square deviation' (RMSD). It measures the average distance between corresponding atoms after the predicted and the real structures have been optimally superimposed on each other. This distance is usually measured using Angstrijm (A) which is the unit of length equal to one millionth of a centimeter. RMSD is given by the formula: where rai and rh are the positions of atom i of structure a and b, respectively. In general, a prediction with RMSD of about 6A is considered non-random, but not useful, RMSD of 4-6fL are meaningful, but not accurate, and RMSD of below 4A are considered good [33]. Of course, the required accuracy also depends on the purpose of the prediction. For example, identifying the overall fold for understanding the function of a given protein requires less precision than designing an inhibitor for a protein. RMSD is widely used for structure comparison. The major problem with this metric is that two structures have to be appropriately superimposed. Finding best superimpose alignment itself is a hard problem and meanwhile, best alignment does not always mean minimal RMSD. When all equivalent parts of the proteins cannot be simultaneously superimposed, this is not a good measure. Other problems with the RMSD metric is that significance of RMSD depends on the size and type of the protein. Another metric for measuring the accuracy of a predicted structure is the Distance Matrix Error (DME). But we are not discussing it here. 2.6 Related Problems The field of protein structure prediction has grown and diversified greatly since the first attempts. Initially, researchers focused on understanding physical and chemical principles and using these principles to simulate the folding and obtain protein structure. While this has not worked out a solution, with more and more experimental data available, researchers try to derive empirical rules from them and predict new protein structure accordingly. On the other hand, because protein structure can be viewed at different levels and different types

26 CHAPTER 2. PROBLEM OVERVIEW of proteins possess different structure features, the general problem of structure prediction may be simplified or varied to address different prediction tasks. Here I briefly introduce three closely related problems, which may help in understanding the problem of PSP. 0 Protein folding The PSP problem attempts to predict the native structure of a protein given its pri- mary structure, while the protein folding problem consists in predicting the folding process or pathways to reach the native structure. Both problems explore protein structure, but with complementary aims. Studies of protein folding are mainly concerned with fundamental physicochemical principles and less concerned about pro- ducing accurate 3-d structure models. A solution to the protein folding problem will provide a solution to the PSP problem; but knowing the final structure does not solve the folding problem. In this sense, protein folding problem is more complex than PSP problem. The progress in protein folding problem is definitely helping PSP problem because a better understanding of physicochemical principles in protein folding will help devel- oping more appropriate energy functions for PSP problem. When talking about protein folding, we cannot ignore a distributed computing project: folding@home [97]. It was launched on October 1, 2000, and is managed by the Pande Group in Stanford University. It is designed to perform the intensive computations of protein folding simulation. As of February 2006, more than 210,000 CPUs world-wide were actively participating, and with a total of over 1,600,000 CPUs registered with the project. 0 Seconda y structure prediction Predicting protein secondary structure sometimes is considered as a sub-problem of PSP, although it can stand on its own. The term 'protein structure prediction' in early (back to 80s) research often actually referred to secondary structure prediction. Given a protein sequence, if the secondary structure is known, the 3-d structure prob: lem becomes arranging the known secondary structure elements into the correct 3-d structure. Some of the other uses of secondary structure prediction are fold recogni- tion, genome annotation, and predicting regions of a protein that are likely to undergo 3~n old literature, sometimes the two terms are not distinguished.

27 CHAPTER 2. PROBLEM OVERVIEW 20 structural changes. In the next chapter, we will address this very important task using ANNs. Protein design problem This problem is to identify the amino acid sequences folding into a given native conformation. Thus, it can be considered as the inverse problem of PSP. Unlike PSP that has only one desired solution (the native structure), the inverse problem is likely to have many solutions because it has been recognized that different protein sequences may fold into a very similar structure. For example, it was reported in [49] that 2 non-homologous proteins, the third domain of ovomucoid and the C-terminal fragment of ribosomal L7/L12 protein, have very similar structure, while possessing completely different sequences. Protein design problem on simplified lattice models has also been shown to be NPhard [63]. This problem is also attracting active research, and researchers are asking whether the (partial) success of the various A1 techniques that were applied to PSP problem could be replicated in the inverse problem.

28 Chapter 3 Predict ion Approaches Overview Many computational techniques have been employed in the PSP problem. To name a few: artificial neural networks, evolutionary computation, monte carlo search techniques. To see the big picture of where and how these individual techniques are applied in the landscape of protein structure prediction, it is useful to introduce the two main categories of approaches, namely: knowledge-based prediction and ab initio prediction. The use of knowledge-based approaches relies on the existence and detection of homologous proteins with known structure that serves as a template to model the target protein structure. Overall, it is estimated that knowledge-based approaches can be applied to only less than half of novel proteins [74]. In many cases, given a novel protein sequence, there is no homologous protein with known structure available in existing databases. Thus its structure has to be modeled ab initio, which means we have to do a direct prediction based on the sequence alone, plus known physical-chemical principles. Ab initio structure prediction is arguably more useful than knowledge-based prediction because it can be applied more generally. But currently ab initio prediction is very difficult and less successful. Both knowledge-based and ab initio approaches are trying to predict directly 3-d model of protein structure, although sometimes just simplified models. The third category of approaches to protein structure prediction focuses on predicting the intermediate structures or values of structural features, such as secondary structure, residues distances or contact maps, which are important information in aiding 3-d prediction. Compared with full 3-d model, predicting these 1-d or 2-d features are more tractable and various A1 techniques have been applied and achieved good results. This chapter will only cover general ideas of how the above-mentioned approaches work

29 CHAPTER 3. PREDICTION APPROACHES OVERVIEW 22 to find protein native structures. It is organized by different categories of approaches. Analysis of selected A1 techniques involved in these approaches will be discussed in the next chapter. PSP is a complex problem, the classification of approaches itself is not the focus of this chapter. Others might reasonably classify some approaches differently, as many overlap or share characteristics. Our intent is to structure the presentation to give the big picture for individual techniques discussed in Chapter four. 3.1 Knowlege-based Prediction Homology (comparative) modeling and fold recognition (threading) are two major knowledgebased approaches. In these approaches, we do not have to care about the folding mechanics of a protein. We make use of the large amount of available sequence and structure data: comparing, analyzing, and inferring from them. This is an example of a scientific problem that can be (partially) solved in practice, without first obtaining a complete understanding of the protein folding process as it occurs in nature. The major difference between comparative modeling and threading relies on whether a homologous protein of the target one can be found through mere sequence alignments. If sequence comparison cannot help finding the template, threading approach has to be tried Homology (Comparat ive) Modeling Based on the principle that significant sequence similarity implies similarity in 3-d structure, homology modeling first identifies an evolutionarily related protein with the target protein through sequence alignment, then builds the 3-d model of the target protein using the known structure of the related protein as a template. The basic assumption of homology modelling is that the target and the template have identical backbones. The task is to correctly place the side chains of the target and build loop regions. To build side chains, molecular dynamics simulations or other techniques can be applied. In more detail, homology modeling comprises the following four steps. 1. Select the template. This is facilitated by searching databases through programs like BLAST, FASTA, etc. If no such template exists, homology modeling is not applicable, other approaches need to be used.

30 CHAPTER 3. PREDICTION APPROACHES OVERVIEW Construct a sequence alignment of the target protein and the template protein. The aim of this step is to match each residue in the target sequence to its corresponding residue in the template structure, allowing for insertions and deletions. 3. Build the model based on the target-template alignment. When the sequence alignment is good, use the template structure directly as the target structure, replacing the side chains of the residues that differ. A subsequent optimization step would then take care of the side chain interaction. When the target-template sequence similarity is low, first build the backbone, then place the side chains and finally optimizing the entire structure. Some of these techniques need a large amount of computation1 time and user expertise. 4. Refine the model. Additional adjustments may be needed. Various methods exist for this optimization stage, such as packing, energy calculations. The accuracy of homology modelling clearly depends on the amount of target-sequence identity. With high levels of identity (70%), homology-derived models can be as accurate as the experimentally-derived. But if the identity is only about 30% or less, the model built on the alignment would probably be completely wrong. So far comparative modeling is still the most accurate approach in solving PSP, but it is limited by the absolute need for a related template structure Fold Recognition (Threading) If a highly similar sequence with known structure cannot be found, a new protein may still be structurally similar to some protein with known structure. In this case, the two proteins are said to be remote homologous. Fold recognition is aimed at identifying the remote homologue from a collection of candidate folds. If such a fold template exists, threading is used to provide a sequence-structure alignment between target sequence and template structure, rather than mere sequence alignment as in homology modeling. In actual operation, the two tasks are usually handled together: given a collection of potential fold templates, for each template, the query sequence is threaded onto the known structure template. It then follows an assessment of how well the query sequence fits each structure template (sequence-structure compatibility) using some scoring function. Threading can be no-gap or gapped, where gapped threading allows gaps in the match of sequence to

31 CHAPTER 3. PREDICTION APPROACHES OVERVIEW 24 fold. The scoring function can be either amino acid structural propensities 1321 or meanforce (statistical) potentials [42]. To speed up the process, other techniques have also been proposed. In [43], profile-based sequence alignments are used to align the query sequence and the sequence of the candidate template. Feed-forward neural networks are then used to score the structural similarity of the two proteins. Also, kernel methods have been applied to detect a remote homology with good results [43]. Various fold recognition methods generally share four components: 0 A library of possible structural templates. 0 A scoring function that distinguishes better threading. 0 An efficient algorithm that searches all possible alignments of the target sequence with every possible fold in the library. Computing the optimal gapped alignment is a NPcomplete problem if the scoring function takes into consideration pair interactions. In these cases, approximations or heuristics need to be used. 0 A method to assess significance when selecting best template candidate. Success of this approach also depends on the degree of similarity between the known and modeled structures. 3.2 Ab initio Prediction Ab initio prediction approaches are those that do not rely on known 3-d structures, rather, they are based on Anfinsen's "thermodynamic hypotheses" [I] asserting that the native structure of a protein corresponds to its minimum free energy state. Accordingly, many ab initio prediction methods are formulated as optimizations and computationally intensive. If this category of methods works fine, it can not only identify in vivo structures of natural proteins, it can identify structures of arbitrary polypeptides in arbitrary environments. Therefore ab initio prediction is not only significant to the new proteins that cannot be modeled with knowledge-based methods, but also significant to drug design. However, compared with knowledge-based methods, ab initio prediction is less successful and models produced are not very useful yet - limited to short proteins and coarse models. In the ab initio prediction category, there are roughly four major approaches: dynamic modeling, energy minimization, specific protein structure prediction and other approaches.

32 CHAPTER 3. PREDICTION APPROACHES OVERVIEW In this section we briefly discuss dynamic modeling and energy minirnization. Specific protein structure prediction refers to structure prediction of some specific types of proteins e.g. transmembrane proteins. It generally needs more specific domain knowledge. "Other approaches" refers to those hybrid approaches or those that are hard to classify. Many of them achieve good prediction results, e.g. building block approach [88] and Rosetta program. But like specific protein structure prediction, it lacks generality. We do not cover them here. Most research in ab initio approaches is focused on improving the energy function and search techniques to achieve faster or higher accuracies of prediction; examples can be found in [85, 571. More about energy function and search techniques will be discussed in Chapter four Dynamic Modeling Dynamic Modeling uses molecular dynamics (MD) simulation to obtain protein native structure. Assume our description of all forces at the atomic level is accurate, given any conformation state of a protein system, we should be able to calculate the forces that atoms in the system exert on each other and to where each atom is moving. Following the trajectory of the system, eventually the system will rest on its lowest energy state which corresponds to the native conformation of the protein. However, there are two problems with this approach. Firstly, we do not have accurate description of all forces at the atomic level. There are approximate models available, e.g. empirical potentials, or quantum-mechanical formula, but they are not accurate enough. Secondly, dynamic modeling often encounters limits of computational power. In the dynamic system, while one atom moves under the influence of all the other atoms, the other atoms are also in motion, in other words, the force fields are constantly changing. Thus, we need to constantly recalculate the forces between each pair of atoms and their positions in very small timestep. In principle, it requires n2 calculations in each timestep, where n is the number of atoms in the protein and its surrounding environment. Because the timestep to recalculate must be chosen small enough to avoid discretization errors (usually at the order of 10-l4 s, which is the same timescale as bond formation.) and the number of timesteps, thus the simulation time, must be large enough to capture the effect, the calculation becomes huge. Actually the need to recalculate the forces is the main bottleneck of this method. So far, we can only simulate very short time of this dynamic process, at the order of s or

33 CHAPTER 3. PREDICTION APPROACHES OVERVIEW 26 s, which is far from enough because proteins fold on the timesale of s, or longer [82]. Normally dynamic modeling simulations require a full atomic description of the protein and a detailed energy function Energy Minimization Because it is believed that the native state of a protein corresponds to its minimum free energy state, if we can find the minimum energy state on the energy landscape, we can obtain the native conformation. The energy landscape of a protein is the variation of its free energy as a function of its conformation, owing to the interactions between the amino acid residues. As shown in Figure 3.1, this energy landscape usually has a funneled shape which leads towards the native state. For a realistic-sized protein, the energy landscape is very complicated because it has many parameters and has an enormous number of local minimums. Figure 3.1: A hypothetical energy landscape exhibiting a folding funnel In general, energy minimization approaches are composed of the following three components 1581: a representation of protein geometry, a potential energy function that can distinguish between favorable and non-favorable structures, and a search technique to explore the conformational space. In each of the components, large approximations are required because of the complexity of the problem. Different computational approaches differ in which simplifications are made. Brief discussion about each component is as follows. More details about protein representation and energy function can be found in Chapter two and four.

34 CHAPTER 3. PREDICTION APPROACHES OVERVIEW 27 For protein representation, because an all-atom model of protein is computationally expensive, often a simplified protein representation is adopted. Simplifications include methods using one or a few atoms per residue, as well as a lattice representation of proteins. Recent computational analyses of PSP problem have shown that this problem is intractable on the simplest 2-d HP lattice models [6]. Simplified models actually cannot give 3-d structure prediction of proteins, but they are inexpensive to use while they capture many global aspects of protein structures, thus current research in energy minimization approach mainly focuses on simplified models. Formulating a good energy function is always important, yet difficult. Approximate energy functions include atom-based potentials from molecular mechanics packages such as CHARMM [51] or AMBER [17], statistical potentials of mean force derived from many known structures of proteins, and simplified potentials based on chemical intuition. Given an energy function, many intelligent search techniques have been applied to improve on the sampling and the convergence of the search, such as Monte Carlo, simulated annealing, evolutionary computation. Take Monte Carlo method as an example. To minimize a given energy function, take a small conformational step and calculate the free energy of the new conformation. If the free energy is reduced compared to the old conformation (i.e. a downhill move), then the new conformation is accepted, and the search continues from there. If the free energy increases, then a nondeterministic decision is made: the new conformation is accepted if the Metropolis test is positive. These search methods sometimes are coupled with the use of other structural information or multiple processors to achieve better results. For example, an interesting approach 1811 uses a Monte Carlo optimization of a statistical energy function to assemble the whole protein model from relatively short building blocks. These candidate blocks are obtained from known protein structures using energetic, geometric, or sequence similarity filters. While the energy function issue needs to be addressed primarily by biochemists, the searching for an optimal or near-optimal solution attracts research attention of computing scientists. To conclude, to use energy minimization approach to investigate protein structure prediction, we need to pick a representation of protein geometry, an appropriate energy function and a search technique. It is not that all choices of different combinations will work fine. For example, the commercial energy packages CHARMM or AMBER is not suitable as fitness functions for evolutionary algorithms partly because they examine atomic interactions and the energy computation per generation is too expensive General problems with

35 CHAPTER 3. PREDICTION APPROACHES OVERVIEW 28 this category of approaches are: expensive calculation; some energy functions do not have strong physics basis; and may not converge to correct result. Another aspect to be noted is that since most of the research in this approach focuses on simplified models, they are more conceptual rather than being able to produce working 3-d structure models. 3.3 Structural Features Prediction Structural features of a protein include secondary structure, inter-residue distances, disulfide bond formation, etc. Structural features prediction is to relate these structurally measurable features onto an amino acid sequence. These structural elements can be used to provide constraints for tertiary structure prediction method or as a part of prediction process. For example, the results of secondary structure prediction has been integrated in many tertiary structure prediction approaches. Compared with a complete 3-d structure, structural fea- tures prediction has less scale and difficulty and they contribute more and more significantly to the final goal of predicting full tertiary structure. When predicting structural features, normally, statistical or empirical approaches are adopted. Examples of sequences and their corresponding known structural features have to be collected from the existing databases. Then techniques from statistics or A1 are used to derive meaningful relationships which could be of the form of a neural network, a set of rules, or an analytical relationship. Then they are applied to sequences of unknown structure to predict structure features. Among the many techniques applied, artificial neural networks is more recent and more successful and will be discussed more in detail in Chapter four Secondary Structure Prediction Secondary structure is a very important feature when examining tertiary structure. If the secondary structure of a given protein sequence is known, the 3-d problem becomes arranging the known secondary structure elements into the correct 3-d structure. In this sense, secondary structure prediction can be considered as a sub-problem of PSP. This prediction problem can be viewed as classifying each amino acid in a sequence into one of the three classes of secondary structures: H(helix), B(strand), and C(coi1). Among the many different techniques used in secondary structure prediction, ANNs have proven successful. One of the first attempts to achieve over 70% prediction accuracy was PhD [70] using a sliding window and a standard Slayer neural network that was trained

36 CHAPTER 3. PREDICTION APPROACHES OVERVIEW 29 on a carefully selected set of proteins. ANNs have also been used successfully in PSI- PRED [43], and [64]. Now the ANN-based methods can achieve a Qg (discussed in Section 4.3.3) accuracy of almost 80%. More discussion on neural networks applied in secondary structure prediction will be provided in Section 4.4. Recently, Kernel methods have been applied and also perform well in accuracy.

37 Chapter 4 A1 Techniques for PSP While Chapter three gives a big picture of various approaches to the PSP problem, this chapter focuses on some selected A1 techniques involved in those approaches. As discussed in the previous chapters, protein structure prediction is a very complex problem and we do not fully know about the search space, thus we cannot address it fully analytically. This is why many A1 techniques have long been applied to it. Among them, I am particularly interested in those that are initiated from biological systems, especially evolutionary computation, artificial neural networks and Lindenmayer systems. When humans try to solve problems, looking at Nature's solutions has always been a source of inspiration. Two powerful natural problem solvers are the human brain and the evolutionary process [26]. Trying to design problem solvers based on human brains leads to the field of neuro-computing. Evolutionary process forms a basis for evolutionary computing. Although not as powerful as ANNs and evolutionary computing, Lindenmayer systems, also initiated biologically, have found many applications in computing world. For PSP problem, evolutionary computation is used as a population-based search technique, mainly in ab initio prediction approach. It represents an intelligent way of search for an optimal solution. It has general applicability: whenever there is some reasonable method for scoring candidate solutions to a problem, evolutionary computation can be applied. Lindenmayer systems, as a novel generative encoding scheme to capture protein structure in lattice model, has been tested in evolutionary algorithms. But further research is needed to investigate its applicability in PSP problem. Artificial neural networks are most successful in secondary structure prediction. For humans, a large memory of stored examples can serve as the basis for intelligent inference. For PSP problem, ANNs infer meaningful relations

38 CHAPTER 4. A1 TECHNIQUES FOR PSP between primary sequence and seondary structures from selected dataset. 4.1 Evolutionary Computation Evolution and intelligence are closely related. Evolutionary Computation (EC) is considered as a subfield in Computational Intelligence by IEEE Computational Intelligence Society. If a system can adapt its behavior and evolve itself to meet certain goals in certain environments, it is an intelligent system. By imitating the evolution process on computers, EC mimics the intelligence associated with the problem solving capabilities of the evolution process. In real life, evolution creates very robust organisms; on computers, EC often produces good solutions to hard problems. Broadly speaking, EC refers to any biologically inspired and population-based search technique that involve iterative development of final solutions, such as ant colony optimization technique. Narrowly speaking, EC refers to Evolutionary Algorithms (EAs) which are a family of computational models inspired by Darwin's theory of evolution. EAs solve hard computational problems by simulating the evolutionary process of inheritance, mutation, recombination and selection to finally evolve a good solution to a problem. As such, EA represents an intelligent way of search for a near-optimal solution. For ab initio prediction of protein structure, even for the simple HP lattice model, it is proven that the problem is computationally intractable Consequently, there is much interest in effective techniques that can discover reasonably good solutions within an acceptable time. Evolutionary computation was first applied to PSP problem in early 90s, with noticeable success. Not only to PSP problem, the basic technique is both broadly applicable and easily tailored to many bioinformatics problems. [33] is a good reference for evolutionary computation in bioinformatics in general. In late 90s or early OOs, some researchers began to seek multi-objective evolutionary approach to the PSP problem, as in [20, 241. In this section, we first give an introduction to EAs in general. Then we discuss how EAs were used in the various approaches to the PSP problem. Further discussions will be on some important issues raised from applying EAs in PSP problem.

39 CHAPTER 4. A1 TECHNIQUES FOR PSP Introduction to Evolutionary Algorithms How does an evolutionary algorithm work? Generally, EA manipulates a population of individuals. Each individual represents a single possible solution to the problem under investigation. EA starts with an initial population of size n of randomly generated solutions and a fitness value is then calculated for each solution using the fitness function of the problem. Individuals with better fitness scores represent better solutions to the problem. After this initialization, the main iterative cycle of the algorithm begins. Using certain variation operators, the n individuals in the current population produce a number of children. The children are then assigned fitness scores as well. Then, according to some selection criteria, a new population of n individuals is selected from the current population and their children. This new population becomes the current population and the iterative cycle is repeated until some condition is met. The above generic framework can be wrapped as follows: 1. Initialize population of candidate solutions and evaluate each of them. 2. Select some of the population to be parents. 3. Apply variation operators to the parents to produce children. 4. Evaluate the children and include them into the population. 5. Repeat from step 2 until some termination condition is met. While the basic computational framework is quite simple, it is the design and implementation details that significantly affect the performance of EAs. There are no general guidelines in choosing a specific design or implementation for different problems. Recent theory suggests the search for an "all-purpose" algorithm may be fruitless [26]. Thus the choice of implementation is often based on experience or on trial and error. Some important factors determining the performance are representation of individuals; variation and selection operators; and fitness evaluation. Representation In evolutionary computing, representation is translation of the problem space into encodings that can be used for evolution, i.e., to represent individual candidate solutions in a manner

40 CHAPTER 4. A1 TECHNIQUES FOR PSP 33 that can be manipulated by evolutionary operators. Some commonly used representations are: binary representations (gray coding can be used to ensure that consecutive integers always have Hamming distance one); integer representations; real-valued or floating-point representation; and permutation representation. Traditionally different types of EAs have been associated with different representations. For example, Genetic Algorithms (GAS), the most widely known type of EAs, often use fixed length binary strings; while finite state machine representation is often associated with Evolutionary Programming. But, there is no restriction as to what representation to use in a particular problem or an algorithm. For example, since binary encoding is often inappropriate for many problems, in current GAS, non-binary representations such as integer string individuals or even more general representations such as tree and matrix structures can also be used. Thus, the best strategy is to choose representation to suit the problem under investigation and then choose variation operators to suit representation. Selection operators only use fitness, it is independent of representation. Not only individuals need to be represented, so does the population. Different types of representation for the population can be seen in the literature. Two popular ones are single population and structured populations. In single population, any individual may be mated with any others. In structured population, the population is decentralized into many sub-populations, thus the algorithm is decentralized. Often greater performance is achieved using structured populations, but the implementation complexity is also greater. Variation Variation operators act on one or two parent individuals to produce offsprings. They create the necessary diversity of the population and heavily influence how effectively the algorithm explores the search space. Two types of variation operators are mutation and recombination. Mutation can be viewed as single-parent production: a new individual is created by a random and slight change from one parent. Thus mutation is always stochastic. Recombination, also called "crossover" in evolutionary computing, can be viewed as two-parent production (or more than two parents): each pair of parents selected is recombined to produce (a pair of) children. When designing Variation operators, it is obvious that variation operators have to match the given representation, e.g., binary representation and real-valued representation have

41 CHAPTER 4. A1 TECHNIQUES FOR PSP 34 different variation techniques applied on them. For specific problems, standard operators can be considered, but, it may be more beneficial if some thought could be given to designing operators that take advantage of domain knowledge. Often variation operators have probability rates associated with them. These probabilities are parameters of the algorithm and must be set beforehand. In actual problems, we often need to tune these parameters to find reasonable setting for the problem under investigation. A very small mutation rate may lead to premature convergence in a local optimum. A mutation rate that is too high may lead to loss of good solutions. There are theoretical but not yet practical upper and lower bounds that can help guide tuning these parameters. Selection Like in natural selection, the selection operator in evolutionary computing applies evolutionary pressure and is responsible for pushing improvement of population. As opposed to variation operators that act on individuals, selection operators work on population level. In EAs, selection is based on fitness scores and is applied either when choosing individuals to breed children - parent selection or when choosing individuals to form a new population - survivor selection. There are different selection methods and selection can be deterministic or probabilistic. Because selection only considers fitness information, it works independently from the actual representation. Therefore, selection methods are universally applicable to different problems and representations. Popular and well-studied selection methods include roulette wheel selection and tournament selection. In roulette wheel parent selection, each individual is assigned a sector of a roulette wheel that is proportional to its fitness and the wheel is spun to select a parent. While in tournament selection, global knowledge of the fitness of the population is not required, instead, it requires an ordering relation that can rank any two individuals. Tournament selection looks at relative rather than absolute fitness. Most selection schemes are designed to enable a small portion of less fit solutions being selected, which helps keep the diversity of the population and prevent premature convergence on local optimum.

42 CHAPTER 4. A1 TECHNIQUES FOR PSP 35 Fitness evaluation The fitness of each individual is evaluated by fitness functions. A fitness function can be viewed as a particular type of objective function that quantifies the goodness of a solution or viewed in terms of a fitness landscape which shows the fitness for each possible individual. An ideal fitness function correlates closely with the algorithm's goal, and yet may be computed quickly. Speed of execution is very important, since an evolutionary cycle must be iterated many, many times before producing a useable result for a non-trivial problem. EA variants Some specific versions of EAs often addressed in literature are listed as follows. 0 Genetic algorithm (GA)- Initially proposed as an adaptive search technique [38], GA is the most widely known type of EAs. Typically, candidate solutions are represented by binary strings called chromosomes. The operators used in GAS reflect those found in natural reproduction, namely mutation and crossover. 0 Evolution strategy - Individuals are often represented as tuples of real values which, compared to GAS, are closer to the natural problem representation. The main variation operator used is mutation. Mutations are usually introduced as Gaussian perturbations. Evolution strategies have been successfully applied to many engineering applications. 0 Evolutionay programming - Looks at evolving computer programs. Fogel [34] proposed using the processes present in natural evolution to design intelligent agents, these agents taking the form of computer programs, which in turn were represented as finite state automata. These agents could then be used for prediction, control, or perhaps classification tasks. 0 Genetic programming - Individuals are in the form of computer programs, and their fitness is measured by their ability to solve a computational problem. These variations of EAs are similar in their underlying framework, but differ in the nature of the particular problems to which they can be applied, and the implementation details. However, given their similarity in nature, detailed implementations such as representation and variation operators are often borrowed from one type of EAs to another, thus there is no

43 CHAPTER 4. AI TECHNIQUES FOR PSP 36 clear distinction between them. But in literature, to solve computational biology problems, GAS are much more frequently seen to be used than other EAs Evolutionary Algorithms for PSP Evolutionary algorithms were first applied to PSP problem in early 90s when Dandekar and Agros conducted a series of studies [21, 22, 231. Since then, many researchers have used the EC technique in various approaches to the problem. But most commonly, EAs are applied to ab initio prediction approach, using genetic algorithms. EAs have also been applied to secondary structure prediction, one example can be found in [91] that a GA was used to supervise an artificial neural network to predict secondary structures. In this section, we mainly discuss EAs applied to ab initio prediction approach in which the PSP problem is cast as an optimization problem where the conformational space is searched to find the structure with lowest free energy. As discussed in the above general settings of EAs, design considerations for an EA for ab initio PSP problem involve decisions on the following major issues: 0 a protein representation; 0 mutation and recombination operators for effective exploration of the conformation space; 0 individual selection policies; 0 a molecular interaction model (energy function) with which individual fitness will be measured. In the following sections, we discuss these issues and survey common practices dealing with these issues. Of course, for a full specification of an EA used in PSP problem, other issues, for instance, the population size, termination criteria, the probability rates of mutation and crossover, etc., have to be considered and specified to produce an executable EA. We will briefly discuss some of these issues in the "Discussion" section. But largely, our discussion focuses on the major issues, and mostly on conceptual level, rather than executable level.

44 CHAPTER 4. A1 TECHNIQUES FOR PSP 37 Representation In literature, EAs are seen applied to both off-lattice and lattice models. For each type of model, structure representation can be further categorized as follows: Off-lattice Dihedral Cartesian coordinates Absolute matrix coordinates Relative direction Figure 4.1: Classification of (direct) structure representations All these representations use direct encoding of the folded chain, i.e., how each amino acid (or other structural unit) along the protein chain is arranged in space is directly described. Recently, some researchers proposed a completely different representation scheme for lattice proteins - L-systems [27]. L-systems do not encode protein structures directly, but it can generate directly-encoded-structures. Thus it is a generative encoding scheme. It will be discussed in next section. Here we give an overview of all the direct representations that can be used with EAs. Off-lattice representation For EAs used on off-lattice protein model, individual solution of protein structure can be encoded by dihedral angle representation as in Because the main degrees of freedom in determining protein 3-d conformation are the two dihedral angles q!j and?i, at each side of Ca atom, a protein conformation can be represented as a vector of these angle pairs along the main chain: [(q!j1, (42,?I,2)l..., (&, $I~)]. This representation can be easily converted to Cartesian coordinates of Ca atoms. The conversion formula can be found in [35]. Dihedral angle representation also has the advantage of keeping well-predicted local segments since local fragments of the structure are encoded continuously. E.g., when crossover operator is applied, some well-predicted secondary structure segments are more likely to be kept and inherited to the next generation. For the values of these dihedral angles, real numbers can be adopted. As an alternative, because dihedral angles are found to be restricted to a certain range of values, they can be discretized and each discrete dihedral angle can be encoded as integer numbers, or as bit strings as in the study [22]. In practice, the range of these angles can be further bounded through preprocessing to further reduce

45 CHAPTER 4. A1 TECHNIQUES FOR PSP 38 the size of the conformational space. Another type of off-lattice representation that can be used for EAs was given in [62] which introduced a distance matrix representation of residue positions. A distance matrix contains distances for every residue pair and the Cartesian coordinates can be inferred from the distance matrix. Lattice representation For EAs used on lattice models, individual structure can be represented using Cartesian coordinates [89], or more commonly, by internal coordinates representation [60, 18, 771. In Cartesian coordinates representation, each vertex in the lattice has a set of coordi- nates, thus a protein conformation on a 2-d lattice can be encoded as a vector of coordinates [(XI, yl), (22, y2),...,(xn, yn)], where (xi, yi) is the Cartesian coordinates of the vertex occu- pied by the ith amino acid. A 3-d lattice will require three coordinates for each amino acid. In internal coordinates representation, the location of one amino acid is specified in terms of its previous one on the protein sequence. Thus, a protein conformation can be represented by a direction list expressing a sequence of moves. Obviously this representation depends on the particular lattice topology considered. Internal coordinates representation can be further classified into two major schemes: absolute and relative. The absolute scheme, as studied in 1481, uses absolute direction reference system and the moves are specified with respect to it. Take 2-d square lattice as an example (an extension to other lattices is straightforward), four absolute directions of North, South, East and West can be naturally chosen as the reference system for it. Using this reference system, a conformation can be expressed as a sequence S E {N, S, E, Win-' where n is the length of the protein sequence (the location of the first amino acid is fixed). Thus a very simple 6-residue conformation as shown in the Figure 4.2(a) below can be expressed as SabsolzLte = ENESE. In relative direction scheme [60, 771, the reference system is not fixed, but each move is specified relative to the direction of the previous move, rather than relative to the absolute axes defined by the lattice. Still take 2-d square lattice as an example, three directions: Forward, Right-turn and Left-turn are enough to specify each new move relative to the previous one, thus a conformation can be expressed as a sequence S E {F, R, Ljn-l (the first move is always Forward). The example structure in Figure 4.2(a) is then expressed, in this reference system, as S,elati,, = FLRRL.

46 ) CHAPTER 4. AI TECHNIQUES FOR PSP Figure 4.2: (a): A very simple &residue conformation is represented in absolute direction as ENESE, in relative direction as FLRRL. (b) and (c) show two possible arrangements after point mutation at the 3rd residue position. This relative direction representation scheme has the advantage of guaranteeing that all solutions are at least 1-step self-avoiding since there is no "back" move. self-avoiding (no clash between chain elements) is the basic condition on which a valid lattice conformation can be formed. In a comparative study [48], it shows that this representation scheme is almost always better than the absolute encoding of directions for the square and cubic lattices. One problem when using these representations is that some mechanism needs to be in place to ensure the encoded structure is collision-free, which means the representation has to observe geometrical constraints to be valid. More discussion about general constraint handling in EAs for PSP is given later (see Other design issues). Variation operators When designing Variation operators, it is obvious that they have to match the protein representation. We first discuss variation operators of EAs on 2-d lattice models. An early study of the use of EAs on 2-d square lattice model was [89] in which Genetic Algorithms were investigated and protein conformations were encoded as actual lattice coordinates. In this study, mutations were implemented by a rotation of the structure around a randomly selected coordinate. Not like most GAS applied to other problems in which mutation rate were kept low, they found that, for protein structure prediction on simple lattice models, a higher rate of mutation is beneficial. Crossover was implemented by swapping a pair of selected parent

47 CHAPTER 4. A1 TECHNIQUES FOR PSP structures at randomly selected cutting points. On a square lattice, there are three possible orientations by which two fragment structures can be joined. All three possibilities were tested in order to find a valid, collision-free one. In the study, a quality control mechanism was introduced to the recombination process by requiring the fitness value of the child con- formation to be not worse than the average fitness of its parents. This was implemented by performing a Metropolis test comparing the energy of the child to the average energy of its parents. If the child conformation was rejected, new parents had to be selected. This study also demonstrated that the performance of EA approach, at least on simple models, was better than Monte Carlo based approaches. If protein conformations were not encoded as actual lattice coordinates, but using in- ternal coordinates, the effect of mutation operators relies on specific representation used. Consider the effect of one point mutation on the structure in Figure 4.2(a), We know from the previous section that, using relative direction representation, this 6-residue lattice con- formation is SrelatiVe = FLRRL. A mutation on the 3rd position value could produce either of S,'elat,,e = FLFRL or S~ela,ive = FLLRL, which are shown in Figure 4.2(b) and (c) respectively. However, if the structure in (a) is expressed using absolute direction repre- sentation as Sabsolute = ENESE, to produce the same conformation as in (b) and (c), all the three position values beyond the 3rd position have to be mutated; the corresponding representations are S~bsolute - ENNEN and S~bsolute = ENWNW respectively. We can see from this example that a one-point mutation in the relative direction repre- sentation produces a rotation effect in the structure at the mutated point. To produce the same effect in the absolute direction representation, a multiple-point mutation is needed, i.e., all the position values beyond the mutation point need to be simultaneously mutated to produce the same change in the structure. On the contrary, a one-point mutation in an ab- solute direction representation leaves the orientation of the rest of the structure unchanged. To achieve the same effect in the relative representation, changes at two subsequent position values are needed. As for crossover operation, most studies use a cut-and-paste-type. But reference [68] presented an interesting deviation. They investigated GAS on lattice-based models. The mutation was introduced as an Monte Carlo step, where each move changed the local ar- rangement of short (2-8 residues) segments of the protein chain. The crossover operation was performed by averaging two selected parents: first the parents were superimposed on each other to ensure a common frame of reference and then the locations of corresponding

48 CHAPTER 4. A1 TECHNIQUES FOR PSP 41 structural elements in each parent were averaged to produce a child structure that lay in the middle of the two parents. A refitting step was then required in order to place the child structure back within lattice coordinates. In the study, this new implementation of GA was compared to Monte Carlo search and to standard GA. It was shown that it is more effective than standard GA implementations. And the superiority of both GA methods over MC search was also demonstrated. The above discussion is on lattice proteins. For dihedral angle off-lattice representation, a simple way to introduce a mutation is to change the value of a single dihedral angle. This can be done in two ways: allowing only a small change in the value, or allowing complete random assignment of the dihedral angle values for a single amino acid. Like in relative direction representation in lattice models, one change in a dihedral value might have a large effect on the overall structure, because it causes the rotation of the entire arm of the structure beyond the mutated dihedral angle point, which may cause collisions between many atoms. The crossover operator mostly are implemented as a cut-and-paste operation over the lists of the dihedral angles, as in [22]. Thus the "children" structure will contain part of each parents' structure. Similar as mutation, this may also lead to collision. Since detecting collision in off-lattice models is much more difficult than in lattice models, almost every implementation needs to carefully address this issue and come up with a way to handle it. When the child structure resulted from crossover operator does not contain collision, it may also have another problem of being too open (not compact enough to be globular) and not likely to be a good candidate for further modification. To overcome these problems, many of the implementations include explicit quality control procedures that are applied after variation operators. These procedures may include several rounds of energy minimization process to relieve collision, loose conformations, etc. While some ordinary implementations of variation operators are shared by many studies, the manner and order in which they are applied is different for each actual algorithm. Other than the above-mentioned regular operators, many special operators have been devised in literature. We have already given an example research of [68] in which a Cartesian space operator is used for recombination in GA. Two more examples are as follows. In [77], a specially devised operator named "partial optimization" was employed on lattice proteins. The idea of this operator is to randomly select two non-consecutive residues of the protein and fix their positions in the lattice and then locate some intermediate residues by calculating all the different possibilities of the intermediate residues. The conformation that gives the

49 CHAPTER 4. A1 TECHNIQUES FOR PSP 42 best fitness is kept. The number of intermediate residues to be permutated is a user-defined parameter named partial optimization size. Another example is a rotation operator designed in [48], which is actually a mutation operator, that flips a part of the folded chain along a certain symmetry axis. Fitness functions It cannot be over-emphasized how important the fitness function is to the prediction result. The fitness of each solution must be an accurate reflection of the problem or else the evolutionary process will find the right solution for the wrong problem. Defining an appropriate fitness function can be challenging in any evolutionary algorithm. In almost all EA approaches to PSP problem, the fitness function adopts a certain form of potential energy functions. This makes the design of EA fitness function easier because there are many existing energy functions available, but this also creates a problem of hardly distinguishing between the performance of energy function and the EA algorithm itself. The wide variety of energy functions that have been used in EAs range from the hydrophobic potential in HP lattice model to much more detailed energy models such as CHARMM (see 2.4). Because it is very easy to incorporate and modify the various energy functions in the framework of EAs as fitness functions, many researchers develop their own energy function terms to suit their specific needs, thus energy functions used in EAs are very versatile. In this section, we mainly survey some typical energy functions used in EAs, with emphasized discussion on the simple HP model. Further discussion on the dilemma of energy functions used as fitness functions will be given in Discussion section 'More on energy function'. For lattice models, the simplest energy function is that used in HP model in which every direct hydrophobic-hydrophobic (HH) amino acid contact is awarded, as shown in the table: Table 4.1: Energy potential pij for the HP evaluation function The optimal structure is the one with the most nuqber of HH contact for a given protein

50 CHAPTER 4. A1 TECHNIQUES FOR PSP 43 sequence. Figure 2.l(b) shows sequences embedded in triangular lattice with HH contacts highlighted in curved lines. Given that each HH contact has a value of -1 as specified in Table 4.1, the conformation in Figure 2.l(b) has an energy of -4. Many EAs working on HP lattice model use this simple energy potential to measure fitness of individual solutions, yet it is too gross in some cases. For instance, examine the two conformations in the following Figure 4.3: Figure 4.3: (a) and (b) are different conformations but have equal energy values. (a) is obviously closer to forming the optimal conformation than (b). But because only direct HH contacts are rewarded, these two conformations have the equal energy values judged by the simple energy function in Table 4.1. In other words, this function cannot effectively distinguish between some individual solutions in a EA, thus will cause many plateaus in the energy landscape and trap the search. There are ways to avoid the trap. One remedy is by augmenting the energy function to allow a distance-dependent HH potential, as proposed in [48]. Since the distances between amino acids form a countable set, it is possible to construct a distance-dependent potential that preserves the ranking of the conformations in the standard HP model while enabling a finer level of distinction between conformations with the same number of HH contacts. For example, if dij is the distance between two hydrophobic amino acids Hi and Hj, reference [48] gave a modified energy potential as follows: where NH equals the number of hydrophobic amino acids in the sequence, and k = 4 for the square lattice and k = 5 for the triangular and cubic lattices. And it was suggested that the modified energy formulation is especially effective for hybrid EAs that use a local search method.

51 CHAPTER 4. AI TECHNIQUES FOR PSP 44 Another remedy was proposed in [77]. A concept "radius of gyration" (RG) was used to estimate the compactness of a set of amino acids: the more compact a conformation is, the smaller is its radius of gyration. Hopefully, by integrating RG in the fitness function, the fitness landscape can be changed in such a way that more compact conformations with the same number of HH bonds will be rewarded, bringing the evaluation closer to reality. The above simple energy function for HP model can be extended in various ways to either fit for more complicated lattice models, or account for more detailed energy items. In [69], the charge property of amino acids is taken into consideration, thus amino acids are classified into four types as hydrophobic, positively or negatively charged, or neutral, rather than just two classes. The energy potential table is expanded to 4 x 4 accordingly. Besides, different degrees of polarity, or hydrophobicity for different amino acids can be used to make the energy function more detailed in the hope that it should yield conformations closers to the native ones. Some of these function examples can be found in [35]. For off-lattice models, a very simple energy function will be an adaptation of the lattice HP function to off-lattice environments. The energy function can just take into account the distance between interacting residues which can be calculated using the empirical mean distance between consecutive residues in proteins l. An optimal interaction potential equivalent to the lattice interaction potential for neighboring hydrophobic residues occurs at unit distance l. Smaller distances are penalized to enforce steric constraints, i.e., to avoid residue clashes. In [35], one version of the calculation of the total energy is provided as: where E is the total energy of a conformation, eij is energy potential between residues i and j, dij is the distance between residues i and j, y and E are constant parameters, and pij is the interaction potential according to Table 4.1. For dihedral angles off-lattice representation, generally, total energy is calculated as the sum of several energy potentials. The typical form would be like the equation shown in 'This distance is roughly 3.8W and can be set as the unit distance. The distance between a pair of interacting residues can be calculated using this distance and angular values.

52 CHAPTER 4. A1 TECHNIQUES FOR PSP section 2.4 in which various bonded or non-bonded potentials are calculated. The popular CHARMM force field is in this category of energy functions. Another example is that used in [22] in which small helical proteins are successfully folded using a GA. The fitness(energy) function took into account the effect of bad clashes, secondary structure formation, tertiary structure formation, hydrophobic burial, and hydrogen bonding. Normally, this category of energy functions are linear sums of several energy terms. But in the interesting energy function used in [61], the terms were normalized and then multiplied rather than added. By this way, it makes sure that all the terms have reasonable values, since even one bad term can significantly affect the total score. One more special type of energy functions adopted in EAs for PSP uses empirically derived contact potentials for amino acid interactions. A contact potential describes the energy between two residues close enough to each other (typically 5 6.5A). In [53], a contact potential emp was determined for all pairs of amino acid types using 1168 known structures. Then these potentials are used in some function similar to that in Section 2.4 to calculate the total energy. Other design issues Prevention of premature convergence on undesired solutions: These undesired solutions are often local minimums. It is common that, during successive generations, one or very few solutions take over the population. Once this happens, the rate of evolution drops dramatically: crossover becomes meaningless and advances are achieved only by mutations at a very slow rate. Several approaches have been suggested to avoid this situation. These include temporarily increasing the rate of mutations until the diversity ofthe populations is regained; isolating unrelated sub-populations and allowing them to interact with each other whenever a given subpopulation becomes frozen, and rejecting new solutions if they are too similar to solutions that already exist in the population. Geometrical constraints: Like many practical problems, PSP problem is constrained. Two types of constraints that need to be enforced to define a feasible conformation are: the connectivity of the chain and the collision-free conformation. Many implementations use internal coordinates representations to implicitly handle the first constraint (the off-lattice dihedral angle representation is actually a kind

53 CHAPTER 4. A1 TECHNIQUES FOR PSP 46 of internal coordinates representation). As for the second constraint, for off-lattice models, it means some torsion-angle ranges are not allowed and residues should not collide; for lattice-models, it means the conformational path has to form a self-avoiding walk in the lattice. Thus, not all possible individuals represent valid solutions. In one perspective, it provides extra information the EA can use to narrow down the search space; in another perspective, it adds extra dimension(s) to the already highdimensional problem, thus may make the search more difficult to handle. Generally speaking, in EAs, constraint handling is not straightforward, because the variation operators(mutation and recombination) are typically "blind" to constraints. That is, there is no guarantee that even if the parents satisfy some constraints, the offspring will satisfy them as well. In [26], some ways for handling constraints in EAs at the conceptual level are introduced as follows: - Use penalty functions to reduce the fitness of infeasible solutions, the fitness may be reduced in proportion to the number of constraints violated, or to the distance from the feasible region. - Use mechanisms that take infeasible solutions and "repair" them to the closest feasible one. - Use a specific alphabet for the problem representation, plus suitable initialization, recombination, and mutation operator such that the feasibility of a solution is always ensured. These constraint handling methods have all been employed in various EAs for PSP problem. In [44] and [77], penalty functions are used to measure to which extent the constraints are violated. Infeasible solutions are allowed, but they are assigned a lower fitness value due to the existence of a penalizing term. In [18], an alternative was explored. A repair procedure maps.infeasible solutions to feasible conformations, and evolutionary operators are designed such that they are closed in feasible space. There are also other techniques designed for particular representation. E.g., to address the collision-free constraint on lattice model for absolute coordinates representation, one simple way is just marking lattice vertices as free or occupied. Human intervention in EAs: How much human intervention will be involved in assisting the algorithm? This is a question for any EA. You can choose to only preset some

54 CHAPTER 4. AI TECHNIQUES FOR PSP 47 probability parameters and leave all other aspects of the evolving process to random decisions. Or you can incorporate more domain knowledge to guide and assist the algorithm. For PSP problem, in practice, domain knowledge is often incorporated in the algorithm to improve the prediction accuracy. One way is to first predict secondary or supersecondary structures, then use the results as constraints during EA search, e.g., rather than choosing crossover points totally randomly, the EA can choose some hot spots selected on the bases of keeping secondary structure. Another way is to include experimentally derived structural information such as the existence of S-S bonds, conserved hydrophobic residues, in the prediction scheme to improve the prediction quality. For example in [5], distance constraints derived from NMR experiments were used to help a genetic algorithm to calculate protein structure Discussion In this section, we discuss some general and conceptual issues that are raised from using EC in PSP problem. Suitability of EA in PSP problem Evolutionary computation, according to [33], is both effective and computationally efficient search strategy. It has the advantages of ease of use, general applicability, and success in finding good solutions for difficult high dimensional problems. Particularly, EAs are useful when: 1) the problem search space is large, complex or poorly understood; 2) domain knowledge is scarce or difficult to encode to narrow the search space; 3) no mathematical analysis is available; or 4) traditional search methods fail. Except for the 2nd case, PSP problem falls into all the other cases. Besides, many studies demonstrated that, as a general search method, EA does show superiority over other methods like monte carlo search. This suggests that PSP problem is suited for EA. This is interesting since EA works on population level, i.e., many individuals mix and interact to evolve a good individual, while protein molecule folds individually on single-molecule level, not by mixing different proteins on population level. [go] gave an explanation and suggested an interesting view of EAs as being compatible with protein folding pathway: although EAs do not simulate the actual folding pathway of single molecule, we can refer to the many solutions in the EA system not as different molecules but as different conformations of the same molecule. Each individual

55 CHAPTER 4. A1 TECHNIQUES FOR PSP 48 solution can be considered as a point on the folding pathway of the single molecule, and it examines and evolves itself using the variation and selection operators. Adaptive and dynamic nature of EAs Evolutionary computation, by nature, is a dynamic and adaptive process. Thus, when applying EAs in practical problems, this nature should be given enough consideration. The consideration is on three levels. First, the essence of EA's adaptive nature should be taken into consideration when modeling the problem. Initially, GA, the most popular form of EA, was conceived by Holland as a means of studying adaptive behavior, as suggested by the title of his book in which he put together his early research - "Adaptation in natural and artificial systems". In later studies, however, maybe because EAs generally perform well in searching for optimal solutions, they have largely been considered as optimization methods. In fact, there are many ways to view EAs, as pointed out in [26], not only as problem solvers, but also as basis for competent machine learning, as creative computational models, or as guiding philosophy. Till now, EAs have been applied to PSP problem only as an optimization search tool. Maybe the future research on PSP problem would model the problem differently and combine the macro evolution and micro protein folding in a creative way? On the second level, when we consider EA as an effective search tool in PSP problem, we should bear in mind that EA is adaptive and there is no best EA across all problems [33]. PSP problem can be formulated differently or can be focused on different types of proteins. Thus algorithm components should be developed in such way that they are tuned to the formulation at hand rather than simply forcing the problem into a particular version of an EA. On the third level of setting algorithm parameters for a particular EA, it is suggested in [26] that using rigid parameters that do not change their values during the running of the EA is against the adaptive and dynamic nature of it. Globally, there are two major forms of setting parameter values: parameter tuning and parameter control. Parameter tuning is the commonly practised approach that values of parameters (population size, mutation rate, etc.) are set before the run of the algorithm and remain fixed during the run. However, a run of an EA is an intrinsically dynamic, adaptive process. It is intuitively obvious, and has been empirically and theoretically demonstrated in [26] that different values of parameters might be optimal at different stages of the evolutionary process. For instance, large mutation

56 CHAPTER 4. AI TECHNIQUES FOR PSP 49 steps can be good in the early generations, helping the full exploration of the search space, while small mutation steps may be needed in the late generations to locate the desired global optimum. Thus we need dynamic parameter control. For the mutation problem, e.g., one possible solution is to suggest a range of dynamic mutations, from small to large, during the evolutionary process and let the EA self-control its parameters. This comes to the idea of self-adaption. Self-adaptation can be done by associating each individual with an additional vector that provides instructions on how to best mutate it; or it is also natural to use two EAs: one for problem solving and another one for tuning the first one. But there is not much research in this line for PSP problem yet. Variation of EAs applied in PSP Among the variants of EAs, Genetic Algorithms are still the predominant EA used in PSP problem. But it was pointed out in [33] that crossover, the main variation operator in GAS, is largely ineffective for protein structure prediction and other variants, especially Evolution strategy which emphasizes on mutation, should be more extensively investigated. In the literature, memetic algorithms are also seen applied to PSP problem. The memetic algorithm refers to a hybrid evolutionary algorithm approach that uses a standard EA in conjunction with local search. The additional localized searches conducted in a memetic algorithm generally results in a significant improvement in the fitness of the best solution found. Another research direction is the multi-objective formulation of the PSP problem. Historically, ab initio prediction has been approached as a single-objective optimization problem. While recently, some researchers reformulate it as a multi-objective optimization problem. An early research is [24] in which a multi-objective evolutionary algorithm (MOfmGA) was used for the structure prediction of two small proteins (5 and 14 residues respectively). Using this idea, Cutello investigated medium size proteins (46-70 residues) with promising results and further conjectured and partially verified by experiments that PSP problem is more suitable to be modeled as a multi-objective optimization problem [20]. Their approach considers the local interaction (bond energy) and non-local interaction (non-bond energy) among atoms to be the main forces to direct the forming of the protein native state, and is based on the intuition (or, fact) that the two kinds of interaction are in conflict. This is the typical characteristic of a multi-objective optimization problem.

57 CHAPTER 4. AI TECHNIQUES FOR PSP 50 More on energy function In ab initio structure prediction, the two important aspects of the problem, the energy function that must discriminate between the native structure and many non-natives and the search algorithm to identify the conformation with the lowest energy, are fraught with difficulties [go]. Furthermore, difficulties in each aspect reduce progress in the other. Until we have a search method that will enable us to identify the solutions with the lowest energy for a given energy function, we will not be able to determine whether the conformation with the minimal calculated energy coincides with the native conformation. On the other hand, until we develop an optimized energy function, we will not be able to verify that a particular search method is capable of finding the minimum of that specific function. That is, evaluating the performance of the search tool and evaluating the performance of the associated energy function are tangled together and making a distinction between them is hard. This is a dilemma in PSP research. When discussing EAs for PSP, the same problem arises, and to make things worse, in almost all EA implementations, the energy function is also used as the fitness function of the EA, thus making the distinction between the energy function and the search algorithm even more difficult. It was suggested in [go] that, at least for algorithmic design and analysis purposes, it is possible to detach the issues of the search from the issue of the energy function, by using a simple model where the optimal conformation is known by full enumeration of all conformations, or by tailoring the energy function to specifically prefer a given conformation. But there is not much research in this line yet. Another issue about energy function is that complex energy function models could also be parallelized for more efficient calculation. This is often adopted in knowledge-based approaches to the PSP problem, as well as for EAs in ab initio prediction. Significant reduction in convergence time can be achieved by either distributing a single evolving population over a number of machines or allowing different machines to compute independently evolving populations. Many practical EA implementations for solving PSP have adopted parallel computations. Conceptually, it matches the nature of evolution because evolution itself is just a parallel process.

58 CHAPTER 4. A1 TECHNIQUES FOR PSP 51 Possible future improvements Despite the conceptual and technical suitability of EA in the PSP problem, the success of EA in PSP problem is moderate. Most research focuses on lattice models. What kinds of improvements might be made to EA methods to improve their performance? One obvious aspect is improving the energy function. While this is a common problem for all prediction methods, an interesting possibility to explore within the EA framework is to make a distinction between the fitness function that is used to guide the production of the emerging solution and the energy function that is being used to select the final structure. In this way it might be possible to emphasize different aspects of the fitness function in different stages of folding. Another possibility, as suggested in [go], is to introduce explicit "memory" into the emerging substructure, such that substructures that have been advantageous to the structures that harbored them will get some level of immunity from changes. This can be achieved by biasing the selection of crossover points to respect the integrity of successful substructures or by making mutations less likely in these regions. It seems that the PSP problem is too difficult for a naive "pure" implementation of EAs. The direction to go is to take advantage of the ability of the EA approach to incorporate various types of considerations when attacking this problem. GAS are still the predominant EA used in PSP problem. It was pointed out in [33] that crossover, the primary reproduction mechanism used in GAS, is largely ineffective for protein structure prediction. It was suggested that evolution strategies and evolutionary programming which place emphasis on mutation as a reproduction mechanism should be explored in PSP problem. Finally, a long term effort should be made to better integrating the adaptive and dynamic nature of evolutionary computing in various levels of approaching PSP problem: in modeling of the problem; in developing algorithm components; and in setting algorithm parameters. Both conceptual model and technical implementations need to be explored. As discussed before, ab initio prediction approaches to PSP problem often use simplified lattice models to study protein structure. On the 2- or 3-d lattices, the folded structures are usually represented using direct encoding of the coordinates of every residue on the folded

59 CHAPTER 4. A1 TECHNIQUES FOR PSP 52 chain (See Representation). Recently, a few researchers proposed using Lindenmayer systems (L-systems) to capture protein structures [27, 561. After David Searls laid the ground for using generative grammar in biosequence analysis [78], this is a novel and interesting practice for representing folded protein structure on lattice models. In this section, we will give a short introduction to L-systems, then introduce and discuss the L-system-based encoding for lattice protein in current research Introduction to L-systems L-systems were developed by Aristid Lindenmayer in late 1960s. Originally they were devised to provide a formal description of the growth patterns of simple multicellular organisms. Later on, this system was extended to describe higher plants and complex branching structures. L-systems are commonly defined as a tuple < V, C, w, P >, where V, variables, is a set of symbols that can be replaced; C, constants, is a set of symbols that remain fixed; w, axiom, is a string of symbols from V+ C defining the initial state of the system; P, productions, (or rewriting mles), is a set of rules or productions defining the way variables can be replaced with combinations of constants and other variables. Other than these terms, we also use alphabet to refer to the set of V + C and symbol to refer to any element in V or C. As an example, Lindenmayer7s original L-system for modelling the growth of algae is as follows. Algae consists of cells, each of which could take on one of two values a or b. variables: a, b constants: none axiom: a rules: a --t ab, b --t a, which successively produces: a, ab, aba, abaab, abaababa,... This pattern of growth fairly closely matched the growth patterns of the algae that Lindenmayer was studying. An L-system is context-free if each production rule has only one variable on the left. If a rule refers not only to a single variable but also to a combination of this variable and certain neighbours, it is termed a context-sensitive L-system. An L-system is deterministic if there is exactly one production for each variable. If there are several, and each is chosen with a certain probability during each iteration, then it is a stochastic L-system. Finally, L-systems can be parametric if there are numerical parameters associated with the symbols

60 CHAPTER 4. A1 TECHNIQUES FOR PSP or productions. A deterministic context-free L-system is the simplest form of L-systems and popularly called a DOL-system. Compared with traditional formal language grammars, the major difference lies in the way of applying production rules. In formal languages, productions are applied sequentially, while in L-systems they are applied in parallel, replacing simultaneously all variables in a given word. This difference reflects the biological motivation of L-systems. Productions are intended to capture cell divisions in multicellular organisms, where many divisions may occur at the same time. Another difference is that in L-systems, there is not necessarily such non-terminals as in traditional grammars. Variables in some L-systems constitute valid words in the languages of the L-systems. In this case, although they are replaceable, the variables are more like the terminals in traditional grammars L-system-based Encoding for Protein Structure L-systems are investigated to encode lattice protein conformations only very recently [27,56]. In the research, evolutionary algorithms are used as the inference procedure for discovering L-systems that represent target protein structures on simple lattice models. At this stage, the problem they are trying to solve essentially is: given a target structure expressed in "internal coordinates"(see Figure 4.1), how to find an L-system that, once evaluated, would produce the original target structure or a closely matched one. They used EAs to search the space of L-systems and produced promising results for short sequences. However, there is still long way to go before L-system-based structure representation can be used in the PSP problem or its inverse problem. We will discuss this point more in detail in the discussion section. Why a grammatical encoding? As discussed in section Representation, protein structures on lattice models are usually represented by a direct encoding of the folded chain. One commonly used direct encoding is "internal coordinates" that represents the structure by a list of moves on the lattice. The moves can be absolute or relative. Under the relative scheme, each move is specified relative to the direction of the previous move. In a 2-d square lattice, e.g., a struc- ture S is encoded as a string S E {Forward,t~rnRight,turnLeft)~-l. See Figure 4.2(a) for an encoding example.

61 CHAPTER 4. A1 TECHNIQUES FOR PSP 54 However, the string length of the encoded structure is basically the same as that of the protein sequence, thus causing the search techniques that use this type of encoding hard to scale. L-systems is a generative or rule-based scheme that specifies how to construct the structure rather than directly encodes the structure, thus can achieve greater scalability. But there comes the question: is lattice protein structure suitable to be encoded grammatically? The researchers provide their reasoning in 1271 that can be concluded as: proteins exhibit regularity and repeated substructures, which is consistent with the recursive nature of L-systems where rewriting rules lead to modular, auto-similar structures. But the researchers didn't investigate to what level of degree proteins exhibit regularity and whether the regularity showed in protein structure is enough to be modelled by L-systems in general. We will comment on this point more in Discussion section. Another advantage of using grammatical encoding is that it is more compact and parts of the encoding are more easily to be reused. And specifically for evolutionary algorithms, grammatical encoding for individuals is more suitable for crossover and building block transfer between individuals. L-system-based encoding In this section, we briefly introduce how lattice protein structure is encoded by L-systems based on the methods discussed in 1271 and [56]. The L-system's alphabet will depend on the lattice and coordinate system used. For square 2-d lattice and relative internal coordinates, the specification of DOL-systems chosen in [27] is: variable set V = {0,1,2,.., m - 1) with each of the number elements representing one rewriting rule; constant set C = {F, L, R) representing three moves: Forward, Left-turn and Right-turn in the relative coordinates; axiom w can be any string of combinations of characters from V + C. The number of production rules is the size of the variable set and each rule takes the form n -t w, where n E V and w E {V + C)+, the set of all nonempty words over V + C. An example of L-system to encode a short lattice protein structure RFRRLLRLRRFRLLRRFR would be as follows, with its derivation process shown in Figure 4.4.

62 CHAPTER 4. A1 TECHNIQUES FOR PSP axiom = 31 rules = {0 + 3LL2; 1 + RORL; 2 + RRF; 3 + RFR1) w.'r RORL R 3LLZ RL 1 0 Post- processing RFRRLLRLRRFRLLRRFR Figure 4.4: A derivation process example The maximum lengths of the axiom and rules, as well as the number of rules are parameters for the inference algorithm (in [27] it was EAs) that will depend on the length of the protein. In further studies [56], knowledge of secondary structures is incorporated in the L- system-based encoding in the form of predesigned production rules. In the HP 2-d square model, right-oriented a-helix is designed as RRLL (represented by variable A); left-oriented a-helix is designed as LLRR (represented by variable H); P-sheet is represented as a string of Fs (the maximum number of Fs is 4). Moreover, the L-systems are parametric. There are numerical parameters associated with the symbols. For example, if a structure segment in the relative coordinates encoding is FFFF, then in parametric L-systems encoding, it can be written as F4. Another instance: a 2-d lattice folding RFRRLLRLRRFRLLRRFR in relative coordinates can be rewritten in parametric L-systems as: RFARLR2~R~FR. Thus the parametric L-system has only five symbols in its alphabet: {F, R, L, A, H). And its rules are fixed and not explicit compared with the DOL-systems discussed above.

63 CHAPTER 4. AI TECHNIQUES FOR PSP 56 Evolving L-systems-encoded structures The ability of L-system-based encoding to capture protein native conformation in the 2-d HP lattice model can be tested using EAs. Given a target structure in direct encoding, the EA will explore the L-system's space and evolve a set of rules that, once derived, would produce a conformation that closely matches the target. The following general description of the EA used to test L-systems is based on [27]. The approach is close to grammatical evolution. Each individual L-system in the population is determined by the axiom and the rewriting rules. The maximum number of rules and string lengths for the axiom and rules are preset as parameters. For initialization, both the axiom and the rules of an individual L-system are randomly generated strings of symbols of the maximum lengths where each symbol is selected with uniform distribution from the alphabet. The recombination operator resembles uniform crossover where the rules are interchanged. During the recombination process, if a selected rule in an offspring makes reference to a variable symbol (rule) not defined in the offspring L-system, a repair operator is used to change that variable. The mutation operators are addition, deletion or modification of a single symbol that conforms either the axiom or the rewriting rules of each individual. For selection, linear ranking selection and elitism can be used. And, a mate selection strategy that chooses less similar parents can increase the population diversity. To evaluate an individual's fitness, its L-system is derived and the Hamming distance is computed between the derived structure and the target structure. During the evolutionary process, the L-system that produces illegal (not self-avoiding) lattice conformation is allowed, but will not be accepted as a final solution Discussion L-systems are recursive in nature. This nature makes L-systems very suitable to describe fractal-like structures. Is L-system suitable to describe protein structures? The preliminary research [56] seems to give positive answer by asserting "Results confirmed the suitability of the proposed (L-systems) representation". However, experiments have also shown that some protein instances are more difficult than others to evolve an adequate L-system [27] and instances with high frequencies of a-helices and P-sheets have a clear advantage in their suitability to be encoded by L-systems [56]. These results show that the suitability of the new encoding scheme heavily depends on the occurrence of sub-structures and their regularity.

64 CHAPTER 4. AI TECHNIQUES FOR PSP 57 * Although it is known that protein structures indeed exhibit some regularity and repeated sub-structures which can be captured by L-systems, to what degree do protein structures show regularity? And generally, is the level of modularity and repetition within protein structures high enough for L-systems to be suitable to encode them? Current research has not explicitly addressed these questions yet. It is also worth noting that for 2-d lattice protein, the proposed L-system-based encoding is not independent of direct encodings. This dependence has two folds: the alphabet of the L-systems includes all the symbols used in direct encoding; and an L-system needs to be derived to the direct encoding form before the structure it encodes can be evaluated. Also note that a given target structure may have various direct encoding representations and that various distinct L-systems could produce the same direct encoding word [27]. Therefore, if L-systems are actually used in PSP problem under this scheme, the advantages of using grammatical encoding have to be evaluated against the cost of adding a layer of complication in the encoding system. L-systems grammar has been used in many applications of evolutionary algorithms to problems in biology, engineering, and computer graphics. One example of L-systems as a powerful encoding is investigated in [46] where it represents blood circulation of the human retina. Using L-systems to encode lattice protein conformations as reviewed here is very recent research. It is limited to short proteins on 2-d square lattice model and it has not been integrated into any approach to PSP problem. However, it is a very interesting protein conformation representation scheme and more research in this line is needed to investigate its possible application in PSP and the inverse PSP problem. 4.3 Artificial Neural Networks As introduced in Chapter three, protein structural features prediction is an important category of PSP. Examples of structural features include secondary structure, residue solvent accessibility, trans-membrane strands and helices etc. Although these features do not represent 3-d structure, accurate predictions of them are important steps toward 3-d prediction. 2~he experiment analysis in [27] shows some sub-strings that appear several times in the folded chain (e.g. RFR) are also present as part of the evolved rules. This supports the idea that L-systems capture the natural occurring sub-structures in lattice protein. 30nly applying to absolute internal coordinates.

65 CHAPTER 4. AI TECHNIQUES FOR PSP 58 For instance, predicted secondary structures can be regarded as rigid bodies, simplifying molecular dynamics simulations; or in ab initio prediction approaches, these predicted features represent additional information that can help guide the conformational search. The prediction of structural features are often modeled as inferring a mapping from input amino acid sequences to some kind of output sequences. The output sequence has the same length as the input sequence and each symbol appearing in the output sequences describes the structural property of the residue in the same position as in the input se- quence. This way of modeling the problem enables the application of automatic learning methods, such as artificial neural networks (ANNs). These networks are capable of mapping between protein sequence and structure, of classifying types of structures, and identifying similar structural features from a database. Neural networks have the advantage of making decisions out of a large number of competing variables without explicit understanding of the problem. This is particularly important for PSP problem where the principles governing protein structure forming are complex and not yet fully understood. So far neural network models are among the most successful approaches in predicting protein structual features, especially in secondary structure prediction. In this section, we first give an introduction to ANNs and a basic ANN scheme for predicting structural features in general. Then we focus on secondary structure prediction: to illustrate and review how ANNs are applied in this important category of prediction. Then we briefly introduce a few other types of structural predictions made by ANNs. Further discussions will be on some important issues raised from using ANNs in the PSP problem Introduction to ANNs Artificial neural networks are inspired from biological neural net which consists of billions of biological neurons. Neurons are basic computing units of the brain. For each neuron, input signals are gathered, then processed and evaluated. If the evaluated result is larger than some threshold, an action potential fires and then propagates down to become the output signal of this neuron. Before this output signal becomes the input to the next neuron, it will undergo some processing to determine how the signal will be transmitted from the output neuron to the next input neuron. This rather simplified model of biological neuron serves as basis of artificial neurons (nodes) from which ANNs are constructed. A simple scheme of a generic ANN node is shown as follows.

66 CHAPTER 4. A1 TECHNIQUES FOR PSP inputs, outputs the shhold function Figure 4.5: A generic scheme of artificial neuron In this computation scheme, a weight controls how much influence a previous neuron node has on this node. Suppose there are n previous nodes connecting to this node, then a vector 3 = (xl, xz,..., x,) represents the n inputs from the corresponding n nodes; and w = (wl, 202,..., w,) is the corresponding weights vector. Inside the node, it calculates a weighted linear combination of all the inputs: 'LZI 3, then maybe subtracts a threshold, and passes the result through an activation function to produce the output to other nodes that connect with it. Activation functions can be of different types, the commonly used one is the sigmoid function F(x) = 1/(1+ e-"). ANNs can take various architectures. Normally, nodes are arranged into layers. The inter-layer connections can be divided into two kinds: feed-forward and feed-back. A feedforward network has only unidirectional connections and signals propagate only forward from the input layer to the output layer. In feed-back networks, a layer can be connected to the next layer or any of the previous layers, thus signals can travel in both directions, causing loops in the network. Feed-back networks are dynamic and very powerful, but can get very complicated. While the connections are hardwired, the weights between nodes can be adjusted by the network during the training process. The idea of network training is to find, or to learn, the weights that fit the training data so that the learned network can be used to solve new data. There are two learning paradigms: supervised and unsupervised. Supervised learning is the method commonly used in structural features prediction. In supervised learning, the ANN is repeatedly presented with a set of training samples with known results. The task of the ANN is to modify the weights through these samples. The process is, first, the network takes input values of one sample and works out an output using initially random weights; then, compare the observed output value with the known

67 CHAPTER 4. A1 TECHNIQUES FOR PSP 60 value of that sample and back-propagate(see the following subsection) an error adjustment to the weights so that the next time the sample is presented, the observed output is closer to the desired output. This is repeated for all samples in the training set and results in one epoch. Then the process is repeated for a second epoch, a third epoch... until the network manages to reduce the output error for all samples to an acceptable low value. At this point, the training is stopped and all weights are settled, the trained ANN can be used to work out new data, or, if needed, a test phase can begin to determine the validity, or prediction accuracy of the network. In unsupervised learning, the network is not presented with the desired output. It must learn the weights without being able to measure its result and minimize its error. In such an unsupervised scheme, nodes compete for the opportunity to update their weights, resulting in self-organization. Generally, unsupervised ANNs are used for finding interesting clusters within the data. Error minimization In supervised learning, the weights have to be adjusted so the error between the desired output and the actual output is reduced. A best-known algorithm doing this weight optimization is Back-propagation. For node i, the difference between the observed output oi and desired output di is called the error, The sum of squared errors is then where i runs over all output nodes. By calculating the gradient of the error function the adjustment of weights is where 7 is the learning rate. Then each weight is updated as During this process, the weights are adjusted to minimize the errors. One way of conceiving this error minimization process is to consider each individual weight as a dimension in space.

CHAPTER 4. AI TECHNIQUES FOR PSP 61 If we could plot the value of the error for each combination of weights, we could obtain an "error surface" in multidimensional space.

Like searching for the minimum free energy on the energy landscape in ab initio prediction approaches, no algorithm can guarantee to locate the global minimum.

68 CHAPTER 4. AI TECHNIQUES FOR PSP 61 If we could plot the value of the error for each combination of weights, we could obtain an "error surface" in multidimensional space. In one aspect, the objective of network training is to find the lowest point in the error surface. Like searching for the minimum free energy on the energy landscape in ab initio prediction approaches, no algorithm can guarantee to locate the global minimum. But in another aspect, the neural network training should avoid over-training. If an ANN is over-trained for too many cycles to minimize the errors, it will overfit the training data while leading to larger errors on test data A Basic ANN Scheme for Predicting Structural Features To apply ANNs in structural features prediction, a common approach is a multi-layer (often 3) feed-forward network. The following figure provides a basic scheme of these structural predictors. Figure 4.6: A basic scheme of ANN predictors adopted from [59] As it shows, the network is moved along the input sequence and computes an output vector encoding for the structural class of the amino acid in current position (Y in figure). As it is generally assumed that structural properties of a residue are greatly affected by its local context (neighboring residues), the input is a window of a certain size of residues centered at current inspected position. The architectural parameters of the network includes

69 CHAPTER 4. A1 TECHNIQUES FOR PSP 62 the number of output nodes, the number of hidden layers and nodes, input encoding and window size. The number of output nodes depends on the specific prediction task, for secondary structure prediction, e.g., often three output nodes, representing three secondary structures: helix, strand and coil. The input encoding refers to the encoding of each input amino acid. There are two main types of input encoding: orthogonal and profile-based. Orthogonal encoding just takes each amino acid in the sequence as it is and usually encodes it using a binary string. Since there are 20 amino acids, each one is represented by a 20-bit binary string consisting of 19 0s and one 1, where the position of the 1 identifies the amino acid. Profile-based encoding uses the 20 dimensional profile extracted from the PSSM of a multiple alignment. More about this type of input and its advantages is discussed in the next section. About PSSM and multiple sequence alignment, refer to Section 1.2. The input window size controls how much local context information we want to consider in the prediction and it usually takes an odd length so that the amino acid at the center of the window is the prediction target. Ideally, one may expect that the larger the window size, the more information given to the predictor, hence performance should increase. Unfortunately, the increase of window size also means the increase of possible noises. It is observed that beyond some threshold size, the signal to noise ratio would decrease. Typical window sizes range from 9 to 25 residues [92] Secondary Structure Prediction The general hypothesis taken when attempting to predict secondary structure (SS) is, firstly, an amino acid intrinsically has certain conformational preferences due to its chemical properties; secondly, these preferences may be modulated by the locally surrounding amino acids; and thirdly, long range interactions between amino acids may also play a role in forming SS. Various approaches focusing on different factors have been designed to predict an amino acid's secondary structure given the sequence context with which it is placed. Before ANNs were first applied to SS prediction in the work [67], prediction methods mainly used statistical information as in [I51 or physico-chemical properties of amino acids as in [66] to investigate amino acids' conformational preferences. These methods make predictions only on information coming from a single residue and average accuracy achieved was limited to 60%. Then came many years of fruitful research on ANN-based approaches which take local context of individual amino acid into account and have achieved 80% with the help of evolutionary profile. While ANN-based research is still on-going, recently, other

70 CHAPTER 4. AI TECHNIQUES FOR PSP 63 techniques, including Hidden Markov Models [50] and Support Vector Machines [41], have been applied to SS prediction. But they have not out-performed ANN-based methods yet in terms of prediction accuracy. In the following subsections, we first introduce performance measures commonly used in SS prediction, then review different ANN-based methods ever applied to this problem. These various ANNs are categorized into four groups: feed-forward networks based on amino acid local interactions; feed-forward networks based on evolutionary informations; feed-back networks; and ANNs as combining classifiers. Performance measures and testing The performance of prediction methods can be evaluated in terms of four measures: the Sensitivity, Specificity, Matthew's correlation coefficient and Segment Overlap score. For the overall sensitivity measure, the most commonly used is the three-state perresidue accuracy Qg. It is defined as the percentage of correctly predicted residues out of the total number of residues. It counts for all three secondary conformational states: helix, strand and coil. This measure can also be used for a single conformational state, thus it has three other forms: Qhelir, Qstrand and QCoil, giving respectively the percentage of correctly predicted helix residues, strand residues and coil residues. Note that the this accuracy measure does not convey many useful types of information - e.g., it doesnt say where the errors are, or in what way the prediction failed. Nevertheless, it is commonly used to compare the performance of SS predictors. Qindez is based on individual residues. The measure of the prediction of one residue is relatively independent of the measure of the prediction of its neighbors. But, the secondary structure is composed by a segment, or a collection of segments of consecutive residues. To reflect the nature of protein structure, measures should be concentrating on how well the entire secondary structure elements are predicted instead of individual residues. Thus, SOV(Segment Overlap quantity) measure was proposed by Rost et al. in [71]. In web site [loll, this measure was modified and given full descriptions. Another useful measure of prediction accuracy for each of the three types of secondary structures can be calculated using the Matthews' correlation coefficient [67]. For a-helix, e.g., it is: Coefficient = J(P pn - uo + 21) (P + 0) (n + u) (n + o)

71 CHAPTER 4. A1 TECHNIQUES FOR PSP 64 with p being the number of residues which are true positive(correct1y positively predicted), n being the number of true negative, o the number of false positive, and u the number of false negative. The correlation coefficients are in the range of +1 (totally correlated) to -1 (totally anti-correlated) and the values for the three types of secondary structure can be combined in a single figure by calculating the geometric mean. Moreover, a systematic testing of performance is needed. Often it is done by crossvalidation. In k-fold cross-validation, the original samples are partitioned into k subsets. Of the k subsets, one is retained as the validation set for testing the model, and the remaining k - 1 subsets are used as training samples. The cross-validation process is repeated k times for each training epoch, with each of the k subsets used exactly once as the validation data. The error on the cross-validation set can then be used to stop the training when it begins to increase. What is a good value for k? According to [76], the exact number of k is not important provided that the test set is representative, comprehensive and the crossvalidation results are not miss-used to again change parameters. In [76], the requirements for the cross-validation process are also addressed. ANNs based on local interactions The early ANNs are basically feed-forward networks taking into account local interactions of amino acids by means of an input sliding window with orthogonal encoding. The pioneering work was [67] in which 62.7% Q3 accuracy was reported. Their network architecture is very similar to the template given in the previous section: three layers fully connected and the output layer consisted of three sigmoidal units representing three SS classes. The input amino acids are encoded by 21-bit binary strings (the 21st bit specifying a gap). This sparse encoding increases the number of network parameters needed, but it has the advantage of not imposing an artificial ordering of the input data. Other network parameters, including the number of input and hidden nodes and the window size, are experimented thoroughly in their work. One example of feasible arrangements could be: 357 input nodes, 5 hidden nodes, and 3 output nodes, resulting in 1,808 weights. The 3 output nodes correspond to the 3 types of secondary structures. The 357 input nodes allows for a segment of 17 amino acids, i.e., the input window size is 17. The number of connections, thus the number of weights, mainly relies on the number of hidden nodes. One interesting point reflected in [67] is that the performance of the network is almost independent of the number of hidden nodes. In their work, they experimented

72 CHAPTER 4. A1 TECHNIQUES FOR PSP 65 different number of hidden nodes, from 0 to 40. The test results do not show much difference in performance. Although the accuracy reported in [67] was not much an improvement compared with other prediction methods, this early work led to subsequent years of successful research on ANNs in SS prediction. This type of ANNs that are based on single sequences and local windows seemed to achieve prediction accuracy of at most 65-69%. Increasing the size of the window will not lead to improvements due to the overfitting problem associated with large networks. However, some improvement was obtained by cascading the previous architecture with a second network to clean up the output of the lower network. More on this is introduced in the subsection 'ANNs as filters and combining predictors'. Other than general prediction accuracy, another major difficulty of the ANNs based on a window of local context is in predicting P-strands, because P-strand is determined by comparatively long-range interactions. By this, it is suggested that 65% of secondary structure depends on local interactions. ANNs based on evolutionary information The next generation of ANNs for SS prediction considers not only the information contained in the local context of the input sequence, but also the information coming from homologous sequences. The rationale behind this approach is that the structural features, including secondary structures, within a family of evolutionary related proteins is more conserved than sequences. This information is processed by first doing a PSI-BLAST search for homologous sequences in databases and doing a multiple alignment of them, then extracting a matrix of profiles, PSSM, indicating the frequencies of each amino acid in each position. Thus each residue is encoded by one matrix column at the corresponding position which is a vector of 20 real number frequencies. PHD [70, 71, 731 was one of the first ANN methods using profile-based inputs and going beyond 70% in accuracy and the researchers at the same time suggested that the power of neural networks should be fully exploited for the PSP problem. The PHD system is composed of cascading networks. The first one is that of Figure 4.6. A second one 4PSI-BLAST is a web-based search tool, for identifying biologically relevant sequence similarities in databases. Other local alignment algorithms will also do for this task.

73 CHAPTER 4. A1 TECHNIQUES FOR PSP 66 takes as input a window sliding on the previous outputs and refines the output of the first network. A final stage takes a jury decision averaging the outputs from independently trained models. Although a number of techniques including early stopping and ensembles of different networks are used, most of the improvements achieved by PHD seem to result from the use of evolutionary profiles [73]. In [12], it was claimed that the most accurate SS prediction methods would be found using ANNs. And they developed a system that involves two neural networks to get an accuracy of 75%. Other example of evolutionary profile-based ANN method is PSI-PRED [43] which uses two neural networks to analyze profiles. At present, almost all profile-based ANN prediction can achieve accuracy about 76-78%. Prediction using recurrent networks Human brains are recurrent neural nets: a network of neurons with feedback connections. Recurrent networks are considered computationally more powerful than feed-forward networks. For SS prediction, although the forming of SS is mainly driven by local interactions of residues, which justifies the success of feed-forward networks with evolutionary profiles as inputs, many researchers suggest that possible long range interactions between different regions of a sequence should also be taken into account to further improve prediction accuracy. Thus there has been research into recurrent architectures applied in PSP problem recently. Recurrent networks permit the state of the hidden (or output) units at the previous time step to be part of the input at the next time step, as shown in Figure 4.7. This provides the network with some memory of previous inputs, and this information can be used when processing current inputs. Recurrent network is useful for modeling time series data and the acquisition of grammar. One of the common features between protein structure and sentence structure is the inherent sequential nature of the structures: as sentence structure is based on sequential characters, protein structure is based on primary sequence that begins at N-terminal and ends at C-terminal. The other common feature between sentence structure and protein structure is the possible not-sequential long-distance dependency existing in the structure. One example of this dependency in protein structures is the forming of a P-sheet by several strands located apart along the sequence. Feed-forward networks can hardly capture this long-range dependency, this is why the prediction accuracy of P-sheets using feed-forward networks is generally lower than that of helices.

74 CHAPTER 4. A1 TECHNIQUES FOR PSP w output units(t) hidden unitsw I I hidden unitso I I hidden units$-1) I Figure 4.7: Sketch of recurrent network In [4], a bidirectional recurrent neural network (BRNN) architecture was proposed, and was further refined in [64] to predict protein secondary structure at an accuracy about 76%. In this architecture, the prediction for the residue at position t is determined by three components. First, there is a central component associated with the local window at position t, as in standard feed forward networks for SS prediction. Then, two other components are two similar recurrent networks being associated with the central component. These two recurrent networks act as two "wheels" rolling along the protein chain, one from the N- terminal and the other from the C-terminal, exploiting upstream and downstream context in the sequence all the way to the point of prediction. This bidirectional recurrent network is trained with a generalized back propagation algorithm. But because the algorithm is gradient descent essentially, the error propagation in both the forward and backward chains is subject to exponential decay, thus the learning of remote information is not efficient. For SS prediction, the BRNN can use information within about f 15 residues around the residue of interest, and it can hardly discover relevant information contained in further distant portions. But anyhow, the researchers in [64] claim that they have developed new algorithmic ideas that begin to address the problem of long-range dependencies in SS prediction. There is more research based on BRNN. In [13], segmented memory recurrent networks were proposed to replace the standard recurrent networks in the BRNN architecture. The idea of segmented memory is based on the observation that when trying to memorize a long sequence, humans tend to break it into smaller segments first and then cascade them to form the final sequence. Thus it is believed that RNNs are more capable of capturing long-term

75 CHAPTER 4. AI TECHNIQUES FOR PSP 68 dependencies if they have segmented memory and imitate the way of human memorization. The experiment of applying this idea to refine BRNN to predict SS indicates moderate improvement in prediction accuracy[l3]. In another research paper [9], bidirectional recurrent networks are used as filtering networks to correct the output coming from the first stage prediction by trying to capture valid segments of SS. In this approach, early stopping mechanism was used to control overfitting during training process. The experiments showed that this approach reached good accuracy and a very high value of SOV. Despite some good results achieved, recurrent networks have not been fully explored for the PSP problem because most research is based on the bidirectional recurrent architecture proposed in [4] and other network architecture or implementation for PSP problem can be hardly found in literature. ANNs as filters and combining predictors Other than a direct SS prediction method, ANNs are also used as filters and in combining results from different prediction methods as a consensus meta predictor. Filtering is to examine the final predictions to make them more realistic by removing bad predictions. It is now standard in secondary structure prediction, and is used in many successful methods. There are various filtering techniques, such as using if-then rewrite rules found through machine learning method CART [99]. One of the rules specifies: [la, *, *,a, c] -+ c with a = a-helix, c = coil, * = any, 1 = not. This rule says that if the pattern on the left is met in a prediction, then the secondary structure in bold on the left is rewritten as the secondary structure on the right of the rule. Thus, a predicted SS segment [b, b, b, a, c], after filtering, will be rewritten as [b, b, b, c, c]. The more widely used filtering method in SS prediction is to use ANNs. As early as in [67], a second, structure-structure network was used to filter the outputs from the first, sequence-structure network. The inputs to the second network was a window of vectors resulted from the first network, each vector is the frequencies of the three types of SS at a residue position:

76 CHAPTER 4. A1 TECHNIQUES FOR PSP Inputs :...( O.6,O.l,O.4)(O.8,O.2,O.2)(O.5,O.6,O.2)... The structure-structure network has only three inputs per residue, which allows a much larger window size for the same number of weights as a sequence-structure network which has to admit 20 inputs per residue. In [67], a 2% improvement was reported in prediction accuracy for using a filtering network. Now adding a filtering network becomes a common approach that is believed to be able to improve both Q3 and SOV and the best performance so far achieves an accuracy of 78% and a SOV of 73.5% [9]. Generally, the filtering networks are feed-forward, but in [9], the filtering network used was a bidirectional recurrent net- works(brnn). This filtering BRNN has a much simpler architecture than the architecture based on a BRNN with profiles as input, yet when tested on the predictions of both ANN and SVM predictors, the performance of this solution on Q3 and SOV index are equivalent to the latter [9]. ANNs are also found useful in combining predictions from several, or many, networks. In PHD method [70], for example, a third level network combined the predictions from 10 separate neural network systems that vary in training data and encoding schemes. The output is the prediction resulted from arithmetic average of the 10 ensemble predictions. The network also outputs a reliability index that indicates how many of the independently trained networks agree on the prediction. They reported a 2% improvement in predictive performance. In the PSI-PRED method, it also averages the output from up to four separate neural networks to increase prediction accuracy. The study of neural network ensembles, which is closely linked to the development of Bayesian neural networks, is a potential area that may further improve the SS prediction accuracy. Other issues on SS prediction Limits of accuracy: Currently, the best accuracy of secondary structure prediction is close to 80% '0761. It is arguable whether this accuracy will ever be significantly improved. There are probably three reasons for this doubt: 1) given the 3-d structure of a protein, there is no complete agreement on how to assign SS to each amino acid, especially for the amino acids that are located at the beginning or the end of a SS element. This is largely due to the fact that secondary structure does not represent clear-cut category of structure in nature, rather it is a useful piece of terminology; 2) some regions of SS are not solely determined

77 CHAPTER 4. AI TECHNIQUES FOR PSP 70 by the local sequence, but may also be influenced by long range interactions. Thus without full understanding of tertiary structure, secondary structure could not be expected to be accurately predicted. This is why some researchers use tertiary structure information in constructing ANNs to predict secondary structure and gain some improvement in accuracy, e.g. [52]. But this approach reverses the objective of SS prediction; 3) Usually in SS prediction, 3 classes of secondary structure are adopted. But the secondary structure database, the DSSP, describes 8 structure classes. Then a mapping from 8 to 3 classes is needed to reduce the feature space and enable efficient computations. However, by imposing a coarser input space, we may impose a limit on the accuracy of SS prediction. From secondary to tertiary structures: Suppose we can accurately predict secondary structure of a protein, how would it help in constructing the protein's tertiary structure? This problem is not trivial because the SS elements does not uniquely define the 3-d structure. Other information, such as their relative distances, is needed. There are methods that attempt to derive distance constraints between amino acids on the basis of a multiple sequence alignment of proteins of the same family, like the one discussed in But these methods are not reasonably effective yet. Not only in computational methods, even in NMR experiments, not all atomic distances can be measured or the uncertainty of the measured value is rather high. Thus, there is still way to go to reconstructing tertiary structure from secondary structure. The encouraging part is, however, the accuracy of methods for SS prediction has reached a respectable value for further research on this problem Other Structural Features Prediction Besides secondary structures, there are other structural features that can help understand and predict protein tertiary structure, such as residue solvent accessibility, Cysteines bonding state, residue long-range contacts, etc. Since for most of the structural features, the prediction problem can be modeled as a mapping problem that relates each residue in the protein sequence to a symbol that describes a certain property, it is not surprising that ANNs, as an automatic learning methods, finds its application and success in predicting many other structural features. In this section, we briefly survey a few of them. Residue solvent accessibility (RSA) describes the relative degree to which a residue interacts with solvent molecules. It can be described in several ways. The simplest is a two-state description: residues with greater RSA are considered as exposed, residues with lower RSA are considered as buried. ANN methods have long been applied to the prediction

78 CHAPTER 4. A1 TECHNIQUES FOR PSP 71 of RSA. As in SS prediction, the first attempts only took single sequence as input, as in [37] and later evolutionary profiles were used, as in [72]. Then in [65], ensembles of bidirectional recurrent neural networks, similar to those employed in SS prediction, were investigated and obtained good performance, showing again the ability of ANNs to exploit structural features. Cysteine is one of the twenty amino acid and it can occur in either of the two forms: oxidized or reduced. Two oxidized cysteines can pair to form a disulphide bridge which is a type of covalent bond important for protein folding and stabilizing. Thus identifying oxidized cysteines can help predicting disulphide bridges and this problem can be cast to a binary classification task, i.e., for each cysteine in a given protein, predict whether it is in a disulphide bridge or not. Both feed-forward and recurrent network have been applied to this task. The program CYSPRED developed in [29] uses a neural network with no hidden nodes, fed by a window of residue positions centered at the target cysteine. Evolutionary profile is used as the input for each residue position. This method achieved 79% accuracy. In [lo], SVM method and more domain knowledges are investigated, the accuracy achieved is 84%. Based on this method, they further add a global refinement stage by bidirectional recurrent networks and reach 88% accuracy. For predicting long-range contacts of residues, the basic hypothesis is that residues in contact in a protein structure tend to mutate in a covariant manner. Thus detecting residues mutating in a correlated manner can be taken as an indication of probable physical contact in 3-d. There are various methods for this problem. One approach is to train neural networks using different encoding systems for multiple sequence alignments [30]. For example, each residue pair in the protein sequence can be coded as an input vector containing 210 elements (20 x (20 + 1)/2), representing all the possible ordered couples of residues (considering that each residue couple and its symmetric are coded in the same way) and a single output state can code for contact and non-contact Discussion What has been described here is the application of ANNs to the protein structual features prediction, a subproblem of PSP, and in particular secondary structure prediction. This category of prediction problem can be described as a mapping problem in which we relate a sequence encoded by an alphabet of twenty letters into a sequence of a certain alphabet representing some structural features. This way of posing the problem enables the use of

79 CHAPTER 4. A1 TECHNIQUES FOR PSP 72 ANNs, as well as other automatic learning techniques, to infer the relationships between sequences and structural features by learning from known cases. ANNs perform quite successfully in this task. The general ideas of how ANNs are applied in this task have been introduced in previous sections. In this section, we discuss a few more issues. Some of the issues are specific to protein structural features prediction, while others are general problems of ANNs. The problem of over-fitting The purpose of training an ANN is not to learn the training set to the highest degree of accuracy. Rather, the aim is to generate a network that has the ability to generalize to other unseen data. Thus a network should avoid being over-trained. Otherwise, it will fit perfectly to the training data while has poor ability to generalize. It is like we focus so much on particular trees that we miss the forest. Another problem with over-training is that training data normally contain noise. If a network is over-trained, it will learn the noisy details of the training set and is unlikely to be optimal from the perspective of generalization. Some factors have been identified concerning with specifying the conditions that make ANNs generalize well. Examples of these factors are: 1) the ratio of network parameters to training examples. This ratio should not be too large. 2) The number of hidden nodes. Although structural features prediction does not necessarily require hidden nodes, most of ANN designs in literature for realistic length proteins have hidden layers. The number of hidden nodes are often experimented. Too few hidden nodes will cause the network unable to learn, but if too many, its generalization will be poor. 3) The number of training iterations. If there are too few training iterations, the network will be unable to extract important features from the training set; but if too many, the network will begin to learn the details of the training set that it will not be able to abstract general features. In practice, the above-mentioned factors can be handled in differnt ways. For example, the popular SS prediction method PHD [70] uses two methods to address the over-fitting problem. One is early stopping. The other is to use ensemble averages by training different networks independently, using different input information and learning procedures. Crossvalidation techniques (See Section 4.3.3) are also commonly used in the training process to control over-training. It is effective in handling over-fitting problem, yet computationally expensive.

80 CHAPTER 4. AI TECHNIQUES FOR PSP 73 Effects of evolutionary information The fact that proteins are evolutionarily related affects the application of ANNs in structural features prediction in the following aspects. First, evolutionary information has been proven to be useful in improving prediction accuracies. Making use of evolutionary information during prediction process contributes significant improvement in prediction accuracies 191. The evolutionary information mainly takes the form of multiple alignment profiles. Secondly, because evolutionarily related proteins often exhibit very similar secondary structures, in the process of network training, we have to ensure that no protein homologous to those in the training set is present in the validation and test sets, otherwise the evaluation of the network is bound to be incorrect because the network may "learn" to recognize homologous proteins and to give the same answer for them, rather than to recognize the features of the sequence. Thus, in practice, the protein sequences in training, validation, and test set normally have to undergo inspections to make sure that no pairs share significant similarity. Usually a threshold of about 25% sequence identity is used for this purpose. About data sets Application of ANNs is dependent on the use of data sets including training, validation and test sets. For ANNs used in PSP problem, there are some issues about these data sets worth noting. First, as pointed out in [12], increasing the number of non-homologous proteins in the data sets improves the prediction accuracy, because more biological information improves the network's ability to discriminate between different types of structures, and the risk of over-fitting is reduced. Such an example was given in [12]: a 4% improvement in Q3 index was achieved using a data set of 318 non-homologous protein sequences, compared with Qian and Sejnowski's network [67] in which a much smaller (three times less) data set was used. This suggests that as the number of solved non-homologous protein structures increases over time, prediction based on larger data sets will be more accurate. Although it is hard to find more evidence for this conjecture in literature, subsequent work in ANN approaches to SS prediction usually use larger data sets. For example, the data set in [41] published in 2001 contains 513 protein chains with low similarity, while the data set in [9] published two years later contains 969 chains and almost 184,000 amino acids.

81 CHAPTER 4. AI TECHNIQUES FOR PSP 74 Secondly, not only larger data sets themselves contribute to the improvement of prediction, but that the data pool from which the data sets are drawn are getting larger and this also contributes to the improvement of prediction. This contribution comes from two aspects: one, as discussed before, the use of evolutionary information increases the prediction accuracy, and the obtainment of evolutionary information is directly connected to database size and database search tools; two, larger data pools make it possible and easy for the selection of good-quality protein data used for ANN methods. One last issue about ANN data sets is not specific to PSP problem, but a general problem to ANN method: one ANN that is trained on certain data may produce a prediction different from another ANN trained on different data. This poses problems for prediction accuracy and it has not been addressed in PSP problem. PSP researchers do pay attention in choosing data sets, but their attention is on choosing proteins that are mutually non-homologous rather than attending this issue. Opening the black-box While ANNs have been used successfully in PSP problem, one major complaint about ANN predictors, especially from biologists, is that there is no explanation why a protein structure is predicted as such. A trained ANN is like a protein folding machine, being fed with protein sequences and producing folded structure features. But this machine is a black-box or unknown function of the amino acid sequence. A trained ANN has obviously learned meaningful relationships in the training data, but these relationships are encoded as weight vectors within the network, which are difficult to interpret. Is it possible to see inside the black box? Is it possible to "fit the curve" to the data points and thus empirically derive the corresponding function from sequence to structure? If this problem is cast as fitting a function to data, many techniques in mathematics or computing science are applicable. But our discussion here focuses on extracting rules from neural networks so that these networks can do more than being mere "black boxes". Rule extraction from neural networks has been an active research topic in recent years. Many methods have been proposed [87]. If the feed-forward net used for SS prediction has no hidden layers, the values of the weights chosen by the network during training for each residue type and window location are themselves instructive. But most of the networks used for PSP problem are multi-layer. For multi-layer networks or other network types such as recurrent networks, rule-extraction methods vary and are dependent on network

82 CHAPTER 4. AI TECHNIQUES FOR PSP 75 architecture, training and activation functions. But they can be roughly categorized as being between 'decompositional' and 'pedagogical' approaches, according to [87]. Decompositional approaches 'look inside' the network and analyze weights between units to extract rules. Some of these approaches require specialized restricted weight modification algorithms, while others require specialized network architectures such as an extra hidden layer of units with staircase activation functions. Pedagogical approaches do not examine weights inside the black box, but extract rules by observing the relationship between the network' inputs and outputs. Thus they are general purpose in nature and can be applied to any feed forward network architecture. For PSP problem, the rules extracted from ANN solutions should be applicable to most protein sequences and compliant with the truths of chemistry and physics. But overall, there has not been much research in this line yet. In [86], the rules are extracted by sepecific modulation of the training procedure. The attempt did not improve performance but it showed that the rules extracted from ANNs are more complicated than were available by statistical analysis. Because ANNs perform better than other methods in prediction accuracy, it is worthwhile trying to extract rules from black box ANNs. This improves the comprehensibility of the solutions without losing the accuracy of the black boxes.

83 Chapter 5 Summary In order to understand the function of a protein, it is important to know its structure. This report deals with the determination of protein structure using computational methods, especially A1 techniques that are initiated from biological systems. The structure of a protein may be described at four major levels. Protein structure prediction operates primarily at the level of the secondary and tertiary structure. The fundamental principle underlying all the methods is Anfinsen's hypothesis (experimentally justified) that there is sufficient information contained in the protein sequence that specifies the final 3-d structure [I]. While the problem remains largely unsolved, researchers have made good effort by resorting to various simplified models and trying various approaches. Some common simplifications are: focusing only on the residues rather than all the atoms in the protein; reducing the number of residue types by grouping residues based on physical properties such as hydrophobicity, as in HP models; reducing the number of spatial degrees of freedom of the atoms or residues, for instance, by restricting the residues locations on lattices. Predicting secondary structure can also be seen as simplifying the 3-d problem by projecting 3-d structure onto 1-d string of secondary structural assignments for each residue. The various approaches to the problem can be classified into three categories: knowledgebased - building the structure based on knowledge of a good template structure; ab initio - building the structure from scratch using first principles; and structural features prediction. Each category has sub-divisions of approaches. A particular approach is chosen depending on the protein in question and the amount of data available, or on the research interest of the research group. In practice, knowledge-based prediction tools are more successful. Hybrid

84 CHAPTER 5. SUMMARY 77 approaches also perform well and are becoming a trend in PSP research. Currently, most ab initio methods work on simplified models and at residue level, thus strictly speaking, are not considered as practical full tertiary structure prediction methods. However, they are more important in the sense that a true solution to ab initio prediction will permit rational design of novel proteins with novel functions. A1 techniques have been applied to many approaches to the problem. The most notable are evolutionary computation in ab initio prediction and artificial neural networks in pre- dicting secondary structure. In this report, we reviewed and analyzed the applications of three biologically initiated A1 techniques to PSP problem: evolutionary computing, ANNs and L-systems. For each of these techniques, we presented a general framework of how they can be used for PSP either directly or by discussing important components of the technique. The rationale of whether or why they are suitable for protein structure prediction is pre- sented. We also discussed and compared significant studies that were published in recent years. Evolutionary algorithms are effective and generally applicable search techniques for hard problems for which analytical methods or good heuristics are not available. PSP problem, when formulated as a searching-for-optimal-conformation problem, is a good candidate for using EAs. EAs explore an energy landscape for a minimal energy conformation which is believed to correspond to the native state. Three crucial components are addressed: a representation for structure geometry that translates the problem space into encodings that can be used for evolution; a potential energy function that can distinguish between favorable and non-favorable structures, and the specific variation and selection operators to explore the conformational space. For structure representation and energy function, large approximations are required because of the complexity of the problem. We addressed this issue and sampled in research literature how various approximations are handled. Lindenmayer systems is presented as a novel generative encoding scheme to capture protein structure in lattice model. We introduced and analyzed the recent research in this line. L-system-based encoding has been tested in evolutionary algorithms with good preliminary results. But further research is needed to investigate its applicability in PSP problem. For humans, a large memory of stored examples can serve as the basis for intelligent inference. For PSP problem, ANNs infer meaningful relations between primary sequence and seondary structures from selected dataset. The learned relationships, although in a hidden

85 CHAPTER 5. SUMMARY 78 form, are then used to predict the structures of new sequences, with promising results. From the point of view of pattern recognition, secondary structure prediction can be seen as a classification task, which assigns to each residue one of the three (sometimes more) classes of conformational states. Various kinds of ANNs have been used in this task. We examined feed-forward networks based on amino acid local interactions; feed-forward networks based on evolutionary informations; feed-back networks; and ANNs as combining classifiers. Among the many A1 techniques that have been applied to PSP problem, I only sampled a few of those that are initiated from biological systems. Nature is still 'smarter' than humans. Maybe eventually we can successfully apply what we've learned from Nature to biological problems themselves?

86 Bibliography [I] C.B. Anfinsen. Principles that govern the folding of protein chains. Science, 181: , [2] J. Augen. Bioinformatics in the Post-Genomic Era: Genome, Transcriptome, Proteome, and Information-Based Medicine. Addison Wesley, [3] P. Baldi and S. Brunak. Bioinformatics: the machine learning approach. The MIT Press, [4] P. Baldi, S. Brunak, P. Frasconi, G. Soda and G. Pollastri. Exploiting the past and the future in protein secondary structure prediction. bioinfomatics, 15: , [5] M. J. Bayley, G. Jones, P. Willett and M.P. Williamson. GENFOLD: a genetic algorithm for folding protein structures using NMR restraints. Protein Sci, 7: , [6] B. Berger and T. Leight. Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete. J. Comp. Bio., 5: 27-40, [7] C. Branden and J. Tooze. Introduction to protein structure. Garland Publishing Inc., 2nd edition, [8] R. Casadio, E. Capriotti, M. Compiani, P. Fariselli, I. Jacoboni, P. Luigi, I. Rossi and G. Tasco. Neural networks and the prediction of protein structure. In Artificial intelligence and heuristic methods in bioinformatics, P. Frasconi and R. Shamir (eds.), IOS Press, [9] A. Ceroni, P. Frasconi, A. Passerini and A.Vullo. A combination of support vector machines and bidirectional recurrent neural networks for protein secondary structure prediction. In AI*IA 2003: Advances in Artificial Intelligence, A. Cappelli and F. Turini (eds.), [lo] A. Ceroni, P. Frasconi, A. Passerini and A.Vullo. Predicting the disulfide bonding state of cysteines with combinations of kernel machines. Journal of VLSI Signal Processing, 35: , 2003.

87 BIBLIOGRAPHY 80 [ll] A. Ceroni, P. Frasconi, A. Passerini and A.Vullo. Cysteine bonding state: local prediction and global refinement using a combination of kernel machines and bidirectional recurrent neural networks. In AI*IA 2003: Advances in Artificial Intelligence, A. Cappelli and F. Turini (eds.), [12] J. Chandonia and M. Karplus. The importance of larger data sets for protein secondary structure prediction with neural networks. Protein Science, 5: , [13] J. Chen and N.S. Chaudhari. Capturing long-term dependencies for protein secondary structure prediction. In Advances in Neural Networks: Lecture Notes in Computer Science, Vol. 3174, Springer Verlag, [14] C. Chothia and A. Lesk. Relationship between the divergence of sequence and structure in proteins. EMBO Journal, 5: , [15] P.Y. Chou and U.D. Fasman. Prediction of protein conformation. Biochemistry, 13: , [16] J. Cohen. Bioinformatics-an introduction for computer scientists. ACM Computing Surveys, 36: , [17] W.D. Cornell et al. A second generation force field for the simulation of proteins and nucleic acids. J. Am. Chem. Soc., 117: , [I81 C. Cotta. Protein structure prediction using evolutionary algorithms hybridized with backtracking. Artificial Neural Nets Problem Solving Methods, Lecture Notes in Computer Science, 2687: [19] P. Crescenzi, D. Goldman, C. Papadimitriou, A. Piccolboni, and M. Yannakakis. On the complexity of protein folding. J. Comp. Bio., 5: , [20] V. Cutello, G. Narzisi and G. Nicosia. A multi-objective evolutionary approach to the protein structure prediction problem. J. R. Soc. Interface, doi: , [21] T. Dandekar and P. Argos. Potential of genetic algorithms in protein folding and protein engineering simulations. Protein Eng., 5: , [22] T. Dandekar and P. Argos. Folding the main chain of small proteins with the genetic algorithm. Journal of Molecular Biology, 236: , [23] T. Dandekar and P. Argos. Identifying the tertiary fold of small proteins with different topologies from sequence and secondary structure using the genetic algorithm and extended criteria specific for strand regions. Journal of Molecular Biology, 256: , [24] R. Day, J. Zydallis and G. Lamout. Solving the protein structure prediction problem through a multiobjective genetic algorithm. In Proc Computational Nanoscience and Nanotechnology Conference, 2002.

88 BIBLIOGRAPHY 81 [25] K.A. Dill. Theory for the folding and stability of globular proteins. Biochemistry, 24:1501, [26] A.E. Eiben and J.E. Smith Introduction to evolutionary computing. Springer, [27] G. Escuela, G. Ochoa and N. Krasnogor. Evolving L-systems to capture protein structure native conformations. In Proc Genetic Programming: 8th European Conference, 74-84, [28] V.A. Eyrich, M.A. Marti-Renom, D. Przybylski, M.S. Madhusudhan, A. Fiser, F. Pazos, A. Valencia, A. Sali and B. Rost. EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics, 17: , [29] P. Fariselli, P. Riccobelli, and R. Casadio. Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins. Proteins, 36: , [30] P. Fariselli and R. Casadio. Neural network based predictor of residue contact in proteins. Protein Engineering, 12: 15-21, [31] D. Fischer, D. Baker and J. Moult. We need both computer models and experiments (correspondence). Nature, 409: 558, [32] D. Fischer, D. Eisenberg. Fold recognition using sequence derived properties. Protein Science, 5: , [33] G.B. Fogel and D.W. Corne. Evolutionary Computation in Bioinformatics. Elsevier, [34] D.B. Fogel. Evolutionary Computation: Toward a New Philosophy of Machine Intelligence. IEEE Press, [35] J. Gamalielsson and B. Olsson. Evaluating protein structure prediction models with evolutionary algorithms. In Information Processing with Evolutionary Algorithms, M. Grana, R. Duro, dlanjou and P. Wang (eds.), Springer, [36] I. Halperin, B. Ma, H. Wolfson and R. Nussinov. Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins, 47: , [37] S.R. Holbrook, S.M. Muskal and S.H. Kim. Predicting surface exposure of amino acids from protein sequence. Protein Engineering, 3: , [38] J.H. Holland. Adaptation in Natural Artificial Systems. The University of Michigan Press, [39] B. Honig. Protein folding: from the levinthal paradox to structure prediction. Journal of Molecular Biology, 293: , 1999.

BIBLIOGRAPHY 82 [40] G. Hornby and J. Pollack. The advantages of generative grammatical encodings for physical design. Congress on Evolutionary Computation, 2001. [41] S. Hua and Z. Sun.

89 BIBLIOGRAPHY 82 [40] G. Hornby and J. Pollack. The advantages of generative grammatical encodings for physical design. Congress on Evolutionary Computation, [41] S. Hua and Z. Sun. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. Journal of Molecular Biology, 308: , [42] D.T. Jones. W.R. Taylor and J.M. Thornton. A new approach to protein fold recognition. Nature, 358: 86-89, [43] D.T. Jones. GenThreader: an efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology 287: , [44] M. Khimasia and P. Coveney. Protein structure prediction as a hard optimization problem: the genetic algorithm approach. Molecular Simulation, 19: , [45] R. King and M. Sternberg. Identification and application of the concepts important for accurate and reliable protein secondary structure prediction. Protein Science, 5: , [46] G. Kokai, Z. Toth and R. Vanyi. Modeling blood vesels of the eye with parametric L- systems using evolutionary algorithms. In Proc Joint European Conference on Artijicial Intelligence in Medicine and Medical Decision Making, [47] N. Krasnogor, D. Pelta, P.E. Lopez, and E. Canal. Genetic algorithm for the protein folding problem, a critical view. In Proc of Engineering of Intelligent Systems, [48] N. Krasnogor, W. Hart, J. Smith and D. Pelta. Protein structure prediction with evolutionary algorithms. In Proc Genetic and Evolutionary Computation Conference, [49] D.V. Laurents, S. Subbiah and M. Levitt. Different protein sequence can give rise to highly similar folds through different stabilizing interactions. Protein Science, 3: , [50] K. Lin, V. Simossis, W. Taylor and J. Heringa. A simple and fast secondary structure prediction method using hidden neural networks. Bioinformatics, 21(2): , [51] A.D. MacKerell et al. All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B, 102: , [52] J. Meiler and D. Baker. Coupled prediction of protein secondary and tertiary structure. Proc Natl Acad Sci, loo(21): , [53] S. Miyazawa and R.L. Jernigan. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term for simulation and threading. Journal of Molecular Biology, 256: , 1996.

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction Conformational Analysis 2 Conformational Analysis Properties of molecules depend on their three-dimensional