Poster Project Extended Report: Protein Folding and Computational Techniques Blake Boling. Abstract. Introduction

Size: px
Start display at page:

Download "Poster Project Extended Report: Protein Folding and Computational Techniques Blake Boling. Abstract. Introduction"

Transcription

1 Poster Project Extended Report: Protein Folding and Computational Techniques Blake Boling Abstract One of the goals of biocomputing is to understand how proteins fold so that we may be able to predict the folding of an amino acid sequence in order to produce novel drugs or to treat existing diseases caused by misfolding with appropriate therapies. Hypothetically, it is possible to predict the movement of each and every atom within a protein molecule, in an effort to predict the native conformation of the protein. However, this approach consumes vast amounts of computational power which increases proportionally to the size of the molecule. Furthermore, there are many other variables that must be accounted for, such as the micro-environment of the molecule. Currently, we can not predict the movement of every atom within a protein, because most proteins are very large molecules. We are limited by computational power and technological limits. The goal of this paper is to introduce some of the computational approaches used to study and predict the conformation of proteins given today s technological limits. Introduction All proteins are a linear sequence of monomer units composed of unbranched amino acids. It is the amino acid sequence and its three dimensional structure that makes each protein unique, physically and functionally (Cohen-Gonsaud et al., 2004). In fact, there is a strict relationship between protein function and structure (Armano et al., 2005). Protein folding is the process in which an amino acid sequence assumes a functional three-dimensional structure by spontaneous folding and coiling. The shape of the molecule is an extremely important component of its function. If it is not shaped correctly then it will not function properly. A protein s misconformation (aka misfolding) can sometimes be the origin of human and animal disease. Sickle-cell anemia and cystic fibrosis are fatal human diseases caused by the change of a single amino acid within a protein, thus leading to a critical change in overall structure (Berg et al., 2002). Amino acids are linked together by peptide bonds to form polypeptide chains. The sequence of amino acids is specified by the organism s genes. This amino acid sequence of a protein is called the primary structure, and it governs the protein s native conformation. In other words, the structure the protein spontaneously attains in its natural environment. The concept of spontaneity is crucial. There is an enormous difference between calculated folding times and actual folding time. For a small molecule it would take about 1.6x10 27 years to find it s correct conformation if it had to search through each one of them. However, the folding process is virtually instantaneous. This is called Levinthal s paradox. The amino acid sequence specifies not only the protein s native conformation but also the pathway to attain this state. The landmark event in biochemistry was the work of Frederick Sanger. Who in 1953 determined the amino acid sequence of insulin. His work showed for the first time that a protein has a precisely defined amino acid sequence. Now the precise amino acid sequence of over 100,000 proteins is known (Berg et al., 2002). The complimentary work of Christian Anfinsen helped to create the central principle of biochemistry: sequence specifies conformation. He showed that denatured ribonuclease can be refolded back into its native conformation.

2 The second step in the folding process is the formation of the primary architectural structures: alpha helices and beta sheets. The formation of these structures as well as reversals in the polypeptide chain, called beta turns and omega loops, give the protein its secondary structure. Only after the establishment of the secondary structure can the protein assume its tertiary structure, which might be considered the protein in its folded state. The final assemblage is the quaternary structure, which is the overall three-dimensional structure of all the polypeptide subunits that have already folded put together. Hydrogen bonds between amino acids shape the sequence into alpha helices and beta sheets. Covalent bonds between two cysteine residues as well as electrostatic interactions, such as hydrogen bonds and Van der Waals interactions, between residue groups are the forces that hold the folded protein in its native conformation. In particular, the electrostatic forces are responsible for guiding the protein to its native conformation. Furthermore, electrostatic forces are responsible for a very important force, shaping many proteins: hydrophilicity/hydrophobicity. For instance, hydrophobic groups within a protein will turn toward each other to exclude water. A protein s micro-environment governs its shape as well. For instance, extreme temperature, high solute concentrations, and extreme ph can denature a protein, meaning that covalent and electrostatic bonds are broken and secondary and tertiary structure are lost. Furthermore, as mentioned earlier, hydrophilic or hydrophobic groups on a protein determines a protein s conformation by reacting to the aqueous state of the protein s micro-environment. The misconformation of proteins is the basis of diseases such as Creutzfeldt-Jakob disease (CJD), bovine spongiform encephalopathy (mad cow disease), sickle-cell anemia, cystic fibrosis, and Alzheimer's disease. CJD and mad cow disease involve misshaped proteins called prions. Prions are brain proteins that when converted from their normal native conformation (designated PrP c ) to the misshaped version (designated PrP sc ) causes the diseased state. The role of these aggregates, often called plaques, is not yet fully understood. Alzheimer's disease is also associated with the accumulation of plaques in brain tissue, which in the case of Alzheimer's are referred to as amyloid plaques. It is not known whether these plaques are the cause or the symptom of the disease (Berg et al., 2002). Prions, on the other hand, are infectious proteins. The conversion PrP c to PrP sc proteins is self-propagating, meaning that large aggregates of malformed prions can accumulate within brain tissue if one molecule of PrP sc is introduced into the body. One of the goals of biocomputing is to predict the native three-dimensional structure of a protein given limited information. A number of factors exist that make protein structure prediction a very difficult task, including 1) The number of possible structures that proteins may possess is extremely large, as highlighted by the Levinthal paradox. 2) The physical basis of protein structural stability is not fully understood. 3) The primary sequence may not fully specify the tertiary structure. For example, proteins known as chaperones have the ability to induce proteins to fold in specific ways. 4) Direct simulation of protein folding via methods such as molecular dynamics is not generally reliable for both practical and theoretical reasons (Samudrala, 2000). However, the distributed computing project, Folding@home, is tackling such simulation difficulties. Therefore, due to the complexity of the problem, the role of computers in this endeavor is extremely important. Researchers in the traditional fields of X-ray crystallography and nuclear magnetic resonance can not cope with the huge numbers of new genes constantly being discovered. Genome programs are churning out new sequences by the hundreds every week (Morton, 2001). Determining the structure of the proteins that those genes describe takes far longer and costs far more. You have to engineer the gene into a microorganism, hope that it

3 makes the protein described, and then purify the protein and determine it s structure via X-ray crystallography or nuclear magnetic resonance. It's hoped that computational techniques can eventually help turn the floods of new genome data into information we can use to make new drugs. In terms of drug design, libraries of chemical compounds are rapidly growing while the structural, thermodynamic and dynamic characterization of ligand-macromolecule complexes is till tedious and difficult. In silico methods need to be developed in the field of pharmacogenomics or drug design will be significantly delayed (Cohen-Gonsaud et al., 2004). Results A number of factors exist that make protein structure prediction a very difficult task, including the fact that direct simulation of protein folding via methods such as molecular dynamics is not generally reliable for both practical and theoretical reasons. However, the distributed computing projects (i.e. Folding@home) are tackling such simulation difficulties. Traditionally, the main approaches to determining protein structure have been experimental, using X-ray crystallography and nuclear magnetic resonance. Computational techniques used for the prediction of protein folding employ two major approaches: ab initio methods and comparative protein modeling (Cohen-Gonsaud et al., 2004). Furthermore, computational simulations of model proteins attempts to simulate the folding of proteins on an atom by atom basis. Distributed computing projects such as Folding@home, Predictor@home, and the Human Proteome Folding Project have dramatically improved computing power for researchers without access to supercomputers. The goal of protein structure prediction is to determine the three-dimensional structure of proteins from their amino acid sequence. One approach is to use comparative protein modeling, also called homology modeling. This approach is knowledge based. An interesting aspect of comparative protein modeling is that it uses previously solved structures as starting points. Surprisingly, there is a limited set of tertiary structural motifs to which most proteins belong, even though the number of actual proteins is vast. There are probably about 2000 distinct protein folding motifs in nature, which we can use as templates (Wikipedia, 2005). A protein of known sequence but unknown structure can be compared to other proteins of known structures. When a match is found, the known structure of the protein is used as a starting point or a model to help elucidate the structure of the known sequence. There are about 20,000 protein structures in the Protein Data Bank ( and about 1,000,000protein sequences have been deposited into the SwissProt/TrEMBL ( databases (Pevsner, 2003). Obviously, there is a great need for an efficient approach to structure elucidation. Traditional experimental approaches are far too time consuming. Comparative modeling consists of five sequential steps. Step 1 is to identify related structures. Step 2 is to select templates. Step 3 is to align target sequence (unknown structure) with template structures. Step 4 is to build a model for a target, and finally step 5 is to evaluate the model (Bernasconi and Segre, 2000). There are two methods to comparative protein modeling: homology modeling and protein threading. Protein threading scans the amino acid sequence of an unknown structure against a database of solved structures. In each case, a scoring function is used to assess the compatibility of the sequence to the structure, thus yielding possible three-dimensional models. This approach resembles a best fit approach.

4 Homology is a principle based on evolutionary relationships. It is an important concept in bioinformatics, because it is used to predict the function of a gene as well as the structure of an amino acid sequence. For instance, if the function of gene A is known and is homologous to gene B then we can infer that gene B might have a similar function to gene A. This idea can also be used to predict the structure of a protein if the structure of a homologous protein is known. This technique is called homology modeling. Currently this remains the only way to predict protein structures reliably. It has an accuracy similar to a low-resolution experimentally determined structure (Pevsner, 2003). Homology modeling is facilitated by the fact that the three-dimensional structure of proteins from the same family is more conserved than their primary sequences. Furthermore, a small change in the protein sequence usually results in a small change in its 3D structure. For example, human hemoglobin is homologous to leghemoglobin (hemoglobin in legumes). The amino acid sequence for these two proteins is very different, but they both have identical functions (Pevsner, 2003). Therefore, they are also structurally identical. If proteins are similar at the sequence level then structural similarity can usually be assumed. There are several computer programs and web sites that facilitate the comparative modeling process. Some of them are: Swiss-Model ( CPHModels ( SDSC1 ( FAMS ( and ModWeb ( These servers accept an amino acid sequence as input from a user and return an all atom comparative model when possible. Another approach to protein structure prediction is de novo structure prediction, also called ab initio prediction. This approach attempts to establish the three-dimensional structure of a protein from scratch. In other words, this approach attempts to predict a protein s structure only from its amino acid sequence and without reference to any other known protein structures. This approach requires vast computational resources for all but the smallest protein. This computational approach uses stochastic methods to search possible solutions. The possible solutions are predicted based on the minimal free energy of the final structure. Finding the structure with the lowest free energy is the key element of this approach (Wikipedia, 2005). The thermodynamic hypothesis states that the native conformation of a protein is the one for which the free energy achieves the global minimum. There are two components to ab initio prediction: devising a scoring function and devising a search method to explore the conformational space (Pevsner, 2003). This approach uses an energy function to distinguish between native-like structures from nonnative-like structures which must consider interactions between all pairs of atoms in the polypeptide chain (Pevsner, 2003). The number of interactions grows exponentially with sequence length. Therefore, this is a computationally intensive and complex process. Furthermore, a computational model using this approach would have to take into consideration all of the variables associated with the molecule s micro-environment. Obviously, the complexity of this model s inherit characteristics necessitates simplification and there are many limits to its practicality. Because of the vast number of possible conformations and influencing variables, the computing power that this approach requires is enormous. This approach requires the computational power of supercomputers or the resources of a distributed computing network. Distributed computing is a vast computing network, which employs the unused CPU cycles of personal computers worldwide to analyze scientific data. The two distributed computing projects Folding@home within Stanford University's Chemistry Department

5 ( and run by the Scripps Research Institute ( are designed to perform computationally intensive simulations of protein folding. ( is another interesting distributed computing project used to analyze data received by the Arecibo radio telescope in the hopes of finding extraterrestrial intelligence. is the second largest distributed computing project after (Wikipedia, 2005). The Human Proteome Folding Project ( is a part of the World Community Grid, which is the world s largest public computing grid and is run by IBM (Wikipedia, 2005). Like projects, the World Community Grid pools CPU cycles from around the world. However, it is not targeted toward a single project. World Community Grid supports many different humanitarian projects. Folding@home and Predictor@home are similar. However, Folding@home aims to study the dynamics of protein folding, but Predictor@home aims to specify what the final tertiary structure will be. Predictor@home uses BOINC and Folding@home uses its own infrastructure, but it is making a transition to BOINC (Wikipedia, 2005). BOINC (Berkeley Open Infrastructure for Network Computing) is a distributed computing infrastructure, originally developed for the SETI@home project. It is now being used in other fields such as biocomputing. Computational simulations of model proteins uses highly simplified computer models of proteins called lattice proteins (Wikipedia, 2005). These simplified computer models are employed because most proteins are too large for current technology to simulate folding on an atom by atom basis. The lattice protein is a simplified molecule such that an amino acid sequence behaves like a bead, in other words, a single functional unit. This simplification allows for structural prediction without having to predict the movement of every atom within a protein molecule. The most stable thermodynamic state is predicted based upon the movements of the simplified functional units. References Armano G, Mancosu G, Milanesi L, Orro A, Saba M, Vargiu E (2005) A Hybrid Genetic- Neural System for Predicting Protein Secondary Structure. BMC Bioinformatics 6(Suppl 4):S3 Berg JM, Tymoczko JL, Stryer L, Stryer L (2002) Biochemistry, Ed 5th. W.H. Freeman, New York Bernasconi A, Segre AM (2000) Ab Initio Methods for Protein Structure Prediction: A New Technique based on Ramachandran Plots. ERCIM News 43:13-14 Cohen-Gonsaud M, Catherinot V, Labesse G, Douguet D (2004) From Molecular Modeling to Drug Design. In JM Bujnicki, ed, Practical Bioinformatics, Vol 15. Springer, New York Morton O (2001) Gene Machine. In Wired, Vol 9 Pevsner J (2003) Bioinformatics and functional genomics. Wiley-Liss, Inc., Hoboken, N.J. Samudrala R Protein Folding and Protein Structure Prediction. (Accessed: 12/7/2005). Wikipedia Folding@home. (Accessed: 12/12/2005). Wikipedia Lattice protein. (Accessed: 12/12/2005).

6 Wikipedia Protein structure prediction. (Accessed: 12/11/2005). Wikipedia World Community Grid. (Accessed: 12/12/2005).