Consensus Prediction of Protein Secondary Structures

Size: px
Start display at page:

Download "Consensus Prediction of Protein Secondary Structures"

Transcription

1 Consensus Prediction of Protein Secondary Structures by Zheng Wang Bachelor of Management Information System, Shandong Economic University, Jinan, P. R. China, 2004 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Computer Science In the Graduate Academic Unit of Computer Science Supervisor(s): Examining Board: Dr. Patricia Evans, Ph.D., Faculty of Computer Science Dr. Virendrakumar C. Bhavsar, Ph.D., Faculty of Dr. Computer Science Dr. Eric Aubanel, Ph.D., Faculty of Computer Science Dr. Bradford Nickerson, Ph.D., Faculty of Dr. Computer Science Dr. Julian Meng, Ph.D., Department of Dr. Electrical and Computer Engineering This thesis is accepted by the Dean of Graduate Studies THE UNIVERSITY OF NEW BRUNSWICK July, 2007 c Zheng Wang, 2007

2 Dedication I dedicate this thesis to my parents. They raised me up, provide me happiness, cultivate my personality, and provide me the best education. I dedicate this thesis to my grandmother, a diligent and kind Chinese woman, who made great contributions to the whole family, and left us in 2004 when I was pursuing this degree in Canada. I wish her peace forever. ii

3 Abstract Protein structure prediction is one of the most significant problems in bioinformatics. Currently, there are some tools which can predict protein secondary structure, or find protein structural motifs and some specific structure segments. However, their results are sometimes different or contradictory. CISPred is a consensus protein structure prediction system which integrates results in order to provide overall consensus predictions of protein secondary structures. The average accuracy of CISPred predictions is 82.6% on a dataset containing 109 CASP sequences, and 89.3% on a dataset containing 1758 sequences. iii

4 Acknowledgements I sincerely appreciate my supervisors, Dr. Patricia Evans and Dr. Virendra Bhavsar. They impart their knowledge, direct my research, and provide financial support. Dr. Virendra Bhavsar and Dr. Patricia Evans are professors with profound knowledge and experience, and have been respectful mentors. The two years of study and research with them have been one of the best periods in my life. Sili Huang and Lu Yang, system administrators of the Advanced Computational Research Laboratory at the University of New Brunswick, provided a lot of technical support to the development and testing of CISPred. Particularly, special thanks to Sili Huang, who provided many helpful suggestions on the concurrent implementation of CISPred. I thank my colleagues: Rachita Sharma, a Ph.D of Computer Science candidate; En Zhang, a Master of Computer Science candidate; Aijazuddin Syed, Master of Computer Science; and Marc Cooper, Master of Computer Science. They began their research in our bioinformatics laboratory earlier than I, and provided much help and many suggestions for my research. The members of my entire family greatly supported my study in Canada. I appreciate them all. iv

5 Table of Contents Dedication Abstract Acknowledgments Table of Contents List of Figures ii iii iv vii ix 1 Introduction Motivation Objective Organization Background Protein Protein Secondary Structure Secondary Structure Definitions Secondary Structure Assignments Protein Secondary Structure Prediction Overview PHD, PSIPRED and SSPRO The Threading Method and THREADER Comparison of Protein Structure Prediction Tools Benchmarked Non-redundant Dataset Protein Motif and Motif Databases Protein Motif PATMATMOTIFS and the PROSITE database CISPred: Consensus Integrated Protein Structure Prediction Overview System Architecture Selection of Integrated Tools THREADER v

6 Sorting THREADER Reports Clustering THREADER Alignments Finding Motif Secondary Structures PSIPRED and SSPRO Generating Consensus Structure Prediction System Implementation Overview System Infrastructure Concurrent Implementation Overview THREADER Finding Protein Motif Secondary Structures SSPRO and PSIPRED Execution Time Experimental Results Overview CISPred Testing Results on CASP Sequences CISPred Testing Results on 1758 Sequences Selection of CISPred Default Threshold Comparison of CISPred and Integrated Tools Overview Comparison on CASP Sequences Comparison on 1758 Sequences Conclusions Conclusion Thesis Contributions Future Work References 113 A Submission Templates on Cluster 114 A.1 Submission Template for THREADER A.2 Submission Template for SSPRO A.3 Submission Template for PSIPRED A.4 Submission Template for Finding Motif Structures Vita 117 vi

7 List of Figures 2.1 The general formula of an amino acid The amino acid sequence of protein 1AD An α helix in protein 1R7G [26] An ideal β strand [43] A parallel β sheet in protein 1DIN [26] An anti-parallel β sheet in protein 1IC9 [26] A β barrel in protein 1BY3 [26] The secondary structure of protein 1AD Part of the PDB file of protein 1WCK The entire amino acid sequence of protein 1WCK Part of the DSSP file of protein 1WCK Part of the PDBFINDER entry of protein 1WCK Part of a GARNIER [16] prediction report Part of a PREDATOR [14] prediction report Part of a PSIPRED [24] horizontal prediction report Part of a SSPRO [37] prediction report CISPred system architecture Example of THREADER score report Alignment results of THREADER Structure segments for a protein motif Example of a structure formula result Example of a PATMATMOTIFS result Example of a PROSITE entry The generation of motif structure formulae Example of a PSIPRED vertical result Example of a SSPRO result Example of the information available in one amino acid position Average 3-state accuracy of THREADER predictions on 80 random sequences Example of a CISPred vertical result in an amino acid position Example of a CISPred vertical result Example of a CISPred horizontal result The system infrastructure of CISPred vii

8 4.2 Example of a CISPred queried sequence Web page for submitting query sequences to CISPred A CISPred web page displaying user jobs Overview of the concurrent implementation of CISPred Concurrent implementation of THREADER The sorting of THREADER reports Concurrent finding of motif structures The execution time of CISPred The speedup of CISPred Two chains of protein 1ZD Average 3-state accuracy of CISPred on the 109 CASP sequences Standard deviation of CISPred on the 109 CASP sequences Coefficient of variation of CISPred on the 109 CASP sequences Number of sequences CISPred predicts with 3-state accuracy in several specific ranges on the 109 sequences dataset Distribution of the 109 CASP sequences predicted by CISPred with 1% 3-state accuracy as interval Distribution of the 109 CASP sequences predicted by CISPred with 3% 3-state accuracy as interval Distribution of the 109 CASP sequences predicted by CISPred with 5% 3-state accuracy as interval Average 3-state accuracy of CISPred predictions on 1758 sequences Standard deviation of CISPred predictions on the 1758 sequences Coefficient of variation of CISPred predictions on the 1758 sequences Number of sequences CISPred predicts with 3-state accuracy in several specific ranges on the 1758 sequences dataset state accuracies of PSIPRED predictions on the 109 CASP sequences with average Q3 score 0.778, standard deviation 0.084, and coefficient of variation 15.6% Bar graph showing the distribution of the 3-state accuracies of PSIPRED predictions on 109 CASP sequences state accuracies of SSPRO predictions on 109 CASP sequences with an average Q3 score 0.821, standard deviation 0.095, and coefficient of variation 11.6% Bar graph showing the distribution of 3-state accuracies of SSPRO predictions on the 109 CASP sequences state accuracies of CISPred predictions on the 109 CASP sequences when the threshold at which to stop clustering equals Bar graph showing the distribution of the 3-state accuracies of CIS- Pred predictions when the threshold equals 0.42 on the 109 CASP sequences Bar graph showing the distribution of the 3-state accuracies of predictions of CISPred, PSIPRED, and SSPRO viii

9 5.20 Prediction results of CISPred and integrated tools on protein 1WCK Amino acid sequences and secondary structure sequences predicted by SSPRO with 100% 3-state accuracy state accuracy of PSIPRED on 1758 sequences with average of 3-state accuracy 0.789, standard deviation 0.089, and coefficient of variation 11.2% Bar graph showing the distribution of the 3-state accuracies of PSIPRED predictions on the 1757 sequences dataset The amino acid sequences for which PSIPRED fails to predict the secondary structures state accuracy of SSPRO predictions on 1758 sequences with average of 3-state accuracy 0.911, standard deviation 0.101, and coefficient of variation 11.1% Bar graph showing the distribution of the 3-state accuracies of SSPRO predictions on the 1758 sequences dataset The 3-state accuracies of CISPred predictions on the 1758 sequences when the threshold at which to stop clustering equals The average 3-state accuracy, standard deviation, and coefficient of variation of these predictions are 0.893, 0.095, and 10.7% respectively The amino acid sequence, 8-state DSSP secondary structure, and 3-state secondary structure of protein 1XR Prediction result of SSPRO on protein 1XR Prediction result of PSIPRED on protein 1XR Alignment result of THREADER on protein 1XR Bar graph showing the distribution of the 3-state accuracies of CIS- Pred predictions on the 1758 sequences when the threshold is Bar graph showing the distribution of the 3-state accuracies of the predictions of PSIPRED, SSPRO, and CISPred on the 1758 sequences dataset when the threshold is Summary of experimental results ix

10 Chapter 1 Introduction 1.1 Motivation It has been proven by researchers that protein functions are determined by their specific three-dimensional structures. Experimental techniques such as X-ray crystallography or NMR analysis are inadequate, and the gap between the number of known tertiary and primary structures is widening; therefore, it is necessary to develop approaches that deduce protein structures from their amino acid sequences. The prediction of protein secondary structures using computer technologies is one of the necessary efforts to narrow the gap. There are many tools and algorithms for protein secondary structure prediction. These tools are based on specific methods to predict structures, and their results sometimes are not identical, and are even contradictory for some proteins. A method that is able to integrate different prediction tools and make consensus predictions is necessary for researchers. 1

11 1.2 Objective The main objective of the thesis is to integrate results of selected protein structure prediction tools and make a consensus protein secondary structure prediction in a position-specific way. 1.3 Organization Chapter 2 gives some background knowledge about protein, protein structures, and protein motifs. This chapter also briefly introduces several protein structure prediction tools and methods. Chapter 3 presents the architecture and methodology of CISPred, a consensus integrated protein structure prediction system. Chapter 4 presents the concurrent implementation of CISPred. Chapter 5 presents the testing strategy and testing results of CISPred. Chapter 6 presents the contributions of CISPred, and offers some suggestions for future work. 2

12 Chapter 2 Background 2.1 Protein Proteins are large molecules made up of 20 types of amino acids. Each protein molecule is a long and unique chain of these amino acid residues 1. These long chains tend to fold into massive and complicated structures because of the power of bonds between atoms. After they fold into their structures, these long chains are stable. The a-carbon atom Amino group H H 2 N C COOH R Carboxyl group Side chain group Figure 2.1: The general formula of an amino acid The sequence of atoms along the core of a chain is called the backbone of the 1 In biochemistry and molecular biology, a residue refers to a specific monomer within the polymeric chain of a polysaccharide, protein or nucleic acid [45]. 3

13 protein. The portions of the amino acids that are not involved in this backbone are called side chains. Figure 2.1 illustrates the general formula of an amino acid, in which R represents one of 20 different side chains of amino acids. Every protein backbone has a C-terminus and a N-terminus, which represent the two ends of the backbone. EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVD SLETEEWFFKGISRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVK HYKIRTLDNGGFYISPRSTFSTLQELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAW EIPRESLKLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPGSMSVEAFLAEANVMKTLQH DKLVKLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFSAQIAEGMAFIE QRNYIHRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWS FGILLMEIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEER PTFEYIQSVLDDFYTATESQYQQQP Figure 2.2: The amino acid sequence of protein 1AD5. Each of the 20 amino acids can be represented by a 1-letter code or 3-letter code; for example, amino acid Alanine is represented by the letter A or Ala, and amino acid Cysteine is represented by the letter C or Cys. A protein molecule is then represented as a string of 1-letter codes, each of which represents an amino acid. This string of letters is called an amino acid sequence. Figure 2.2 illustrates the amino acid sequence of protein 1AD Protein Secondary Structure Secondary Structure Definitions After comparing the 3D structures of many different proteins, some regular folding patterns are often found, such as the α helix, β strand, and β sheet. The Dictionary of Protein Secondary Structure (DSSP) [49] defines these patterns and 2 Protein Data Bank [2] I.D. 4

14 uses a single letter code to describe each of them. This single letter code is called DSSP code, and is frequently used to describe protein secondary structures. An α helix is a structure formed when the backbone chain of a protein twists around itself, and in which the backbone N-H group in each amino acid forms a hydrogen bond with the C=O group of the amino acid four residues earlier. Figure 2.3 [26] illustrates an α helix in protein 1R7G. An α helix is also called a 4-turn helix, and it is represented by H in DSSP code. If a hydrogen bond is formed between two amino acids that are three residues apart, this is called a 3-turn helix, represented by G ; if a hydrogen bond is formed between two amino acids that are five residues apart, this is called a 5-turn helix, represented by I ; and if a hydrogen bond is formed between two adjacent amino acids, this is called a hydrogen bonded turn, represented by T. Figure 2.3: An α helix in protein 1R7G [26]. A β strand is illustrated in Figure 2.4 [43], in which the backbone of the protein is folded with successive 120 degree angles. Figure 2.4: An ideal β strand [43]. 5

15 A β sheet consists of two or more β strands connected by hydrogen bonds, and its minimum length is two amino acid residues. If its length is less than two residues, then it is called a residue in isolated beta-bridge, and represented by B. The two neighbouring β strands may be parallel if they are aligned in the same direction from one terminus (N or C) to the other, which is called a parallel β sheet as shown in Figure 2.5 [26]. Figure 2.5: A parallel β sheet in protein 1DIN [26]. If the two neighbouring β strands are aligned in the opposite direction, then it is called an anti-parallel β sheet as shown in Figure 2.6 [26]. Figure 2.6: An anti-parallel β sheet in protein 1IC9 [26]. A closed β sheet is called a β barrel, which is illustrated in Figure 2.7 [26]. All β strands, β sheets and β barrels are represented by E in DSSP code. In DSSP code, the structures formed by non-hydrogen bonds are called bend, and are represented by S. The random turns and the structures which are not in 6

16 any of the above conformations are designated as (space), which is sometimes also written as C. Usually, the eight secondary structure types defined in the DSSP are reduced into three types based on a 3-state scheme [40]: G and H are taken to be helix ( H ), E and B are taken to be strand ( E ), and all of the other structure types are treated as random turns or a coil ( C ). Based on DSSP code, the secondary structure of a protein molecule can be represented by a string of letters, which is called a protein secondary structure sequence. Figure 2.8 illustrates the secondary structure sequence of the protein 1AD5, in which each of the letters, such as H, B and C, represents a particular protein folding pattern. Figure 2.7: A β barrel in protein 1BY3 [26]. CCCEEEESSCBCCCSSSBCCBCTTCEEEEEECCTTEEEEEETTTCCEEEEEGGGEEETT SGGGSTTEETTCCHHHHHHHHTSTTCCTTCEEEEECSSSTTSEEEEEEEECTTSCEEEE EEECEECSSSCEESSTTSCBSCHHHHHHHHTTCCSSSSSCCCSBCCCCCCCCCCCTTCS EECGGGEEEEEEEECCSSEEEEEEEETTTEEEEEEEECTTSSCHHHHHHHHHHHTTCCC TTBCCEEEEECSSSEEEEEECCTTCBHHHHHTSHHHHTCCHHHHHHHHHHHHHHHHHHH HTTCCCSCCSTTSEEECTTSCEEECCCCCCCCCCCCCGGGCCHHHHHHCCCCHHHHHHH HHHHHHHHHTTTCCSSSSCCTHHHHHHHHTTCCCCCCTTSCHHHHHHHHHHTCSSGGGS CCHHHHHHHHHTTTSCGGGSSCCCC Figure 2.8: The secondary structure of protein 1AD5. 7

17 2.2.2 Secondary Structure Assignments Protein secondary structures are assigned to amino acid sequences based on the three dimensional orthogonal coordinates of the atoms in proteins. The three dimensional orthogonal coordinates of proteins are stored in the RCSB (Research Collaboratory for Structural Bioinformatics) Protein Data Bank (PDB) [2], which is a database that stores information about known proteins such as their amino acid sequences, the methods used to find them, their atoms, and the three dimensional orthogonal coordinates of each atom in a protein. By April, 2007, the RCSB Protein Data Bank has stored information from about 39,261 known proteins. The information from a protein is stored in a PDB file in plain text format. Figure 2.9 illustrates part of the PDB file for protein 1WCK. The lines starting with ATOM list the three dimensional orthogonal coordinates of the atoms in protein 1WCK. Considering the first line starting with ATOM as an example: the N in the third column indicates that the atom is Nitrogen; the GLY in the fourth column indicates that this atom is one of the atoms of the amino acid type GLY, which is Glycine with 1-letter code G; the 80 in the sixth column indicates that the amino acid Glycine is the 80th amino acid of the protein 1WCK; the indicates the orthogonal coordinate X in angstroms; the indicates the orthogonal coordinate Y in angstroms; and the indicates the orthogonal coordinate Z in angstroms. The protein 1WCK contains 220 amino acids in total as shown in Figure 2.10, in which the atoms in the underlined amino acids are those included in the PDB file shown in Figure 2.9 and the atoms in the nonunderlined amino acids are not included in the PDB file shown in Figure 2.9. The reason that the PDB file (see Figure 2.9) does not provide the three dimensional orthogonal coordinates of the atoms in the non-underlined amino acid is that these amino acids are partially or completely unstructured and do not fold into a stable 8

18 state, which is labeled as disordered regions by structural biologists. ORIGX ORIGX SCALE SCALE SCALE ATOM 1 N GLY A N ATOM 2 CA GLY A C ATOM 3 C GLY A C ATOM 4 O GLY A O ATOM 5 N LEU A N ATOM 6 CA LEU A C ATOM 7 C LEU A C ATOM 8 O LEU A O ATOM 9 CB LEU A C ATOM 10 CG ALEU A C ATOM 11 CD1ALEU A C ATOM 12 CD2ALEU A C ATOM 13 CG BLEU A C ATOM 14 CD1BLEU A C ATOM 15 CD2BLEU A C ATOM 16 N GLY A N ATOM 17 CA GLY A C ATOM 18 C GLY A C ATOM 19 O GLY A O ATOM 20 N LEU A N ATOM 21 CA LEU A C ATOM 22 C LEU A C ATOM 23 O LEU A O ATOM 24 CB LEU A C ATOM 25 CG LEU A C ATOM 26 CD1 LEU A C ATOM 27 CD2 LEU A C ATOM 28 N PRO A N ATOM 29 CA PRO A C ATOM 30 C PRO A C ATOM 31 O PRO A O ATOM 32 CB PRO A C ATOM 33 CG PRO A C... ATOM 966 N HIS A N ATOM 967 CA HIS A C ATOM 968 C HIS A C ATOM 969 O HIS A O ATOM 970 CB HIS A C ATOM 971 CG HIS A C ATOM 972 ND1 HIS A N ATOM 973 CD2 HIS A C ATOM 974 CE1 HIS A C ATOM 975 NE2 HIS A N TER 976 HIS A 215 HETATM 977 AS CAC A AS HETATM 978 O1 CAC A O HETATM 979 O2 CAC A O HETATM 980 C1 CAC A C HETATM 981 C2 CAC A C Figure 2.9: Part of the PDB file of protein 1WCK. The DSSP program [49] is a program that assigns secondary structures to amino acid sequences based on the three dimensional coordinates of the atoms in 9

19 proteins. The DSSP program reads the PDB files as shown in Figure 2.9, assigns a secondary structure type to each of the amino acid positions, and saves the secondary structures in a DSSP file. Figure illustrates part of the DSSP file of protein 1WCK. The fourth column from the left lists the amino acids of the protein, and the fifth column from the left lists the secondary structures in each amino acid position. PDBFINDER [49] is a database that stores the secondary structures of all protein entries in the Protein Data Bank. Figure 2.12 shows the entry of 1WCK in the PDBFINDER database, in which the line starting with Sequence lists the amino acid of protein 1WCK (without disordered segments), and the line starting with DSSP lists the secondary structure of protein 1WCK. The secondary structures in the PDBFINDER database are assigned by the DSSP program. 2.3 Protein Secondary Structure Prediction Overview Protein secondary structure prediction methods usually do not distinguish all of the secondary structure types defined in the Dictionary of Protein Secondary Structure, but only consider three structural states. Generally, α helix ( H ) and 3-turn helix( G ) are all treated as Helix, represented by H, β strand(e) 3 In order to fit on the paper, some unrelated segments or columns in the examples shown in this thesis may be deleted or omitted as indicated by.... >1WCK:A PDBID CHAIN SEQUENCE MAFDPNLVGPTLPPIPPFTLPTGPTGPTGPTGPTGPTGPTGPTGDTGTTGPTGPTGPTGPTGPTGATGL TGPTGPTGPSGLGLPAGLYAFNSGGISLDLGINDPVPFNTVGSQFGTAISQLDADTFVISETGFYKITV IANTATASVLGGLTIQVNGVPVPGTGSSLISLGAPIVIQAITQITTTPSLVEVIVTGLGLSLALGTSAS IIIEKVAHHHHHH Figure 2.10: The entire amino acid sequence of protein 1WCK. 10

20 # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI 1 80 A G , 0.0 3,-0.1 0, 0.0 0, A L ,-0.3 2,-0.2 0, 0.0 0, A G ,-0.1-1,-0.3 3,-0.0 3, A L , ,-0.1 1,-0.1 3, A P S S , 0.0 2,-0.3 0, , A A E +A 133 0A 1 127, , ,-0.1 2, A G E -AB A 3 28, ,-2.4-2,-0.3 2, A L E -AB A 3 123, ,-2.5-2,-0.3 2, A Y E +AB A 92 24, ,-2.2-2, , A A E +AB A 0 119, ,-2.6-2,-0.5 2, A F E -AB A 41 19, ,-2.3-2,-0.3 2, A N E -A 127 0A , ,-1.6-2, , A S E +A 125 0A 57-2,-0.5 2, , , A G E -A 124 0A , ,-3.0-2,-0.4 4, A G S S ,-0.3 2, , , A I S S , , ,-0.1 2, A S ,-0.3 2, , , A L E -E 121 0B , ,-2.5-4,-0.1 2, A D E -E 120 0B 117-2,-0.5 2, , , A L E -E 119 0B 22 99, ,-2.1-2,-0.5 2, A G > ,-0.3 3, , , A T E S+A 13 0A 69-2, , , , A S E , ,-2.4-2,-1.1 2, A A E -AD 12 61A 0-115, , ,-0.3 2, A S E -AD 11 60A 22-68, ,-2.3-2,-0.3 2, A I E -AD 10 59A 0-119, ,-2.5-2,-0.3 2, A I E -AD 9 58A 63-72, ,-2.3-2,-0.3 2, A I E +AD 8 57A 0-123, ,-2.5-2,-0.4 2, A E E -AD 7 56A 51-76, ,-2.5-2,-0.4 2, A K E +AD 6 55A 6-127, ,-0.6-2, , A V E ,-2.3 2,-0.3-2, , A A E D 0 54A 42-81, , ,-0.0-1, A H , , , , Figure 2.11: Part of the DSSP file of protein 1WCK. 11

21 ID : 1WCK Header : STRUCTURAL PROTEIN Date : Compound : bcla protein Source : (bacillus anthracis) Author : S.Rety Author : S.Salamitou Author : L.A.Augusto Author : R.Chaby Author : F.Lehegarat Author : A.Lewit-bentley Exp-Method : X Resolution : 1.36 R-Factor : Free-R : Ref-Prog : REFMAC HSSP-N-Align : 23 T-Frac-Beta : 0.60 T-Nres-Prot : 136 T-Water-Mols : 190 HET-Groups : 1 Het-Id : 1216 Natom : 5 Name : CACODYLATE ION Chain : A Sec-Struc : 136 Beta : 82 B-Bridge : 2 Anti-Hb : 108 Amino-Acids : 136 Substrate : 5 Sequence : GLGLPAGLYAFNSGGISLDLGI... ALGTSASIIIEKVAH DSSP : CCCCSEEEEEEEEESSCEEECT... CSEEEEEEEEEEEEC Nalign : Nindel : Entropy : Cons-Weight : Chain : Z Water-Mols : 190 Figure 2.12: Part of the PDBFINDER entry of protein 1WCK. 12

22 and residue in isolated beta-bridge (B) are all treated as Strand, represented by E, and all of the others are treated as Coil, represented by C or (space). Correspondingly, the 3-state accuracy score (also called Q3 score ) is used to evaluate prediction accuracy, which is the percentage of the residues which have predictions matching the real structures. Many methods and algorithms have been used to predict protein secondary structures. The early methods used in protein structure prediction usually only contained linear statistics [16, 5, 15, 17, 34] and stereochemical principles [31]. Subsequently, machine learning algorithms proved to be a successful way to predict protein secondary structures. The successful machine learning algorithms used include decision tree [35], neural networks [38, 40, 19, 28, 39, 3, 4, 21], and K- way nearest neighbours [6, 7, 13, 12, 30]. Currently, most of the top successful prediction tools with prediction accuracy higher than 75%, such as PHD [40], PSIPRED [24] and SSPRO [37], use artificial neural network (ANN) algorithms to make their predictions. It has also been proven that considering evolutionary information, or multiple aligned sequences, in protein structure prediction can improve prediction accuracy [8]. This is because multiple sequence alignment can be obtained from the core structure or a consensus structure of a whole protein family which can then be used to predict the structure of proteins which belong to or are related to that protein family. Currently, multiple sequence alignment is used quite often in protein secondary structure prediction [33, 36, 14, 9], and is considered a successful method [18, 32]. Recently, a trend is not to use only one technique to predict protein secondary structures, but to combine several techniques; for example, to combine ANNs and multiple sequence alignment [47, 9, 25], and to combine statistical methods, 13

23 homology methods, information theory methods, and artificial neural network algorithms [46]. Besides combining various techniques in one tool, some tools combine other prediction tools to make consensus prediction. One typical and successful example is JPRED [10] which combines 6 different prediction tools: DSC [27], PHD [40], NNSSP [6], PREDATOR [14], ZPRED [48], and MULPRED 4. Each of these tools combines multiple sequences alignment with a specific method; for example, PHD uses jury decision neural networks, NNSSP is based on nearest neighbours, and DSC uses linear discrimination. CISPred integrates two existing prediction tools SSPRO [37] and PSIPRED [24], which have prediction results with relatively high 3-state accuracy, and are freely downloaded and easily integrated. Moreover, CISPred also integrates the protein motif structures database and the threading method, which have not been widely used by existing protein secondary structure prediction tools PHD, PSIPRED and SSPRO Currently, PHD [40], PSIPRED [24] and SSPRO [37] are three of the most successful protein secondary structure prediction tools. As mentioned above, multiple sequence alignments can improve the accuracy of protein secondary structure prediction, and are widely used in protein secondary structure prediction tools. The generation of sequence profiles by multiple sequence alignment is time-consuming. For example, a very successful method, PHD [40], uses a multi-processor computer to generate multiple sequence alignment; therefore, the PHD server [42] cannot be moved to a new site. In 1999, PSIPRED [24], a protein secondary structure prediction system that could be easily ported to any workstation, was created. The approach of PSIPRED is 4 Barton, 1988, unpublished 14

24 to use the position-based scoring matrix of PSI-BLAST, instead of multiple sequence alignments, as the inputs for a two-stage neural network. According to the experiments conducted by its author, PSIPRED can achieve an average 3- state accuracy between 76.5% and 78.3% on the CASP3 (Critical Assessment of Techniques for Protein Structure Prediction experiment) [29] dataset. The output of PSIPRED [24] gives the confidence score of each of the three secondary structures C, H, and E, respectively, in each amino acid position. The details of PSIPRED output reports are presented in Chapter 3. In 2004, SSPRO [37], a protein secondary structure prediction tool, was created based on an ensemble of 100 1D-RNNs (one dimensional recurrent neural networks), PSI-BLAST-derived profiles (position-based scoring matrix), and a large non-redundant training set. According to the experiments conducted by its author, SSPRO can achieve a 3-state accuracy of 77%. The details of the prediction results from SSPRO [37] are presented in Chapter The Threading Method and THREADER The threading method is an algorithm which can be used to predict protein structures. A protein fold library usually is constructed which contains protein folds as structural templates. Then a score function is chosen to evaluate any alignments of a queried amino acid sequence with a structural template. The score function usually computes the free energy of this queried sequence in a structural template. The less free energy the queried sequence has, the more stable the queried sequence is in this structural template, which also indicates a higher likelihood that this template is the final structure of this queried sequence. Based on the score function, the best alignment of a query sequence with each of the structural templates can be found. Then the most appropriate structural templates with 15

25 optimal alignments are selected as the predicted structures. THREADER [23] is a tool which implements the threading method. Its output includes a score report showing the score of the alignment of a queried sequence with each structural template, and an alignment report showing the alignments of the query sequence with each of the structural templates. The details of the score report and alignment report are presented in Chapter Comparison of Protein Structure Prediction Tools Because each method and tool has a different approach to prediction, results may be different, and sometimes part of the results are contradictory. Figures 2.13, 2.14, 2.15, and 2.16 are part of the prediction results based on the first 50 amino acids of protein 1AP9, from GARNIER [16], PREDATOR [14], PSIPRED [24], and SSPRO [37] QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP helix sheet EEE HHH EE H EEEEE EEEEEE HHHHH EEEEEEEEE turns TT TT coil CCCCC CCC CCCC Figure 2.13: Part of a GARNIER [16] prediction report. 1 QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP 50 HHHHHHHHHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHH Figure 2.14: Part of a PREDATOR [14] prediction report. These four reports illustrate the differences between the four prediction tools. For example, PREDATOR [14] predicts more α helices ( H ) than GARNIER [16], and at some positions PREDATOR [14] predicts an α helix ( H ) where GAR- NIER [16] predicts contradictory results, a β sheet ( E ). PSIPRED [24] and 16

26 SSPRO [37] have similar prediction results; both of them predict two series of α helices ( H ) with some coils ( C ) in the middle. But the lengths of the two α helices ( H ) series predicted by PSIPRED [24] and SSPRO [37] are slightly different Benchmarked Non-redundant Dataset As mentioned above, most of the successful methods to predict protein secondary structure use machine learning algorithms. Accordingly, datasets that contain non-redundant protein amino acid sequences are needed for cross-validation. In 1994, Burkhard Rost and Chris Sander provided a dataset, often referred to as dataset RS126 [41], that contains 126 non-redundant protein sequences. The non-redundancy of RS126 means that any two proteins in the dataset share no more than a 25% sequence identity over a length of more than 80 residues. The RS126 dataset was used as a benchmark dataset by many machine learning algorithms that predict protein secondary structure. In 1999, James Cuff and Geoffrey Barton pointed out that the standard used to determine the non-redundancy of the RS126 dataset, percentage identity, is a poor measure of sequence similarity [8], and they provided a more sophisticated Conf: Pred: CCCCCCCCHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCHHHHHHHHHHHH AA: QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP Figure 2.15: Part of a PSIPRED [24] horizontal prediction report. QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP CCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHH Figure 2.16: Part of a SSPRO [37] prediction report. 17

27 method. To compute the similarity of two amino acid sequences A and B, their method aligns A and B using a standard dynamic programming algorithm, and then obtains the score for the alignment V. The order of the amino acids in both sequence A and sequence B are randomized, and the two randomized sequences are aligned by the dynamic programming algorithm. This process is usually repeated more than 100 times, and then the mean of the alignment scores of randomized sequences x and the standard deviation of the alignments scores of randomized sequences σ are computed. A SD score, or a Z-score, that measures the similarity of the original sequences A and B is computed using Equation 2.1. The sequences with an SD score higher than 5 are considered similar. A dataset that contains 396 sequences, usually named dataset CB396 [8], is provided using this similarity method. Each sequence in CB396 is not similar to any other sequence in CB396 and also not similar to any sequence in RS126. CB396 is another of the benchmark non-redundant datasets currently used. SD = V x σ (2.1) The similarity of each pair of sequences in dataset RS126 was also measured using the new method mentioned above, and 9 sequences were removed from RS126 in order to make the remaining 117 sequences non-redundant. These 117 sequences from RS126 are combined with CB396 to form a dataset named CB513, which is one of the benchmarked non-redundant datasets currently used. 18

28 2.4 Protein Motif and Motif Databases Protein Motif Motifs are biologically significant sites and patterns existing in proteins. They can be used to characterize protein families PATMATMOTIFS and the PROSITE database PATMATMOTIFS [1], a protein motifs finding tool, compares a given protein sequence to the PROSITE [22] database, which stores the information about known protein motifs. In some cases, an unknown protein sequence is distantly related to known proteins, therefore it is difficult to determine the features of an unknown protein by overall sequence alignment. By comparing an unknown protein to the PROSITE [22] database, some biologically important patterns, motifs or fingerprints can be found, which can help determine to which family it belongs. Examples of PATMATMOTIFS results and PROSITE entries are presented in Chapter 3. 19

29 Chapter 3 CISPred: Consensus Integrated Protein Structure Prediction 3.1 Overview CISPred, a Consensus Integrated Structure Prediction tool, predicts protein secondary structures by integrating several prediction tools and databases. The tools and databases integrated in CISPred include two protein secondary structure prediction tools, SSPRO [37] and PSIPRED [24]; a protein motif searching tool, PATMATMOTIFS [1]; a motif database, PROSITE; a protein secondary structure database, PDBFINDER [20]; and a threading method tool, THREADER [23]. 3.2 System Architecture Selection of Integrated Tools The tools selected for integration in CISPred meet several requirements. The integrated tools have a relatively high prediction accuracy. The techniques of 20

30 protein secondary structure prediction have had great improvements in the last 20 years. Some early algorithms, such as GARNIER [16], have no more than 65% accuracy. Because the consensus results of CISPred are computed based on existing tools, the prediction accuracy of these tools directly influences the accuracy of CISPred. The tools integrated by CISPred can be downloaded and installed on local machines. Currently, there are many successful protein structure prediction servers, yet some of them cannot be downloaded, but only accessed from their web pages. To improve the stability and reduce the execution time of CISPred, the existing tools integrated are designed to be executed on local machines. Therefore, the protein secondary structure prediction tools PSIPRED [24] and SSPRO [37] are selected to be integrated into CISPred, because both of them can be downloaded and installed on local machines, and both have relatively high accuracy: PSIPRED has 80.6% accuracy and SSPRO has 80% or better. PSIPRED and SSPRO mainly use ANN techniques. In order to provide consensus predictions, more tools which use different prediction approaches are integrated into CISPred. THREADER [23] is integrated into CISPred and implements the threading method, a completely different method than neural networks. Moreover, CISPred also integrates protein motif structural information by finding the structure formulae of protein motifs. In total, CISPred integrates two protein secondary structure prediction tools using neural networks and PSI-BLAST profiles, a tool using the threading method, and protein motif structure formulae. Figure 3.1 illustrates the system architecture of CISPred. A program running at the web server of CISPred submits queried sequences to the integrated tools which are executed at a Cluster. Motif structure formulae are found by integrating the tool PATMATMOTIFS [1] and two databases: PDBFINDER [20] 21

31 Web Server Submission Program Query sequences Cluster SSPRO Finding the Motif Structure Formulas THREADER PROSITE Database PDBFinder Database PSIPRED Prediction results Prediction results PATMATMOTIFS Structure Formulas Clustering Protein folds Consensus Prediction Program Consensus Results Web Server Program Figure 3.1: CISPred system architecture. 22

32 and PROSITE [22]. The results generated from THREADER are clustered before being integrated. When the execution of the integrated tools and the finding of motif structure formulae are finished, a program integrates the structure formulae and the results of each integrated tool, and then generates consensus prediction for each queried sequence. The consensus predictions are then sent to a program running at the web sever of CISPred, which sends the results to CISPred users by THREADER Sorting THREADER Reports THREADER [23] is a tool that implements the threading method. THREADER has a library which contains the structures of 6251 protein folds. A queried amino acid sequence is threaded through each of the protein folds. For each protein fold, an alignment between the amino acid sequence and the protein fold with minimum free energy is selected as the optimum alignment. THREADER provides information about the optimum alignment for each fold in the library. Figure 3.2 shows part of the output report from THREADER, in which column 8 lists the filtered combined energy Z-scores [23] and the rightmost column lists the PDB ID codes of the protein folds. Based on the manual of THREADER, protein structure predictions should be based on the filtered combined energy Z-scores, because the higher the filtered combined energy Z-scores are, the more appropriately the amino acid sequence fits the protein fold, and the higher the probability that the protein fold is the correct prediction structure. Usually, the protein fold with the filtered combined energy Z-score above 3.5 is considered to have significantly high probability to be the correct predicted structure of a queried sequence. The alignments between a queried amino acid sequence and the protein fold 23

33 b3mA b7gO bhe a5t ak b6cB bf6A a4yA a4gA a7kA a6o b3oA bif b4kA aye bd0A b5tA a12A a af d7yA dhpA c8zA bsvA c0kA c3oB c0nA bx4A dceA bk5A cl2A bjwA0 Figure 3.2: Example of THREADER score report. 24

34 are also provided by THREADER, as shown in Figure 3.3. The alignments contain confidence scores which fall in an integer range from 0 to 9 inclusive at positions with structure types H and E. Each of these confidence scores indicates the possibility that a structure type is the correct prediction at an amino acid position. However, in order to integrate the structures with a confidence score of 0, CISPred raises all of the confidence scores by 1, and subsequently the range of these confidence scores becomes 1 to 10 inclusive. For a queried sequence, THREADER generates a filtered combined energy Z- score and an alignment for each of the 6251 protein folds in its library. By sorting the column listing the filtered combined energy Z-scores, CISPred finds the 20 protein folds with the highest filtered combined energy Z-scores. Usually, the filtered combined energy Z-scores of these 20 folds are all above 3.5, however, CISPred checks the filtered combined energy Z-scores of these 20 folds and eliminates the folds with filtered combined energy Z-scores lower than 3.5. The folds left are considered to be highly appropriate prediction structures for the queried sequence. To integrate the most appropriate protein folds into CISPred, the structural segments of these protein folds are clustered, and only the cluster of folds containing the highest average confidence score is integrated into CISPred Clustering THREADER Alignments The structural segments of the protein folds with the highest filtered combined energy Z-scores are clustered by a hierarchical clustering algorithm [11]. Initially, each cluster contains the secondary structures of one of these fold segments. The distance between each pair of clusters is computed, and the two clusters with the smallest distance are merged into one cluster. This process continues until the smallest distance between each pair of clusters reaches a threshold. The distance 25

35 THREADER Protein Sequence Threading Program Build date : Sep Copyright (C) 2002 University College London Portions Copyright (C) 1990 D.T.Jones Registered user: q8x46@unb.ca Reading mean force potential tables... Alignment with 1sb8A0: CCCHHHHHHHHHHHC-CCEEEEEC-CCCHHHHHHHHHHHHC CCEEE MMSRYEELRKELPAQ-PKVWLITG-VAGFIGSNLLETLLKL DQKVV EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGE---WWKARSLATRKEGYIPSNY-VA EEECCCC CCHHHHHHHHHHCCHHHHCCEEEEECCCCCHHHHHHHHCC----- GLDNFAT GHQRNLDEVRSLVSEKQWSNFKFIQGDIRNLDDCNNACAG----- RVDSLETEEWFFKGISRKDAERQLLAPGN--MLGS-FMIRDSETTKGSYSLSVRDYDPRQ CHHHHHHHHHHHC----CCC-EEECCCCCEECC-EEHHHHHHHHHHHHCC CCC AVIPKWTSSMIQG----DDV-YINGDGETSRDF-CYIENTVQANLLAATA GLD PKLIDFSAQIAEGMAFIEQRNYIHRDLRA-ANILVSASLVCKIADFGLARVGAKFPIKWT CCCEEEEEC---CCCCEEHHHHHHHHHHHHHHCCCCCC---CCCEEEC-CCCCCCCCCCC ARNQVYNIA---VGGRTSLNQLFFALRDGLAENGVSYH---REPVYRD-FREGDVRHSLA APE-AINFGSFTIKS--DVWSFGILLMEIVTYGRIPYPGMSNPEVIRALERGYRMPRPEN CCHHHHHHC--CC--CC-CC----CHHHHHHHHHHHHHHHCC--- DISKAAKLL--GY--AP-KY----DVSAGVALAMPWYIMFLK--- CPEELYNIMMRCWKNRPEERPTFEYIQSVLDDFYTATESQYQQQP Percentage Identity = 7.6. Figure 3.3: Alignment results of THREADER. 26

36 between two clusters is computed according to Equation 3.1 [11], in which C and C are two clusters, C and C are the number of fold segments in the two clusters respectively, and d(x, y) is the distance between two fold segments located in two clusters. The shorter the distance between two clusters, the greater the similarities between them. d avg (C, C ) = d(x, y) C C (3.1) The distance between two fold segments is computed according to Equation 3.2, in which N identical represents the number of positions with identical secondary structures, and N total represents the total number of positions in a fold segment. d(x, y) = N identical N total (3.2) After the clustering stops, the cluster with the highest average confidence score is selected, and the fold segments in that cluster are integrated into CISPred. The average confidence score is computed by dividing the sum of confidence scores of each structure type ( H, E, and C ) by the total number of positions in that cluster, as shown in Equation 3.3. THREADER only provides confidence scores for the positions with structure types H and E, so the confidence score for the positions with structure C is set to 5, one of the middle integers in the range 1 to 10. C avg = CH + C E + C C N H + N E + N C (3.3) The threshold of the clustering algorithm influences the number of fold segments integrated into CISPred, and therefore influences the final consensus pre- 27

37 dictions of CISPred. To illustrate the influence this threshold has on the accuracy of CISPred, several experiments were conducted, and are presented in Chapter 5: Experimental Results Finding Motif Secondary Structures The proteins in one family, or having similar functions, are found to contain some common or similar amino acid segments. These segments can distinguish families of proteins, and are called motifs. A motif can exist in many individual proteins; however, the structures of the motif are conservative and fit a particular structure template. For example, Figure 3.4 illustrates the structures of a motif named BACTERIAL OPSIN 1 in different individual proteins. The structures of the first five positions and the last three positions are usually H, and for the five positions in the middle the structures can be either H or T. By statistically analyzing all of the structures of an existing motif in different proteins, the structure template or structure formula of the motif can be determined, which provides the proportion of each secondary structure type at each of the amino acid positions of a motif. Figure 3.5 shows the structure formula of a motif named PROTEIN KINASE ATP. The structure formula is generated by integrating two databases, PROSITE [22] and PDBFINDER [20], and the motif finding tool, PATMATMOTIFS [1]. PAT- MATMOTIFS finds motifs from queried amino acid sequences and provides the name, length, and start and end positions of these motifs. Figure 3.6 is an example of a PATMATMOTIFS result, which illustrates that PATMATMOTIFS finds a motif named PROTEIN KINASE ATP which is 23 amino acids long, starting from position 190 and ending at position 212. After executing PATMATMOTIFS on a queried sequence, CISPred then searches the PROSITE database for entries 28

38 2nd Structure PDB ID AA Sequence HHHHH HHHTH HHH 2BRD RYADW LFTTP LLL HHHHH HHHTT THH 2AT9 RYADW LFTTP LLL HHHHH HHHTH HHH 1XJI RYADW LFTTP LLL HHHHH HHHTH HHH 1VJM RYADW LFTTP LLL HHHHH TTHHH HHH 1UCQ RYADW LFTTP LLL HHHHH HHHTH HHH 1UAZ RYADW LFTTP LLL HHHHH TTHHH HHH 1TN5 RYADW LFTTP LLL HHHHH TTHHH HHH 1TN0 RYADW LFTTP LLL HHHHH TTHHH HHH 1S54 RYADW LFTTP LLL HHHHH TTHHH HHH 1S53 RYADW LFTTP LLL HHHHH TTHHH HHH 1S52 RYADW LFTTP LLL HHHHH TTHHH HHH 1S51 RYADW LFTTP LLL HHHHH HHHTH HHH 1R84 RYADW LFTTP LLL HHHHH HHHTH HHH 1R2N RYADW LFTTP LLL HHHHH TTHHH HHH 1QM8 RYADW LFTTP LLL HHHHH HHHTH HHH 1QKP RYADW LFTTP LLL HHHHH HHHTH HHH 1QKO RYADW LFTTP LLL HHHHH HHHTH HHH 1QHJ RYADW LFTTP LLL HHHHH TTHHH HHH 1Q5I RYADW LFTTP LLL HHHHH TTHHH HHH 1PY6 RYADW LFTTP LLL HHHHH TTHHH HHH 1PXS RYADW LFTTP LLL HHHHH TTHHH HHH 1PXR RYADW LFTTP LLL HHHHH HHHTH HHH 1P8U RYADW LFTTP LLL HHHHH HHHTH HHH 1P8I RYADW LFTTP LLL HHHHH HHHTH HHH 1P8H RYADW LFTTP LLL HHHHH TTHHH HHH 1O0A RYADW LFTTP LLL HHHHH HHHTH HHH 1M0M RYADW LFTTP LLL HHHHH HHHTH HHH 1M0L RYADW LFTTP LLL HHHHH HHHTH HHH 1M0K RYADW LFTTP LLL HHHHH TTHHH HHH 1KME RYADW LFTTP LLL HHHHH HHHTH HHH 1KGB RYADW LFTTP LLL HHHHH HHHTH HHH 1KG9 RYADW LFTTP LLL HHHHH HHHTH HHH 1KG8 RYADW LFTTP LLL HHHHH HHHTH HHH 1JGJ RYIDW ILTTP LIV HHHHH HHHTH HHH 1IXF RYADW LFTTP LLL HHHHH HHHTH HHH 1IW9 RYADW LFTTP LLL HHHHH HHHTH HHH 1IW6 RYADW LFTTP LLL HHHHH HHHTH HHH 1BRR RYADW LFTTP LLL HHHHH HHHHH HHH 1BRD RYADW LFTTP LLL HHHHH HHHTH HHH 1BM1 RYADW LFTTP LLL HHHHH HHHHH HHH 1AT9 RYADW LFTTP LLL THHHH TTTHH HHT 1AP9 RYADW LFTTP LLL Figure 3.4: Structure segments for a protein motif. 29

39 MOTIF PROTEIN_KINASE_ATP LENGTH 23 START 190 END 212 STR_FORMULA [C:0.11, H:0.00, T:0.00, S:0.01, E:0.87, G:0.01, I:0.00, B:0.00] STR_FORMULA [C:0.09, H:0.00, T:0.01, S:0.01, E:0.87, G:0.02, I:0.00, B:0.00] STR_FORMULA [C:0.23, H:0.00, T:0.01, S:0.01, E:0.73, G:0.02, I:0.00, B:0.00] STR_FORMULA [C:0.63, H:0.00, T:0.01, S:0.04, E:0.28, G:0.03, I:0.00, B:0.01] STR_FORMULA [C:0.01, H:0.00, T:0.35, S:0.63, E:0.00, G:0.01, I:0.00, B:0.00] STR_FORMULA [C:0.00, H:0.00, T:0.35, S:0.64, E:0.00, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.20, H:0.00, T:0.01, S:0.19, E:0.59, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.10, H:0.00, T:0.00, S:0.03, E:0.87, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.01, H:0.00, T:0.00, S:0.00, E:0.98, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:1.00, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.01, H:0.00, T:0.00, S:0.00, E:0.99, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.02, H:0.00, T:0.00, S:0.00, E:0.98, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.04, H:0.00, T:0.05, S:0.02, E:0.88, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.04, H:0.00, T:0.01, S:0.01, E:0.93, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.03, H:0.00, T:0.00, S:0.02, E:0.95, G:0.00, I:0.00, B:0.00] STR_FORMULA [C:0.01, H:0.00, T:0.00, S:0.00, E:0.98, G:0.00, I:0.00, B:0.00] Figure 3.5: Example of a structure formula result. 30

40 about the motifs PATMATMOTIFS found. Figure 3.7 shows an entry in the PROSITE database which contains information about the motif PROTEIN KINASE ATP. The line starting with ID is the name of the motif. The lines starting with PA are the consensus pattern of the motif, which is the amino acid template of that motif. For example, in a consensus pattern [ALT] indicates that any one of Alanine(A), Leucine(L) or Threonine(T) may occur at this position; {AM} indicates that any amino acid except Alanine(A) and Methionine(M) may occur in this position; x indicates that any amino acid may be in this position; x(3) corresponds to x-x-x, which indicates that any three amino acids may occur in this position; and x(2,4) corresponds to x-x, x-x-x or x-x-x-x, which indicates that any two, three or four amino acids may occur in this position. The lines starting with 3D contain the PDB [2] IDs of the proteins in which this motif exists. Based on the PDB IDs of these proteins, CISPred searches the PDBFINDER [20] database and retrieves the secondary structures and the amino acid sequences of these proteins. PATMATMO- TIFS is then executed on each of these amino acid sequences in order to locate the position of the motif PROTEIN KINASE ATP in each of these proteins. According to the position of the motif provided by PATMATMOTIFS, CISPred finds the secondary structures of the motif in each of these proteins. A statistical analysis which computes the proportion of the occurrence of each secondary structure type and generates the structure formula of this motif is then performed on these secondary structures. The process of finding the structure formulae of protein motifs is illustrated in Figure 3.8. Figure 3.5 illustrates the structure formula of the motif PROTEIN KINASE ATP, and each line therein starting with STR FORMULA provides the proportion of the occurrence of each structure type in that amino acid position. Since the motif PROTEIN KINASE ATP is 31

41 23 amino acids long, the structure formula of PROTEIN KINASE ATP contains 23 lines, each of which contains the proportion of each structure type. ######################################## # Program: patmatmotifs # Rundate: Sun Sep 17 16:34: # Report_format: dbmotif # Report_file: Pat ######################################## #======================================= # # Sequence: SEQUENCE from: 1 to: 438 # HitCount: 2 # # Full: No # Prune: Yes # Data_file: /usr/local/share/emboss/data/prosite/ prosite.lines # #======================================= Length = 23 Start = position 190 of sequence End = position 212 of sequence Motif = PROTEIN_KINASE_ATP KLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPG Length = 13 Start = position 299 of sequence End = position 311 of sequence Motif = PROTEIN_KINASE_TYR IEQRNYIHRDLRAANILVSASLV # # Figure 3.6: Example of a PATMATMOTIFS result. The length of some motifs are variable; for example, the consensus pattern of 32

42 ID AC DT DE PA PA NR NR NR CC CC DR DR DR DR DR DR DR... DR DR DR 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D DO // PROTEIN_KINASE_ATP; PATTERN. PS00107; APR-1990 (CREATED); NOV-1995 (DATA UPDATE); SEP-2006 (INFO UPDATE). Protein kinases ATP-binding region signature. [LIV]-G-{P}-G-{P}-[FYWMGSTNH]-[SGA]-{PW}-[LIVCAT]-{PD}-x-[GSTACLIVMFY]- x(5,18)-[livmfywcstar]-[aivp]-[livmfagckr]-k. /RELEASE=50.8,234112; /TOTAL=1989(1969); /POSITIVE=1890(1870); /UNKNOWN=1(1); /FALSE_POS=98(98); /FALSE_NEG=373; /PARTIAL=30; /TAXO-RANGE=??EPV; /MAX-REPEAT=2; /VERSION=1; P13368, 7LESS_DROME, T; P20806, 7LESS_DROVI, T; P45894, AAPK1_CAEEL, T; Q13131, AAPK1_HUMAN, T; Q5EG47, AAPK1_MOUSE, T; Q5RDH5, AAPK1_PONPY, T; P54645, AAPK1_RAT, T; Q95ZQ4, AAPK2_CAEEL, T; P54646, AAPK2_HUMAN, T; Q28948, AAPK2_PIG, T; Q5RD00, AAPK2_PONPY, T; Q09137, AAPK2_RAT, T; Q6ZMQ8, AATK_HUMAN, T; Q80YE4, AATK_MOUSE, T; P03949, ABL1_CAEEL, T; P00519, ABL1_HUMAN, T; P00520, ABL1_MOUSE, T; P42684, ABL2_HUMAN, T; P00522, ABL_DROME, T; P10447, ABL_FSVHY, T; P00521, ABL_MLVAB, T; Q9GMA3, VSX1_BOVIN, F; P29944, YCB2_PSEDE, F; P33222, YJFC_ECOLI, F; Q12291, YL063_YEAST, F; Q09371, YS42_CAEEL, F; O32095, YUEF_BACSU, F; P47917, ZRP4_MAIZE, F; Q9PIN2, ZUPT_CAMJE, F; 1A9U; 1AD5; 1AGW; 1APM; 1ATP; 1B6C; 1BKX; 1BL6; 1BL7; 1BLX; 1BMK; 1BX6; 1BYG; 1CKI; 1CKJ; 1CSN; 1CTP; 1DAW; 1DAY; 1DI9; 1DS5; 1E9H; 1EH4; 1ERK; 1F0Q; 1FGI; 1FGK; 1FIN; 1FMK; 1FMO; 1FOT; 1FPU; 1FQ1; 1FVV; 1G3N; 1GAG; 1GJO; 1GNG; 1GOL; 1GY3; 1H1P; 1H1Q; 1H1R; 1H1S; 1H1W; 1H24; 1H25; 1H26; 1H27; 1H28; 1H8F; 1HOW; 1I09; 1I44; 1IAN; 1IAS; 1IEP; 1IG1; 1IR3; 1IRK; 1J1B; 1J1C; 1J3H; 1J91; 1JAM; 1JKK; 1JKL; 1JKS; 1JKT; 1JLU; 1JPA; 1JQH; 1JST; 1JWH; 1K2P; 1K3A; 1K9A; 1KMU; 1KMW; 1KSW; 1KV1; 1KV2; 1KWP; 1L3R; 1LC9; 1LCH; 1LD2; 1LEW; 1LEZ; 1LFN; 1LFR; 1LG3; 1LHX; 1LP4; 1LPU; 1LR4; 1LUF; 1LWP; 1M14; 1M17; 1M2P; 1M2Q; 1M2R; 1M52; 1M7N; 1M7Q; 1MP8; 1MQ4; 1MQB; 1MRU; 1MUO; 1NA7; 1NXK; 1NY3; 1O6K; 1O6L; 1O6Y; 1O9U; 1OB3; 1OEC; 1OGU; 1OI9; 1OIU; 1OIY; 1OKV; 1OKW; 1OKY; 1OKZ; 1OL1; 1OL2; 1OL5; 1OL6; 1OL7; 1OM1; 1OMW; 1OPJ; 1OPK; 1OPL; 1OUK; 1OUY; 1OVE; 1OZ1; 1P14; 1P38; 1P4F; 1P4O; 1P5E; 1PF6; 1PF8; 1PHK; 1PJK; 1PKD; 1PKG; 1PME; 1PVK; 1PY5; 1PYX; 1Q24; 1Q3D; 1Q3W; 1Q41; 1Q4L; 1Q5K; 1Q61; 1Q62; 1Q8T; 1Q8U; 1Q8W; 1Q8Y; 1Q8Z; 1Q97; 1Q99; 1QCF; 1QL6; 1QMZ; 1QPC; 1QPE; 1QPJ; 1R0E; 1R0P; 1R1W; 1R39; 1R3C; 1RDQ; 1RE8; 1REJ; 1REK; 1RJB; 1RQQ; 1RW8; 1S9I; 1S9J; 1SM2; 1SMH; 1SNU; 1SNX; 1STC; 1SYK; 1SZM; 1T45; 1T46; 1TVO; 1U59; 1U5Q; 1U5R; 1U7E; 1UNL; 1URC; 1UU3; 1UU7; 1UU8; 1UU9; 1UV5; 1UVR; 1UWH; 1UWJ; 1V0B; 1V0P; 1VJY; 1VR2; 1VYW; 1VZO; 1W7H; 1W82; 1W83; 1W84; 1W98; 1WBN; 1WBO; 1WBP; 1WBS; 1WBT; 1WBV; 1WBW; 1WFC; 1WMK; 1WZY; 1X8B; 1XH4; 1XH5; 1XH6; 1XH7; 1XH8; 1XH9; 1XHA; 1XJD; 1XKK; 1XO2; 1XQZ; 1XR1; 1XWS; 1Y57; 1Y6B; 1Y8G; 1YDR; 1YDS; 1YDT; 1YHS; 1YI3; 1YI4; 1YKR; 1YM7; 1YMI; 1YOL; 1YOM; 1YVJ; 1YW2; 1YWR; 1Z57; 1Z5M; 1ZMU; 1ZMW; 1ZOE; 1ZOG; 1ZOH; 1ZRZ; 1ZXE; 1ZY4; 1ZY5; 1ZYC; 1ZYD; 1ZZ2; 1ZZL; 2A19; 2A1A; 2AC3; 2AC5; 2AUH; 2B4S; 2B54; 2B7A; 2B9F; 2B9H; 2B9I; 2B9J; 2BAJ; 2BAK; 2BAL; 2BAQ; 2BCJ; 2BIK; 2BIL; 2BIY; 2BKZ; 2BMC; 2BPM; 2BZH; 2BZI; 2BZJ; 2C1A; 2C1B; 2C30; 2C3I; 2C4G; 2C5N; 2C5O; 2C5P; 2C5T; 2C5X; 2C6D; 2C6E; 2C6T; 2CDZ; 2CHL; 2CPK; 2CSN; 2ERK; 2ERZ; 2ESM; 2ETK; 2ETO; 2ETR; 2EU9; 2EXM; 2F49; 2F4J; 2F57; 2FA2; 2FGI; 2FO0; 2G15; 2HCK; 2PHK; 2SRC; 3ERK; 3LCK; 4ERK; PDOC00100; Figure 3.7: Example of a PROSITE entry. 33

43 the motif PROTEIN KINASE ATP shown in figure 3.7 contains an x(2,4) which indicates that any two, three or four amino acids may occur in that position. These variable length positions do not affect the illustration of the structural template of the motif; therefore, these variable length positions are ignored by CISPred PSIPRED and SSPRO PSIPRED [24] and SSPRO [37] are the two existing protein secondary structures prediction tools integrated into CISPred. Figure 3.9 illustrates an example of a PSIPRED result: the first column provides an index of the amino acid residues in the queried sequence, the second column provides the amino acid sequence; the third column provides the predicted structures which have the highest confidence scores in each amino acid position; the fourth column provides the confidence scores of the structure type C ; the fifth column provides the confidence scores of the structure type H ; and the sixth column provides the confidence scores of the structure type E. Figure 3.10 shows a sample result from SSPRO, and contains the ID and description of the queried amino acid sequence, followed by the queried amino acid sequence and the predicted sequence of secondary structures. SSPRO does not provide any confidence scores about its predictions. PSIPRED and SSPRO are independent and successful prediction tools. They provide complete predictions for all of the amino acid residues of the queried sequence. Therefore, their results and the confidence scores are directly integrated into CISPRED Generating Consensus Structure Prediction The consensus predictions of CISPred are determined by integrating the fold structures provided by THREADER [23], the predicted structures provided by 34

44 Query amino acid sequence PATMATMOTIFS Motif name PROSITE Database Protein PDB IDs of all proteins having this motif PDBFINDER Database Protein secondary structure sequences of the proteins having this motif PATMATMOTIFS Program Generating Structure Formula Sequence segments for this motif Motif Structure Formula Consensus Prediction Program Figure 3.8: The generation of motif structure formulae. 35

45 # PSIPRED VFORMAT 1 E C D C I E I E V E V E A E L E Y C D C Y C E C A C I C H C H C E C Q H S H V H L H D H D H F H Y H T H A C T C E C S C Q C Y C Q C Q C Q C P C Figure 3.9: Example of a PSIPRED vertical result. 36

46 PSIPRED [24] and SSPRO [37], and motif structure formulae. CISPred parses the results of SSPRO, PSIPRED, THREADER alignments, and motif structure formulae, and then generates consensus predictions from the first amino acid position to the last, one amino acid position at a time. Figure 3.11 shows an example of the available structures and confidence scores for one amino acid position. The line starting with AA shows the type of queried amino acid in this position, a K (Lysine) in this example. The line starting with SSPRO shows the structure type predicted by SSPRO in this position, a H in this example. The line starting with PSIPRED shows the PSIPRED result in this position, and provides an index of this position in the queried amino acid sequence ( 10 ), the queried amino acid type in this position ( K ), the structure type predicted by PSIPRED in this position ( H ), the confidence score of the structure type C in this position ( ), the confidence score of the structure type H in this position ( ), and the confidence score of the structure type E in this position ( ). The lines starting with THREADER show the THREADER alignments in this position. In this example, five alignments are 1AD5:B PDBID CHAIN SEQUENCE EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVDSLETEEWFFKGI SRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVKHYKIRTLDNGGFYISPRSTFSTLQ ELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAWEIPRESLKLEKKLGAGQFGEVWMATYNKHTKVAVKT MKPGSMSVEAFLAEANVMKTLQHDKLVKLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFS AQIAEGMAFIEQRNYIHRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWSF GILLMEIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEERPTFEYIQSVLDDF YTATESQYQQQP CCCEEEECCCECCCCCCECCECCCCEEEEEECCCCEEEEEECCCCCEEEEEHHHEEECCCHHHCCCEECCC CHHHHHHHHHCCCCCCCCEEEEECCCCCCCEEEEEEEEECCCEEEEEEEEEEECCCCCEECCCCCCECCHH HHHHHHCCCCCCCCCCCCCECCCCCCCCCCCCCCCECCHHHEEEEEEEECCCCEEEEEEEECCCEEEEEEE ECCCCECHHHHHHHHHHHCCCCCCCECCEEEEECCCCCEEEECCCCCCEHHHHHHCHHHHCCCHHHHHHHH HHHHHHHHHHHHCCCCCCCCCHHHEEECCCCCEEECCCCHHHHCCCCCHHHCCHHHHHHCCCCHHHHHHHH HHHHHHHHCCCCCCCCCCCHHHHHHHHHHCCCCCCCCCCCHHHHHHHHHHCCCCHHHCCCHHHHHHHHHCC CCCCCCCCECCC Figure 3.10: Example of a SSPRO result. 37

47 chosen for integration into CISPred, and their structures in this position are C, H, E, H and H, with confidence scores of -, 6, 3, 0 and 5, respectively 1. However, as previously presented, all of the confidence scores of the structure types E and H are increased by 1 from their original numbers provided by THREADER alignments, and the confidence scores of the structure type C are arbitrarily set to be 5. Therefore, the confidence scores actually considered by CISPred are 5, 7, 4, 1 and 6, respectively. The line starting with Str F shows the structure formula in this position, and indicates the proportion of the occurrence of each structure type in the position. In the Str F line, C:0.11 means that 11% of the protein motifs have a structure type C in this position. For each of the amino acid positions in a queried sequence, CISPred computes a total confidence score C total for each of the three structure types: H, E, and C. The structure type with the highest total confidence score C total is considered as the consensus prediction in an amino acid position. Equation 3.4 shows the computation of the total confidence score C total for the structure type H in an amino acid position. C total H = W psi C psi H + W ssp C ssp H + W sf P sf H + W thr C thr avg H (3.4) In Equation 3.4, C psi H is the confidence score of the structure type H provided by PSIPRED. The confidence scores provided by PSIPRED are converted to a scale from 0 to 10; therefore, original confidence scores provided by PSIPRED are multiplied by 10. In the example presented in Figure 3.11, the integrated 1 THREADER does not provide confidence scores for the structure type C, and it uses - to represent the confidence score of the structure type C. 38

48 AA K SSPRO H PSIPRED 10 K H THREADER - K C THREADER THREADER THREADER THREADER Str_F 6 K H 3 K E 0 K H 5 K H [C:0.11, H:0.87, T:0.00, S:0.01, E:0.00, G:0.01, I:0.00, B:0.00] Figure 3.11: Example of the information available in one amino acid position. 39

49 confidence score of the structure type H (C psi H ) is =6.27. C ssp H is the confidence score of the structure type H provided by SSPRO. As previously mentioned, SSPRO does not provide any confidence scores for its predictions. To integrate the predictions of SSPRO and assign its predictions appropriate confidence scores, each of the predictions of SSPRO is considered to have a confidence score of 5, one of the two integers halfway between 1 and 10. In the example illustrated in Figure 3.11, SSPRO predicts an H in that position; therefore, the C ssp H in that example is set to be 5. If the structure type predicted by SSPRO in that position is not an H, the C ssp H was set to be 0. P sf H is the proportion of the structure type H provided by the structure formula of an amino acid position. CISPred uses a 3-state scheme [40], in which G and H are taken to be helix ( H ), E and B are taken to be strand ( E ), and all of the other structure types are considered as coil ( C ). The proportion of the structure type H includes the proportion of the structure type G. Like the confidence scores provided by PSIPRED, the proportions provided by structure formula are multiplied by 10 in order to change their scale to between 0 and 10. In the example illustrated in Figure 3.11, P sf H = ( ) 10 = 8.8. C thr avg H is the average confidence score of the structure type H in an amino acid position of the selected THREADER alignments. As previously mentioned, the fold structures from the THREADER alignments are clustered, and one cluster of fold structures is chosen for integration in CISPred. In each position, CISPred computes an average of the confidence scores of these selected fold structures. Equation 3.5 illustrates the computation of C thr avg H : C thr H i is the confidence score of the structure type H in an amino acid position of the THREADER alignments, and N H is the number of the structure types H in the amino acid position. For the example illustrated in Figure 3.11, C thr H 1, C thr H 2, and C thr H 3 are equal to 7, 40

50 1, and 6, respectively; N i=1 (C thr H i) equals (7+1+6)=14, and N H equals to 3. C thr avg H = NH i=1 C thr H i N H (3.5) The consensus predictions of CISPred are determined not only by considering the confidence scores provided by each tool, but also the overall prediction accuracies of the integrated tools. In Equation 3.4, W psi, W ssp, and W thr are the weights of PSIPRED, SSPRO, and THREADER, and W sf is the weight for the structure formulae. A weight is a real number from 0 to 1 inclusive which indicates the accuracy rate of the information provided by a tool. The weight of structure formulae is set to 1 because the proportions provided by structure formulae are determined by statistical analysis of the real structural data of existing motifs, and not predicted by algorithms. The weights of PSIPRED, SSPRO, and THREADER are equal to their average 3-state accuracies on a training dataset containing 80 randomly selected amino acid sequences. 3-state accuracy, also called a Q3 score, is used to evaluate the prediction accuracy of secondary structure prediction tools. 3-state accuracy only considers the prediction accuracy of the following three states: helix ( H ), strand ( E ), and coil ( C ). 3-state accuracy is defined as the percentage of the amino acid residues that are correctly predicted. Equation 3.6 defines 3-state accuracy, where N correct is the number of residues that are correctly predicted, and N total is the total number of amino acid residues in the sequence. Q 3 = N correct N total (3.6) SSPRO and PSIPRED are independent existing protein secondary structure prediction tools, and their outputs are in the format of protein secondary structure sequences. The 3-state accuracy scores of their predictions can be computed 41

51 by comparing the predicted structure type in each amino acid position with the real structure type retrieved from the PDBFINDER database [20]. The average 3-state accuracy of SSPRO predictions on the 80 training sequences is 0.937, and the average 3-state accuracy of PSIPRED predictions on the 80 training sequences is THREADER does not provide predicted secondary structure sequences, rather the structures of the most appropriate folds, and the number of the most appropriate folds is influenced by the threshold at which the hierarchy clustering stops. The structure type that has the highest average confidence score is considered to be the structure type predicted by THREADER in an amino acid position. Equation 3.5 illustrates the computation of the average confidence score of the structure type H. Figure 3.12 illustrates the average 3-state accuracy of THREADER predictions on 80 random sequences. The average 3-state accuracy declines as the threshold of the hierarchy clustering rises, which indicates that the more similarities the selected folds have, the higher the prediction accuracy of THREADER is. As shown in Figure 3.12, the minimum average 3-state accuracy of THREADER is 0.684, and the maximum average 3-state accuracy of THREADER is Therefore, the 3-state accuracy of THREADER is considered to be which is halfway between the maximum accuracy and the minimum accuracy. In total, based on Equation 3.4, the total confidence score for the structure type H in the amino acid position presented in Figure 3.11 is computed as C total H = ( ) ( )/ Similarly, C total E and C total C are computed as C total E = ( ) /1 6.07, and C total C = ( ) /1 4.94, respectively. In the amino acid position 42

52 Average 3-state accuracy (Q3 score) Threshold Figure 3.12: Average 3-state accuracy of THREADER predictions on 80 random sequences. 43

53 presented in Figure 3.11, the maximum among C total H, C total E and C total C is C total H, which approximately equals 21.88; therefore, the structure type H is considered to be the consensus structure type predicted by CISPred in this amino acid position. Similar to the prediction performed in the position illustrated in Figure 3.11, CISPred provides a consensus prediction in each of the amino acid positions of a queried sequence, which composes a protein secondary structure sequence as the consensus prediction on the queried sequence. C total reaches its maximum limit when the confidence score provided by PSIPRED equals 1.00, the confidence score provided by SSPRO equals 5, the proportion provided by structure formula equals 1.00, and the average confidence score of THREADER alignments equals 10. Therefore, the maximum limit of C total is computed as L max = = C total reaches its minimum limit when the confidence score provided by PSIPRED equals 0.00, the confidence score provided by SSPRO equals 0, the proportion provided by structure formula equals 0.00, and the average confidence score of THREADER alignments equals 0 when a fold has a gap - aligned in an amino acid position. Therefore, the minimum limit of C total is computed as L min = = 0. In order to clearly present total confidence scores, the C total of each structure type is converted into a real number from 0 to 1 inclusive. Equation 3.7 illustrates the conversion of the total confidence score of the structure type H. C total H = 1 C total H L min L max L min (3.7) In the example illustrated in Figure 3.11, the converted total confidence score of the structure type H is computed as C total H = 21.88/ , the 44

54 converted total confidence score of the structure type E is computed as C total E = 6.07/ , and the converted total confidence score of the structure type C is computed as C total C = 4.94/ Figure 3.13 illustrates the consensus prediction of CISPred in the amino acid position illustrated in Figure 3.11, in which 10 is the index of the amino acid position, K is the amino acid type, H is the structure type predicted by CISPred, is the converted total confidence score of structure type C, is the converted total confidence score of structure type H, and is the converted total confidence score of structure type E. Figure 3.14 illustrates an example of CISPred result in a vertical format, in which each line is a CISPred prediction result for an amino acid position. Figure 3.15 illustrates an example of a CISPred result in a horizontal format, in which the queried sequence is shown in FASTA format followed by CISPred prediction results. 10 K H Figure 3.13: Example of a CISPred vertical result in an amino acid position. 45

55 # CISPred vertical result 1 E C D C I C I E V E V E A E L C Y C D C Y E E C A C I C H C H C E C Q H S H V H L H D H D C F C Y C T C A C T C E C S C Q C Y C Q E Q C Q C P C Figure 3.14: Example of a CISPred vertical result. 46

56 >1AD5:B PDBID CHAIN SEQUENCE EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVDS LETEEWFFKGISRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVKHY KIRTLDNGGFYISPRSTFSTLQELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAWEIP RESLKLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPGSMSVEAFLAEANVMKTLQHDKLV KLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFSAQIAEGMAFIEQRNYI HRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWSFGILLM EIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEERPTFEYIQ SVLDDFYTATESQYQQQP CCCEEEECCCECCCCCCECCECCCCEEEEEECCCCEEEEEECCCCCEEEEEHCCEEECCC CCHCCCEECCCCHHHHHHHHHCCCCCCCCEEEEECCCCCCCEEEEEEEEECCCCEEEEEE EEEECCCCCEECCCCCCHHHHHHHHHHHCCCCCCCCCCCCCECCCCCCCCCCCCCHHHHH HHHHHHEEEEECCCCEEEEEEEECCCEEEEEEEECCCCCCHHHHHHHHHHHCCCCCCCEE EEEEEECCCCCEEEECCCCCCEHHHHHHCHCCHCCCHHHHHHHHHHHHHHHHHHHHCCCC CCCCCHHHHHHHHHHHHHHHHHHHHHHCCCCCCCHCCHHHHHHCCCCHHHHEEEEEEEEE HHHCCCCCCCCCCCHHHHHHHHHHCCCCCCCCCCCHHHHHHHHHHCCCCHHHCCCHHHHH HH HHCCCCCCCCCCECCC Figure 3.15: Example of a CISPred horizontal result. 47

57 Chapter 4 System Implementation 4.1 Overview From a software engineering perspective, CISPred is an Internet web application, an application that is accessed through a web browser over the Internet. The inputs of CISPred, amino acid sequences, are submitted on a web page of the CISPred website. After CISPred predictions are finished, the consensus prediction results are sent to the address of a user. The tools integrated in CISPred are concurrently executed on a 164-processor high performance SUN cluster. The computations for one queried sequence are simultaneously performed on at least 12 processors, which greatly reduces the execution time of CISPred. 4.2 System Infrastructure Figure 4.1 illustrates the system infrastructure of CISPred. A user of CISPred accesses the CISPred website at the HTTP address through a web browser. The CISPred website is constructed in HTML and 48

58 CGI(Common Gateway Interface), and runs on a web server which is a 4-processor SUN computer with the Fully Qualified Domain Name (FQDN) quartet.cs.unb.ca. The user address and queried amino acid sequences in FASTA format shown in Figure 4.2 are submitted to the CISPred web sever by the submission web page of CISPred shown in Figure 4.3. A PERL CGI program running on the CISPred web server parses the queried sequences and submits concurrent jobs to the 164- processor high performance SUN cluster with FQDN chorus.cs.unb.ca. A MySQL database that runs at the 4-processor SUN computer stores the information about queried sequences and their concurrent job IDs. A program that runs on the web server retrieves the IDs of unfinished concurrent jobs and checks the status of these jobs. After the concurrent jobs are finished, a program on the cluster is executed which integrates the results of concurrent jobs and generates consensus prediction results. The consensus prediction results are sent to the CISPred web server via FTP and then sent to the address of a user. Moreover, the CISPred website contains web pages for CISPred administrators to view the usage information of CISPred. The administration web pages are implemented in PERL CGI and are able to list the submitted and finished time of each queried task, and the IP address and the geographical location of the computer a CISPred user uses to submit queried sequences. This information is stored in a table in the MySQL database that runs at the 4-processor SUN machine. Figure 4.4 shows one of the administration web pages of CISPred. 49

59 Users HTTP Queryied sequences Consensus Predictions Web Server & Database Server Queried sequences CISPred Web Page Program Submitting Execution Jobs Task ID Job IDs Program Checking Job Status & Sending s Task Information MySQL Database Queried sequences Job IDs Job status Consensus Predictions Execution of Integrated Tools Results of Integrated Tools Program Generating Consensus Predictions SUN Cluster with 160 Processors Figure 4.1: The system infrastructure of CISPred. >1AD5:B PDBID CHAIN SEQUENCE EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVDS LETEEWFFKGISRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVKHY KIRTLDNGGFYISPRSTFSTLQELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAWEIP RESLKLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPGSMSVEAFLAEANVMKTLQHDKLV KLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFSAQIAEGMAFIEQRNYI HRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWSFGILLM EIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEERPTFEYIQ SVLDDFYTATESQYQQQP Figure 4.2: Example of a CISPred queried sequence. 50

60 Figure 4.3: Web page for submitting query sequences to CISPred. 51

61 Figure 4.4: A CISPred web page displaying user jobs. 52

62 4.3 Concurrent Implementation Overview The tools integrated in CISPred, THREADER [23], PSIPRED [24], and SSPRO [37], and the finding of motif structure formulae are concurrently executed in a high performance SUN cluster containing 164 processors. Figure 4.5 illustrates an overview of the concurrent implementation of CISPred. A PERL CGI program that runs at the CISPred web server parses the submission file that may contains more than one queried amino acid sequence in FASTA format. For each of the queried sequences, the PERL CGI program submits several concurrent jobs to the high performance SUN cluster, which contains 10 concurrent jobs of THREADER [23], 1 concurrent job of PSIPRED [24], 1 concurrent job of SSPRO [37], and n concurrent jobs of finding motif structural formulae, where n equals the number of existing motifs in a queried amino acid sequence. After the executions of the concurrent jobs are finished, a program integrates the results of each concurrent job and generates consensus predictions THREADER The default fold library of THREADER contains 6251 protein folds. A queried sequence is threaded through each of the protein folds and free energy is computed during this process. Figure 4.6 illustrates the concurrent implementation of THREADER. CISPred divides the library into 10 sub-libraries, each of which contains approximately 625 protein folds. A queried sequence is threaded through the folds in the 10 sub-libraries simultaneously on 10 processors. In total, a PERL CGI program that runs at the CISPred web server submits 10 concurrent jobs of THREADER according to the submission template shown in 53

63 THREADER Task 1... THREADER Task 10 Concurrent Job Submission Program A Query Sequence Motif Structure Finding Task 1... Motif Structure Finding Task n Results Result Integration Program SSPRO Task PSIPRED Task Figure 4.5: Overview of the concurrent implementation of CISPred. THREADER 1 Fold Library 1 Result 1 Concurrent Job Submission Program A queried sequence Result 10 Result Integration Program THREADER 10 Fold Library 10 Figure 4.6: Concurrent implementation of THREADER. 54

64 Appendix A.1. After the executions of these concurrent jobs are finished, 10 alignment reports and 10 score reports are generated by THREADER. An integration program that runs on the high performance SUN cluster sorts each of the score reports. In each of the score reports, 20 folds with the highest filtered combined energy Z-scores [23] are selected. The integration program gathers 20 folds from each of the sub-libraries, sorts the folds based on the filtered combined energy Z-scores [23], and selects the top 20 folds with the highest filtered combined energy Z-scores [23] as shown in Figure 4.7. These 20 folds are the most appropriate folds in the 6251 protein folds of the THREADER default library. From these 20 folds, CISPred checks the filtered combined energy Z-scores of each of them and the protein folds with filtered combined energy Z-scores lower than 3.5 are eliminated. The folds left are then clustered and only the folds in one cluster are integrated in CISPred. Score report 1 Sort function Top 20 folds in report 1 Sort function Top 20 folds in all reports Score report 10 Sort function Top 20 folds in report 10 Figure 4.7: The sorting of THREADER reports Finding Protein Motif Secondary Structures A queried amino acid sequence may contain several motifs. The submission program that runs at the CISPred web server executes PATMATMOTIFS [1] on each 55

65 of the queried sequences in order to find the existing motifs in a queried sequence. The finding of the structural formulae of a motif is performed on one processor. By parsing the results of PATMATMOTIFS, the submission program that runs at the CISPred web server retrieves the number of motifs found in a queried sequence and submits the corresponding number of concurrent jobs to the high performance SUN cluster, each of which performs an independent process of finding structural formulae for one motif. Appendix A.4 illustrates the template for submitting concurrent jobs of finding motif structural formulae. Figure 4.8 illustrates the concurrent implementation of finding motif structural formulae. After the concurrent jobs of finding motif structural formulae are finished, the structural formulae are integrated in a program that runs on the high performance SUN cluster SSPRO and PSIPRED PSIPRED [24] and SSPRO [37] are independent existing protein structure prediction tools, and their execution time is within two minutes, much shorter than the execution time of THREADER, which may be up to several hours based on the length of a queried sequence. PSIPRED and SSPRO are not further divided and each of them is performed by a processor. The submission PERL CGI program that runs at the CISPred web server submits jobs for each of SSPRO and PSIPRED. Appendices A.2 and A.3 illustrate the template for submitting jobs of SSPRO and PSIPRED. After the executions of the jobs are finished, the integration program that runs on the high performance SUN cluster directly integrates the results of SSPRO and PSIPRED. 56

66 Query amino acid sequence PATMATMOTIFS Concurrent Tasks Submission Program Names of the motifs found in query sequence Processor 1 Name of Motif 1 Name of Motif n Processor n PROSITE Database PROSITE Database The PDB IDs of all proteins that contain Motif 1 The PDB IDs of all proteins that contain Motif n PDBFINDER Database PDBFINDER Database The whole amino acid sequence and secondary structures of the proteins that contain Motif 1... The whole amino acid sequence and secondary structures of the proteins that contain Motif n PATMATMOTIFS PATMATMOTIFS Structure Formula Generation Program Segments of the secondary structure sequences of Motif 1 Structure Formula Generation Program Segments of the secondary structure sequences of Motif n Structure Formula of Motif 1... Consensus Prediction Program of CISPred Structure Formula of Motif n Figure 4.8: Concurrent finding of motif structures. 57

67 4.4 Execution Time The executions of integrated tools and the finding of motif structural formulae of CISPred are concurrently implemented on a high performance 160-processor SUN cluster, which greatly reduces its execution time. The PERL CGI program that runs at the CISPred web server sequentially submits concurrent jobs to the high performance SUN cluster, and the first submitted concurrent jobs are first executed in the high performance cluster. The submission order of concurrent jobs is the same as the execution order of concurrent jobs; this is called the job schedule. Sometimes the number of available processors in the high performance cluster is less than the number of concurrent jobs submitted. After being submitted, some of the concurrent jobs cannot be executed immediately, but have to wait to be executed. Moreover, the execution time of concurrent jobs is different; for example, the execution time of SSPRO on one queried sequence is usually within 2 minutes, while the execution time of a THREADER concurrent job usually takes minutes. Job schedule influences the total execution time of concurrent jobs. Figures 4.9 and 4.10 illustrate the execution time and speedup of CISPred on protein 1AD5 shown in Figure 4.2. Optimized Schedule indicates that the concurrent jobs with longer execution times are submitted ahead of the concurrent jobs that take less time. The order of jobs submitted is as follows: 1. Job(s) for finding motif structural formulae 1 ; THREADER concurrent jobs; 3. The SSPRO job; 1 If a queried amino acid sequence contains more than one motif, the order of jobs submitted for finding motif structural formulae of these motifs is the same as the order in which these motifs were found in the queried amino acid sequence. 58

68 4. The PSIPRED job; Non-optimized Schedule indicates that the concurrent jobs are submitted in a random order. As shown in Figures 4.9 and 4.10, submitting the concurrent jobs that have longer execution times ahead of the concurrent jobs that have shorter execution times reduces the total execution time and improves the speedup of CISPred. CISPred concurrently generates structure formulae for each of the motifs found in a queried sequence. Initially, CISPred sequentially generates structure formulae for each of the motifs found in a queried sequence. In Figures 4.9 and 4.10, the line with the legend Optimized Schedule indicates the execution time and speedup of CISPred when the CISPred job schedule is optimized but finding structural formulae of existing motifs in a queried sequence are not concurrently implemented. The line with the legend Non-optimized Schedule indicates the execution time and speedup of CISPred when the CISPred job schedule is not optimized and the finding of structural formulae of existing motifs in a queried sequence are not concurrently implemented. The line with the legend Optimized schedule and concurrent finding of motif structures indicates the execution time and speedup of CISPred when the CISPred job schedule is optimized and the finding of structural formulae of existing motifs in a queried sequence are concurrently implemented. As shown in Figure 4.9, when the CISPred job schedule is optimized and the finding of structural formulae of existing motifs in a queried sequence are concurrently implemented, the execution time of CISPred has a large decrease when the number of processors increases from 1 to 6. As Figure 4.10 shows, when the CISPred job schedule is optimized and the finding of structural formulae of existing motifs in a queried sequence are concurrently implemented, the execution time of CISPred decreases almost 8 times when the number of processors increases 59

69 from 1 to 11. Non-optimized schedule Optimized schedule Optimized schedule and concurrent finding of motif structure formulas Figure 4.9: The execution time of CISPred. 60

70 Non-optimized schedule Optimized schedule Optimized schedule and concurrent finding of motif structure formulas Figure 4.10: The speedup of CISPred. 61

71 Chapter 5 Experimental Results 5.1 Overview The purpose of the experiments is to test the prediction accuracy of CISPred, to determine a default threshold to be used in the clustering of protein folds generated from THREADER alignments, and to compare CISPred with other existing protein structure prediction tools. Two test datasets are used in the experiments. One of the test datasets consists of 109 Critical Assessment of Techniques for Protein Structure Prediction Experiment (CASP) [29] target amino acid sequences. CASP [29] is an organization which evaluates protein structure prediction methods. Prediction methods provide blind predictions before the structures of the target sequences are observed by experimental methods. The CASP target sequences have a variety of lengths and are newly discovered proteins; therefore, they have been widely used by current prediction tools as a standard test dataset. The 109 CASP sequences used in the experiments conducted are randomly selected from CASP3 (1998), CASP4 (2000), CASP5 (2002), and CASP6 (2004). The experiments are performed on a dataset containing 1758 amino acid se- 62

72 quences selected from the PDBFINDER database [20] as a result of the following procedure. The PDBFINDER database contains information, such as the amino acid sequences and the secondary structures, of current known proteins. PDBFINDER database is in a plain text format. The entries in the PDBFINDER database are listed by the alphabetical order of the protein PDB IDs. We select 5000 proteins starting from the first protein listed in the PDBFINDER database with PDB ID 100D to the protein with PDB ID 4HTC by the alphabetical order. Out of these 5000 protein sequences, the following sequences are deleted: 1. The sequences included in the training dataset which contains 80 sequences presented in Chapter The sequences that contain illegal characters. 3. Some protein chains with the same PDB ID. Some proteins contain several chains. For example, 1ZD6:A and 1ZD6:B are the two chains of protein 1ZD6. The two chains 1ZD6:A and 1ZD6:B have very high sequence similarities as shown in Figure 5.1 and are listed in PDBFINDER as two separate entries. In this situation, only the first chain of the protein is included, all of the other chains of the same proteins are eliminated. The test dataset (which contains 1758 sequences) is much larger than the CASP dataset (which has 109 sequences) and contains selected amino acid sequences of regular known proteins. As is the case with most of the other existing protein structure prediction tools, CISPred only predicts three structural states: helix ( H ), strand ( E ), and coil ( C ). The eight secondary structure types defined by DSSP are reduced into these three states based on a 3-state scheme [40]: G and H are taken to 63

73 be helix ( H ), E and B are both taken to be strand ( E ), and all of the other structure types are considered to be coil ( C ). 3-state accuracy is used to evaluate prediction accuracy in the experiments conducted. 3-state accuracy is defined as the percentage of the amino acid residues that are correctly predicted, as shown in Equation CISPred Testing Results on CASP Sequences As presented in Chapter 3, the secondary structure sequences of the protein folds generated from THREADER alignments are clustered, and only the folds in one cluster are integrated by CISPred. The clustering process stops when the distance between the nearest two clusters reaches a threshold. In order to test the prediction accuracy of CISPred on each of the clustering thresholds from 1% to 100%, CISPred is executed 100 times on the two datasets, each time with a different clustering threshold. Figure 5.2 illustrates the average Q3 scores of the 109 CASP sequences on each threshold from 1% to 100%, with 1% as the interval. As shown in Figure 5.2, the average 3-state accuracy of CISPred predictions stays above when the threshold increases from 0 to almost It declines to when the threshold is It reaches its peak, 0.828, when the threshold is 0.40, and declines from its peak to the lowest point, 0.815, when the threshold is raised to It has a slight increase from the lowest point to when >1ZD6:A PDBID CHAIN SEQUENCE MGPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLT TEEEFVEGIYKVEIDTKSYWKALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTT AVVTNPKE >1ZD6:B PDBID CHAIN SEQUENCE MGPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLT TEEEFVEGIYKVEIDTKSYWKALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTT AVVTNPKE Figure 5.1: Two chains of protein 1ZD6. 64

74 the threshold increases from 0.66 to In total, the average 3-state accuracy has a large decline when the threshold is raised from 0.40 to Moreover, it stays relatively high when the threshold is between 0 and 0.40, and relatively low when the threshold is between 0.66 and 1, which illustrates that the clustering of the protein folds generated from THREADER alignments improves the prediction accuracy of CISPred. This is because when the threshold reaches 1, all of the protein folds are integrated, which is the equivalent of integrating all of the protein folds without clustering them. Figure 5.3 illustrates the standard deviation of the 3-state accuracies of the CISPred predictions with different thresholds on the 109 CASP sequences. Standard deviation is used to measure how the values in a distribution are spread. Standard deviation is computed according to the Equation Average 3-state accuracy (Q3 score) Threshold Figure 5.2: Average 3-state accuracy of CISPred on the 109 CASP sequences. 65

75 σ = 1 N N (x i x) 2 (5.1) i=1 N stands for the number of samples taken, x i is the value of each sample, and x is the average of the sample values Standard deviation Threshold Figure 5.3: Standard deviation of CISPred on the 109 CASP sequences. In statistics, coefficient of variation [44] also measures the dispersion of a probability distribution. The coefficient of variation is defined using Equation 5.2, in which µ stands for the arithmetic mean or average, and σ stands for standard deviation: C v = σ µ (5.2) Figure 5.4 illustrates the coefficient of variation of the 3-state accuracies of 66

76 Coefficient of variance the CISPred predictions with different thresholds on the 109 CASP sequences. Both Figures 5.3 and 5.4 show that the 3-state accuracies of the 109 CASP sequences have the highest dispersion when the threshold is 0.60, and have the lowest dispersion when the threshold is In total, the standard deviation is roughly 0.1 when the threshold is increased from 0 to 0.62, and declines to when the threshold is increased from 0.60 to 1. The coefficient of variation is roughly 12.1% when the threshold is increased from 0 to 0.53, and after a rise, the coefficient of variation declines to 11.6% when the threshold is increased from 0.6 to 1. Generally speaking, both the standard deviation and the coefficient of variation stay at lower values when the threshold reaches a relatively higher value. The standard deviation ranges from to 0.102, and the coefficient of variation ranges from 11.6% to 12.4%, neither of which changes greatly and both remain at a low level Coefficientofvariation Threshold Figure 5.4: Coefficient of variation of CISPred on the 109 CASP sequences. 67

77 Figure 5.5 illustrates the number of sequences that have 3-state accuracies in some specific ranges. It shows that CISPred always predicts at least 20 sequences out of 109 CASP sequences with 3-state accuracy higher than 90%, which is about 18.3% of the 109 CASP sequences; CISPred always predicts at least 35 sequences out of the 109 CASP sequences with 3-state accuracy higher than 85%, which is about 32% of the 109 CASP sequences; and CISPred always predicts at least 60 sequences out of the 109 CASP sequences with 3-state accuracy higher than 80%, which is about 55% of the 109 CASP sequences. The number of sequences with higher 3-state accuracies, such as those above 90%, 85%, 80%, and 75%, declines when the threshold is increased from 0.48 to 0.64, while the number of sequences with lower 3-state accuracies, such as 65%, 60%, and 55%, increases when the threshold is increased from 0.48 to Number of sequences Threshold Above 90% Above 85% Above 80% Above 75% Above 70% Above 65% Above 60% Above 55% Above 50% Figure 5.5: Number of sequences CISPred predicts with 3-state accuracy in several specific ranges on the 109 sequences dataset. 68

78 Figures 5.6, 5.7, and 5.8 illustrate the distributions of the CISPred predictions on the 109 CASP sequences with 1%, 3%, and 5% 3-state accuracy as intervals, respectively. In these three gray-scale graphs, the more sequences located in an area, the darker this area is. In Figure 5.6, the area with 3-state accuracy of 80% and a threshold from 0% to 58% is the darkest area, which indicates that this area contains the densest distribution of sequences. Figure 5.7 shows that besides the area with 80% 3-state accuracy and a threshold from 0% to 58%, the area with 90% 3-state accuracy and a threshold from 25% to 45% also contains a relatively high density of sequences. Figure 5.8 shows that area with 3-state accuracy from 75% to 85%, and a threshold from 40% to 44%, 48% to 58%, and 60% to 100% contains a higher density of sequences. Overall, the sequences with higher than 90% 3-state accuracy are mostly generated when the threshold is from 25% to 50%, the sequences with 80% to 85% 3-state accuracy are mostly generated when the threshold is from 70% to 100%, the sequences with 78% to 80% 3-state accuracy are mostly generated when the threshold is from 0% to 60%. 5.3 CISPred Testing Results on 1758 Sequences The experiments conducted on the 109 CASP sequences are also performed on a dataset containing 1758 amino acid sequences. The 1758 sequences are selected form the PDBFINDER database [20], which contains information about known proteins, such as amino acid sequences and secondary structure sequences. Figure 5.9 illustrates the average 3-state accuracies of CISPred on the 1758 sequences. It shows that the average of 3-state accuracy rises from to when the threshold is increased from 0 to 0.45; it declines to when the threshold is increased to 0.62, and then rises and reaches its peak at In total, the average 3-state accuracy rises when the threshold is increased from 69

79 3-stateaccuracy (inpercentage) Threshold (in percentage) Figure 5.6: Distribution of the 109 CASP sequences predicted by CISPred with 1% 3-state accuracy as interval. 70