Traditional approaches to 3-D structure determination

Similar documents
Structural bioinformatics

Virtual bond representation

Protein 3D Structure Prediction

CFSSP: Chou and Fasman Secondary Structure Prediction server

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen

Homology Modelling. Thomas Holberg Blicher NNF Center for Protein Research University of Copenhagen

Why learn sequence database searching? Searching Molecular Databases with BLAST

Dynamic Programming Algorithms

Structure formation and association of biomolecules. Prof. Dr. Martin Zacharias Lehrstuhl für Molekulardynamik (T38) Technische Universität München

Zool 3200: Cell Biology Exam 3 3/6/15

Sequence Databases and database scanning

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Packing of Secondary Structures

MATH 5610, Computational Biology

6-Foot Mini Toober Activity

Supporting Online Material. Av1 and Av2 were isolated and purified under anaerobic conditions according to

BIOINFORMATICS Introduction

Alpha-helices, beta-sheets and U-turns within a protein are stabilized by (hint: two words).

Protein Folding Problem I400: Introduction to Bioinformatics

Introduction to protein structure analysis and prediction

CSE : Computational Issues in Molecular Biology. Lecture 19. Spring 2004

Problem Set Unit The base ratios in the DNA and RNA for an onion (Allium cepa) are given below.

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012

Chapter 8. One-Dimensional Structural Properties of Proteins in the Coarse-Grained CABS Model. Sebastian Kmiecik and Andrzej Kolinski.

Protein Structure Prediction. christian studer , EPFL

11 questions for a total of 120 points

Introduction to Proteins

Residue Contact Prediction for Protein Structure using 2-Norm Distances

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Polypeptide backbone (or the main chain)

Why Use BLAST? David Form - August 15,

Ref: Crystallgraphy made crystal clear, Chapter 6. Preparation of heavy atom derivatives. Heavy-atom data bank:

Fundamentals of Protein Structure

Computer aided modeling of a fructose repressor

Protein NMR II. Lecture 5

Building an Antibody Homology Model

3D Structure Prediction with Fold Recognition/Threading. Michael Tress CNB-CSIC, Madrid

X-ray structures of fructosyl peptide oxidases revealing residues responsible for gating oxygen access in the oxidative half reaction


Overview. Secondary Structure. Tertiary Structure

Protein Bioinformatics Part I: Access to information

MBMB451A Section1 Fall 2008 KEY These questions may have more than one correct answer

Basic Bioinformatics: Homology, Sequence Alignment,

Mutation Rates and Sequence Changes

Lecture for Wednesday. Dr. Prince BIOL 1408

Name: TOC#. Data and Observations: Figure 1: Amino Acid Positions in the Hemoglobin of Some Vertebrates

Sequence Analysis '17 -- lecture Secondary structure 3. Sequence similarity and homology 2. Secondary structure prediction

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

Lecture 11: Gene Prediction

SUPPLEMENTAL FIGURE LEGENDS. Figure S1: Homology alignment of DDR2 amino acid sequence. Shown are

RNA does not adopt the classic B-DNA helix conformation when it forms a self-complementary double helix

Protein Structure Databases, cont. 11/09/05

DNA stands for deoxyribose nucleic acid

ENZYMES AND METABOLIC PATHWAYS

Università degli Studi di Roma La Sapienza

1. DNA replication. (a) Why is DNA replication an essential process?

Ch 10 Molecular Biology of the Gene

Amino Acid Distribution Rules Predict Protein Fold: Protein Grammar for Beta-Strand Sandwich-Like Structures

Cristian Micheletti SISSA (Trieste)

Introduction to Bioinformatics

Proteins the primary biological macromolecules of living organisms

Disease and selection in the human genome 3

Homology Modeling of Mouse orphan G-protein coupled receptors Q99MX9 and G2A

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Genome Sequence Assembly

Homology modeling and structural validation of tissue factor pathway inhibitor

Basic Biology. Gina Cannarozzi. 28th October Basic Biology. Gina. Introduction DNA. Proteins. Central Dogma.

Retrieving and Viewing Protein Structures from the Protein Data Base

What s New in Discovery Studio 2.5.5

36. The double bonds in naturally-occuring fatty acids are usually isomers. A. cis B. trans C. both cis and trans D. D- E. L-

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

1. DNA, RNA structure. 2. DNA replication. 3. Transcription, translation

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

ONLINE BIOINFORMATICS RESOURCES

Insight II. Homology. March Scranton Road San Diego, CA / Fax: 858/

ELE4120 Bioinformatics. Tutorial 5

Chimp Sequence Annotation: Region 2_3

Protein Structure Prediction by Constraint Logic Programming

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Bi 8 Lecture 7. Ellen Rothenberg 26 January Reading: Ch. 3, pp ; panel 3-1

Name Date of Data Collection. Class Period Lab Days/Period Teacher

7.014 Quiz II Handout

Supplementary Table 1: List of CH3 domain interface residues in the first chain (A) and

BIL 256 Cell and Molecular Biology Lab Spring, 2007 BACKGROUND INFORMATION I. PROTEIN COMPOSITION AND STRUCTURE: A REVIEW OF THE BASICS

Supplementary Materials. Figure S1 Chemical structures of cisplatin and carboplatin.

Introduction to Bioinformatics

Biomolecules: lecture 6

Proteins Higher Order Structures

BIRKBECK COLLEGE (University of London)

MOLEBIO LAB #3: Electrophoretic Separation of Proteins

Introduction to Molecular Biology

Gene Identification in silico

Genetic Algorithms For Protein Threading

The study of protein secondary structure and stability at equilibrium ABSTRACT

Daily Agenda. Warm Up: Review. Translation Notes Protein Synthesis Practice. Redos

STRUCTURAL BIOLOGY. α/β structures Closed barrels Open twisted sheets Horseshoe folds

Introduction to Bioinformatics

Main-Chain Conformational Tendencies of Amino Acids

Transcription:

Homology Modeling: A Brief Introduction W. Ross Ellington Biology-IMB Traditional approaches to 3-D structure determination X-ray crystallography: Sufficient protein? Natural protein is often heterogeneous Can it be expressed? Requires full length expression construct If so, in soluble state? Often recombinant protein is present as insoluble inclusion bodies Will it crystallize? Perhaps the most time-consuming task Will the crystals be of diffraction quality? Some crystals will simply not diffract How do you solve the phase problem? heavy metals, selenomethionine, molecular replacement 1

NMR spectroscopy: Same problems above with respect to protein availability. Protein must be N 15 and/or C 13 and/or H 2 labeled (bacteria must grow in isotopically enriched media; samples may cost > $2,000 to prepare). Protein must be stable for weeks. Restriction is terms of size of protein (< 25-30 kd). All methods are time and labor intensive but yield atomic level resolution ( 1.5 3 Å) 2

Proliferation of deduced amino acid sequences 1. RTPCR PCR amplification of RT products using redundant gene specific primers (very rapid; generates large amounts of data) 2. cdna libraries Cloning of RT products; when screened yield large numbers of cdna sequences. 3. Genomic DNA sequencing efforts Eukaryotes- (man, mouse, chicken, pufferfish, zebrafish, Ciona [tunicate], sea urchin, fruit fly, C. elegans (nematode) and many unicellular pathogenic species) Gene finder software extracts EXONs and assembles ORFs. Prokaryotes- > 25 genomes have been sequenced 3

Human genome codes for >30,000 proteins! Convert sequences to 3-D structure > 40% sequence identity- HOMOLOGY MODELING (actually extends to much lower degrees of identity) 20-40% sequence identity- THREADING <25% sequence identity- AB INITIO modeling 4

Homology modeling- assume query sequence and homologs share the same basic folds; approach is to pick out the best template from a suite of homologs Homologous proteins- share a common evolutionary origin [ancestor] Orthologs- the same gene present in different species (ca., hemoglobin in fish vs. camels) Paralogs- products of gene duplication events (brain, muscle, sarcomeric mitochondrial and ubiquitous mitochondrial creatine kinases in man) The potential for homology modeling > 500,000 protein sequences are known There are >10,000 experimentally determined 3-D structures of proteins 33% of known protein sequences have similarities to a proteins for which the 3-D structure has been determined Therefore, there is the potential for homology modeling of > 150,000 proteins 5

Find homologs 1. Pairwise comparison: compare target sequence individually with other sequences in a database of sequences; FASTA; BLAST Easy to use, on-line sites; probe with protein (BLASTp, tblastn) or nucleotides (BLASTn) NCBI - http://www.ncbi.nlm.nih.gov/blast/ EMBhttp://www.ch.embnet.org/software/BottomBLAST. html Expasy - http://us.expasy.org/tools/blast/ 6

2. Psi-Blast: constructs multiple sequence alignments of many sequences which are iteratively sampled from a database; position specific scoring matrix is used to sample database for additional homologs; very useful when percent identity is low NCBS, National Center for Biological Sciences, Bangalore http://caps.ncbs.res.in/campass/psi-blast.html 3. 3D template matching: pairwise comparison of the query sequence and a protein of known structure; structuredependent scoring function is used to ascertain whether a query protein adopts or not any one of the library of 3D folds; useful when pairwise or multiple sequence approaches have not identified homologs Imperial College 3D PSSM http://www.sbg.bio.ic.ac.uk/~3dpssm/ 7

4. PROPSEARCH uses the amino acid composition instead. In addition, other properties like molecular weight, content of bulky residues, content of small residues, average hydrophobicity, average charge a.s.o. and the content of selected dipeptide-groups are calculated from the sequence as well. 144 such properties are weighted individually and are used as query vector. The weights are then trained on a set of protein families with known structures, using a genetic algorithm. Sequences in the database are transformed into vectors as well, and the euclidian distance between the query and database sequences is calculated. Distances are rank ordered, and sequences with lowest distance are reported on top. (University of Montpelier) http://www.infobiosud.univmontp1.fr/serveur/propsearch/presentatio n.html Template choice 1. Higher the sequence identity, the more likely the template will be suitable 2. Most closely related from a phylogenetic point of view 3. Template environment (solvent, ph, temperature, quaternary structure) 4. Quality of the template structure (resolution and R factor) 8

Quality: 645 Length: 374 Ratio: 1.807 Gaps: 5 Percent Similarity: 50.430 Percent Identity: 40.115 Match display thresholds for the alignment(s): = IDENTITY : = 2. = 1 Alignment of target (query) with template- using CLUSTAL, GAP (GCG), BLAST (NCBI) etc with manual adjustments UrechiscaupoLK.txt x limulusak.pep October 29, 2001 12:42.. 1...MAFQNDAKANF...PDYANHGCVVGRHLNFEMYQRLFGKKTAHGVTV 44.. :.... : :.: :. 1 MVDQATLDKLEAGFKKLQEASDCKSLLKKHLTKDVFDSIKNKKTGMGATL 50 45 DKVIQPSVDNFGNCIGLIAGDEESYEVFKELFDAVINEKHKGFGPNDSQP 94 :. : : :.: 51 LDVIQSGVENLDSGVGIYAPDAESYRTFGPLFDPIIDDYHGGFKLTDKHP 100 95 APDL.DASKLVGGQFDEKYVKSCRIRTGRGIRGLCYPPSCTRGERREVER 143 :..:: : :. : : :. 101 PKEWGDINTLVDLDPGGQFIISTRVRCGRSLQGYPFNPCLTAEQYKEMEE 150 144 VITTALAGLSGDLSGTYYPLSKMTPEQENQLIADHFLFQKPTGHLMVNSA 193 :... : :.. :... 151 KVSSTLSSMEDELKGTYYPLTGMSKATQQQLIDDHFLFKEGDRFLQTANA 200 194 SVRDWPDARGIWHNNEKTFLIWINEEDHMRVISMQKGGNVKAVFERFGRG 243.. : : : :.. :. 201.CRYWPTGRGIFHNDAKTFLVWVNEEDHLRIISMQKGGDLKTVYKRLVTA 249 244 LNAIAEQMKKNGREYMWNQRLGYLCACPSNLGTGLRASVHVQLHQLSKHP 293...: :. :. : :.. 250 VDNIESKL...PFSHDDRFGFLTFCPTNLGTTMRASVHIQLPKLAKDR 294 294 K.FEDIVVALQLQKRGTGGEHTAAVDDVYDISNAARLKKSEREFVQLLID 342.. :. : 295 KVLEDIASKFNLQVRGTRGEHTESEGGVYDISNKRRLGLTEYQAVREMQD 344.. 343 GVKKLIDMEQALEAGKSIDDLIPA 366 :.:. 345 GILEMIKMEKAAA... 357 Look for highly conserved regionsthis helps with the manual adjustments of alignments! MSTSQNK-IFNQGEDCKPSWTSAGADYNEKDTHLEIMGKPVMERWETPYMHLLATIASSK MTAVAGQPIFGSGEDCKPAWTDASADYDATDEHLEIMGKPVMERWETPYMHKLAQVASCK --MSAAQPIFSKGENCKQVWHDANADYNAADTHLEIMGKPVMERWETPYMHSLATVAASK -MTSSAQPIFSKGEDCKPTWHDANAGYNEADTHXEIMGKPVMERWETPYMHSLATVAASK --MTTAQPIFSKGEDCKAGWHDAGAGYNETDTHLEILGKPVMERWETPYMHSLAIVASSK --MSSDK-IFAEGESCKSAWHDATAGYDETDTHLEIMGKPVMERWETPYMHSLATVAASK -SSSAASPLFAPGEDCGPAWRAAPAAYDTSDTHLQILGKPVMERWETPYMHSLAAAAASR. :* **.* * * * *: * * :*:************** ** *:.: 9

Quality: 1415 Length: 419 Ratio: 3.763 Gaps: 2 Percent Similarity: 77.660 Percent Identity: 68.883 Match display thresholds for the alignment(s): = IDENTITY : = 2. = 1 EXAMPLE- Chaetopterus variopedatus (polychaete= marine worm) mitochondrial creatine kinase vs. chicken sarcomeric MiCK (X-ray crystal structure published in 1995 by Fritz-Wolf et al.; PDB- 1crk) correctchaetmick.txt x chickenmicksar.pep February 25, 2002 14:01.. 1...MRLGTSNVHLRYPA 14.: : 1 MAGTFGRLLAGRVTAALFAAAGSGVLTTGYLLNQQNVKATVHEKRKLFPP 50 15 SANYPDLSQHNNIMASNLTPVIYAKLRDKVTPNGVTLNLCIQTGVDNPGH 64..... 51 SADYPDLRKHNNCMAECLTPAIYAKLRDKLTPNGYSLDQCIQTGVDNPGH 100 65 PFIKTVGLVAGDEESYEVFADLFDKCIDERHGGYKPWD.KHPTDLDSTKL 113 : ::.. : 101 PFIKTVGMVAGDEESYEVFAEIFDPVIKARHNGYDPRTMKHHTDLDASKI 150 114 RGGNFDPKYVLSSRVRTGRCIRGLSLPPACSRAERREVERVVVEALNGLQ 163 :. 151 THGQFDERYVLSSRVRTGRSIRGLSLPPACSRAERREVENVVVTALAGLK 200 164 GDLAGKYFPLAKMTDAEQEKLIEDHFLFDKPVSPLLLASGMARDWPDARG 213. :.: : :. :. 201 GDLSGKYYSLTNMSERDQQQLIDDHFLFDKPVSPLLTCAGMARDWPDARG 250 214 IWHNDKKNFLVWVNEEDHTRVISMQKGGNMREVFDRFCNGLQKVENLIQS 263. : : : :... 251 IWHNNDKTFLVWINEEDHTRVISMEKGGNMKRVFERFCRGLKEVERLIKE 300 264 RGWEFMWNEHLGYVLTCPSNLGTGLRAGVHIKIPKLAKDPRLNEVLKKMN 313 : : :..:. : 301 RGWEFMWNERLGYVLTCPSNLGTGLRAGVHVKLPRLSKDPRFPKILENLR 350 314 LQKRGTGGVDTAATGDTFDISNSDRLGKSEVQLVQQVVDGIDLLIQMEKR 363 : : : : : :. : : 351 LQKRGTGGVDTAAVADVYDISNLDRMGRSEVELVQIVIDGVNYLVDCEKK 400. 364 LERGQ..RIDDLIPK... 376 : :: :. 401 LEKGQDIKVPPPLPQFGRK 419 Generate coordinates for SCRs and VRs using template(s0 Web-based 1. Swiss Model- Glaxo Welcome 2. What If- EMBL Programs 1. Modeller- free; Unix, Linux [Rockefeller U.] 2. Insight II ( homology )- Unix [MSI] 10

SwissModel- http://www.expasy.org/swissmod/ What If- http://www.cmbi.kun.nl/whatif/ Modeller- http://www.salilab.org/modeller/modeller.html Insight IIhttp://www.accelrys.com/insight/homology.html PDB file-header HEADER TRANSFERASE 08-MAR-96 1CRK Background information TITLE MITOCHONDRIAL CREATINE KINASE COMPND MOL_ID: 1; COMPND 2 MOLECULE: CREATINE KINASE; COMPND 3 CHAIN: A, B, C, D; COMPND 4 EC: 2.7.3.2; COMPND 5 BIOLOGICAL_UNIT: OCTAMER SOURCE MOL_ID: 1; SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE KEYWDS EXPDTA AUTHOR 2 ORGANISM_SCIENTIFIC: GALLUS GALLUS; 3 ORGANISM_COMMON: CHICKEN; 4 ORGAN: HEART; 5 TISSUE: MUSCLE; 6 CELL: SARCOMER; 7 ORGANELLE: MITOCHONDRIA TRANSFERASE, CREATINE KINASE X-RAY DIFFRACTION K.FRITZ-WOLF,T.SCHNYDER,T.WALLIMANN,W.KABSCH REVDAT 1 07-JUL-97 1CRK 0 JRNL AUTH K.FRITZ-WOLF,T.SCHNYDER,T.WALLIMANN,W.KABSCH JRNL TITL STRUCTURE OF MITOCHONDRIAL CREATINE KINASE JRNL REF NATURE V. 381 341 1996 JRNL REFN ASTM NATUAS UK ISSN 0028-0836 0006 11

PDB file-remarks REMARK 1 Properties of the structure like resolution, error etc. REMARK 2 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. REMARK 3 REMARK 3 REFINEMENT. REMARK 3 PROGRAM : X-PLOR REMARK 3 AUTHORS : BRUNGER REMARK 3 REMARK 3 DATA USED IN REFINEMENT. REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 3. REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 8. REMARK 3 DATA CUTOFF (SIGMA(F)) : 0. REMARK 3 DATA CUTOFF HIGH (ABS(F)) : NULL REMARK 3 DATA CUTOFF LOW (ABS(F)) : NULL REMARK 3 COMPLETENESS (WORKING+TEST) (%) : 99.6 REMARK 3 NUMBER OF REFLECTIONS : 40730 REMARK 3 REMARK 3 FIT TO DATA USED IN REFINEMENT. REMARK 3 CROSS-VALIDATION METHOD : NULL REMARK 3 FREE R VALUE TEST SET SELECTION : NULL REMARK 3 R VALUE (WORKING SET) : 0.217 REMARK 3 FREE R VALUE : 0.264 REMARK 3 FREE R VALUE TEST SET SIZE (%) : 10. REMARK 3 FREE R VALUE TEST SET COUNT : NULL REMARK 3 ESTIMATED ERROR OF FREE R VALUE : NULL PDB file- helix/sheet HELIX 61 61 ALA D 108 LYS D 110 5 3 HELIX 62 62 ARG D 143 GLY D 159 1 17 HELIX 63 63 GLY D 162 LEU D 164 5 3 HELIX 64 64 GLU D 176 ASP D 184 1 9 HELIX 65 65 PRO D 195 ALA D 200 1 6 HELIX 66 66 MET D 241 ARG D 262 1 22 HELIX 67 67 PRO D 279 ASN D 281 5 3 HELIX 68 68 PRO D 295 LYS D 299 1 5 HELIX 69 69 PHE D 303 LEU D 310 1 8 HELIX 70 70 GLU D 341 GLU D 363 1 23 SHEET 1 A 8 GLY A 166 SER A 170 0 SHEET 2 A 8 GLY A 211 ASN A 215-1 N HIS A 214 O LYS A 167 SHEET 3 A 8 PHE A 220 ILE A 224-1 N ILE A 224 O GLY A 211 SHEET 4 A 8 THR A 230 LYS A 237-1 N ILE A 233 O LEU A 221 SHEET 5 A 8 VAL A 121 ARG A 130-1 N ARG A 130 O THR A 230 Secondary structure- a helices and b sheets 12

PDB file- atomic coordinates of atoms ATOM 2 CA THR A 1 6.191 51.661 0.792 1.00 88.98 C ATOM 3 C THR A 1 7.324 52.686 0.599 1.00 88.03 C ATOM 4 O THR A 1 7.073 53.849 0.258 1.00 86.86 O ATOM 5 CB THR A 1 5.395 51.971 2.106 1.00 89.45 C ATOM 6 OG1 THR A 1 5.690 50.979 3.102 1.00 88.99 O ATOM 7 CG2 THR A 1 3.887 51.965 1.836 1.00 88.95 C ATOM 8 N VAL A 2 8.565 52.223 0.760 1.00 87.75 N ATOM 9 CA VAL A 2 9.753 53.043 0.508 1.00 87.41 C ATOM 10 C VAL A 2 10.316 52.696-0.874 1.00 86.54 C ATOM 11 O VAL A 2 10.591 51.527-1.172 1.00 88.07 O ATOM 12 CB VAL A 2 10.862 52.810 1.586 1.00 87.73 C ATOM 13 CG1 VAL A 2 11.812 54.007 1.635 1.00 87.20 C ATOM 14 CG2 VAL A 2 10.237 52.581 2.958 1.00 86.79 C ATOM 15 N HIS A 3 10.417 53.703-1.736 1.00 84.79 N ATOM 16 CA HIS A 3 10.774 53.482-3.132 1.00 81.24 C ATOM 17 C HIS A 3 11.457 54.697-3.727 1.00 78.71 C ATOM 18 O HIS A 3 10.830 55.729-3.969 1.00 74.98 O ATOM 19 CB HIS A 3 9.528 53.131-3.947 1.00 83.78 C ATOM 20 CG HIS A 3 9.807 52.840-5.390 1.00 84.72 C ATOM 21 ND1 HIS A 3 9.875 51.557-5.893 1.00 85.65 N ATOM 22 CD2 HIS A 3 9.959 53.667-6.452 1.00 84.35 C ATOM 23 CE1 HIS A 3 10.054 51.607-7.202 1.00 85.45 C ATOM 24 NE2 HIS A 3 10.109 52.877-7.565 1.00 85.49 N Three general methods for initial model construction 1. Assembly of rigid bodies- backbone of target is arranged to template positions of a carbons RAPPER (Cambridge University) is a discrete conformational sampling algorithm for restraint-based protein modelling. It has been used for all-atom loop modelling, whole protein modelling under limited (Calpha) restraints, comparative modelling, ab initio structure prediction, structure validation, and experimental structure determination with X-ray and nuclear magnetic resonance spectroscopy http://raven.bioc.cam.ac.uk/ 13

2. Segment matching (coordinate reconstruction)- most hexapeptide segments of protein structure can be clustered into approximately 100 structural classes; subset of atomic positions in template are used as guiding positions to fit segments from the target into these positions; can be used to model main chain and side chain atoms SegMod component available in UCLA s Genemine http://www.bioinformatics.ucla.edu/genemine/ 3. Satisfaction of spatial restraints-model is built on minimizing violations of spatial restraints (constraints) as defined from the template; homology derived restraints (mainchain and sidechain dihedrals, mainchain CA-CA distances, sidechainmainchain distances etc) and stereochemical (bond angles, dihedral angles, non-bonded atom-atom contacts) (MODELLER) A. Sali & T.L. Blundell. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815, 1993. A. Fiser, R.K. Do, & A. Sali. Modeling of loops in protein structures, Protein Science 9. 1753-1773, 2000. 14

Initial model will yield a framework of SCRs consisting of the following: 1. Backbone of target protein is arranged to a carbons of the template. 2. Secondary structural elements are present 3. F and Y angles are arranged in SCRs Initial modeling will yield a framework which lacks two important components- (a) VRs [nonconserved loops] and (b) side chains. Quality: 1415 Length: 419 Ratio: 3.763 Gaps: 2 Percent Similarity: 77.660 Percent Identity: 68.883 Match display thresholds for the alignment(s): = IDENTITY : = 2. = 1 correctchaetmick.txt x chickenmicksar.pep February 25, 2002 14:01.. 1...MRLGTSNVHLRYPA 14.: : 1 MAGTFGRLLAGRVTAALFAAAGSGVLTTGYLLNQQNVKATVHEKRKLFPP 50 15 SANYPDLSQHNNIMASNLTPVIYAKLRDKVTPNGVTLNLCIQTGVDNPGH 64..... 51 SADYPDLRKHNNCMAECLTPAIYAKLRDKLTPNGYSLDQCIQTGVDNPGH 100 65 PFIKTVGLVAGDEESYEVFADLFDKCIDERHGGYKPWD.KHPTDLDSTKL 113 : ::.. : 101 PFIKTVGMVAGDEESYEVFAEIFDPVIKARHNGYDPRTMKHHTDLDASKI 150 114 RGGNFDPKYVLSSRVRTGRCIRGLSLPPACSRAERREVERVVVEALNGLQ 163 :. 151 THGQFDERYVLSSRVRTGRSIRGLSLPPACSRAERREVENVVVTALAGLK 200 164 GDLAGKYFPLAKMTDAEQEKLIEDHFLFDKPVSPLLLASGMARDWPDARG 213. :.: : :. :. 201 GDLSGKYYSLTNMSERDQQQLIDDHFLFDKPVSPLLTCAGMARDWPDARG 250 214 IWHNDKKNFLVWVNEEDHTRVISMQKGGNMREVFDRFCNGLQKVENLIQS 263. : : : :... 251 IWHNNDKTFLVWINEEDHTRVISMEKGGNMKRVFERFCRGLKEVERLIKE 300 264 RGWEFMWNEHLGYVLTCPSNLGTGLRAGVHIKIPKLAKDPRLNEVLKKMN 313 : : :..:. : 301 RGWEFMWNERLGYVLTCPSNLGTGLRAGVHVKLPRLSKDPRFPKILENLR 350 314 LQKRGTGGVDTAATGDTFDISNSDRLGKSEVQLVQQVVDGIDLLIQMEKR 363 : : : : : :. : : 351 LQKRGTGGVDTAAVADVYDISNLDRMGRSEVELVQIVIDGVNYLVDCEKK 400. 364 LERGQ..RIDDLIPK... 376 : :: :. 401 LEKGQDIKVPPPLPQFGRK 419 15

Loop modeling- changes in loops usually occur in exposed regions that connect secondary structure elements. No reliable methods are available for constructing loops longer than 5 residues. Two general approaches: 1. Ab initio loop prediction- conformational search or enumeration of conformations in a given environment guided by a scoring or energy function. Also available on the Cambridge University RAVEN site http://raven.bioc.cam.ac.uk/loop2.php 2. Database approaches-consists of finding a portion of the main chain which fits the stem regions of a particular loop. The database of protein folds is searched and fitted within the constraints of the loop sequence, stem structure and energy optimization criteria. Loop database at PKUBIOS (China) http://mdl.ipc.pku.edu.cn/moldes/oldmem/liwz/home/ loop.htm 16

Side chains-defined by C 1, C 2. (dihedral angles) -protein side chains play a key role in molecular recognition and packing of hydrophobic cores of globular proteins -significant correlations exist between C 1, C 2 and F and Y angles -side chain conformations exist in a limited number of canonical shapes (rotamers) -rotamer libraries can be constructed where only 3-50 conformations are taken into account for each side chain -approaches involve iterative sampling of rotamers for each side chain into the target backbone with scoring matrices and attention paid to steric clashes of sidechain with mainchain. Refinement by molecular mechanics -restrain the region of the model that is most likely correct and focus on suspect areas or perform on entire molecule -idealize bond geometry and remove unfavorable non-bonded contacts by protein force fields using AMBER Assisted Model Building with Energy Refinement (http://amber.scripps.edu/) CHARMm Chemistry at HARvard Molecular mechanics (http://www.accelrys.com/insight/charmm.html) GROMOS Molecular Dynamics (http://www.igc.ethz.ch/gromos/) 17

-this essentially is an energy minimization approach; during force field calculations there is a tendency for the structure to drift away from the control structure. This drift can be reduced by -(a) limiting the number of minimization steps and - (b) restraining many of the alpha carbons. Errors in homology models (a) errors in side chain packing- as proteins diverge, the packing of side chains in the core changes, (b) distortions or shifts in correctly aligned regionssequence divergence may result in main chain conformation changes even though overall fold is the same, (c) errors in regions without a template- occurs in segments of a target protein for which there are no equivalents in the template 18

(d) errors due to misalignments- most common error and (e) incorrect templates- is a problem that occurs when using distantly related templates. Evaluating models Explicit approach: overlay the model on the template and evaluate rmsd (root mean square deviation of the a carbons) -experimental rmsd values by X-ray crystallography range from from 0.5 Å for the same protein to 1 Å for proteins with > 50% identity -successful model has < 2 Å rmsd from template (if template has a sequence identity > 60% then the above criterion would be met with a success rate > 70%) -sequence identity of target and template is critical 19

Judging quality of homology models 1. ACCURACY- how well it fits the templates upon which it was built (rmsd deviation) 2. CONFIDENCE FACTOR (Model B-factor)- SWISS MODEL provides a B-factor which is an uncertainty factor; higher the value the lower the amount of actual structural support is present for a particular portion of the model 3. INCORRECTNESS- presence of hydrophobic groups on the surface or polar groups in the interior that do not have hydrogen bonding or ionic bonding capabilities satisfied by their neighbors or steric clashes or unreasonable structural parameters such as bond angles/lengths. 20

4. Ramachandran diagram- plot shows main-chain conformational angles; there are a finite number of F and Y angle combinations as defined in the plot; glycines, which lack side chains, often fall out of allowed regions. Residues other than glycines in a model that are not present in allowed regions are suspect. 5. REASONABLENESS- defined as whether the model is in keeping with expectations for similar proteins. Assessed by summing up the probabilities that each residue should occur in the environment in which it is found in the model. For all PDB models, each of the 20 amino acids has a certain probability of belonging to one of the following classes: solvent-accessible surface, buried polar, exposed nonpolar, helix, sheet or turn. Regions in the model that do not fit expectations are suspect. Threading Energy- a criterion for reasonableness; higher the energy, lower the reasonableness 21

Detection of errors 1.manual inspection 2. checking stereochemistry PROCHECK (www.biochem.ucl.ac.uk) WHATCHECK (www.sander.embl-heidelberg.de) SQUID (www.yorvik.york.ac.uk) 3. statistical analyses of structures based on compiled 3-D features of many proteins VERIFY3D (www.doe-mbi.ucla.edu) ProsaII (www.came.sgb.ac.at) 22

23

24

Quality: 1415 Length: 419 Ratio: 3.763 Gaps: 2 Percent Similarity: 77.660 Percent Identity: 68.883 Match display thresholds for the alignment(s): = IDENTITY : = 2. = 1 correctchaetmick.txt x chickenmicksar.pep February 25, 2002 14:01.. 1...MRLGTSNVHLRYPA 14.: : 1 MAGTFGRLLAGRVTAALFAAAGSGVLTTGYLLNQQNVKATVHEKRKLFPP 50 15 SANYPDLSQHNNIMASNLTPVIYAKLRDKVTPNGVTLNLCIQTGVDNPGH 64..... 51 SADYPDLRKHNNCMAECLTPAIYAKLRDKLTPNGYSLDQCIQTGVDNPGH 100 65 PFIKTVGLVAGDEESYEVFADLFDKCIDERHGGYKPWD.KHPTDLDSTKL 113 : ::.. : 101 PFIKTVGMVAGDEESYEVFAEIFDPVIKARHNGYDPRTMKHHTDLDASKI 150 114 RGGNFDPKYVLSSRVRTGRCIRGLSLPPACSRAERREVERVVVEALNGLQ 163 :. 151 THGQFDERYVLSSRVRTGRSIRGLSLPPACSRAERREVENVVVTALAGLK 200 164 GDLAGKYFPLAKMTDAEQEKLIEDHFLFDKPVSPLLLASGMARDWPDARG 213. :.: : :. :. 201 GDLSGKYYSLTNMSERDQQQLIDDHFLFDKPVSPLLTCAGMARDWPDARG 250 214 IWHNDKKNFLVWVNEEDHTRVISMQKGGNMREVFDRFCNGLQKVENLIQS 263. : : : :... 251 IWHNNDKTFLVWINEEDHTRVISMEKGGNMKRVFERFCRGLKEVERLIKE 300 264 RGWEFMWNEHLGYVLTCPSNLGTGLRAGVHIKIPKLAKDPRLNEVLKKMN 313 : : :..:. : 301 RGWEFMWNERLGYVLTCPSNLGTGLRAGVHVKLPRLSKDPRFPKILENLR 350 314 LQKRGTGGVDTAATGDTFDISNSDRLGKSEVQLVQQVVDGIDLLIQMEKR 363 : : : : : :. : : 351 LQKRGTGGVDTAAVADVYDISNLDRMGRSEVELVQIVIDGVNYLVDCEKK 400. 364 LERGQ..RIDDLIPK... 376 : :: :. 401 LEKGQDIKVPPPLPQFGRK 419 Chicken sarmick 25

26

27

28

29

Where do we go from here? vhigh through-put structure determination efforts will never come close to determining the bulk of structures vhowever, a dictionary of most of the possible folds is a realistic vgiven these folds, homology modeling affords great promise in identification and characterization of structural genes whose products have an unknown function 30