Structural bioinformatics Why structures? The representation of the molecules in 3D is more informative New properties of the molecules are revealed, which can not be detected by sequences Eran Eyal Plant Sciences Department Weizmann Institute of Science Similar sequence Similar sequence Similar structure Similar sequence Similar function Similar sequence Similar structure http://pdb.weizmann.ac.il/ http://www.rcsb.org/pdb/ Source of data: Crystal structures NMR models Other PDB The PDB database is the main repository for the processing and distribution of 3D biological macromolecular structure data
http://www.rcsb.org/pdb/ PDB content growth XRay Crystallography Data Source Clone/Express/Purify Crystallize XRay diffraction data + Solve phase problem Interpret electron density map Coordinates of atoms in protein molecule
NMR Spectroscopy Data Source Xray crystallography NMR information about spatiallyclosed atoms list of distance constraints + dihedral angles constraints multiple models of protein structure Atomic resolution Good Reasonable Hydrogens Rarely determined Determined Molecule size No restriction Small proteins Dynamics Snapshot Multi models Membrane proteins Problematic Procedure Very long long Coordinates of atoms in protein molecule What information is included in the PDB? File Format Protein description Literature Data about the experiment Sequence Header section Structure (atomic coordinates) Connectivity Coordinate section http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html
JRNL AUTH L.J.HARRIS,S.B.LARSON,K.W.HASEL,A.MCPHERSON JRNL TITL REFINED STRUCTURE OF AN INTACT IGG2A MONOCLONAL JRNL TITL 2 ANTIBODY HEADER IMMUNOGLOBULIN 25OCT96 1IGT JRNL REF BIOCHEMISTRY V. 36 1581 1997 COMPND MOLECULE: IGG2A INTACT ANTIBODY MAB231; JRNL REFN ASTM BICHAW US ISSN 00062960 0033 SOURCE MOUSE (MUS MUSCULUS, STRAIN BALB/C) KEYWDS INTACT IMMUNOGLOBULIN V REGION C REGION, IMMUNOGLOBULIN EXPDTA XRAY DIFFRACTION AUTHOR L.J.HARRIS,S.B.LARSON,K.W.HASEL,A.MCPHERSON REVDAT 1 07JUL97 1IGT 0 REMARK 2 RESOLUTION. 2.8 ANGSTROMS. REMARK 470 THE FOLLOWING RESIDUES HAVE MISSING ATOMS (M=MODEL NUMBER; REMARK 470 RES=RESIDUE NAME; C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER; REMARK 470 I=INSERTION CODE): REMARK 470 M RES CSSEQI ATOMS REMARK 470 LEU A 6 CG CD1 CD2 REMARK 470 ARG A 8 CG CD NE CZ NH1 NH2 HELIX 1 1 PRO A 80 ASP A 82 5 SHEET 1 A 4 LEU A 4 SER A 7 0 SHEET 2 A 4 ILE A 19 HIS A 24 1 N HIS A 24 O THR A SHEET 3 A 4 GLY A 70 ILE A 75 1 N ILE A 75 O ILE A SHEET 4 A 4 PHE A 62 SER A 67 1 N SER A 67 O GLY A SSBOND 1 CYS A 23 CYS A 88 CRYST1 65.820 76.770 100.640 88.05 92.35 97.23 P 12
SEQRES SEQRES SEQRES 1 A 214 ASP ILE VAL LEU THR GLN SER PRO SER SER LEU SER 2 A 214 SER LEU GLY ASP THR ILE THR ILE THR CYS HIS ALA 3 A 214 GLN ASN ILE ASN VAL TRP LEU SER TRP TYR GLN GLN Atom Atom Res Res X Y Z Occ Bfact No name No ATOM 1 N ASP A 1 1.600 85.453 44.624 1.00 43.02 ATOM 2 CA ASP A 1 1.649 84.304 45.569 1.00 38.99 HET NAG D 1 26 HETNAM NAG NACETYLDGLUCOSAMINE FORMUL 5 NAG 8(C8 H15 N1 O6) HETATM 3568 CA CA 0 12.108 17.156 78.830 1.00 7.31 HETATM 3569 O HOH 1 12.160 19.496 78.042 1.00 33.27 HETATM 3570 O HOH 2 23.163 36.984 67.113 1.00 18.80 HETATM 3571 O HOH 3 10.102 42.843 63.995 1.00 24.28 HETATM 3572 O HOH 4 22.311 19.282 69.877 1.00 27.58 CONECT 482 480 3568 CONECT 509 507 3568 CONECT 3568 482 509 3799 Visualization Molecular graphics What do we need? Rotation & translation Color specific parts of the molecule Labeling of residues and atoms Geometrical measurements (distances & angles) Schematic representation: Atoms/Bonds/Secondary structures, Molecular surfaces Compare structures Saving pictures
Representation of molecules (1) Stickmodel Ball & Stick Ball size: 0 Stick size: 0.2 Ball size: 0.4 Stick size: 0.2 Molecular surfaces Spacefilled model Ball size: 0.8 Stick size: 0 Representation of molecules (2) Backbone only connections between Calpha atoms Schematic Surface helix cylinder strand arrow How to search in the PDB? The OCA browser developed in the WIS by Jaime Prilusky is the best interface to the PDB. Entries can be retrieved by variety of criteria such http://bip.weizmann.ac.il/ocabin/ocamain
Problems in the PDB database Missing data Quality of data Format problems residue numbers Independence of data is doubtful Structural analysis of proteins Examination of atomic interactions Examination of secondary structures Cavities Buried/exposed regions Analysis of ligands Topics in structural bioinformatics Structural alignment Structural classification Secondary structure prediction Structure prediction Molecular docking Molecular dynamics
Structural alignment why to compare protein structures? Structures are more conserved in evolution than sequences. Two homologous proteins have the same overall structure. It is possible that 2 proteins without detectable similarity will have the same structure. In the twilight zone of sequence similarity, structural alignment might help to correctly determine the relations between 2 proteins Structural similarity is therefore more sensitive method than sequence alignment to determine protein function What properties of protein might be used to detect structural similarity to other proteins? Structural classification All " All! sequence Type and number of secondary structures (sheets, helices) Structural arrangement of secondary structures Structural attributes of individual amino acids Distances between amino acids in the protein!/ "!+"
Secondary structure prediction Prediction of tertiary structures based on the amino acid sequence is still very difficult task. Prediction of more local structural properties is easier The most known classification databases are: SCOP CATH Prediction of secondary structures is important and more feasible Prediction of secondary structures is a bridge between the linear information and the 3D structure Programs in this field often employ different types of machine learning approaches ACHYTTEKRGGSGTKKREA Building 3D models of proteins ACHYTTEKRGGSGTKKREA HHHHHHHHOOOOOSSSSSS
Building by homology (Homology modelling) Fold recognition (Threading) Alignment with proteins of known structure The sequence: M A A G Y A V L S M A A A A A T S K G G G A Y F F Y A D E L Y G V V V V L I V L S D E S + Known protein folds structural model structural model Ab initio Building by homology The sequence M A A G Y A V L S There are millions of proteins but only several thousands different folds. If we can find a similar protein with a known structure we can use the fold of that structure as the basic template to the structure of our protein. structural model Positions of loop and side chains will be constructed in the second stage
Find proteins with known structure which are similar to your sequence build alignment Build structural model Check the model Finish Construction of loops might be done by: Using database of loops. The loops are classified according to their length, the geometry of their edges and their sequence Without any use of previous data, using physical and chemical principles
Several web pages for homology modeling COMPOSER felix.bioccam.ac.uksoftbase.html MODELLER guitar.rockefeller.edu/modeller/modeller.html WHAT IF www.sander.emblheidelberg.de/whatif/ SWISSMODEL www.expasy.ch/swissmodel.html SwissModel http://www.expasy.ch/swissmod/swissmodel.html
Modeller http://guitar.rockefeller.edu/modeller/about_modeller.shtml Advanced program for homology modeling Based on distance constraints Implemented in several popular modelling packages such as InsightII The source is available for unix platforms at the above URL Threading (fold recognition) The input sequence is threaded on different folds from library of known folds Using scoring functions we get a score for the compatability between the sequence and the structure Statisticaly significant score tells that the input protein adopts similar 3D structure to that of the examined fold
This method is less accurate but could be applied for.more cases When the fold of our protein is not represented in the database we can not get a correct solution using this method. The most important part is the accuracy of the scoring function which evaluate the compatibility of a structure and a sequence. H bond donor H bond acceptor Glycin Hydrophobic Input: sequence Library of folds of known proteins H bond donor H bond acceptor Glycin Hydrophobic Web sites for fold recognition Profiles: 3DPSSM http://www.bmm.icnet.uk/~3dpssm Libra I http://www.ddbj.nig.ac.jp/htmls/email/libra/libra_i.html UCLA DOE http://www.doembi.ucla.edu/people/frsvr/frsvr.html Contact potentials 123D http://wwwimmb.ncifcrf.gov/~nicka/123d.html S=2 Z= 1 S=5 Z=1.5 S=20 Z=5 Profit http://lore.came.sbg.ac.at/home.html
Abinitio methods for modelling Great theoretical interest but not practical The basic idea is to build empirical function that simulates real physical forces and potentials of chemical contacts If we will have perfect function and we will be able to scan all the possible conformations, then we will be able to detect the correct fold Docking: finding the binding orientation of two molecules with known structures According to the molecules involved: ProteinLigand docking ProteinProtein docking Specific docking algorithms usually designed to deal with one of these problems but not with both (different contact area, flexibility, level of representation, etc.) Local docking Global docking Why? Understanding interactions, roles of specific amino acids, design of mutations and changes of activity. Comparison of affinities of different molecules Drug design