Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Size: px
Start display at page:

Download "Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics"

Transcription

1 The molecular structures of proteins are complex and can be defined at various levels. These structures can also be predicted from their amino-acid sequences. Protein structure prediction is one of the most widespread fields of research in bioinformatics. Learning Objective In this Learning Object, the learner will be able to, Describe Protein Structural Databases, and, Recall Uses of Structural databases.

2 Protein Structural Databases The protein structural databases contain a basic search box which requires the input for an identifier of the protein. This identifier can be the protein name, key-word, ID, author, etc. In this example, we take the case of Viral Capsid Proteins. These databases have advanced search features which are optional but help in making the query very specific. The general options can be categorized in 4 broad classes. Structural Features, Biology, Sequence Data and Experimental Details.

3 Protein Structural Databases The search results for the query protein entered showed 67 structures in the database that match the criteria given by the user in the search options. The first page of the results shows the titles of all the hits. The user then needs to select the protein structure of their interest to study in detail. Here we select the structure titled HIV CAPSID C-TERMINAL DOMAIN (CAC146) for further study.

4 Protein Structural Databases The summary page shows all the general information pertaining to the basic features of the protein. This includes: 1. Protein Identifier 2. Molecule name, structure weight, polymer type, number of chains, length of the molecule and its classification 3. Source organism and Expression organism 4. Journal, paper and author name

5 Protein Structural Databases The sequence data tab contains all the information related to the amino acid sequence corresponding to the protein under consideration 1. FATSA sequence for all chains in the polypeptide 2. Type of chain such as polypeptide, glyco-peptide, lipo-peptide, etc. 3. Diagrammatic representation of the Classification and Secondary structure of this chain - assigning residues with helix, sheet or turn

6 Protein Structural Databases The sequence similarity tab shows the information related to comparative studies of the two sequences. 1. Option to perform BLAST search. 2. List of Clusters of proteins is produced. These clusters are formed and ranked based on the resolution of the structures within them. The better the quality (resolution) of the cluster, higher it is ranked. When the user clicks on a particular cluster, the component proteins within the cluster are displayed along with supporting information.

7 Protein Structural Databases The structural similarity tab shows the information related to comparative studies of the two structures. It establishes equivalences based on 3D conformations of both proteins. The default visualization tool for PDB is Jmol. Structural alignment is covered in more detail in the second part of this animation.

8 Protein Structural Databases This tab provides details of the methodology used in conducting those experiments. This includes, 1. Crystallization methods, ph, temperature, and other details of the experiment 2. Crystal Data (Space group, unit cell dimensions) 3. Diffraction source, diffraction protocol and diffraction detectors 4. Data related to Resolution and Refinement details 5. Software, programs and Computing utilized.

9 Protein Structural Databases The Geometry of the molecule contains all the spatial information about the Geometry of the molecule, so that it can be simulated in a virtual environment. This includes: Bond length: Number of occurrences and their positions in the chains Bond Angles: Number of occurrences and their positions in the chains Dihedral Angles: Number of occurrences and their positions in the chains Ramachandran plot, Fold Deviation Scores and other structural details

10 Protein Structural Databases The biology tab contains information about the significance of the molecule at the biological and cellular level. This includes 1. Molecule type 2. Formula weight 3. Monomers, and linkages 4. Source method 5. Ligands and prosthetic groups 6. Gene detail and Genome information 7. Keywords

11 Protein Structural Databases Data for the same protein but from other resources such as SCOP, CATH and PFAM classification details are provided in the derived data tab. For more detailed analysis visit iveddata.do?structureid=1aum

12 Uses of structural databases Two given proteins can be structurally aligned to evaluate the similarity between them. The server requires an input of two protein sequences or their IDs, which are then simulated and aligned based on their 3D coordinates, bond angles and dihedral angles. Few of the various servers available for this are DALI, MAMMOTH, CE/CE-MC, SSAP and ProFit.

13 Uses of structural databases The results are 1. P-value: It is the probability measure that the two structure are similar. If P-value < 0.05 indicates significant similarity 2. Raw score: It is used to compare other similarity matches with same proteins 3. RMSD: Measure of the average distance between the atoms of the super-imposed proteins 4. Percentage sequence identity in the alignment

14 Uses of structural databases Once the amino acid sequence of the protein is known, its secondary and tertiary structures can be predicted using many prediction algorithms, which utilize information from previous structurally characterized sequences. In the secondary structure prediction, 1. h represents Alpha Helix 2. e represents Beta Sheets, 3. c represents Coils Since all known proteins have not yet been structurally characterized, this provides a useful bioinformatics analysis tool for researchers. The various servers for structure prediction are GOR, HNN, PredictProtein, NNPredict and Sspro.

15 Uses of structural databases Given a particular amino acid sequence, the cellular, molecular and biological processes associated with the sequence can be predicted using functional annotation servers. These processes are represented by a unique set of identifiers called Gene Ontology Terms or the GO Terms. The GO term can be a word or an alphanumeric identifier which includes a definition with cited sources and a namespace indicating the domain to which it belongs. The various server for this include DbAli Annolite, PFP, ProteomeAnalyst, GOPET, SpearMint and ProKnow.

16 Protein structural databases 1. Geometry of Protein Structure: Geometry of a protein structure refers to the three dimensional coordinates of its atoms and the angles between their bonds. These are essential to simulate the protein structure on computers. 2. Biology of Protein Structure: Information regarding the biological source of the protein and its metabolic roles within the cell and organism is referred to as the biology of protein structure. 3. SCOP classification: SCOP stands for Structural Classification of Proteins and aims to provide a detailed description of the various structural and evolutionary relationships between all proteins that have been structurally characterized. SCOP Classification can be done at four levels - Class, Fold, Superfamily and Family. 4. CATH classification: CATH stands for Class Architecture Topology and Homologous Superfamily and provides a semi-automatic, hierarchical classification of protein domains. The levels for CATH classification are Class, Architecture, Topology and Homologous Superfamily.

17 Uses of structural databases 1. Protein Structural Alignment: The geometry of two given protein structures can be compared by means of available software tools that analyse their three dimensional similarity to each other. 2. Protein Structure Prediction: The prospective secondary structures of peptides or proteins can be predicted from a given stretch of amino acid residues by using machine learning algorithms. 3. Machine Learning Algorithms: These are computer algorithms that can be trained from a given classified dataset. Thereafter, these programs train their parameters in a such a way, that they can classify new data. Most widely used Machine Learning Algorithms in Bioinformatics are Artificial Neural Networks, Hidden Markov Modeling, Support Vector Machines, etc. 4. Functional Annotation: For novel proteins that are yet to be characterized, the potential functions can be predicted by techniques such as Homology Modelling which provide an initial insight into the protein s properties.

18

19

20

21

22

23