FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Similar documents
Protein Bioinformatics Part I: Access to information

COMPUTER RESOURCES II:

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Sequence Databases and database scanning

Finding Genes, Building Search Strategies and Visiting a Gene Page

Finding Genes, Building Search Strategies and Visiting a Gene Page

Why Use BLAST? David Form - August 15,

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Why learn sequence database searching? Searching Molecular Databases with BLAST

BIOINF525: INTRODUCTION TO BIOINFORMATICS LAB SESSION 1

FINDING GENES AND EXPLORING THE GENE PAGE AND RUNNING A BLAST (Exercise 1)

Sequence Based Function Annotation

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

2. The dropdown box has a number of databases that are searchable. Select the gene option and search for dihydrofolate reductase.

ELE4120 Bioinformatics. Tutorial 5

FUNCTIONAL BIOINFORMATICS

Exercises (Multiple sequence alignment, profile search)

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Retrieval of gene information at NCBI

BLASTing through the kingdom of life

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

From AP investigative Laboratory Manual 1

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

This practical aims to walk you through the process of text searching DNA and protein databases for sequence entries.

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Product Applications for the Sequence Analysis Collection

Hands-On Four Investigating Inherited Diseases

BLASTing through the kingdom of life

ONLINE BIOINFORMATICS RESOURCES

Exercise I, Sequence Analysis

Tutorial for Stop codon reassignment in the wild

Bioinformatics for Proteomics. Ann Loraine

Chimp Sequence Annotation: Region 2_3

DNA & Protein Synthesis. The source and the process!

BLAST. Subject: The result from another organism that your query was matched to.

NCBI web resources I: databases and Entrez

Pre-Lab Questions. 1. Use the following data to construct a cladogram of the major plant groups.

Web-based Bioinformatics Applications in Proteomics

Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

HC70AL Spring An Introduction to Bioinformatics -- Part I. Brandon Le. April 6, What is a Gene? An ordered sequence of nucleotides

BLASTing through the kingdom of life

ipep User s Guide Proteomics

Chapter 2: Access to Information

Biotechnology Explorer

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Final exam: Introduction to Bioinformatics and Genomics DUE: Friday June 29 th at 4:00 pm

Homework 4. Due in class, Wednesday, November 10, 2004

Unit 1: DNA and the Genome. Sub-Topic (1.3) Gene Expression

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Download the Lectin sequence output from

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: MAIN, SUBJECT, COURSE AND CODE: GENE 320: Bioinformatics

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

AP BIOLOGY. Investigation #3 Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST. Slide 1 / 32. Slide 2 / 32.

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

SignalP Plugin USER MANUAL

What is a Gene? HC70AL Spring An Introduction to Bioinformatics -- Part I. What are the 4 Nucleotides By in DNA?

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

SENIOR BIOLOGY. Blueprint of life and Genetics: the Code Broken? INTRODUCTORY NOTES NAME SCHOOL / ORGANISATION DATE. Bay 12, 1417.

Project Manual Bio3055. Lung Cancer: Cytochrome P450 1A1

Lab Week 9 - A Sample Annotation Problem (adapted by Chris Shaffer from a worksheet by Varun Sundaram, WU-STL, Class of 2009)

Two Mark question and Answers

Lecture 7 Motif Databases and Gene Finding

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28.

Investigating Inherited Diseases

Textbook Reading Guidelines

Functional analysis using EBI Metagenomics

Bioinformatics of the Green Fluorescent Proteins

BME 110 Midterm Examination

Protein Synthesis & Gene Expression

Sequence Analysis. Introduction to Bioinformatics BIMMS December 2015

Data Retrieval from GenBank

An Introduction to Bioinformatics

Identification of Single Nucleotide Polymorphisms and associated Disease Genes using NCBI resources

NCBI Molecular Biology Resources

Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases

Gene Identification in silico

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Exploring the Genetic Basis for Behavior. Instructor s Notes

Molecular Cell Biology - Problem Drill 11: Recombinant DNA

SAMPLE LITERATURE Please refer to included weblink for correct version.

MOL204 Exam Fall 2015

Project Manual Bio3055. Lung Cancer: K-Ras 2

Textbook Reading Guidelines

Evolutionary Genetics. LV Lecture with exercises 6KP

Basic protein and peptide science for proteomics. Henrik Johansson

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Computational Biology and Bioinformatics

An insilico Approach: Homology Modelling and Characterization of HSP90 alpha Sangeeta Supehia

Biology A: Chapter 9 Annotating Notes Protein Synthesis

The Major Function Of Rna Is To Carry Out The Genetic Instructions For Protein Synthesis

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Laboratory 3: Detecting selection

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

CS313 Exercise 1 Cover Page Fall 2017

SENIOR BIOLOGY. Blueprint of life and Genetics: the Code Broken? INTRODUCTORY NOTES NAME SCHOOL / ORGANISATION DATE. Bay 12, 1417.

Why study sequence similarity?

DNA, RNA & Proteins Chapter 13

Transcription:

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE BIOMOLECULES COURSE: COMPUTER PRACTICAL 1 Author of the exercise: Prof. Lloyd Ruddock Edited by Dr. Leila Tajedin 2017-2018 Assistant: Leila Tajedin (leila.tajedin@oulu.fi) Course responsible person: Dr. Tuomo Glumoff (tuomo.glumoff@oulu.fi) CONTENTS Biochemistry Practical Schedules PAGE ORGANISATION 1 Computing Practical 1 3-8 ORGANISATION 1 Students will work as individuals. Failure to attend a practical will result in a mark of zero being awarded unless there is a good reason for absence which must be notified to Tuomo Glumoff. 2 To achieve good results from these practicals it is important that students should: (i) understand the basic principles underlying each experiment; and (ii) should be able to discern which operations are critical and which are not. It is therefore important that every student READ THE PRACTICAL SHEETS before the class. If you do not understand any point please consult one of the demonstrators - that is what they are primarily there for! 1

3 Reports must be sent within one week of the practical class to Leila Tajedin (Email address above). The maximum length allowed for the report is 6 sides of A4 paper and the acceptable format for report is PDF. Late work will not be marked unless there is due reason for the lateness. 4 Note: Try to present answers, which are concise and logical. No need to mention irrelevant points. If irrelevant or false sentences are mentioned which dilutes the real answer then they will be marked negatively. In addition, choice of words is important; do consult dictionary and scientific literature to clarify the actual meaning of words so that you are sure it correctly corresponds to the meaning, which you want to convey in the answers. In this practice, you will work on three human proteins with following SwissProt ID: 1) P15291 2) Q8IVL5 3) Q76MJ5 The write-up for this report is free-format but at a minimum must include the following information: 1) SwissProt ID and FASTA (canonical) format of sequences 2) For transmembrane proteins the possible orientation of the protein in the membrane 3) The signal sequence or transmembrane region for the protein which targets it to the ER. 4) The molecular weight, pi and extinction coefficient of the mature protein 5) Determine the concentration of the protein in mg/ml of a solution with A280 of 1.0 in a 1 cm path length cuvette. 6) Find any putative post-translational modifications 7) Mention the expression pattern for the protein. 8) The SwissProt ID of mouse homologue of the human protein and the % identity between the full-length mouse and human proteins. 9) Find the possible domains and motifs of the proteins 10) Explore the 3D structure of the proteins For a normal protein if you get the answers to all 8 point correct you will get 70% of the marks. To get the remaining marks (30%), you are supposed to find out which bioinformatic analysis software and links are useful for point 9 and 10 and make a short handbook, which can be up to two pages of total report. There are lots of links of the Uniprot page for the protein and there are several analysis software available, some of which can be found at: http://expasy.org/tools/ 2

COMPUTING PRACTICAL 1 BASIC BIOINFORMATICS Analysis of gene or protein sequences by computer can reveal many significant pieces of information regarding function, expression, cellular localization, structure, post-translational modifications etc. The number and types of methodologies is continuously growing and this practical will cover only the very basics of analysis in a couple of simple cases. As increasing amounts of information are obtained from genome sequencing projects, transcriptional analysis etc, it is increasingly easy to find some information about your protein of interest. However, as the databases get larger and ever more complex it is also becoming increasingly more difficult to find the relevant data. In addition, it must always be remembered that computer-based predictions are only predictions, they are not fact until supporting experimental evidence is found. In this practical you will use a few of the services available directly or indirectly from the ExPASy (Expert Protein Analysis System) server (http://au.expasy.org/) along with other commonly used bioinformatic analysis software to: 1) Find a specific human protein sequence 2) To find the signal sequence or transmembrane region for the protein which targets it to the ER. 3) To look for transmembrane regions and the possible orientation in the membrane 4) Use this information to find out the molecular weight and pi of the mature protein 5) To look for post-translational modifications 6) To look for information on the expression pattern for the protein. 7) To find the mouse homologue of the human protein and to calculate the % identity between the mouse and human proteins. (by completing these seven activities you will receive 70% of the mark) 8) To look for domains and 3D structure of the proteins (30% of the mark). One of the problems associated with bioinformatics is the large number of identifiers used, each protein may have multiple entries in multiple databases each with a unique ID code, with mrna sequences, genomic sequences etc for the same protein showing an equal diversity of ID codes. In this practical, your starting point will be the SwissProt ID for each 3

of the human proteins. Some of these proteins are in some way associated with the endoplasmic reticulum (ER) either by being having a transmembrane (membrane spanning) region (and appropriate signals) or for soluble proteins by being targeted to the ER by a N- terminal signal sequence at the N-terminus of the protein. These signal sequences, which are typically 17-26 amino acids long but which may be longer, may either be cleaved by an enzyme called signal peptidase which will then release a mature protein into the ER or can be non-cleavable in which case they act as transmembrane region. Transmembrane regions are usually helical and around 20 amino acids long and a single protein may have multiple transmembrane regions, each spanning the membrane. Proteins containing such regions are always orientated in one specific direction i.e. for a given ER-protein which has a single transmembrane region may be orientated with its N-terminus either in the cytoplasm or in the ER; it will not form a mixed population with some molecules orientated with the N-terminal in the ER and some in the cytoplasm. An ER-protein which has an even number of transmembrane regions will have both the N-terminus and C-terminus on the same side of the membrane i.e. both in the cytoplasm or both in the ER. In this practical course, we will study several proteins that are all well studied and so even on the first data page you open there will be a significant amount of information towards answering the questions posed. Please do not take a short cut and return this information, do the full analysis. In some cases the answers obtained from the bioinformatics analysis and that listed in the SwissProt entry for the protein are very significantly different. In this manual a weblink will be underlined e.g. http://au.expasy.org/ while a menu link will have the menu name in italics and subheadings referred to with -> e.g. Tools -> word count You will quickly collect a LOT of information about the protein and it may be worthwhile having Microsoft Word open and copying and pasting information into this as you go along. You MUST edit this information into a readable, presentable format that answers all of the questions asked (and shows the evidence). Open internet explorer and go to the UniProt homepage http://www.uniprot.org/. Enter the SwissProt ID in the query box and press search (care must be taken to distinguish between the number 0 and the letter O). This should take you to a unique UniProtKB entry; make sure that it is your protein before going further (remember ALL of the examples are human proteins). This entry contains a lot of information about the protein, links to references, 4

other databases, mrna and genomic DNA sequences, structures (PDB files) etc. You are actively encouraged to explore these links during the practical to see what information can be found in databases about any particular protein. Whenever possible, all the protein products encoded by one gene in a given species are described in a single UniProtKB/Swiss-Prot entry, including isoforms generated by alternative splicing, alternative promoter usage, and alternative translation initiation. Canonical sequences alone or canonical and isoform sequences can be downloaded in FASTA format using the FORMAT tab on top of the page. Click on FASTA (canonical); this will take you to the other page containing the amino acid sequence for the protein which you will need for other analysis programmes. Copy this sequence in your Microsoft word file. Make sure this is your sequence and you have copied all the amino acids by double checking the number of copied amino acids with UniProt. Notice that the heading in FASTA format starts with > and contains gene/protein information and the amino acid sequence, you will need the amino acid sequence for analysis (amino acids are marked in red box in the picture). All of the proteins in this exercise are either transmembrane or have a cleavable N-terminal signal sequence which directs them to the endoplasmic reticulum. To find out which your protein has you will analyse the sequence in a number of programmes. First select your sequence in word and copy it (right mouse click -> copy). Then go back to a series of websites and systematically check your sequence in a number of analysis programmes. Include at least: PSORT (Use the PSORT II sub-programme) http://psort.hgc.jp/ SignalP http://www.cbs.dtu.dk/services/signalp/ TMpred http://www.ch.embnet.org/software/tmpred_form.html 5

In each case follow the instructions in the programme, pasting your sequence in the query box edit->paste. Carefully note the results (different programmes may give different results) and use the consensus result for future analysis. For signal sequences you need to note the signal sequence and whether it is cleavable. For transmembrane proteins you need to note the transmembrane region(s) and the probable orientation in the membrane. PSORT will also give you other information, think about what might be relevant to note down. Note: If you do not get a consensus result for TM regions from these three programmes then you will have to try others until you get a consensus. Suggested other programmes include: http://www.cbs.dtu.dk/services/tmhmm/ http://www.enzim.hu/hmmtop/ http://bp.nuap.nagoya-u.ac.jp/sosui/ Then calculate the pi (isoelectric point) and Mw of the mature protein. For proteins with a cleavable signal sequence this will be the full length protein with the signal sequence removed. For transmembrane proteins it will be the full length protein. Go to ProtPram http://web.expasy.org/protparam/ and paste in the sequence to be analysed (remember the signal sequence issue). Note down the results for pi, Mw and extinction coefficient). If you had a solution of the mature protein with an absorbance at 280nm of 1.0 in a 1cm pathlength cuvette, what would the concentration of the protein be in mg/ml? Next examine possible post-translational modifications of the protein. Include at least: NetNGly http://www.cbs.dtu.dk/services/netnglyc/ NetPhos http://www.cbs.dtu.dk/services/netphos/ Sulfinator http://web.expasy.org/sulfinator/ In each case follow the instructions in the programme, pasting your full length sequence in the query box edit->paste. Carefully note the results. N-glycosylation and tyrosine sulfation occur in the ER and hence for transmembrane proteins only those regions in the ER can be modified. Hence for transmembrane proteins use the topology you worked out earlier to decide if any potential N-glycosylation sites or tyrosine sulfation sites are ER-lumen accessible. Other PTM analysis software is available (easily found using google). Next find information on the specific expression pattern of your protein. Again there are numerous databases (and numerous access points) but for this practical use GeneCard. Return 6

to the UniProt entry page for your protein and under Organism-specific databases left click on the GenCards link. This should take you through to the GeneCard page for that protein. If it does not, then go to the gencard home page http://www.genecards.org/ and search using keyword typing in your SwissProt ID. This may give multiple hits, but it should be very clear which is the link to your protein. On the GeneCard page for your protein there is a subheading Transcripts. Click on the link for the Unigene Cluster for your protein (the format for the link will be something like Hs.464336 ) and note down the expression sources for the protein based on cdna sources. Back in the Genecards entry look under the subheading Expression in human tissues to see in which tissue types the protein is most highly expressed; note these down. There may be other useful information and useful links on the rest of the GenCard page. Finally find the mouse homologue of the human protein and calculate the % identity between the mouse and human proteins.. There are various programmes for searching databases but for this practical use http://web.expasy.org/blast/. There are then several types of search possible: Blastp searches with a protein sequence for other protein sequences Blastn searches with a nucleotide sequence for other nucleotide sequences Tblastn searches with a protein sequence for a translated sequence from a nucleotide sequence i.e. you search with a protein sequence for a nucleotide sequence. You MUST select the appropriate type of database to search i.e. for blastp which you will use in this practical you must select a protein database. Paste your full length sequence in the box. Select blastp; protein databases ->UniProt (SwissProt/TrEMBL) and use the default settings for the alignment matrix, filters and graphics. Then click Run blast. Once the search is complete, scroll down the list of similar sequences until you find the mouse sequence (mus musculus) with the highest identity with your human sequence. Note down the SwissProt ID and the percentage identity. If this covers the whole protein (check the number of amino acids that the alignment is shown for the second number under identities or positives with the number of amino acids in your protein) and has no X then you need not do the alignment suggested below to get the % identity for the protein. 7

To work out % identity for the whole protein you must get the mouse protein sequence. Left click on the link to go to the Swiss-Prot entry for the mouse protein and then get the sequence without spaces, digits etc in same way you got the original human sequence i.e. via Word. Then go to LALIGN http://www.ch.embnet.org/software/lalign_form.html. Paste the human sequence as 1 st Query and the mouse sequence as the 2 nd Query and run LALIGN using the default settings. Record the % identity between the proteins and whether this is over the whole protein or not. 8