FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE BIOMOLECULES COURSE: COMPUTER PRACTICAL 1 Author of the exercise: Prof. Lloyd Ruddock Edited by Dr. Leila Tajedin 2017-2018 Assistant: Leila Tajedin (leila.tajedin@oulu.fi) Course responsible person: Dr. Tuomo Glumoff (tuomo.glumoff@oulu.fi) CONTENTS Biochemistry Practical Schedules PAGE ORGANISATION 1 Computing Practical 1 3-8 ORGANISATION 1 Students will work as individuals. Failure to attend a practical will result in a mark of zero being awarded unless there is a good reason for absence which must be notified to Tuomo Glumoff. 2 To achieve good results from these practicals it is important that students should: (i) understand the basic principles underlying each experiment; and (ii) should be able to discern which operations are critical and which are not. It is therefore important that every student READ THE PRACTICAL SHEETS before the class. If you do not understand any point please consult one of the demonstrators - that is what they are primarily there for! 1

3 Reports must be sent within one week of the practical class to Leila Tajedin (Email address above). The maximum length allowed for the report is 6 sides of A4 paper and the acceptable format for report is PDF. Late work will not be marked unless there is due reason for the lateness. 4 Note: Try to present answers, which are concise and logical. No need to mention irrelevant points. If irrelevant or false sentences are mentioned which dilutes the real answer then they will be marked negatively. In addition, choice of words is important; do consult dictionary and scientific literature to clarify the actual meaning of words so that you are sure it correctly corresponds to the meaning, which you want to convey in the answers. In this practice, you will work on three human proteins with following SwissProt ID: 1) P15291 2) Q8IVL5 3) Q76MJ5 The write-up for this report is free-format but at a minimum must include the following information: 1) SwissProt ID and FASTA (canonical) format of sequences 2) For transmembrane proteins the possible orientation of the protein in the membrane 3) The signal sequence or transmembrane region for the protein which targets it to the ER. 4) The molecular weight, pi and extinction coefficient of the mature protein 5) Determine the concentration of the protein in mg/ml of a solution with A280 of 1.0 in a 1 cm path length cuvette. 6) Find any putative post-translational modifications 7) Mention the expression pattern for the protein. 8) The SwissProt ID of mouse homologue of the human protein and the % identity between the full-length mouse and human proteins. 9) Find the possible domains and motifs of the proteins 10) Explore the 3D structure of the proteins For a normal protein if you get the answers to all 8 point correct you will get 70% of the marks. To get the remaining marks (30%), you are supposed to find out which bioinformatic analysis software and links are useful for point 9 and 10 and make a short handbook, which can be up to two pages of total report. There are lots of links of the Uniprot page for the protein and there are several analysis software available, some of which can be found at: http://expasy.org/tools/ 2

COMPUTING PRACTICAL 1 BASIC BIOINFORMATICS Analysis of gene or protein sequences by computer can reveal many significant pieces of information regarding function, expression, cellular localization, structure, post-translational modifications etc. The number and types of methodologies is continuously growing and this practical will cover only the very basics of analysis in a couple of simple cases. As increasing amounts of information are obtained from genome sequencing projects, transcriptional analysis etc, it is increasingly easy to find some information about your protein of interest. However, as the databases get larger and ever more complex it is also becoming increasingly more difficult to find the relevant data. In addition, it must always be remembered that computer-based predictions are only predictions, they are not fact until supporting experimental evidence is found. In this practical you will use a few of the services available directly or indirectly from the ExPASy (Expert Protein Analysis System) server (http://au.expasy.org/) along with other commonly used bioinformatic analysis software to: 1) Find a specific human protein sequence 2) To find the signal sequence or transmembrane region for the protein which targets it to the ER. 3) To look for transmembrane regions and the possible orientation in the membrane 4) Use this information to find out the molecular weight and pi of the mature protein 5) To look for post-translational modifications 6) To look for information on the expression pattern for the protein. 7) To find the mouse homologue of the human protein and to calculate the % identity between the mouse and human proteins. (by completing these seven activities you will receive 70% of the mark) 8) To look for domains and 3D structure of the proteins (30% of the mark). One of the problems associated with bioinformatics is the large number of identifiers used, each protein may have multiple entries in multiple databases each with a unique ID code, with mrna sequences, genomic sequences etc for the same protein showing an equal diversity of ID codes. In this practical, your starting point will be the SwissProt ID for each 3

of the human proteins. Some of these proteins are in some way associated with the endoplasmic reticulum (ER) either by being having a transmembrane (membrane spanning) region (and appropriate signals) or for soluble proteins by being targeted to the ER by a N- terminal signal sequence at the N-terminus of the protein. These signal sequences, which are typically 17-26 amino acids long but which may be longer, may either be cleaved by an enzyme called signal peptidase which will then release a mature protein into the ER or can be non-cleavable in which case they act as transmembrane region. Transmembrane regions are usually helical and around 20 amino acids long and a single protein may have multiple transmembrane regions, each spanning the membrane. Proteins containing such regions are always orientated in one specific direction i.e. for a given ER-protein which has a single transmembrane region may be orientated with its N-terminus either in the cytoplasm or in the ER; it will not form a mixed population with some molecules orientated with the N-terminal in the ER and some in the cytoplasm. An ER-protein which has an even number of transmembrane regions will have both the N-terminus and C-terminus on the same side of the membrane i.e. both in the cytoplasm or both in the ER. In this practical course, we will study several proteins that are all well studied and so even on the first data page you open there will be a significant amount of information towards answering the questions posed. Please do not take a short cut and return this information, do the full analysis. In some cases the answers obtained from the bioinformatics analysis and that listed in the SwissProt entry for the protein are very significantly different. In this manual a weblink will be underlined e.g. http://au.expasy.org/ while a menu link will have the menu name in italics and subheadings referred to with -> e.g. Tools -> word count You will quickly collect a LOT of information about the protein and it may be worthwhile having Microsoft Word open and copying and pasting information into this as you go along. You MUST edit this information into a readable, presentable format that answers all of the questions asked (and shows the evidence). Open internet explorer and go to the UniProt homepage http://www.uniprot.org/. Enter the SwissProt ID in the query box and press search (care must be taken to distinguish between the number 0 and the letter O). This should take you to a unique UniProtKB entry; make sure that it is your protein before going further (remember ALL of the examples are human proteins). This entry contains a lot of information about the protein, links to references, 4

other databases, mrna and genomic DNA sequences, structures (PDB files) etc. You are actively encouraged to explore these links during the practical to see what information can be found in databases about any particular protein. Whenever possible, all the protein products encoded by one gene in a given species are described in a single UniProtKB/Swiss-Prot entry, including isoforms generated by alternative splicing, alternative promoter usage, and alternative translation initiation. Canonical sequences alone or canonical and isoform sequences can be downloaded in FASTA format using the FORMAT tab on top of the page. Click on FASTA (canonical); this will take you to the other page containing the amino acid sequence for the protein which you will need for other analysis programmes. Copy this sequence in your Microsoft word file. Make sure this is your sequence and you have copied all the amino acids by double checking the number of copied amino acids with UniProt. Notice that the heading in FASTA format starts with > and contains gene/protein information and the amino acid sequence, you will need the amino acid sequence for analysis (amino acids are marked in red box in the picture). All of the proteins in this exercise are either transmembrane or have a cleavable N-terminal signal sequence which directs them to the endoplasmic reticulum. To find out which your protein has you will analyse the sequence in a number of programmes. First select your sequence in word and copy it (right mouse click -> copy). Then go back to a series of websites and systematically check your sequence in a number of analysis programmes. Include at least: PSORT (Use the PSORT II sub-programme) http://psort.hgc.jp/ SignalP http://www.cbs.dtu.dk/services/signalp/ TMpred http://www.ch.embnet.org/software/tmpred_form.html 5

In each case follow the instructions in the programme, pasting your sequence in the query box edit->paste. Carefully note the results (different programmes may give different results) and use the consensus result for future analysis. For signal sequences you need to note the signal sequence and whether it is cleavable. For transmembrane proteins you need to note the transmembrane region(s) and the probable orientation in the membrane. PSORT will also give you other information, think about what might be relevant to note down. Note: If you do not get a consensus result for TM regions from these three programmes then you will have to try others until you get a consensus. Suggested other programmes include: http://www.cbs.dtu.dk/services/tmhmm/ http://www.enzim.hu/hmmtop/ http://bp.nuap.nagoya-u.ac.jp/sosui/ Then calculate the pi (isoelectric point) and Mw of the mature protein. For proteins with a cleavable signal sequence this will be the full length protein with the signal sequence removed. For transmembrane proteins it will be the full length protein. Go to ProtPram http://web.expasy.org/protparam/ and paste in the sequence to be analysed (remember the signal sequence issue). Note down the results for pi, Mw and extinction coefficient). If you had a solution of the mature protein with an absorbance at 280nm of 1.0 in a 1cm pathlength cuvette, what would the concentration of the protein be in mg/ml? Next examine possible post-translational modifications of the protein. Include at least: NetNGly http://www.cbs.dtu.dk/services/netnglyc/ NetPhos http://www.cbs.dtu.dk/services/netphos/ Sulfinator http://web.expasy.org/sulfinator/ In each case follow the instructions in the programme, pasting your full length sequence in the query box edit->paste. Carefully note the results. N-glycosylation and tyrosine sulfation occur in the ER and hence for transmembrane proteins only those regions in the ER can be modified. Hence for transmembrane proteins use the topology you worked out earlier to decide if any potential N-glycosylation sites or tyrosine sulfation sites are ER-lumen accessible. Other PTM analysis software is available (easily found using google). Next find information on the specific expression pattern of your protein. Again there are numerous databases (and numerous access points) but for this practical use GeneCard. Return 6

to the UniProt entry page for your protein and under Organism-specific databases left click on the GenCards link. This should take you through to the GeneCard page for that protein. If it does not, then go to the gencard home page http://www.genecards.org/ and search using keyword typing in your SwissProt ID. This may give multiple hits, but it should be very clear which is the link to your protein. On the GeneCard page for your protein there is a subheading Transcripts. Click on the link for the Unigene Cluster for your protein (the format for the link will be something like Hs.464336 ) and note down the expression sources for the protein based on cdna sources. Back in the Genecards entry look under the subheading Expression in human tissues to see in which tissue types the protein is most highly expressed; note these down. There may be other useful information and useful links on the rest of the GenCard page. Finally find the mouse homologue of the human protein and calculate the % identity between the mouse and human proteins.. There are various programmes for searching databases but for this practical use http://web.expasy.org/blast/. There are then several types of search possible: Blastp searches with a protein sequence for other protein sequences Blastn searches with a nucleotide sequence for other nucleotide sequences Tblastn searches with a protein sequence for a translated sequence from a nucleotide sequence i.e. you search with a protein sequence for a nucleotide sequence. You MUST select the appropriate type of database to search i.e. for blastp which you will use in this practical you must select a protein database. Paste your full length sequence in the box. Select blastp; protein databases ->UniProt (SwissProt/TrEMBL) and use the default settings for the alignment matrix, filters and graphics. Then click Run blast. Once the search is complete, scroll down the list of similar sequences until you find the mouse sequence (mus musculus) with the highest identity with your human sequence. Note down the SwissProt ID and the percentage identity. If this covers the whole protein (check the number of amino acids that the alignment is shown for the second number under identities or positives with the number of amino acids in your protein) and has no X then you need not do the alignment suggested below to get the % identity for the protein. 7

To work out % identity for the whole protein you must get the mouse protein sequence. Left click on the link to go to the Swiss-Prot entry for the mouse protein and then get the sequence without spaces, digits etc in same way you got the original human sequence i.e. via Word. Then go to LALIGN http://www.ch.embnet.org/software/lalign_form.html. Paste the human sequence as 1 st Query and the mouse sequence as the 2 nd Query and run LALIGN using the default settings. Record the % identity between the proteins and whether this is over the whole protein or not. 8