FUNCTIONAL BIOINFORMATICS

Size: px
Start display at page:

Download "FUNCTIONAL BIOINFORMATICS"

Transcription

1 Molecular Biology FUNCTIONAL BIOINFORMATICS PREDICTING THE FUNCTION OF AN UNKNOWN PROTEIN Suppose you have found the amino acid sequence of an unknown protein and wish to find its potential function. One approach would be to determine if there are similar proteins with a high level of similarity and to determine what potential conserved domains these share. To do this, we will use the Blast function; Blastp. 1. Obtain the sequence of a protein of unknown function from Mus musculus from this course s web site under the heading sequences. 2. Go to the NCBI website and click on Blast. Click on protein BLAST (blastp). Copy the mouse protein sequence and paste it into the Search box. Specify the organism here 3. From the dropdown menu Database, select reference proteins (refseq_protein). For this exercise, we will ask whether similar proteins can be found in humans. To do so, type and choose Homo sapiens in the Organism box. 4. Choose show results in a new window and then click on the BLAST button. 5. A new window, similar to the one shown below, will be displayed with a colored, diagrammatic representation of the mouse protein sequence showing the locations of any functional/structural domains that are present in the protein.

2 Molecular Biology Click on the domain(s) to find out what they are. You will be brought to a new page as shown below. Domains 7. To find out more about the function of the domain (s), click the [+] symbols on the left. This should give you some clues about the identity of the mouse protein. 8. If you scroll down through the BLAST results, in the original window showing the diagram of domains in the mouse protein, you will see many sequence alignments one after the other. Each alignment is a sequence comparison between the mouse protein and a human protein that is similar in sequence to it. The first alignment compares the mouse protein and a human protein that is the best match to the mouse protein; the second alignment compares the mouse protein and a human protein that is the second best match to the mouse protein; and so on 9. To infer the function of a protein about which little is known, one can compare the sequence of the unknown protein to other proteins of known function. If the unknown protein is very similar in sequence to a protein of known function, then there is a good chance that the unknown protein has the same function as the known protein. 10. For your assignment, repeat this exercise with the protein of unknown function from Danio rerio (zebra fish) and compare it to proteins from Homo sapiens (humans), Mus musculus (mouse), and Saccharomyces cerevisiae (yeast). 11. Choose and save the best protein matches in each case. 12. Align the three proteins found to the initial unknown protein. Obtain the percent identity at the protein level to determine which one shares the highest level of identity.

3 Molecular Biology FINDING OPEN READING FRAMES Sequencing has become so easy, that we have in recent years obtained the sequences of complete genomes from numerous prokaryotes, eukaryotes, and viruses. These sequences are of little utility unless we can derive their functions; the field of functional genomics. Amongst other things, functional genomics involves the search and identification of coding sequences - the genes. One of the bioinformatics methods used to this end is the search for open reading frames (ORF). These typically start with a translation initiation codon (AUG) and end with a translation termination codon (UAG, UGA, or UAA). Genes that have an ORF necessarily code for proteins. However, one must consider that not all genes code for proteins and thus not all genes possess ORFs. In contrast to many genomes, viral genomes are relatively small and simple making them quite easy to sequence. Sequencing of these allows rapid identification, evolutionary studies and the identification of new viruses. Given their simplicity, it should therefore be a relatively simple task to find genes within these sequences. In this exercise, you will perform a search for potential ORFs in a sequence obtained from two different RNA segments of a viral genome. 1. Go to the NCBI site and click on the link "Open reading frame finder (ORF finder)" in the menu "Resource List (A-Z)".

4 Molecular Biology Copy-paste in the query box the sequence Viral1 from the text file "viral genome sequence" available on this course's web page. 3. Click on Submit. A new page similar to the one below will be loaded. This page shows all the open reading frames found in all the six possible reading frames. The positions and lengths, in bases, of each of the ORF are presented graphically and in text form. The symbols [+] and [-] under the heading strand indicates whether the ORF is on the sequence entered in the query or on its reverse complement respectively. Start and Stop indicate the base pair position of the beginning and the end respectively of the open reading frame. Length (bp aa) indicates the length in base pairs and amino acids of the ORF identified. 4. To display the amino acid sequence of a given ORF, simply click on the desired ORF in the right panel. The amino acid sequence of the selected ORF is displayed in the left panel. You may select the protein sequence displayed and copy it for your records. 5. To obtain the nucleotide sequence of the selected ORF, click on Display ORF as and choose the option Nucleotide sequence. 6. To obtain the annotated nucleotide sequence of the selected ORF, click on Display ORF as and choose the option CDS translation. 7. To determine the possible function of this protein and thus this gene, we will perform a Blastp search as you did in the previous exercise. This time, do not specify a specific organism.

5 Molecular Biology Obtain the record for the gene with the best match. Obtain the following information from the record: The definition The organism this gene comes from The name of the protein product The gene s name 9. Repeat steps 1-8 with the second viral sequence "viral2" USING BLASTX An alternative way of performing the same task is to use the function BlastX from the NCBI Blast options. BlastX will search a translated nucleotide query against a protein databases to give you very similar information. 1. Go to the NCBI website and click on blastx (translated nucleotide protein). Copy paste the viral1 sequence into the Search box. 2. From the dropdown menu Database, select reference proteins (refseq_protein). Do not choose any specific organism. 3. Choose show results in a new window and then click on the BLAST button.

6 Molecular Biology As previously a new window will be displayed with a colored, diagrammatic representation of the protein sequence showing the locations of any functional/structural domains that are present in the protein. 5. By scrolling further down, the actual alignments, such as the one shown below, will be displayed.

7 Molecular Biology Interpretation of the alignment: Length: Indicates the length in amino acids of the protein found. Identities: Indicates the number of amino acids that are the same between the two proteins and the overall percentage of identity. Positives: Indicates the sum of the number of identical and conserved amino acid changes between the two proteins and the overall percentage of similarity. Frame: Indicates which reading frame was used to obtain the best protein alignment. FINDING SNPS: Single nucleotide polymorphisms (SNP) represent single nucleotide changes within a nucleotide sequence that occur through mutations. These may have a drastic effect on the gene s function and have often been associated with various diseases. Viruses are amongst the fastest evolving organisms. For instance, in the case of the influenza virus, the mutation rate is so high that new vaccines are often developed each year. The viral3 sequence, in the viral sequence document, represents the same gene as that of the viral1 sequence, which was isolated during a different year. Use the skills you have acquired in bioinformatics to obtain the following information about the viral3 sequence. Perform a nucleotide alignment to determine what SNPs have been acquired in the viral3 sequence. Do these SNPs change the reading frame of the longest ORF? If the SNPs do not change the reading frame, do they change the amino acid (s) coded by the longest ORF? If amino acid changes are observed, are these conserved, semi-conserved or nonconserved amino acid changes?