COMPUTER RESOURCES II: Using the computer to analyze data, using the internet, and accessing online databases Bio 210, Fall 2006 Linda S. Huang, Ph.D. University of Massachusetts Boston In the first computer lab, we discussed how using the computer is important to modern-day cell biologists. In this lab, we will use the computer to analyze a DNA sequence that will be provided to you. You will examine this DNA sequence for a protein coding region (often referred to as an open reading frame or ORF s). Once you determine the protein most likely coded for by this piece of DNA, you will take the translated protein sequence and access an online database to figure out what protein it encodes. Finally, you will use an organism-specific database to learn more about your protein. Before you begin the computer work, your TA will be giving you a short double-stranded DNA sequence and will ask you to translate, using the codon tables below (see next page), this DNA sequence into the six possible reading frames. Please complete this assignment first, hand it in to the TA, and then proceed to the next part of the lab. To begin the computer-assisted translation, get your DNA sequence from your TA. Note that this DNA sequence contains the entire Open Reading Frame (ORF) for a yeast (Saccharomyces cerevisiae) gene. There are a few important details to note: The protein encoded by the DNA sequence is greater than 80 amino acids long. The DNA sequence your TA has given you does not contain any introns. (Only 5% of S. cerevisiae genes contain introns, unlike human genes where essentially all protein-coding genes contain introns. It is simpler to do the bioinformatics of protein prediction using DNA without introns.) This sequence is a single strand of DNA. By convention, the first nucleotide is at the 5 end and the last nucleotide is at the 3 end. The reverse complement of this strand can be easily inferred from the sequence you have. Step 1: Figure out the protein sequence that your DNA sequence encodes Since we have told you the DNA sequence you have encodes a protein, you should be able to figure out what protein it encodes. You should recall from class that the DNA code is translated into protein by the use of trnas that recognize DNA as triplets.
The genetic code is provided below (from Figure 7-24 of Essential Cell Biology, ed. 3): Here is an alternative layout of the genetic code: You should also recall how the ribosome translates an mrna molecule (see Figures 7-33, 7-34, and 7-37 in Essential Cell Biology, ed. 3 for review). One way to translate your DNA sequence is to compare the sequence to the genetic code by hand and determine the proteins that could potentially be encoded by the DNA. There are also computer programs that can do this same thing. However, you should understand conceptually
how to translate DNA into protein, just like you should know how to add even though you own a calculator. A particularly good program to translate DNA into protein is JavaScript DNA Translator 1.1. This program can be found at: http://www.annular.org/~sdbrown/dna/translator.html To use this program: Copy the sequence that your TA gave you and paste it in the box labeled Sequence: DNA Only or FASTA format. Using the menus, you can choose to receive the output in either a 3-letter Amino Acid representation or a 1-letter Amino Acid representation. Choose 1-letter. Be sure to choose 6 reading frames for the Reading Frame. The Line Length choice determines how long the lines of your output file is; the default of 60 is fine. When you are setting up your translations, UNCHECK the box that says Display Translations of ORFs of at least The bottom selections of your computer screen should look like: Now, hit the button that says Translate and you will get the output on another window in your browser. (Be patient, it can take a few seconds!) Examine the output file. At the top, you will see the six-frame translation of your DNA sequence. At the bottom of the window, you will see the six translations listed separately.
1. Scroll to the bottom of the output window. You will see the six translations in six reading frames, highlighted in yellow. From these translations, determine which is the most likely ORF that encodes your protein. Hint: look for the longest possible ORF, starting with methionine and ending with a stop codon, which is indicated with an asterisk (*). Note the number of the frame that encodes this ORF. Copy this ORF, from methionine to the last amino acid, and paste it into the Word file with your gene, below the gene sequence. 2. Now look at the printout of your output, and find the DNA codon that encodes the first methionine in your ORF (methionine is encoded by AUG in the RNA which corresponds to ATG in the DNA). In the output window, amino acid letters are positioned above the first nucleotide in each codon. Circle this ATG in your printout. Now return to your Word file with your gene sequence, find that codon in the sequence, and highlight it using BOLD face and 16 point font. Note that if your gene is encoded by reading frames 4, 5, or 6 (on the bottom strand of your DNA), then you have to be looking closer to the end of your gene s sequence, and looking for the codon which is complementary to ATG. (The reason for this is that in the Word file, only one strand is provided as your gene sequence, which the translator program understands as the top strand by default.) 3. Look at your printout again, and find the DNA codon which corresponds to the translation stop signal. These codons are TAA, TAG, or TGA, and will correspond to the asterisk at the end of your ORF. Circle the stop codon in your printout, and return to the Word file with your gene. Find the stop codon there and highlight it by using underline and 16 point font. Again, remember that if your gene is in frame 4, 5, or 6, you have to be looking somewhere in the beginning of your sequence, and searching for a codon which is complementary to TAA, TAG, or TGA. Again, this is because only the top strand is given as the sequence for your gene in the Word file. 4. Print your amended sequence file. Step 2: Find out the name of the protein that your DNA encodes. Now that you have the protein sequence that your DNA encodes, we can use the databases to figure out what the name of that protein is. As you are probably aware, concerted efforts to determine the full DNA sequences of many different organisms have been undertaken (also referred to as genome projects, since these efforts are to determine the composition of a genome of a particular species). The Saccharomyces cerevisiae genome was first completed in 1996 and was the first complete eukaryotic genome to be sequenced. Since then, many other genomes have been completed (including the first draft of the human genome, completed in 2001). The DNA sequences obtained from these genome projects are available in public databases, and many programs exist that can assist you in searching these databases. Additionally, the predicted proteins encoded by these DNA sequences can be determined as well, in a similar manner as you did in Step 1 above; these predicted protein sequences have also been deposited into publicly available databases.
One of the more commonly used programs to search the DNA databases is BLAST (Basic Local Alignment Search Tool), developed by the National Center for Biotechnology Information (NCBI). BLAST can be used in different ways: to compare a DNA sequence to DNA sequences in the databases (blastn). to compare an amino acid sequence to protein sequences in the databases (blastp). to compare a DNA sequence translated in all six reading frames to all protein sequences in the databases (blastx). You can access BLAST searches against ALL publicly available databases at: http://www.ncbi.nlm.nih.gov/blast/ However, since we know that you have a yeast protein, we are instead going to use a portal specifically designed for analyzing genomic information related to S. cerevisiae. This way when you search for your protein sequence, you will not get all theoretical matches, but only those matches that are in the S. cerevisiae genome. You will also be able to more easily obtain other information about your gene. Point your browser to: http://www.yeastgenome.org/cgi bin/blast sgd.pl Paste in your translated protein sequence into the box that says Type or Paste a Query Sequence. At the first drop down menu, choose blastp as the appropriate BLAST program. In the second box with options, choose Open Reading Frames (DNA or Protein). Leave everything as the default, and push the button to Run WU- BLAST. A new page will come up giving you your BLAST search results. At the top of the page will be a graphical interface depicting the highest significant matches from the S. cerevisiae protein database. For a beginner, a conceptually more simple output can be found at the middle of the page, with results that look like:
These results above are for a blastp search using a protein sequence of 388 amino acids against the database containing the translation of all standard S. cerevisiae ORFs. You can see a list of many sequences that produce high-scoring segment pairs with your protein query sequence. What this list details is, for the first high-scoring segment above: the official ORF name in bold (YPR043W) the gene name (SMK1; this is also the name of the protein) the database ID (SGDID:S000006258) a brief description (Chr XVI from 666277-667443) the BLAST score, in arbitrary units (2053; note a larger number indicates a more significant match compared to a smaller number) the probability that your match was random (3.8e-214; the smaller the less likely it was random) You can see that the results are sorted, with the highest-scoring match being presented first. If you now click on the probability (also called E value) above in blue (today it may be pink), you skip to the area on that page which aligns the amino acid sequence matches between your query sequence and the one obtained by BLAST. Since you know that your DNA encodes a S. cerevisiae protein, your query should be an exact match with the protein identified by blastp. The other proteins on the list are proteins that have some similarity to your query, but should not be an exact match. For the query above, the best match was the SMK1 gene, encoded by ORF number YPR043w. The alignment looks like this:
4. You should write down the ORF name and gene name of the best match with the protein sequence that you queried. Put this information on a new line BELOW where you pasted the protein sequence. Step 3: Learn something about the protein you are researching. Now that you know the name of the protein, use the yeast database to learn something about its characteristics. There are many ways to go about doing this. The simplest method is as follows: Point your browser to: www.yeastgenome.org This is the home page of the Saccharomyces Genome Database (SGD), and contains a great deal of information about research involving Saccharomyces research. For example, if you were to click on the Virtual Library link on the left (under External Links ), you would pull up a page that includes other links to information about yeasts, including Yeast information for the nonspecialist. If you were to click on the BLAST link on the left (under Analysis and Tools ), you would arrive at the page you used previously to determine what the name of the gene that encoded your protein. At the top of the page is a Quick Search Box. Enter either the ORF number or the gene name and hit Submit. You should arrive at a summary page that has a lot of information about your gene. The top of the page will look something like this:
Spend some time scrolling around and clicking on the various links, including the tabs at the top of the page (i.e., Locus History, Literature, Phenotype, etc.) and the pull-down menus on the left hand side. To complete this laboratory exercise, figure out the following information about your protein using SGD, and put this information in the file you ve been working on. 5. What chromosome is this gene on? 6. Provide the reference for a scientific paper that includes information about this gene. 7. What is the predicted molecular weight (MW) in Daltons (Da) for your protein? For extra credit: 8. Does this gene have human homologs, and if so, give the name of one of the human homologs? Print out your file with the sequence you were given and the answers to questions 1-7 (or 1-8, if you did the extra credit), and give it to your TA before you leave the lab.