BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Size: px
Start display at page:

Download "BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology"

Transcription

1 BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology Jeremy Buhler March 15, 2004 In this lab, we ll annotate an interesting piece of the D. melanogaster genome. Along the way, you ll get some practice running command-line BLAST and reading its output. You ll also have to do some interpretation of the results to figure out what is going on. To begin the lab: 1. Log into the BIO4342 server (goose.wustl.edu). 2. Create a working directory (e.g. lab1 ) and cd into it. 3. Copy the following files to your working directory: jbuhler/lab1seq1.fna jbuhler/lab1seq2.fna The file lab1seq1.fna contains a FASTA-formatted DNA sequence, which represents roughly 100 kilobases from chromosome X of D. melanogaster. The file lab1seq2.fna contains a much smaller subsequence (about 4500 bases) from this region that you will use in Section 2. Much of this lab consists of questions, which you should answer as you go. You can use your lab notebook or simply open a text file on your laptop to write your answers, so long as you save them somewhere. You should also write down somewhere the exact BLAST and RepeatMasker commands you use, so that you can refer to them later if there is any question about how you ran these programs. 1 Finding Interspersed Repeats Before we go hunting for genes in a sequence, we should first annotate its repetitive elements using RepeatMasker. If you haven t yet done so, run RepeatMasker & more at the command line to see the program s range of possible options. RepeatMasker can use any of several repeat libraries, depending on what kind of sequence you are annotating. Our sequence is from a fruit fly, so we ll use RepeatMasker s Drosophila repeat library (command line option -dr). Note that you must specify the library the default is for primates, and you don t want to waste time looking for Alus in your fly sequence! Other options include -rod for rodent-specific repeats or -art for artiodactyl-specific repeats. By default, RepeatMasker also masks out simple repeats (e.g. dincleotides and trinucleotides), as well as some so-called low-complexity DNA. Low-complexity sequences may not have highly repetitive structure, but they consist primarily of one or two out of the four possible nucleotides. 1

2 Question: Why might it be a good idea to remove low-complexity DNA from a sequence before running blastn? Why might it be a bad idea to do so before running blastx? (Hint: consider proteins such as collagen with highly regular sequences.) BLAST automatically does low-complexity filtering of DNA when appropriate, so we ll tell RepeatMasker not to do so by using its -nolow option. Now that we know what to do, let s do it: RepeatMasker -dr -nolow lab1seq1.fna. After you run this command, you should have several useful files: a copy of the original sequence with its repeats replaced by Ns, in lab1seq1.fna.masked; a summary of the repetitive elements found in the sequence, in lab1seq1.fna.tbl; a detailed list of repetitive elements found, in lab1seq1.fna.out. Question: How many repetitive elements does your sequence contain, and what are their types? In the next section, we will be working with your other sequence, lab1seq2.fna. Go ahead and run RepeatMasker on that sequence now with the same options as above. Question: What is your result? Given the length of this sequence, would you expect the same result if it had come from, say, a primate? 2 BLASTX: the Gene Hunter We now have a number of options for how to proceed. We could look for matches to our sequence at either the DNA or the protein level, using any one of several databases. In deciding which comparison tool to use to begin annotating our sequence, we should consider a few factors: 1. How sensitive will the comparison be? Is it likely to find genes or other meaningful features in our sequence? 2. How specific will the matches returned by our tool be? Will they cover the entire region (as might be the case for a Drosophila genomic clone), or will they be confined to specific features of interest? 3. How good is the information associated with any matches we may find? Will we be able to interpret those matches? 4. How long will the tool take to run? Taking all these factors into consideration, a reasonable first analysis for any organism is to compare the DNA sequence to the Swissprot protein database using blastx. Although blastx is more expensive than most other types of BLAST search, it is both sensitive and specific to coding DNA and so should give us a good picture of potential genes in the sequence without a lot of other clutter. We could increase our chances of seeing a match by searching against all proteins in Genbank (the so-called nr, or nonredundant, protein database). However, any matches we see in Swissprot will come with lots of information about the protein that matched, while the average quality of information in protein nr hits is often much lower. Nb: there are also specialized databases for fruit fly, in particular FlyBase. For the moment, we ll just use the generic databases, but feel free to poke around at Let s set up the BLAST command line for this search. 2

3 The program to be used is blastall. We want to perform a blastx search, so use the option -p blastx. We want to produce HTML output, so add the flag -T. The -i option specifies query sequence, in this case lab1seq2.fna. The -d option specifies the database, in this case Swissprot. On our server, the path to this database is /db1/swissprot2/swiss. We should save the output of BLAST in a file. You can either use UNIX redirection or specify the -o option, followed by an output filename. The output file s name should end in.html so that the web server recognizes it as HTML. Go ahead and run the BLAST search now, then copy your HTML output file to your public html directory for viewing. Question: How many BLAST hits to distinct sequences were returned? What are the best and worst E-values reported? Are the matches with poor E-values consistent with those with better E-values? When searching a large database, it s good practice to ignore matches with poor E-values. In principle, a match with an E-value less than 1 is unlikely to occur by chance alone. However, you should allow a large margin of safety in interpreting E-values, mainly because the probabilistic model on which they are based is a crude approximation of real biological sequences. As a rule of thumb, you should be suspicious of matches with E-values higher than about 10 10, and extremely suspicious of matches with E-values above By default, BLAST reports matches with E-values as high as 10, but you can change this default using the -e option. For example, adding -e 1e-5 to the BLAST command line discards any matches with E-value greater than Question: Considering only the most reliable matches, what does BLAST say about the content of this sequence? What caveats might you consider in interpreting these results? 3 Interpreting the BLASTX Output If all went well, you should now have strong BLAST hits to the Swallow protein. So, what do you think? Does this sequence contain the melanogaster ortholog of Swallow? Is that all it contains? We need to investigate further before deciding how to annotate the sequence. You can find out more about Swallow on the web. A good place to start is to use information from the Swissprot database, which is hand-curated and has links to many other databases. To access the information for a protein, you need its Swissprot accession string, which is found in the BLAST output and looks something like SWA DROME. A Swissprot accession string consists of an abbreviated gene name, followed by an abbreviation indicating which organism the particular protein in this entry came from. To access a Swissprot entry by its accession, go to the Expasy web site (U.S. mirror at and enter the accession in the search dialog at the top of the page. If you like, you can also do a keyword search, e.g. Swallow, to find multiple related entries. 3

4 Question: What does Swallow do? Does your BLAST output match Swallow genes from more than one species, and if so, which species? If you want to talk about this gene in your own work, whom should you cite as having discovered it? (Hint: look at the GenBank record.) Now that we know a bit more about the candidate matches to our gene, let s take a closer look at the BLAST output. To produce an annotation, we need to verify that the query sequence really does contain the D. melanogaster Swallow gene. In particular, the match should be full-length, including all the coding exons of the gene. Question: Which orientation is the Swallow gene in relative to your query sequence? Question: Looking over all the matches to SWA DROME in your BLAST output, is the entire protein matched? If not, which residues are missing? Are any regions of the protein matched more than once at different places in the query sequence? You should see a considerable amount of confusion in this BLAST output missing residues, duplicated residues, etc. As an annotator, your job is to produce order from this chaos. Let s start with the missing residues. Go back to the Swissprot entry for SWA DROME and find the part of the protein that is not represented by any BLAST hits. Question: Which amino acids predominate in the missing region? Given that blastx likes to mask low-complexity sequence in the query before a search, do you have a reasonable explanation for why this part of the protein is missing? Around base 190 of the protein, you will see a string of X s representing masked residues in the query. BLAST apparently decided that the protein in that region was a little too serine/asparaginerich and so marked it low-complexity. Aligning a residue to an X yields a negative score. Question: Given that BLAST seems happy enough to include masked residues in its alignments, why didn t it include residues of the protein? (Hint: look at the frames of the matches ending at 77 and beginning at 91. What happens if you add negative-scoring residue pairs to the end of an alignment?) Now we need to deal with the duplicated matches. The best way to make sense of the output is to sketch out the relative positions of all the matches to SWA DROME in the query on a piece of paper. Note which residues of the protein match each part of the query. Question: How many distinct features seem to be present at this locus? Which one seems most likely to be the true Swallow gene? What might the other matches be, and what biological mechanism might have produced them? 4 Further Exploration at the Genomic Level To make further progress in determining the right annotation of the sequence, we will pull in additional evidence from the DNA level. To do this, we will use nucleotide sequences derived from Drosophila mrnas. To detect DNA-to-DNA matches, we ll use blastn instead of blastx. Our choices for a database against which to test our query include the Genbank nonredundant nucleotide database (also known 4

5 as nt, not to be confused with the protein nr database), or one of a few EST databases. ESTs are pretty noisy and don t come with easily accessible annotations, so we ll use the nt database. On our server, the path to this database is /db1/nt/nt. Modify your BLAST search to do blastn against the nucleotide database, and run the modified search against your query. Copy the output to your public html directory for viewing. Question: Had your query contained a repetitive element such as a transposon, what would have happened if you had forgotten to repeat-mask the query before running it? The sequences in the Genbank nucleotide database come from numerous sources, including genomic contigs from genome sequencing projects and mrnas/cdnas. A particularly useful class of mrna entries are the NCBI Refseqs, which come from a curated database of full-length mrnas for various genes. You can find out more about the Refseq database at Refseq matches are recognizable in the BLAST output because they start with the string ref. Question: What is the best Refseq match to the query? How good is the match to what you think is the true Swallow gene? Based on this alignment, how many exons does the gene have, and roughly where do the introns occur? Question: How well does the Refseq match the other part of the query? Can you see matches that were not visible at the protein level? Why might this be? The main question at this point is whether the other set of matches to Swallow outside the likely ortholog indicate a real gene or a pseudogene. Pseudogenes are pretty rare in Drosophila compared to mammals, but they are not unknown. There are at least two types of mutation that strongly suggest that a putative match to a gene might be a pseudogene. One is a stop codon that truncates the protein prematurely in the middle of a coding exon. You can see such internal stop codons in a blastx alignment as star ( * ) characters. Another diagnostic mutation is a gap in the middle of an exon that would cause a frameshift. Typically, such gaps are visible only at the DNA level, since a frameshift will terminate a blastx alignment. Question: Keeping in mind the exon boundaries you inferred above, can you find evidence of premature stop codons and/or frameshift-inducing gaps that would cause you diagnose a pseudogene adjacent to the Swallow gene? Describe any evidence you find. 5 Summary Question: Based on all the evidence gathered in this lab, how would you annotate the query sequence? What uncertainties remain? Compose a short (a few sentences) paragraph that you could add to an annotation database summarizing your findings. 5