Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Size: px

Start display at page:

Download "Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences."

Horace Allison
5 years ago
Views:

1 Bio4342 Exercise 1 Answers: Detecting and Interpreting Genetic Homology (Answers prepared by Wilson Leung) Question 1: Low complexity DNA can be described as sequences that consist primarily of one or two out of the four possible nucleotides. Due to its simple structure, we would expect to find low complexity DNA in the fast-annealing fraction of the genome in a Cot curve. Similarly, low complexity sequences tend to align with each other in a BLAST search. Therefore, while it may not have a highly repetitive structure, low complexity DNA causes problems similar to simple repeats and other repetitious elements an increase in the number of spurious matches. The generic BLAST algorithm looks first for short perfect matches and then attempts to extend the alignment in both directions. A blastn search involves the comparison of a nucleotide query sequence with a nucleotide database. Removing low-complexity DNA (and other repetitious elements) decreases the number of spurious matches in a blastn search. Removing lowcomplexity DNA and repetitive elements can also significantly decrease the running time of the blast search. The decrease in running time is caused by fewer initial perfect matches that fail to produce significant alignments following the extension phase of the BLAST search. However, it is generally a bad idea to remove low-complexity DNA from a sequence before running blastx. In a blastx search, a nucleotide query is translated into all six frames and compared against a protein database. While regions may be low complexity at the DNA level, these regions may be significant to the protein at the amino acid level. For example, proteins such as collagen have highly regular sequences that may appear to be low complexity at the DNA level but are essential for the protein to function properly. Other structural proteins, such as elastin, also have this characteristic of highly regular amino acid sequence (i.e rich in glycine, valine, alanine, proline). Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences. Question 3: There are no known repeats that are found in this 4500 bp sequence (lab1seq2.fna). We would generally not expect the same results from primates, since repetitive elements (particularly Alu) are more prevalent in primate genomes. It should also be noted, however, that transposon-free regions have been found in both the human and the mouse genome that span more than 10 kb. Question 4: There are a total of 24 hits reported by blastx. The best E-value is 1e-148 while the worst E- value is 8.7. The matches with poor E-values were inconsistent with those with better e-values (i.e. 1e-148 and 6e-81). The two hits with the best E-values reported matches to the SWALLOW protein, while the remaining hits report matches to a diverse array of proteins. These poor matches range from the extensin precursor to the human collagen alpha I(IX) chain precursors. 1

2 Question 5: The two most reliable matches, according to blastx, suggest that our sequence shows a high degree of similarity to the SWALLOW proteins in D. melanogaster and D. pseudoobscura. Furthermore, the small E-values for both matches indicate that the probability of obtaining alignments this good or better due to random chance is extremely low (assuming that the evolutionary model used by BLAST is correct). The potential pitfalls of this interpretation are due to the inductive reasoning that is used in a BLAST search (comparison software -> similar sequence -> conservation -> negative selection - > conserved function). The first problem with our reasoning is in the second step (similar sequence -> conservation). We know that the Expect value (E-value) depends on the size of our search space and our scoring system. If an inappropriate scoring system is used, the BLAST search will produce significant hits with low E-values that are biologically meaningless. A more serious problem with the second step is that sequences may appear similar due to convergent evolution or simply to chance (improbable given the small E-values, but not impossible, in this case). The third step in our chain of inference is conservation -> negative selection. However, sequences such as inactive transposable elements can remain conserved simply because there is insufficient time or selective pressure for mutations to accumulate. Another problem with this inference is the possibility of a pseudogene - where our sequence no longer represents a functional copy of the SWALLOW protein but insufficient time has elapsed for the pseudogene to show significant divergence from the real SWALLOW protein. The final step (negative selection -> conserved function) also is problematic because negative selection can still produce similar proteins that have very different functions (examples include the different functions of adh1 and adh2 in yeast). In addition, homologous sequences can be either orthologous or paralogous. Question 6: The SWALLOW gene "has a role in localizing bicoid mrna at the anterior margin of the oocyte during oogenesis, and a poorly characterized role in nuclear divisions in early embryogenesis" according to the Swissprot database. The BLAST output matches SWALLOW genes from two species: Drosophila melanogaster and Drosophila pseudoobscura. According to the Genbank records for SWA_DROME, we should cite Chao, Y.C., Donahue, K.M., Pokrywka, N.J. and Stephenson, E.C. as the group who first characterized this gene. We should cite Huang, Z., Pokrywka, N.J., Yoder, J.H. and Stephenson, E.C. as the individuals who characterized SWA_DROPS. Question 7: The SWALLOW gene is in opposite orientation relative to the query sequence. Question 8: We note that the matches are to the following regions in the subject (SWA_DROME) sequence: 2

3 1-77, , , , This is most easily visualized by drawing the fragments identified on a map of the full-length protein. We know from our original blastx output that the SWA_DROME SWALLOW protein has a total length of 548 amino acids. Therefore, the entire protein is not matched. The region of the protein that is missing is at residues Figure 1. Positions of HSPs relative to the SWA_DROME protein in blastx search (with SEG filter) The basic unit of a BLAST output is the High-scoring Segment Pair (HSP). A HSP denotes the optimal local alignment in a region whose alignment score is above a certain threshold. There are multiple HSPs between our query and SWA_DROME that overlap each other in this BLAST alignment. Regions that showed two hits include , , and Regions that have three hits include (Figure 1). In this case, there has been a partial gene duplication event that leads to these overlapping HSP s. Overlapping HSP s can also occur because BLAST may overextend an alignment. Question 9: According to the Swissprot database, the region has the sequence QEDEDDYDEDVD. This sequence is rich in both glutamic acid (E) and aspartic acid (D) with a few glutamine, tyrosine, and valine residues. Hence this region is rich in a few highly charged amino acids. By default, NCBI BLAST automatically filters low complexity regions (using the program SEG for blastx). With the repeated occurrence of E and D in this region, this region may have been filtered prior to the alignment. To test this hypothesis, blastx is run with filter turned off. The following hits are found: , , , 1-91, 46-91, We find that when we turn off the SEG filter, we obtain an alignment to the entire protein (Figure 2). Blastx with SEG filter: Blastx without SEG filter: Figure 2. blastx hits relative to the query sequence (with versus without filtering low complexity regions). 3

4 Question 10: From the blastx results of question 9, we note that the amino acid position at 91 represents a cutoff point for the various HSPs. We note that the reading frame for 1-91 is -3 while the reading frame for is -1. This change in reading frame suggests the presence of an intron in this region. In particular, the matches to the protein at 1-91 correspond to in our query sequence while the matches to the protein at correspond to Hence there may be a potential intron that spans from (178 bp). The reason BLAST did not include residues in the protein alignment is because alignment to a masked base (X) still incurs a score penalty. (See Figure 13 of the Problem Set for an example of an alignment with masked bases.) Since BLAST is looking for optimal local alignments, extension of the alignment to include the masked base would simply lower the score of the alignment. Hence the optimal local alignment will be the one that did not include the masked bases. In other words, adding negative-scoring residue pairs to the end of an alignment will result in a worse scoring alignment, and hence be less optimal than the alignment without the masked bases. Question 11: Two distinct features seem to be present in this locus (Figure 2). Region 1 spans from while region 2 spans from Based on the matches, Region 2 would most likely be the true SWALLOW gene since we have matches to most of the full-length protein (Figure 3). Figure 3. Hits (without SEG filter) in region 2 match to the full-length protein. Region 1 could arise from a tandem duplication of the real gene. Since a functional copy is still present in the genome (in region 2), the sequences in region 1 can then mutate at a relatively neutral rate. Based on the number of sections that are missing in region 1, this probably does not encode a functional copy of the SWALLOW protein (Figure 4). Figure 4. Hits (without SEG filter) in region 1 do not match to full-length protein. Question 12: If our query has a repetitive element such as a transposon that is not masked, blastn would report many matches where the only region of similarity is the transposon sequence. It will be more difficult to identify biologically meaningful matches due to the increased noise. In other words, we will have a lot more false positives in our BLAST results if we forget to mask the repeats within the query sequence. Question 13: The best refseq match (identified by ref in the accession number) is to the Drosophila melanogaster CG3429-PA (swa) mrna (ref NM_ ). The hits that show very high sequence identity (100%) to the mrna sequence map to region 2 - the region that we have 4

previously hypothesized to be the real SWALLOW gene in our query (Figure 5). The alignment indicates there are three exons and two introns in our homolog to the SWALLOW gene.

5 previously hypothesized to be the real SWALLOW gene in our query (Figure 5). The alignment indicates there are three exons and two introns in our homolog to the SWALLOW gene. The introns are at and in the query, corresponding to and in the protein coding sequence. Figure 5. blastn alignments between the refseq hit and the query sequence. Question 14: The refseq hits to region 2 extend to sequences that are outside of the protein matches on both sides of SWALLOW from the previous blastx search (4259 in blastx versus 4372 in blastn ; 2347 in blastx versus 2091 in blastn). These extended regions may represent untranslated regions (UTRs) of the mature SWALLOW mrna transcript. Since the blastx search uses only the amino acid sequence of the SWALLOW protein, it would not have revealed these UTRs. These UTRs can only be found when we search against the mrna sequence at the nucleotide level using blastn. We also notice there are some small discrepancies in the potential splice sites identified using blastx against Swissprot and blastn against nt. Since the refseq hit represents the full transcript, we would trust the mrna more in terms of ascertaining the potential splice sites. In addition, when we examine the blastx alignment (Figure 2) from 2943 to 2347 in the query we note that blastx might have extended the alignment too far as there are a few stop codons at the beginning of the alignment (Figure 6). Figure 6. blastx alignment may have extended too far in region 2 5

6 There are also additional matches to the beginning of the sequences that are in much smaller fragments that are clustered at the beginning of our query sequence. There are two main explanations that could account for these additional matches in blastn relative to blastx. First, blastx is more sensitive to insertions and deletions (indels) than blastn. An indel in a blastn alignment will only incur a small gap penalty. However, since blastx translates the query into amino acid sequences prior to the alignment, an indel can cause a frameshift mutation that leads to alignments with stop codons. A residue aligned with a stop codon is heavily penalized and will quickly terminate the alignment. Second, random nucleotide matches are more probable than random amino acid sequence matches. Assuming uniform independent, identically distributed (IID) model, the probability of finding random nucleotide matches of length N is (1/4)^N. However, the probability of finding random amino acid matches of length N is approximately (1/20)^N. Hence we are more likely to detect local region of similarity (spurious matches) when we compare nucleotide sequences. Question 15: Figure 7. Comparison of refseq hits to region 2 (top) versus region 1 (bottom) of query sequence The blastn refseq hits to region 1 are significantly worse than the hits to region 2. We find multiple (approximately 7) gaps relative to the refseq mrna sequence (Figure 7). We see some evidence of potential frameshift mutations in this region (a 7-base gap between and a 1-base gap at base 1768 relative to the subject sequence) (Figure 8). 6

Figure 8. Gaps in alignments of region 1 We also notice multiple gaps in the blastx alignment. The overall quality of the alignments is much worse in region 1 when compared to region 2.

7 Figure 8. Gaps in alignments of region 1 We also notice multiple gaps in the blastx alignment. The overall quality of the alignments is much worse in region 1 when compared to region 2. In addition, we notice there are three stop codons (*) in the HSP from 1499 to 456 of the query ( of subject) (Figure 9). Figure 9. Presence of stop codons (*) indicative of pseudogene in region 1 of our query sequence Based on the available data, we conclude that region 1 probably represents a pseudogene that is derived from the SWALLOW gene. Question 16: I would annotate two features in this region, a gene in region 2 and a pseudogene in region 1. The gene in region 2 is probably orthologous to the SWALLOW gene in D. melanogaster and has 3 exons (2 introns). Potential UTR regions and more precise definition of the splice sites in region 2 can be determined using the mrna blastn alignment. The blastn alignment suggests the presence of 7 bp and 1 bp gaps in region 1 relative to the mrna sequence. These deletions could cause frameshift mutations. Furthermore, examination of the blastx alignment indicates the presence of stop codons in the reading frame of feature 1. Hence our evidence suggests feature 1 is a pseudogene derived from the SWALLOW gene. Last Update: 06/15/2006 7

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz] BLAST Exercise: Detecting and Interpreting Genetic Homology Adapted by W. Leung and SCR Elgin from Detecting and Interpreting Genetic Homology by Dr. J. Buhler Prequisites: None Resources: The BLAST web