Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

Size: px

Start display at page:

Download "Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M."

Bertram Lambert
5 years ago
Views:

1 Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M. Brent Prerequisites: A Simple Introduction to NCBI BLAST Resources: The GENSCAN web server is available at The NCBI BLAST web server is available at The UCSC Genome Browser is available at Files for this Tutorial: All the required files are compressed into a single archive: genespseudogenes.tar.gz After the download is complete, double-click the file to expand the archive on Mac OS X. On MS Windows, you can expand the archive using the freeware 7-zip ( Once you expand the archive, the folder genespseudogenespackage contains all the files required for this tutorial. The file chimp_sequence.fasta contains a 19kb sequence from the chimpanzee genome we will analyze in this tutorial. The file chimp_sequence.fasta.masked contains the repeat-masked version of the same sequence. chimp_genscan_predictions.html contains the GENSCAN predictions for our sequence. genscan_peptide1_nr_blp.html and genscan_peptide2_nr_blp.html contains the results of the BLAST searches. Introduction: Recent technological advances have dramatically increased the rate at which we can generate high-quality DNA sequence. However, characterization of features (e.g. genes, repetitive elements, etc) found within these raw sequences remains a slow and labor-intensive process. Various computational methods have been devised to improve the efficiency of sequence annotation. In this tutorial, we will focus on the identification of genes and pseudogenes in a chimpanzee sequence. We will use the programs GENSCAN and BLAST, along with the UCSC Genome Browser to facilitate the annotation process. GENSCAN is a program that predicts the locations of genes in DNA sequences. While potential genes identified by GENSCAN must still be experimentally confirmed, its predictions can nonetheless narrow the scope of the investigation by identifying regions where genes are most likely to be found. GENSCAN uses a probabilistic model to predict locations of promoters, exons, and polyadenylation signals based solely on the input genomic sequence. Discussion of the actual implementation of GENSCAN is beyond the scope of this tutorial. For additional information on how GENSCAN works, please see the paper by Burge et al 1, or visit the GENSCAN website at 1 Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268,

2 Analyzing Our Chimpanzee Sequence with GENSCAN In this lab, we will analyze a 19kb sequence from the chimpanzee genome (chimp_sequence.fasta). Since transposable elements may lead to false GENSCAN predictions (due to the presence of gene-like features such as transposase and reverse-transcriptase), we have masked the interspersed repeats in our sequence using the program RepeatMasker. We will run GENSCAN using this repeat-masked version (chimp_sequence.fasta.masked) of the sequence. Open a new web browser window and navigate to the GENSCAN web server at (Figure 1). Figure 1. The GENSCAN web server at We will customize our analysis by changing the following parameters (Figure 2): 1. Under Organism, select Vertebrate to use the vertebrate prediction parameters 2. Under Sequence name, enter Chimp unknown sequence 3. Under Print options, select Predicted CDS and peptides to display both the Coding DNA Sequence (CDS) and the peptide sequence for each GENSCAN prediction 4. Click Browse and select the repeat-masked sequence chimp_sequence.fasta.masked under the Upload your DNA sequence file field 5. Click Run GENSCAN Alternatively, for class purposes, the GENSCAN prediction is available in the file chimp_genscan_predictions.html. The graphical overview in PDF and postscript formats is available in the files chimp_genscan_predictions.pdf and chimp_genscan_predictions.ps, respectively. 2

Figure 2. Configuring GENSCAN to analyze the unknown chimpanzee sequence Overview of the GENSCAN Results If all goes well, you should see a new web page with the GENSCAN output.

In this case, we see that our unknown_sequence has a total length of 18,773 base pairs with 46.29 CG (Figure 3a).

3 Figure 2. Configuring GENSCAN to analyze the unknown chimpanzee sequence Overview of the GENSCAN Results If all goes well, you should see a new web page with the GENSCAN output. There are four major sections in the GENSCAN output. The GENSCAN output begins with a summary of the program version, input sequence and parameters used to generate the prediction. In this case, we see that our unknown_sequence has a total length of 18,773 base pairs with CG (Figure 3a). Under the Parameter matrix section, we see that the human parameters were used to generate the prediction. Figure 3a. Basic information on the input sequence and the parameters used to generate the GENSCAN output The second part of the GENSCAN output is a table that summarizes the exon coordinates for all the predicted genes within our unknown sequence (Figure 3b). There are also links to images that demarcate the locations of the predicted genes relative to our chimpanzee sequence. Figure 3b. Table listing the exon coordinates for all the GENSCAN predicted genes 3

4 Each row of the table describes the type and location of the feature GENSCAN has identified. Understanding the first six columns of this table is essential to properly interpreting the GENSCAN output. Brief explanations of the first six columns of the table are summarized below. Explanations for all the columns in the table are available at the end of the GENSCAN output. Column Explanation Gn.Ex Unique identifier for the predicted exon. For example: 1.01 denotes the first exon of the first gene prediction Type S Begin End Len There are six types of features: 1. Init = Initial exon 2. Intr = Internal exon 3. Term = Terminal exon 4. Sngl = Single exon gene 5. Prom = Promoter 6. PlyA = Poly-A signal Orientation of the predicted gene relative to the input sequence: + means the predicted gene is in the same orientation - means the predicted gene is in the opposite orientation Coordinate for the start of the feature relative to the input sequence. (Note that if the feature is in the negative strand, the coordinate for Begin is greater than End) Coordinate for the end of the feature relative to the input sequence. (Note that if the feature is in the negative strand, the coordinate for End is less than Begin) Length of the feature Unlike other gene finders (e.g. NSCAN), GENSCAN does not have a separate model for predicting untranslated regions. Hence GENSCAN predictions for the promoter and poly-a regions tend to be less reliable than its predictions for coding exons. Therefore, in general, we will ignore the promoter and poly-a predictions and instead rely on alignments with mrna and Expressed Sequence Tags (ESTs) to annotate these untranslated regions. According to the summary table, GENSCAN predicted two genes in our input sequence: a 3- exon gene that spans from and a 2-exon gene that spans from of our input sequence. We can see an illustration of where the predicted genes are located in our unknown sequence by clicking on the link Click here to view a PDF image of the predicted gene(s) (Figure 3c). 4

5 Figure 3c. Graphical overview of the GENSCAN predictions relative to our unknown sequence The third part of the Genscan output consists of the peptide (amino acid) and coding (nucleotide) sequences for each of the predicted genes (Figure 3d). We will use these sequences to evaluate the validity of the GENSCAN predictions. Figure 3d. Peptide and coding DNA sequences for the first GENSCAN gene prediction 5

The last part of the GENSCAN output consists of explanations for each column in the gene prediction table and an explanation on how GENSCAN calculates the score and probability for each predicted

6 The last part of the GENSCAN output consists of explanations for each column in the gene prediction table and an explanation on how GENSCAN calculates the score and probability for each predicted feature (Figure 3e). Figure 3e. Explanations of all the columns in the gene prediction and details on how GENSCAN calculates the statistical significance of a feature Analysis of Predicted Gene 1: Three exons, 301 amino acids, 906 coding bases blastp of the Predicted Peptide, Searching the non-redundant Protein Database (nr) Now that we have a basic understanding of the GENSCAN output, we would like to determine if the two predictions correspond to real genes in the chimpanzee genome. To get an idea on what the first predicted gene could be, we will run a protein-protein BLAST (blastp) search against the NCBI non-redundant protein database (nr) using the default parameters. The nr database consists of all known and hypothetical protein sequences that have been submitted to GenBank, with groups of essentially identical sequences reduced to a single sequence (hence the term nonredundant ). While BLAST hits produced with the nr database may contain relatively little information to help us interpret them, the names of these hits and the corresponding literature references should nonetheless provide us with a reasonable starting point. The key question we wish to address with our BLAST searches is whether the GENSCAN gene prediction is most likely a gene or a pseudogene. If our predicted only has a single exon (only about 7% of known human genes have a single exon), we would need to be particularly vigilant for signs that the predicted gene is either a retrotransposed pseudogene or a mispredicted fragment of a multi-exon gene. It would not be terribly surprising if GENSCAN missed an exon or split two exons of the same transcript into separate gene predictions. In this case, our first predicted gene has three exons so it is less likely for it to be a pseudogene. 6

Figure 4. Click on protein blast in the main NCBI BLAST web page 1. Select the peptide sequence for the first prediction and copy it onto the clipboard. 2.

Paste the contents of the clipboard into the textbox labeled Enter Query Sequence 4.

7 Figure 4. Click on protein blast in the main NCBI BLAST web page 1. Select the peptide sequence for the first prediction and copy it onto the clipboard. 2. Open a new browser window and navigate to the NCBI BLAST server at ( ), then click on protein blast (Figure 4) 3. Paste the contents of the clipboard into the textbox labeled Enter Query Sequence 4. In the Choose Search Set section, make sure to set the Database field to Nonredundant protein sequences (nr). 5. In the Program Selection section, make sure that blastp (protein-protein BLAST) is selected. Click BLAST (Figure 5). Figure 5. Configuring the blastp search of the first GENSCAN prediction against the nr database 7

Our predicted protein contains a putative conserved domain called Phd_like_Phd that belongs to the TRX superfamily (Figure 6).

8 Alternatively, for class purposes, the BLAST results are available in the file genscan_peptide1_nr_blp.html in the tutorial package. The initial response page from NCBI BLAST shows the results of a search of our predicted protein against the NCBI Conserved Domain Database (CDD). Our predicted protein contains a putative conserved domain called Phd_like_Phd that belongs to the TRX superfamily (Figure 6). The presence of a conserved domain in the predicted protein lends credence to the hypothesis that GENSCAN has identified a putative gene or a pseudogene derived from a real gene in our chimp sequence. As an optional exercise, we can learn more about the Phd_like_Phd domain by clicking on the conserved domain image. Figure 6. NCBI Conserved Domain Search identified a Phd_like_Phd domain in the predicted peptide sequence Once blastp search is complete, a new page will appear displaying all the hits to the nr database. From the graphical overview, we see that there are many full-length high quality hits to sequences in the nr database (Figure 7). Figure 6. Many high quality matches between the predicted protein and sequences in the nr database 8

Scroll down to the table with the list of sequences that produce significant alignments with our predicted protein (Figure 7). There are many hits that are highly significant (e.g. with small E- values).

9 Scroll down to the table with the list of sequences that produce significant alignments with our predicted protein (Figure 7). There are many hits that are highly significant (e.g. with small E- values). However, since the sequences in the nr database come from a variety of sources, we would ideally like to find matches to sequences that have been experimentally confirmed. Accession Number Figure 7. Click on the accession number to retrieve the NCBI record and click on the score to see the alignments We can use the accession number of the matches to quickly assess the reliability of the sequence records. Accession numbers with the prefixes gb, emb or dbj are sequences directly submitted by research scientists so the quality of these submissions will vary. Accession numbers with the prefix ref are sequences that comes from the NCBI RefSeq Database. These RefSeq sequences are curated by the staff at NCBI and are generally much better annotated than a typical GenBank entry. There are two major classes of protein RefSeq hits: sequences that are supported by experimental evidence have the prefix ref NP, while computational predictions without experimental support have the prefix ref XP. Generally, we do not want to rely on (possibly incorrect) computational predictions alone for any part of our annotation; instead we will use matches that are based on more direct evidence (e.g. those with the prefix ref NP). Furthermore, since we are annotating a chimpanzee sequence, we will annotate our feature based on known human genes. The fourth hit shows a RefSeq match to the human phosducin-like protein. Click on the score corresponding to this hit in the table summary to jump to the alignment (Figure 8). 9

Figure 8. BLAST alignment for the human phosducin-like protein (subject) against our predicted protein (query). Note that there are several records of this sequence in the nr database.

10 Figure 8. BLAST alignment for the human phosducin-like protein (subject) against our predicted protein (query). Note that there are several records of this sequence in the nr database. Looking at the top of the alignment, we see that the human version of the phosducin-like protein (PHLP) has a length of 301 amino acids. This is the same length as the gene predicted by GENSCAN, so we have a full-length match. Furthermore, we see from the alignment that the predicted protein from chimpanzee has high sequence identity (98%) and positives (99%) compared to the human protein. Since we expect human and chimp orthologs to have a protein sequence identity of approximately 98%, the evidence we have gathered thus far suggests we have identified the ortholog of the PHLP gene in our chimpanzee sequence. In order to create a gene model in our chimpanzee sequence, we can utilize the exon-by-exon annotation strategy (described in the tutorial A Simple Introduction to BLAST ) to identify the locations of the exons and the putative donor and acceptor splice sites. Creating the complete gene model in the chimpanzee sequence is left as an exercise for the reader. Analysis of Predicted Gene 2: Two exons, 230 amino acids, 693 coding bases blastp of the Second Predicted Peptide, Searching the non-redundant Protein Database (nr) Our previous analysis suggests the first GENSCAN prediction is probably an ortholog of the human PHLP protein. We will now analyze the second GENSCAN prediction to determine if it is a gene or a pseudogene. Just like the first gene prediction, we will begin our analysis with a blastp search against the nr database: 10

1. Navigate back to the browser window with the GENSCAN output. Select the peptide sequence for the second gene prediction and copy it onto the clipboard. 2.

11 1. Navigate back to the browser window with the GENSCAN output. Select the peptide sequence for the second gene prediction and copy it onto the clipboard. 2. Open a new browser window and navigate to the NCBI BLAST server at ( then click on protein blast 3. Paste the contents of the clipboard into the textbox labeled Enter Query Sequence 4. In the Choose Search Set section, make sure to set the Database field to Nonredundant protein sequences (nr) 5. In the Program Selection section, make sure that blastp (protein-protein BLAST) is selected. Click BLAST (Figure 9). Figure 9. Configuring blastp to search the second gene prediction against the nr database Alternatively, for class purposes, the BLAST results are available in the file genscan_peptide2_nr_blp.html in the tutorial package. The initial conserved domain search from the NCBI BLAST service indicates our predicted protein contains part of a conserved filament domain (Figure 10). The partial match to a known domain suggests GENSCAN may have predicted only part of the real gene in our chimpanzee sequence. Figure 10. CDD search identifies part of the filament domain in our second predicted protein 11

High quality matches between sequences in the nr database and our second predicted gene Scroll down to the table with the list of sequences that produce significant alignments.

12 Once the blastp search is complete, a new page will appear displaying all the hits to the nr database. From the graphical overview, we again see that there are many full-length high quality hits to sequences in the nr database (Figure 11). Figure 11. High quality matches between sequences in the nr database and our second predicted gene Scroll down to the table with the list of sequences that produce significant alignments. We will use the human keratin 18 RefSeq hit to annotate our feature. Click on the score corresponding to the fourth blastp hit (ref NP_ ) in the table summary to jump to the alignment section for the human keratin 18 RefSeq (Figure 12). Figure 12. Click on the score corresponding to the human keratin 18 RefSeq hit to see the alignment 12

13 Figure 13. BLAST alignment for the human keratin 18 protein (subject) against our predicted protein (query) shows only a partial match to the human protein Look at the top of the alignment for the human keratin 18 RefSeq hit (Figure 13), the full keratin 18 protein in human has a length of 430 amino acids. However, our predicted gene only has 230 amino acids. There are three potential explanations that could account for this observation. First, GENSCAN may have simply missed part of the real chimp protein. This scenario could happen if GENSCAN failed to identify one or more exons, or called them as part of another gene. We should also consider the possibility that, since the final chimpanzee sequence is not finished to high quality, there may be gaps in the assembly. An alternative hypothesis is that the predicted gene may be a pseudogene that has acquired an in-frame stop codon. The stop codon would cause the GENSCAN prediction to end prematurely. A third hypothesis is that GENSCAN has accurately predicted a protein closely related to the human keratin 18 protein but lacks some part of it, such as one or more functional domains. Based on the alignment, it appears that the GENSCAN prediction matches with the end of the keratin 18 protein. We also notice there are 11 gaps in the aligned region and the sequence homology (79% Identities and 85% Positives) is lower than what we would expect if this were an ortholog of the human keratin 18 gene. Therefore, this prediction is either a paralog of the human keratin 18 protein or it is a pseudogene derived from keratin 18. To determine which of the two hypotheses is the more likely explanation for our predicted gene, we need to learn more about the keratin 18 gene in human. Scroll back up to the table listing all the sequences that produce significant alignments with our predicted protein. Click on the G icon next to the human keratin 18 RefSeq hit (ref NP_ ) (Figure 14). Figure 14. Click on the G icon to retrieve the Entrez Gene record for the human keratin 18 gene 13

14 Figure 15. The Entrez Gene record for the human keratin 18 gene shows that it has seven coding exons. The Entrez Gene record provides us with lots of information about the human keratin 18 (KRT18) gene. In the summary section, we see that KRT18 belongs to the intermediate filament gene family and mutations in this gene have been linked to the disease cryptogenic cirrhosis. In the Genomic regions, transcripts, and products section, we see that the KRT18 genes has two isoforms. Interestingly, unlike our gene prediction that only has two coding exons, both isoforms of KRT18 have seven coding exons. In the Genomic context section, we noted that the gene is found in chromosome 12 in humans and that there is a predicted protein in the opposite strand of KRT18 called LOC However, this locus has not yet been experimentally confirmed (e.g. accession number for this locus has the prefix ref XM). We also found out that KRT8 and eif4b are the genes closest to KRT18 in the human genome. To summarize our findings so far, we learned that our predicted protein shows strong sequence similarity to the human KRT18 protein but it is missing several exons compared to the human protein. We need to decide among three possible explanations for the truncated gene model: GENSCAN created an inaccurate model of the gene, the predicted gene is a pseudogene, or the predicted gene is a paralog that lacks certain functional domains found in the human KRT18. 14

Analyze the Predicted Gene using the UCSC Genome Browser In order to gather additional evidence to determine which of the three hypotheses is the most likely explanation for our predicted gene, we

ucsc.edu) created by the University of California Santa Cruz Genome Bioinformatics Group allow us to quickly navigate to specific regions of a genome and see all the available annotations,

15 Analyze the Predicted Gene using the UCSC Genome Browser In order to gather additional evidence to determine which of the three hypotheses is the most likely explanation for our predicted gene, we need an easy way to visualize a region of the genome in conjunction with all the available computational and biological evidence. The UCSC Genome Browser and the BLAT search service ( created by the University of California Santa Cruz Genome Bioinformatics Group allow us to quickly navigate to specific regions of a genome and see all the available annotations, alignments, and computational predictions for that region. We will use BLAT to identify regions in the human genome that shows sequence homology to our predicted gene and then use the genome browser to examine these regions more closely. Figure 16. Click on BLAT to access the BLAT search service from the UCSC Genome Bioinformatics page 1. Navigate back to the browser window with the GENSCAN output. Select the peptide sequence for the second gene prediction and copy it onto the clipboard. 2. Open a new browser window and navigate to the UCSC Genome Browser at ( Click on BLAT on the left navigation bar (Figure 16). 3. Change the following parameters on the BLAT search page: a. Under Genome, select Human b. Under Assembly, select Mar c. Under Query type, select protein 4. Paste the contents of the clipboard into the textbox 5. Click submit (Figure 17) Figure 17. Configuring BLAT to search our predicted protein against the human genome 15

Figure 18. BLAT search results for the second GENSCAN gene prediction against the human genome Each row in the BLAT Search Results table is a unique hit in the human genome.

16 Figure 18. BLAT search results for the second GENSCAN gene prediction against the human genome Each row in the BLAT Search Results table is a unique hit in the human genome. The BLAT results show that our predicted protein matches with numerous loci in the human genome, which is not surprising since BLAST has previously identified a conserved domain in our predicted gene (Figure 18). The top hit shows a sequence identity of 99.2%, over almost the entire length of the query sequence (3-230). This is within the range of percent sequence identity we expect for a pair of human and chimpanzee orthologs. The next best match has an identity of only 83.9 %, which is significantly below what we expect for a human and chimp orthologs. Click on the browser link for the top hit (on chromosome 9), and the new page will show your sequence as a blue bar under the title Your Sequence from Blat Search. Aligned regions are shown as blocks and lines are used to connect multiple aligned regions. We can change the display options by clicking on the configure button. For additional information on the available configuration options, please refer to the UCSC Browser training materials available at One of the key aspects to using a genome browser effectively is to understand how to modify the browser settings to display only the evidence tracks we wish to see. Most tracks have five different display modes: Display Mode hide pack squish pack full Explanation The track is not displayed in the genome browser view All features (including overlapping features) are collapsed into a single line Each feature is shown separately, but at half the height of the full mode Overlapping features are shown at separate lines, non-overlapping features are shown on the same line Each feature is displayed on a separate line In this case, we will turn on some of the gene prediction and alignment evidence tracks. 16

Sequence Adjust the following tracks to dense mode: Under Gene and Gene Prediction Tracks : UCSC Genes, RefSeq Genes, Ensembl Genes, N-SCAN, SGP Genes, Genscan Genes (Figure 19).

17 Figure 19. Configure the track display options for the UCSC Genome Browser Reinitialize the browser by clicking on the hide all button. Scroll down to the track configuration section and change the display mode for the evidence tracks to the following: Adjust the following track to pack mode: Under Mapping and Sequencing Track: Blat Sequence Adjust the following tracks to dense mode: Under Gene and Gene Prediction Tracks : UCSC Genes, RefSeq Genes, Ensembl Genes, N-SCAN, SGP Genes, Genscan Genes (Figure 19). Under mrna and EST Tracks : Human mrnas, Spliced ESTs, Human ESTs, Other ESTs. Under Variations and Repeats : RepeatMasker Now hit refresh and look at the new image. Zoom out 3X to get a broader view of the region (Figure 20). Figure 20. UCSC genome browser for our region of interest We can see from the genome browser view that there are no known genes in this region according to the UCSC Gene annotation, human RefSeqs and other curated gene lists. The only evidence of a gene-like feature comes from hypothetical gene predictions by programs like SGP Gene and GENSCAN. Both programs predicted a four-exon gene in this region instead of the 17

two-exon gene prediction in our chimp sequence. Interestingly, more sophisticated gene predictors, such as N-SCAN and the Ensembl gene pipeline, do not predict a gene in this region.

To see the EST evidence, change the display mode for the Human ESTs to pack and click Refresh (Figure 21). Figure 21. Five unspliced human ESTs are aligned to this region.

18 two-exon gene prediction in our chimp sequence. Interestingly, more sophisticated gene predictors, such as N-SCAN and the Ensembl gene pipeline, do not predict a gene in this region. We also see some unspliced human ESTs (Expressed Sequence Tags) that are aligned to this region. We should investigate these matches further. To see the EST evidence, change the display mode for the Human ESTs to pack and click Refresh (Figure 21). Figure 21. Five unspliced human ESTs are aligned to this region. These five EST alignments may indicate that a transcript is generated at this locus. However, since our gene prediction contains a conserved filament domain, these short EST matches could also be spurious alignments caused by the conserved domain. The fact that there are no spliced ESTs aligned to this region lends credence to the hypothesis that the unspliced EST matches are simply spurious alignments. Analysis of the Second Gene Prediction using the human KRT18 Protein Based on the evidence we have collected thus far, we should be highly skeptical of the hypothesis that GENSCAN have simply missed part of the real chimp protein in its prediction. We will now evaluate the remaining two hypotheses: the GENSCAN prediction is a pseudogene or the prediction corresponds to a paralog of the KRT18 gene. To determine which of the two hypotheses is more likely, we will use the human KRT18 protein and search it against the human genome to identify all the regions of the human genome that shows sequence homology to this protein. Go back to the blastp result for our second predicted peptide searching against the nr database. Scroll back up to the table listing all the sequences that produce significant alignments with our predicted protein and click on the link ref NP_ to retrieve the GenBank record for this sequence (Figure 22). 18

Change the Display option to FASTA to obtain the protein sequence for KRT18 Under Display, select

Open a new browser window and go back to the BLAT search page.

19 Figure 22. Retrieve the GenBank Record for the human KRT18 protein through the blastp output Figure 23. Change the Display option to FASTA to obtain the protein sequence for KRT18 Under Display, select FASTA (Figure 23). Copy the sequence onto the clipboard. Open a new browser window and go back to the BLAT search page. Paste the sequence into the textbox. Make sure you have selected Human under Genome, selected Mar under Assembly, and selected protein under Query type (Figure 24). Click submit. Figure 24. BLAT search against the human genome using the human KRT18 protein 19

Looking at the BLAT search results, the top hit is the perfect match to chromosome 12 (Figure 25).

20 Figure 25. BLAT search results using the human KRT18 protein shows perfect match to chromosome 12 Figure 26. Alignment between the region we have previously identified on chromosome 9 and the human KRT18 protein shows only 83.1% sequence identity. Looking at the BLAT search results, the top hit is the perfect match to chromosome 12 (Figure 25). This is consistent with the location of the human KRT18 gene we have previously obtained from the Entrez Gene Record. We also notice that the match to human chromosome 9 we have initially identified appeared near the bottom of the list (Figure 26). We previously suspected that our predicted protein is a paralog (arise through a duplication event) of KRT18, so a sequence identity of 83.1% is actually quite good. However, we also see that the alignment to this region actually include all 430 amino acids instead of only to the 230 amino acids in our predicted protein. To understand why GENSCAN would predict a shorter gene in our chimpanzee sequence, we need to examine the region further with the genome browser. Click on the browser link that corresponds to the hit on human chromosome 9 (chr9: ). Zoom out 3X to get a better view of the region. We notice there are four alignment blocks between the human KRT18 gene and this region of chromosome 9. These alignment blocks appeared to be relatively close to each other and are unlikely to correspond to individual exons. Furthermore, we notice that the matches appear to completely overlap the GENSCAN and SGP gene predictions but also extend beyond it in some of the exons (Figure 27). To determine why the two gene finders would predict a shorter gene, we need to examine the alignment. 20

Figure 27. BLAT alignment of KRT18 extends beyond the GENSCAN and SGP predicted genes Go back to the BLAT results page, or resubmit the BLAT search using the human KRT18 peptide sequence from GenBank.

The first two parts of the BLAT alignment consists of the KRT18 protein sequence and the aligned region of the human chromosome 9 by themselves, color-coded according to indicate matches, mismatches

21 Figure 27. BLAT alignment of KRT18 extends beyond the GENSCAN and SGP predicted genes Go back to the BLAT results page, or resubmit the BLAT search using the human KRT18 peptide sequence from GenBank. Click details to examine the alignment for the hit on human chromosome 9 ( ) (Figure 28). The first two parts of the BLAT alignment consists of the KRT18 protein sequence and the aligned region of the human chromosome 9 by themselves, color-coded according to indicate matches, mismatches and gap. Matches are shown as blue upper case letters, gap boundaries for either the query or the genomic sequence are shown in light blue upper case letters while all mismatched bases are in black lowercase letters. As we might expect from the 83.1% sequence identity, there are many differences between the two sequences. However, when we examine some of the mismatched regions in the genomic sequence on chromosome 9, there are a few regions that may encode for a stop codon (e.g. TGA and TAG in the mismatched regions) if they are in the correct reading frame. We will use the side-by-side alignment to determine if these are in fact in-frame stop codons. Figure 28. Presence of putative stop codons (tga, tag) in the human chromosome 7 sequence 21

We see a red X a stop codon in the human sequence aligning to a R (Arginine) in the KRT18 protein. Figure 29.

22 The side-by-side alignment allows us to examine the predicted protein in frame relative to the aligned region of human chromosome 9 (Figure 29). The matching line uses the following coloring scheme: lines for match, green for changes in amino acids that have similar chemical properties, and red for dissimilar amino acids changes. We see a red X a stop codon in the human sequence aligning to a R (Arginine) in the KRT18 protein. Figure 29. Side-By-Side Alignment of KRT18 and human chromosome 9 shows an in-frame stop codon The fact that the protein alignment runs right over this stop codon, with no deterioration in similarity, strongly suggests that our prediction is a recently retrotransposed pseudogene, which has been inactivated by mutation but has not yet diverged significantly from its source gene. To confirm this hypothesis, go back to the BLAT results and click browser to find the place where KRT18 best matches the genome (100% identity at chromosome 12) and zoom out 3X (Figure 30). We see that KRT18 actually has 7 coding exons. Figure 30. Best BLAT match for KRT18 shows that it has 7 coding exons 22

23 Analysis Results for Predicted Gene 2 Based on all the evidence accumulated here, we conclude that, as a cdna, the KRT18 gene was retrotransposed and acquired a stop codon mutation prior to the split of the chimpanzee and human lineages. However, this transposition event appears relatively recently since the pseudogene still retains 83.1% sequence identity to its source protein. But how do we explain all the supporting evidence (e.g. unspliced ESTs and other ESTs) for our prediction, given that it turned out to be a pseudogene? They are almost certainly due to the fact that our pseudogene has high similarity to the functional filament conserved domain. To confirm this hypothesis, we can do pairwise BLAST alignments of the ESTs against the pseudogene and the functional protein. If our hypothesis is correct, the ESTs should be more similar to the functional protein than to the pseudogene. To further establish the age of this pseudogene, we can do a BLAT search using the pseudogene or the KRT18 protein against the mouse or rat genome browsers. If we find the corresponding pseudogene in the syntenic regions of the rodent genomes, we would have evidence that the transposition event occurred prior to the divergence of rodents and primates. Conclusions In this tutorial, we have learned how to run GENSCAN and to critically evaluate the GENSCAN results using BLAST searches and the UCSC Genome Browser. Our conclusion for the first GENSCAN prediction in our chimp sequence is that it is an ortholog of the human phosducinlike protein (PHLP) gene. Our conclusion for the second GENSCAN prediction is that it is likely a pseudogene derived from the human keratin 18 (KRT18) gene. Now that you have seen the basic procedure for identifying and annotating genes using GENSCAN, BLAST, and the UCSC Genome Browser, you can try to annotate other more challenging regions. You can find additional chimpanzee annotation problems on the Genomics Education Partnership web site at under the Annotation Exercises section. Last Update: 07/10/