Identifying Regulatory Regions using Multiple Sequence Alignments

Size: px
Start display at page:

Download "Identifying Regulatory Regions using Multiple Sequence Alignments"

Transcription

1 Identifying Regulatory Regions using Multiple Sequence Alignments Prerequisites: BLAST Exercise: Detecting and Interpreting Genetic Homology. Resources: ClustalW is available at The Saccharomyces Genome Database is available at The SCPD database is available at YEASTRACT database is available at Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: Clustal_Intro.tar.gz Introduction: In addition to finding genes in genomic sequences, identifying regions that regulate gene expression is also an important aspect of understanding a genome. However, identifying regulatory regions within genomic sequences is difficult because these regions are often short, degenerate, and locate at variable distances from the genes that they regulate. Consequently, pair-wise comparisons (such as BLAST) often fail to detect these conserved regions. One useful strategy to detect functional non-coding regions, such as regulatory regions, is to generate multiple sequence alignments of homologous sequences from closely related species. This technique, also known as phylogenetic footprinting, relies on the fact that the evolutionary constraints on non-functional sequences are significantly less than functional ones. Hence we expect non-coding functional sequences to show higher sequence conservation compared to nonfunctional sequences. Multiple sequence alignments will greatly increase the sensitivity and specificity of the sequence comparisons and may identify conserved non-coding regions that are undetectable by pair-wise alignments. Selection of related species in a multiple sequence alignments are heavily influenced by the research question. While alignment of sequences from closely related species will identify conserved regions with high sensitivity, alignment of sequences with remote outgroups often improves the specificity in detecting conserved regions. The theoretical principles behind the selection of species used in comparative genome sequence analysis are beyond the scope of this tutorial. For additional information, see the analysis done by Eddy, SR. 1 In this tutorial, we will replicate some of the results previously reported by Cliften et al. 2 where multiple sequence alignments generated by the program ClustalW are used to identify transcriptional factor binding sites in the Saccharomyces genomes. The species used in the multiple sequence alignments are closely related to S. cerevisiae, including S. mikatae, S. kudriavzevii, and S. bayanus. 1 Eddy, SR. A Model of the Statistical Power of Comparative Genome Sequence Analysis. PLOS Biol. 3(1): e10. (2005) 2 Cliften P., Sudarsanam, P., et al. Finding Functional Features in Saccharomyces Genome by Phylogenetic Footprinting. Science. 301: (2003). 1

2 Overview of ClustalW ClustalW (Thompson et al., 1994) 3 is a program commonly used to generate multiple sequence alignments. In addition to generating alignments from multiple sequences, you can also add sequences to existing alignments, realign portions of the existing alignments, generate phylogenetic trees, and merge two alignment profiles. However, in this tutorial, we will only use the basic functionalities of the program to align multiple sequences. While detailed description of the ClustalW algorithm is beyond the scope of this tutorial, the key idea behind ClustalW is to generate the multiple sequence alignment from a progressive series of pair-wise alignments. First, similarity scores are calculated among all the sequences we wish to align. These scores are used to construct a dendrogram (also known as a guide tree). Sequences are then aligned in progressively larger groups according to the branching order in this guide tree. The program begins with pair-wise alignment of the two most similar sequences and generates an alignment profile. The sequence from the next closest species to the first group is aligned to this profile to generate a new profile. This process continues iteratively until all sequences have been aligned. Note that ClustalW uses a heuristic algorithm and does not guarantee that it will report the optimal multiple sequence alignment. We will use the ClustalW web server from the European Bioinformatics Institute at to analyze our sequence. The ClustalW web site contains useful documentation that explains the myriad options available in ClustalW. Finding Open Reading Frames in S. cerevisiae using the Saccharomyces Genome Database Figure 1. Searching for the record for YDR374C in the Saccharomyces Genome Database In this tutorial, we will use ClustalW to search for known transcription factor binding sites in the promoter region of the hypothetical YDR374C open reading frame (ORF). We can find additional information on this ORF by searching for YDR374C at the Saccharomyces Genome Database ( (Figure 1). The summary page for the ORF YDR374C has links to many resources that are available in the Saccharomyces Genome Database (SGD) such as a graphical browser (GBrowse) and the BLAST search service. We can also retrieve the sequence that corresponds to this ORF in different formats (including the genomic DNA, ORF translation, 1kb upstream and downstream of ORF) under the Retrieve Sequences section (Figure 2). 3 Thompson, JD., Higgins, D.G., et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acid Res. 22:

3 Figure 2. The summary page for the ORF YDR374C at the SGD Database Before we can run ClustalW to identify transcription factor binding sites, we must first obtain the genomic sequences that correspond to the promoter region of the YDR374C ORF in S. cerevisiae and the orthologous regions from the other Saccharomyces species we wish to align. Since the intergenic regions in most Saccharomyces species are relatively short (the average length is approximately 500 base pairs), we will analyze the entire region between the translation start of YDR374C and its immediate flanking gene. Click on the link labeled GBrowse in the Resources section to examine the region surrounding YDR374C in the SGD GBrowse. We notice that YDR373W and YDR374W-A are the two ORFs closest to YDR374C. In addition, since the direction of the arrow shows the orientation of the ORF, we see that YDR374C is in the reverse orientation relative to the chromosome. Therefore, since we are interested in the promoter region of YDR374C, we will analyze the region between YDR374W-A and YDR374C (chriv: , a total of 362 base pairs). We should also be aware of the fact that, unlike BLAST, ClustalW will not reverse-complement the input sequences when it generates the alignment. Therefore, since the ORF YDR374C is in the reverse orientation, we need to reverse-complement the region first to ensure that the promoter region of all the input sequences are in the same orientation. Region of interests Figure 3. YDR373W and YDR374W-A are the two ORFs immediately flanking the ORF YDR374C 3

4 For this tutorial, Cliften et al. have identified the promoter regions (e.g. upstream of the translation start site) of the S. cerevisiae ORF YDR374C and orthologous regions from S. mikatae, S. kudriavzevii, and S. bayanus. The sequences are available in the file YDR374C.fasta. Running ClustalW using the ClustalW Web Service at EBI Navigate to the ClustalW web server at (Figure 4). For the purpose of this tutorial, we will use the default settings for our ClustalW search. Figure 4. ClustalW web server at EBI Change the Alignment Title to MSA_YDR374C and open the file YDR374C.fasta in a text editor (e.g. WordPad on MS Windows, TextEdit on the Mac). Copy and Paste the sequences into the text box labeled Enter or Paste a set of Sequences in any supported format: (Figure 5). Click Run. The analysis may take a while, so be patient. Alternatively, for class purposes, the ClustalW output is available in the folder ClustalOutput inside the tutorial package. Figure 5. Configuring the ClustalW search using the web interface 4

5 Figure 6. The main result screen has links to the four output files generated by ClustalW In the ClustalW results screen, you should see links to four files: the.output file contains the scores of the alignment, the.aln file contains the multiple sequence alignment, the.dnd file contains the guide tree used to determine the order of the progressive sequence alignment, and the original input file has the file extension.input (Figure 6). Right click on each of the links and save each file onto your computer for future references. Interpreting the ClustalW Output Scroll down to the Guide Tree section of the ClustalW output to see the guide tree used to construct the multiple sequence alignment (Figure 7). The notation ( ) denotes species that are grouped into the same clade (for example, S. kudriazevii and S. mikate are in the same clade according to the guide tree). The numbers following the name of the organisms are the calculated branch lengths. Figure 7. The Guide Tree shows the order in which ClustalW aligned the sequences 5

6 We can see a graphical version of the guide tree in the Cladogram section of the ClustalW output. A comparison of the guide tree with the known phylogenetic tree shows a different relationship among the four species (Figure 8). This should not be surprising since ClustalW only uses the input sequences to construct the guide tree and the evolution of specific genes may deviate from the general evolutionary history of the yeast species. Figure 8. Comparison of known phylogenetic tree 4 (top) and the guide tree created by ClustalW shows different relationships among the four Saccharomyces species. To examine the multiple sequence alignment, scroll up to the Alignment section of the page (Figure 9). For each alignment block, the labels on the left indicate the names of the sequences used to generate the alignment. The numbers on the right indicate the position of each sequence within the alignment. Under each alignment is a matching line with symbols that indicate the degree of conservation in each column of the alignment. Note that ClustalW will only annotate columns that are fully conserved by default. The symbols and the corresponding degree of sequence conservation are listed in the table below: Character Degree of Conservation * Fully conserved : Strongly conserved. Weakly conserved (Space) No sequence conservation 4 Available online at 6

7 Sequence Position Aligned Species Degree of Conservation Figure 9. The multiple sequence alignment generated by ClustalW for four Saccharomyces species Identifying Conserved Regions From the alignment, we notice that some regions upstream of the YDR374C ORF are highly conserved among the four Saccharomyces species. While we expect conservation of a few contiguous bases due to random chance alone, longer stretch of conserved sequence may indicate the presence of a functional motif. However, since the four species are closely related evolutionarily, conserved regions may simply denote non-functional sequences that did not have sufficient time to diverge. Hence a long stretch of conserved sequences in a multiple sequence alignment alone does not provide sufficient evidence that a conserved functional DNA motif is present in a genomic sequence. Nonetheless, multiple sequence alignments can be used to identify potential candidate sites that can then be validated through wet-lab experiments. Looking at the alignment, we notice three highly conserved regions (Figure 10): TACCCGTT starting at base 94 of S. cerevisiae, TCGGCGGCTAAT starting at base 132 of S. cerevisiae, and GCCTTTTGTGATAT starting at base 154 of S. cerevisiae. We would now like to know if these conserved sequences correspond to any known regulatory motifs. Since sequences of regulatory motifs are often short and degenerate, identification via comparisons with consensus sequence in a database of known motifs often failed to produce useful results. However, regulatory motifs can be identified experimentally by studying gene expression profiles. Genes that show similar expression patterns are presumably controlled by the same regulatory mechanisms [i.e. binding of transcription factors to the same transcriptionfactor binding sites (TFBS)]. Binding affinity analysis can also be used to facilitate the identification of transcription factor binding sites. Unfortunately, finding regulatory motifs remains a difficult problem. In fact, development of high-throughput methods to experimentally determine the functions of conserved sequence motifs, if any, remains an active area of research. 7

8 Figure 10. Three highly conserved regions in the multiple sequence alignment of the four Saccharomyces species In this tutorial, we will search the three conserved sequences against the Saccharomyces cerevisiae Promoter Database (SCPD). Open a new browser window and navigate to the SCPD motif search page at Enter our first motif (TACCCGTT) into the textbox labeled Motif. Since we would like to find perfect matches in our initial analysis, we will set the Allowed mismatches to 0. Click Submit (Figure 11). Figure 11. Search for known motifs using the SCPD database 8

9 Figure 12. Single match (MCM1) to our conserved sequence There is only a single incomplete match (MCM1) to our query (Figure 12). When we reexamine the multiple sequence alignment, we notice that the conservation of the alignment decreases significantly (e.g. containing gaps) immediately following our conserved sequence. Hence it would seem unlikely that our conserved region would be this TFBS. Same TFBS sequence Figure 13. Multiple matches between REB1 and the conserved sequence Since it is possible for the conserved sequence to be longer than the known transcription-binding site, we can search for our TFBS using a truncated version of the conserved sequence (e.g. TACCCG). Repeating our motif search using the sequence TACCCG results in multiple matches (Figure 13). In addition to our original MCM1 hit, we also have numerous matches to the transcription factor REB1. There are also single matches to PHO4 and GRF2. The match to PHO4, similar to the match to MCM1, is relatively poor. The match to GRF2 shows the same degree of sequence similarity compared to the other REB1 matches. However, you may have 9

10 noticed that the eighth TFBS hit for REB1 has the same sequence as the TFBS for GRF2 (the ninth hit). Hence our hypothesis is that GRF2 is actually a synonym of REB1. To verify this hypothesis, we will search for information on GRF2 at the Saccharomyces Genome Database. Open a new browser window and navigate to the SGD home page ( then search for GRF2 using the Quick Search box (Figure 14). Figure 14. Search for GRF2 record at the Saccharomyces Genome Database Instead of the summary page for GRF2, we are redirected to the summary page for REB1 (Figure 15). In the Alias section under the REB1 Basic Information section, we notice that REB1 has the alias GRF2. In addition, we find that REB1 is a RNA polymerase I Enhancer Binding protein and is required for terminating RNA polymerase I transcription. Figure 15. SGD record shows GRF2 is an alias for REB1 Since we have addressed the apparent discrepancies between the high-quality hit to GRF2 with the other high quality REB1 hits, we will try to gather additional evidence that the motif we have identified is in fact a REB1 binding site. Go back to the browser window with the list of TFBS in the SCPD database that matches with our conserved sequence. Click on any of the REB1 hit to retrieve the SCPD database record for REB1 (Figure 16), then click on the button Get consensus (Figure 17). Figure 16. Click on the link REB1 to retrieve learn more about the binding site 10

11 Figure 17. Consensus sequence for the REB1 TFBS We find the consensus motif for REB1 is YYACCCG where Y in IUPAC code represents a pyrimidine (C or T). Looking at the ClustalW alignment, we find the position immediately prior to the conserved sequence (TACCCG) begins with either a C (in S. cerevisiae, S. bayanus, and S. mikate) or a T (in S. kudriazevii). Consequently, we are relatively confident that our conserved sequence corresponds to a binding site for REB1. We can employ the same strategy to identify the other two motifs. Open a new browser window and navigate to the SCPD motif search page again. Enter the sequence TCGGCGGCTAAT into the text box then click Submit. We find that this conserved sequence has only a single high quality hit to the TFBS URS1H (Figure 18). Figure 18. Search for the conserved sequence TCGGCGGCTAAT results in a high quality hit to the TFBS URS1H Finally, searching the database with the sequence GCCTTTTGTGATAT and its subsequences fails to produce matches to any known motifs. While it is possible that this is a conserved region due to random chance alone, the more likely hypothesis (given the length of the conserved region) is that this sequence is a novel motif not in the SCPD database. To determine if this sequence is a known TFBS, we can search this sequence against a more comprehensive database for transcription factor binding sites called TRANSFAC ( However, the TRANSFAC database is a subscription-based and you must register in order to use this service. Note that in our analysis, we have only focused on perfect matches to known motifs in the SCPD database. We would need some measure of statistical significance (similar to the Karlin-Altschul statistics used in BLAST) in order to incorporate mismatches into our analysis. However, this type of analysis is beyond the scope of this tutorial. 11

12 Finding More Information About REB1 with the YEASTRACT database Figure 19. The YEASTRACT database home page Now that we have found some regulatory motifs in our multiple sequence alignments, we can use the YEASTRACT (Yeast Search for Transcriptional Regulators and Consensus Tracking) at to learn more about each of the transcription factors (Figure 19) we have identified. The YEASTRACT database collects information from the scientific literature, the SGD, and the Gene Ontology Consortium that shows associations between transcription factors and their target genes in S. cerevisiae. To search for information on the regulatory motifs we have previously identified, click on the link TF-Consensus List under Retrieve (Figure 20). Figure 20. List of known transcription factors in the YEASTRACT database 12

13 Figure 21. Click on any of the links labeled Reb1p to retrieve the protein record of the Reb1p protein Figure 22. Details on the Reb1p protein and known TFBS of Reb1p Find the three links to Reb1p in the table of transcription factors and click on it (Figure 19). This will bring you to a new page with a brief summary of the Reb1p protein (Figure 20). In addition to the summary of the protein we previously saw in the SGD web page, there are also links to the known transcription factor binding sites, a list of documented genes regulated by this transcription factor and a list of potential genes regulated by this factor. Identifying TFBS in Other Regions of the Saccharomyces Genome Now that you have learned how to identify known regulatory motifs, you can apply the same technique to other promoter regions. The ClustalW alignments and the original sequences from the orthologous promoter from other regions in the Clifton et al. manuscript are available for download at Last Update: 07/05/