PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

Molecular Biology-2017 1 PRESENTING SEQUENCES As you know, sequences may either be double stranded or single stranded and have a polarity described as 5 and 3. The 5 end always contains a free phosphate group whereas the 3 end has a free hydroxyl group. Typically, all molecular biologists present the sequence of only one strand, independently of whether the source nucleic acid is single stranded or double stranded. Furthermore, all sequences are written in the 5 to 3 orientation unless otherwise specified. For example, let s say that a given sequence within a cellular organism which possesses a double stranded genome is the following: 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5 It is highly unlikely that a sequence would actually be presented this way. Furthermore, no bioinformatics programs would accept such a format. However, if in exceptional cases both strands were to be presented, by default the top strand would be written in the 5 to 3 direction. A much more common way of presenting the sequence would be as follows: Either: (A) 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 Or: (B) 5 GTTCCATCGTACCAGTCTAAGCCGCATTC 3 As you can see, the sequence of either strand would be presented and in both cases they would be written in the 5 to 3 direction. Also note that in this format it is not necessary to label the ends. It is assumed by any molecular biologist or bioinformatics program that a sequence is always written in the 5 to 3 direction. How do different bioinformatics programs treat inputted sequences? Given that the sequence of only one strand in entered, all bioinformatics programs will assume that the sequence is written 5 to 3. Depending on the program used, the subsequent analysis could be carried out only on the entered sequence or in some cases the program will also carry out the analysis on the other strand even though it has not been entered. In the case where a program is capable of analyzing both strands, by default the program will assign the symbol + to the entered sequence and the symbol - to the other strand s sequence. For example, if sequence A indicated above was entered in such a bioinformatics program it would be labelled as + and sequence B would be labelled as -. However, if sequence B indicated above was entered in such a bioinformatics program it would be labelled as + and sequence A would be labelled as -.

Molecular Biology-2017 2 MANIPULATING HOW A SEQUENCE IS PRESENTED There are several different manners by which a sequence can be manipulated and presented. To illustrate this, let s start with a relatively simple short sequence: (A) 5 -GAATGCGGCTTAGACTGGTACGATGGAAC-3 A different version of the above sequence would be to present its complement. In this case the sequence would be written as 3 CTTACGCCGAATCTGACCATGCTACCTTG 5. Given that this is a complement sequence, the ends must be indicated and the orientation must necessarily be 3 to 5. Another way of presenting a sequence is referred to as the inverse (or reverse) sequence. This involves presenting the sequence written in the opposite orientation. Therefore the reverse sequence of sequence A would be: 5 -CAAGGTAGCATGGTCAGATTCGGCGTAAG-3 Note that this sequence is still written in the 5 to 3 direction! The final way of presenting a sequence is what is called the reverse (inverse) complement. In this case the complement sequence is presented in the 5 to 3 direction. Thus the inverse complement of sequence (A) would be: 5 GTTCCATCGTACCAGTCTAAGCCGCATTC 3. Note that the inverse complement and the complement of the inverse are not synonymous. Fortunately, all these manipulations can be carried out by computer programs. The one we will use in this course is a Web based program called Reverse complement. You can access this program from the link on this course s web site under the heading bioinfo links. For your assignment obtain the sequence with the accession number GBYX01460764 from the NCBI site. Once you ve obtained this sequence, use the Web based program reverse complement to carry out the following manipulations: Step 1: Obtain the reverse sequence of the sequence. Step 2: Obtain the complement sequence of the sequence obtained in step 1. Step 3: Obtain the reverse complement sequence of the sequence obtained in step 2. Step 4: Obtain the reverse sequence of the sequence obtained in step 3. Step 5: Obtain the complement sequence of the sequence obtained in step 4. Indicate the first 20 bases of the final sequence obtained. Make sure to indicate the 5 and 3 ends.

Molecular Biology-2017 3 SEQUENCE ALIGNMENTS A sequence alignment is the assignment of base-to-base (or residue-to-residue) correspondences between two or more sequences. Therefore an alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. This is basically what the program Blast did, to search the databases for a sequence of interest. There are several other reasons one may want to perform an alignment. You may wish to compare a small subset of sequences to identify similar regions or to determine how similar two sequences are. Alternatively, you may simply wish to locate a short sequence within a larger sequence. This would be analogous to the ctrl-f function you commonly use in Word. Several different alignment programs are available and the results returned can be very different. For the purpose of this course we will examine the use of two programs; Blast and Clustal omega. Both programs can be used to align either nucleotide or protein sequences. In this exercise we will focus on the use of these programs to align nucleotide sequences. Note: Aligning is NOT synonymous to base pairing. The following is an example of two perfectly aligned (identical sequences) as they would be displayed in Blast or Clustal omega. Blast display: Perfect match Clustal Omega display: Perfect match ************ Blast and Clustal omega treat differently bases which are not identical (mismatches) between two sequences. If the mismatches are within the sequence (not on the ends) the display is similar for both Blast and Clustal omega: Blast display: Internal mismatches Clustal Omega display: Internal Mismatches THISSEQUENCE ** ******** THISSEQUENCE

Molecular Biology-2017 4 If the mismatches are at the end of the sequence, then the display is different: Sequences being compared: and SHATSEQUENCG Blast display: Mismatched ends Clustal Omega display: Mismatched ends HATSEQUENC HATSEQUENC ********** SHATSEQUENCG In the case of insertions or deletions, I strongly recommend using Clustal omega. The display would be as follows: THATISASEQUENCE THAT---SEQUENCE ALIGNING SEQUENCES TO LOCATE PRIMER WITHIN A TEMPLATE USING BLAST 1. Go to Blast and select the option "Align two or more sequences" below the query box. A new Blast program window will be displayed as shown below:

Molecular Biology-2017 5 2. For this tutorial, we will use as a template the sequence corresponding to the accession number EF391433. Either obtain this sequence and enter it in FASTA format in the query box or enter the accession number itself. Remember this sequence will be assigned by default the symbol + 3. In the second box (subject) copy paste the following sequence in FASTA format: GAGACTATTTCCAGCACTCGAGC. Make sure to give it a distinct name as compared to the template (query). 4. Choose the option Somewhat similar sequences and Show results in a new window as you've done in a previous exercise. 5. Now click on Blast to obtain your results. Scroll down the page to view the alignment as shown below: 6. Interpretation of the results: Length: This indicates the length of the subject sequence, in this case your primer is 23 bases long. Identities: In this case 23 out of 23 of the bases were identical. Therefore the total length of the primer is aligned. There are no mismatches. Strand: In this case the result is indicating that the match was performed with the query provided. If the alignment had been found on the reverse complement of the query sequence then this result would be indicated as plus/minus. Thus this must be a forward primer ; a primer whose sequence is the same as that of the template sequence. Numbering: The numbering indicates the exact positions on both the query and the subject at which the match was found. Keep in mind that the smaller number will always represent the 5 end and the larger number the 3 end. Therefore, in this example the primer anneals between positions 293 (5 ) to 315 (3 ). 7. Repeat steps 2-6 with the following primer sequence: CGTAGCGGAACTTCACTGTAT. You should obtain the following results: 8. Interpretation of the results: Length: This indicates the length of the subject sequence. In this case your primer is 21 bases in length. Identities: In this case 20 out of 20 of the bases were identical. However, the length of your primer was 21 bases. Consequently there must be a mismatch, specifically at one of the ends! Strand: This time the result is indicating that a match was not found with the query provided. Instead, the match was found on the reverse complement sequence of the query; therefore plus/minus. This would therefore be a reverse primer ; a primer whose sequence is the same as that of the reverse complement sequence of the template.

Molecular Biology-2017 6 Numbering: The numbering indicates that this primer aligns between positions 1010 (5 ) and 1029 (3 ) of the template. However, numbering of the subject sequence (the primer) is indicated to occur between positions 1(5 ) to 20 (3 ). Given that the primer was 21 bases, this indicates that base 21 is not aligned or mismatched. Therefore the 3 end (larger number) of this primer is mismatched. USING CLUSTAL OMEGA 1. Another program which can be used to align sequences is Clustal Omega. Use the link on this course's web site under the heading Bioinfo links to access this program. You should obtain the following page: 2. Since we are interested in aligning nucleotide sequences, change the sequence parameter to "DNA" 3. Then enter the sequences EF391433 and GAGACTATTTCGAGCACTCGAGT in FASTA format, in the query box. Note all sequences, must be entered in the same box and separated by a line break. Also, the sequence identifiers must be sufficiently different. See the example below. Sequence identifiers Line break

Molecular Biology-2017 7 4. Change the output format to Clustal w\ numbers. 5. Now click on submit. Once the analysis is completed the alignment will be shown on a new page similar to the one shown below. 6. In contrast to Blast, a global alignment is shown rather than only the aligned sequence. Perfect matches are indicated by a star. In the above example two mismatches were detected, one of which represents the 3' base. Note that this would not have been shown using Blast and may therefore be more difficult to notice. The base on the 3' end is one of the most important. Primers which have a mismatch at this position will NOT work. Furthermore, since the alignment was performed with the original template sequence, this must be a Forward primer. 7. In contrast to Blast, this program performs the alignment only of the entered sequence and not on the reverse complement strand. To illustrate this, repeat the above steps with the following primer: GAGACTATTTCGAGCACTCGAGT. You will notice this time that the alignment found is much less perfect.

Molecular Biology-2017 8 8. Consequently, when aligning primers using Clustal omega, you should always verify with which strand one obtains the best alignment. To determine whether the alignment is better on the other strand replace the primer sequence with its reverse complement. Use the Web based program "Reverse complement" previously used to convert the sequence and then repeat this exercise. 9. This alignment is much better! Typically, if the best alignment is obtained with the original sequence the primer is said to be a Forward primer. If the best alignment is obtained with the reverse complement, the primer is said to be a Reverse primer. CAUTION: Given that this is the inverse complement of the actual primer sequence, the 3 end IS NOT the larger number (or on the right), but rather the smaller number (the left end) of the primer sequence.

Molecular Biology-2017 9 FOR YOUR ASSIGNMENT 1. Map the alignment positions of each of the following primer sequences on the sequence of the puc19 sequence. You may obtain this sequence on the course s Web site. A. TGCGGTGTGAAATACCCT B. GCCATTCAGGCTGCGCAA C. GGGTTATTGTCTCATGAG D. GAGACAATAACCCTGATA Use a diagram, as shown below, to indicate the annealing positions of each of the primers. Each primer should be indicated by an arrow; where the head of the arrow represents the 3' end and the tail represents the 5' end. The direction of the arrow should indicate whether the primer is a Forward ( ) or a Reverse ( ) primer. The numbers above the arrows should represent positions on the template corresponding to the 5' end of the primer. Indicate in a legend to your figure all primer pairs, if any, would give an amplification product of at least 200bp. 814 1514 56 908 2351 puc19 2. Amplification and mutagenesis of GFP: In exercise 3 of the lab manual, you performed PCR to amplify and mutagenize the GFP gene using pgfpuv as the template sequence. Use the knowledge gained in bioinformatics to submit a figure which represents the region which extends from position 200 to 1035 of the sequence of pgfpuv (Your figure should indicate a numbering of 1-836). Indicate on this figure the HindIII and EcoRI restriction sites and their positions. As in the previous question, use arrows to illustrate the positions of the "Forward" `and "Reverse" primers used for the amplification and mutagenesis. The pgfpuv sequence is available on the Web page of this course, while those of the primers are listed on page 38 of your lab manual. In a legend to your figure, indicate the predicted size of the amplified product. This is represented by the distance between the 3 end of the Forward primer and the 3 end of the Reverse primer.