Identification of Putative Coding Regions in a 10kB Sequence of the E. coli 536 Genome. By Someone. Partner: Someone else

Size: px
Start display at page:

Download "Identification of Putative Coding Regions in a 10kB Sequence of the E. coli 536 Genome. By Someone. Partner: Someone else"

Transcription

1 1 Identification of Putative Coding Regions in a 10kB Sequence of the E. coli 536 Genome Introduction: By Someone Partner: Someone else The instructions required for the maintenance of life are encoded by an organism s DNA. DNA (deoxyribonucleic acid) is a double-stranded helix, consisting of two strands arranged antiparallel to one another and connected through hydrogen bonding between complementary nitrogenous bases. Each DNA strand is made up of a sequence of nucleotides, which are basic units of DNA consisting of a deoxyribose sugar, a nitrogenous base (adenine (A), thymine (T), cytosine (C), or guanine (G)), and a phosphate group. To create a DNA chain, the next nucleotide is added to the 3 hydroxyl group of the previous nucleotide and a phosphodiester bond is formed between the 5 phosphate group of the newly added nucleotide and the 3 hydroxyl group of the previous nucleotide. As the next nucleotide is always added to the free 3 end of the last nucleotide in the DNA sequence, DNA has 5 to 3 directionality. To connect the two antiparallel DNA strands, hydrogen bonding occurs between the complementary nitrogenous bases in each strand. For instance, adenine pairs with thymine and cytosine pairs with guanine. The entirety of an organisms DNA is called its genome. An organism s genome consists of both coding DNA, which contains genes that provide instructions for making proteins required for cellular survival, and non-coding DNA, which does not code for protein, but can play important roles in regulating gene expression. Typically, in protein-coding regions of DNA, putative genes can be identified by locating open reading frames (ORFs) within in the DNA sequence. ORFs consist of DNA sequences preceded by the start codon ATG and terminated by the TAG, TAA, or TGA stop codon. The DNA sequence in between the start and stop codons is transcribed into mrna which is then translated into protein. Although identifying ORFs can be a reliable way of determining whether a DNA sequence codes for protein, this method is not full proof. Just based on random chance, three sequential nucleotides can form a start or stop codon, yet not signal the end or beginning of a protein coding region. Also, the identification of ORFs can be challenging as there are multiple different frames which ORFs can be read in. For instance, ORFs can be read in three different reading frames (0,1,2), based on the relative displacement of the first value in the codon from the left most (5 ) end of the DNA sequence, or an ORF can be found on the complementary strand of the DNA and, therefore, can have a reverse reading frame. Therefore, to more confidently predict the location of protein coding genes in an organism s genome, other strategies, besides just looking for ORFs, can be utilized. For instance, the regions around gene have been found to have a higher GC content (relative proportion of C and G compared to the other nucleotides in

2 2 the sequence), so identifying regions of high GC content could help aid the identification of protein coding genes. The goal of this project was to design a program which would detect putative coding regions within a 10 kb sequence of the E. coli 536 genome by identifying potential ORFs and regions of high GC content in the sequence. Identifying potential genes within in an organism s genome is of great interest for many reasons. For instance, by comparing the sequence of the same gene in two different species and identifying the differences between the sequences for that gene, the degree of relatedness between two species can be determined. Additionally, by comparing sequences of unknown genes to those of known genes, the function of the unknown gene can be inferred. Finally, by looking at how genes are clustered and organized in the genome, the mechanism by which gene expression is regulated can be inferred. Therefore, developing a program which determines the location of potential genes within a given DNA sequence would be instrumental in elucidating how the organization of the genome relates to its function. Methods: a.) This program first reads a text file containing a DNA sequence and displays the DNA sequence on the screen with the 5 end of the sequence beginning at the bottom left portion of the screen. Then, for each possible reading frame of the DNA sequence (0, 1, and 2), a bar is drawn above the DNA sequence. If the region of the DNA sequence contains an ORF, (as defined by the presence of a start codon (ATG) followed by a DNA sequence terminated with a stop codon (TAA, TAG, TGA)) the bar above the region of that sequence is colored blue, but if the region of the sequence does not contain an ORF, the bar above the region of that sequence is colored red. Next, a plot representing the relative GC content (amount of G and C over the total number of nucleotides) over many smalls regions of the DNA sequence (5 nucleotides) is generated above the DNA sequence and the bars representing the ORFs of the DNA sequence. In that plot, a constant red line representing a GC content of 50 % (0.50) is drawn, while the GC content of each small DNA window is plotted as a blue line. b.) My program contains six functions: plot, bar, gcfreq, orf1, viewer, and main: Before the functions are defined, turtle is imported to allow for turtle graphics to draw the sequences, ORFs, and GC plot, and the window height and width and the number of rows and columns of text are defined. The plot function takes a turtle object (tortoise), and the integer values index (representing the nucleotide of DNA), value (representing the GC content in a window of DNA), and window (the size of the region of the DNA sequence that is being examined) as parameters. Using a tortoise, it plots the GC content (fraction) of the window of DNA ending at the current index (nucleotide) position.

3 3 The bar function takes a turtle object (tortoise), and the integer values index (representing the nucleotide of DNA being examined) and rf (representing the current reading frame). Using a tortoise, the bar function draws a colored bar over a codon of DNA starting at the current index (nucleotide). The bar that the tortoise draws is shifted upward for each reading frame. The gcfreq function takes the string dna, the integer value window (representing the size of the region of the DNA sequence being examined), and a turtle object (tortoise) as parameters. First, a tortoise is used to draw a red line above the ORFs across the entire DNA sequence. Next, the GC content for the first window of the DNA sequence is calculated and then plotted above the ORFs and DNA sequence as a blue line. To accomplish this, the GC count is first initialized to 0 and the pen color of the tortoise is changed to blue. Then, a for loop is used to iterate over every character within that window of dna (using a slice of the dna string dna[:window]). An if statement within the for loop is then used to evaluate whether a character was a G or C, and increment the GC count if the character was a G or C. The GC fraction within that window of DNA is then calculated by dividing the GC count by the window size. The GC fraction for the first window is then plotted above the ORFs and DNA sequence by calling the plot function. To determine the GC content of the rest of the DNA windows and then plot the GC content (fraction) of each window above the corresponding region of the DNA sequence, another for loop is used. This for loop iterates over the rest of the indices (nucleotides) of the DNA sequence, starting with 1 (representative of the second nucleotide). The len function is used to convert the number of characters in the DNA sequence to an integer value. As each new window begins at the subsequent index, representing the next nucleotide in the DNA sequence, nucleotides represented by the positions between index and index+window -2 in the next window would be reevaluated for the presence of G or C while the contribution of the nucleotide at position index-1 to the GC count would remain unchanged. The only new nucleotide to be examined for the presence of G or C in that window would be at position index+window-1. Therefore, to prevent previous GC counts from influencing the GC counts within the current window, an if statement is included, so that if the nucleotide in position index-1 was a G or a C, one would be subtracted from the total GC count so that the nucleotide in position index-1 would not contribute to the GC count of the subsequent window. Since the only new contribution to the GC count would be the nucleotide at position index+window-1 (as the other nucleotides in the window were evaluated in the previous window), another if statement is included, so that if the nucleotide at position index+window-1 was a G or a C, the GC count would be incremented by one. The GC fraction for the window starting at the current index is then calculated by dividing the current GC count by the window size, and then the plot function is called to plot the current GC fraction value on the GC plot above the DNA sequence. This loop continues to iterate until the loop has gone through all of the indices of the DNA sequence up to the index that would begin the final window evaluated in the DNA sequence (len(dna)-window+1). The orf1 function takes a dna string, the integer value rf (representing the current reading frame being evaluated), and a turtle object (tortoise) as parameters. It first defines a variable, inorf,

4 4 which will be used to determine whether or not a codon is within an ORF. Then, using a for loop which iterates over every third index of the DNA sequence (representing the first nucleotide in each codon) in a defined reading frame, the inorf is either assigned a value of True or False based on whether the codon starting with the nucleotide at the current index position is determined to be in an ORF or not. To determine whether a nucleotide at position index is within an ORF, two if statements are used. As ATG represents a start codon, if the slice of the DNA sequence between positions index and index+3 (dna[index:index+3]) is equivalent to ATG, then inorf is assigned the Boolean value of True. If the slice of the DNA sequence between positions index-3 and index(dna[index:index-3]) (representing the previous codon) is equivalent to a stop codon (TAA, TAG, TGA), then ORF is assigned the Boolean value of False. Following those two if statements is an if/else statement which changes the pen color of the tortoise to blue if inorf is true, or leaves the pen color red if inorf is False or None. Finally, the bar function is called to draw a red or blue bar (dependent on whether inorf is true) over a codon in the DNA sequence starting with a position of index. So before inorf is true or false, a red bar is drawn as a default, but once a start codon is reached, inorf becomes true, and a blue bar is drawn until a stop codon is reached, at which point in ORF is set to false at the next index value, and a red bar is drawn again. The viewer function takes a dna string as a parameter. It first converts all of the characters in the DNA string to upper case letters using the upper function. It then creates a tortoise and then uses methods of the screen class (setup and setworldcoordinates) to generate a viewing screen. The function then displays the DNA sequence in a window by using a for loop which iterates over all of the indices of the string and writes out each character in the DNA string. Then, using a for loop, the viewer function calls the orf1 function three times to find the ORFs in each reading frame. Finally, the viewer function calls the gcfreq function to generate a plot of the GC fractions across the DNA sequence. The main function is where the program begins. It opens and reads the desired DNA sequence file using the open and read methods, and then calls the viewer function. After the program is finished running, the DNA sequence along with the three reading frames and the GC plot should appear in the viewing window. Results:

5 Figure 1. Identification of potential ORFs in a 10 kb sequence of the E. coli 536 genome. The DNA sequence begins at the bottom left and ends at the top right. Above the DNA sequence are bars representing the first, second, and third reading frames. A blue bar indicates a potential ORF while a red bar indicates non-coding region of DNA between ORFs. Above the colored bars representing the ORFs is a plot of the GC content of the DNA sequence over windows of five nucleotides in length. 5

6 6 After running the program, multiple putative ORFs were observed in the first and third reading frames, but not the second (Figure 1). In the first reading frame, there were 10 putative ORFs, but the small size of some of the ORFs (30-90 nucleotides; ~ 10 to 30 amino acids) made it unlikely that they were true ORFs, as the proteins produced would have been very small (Figure 1). Two of the ORFs (7 and 9 with 1 being the first ORF observed) were longer (200 nucleotides), though, and so they could have potentially represented coding regions of the DNA (Figure 1). In the third reading frame, there were two potential ORFs (Figure 1). The first was very small, so it could be excluded as a potential ORF (Figure 1). The second ORF was much longer than not only the first ORF in the third reading frame, but all of the other ORFs in the other reading frames as well, with a length of approximately 2400 nucleotides (Figure 1). Using BLAST, it was determined that the sequence of this ORF matched that of a potential aspartokinase I/homoserine dehydrogenase gene. ( RID=FYY405M4014) Both the gene and the ORF were around 2400 nucleotides in length, with the gene being 2463 nucleotides in length, and were located around the same position in the E. coli 536 genome (Figure 1). For instance, the ORF began around 300 nucleotides in the program, and the gene is located between nucleotides 336 and 2798 in the E. coli 536 genome (Figure 1). Also, of interest was whether or not the GC content of a region of DNA could determine if a gene was present. After running the program, it appeared that using the GC content to determine the location of a gene was not very helpful. The GC content appeared to fluctuate frequently both in putative ORF regions and non-coding regions (Figure 1). In some cases though, for instance, when there was no ORF in any of the three reading frames around region 100 to 200 in the DNA sequence, the GC content dipped significantly below 0.5 (Figure 1). This would make sense if low GC content excluded a region as an ORF, and, therefore, gene. Also, in some regions where putative ORFs were predicted in both the first and third reading frames, the GC content appeared to stay a little bit higher than 0.5, which would be expected if a high GC indicated the presence of a gene (Figure 1). Unfortunately, though, after looking at the GC content at the beginning and end of each of the ORFs, no consistent pattern could be observed (Figure 1). Therefore, it seems that using GC content to determine the location of gene is semireliable at best. Overall, it appears that many considerations need to be taken into account when trying to determine the location of genes within a DNA sequence. As seen in this program, just using ORFs to identify genes is unreliable, as many of the ORFs found were too small to possibly be genes. Perhaps to correct for this, a minimum length could be required for a sequence to be determined as an ORF. Also, the GC content did not appear to be a reliable way to determine the location of a gene. So, to better identify putative genes in a sequence, perhaps it would be useful to use other potential indicators of a gene, such as promotor and enhancer sequences, in addition to ORFs and GC content.