ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

A very coarse introduction to bioinformatics In this exercise, you will get a quick primer on how DNA is used to manufacture proteins. You will learn a little bit about how the building blocks of these proteins may change due to point mutations. And, you will investigate how a statistical description of these mutations can be used as evidence supporting hypothesized evolutionary paths. Summary of genetic machinery DNA sequences are composed of four bases: adenine (A), thymine (T), guanine (G), and cytosine (C). The two strands of DNA are composed of nucleotides with complementary bases where A is matched up with T and C is matched up with G. An example of the two strands of a fragment of DNA is shown here: DNA (3' end)...cacctacgttcaggaggtcaggactggtac... (5' end) (5' end)...gtggatgcaagtcctccagtcctgaccatg... (3' end) A DNA sequence is transcribed into mrna starting at the 3' end of the DNA strand. Each nucleotide of the DNA strand attracts its complementary RNA base through a pairing rule. mrna is composed of A, G, and C bases with uracine (U) substituted for thymine. The pairing rule is shown below: DNA -> mrna A U T A G C C G DNA (3' end)...cacctacgttcaggaggtcaggactggtac... (5' end) mrna (5' end)...guggaugcaaguccuccaguccugaccaug... (3' end) mrna is translated into a sequence of amino acids by ribosomes. This process is initiated by a ribosome with the amino acid methionine attached to it. This ribosome latches onto the 5' end of the mrna strand and begins to move downstream until it finds a sequence of three nucleotides, AUG, known as the start codon. This codon triggers the ribosome to start building a polypeptide with methionine as the first amino acid in the chain. The ribosome continues down the mrna strand by binding to the next triplet (without re-using any of the mrna nucleotides and matching each subsequent triplet with a particular amino acid. This continues until the ribosome reaches a stop codon signifying the end of the chain of amino acids which make up the particular protein coded by this genetic sequence. The triplets and their corresponding amino acids are displayed in the matrix on the following page.

An alternate way of visualizing this information is the codon wheel:

The red letters in the codon wheel are a simple code for each of the twenty amino acids that can be used to express the sequence of amino acids that comprise a single protein. The ribosome finds the beginning of the gene by looking for a start codon and then reads the mrna sequence as a series of triplets until it reaches a stop codon. start codon v mrna (5' end) GUGGAUGCAAGUCCUCCAGUCCUGACCAUG (3' end) mrna triplets AUG CAA GUC CUC CAG UCC UGA (start) (stop) Amino acids M Q V L Q S Notice that the codon table (or wheel) shows that there are two different mrna codes for glutamine (Q) which differ in the third base of the triplet. Either CAG or CAA triplets will cause a glutamine to be added to the polypeptide chain. There is redundancy in the codon for some amino acids but there is no ambiguity. Redundancy is actually a trait which protects the genetic code from some point mutations. A point mutation is characterized by the accidental substitution of a single nucleotide with another nucleotide. A point mutation is responsible for sickle cell disease due to the substiation of of a T for an A in a triplet for GAG. The sequence GAG codes for glutamic acid whereas the sequence GTG codes for valine. The substitition of one amino acid for another can cause the resulting protein to have a different shape and/or functionality. Point mutations that occur in the third base position are sometimes silent when they change the genetic code but do not change the sequence of amino acids in the resulting protein. Point mutations in the first and second positions are much more likely to change the structure of a protein.

Properties of amino acids This Venn diagram relates some of the properties of amino acids. Substitutions of amino acids with similar properties are less likely to change the function of the protein than those whose properties are very different.

Evolution and bioinformatics Bioinformatics is a study of genetics and molecular biology using tools derived from the fields of computer science and information technology. By studying many samples of genetic material from a single species, it is possible to estimate the probability that mutations to the genetic code will cause one amino acid in a protein to be changed to a different amino acid. This can be quantified by a substitution matrix. In the matrix below each entry is a log-odds score, which means that it is the log of the likelihood of a certain amino acid substitution taking place in a fixed number of generations. A positive number represents a substitution that is relatively common. A negative number is one that is less common. These numbers are known as a PAM score (for Point Accepted Mutation). To compare a sequence of amino acids in two related proteins: sequence 1: sequence 2: PEYDLLV PERDILV To get from sequence 1 to 2, the first two amino acids must be preserved, the third substituted, the fourth conserved, etc. These two sequences would be scored as follows:

sequence 1: P E Y D L L V sequence 2: P E R D I L V PAM score: +7 +5-2 +6 +2 +4 +4 = 26 A third sequence might get a PAM score of 24 when compared to sequence 1. The lower PAM score suggests that the third sequence is less likely to have evolved from sequence 1 over a given amount of time. A simplified look at the study of evolution using bioinformatics 1. Isolate and decode a particular gene in two different species Not always as easy as it sounds. These genes may have different lengths and various insertions, deletions, and mutations. 2. Align the base pairs for the two genes. 3. Determine the sequence of amino acids coded by the genes. 4. Calculate a quantitative measure of the differences between the proteins produced by the two genes. 5. Assess this difference in evolutionary terms. The average protein in the human body is comprised of about 400 amino acids but there might be tens of thousands of base pairs in the DNA sequence which produces this protein. Analyzing two alleles of a single gene would be virtually impossible to do by hand. There are more than 20,000 genes in the human genome. Obviously, most of the raw work of bioinformatics must be done by computers.

Genetic Sequence Alignment One of the more difficult and time-consuming tasks involved in the study of genetic information is deciding how to two sequences should be aligned for comparison. Evolution of the genetic code includes not only point mutations, but also dropped information and inserted information in the genetic code. The following notes describe how you can use PAM scores to decide how sequences of amino acids in two different species should be compared. To solve alignment problems, we use the PAM table from the previous section: This table gives you a relative measure of the likelihood of a substitution between amino acids due to point mutations. We will now consider one more type of mutation an insertion or deletion of one or more amino acids. Each instance of a insertion or deletion has an equivalent PAM score of -5. Using this information and given two sequences of possibly different lengths, one must decide what is the most likely alignment of the amino acids. For example, sequence 1: sequence 2: PEYLLV PERDILV To align these two sequences, we must take into account a deletion of a single codon in the DNA code of the sequence 1. Or equivalently, this could be an insertion of a single codon in sequence 2. Several alignments might reasonably be considered:

sequence 1: P E Y - L L V sequence 2: P E R D I L V PAM score: +7 +5-2 -5 +2 +4 +4 = 15 sequence 1: P E - Y L L V sequence 2: P E R D I L V PAM score: +7 +5-5 -3 +2 +4 +4 = 14 sequence 1: P E Y L - L V sequence 2: P E R D I L V PAM score: +7 +5-2 -4-5 +4 +4 = 9 The higher PAM score suggests that the first alignment is the most likely to arise from evolutionary processes. There is a tabular solution technique that can simplify the alignment process. To use this technique, we first write one sequence at the top of the table and the other along the left hand side: P E R D I L V In the upper left hand corner of the table, the 0 is a starting point for calculating the relative probability of producing one of these sequences from the other through point mutations, substitutions, or deletions. The goal is to sum together the relative probabilities of each link in the replication/mutation sequence that gets to the bottom right corner of the table. The most probable path through this table is usually the best alignment of the two sequences. The first amino acid in each sequence is proline (P). The relative probably of an accurate replication of this is given by the PAM score for a P to P replication, +7. This is the most likely event, so we add this PAM score to our current score of 0 (from the cell above the current one along the diagonal). We enter the new sum in the corresponding cell, 0+7 = 7.

P 0+7 E R D I L V Moving down in the table indicates that the vertical sequence has an insertion (PAM score = -5).compared to the horizontal sequence. Moving to the right indicates a deletion (PAM score = -5). Moving diagonally down and to the right indicates that the next two amino acids represent either an accurate duplication or a point mutation. In our case, this would be an accurate duplication of E (PAM score = +5). This last option is more probable (higher PAM score) as indicated in the table below. P 7 7-5=2 E 7-5=2 7+5=12 R D I L V We continue this process for each cell in out table choosing the most probable outcome and retaining some of the neighboring PAM values for comparison purposes. P 7 2 E 2 12 12-5=7 R 12-5=7 12-2=10 D I L V

The alignment should generally follow a path to the left and downward. Inspection of the two sequences shows that the last amino acids should match up directly. So, the task is to find the path from left to right that is most probable. We will continue to choose the most likely outcome at each step to see where that takes us. The next step is shown below: P 7 2 E 2 12 7 R 7 10 5 D 5 6 I L V And, continuing to the end of the sequence, we get the following: P 7 2 E 2 12 7 R 7 10 5 D 5 6 1 I 1 8 3 L 3 9 V 4 This path has a net PAM score of +4. However, this is only one of many possible paths through the table. Another one is shown below: P 7 2 E 2 12 7 R 7 10 5 D 5 6 I 1 L 5 V 9

Yet another path through this table is shown below. This alignment is the most probable one that we found previously. P 7 2 E 2 12 7 R 7 10 5 D 5 0 I 0 7 L 11 V 15 This path problem formulation allows the more mathematically-oriented scientists to use graph theory to solve for the most probable alignment between any two sequences. However, the optimization problem is beyond our present needs. A brute force approach will also work wherein one tries all possible paths through the table and chooses the one with the highest PAM score.