BLAST Basics. ... Elements of Bioinformatics Spring, Tom Carter. tom/

Size: px

Start display at page:

Download "BLAST Basics. ... Elements of Bioinformatics Spring, Tom Carter. tom/"

Shavonne Malone
6 years ago
Views:

1 BLAST Basics Elements of Bioinformatics Spring, 2003 Tom Carter tom/ March,

2 Sequence Comparison One of the fundamental tasks we would like to do in bioinformatics is to compare two sequences of nucleotides or amino acids. In general, if two sequences are similar, we can hope (or presume?) that the two sequences (molecules) have similar biological functions, or share (meaningful) evolutionary history, or both. We could look at the global similarity of two sequences (according to some probability model) to get an overall statistic relating the two sequences, or we could look at local areas of similarity between the two sequences, considering two sequences to be similar if they share similar subsequences. In what follows, we will look at an approach to local sequence comparisons. 2

3 Basic model for local sequence comparisons Our general model for the similarity of molecular biological sequences is common origin, and therefore we will imagine that one sequence can be derived from the other via a succession of basic changes or operations (i.e., mutations). These basic operations are: Substitution this change replaces a single nucleotide or amino acid by another. Thus, for example, the sequence AGCTTTTCAT results in the sequence AGCATTTCAT via the substitution of an A for the T in the fourth location. 3

4 Insertion this adds a single nucleotide or amino acid in the sequence. The sequence AGCTTTTCAT results in the sequence AGACTTTTCAT via the insertion of an A after the G. Deletion this removes a single nucleotide or amino acid in the sequence. The sequence AGCTTTTCAT results in the sequence ACTTTTCAT via the deletion of the G. 4

5 We are then interested in developing probability models to describe biological sequences. We will also be interested in models to describe the conversion of one sequence into another via a succession of the three basic operations. There are a variety of possible approaches to developing such a probability model. The general approach is to start with a very simple model, and then add additional features to the model to get a better description. In the simplest model, the Random model (R), we could assume, for example, that a DNA sequence arises from a random process whereby each nucleotide is equally likely to occur (i.e., each of A, C, G, T has probability 1/4 of occurring), and thus the probability of observing a given specific sequence of n nucleotides would be (1/4) n. 5

6 More generally, given two nucleotide or amino acid sequences x and y, we can look at the joint probability of observing the pair of sequences (assuming the Random model R): P (x, y R) = i q xi j q yj where q xi is the probability of occurrence of the ith nucleotide in sequence x, etc. What we want to do is build an alternative probability model which reflects evolutionary history (or at least biochemistry). Let us call this alternative model M. In a first version of this model, we assume that the two sequences have the same length, and think of aligning the two sequences. We can then look at the joint probability of the occurrence of the two nucleotides or amino acids in each position, assuming that they derive from a common ancestor at some time in the past via substitution. We would then have, for each pair (a, b) of nucleotides or amino acids 6

7 a probability P ab that the given pair resulted from substitutions from a common ancestor c. This would give us a probability of observing the pair of sequencs x and y given by: P (x, y M) = i p xi y i. We then want to develop a way of estimating the likelihood that similarity between a pair of sequences reflects actual biological similarity, or is just a random occurrence. We can calculate the ratio of the two probabilities as an odds ratio: P (x, y M) P (x, y R) = i p xi y i i q xi j q yj = i p xi y i q xi q yi. 7

8 We can convert this into an additive system by taking logarithms, so that we can calculate by adding up a log-odds ratio: S = i s(x i, y i ). In this formula, s(a, b) = log ( Pab q a q b is the likelihood ratio of the pair (a, b) resulting from a substitution from a common ancestor rather than just by a random occurrence. ) The next step, then, would be to build a table (an array) of s(a, b) log-likelihoods for all possbile pairs of nucleotides or amino acids. On the next page is an example of such a table for amino acids, called the BLOSUM-62 Clustered Scoring Matrix in 1/2 Bit Units. 8

9 BLOSUM Clustered Scoring Matrix in 1/2 Bit Units Cluster Percentage: >= 62 Entropy = , Expected = A R N D C Q E G H I L K M F P S T W Y V B Z X * A R N D C Q E G H I L K M F P S T W Y V B Z X * A R N D C Q E G H I L K M F P S T W Y V B Z X * Blocks Substitution Matrices for Protein Sequence Comparisons (Blosum) 9

10 Basics of the BLAST algorithm Once we have a scoring matrix (such as the Blosum-62 matrix), we can develop the BLAST algorithm. BLAST stands for Basic Local Alignment Search Tool. The fundamental approach is to develop a similarity score between pairs of sequences. This is done by finding (short) local very close matches between sections of a given sequence and the target, or comparison, sequence. Typically these short matches are about 3 amino acids for proteins, and about 11 nucleotides for nucleic acids. This short match is then extended by one residue at a time to find the local alignment with the maximum score. This score can then be compared with other scores for other sequences to find the best match. 10

11 In a more general form of the algorithm, we can include the possibility of a gap in which we don t require a match. Typically we will include a penalty in our scoring for opening a gap, and an additional penalty for each residue added to the gap. These will then be included in the score for the local alignment match. In practice, we often want to BLAST a given sequence against a whole database of sequences. In order to make this process reasonably fast, we can start by building a table of the short sequences that occur in the database. Our given sequence is then matched against the table of short sequences, which then refer the algorithm to appropriate sequences in the database for possible extensions. We can keep a running total of the best scores we have seen so far, and abandon extensions which are not as good as what we have already found. 11

12 Building the BLOSUM matrices The BLOSUM scoring matrices were developed from a database of typical blocks of amino acids observed in proteins. First, the blocks database was clustered at a given percentage level. In other words, blocks that were at least the given percentage identical were clustered together to give a typical example of a cluster of similar proteins. From each cluster, a representative amino acid sequence was developed. The entries in the BLOSUM matrix are then calculated as the actual frequency of occurrence of the amino acid pair in the clustered blocks database, divided by the expected probability of occurrence. The expected value is calculated from the frequency of occurrence of each of the two 12

13 individual amino acids in the blocks database, which gives an estimate of a chance (random) alignment of the two amino acids. The actual/expected ratio is expressed as a log-odds score in so-called halfbit units. These units are obtained by taking the base 2 logarithm of the ratio, and then multiplying by 2. A zero score means that the frequency of the amino acid pair in the database was the same as a chance alignment, a positive score that the pair was found more often than by chance, and a negative score that the pair was found less often than by chance. The accumulated score of a given alignment of several amino acids in two sequences is calculated by adding up the respective scores of each individual pair of amino acids in the alignment. The Blosum matrix values are based on the observed amino acid substitutions in a large set of approximately 2000 conserved amino 13

14 acid patterns, called blocks. These blocks come from a database of protein sequences representing over 500 groups of related proteins, which can act as signatures for protein families. The Blosum matrices are based on a different principle and a larger data set than the Dayhoff PAM (percent accepted substitution) matrices, which are derived from the observed rate of mutation during predicted evolutionary changes in a relatively small number of protein families. To build the blosum matrices, the sequences of the proteins in 500 families were aligned in the regions defined by the blocks. Each column in the aligned sequences then provided a set of possible amino acid substitutions. For this analysis, it is assumed that the probability of change from amino acid X to amino acid Y is the same as the probability of the reverse change from Y to X, and thus the resulting matrix is symmetric. 14

15 The various substitutions were then tabulated for all the aligned patterns in the database. More common substitutions should represent a closer relationship between two amino acids in related proteins, and thus should show a higher score in sequence alignment. Rare substitutions will show lower scores. This approach, however, can result in too high a representation of amino acid substitutions which occur in the most closely related members of the protein families. To reduce this dominant contribution from the most similar proteins, the sequences of the most similar proteins were grouped together into a single sequence before scoring the amino acid substitutions in the aligned blocks. The amino acid changes within these clustered sequences were then averaged. Patterns with 60% agreement were grouped together to make one substitution matrix called blosum60, and those 80% alike to make the blosum80 matrix, etc. 15

16 As the clustering percentage was increased, the ability of the resulting matrix to distinguish actual from chance alignments also increased. This discriminating capability of the scoring system depends on the relative entropy, or average information content per residue pair. However, at the same time, the dominance effect of the most similar proteins also increases, which biases the matches. Blosum62 represents a generally reasonable balance between information content and match bias and is therefore often used as the default matrix for predicting alignments among typical protein families. References Henikoff S. and Henikoff J.G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89: Henikoff S. and Henikoff J.G. (1993). Performance evaluation of amino acid substitution matrices. Proteins 17 :

17 Building the PAM matrices The PAM scoring / substitution matrices are intended to reflect actual changes in sequences over evolutionary time. The general idea is to compare sequences for which we (believe we) can infer an evolutionary history. PAM stands for Point Acceptance Matrix. The idea is that we build a scoring matrix according to point (substitution) mutations that have actually been accepted by evolution. The original PAM matrices were developed by M. O. Dayhoff in The steps for building the original PAM matrices involved: Gather sequences and align pairs with at least 85% agreement, being sure to minimize ambiguity and number of coincident mutations. 17

18 Build phylogenetic trees, inferring ancestral sequences. Here is an example of what an inferred phylogenetic tree might look like, with amino acid substitutions indicated: ABIH / \ / \ I-G / \ J-H / \ / \ ABGH ABIJ / \ / \ B-C / \ A-D B-D / \ A-C / \ / \ ACGH DBGH ADIJ CBIJ 18

19 Count the residue replacements accepted by natural selection. We let A ij be the number of times amino acid i was replaced by amino acid j. We take A ii to be 0. For each amino acid j, compute the relative mutability. Call this m j. We compute the relative mutability by counting the number of changes of an amino acid between aligned sequences, divided by the number of occurrences of the amino acid. These numbers are then scaled to the number of replacements of the given amino acid per 100 residues in the alignments. 19

20 For example: Aligned sequences A C A A C B Amino acids A B C Number of changes Frequency of occurrence Relative mutability The relative mutabilities found by Dayhoff were (with Ala arbitrarily set to 100): Ser 149 Asp 90 His 50 Met 122 Thr 90 Phe 45 Asn 111 Gap 84 Arg 44 Ile 110 Val 80 Leu 38 Glu 102 Gly 48 Tyr 34 Ala 100 Lys 57 Cys 27 Gln 98 Pro 56 Trp 22 20

21 We now compute the Mutation Probability Matrix (called one PAM of evolutionary distance) by the formulae and M ij = m ja ij i A ij, (i j), M jj = 1 m j. This matrix is symmetric, and the sum of any row (or column) is one: i M ij = j M ij = 1. Count the number f i of occurrences of each amino acid i, and then form the relatedness odds matrix: R ij = M ij f i. We can then calculate the log-odds scoring matrix by S ij = log(r ij ). 21

22 Some properties of the Mutation Probability Matrix: The Mutation Probability Matrix M 1 defines a unit of evolutionary change that is, 1 PAM is an average of one Accepted Point Mutation per 100 residues. Note that there is no obvious direct connection between this rate of change and real (million years, or whatever) rates of evolutionary change. We can use M1 to simulate evolutionary change. Given an amino acid sequence, we use a random number generator to apply M1 to each residue in the sequence. M1 is scaled so that for an average amino acid sequence, applying M1 to each residue once will result, on average, in one amino acid change per 100 residues. We could repeat this process, to get 2, 3, 4,... PAMs of simulated evolutionary change. 22

23 The following are equivalent: Successive applications of M1 to a sequence n times. Matrix multiplication of M 1 times itself n times, and applying the resulting M1 n to the sequence. If the m j are all quite small, then the following is also approximately equivalent: Scaling the elements of M 1 according to the formulae: and M ij = n m ja ij i A ij, (i j), M jj = 1 n m j, and applying the resulting matrix to the sequence. The last formulae give an easy way to approximate the matrix for any desired 23

24 PAM distance. Thus, for example, we could build a PAM100 or PAM250 matrix. What would be the difference between using a PAM100 matrix versus a PAM250 matrix for similarity scoring between protein sequences? Why might one choose one or the other? One way to think about this is that PAM100 corresponds with an average of 100 substitutions per 100 amino acids in the sequence (and PAM250 corresponds to 250 substitutions per 100 amino acids). At first, this may not seem very meaningful. How could there be an average of 250 substitutions per 100 amino acids? In general, we want to think of a scoring matrix as a way to measure the evolutionary distance between two 24

25 sequences. We can think of the PAM matrices as telling us about evolutionary distance in terms of PAM units. One PAM unit is the length of time it takes for a sequence to experience an average of one (accepted) mutation per every 100 amino acids in the sequence. Remember that most mutations will not confer a survival advantage (or in fact will have a deleterious effect) on the organism, and hence will not be (differentially) reproduced, and hence will not be accepted by evolution. Thus, PAM1 matrix scoring of sequence similarity will be sensitive to sequences that have diverged from each other (have a common ancestor) in the recent past (within 1 PAM). It will be much less sensitive to similarities between sequences that have diverged longer ago than 1 PAM. (Note: 1 PAM in years may be different for different families of proteins, or different regions of a protein. Why?) 25

26 PAM100 (or PAM250) will be sensitive to similarities between sequences that have diverged within the past 100 PAM (or 250 PAM). In general, we can expect there to have been significantly more divergence between sequences in 100 PAM or 250 PAM than in 1 PAM. We can think of the PAM1, PAM100, and PAM250 scoring matrices as similarity measuring tools with differing levels of focus. In order for two sequences to have a relatively high PAM1 similarity score, they must have diverged very recently. PAM1 would in general give a low score to a pair of sequences that diverged 50 PAM ago. On the other hand, PAM100 scoring could still give a reasonably high similarity score to two such sequences (as could PAM250). Thus we can think of PAM100 as being sensitive to a broader range of sequences similarities than PAM1, and PAM250 a still broader range. 26

27 Thus, if we BLAST a sequence against a protein database with PAM1, we would expect relatively few sequences in the database to return a high score (only those that diverged within about 1 PAM). On the other hand, if we BLAST with PAM250, we would expect many more sequences to return a high score (including many that diverged more than 1 PAM ago, but less than 250 PAM ago). Suppose we had two sequences that had a high PAM1 similarity score. What would we expect them to look like? They would be nearly identical. We would expect them to differ only by about one residue per 100 (i.e., we would expect about 99 out of 100 residues to be the same). Suppose we had two sequences that had a relatively high PAM250 similarity score. What would we expect them to look like? Said another way, suppose we applied M1 27

28 to a sequence 250 times. How different would we expect it to look? Suppose we follow individual residues through 250 successive applications of M1. What sorts of things could happen? A residue could just stay the same through all 250 steps (on average, we could expect about 8% of the residues not to have changed at all). It could have changed somewhere along the way, and then at some later step changed back (remember that M1 is symmetric). Thus, a certain portion of the residues would be the same as they started. Other residues would have changed, but the changes would be expected to reflect the probabilities in the M1 matrix. In some sense, we could think of PAM1 as representing a relatively small cloud of sequences around a given sequence. PAM100 would represent a larger cloud, and PAM250 an even larger cloud. High 28

29 scores would occur for sequences within the cloud. Why would we choose one PAM matrix over another? PAM250 will give relatively high scores to more distantly related sequences (as well as to closely related sequences). On the other hand, PAM250 is less discriminating than PAM100, and thus is more likely to (mistakenly) give a relatively high score to a sequence that is similar simply by chance, rather than because of evolutionary relatedness. There is a tradeoff between how broad a range of sequences will be recognized as similar, and how many randomly similar sequences will be (mistakenly) included. 29

30 Log Odds Matrix for a 250 PAM evolutionary distance Cys C 12 Ser S 0 2 Thr T Pro P Ala A Gly G Asn N Asp D Glu E Gln Q His H Arg R Lys K Met M Ile I Leu L Val V Phe F Tyr Y Trp W C S T P A G N D E Q H R K M I L V F Y W In this table, the amino acids in the table are grouped according to the chemistry of the side group: C-sulfhydryl, STPAG-small hydrophilic, NDEQ-acid, acidamide and hydrophilic, HRK-basic, MILV-small hydrophobic and FYW-aromatic. The matrix was obtained by taking the log of each element in the relatedness odds matrix for 250 PAM. The elements in this matrix are multiplied by 10 for readability. A score of -10 means that a given pair would be expected to be aligned only one tenth as frequently in related sequences as random chance would predict; a score of 2 means that the pair would be expected to align 1.6 times as frequently. The amino acids were arranged by assuming that positive values represent evolutionarily conservative replacements; the clusters correspond to groupings based on the physicochemical properties of the amino acids. 30

31 About the PAM model Some assumptions in the PAM model: This is a pointwise Markov model in other words, a substitution at any given site depends only on the amino acid at that site and the probability given by the table. In particular, a substitution does not depend on nearby residues. The model assumes that sequences being compared have typical amino acid composition. Some sources of error in the PAM model: Many sequences are not typical in composition. 31

32 Rare substitutions may be observed too infrequently to accurately reflect relative probabilities accurately. For example, in the original work, 36 amino acid pair substitutions were not observed at all. Extrapolating to higher order PAMs will multiply errors in PAM1. Markov models are an imperfect representation of evolutionary processes. For example, even distantly related sequences often have islands or blocks of conserved residues. This means that substitutions are not equally likely across entire sequences. 32

33 Scoring, Statistics and Expectations Suppose we BLAST a sequence X against a database, and it reports a sequence Y with score S. What does the score S mean, and how much confidence can we have that the similarity is meaningful, and not just the result of a random coincidence? One way to think about this is to imagine that we have a database of random sequences. How many sequences in the database would we expect to have a similarity score as high as S with our sequence? Or, perhaps, how large would the database have to be in order for us to expect for there to be at least one sequence with a score as high as S? Let s assume that our sequence X has length m, and that each sequence in the random 33

34 database has length n. If we assume that each sequence in the random database is constructed by an independent, identically distributed random process for each residue in the sequence, then we would expect the similarity scores with our sequence X to be normally distributed (this is by the central limit theorem). We would expect the maximum scores across the database to follow an Extreme Value Distribution, and thus for the expected values associated with scores S to be given by: E(S) = Kmne λs. Here K and λ are scaling parameters for the size of the database and scoring method, respectively. If things are appropriately scaled, then an E(S) = 1 would mean that we would expect there to be about 1 sequence Y in the random database with a similarity score as 34

35 high as S. An E(S) = 5 would mean we expect about 5 sequences in the database to have a score as high as S. If E(S) is less than one, that would mean that in order to expect even one random sequence to score as high as S, the random database would have to be larger. For example, an E(S) = 0.01 would mean that the database would have to be 100 times as big for us to expect even one match with score as high as S. In evaluating the results of BLASTing a sequence against a database, we can then use an E-value to assess our results. If the E-value is very small (say ), then it is extremely unlikely that the S score is the result of a random coincidence. On the other hand, if the E-value is 1 or more, then there is a fairly high likelihood that the match is the result of a random coincidence (and the higher the E-value, the higher the likelihood). Fairly typically, people insist on an E-value 35

36 less than about 0.05 before they have confidence that the match is likely to be meaningful. One other thing we can do is to normalize the scores according to S = λs ln(k). ln(2) In effect, this sets the units of the score. The ln(2) in the denominator means that we are using bits as our units. If we use these normal units, then we have that the expectation is E(S ) = mn2 S. This normalization allows us to do reasonable comparisons of the results of various BLASTs. Another issue is the fact that not all the sequences in the database will have the same 36

37 length. In effect, we can account for this by treating the database as though it were one long sequence, and then, since the E-value scales as the length of the sequences in the database, we can simply multiply the pairwise E-values by the number of sequences in the database. Thus, as the database grows, so do the E-values. It should be noted that the theoretical analysis underlying the foregoing has only really been done for ungapped scoring (i.e., no gaps allowed). On the other hand, empirical evidence suggests that the same general results hold for gapped scoring. NOTE: much of this material is discussed in the NCBI BLAST tutorial at 1.html 37

38 General issues of scoring matrices The results you will get from applying a local alignment algorithm will depend on the scoring matrix being used. In general, all the scoring matrices in general use are of the form: ( qij ln S ij =, λ where q ij are the target frequencies (positive numbers that sum to 1), p i are the background frequencies of the residues, and λ is a scaling number (the same λ as above). p i p j ) It is important to remember that the best scoring matrix to use for a given class of alignments is one whose target frequencies best characterize the class. The class of alignments depends on the specific characteristics of the research being done, the 38

39 sample sequences being used, and the databases being searched. It is worth your while to think about these issues as you choose parameters for your BLAST searches. It may also be worthwhile to explore various BLAST parameters to see what sorts of results are returned for a particular case at hand, and to progressively refine your search depending on the intermediate results you get. 39

40 One more example (codons) Mutation costs for amino acids A S G L K V T P E D N I Q R F Y C H M W Z B X Ala=A O Ser=S 1 O Gly=G Leu=L Lys=K Val=V Thr=T Pro=P Glu=E Asp=D O Asn=N O Ile=I Gln=Q Arg=R Phe=F Tyr=Y O Cys=C His=H Met=M Trp=W Glx=Z Asx=B ???=X The table is generated by calculating the minimum number of base changes required to convert an amino acid in row i to an amino acid in column j. Note that Met->Tyr is the only change that requires all 3 codon positions to change. 40

41 Nucleotide BLASTs Many of these ideas can also be applied to comparison of nucleotide sequences as well. However, there are a variety of differences: There are only four nucleotides, as opposed to the twenty amino acids. This means that the scoring matrix is much smaller. A typical very simple scoring matrix would look like: A T C G A T C G In this matrix, we only score for a match or mismatch of nucleotides. 41

42 A slightly more sophisticated scoring matrix might take into account the differences between purines and pyrimidines: A T C G A T C G In this version of a scoring matrix, a purine to purine (A G) or pyrimidine to pyrimidine (T C) transition is considered more likely than a purine pyrimidine (A T, A C, T G, or C G) transversion. 42

43 Straight nucleotide sequence to nucleotide sequence comparisons can be useful, but they do not easily reflect the fact that much of the genome information in which we are interested is sequences that are translated into amino acid sequences (proteins). Thus, a very typical approach is first to translate a given nucleotide sequence into a corresponding amino acid sequence, and then BLAST the resulting sequence against protein databases. For a typical nucleotide sequence (e.g., a sequence derived from the direct analysis of the genome of some species), there are six possible translation frames three in the forward direction, and three in the reverse direction, using the Watson-Crick complementary sequence. These translations ordinarily do not take account of possible introns, but allowing gaps in the alignments may handle this. 43

44 Because of the redundancy of the genetic code (i.e., multiple codons code for the same amino acid), it is not particularly easy to recode amino acid sequences to nucleotide sequences. There might be trillions of different nucleotide sequences, each of which encodes for the same amino acid sequence. There is another potential difficulty, for which provision is sometimes made. The genetic code is often thought of as universal, in the sense that the same codons code for the same amino acids and the same START and STOP codes are used in the vast majority of genes in nearly all species. However, some exceptions have been found. Exceptions often involve using one or two of the three STOP codons to code for an amino acid instead. 44

45 Mitochondrial genes are one place where alternative codings have been discovered. Animal and microorganism (but not plant, apparently) mitochondria use UGA to encode tryptophan (Trp) rather than as a chain terminator. In addition, most animal mitochondria use AUA to code for methionine instead of isoleucine. However, all vertebrate mitochondria seem to use AGA and AGG as chain terminators (STOP codons). Yeast mitochondria assign all codons beginning with CU to threonine instead of leucine (which is still encoded by UUA and UUG, as it is in normal cytosolic mrna). Plant mitochondria use the universal code, and this has permitted angiosperms to transfer mitochondrial genes to their nucleus. Exceptions to the universal code seem to be far rarer for nuclear genes. A few 45

46 unicellular eukaryotes have been found that use one or two (of their three) STOP codons for amino acids instead. These examples are all simple code substitutions (where a codon is used for another purpose, but the same 20 amino acids are used). The vast majority of proteins are constructed from the standard 20 amino acids, although some of these may be chemically altered, e.g. by phosphorylation, after mrna to amino acid translation has occurred. However, at least two cases have been found where an amino acid other than one of the standard 20 is inserted by a trna into the growing polypeptide. The two nonstandard amino acids that have been observed are: Selenocysteine. In certain Archaea, eubacteria, and animals, the codon 46

47 UGA sometimes codes for selenocysteine, but UGA is still often used as a STOP codon. Pyrrolysine. In one gene found in a member of the Archaea, the codon UAG is sometimes used for pyrrolysine. Again, UAG may still be used as a STOP codon. In both of these cases, the codon (UGA for selenocysteine, or UAG for pyrrolysine) is sometimes used to code for the alternative amino acid, but is often still used as a STOP codon. How the ribosomal translation machinery knows when it encounters a UGA or UAG codon whether to use a special trna to insert selenocysteine or pyrrolysine, or simply to stop translation, is not yet known. 47

48 For example, in the Biology Workbench BLASTx (nucleotide to protein translation and then BLAST), the following genetic codes are available: Standard Vertebrate mitochondrial Yeast mitochondrial Mold mitochondrial Invertebrate mitochondrial Ciliate nuclear Echinoderm mitochondrial Euplotid nuclear Bacterial Alternative yeast nuclear Ascidian mitochondrial Flatworm mitodhondrial Blepharisma macronuclear 48

Dynamic Programming Algorithms

Dynamic Programming Algorithms Sequence alignments, scores, and significance Lucy Skrabanek ICB, WMC February 7, 212 Sequence alignment Compare two (or more) sequences to: Find regions of conservation