Important points from last time Subst. rates differ site by site Fit a Γ dist. to variation in rates Γ generally has two parameters but in biology we fix one to ensure a mean equal to 1 and the other parameter (α) is called the shape parameter Estimates of α from sequences are small Estimates of K = 2µt jump up when α is small
Other DNA based methods The PAML package of programs is most common method. It makes use of a codon model. It takes the codon as the variable changing and measures the changes from any one codon to any other. 104
Other DNA based methods In its simplest form the programs from the PAML package assume Pr(> one change ) = 0 and has a model with two parameters (beyond the phylogenetic relationship of the sequences) κ - a rate to measure transition/transversion bias ω - a measure of nonsynonymous versus synonymous rates 105
Other DNA based methods So if... ω < 1 ω = 1 ω > 1 purifying or negative selection no selection, neutral positive selection 106
Other DNA based methods - example no. positive Inferred Number of Genes Under Positive Selection (338-382) (119-162) (32-62) (234-327) (183-232) (219-257) (318-360) (357-426) (255-325) (213-292) (204-278) (281-333) From: Kosiol et al. 2011 107
Other DNA based methods - example GO over-represented From: Kosiol et al. 2008 PLoS Genetics 4:e1000144 108
Other DNA based methods - example coevolution positive Co-evolution in complement immunity P<0.05 FDR<0.05 6 From: Kosiol et al. 2011 109
Other DNA based methods - example immune positive From: Kosiol et al. 2008 PLoS Genetics 4:e1000144 110
Amino acid distance measures As for the nucleotide sequences the Jukes Cantor distance can be applied to amino acid sequences: The only difference is 20aa rather than 4bp. D JC = (19/20) ln(1 (20/19)D) Often simplified to just D JC = ln(1 D) As for the nucleotide sequences it assumes the same rate of substitution between amino acids. 111
Amino acid distance measures Various characteristics of the amino acids charge polarity hydrophobicity aromaticity size It is therefore unlikely that amino acid substitutions will occur with a similar probability Use empirical weighting schemes when computing amino acid distances 112
UUU Phe UUC Phe UUA Leu UUG Leu CUU Leu CUC Leu CUA Leu CUG Leu AUU Ile AUC Ile AUA Ile AUG Met GUU Val GUC Val GUA Val GUG Val UCU Ser UCC Ser UCA Ser UCG Ser CCU Pro CCC Pro CCA Pro CCG Pro ACU Thr ACC Thr ACA Thr ACG Thr GCU Ala GCC Ala GCA Ala GCG Ala UAU Tyr UAC Tyr UAA ter UAG ter CAU His CAC His CAA Gln CAG Gln AAU Asn AAC Asn AAA Lys AAG Lys GAU Asp GAC Asp GAA Glu GAG Glu UGU Cys UGC Cys UGA ter UGG Trp CGU Arg CGC Arg CGA Arg CGG Arg AGU Ser AGC Ser AGA Arg AGG Arg GGU Gly GGC Gly GGA Gly GGG Gly non polar polar Unusual 113
Dayhoff et al (1978) computed the percent accepted mutations (PAM) Margaret O. Dayhoff (1925-1983) Columbia University Took a number of globular proteins and compared every site, cataloging the changes. Extrapolates the changes from a short period of time to a longer period. Picture from http://wikipedia.org 114
PAM steps 1 Calculate how often pairs of amino acids are exchanged 2 The frequency of occurrence of each amino acid 3 The mutation probability 4 How mutable is each amino acid 5 Scale to one amino acid change 6 Calculates not only the probability for changes but also the probability of no change 7 End with a PAM score for all changes aa i to aa j 115
1572 amino acid pairwise differences (1978) Ala Arg Asn Asp Cys Gln Glu A R N D C Q E Ala A - 30 109 154 33 93 266 Arg R - 17 0 10 120 0 Asn N - 532 0 50 94 Asp D - 0 76 831 Cys C - 0 0 Gln Q - 422 Glu E - 116
Normalized Frequencies of aa s within her dataset Gly 0.089 Arg 0.041 Ala 0.087 Asn 0.040 Leu 0.085 Phe 0.040 Lys 0.081 Gln 0.038 Ser 0.070 Ile 0.037 Val 0.065 His 0.034 Thr 0.058 Cys 0.033 Pro 0.051 Tyr 0.030 Glu 0.050 Met 0.015 Asp 0.047 Trp 0.010 117
Relative Mutabilities (# substitutions/freq) Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56 Glu 102 Pro 56 Ala 100 gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18 Ala has been arbitrarily set to 100. 118
PAM-1 Matrix 10,000 From: Ala Arg Asn Asp Cys Gln Glu To: A R N D C Q E Ala A 9867 2 9 10 3 8 17 Arg R 1 9913 1 0 1 10 0 Asn N 4 1 9822 36 0 4 6 Asp D 6 0 42 9859 0 6 53 Cys C 1 1 0 0 9973 0 0 Gln Q 3 9 4 5 0 9876 27 Glu E 10 0 7 56 0 35 9865 119
PAM1 is the expectation after approximately 1% of the sequence has been substituted. PAM2 is calculated as PAM1 PAM1 PAMx is calculated as PAM(x-1) PAM1 PAM250 is generally used for distant comparisons. It corresponds to 2.5 differences per site ( 20% identity). NOTE: These measure divergence not time. 120
PAM-250 Matrix 100 From: Ala Arg Asn Asp Cys Gln Glu To: A R N D C Q E Ala A 13 6 9 9 5 8 9 Arg R 3 17 4 3 2 5 3 Asn N 4 4 6 7 2 5 6 Asp D 5 4 8 11 1 7 10 Cys C 2 1 1 1 52 1 1 Gln Q 3 5 5 6 1 10 7 Glu E 5 4 7 11 1 9 12 121
PAM scoring matrix The PAM scoring values are generally shown as a symmetric log odds ratio matrix. Odds (for those who do not gamble) are 1 p where p is the probability of an event and 1 p is the probability of some other event. For example if p = 0.5 then the odds are 50/50 or 1 to 1 ( 0.5 0.5 = 1). While if p = 0.75 then the odds are 3 to 1 ( 0.75 0.25 = 3). The odds ratio is the ratio of the odds for and against. p 122
PAM scoring matrix Generally the odds are presented as log values. For PAM matrices it is generally log 10 that is used and so each integer value represents an order of magnitude. For example if p = 0.08, odds are 0.08/0.92 = 0.087 (11 to 1) and log odds are log 10 (0.087) = 1.06 while if p = 0.996, odds are 0.996/0.004 = 249 (order magnitude larger and opposite direction), the log odds are log 10 (249) = +2.40. 123
For a PAM scoring matrix S ij = log p i M ij p i p j = log M ij p j = log observed frequency expected frequency This matrix will be symmetric. 124
C S T P A G N D E Q H R K M I L V F Y W C S T P A G N D E Q H R K M I L V F Y W 12 0 2 2 1 3 3 1 0 6 2 1 1 1 2 3 1 0 1 1 5 4 1 0 1 0 0 2 5 0 0 1 0 1 2 4 5 0 0 1 0 0 1 3 4 5 1 1 0 0 1 1 2 2 4 3 1 1 0 1 2 2 1 1 3 6 4 0 1 0 2 3 0 1 1 1 2 6 5 0 0 1 1 2 1 0 0 1 0 3 5 5 2 1 2 1 3 2 3 2 1 2 0 0 6 2 1 0 2 1 3 2 2 2 2 2 2 2 2 5 6 3 2 3 2 4 3 4 3 2 2 3 3 4 2 6 2 1 0 1 0 1 2 2 2 2 2 2 2 2 4 2 4 4 3 3 5 4 5 3 6 5 5 2 4 5 0 1 2 1 9 0 3 3 5 3 5 2 4 4 4 0 4 4 2 1 1 2 7 10 8 2 5 6 6 7 4 7 7 5 3 2 3 4 5 2 6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W C S T P A G N D E Q H R K M I L V F Y W Values multiplied by 10. 125
A log odds of zero implies the two amino acids are found across from each in an alignment as often as expected by chance (given their mutabilities and frequencies of occurrence). A log odds greater than zero implies the two amino acids are found across from each in an alignment more often than expected by chance (given their mutabilities and frequencies of occurrence). A log odds less than zero implies the two amino acids are found across from each in an alignment less often than expected by chance (given their mutabilities and frequencies of occurrence). 126
Two uses for PAM matrices, Scoring matrix PAM250 (very distant) PAM160 (distant) PAM70 (less distant) PAM30 (more similar) etc Transition matrix PAM1 127
PAM-1 Matrix 10,000 From: Ala Arg Asn Asp Cys Gln Glu To: A R N D C Q E Ala A 9867 2 9 10 3 8 17 Arg R 1 9913 1 0 1 10 0 Asn N 4 1 9822 36 0 4 6 Asp D 6 0 42 9859 0 6 53 Cys C 1 1 0 0 9973 0 0 Gln Q 3 9 4 5 0 9876 27 Glu E 10 0 7 56 0 35 9865 128
PAM - strange (?) patterns Lots of interesting properties Many exchanges between amino acids D and E Far more double codon substitutions than expected Fewer of some single codon substitutions; e.g. G and W 129
PAM - scoring an amino acid alignment Consider an alignment... Seq1 C G N G Seq2 C G D R PAM250 12 5 2-3 Total score is 12 + 5 + 2 3 = 16 The chances of getting an alignment this good by chance is given by the odds. Normally one would multiply the odds at each site (assuming independence) but since log s have been taken we can add the log odds. The log 10 odds of 1.6 corresponds to odds of 39.8. So this is an unusual similarity between these two peptides despite their length (in large part due to rare cysteines across from each other). 130
The PAM matrix was computed on globular proteins and may therefore not be a good representation of the substitution matrix for membrane or other non-globular proteins. It assumes that all sites are equally mutable (but not all residues). Only a limited number of proteins were available in comparison to the huge numbers today. 131
The JTT matrix (Jones, Taylor, Thornton 1992) was an update of the PAM matrix. It is mostly used as a transition matrix rather than as a scoring matrix (for the later purpose PAM250 still seems the method of choice). 132
A matrix of BLOCKS BLOcks SUbstitution Matrix Based on the analysis of conserved proteins regions from the BLOCKS database. More reliable than the PAM matrix for distantly related proteins Default for BLAST searches Used in many other programs including FASTA 133