Evolution of protein coding sequences

Size: px
Start display at page:

Download "Evolution of protein coding sequences"

Transcription

1 Evolution of protein coding sequences

2 Kinds of nucleo-de subs-tu-ons Given 2 nucleo-de sequences, how their similari-es and differences arose from a common ancestor? We assume A the common ancestor: Single subs-tu-on A A C 1 change, 1 difference Mul-ple subs-tu-on A C T A 2 changes, 1 difference Coincidental subs-tu-on A G C 2 change, 1 difference Parallel subs-tu-on Convergent subs-tu-on Back subs-tu-on C C T A A A A C 2 changes, no difference T 3 changes, no difference C A 2 changes, no difference

3 Important proper-es inherent to the standard gene-c code

4 Synonymous vs nonsynonymous subs-tu-ons Nondegenerate sites: are codon posi-on where muta-ons always result in amino acid subs-tu-ons. (exp. TTT (Phenylalanyne, CTT (leucine), ATT (Isoleucine), and GTT (Valine)). Twofold degenerate sites: are codon posi-ons where 2 different nucleo-des result in the transla-on of the same aa, and the 2 others code for a different aa. (exp. GAT and GAC code for Aspar-c acid (asp, D), whereas GAA and GAG both code for Glutamic acid (glu, E)). Threefold degenerate sites: are codon posi-ons where changing 3 of the 4 nucleo-des has no effect on the aa, while changing the fourth possible nucleo-de results in a different aa. There is only 1 threefold degenerate site: the 3 rd posi-on of an isoleucine codon. ATT, ATC, or ATA all encode isoleucine, but ATG encodes methionine.

5 Standard genetic code! Fourfold degenerate sites: are codon posi-ons where changing a nucleo-de in any of the 3 alterna-ves has no effect on the aa. exp. GGT, GGC, GGA, GGG(Glycine); CCT,CCC,CCA,CCG(Proline) Three amino acids: Arginine, Leucine and Serine are encoded by 6 different codons: R Arg Arginine CGT CGC CGA CGG AGA AGG L Leu Leucine TTA TTG CTT CTC CTA CTG S Ser Serine TCT TCC TCA TCG AGT AGC Five amino- acids are encoded by 4 codons which differ only in the third posi-on. These sites are called fourfold degenerate sites A Ala Alanine GCT GCC GCA GCG G Gly Glycine GGG GGA GGT GGC P Pro Proline CCT CCC CCA CCG T Thr Threonine ACT ACC ACA ACG V Val Valine GTT GTC GTA GTG

6 Standard genetic code! Nine amino acids are encoded by a pair of codons which differ by a transi-on subs-tu-on at the third posi-on. These sites are called twofold degenerate sites. N Asn Asparagine AAT AAC D Asp Aspartic acid GAT GAC C Cys Cysteine TGT TGC Q Gln Glutamine CAA CAG E Glu Glutamic acid GAG GAA H His Histidine CAT CAC K Lys Lysine AAA AAG F Phe Phenylalanine TTT TTC Y Tyr Tyrosine TAT TAC Isoleucine is encoded by three codons(with a threefold degenerate site) I Ile Isoleucine ATT ATC ATA Methionine and Triptophan are encoded by single codon M Met Methionine W Trp Tryptophan ATG TGG Three stop codons: TAA, TAG and TGA Transi-on: A/G; C/T

7 Evolution of protein coding sequences Some amino acid substitutions require more DNA substitutions than others Ile à Thr : at least one DNA change AUU à ACU AUC à ACC AUA à ACA Ile à Cys: at least two DNA changes AUU (Ile) à AGU (Ser) à UGU (Cys) AUU (Ile) à UUU (Phe) à UGU (Cys)

8 Example: 2 homologous sequences Glu Val Phe! SEQ.1 GAA GTT TTT! SEQ.2 GAC GTC GTA! Asp Val Val! Codon 1: GAA - - > GAC ;1 nuc. diff., 1 nonsynonymous difference; Codon 2: GTT - - > GTC ;1 nuc. diff., 1 synonymous difference; Codon 3: coun-ng is less straigh]orward: 1 TTT(F:Phe) 2 GTT(V:Val) TTA(L:Leu) GTA(V:Val) Path 1 : implies 1 non- synonymous and 1 synonymous subs-tu-ons; Path 2 : implies 2 non synonymous subs-tu-ons;

9 Codon Adapta-on Index (CAI) In recogni-on of the role of selec-on in producing high codon bias, a sta-s-c called Codon Adapta-on Index (or CAI) is calculated. Pattern of codon usage in very highly expressed genes can reveal: (i) which of the alternative synonymous codons for an amino acid is the most efficient for translation; (ii) the relative extent to which other codons are disadvantageous Sharp, PM & Li WH (1987). NAR

10 RSCU Rela-ve Synonymous Codon Usage : a sta-s-cal measure of codon usage bias RSCU = X ij /(1/n i *Σ{X ij ; j=1, n i }) where X ij is the number of occurrences of the j th codon for the i th amino acid, and n i is the number (from 1 to 6) of alterna-ve codons for the i th amino acid. i.e. the observed number of the j th codon for the amino- acid i normalized by the average number of all codons coding the same amino- acid i.

11 Rela-ve adap-veness of a codon w ij = RSCU ij /RSCI imax = X ij /X imax where RSCU imax and X imax are RSCU and X values for the most frequently used codon for the i th amino acid.

12 Codon Adapta-on Index The CAI for a gene is calculated as the geometric mean of the RSCU values corresponding to each of the codons used in that gene, divided by the maximum possible CAI for a gene of the same amino acid composi-on: CAI = CAI obs / CAI max where CAI obs = (πrscu k ; k=1,l) 1/L CAI max = (πrscu kmax ; k=1,l) 1/L where RSCU k is the RSCU value for the k th codon in the gene, RSCU kmax is the maximum RSCU value for the amino acid encoded by the k th codon in the gene, and L is the number of codons in the gene.

13 Evolution of protein coding sequences Redundancy of the genetic code Biochemical properties of amino acids Under neutral evolution (no effect of selection) amino acids should replace each other with a probability determined by the number of DNA substitutions

14 Evolution of protein coding sequences Some amino acid substitutions require more DNA substitutions than others Ile à Thr : at least one DNA change AUU à ACU AUC à ACC AUA à ACA Ile à Cys: at least two DNA changes AUU (Ile) à AGU (Ser) à UGU (Cys) AUU (Ile) à UUU (Phe) à UGU (Cys)

15 Rates and patterns of nucleotide substitution Influenced by three things Functional constraint (negative selection) Positive selection Mutation rate

16 Rate of nucleotide substitution K = mean number of substitutions per site T = time since divergence rate = r = number of substitutions per site per year r = K/2T Ancestral sequence T T Sequence 1 Sequence 2

17 Gene tree - Species tree Time Duplica*on Duplica*on Specia*on A B C Gene tree Specia*on A B C Genomes 2 edi-on T.A. Brown A B C Species tree

18 Common ancestor of sequences speciation Allele A Ancestral species Allele B Time Human Gorilla

19 Evolution of protein-coding sequences The Genetic Code is redundant Some nucleotide changes do not change the amino acid coded for 3 rd codon position often synonymous 2 nd position never 1 st position sometimes

20 Standard Genetic Code Phe UUU Ser UCU Tyr UAU Cys UGU UUC UCC UAC UGC Leu UUA UCA ter UAA ter UGA UUG UCG ter UAG Trp UGG Leu CUU Pro CCU His CAU Arg CGU CUC CCC CAC CGC CUA CCA Gln CAA CGA CUG CCG CAG CGG Ile AUU Thr ACU Asn AAU Ser AGU AUC ACC AAC AGC AUA ACA Lys AAA Arg AGA Met AUG ACG AAG AGG Val GUU Ala GCU Asp GAU Gly GGU GUC GCC GAC GGC GUA GCA Glu GAA GGA GUG GCG GAG GGG

21 rates In general... Rates of nucleotide substitution are lowest at nondegenerate sites (0.78 x 10-9 per site per year) Intermediate at two-fold degenerate sites (2.24 x 10-9 ) Highest at fourfold degenerate sites (3.71 x 10-9 )

22 Effect of amino acid substitutions Deleterious 86% Neutral 14% Advantgageous 0.0%? (very low) In protein coding sequences, selection is often acting to remove changes Less common outcome is drift of neutral changes Rarely see positive selection for advantageous changes

23 Functional Constraint Proteins often have some functional constraint The stronger the functional constraint, the slower the rate of evolution

24 Haemoglobin Haeme pocket is highly constrained at protein seq. level Remainder of protein only constrained to be hydrophillic

25 Histone 4 Two copies in Histone octamer Forms complex with other histones and binds DNA into chromatin Almost the whole protein is highly constrained

26 Hardly any sequence constraint Fibrinopeptides

27

28 Rates and Patterns Patterns of change can be informative of the function of a protein Different genes evolve at different rates Amino acids that are always conserved are likely to be critical to the function

29 Biochemical properties

30

31 Histone 4 Highly conserved protein Compare human and wheat H4 genes 55 DNA differences 2 amino acid differences Val à Ile (both aliphatic) Lys à Arg (both charged)

32 Evolution of non-coding regions homologous sequences e.g., compare introns of homologous genes 5 UTR and 3 UTR (untranslated region) Pseudogenes

33

34

35 Synonymous substitution rate variation Synonymous rates may differ between genes How come? Maybe different mutation rates in different parts of the genome

36 Varia*on in the rates of synonymous subs*tu*ons: Secondary structure constraints Stems in secondary RNA structures are more constrained than loops.