Semantic Information in Genetic Sequences

Size: px
Start display at page:

Download "Semantic Information in Genetic Sequences"

Transcription

1 Semantic Information in Genetic Sequences Hinrich Kielblock Network Dynamics Group Max Planck Institute for Dynamics and Self-Organization

2 Outline 1 The genetic code and its translation 2 The information theorist s view 3 Biological models 4 Semantics

3 The basic structure of genes Genes may be seen as sequences of the base nucleotides (letters): Adenine (A) Cytosine (C) Guanine (G) Thymine (T) ATGGCTTAGACA... from: ure

4 Creating amino acids from the DNA The DNA specifies how to generate proteins. The proteins are built out of a set of 20 amino acids.

5 Creating amino acids from the DNA The DNA specifies how to generate proteins. The proteins are built out of a set of 20 amino acids. One base nucleotide can only code four different amino acids.

6 Creating amino acids from the DNA The DNA specifies how to generate proteins. The proteins are built out of a set of 20 amino acids. One base nucleotide can only code four different amino acids. Also pairs of nucleotides only achieve 4 2 = 16.

7 Creating amino acids from the DNA The DNA specifies how to generate proteins. The proteins are built out of a set of 20 amino acids. One base nucleotide can only code four different amino acids. Also pairs of nucleotides only achieve 4 2 = 16. Sets of three nucleotides (Codons, 4 3 = 64) are needed to code all 20 amino acids.

8 Creating amino acids from the DNA The DNA specifies how to generate proteins. The proteins are built out of a set of 20 amino acids. One base nucleotide can only code four different amino acids. Also pairs of nucleotides only achieve 4 2 = 16. Sets of three nucleotides (Codons, 4 3 = 64) are needed to code all 20 amino acids. Three letters (nucleotide triplet) form a codon. Each codon codes a specific amino acid. ATG GCT TAG ACA...

9 Question: How are the Codons mapped onto the amino acids? Creating amino acids from the DNA The DNA specifies how to generate proteins. The proteins are built out of a set of 20 amino acids. One base nucleotide can only code four different amino acids. Also pairs of nucleotides only achieve 4 2 = 16. Sets of three nucleotides (Codons, 4 3 = 64) are needed to code all 20 amino acids. Three letters (nucleotide triplet) form a codon. Each codon codes a specific amino acid. ATG GCT TAG ACA...

10 The comma-free code Crick s idea [Crick et. al., PNAS 1957] Genetic code is comma-free: There are 20 meaningful Codons, the other 44 are nonsense Codons. Such that: The code is understandable without any spacing between single characters.

11 The comma-free code Crick s idea [Crick et. al., PNAS 1957] Genetic code is comma-free: There are 20 meaningful Codons, the other 44 are nonsense Codons. Such that: The code is understandable without any spacing between single characters. Example Let ATC and TGA be meaningful Codons. Then in the sequence ATCTGA the Codons TCT and CTG must be nonsense Codons.

12 The comma-free code How many words can the comma-free code include? Analysis Codons AAA, CCC, GGG and TTT are forbidden. Remaining 60 Codons can be sorted into permutation groups of three (e.g.) AGT, TAG, GTA. Only one out of these three groups works.

13 The comma-free code How many words can the comma-free code include? Analysis Codons AAA, CCC, GGG and TTT are forbidden. Remaining 60 Codons can be sorted into permutation groups of three (e.g.) AGT, TAG, GTA. Only one out of these three groups works. 20 words is the maximum number.

14 The comma-free code How many words can the comma-free code include? Analysis Codons AAA, CCC, GGG and TTT are forbidden. Remaining 60 Codons can be sorted into permutation groups of three (e.g.) AGT, TAG, GTA. Only one out of these three groups works. 20 words is the maximum number. Read also: THIS IS WRONG

15 The real genetic code Experiments clarified: Each codon codes one specific amino-acid... [Knight et. al., TBIS 1999]

16 The real genetic code Experiments clarified: Each codon codes one specific amino-acid and the code is degenerate. [Knight et. al., TBIS 1999]

17 How the code is translated DNA translation similar to Turing machine. DNA is copied to mrna or t-rna (transcription). t-rna docks to Start-Codon AUG. RNA is translated into corresponding protein by t-rna Codon by Codon. from:

18 Structure of the genome The genetic code has a hierarchical structure. Structure and parallels to human language: Genetic unit Size Function Analog nucleotide single symbol primary coding symbol letter codon nucleotide triplet symbol for amino acid phoneme cistron codons coding unit for protein word scripton 15 cistrons transcription unit (m-rna) sentence replicon 100 scriptons reproduction unit paragraph genome some replicons mitotic unit text [Küppers, NAL 1996]

19 Open questions What is the information coded in the DNA? Where does the information come from? What is the code optimized for? How does the genetic code know how to code information? How can a code evolve without someone specifying what should be coded?

20 Genetic Information What does Information mean in the context of the DNA?

21 Genetic Information What does Information mean in the context of the DNA? DNA codes all necessary manufacturing instructions of an organism.

22 Genetic Information What does Information mean in the context of the DNA? DNA codes all necessary manufacturing instructions of an organism. All reproduction processes of the organism go via copying of DNA.

23 Genetic Information What does Information mean in the context of the DNA? DNA codes all necessary manufacturing instructions of an organism. All reproduction processes of the organism go via copying of DNA. Definition The construction of a living system is completely regulated by the instructions of the genetic molecules. We define these instructions as the genetic information, imminent in the DNA.

24 And God said: Let there be life. Where does the information in the DNA come from?

25 And God said: Let there be life. Where does the information in the DNA come from? The probability for the formation of an information-carrying biopolymer by accident is extremely small: For a biopolymer s sequence of length l consisting of m classes of monomers there are N = m l possible alternatives. Even for a reasonably small protein this is N

26 And God said: Let there be life. Where does the information in the DNA come from? The probability for the formation of an information-carrying biopolymer by accident is extremely small: For a biopolymer s sequence of length l consisting of m classes of monomers there are N = m l possible alternatives. Even for a reasonably small protein this is N Conclusion DNA interacts with the world and changes. These changes create an information gain.

27 Mutation and Selection Spontaneous mutations (e.g. errors in the copying process) lead to variation of DNA. Organisms with different DNA have different reproduction rates: Selection. ATG GCT TAG ACA... ATG ACT TAG ACA...

28 Mutation and Selection Spontaneous mutations (e.g. errors in the copying process) lead to variation of DNA. Organisms with different DNA have different reproduction rates: Selection. ATG GCT TAG ACA... ATG ACT TAG ACA... Slow shifts of DNA-concentrations to regions of higher fitness.

29 Mutation and Selection Spontaneous mutations (e.g. errors in the copying process) lead to variation of DNA. Organisms with different DNA have different reproduction rates: Selection. ATG GCT TAG ACA... ATG ACT TAG ACA... Slow shifts of DNA-concentrations to regions of higher fitness. Higher fitness = More information how to reproduce best.

30 Shannon s view Define: Each DNA-string is a message S k. In the system, there is a set S = {S 1,..., S N } of messages with ( probability distribution P = {p 1,..., p N } N ) with k=1 p k = 1. I k = log(p k ) is a measure for the amount of information in S k [Shannon, Weaver, 1949]. H(P) = N k=1 p ki k is a measure for the information in the system.

31 Shannon s view Define: Each DNA-string is a message S k. In the system, there is a set S = {S 1,..., S N } of messages with ( probability distribution P = {p 1,..., p N } N ) with k=1 p k = 1. I k = log(p k ) is a measure for the amount of information in S k [Shannon, Weaver, 1949]. H(P) = N k=1 p ki k is a measure for the information in the system. Mutation+Selection increase H Starting condition: All Strains random: H(P) = log(n). By mutation and selection: P Q = {q 1,..., q N }. In the end only one species survives (the fittest, global maximum): H(Q) = > H(P).

32 The ascent More information about environment Higher fitness DNA with highest fitness survives. from:

33 Structures in the code Where does the code come from? Amino acids are: hydrophobic: lighter shades (black text) neutral: medium shades (yellow text) hydrophilic: darker shades (white text) [Knight et. al., TBIS 1999]

34 Observations We find that: Amino acids with similar chemical properties are grouped together.

35 Observations We find that: Amino acids with similar chemical properties are grouped together. Codons coding the same amino acid are grouped together.

36 Observations We find that: Amino acids with similar chemical properties are grouped together. Codons coding the same amino acid are grouped together. Some amino acids have more codons than others.

37 A frozen accident Frozen accident model The codon assignments were historical accidents that became fixed in the last common ancestor of all modern organisms.

38 A frozen accident Frozen accident model The codon assignments were historical accidents that became fixed in the last common ancestor of all modern organisms. Highly impropable due to the seen structures in the code. Code is not absolutely universal, small differences exist.

39 A frozen accident Frozen accident model The codon assignments were historical accidents that became fixed in the last common ancestor of all modern organisms. Highly impropable due to the seen structures in the code. Code is not absolutely universal, small differences exist. Three main challenges to this model: Chemical, historical and adaptive arguments.

40 A frozen accident Frozen accident model The codon assignments were historical accidents that became fixed in the last common ancestor of all modern organisms. Highly impropable due to the seen structures in the code. Code is not absolutely universal, small differences exist. Three main challenges to this model: Chemical, historical and adaptive arguments. Still the frozen accident model is a useful null model against which other models can be tested.

41 Chemical and historic arguments Chemical argument The amino acids are assigned to particular codons because of direct chemical interactions between RNA and amino acids. Similar amino acids should bind to similar codons.

42 Chemical and historic arguments Chemical argument The amino acids are assigned to particular codons because of direct chemical interactions between RNA and amino acids. Similar amino acids should bind to similar codons. Explains chemical grouping of the code.

43 Chemical and historic arguments Chemical argument The amino acids are assigned to particular codons because of direct chemical interactions between RNA and amino acids. Similar amino acids should bind to similar codons. Explains chemical grouping of the code. Historical argument The genetic code evolved from a simpler ancestral form: New amino acids were incorporated into the code leading to todays code.

44 Chemical and historic arguments Chemical argument The amino acids are assigned to particular codons because of direct chemical interactions between RNA and amino acids. Similar amino acids should bind to similar codons. Explains chemical grouping of the code. Historical argument The genetic code evolved from a simpler ancestral form: New amino acids were incorporated into the code leading to todays code. Explains emergence of the code.

45 Adaptive arguments Adaptive argument Adaptation optimizes the code to reduce the lethal effect of errors. The code is error-correcting.

46 Adaptive arguments Adaptive argument Adaptation optimizes the code to reduce the lethal effect of errors. The code is error-correcting. The natural code is highly optimal: Simulations by Freeland and Hurst: Very efficient error-correcting code. Efficiencies of 10 6 randomly generated codes are compared to the natural code [Freeland, Hurst, 1998].

47 Three forces shaping the code Knight s Theory The genetic code probably originated through stereochemical interactions, then underwent a period of expansion (incorporation of new amino acids) and adaptive evolution with codon reassignments for code optimization. [Knight et. al., TBIS 1999]

48 Information and context The fitness landscape defines which sequence will be selected. It defines a context, to which the information in the sequence refers.

49 Information and context The fitness landscape defines which sequence will be selected. It defines a context, to which the information in the sequence refers. Information is no absolute quantity. It is always defined with regard to an information-carrying context.

50 Information and context The fitness landscape defines which sequence will be selected. It defines a context, to which the information in the sequence refers. Information is no absolute quantity. It is always defined with regard to an information-carrying context. Sender and Receiver need semantic specifications to understand each other. Do you understand the message:

51 Understanding information How much information is needed to understand another information? A complexity measure For a sequence S we define the complexity C(S) as the length L(p) of the shortest Algorithm p that generates S: C(S) = min L(p) p This definition is also a measure for the information in a sequence.

52 Understanding information How much information is needed to understand another information? A complexity measure For a sequence S we define the complexity C(S) as the length L(p) of the shortest Algorithm p that generates S: C(S) = min L(p) p This definition is also a measure for the information in a sequence. It is impossible to derive the semantic information from the syntactic structure of a sequence. Information carrying sequences are of maximal complexity.

53 What came first: Hen or egg? A sender gives a meaningful sequence S to a receiver R. The meaningful sequence is of maximal complexity. R needs at least C(S) bits to understand the message. C(R) C(S)

54 What came first: Hen or egg? A sender gives a meaningful sequence S to a receiver R. The meaningful sequence is of maximal complexity. R needs at least C(S) bits to understand the message. C(R) C(S) The receiver needs at least as much information as the sender to understand the message. How can a language evolve anyway?

55 Bottom up learning Start at guessing the meaning of a simple information. If you are lucky (right), use it to understand larger pieces and work your way up the complexity. Explains similarities between human language and genetic code: Genetic unit Size Function Analog nucleotide single symbol primary coding symbol letter codon nucleotide triplet symbol for amino acid phoneme cistron codons coding unit for protein word scripton 15 cistrons transcription unit (m-rna) sentence replicon 100 scriptons reproduction unit paragraph genome some replicons mitotic unit text

56 Summary DNA accumulates information about the environment by mutation and selection. The genetic code probably evolved in the early stages of life and was shaped by chemical interactions, insertion of new amino acids and optimization for error-correction. Evolution is a complex process in which biological complexity (information) comes from interactions with the surrounding world (context).

57 Thank you Thank you for your attention!

58 References F. H. Crick, J. S. Griffith and L. E. Orgel, Codes without commas, Proc. Natl. Acad. Sci. 43 (1957), B.-O. Küppers, Der semantische Aspekt von Information und seine evolutionsbiologische Bedeutung, Nova Acta Leopoldina 294 (1996), R. D. Knight, S. J. Freeland and L. F. Landweber, Selection, history and chemistry: The three faces of the genetic code, Trends Biochem. Sci. 24(6) (1999), C.E. Shannon and W. Weaver, The mathematical theory of communication, University of Illinois Press, Urbana, 1949 S. J. Freeland and L. D. Hurst, The genetic code is one in a million, J. Mol. Evol. 47 (1998),