Biological sequence patterns

Size: px
Start display at page:

Download "Biological sequence patterns"

Transcription

1 Biological sequence patterns The TPOX short tandem repeat has repeat pattern AATG. The start codon for protein coding genes is ATG. The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns. This is the text mining part of bioinformatics. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

2 Text mining How often and where does a pattern or motif occur in a text? By complete chance? Due to a rule that we want to understand and/or model?? A core problem in biological sequence analysis Of broader relevance: spam detection, datamining text databases, key example of a datamining problem. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

3 A dice game Throw the die until one of the patterns or occurs. I win if occurs. This is a winner: Is this a fair game? How much should I be willing to bet if you bet 1 kroner on your pattern to be the winning pattern? Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

4 Fairness and odds Lets play the game n times, let p denote the probability that I win the game, and let ξ denote my bet. With ɛ n the relative frequency that I win, my average gain in the n games is ɛ n ξ(1 ɛ n ) p ξ(1 p), the approximation following from the frequency interpretation. The game is fair if the average gain is 0, that is, if ξ = p 1 p. The quantity ξ is called the odds of the event that I win. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

5 The probability of the start codon Lets try to compute the probability of the start codon ATG. We need a probability model for the single letters, a model of how the letters are related, and some notation to support the computations. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

6 Random variables It is useful to introduce the concept of random variables as representations of unobserved variables. Three unobserved DNA letters are denoted XYZ, and we want to compute P(XYZ = ATG) = P(X = A, Y = T, Z = G) We assume that the random variables X, Y, and Z are independent, which means that P(X = A, Y = T, Z = G) = P(X = A)P(Y = T)P(Z = G). We assume that X, Y and Z have the same distribution, that is P(X = w) = P(Y = w) = P(Z = w) for all letters w in the DNA alphabet. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

7 Amino acid distributions On the sample space of amino acids { A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V } we can take the uniform distribution (unloaded die) with probability 1/20 for all AA. We may encounter the Robinson-Robinson point probabilities from the relative frequencies of the occurrences of amino acids in a selection of real proteins. They read Amino acid Probability Amino acid Probability Amino acid Probability A G P R H S N I T D L W C K Y E M V Q F Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

8 The probability of the start codon If the point probabilities for the DNA alphabet are A C G T the probability of ATG under the loaded die model is P(XYZ = ATG) = = The probability of not observing a start codon is P(XYZ ATG) = 1 P(XYZ = ATG) = = Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

9 Exercise 1: Simulate occurrences of start codons Use the R function sample to generate random DNA sequences of length 99 with the point probabilities as given above. Generate 10,000 sequences of length 99 and compute the relative frequency of sequences with a start codon at any position a start codon in the reading frame beginning with the first letter. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

10 The probability of a least one start codon If we have n codons (3n DNA letters) what is the probability of at least one start codon? What is the probability of not observing a start codon? The codons are independent P(no start codon) = P(XYZ ATG) n = n. and P(at least one start codon) = n. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

11 The number of codons before the first start codon If L denotes the number of codons before the first start codon we have found that P(L = n) = P(no start codon in n codons, start codon at codon n + 1) = n This is the geometric distribution with success probability p = on the non-negative integers N 0. It has the general point probabilities P(L = n) = (1 p) n p. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

12 The number of start codons What is the probability of observing k start codons among the first n start codons? Any configuration of k start codons and n k non-start codons are equally probable with probability k n k. With S the number of start codons ( ) n P(S = k) = k n k. k This is the binomial distribution with success probability p = It has general point probabilities ( ) n P(S = k) = p k (1 p) n k k for k {0,..., n}. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

13 More complicated pattern problems Start codons do not respect the reading frame I have chosen. More complicated motifs involve wild cards and self-overlap. The loaded die model is not accurate. Returning to the dice game, which of the patterns or occurs first? Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

14 The Poisson distribution On the sample space N 0 of non-negative integers the Poisson distribution with parameter λ > 0 is given by the point probabilities for n N 0. λ λn p(n) = e n! The expected value is np(n) = n=0 n=0 λ λn ne n! = λ Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

15 Exercise 2: Distribution of patterns Use sample as previously to generate random DNA sequences. This time of length 1,000. Find the distribution of the number of occurrences of the pattern AATG using the R function gregexpr. Compare the distribution with the Poisson distribution. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18

16 Hardy-Weinberg equilibrium All diploid organisms like humans carry two copies of the chromosomes. For a gene occurring in two combinations as allele A or a there are three possible genotypes: AA, Aa, aa. We sample a random individual from a population and observe Y the genotype taking values in the sample space {AA, Aa, aa}. With X f and X m denote unobservable father and mother alleles for the random individual then Y is a function of these. Under a random mating assumption independence of X m and X f and equal distribution assumption in the male and female populations: P(Y = AA) = p 2, P(Y = Aa) = 2p(1 p), P(Y = aa) = (1 p) 2 with p = P(X m = A) = P(X f = A). Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E November 30, / 18