Gene Prediction in Eukaryotes

Size: px
Start display at page:

Download "Gene Prediction in Eukaryotes"

Transcription

1 Gene Prediction in Eukaryotes Jan-Jaap Wesselink Biomol Informatics, S.L. June 2010/Madrid (BI) Gene Prediction June 2010/Madrid 1 / 34

2 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 2 / 34

3 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 2 / 34

4 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 2 / 34

5 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 2 / 34

6 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 3 / 34

7 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 4 / 34

8 Eurkaryotic Gene Structure (BI) Gene Prediction June 2010/Madrid 5 / 34

9 Schematic Gene Structure Exon Intron Exon Intron Exon ATG GT AG GT AG TGA UTR CDS Gene prediction programs only predict the coding fraction of genes Signals Exons Regions Start (ATG) Single Exons Stops (TGA,TAA,TAG) First Introns Donor (GT) Internal Intergenic Acceptor (AG) Terminal 5 and 3 UTRs jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 6 / 34

10 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 7 / 34

11 Signals are difficult to find (1) Example Try reading this sentence: LOOKITSMUCHEASIERLIKETHIS (BI) Gene Prediction June 2010/Madrid 8 / 34

12 Signals are difficult to find (1) Example Try reading this sentence: Look! It s much easier like this! jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 8 / 34

13 Signals Are Difficult To Find (2) Example Genomic DNA sequence GTTTCAAGTGATCCTCCCGCCTCAGCCTGCCCAGGTGCTGAGATTACATGTATGAGCCAC TGCACCTGGAAAGGAGCCAGAAATGTGAAGTGCTAGCTGAAGGATGAGCAGCAGCTAGCC AGGCAAAGGTAGGGTTTGGGGAAGGAAAGTGCACATTCTCTTCCCATCTGTGTTTCAGGG GGCAATGGCGGCTTCCTGTGTTCTACTGCACACTGGGCAGAAGATGCCTCTGATTGGTCT GGGTACCTGGAAGAGTGAGCCTGGTCAGGTGAGGGATGGGGGAAGAAAAAAGAAACCTCT GCTTCTCTCACCTGGCAGGTAAAAGCAGCTGTTAAGTATGCCCTTAGCGTAGGCTACCGC CACATTGATTGTGCTGCTATCTACGGCAATGAGCCTGAGATTGGGGAGGCCCTGAAGGAG GACGTGGGACCAGGCAAGGTAAGGACTGGGGTTGTAAATAGAGCTGTGGGCCCTGCCCCC TGCACTAG Only beginning and end of introns are shown (BI) Gene Prediction June 2010/Madrid 9 / 34

14 Signals Are Difficult To Find (2) Example Genomic DNA sequence GTTTCAAGTGATCCTCCCGCCTCAGCCTGCCCAGGTGCTGAGATTACATGTATGAGCCAC TGCACCTGGAAAGGAGCCAGAAATGTGAAGTGCTAGCTGAAGGATGAGCAGCAGCTAGCC AGGCAAAGGTAGGGTTTGGGGAAGGAAAGTGCACATTCTCTTCCCATCTGTGTTTCAGGG GGCAATGGCGGCTTCCTGTGTTCTACTGCACACTGGGCAGAAGATGCCTCTGATTGGTCT GGGTACCTGGAAGAGTGAGCCTGGTCAGGTGAGGGATGGGGGAAGAAAAAAGAAACCTCT GCTTCTCTCACCTGGCAGGTAAAAGCAGCTGTTAAGTATGCCCTTAGCGTAGGCTACCGC CACATTGATTGTGCTGCTATCTACGGCAATGAGCCTGAGATTGGGGAGGCCCTGAAGGAG GACGTGGGACCAGGCAAGGTAAGGACTGGGGTTGTAAATAGAGCTGTGGGCCCTGCCCCC TGCACTAG Only beginning and end of introns are shown (BI) Gene Prediction June 2010/Madrid 9 / 34

15 All Signals Predicted by geneid in a Genomic DNA Sequence jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 10 / 34

16 All Exons Predicted by geneid in a Genomic DNA Sequence jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 11 / 34

17 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 12 / 34

18 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 13 / 34

19 Different Approaches to Gene Finding Different Types of Information Can be Used: Signals: search for signals of transcription, splicing, translation. Typically, these signals are assigned a score, and the highest scoring signals are combined. Content: here, one tries to discriminate the protein coding from non-coding regions. Statistical models of nucleotide frequencies and dependencies in codons are used here. Homology: significant sequence similarity of a genomic DNA sequence to a known gene, implies that it is likely to share its function. This information may be used in the gene prediction process. jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 14 / 34

20 Different Approaches to Gene Finding Different Types of Information Can be Used: Signals: search for signals of transcription, splicing, translation. Typically, these signals are assigned a score, and the highest scoring signals are combined. Content: here, one tries to discriminate the protein coding from non-coding regions. Statistical models of nucleotide frequencies and dependencies in codons are used here. Homology: significant sequence similarity of a genomic DNA sequence to a known gene, implies that it is likely to share its function. This information may be used in the gene prediction process. jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 14 / 34

21 Different Approaches to Gene Finding Different Types of Information Can be Used: Signals: search for signals of transcription, splicing, translation. Typically, these signals are assigned a score, and the highest scoring signals are combined. Content: here, one tries to discriminate the protein coding from non-coding regions. Statistical models of nucleotide frequencies and dependencies in codons are used here. Homology: significant sequence similarity of a genomic DNA sequence to a known gene, implies that it is likely to share its function. This information may be used in the gene prediction process. jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 14 / 34

22 Different Approaches to Gene Finding Different Types of Information Can be Used: Signals: search for signals of transcription, splicing, translation. Typically, these signals are assigned a score, and the highest scoring signals are combined. Content: here, one tries to discriminate the protein coding from non-coding regions. Statistical models of nucleotide frequencies and dependencies in codons are used here. Homology: significant sequence similarity of a genomic DNA sequence to a known gene, implies that it is likely to share its function. This information may be used in the gene prediction process. jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 14 / 34

23 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 15 / 34

24 Search by Signal Signals are usually represented as patterns Example (patterns) Strings: P = GCCACCTAGG Consensus sequences: subsitutions occur at certain positions Regular expressions: describe set of strings generated by a regular language. Decision trees Position Weight Matrices jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 16 / 34

25 Search by Signal Signals are usually represented as patterns Example (patterns) Strings: P = GCCACCTAGG Consensus sequences: subsitutions occur at certain positions Regular expressions: describe set of strings generated by a regular language. Decision trees Position Weight Matrices jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 16 / 34

26 Search by Signal Signals are usually represented as patterns Example (patterns) Strings: P = GCCACCTAGG Consensus sequences: subsitutions occur at certain positions Regular expressions: describe set of strings generated by a regular language. Decision trees Position Weight Matrices jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 16 / 34

27 Search by Signal Signals are usually represented as patterns Example (patterns) Strings: P = GCCACCTAGG Consensus sequences: subsitutions occur at certain positions Regular expressions: describe set of strings generated by a regular language. Decision trees Position Weight Matrices jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 16 / 34

28 Search by Signal Signals are usually represented as patterns Example (patterns) Strings: P = GCCACCTAGG Consensus sequences: subsitutions occur at certain positions Regular expressions: describe set of strings generated by a regular language. Decision trees Position Weight Matrices jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 16 / 34

29 Patterns: examples Example (patterns 2) Consensus sequence: p 1 = p 2 = p 3 = p 4 = p 5 = p 6 = p 7 = P = CTAAAAATAA TTAAAAATAA TTTAAAATAA CTATAAATAA TTATAAATAA CTTAAAATAG TTTAAAATAG YTWWAAATAR Y = pyrimidine (C or T), W = A or T, R = purine ( A or G) Regular expression: Prosite pattern: P = G [GN] [SGA] G x R x [SGA] C x(2) [IV ] jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 17 / 34

30 Patterns: examples Example (patterns 2) Consensus sequence: p 1 = p 2 = p 3 = p 4 = p 5 = p 6 = p 7 = P = CTAAAAATAA TTAAAAATAA TTTAAAATAA CTATAAATAA TTATAAATAA CTTAAAATAG TTTAAAATAG YTWWAAATAR Y = pyrimidine (C or T), W = A or T, R = purine ( A or G) Regular expression: Prosite pattern: P = G [GN] [SGA] G x R x [SGA] C x(2) [IV ] jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 17 / 34

31 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 18 / 34

32 Position Weight Matrices Construction of PWM for splice sites (...GT... or...ag...) 1 align sequences for true splice sites 2 calculate relative frequencies 3 same for false donor sites 4 calculate log odds ratio true/false frequencies M i,j = ln( f true f false ) 5 score of a given sequence is sum of matrix coefficients jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 19 / 34

33 Position Weight Matrices Construction of PWM for splice sites (...GT... or...ag...) 1 align sequences for true splice sites 2 calculate relative frequencies 3 same for false donor sites 4 calculate log odds ratio true/false frequencies M i,j = ln( f true f false ) 5 score of a given sequence is sum of matrix coefficients jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 19 / 34

34 Position Weight Matrices Construction of PWM for splice sites (...GT... or...ag...) 1 align sequences for true splice sites 2 calculate relative frequencies 3 same for false donor sites 4 calculate log odds ratio true/false frequencies M i,j = ln( f true f false ) 5 score of a given sequence is sum of matrix coefficients jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 19 / 34

35 Position Weight Matrices Construction of PWM for splice sites (...GT... or...ag...) 1 align sequences for true splice sites 2 calculate relative frequencies 3 same for false donor sites 4 calculate log odds ratio true/false frequencies M i,j = ln( f true f false ) 5 score of a given sequence is sum of matrix coefficients jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 19 / 34

36 Position Weight Matrices Construction of PWM for splice sites (...GT... or...ag...) 1 align sequences for true splice sites 2 calculate relative frequencies 3 same for false donor sites 4 calculate log odds ratio true/false frequencies M i,j = ln( f true f false ) 5 score of a given sequence is sum of matrix coefficients jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 19 / 34

37 Position Weight Matrices Construction of PWM for splice sites (...GT... or...ag...) 1 align sequences for true splice sites 2 calculate relative frequencies 3 same for false donor sites 4 calculate log odds ratio true/false frequencies M i,j = ln( f true f false ) 5 score of a given sequence is sum of matrix coefficients jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 19 / 34

38 Example: Donor Sites (BI) Gene Prediction June 2010/Madrid 20 / 34

39 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 21 / 34

40 Search by Content Coding statistics Hidden Markov Models Coding Statistics there are 64 codons for 22 amino acids different codons are used in exons than in introns compute codon usage from coding DNA frequencies for a sequence, S, the higher the number of codons in S that occur frequently in coding sequences, the higher the probability that S is coding for a protein jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 22 / 34

41 Search by Content Coding statistics Hidden Markov Models Coding Statistics there are 64 codons for 22 amino acids different codons are used in exons than in introns compute codon usage from coding DNA frequencies for a sequence, S, the higher the number of codons in S that occur frequently in coding sequences, the higher the probability that S is coding for a protein jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 22 / 34

42 Search by Content Coding statistics Hidden Markov Models Coding Statistics there are 64 codons for 22 amino acids different codons are used in exons than in introns compute codon usage from coding DNA frequencies for a sequence, S, the higher the number of codons in S that occur frequently in coding sequences, the higher the probability that S is coding for a protein jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 22 / 34

43 Search by Content Coding statistics Hidden Markov Models Coding Statistics there are 64 codons for 22 amino acids different codons are used in exons than in introns compute codon usage from coding DNA frequencies for a sequence, S, the higher the number of codons in S that occur frequently in coding sequences, the higher the probability that S is coding for a protein jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 22 / 34

44 Search by Content Coding statistics Hidden Markov Models Coding Statistics there are 64 codons for 22 amino acids different codons are used in exons than in introns compute codon usage from coding DNA frequencies for a sequence, S, the higher the number of codons in S that occur frequently in coding sequences, the higher the probability that S is coding for a protein jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 22 / 34

45 Search by Content Coding statistics Hidden Markov Models Coding Statistics there are 64 codons for 22 amino acids different codons are used in exons than in introns compute codon usage from coding DNA frequencies for a sequence, S, the higher the number of codons in S that occur frequently in coding sequences, the higher the probability that S is coding for a protein jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 22 / 34

46 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 23 / 34

47 Coding Statistics The probability of a sequence S being protein coding: p(s) = p(c 1) p(c 2 )... p(c n ) p(s) f (C 1) f (C 2 )... f (C n ) jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 24 / 34

48 Coding Statistics The probability of a sequence S being protein coding: p(s) = p(c 1) p(c 2 )... p(c n ) p(s) f (C 1) f (C 2 )... f (C n ) jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 24 / 34

49 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 25 / 34

50 Hidden Markov Models can have HMM s for entire genes can have HMM s for coding/non-coding regions jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 26 / 34

51 Hidden Markov Models can have HMM s for entire genes can have HMM s for coding/non-coding regions jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 26 / 34

52 Hidden Markov Models can have HMM s for entire genes can have HMM s for coding/non-coding regions jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 26 / 34

53 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 27 / 34

54 Search by Homology Various Uses of Homology in Gene Prediction one can compare a genomic DNA sequence with a database of ESTs (using e.g. blastn) genomic DNA sequences can be compared to a database protein sequences (using blastx, to identify coding regions comparison of predicted peptides with a protein sequence data base can be used to assign putative functions the genome of one species can be compared to the genome of another, closely related species: conserved regions often correspond to conserved functions (e.g. exons, parts of promoters) jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 28 / 34

55 Search by Homology Various Uses of Homology in Gene Prediction one can compare a genomic DNA sequence with a database of ESTs (using e.g. blastn) genomic DNA sequences can be compared to a database protein sequences (using blastx, to identify coding regions comparison of predicted peptides with a protein sequence data base can be used to assign putative functions the genome of one species can be compared to the genome of another, closely related species: conserved regions often correspond to conserved functions (e.g. exons, parts of promoters) jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 28 / 34

56 Search by Homology Various Uses of Homology in Gene Prediction one can compare a genomic DNA sequence with a database of ESTs (using e.g. blastn) genomic DNA sequences can be compared to a database protein sequences (using blastx, to identify coding regions comparison of predicted peptides with a protein sequence data base can be used to assign putative functions the genome of one species can be compared to the genome of another, closely related species: conserved regions often correspond to conserved functions (e.g. exons, parts of promoters) jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 28 / 34

57 Search by Homology Various Uses of Homology in Gene Prediction one can compare a genomic DNA sequence with a database of ESTs (using e.g. blastn) genomic DNA sequences can be compared to a database protein sequences (using blastx, to identify coding regions comparison of predicted peptides with a protein sequence data base can be used to assign putative functions the genome of one species can be compared to the genome of another, closely related species: conserved regions often correspond to conserved functions (e.g. exons, parts of promoters) jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 28 / 34

58 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 29 / 34

59 Comparative Gene Prediction Comparing the human FOS gene with: [Mouse] [Chicken] [Pufferfish] using tblastx (BI) Gene Prediction June 2010/Madrid 30 / 34

60 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 31 / 34

61 Gene Prediction Accuracy measured in annotated sequences can measure at nucleotide, exon and gene level Sn = TP TP + FN SP = TP TP + FP jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 32 / 34

62 Gene Prediction Accuracy measured in annotated sequences can measure at nucleotide, exon and gene level Sn = TP TP + FN SP = TP TP + FP jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 32 / 34

63 Outline 1 Gene Structure Eukaryotes Find Signals 2 Different Approaches To Gene Finding Different Information Search by Signal PWM Search by Content Coding Statistics HMM Homology Comparative 3 Accuracy of Predictions 4 References jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 33 / 34

64 References jjw/oeiras05/ 3 Guigó, R. (1999) DNA Composition, Codon Usage and Exon Prediction. In Bisshop, M., ed. Genetic Databases, Academic Press. 4 Eddy, S. (2004) What is a Hidden Markov Model? Nature 22: Burset, M. and Guigó, R. (1996). Evaluation of gene structure prediction programs. Genomics, 34: Brent M.R. and Guigó, R. (2004). Recent advances in gene structure prediction. Curr. Opin. Struct. Biol. 14: Guigó, R. and Reese, M.G. (2005). EGASP: collaboration through competition to find human genes. Nature Methods, 2: jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 34 / 34