Computational gene finding. Devika Subramanian Comp 470

Similar documents
O C. 5 th C. 3 rd C. the national health museum

Transcription in Eukaryotes

SSA Signal Search Analysis II

Gene Identification in silico

Make the protein through the genetic dogma process.

Fig Ch 17: From Gene to Protein

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

Chapter 12 Packet DNA 1. What did Griffith conclude from his experiment? 2. Describe the process of transformation.

Prokaryotic Transcription

DNA Function: Information Transmission

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Transcription Eukaryotic Cells

Lecture for Wednesday. Dr. Prince BIOL 1408

Gene Expression Transcription/Translation Protein Synthesis

Bio 101 Sample questions: Chapter 10

Chapter 12. DNA TRANSCRIPTION and TRANSLATION

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

TRANSCRIPTION AND PROCESSING OF RNA

PROTEIN SYNTHESIS. copyright cmassengale

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Eukaryotic Gene Structure

Chromosomes. Chromosomes. Genes. Strands of DNA that contain all of the genes an organism needs to survive and reproduce

Independent Study Guide The Blueprint of Life, from DNA to Protein (Chapter 7)

Review of Protein (one or more polypeptide) A polypeptide is a long chain of..

Protein Synthesis. DNA to RNA to Protein

RNA and Protein Synthesis

PROTEIN SYNTHESIS. copyright cmassengale

Chapter 13. From DNA to Protein

Genes and gene finding

Genomics and Gene Recognition Genes and Blue Genes

What happens after DNA Replication??? Transcription, translation, gene expression/protein synthesis!!!!

Protein Synthesis & Gene Expression

From DNA to Protein: Genotype to Phenotype

Genomics and Gene Recognition Genes and Blue Genes

DNA Transcription. Visualizing Transcription. The Transcription Process

TRANSCRIPTION AND TRANSLATION

Year III Pharm.D Dr. V. Chitra

DNA RNA PROTEIN SYNTHESIS -NOTES-

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Gene Expression. Student:

DNA. translation. base pairing rules for DNA Replication. thymine. cytosine. amino acids. The building blocks of proteins are?

BIO 311C Spring Lecture 36 Wednesday 28 Apr.

Ch. 10 Notes DNA: Transcription and Translation

DNA Transcription. Dr Aliwaini

Transcription. DNA to RNA

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

DNA Replication and Repair

7.2 Protein Synthesis. From DNA to Protein Animation

Name: Class: Date: ID: A

Chapter 13 - Concept Mapping

BEADLE & TATUM EXPERIMENT

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Chapter 10: Gene Expression and Regulation

Introduction to Bioinformatics

Ch 10 Molecular Biology of the Gene

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

TRANSCRIPTION COMPARISON OF DNA & RNA TRANSCRIPTION. Umm AL Qura University. Sugar Ribose Deoxyribose. Bases AUCG ATCG. Strand length Short Long

Protein Synthesis Notes

8/21/2014. From Gene to Protein

From Gene to Protein transcription, messenger RNA (mrna) translation, RNA processing triplet code, template strand, codons,

CHAPTERS , 17: Eukaryotic Genetics

Protein Synthesis

Algorithms in Bioinformatics

From Gene to Protein. Chapter 17. Biology Eighth Edition Neil Campbell and Jane Reece. PowerPoint Lecture Presentations for

Nucleic acids and protein synthesis

Answers to Module 1. An obligate aerobe is an organism that has an absolute requirement of oxygen for growth.

Nucleic acids deoxyribonucleic acid (DNA) ribonucleic acid (RNA) nucleotide

DNA - DEOXYRIBONUCLEIC ACID

DNA is the genetic material. DNA structure. Chapter 7: DNA Replication, Transcription & Translation; Mutations & Ames test

Protein Synthesis. OpenStax College

Name 10 Molecular Biology of the Gene Test Date Study Guide You must know: The structure of DNA. The major steps to replication.

Bundle 5 Test Review

DNA Structure and Analysis. Chapter 4: Background

Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype)

Adv Biology: DNA and RNA Study Guide

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

NUCLEIC ACIDS AND PROTEIN SYNTHESIS

M I C R O B I O L O G Y WITH DISEASES BY TAXONOMY, THIRD EDITION

What is RNA? Another type of nucleic acid A working copy of DNA Does not matter if it is damaged or destroyed

RNA & PROTEIN SYNTHESIS

RNA and PROTEIN SYNTHESIS. Chapter 13

RNA and Protein Synthesis

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

The Central Dogma of Molecular Biology

Molecular Genetics. Before You Read. Read to Learn

Gene Expression: Transcription

Computational Biology I LSM5191 (2003/4)

13.1 RNA. Lesson Objectives. Lesson Summary

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

COMPUTER RESOURCES II:

DNA and RNA. Chapter 12

Protein Synthesis: Transcription and Translation

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

1. The diagram below shows an error in the transcription of a DNA template to messenger RNA (mrna).

Protein Synthesis. Lab Exercise 12. Introduction. Contents. Objectives

Chapter 14 Active Reading Guide From Gene to Protein

Transcription:

Computational gene finding Devika Subramanian Comp 470

Outline (3 lectures) The biological context Lec 1 Lec 2 Lec 3 Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative methods for gene finding Evaluating gene finding programs (c) Devika Subramanian, 2009 2

The biological context Introduction to the human genome and genes The central dogma: transcription and translation (c) Devika Subramanian, 2009 3

Facts about the human genome The human genome contains 3 billion chemical nucleotide bases (A, C, T, and G). About 20,500 genes are estimated to be in the human genome. Chromosome 1 (the largest human chromosome) has the most genes (2968), and the Y chromosome has the fewest (231). http://www.pnas.org/content/104/49/19428.abstract ~17,000 genes confirmed experimentally/ensembl2009 (c) Devika Subramanian, 2009 4

More facts The average gene consists of 3000 bases, but sizes vary greatly, with the largest known human gene being dystrophin at 2.4 million bases. (c) Devika Subramanian, 2009 5

More facts Genes appear to be concentrated in random areas along the genome, with vast expanses of non-coding DNA between. About 2% of the genome encodes instructions for the synthesis of proteins. We do not know the function of more than 50% of the discovered genes. (c) Devika Subramanian, 2009 6

More facts The human genome sequence is almost (99.9%) exactly the same in all people. There are about 3 million locations where single-base DNA differences occur in humans (Single Nucleotide Polymorphisms or SNPs). Over 40% of the predicted human proteins share similarity with fruit-fly or worm proteins. (c) Devika Subramanian, 2009 7

A great site to learn more http://www.dnai.org/index.htm (c) Devika Subramanian, 2009 8

Genome sizes Organism Genome Size (Bases) Estimated Genes Human (Homo sapiens) 3 billion 20,000 Laboratory mouse (M. musculus) 2.6 billion 20,000 Mustard weed (A. thaliana) 100 million 20,000 Roundworm (C. elegans) 97 million 19,000 Fruit fly (D. melanogaster) 137 million 13,000 Yeast (S. cerevisiae) 12.1 million 6,000 Bacterium (E. coli) 4.6 million 3,200 Human immunodeficiency virus (HIV) 9700 9 No correlation between amount of DNA in a species and number of genes (c) Devika Subramanian, 2009 9

Codons 3 consecutive DNA bases code for an amino acid. There are 64 possible codons, but only 20 amino acids (some amino acids have multiple codon representations). Special codons: start codon (ATG) [and two others] and three stop codons (TAG, TGA, TAA). They indicate the start and end of translation regions. (c) Devika Subramanian, 2009 10

The central dogma mrna produced by transciption DNA mrna proteins (c) Devika Subramanian, 2009 11

Transcription When a gene is "expressed" the sequence of nucleotides in the DNA is used to determine the sequence of amino acids in a protein in a two step process. First, the enzyme RNA polymerase uses one strand of the DNA as a template to synthesize a complementary strand of messenger RNA (mrna) in a process called transcription. RNA is identical to DNA except that in RNA T is replaced with U (for uracil). Also, unlike DNA, RNA usually exists as a single stranded molecule. (c) Devika Subramanian, 2009 12

Splicing and Translation In eukaryotes, after a gene is transcribed the introns are removed from the mrna and the adjacent exons are spliced together in the nucleus prior to translation outside the nucleus. (alternative splicing) After the mrna for a particular gene is made it is used as a template with which ribosomes synthesize the protein in a process called translation. (c) Devika Subramanian, 2009 13

The Biological Model Eukaryotic Gene Structure (c) Devika Subramanian, 2009 14

The current view Philip A. Sharp, MIT, 2009 (c) Devika Subramanian, 2009 15

How genes are validated Genomic DNA Full-length mrna ESTs Protein cdna - single-stranded DNA complementary to an RNA, synthesized from it by reverse transcription full-length mrnas (GenBank RefSeq, ~16000 human sequences)/ensembl 2009 ESTs Expressed Sequence Tags relatively short, 500 bp long on average span one or more exons large data sets required (GenBank dbest 4.3 M human sequences) (c) Devika Subramanian, 2009 16

Signals Translation start codon CpG rich region TATA box ATG Gene DNA coding strand AAUAAA Exon 1 Exon 2 primary (A)n GT AG transcript Polyadenylation site Splice signals Upstream regulatory signals (TATA boxes) Translation start codon (ATG) Translation stop codon (e.g., TAA) Polyadenylation signal (~AATAAA) Splice recognition signals (e.g., GT-AG, branch point) (c) Devika Subramanian, 2009 17

Computational gene finding Gene finding in prokaryotes Gene finding in eukaryotes (c) Devika Subramanian, 2009 18

Finding genes in prokaryotes Prokaryotes are single-celled organisms without a nucleus (e.g., bacteria). Few introns in prokayotic cells. Over 70% of H. influenzae genome codes for proteins. No introns in coding region. gene1 gene2 gene3 (c) Devika Subramanian, 2009 19

Prokaryotic gene structure ORF (open reading frame) TATA box Start codon Stop codon Frame 1 Frame 2 Frame 3 ATGACAGATTACAGATTACAGATTACAGGATAG

Prokaryotes stack multiple genes together for expression ( operons ) Promoter Gene1 Gene2 Gene N Terminator Transcription RNA Polymerase mrna 5 3 C 1 2 N Translation N N C Polypeptides Ribosome, trnas, Protein Factors N C 1 2 3

Prokaryote gene finding Advantages Simple gene structure Small genomes (0.5 to 10 million bp) No introns High coding density (>90%) Disadvantages Some genes overlap (nested) Some genes are quite short (<60 bp)

Finding genes in prokaryotes Main idea: if bases were drawn uniformly at random, then a stop codon is expected once every 64/3 (about 21) bases. Since coding regions are terminated by stop codons, a simple technique to find genes is to look for long stretches of bases without a stop codon. Once a stop codon is found, we work backward to find the start codon corresponding to the gene. Main problems: misses short genes, overlapping ORFs. (c) Devika Subramanian, 2009 23

Gene finding based on start/stop codons Look for putative start codon (ATG) Staying in same frame, scan in groups of three until a stop codon is found If # of codons >=50, assume it s a gene If # of codons <50, go back to last start codon, increment by 1 & start again At end of chromosome, repeat process for reverse complement

Example ORF

Segment of Influenza Virus http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?d b=nucleotide&val=cy018024 1151 bp segment of Influenza B Virus. Has two genes: 4 to 750 and 750 to 1079 (c) Devika Subramanian, 2009 26

The sequence 1 aaaatgtcgc tgtttggaga cacaattgcc tacctgcttt cattgacaga agatggagaa 61 ggcaaagcag aactagcaga aaaattacac tgttggttcg gtgggaaaga atttgaccta 121 gactctgcct tggaatggat aaaaaacaaa agatgcttaa ctgatataca aaaagcacta 181 attggtgcct ctatctgctt tttaaaaccc aaagaccagg aaaggaaaag aagattcatc 241 acagagcctc tatcaggaat gggaacaaca gcaacaaaaa agaaaggcct gattctagct 301 gagagaaaaa tgagaagatg tgtgagcttt catgaagcat ttgaaatagc agaaggccat 361 gaaagctcag cgctactata ttgtctcatg gtcatgtacc tgaatcctgg aaattattca 421 atgcaagtaa aactaggaac gctctgtgct ttgtgcgaga aacaagcatc acattcacac 481 agggctcata gcagagcagc gagatcttca gtgcccggag tgagacgaga aatgcagatg 541 gtctcagcta tgaacacagc aaaaacaatg aatggaatgg gaaaaggaga agacgtccaa 601 aagctggcag aagagctgca aagcaacatt ggagtattga gatctcttgg agcaagtcaa 661 aagaatgggg aaggaattgc aaaggatgta atggaagtgc taaagcagag ctctatggga 721 aattcagctc ttgtgaagaa atatctataa tgctcgaacc atttcagatt ctttcaattt 781 gttcttttat cttatcagct ctccatttca tggcttggac aatagggcat ttgaatcaaa 841 taaaaagagg agtaaacatg aaaatacgaa taaaaggtcc aaacaaagag acaataaaca 901 gagaggtatc aattttgaga cacagttacc aaaaagaaat ccaggccaaa gaaacaatga 961 aggaagtact ctctgacaac atggaggtat tgagtgacca catagtgatt gaggggcttt 1021 ctgccgaaga gataataaaa atgggtgaaa cagttttgga gatagaagaa ttgcattaaa 1081 ttcaattttt tactgtattt cttattatgc atttaagcaa attgtaatca atgtcagcaa 1141 ataaactgga a (c) Devika Subramanian, 2009 27

Problems with start/stop codon based approaches Advantages Simple and fairly sensitive (>50%) Disadvantages Prokaryotic genes are not always so simple to find ATG is not the only possible start site (e.g. CTG, TTG class I alternates) Small genes tend to be overlooked and long ones over-predicted Solution? Use additional information to increase confidence in predictions

Other content features RNA polymerase promoter site (-10, -30 site or TATA box) Shine-Dalgarno sequence (+10, Ribosome Binding Site) to initiate protein translation Codon biases High GC content Search for nucleotide sequences and coded proteins in related organisms (c) Devika Subramanian, 2009 29

Promotor structure (c) Devika Subramanian, 2009 30

(c) Devika Subramanian, 2009 31

GLIMMER State of the art prokaryotic gene finder. Based on interpolated Markov models. Available at http://cbcb.umd.edu/software/glimmer 98% accuracy in identifying viral and microbial genes. 2007 paper in Bioinformatics that shows latest version of tool. (c) Devika Subramanian, 2009 32

Gene finding in eukaryotes Gene finding in eukaryotic DNA (c) Devika Subramanian, 2009 33

Structure of a human gene Exon-intron structure (c) Devika Subramanian, 2009 34

atg ggtgag caggtg ggtgag cagatg cagttg ggtgag ggtgag caggcc tga (c) Devika Subramanian, 2009 35

Ab initio methods Use information embedded in the genomic sequence exclusively to predict the gene structure. Find structure G representing gene boundaries + internal gene structure which maximizes the probability P(G genomic sequence). Hidden Markov models are the predominant generative method for modeling the problem. (c) Devika Subramanian, 2009 36