May 16. Gene Finding

Size: px
Start display at page:

Download "May 16. Gene Finding"

Transcription

1 Gene Finding

2 j T[j,k] k i Q is a set of states T is a matrix of transition probabilities T[j,k]: probability of moving from state j to state k Σ is a set of symbols e j (S) is the probability of emitting S while in state j. Automaton M=(Q,T, π,σ,e) At 9irst, M goes to initial state j with probability π j In state j, M emits a symbol from Σ according to e j, and moves to state k with probability T[j,k]., 2016

3 k S 1 S i-1 j S i P max (i,j M) = max k P max (i- 1,k) T[k,j] e j (S i ) (Viterbi) P sum (i,j M) = k (P sum (i- 1,k) T[k,j]) e j (S i ), 2016

4 E F (H)=0.5 E L (H)=0.1 M=(Q,T, π,σ,e), 2016

5 E F (H)= E L (H)=0.1 H H T T T is the observed sequence P_max(2,F) e-1 4.5e-2 1.3e-2 5.8e e-2 5.4e-2 1.6e e-3, 2016

6 HMMs allow us to model position speci9ic gap penalties, and allow for automated training to get a good alignment. Patterns/Pro9iles/HMMs allow us to represent families and foucs on key residues Each has its advantages and disadvantages, and needs special algorithms to query ef9iciently.

7 Input: a protein sequence of unknown function. To get function: Compare against a database of protein sequences with known function. Create a database of multiple alignment of diverged family members. Search using patterns such as regular expressions Search using pro9iles Search using HMMs In all cases, search domains, not the entire sequence

8 A number of databases capture proteins (domains) using various representations Each domain is also associated with structure/ function information, parsed from the literature. Each database has speci9ic query mechanisms that allow us to compare our sequences against them, and assign function HMM 3D

9 What is a Gene? 5/16/16 CSE 182

10 In our discussion of BLAST, we alternated between looking at DNA, and protein sequences, treating them as strings. DNA, RNA, and proteins are the 3 important molecules What is the relation between the three?

11

12 We de9ine a gene as a location on the genome that codes for proteins. The genic information is used to manufacture proteins through transcription, and translation. There is a unique mapping from triplets to amino- acids 5/16/16 CSE 182

13

14

15 Transcription start ATAGATGATGTACGATGAGAATGTGATTAATG Translation start Donor Acceptor

16 The ribosomal machinery reads mrna. Each triplet is translated into a unique amino- acid until the STOP codon is encountered. There is also a special signal where translation starts, usually at the ATG (M) codon. 5/16/16 CSE 182

17 The ribosomal machinery reads mrna. Each triplet is translated into a unique amino- acid until the STOP codon is encountered. There is also a special signal where translation starts, usually at the ATG (M) codon. Given a DNA sequence, how many ways can you translate it? 5/16/16 CSE 182

18 The gene can lie on any strand (relative to the reference genome) The code can be in one of 3 frames. Frame 1 Frame 2 Frame 3 S R V * W R V Q Y S G * S I V D AGTAGAGTATAGTGGACG TCATCTCATATCACCTGC -ve strand

19 5/16/16 CSE 182

20 ATG 5 UTR exon 3 UTR Translation start intron Transcription start Donor splice site 5/16/16 CSE 182 Acceptor

21 Eukaryotic gene de9initions: Location that codes for a protein The transcript sequence(s) that encodes the protein The protein sequence(s) Suppose you want to know all of the genes in an organism. This was a major problem in the 70s. PhDs, and careers were spent isolating a single gene sequence. All of that changed with better reagents and the development of high throughput methods like EST sequencing 5/16/16 CSE 182

22 Proteins are the molecular machinery of the cell. Drugs target proteins, binding to activate/inhibit.

23 Only a few (protein) targets were known in1970s The focus was on designing drugs that interact with the target.

24

25

26 It is possible to extract all of the mrna from a cell. However, mrna is unstable An enzyme called reverse transcriptase is used to make a DNA copy of the RNA. Use DNA polymerase to get a complementary DNA strand. Sequence the (stable) cdna from both ends. This leads to a collection of transcripts/expressed sequences (ESTs). Many might be from the same gene AAAA TTTT AAAA TTTT 5/16/16 CSE 182

27 The expressed transcript (mrna) has a poly- A tail at the end, which can be used as a template for Reverse Transcriptase. This collection of DNA has only the spliced message! It is sampled at random and sequenced from one (3 /5 ) or both ends. Each message is sampled many times. The resulting collection of sequences is called an EST database AAAA TTTT AAAA TTTT 5/16/16 CSE 182

28 Often, reverse transcriptase breaks off early. Why is this a good thing? The 3 end may not have a much coding sequence. We can assemble the 5 end to get more of the coding sequence 5/16/16 CSE 182

29 Newer methods like RNA- seq offer a more comprehensive sampling of the set of transcripts: They can be used for gene 9inding, but, Differences in expression/abundance of transcripts The gene sequence is in small pieces The fragments must still be mapped back to the genome to get the coordinates. There are other features of the gene that are not revealed by transcript sequencing

30 Given Genomic DNA, identify all the coordinates of the gene TRIVIA QUIZ! What is the name of the FIRST gene 9inding program? (google testcode) ATG 5 UTR Translation start intron exon 3 UTR Donor splice site Transcription start Acceptor

31 Given genomic DNA, does it contain a gene (or not)? Key idea: The distributions of nucleotides is different in coding (translated exons) and non- coding regions. Therefore, a statistical test can be used to discriminate between coding and non- coding regions.

32 You are given a collection of exons, and a collection of intergenic sequence. Count the number of occurrences of ATGATG in Introns and Exons. Suppose 1% of the hexamers in Exons are ATGATG Only 0.01% of the hexamers in Intergenic are ATGATG How can you use this idea to 9ind genes?

33 Frequencies (X10-5 ) AAAAAA AAAAAC AAAAAG AAAAAT I E X 5 10 Compute a frequency count for all hexamers. Exons, Intergenic and the sequence X are all vectors in a multi-dimensional space Use this to decide whether a sequence X is exonic/ intergenic.

34 Plot the following vectors E= [10, 20] I = [10, 5] V 3 = [6, 10] V 4 = [9, 15] Is V 3 more like E or more like I? V 3 E 5 I

35 Normalize V = V/ V All vectors have the same length (lie on the unit circle) Next, compute the angle to E, and I. Choose the feature that is closer (smaller angle. β E V 3 I E - score(v 3 ) = α α + β α

36 Fickett and Tung (1992) compared various measures Measures that preserve the triplet frame are the most successful. Genscan uses a 5th order Markov Model

37 Exon Intron AAAAAA 20 1 AAAAAC AAAAAG 5 30 AAAAAT 3.. A AAAAA C AAAAC Tot G AAAAG Pr EXON [AAAAAACGAGAC..] =T[AAAAA,A] T[AAAAA,C] T[AAAAC,G] T[AAACG,A] = (20/78) (50/78).

38 " CodingDifferential[x] = log Pr Exon [x] % $ ' # Pr Intron [x]& The coding differential can be computed as the log odds of the probability that a sequence is an exon vs. and intron. In Genscan, separate transition matrices are trained for each frame, as different frames have different hexamer distributions

39

40 Plot the coding score using a sliding window of 9ixed length. The (large) exons will show up reliably. Not enough to predict gene boundaries reliably Coding

41 Signals at exon boundaries are precise but not speci9ic. Coding signals are speci9ic but not precise. When combined they can be effective ATG GT AG Coding

42 We can compute the following: E- score[i,j] I- score[i,j] D- score[i] A- score[i] Goal is to 9ind coordinates that maximize the total score i j

43 Ex: Grail II. Used statistical techniques to combine various signals into a coherent gene structure. It was not easy to train on many parameters. Guigo & Bursett test revealed that accuracy was still very low. Problem with multiple genes in a genomic region

44 An HMM is the best way to model and optimize the combination of signals Here, we will use a simpler approach which is essentially the same as the Viterbi algorithm for HMMs, but without the formalism.

45 i 1 i 2 i 3 i 4 IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII Identifying a gene is equivalent to labeling each nucleotide as E/I/intergenic etc. These labels are the hidden states For simplicity, consider only two states E and I

46 i 1 i 2 i 3 i 4 IIIIIEEEEEEIIIIIIEEEEEEIIIIEEEEEE IIIII Given a labeling L, we can score it as I- score[0..i 1-1] + E- score[i 1..i 2 ] + D- score[i 2 +1] + I- score[i i 3-1] + A- score[i 3-1] + E- score[i 3..i 4 ] +. Goal is to compute a labeling with maximum score.

47 De9ine V E (i) = Best score of a labeling of the pre9ix 1..i such that the i- th position is labeled E De9ine V I (i) = Best score of a labeling of the pre9ix 1..i such that the i- th position is labeled I Why is it enough to compute V E (i) & V I (i)?

48 # E_score[ j i] + V V E (i) = max I ( j 1) j<i $ % +A_score[ j 1]} j i # I_score[ j..i] + V V I (i) = max E ( j 1) j<i $ % +D_score[ j]} j i

49 Note that we deal with two states, and consider all paths that move between the two states. E I i

50 We did not deal with the boundary cases in the recurrence. Instead of labeling with two states, we can label with multiple states, E init, E 9in, E mid, I, I G (intergenic) I G I Note: all links are not shown here E fin E mid E init

51

52 Gene 9inding can be interpreted as a d.p. approach that threads genomic sequence through the states of a gene HMM. E init, E 9in, E mid, I, I G (intergenic) I G I E fin Note: all links are not shown here E mid E init i

53 A probabilistic model for each of the states (ex: Exon, Splice site) needs to be described In standard HMMs, there is an exponential distribution on the duration of time spent in a state. This is violated by many states of the gene structure HMM. Solution is to model these using generalized HMMs.

54

55 Each state also emits a duration for which it will cycle in the same state. The time is generated according to a random process that depends on the state.

56 q k j i F k (i) = P q k (X j,i ) f qk ( j i +1) a lk j<i l Q F l ( j) Duration Prob.: Probability that you stayed in state q k for j-i+1 steps Emission Prob.: Probability that you emitted X i..x j in state q k (given by the 5th order markov model) Forward Prob: Probability that you emitted i symbols and ended up in state q k

57 Various signals distinguish coding regions from non- coding HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. Further improvement may come from improved signal detection

58 Coding versus non- coding Splice Signals Translation start ATG 5 UTR exon 3 UTR Translation start intron Transcription start Donor splice site Acceptor

59 The donor site marks the junction where an exon ends, and an intron begins. For gene 9inding, we are interested in computing a probability D[i] = Prob[Donor site at position i] Approach: Collect a large number of donor sites, align, and look for a signal.

60 Fixed length for the splice signal. Each position is generated independently according to a distribution Figure shows data from > 1200 donor sites AAGGTGAGT CCGGTAAGT GAGGTGAGG TAGGTAAGG

61 Various signals distinguish coding regions from non- coding HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. Further improvement may come from improved signal detection

62 Nature Science

63

64 Gene prediction is harder with alternative splicing. One approach might be to use comparative methods to detect genes Given a similar mrna/protein (from another species, perhaps?), can you 9ind the best parse of a genomic sequence that matches that target sequence Yes, with a variant on alignment algorithms that penalize separately for introns, versus other gaps.

65 Pr[GGTA] is a donor site? 0.5*0.5 Pr[CGTA] is a donor site? 0.5*0.5 Is something wrong with this explanation? GGTA GGTA GGTA GGTA CGTG CGTG CGTG CGTG

66 PWMs do not capture correlations between positions Many position pairs in the Donor signal are correlated

67 Choose the position i which has the highest correlation score. Split sequences into two: those which have the consensus at position i, and the remaining. Recurse until <Terminating conditions> Stop if #sequences is small enough

68

69 Various signals distinguish coding regions from non- coding HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. Further improvement may come from improved signal detection

70 Nature Science

71

72 Gene prediction is harder with alternative splicing. One approach might be to use comparative methods to detect genes Given a similar mrna/protein (from another species, perhaps?), can you 9ind the best parse of a genomic sequence that matches that target sequence Yes, with a variant on alignment algorithms that penalize separately for introns, versus other gaps.

73 Procrustes/Sim4: mrna vs. genomic Genewise: proteins versus genomic CEM: genomic versus genomic Twinscan: Combines comparative and de novo approach. Mass Spec related? Later in the class we will consider mass spectrometry data. Can we use this data to identify genes in eukaryotic genomes? (Research project)

74 RefSeq and other databases maintain sequences of full- length transcripts/ genes. We can query using sequence.

75 Sequence Comparison (BLAST & other tools) Protein Motifs: Pro9iles/Regular Expression/ HMMs Discovering protein coding genes Gene 9inding HMMs DNA signals (splice signals) How is the genomic sequence itself obtained? ESTs Gene finding Protein sequence analysis

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs Gene Finding GenBank Growth GenBank Growth In 2003 ~ 31 million sequences ~ 37 billion base pairs GenBank: Exponential Growth Growth of GenBank in billions of base pairs from release 3 in April of 1994

More information

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading: 132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, 214 1 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading: Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 155 12 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

Computational gene finding

Computational gene finding Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) Lec 1 Lec 2 Lec 3 The biological context Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative

More information

Gene Identification in silico

Gene Identification in silico Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction

More information

How to design an HMM for a new problem. HMM model structure. Inherent limitation of HMMs. Duration modeling. Duration modeling

How to design an HMM for a new problem. HMM model structure. Inherent limitation of HMMs. Duration modeling. Duration modeling How to design an HMM for a new problem Architecture/topology design: What are the states, observation symbols, and the topology of the state transition graph? Learning/Training: Fully annotated or partially

More information

Gene Prediction in Eukaryotes

Gene Prediction in Eukaryotes Gene Prediction in Eukaryotes Jan-Jaap Wesselink Biomol Informatics, S.L. jjw@biomol-informatics.com June 2010/Madrid jjw@biomol-informatics.com (BI) Gene Prediction June 2010/Madrid 1 / 34 Outline 1 Gene

More information

Profile HMMs. 2/10/05 CAP5510/CGS5166 (Lec 10) 1 START STATE 1 STATE 2 STATE 3 STATE 4 STATE 5 STATE 6 END

Profile HMMs. 2/10/05 CAP5510/CGS5166 (Lec 10) 1 START STATE 1 STATE 2 STATE 3 STATE 4 STATE 5 STATE 6 END Profile HMMs START STATE 1 STATE 2 STATE 3 STATE 4 STATE 5 STATE 6 END 2/10/05 CAP5510/CGS5166 (Lec 10) 1 Profile HMMs with InDels Insertions Deletions Insertions & Deletions DELETE 1 DELETE 2 DELETE 3

More information

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian, Genscan The Genscan HMM model Training Genscan Validating Genscan (c) Devika Subramanian, 2009 96 Gene structure assumed by Genscan donor site acceptor site (c) Devika Subramanian, 2009 97 A simple model

More information

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project

More information

3'A C G A C C A G T A A A 5'

3'A C G A C C A G T A A A 5' AP Biology Chapter 14 Reading Guide Gene Expression: From Gene to Protein Overview 1. What is gene expression? Concept 14.1 Genes specify proteins via transcription and translation Basic Principles of

More information

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing Sequence Analysis II: Sequence Patterns and Matrices George Bell, Ph.D. WIBR Bioinformatics and Research Computing Sequence Patterns and Matrices Multiple sequence alignments Sequence patterns Sequence

More information

Homework 4. Due in class, Wednesday, November 10, 2004

Homework 4. Due in class, Wednesday, November 10, 2004 1 GCB 535 / CIS 535 Fall 2004 Homework 4 Due in class, Wednesday, November 10, 2004 Comparative genomics 1. (6 pts) In Loots s paper (http://www.seas.upenn.edu/~cis535/lab/sciences-loots.pdf), the authors

More information

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey Applications of HMMs in Computational Biology BMI/CS 576 www.biostat.wisc.edu/bmi576.html Colin Dewey cdewey@biostat.wisc.edu Fall 2008 The Gene Finding Task Given: an uncharacterized DNA sequence Do:

More information

MATH 5610, Computational Biology

MATH 5610, Computational Biology MATH 5610, Computational Biology Lecture 2 Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology p.1/24 Announcements Error on syllabus Class

More information

Outline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools

Outline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools Outline 1. Introduction 2. Exon Chaining Problem 3. Spliced Alignment 4. Gene Prediction Tools Section 1: Introduction Similarity-Based Approach to Gene Prediction Some genomes may be well-studied, with

More information

MODULE 5: TRANSLATION

MODULE 5: TRANSLATION MODULE 5: TRANSLATION Lesson Plan: CARINA ENDRES HOWELL, LEOCADIA PALIULIS Title Translation Objectives Determine the codons for specific amino acids and identify reading frames by looking at the Base

More information

Fermentation. Lesson Overview. Lesson Overview 13.1 RNA

Fermentation. Lesson Overview. Lesson Overview 13.1 RNA 13.1 RNA THINK ABOUT IT DNA is the genetic material of cells. The sequence of nucleotide bases in the strands of DNA carries some sort of code. In order for that code to work, the cell must be able to

More information

Annotating the Genome (H)

Annotating the Genome (H) Annotating the Genome (H) Annotation principles (H1) What is annotation? In general: annotation = explanatory note* What could be useful as an annotation of a DNA sequence? an amino acid sequence? What

More information

Computational gene finding

Computational gene finding Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) Lec 1 Lec 2 Lec 3 The biological context Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative

More information

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction Lecture 8 Reading Lecture 8: 96-110 Lecture 9: 111-120 DNA Libraries Definition Types Construction 142 DNA Libraries A DNA library is a collection of clones of genomic fragments or cdnas from a certain

More information

Fig Ch 17: From Gene to Protein

Fig Ch 17: From Gene to Protein Fig. 17-1 Ch 17: From Gene to Protein Basic Principles of Transcription and Translation RNA is the intermediate between genes and the proteins for which they code Transcription is the synthesis of RNA

More information

Videos. Bozeman Transcription and Translation: Drawing transcription and translation:

Videos. Bozeman Transcription and Translation:   Drawing transcription and translation: Videos Bozeman Transcription and Translation: https://youtu.be/h3b9arupxzg Drawing transcription and translation: https://youtu.be/6yqplgnjr4q Objectives 29a) I can contrast RNA and DNA. 29b) I can explain

More information

Transcription is the first stage of gene expression

Transcription is the first stage of gene expression Transcription is the first stage of gene expression RNA synthesis is catalyzed by RNA polymerase, which pries the DNA strands apart and hooks together the RNA nucleotides The RNA is complementary to the

More information

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall Biology Biology 1 of 39 12-3 RNA and Protein Synthesis 2 of 39 Essential Question What is transcription and translation and how do they take place? 3 of 39 12 3 RNA and Protein Synthesis Genes are coded

More information

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall Biology Biology 1 of 39 12-3 RNA and Protein Synthesis 2 of 39 12 3 RNA and Protein Synthesis Genes are coded DNA instructions that control the production of proteins. Genetic messages can be decoded by

More information

The Nature of Genes. The Nature of Genes. Genes and How They Work. Chapter 15/16

The Nature of Genes. The Nature of Genes. Genes and How They Work. Chapter 15/16 Genes and How They Work Chapter 15/16 The Nature of Genes Beadle and Tatum proposed the one gene one enzyme hypothesis. Today we know this as the one gene one polypeptide hypothesis. 2 The Nature of Genes

More information

Videos. Lesson Overview. Fermentation

Videos. Lesson Overview. Fermentation Lesson Overview Fermentation Videos Bozeman Transcription and Translation: https://youtu.be/h3b9arupxzg Drawing transcription and translation: https://youtu.be/6yqplgnjr4q Objectives 29a) I can contrast

More information

Genome 373: Hidden Markov Models III. Doug Fowler

Genome 373: Hidden Markov Models III. Doug Fowler Genome 373: Hidden Markov Models III Doug Fowler Review from Hidden Markov Models I and II We talked about two decoding algorithms last time. What is meant by decoding? Review from Hidden Markov Models

More information

Gene Prediction. Mario Stanke. Institut für Mikrobiologie und Genetik Abteilung Bioinformatik. Gene Prediction p.

Gene Prediction. Mario Stanke. Institut für Mikrobiologie und Genetik Abteilung Bioinformatik. Gene Prediction p. Gene Prediction Mario Stanke mstanke@gwdg.de Institut für Mikrobiologie und Genetik Abteilung Bioinformatik Gene Prediction p.1/23 Why Predict Genes with a Computer? tons of data 39/250 eukaryotic/prokaryotic

More information

BIO 311C Spring Lecture 36 Wednesday 28 Apr.

BIO 311C Spring Lecture 36 Wednesday 28 Apr. BIO 311C Spring 2010 1 Lecture 36 Wednesday 28 Apr. Synthesis of a Polypeptide Chain 5 direction of ribosome movement along the mrna 3 ribosome mrna NH 2 polypeptide chain direction of mrna movement through

More information

ProGen: GPHMM for prokaryotic genomes

ProGen: GPHMM for prokaryotic genomes ProGen: GPHMM for prokaryotic genomes Sharad Akshar Punuganti May 10, 2011 Abstract ProGen is an implementation of a Generalized Pair Hidden Markov Model (GPHMM), a model which can be used to perform both

More information

Genes & Gene Finding

Genes & Gene Finding Genes & Gene Finding Ben Langmead Department of Computer Science Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briefly how you are using the slides. For original Keynote files,

More information

Lecture 7 Motif Databases and Gene Finding

Lecture 7 Motif Databases and Gene Finding Introduction to Bioinformatics for Medical Research Gideon Greenspan gdg@cs.technion.ac.il Lecture 7 Motif Databases and Gene Finding Motif Databases & Gene Finding Motifs Recap Motif Databases TRANSFAC

More information

CSE 527 Computational Biology Autumn Lectures ~14-15 Gene Prediction

CSE 527 Computational Biology Autumn Lectures ~14-15 Gene Prediction CSE 527 Computational Biology Autumn 2004 Lectures ~14-15 Gene Prediction Some References A great online bib http://www.nslij-genetics.org/gene/ A good intro survey JM Claverie (1997) "Computational methods

More information

Unit 1: DNA and the Genome. Sub-Topic (1.3) Gene Expression

Unit 1: DNA and the Genome. Sub-Topic (1.3) Gene Expression Unit 1: DNA and the Genome Sub-Topic (1.3) Gene Expression Unit 1: DNA and the Genome Sub-Topic (1.3) Gene Expression On completion of this subtopic I will be able to State the meanings of the terms genotype,

More information

RNA, & PROTEIN SYNTHESIS. 7 th Grade, Week 4, Day 1 Monday, July 15, 2013

RNA, & PROTEIN SYNTHESIS. 7 th Grade, Week 4, Day 1 Monday, July 15, 2013 RNA, & PROTEIN SYNTHESIS 7 th Grade, Week 4, Day 1 Monday, July 15, 2013 The Central Dogma RNA vs. DNA Ribonucleic Acid RNA is required for translation of genetic information stored in DNA into protein

More information

Lecture 10. Ab initio gene finding

Lecture 10. Ab initio gene finding Lecture 10 Ab initio gene finding Uses of probabilistic sequence Segmentation models/hmms Multiple alignment using profile HMMs Prediction of sequence function (gene family models) ** Gene finding ** Review

More information

BIOLOGY - CLUTCH CH.17 - GENE EXPRESSION.

BIOLOGY - CLUTCH CH.17 - GENE EXPRESSION. !! www.clutchprep.com CONCEPT: GENES Beadle and Tatum develop the one gene one enzyme hypothesis through their work with Neurospora (bread mold). This idea was later revised as the one gene one polypeptide

More information

Year III Pharm.D Dr. V. Chitra

Year III Pharm.D Dr. V. Chitra Year III Pharm.D Dr. V. Chitra 1 Genome entire genetic material of an individual Transcriptome set of transcribed sequences Proteome set of proteins encoded by the genome 2 Only one strand of DNA serves

More information

The Flow of Genetic Information

The Flow of Genetic Information Chapter 17 The Flow of Genetic Information The DNA inherited by an organism leads to specific traits by dictating the synthesis of proteins and of RNA molecules involved in protein synthesis. Proteins

More information

Bi 8 Lecture 5. Ellen Rothenberg 19 January 2016

Bi 8 Lecture 5. Ellen Rothenberg 19 January 2016 Bi 8 Lecture 5 MORE ON HOW WE KNOW WHAT WE KNOW and intro to the protein code Ellen Rothenberg 19 January 2016 SIZE AND PURIFICATION BY SYNTHESIS: BASIS OF EARLY SEQUENCING complex mixture of aborted DNA

More information

Genes and How They Work. Chapter 15

Genes and How They Work. Chapter 15 Genes and How They Work Chapter 15 The Nature of Genes They proposed the one gene one enzyme hypothesis. Today we know this as the one gene one polypeptide hypothesis. 2 The Nature of Genes The central

More information

Chapter 13. From DNA to Protein

Chapter 13. From DNA to Protein Chapter 13 From DNA to Protein Proteins All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequenceof a gene The Path From Genes to

More information

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs. Page 1 REMINDER: BMI 214 Industry Night Comparative Genomics Russ B. Altman BMI 214 CS 274 Location: Here (Thornton 102), on TV too. Time: 7:30-9:00 PM (May 21, 2002) Speakers: Francisco De La Vega, Applied

More information

Analysis of Biological Sequences SPH

Analysis of Biological Sequences SPH Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu nuts and bolts meet Tuesdays & Thursdays, 3:30-4:50 no exam; grade derived from 3-4 homework assignments plus a final project (open book,

More information

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation Erwin Datema (2011) Sandra Smit (2012, 2013) Genome annotation AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAAT AATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAA

More information

Computational gene finding. Devika Subramanian Comp 470

Computational gene finding. Devika Subramanian Comp 470 Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) The biological context Lec 1 Lec 2 Lec 3 Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative

More information

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017 Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l

More information

Genes and gene finding

Genes and gene finding Genes and gene finding Ben Langmead Department of Computer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com)

More information

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018 Annotation Annotation for D. virilis Chris Shaffer July 2012 l Big Picture of annotation and then one practical example l This technique may not be the best with other projects (e.g. corn, bacteria) l

More information

Methods and Algorithms for Gene Prediction

Methods and Algorithms for Gene Prediction Methods and Algorithms for Gene Prediction Chaochun Wei 韦朝春 Sc.D. ccwei@sjtu.edu.cn http://cbb.sjtu.edu.cn/~ccwei Shanghai Jiao Tong University Shanghai Center for Bioinformation Technology 5/12/2011 K-J-C

More information

Lesson Overview. Fermentation 13.1 RNA

Lesson Overview. Fermentation 13.1 RNA 13.1 RNA The Role of RNA Genes contain coded DNA instructions that tell cells how to build proteins. The first step in decoding these genetic instructions is to copy part of the base sequence from DNA

More information

Transcription. DNA to RNA

Transcription. DNA to RNA Transcription from DNA to RNA The Central Dogma of Molecular Biology replication DNA RNA Protein transcription translation Why call it transcription and translation? transcription is such a direct copy

More information

Bio 101 Sample questions: Chapter 10

Bio 101 Sample questions: Chapter 10 Bio 101 Sample questions: Chapter 10 1. Which of the following is NOT needed for DNA replication? A. nucleotides B. ribosomes C. Enzymes (like polymerases) D. DNA E. all of the above are needed 2 The information

More information

COMPUTER RESOURCES II:

COMPUTER RESOURCES II: COMPUTER RESOURCES II: Using the computer to analyze data, using the internet, and accessing online databases Bio 210, Fall 2006 Linda S. Huang, Ph.D. University of Massachusetts Boston In the first computer

More information

Computational Gene Finding

Computational Gene Finding Computational Gene Finding Dong Xu Digital Biology Laboratory Computer Science Department Christopher S. Life Sciences Center University of Missouri, Columbia E-mail: xudong@missouri.edu http://digbio.missouri.edu

More information

Bacterial Genome Annotation

Bacterial Genome Annotation Bacterial Genome Annotation Bacterial Genome Annotation For an annotation you want to predict from the sequence, all of... protein-coding genes their stop-start the resulting protein the function the control

More information

Gene Structure & Gene Finding Part II

Gene Structure & Gene Finding Part II Gene Structure & Gene Finding Part II David Wishart david.wishart@ualberta.ca 30,000 metabolite Gene Finding in Eukaryotes Eukaryotes Complex gene structure Large genomes (0.1 to 10 billion bp) Exons and

More information

Textbook Reading Guidelines

Textbook Reading Guidelines Understanding Bioinformatics by Marketa Zvelebil and Jeremy Baum Last updated: January 16, 2013 Textbook Reading Guidelines Preface: Read the whole preface, and especially: For the students with Life Science

More information

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein This is also known as: The central dogma of molecular biology Protein Proteins are made

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 08: Gene finding aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggc tatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt

More information

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson

More information

Molecular Cell Biology - Problem Drill 08: Transcription, Translation and the Genetic Code

Molecular Cell Biology - Problem Drill 08: Transcription, Translation and the Genetic Code Molecular Cell Biology - Problem Drill 08: Transcription, Translation and the Genetic Code Question No. 1 of 10 1. Which of the following statements about how genes function is correct? Question #1 (A)

More information

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Multiple choice questions (numbers in brackets indicate the number of correct answers) 1 Multiple choice questions (numbers in brackets indicate the number of correct answers) February 1, 2013 1. Ribose is found in Nucleic acids Proteins Lipids RNA DNA (2) 2. Most RNA in cells is transfer

More information

6.C: Students will explain the purpose and process of transcription and translation using models of DNA and RNA

6.C: Students will explain the purpose and process of transcription and translation using models of DNA and RNA 6.C: Students will explain the purpose and process of transcription and translation using models of DNA and RNA DNA mrna Protein DNA is found in the nucleus, but making a protein occurs at the ribosome

More information

SSA Signal Search Analysis II

SSA Signal Search Analysis II SSA Signal Search Analysis II SSA other applications - translation In contrast to translation initiation in bacteria, translation initiation in eukaryotes is not guided by a Shine-Dalgarno like motif.

More information

DNA is normally found in pairs, held together by hydrogen bonds between the bases

DNA is normally found in pairs, held together by hydrogen bonds between the bases Bioinformatics Biology Review The genetic code is stored in DNA Deoxyribonucleic acid. DNA molecules are chains of four nucleotide bases Guanine, Thymine, Cytosine, Adenine DNA is normally found in pairs,

More information

Lecture 11. Initiation of RNA Pol II transcription. Transcription Initiation Complex

Lecture 11. Initiation of RNA Pol II transcription. Transcription Initiation Complex Lecture 11 *Eukaryotic Transcription Gene Organization RNA Processing 5 cap 3 polyadenylation splicing Translation Initiation of RNA Pol II transcription Consensus sequence of promoter TATA Transcription

More information

Biology A: Chapter 9 Annotating Notes Protein Synthesis

Biology A: Chapter 9 Annotating Notes Protein Synthesis Name: Pd: Biology A: Chapter 9 Annotating Notes Protein Synthesis -As you read your textbook, please fill out these notes. -Read each paragraph state the big/main idea on the left side. -On the right side

More information

Make the protein through the genetic dogma process.

Make the protein through the genetic dogma process. Make the protein through the genetic dogma process. Coding Strand 5 AGCAATCATGGATTGGGTACATTTGTAACTGT 3 Template Strand mrna Protein Complete the table. DNA strand DNA s strand G mrna A C U G T A T Amino

More information

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

TIGR THE INSTITUTE FOR GENOMIC RESEARCH Introduction to Genome Annotation: Overview of What You Will Learn This Week C. Robin Buell May 21, 2007 Types of Annotation Structural Annotation: Defining genes, boundaries, sequence motifs e.g. ORF,

More information

Transcription and Translation. DANILO V. ROGAYAN JR. Faculty, Department of Natural Sciences

Transcription and Translation. DANILO V. ROGAYAN JR. Faculty, Department of Natural Sciences Transcription and Translation DANILO V. ROGAYAN JR. Faculty, Department of Natural Sciences Protein Structure Made up of amino acids Polypeptide- string of amino acids 20 amino acids are arranged in different

More information

From RNA To Protein

From RNA To Protein From RNA To Protein 22-11-2016 Introduction mrna Processing heterogeneous nuclear RNA (hnrna) RNA that comprises transcripts of nuclear genes made by RNA polymerase II; it has a wide size distribution

More information

Transcription steps. Transcription steps. Eukaryote RNA processing

Transcription steps. Transcription steps. Eukaryote RNA processing Transcription steps Initiation at 5 end of gene binding of RNA polymerase to promoter unwinding of DNA Elongation addition of nucleotides to 3 end rules of base pairing requires Mg 2+ energy from NTP substrates

More information

Transcription in Eukaryotes

Transcription in Eukaryotes Transcription in Eukaryotes Biology I Hayder A Giha Transcription Transcription is a DNA-directed synthesis of RNA, which is the first step in gene expression. Gene expression, is transformation of the

More information

Lecture for Wednesday. Dr. Prince BIOL 1408

Lecture for Wednesday. Dr. Prince BIOL 1408 Lecture for Wednesday Dr. Prince BIOL 1408 THE FLOW OF GENETIC INFORMATION FROM DNA TO RNA TO PROTEIN Copyright 2009 Pearson Education, Inc. Genes are expressed as proteins A gene is a segment of DNA that

More information

DNA Function: Information Transmission

DNA Function: Information Transmission DNA Function: Information Transmission DNA is called the code of life. What does it code for? *the information ( code ) to make proteins! Why are proteins so important? Nearly every function of a living

More information

Introduction to Cellular Biology and Bioinformatics. Farzaneh Salari

Introduction to Cellular Biology and Bioinformatics. Farzaneh Salari Introduction to Cellular Biology and Bioinformatics Farzaneh Salari Outline Bioinformatics Cellular Biology A Bioinformatics Problem What is bioinformatics? Computer Science Statistics Bioinformatics Mathematics...

More information

Key Area 1.3: Gene Expression

Key Area 1.3: Gene Expression Key Area 1.3: Gene Expression RNA There is a second type of nucleic acid in the cell, called RNA. RNA plays a vital role in the production of protein from the code in the DNA. What is gene expression?

More information

I. Gene Expression Figure 1: Central Dogma of Molecular Biology

I. Gene Expression Figure 1: Central Dogma of Molecular Biology I. Gene Expression Figure 1: Central Dogma of Molecular Biology Central Dogma: Gene Expression: RNA Structure RNA nucleotides contain the pentose sugar Ribose instead of deoxyribose. Contain the bases

More information

Analysis of Biological Sequences SPH

Analysis of Biological Sequences SPH Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu nuts and bolts meet Tuesdays & Thursdays, 3:30-4:50 no exam; grade derived from 3-4 homework assignments plus a final project (open book,

More information

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important! Themes: RNA is very versatile! RNA and RNA Processing Chapter 14 RNA-RNA interactions are very important! Prokaryotes and Eukaryotes have many important differences. Messenger RNA (mrna) Carries genetic

More information

Chapter 17: From Gene to Protein

Chapter 17: From Gene to Protein Name Period Chapter 17: From Gene to Protein This is going to be a very long journey, but it is crucial to your understanding of biology. Work on this chapter a single concept at a time, and expect to

More information

Studying the Human Genome. Lesson Overview. Lesson Overview Studying the Human Genome

Studying the Human Genome. Lesson Overview. Lesson Overview Studying the Human Genome Lesson Overview 14.3 Studying the Human Genome THINK ABOUT IT Just a few decades ago, computers were gigantic machines found only in laboratories and universities. Today, many of us carry small, powerful

More information

Bis2A 12.0 Transcription *

Bis2A 12.0 Transcription * OpenStax-CNX module: m56068 1 Bis2A 12.0 Transcription * Mitch Singer Based on Transcription by OpenStax This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License

More information

The Nature of Genes. The Nature of Genes. The Nature of Genes. The Nature of Genes. The Nature of Genes. The Genetic Code. Genes and How They Work

The Nature of Genes. The Nature of Genes. The Nature of Genes. The Nature of Genes. The Nature of Genes. The Genetic Code. Genes and How They Work Genes and How They Work Chapter 15 Early ideas to explain how genes work came from studying human diseases. Archibald Garrod studied alkaptonuria, 1902 Garrod recognized that the disease is inherited via

More information

Machine Learning. HMM applications in computational biology

Machine Learning. HMM applications in computational biology 10-601 Machine Learning HMM applications in computational biology Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly

More information

Replication, Transcription, and Translation

Replication, Transcription, and Translation Replication, Transcription, and Translation Information Flow from DNA to Protein The Central Dogma of Molecular Biology Replication is the copying of DNA in the course of cell division. Transcription is

More information

Chapter 12. DNA TRANSCRIPTION and TRANSLATION

Chapter 12. DNA TRANSCRIPTION and TRANSLATION Chapter 12 DNA TRANSCRIPTION and TRANSLATION 12-3 RNA and Protein Synthesis WARM UP What are proteins? Where do they come from? From DNA to RNA to Protein DNA in our cells carry the instructions for making

More information

Transcription and Post Transcript Modification

Transcription and Post Transcript Modification Transcription and Post Transcript Modification You Should Be Able To 1. Describe transcription. 2. Compare and contrast eukaryotic + prokaryotic transcription. 3. Explain mrna processing in eukaryotes.

More information

Biotechnology Project Lab

Biotechnology Project Lab Only for teaching purposes - not for reproduction or sale Advanced Cell Biology & Biotechnology Biotechnology Project Lab Giovanna Gambarotta COMPETENCES THAT YOU WILL ACQUIRE - compare DNA sequences -

More information

Lecture Summary: Regulation of transcription. General mechanisms-what are the major regulatory points?

Lecture Summary: Regulation of transcription. General mechanisms-what are the major regulatory points? BCH 401G Lecture 37 Andres Lecture Summary: Regulation of transcription. General mechanisms-what are the major regulatory points? RNA processing: Capping, polyadenylation, splicing. Why process mammalian

More information

Chimp Sequence Annotation: Region 2_3

Chimp Sequence Annotation: Region 2_3 Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker

More information

The Structure of Proteins The Structure of Proteins. How Proteins are Made: Genetic Transcription, Translation, and Regulation

The Structure of Proteins The Structure of Proteins. How Proteins are Made: Genetic Transcription, Translation, and Regulation How Proteins are Made: Genetic, Translation, and Regulation PLAY The Structure of Proteins 14.1 The Structure of Proteins Proteins - polymer amino acids - monomers Linked together with peptide bonds A

More information

Gene finding: putting the parts together

Gene finding: putting the parts together Gene finding: putting the parts together Anders Krogh Center for Biological Sequence Analysis Technical University of Denmark Building 206, 2800 Lyngby, Denmark 1 Introduction Any isolated signal of a

More information

An Overview of Probabilistic Methods for RNA Secondary Structure Analysis. David W Richardson CSE527 Project Presentation 12/15/2004

An Overview of Probabilistic Methods for RNA Secondary Structure Analysis. David W Richardson CSE527 Project Presentation 12/15/2004 An Overview of Probabilistic Methods for RNA Secondary Structure Analysis David W Richardson CSE527 Project Presentation 12/15/2004 RNA - a quick review RNA s primary structure is sequence of nucleotides

More information

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences Huiqing Liu Hao Han Jinyan Li Limsoon Wong Institute for Infocomm Research, 21 Heng Mui Keng Terrace,

More information

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein?

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein? Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein? Messenger RNA Carries Information for Protein Synthesis from the DNA to Ribosomes Ribosomes Consist

More information

Eukaryotic Gene Structure

Eukaryotic Gene Structure Eukaryotic Gene Structure Terminology Genome entire genetic material of an individual Transcriptome set of transcribed sequences Proteome set of proteins encoded by the genome 2 Gene Basic physical and

More information