Lecture 11: Gene Prediction

Similar documents
Disease and selection in the human genome 3

Lecture 10, 20/2/2002: The process of solution development - The CODEHOP strategy for automatic design of consensus-degenerate primers for PCR

NAME:... MODEL ANSWER... STUDENT NUMBER:... Maximum marks: 50. Internal Examiner: Hugh Murrell, Computer Science, UKZN

ORFs and genes. Please sit in row K or forward

Materials Protein synthesis kit. This kit consists of 24 amino acids, 24 transfer RNAs, four messenger RNAs and one ribosome (see below).

Lecture 19A. DNA computing

Homework. A bit about the nature of the atoms of interest. Project. The role of electronega<vity

G+C content. 1 Introduction. 2 Chromosomes Topology & Counts. 3 Genome size. 4 Replichores and gene orientation. 5 Chirochores.

Codon Bias with PRISM. 2IM24/25, Fall 2007

Primer Design Workshop. École d'été en géné-que des champignons 2012 Dr. Will Hintz University of Victoria

Protein Structure Analysis

Figure S1. Characterization of the irx9l-1 mutant. (A) Diagram of the Arabidopsis IRX9L gene drawn based on information from TAIR (the Arabidopsis

Project 07/111 Final Report October 31, Project Title: Cloning and expression of porcine complement C3d for enhanced vaccines

PGRP negatively regulates NOD-mediated cytokine production in rainbow trout liver cells

Supporting information for Biochemistry, 1995, 34(34), , DOI: /bi00034a013

for Programmed Chemo-enzymatic Synthesis of Antigenic Oligosaccharides

National PHL TB DST Reference Center PSQ Reporting Language Table of Contents

Electronic Supplementary Information

Det matematisk-naturvitenskapelige fakultet

Supplementary. Table 1: Oligonucleotides and Plasmids. complementary to positions from 77 of the SRα '- GCT CTA GAG AAC TTG AAG TAC AGA CTG C

Genomic Sequence Analysis using Electron-Ion Interaction

Hes6. PPARα. PPARγ HNF4 CD36

Arabidopsis actin depolymerizing factor AtADF4 mediates defense signal transduction triggered by the Pseudomonas syringae effector AvrPphB

Supplemental Data Supplemental Figure 1.

PROTEIN SYNTHESIS Study Guide


Supplementary Figure 1A A404 Cells +/- Retinoic Acid

Outline. 1. Introduction. 2. Exon Chaining Problem. 3. Spliced Alignment. 4. Gene Prediction Tools

Dierks Supplementary Fig. S1

Supplemental Data. mir156-regulated SPL Transcription. Factors Define an Endogenous Flowering. Pathway in Arabidopsis thaliana

Lezione 10. Bioinformatica. Mauro Ceccanti e Alberto Paoluzzi

SAY IT WITH DNA: Protein Synthesis Activity by Larry Flammer

Expression of Recombinant Proteins

Introduction to Bioinformatics Dr. Robert Moss

PCR analysis was performed to show the presence and the integrity of the var1csa and var-

DNA sentences. How are proteins coded for by DNA? Materials. Teacher instructions. Student instructions. Reflection

Supplemental Data. Bennett et al. (2010). Plant Cell /tpc

Supporting Information

Supplementary Information. Construction of Lasso Peptide Fusion Proteins

ΔPDD1 x ΔPDD1. ΔPDD1 x wild type. 70 kd Pdd1. Pdd3

Creation of A Caspese-3 Sensing System Using A Combination of Split- GFP and Split-Intein

Supplement 1: Sequences of Capture Probes. Capture probes were /5AmMC6/CTG TAG GTG CGG GTG GAC GTA GTC

Y-chromosomal haplogroup typing Using SBE reaction

Supplementary Materials for

Thr Gly Tyr. Gly Lys Asn

Multiplexing Genome-scale Engineering

FROM DNA TO GENETIC GENEALOGY Stephen P. Morse

INTRODUCTION TO THE MOLECULAR GENETICS OF THE COLOR MUTATIONS IN ROCK POCKET MICE

Supporting Information

Table S1. Bacterial strains (Related to Results and Experimental Procedures)

Additional Table A1. Accession numbers of resource records for all rhodopsin sequences downloaded from NCBI. Species common name

BIOSTAT516 Statistical Methods in Genetic Epidemiology Autumn 2005 Handout1, prepared by Kathleen Kerr and Stephanie Monks

II 0.95 DM2 (RPP1) DM3 (At3g61540) b

Overexpression Normal expression Overexpression Normal expression. 26 (21.1%) N (%) P-value a N (%)

Sampling Random Bioinformatics Puzzles using Adaptive Probability Distributions

A Circular Code in the Protein Coding Genes of Mitochondria

SUPPORTING INFORMATION

Supplemental material

Supporting Online Information

SUPPLEMENTAL MATERIAL GENOTYPING WITH MULTIPLEXING TARGETED RESEQUENCING

Nucleic Acids Research

Supporting Information. Copyright Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, 2006

2

Add 5µl of 3N NaOH to DNA sample (final concentration 0.3N NaOH).

Supplemental Table 1. Mutant ADAMTS3 alleles detected in HEK293T clone 4C2. WT CCTGTCACTTTGGTTGATAGC MVLLSLWLIAAALVEVR

Supplemental Information. Human Senataxin Resolves RNA/DNA Hybrids. Formed at Transcriptional Pause Sites. to Promote Xrn2-Dependent Termination

RPA-AB RPA-C Supplemental Figure S1: SDS-PAGE stained with Coomassie Blue after protein purification.

hcd1tg/hj1tg/ ApoE-/- hcd1tg/hj1tg/ ApoE+/+

Genomics and Gene Recognition Genes and Blue Genes

Evolution of protein coding sequences

Supplementary Figure 1. Localization of MST1 in RPE cells. Proliferating or ciliated HA- MST1 expressing RPE cells (see Fig. 5b for establishment of

SUPPLEMENTARY INFORMATION

evaluated with UAS CLB eliciting UAS CIT -N Libraries increase in the

strain devoid of the aox1 gene [1]. Thus, the identification of AOX1 in the intracellular

Supporting Information

1. DNA, RNA structure. 2. DNA replication. 3. Transcription, translation

Quantitative reverse-transcription PCR. Transcript levels of flgs, flgr, flia and flha were

Gene Prediction: Statistical Approaches

Complexity of the Ruminococcus flavefaciens FD-1 cellulosome reflects an expansion of family-related protein-protein interactions

Search for and Analysis of Single Nucleotide Polymorphisms (SNPs) in Rice (Oryza sativa, Oryza rufipogon) and Establishment of SNP Markers

Cover Page. The handle holds various files of this Leiden University dissertation.

Biomolecules: lecture 6

Gene synthesis by circular assembly amplification

Genes and Proteins. Objectives

Supplementary Figures

Supplemental Data. Distinct Pathways for snorna and mrna Termination

FAT10 and NUB1L bind the VWA domain of Rpn10 and Rpn1 to enable proteasome-mediated proteolysis

11th Meeting of the Science Working Group. Lima, Peru, October 2012 SWG-11-JM-11

Supplemental Table 1. Primers used for PCR.

BioInformatics and Computational Molecular Biology. Course Website

Gene Prediction: Statistical Approaches

Converting rabbit hybridoma into recombinant antibodies with effective transient production in an optimized human expression system

Cat. # Product Size DS130 DynaExpress TA PCR Cloning Kit (ptakn-2) 20 reactions Box 1 (-20 ) ptakn-2 Vector, linearized 20 µl (50 ng/µl) 1

Legends for supplementary figures 1-3

Gene Prediction: Statistical Approaches

EECS730: Introduction to Bioinformatics

iclicker Question #28B - after lecture Shown below is a diagram of a typical eukaryotic gene which encodes a protein: start codon stop codon 2 3

PCR-based Markers and Cut Flower Longevity in Carnation

S4B fluorescence (AU)

Transcription:

Lecture 11: Gene Prediction Study Chapter 6.11-6.14 1

Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions of genes in a genome

Where are the genes in the genome? aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgcta atgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggcta tgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgcta atgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatg catgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaa tgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaag ctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgaca Gene! atgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagc tgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggcta tgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg ctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaat gcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgcta agctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatg cggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaat gcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagc tgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg 3

Exons and Introns In prokaryotes, one-cell organisms without a cell nucleus (e.g. bacteria), a gene is a single sequence In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns) This makes computational gene prediction in eukaryotes even more difficult 5 cap intron1 intron2 exon1 exon2 exon3 mrna transcription splicing translation 3 poly-a tail What are the hints that a gene is nearby? 4

Splicing Signals Exon 1 GT AG Exon 2 GT AG Exon 3 The genome provides subtle hints where exon/intron boundaries might occur Dinucleotide sequences GT and AG flank an exon >95% of the time With one or two additional flanking sequences cover 99.9% of the cases How helpful is this? 5

Two Approaches to Gene Prediction Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns). Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes. 6

Genetic Code and Stop Codons TAA, TAG and TGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading Frames 7

Six Frames in a DNA Sequence CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG stop codons TAA, TAG, TGA start codons - ATG 8

Open Reading Frames (ORFs) Detect potential coding regions by looking at ORFs A genome of length n is comprised of (n/3) codons Stop codons break genome into segments between consecutive Stop codons The subsegments of these that start from the Start codon (ATG) are ORFs ORFs in different frames may overlap ATG TGA Genomic Sequence Open reading frame 9

Long versus Short ORFs Long open reading frames may be a gene At random, we should expect one stop codon every (64/3) ~= 21 codons However, genes are usually much longer than this A basic approach is to scan for ORFs whose length exceeds certain threshold This is an incomplete strategy because some genes (e.g. some neural and immune system genes) are relatively short 10

Testing ORFs: Codon Usage Create a 64-element hash table and count the frequencies of codons in an ORF Amino acids typically have more than one codon, but in nature certain codons are more in use Uneven use of the codons may characterize a real gene This compensates for pitfalls of the ORF length test 11

Codon Usage in the Human Genome T C A G TTT Phe 46 TCT Ser 19 TAT Tyr 44 TGT Cys 46 T C A G TTC Phe 54 TCC Ser 22 TAC Tyr 56 TGC Cys 54 TTA Leu 8 TCA Ser 15 TAA Stop 30 TGA Stop 47 TTG Leu 13 TCG Ser 5 TAG Stop 24 TGG Trp 100 CTT Leu 13 CCT Pro 29 CAT His 42 CGT Arg 8 CTC Leu 20 CCC Pro 32 CAC His 58 CGC Arg 18 CTA Leu 7 CCA Pro 28 CAA Gln 27 CGA Arg 11 CTG Leu 40 CCG Pro 11 CAG Gln 73 CGG Arg 20 ATT Ile 36 ACT Thr 25 AAT Asn 47 AGT Ser 15 ATC Ile 47 ACC Thr 36 AAC Asn 53 AGC Ser 24 ATA Ile 17 ACA Thr 28 AAA Lys 43 AGA Arg 21 ATG Met 100 ACG Thr 11 AAG Lys 57 AGG Arg 21 GTT Val 18 GCT Ala 27 GAT Asp 46 GGT Gly 16 GTC Val 24 GCC Ala 40 GAC Asp 54 GGC Gly 34 GTA Val 12 GCA Ala 23 GAA Glu 42 GGA Gly 25 GTG Val 46 GCG Ala 11 GAG Glu 58 GGG Gly 25 12

Codon Usage in the Mouse Genome T C A G TTT Phe 39 TCT Ser 16 TAT Tyr 41 TGT Cys 45 T C A G TTC Phe 61 TCC Ser 21 TAC Tyr 59 TGC Cys 55 TTA Leu 7 TCA Ser 21 TAA Stop 27 TGA Stop 27 TTG Leu 10 TCG Ser 6 TAG Stop 45 TGG Trp 100 CTT Leu 13 CCT Pro 26 CAT His 35 CGT Arg 6 CTC Leu 21 CCC Pro 30 CAC His 65 CGC Arg 13 CTA Leu 14 CCA Pro 34 CAA Gln 24 CGA Arg 10 CTG Leu 35 CCG Pro 10 CAG Gln 76 CGG Arg 13 ATT Ile 27 ACT Thr 24 AAT Asn 41 AGT Ser 13 ATC Ile 43 ACC Thr 28 AAC Asn 59 AGC Ser 23 ATA Ile 29 ACA Thr 40 AAA Lys 58 AGA Arg 34 ATG Met 100 ACG Thr 8 AAG Lys 42 AGG Arg 25 GTT Val 14 GCT Ala 23 GAT Asp 39 GGT Gly 17 GTC Val 24 GCC Ala 36 GAC Asp 61 GGC Gly 32 GTA Val 19 GCA Ala 30 GAA Glu 49 GGA Gly 29 GTG Val 43 GCG Ala 11 GAG Glu 51 GGG Gly 15 13

Codon Usage and Likelihood Ratio An ORF is more believable than another if it has more likely codons Do sliding window calculations to find ORFs that have the likely codon usage Allows for higher precision in identifying true ORFs; much better than merely testing for length. However, average vertebrate exon length is 130 nucleotides, which is often too small to produce reliable peaks in the likelihood ratio Further improvement: in-frame hexamer count (frequencies of pairs of consecutive codons) 14

Similarity-Based Approach Genes in different organisms are similar The similarity-based approach uses known genes in one genome to predict (unknown) genes in another genome Problem: Given a known gene and an unannotated genome sequence, find a set of substrings of the genomic sequence whose concatenation best fits the gene An alignment problem! 15

Chaining Local Alignments Locate codon substrings that match a given protein subsequence, putative (candidate) exons Define a putative exon as a 3-tuple (l, r, w) l = left starting position r = right ending position w = weight based on some scoring function w(# of amino acid s matched, codon freq, ) Look for a maximum chain of substrings Chain: a set of non-overlapping nonadjacent intervals. 16

Exon Chaining Problem 5 5 3 4 11 9 15 0 2 3 5 6 11 13 16 20 25 27 28 30 32 Locate the beginning and end of each interval (2n points) Find the best path 17

Exon Chaining Problem: Formulation Exon Chaining Problem: Given a set of putative exons, find a maximum set of nonoverlapping putative exons Input: a set of weighted intervals (putative exons) Output: A maximum chain of intervals from this set Would a greedy algorithm solve this problem? 18

Exon Chaining Problem: Graph Representation A greedy solution takes the highest scoring chain first, followed by largest one left in the uncovered range, and so on until none can be 21 taken, the score is the sum of all taken chains A dynamic programming solution can do better in O(n) time, by constructing a graph 19

Exon Chaining Algorithm ExonChaining (G, n) //Graph, number of intervals 1 for i to 2n 2 s i 0 3 for i 1 to 2n 4 if vertex v i in G corresponds to right end of the interval I 5 j index of vertex for left end of the interval I 6 w weight of the interval I 7 s j max {s j + w, s i-1 } 8 else 9 s i s i-1 10 return s 2n 20

Exon Chaining: Deficiencies Poor definition of the putative exon endpoints Optimal interval chain may not correspond to a valid alignment We enforce genomic order but not protein order First interval may correspond to a suffix, whereas second interval may correspond to a prefix Combination of such intervals is not a valid alignment 21

Spliced Alignment Mikhail Gelfand and colleagues proposed a spliced alignment approach of using a protein within one genome to reconstruct the exon-intron structure of a (related) gene in another genome. Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem). This set is further filtered in a such a way that attempt to retain all true exons, but allow some false ones. (Many false positives, but no false negatives) 22

Spliced Alignment Problem: Formulation Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence Input: Genomic sequences G, target sequence T, and a set of candidate exons B. Output: A chain of exons Γ such that the global alignment score between Γ* and T is maximum among all chains of blocks from B. Γ* - concatenation of all exons from chain Γ 23

Lewis Carroll Example 24

Lewis Carroll Example 25

Lewis Carroll Example 26

Exon Chaining vs Spliced Alignment In Spliced Alignment, every path spells out a string obtained by concatenating the labels of its vertices. The overall path gives an optimal alignment score between concatenated labels (blocks) and target sequence Defines weight as the sum of vertex weights rather than as the sum of edge weights as in Spliced Alignment Exon Chaining assumes the positions and weights of exons are pre-defined How to solve using Dynamic Programming? 27

Spliced Alignment: Idea Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T: S(i,j) But what is i-prefix of G? There may be a few i-prefixes of G depending on which block B we are in. Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T under the assumption that the alignment uses the block B at position i, S(i,j,B) Two cases to consider, block B starts at i, or it does not 28

Spliced Alignment Recurrence If i is not the starting vertex of block B: S(i 1, j,b) σ S(i, j,b) = max S(i, j 1,B) σ S(i 1, j 1,B) +δ(g i,t j ) Recall σ was the cost of an indel If i is the starting vertex of block B: S(i 1, j,b) σ S(i, j,b) = max max all blocks B' preceding B S(end(B'), j 1,B') δ(g i,t j ) max all blocks B' preceding B S(end(B'), j,b') σ 29

Spliced Alignment Solution After computing the three-dimensional table S(i, j, B), the score of the optimal spliced alignment is: max all blocks B S(end(B), length(t), B) 30

A Modern Approach Sequence novel transcripts Identify gene(s) with similar protein Align transcript sequence to genome Locus of alignment identifies location and intron/exon structure 34