Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Similar documents
Gene Identification in silico

Computational gene finding

Transcription in Eukaryotes

Genome annotation & EST

Computational gene finding

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

BIOLOGY - CLUTCH CH.17 - GENE EXPRESSION.

Computational gene finding. Devika Subramanian Comp 470

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

I. Gene Expression Figure 1: Central Dogma of Molecular Biology

Molecular Genetics of Disease and the Human Genome Project

From DNA to Protein: Genotype to Phenotype

HUMAN GENOME BIOINFORMATICS. Tore Samuelsson, Dec 2009

Transcription is the first stage of gene expression

SSA Signal Search Analysis II

From DNA to Protein: Genotype to Phenotype

Chapter 17. From Gene to Protein. Slide 1. Slide 2. Slide 3. Gene Expression. Which of the following is the best example of gene expression? Why?

The Nature of Genes. The Nature of Genes. Genes and How They Work. Chapter 15/16

9/19/13. cdna libraries, EST clusters, gene prediction and functional annotation. Biosciences 741: Genomics Fall, 2013 Week 3

CHAPTER 21 LECTURE SLIDES

Lecture Summary: Regulation of transcription. General mechanisms-what are the major regulatory points?

The Flow of Genetic Information

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

The Nature of Genes. The Nature of Genes. The Nature of Genes. The Nature of Genes. The Nature of Genes. The Genetic Code. Genes and How They Work

Make the protein through the genetic dogma process.

30 Gene expression: Transcription

Annotating the Genome (H)

ab initio and Evidence-Based Gene Finding

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

The study of the structure, function, and interaction of cellular proteins is called. A) bioinformatics B) haplotypics C) genomics D) proteomics

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

DNA is normally found in pairs, held together by hydrogen bonds between the bases

8/21/2014. From Gene to Protein

Homework 4. Due in class, Wednesday, November 10, 2004

13.1 RNA Lesson Objectives Contrast RNA and DNA. Explain the process of transcription.

Fig Ch 17: From Gene to Protein

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

Transcription. DNA to RNA

Lecture for Wednesday. Dr. Prince BIOL 1408

Unit 1: DNA and the Genome. Sub-Topic (1.3) Gene Expression

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

CH 17 :From Gene to Protein

Biotechnology Unit 3: DNA to Proteins. From DNA to RNA

Gene Structure & Gene Finding Part II

MODULE 5: TRANSLATION

Genomics and Gene Recognition Genes and Blue Genes

DNA REPLICATION. DNA structure. Semiconservative replication. DNA structure. Origin of replication. Replication bubbles and forks.

Eukaryotic Gene Structure

DNA Function: Information Transmission

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

From Gene to Protein. Chapter 17

GENOME ANNOTATION INTRODUCTION TO CONCEPTS AND METHODS. Olivier GARSMEUR & Stéphanie SIDIBE-BOCS

BS 50 Genetics and Genomics Week of Oct 24

Analysis of Biological Sequences SPH

Unit 6: Molecular Genetics & DNA Technology Guided Reading Questions (100 pts total)

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Applicazioni biotecnologiche

Bi 8 Lecture 5. Ellen Rothenberg 19 January 2016

How to design an HMM for a new problem. HMM model structure. Inherent limitation of HMMs. Duration modeling. Duration modeling

Lecture 2: Biology Basics Continued. Fall 2018 August 23, 2018

The Central Dogma. DNA makes RNA makes Proteins

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

Lecture 2: Biology Basics Continued

Genomes summary. Bacterial genome sizes

DNA Evolution of knowledge about gene. Contains information about RNAs and proteins. Polynucleotide chains; Double stranded molecule;

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Degenerate site - twofold degenerate site - fourfold degenerate site

CHAPTER 21 GENOMES AND THEIR EVOLUTION

Molecular Cell Biology - Problem Drill 08: Transcription, Translation and the Genetic Code

Videos. Lesson Overview. Fermentation

EECS730: Introduction to Bioinformatics

Chimp Sequence Annotation: Region 2_3

Chapter 13. From DNA to Protein

There are four major types of introns. Group I introns, found in some rrna genes, are self-splicing: they can catalyze their own removal.

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly

Bio 101 Sample questions: Chapter 10

Transcriptomics. Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona

Analyzed Fungi Neurospora crassa mutants. Mutants were UNABLE to grow without Arginine (an amino acid) Other biochemical experiments indicated:

Fermentation. Lesson Overview. Lesson Overview 13.1 RNA

Roadmap. The Cell. Introduction to Molecular Biology. DNA RNA Protein Central dogma Genetic code Gene structure Human Genome

Introduction to Plant Genome and Gene Structure. Dr. Jonaliza L. Siangliw Rice Gene Discovery Unit, BIOTEC

Chapter 17. From Gene to Protein

Aaditya Khatri. Abstract

Transcription steps. Transcription steps. Eukaryote RNA processing

Videos. Bozeman Transcription and Translation: Drawing transcription and translation:

Year III Pharm.D Dr. V. Chitra

Biol 478/595 Intro to Bioinformatics

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

An introduction to genetics and molecular biology

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

BIO 311C Spring Lecture 36 Wednesday 28 Apr.

Review of Protein (one or more polypeptide) A polypeptide is a long chain of..

Unit IX Problem 3 Genetics: Basic Concepts in Molecular Biology

Wednesday, November 22, 17. Exons and Introns

Gene Prediction in Eukaryotes

Bioinformatics: Sequence Analysis. COMP 571 Luay Nakhleh, Rice University

Transcription:

Genome Annotation Genome Sequencing Costliest aspect of sequencing the genome o But Devoid of content Genome must be annotated o Annotation definition Analyzing the raw sequence of a genome and describing relevant genetic and genomic features such as genes, mobile elements, repetitive elements, duplications, and polymorphisms Annotation costs originally low o May eventually exceed the sequencing cost Why?? Continued reanalysis is required to find all of the meaning What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Whole Genome Duplications Whole genome duplications are a major event in the history of all eukaryotic groups of species Duplications can be of o The full genome of one species Autopolyploid o The mating and retention of both chromosome sets of two species Allopolyploid Duplications in the biological kingdoms o Animals Three ancient whole genome duplications o Plants Multiple duplications along the two major lineages Dicots and monocots From: Genome Biology and Evolution 4:918

o Post whole-genome duplication events o Paleopolyploid An ancient species that results from a genome duplication Modern relationships between species Shared segments exist between Derived from an ancient ancestral species that does not exist today

Chromosomal Tree of Life From: Genome Biology and Evolution 4:918

Major result of duplications Many of the genes in all genomes are related by descent Protein sequences are similar Conserved function can be inferred BUT NOT PROVEN FROM SEQUENCE ANALYSIS Segmental Gene Duplications Segmental gene duplications: what are they??? o Large gene blocks that are duplicated in the genome o Parts of chromosomes Intrachromosomal duplication Duplicated region moved to same chromosome Interchromosomal duplication Duplicated region moved to another chromosome Confirms/determines biological function Are their large blocks of duplications in genomes? o Arabidopsis genome Model organism Originally considered devoid of gene duplications Sequencing discovery Large blocks of segmental duplication o Local duplication Intrachromosomal o Distal duplications Interchromosomal o Human genome Duplication pattern similar to Arabidopsis o Mouse genome Lesser degree of duplication

Current Arabidopsis Chromosomes Showing Shared Genomic History o Fragment between two chromosomes that share a color are similar because of the duplication history of the species lineage.

Genes Genes define the species itself o Development o Reproduction o Response to environment Biotic Abiotic What defines the gene??? o Coding region Contains the information that defines the nature of an expressed protein or a functional RNA molecule o Controlling regions Sequences that define where, when, and the degree to which the gene will be expressed Major goal of annotation o Defining genes and their controlling regions

Repetitive Elements Repetitive elements o May be major component of genome o Generally conserved o Fairly ease to discover Example Retrotransposons o Reverse transcriptase protein is conserved o Similarity searches for these proteins are fairly powerful Cataloging the repetitive elements o Critical first annotation step Greatly reduces the amount of sequence that must be searched o Repeat masking Procedure that removes the repetitive elements from the gene discovery process Mobile Genetic Elements Also called transposable elements A major component of some genomes Classes o DNA elements McClintock discovery o Retrotransposons Most abundant class of repeats Abundance Human o 50% of genome is mobile elements Arabidopsis o 10 % of total DNA o 20 % of gene rich-region

Major Classes of Transposable Elements Super families of TEs Number of TEs (x10 3 ) Coverage of TEs (bp) Fraction of genome (%) Class 1 (RNA-based) 283.1 195,948,599 41.47 LTR retrotransposon 244.3 181,963,056 38.51 Ty3-gypsy 146.7 125,312,211 26.52 Ty1-copia 62.2 47,126,880 9.97 Others 35.3 9,523,965 2.02 LINEs 37.8 13,825,275 2.93 SINEs 1.0 160,268 0.03 Class 2 (DNA-based) 87.4 26,832,637 5.68 CACTA 44.0 13,295,207 2.81 Harbinger/PIF 0.5 263,181 0.06 hat 4.0 1,062,438 0.22 Helitron 18.3 5,095,472 1.08 MULE 20.7 7,116,339 1.51 Unclassified TEs 14.7 2,728,570 0.58 Total 385.2 225,509,806 47.73 Table S8. Summary of transposable elements (TEs) in Phaseolus vulgaris. Schmutz et al. (2014). Other Repeat Elements Simple Sequence Repeats o SSRs o Major repetitive class found in genomes o Defined as Localized repetitions of di- or tri-nucleotides o Well described in many species o Widely used as molecular markers

Genetic Diversity How can gene sequences among individuals of a species vary? o Large deletions Eliminate gene function o Small deletions Gene functions but expression changed Example o Cystic fibrosis gene Triplet lost Phenylalanine missing Mutant CF phenotype expressed Single nucleotide polymorphisms (SNPs) o A difference in a single nucleotide between two alleles o Resequencing multiple individuals develop a HapMap of a species HapMap shows the regions of the genome that vary across the breadth of the species The most detailed analyses can be performed when more individuals from the species are surveyed Uncovers SNP diversity related to function Examples of SNPs o Sickle cell anemia Adenine changed to thymine in sixth amino acid codon of β-globin gene The change leads to the sickle cell phenotype o Mendel s plant height gene (Le) Guanine to adenine change at nucleotide 685 of the mrna First nucleotide of 229 codon Converts alanine to threonine Changes function of gibberellin 3 betahydroxylase Plant is short rather than tall

Exon Definition Model Key principle in gene model o Genes consist of exons and introns Exons are the functional building blocks of genes o Exon definition model is based on key sequences that define exons, introns, and 5 and 3 region of genes Defines the sequences that define exons These sequences are the keys to computationally discovering gene models 5 End of Intron 3 End of Intron Donor Site Acceptor Site AG GURAGU Y (10) NCAG G Figure 3 Exon-definition model. Typically, in vertebrates, exons are much shorter than introns. According to the exon-definition model, before introns are recognized and spliced out, each exon is initially recognized by the protein factors that form a bridge across it. In this way, each exon, together with its flanking sequences, forms a molecular, as well as a computational, recognition module (arrows indicate molecular interactions). Modified with permission from Ref. 26 (2002) Macmillan Magazines Ltd. CBC, cap-binding complex; CFI/II, cleavage factor I/II; CPSF, cleavage and polyadenylation specificity factor; CstF, the cleavage stimulation factor; PAP, poly(a) polymerase; snrnp, small nuclear RNP; SR, SR protein; U2AF, U2 small nuclear ribonucleoprotein particle (snrnp) auxiliary factor. From: Nature Review Genetics (2002) 3:698-709

Finding Genes In a Sea of Nucleotides Extrinsic Content Detection Uses data from databases to discover genes o Search genomic sequence as a query against Protein databases Nucleotide databases EST o Best information for gene structures because o Known to represent genes o Contain 3 sequences that are normally gene specific Problems o Exon-intron borders not always easy to predict o 5 -UTR sequences cannot be predicted

Hidden Markov Models Definition A statistical approach that builds on prior knowledge and assigns a probability that a certain sequence set is a member of specific state. State o 5 UTR o Exon o Intron o 3 UTR o Polyadenylation site Example: Predicting a GT splice site Given a specific nucleotide sequence o GTGAGT o What is the probability that the 5 G is start of an intron splice site if the next nucleotide is a T? Prior knowledge o Sequence of splice site begins with GT If the next nucleotide is indeed a T, a high probability exists that it is part of the GT splice site IF other properties of the state associate it with the GT splice site state o Is the GT preceded by an [A/C]AG sequence? o Is the GT followed by [A/G]AGT? If yes, then all criteria met, so [A/C]AGGT[A/G]AGT is a splice site and defines an intron

Intrinsic Content Detection Finding 5 exons o Difficult process 5 signal not fully defined Transcriptional start site (TSS) o Few known Promoter o Variation among promoters known o Algorithms search for: CpG island Normally gene rich Some algorithms find TSS within the island Some algorithms find TSS associated with TATA box in island Identify ATG start site in island Finding 3 exons o Identify polyadenylation addition site signal AATAAA o Use stop codon as a 3 prediction signal Essential for determining where one gene ends and another begins Finding internal exons o Based on splice site features Acceptor site Exon Intron AG GTRAGT (R = A or G) Donor site Intron Exon YYYYYYYYYYNCAG G (Y = C or T; N = any nucleotide)

Finding intronless exons o Difficult task o Must distinguish these from Long internal exon Pseudogenes that occurred by lost of intron in normal gene

What does a gene prediction program do? Calculates best scores for all gene features o Defines likelihood that neighboring coding features are really part of a gene o Likelihood is calculated as a Weight Probability Hidden Markov Model (HMM) approach is currently preferred approach o DNA fragments (a few nucleotides at a time) are defined as a state o Probability that a neighboring state can be coupled with the first state to form a gene feature is calculated o This allows interdependencies between exons to be explored o Calculation based on a training set of genes Training set are genes from a similar taxonomic group with putative similar gene features o Probability that multiple states can define a gene is calculated

Predicting Multiple Genes HMM approaches easily extended to study both strands of a DNA sequence simultaneously o Value of modeling both strands Prevents predicting two genes that overlap on the two strands A rare eukaryotic event Need to understand features common across chromosomes o Insulator elements o Boundary elements o Matrix attachment regions Scaffold attachment regions Comparative analysis One gene set can aid discovery in a related species o Gene order is conserved o Gene structure is conserved o Provides additional training set data for gene prediction o Example: Human gene models supporting mouse gene discovery

How does a popular software package do it?? FGENESH++C Pipeline (Softberry; quoted directly from company brochure) 1. RefSeq mrna mapping by EST_MAP program mapped genes are excluded from further gene prediction process. 2. Ab initio FGENESH gene prediction. 3. Search of all proeducts of predicted genes through NR database for protein homologs. 4. FGENESH+ gene prediction on sequences with found protein homology. 5. Second run of ab initio gene prediction in regions free from predictions made on stages 1 and 5. 6. Run of FGENESH gene predictions in large introns of known and predicted genes.

Naming The Genes Gene naming follows the discovery of potential genes Relies upon the significant amount of research that predated genome projects o Historically done on a gene-by-gene approach o Goals of gene-by-gene research goal is to clone and characterize an individual gene o Each gene is of interest to a specific research group Housekeeping genes Necessary for basic cellular biochemical processes of a cell Nearly all are characterized at the nucleotide and protein levels Sequence information is stored in large databases

The Naming Process BLAST o Software tool most often used to annotate (or name) a gene o Basic Local Alignment Search Tool o Series of computer programs Looks for sequence similarities between two sequences o Analysis consists of Query Sequence to which you are looking for a match Nucleotide or amino acid sequences Database Set of sequences that may be like the query o Nucleotide sequences GenBank o Protein sequences Swiss-Prot is used to uncover sequences that are similar to the query Translations possible Nucleotide query sequence can be translated Amino acid database sequences can be reverse translated Recent BLAST innovations Gaps can be incorporated to discover matches

Non-gene RNA Sequences RNA molecules o Important components of the genome Ribosomal RNAs and trnas Both essential for protein translation Small nuclear RNAs Important for RNA splicing Necessary component of the genome Highly conserved o Easily recognizable MicroRNAs Short RNA sequences 21-25 nt long Negative regulators of gene expression o Bind target gene mrnas and prevent their expression Programs that search specifically for these genes are available molecules

Regulatory Sequences Gene regulation a major area of research o Key to understanding gene expression Regulatory motifs discovered o Motifs Short sequences that define a function Nucleotide sequences o Sites where regulatory bind Orthology searches Scan promoter sequences Search for conserved regulatory motifs Amino acid sequences o Key sequences that bind DNA molecules Orthology searches Scan protein sequences Search for DNA binding motifs Discovered motifs must be tested experimentally Functional genomics concern

Transcription Factor Binding Motifs Sequences upstream to which transcription factors bind o Multiple sites possible per transcription factor o Motif location can vary because the 5 -UTR length is not consistent Motif/TF Family Table from: BMC Genomics (2014) 15:317 Motif GCTGCCGGAGA GCACGTGGAG ATGTGATGC GGTTGTGGT ACCAAACAT CACCTAAC Transcription Factor Family NAC bhlh bhlh R2R3-MYB R2R3-MYB R2R3-MYB