UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Similar documents
ab initio and Evidence-Based Gene Finding

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Chimp Sequence Annotation: Region 2_3

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

Annotation Walkthrough Workshop BIO 173/273 Genomics and Bioinformatics Spring 2013 Developed by Justin R. DiAngelo at Hofstra University

Lab Week 9 - A Sample Annotation Problem (adapted by Chris Shaffer from a worksheet by Varun Sundaram, WU-STL, Class of 2009)

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. Evidence Based Annotation. GEP goals: Evidence for Gene Models 08/22/2017

Collect, analyze and synthesize. Annotation. Annotation for D. virilis. GEP goals: Evidence Based Annotation. Evidence for Gene Models 12/26/2018

user s guide Question 1

Small Exon Finder User Guide

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Aaditya Khatri. Abstract

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Annotating Fosmid 14p24 of D. Virilis chromosome 4

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Methods and Algorithms for Gene Prediction

Annotation of a Drosophila Gene

HUMAN GENOME BIOINFORMATICS. Tore Samuelsson, Dec 2009

user s guide Question 3

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

INTRODUCTION TO BIOINFORMATICS. SAINTS GENETICS Ian Bosdet

MODULE 5: TRANSLATION

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Lecture 10. Ab initio gene finding

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

GenBank Growth. In 2003 ~ 31 million sequences ~ 37 billion base pairs

Annotation of contig27 in the Muller F Element of D. elegans. Contig27 is a 60,000 bp region located in the Muller F element of the D. elegans.

user s guide Question 3

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

BME 110 Midterm Examination

Investigating Inherited Diseases

Hands-On Four Investigating Inherited Diseases

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

MODULE TSS1: TRANSCRIPTION START SITES INTRODUCTION (BASIC)

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Guided tour to Ensembl

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Lecture 7 Motif Databases and Gene Finding

EGPred: Prediction of Eukaryotic Genes Using Ab Initio Methods After Combining With Sequence Similarity Approaches

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Annotation of contig62 from Drosophila elegans Dot Chromosome

Computational gene finding

9/19/13. cdna libraries, EST clusters, gene prediction and functional annotation. Biosciences 741: Genomics Fall, 2013 Week 3

Genome annotation & EST

Annotation of Contig8 Sakura Oyama Dr. Elgin, Dr. Shaffer, Dr. Bednarski Bio 434W May 2, 2016

Bacterial Genome Annotation

Gene Prediction 10/21/05

FUNCTIONAL BIOINFORMATICS

Annotating the D. virilis Fourth Chromosome: Fosmid 99M21

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Overview: GQuery Entrez human and amylase Search Pubmed Gene Gene: collected information about gene loci AMY1A Genomic context Summary

Gene Identification in silico


132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

ProGen: GPHMM for prokaryotic genomes

Transcription Start Sites Project Report

Chapter 4 METHOD FOR PREDICTING GENES IN EUKARYOTIC GENOMES

The University of California, Santa Cruz (UCSC) Genome Browser

Homework 4. Due in class, Wednesday, November 10, 2004

Gene Signal Estimates from Exon Arrays

Using the Genome Browser: A Practical Guide. Travis Saari

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Computational gene finding. Devika Subramanian Comp 470

Gene Finding Genome Annotation

Improved Splice Site Detection in Genie

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

Gene Regulation 10/19/05

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

Genome Annotation. Stefan Prost 1. May 27th, States of America. Genome Annotation

MODULE TSS2: SEQUENCE ALIGNMENTS (ADVANCED)

Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

Prokaryotic Annotation Pipeline SOP HGSC, Baylor College of Medicine

COMPUTER RESOURCES II:

Tutorial for Stop codon reassignment in the wild

Gene Prediction Group

AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG

Textbook Reading Guidelines

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Eukaryotic Gene Prediction. Wei Zhu May 2007

Bioinformatics Course AA 2017/2018 Tutorial 2

Gene Structure & Gene Finding Part II

BIOINF525: INTRODUCTION TO BIOINFORMATICS LAB SESSION 1

Biotechnology Explorer

Genscan. The Genscan HMM model Training Genscan Validating Genscan. (c) Devika Subramanian,

Computational gene finding

Chapter 2: Access to Information

Array-Ready Oligo Set for the Rat Genome Version 3.0

G4120: Introduction to Computational Biology

Transcription:

UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene finding using the UCSC Browser 1

AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAAT ATCGTTCTTAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATT CCAATAGCCGTAAGAGTTCATTTAATGACAATGACGATGGCGGCAAAGTCGA TGAAGGACTAGTCGGAACTGGAAATAGGAATGCGCCAAAAGCTAGTGCAGC TAAACATCAATTGAAACAAGTTTGTACATCGATGCGCGGAGGCGCTTTTCTC TCAGGATGGCTGGGGATGCCAGCACGTTAATCAGGATACCAATTGAGGAGGT GCCCRAGCTCACCEAGAGCCGGCCAATAAGGACCCATCGGGGGGGCCGCTTA TGTGGAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATTT AAGAAACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGG GCATACGCCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGC GCCGATAGCACCAGCGTTTGGACGGGTCAGTCATTCCAHATAEGCACAACGT CTGGTGTTGCAGTCGGTGCCATAGCGCCTGGCEGTTGGCGCCGCTGCTGGTCC CCAATGGGGACAGGCTGTTGCTGTTGGTGTTGGAGTCGGAGTTGCCTTAAAC TCGACTGGAAATAACAATGCGCCGGCAACAGGAGCCCTGCCTGCCGTGGCTC CCCGGCCGGAATGNTAANGAAOTAATCAAATTTTGGCIGACAOAATGNGCAC CGGCCGGAATGNTAANGAAOTAATCAAATTTTGGCIGACAOAATGNGCACGT GTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATCAGATTCAGAAT ATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATCGCAA GCTCAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCGGGGC ACAATGGGGAGCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCCAAGCA AATCACTGGATGGGAGGAACCACAATCAGATTCAGAATATTAACAAAATGG TCGGCCCCGTTGTTATGGATAAAAAATTTGTGTCTTCGTACGGAGATTATGTT GTTAATCAATTTTATTAAGATATTTAAATAAATATGTGTACCTTTCACGAGAA ATTTGCTTACCTTTTCGACACACACACTTATACAGACAGGTAATAATTACCTT TTGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTCTATGAATCT TTGTAATACTTTCGAATTTAATTATTAGTCTACATTAATATTGATACCGTTTTT CGTAGGCAATATATATCCGACGCCAAAAGATGCTGATATTTTTGCTTTTTGCT GCCC What to Annotate? Genes Novel genes, known genes, pseudogenes Non-coding RNAs trnas, mirnas, sirnas Regulatory Elements Promoters, enhancers, silencers Repeats Transposable elements, simple repeats 2

ab initio Gene Prediction ab initio = From the Beginning Gene prediction using only the genomic DNA sequence Search for signals of protein coding regions Typically uses a probabilistic model Hidden Markov Model (HMM) Requires external evidence to support predictions (mrna, ESTs) Performance of Gene Finders Most gene-finders can predict prokaryotic genes accurately However, gene-finders do a poor job of predicting genes in eukaryotes Not as much is known about the general properties of eukaryotic genes Splice site recognition, isoforms 3

ab initio Gene Predictions Examples: Glimmer for prokaryotic gene predictions (S. Salzberg, A. Delcher, S. Kasif, and O. White 1998) Genscan for eukaryotic gene predictions (Burge and Karlin 1997) Genscan is the gene finder we will use for our chimpanzee and Drosophila annotations Genscan Gene Model Genscan considers the following: Promoter signals Polyadenylation signals Splice signals Probability of coding and non-coding DNA Gene, exon and intron length Chris Burge and Samuel Karlin, Prediction of Complete Gene Structures in Human Genomic DNA, JMB. (1997) 268, 78-94 4

Common Problems Common problems with gene finders Fusing neighboring genes Spliting a single gene Missing exons or entire genes Overpredicting exons or genes Other challenges Nested genes Noncanonical splice spites Pseudogenes Different isoforms of same gene How to Improve Predictions? New gene finders use additional evidence to generate better predictions: Twinscan extends model in Genscan by using homology between two related species Separate model used for exons, introns, splice sites, UTR s Ian Korf, et al. Integrating genomic homology into gene structure prediction. Bioinformatics. (17) S140-S148. 5

How to Improve Predictions? Manual annotation Collect evidence from multiple biological and computational sources to create gene models This method still generates the best annotations Need a place to collate all the different lines of evidence available UCSC Browser Introduction to the UCSC Browser http://genome.ucsc.edu 6

UCSC Browser Developers UCSC Browser is created by the Genome Bioinformatics Group of UC Santa Cruz Development team: http://genome.ucsc.edu/staff.html Led by Jim Kent and David Haussler UCSC Browser was initially created for the human genome project It has since been adapted for many other organisms We have set up local version of the UCSC Browser for Bio 4342 Functions of UCSC Browser Functionalities of UCSC Browser Genome Browser - views of genomic regions BLAT - BLAST-Like Alignment Tool Table Browser - SQL access to genomic data Training section on the UCSC web site http://openhelix.com/ucscmaterials.shtml Pre-recorded tutorial (presentation slide set) Reference cards 7

Chimp BAC Analysis Goal: Annotation of one of the features in a 170 kb chimpanzee BAC A more detailed walkthrough is available Genscan was run on the repeat-masked BAC using the vertebrate parameter set (GENSCAN_ChimpBAC.html) Predicts 8 genes within this BAC By default, Genscan also predicts promoter and poly-a sites; however, these are generally unreliable Output consists of map, summary table, peptide and coding sequences of the predicted genes Chimp BAC Analysis Analysis of Gene 1 (423 coding bases): Use the predicted peptide sequence to evaluate the validity of Genscan prediction blastp of predicted peptide against the nr database Typically uses the NCBI BLAST page: http://www.ncbi.nlm.nih.gov/blast/ Choose blastp and search against nr For the purpose of this tutorial, open blastpgene1.html 8

Interpreting blastp Output Many significant hits to the nr database that cover the entire length of the predicted protein Do not rely on hits that have accession numbers starting with XP XP indicates RefSeq without experimental confirmation Click on the Score for the second hit in the blastp output (gb AAH70482.1) Indicates hit to human HMGB3 protein Investigating HMGB3 Alignment The full HMGB3 protein has length of 200 aa However, our predicted peptide only has 140 aa Possible explanations: 1. Genscan mispredicted the gene Missed part of the real chimp protein 2. Genscan predicted the gene correctly Pseudogene that has acquired an in-frame stop codon Functional protein in chimp that lacks one or more functional domains when compared to the human version 9

Analysis using UCSC Browser Go back to Genscan output page and copy the first predicted coding sequence Navigate to UCSC browser @ http://genome.ucsc.edu Click on BLAT Select the human genome (May 2004 assembly) Paste the coding sequence into the text box Click submit Human Blat Results Predicted sequence matches to many places in the human genome Top hit shows sequence identity of 99.1% between our sequence and the human sequence Next best match has identity of 93.6%, below what we expect for human / chimp orthologs (98.5% identical) Click on browser for the top hit (on chromosome 7) The genome browser for this region in human chromosome 7 should now appear Navigation buttons are on the top menu bar Reinitialize the browser by clicking on hide all 10

Adjusting Display Options Adjust following tracks to pack Under Mapping and Sequencing Tracks : Blat Sequence Adjust following tracks to dense Under Gene and Gene Prediction Tracks Known Genes, RefSeq, Ensembl Genes, Twinscan, SGP Genes, Genscan Genes Under mrna and EST Tracks Human mrnas, Spliced ESTs, Human ESTs, Other ESTs Under Comparative Genomics Mouse Net Under Variation and Repeats RepeatMasker Human Genome Browser Hit refresh and look at new image; zoom out 3x to get a broader view There are no known genes in this region Only evidence is from hypothetical genes predicted by SGP and Genscan SGP predicted a larger gene with two exons There are also no known human mrna or human ESTs in the aligned region However, there are ESTs from other organisms 11

Investigate Partial Match Go to GenBank record for the human HMGB3 protein (using the BLAST result) Click on the Display button and select FASTA to obtain the sequence Go back to the BLAT search to search this sequence against the human genome assembly (May 2004) BLAT search of human HMGB3 Notice the match to part of human chromosome 7 we observed previously is only the 7th best match (identity of 88%) Consistent with one of our hypotheses that our predicted protein is a paralog Click on browser to see corresponding sequence on human chromosome 7 BLAT results overlap Genscan prediction but extend both ends Why would Genscan predict a shorter gene? 12

Examining Alignment Now we need to examine the alignment: Go back to previous page and click on details In general, the alignment looks good except for a few changes However, when examining some of the unmatched (black) regions, notice there is a tag - a stop codon. Confirm predicted protein is in frame relative to human chromosome 7 by Looking at the side-by-side alignment Confirming Pseudogene Side-by-side alignment color scheme Lines = match Green = similar amino acids Red = dissimilar amino acids We noticed a red X (stop codon) aligning to a Y (tyrosine) in the human sequence 13

Confirming Pseudogene Alignment after stop codon showed no deterioration in similarity suggest our prediction is a recently retrotransposed pseudogene To confirm hypothesis, go back to BLAT results and get the top hit (100% identity on chromosome X) The real HMGB3 gene in human is a 4-exon gene! Conclusions Based on evidence accumulated: As a cdna, the four-exon HMGB3 gene was retrotransposed It then acquired a stop codon mutation prior to the split of the chimpanzee and human lineages Retrotransposition event is relatively recent Pseudogene still retains 88.8% sequence identity to source protein 14

Questions? 15