Tutorial section. VEGA, the genome browser with a difference

Similar documents
Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Guided tour to Ensembl

Array-Ready Oligo Set for the Rat Genome Version 3.0

The University of California, Santa Cruz (UCSC) Genome Browser

user s guide Question 1

ab initio and Evidence-Based Gene Finding

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Genome annotation & EST

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Browsing Genomes with Ensembl

Ensembl: A New View of Genome Browsing

Training materials.

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

user s guide Question 3

Identifying Genes and Pseudogenes in a Chimpanzee Sequence Adapted from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. M.

Aaditya Khatri. Abstract

Gene-centered resources at NCBI

Training materials.

Gene-centered databases and Genome Browsers

Gene-centered databases and Genome Browsers

Niemann-Pick Type C Disease Gene Variation Database ( )

Lecture 7 Motif Databases and Gene Finding

user s guide Question 3

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

Il trascrittoma dei mammiferi

Investigating Inherited Diseases

Access to genes and genomes with. Ensembl. Worked Example & Exercises

Gap Filling for a Human MHC Haplotype Sequence

Gene Identification in silico

Chapter 2: Access to Information

GREG GIBSON SPENCER V. MUSE

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP

INTRODUCTION TO BIOINFORMATICS. SAINTS GENETICS Ian Bosdet

Genome annotation. Erwin Datema (2011) Sandra Smit (2012, 2013)

Comparison of human (and other) genome browsers

After the draft sequence, what next for the Human Genome Mapping Project Resource Centre?

Genomic Annotation Lab Exercise By Jacob Jipp and Marian Kaehler Luther College, Department of Biology Genomics Education Partnership 2010

Hands-On Four Investigating Inherited Diseases

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Ensembl and ENA. High level overview and use cases. Denise Carvalho-Silva. Ensembl Outreach Team

NCBI & Other Genome Databases. BME 110/BIOL 181 CompBio Tools

HUMAN GENOME BIOINFORMATICS. Tore Samuelsson, Dec 2009

Bacterial Genome Annotation

Experimental validation of candidates of tissuespecific and CpG-island-mediated alternative polyadenylation in mouse

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Bioinformatics for Proteomics. Ann Loraine

Genomics: Genome Browsing & Annota3on

Identification of Single Nucleotide Polymorphisms and associated Disease Genes using NCBI resources

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

The Gene Ontology Annotation (GOA) project application of GO in SWISS-PROT, TrEMBL and InterPro

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.


9/19/13. cdna libraries, EST clusters, gene prediction and functional annotation. Biosciences 741: Genomics Fall, 2013 Week 3

Human KIR sequences 2003

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph

SeattleSNPs Interactive Tutorial: Database Inteface Entrez, dbsnp, HapMap, Perlegen

COMPUTER RESOURCES II:

Computational gene finding

Genomes contain all of the information needed for an organism to grow and survive.

Open Access. Abstract

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Vega and the Otterlace Community Manual Annotation Tool

Figure 1. FasterDB SEARCH PAGE corresponding to human WNK1 gene. In the search page, gene searching, in the mouse or human genome, can be done: 1- By

Introduction to NGS analyses

SUPPLEMENTARY INFORMATION

Entrez Gene: gene-centered information at NCBI

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Browsing Genes and Genomes with Ensembl

Gene Finding Genome Annotation

CHAPTER 21 LECTURE SLIDES

The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28.

Bioinformatics Course AA 2017/2018 Tutorial 2

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

A new strategy to identify novel genes and gene isoforms: Analysis of human chromosomes 15, 21 and 22

Aligning GENCODE and RefSeq transcripts By EMBL-EBI and NCBI

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Annotating Fosmid 14p24 of D. Virilis chromosome 4

What is Bioinformatics?

Introduc)on to Databases and Resources Biological Databases and Resources

Supplementary Online Material. the flowchart of Supplemental Figure 1, with the fraction of known human loci retained

Piloting the Zebrafish Genome Browser

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Biotechnology Project Lab

Overview: GQuery Entrez human and amylase Search Pubmed Gene Gene: collected information about gene loci AMY1A Genomic context Summary

B I O I N F O R M A T I C S

The Human Genome Project

In silico variant analysis: Challenges and Pitfalls

Chapter 15 The Human Genome Project and Genomics. Chapter 15 Human Heredity by Michael Cummings 2006 Brooks/Cole-Thomson Learning

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

BME 110 Midterm Examination

Pharmacogenetics: A SNPshot of the Future. Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001

Sequence Based Function Annotation

Transcription:

VEGA, the genome browser with a difference Keywords: vertebrate, annotation, database, manual, curation Abstract The Vertebrate Genome Annotation (Vega) database is a community resource for browsing manual annotation from a variety of vertebrate genomes of finished sequence (http:// vega.sanger.ac.uk). Vega is different from other genome browsers as it has a standardised classification of genes which encompasses pseudogenes and non-coding transcripts. The data is manually curated, which is more accurate at identifying splice variants, pseudogenes poly(a) features, non-coding and complex gene structures and arrangements than current automated methods. The database also contains annotation from regions, not just whole genomes, and displays multiple species annotation (human, mouse, dog and zebrafish) for comparative analysis. Vega encourages community feedback that results in annotation updates and manual annotation of finished vertebrate sequence. Since completion of the draft human genome sequence in 2000 1,2 and the subsequent finishing of this in 2003 3 many different genome browsers have been developed to enable scientists to access genome data. The initial interpretation of the human genome was through automated annotation such as Ensembl 4 and the UCSC genome browsers. 5 There are currently limits to an automated approach for the analysis of genomes, for example in duplicated regions identifying unprocessed pseudogenes, and therefore there is still a need for manual intervention. As the genome sequence became finished, quality curated browsers such as MapView 6,7 and the H-InvDB 8,9 were developed. The Vertebrate Genome Annotation (Vega) database 10 is a community resource for browsing manual annotation from a variety of vertebrate genomes of finished sequence. 11 Vega is based on the Ensembl schema, with gene objects shown in shades of blue, and also incorporates curation-specific data. The database allows users to view the manual annotation provided by the Havana group at the Wellcome Trust Sanger Institute (WTSI), 12 IMB-Jena, the Joint Genome Institute, Genoscope and Washington University. It currently contains the manual annotation of ten human chromosomes (6, 7, 9, 10, 13, 14, 20, 22, X and Y). As the genome sequencing centres publish the annotation and analysis of their chromosomes then the data will be accessible in Vega. Why is Vega different from other browsers? It has a standardised classification of genes which encompasses pseudogenes and non-coding transcripts. PolyA sites/signals are annotated. The data are manually curated. The data are periodically updated. It contains annotation of haplotypes. & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 2. 189 193. JUNE 2005 189

Table 1: Vega annotation definitions Known Novel Novel transcript Putative Pseudogene Predicted Ig segment Ig pseudogene segment Identical to human cdna or protein sequences in the Entrez Gene database (http://www.ncbi.nlm.nih.gov/entrez/ query.fcgi?db¼gene/) Have an open reading frame and are identical or homologous to known vertebrate cdnas and/or proteins from all species Similar to novel gene but no open reading frame or open reading frame ambiguous Homologous to spliced vertebrate expressed sequence tags (ESTs) with no significant open reading frame Homologous to protein sequences with a disrupted CDS and an active gene can be found at another locus Based on ab initio prediction for which at least one exon is supported by biological data (unspliced ESTs, protein sequence similarity with mouse or tetraodon genomes) Only used in chromosome 14 Immunoglobulin gene segments Inactivated immunoglobulin segment Single nucleotide polymorphisms (SNPs) are mapped to manual curation. It is multispecies and small regions of finished sequence can be submitted and annotated as well as whole genomes. It encourages community feedback and results in annotation updates. GENE CLASSIFICATION A standardised set of definitions has been used to categorise the annotation of the different gene features (Table 1). Irrespective of which category gene objects have been assigned to all annotated gene structures are supported by homology to cdnas, expressed sequence tags (ESTs) or protein sequences. GENE NAMING It is important to use the correct gene nomenclature to maintain consistency in the annotation database, especially when comparing haplotypic or syntenic regions. The Vega annotators interact closely with the nomenclature committees from the Human Genome Organisation (HUGO, HGNC), 13 Zebrafish Information Network (ZFIN) 14 and Mouse Genome Database (MGD). 15 If an approved symbol is not available for a gene locus, an interim identifier is used in the format of international clone identifier followed by number, eg RP11 695B14.2. All loci and their associated transcripts and exons are given stable versioned database IDs (eg OTTHUMG00000021027) that are generated and tracked in the Otter database 16 that underlies Vega (see Figure 1). Whenever a locus is edited the version number increases and the date of the change saved. MAIN FEATURES OF VEGA Manual annotation is currently more accurate at identifying splice variants, pseudogenes, polyadenylation features, non-coding genes, complex gene arrangements and clusters than automated methods. Splice variants account for approximately 50 per cent of gene loci in finished chromosomes 9, 10 and X, with an average of 2.5 alternative transcripts per locus. Note the majority are noncoding but have canonical splice sites. Splice variants must be supported by splicing EST/cDNA evidence, but the presence of a coding sequence (CDS) is not essential. Hence the majority of variants are annotated without a CDS. ESTs and cdnas from different species are also used as evidence to predict alternative transcripts as genome comparison studies have shown that gene structures are generally conserved between human and mouse. 17 Pseudogenes are defined as nonfunctional copies of genes and are categorised in Vega into unprocessed and processed pseudogenes (viewed in two shades of grey). They are generated by 190 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 2. 189 193. JUNE 2005

Official HUGO ID Gene last modified date Stable Otter ID for gene locus Splice variants: 7 coding, 1 non-coding, each with stable Otter transcript ID Figure 1: Curated Locus Report giving information about the NFB1 locus on chromosome 9 either of two mechanisms: retrotransposition or duplication of genomic DNA. Those that arise from retrotransposition are called processed pseudogenes 18 and have no 59 promoter sequence or introns but generally have an integrated poly(a) tail at the 39 end that often retains the poly(a) signal. Unprocessed pseudogenes have arisen from genomic duplication and often have a structure that is very similar to the ancestral gene and may even splice correctly. The majority of pseudogenes of both types contain frameshifts and/or stop codons in the coding region. Pseudogenes are valuable in annotation as they have been implicated in human disease 19 and can be used to study evolution. Poly(A) sites /signals are annotated and may be browsed in Vega. Poly(A) signals are displayed in light red and poly(a) sites in dark red in contigview. Alternative polyadenylation appears to affect many higher eukaryotes, mainly in a tissue-dependent manner which may be implicated in disease. 20 All poly(a) features are checked manually, using large numbers of ESTs marking out the 39 ends of genes and the fact that signals (of which there are 10 variants in human 21 ) are usually found within 60 bases of the poly(a) site. SNPs can be viewed in ContigView and are mapped from the Glovar database 22 onto the clones within Vega. Glovar contains all the data from dbsnp together with SNPs found from comparisons of the trace repository 23 with the current genome build. Using Vega annotation, SNPs are classified as coding (red), untranslated region (pink), intronic (blue) or other (grey). ACCESSING AND QUERYING DATA As the Vega browser is based on Ensembl web code it has similar standard entry points such as keyword search and & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 2. 189 193. JUNE 2005 191

similarity searching (BLAST, SSAHA). ExportView can be used to download data in formats such as FastA, Gene Feature Format (GFF) and flat files. There is also direct access to annotation via a distributed annotation server (DAS). If required, the Ensembl API 24 can be used to perform more comprehensive searches of the Vega data. Also Vega genes mapped to the current genome assembly can be downloaded from Ensembl using Ensmart. MHC HAPLOTYPE ANNOTATION Unlike other browsers Vega can also contain annotations from regions, not just whole chromosomes. Regions available include the haplotype COX for the major histocompatibility complex (MHC) on human chromosome 6, with more haplotypes to follow. 25 ACCESSING MULTISPECIES ANNOTATION IN VEGA Vega can display multiple species annotation for comparative analysis. In the mouse annotation browser selected regions such as the Del36H deletion region on chromosome 13 and the insulin-dependent diabetes (IDD) susceptibility loci regions. The latter are annotated in both the reference mouse strain (C57BL/6) and the non-obese diabetic (Nod) strain. 26 The zebrafish genome is being sequenced in its entirety at the Sanger Institute and Vega will be the main site for browsing the manually curated data. The reference is Tuebingen strain and Vega currently displays chromosomes/linkage groups 1 25 plus one artificial chromosome, U, that contains all clones with unknown chromosomal locations. The AB chromosome displays clones from the AB strain. Manual annotation is added on a monthly basis and clones which have not yet been annotated (displayed in grey) are shown with features from automated computational analysis (repeat masking, BLAST searches, etc). Recently the finished sequence of the MHC (DLA) class II region from the dog breed Doberman has been annotated and is available in Vega. 27 The sequence displays a high level of conservation with the human, cat and mouse class II region. COMMUNITY FEEDBACK Vega is a community annotation database and therefore to maintain up-to-date annotation it is essential to have feedback from researchers. A webform 28 is available by which users can contact the Vega team to improve/correct annotation if there is additional evidence. Manual annotation of finished vertebrate sequence may also be submitted if it has been peer reviewed and/or meet the annotation standards. 29 FUTURE DEVELOPMENTS IN VEGA Currently available genome browsers often display different transcript structures for the same loci. In order to produce a single standard human gene set the Consensus CDS (CCDS) project has been set up between NCBI, USCS, Ensembl and the Havana group. The aim is to compare the human gene sets produced by RefSeq, Ensembl and Vega and then identify transcripts where the protein coding region is agreed on by all collaborators. These CDSs will be identified by stable CCDS identifiers in all the browsers. In the near future manual annotation of the regions for the ENCODE project 30,31 will be displayed in Vega. As mouse and zebrafish genomes reach completion it is hoped that the manually annotated orthologues may be browsed using multicontigview which is already available in Ensembl. Acknowledgments I gratefully acknowledge the help of Dr Jennifer Ashurst and Dr Laurens Wilming at the Wellcome Trust Sanger Institute. Dr Jane Loveland HAVANA Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK 192 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 2. 189 193. JUNE 2005

References Tel: +44 (0) 1223 495389 Fax: +44 (0) 1223 494919 E-mail:jel@sanger.ac.uk 1. Lander, E. S., Linton, L. M., Birren, B. et al. (2001), Initial sequencing and analysis of the human genome, Nature, Vol. 409(6822), pp. 860 921. 2. Venter, J. C., Adams, M. D., Myers, E. W. et al. (2001), The sequence of the human genome, Science, Vol. 291(5507), pp. 1304 1351. 3. International Human Genome Sequencing Consortium (2004), Finishing the euchromatic sequence of the human genome, Nature, Vol. 431(7011), pp. 931 945. 4. Hubbard, T., Andrews, D., Caccamo, M. et al. (2005), Ensembl 2005, Nucleic Acids Res., Vol. 33 (Database issue), pp. D447 453. 5. Kent, W. J., Sugnet, C. W., Furey, T. S. et al. (2002), The human genome browser at UCSC, Genome Res., Vol. 12(6), pp. 996 1006. 6. Wheeler, D. L., Chappey, C., Lash, A. E. et al. (2002), Database resources of the National Center for Biotechnology Information: 2002 update, Nucleic Acids Res., Vol. 30(1), pp. 13 16. 7. URL: http://www.ncbi.nlm.nih.gov/ mapview/ 8. Imanishi, T., Itoh, T., Suzuki, Y. et al. (2004), Integrative annotation of 21,037 human genes validated by full-length cdna clones, PLoS Biol., Vol. 2(6), p. e162. 9. URL: http://www.h-invitational.jp/ 10. Ashurst, J. L., Chen, C.-K., Gilbert, J. G. R. et al. (2005), The Vertebrate Genome Annotation (Vega) database, Nucleic Acids Res., Vol. 33 (Database issue), pp. D459 465. 11. URL: http://vega.sanger.ac.uk 12. URL: http:www.sanger.ac.uk/hgp/havana 13. Wain, H. M., Lush, M. J., Ducluzeau, F. et al. (2004), Genew: The Human Gene Nomenclature Database, 2004 updates, Nucleic Acids Res., Vol. 32 (Database issue), pp. D255 257. 14. Sprague, J., Clements, D., Conlin, T. et al. (2003), The Zebrafish Information Network (ZFIN): The zebrafish model organism database, Nucleic Acids Res., Vol. 31(1), pp. 241 243. 15. Eppig, J. T., Bult, C. J., Kadin, J. A. et al. (2005), The Mouse Genome Database (MGD): From genes to mice a community resource for mouse biology, Nucleic Acids Res., Vol. 33 (Database issue), pp. D471 475. 16. Searle, S. M., Gilbert, J., Iyer, V. and Clamp, M. (2004), The otter annotation system, Genome Res., Vol. 14(5), pp. 963 970. 17. Batzoglou, S., Pachter, L., Mesirov, J. P. et al. (2000), Human and mouse gene structure: Comparative analysis and application to exon prediction, Genome Res., Vol. 10(7), pp. 950 958. 18. Vanin, E. F. (1985), Processed pseudogenes: Characteristics and evolution, Annu. Rev. Genet., Vol. 19, pp. 253 272. 19. Kenmochi, N., Yoshihama, M., Higa, S. and Tanaka, T. (2000), The human ribosomal protein L6 gene in a critical region for Noonan syndrome, J. Human Genet., Vol. 45(5), pp. 290 293. 20. Edwalds-Gilbert, G., Veraldi, K. L. and Milcarek, C. (1997), Alternative poly(a) site selection in complex transcription units: Means to an end?, Nucleic Acids Res., Vol. 25(13), pp. 2547 2561. 21. Beaudoing, E., Freier, S., Wyatt, J. R. et al. (2000), Patterns of variant polyadenylation signal usage in human genes, Genome Res., Vol. 10(7), pp. 1001 1010. 22. URL: http://www.glovar.org/ Homo_sapiens/ 23. URL: http://trace.ensembl.org/ 24. URL: http://www.ensembl.org/docs/ 25. Stewart, C. A., Horton, R., Allcock, R. J. N. et al. (2004), Complete MHC haplotype sequencing for common disease gene mapping, Genome Res., Vol. 14(6), pp. 1176 1187. 26. Hill, N. J., Lyons, P. A., Armitage, N. et al. (2000), NOD Idd5 locus controls insulitis and diabetes and overlaps the orthologous CTLA4/ IDDM12 and NRAMP1 loci in humans, Diabetes, Vol. 49(10), pp. 1744 1747. 27. Debenham, S. L., Hart, E. A., Ashurst, J. L. et al. (2005), Genomic sequence of the class II region of the canine MHC: Comparison with the MHC of other mammalian species, Genomics, Vol. 85(1), pp. 48 59. 28. URL: http://vega.sanger.ac.uk/helpdesk/ index.html 29. URL: http://sanger.ac.uk/hgp/havan/docs/ guidelines.pdf 30. ENCODE Project Consortium (2004), The ENCODE (ENCyclopedia Of DNA Elements) Project, Science, Vol. 306(5696), pp. 636 640. 31. URL: http://www.genome.gov/1005107/ & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 2. 189 193. JUNE 2005 193