Tutorial section. VEGA, the genome browser with a difference

VEGA, the genome browser with a difference Keywords: vertebrate, annotation, database, manual, curation Abstract The Vertebrate Genome Annotation (Vega) database is a community resource for browsing manual annotation from a variety of vertebrate genomes of finished sequence (http:// vega.sanger.ac.uk). Vega is different from other genome browsers as it has a standardised classification of genes which encompasses pseudogenes and non-coding transcripts. The data is manually curated, which is more accurate at identifying splice variants, pseudogenes poly(a) features, non-coding and complex gene structures and arrangements than current automated methods. The database also contains annotation from regions, not just whole genomes, and displays multiple species annotation (human, mouse, dog and zebrafish) for comparative analysis. Vega encourages community feedback that results in annotation updates and manual annotation of finished vertebrate sequence. Since completion of the draft human genome sequence in 2000 1,2 and the subsequent finishing of this in 2003 3 many different genome browsers have been developed to enable scientists to access genome data. The initial interpretation of the human genome was through automated annotation such as Ensembl 4 and the UCSC genome browsers. 5 There are currently limits to an automated approach for the analysis of genomes, for example in duplicated regions identifying unprocessed pseudogenes, and therefore there is still a need for manual intervention. As the genome sequence became finished, quality curated browsers such as MapView 6,7 and the H-InvDB 8,9 were developed. The Vertebrate Genome Annotation (Vega) database 10 is a community resource for browsing manual annotation from a variety of vertebrate genomes of finished sequence. 11 Vega is based on the Ensembl schema, with gene objects shown in shades of blue, and also incorporates curation-specific data. The database allows users to view the manual annotation provided by the Havana group at the Wellcome Trust Sanger Institute (WTSI), 12 IMB-Jena, the Joint Genome Institute, Genoscope and Washington University. It currently contains the manual annotation of ten human chromosomes (6, 7, 9, 10, 13, 14, 20, 22, X and Y). As the genome sequencing centres publish the annotation and analysis of their chromosomes then the data will be accessible in Vega. Why is Vega different from other browsers? It has a standardised classification of genes which encompasses pseudogenes and non-coding transcripts. PolyA sites/signals are annotated. The data are manually curated. The data are periodically updated. It contains annotation of haplotypes. & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 2. 189 193. JUNE 2005 189

Table 1: Vega annotation definitions Known Novel Novel transcript Putative Pseudogene Predicted Ig segment Ig pseudogene segment Identical to human cdna or protein sequences in the Entrez Gene database (http://www.ncbi.nlm.nih.gov/entrez/ query.fcgi?db¼gene/) Have an open reading frame and are identical or homologous to known vertebrate cdnas and/or proteins from all species Similar to novel gene but no open reading frame or open reading frame ambiguous Homologous to spliced vertebrate expressed sequence tags (ESTs) with no significant open reading frame Homologous to protein sequences with a disrupted CDS and an active gene can be found at another locus Based on ab initio prediction for which at least one exon is supported by biological data (unspliced ESTs, protein sequence similarity with mouse or tetraodon genomes) Only used in chromosome 14 Immunoglobulin gene segments Inactivated immunoglobulin segment Single nucleotide polymorphisms (SNPs) are mapped to manual curation. It is multispecies and small regions of finished sequence can be submitted and annotated as well as whole genomes. It encourages community feedback and results in annotation updates. GENE CLASSIFICATION A standardised set of definitions has been used to categorise the annotation of the different gene features (Table 1). Irrespective of which category gene objects have been assigned to all annotated gene structures are supported by homology to cdnas, expressed sequence tags (ESTs) or protein sequences. GENE NAMING It is important to use the correct gene nomenclature to maintain consistency in the annotation database, especially when comparing haplotypic or syntenic regions. The Vega annotators interact closely with the nomenclature committees from the Human Genome Organisation (HUGO, HGNC), 13 Zebrafish Information Network (ZFIN) 14 and Mouse Genome Database (MGD). 15 If an approved symbol is not available for a gene locus, an interim identifier is used in the format of international clone identifier followed by number, eg RP11 695B14.2. All loci and their associated transcripts and exons are given stable versioned database IDs (eg OTTHUMG00000021027) that are generated and tracked in the Otter database 16 that underlies Vega (see Figure 1). Whenever a locus is edited the version number increases and the date of the change saved. MAIN FEATURES OF VEGA Manual annotation is currently more accurate at identifying splice variants, pseudogenes, polyadenylation features, non-coding genes, complex gene arrangements and clusters than automated methods. Splice variants account for approximately 50 per cent of gene loci in finished chromosomes 9, 10 and X, with an average of 2.5 alternative transcripts per locus. Note the majority are noncoding but have canonical splice sites. Splice variants must be supported by splicing EST/cDNA evidence, but the presence of a coding sequence (CDS) is not essential. Hence the majority of variants are annotated without a CDS. ESTs and cdnas from different species are also used as evidence to predict alternative transcripts as genome comparison studies have shown that gene structures are generally conserved between human and mouse. 17 Pseudogenes are defined as nonfunctional copies of genes and are categorised in Vega into unprocessed and processed pseudogenes (viewed in two shades of grey). They are generated by 190 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 2. 189 193. JUNE 2005

Official HUGO ID Gene last modified date Stable Otter ID for gene locus Splice variants: 7 coding, 1 non-coding, each with stable Otter transcript ID Figure 1: Curated Locus Report giving information about the NFB1 locus on chromosome 9 either of two mechanisms: retrotransposition or duplication of genomic DNA. Those that arise from retrotransposition are called processed pseudogenes 18 and have no 59 promoter sequence or introns but generally have an integrated poly(a) tail at the 39 end that often retains the poly(a) signal. Unprocessed pseudogenes have arisen from genomic duplication and often have a structure that is very similar to the ancestral gene and may even splice correctly. The majority of pseudogenes of both types contain frameshifts and/or stop codons in the coding region. Pseudogenes are valuable in annotation as they have been implicated in human disease 19 and can be used to study evolution. Poly(A) sites /signals are annotated and may be browsed in Vega. Poly(A) signals are displayed in light red and poly(a) sites in dark red in contigview. Alternative polyadenylation appears to affect many higher eukaryotes, mainly in a tissue-dependent manner which may be implicated in disease. 20 All poly(a) features are checked manually, using large numbers of ESTs marking out the 39 ends of genes and the fact that signals (of which there are 10 variants in human 21 ) are usually found within 60 bases of the poly(a) site. SNPs can be viewed in ContigView and are mapped from the Glovar database 22 onto the clones within Vega. Glovar contains all the data from dbsnp together with SNPs found from comparisons of the trace repository 23 with the current genome build. Using Vega annotation, SNPs are classified as coding (red), untranslated region (pink), intronic (blue) or other (grey). ACCESSING AND QUERYING DATA As the Vega browser is based on Ensembl web code it has similar standard entry points such as keyword search and & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 2. 189 193. JUNE 2005 191

similarity searching (BLAST, SSAHA). ExportView can be used to download data in formats such as FastA, Gene Feature Format (GFF) and flat files. There is also direct access to annotation via a distributed annotation server (DAS). If required, the Ensembl API 24 can be used to perform more comprehensive searches of the Vega data. Also Vega genes mapped to the current genome assembly can be downloaded from Ensembl using Ensmart. MHC HAPLOTYPE ANNOTATION Unlike other browsers Vega can also contain annotations from regions, not just whole chromosomes. Regions available include the haplotype COX for the major histocompatibility complex (MHC) on human chromosome 6, with more haplotypes to follow. 25 ACCESSING MULTISPECIES ANNOTATION IN VEGA Vega can display multiple species annotation for comparative analysis. In the mouse annotation browser selected regions such as the Del36H deletion region on chromosome 13 and the insulin-dependent diabetes (IDD) susceptibility loci regions. The latter are annotated in both the reference mouse strain (C57BL/6) and the non-obese diabetic (Nod) strain. 26 The zebrafish genome is being sequenced in its entirety at the Sanger Institute and Vega will be the main site for browsing the manually curated data. The reference is Tuebingen strain and Vega currently displays chromosomes/linkage groups 1 25 plus one artificial chromosome, U, that contains all clones with unknown chromosomal locations. The AB chromosome displays clones from the AB strain. Manual annotation is added on a monthly basis and clones which have not yet been annotated (displayed in grey) are shown with features from automated computational analysis (repeat masking, BLAST searches, etc). Recently the finished sequence of the MHC (DLA) class II region from the dog breed Doberman has been annotated and is available in Vega. 27 The sequence displays a high level of conservation with the human, cat and mouse class II region. COMMUNITY FEEDBACK Vega is a community annotation database and therefore to maintain up-to-date annotation it is essential to have feedback from researchers. A webform 28 is available by which users can contact the Vega team to improve/correct annotation if there is additional evidence. Manual annotation of finished vertebrate sequence may also be submitted if it has been peer reviewed and/or meet the annotation standards. 29 FUTURE DEVELOPMENTS IN VEGA Currently available genome browsers often display different transcript structures for the same loci. In order to produce a single standard human gene set the Consensus CDS (CCDS) project has been set up between NCBI, USCS, Ensembl and the Havana group. The aim is to compare the human gene sets produced by RefSeq, Ensembl and Vega and then identify transcripts where the protein coding region is agreed on by all collaborators. These CDSs will be identified by stable CCDS identifiers in all the browsers. In the near future manual annotation of the regions for the ENCODE project 30,31 will be displayed in Vega. As mouse and zebrafish genomes reach completion it is hoped that the manually annotated orthologues may be browsed using multicontigview which is already available in Ensembl. Acknowledgments I gratefully acknowledge the help of Dr Jennifer Ashurst and Dr Laurens Wilming at the Wellcome Trust Sanger Institute. Dr Jane Loveland HAVANA Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK 192 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 2. 189 193. JUNE 2005

References Tel: +44 (0) 1223 495389 Fax: +44 (0) 1223 494919 E-mail:jel@sanger.ac.uk 1. Lander, E. S., Linton, L. M., Birren, B. et al. (2001), Initial sequencing and analysis of the human genome, Nature, Vol. 409(6822), pp. 860 921. 2. Venter, J. C., Adams, M. D., Myers, E. W. et al. (2001), The sequence of the human genome, Science, Vol. 291(5507), pp. 1304 1351. 3. International Human Genome Sequencing Consortium (2004), Finishing the euchromatic sequence of the human genome, Nature, Vol. 431(7011), pp. 931 945. 4. Hubbard, T., Andrews, D., Caccamo, M. et al. (2005), Ensembl 2005, Nucleic Acids Res., Vol. 33 (Database issue), pp. D447 453. 5. Kent, W. J., Sugnet, C. W., Furey, T. S. et al. (2002), The human genome browser at UCSC, Genome Res., Vol. 12(6), pp. 996 1006. 6. Wheeler, D. L., Chappey, C., Lash, A. E. et al. (2002), Database resources of the National Center for Biotechnology Information: 2002 update, Nucleic Acids Res., Vol. 30(1), pp. 13 16. 7. URL: http://www.ncbi.nlm.nih.gov/ mapview/ 8. Imanishi, T., Itoh, T., Suzuki, Y. et al. (2004), Integrative annotation of 21,037 human genes validated by full-length cdna clones, PLoS Biol., Vol. 2(6), p. e162. 9. URL: http://www.h-invitational.jp/ 10. Ashurst, J. L., Chen, C.-K., Gilbert, J. G. R. et al. (2005), The Vertebrate Genome Annotation (Vega) database, Nucleic Acids Res., Vol. 33 (Database issue), pp. D459 465. 11. URL: http://vega.sanger.ac.uk 12. URL: http:www.sanger.ac.uk/hgp/havana 13. Wain, H. M., Lush, M. J., Ducluzeau, F. et al. (2004), Genew: The Human Gene Nomenclature Database, 2004 updates, Nucleic Acids Res., Vol. 32 (Database issue), pp. D255 257. 14. Sprague, J., Clements, D., Conlin, T. et al. (2003), The Zebrafish Information Network (ZFIN): The zebrafish model organism database, Nucleic Acids Res., Vol. 31(1), pp. 241 243. 15. Eppig, J. T., Bult, C. J., Kadin, J. A. et al. (2005), The Mouse Genome Database (MGD): From genes to mice a community resource for mouse biology, Nucleic Acids Res., Vol. 33 (Database issue), pp. D471 475. 16. Searle, S. M., Gilbert, J., Iyer, V. and Clamp, M. (2004), The otter annotation system, Genome Res., Vol. 14(5), pp. 963 970. 17. Batzoglou, S., Pachter, L., Mesirov, J. P. et al. (2000), Human and mouse gene structure: Comparative analysis and application to exon prediction, Genome Res., Vol. 10(7), pp. 950 958. 18. Vanin, E. F. (1985), Processed pseudogenes: Characteristics and evolution, Annu. Rev. Genet., Vol. 19, pp. 253 272. 19. Kenmochi, N., Yoshihama, M., Higa, S. and Tanaka, T. (2000), The human ribosomal protein L6 gene in a critical region for Noonan syndrome, J. Human Genet., Vol. 45(5), pp. 290 293. 20. Edwalds-Gilbert, G., Veraldi, K. L. and Milcarek, C. (1997), Alternative poly(a) site selection in complex transcription units: Means to an end?, Nucleic Acids Res., Vol. 25(13), pp. 2547 2561. 21. Beaudoing, E., Freier, S., Wyatt, J. R. et al. (2000), Patterns of variant polyadenylation signal usage in human genes, Genome Res., Vol. 10(7), pp. 1001 1010. 22. URL: http://www.glovar.org/ Homo_sapiens/ 23. URL: http://trace.ensembl.org/ 24. URL: http://www.ensembl.org/docs/ 25. Stewart, C. A., Horton, R., Allcock, R. J. N. et al. (2004), Complete MHC haplotype sequencing for common disease gene mapping, Genome Res., Vol. 14(6), pp. 1176 1187. 26. Hill, N. J., Lyons, P. A., Armitage, N. et al. (2000), NOD Idd5 locus controls insulitis and diabetes and overlaps the orthologous CTLA4/ IDDM12 and NRAMP1 loci in humans, Diabetes, Vol. 49(10), pp. 1744 1747. 27. Debenham, S. L., Hart, E. A., Ashurst, J. L. et al. (2005), Genomic sequence of the class II region of the canine MHC: Comparison with the MHC of other mammalian species, Genomics, Vol. 85(1), pp. 48 59. 28. URL: http://vega.sanger.ac.uk/helpdesk/ index.html 29. URL: http://sanger.ac.uk/hgp/havan/docs/ guidelines.pdf 30. ENCODE Project Consortium (2004), The ENCODE (ENCyclopedia Of DNA Elements) Project, Science, Vol. 306(5696), pp. 636 640. 31. URL: http://www.genome.gov/1005107/ & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 2. 189 193. JUNE 2005 193