Databases/Resources on the web

Size: px
Start display at page:

Download "Databases/Resources on the web"


1 Databases/Resources on the web Jon K. Lærdahl

2 A lot of biological databases available on the web... MetaBase, the database of biological databases (1801 entries) - h p:// bioinforma links directory (620 databases) - h p://bioinforma

3 btw, the bioinforma links directory is an excellent resource bioinforma links directory h p://bioinforma Currently 1459 tools 620 databases 164 resources The problem is not to find a tool or database, but to know what is gold and what is junk

4 Some important centres for bioinforma cs Na onal Center for Biotechnology Informa on (NCBI) part of the US Na onal Library of Medicine (NLM), a branch of the Na onal Ins tutes of Health located in Bethesda, Maryland European Bioinforma cs Ins tute (EMBL- EBI) part of part of European Molecular Biology Laboratory (EMBL) located in Hinxton, Cambridgeshire, UK

5 NCBI databases Provided the GenBank DNA sequence database since 1992 Online Mendelian Inheritance in Man (OMIM) - known diseases with a gene c component and links to genes started early 1960s as a book online version, OMIM, since 1987 on the WWW by NCBI in 1995 currently >22,000 entries (14,400 genes) EST - nucleo de database subset that contains only Expressed Sequence Tag records Gene - genes and associated informa on for a number of organisms in addi on to and including human Protein sequence database - collec on of protein sequence entries compiled from a variety of sources including Swiss- Prot, PIR, PRF, PDB, and transla ons from annotated coding regions in GenBank and RefSeq PubMed - access to over 15 million cita ons from MEDLINE and addi onal life sciences journals SNP - repository for both single nucleo de subs tu ons and short dele on and inser on polymorphisms All data is publicly available

6 NCBI databases 37 databases that together contains over 690 million records Nucleic Acids Res. 41, D8 (2013)

7 EMBL- EBI databases European Nucleo de Archive (ENA) nucleo de sequence database Ensembl - automa c and manually curated annota on on selected eukaryo c (vertebrate) genomes Ensembl Genomes Ensembl for all other organisms UniProt protein sequence and func onal informa on ChEMBL database of bioac ve compounds IntAct - repository of molecular interac ons, including protein- protein, protein- small molecule and protein- nucleic acid interac ons CiteXplore 25 million literature abstracts including PubMed, Agricola & patents Gene Ontology (GO) - controlled vocabulary to describe gene and gene product a ributes in any organism Gene Ontology Annota on (GOA) GO annota ons for proteins in UniProt All data is publicly available

8 NCBI «Trace Archives» Trace Archive Repository of raw data sequencing traces from gel and capillary electrophoresis sequencers >2 billion traces Sequence Read Archive (SRA) Data from high- throughput sequencing (454, Illumina, IonTorrent, SOLiD, etc.) 915 Tbases (9.15 x ) open access sequences At present 1 Tbase added daily h p:// 1 Pbp 100,000 human genomes

9 UniProt Database of protein sequences and func onal annota ons a single worldwide database of protein sequence and func on (2002) UniProt consor um EMBL- EBI Swiss Ins tute of Bioinforma cs (SIB) Swiss- Prot (Amos Bairoch, 1986) TrEMBL (Translated EMBL Nucleo de Sequence Data Library, 1996) Protein Informa on Resource (PIR) roots in Margaret Dayhoff's Atlas of Protein Sequence and Structure (1965) h p://

10 An even be er place to look for good biological databases - Nucleic Acids Res. Database issues released once every year, in January 20th issue (2013) 88 new databases 77 updates on databases previously described in NAR 11 updates on databases previously described elsewhere h p://

11 While we are visi ng NAR: a good place to look for bioinforma cs tools Nucleic Acids Res. Web server issues released once every year, in July 11th issue (2013) 95 web servers h p:// If you need an ar cle or a cita on for a bioinforma cs tool or database, the NAR web server or databases issues are o en good places to look

12 Huge number of databases! In bioinforma cs, the number of databases, tools, algorithms, and papers is enormous impossible to have an overview, especially if bioinforma cs is not your main research area instead of trying to do everything yourself: Get yourself a bioinforma cs expert colleague or collaborator! h p:// NAR online Molecular Biology Database Collec on, currently contains 1512 databases

13 Good and bad databases Some are excep onally good, well maintained and o en updated EMBL- EBI, NCBI, Ensembl,... h p:// h p:// Maintained by 10s and 100s of experts... Species specific h p:// (Schizosaccharomyces pombe) h p:// (Drosophila) h p:// (Escherichia coli K- 12 MG1655) Unique content h p:// Also many have poor quality, are never updated, are unreliable Trick is to know what is good and what is bad... Let your favourite bioinforma cian follow the field!

14 Ensembl genome browser and database

15 Genome browsers Graphical interface for genomic data Shows informa on from biological databases mapped onto genomic sequence Genomic coordinates Various annota ons = tracks NCBI Gene database

16 Ensembl Genome Browser Joint project between EMBL- EBI and the Wellcome Trust Sanger Ins tute Central resource for studying genomes of vertebrates Mainly chordates, but some few extra (e.g. C. elegans and S. cerevisiae) Updated several mes a year with new genome assemblies and new species Annota ons of genomes (e.g. genes and their splice variant, SNPs) added by the Ensembl pipeline Automa c gene predic on (with or without experimental evidence) & some curator input

17 Ensembl Genome Browser h p:// Excellent resource for exploring vertebrate species where the genome has been sequenced

18 Ensembl Genome Browser Oslo Currently >70 species

19 EnsemblGenomes Bacteria, pro sts, fungi, plants and other metazoa

20 Ensembl 2013 Read the ar cle yourself!

21 UCSC Genome Browser

22 Genome browsers Graphical interface for genomic data Shows informa on from biological databases mapped onto genomic sequence Genomic coordinates Various annota ons = tracks NCBI Gene database

23 UCSC Genome Browser Developed and maintained at the University of California, Santa Cruz (UCSC) Interac ve website Access to genome sequence data from Human genome Latest assembly (GRCh37), but also earlier versions Mouse, rat, and approx. 40 other mammals Chicken, turkey, budgerigar, rep les, frogs, and fishes Insects, nematodes, S. cerevisiae and more

24 UCSC Genome Browser h p:// Kuhn et al. Brief. Bioinform. 14, 144 (2012)

25 UCSC Genome Browser Reference genome, chromosome coordinates Known genes Annota on tracks Predicted genes Transcripts Promoter binding sites SNPs Repeats Epigene c marks Many, many, more... Links to detailed data

26 UCSC Genome Browser Cow, chromosome 22, around posi on 17,380,000 Annota on tracks

27 Access to the databases and tools UCSC Genome Browser h p:// General informa on News, updates, announcements

28 UCSC Genome Browser Examples of searching op ons correct query format

29 Different kinds of data Direc on of transcrip on 5 UTR shown in the introns Exon 3 UTR Wiggle (WIG) track format for dense, con nuous data, e.g. conserva on, epigene c marks, and transcriptome data Single loca on data, e.g. SNPs Reset view Hide Add your own data Switch direc on

30 ENCODE data in UCSC h p://

31 Much more in MBV- INFx410

32 CLS Wednesday seminars Bioinforma cs/cls seminars every 14 days

33 cbo- the mailing list for bioinforma cs and computa onal biology in the Oslo region News about Seminars Courses Jobs Conferences Relevant mainly for people in the Oslo region Anyone can send an e- mail to the list Curators check that the message is relevant (to avoid spam) and releases the message Currently >400 subscribers