A Field Guide to GenBank and NCBI Molecular Biology Resources

Size: px
Start display at page:

Download "A Field Guide to GenBank and NCBI Molecular Biology Resources"

Transcription

1 A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper ftp://ftp.ncbi.nih.gov/pub/cooper/fieldguide/ Eric Sayers ftp://ftp.ncbi.nih.gov/pub/sayers/field_guide/u_penn/

2 NCBI Resources About NCBI NCBI Sequence Databases Primary Database GenBank Derivative Databases - RefSeq Entrez Databases and Text Searching BLAST Services Genomic Resources

3 The National Center for Biotechnology Information (NCBI) Lister Hill Center William H. Natcher Building

4 The National Center for Biotechnology Information (NCBI) Created as a part of NLM in 1988 Establish public databases Perform research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed( PubMed,, 1997) Human genome (2001)

5 NCBI Home Page ncbi.nlm.nih.gov To learn more, visit the Site Map and About NCBI web pages

6 Site Map

7 About NCBI

8 Some NCBI Statistics. Growth of GenBank Sequences (millions) Base Pairs Sequences Base Pairs of DNA (millions)

9 Users per day Christmas Day

10 Molecular Databases Primary Databases Original submissions by experimentalists Database staff organize but don t add additional information Example: GenBank Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mrna Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly

11 What is GenBank? NCBI s Primary Sequence Database Nucleotide only sequence database GenBank Data Direct submissions individual records (BankIt( BankIt, Sequin) Batch submissions via (EST, GSS, STS) ftp accounts established for sequencing centers Data shared amongst three collaborating databases: GenBank DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL)

12 The International Nucleotide Sequence Database Collaboration NIH Entrez Sequin BankIt ftp Submissions Updates NCBI GenBank EMBL Submissions Updates CIB DDBJ EBI NIG Submissions Updates SRS getentry EMBL

13 GenBank: NCBI s Primary Sequence Database Release 133 December ,318,883 Records 28,507,990,166 Nucleotides 110,000 + Species full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ >90 Gigabytes of data

14 Entrez Nucleotide RefSeq 1% EMBL 9% DDBJ 19% GenBank 71% 23,464,770 records

15 Primary vs. Derivative Databases ACGTGC Curators GA GA GA GA C ATT ATT C C C ACGTGC TATAGCCG CGTGA TTGACA ATTGACTA ATTGACTA Sequencing Centers TATAGCCG TTGACA TTGACA ACGTGC TTGACA ATTGACTA TATAGCCG TATAGCCG CGTGA ACGTGC ATTGACTA CGTGA ATTGACTA TATAGCCG TATAGCCG TATAGCCGTATAGCCG ATTGACTA TATAGCCG CGTGA ATT C GenBank Labs RefSeq TATAGCCG AGCTCCGATA CCGATGACAA Genome Assembly UniGene AT GA Algorithms GA GA GA GA GA GA GA GA ATT ATT C ATT C C ATT C C ATT ATT C C C

16 Traditional GenBank Divisions Direct Submissions (Sequin and BankIt) Accurate Well characterized BCT Bacterial and Archeal INV Invertebrate MAM Mammalian (ex. ROD and PRI) PHG Phage PLN Plant and Fungal PRI Primate ROD Rodent SYN Synthetic (cloning vectors) VRL Viral VRT Other Vertebrate

17 A Traditional GenBank Record Locus Field Molecule Type Definition Line Accession Number Version GI (GenInfo) Keywords Modification Date GenBank Division Taxonomy

18 A Traditional GenBank Record

19 Bulk Sequence Divisions of GenBank Batch Submissions ( and ftp) Inaccurate Poorly Characterized EST STS GSS HTG HTC Expressed Sequence Tag Sequence Tagged Site Genome Survey Sequence High Throughput Genomic High Throughput cdna

20 Organization of GenBank 11 Traditional Divisions Traditional 8% PAT 4% 1 Patent Division STS, HTG, HTC 2% GSS 19% 5 Bulk Divisions EST 67% 23,087,196 records

21 EST Division: Expressed Sequence Tags >IMAGE: ' mrna sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTT TCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAG AATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAG TTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAG CAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGT nucleus ATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNA TCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGA TGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG 30,000 genes >IMAGE: ', mrna sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACC -isolate unique clones ATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTC RNA -sequence once ATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATT gene products from each end CTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTT GAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATT CTCTGCTCCCCACCTCTANGTTGCCAGCCCTC 5 3 make cdna library ,000 unique cdna clones in library

22 A gene-oriented view of sequence entries MegaBlast-based automated sequence clustering Nonredundant set of gene-oriented clusters Each cluster represents a unique gene Provides information on tissue-specific expression and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents What is UniGene?

23 Organisms Represented in UniGene Just In C.elegans Ciona intestinalis Gallus gallus

24 EST hits to Homo sapiens muscle creatine kinase mrna Query Sequence

25 Genome Sequencing Whole BAC insert (or genome) shredding sequencing cloning isolating GSS division or trace archive assembly Draft Sequence (HTG division)

26 Working Draft Sequence gaps

27 HTG Division: High Throughput Genome phase 1 Acc = AC phase 2 Acc =AC phase 3 Acc = AC HTG HTG ROD 40,000 to > 350,000 bp

28 HTG Division: High Throughput Genome

29 NCBI s Third Party Annotation (TPA) Database NEW NCBI now accepts the submission of new annotations of existing GenBank sequences; Facilitates the annotation of genomes by experts;

30 A Sample TPA record

31 RefSeq: NCBI s Derivative Sequence Database Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis Human model transcripts and proteins Assembled Genomic Regions (contigs( contigs) draft human genome mouse genome Chromosome records Microbial viral organelle

32 The RefSeq Accession Numbers mrnas and Proteins human mouse rat fruit fly NM_ Curated mrna NP_ Curated Protein NR_ Curated non-coding RNA zebrafish XM_ Predicted Transcript (human, mouse) XP_ Predicted Protein (human, mouse) XR_ Predicted non-coding RNA Gene Records NG_ Reference Genomic Sequence (human) Assemblies NT_ Contig (Mouse and Human) NW_ Supercontig (Mouse) NC_ Chromosome (Microbial,Viral,Arabidopsis ) NR_ Interim Identifier for Microbial Chromosomes Arabidopsis

33 Curated RefSeq Records: NM_, NP_

34 Entrez: Linking and Neighboring

35 The Entrez Databases

36 Entrez: Database Integration PubMed abstracts Word weight Phylogeny Taxonomy Genomes 33-D Structure e VAST BLAST Nucleotide sequences Protein sequences BLAST

37 The (ever) Expanding Entrez Journals System UniGene PubMed Central PubMed Books SNP UniSTS Nucleotide PopSet Protein Entrez ProbeSet Structure Genome CDD Taxonomy 3D Domains OMIM

38 Entrez Nucleotides glucose 6 phosphate dehydrogenase

39 Document Summaries: glucose 6 phosphate dehydrogenase[all Fields] = 748 hits

40 Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number glucose 6 phosphate dehydrogenase Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word

41 Entrez Nucleotides: Preview/Index

42 Adding Terms: Preview/Index Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name green plants Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length...

43 Plant G6PD mrnas

44 Display: Formats, Links, and Neighbors Summary Brief ASN.1 FASTA XML GenBank GI list LinkOut Nucleotide Neighbors Genome Links ProbeSet Links OMIM Links PopSet Links Protein Links PubMed Links SNP Links Structure Links Taxonomy Links

45 >gi gb U MSU18238 Medicago sativa glucose-6-phosphate dehyd CCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGA GATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGC TTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATT GTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAAC FASTA definition line AAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTT TACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGAT >gi gb U MSU18238 TTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCT > CCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGAT GGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGAT TGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAAC gi number ATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGC AGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCG Database identifiers AGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAG CCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTG gb GenBank Accession number TTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGC emb EMBL AACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCC CTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAA dbj DDBJ AGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAA sp SWISS-PROT GCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGG ATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTC pdb Protein Databank GCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTT pir PIR GAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGA TATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACA prf PRF AGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTC ref RefSeq TCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGA ATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA Locus name

46 Entrez Genome

47 Organism Pages

48 The Map Viewer: a common platform for integrated display

49 The Map Viewer

50 Entrez PubMed

51 Online Books

52 Entrez Specialized Databases Taxonomy OMIM Searchable taxonomic tree having nodes for all species with records in an Entrez database Online Mendelian Inheritance in Man: A database of genetically linked human diseases ProbeSet Expression data (GEO) and microarray datasets

53 Entrez Taxonomy

54 Entrez OMIM

55 Entrez ProbeSet

56 Trace Archive

57 Entrez Structure 1CET

58 Structure Summary Cn3D viewer Related Structures Conserved Domains

59 Cn3D: Displaying Structures Chloroquine

60 Structure Neighbors

61 Structural Alignment Chloroquine NADH

62 MMDB: Molecular olecular Modeling Data Base Derived from experimentally determined PDB records Value added to PDB records including: Addition of explicit chemical graph information Validation Inclusion of Taxonomy, Citation, and other information Conversion to ASN.1 data description language Structure neighbors determined by Vector ector Alignment Search Tool (VAST)