A Field Guide to GenBank and NCBI Molecular Biology Resources
|
|
- Nicholas Little
- 5 years ago
- Views:
Transcription
1 A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper ftp://ftp.ncbi.nih.gov/pub/cooper/fieldguide/ Eric Sayers ftp://ftp.ncbi.nih.gov/pub/sayers/field_guide/u_penn/
2 NCBI Resources About NCBI NCBI Sequence Databases Primary Database GenBank Derivative Databases - RefSeq Entrez Databases and Text Searching BLAST Services Genomic Resources
3 The National Center for Biotechnology Information (NCBI) Lister Hill Center William H. Natcher Building
4 The National Center for Biotechnology Information (NCBI) Created as a part of NLM in 1988 Establish public databases Perform research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed( PubMed,, 1997) Human genome (2001)
5 NCBI Home Page ncbi.nlm.nih.gov To learn more, visit the Site Map and About NCBI web pages
6 Site Map
7 About NCBI
8 Some NCBI Statistics. Growth of GenBank Sequences (millions) Base Pairs Sequences Base Pairs of DNA (millions)
9 Users per day Christmas Day
10 Molecular Databases Primary Databases Original submissions by experimentalists Database staff organize but don t add additional information Example: GenBank Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mrna Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly
11 What is GenBank? NCBI s Primary Sequence Database Nucleotide only sequence database GenBank Data Direct submissions individual records (BankIt( BankIt, Sequin) Batch submissions via (EST, GSS, STS) ftp accounts established for sequencing centers Data shared amongst three collaborating databases: GenBank DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL)
12 The International Nucleotide Sequence Database Collaboration NIH Entrez Sequin BankIt ftp Submissions Updates NCBI GenBank EMBL Submissions Updates CIB DDBJ EBI NIG Submissions Updates SRS getentry EMBL
13 GenBank: NCBI s Primary Sequence Database Release 133 December ,318,883 Records 28,507,990,166 Nucleotides 110,000 + Species full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ >90 Gigabytes of data
14 Entrez Nucleotide RefSeq 1% EMBL 9% DDBJ 19% GenBank 71% 23,464,770 records
15 Primary vs. Derivative Databases ACGTGC Curators GA GA GA GA C ATT ATT C C C ACGTGC TATAGCCG CGTGA TTGACA ATTGACTA ATTGACTA Sequencing Centers TATAGCCG TTGACA TTGACA ACGTGC TTGACA ATTGACTA TATAGCCG TATAGCCG CGTGA ACGTGC ATTGACTA CGTGA ATTGACTA TATAGCCG TATAGCCG TATAGCCGTATAGCCG ATTGACTA TATAGCCG CGTGA ATT C GenBank Labs RefSeq TATAGCCG AGCTCCGATA CCGATGACAA Genome Assembly UniGene AT GA Algorithms GA GA GA GA GA GA GA GA ATT ATT C ATT C C ATT C C ATT ATT C C C
16 Traditional GenBank Divisions Direct Submissions (Sequin and BankIt) Accurate Well characterized BCT Bacterial and Archeal INV Invertebrate MAM Mammalian (ex. ROD and PRI) PHG Phage PLN Plant and Fungal PRI Primate ROD Rodent SYN Synthetic (cloning vectors) VRL Viral VRT Other Vertebrate
17 A Traditional GenBank Record Locus Field Molecule Type Definition Line Accession Number Version GI (GenInfo) Keywords Modification Date GenBank Division Taxonomy
18 A Traditional GenBank Record
19 Bulk Sequence Divisions of GenBank Batch Submissions ( and ftp) Inaccurate Poorly Characterized EST STS GSS HTG HTC Expressed Sequence Tag Sequence Tagged Site Genome Survey Sequence High Throughput Genomic High Throughput cdna
20 Organization of GenBank 11 Traditional Divisions Traditional 8% PAT 4% 1 Patent Division STS, HTG, HTC 2% GSS 19% 5 Bulk Divisions EST 67% 23,087,196 records
21 EST Division: Expressed Sequence Tags >IMAGE: ' mrna sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTT TCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAG AATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAG TTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAG CAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGT nucleus ATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNA TCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGA TGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG 30,000 genes >IMAGE: ', mrna sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACC -isolate unique clones ATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTC RNA -sequence once ATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATT gene products from each end CTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTT GAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATT CTCTGCTCCCCACCTCTANGTTGCCAGCCCTC 5 3 make cdna library ,000 unique cdna clones in library
22 A gene-oriented view of sequence entries MegaBlast-based automated sequence clustering Nonredundant set of gene-oriented clusters Each cluster represents a unique gene Provides information on tissue-specific expression and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents What is UniGene?
23 Organisms Represented in UniGene Just In C.elegans Ciona intestinalis Gallus gallus
24 EST hits to Homo sapiens muscle creatine kinase mrna Query Sequence
25 Genome Sequencing Whole BAC insert (or genome) shredding sequencing cloning isolating GSS division or trace archive assembly Draft Sequence (HTG division)
26 Working Draft Sequence gaps
27 HTG Division: High Throughput Genome phase 1 Acc = AC phase 2 Acc =AC phase 3 Acc = AC HTG HTG ROD 40,000 to > 350,000 bp
28 HTG Division: High Throughput Genome
29 NCBI s Third Party Annotation (TPA) Database NEW NCBI now accepts the submission of new annotations of existing GenBank sequences; Facilitates the annotation of genomes by experts;
30 A Sample TPA record
31 RefSeq: NCBI s Derivative Sequence Database Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis Human model transcripts and proteins Assembled Genomic Regions (contigs( contigs) draft human genome mouse genome Chromosome records Microbial viral organelle
32 The RefSeq Accession Numbers mrnas and Proteins human mouse rat fruit fly NM_ Curated mrna NP_ Curated Protein NR_ Curated non-coding RNA zebrafish XM_ Predicted Transcript (human, mouse) XP_ Predicted Protein (human, mouse) XR_ Predicted non-coding RNA Gene Records NG_ Reference Genomic Sequence (human) Assemblies NT_ Contig (Mouse and Human) NW_ Supercontig (Mouse) NC_ Chromosome (Microbial,Viral,Arabidopsis ) NR_ Interim Identifier for Microbial Chromosomes Arabidopsis
33 Curated RefSeq Records: NM_, NP_
34 Entrez: Linking and Neighboring
35 The Entrez Databases
36 Entrez: Database Integration PubMed abstracts Word weight Phylogeny Taxonomy Genomes 33-D Structure e VAST BLAST Nucleotide sequences Protein sequences BLAST
37 The (ever) Expanding Entrez Journals System UniGene PubMed Central PubMed Books SNP UniSTS Nucleotide PopSet Protein Entrez ProbeSet Structure Genome CDD Taxonomy 3D Domains OMIM
38 Entrez Nucleotides glucose 6 phosphate dehydrogenase
39 Document Summaries: glucose 6 phosphate dehydrogenase[all Fields] = 748 hits
40 Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number glucose 6 phosphate dehydrogenase Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word
41 Entrez Nucleotides: Preview/Index
42 Adding Terms: Preview/Index Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name green plants Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length...
43 Plant G6PD mrnas
44 Display: Formats, Links, and Neighbors Summary Brief ASN.1 FASTA XML GenBank GI list LinkOut Nucleotide Neighbors Genome Links ProbeSet Links OMIM Links PopSet Links Protein Links PubMed Links SNP Links Structure Links Taxonomy Links
45 >gi gb U MSU18238 Medicago sativa glucose-6-phosphate dehyd CCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGA GATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGC TTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATT GTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAAC FASTA definition line AAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTT TACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGAT >gi gb U MSU18238 TTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCT > CCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGAT GGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGAT TGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAAC gi number ATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGC AGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCG Database identifiers AGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAG CCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTG gb GenBank Accession number TTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGC emb EMBL AACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCC CTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAA dbj DDBJ AGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAA sp SWISS-PROT GCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGG ATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTC pdb Protein Databank GCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTT pir PIR GAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGA TATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACA prf PRF AGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTC ref RefSeq TCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGA ATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA Locus name
46 Entrez Genome
47 Organism Pages
48 The Map Viewer: a common platform for integrated display
49 The Map Viewer
50 Entrez PubMed
51 Online Books
52 Entrez Specialized Databases Taxonomy OMIM Searchable taxonomic tree having nodes for all species with records in an Entrez database Online Mendelian Inheritance in Man: A database of genetically linked human diseases ProbeSet Expression data (GEO) and microarray datasets
53 Entrez Taxonomy
54 Entrez OMIM
55 Entrez ProbeSet
56 Trace Archive
57 Entrez Structure 1CET
58 Structure Summary Cn3D viewer Related Structures Conserved Domains
59 Cn3D: Displaying Structures Chloroquine
60 Structure Neighbors
61 Structural Alignment Chloroquine NADH
62 MMDB: Molecular olecular Modeling Data Base Derived from experimentally determined PDB records Value added to PDB records including: Addition of explicit chemical graph information Validation Inclusion of Taxonomy, Citation, and other information Conversion to ASN.1 data description language Structure neighbors determined by Vector ector Alignment Search Tool (VAST)