National Center for Biotechnology Information (NCBI):

National Center for Biotechnology Information (NCBI): http://www.ncbi.nlm.nih.gov By: Dr Hadi Mozafari

As a national resource for molecular biology information, NCBI's mission is to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. More specifically, the NCBI has been charged with creating automated systems for: 1) Storing and analyzing knowledge about molecular biology, biochemistry, and genetics. 2) Facilitating the use of such databases and software by the research and medical community. 3) Coordinating efforts to gather biotechnology information both nationally and internationally. 4) Performing research into advanced methods of computer-based information processing for analyzing the structure and function of biologically important molecules.

BLAST is a program for sequence similarity searching developed at NCBI and is instrumental in identifying genes and genetic features. BLAST can execute sequence searches against the entire DNA database in less than 15 seconds. Additional software tools provided by NCBI include: Open Reading Frame Finder (ORF Finder), Electronic PCR, and the sequence submission tools, Sequin and BankIt. All of NCBI's databases and software tools are available from the WWW or by FTP. NCBI also has email servers that provide an alternative way to access the databases for text searching or sequence similarity searching.

Structure: Three dimensional structures provide a wealth of information on the biological function and the evolutionary history of macromolecules

dbgap: The database of Genotypes and Phenotypes (dbgap) was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in Humans.

EST: The EST database is a collection of short single-read transcript sequences from GenBank. These sequences provide a resource to evaluate gene expression, find potential variation, and annotate genes.

MeSH: MeSH (Medical Subject Headings) is the NLM controlled vocabulary thesaurus used for indexing articles for PubMed.

OMIM: is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily.

PMC: PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM).

Bookshelf: provides free online access to books and documents in life science and healthcare, Search, read, and discover.

Entrez has links to Medline Entrez is much more than just a tool for finding sequences by keywords. It contains links to PubMed/Medline Entrez also contains all known protein sequences and 3-D protein structures.

Entrez is NCBI's search and retrieval system that provides users with integrated access to sequence, mapping, taxonomy, and structural data. Entrez also provides graphical views of sequences and chromosome maps. PubMed comprises more than 25 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.

Pubmed

Search in Pubmed

Fill limits option

Prewiew/Index

History

Aminoacid & Nucleotides abbreviations

Search abbreviations

dbsnp: Database of single nucleotide polymorphisms (SNPs) and multiple small-scale variations that include insertions/deletions, microsatellites, and non-polymorphic variants.

Direct links to useful parts of NCBI

GenBank Annotated collection of all publicly available nucleotide sequences and their protein translations. Receives sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. Grows exponentially, doubling every 10 months Most journal publishers require deposition of sequence data into GanBank prior to publication so an accession number may be cited Each 2 months would be update

DNA sequencing according to SANGER

Human Sequence in the High Throughput Sequence Division of GenBank

LOCUS AY182241 1931 bp mrna linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mrna, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cdna from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. FEATURES Location/Qualifiers source 1..1931 /organism="malus x domestica" /mol_type="mrna" /cultivar="'law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="afs1" CDS 54..1784 /gene="afs1" /note="terpene synthase" /codon_start=1 /product="(e,e)-alpha-farnesene synthase" /protein_id="aao22848.2" /db_xref="gi:32265058" /translation="mefrvhlqadneqkifqnqmkpepeasylinqrrsanykpniwk NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt // A Traditional GenBank Record Header The Flatfile Format Feature Table Sequence

Header: Locus Line LOCUS AY182241 1931 bp mrna linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mrna, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS. Length SOURCE Malus x domestica (cultivated apple) Division ORGANISM Malus x domestica Locus name Molecule type Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cdna from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. LOCUS AY182241 1931 bp mrna linear PLN 04-MAY-2004 Modification Date

Header: Database Identifiers LOCUS AY182241 1931 bp mrna linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mrna, complete cds. Accession ACCESSION AY182241 Stable VERSION AY182241.2 GI:32265057 KEYWORDS. AY182241 Reportable SOURCE Malus x domestica (cultivated apple) Universal ORGANISM Malus x domestica Eukaryota; AY182241.2 Viridiplantae; Streptophyta; GI:32265057 Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cdna from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. ACCESSION VERSION

The Feature Table FEATURES Location/Qualifiers source 1..1931 /organism="malus x domestica" /mol_type="mrna" /cultivar="'law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="afs1" CDS start (atg) 54..1784 stop (tag) /gene="afs1" /note="terpene synthase" /codon_start=1 /product="(e,e)-alpha-farnesene synthase" Coding sequence /protein_id="aao22848.2" /db_xref="gi:32265058" /translation="mefrvhlqadneqkifqnqmkpepeasylinqrrsanykpniwk NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI

Accession number, GI number, Version accession number (GenBank) - The accession number is the unique identifier assigned to the entire sequence record when the record is submitted to GenBank. The GenBank accession number is a combination of letters and numbers that are usually in the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456). The accession number for a particular record will not change even if the author submits a request to change some of the information in the record. Take note that an accession number is a unique identifier for a complete sequence record, while a Sequence Identifier, such as a Version, GI, or ProteinID, is an identification number assigned just to the sequence data. The NCBI Entrez System is searchable by accession number using the Accession [ACCN] search field. GI (GenBank) - A GI or "GenInfo Identifier" is a sequence identifier that can be assigned to a nucleotide sequence or protein translation. Each GI is a numeric value of one or more digits. The protein translation and the nucleotide sequence contained in the same record will each be assigned different GI numbers. Every time the sequence data for a particular record is changed, its version number increases and it receives a new GI. However, while each new version number is based upon the previous version number, a new GI for an altered sequence may be completely different from the previous GI. For example, in the GenBank record M12345, the original GI might be 7654321, but after a change in the sequence is submitted, the new GI for the changed sequence could be 10529376. Individuals can search for nucleotide sequences and protein translations by GI using the UID search field in the NCBI sequence databases.

GenBank Sections In addition to DNA sequences of genes GenBank has a number of other sections including: Protein sequences (translated from DNA) Short RNA fragments (ESTs) Sequence Tagged Sites (dbsts): Whole Genome Shotgun Sequences (WGS) Third Party Annotation (TPA) database Single Nucleotide Polymorphisms (SNPs) which represent genetic variations in the human population Online Mendelian Inheritance in Man (OMIM) a database of human genetic disorders

Contigs A contig (from contiguous) is a set of overlapping DNA segments that together represent a consensus region of DNA

ARRANGMENT OF PRIMARY SEQUENCES INTO CONTIG an example S19T7 S12SK S19SK S11T7 S17SK S148O20 S148019 S148O15 S148O17 S148O22 S148O13 SC110T7 S148O7 S148O12 SC110SK S17T7 S148O8 S11SK S148O10 S148O11 S13SK S148SK S148T7 S148O14 S148O9 S148O21 S148O18 S12T7 S13T7 S16SK S18SK S14SK orf1 pcab orf2 maca orf-3 pcah pcag 2000 4000 6000 psc1/1 psc1/2 psc1/3 psc1/8 psc1/10 PSC148 (7405 bps) psc1/4 psc1/6

Whole Genome Shotgun Sequences (WGS) Shotgun sequence reads are assembled into contigs, submitted, and updated as the sequencing project progresses and new assemblies are computed.

Shotgun Sequencing Concepts in Biochemistry, 2 nd Ed., R. Boyer Segments are short ~2kb Problem with repeated segments or genes

EST, STS, and GSS EST = Expressed Sequence Tags (dbest): Short (< 1 kb), single-pass cdna sequences from a particular tissue and/or developmental stage. They lack annotation. EST represent first pass sequences with an error rate as high as 1 in 100, including incorrectly identified bases and insertions. However the sheer volume of sequences obtained in this manner makes EST databases a useful database in which to identify new genes and new gene functions, or to extend an existing sequence, or to locate exons in genomic DNA sequences. ESTs now make up about 40% of Genbank. STS = Sequence Tagged Sites (dbsts): Short genomic landmark sequences. They are operationally unique in that they are specifically amplified from the genome by PCR amplification. They define a specific location on the genome and are thus useful for mapping. GSS = Genome Survey Sequences (dbgss): Short sequences derived from genomic DNA, about which little is known. Misc-feature = The site of beginning of gene expression CDS = The coding region of a gene, also known as the coding sequence

High-Throughput Genomic Sequence (HTGS) HTGS entries are submitted in bulk by genome centers, processed by an automated system, and then released to GenBank. Currently, about 30 genome centers are submitting data for a number of organisms, including human, mouse, rat, rice, and Plasmodium falciparum. High throughput genome sequences are the genomic DNA equivalent of ESTs, and can be a potential source of new genes, especially poorly expressed genes which would not be detected in an EST library

HTC HTC = High-Throughput cdna/mrna: Similar to ESTs, but often contain more information. May have a systematic gene name that is related to the lab or center that submitted them, and the longest ORF is often annotated as a coding region.

Submission Tools BankIt: Web-based form for submission of a small number of sequences with minimal annotation to GenBank. Sequin: More appropriate for complicated submissions containing a significant amount of annotation or many sequences. Standalone application available on NCBI s FTP site.

Third Party Annotation (TPA) database Contains nucleotide sequences built from existing primary data with new annotation that has been published in a peer-reviewed scientific journal. Two types of records: Experimental: Annotation supported by lab evidence Inferential: Annotation inferred only Bridges the gap between GenBank and RefSeq: Permitting authors publishing new experimental evidence to re-annotate sequences in a public database as they think best, even if they are not the primary sequencer or the curator of a model organism database.

RefSeq A curated collection of DNA, RNA, and protein sequences built by NCBI. Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes. May include separate linked records for genomic DNA, the gene transcripts, and the proteins arising from those transcripts. Limited to major organisms for which sufficient data is available (only 4000 as of Jan 2007), while GenBank includes sequences for any organism submitted (~250k different organisms).

Comprehensive DB: RefSeq For a particular gene many independent redundant records might exist in GenBank All this information is integrated as such that for a particular locus in the genome a complete description is given that is no longer redundant: the locuslink Redundant GenBank entries e.g. representing distinct indications on the transcript of a gene (incomplete cdna sequences, ESTs) are unified to a single refseq that represents the complete transcript A Refseq sequence protein (starting with NP_) a genomic sequence (starting with NG_) All RefSeq sequences that belong to the same locus on the genome receive the same locus link Additional links to other interesting databases containing additional functional annotation or information are made (e.g to Gene Ontology, KEGG, )

Comprehensive DB: UniGene UniGene is an experimental system for automatically partitioning GenBank sequences into a nonredundant set of gene-oriented clusters Each UniGene cluster contains sequences that represent a unique gene as well as related information such as the tissue types in which the gene has been expressed and map location. These clusters represent the same gene based on the alignment of EST sequences with each other and with the genome sequences of the organism. no attempt has been made to produce contigs

DNA & Protein Abbreviations in Genbank

UCSC Genome Browser http://genome.ucsc.edu/

UCSC Genome Browser BLAT (BLAST-like alignment tool) is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC). Blat is an alignment tool like BLAST, but it is structured differently. On DNA, Blat works by keeping an index of an entire genome in memory. Thus, the target database of BLAT is not a set of GenBank sequences, but instead an index derived from the assembly of the entire genome.

UCSC Conditions of users

UCSC Links

HGNC A curated online repository of HGNC-approved gene nomenclature, gene families and associated resources. The HGNC approves a unique and meaningful name for every known human gene based on a query of experts

Results of ins for Insulin gene

Results for Insulin word

HGMD: The Human Gene Mutation Database (HGMD ) represents an attempt to collate known (published) gene lesions responsible for human inherited disease

KEGG: is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis ingenomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development.

Krebs Cycle in KEGG Pathway

KEGG Disease

Disease Results

Pathway of Melanoma

Enzyme database in KEGG

Results for Catalase

GeneCards: is a searchable, integrative database that provides comprehensive, user-friendly information on all annotated and predicted human genes. It automatically integrates gene-centric data from ~125 web sources, including genomic, transcriptomic, proteomic, genetic, clinical and functional information.

www.scopus.com Scopus is a bibliographic database containing abstracts and citations for academic journal articles. It is a largest abstract and citation database of peer-reviewed literature.

Document Search in Scopus

Author Search

Journal List & Comparison

Order of Journal Ranking

Order of Journal Citation

Journal document per year

Not cited documents of journals

Isid.research.ac.ir

Search for members of KUMS