National Center for Biotechnology Information (NCBI):

Similar documents
GenBank. Direct submissions individual records (BankIt( BankIt,, Sequin) Batch submissions via (EST, GSS, STS) ftp accounts sequencing centers

The University of California, Santa Cruz (UCSC) Genome Browser

Types of Databases - By Scope

NCBI Molecular Biology Resources. NCBI Resources

Gene-centered resources at NCBI

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

ELE4120 Bioinformatics. Tutorial 5

Chapter 2: Access to Information

Computational Biology and Bioinformatics

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

NCBI web resources I: databases and Entrez

Data Retrieval from GenBank

B I O I N F O R M A T I C S

Introduction to BIOINFORMATICS

Gene-centered databases and Genome Browsers

Gene-centered databases and Genome Browsers

Introduc)on to Databases and Resources Biological Databases and Resources

Bioinformatics for Proteomics. Ann Loraine

Retrieval of gene information at NCBI

This software/database/presentation is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Online Mendelian Inheritance in Man (OMIM)

GREG GIBSON SPENCER V. MUSE

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Protein Bioinformatics Part I: Access to information

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Genetics and Bioinformatics

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

user s guide Question 3

Genome annotation & EST

user s guide Question 1

Introduction and Public Sequence Databases. BME 110/BIOL 181 CompBio Tools

Two Mark question and Answers

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

BIMM 143: Introduction to Bioinformatics (Winter 2018)

Niemann-Pick Type C Disease Gene Variation Database ( )

Bioinformatics for Cell Biologists

Important gene-information's

Introduction to Bioinformatics

Entrez Gene: gene-centered information at NCBI

user s guide Question 3

Why learn sequence database searching? Searching Molecular Databases with BLAST

Chapter 5. Structural Genomics

Array-Ready Oligo Set for the Rat Genome Version 3.0

Tutorial for Stop codon reassignment in the wild

Introduction to 'Omics and Bioinformatics

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

Introduction to Bioinformatics

I nternet Resources for Bioinformatics Data and Tools

Genetic databases. Anna Sowińska-Seidler, MSc, PhD Department of Medical Genetics

BLASTing through the kingdom of life

Redundancy at GenBank => RefSeq. RefSeq vs GenBank. Databases, cont. Genome sequencing using a shotgun approach. Sequenced eukaryotic genomes

Genomes contain all of the information needed for an organism to grow and survive.

A Field Guide to GenBank and NCBI Molecular Biology Resources

Klinisk kemisk diagnostik BIOINFORMATICS

FUNCTIONAL BIOINFORMATICS

Databases/Resources on the web

What You NEED to Know

INTRODUCTION TO BIOINFORMATICS. SAINTS GENETICS Ian Bosdet

Genome Biology and Biotechnology

Annotation. (Chapter 8)

Investigation of Genomic Variation in the Rising Era of Individual Genome Sequence: A Primer on Some Available Datasets and Structures

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

NCBI Molecular Biology Resources. Entrez & BLAST. Entrez: Database Integration. Database Searching with Entrez. WWW Access. Using Entrez.

Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Textbook. Web sites. Literature references

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Annotation Practice Activity [Based on materials from the GEP Summer 2010 Workshop] Special thanks to Chris Shaffer for document review Parts A-G

Hands-On Four Investigating Inherited Diseases

BLASTing through the kingdom of life

Training materials.

ab initio and Evidence-Based Gene Finding

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP

NCBI & Other Genome Databases. BME 110/BIOL 181 CompBio Tools

Genome Sequence Assembly

Studying the Human Genome. Lesson Overview. Lesson Overview Studying the Human Genome

Lecture 12. Genomics. Mapping. Definition Species sequencing ESTs. Why? Types of mapping Markers p & Types

Bacterial Genome Annotation

Pharmacogenetics: A SNPshot of the Future. Ani Khondkaryan Genomics, Bioinformatics, and Medicine Spring 2001

Worksheet for Bioinformatics

Fundamentals of Bioinformatics: computation, biology, computational biology

Introduction to Bioinformatics for Medical Research. Gideon Greenspan TA: Oleg Rokhlenko. Lecture 1

Overview of Health Informatics. ITI BMI-Dept

Genome Resources. Genome Resources. Maj Gen (R) Suhaib Ahmed, HI (M)

Investigating Inherited Diseases

Introduction to NGS analyses

Guided tour to Ensembl

BLASTing through the kingdom of life

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Sequence Variations. Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms. NCBI SNP Primer:

Engineering Genetic Circuits

Bioinformatics Course AA 2017/2018 Tutorial 2

BGGN 213: Foundations of Bioinformatics (Fall 2017)

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Bioinformatics, in general, deals with the following important biological data:

Transcription:

National Center for Biotechnology Information (NCBI): http://www.ncbi.nlm.nih.gov By: Dr Hadi Mozafari

As a national resource for molecular biology information, NCBI's mission is to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. More specifically, the NCBI has been charged with creating automated systems for: 1) Storing and analyzing knowledge about molecular biology, biochemistry, and genetics. 2) Facilitating the use of such databases and software by the research and medical community. 3) Coordinating efforts to gather biotechnology information both nationally and internationally. 4) Performing research into advanced methods of computer-based information processing for analyzing the structure and function of biologically important molecules.

BLAST is a program for sequence similarity searching developed at NCBI and is instrumental in identifying genes and genetic features. BLAST can execute sequence searches against the entire DNA database in less than 15 seconds. Additional software tools provided by NCBI include: Open Reading Frame Finder (ORF Finder), Electronic PCR, and the sequence submission tools, Sequin and BankIt. All of NCBI's databases and software tools are available from the WWW or by FTP. NCBI also has email servers that provide an alternative way to access the databases for text searching or sequence similarity searching.

Structure: Three dimensional structures provide a wealth of information on the biological function and the evolutionary history of macromolecules

dbgap: The database of Genotypes and Phenotypes (dbgap) was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in Humans.

EST: The EST database is a collection of short single-read transcript sequences from GenBank. These sequences provide a resource to evaluate gene expression, find potential variation, and annotate genes.

MeSH: MeSH (Medical Subject Headings) is the NLM controlled vocabulary thesaurus used for indexing articles for PubMed.

OMIM: is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily.

PMC: PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM).

Bookshelf: provides free online access to books and documents in life science and healthcare, Search, read, and discover.

Entrez has links to Medline Entrez is much more than just a tool for finding sequences by keywords. It contains links to PubMed/Medline Entrez also contains all known protein sequences and 3-D protein structures.

Entrez is NCBI's search and retrieval system that provides users with integrated access to sequence, mapping, taxonomy, and structural data. Entrez also provides graphical views of sequences and chromosome maps. PubMed comprises more than 25 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.

Pubmed

Search in Pubmed

Fill limits option

Fill limits option

Fill limits option

Prewiew/Index

History

Aminoacid & Nucleotides abbreviations

Search abbreviations

dbsnp: Database of single nucleotide polymorphisms (SNPs) and multiple small-scale variations that include insertions/deletions, microsatellites, and non-polymorphic variants.

Direct links to useful parts of NCBI

GenBank Annotated collection of all publicly available nucleotide sequences and their protein translations. Receives sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. Grows exponentially, doubling every 10 months Most journal publishers require deposition of sequence data into GanBank prior to publication so an accession number may be cited Each 2 months would be update

DNA sequencing according to SANGER

Human Sequence in the High Throughput Sequence Division of GenBank

LOCUS AY182241 1931 bp mrna linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mrna, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cdna from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. FEATURES Location/Qualifiers source 1..1931 /organism="malus x domestica" /mol_type="mrna" /cultivar="'law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="afs1" CDS 54..1784 /gene="afs1" /note="terpene synthase" /codon_start=1 /product="(e,e)-alpha-farnesene synthase" /protein_id="aao22848.2" /db_xref="gi:32265058" /translation="mefrvhlqadneqkifqnqmkpepeasylinqrrsanykpniwk NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt // A Traditional GenBank Record Header The Flatfile Format Feature Table Sequence

LOCUS AY182241 1931 bp mrna linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mrna, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS TITLE Pechous,S.W. and Whitaker,B.D. Cloning and functional expression of an (E,E)-alpha-farnesene synthase cdna from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL REMARK COMMENT The Header Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA Sequence update by submitter On Jun 26, 2003 this sequence version replaced gi:27804758.

Header: Locus Line LOCUS AY182241 1931 bp mrna linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mrna, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS. Length SOURCE Malus x domestica (cultivated apple) Division ORGANISM Malus x domestica Locus name Molecule type Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cdna from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. LOCUS AY182241 1931 bp mrna linear PLN 04-MAY-2004 Modification Date

Header: Database Identifiers LOCUS AY182241 1931 bp mrna linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mrna, complete cds. Accession ACCESSION AY182241 Stable VERSION AY182241.2 GI:32265057 KEYWORDS. AY182241 Reportable SOURCE Malus x domestica (cultivated apple) Universal ORGANISM Malus x domestica Eukaryota; AY182241.2 Viridiplantae; Streptophyta; GI:32265057 Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cdna from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. ACCESSION VERSION

LOCUS AY182241 1931 bp mrna linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mrna, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS. SOURCE x domestica (cultivated apple) ORGANISM Malus Malusx x domestica (cultivated apple) Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Eukaryota; Spermatophyta; Viridiplantae; Magnoliophyta; Streptophyta; eudicotyledons; core Embryophyta; eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS core Pechous,S.W. eudicots; and rosids; Whitaker,B.D. eurosids I; Rosales; Rosaceae; TITLE Maloideae; Cloning and Malus. functional expression of an (E,E)-alpha-farnesene synthase cdna from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. NCBI-controlled taxonomy SOURCE ORGANISM Malus x domestica TITLE JOURNAL Direct Submission Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL REMARK COMMENT Header: Organism Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA Sequence update by submitter On Jun 26, 2003 this sequence version replaced gi:27804758.

The Feature Table FEATURES Location/Qualifiers source 1..1931 /organism="malus x domestica" /mol_type="mrna" /cultivar="'law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="afs1" CDS start (atg) 54..1784 stop (tag) /gene="afs1" /note="terpene synthase" /codon_start=1 /product="(e,e)-alpha-farnesene synthase" Coding sequence /protein_id="aao22848.2" /db_xref="gi:32265058" /translation="mefrvhlqadneqkifqnqmkpepeasylinqrrsanykpniwk NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI

Accession number, GI number, Version accession number (GenBank) - The accession number is the unique identifier assigned to the entire sequence record when the record is submitted to GenBank. The GenBank accession number is a combination of letters and numbers that are usually in the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456). The accession number for a particular record will not change even if the author submits a request to change some of the information in the record. Take note that an accession number is a unique identifier for a complete sequence record, while a Sequence Identifier, such as a Version, GI, or ProteinID, is an identification number assigned just to the sequence data. The NCBI Entrez System is searchable by accession number using the Accession [ACCN] search field. GI (GenBank) - A GI or "GenInfo Identifier" is a sequence identifier that can be assigned to a nucleotide sequence or protein translation. Each GI is a numeric value of one or more digits. The protein translation and the nucleotide sequence contained in the same record will each be assigned different GI numbers. Every time the sequence data for a particular record is changed, its version number increases and it receives a new GI. However, while each new version number is based upon the previous version number, a new GI for an altered sequence may be completely different from the previous GI. For example, in the GenBank record M12345, the original GI might be 7654321, but after a change in the sequence is submitted, the new GI for the changed sequence could be 10529376. Individuals can search for nucleotide sequences and protein translations by GI using the UID search field in the NCBI sequence databases.

GenBank Sections In addition to DNA sequences of genes GenBank has a number of other sections including: Protein sequences (translated from DNA) Short RNA fragments (ESTs) Sequence Tagged Sites (dbsts): Whole Genome Shotgun Sequences (WGS) Third Party Annotation (TPA) database Single Nucleotide Polymorphisms (SNPs) which represent genetic variations in the human population Online Mendelian Inheritance in Man (OMIM) a database of human genetic disorders

Contigs A contig (from contiguous) is a set of overlapping DNA segments that together represent a consensus region of DNA

ARRANGMENT OF PRIMARY SEQUENCES INTO CONTIG an example S19T7 S12SK S19SK S11T7 S17SK S148O20 S148019 S148O15 S148O17 S148O22 S148O13 SC110T7 S148O7 S148O12 SC110SK S17T7 S148O8 S11SK S148O10 S148O11 S13SK S148SK S148T7 S148O14 S148O9 S148O21 S148O18 S12T7 S13T7 S16SK S18SK S14SK orf1 pcab orf2 maca orf-3 pcah pcag 2000 4000 6000 psc1/1 psc1/2 psc1/3 psc1/8 psc1/10 PSC148 (7405 bps) psc1/4 psc1/6

Whole Genome Shotgun Sequences (WGS) Shotgun sequence reads are assembled into contigs, submitted, and updated as the sequencing project progresses and new assemblies are computed.

Shotgun Sequencing Concepts in Biochemistry, 2 nd Ed., R. Boyer Segments are short ~2kb Problem with repeated segments or genes

EST, STS, and GSS EST = Expressed Sequence Tags (dbest): Short (< 1 kb), single-pass cdna sequences from a particular tissue and/or developmental stage. They lack annotation. EST represent first pass sequences with an error rate as high as 1 in 100, including incorrectly identified bases and insertions. However the sheer volume of sequences obtained in this manner makes EST databases a useful database in which to identify new genes and new gene functions, or to extend an existing sequence, or to locate exons in genomic DNA sequences. ESTs now make up about 40% of Genbank. STS = Sequence Tagged Sites (dbsts): Short genomic landmark sequences. They are operationally unique in that they are specifically amplified from the genome by PCR amplification. They define a specific location on the genome and are thus useful for mapping. GSS = Genome Survey Sequences (dbgss): Short sequences derived from genomic DNA, about which little is known. Misc-feature = The site of beginning of gene expression CDS = The coding region of a gene, also known as the coding sequence

High-Throughput Genomic Sequence (HTGS) HTGS entries are submitted in bulk by genome centers, processed by an automated system, and then released to GenBank. Currently, about 30 genome centers are submitting data for a number of organisms, including human, mouse, rat, rice, and Plasmodium falciparum. High throughput genome sequences are the genomic DNA equivalent of ESTs, and can be a potential source of new genes, especially poorly expressed genes which would not be detected in an EST library

HTC HTC = High-Throughput cdna/mrna: Similar to ESTs, but often contain more information. May have a systematic gene name that is related to the lab or center that submitted them, and the longest ORF is often annotated as a coding region.

Submission Tools BankIt: Web-based form for submission of a small number of sequences with minimal annotation to GenBank. Sequin: More appropriate for complicated submissions containing a significant amount of annotation or many sequences. Standalone application available on NCBI s FTP site.

Third Party Annotation (TPA) database Contains nucleotide sequences built from existing primary data with new annotation that has been published in a peer-reviewed scientific journal. Two types of records: Experimental: Annotation supported by lab evidence Inferential: Annotation inferred only Bridges the gap between GenBank and RefSeq: Permitting authors publishing new experimental evidence to re-annotate sequences in a public database as they think best, even if they are not the primary sequencer or the curator of a model organism database.

RefSeq A curated collection of DNA, RNA, and protein sequences built by NCBI. Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes. May include separate linked records for genomic DNA, the gene transcripts, and the proteins arising from those transcripts. Limited to major organisms for which sufficient data is available (only 4000 as of Jan 2007), while GenBank includes sequences for any organism submitted (~250k different organisms).

Comprehensive DB: RefSeq For a particular gene many independent redundant records might exist in GenBank All this information is integrated as such that for a particular locus in the genome a complete description is given that is no longer redundant: the locuslink Redundant GenBank entries e.g. representing distinct indications on the transcript of a gene (incomplete cdna sequences, ESTs) are unified to a single refseq that represents the complete transcript A Refseq sequence protein (starting with NP_) a genomic sequence (starting with NG_) All RefSeq sequences that belong to the same locus on the genome receive the same locus link Additional links to other interesting databases containing additional functional annotation or information are made (e.g to Gene Ontology, KEGG, )

Comprehensive DB: UniGene UniGene is an experimental system for automatically partitioning GenBank sequences into a nonredundant set of gene-oriented clusters Each UniGene cluster contains sequences that represent a unique gene as well as related information such as the tissue types in which the gene has been expressed and map location. These clusters represent the same gene based on the alignment of EST sequences with each other and with the genome sequences of the organism. no attempt has been made to produce contigs

DNA & Protein Abbreviations in Genbank

UCSC Genome Browser http://genome.ucsc.edu/

UCSC Genome Browser BLAT (BLAST-like alignment tool) is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC). Blat is an alignment tool like BLAST, but it is structured differently. On DNA, Blat works by keeping an index of an entire genome in memory. Thus, the target database of BLAT is not a set of GenBank sequences, but instead an index derived from the assembly of the entire genome.

UCSC Conditions of users

UCSC Links

HGNC A curated online repository of HGNC-approved gene nomenclature, gene families and associated resources. The HGNC approves a unique and meaningful name for every known human gene based on a query of experts

Results of ins for Insulin gene

Results for Insulin word

HGMD: The Human Gene Mutation Database (HGMD ) represents an attempt to collate known (published) gene lesions responsible for human inherited disease

KEGG: is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis ingenomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development.

Krebs Cycle in KEGG Pathway

KEGG Disease

Disease Results

Pathway of Melanoma

Enzyme database in KEGG

Results for Catalase

GeneCards: is a searchable, integrative database that provides comprehensive, user-friendly information on all annotated and predicted human genes. It automatically integrates gene-centric data from ~125 web sources, including genomic, transcriptomic, proteomic, genetic, clinical and functional information.

www.scopus.com Scopus is a bibliographic database containing abstracts and citations for academic journal articles. It is a largest abstract and citation database of peer-reviewed literature.

Document Search in Scopus

Author Search

Journal List & Comparison

Order of Journal Ranking

Order of Journal Citation

Journal document per year

Not cited documents of journals

Isid.research.ac.ir

Search for members of KUMS