Computational Biology and Bioinformatics

Similar documents
EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Introduction to Bioinformatics

Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Textbook. Web sites. Literature references

Protein Bioinformatics Part I: Access to information

Chapter 2: Access to Information

NCBI web resources I: databases and Entrez

Types of Databases - By Scope

ELE4120 Bioinformatics. Tutorial 5

Introduction to Bioinformatics. What are the goals of the course? Who is taking this course? Different user needs, different approaches

The University of California, Santa Cruz (UCSC) Genome Browser

Gene-centered resources at NCBI

Introduction to BIOINFORMATICS

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Data Retrieval from GenBank

Bioinformatics for Proteomics. Ann Loraine

A Field Guide to GenBank and NCBI Molecular Biology Resources

Gene-centered databases and Genome Browsers

Gene-centered databases and Genome Browsers

Bioinformatics for Cell Biologists

Sequence Databases and database scanning

Two Mark question and Answers

Introduc)on to Databases and Resources Biological Databases and Resources

Sequence Based Function Annotation

BIOINF525: INTRODUCTION TO BIOINFORMATICS LAB SESSION 1

Introduction to Bioinformatics for Medical Research. Gideon Greenspan TA: Oleg Rokhlenko. Lecture 1

This software/database/presentation is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part

Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL

Bioinformatics Prof. M. Michael Gromiha Department of Biotechnology Indian Institute of Technology, Madras. Lecture - 5a Protein sequence databases

SAMPLE LITERATURE Please refer to included weblink for correct version.

NUCLEIC ACIDS. DNA (Deoxyribonucleic Acid) and RNA (Ribonucleic Acid): information storage molecules made up of nucleotides.

DNAFSMiner: A Web-Based Software Toolbox to Recognize Two Types of Functional Sites in DNA Sequences

FUNCTIONAL BIOINFORMATICS

Genome Resources. Genome Resources. Maj Gen (R) Suhaib Ahmed, HI (M)

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

I nternet Resources for Bioinformatics Data and Tools

Entrez Gene: gene-centered information at NCBI

Biological databases an introduction

BIOINFORMATICS FOR DUMMIES MB&C2017 WORKSHOP

ab initio and Evidence-Based Gene Finding

Introduction to Bioinformatics

Computational gene finding

user s guide Question 3

Biotechnology Explorer

Hands-On Four Investigating Inherited Diseases

Biology From gene to protein

Important gene-information's

BLASTing through the kingdom of life

What is Bioinformatics?

Week 1 BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Array-Ready Oligo Set for the Rat Genome Version 3.0

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Introduction to Bioinformatics

Applied Bioinformatics

What You NEED to Know

Bioinformatics Databases

user s guide Question 1

Digital information cycle. Database. Database. BINF 630: Bioinformatics Methods

Worksheet for Bioinformatics

Molecular Cloning. Genomic DNA Library: Contains DNA fragments that represent an entire genome. cdna Library:

COMPUTER RESOURCES II:

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

B I O I N F O R M A T I C S

Sequencing the Human Genome

Studying the Human Genome. Lesson Overview. Lesson Overview Studying the Human Genome

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide.

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

user s guide Question 3

O C. 5 th C. 3 rd C. the national health museum

Investigating Inherited Diseases

Sequencing the Human Genome

Ensembl workshop. Thomas Randall, PhD bioinformatics.unc.edu. handouts, papers, datasets

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Computational Molecular Biology Intro. Alexander (Sacha) Gultyaev

Databases/Resources on the web

Bioinformatics overview

Evolutionary Genetics. LV Lecture with exercises 6KP. Databases

Redundancy at GenBank => RefSeq. RefSeq vs GenBank. Databases, cont. Genome sequencing using a shotgun approach. Sequenced eukaryotic genomes

BIO4342 Lab Exercise: Detecting and Interpreting Genetic Homology

Lesson Overview. Studying the Human Genome. Lesson Overview Studying the Human Genome

BLASTing through the kingdom of life

Introduction and Public Sequence Databases. BME 110/BIOL 181 CompBio Tools

Computational gene finding

BIO 152 Principles of Biology III: Molecules & Cells Acquiring information from NCBI (PubMed/Bookshelf/OMIM)

Biological databases an introduction

NCBI Molecular Biology Resources. Entrez & BLAST. Entrez: Database Integration. Database Searching with Entrez. WWW Access. Using Entrez.

Guided tour to Ensembl

Hot Topics. What s New with BLAST?

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

Why learn sequence database searching? Searching Molecular Databases with BLAST

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

G4120: Introduction to Computational Biology

Introduction to 'Omics and Bioinformatics

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Chimp BAC analysis: Adapted by Wilson Leung and Sarah C.R. Elgin from Chimp BAC analysis: TWINSCAN and UCSC Browser by Dr. Michael R.

The Ensembl Database. Dott.ssa Inga Prokopenko. Corso di Genomica

Videos. Lesson Overview. Fermentation

Sequence Databases. Chapter 2. caister.com/bioinformaticsbooks. Paul Rangel. Sequence Databases

Transcription:

Computational Biology and Bioinformatics Computational biology Development of algorithms to solve problems in biology Bioinformatics Application of computational biology to the analysis and management of biological data Applied bioinformatics Intelligent use of tools to navigate the sequence space Better experiments can be designed by a careful Bioinformatics analysis before the bench work Bioinformatics 7,850,000 hits

DNA sequencing DNA sequences are the most abundant type of sequences (5 * 10 10 ) Generated by the chain termination method (Sanger sequencing) Based on the action of a DNA polymerase that adds nucleotides to complementary strand Fluorescently labeled ddntp (Dideoxynucleotides) stop synthesis acting as chain terminators. They are included in amounts so as to terminate every time the base appears in the template Requires template DNA, and primer and one ddntp for each base: A,C,G, and T Products are separated by electrophoresis G GG Primer A A A C C C T T T

Summary of chain termination sequencing In the past, four different reactions, one for each ddntp, were separated on a gel that could resolve one-base differences. The sequence was then read from the bottom of the gel to the top.

Sequence reading of fluorescently labeled reactions Fluorescently labeled reactions scanned by laser as particular point is passed Color picked up by detector Output sent directly to computer In its simplest form a sequence can be represented as a string of nucleotides with a basic tag or identifier after a greater than character > : FASTA format Definition line (commonly called def line ) >U54469.1 CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCAACAATCGATA GCTGCCTTTGGCCACCAAAATCCCAAACTTAATTAAAGAATTAAATAATTCGAATAATAATTAAGCCCAG TAACCTACGCAGCTTGAGTGCGTAACCGATATCTAGTATACATTTCGATACATCGAAATCATGGTAGTGT TGGAGACGGAGAAGGTAAGACGATGATAGACGGCGAGCCGCATGGGTTCGATTTGCGCTGAGCCGTGGCA GGGAACAACAAAAACAGGGTTGTTGCACAAGAGGGGAGGCGATAGTCGAGCGGAAAAGAGTGCAGTTGGC GTGGCTACATCATCATTGTGTTCACCGATTATTTTTTGCACAATTGCTTAATATTAATTGTACTTGCACG CTATTGTCTACGTCATAGCTATCGCTCATCTCTGTCTGTCTCTATCAAGCTATCTCTCTTTCGCGGTCAC TCGTTCTCTTTTTTCTCTCCTTTCGCATTTGCATACGCATACCACACGTTTTCAGTGTTCTCGCTCTCTC TCTCTTGTCAAGACATCGCGCGCGTGTGTGTGGGTGTGTCTCTAGCACATATACATAAATAGGAGAGCGG More information can be be added to the FASTA definition line >gb U54469.1 DMU54469 Drosophila melanogaster eukaryotic initiation factor 4E (eif4e) gene. GenBank Accession.version Locus Descritpion

Editing Errors

Capillary electrophoresis Newer automated sequencers use very thin capillary tubes Run all four fluorescently tagged reactions in same capillary Can have 96 capillaries running at the same time robotic arm and syringe 96 glass capillaries 96 well plate load bar

New sequencing technologies 454 Life Sciences: FLX titanium: 1,000,000 reads 400 bp= 400,000,000 high quality bp in 10h run http://www.454.com/products-solutions/how-it-works/index.asp

Present and Future of sequencing Sequencing costs Dropping each year Opens possibility of sequencing genomes of individuals Greatly facilitates comparative genomics.

DNA RNA protein phenotype genomic DNA databases cdna ESTs UniGene protein sequence databases

There are three major public DNA databases EMBL Housed at EBI European Bioinformatics Institute GenBank Housed at NCBI National Center for Biotechnology Information DDBJ Housed in Japan

Taxonomy at NCBI: >200,000 species are represented in GenBank http://www.ncbi.nlm.nih.gov/taxonomy/txstat.cgi

The most sequenced organisms in GenBank Homo sapiens 13.1 billion bases Mus musculus 8.4b Rattus norvegicus 6.1b Bos taurus 5.2b Zea mays 4.6b Sus scrofa 3.6b Danio rerio 3.0b Oryza sativa (japonica) 1.5b Strongylocentrotus purpurata 1.4b Nicotiana tabacum 1.1b GenBank release 168.0

Sequence databases What is a database? An indexed set of records Records retrieved using a query language Examples of sequence databases Primary databases (archival) GenBank EMBL (European Molecular Biology Laboratory) DDBJ (DNA Data Bank of Japan) Secondary databases (curated) RefSeq EMBL Genome Reviews Protein databases TPA (Third party annotations) The client server model has made access to sequence databases fast and easy Biologist query Web Browser Web Server BLAST Search Engine Database

Data flow of submissions between primary databases (Chp. 1) Integrated information retrieval system Is an interface not a database http://www.ensembl.org/index.html Entrez NIH National Institute of Health (USA) Submissions Updates (Sequin) NCBI National Center for Biotechnology Information GenBank EMBL European Molecular Biology Laboratory International Nucleotide Sequence Database Collaboration Updated every 24 hs EMBL http://www.ensembl.org/i ndex.html EBI Ensambl European Bioinformatics Institute DDBJ Center for Information Biology CIB DNA Data Bank of Japan Submissions Updates Getentry http://getentry.ddbj.nig.a c.jp/getstart-e.html Submissions Updates (Sakura) NIG National Institute of Genetics (JAPAN)

Nucleotide sequence flatfiles Header: Locus (10 characters, 1 st letter, arbitrary name, not useful), Length, Molecule type (DNA or RNA), Division code (includes functional divisions as ExpressedSeqTags, SeqTagsSites, WholeGenomeSeq, etc.), Last release date EMBL). Definition: Summary of biological information Accession number: Primary key to reference a record in the database. Used in publications (1+5 or 2+6). Version: The version is increased every time the sequence is updated. For each version there is a dif. GI geninfo identifier, which is specific of GenBank, not very useful Keywords: abandoned because of absence of controlled vocabulary. Source / Organism : taxonomic information top-down. Reference: submission credit and published paper. Feature Table: direct representation of the biological information in the record: Source: org, chromosome, map Gene may include regulatory sequences mrna from start of translation to polyadenilation signal CDS (coding sequences): from start to stop codon, provides exon coordinates. Every CDS is assigned a protein_id.version (3+5) Sequence: actual sequence ending in //

Secondary databases Refseq Comprehensive, integrated, and non-redundant set of sequences Genomic DNA RNA Protein explicitly linked 2_6 Format (the underscore is never present in GenBank accessions): Experimental Predicted NT_123456 (genomic contig) NM_123456 (mrnas) NP_123456 (Proteins) [XM_123456 (model mrnas)] [XP_123456 (model protein)] Undergo continuous curation: most up to date sequence Each RefSeq is a synthesis of information, not a piece of a primary research: equivalent to a review article Message for Removed Secondary

Secondary databases Third Party Annotation (TPA) Includes Reannotations, Combinations of novel and existing primary entries Annotations of trace archives Whole genome Shotgun data Provides GenBank accession. Version numbers and nucleotide locations for all primary entries to which the TPA sequence relates EMBL Genome reviews Includes Add information from UniProt knowledgebase, Gene Ontology Annotation, InterPro, and others Curated versions of entries representing complete genomes Standardize annotations

Protein databases http://www.ebi.ac.uk/blast2/index.html?uniprot Merge 50% = GenPept: translations of all CDS. Not curated Merge 90% = Uniprot (Swiss-Prot/TrEMBL/PIR-PSD) UniParc: most comprehensive, public nonredundant protein database Swiss-Prot (manual)/trembl(computer)/pir-psd GenBank, Patents, Int. Pr. Index (IPI) Protein Data Bank UniProt Knowledgebase: curated subset of UniParc Function Postranslational modifications Domains Catalytic sites Structures Associated diseases Pathways Etc. UniRef: UniProt nonredundant reference database: 95%, 90% and 50% sets. Functional groups Pfam Prosite IternPro Merge 95% = All predicted coding regions

www.ncbi.nlm.nih.gov

PubMed is National Library of Medicine's search service 19 million citations in MEDLINE links to participating online journals PubMed tutorial

Entrez integrates the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes

Entrez is a search and retrieval system that integrates NCBI databases

BLAST is Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 100,000 searches per day

OMIM is Online Mendelian Inheritance in Man catalog of human genes and genetic disorders created by Dr. Victor McKusick; led by Dr. Ada Hamosh at JHMI

Bookshelf is searchable resource of on-line books

Taxonomy Browser is browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms practically useful to find a protein or gene from a species

Structure site includes Molecular Modeling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST)

Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbsnp (single nucleotide polymorphism) DNA N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) RNA NP_007635 AAC02945 Q28369 1KT7 RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record protein Page 27

NCBI s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon reference version of a sequence. RefSeq identifiers include the following formats: Complete genome Complete chromosome Genomic contig mrna (DNA format) Protein NC_###### NC_###### NT_###### NM_###### e.g. NM_006744 NP_###### e.g. NP_006735 Page 27

NCBI s RefSeq project: accession for genomic, mrna, protein sequences Accession Molecule Method Note AC_123456 Genomic Mixed Alternate complete genomic AP_123456 Protein Mixed Protein products; alternate NC_123456 Genomic Mixed Complete genomic molecules NG_123456 Genomic Mixed Incomplete genomic regions NM_123456 mrna Mixed Transcript products; mrna NM_123456789 mrna Mixed Transcript products; 9-digit NP_123456 Protein Mixed Protein products; NP_123456789 Protein Curation Protein products; 9-digit NR_123456 RNA Mixed Non-coding transcripts NT_123456 Genomic Automated Genomic assemblies NW_123456 Genomic Automated Genomic assemblies NZ_ABCD12345678 Genomic Automated Whole genome shotgun data XM_123456 mrna Automated Transcript products XP_123456 Protein Automated Protein products XR_123456 RNA Automated Transcript products YP_123456 Protein Auto. & Curated Protein products ZP_12345678 Protein Automated Protein products

Access to sequences: Entrez Gene at NCBI Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_000518 for beta globin DNA corresponding to mrna) or protein (NP_000509) Page 29

From the NCBI home page, type beta globin and hit Search

Follow the link to Gene

Entrez Gene is in the header Note the Official Symbol HBB for beta globin Note the limits option

By applying limits, there are now far fewer entries

Entrez Gene (top of page) Note that links to many other HBB database entries are available

Entrez Gene (middle of page): genomic region, bibliography

Entrez Gene (middle of page, continued): phenotypes, function

Entrez Gene (bottom of page): RefSeq accession numbers

Entrez Protein: accession, organism, literature

Entrez Protein: features of a protein, and its sequence in the one-letter amino acid code

FASTA format: versatile, compact with one header line followed by a string of nucleotides or amino acids in the single letter code