Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Similar documents
Introduction to Bioinformatics

Why learn sequence database searching? Searching Molecular Databases with BLAST

Types of Databases - By Scope

Protein Bioinformatics Part I: Access to information

ELE4120 Bioinformatics. Tutorial 5

Bioinformatics for Proteomics. Ann Loraine

Basic Bioinformatics: Homology, Sequence Alignment,

Introduction to BIOINFORMATICS

NCBI web resources I: databases and Entrez

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Worksheet for Bioinformatics

Gene-centered resources at NCBI

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Sequence Databases and database scanning

Evolutionary Genetics. LV Lecture with exercises 6KP

SAMPLE LITERATURE Please refer to included weblink for correct version.

Biotechnology Explorer


Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

user s guide Question 3

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Chimp Sequence Annotation: Region 2_3

Identification of Single Nucleotide Polymorphisms and associated Disease Genes using NCBI resources

user s guide Question 3

Hands-On Four Investigating Inherited Diseases

O C. 5 th C. 3 rd C. the national health museum

Entrez Gene: gene-centered information at NCBI

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Bioinformatics, in general, deals with the following important biological data:

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

CHAPTER 21 LECTURE SLIDES

Engineering Genetic Circuits

GenBank. Dennis A. Benson*, Mark S. Boguski, David J. Lipman, James Ostell and B. F. Francis Ouellette

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Sequence Databases. Chapter 2. caister.com/bioinformaticsbooks. Paul Rangel. Sequence Databases

BIOINFORMATICS Introduction

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources

Guided tour to Ensembl

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

Product Applications for the Sequence Analysis Collection

Integration of data management and analysis for genome research

Sequence Analysis Lab Protocol

AP BIOLOGY. Investigation #3 Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST. Slide 1 / 32. Slide 2 / 32.

I nternet Resources for Bioinformatics Data and Tools

Visit ABLE on the Web at:

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Agenda. Annotation of Drosophila. Muller element nomenclature. Annotation: Adding labels to a sequence. GEP Drosophila annotation projects 01/03/2018

Genomic and bioinformatics resources

Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

B I O I N F O R M A T I C S

Sequencing the Human Genome

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

Overview of Health Informatics. ITI BMI-Dept

Training materials.

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Interpretation of sequence results

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Introduction to Molecular Biology Databases

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Genomics and Gene Recognition Genes and Blue Genes

Functional analysis using EBI Metagenomics

IPA : Maximizing the Biological Interpretation of Gene, Transcript & Protein Expression Data with IPA

FINDING GENES AND EXPLORING THE GENE PAGE AND RUNNING A BLAST (Exercise 1)

0. Courses Other Information 1. The Molecules of Life 2. The Origin of Life 3. The Cell an Introduction

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Teaching Bioinformatics in the High School Classroom. Models for Disease. Why teach bioinformatics in high school?

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Bioinformatics to chemistry to therapy: Some case studies deriving information from the literature

Chapter 5. explain how information is submitted to and processed by biological databases.

Introduction to Bioinformatics

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

Genetics and Genomics in Clinical Research

LAB. WALRUSES AND WHALES AND SEALS, OH MY!

Sequencing the Human Genome

From AP investigative Laboratory Manual 1

Exercise1 ArrayExpress Archive - High-throughput sequencing example

Molecular Databases and Tools

Chapter 15 Gene Technologies and Human Applications

BIOINFORMATICS AND FUNCTIONAL GENOMICS

Introduction to Bioinformatics

Fundamentals of Bioinformatics: computation, biology, computational biology

Concepts of Bioinformatics

Host : Dr. Nobuyuki Nukina Tutor : Dr. Fumitaka Oyama

What we ll do today. Types of stem cells. Do engineered ips and ES cells have. What genes are special in stem cells?

Recommendations from the BCB Graduate Curriculum Committee 1

Do engineered ips and ES cells have similar molecular signatures?

Exploring the Genetic Basis for Behavior. Instructor s Notes

Open Access. Abstract

Source: Zanotti (1745). c July :14 AM

Examination Assignments

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Web based Bioinformatics Applications in Proteomics. Genbank

Genome Sequence Assembly

Access to Information from Molecular Biology and Genome Research

NiceProt View of Swiss-Prot: P18907

Transcription:

Introduction to Bioinformatics CPSC 265 Thanks to Jonathan Pevsner, Ph.D. Textbooks Johnathan Pevsner, who I stole most of these slides from (thanks!) has written a textbook, Bioinformatics and Functional Genomics (Wiley, 2003). The chapters contain content, lab exercises, and quizzes that were developed in his course over the past six years. All Pevsner s powerpoints are available at: http://pevsnerlab.kennedykrieger.org Several other bioinformatics texts are available: Baxevanis and Ouellette David Mount Durbin et al. What is bioinformatics? Interface of biology and computers Analysis of proteins, genes and genomes using computer algorithms and computer databases Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects. 1

Top ten challenges for bioinformatics [1] Precise models of where and when transcription will occur in a genome (initiation and termination) [2] Precise, predictive models of alternative RNA splicing [3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli [4] Determining protein:dna, protein:rna, protein:protein recognition codes [5] Accurate ab initio protein structure prediction Top ten challenges for bioinformatics [6] Rational design of small molecule inhibitors of proteins [7] Mechanistic understanding of protein evolution [8] Mechanistic understanding of speciation [9] Development of effective gene ontologies: systematic ways to describe gene and protein function [10] Education: development of bioinformatics curricula Source: Ewan Birney, Chris Burge, Jim Fickett Growth of GenBank Sequences (millions) Base pairs of DNA (billions) Updated 8-12-04: >40b base pairs 1982 1986 1990 1994 1998 2002 Fig. 2.1 Year Page 17 2

There are three major public DNA databases EMBL Housed at EBI European Bioinformatics Institute GenBank Housed at NCBI National Center for Biotechnology Information DDBJ Housed in Japan Page 16 >100,000 species are represented in GenBank all species 128,941 viruses 6,137 bacteria 31,262 archaea 2,100 eukaryota 87,147 Table 2-1 Page 17 The most sequenced organisms in GenBank Homo sapiens 11.2 billion bases Mus musculus 7.5b Rattus norvegicus 5.7b Danio rerio 2.1b Bos taurus 1.9b Zea mays 1.4b Oryza sativa (japonica) 1.2b Xenopus tropicalis 0.9b Canis familiaris 0.8b Drosophila melanogaster 0.7b Updated 8-29-05 GenBank release 149.0 Table 2-2 Page 18 3

National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Page 24 www.ncbi.nlm.nih.gov Fig. 2.5 Page 25 Fig. 2.5 Page 25 4

PubMed is National Library of Medicine's search service 12 million citations in MEDLINE links to participating online journals PubMed tutorial (via Education on side bar) Page 24 Entrez integrates the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes Page 24 BLAST is Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 80,000 searches per day Page 25 5

OMIM is Online Mendelian Inheritance in Man catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU Page 25 Books is searchable resource of on-line books Page 26 TaxBrowser is browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms Page 26 6

Structure site includes Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST) Page 26 Accessing information on molecular sequences Page 26 Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26 7

From the NCBI home page, type lectin and hit Go revised Fig. 2.7 Page 29 revised Fig. 2.7 Page 29 8

Fig. 2.9 Page 32 FASTA format Fig. 2.10 Page 32 9

PubMed at NCBI to find literature information PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has 12 million records dating back to 1966. Page 35 10

BLAST BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences. Applications include identifying homologs (orthologs and paralogs) discovering new genes or proteins discovering variants of genes or proteins investigating expressed sequence tags (ESTs) exploring protein structure and function page 88 Four components to a BLAST search (1) Choose the sequence (query) (2) Select the BLAST program (3) Choose the database to search (4) Choose optional parameters Then click BLAST page 88 11

Fig. 4.1 page 89 Fig. 4.2 page 90 Step 1: Choose your sequence Sequence can be input in FASTA format or as accession number page 89 12

Example of the FASTA format for a BLAST query Fig. 2.10 page 32 Step 2: Choose the BLAST program Step 2: Choose the BLAST program blastn (nucleotide BLAST) blastp (protein BLAST) tblastn (translated BLAST) blastx (translated BLAST) tblastx (translated BLAST) page 90 13

Choose the BLAST program Program Input Database 1 blastn DNA DNA 1 blastp protein protein 6 blastx DNA protein 6 tblastn protein DNA 36 tblastx DNA DNA Fig. 4.3 page 91 DNA potentially encodes six proteins 5 CAT CAA 5 ATC AAC 5 TCA ACT 5 CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3 3 GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5 5 GTG GGT 5 TGG GTA 5 GGG TAG page 92 Step 3: choose the database nr = non-redundant (most general database) dbest = database of expressed sequence tags dbsts = database of sequence tag sites gss = genomic survey sequences htgs = high throughput genomic sequence page 92-93 14

Step 4a: Select optional search parameters CD search page 93 Step 4a: Select optional search parameters Entrez! Filter Expect Word size Scoring matrix organism Fig. 4.5 page 94 BLAST: optional parameters You can... choose the organism to search turn filtering on/off change the substitution matrix change the expect (e) value change the word size change the output format page 93 15

filtering Fig. 4.6 page 95 Fig. 4.7 page 95 Fig. 4.8 page 96 16

Step 4b: optional formatting parameters Alignment view Descriptions Alignments page 97 (page 90) program query database taxonomy Fig. 4.9 page 98 17

We will discuss the Conserved Domain Database (CDD) in chapter 10 (multiple sequence alignment) We will discuss the Conserved Domain Database (CDD) in chapter 10 (multiple sequence alignment) 18

Protein 3D structure NCBI Structure Unlike mostly everything else, NCBI is not the best http://pdbbeta.rcsb.org/pdb/welcome.do (latest version of SDSC PDB site) 19

So now you can Find any sequence in the database Find relevant publications Match DNA to protein sequence Find database matches to DNA or protein Find conserved domains in protein Find the 3D structure of a protein Without doing any experiments! 20