EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Similar documents
Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Computational Biology and Bioinformatics

Introduction to Bioinformatics

Protein Bioinformatics Part I: Access to information

Introduction to BIOINFORMATICS

Types of Databases - By Scope

NCBI web resources I: databases and Entrez

Bioinformatics for Proteomics. Ann Loraine

Gene-centered resources at NCBI

Data Retrieval from GenBank

BLASTing through the kingdom of life

ELE4120 Bioinformatics. Tutorial 5

BLASTing through the kingdom of life

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

Sequence Databases and database scanning

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Entrez Gene: gene-centered information at NCBI

B I O I N F O R M A T I C S

I nternet Resources for Bioinformatics Data and Tools

O C. 5 th C. 3 rd C. the national health museum

Introduction and Public Sequence Databases. BME 110/BIOL 181 CompBio Tools

Introduction to Bioinformatics for Medical Research. Gideon Greenspan TA: Oleg Rokhlenko. Lecture 1

BLASTing through the kingdom of life

Worksheet for Bioinformatics

Genome and DNA Sequence Databases. BME 110: CompBio Tools Todd Lowe April 5, 2007

Bioinformatics overview

Why learn sequence database searching? Searching Molecular Databases with BLAST

CHAPTER 21 LECTURE SLIDES

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Guided tour to Ensembl

GS Analysis of Microarray Data

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

user s guide Question 3

Basic Bioinformatics: Homology, Sequence Alignment,

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

Big picture and history

user s guide Question 3

Integration of data management and analysis for genome research

Biotechnology Explorer

NCBI & Other Genome Databases. BME 110/BIOL 181 CompBio Tools

Overview of Health Informatics. ITI BMI-Dept

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

COMPUTER RESOURCES II:

GS Analysis of Microarray Data

Applied Bioinformatics

Computers in Biology and Bioinformatics

BIO 152 Principles of Biology III: Molecules & Cells Acquiring information from NCBI (PubMed/Bookshelf/OMIM)

earray 5.0 Create your own Custom Microarray Design

Bioinformatics to chemistry to therapy: Some case studies deriving information from the literature

Introduction to Bioinformatics

Digital information cycle. Database. Database. BINF 630: Bioinformatics Methods

CSC 121 Computers and Scientific Thinking

Engineering Genetic Circuits

Introduction to Bioinformatics

FACULTY OF BIOCHEMISTRY AND MOLECULAR MEDICINE

Algorithms in Bioinformatics

Computational gene finding. Devika Subramanian Comp 470

Hands-On Four Investigating Inherited Diseases

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Aaditya Khatri. Abstract

SAMPLE LITERATURE Please refer to included weblink for correct version.

Analysis of Microarray Data

Multiple choice questions (numbers in brackets indicate the number of correct answers)

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

PrimePCR Assay Validation Report

Training materials.

PrimePCR Assay Validation Report

Gene Identification in silico

RNA-Sequencing analysis

PrimePCR Assay Validation Report

PrimePCR Assay Validation Report

CHEM 436 / 630. Molecular modelling of proteins. Winter 2018 Term. Instructor: Guillaume Lamoureux Concordia University, Montréal, Canada

Genomic and bioinformatics resources

7 Gene Isolation and Analysis of Multiple

Introduction to Bioinformatics

Chimp Sequence Annotation: Region 2_3

Chapter 15 Gene Technologies and Human Applications

Sequence Analysis Lab Protocol

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Product Applications for the Sequence Analysis Collection

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Sequence Databases. Chapter 2. caister.com/bioinformaticsbooks. Paul Rangel. Sequence Databases

BIOINFORMATICS Introduction

Introduction to Bioinformatics Part 1 of 2

Ontologies - Useful tools in Life Sciences and Forensics

GREG GIBSON SPENCER V. MUSE

Identification of Single Nucleotide Polymorphisms and associated Disease Genes using NCBI resources

Function Prediction of Proteins from their Sequences with BAR 3.0

Genomic region (ENCODE) Gene definitions

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION. Bioinformatics and Genomic Data: Investigating a Complex Genetic Disease

PrimePCR Assay Validation Report

Lecture #1. Introduction to microarray technology

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

ONLINE BIOINFORMATICS RESOURCES

Introduction to Molecular Biology Databases

Regulation of eukaryotic transcription:

2. Outline the levels of DNA packing in the eukaryotic nucleus below next to the diagram provided.

Chapter 5. explain how information is submitted to and processed by biological databases.

Transcription:

EECS 730 Introduction to Bioinformatics Sequence Alignment Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/

Database What is database An organized set of data Can web pages, books, journal articles, tables, text files, and spreadsheet files be considered as databases? Molecular Biology Databases To disseminate biological data and information To provide biological data in computer-readable form To allow analysis of biological data 2012/9/11 EECS 730 2

Biological Information Nucleic acids: DNA sequence, genes, gene products (proteins), mutation, gene coding, distribution patterns, motifs Genomics: genome, gene structure and expression, genetic map, genetic disorder RNA sequence, secondary structure, 3D structure, interactions Proteins: Protein sequence, corresponding gene, secondary structure, 3D structure, function, motifs, homology, interactions Proteomics: expression profile, proteins in disease processes etc. Ligands and drugs (inhibitors, activators, substrates, metabolites) 2012/9/11 EECS 730 3

Biological Information Function: Binding sites, interactions, molecular action (binding, chemical reaction, etc.) Biological effect (signaling, transport, feedback, regulation, modification, etc.) Functional relationship, protein families, motifs, and homologs Pathways: Molecular networks, biological chain events, regulation, feedback, kinetic data 2012/9/11 EECS 730 4

Overview of molecular biology databases Sequence DNA Genbank (www.ncbi.nlm.nih.gov) EMBL (European Molecular Biology Laboratory, www.ebi.ac.uk) DDBJ (DNA Data Bank of Japan) Protein Swissprot (www.ebi.ac.uk) NCBI Protein classification databases Prosite (www.expasy.org) Pfam (www.sanger.ac.uk/pfam) InterPro (www.ebi.ac.uk/interpro) Gene ontology (www.geneontology.org) 2012/9/11 EECS 730 5

Overview of molecular biology databases Structure PDB (Protein Data Bank, www.rcsb.org/pdb/cgi/queryform.cgi) X-ray crystallography, NMR, modeling KLOTHO (small molecules, http://www.biocheminfo.org/klotho/) Genome Mouse genome database (www.informatics.jax.org) Yeast genome (www.yeastgenome.org/) Bacterial genomes (www.tigr.org) Human genome browsers NCBI www.ncbi.nlm.nih.gov UCSC genome.ucsc.edu EBI www.ensembl.org Celera www.celera.com 2012/9/11 EECS 730 6

Overview of molecular biology databases Genetic disorders OMIM (Online Mendelian Inheritance in Man, www.ncbi.nlm.nih.gov) Taxonomy (www.ncbi.nlm.nih.gov) Literature PubMed (www.ncbi.nlm.nih.gov/entrez) 2012/9/11 EECS 730 7

Data about Databases 2012/9/11 EECS 730 8

Molecular biology databases Nucleic acids sequence Genome data Protein sequence Protein classification Protein structure 2012/9/11 EECS 730 9

Nucleic Acids databases What info are in these databases: DNA sequence, genes, gene products (proteins), mutation, gene coding, distribution patterns, motifs Genomics: genome, gene structure and expression, genetic map, genetic disorder RNA sequence, secondary structure, 3D structure, interactions 2012/9/11 EECS 730 10

Nucleic Acids databases DNA databases GenBank, EMBL, DDBJ 1. General purpose databases focusing on DNA sequences and their properties 2. GenBank, EMBL-bank and DDBJ exchange data to ensure comprehensive worldwide coverage and accession numbers are managed consistently between the three centers. 2012/9/11 EECS 730 11

Three major public DNA databases EMBL GenBank DDBJ 2012/9/11 EECS 730 12

International Nucleotide Sequence Database Collaboration 2012/9/11 EECS 730 13

EMBL nucleotide sequence database EMBL (http://www.ebi.ac.uk/embl/) Contains nucleotide sequences collected from all public sources. Accessible through Sequence Retrieval System (SRS) which allows keyword searching Sequence similarity search tools: Blitz, Fasta, and BLAST (studied later) 2012/9/11 EECS 730 14

2012/9/11 EECS 730 15

EMBL Entry header ID entryname dataclass; molecule; division; sequence length (BP). 2012/9/11 EECS 730 16

EMBL Entry feature table http://www.ebi.ac.uk/embl/documentation/ft_definitions/feature_table.html Coding sequence 2012/9/11 EECS 730 17

EMBL Entry sequence 2012/9/11 EECS 730 18

EMBL format http://www.ebi.ac.uk/embl/documentation/user_manual/usrman.html ID: IDentification AC: Accession numbers The primary means of identifying sequences providing a stable way of identifying entries from release to release. DE: description KW: Key Word information which can be used to generate cross-reference indexes of the sequence entries based on functional, structural, or other categories deemed important. OS: Organism Species OC: Organism Classification the taxonomic classification Of the source organism The OG (OrGanelle) linetype indicates the sub-cellular location of non-nuclear sequences. SQ: SeQuence header marks the beginning of the sequence data and Gives a summary of its content. The sequence data line has a line code consisting of two blanks. 2012/9/11 EECS 730 19

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbsnp (single nucleotide polymorphism) DNA N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) RNA NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record protein 2012/9/11 EECS 730 20

GenBank database GenBank (http://www.ncbi.nih.gov/genbank/) Contains publicly available DNA sequences from more than 100,000 organisms. Also contains derived protein sequences, and annotations describing biological, structural, and other relevant features. Accessible through Entrez, NCBI s integrated retrieval system Sequence similarity search tools: BLAST (studied later) 2012/9/11 EECS 730 21

Number of base pairs in Genbank, 1982 - present http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html Base Pairs (billions) 48 44 40 36 32 28 24 20 16 12 8 4 0 1980 1985 1990 1995 2000 2005 Base Pairs 1.E+11 1.E+10 1.E+09 1.E+08 Semilogarithmic plot 1.E+07 2-fold / 18 mo 10-fold / 5 yr 1.E+06 1980 1985 1990 1995 2000 2005 Year Year These graphs provide one example of the rapidly accumulating data in biology, leading to entire new fields of study. 2012/9/11 EECS 730 22

>100,000 species are represented in GenBank all species 128,941 viruses 6,137 bacteria 31,262 archaea 2,100 eukaryota 87,147 2012/9/11 EECS 730 23

The most sequenced organisms in GeneBank Homo sapiens 10.7 billion bases Mus musculus 6.5b Rattus norvegicus 5.6b Danio rerio 1.7b Zea mays 1.4b Oryza sativa 0.8b Drosophila melanogaster 0.7b Gallus gallus 0.5b Arabidopsis thaliana 0.5b Updated 8-12-04 GenBank release 142.0 2012/9/11 EECS 730 24

A GenBank entry HEADER http://www.ncbi.nlm.nih.gov/sitemap/samplerecord.html 2012/9/11 EECS 730 25

GenBank entry - FEATURES 2012/9/11 EECS 730 26

GenBank entry - SEQUENCE 2012/9/11 EECS 730 27

Common sequence formats EMBL release format Genbank release format FASTA format : >X12345 Y098TR gene CGTATCTTACGAGCTACTACGA GGTCTTATCGGACGAGCGACT... 2012/9/11 EECS 730 28

FASTA format Fig. 2.10 Page 32 2012/9/11 EECS 730 29

cdna cdna: DNA that is synthesized to be complementary to a mrna molecule. A cdna represents a portion of the DNA that specifies a protein (coding sequence of a gene). If the sequence of the cdna is known, the sequence of the DNA is known. Non-translated introns are not found in the cdna. (They are removed after the DNA is transcribed into mrna) DNA RNA protein complementary DNA (cdna) 2012/9/11 EECS 730 30

EST (Expressed Sequence Tag) Expressed Sequence Tags (ESTs) correspond to partial mrna sequences of expressed genes. They are sequences of cdna which have been reversetranscribed from mrna Short sequences (~500-1000 bases), each is result of single sequencing experiment -> high frequency of errors They represent a snapshot of what is expressed in a given tissue, and developmental stage. 2012/9/11 EECS 730 31

dbest (Expressed Sequence Tags database) http://www.ncbi.nlm.nih.gov/dbest/ dbest is a division of GenBank that contains sequence data and other information on cdna sequences, or ESTs, from a number of organisms. 2012/9/11 EECS 730 32

EST (Expressed Sequence Tag) Applications: Discovery of new genes Mapping of various genomes Identification of coding regions in genomic sequences. EST libraries are used to answer questions like: What genes in specific cell or tissue are expressed? 2012/9/11 EECS 730 33

One gene have multiple EST sequences! 2012/9/11 EECS 730 34

UniGene: Unique Genes http://www.ncbi.nlm.nih.gov/unigene UniGene partitions GenBank sequences into a nonredundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. A majority of sequences are ESTs. 2012/9/11 EECS 730 35

Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1 This is a gene with 10 ESTs associated; the cluster size is 10 2012/9/11 EECS 730 36

Cluster sizes in UniGene (human) Cluster size Number of clusters 1 8,100 2 38,200 3-4 23,300 5-8 12,000 9-16 5,600 17-32 3,700 500-1000 1,050 2000-4000 100 8000-16,000 12 16,000-30,000 2 2012/9/11 EECS 730 37 UniGene build 172, 8/04

UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). We will discuss UniGene further on in the section of gene expression. 2012/9/11 EECS 730 38 Page 31

Using a database How to get information out of a database: Browsing: no targeted information to retrieve Search: looking for particular information Searching a database: Must have a key that identifies the element(s) of the database that are of interest. Access number Name of gene Sequence of gene Keyword (any word that occurs somewhere in the database records) Other information 2012/9/11 EECS 730 39

NCBI and Entrez One of the most useful and comprehensive sources of databases is the NCBI, part of the National Library of Medicine. NCBI provides interesting summaries, browsers for genome data, and search tools Entrez is their database search interface http://www.ncbi.nlm.nih.gov/entrez Can search on gene names, sequences, chromosomal location, diseases, keywords... 2012/9/11 EECS 730 40

National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov 2012/9/11 EECS 730 41

Entrez integrates the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes 2012/9/11 EECS 730 42

Entrez is a search and retrieval system that integrates NCBI databases 2012/9/11 EECS 730 43

2012/9/11 EECS 730 44

Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol 2012/9/11 EECS 730 45

2012/9/11 EECS 730 46

Searching for HIV-1 pol: Following the genome link yields a manageable three results 2012/9/11 EECS 730 47 Page 34

Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for hiv-1 ), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! 2012/9/11 EECS 730 48

over 100,000 nucleotide entries for HIV-1 only 1 RefSeq 2012/9/11 EECS 730 49

NCBI s important RefSeq project: best representative sequences http://www.ncbi.nlm.nih.gov/refseq/ The RefSeq collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. It provides an expertly curated accession number that corresponds to the most stable, agreed-upon reference version of a sequence. RefSeq identifiers include the following formats: Complete genome NC_###### Complete chromosome NC_###### Genomic contig NT_###### mrna (DNA format) NM_###### e.g. NM_006744 Protein NP_###### e.g. NP_006735 2012/9/11 EECS 730 50

Strategy for assessment of alternative multiple sequence alignment algorithms 1. Create or obtain a database of protein sequences for which the 3D structure is known. Thus we can define true homologs using structural criteria. BaliBase: a reference alignment resource with over 1,000 sequences in 142 alignments. http://www-igbmc.u-strasbg.fr/bioinfo/balibase/index.html 2. Try making multiple sequence alignments with many different sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers). 3. Compare the answers. 2012/9/11 EECS 730 51

Acknowledge Many of the images and slides in this PowerPoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright 2003 by John Wiley & Sons, Inc. 2012/9/11 EECS 730 52