EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

EECS 730 Introduction to Bioinformatics Sequence Alignment Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/

Database What is database An organized set of data Can web pages, books, journal articles, tables, text files, and spreadsheet files be considered as databases? Molecular Biology Databases To disseminate biological data and information To provide biological data in computer-readable form To allow analysis of biological data 2012/9/11 EECS 730 2

Biological Information Nucleic acids: DNA sequence, genes, gene products (proteins), mutation, gene coding, distribution patterns, motifs Genomics: genome, gene structure and expression, genetic map, genetic disorder RNA sequence, secondary structure, 3D structure, interactions Proteins: Protein sequence, corresponding gene, secondary structure, 3D structure, function, motifs, homology, interactions Proteomics: expression profile, proteins in disease processes etc. Ligands and drugs (inhibitors, activators, substrates, metabolites) 2012/9/11 EECS 730 3

Biological Information Function: Binding sites, interactions, molecular action (binding, chemical reaction, etc.) Biological effect (signaling, transport, feedback, regulation, modification, etc.) Functional relationship, protein families, motifs, and homologs Pathways: Molecular networks, biological chain events, regulation, feedback, kinetic data 2012/9/11 EECS 730 4

Overview of molecular biology databases Sequence DNA Genbank (www.ncbi.nlm.nih.gov) EMBL (European Molecular Biology Laboratory, www.ebi.ac.uk) DDBJ (DNA Data Bank of Japan) Protein Swissprot (www.ebi.ac.uk) NCBI Protein classification databases Prosite (www.expasy.org) Pfam (www.sanger.ac.uk/pfam) InterPro (www.ebi.ac.uk/interpro) Gene ontology (www.geneontology.org) 2012/9/11 EECS 730 5

Overview of molecular biology databases Structure PDB (Protein Data Bank, www.rcsb.org/pdb/cgi/queryform.cgi) X-ray crystallography, NMR, modeling KLOTHO (small molecules, http://www.biocheminfo.org/klotho/) Genome Mouse genome database (www.informatics.jax.org) Yeast genome (www.yeastgenome.org/) Bacterial genomes (www.tigr.org) Human genome browsers NCBI www.ncbi.nlm.nih.gov UCSC genome.ucsc.edu EBI www.ensembl.org Celera www.celera.com 2012/9/11 EECS 730 6

Overview of molecular biology databases Genetic disorders OMIM (Online Mendelian Inheritance in Man, www.ncbi.nlm.nih.gov) Taxonomy (www.ncbi.nlm.nih.gov) Literature PubMed (www.ncbi.nlm.nih.gov/entrez) 2012/9/11 EECS 730 7

Data about Databases 2012/9/11 EECS 730 8

Molecular biology databases Nucleic acids sequence Genome data Protein sequence Protein classification Protein structure 2012/9/11 EECS 730 9

Nucleic Acids databases What info are in these databases: DNA sequence, genes, gene products (proteins), mutation, gene coding, distribution patterns, motifs Genomics: genome, gene structure and expression, genetic map, genetic disorder RNA sequence, secondary structure, 3D structure, interactions 2012/9/11 EECS 730 10

Nucleic Acids databases DNA databases GenBank, EMBL, DDBJ 1. General purpose databases focusing on DNA sequences and their properties 2. GenBank, EMBL-bank and DDBJ exchange data to ensure comprehensive worldwide coverage and accession numbers are managed consistently between the three centers. 2012/9/11 EECS 730 11

Three major public DNA databases EMBL GenBank DDBJ 2012/9/11 EECS 730 12

International Nucleotide Sequence Database Collaboration 2012/9/11 EECS 730 13

EMBL nucleotide sequence database EMBL (http://www.ebi.ac.uk/embl/) Contains nucleotide sequences collected from all public sources. Accessible through Sequence Retrieval System (SRS) which allows keyword searching Sequence similarity search tools: Blitz, Fasta, and BLAST (studied later) 2012/9/11 EECS 730 14

2012/9/11 EECS 730 15

EMBL Entry header ID entryname dataclass; molecule; division; sequence length (BP). 2012/9/11 EECS 730 16

EMBL Entry feature table http://www.ebi.ac.uk/embl/documentation/ft_definitions/feature_table.html Coding sequence 2012/9/11 EECS 730 17

EMBL Entry sequence 2012/9/11 EECS 730 18

EMBL format http://www.ebi.ac.uk/embl/documentation/user_manual/usrman.html ID: IDentification AC: Accession numbers The primary means of identifying sequences providing a stable way of identifying entries from release to release. DE: description KW: Key Word information which can be used to generate cross-reference indexes of the sequence entries based on functional, structural, or other categories deemed important. OS: Organism Species OC: Organism Classification the taxonomic classification Of the source organism The OG (OrGanelle) linetype indicates the sub-cellular location of non-nuclear sequences. SQ: SeQuence header marks the beginning of the sequence data and Gives a summary of its content. The sequence data line has a line code consisting of two blanks. 2012/9/11 EECS 730 19

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbsnp (single nucleotide polymorphism) DNA N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) RNA NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record protein 2012/9/11 EECS 730 20

GenBank database GenBank (http://www.ncbi.nih.gov/genbank/) Contains publicly available DNA sequences from more than 100,000 organisms. Also contains derived protein sequences, and annotations describing biological, structural, and other relevant features. Accessible through Entrez, NCBI s integrated retrieval system Sequence similarity search tools: BLAST (studied later) 2012/9/11 EECS 730 21

Number of base pairs in Genbank, 1982 - present http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html Base Pairs (billions) 48 44 40 36 32 28 24 20 16 12 8 4 0 1980 1985 1990 1995 2000 2005 Base Pairs 1.E+11 1.E+10 1.E+09 1.E+08 Semilogarithmic plot 1.E+07 2-fold / 18 mo 10-fold / 5 yr 1.E+06 1980 1985 1990 1995 2000 2005 Year Year These graphs provide one example of the rapidly accumulating data in biology, leading to entire new fields of study. 2012/9/11 EECS 730 22

>100,000 species are represented in GenBank all species 128,941 viruses 6,137 bacteria 31,262 archaea 2,100 eukaryota 87,147 2012/9/11 EECS 730 23

The most sequenced organisms in GeneBank Homo sapiens 10.7 billion bases Mus musculus 6.5b Rattus norvegicus 5.6b Danio rerio 1.7b Zea mays 1.4b Oryza sativa 0.8b Drosophila melanogaster 0.7b Gallus gallus 0.5b Arabidopsis thaliana 0.5b Updated 8-12-04 GenBank release 142.0 2012/9/11 EECS 730 24

A GenBank entry HEADER http://www.ncbi.nlm.nih.gov/sitemap/samplerecord.html 2012/9/11 EECS 730 25

GenBank entry - FEATURES 2012/9/11 EECS 730 26

GenBank entry - SEQUENCE 2012/9/11 EECS 730 27

Common sequence formats EMBL release format Genbank release format FASTA format : >X12345 Y098TR gene CGTATCTTACGAGCTACTACGA GGTCTTATCGGACGAGCGACT... 2012/9/11 EECS 730 28

FASTA format Fig. 2.10 Page 32 2012/9/11 EECS 730 29

cdna cdna: DNA that is synthesized to be complementary to a mrna molecule. A cdna represents a portion of the DNA that specifies a protein (coding sequence of a gene). If the sequence of the cdna is known, the sequence of the DNA is known. Non-translated introns are not found in the cdna. (They are removed after the DNA is transcribed into mrna) DNA RNA protein complementary DNA (cdna) 2012/9/11 EECS 730 30

EST (Expressed Sequence Tag) Expressed Sequence Tags (ESTs) correspond to partial mrna sequences of expressed genes. They are sequences of cdna which have been reversetranscribed from mrna Short sequences (~500-1000 bases), each is result of single sequencing experiment -> high frequency of errors They represent a snapshot of what is expressed in a given tissue, and developmental stage. 2012/9/11 EECS 730 31

dbest (Expressed Sequence Tags database) http://www.ncbi.nlm.nih.gov/dbest/ dbest is a division of GenBank that contains sequence data and other information on cdna sequences, or ESTs, from a number of organisms. 2012/9/11 EECS 730 32

EST (Expressed Sequence Tag) Applications: Discovery of new genes Mapping of various genomes Identification of coding regions in genomic sequences. EST libraries are used to answer questions like: What genes in specific cell or tissue are expressed? 2012/9/11 EECS 730 33

One gene have multiple EST sequences! 2012/9/11 EECS 730 34

UniGene: Unique Genes http://www.ncbi.nlm.nih.gov/unigene UniGene partitions GenBank sequences into a nonredundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. A majority of sequences are ESTs. 2012/9/11 EECS 730 35

Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1 This is a gene with 10 ESTs associated; the cluster size is 10 2012/9/11 EECS 730 36

Cluster sizes in UniGene (human) Cluster size Number of clusters 1 8,100 2 38,200 3-4 23,300 5-8 12,000 9-16 5,600 17-32 3,700 500-1000 1,050 2000-4000 100 8000-16,000 12 16,000-30,000 2 2012/9/11 EECS 730 37 UniGene build 172, 8/04

UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). We will discuss UniGene further on in the section of gene expression. 2012/9/11 EECS 730 38 Page 31

Using a database How to get information out of a database: Browsing: no targeted information to retrieve Search: looking for particular information Searching a database: Must have a key that identifies the element(s) of the database that are of interest. Access number Name of gene Sequence of gene Keyword (any word that occurs somewhere in the database records) Other information 2012/9/11 EECS 730 39

NCBI and Entrez One of the most useful and comprehensive sources of databases is the NCBI, part of the National Library of Medicine. NCBI provides interesting summaries, browsers for genome data, and search tools Entrez is their database search interface http://www.ncbi.nlm.nih.gov/entrez Can search on gene names, sequences, chromosomal location, diseases, keywords... 2012/9/11 EECS 730 40

National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov 2012/9/11 EECS 730 41

Entrez integrates the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes 2012/9/11 EECS 730 42

Entrez is a search and retrieval system that integrates NCBI databases 2012/9/11 EECS 730 43

2012/9/11 EECS 730 44

Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol 2012/9/11 EECS 730 45

2012/9/11 EECS 730 46

Searching for HIV-1 pol: Following the genome link yields a manageable three results 2012/9/11 EECS 730 47 Page 34

Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for hiv-1 ), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! 2012/9/11 EECS 730 48

over 100,000 nucleotide entries for HIV-1 only 1 RefSeq 2012/9/11 EECS 730 49

NCBI s important RefSeq project: best representative sequences http://www.ncbi.nlm.nih.gov/refseq/ The RefSeq collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. It provides an expertly curated accession number that corresponds to the most stable, agreed-upon reference version of a sequence. RefSeq identifiers include the following formats: Complete genome NC_###### Complete chromosome NC_###### Genomic contig NT_###### mrna (DNA format) NM_###### e.g. NM_006744 Protein NP_###### e.g. NP_006735 2012/9/11 EECS 730 50

Strategy for assessment of alternative multiple sequence alignment algorithms 1. Create or obtain a database of protein sequences for which the 3D structure is known. Thus we can define true homologs using structural criteria. BaliBase: a reference alignment resource with over 1,000 sequences in 142 alignments. http://www-igbmc.u-strasbg.fr/bioinfo/balibase/index.html 2. Try making multiple sequence alignments with many different sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers). 3. Compare the answers. 2012/9/11 EECS 730 51

Acknowledge Many of the images and slides in this PowerPoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright 2003 by John Wiley & Sons, Inc. 2012/9/11 EECS 730 52