EECS 730 Introduction to Bioinformatics Sequence Alignment Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/
Database What is database An organized set of data Can web pages, books, journal articles, tables, text files, and spreadsheet files be considered as databases? Molecular Biology Databases To disseminate biological data and information To provide biological data in computer-readable form To allow analysis of biological data 2012/9/11 EECS 730 2
Biological Information Nucleic acids: DNA sequence, genes, gene products (proteins), mutation, gene coding, distribution patterns, motifs Genomics: genome, gene structure and expression, genetic map, genetic disorder RNA sequence, secondary structure, 3D structure, interactions Proteins: Protein sequence, corresponding gene, secondary structure, 3D structure, function, motifs, homology, interactions Proteomics: expression profile, proteins in disease processes etc. Ligands and drugs (inhibitors, activators, substrates, metabolites) 2012/9/11 EECS 730 3
Biological Information Function: Binding sites, interactions, molecular action (binding, chemical reaction, etc.) Biological effect (signaling, transport, feedback, regulation, modification, etc.) Functional relationship, protein families, motifs, and homologs Pathways: Molecular networks, biological chain events, regulation, feedback, kinetic data 2012/9/11 EECS 730 4
Overview of molecular biology databases Sequence DNA Genbank (www.ncbi.nlm.nih.gov) EMBL (European Molecular Biology Laboratory, www.ebi.ac.uk) DDBJ (DNA Data Bank of Japan) Protein Swissprot (www.ebi.ac.uk) NCBI Protein classification databases Prosite (www.expasy.org) Pfam (www.sanger.ac.uk/pfam) InterPro (www.ebi.ac.uk/interpro) Gene ontology (www.geneontology.org) 2012/9/11 EECS 730 5
Overview of molecular biology databases Structure PDB (Protein Data Bank, www.rcsb.org/pdb/cgi/queryform.cgi) X-ray crystallography, NMR, modeling KLOTHO (small molecules, http://www.biocheminfo.org/klotho/) Genome Mouse genome database (www.informatics.jax.org) Yeast genome (www.yeastgenome.org/) Bacterial genomes (www.tigr.org) Human genome browsers NCBI www.ncbi.nlm.nih.gov UCSC genome.ucsc.edu EBI www.ensembl.org Celera www.celera.com 2012/9/11 EECS 730 6
Overview of molecular biology databases Genetic disorders OMIM (Online Mendelian Inheritance in Man, www.ncbi.nlm.nih.gov) Taxonomy (www.ncbi.nlm.nih.gov) Literature PubMed (www.ncbi.nlm.nih.gov/entrez) 2012/9/11 EECS 730 7
Data about Databases 2012/9/11 EECS 730 8
Molecular biology databases Nucleic acids sequence Genome data Protein sequence Protein classification Protein structure 2012/9/11 EECS 730 9
Nucleic Acids databases What info are in these databases: DNA sequence, genes, gene products (proteins), mutation, gene coding, distribution patterns, motifs Genomics: genome, gene structure and expression, genetic map, genetic disorder RNA sequence, secondary structure, 3D structure, interactions 2012/9/11 EECS 730 10
Nucleic Acids databases DNA databases GenBank, EMBL, DDBJ 1. General purpose databases focusing on DNA sequences and their properties 2. GenBank, EMBL-bank and DDBJ exchange data to ensure comprehensive worldwide coverage and accession numbers are managed consistently between the three centers. 2012/9/11 EECS 730 11
Three major public DNA databases EMBL GenBank DDBJ 2012/9/11 EECS 730 12
International Nucleotide Sequence Database Collaboration 2012/9/11 EECS 730 13
EMBL nucleotide sequence database EMBL (http://www.ebi.ac.uk/embl/) Contains nucleotide sequences collected from all public sources. Accessible through Sequence Retrieval System (SRS) which allows keyword searching Sequence similarity search tools: Blitz, Fasta, and BLAST (studied later) 2012/9/11 EECS 730 14
2012/9/11 EECS 730 15
EMBL Entry header ID entryname dataclass; molecule; division; sequence length (BP). 2012/9/11 EECS 730 16
EMBL Entry feature table http://www.ebi.ac.uk/embl/documentation/ft_definitions/feature_table.html Coding sequence 2012/9/11 EECS 730 17
EMBL Entry sequence 2012/9/11 EECS 730 18
EMBL format http://www.ebi.ac.uk/embl/documentation/user_manual/usrman.html ID: IDentification AC: Accession numbers The primary means of identifying sequences providing a stable way of identifying entries from release to release. DE: description KW: Key Word information which can be used to generate cross-reference indexes of the sequence entries based on functional, structural, or other categories deemed important. OS: Organism Species OC: Organism Classification the taxonomic classification Of the source organism The OG (OrGanelle) linetype indicates the sub-cellular location of non-nuclear sequences. SQ: SeQuence header marks the beginning of the sequence data and Gives a summary of its content. The sequence data line has a line code consisting of two blanks. 2012/9/11 EECS 730 19
What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbsnp (single nucleotide polymorphism) DNA N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) RNA NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record protein 2012/9/11 EECS 730 20
GenBank database GenBank (http://www.ncbi.nih.gov/genbank/) Contains publicly available DNA sequences from more than 100,000 organisms. Also contains derived protein sequences, and annotations describing biological, structural, and other relevant features. Accessible through Entrez, NCBI s integrated retrieval system Sequence similarity search tools: BLAST (studied later) 2012/9/11 EECS 730 21
Number of base pairs in Genbank, 1982 - present http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html Base Pairs (billions) 48 44 40 36 32 28 24 20 16 12 8 4 0 1980 1985 1990 1995 2000 2005 Base Pairs 1.E+11 1.E+10 1.E+09 1.E+08 Semilogarithmic plot 1.E+07 2-fold / 18 mo 10-fold / 5 yr 1.E+06 1980 1985 1990 1995 2000 2005 Year Year These graphs provide one example of the rapidly accumulating data in biology, leading to entire new fields of study. 2012/9/11 EECS 730 22
>100,000 species are represented in GenBank all species 128,941 viruses 6,137 bacteria 31,262 archaea 2,100 eukaryota 87,147 2012/9/11 EECS 730 23
The most sequenced organisms in GeneBank Homo sapiens 10.7 billion bases Mus musculus 6.5b Rattus norvegicus 5.6b Danio rerio 1.7b Zea mays 1.4b Oryza sativa 0.8b Drosophila melanogaster 0.7b Gallus gallus 0.5b Arabidopsis thaliana 0.5b Updated 8-12-04 GenBank release 142.0 2012/9/11 EECS 730 24
A GenBank entry HEADER http://www.ncbi.nlm.nih.gov/sitemap/samplerecord.html 2012/9/11 EECS 730 25
GenBank entry - FEATURES 2012/9/11 EECS 730 26
GenBank entry - SEQUENCE 2012/9/11 EECS 730 27
Common sequence formats EMBL release format Genbank release format FASTA format : >X12345 Y098TR gene CGTATCTTACGAGCTACTACGA GGTCTTATCGGACGAGCGACT... 2012/9/11 EECS 730 28
FASTA format Fig. 2.10 Page 32 2012/9/11 EECS 730 29
cdna cdna: DNA that is synthesized to be complementary to a mrna molecule. A cdna represents a portion of the DNA that specifies a protein (coding sequence of a gene). If the sequence of the cdna is known, the sequence of the DNA is known. Non-translated introns are not found in the cdna. (They are removed after the DNA is transcribed into mrna) DNA RNA protein complementary DNA (cdna) 2012/9/11 EECS 730 30
EST (Expressed Sequence Tag) Expressed Sequence Tags (ESTs) correspond to partial mrna sequences of expressed genes. They are sequences of cdna which have been reversetranscribed from mrna Short sequences (~500-1000 bases), each is result of single sequencing experiment -> high frequency of errors They represent a snapshot of what is expressed in a given tissue, and developmental stage. 2012/9/11 EECS 730 31
dbest (Expressed Sequence Tags database) http://www.ncbi.nlm.nih.gov/dbest/ dbest is a division of GenBank that contains sequence data and other information on cdna sequences, or ESTs, from a number of organisms. 2012/9/11 EECS 730 32
EST (Expressed Sequence Tag) Applications: Discovery of new genes Mapping of various genomes Identification of coding regions in genomic sequences. EST libraries are used to answer questions like: What genes in specific cell or tissue are expressed? 2012/9/11 EECS 730 33
One gene have multiple EST sequences! 2012/9/11 EECS 730 34
UniGene: Unique Genes http://www.ncbi.nlm.nih.gov/unigene UniGene partitions GenBank sequences into a nonredundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. A majority of sequences are ESTs. 2012/9/11 EECS 730 35
Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1 This is a gene with 10 ESTs associated; the cluster size is 10 2012/9/11 EECS 730 36
Cluster sizes in UniGene (human) Cluster size Number of clusters 1 8,100 2 38,200 3-4 23,300 5-8 12,000 9-16 5,600 17-32 3,700 500-1000 1,050 2000-4000 100 8000-16,000 12 16,000-30,000 2 2012/9/11 EECS 730 37 UniGene build 172, 8/04
UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). We will discuss UniGene further on in the section of gene expression. 2012/9/11 EECS 730 38 Page 31
Using a database How to get information out of a database: Browsing: no targeted information to retrieve Search: looking for particular information Searching a database: Must have a key that identifies the element(s) of the database that are of interest. Access number Name of gene Sequence of gene Keyword (any word that occurs somewhere in the database records) Other information 2012/9/11 EECS 730 39
NCBI and Entrez One of the most useful and comprehensive sources of databases is the NCBI, part of the National Library of Medicine. NCBI provides interesting summaries, browsers for genome data, and search tools Entrez is their database search interface http://www.ncbi.nlm.nih.gov/entrez Can search on gene names, sequences, chromosomal location, diseases, keywords... 2012/9/11 EECS 730 40
National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov 2012/9/11 EECS 730 41
Entrez integrates the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes 2012/9/11 EECS 730 42
Entrez is a search and retrieval system that integrates NCBI databases 2012/9/11 EECS 730 43
2012/9/11 EECS 730 44
Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol 2012/9/11 EECS 730 45
2012/9/11 EECS 730 46
Searching for HIV-1 pol: Following the genome link yields a manageable three results 2012/9/11 EECS 730 47 Page 34
Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for hiv-1 ), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! 2012/9/11 EECS 730 48
over 100,000 nucleotide entries for HIV-1 only 1 RefSeq 2012/9/11 EECS 730 49
NCBI s important RefSeq project: best representative sequences http://www.ncbi.nlm.nih.gov/refseq/ The RefSeq collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. It provides an expertly curated accession number that corresponds to the most stable, agreed-upon reference version of a sequence. RefSeq identifiers include the following formats: Complete genome NC_###### Complete chromosome NC_###### Genomic contig NT_###### mrna (DNA format) NM_###### e.g. NM_006744 Protein NP_###### e.g. NP_006735 2012/9/11 EECS 730 50
Strategy for assessment of alternative multiple sequence alignment algorithms 1. Create or obtain a database of protein sequences for which the 3D structure is known. Thus we can define true homologs using structural criteria. BaliBase: a reference alignment resource with over 1,000 sequences in 142 alignments. http://www-igbmc.u-strasbg.fr/bioinfo/balibase/index.html 2. Try making multiple sequence alignments with many different sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers). 3. Compare the answers. 2012/9/11 EECS 730 51
Acknowledge Many of the images and slides in this PowerPoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright 2003 by John Wiley & Sons, Inc. 2012/9/11 EECS 730 52