Introduction to Bioinformatics 260.602.01 September 1, 2006 Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org
Teaching assistants Hugh Cahill (hugh@jhu.edu) Jennifer Turney (jturney@jhsph.edu) Meg Zupancic (mzupanc1@jhmi.edu)
Who is taking this course? People with very diverse backgrounds in biology People with diverse backgrounds in computer science and biostatistics Most people have a favorite gene, protein, or disease
What are the goals of the course? To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI) and EBI To focus on the analysis of DNA, RNA and proteins To introduce you to the analysis of genomes To combine theory and practice to help you solve research problems
Themes throughout the course Textbooks Web sites Literature references Gene/protein families Computer labs
Textbook The course textbook is J. Pevsner, Bioinformatics and Functional Genomics (Wiley, 2003). The chapters contain content, lab exercises, and quizzes that were developed in this course over the past six years. A few copies will be available on reserve at Welch Library for those of you who do not want to buy a copy (go up to the 2 nd floor), and the library has six more copies. Several other bioinformatics texts are available: Baxevanis and Ouellette David Mount Durbin et al.
Web sites The course website is reached via: http://pevsnerlab.kennedykrieger.org/bioinfo_course.htm (or Google pevsnerlab courses) This site contains the powerpoints for each lecture. The textbook website is: http://www.bioinfbook.org This has 1000 URLs, organized by chapter This site also contains the same powerpoints. The weekly quizzes are on my website: http://pevsnerlab.kennedykrieger.org/moodle Once you log in and take a quiz, you will get instant feedback. You can use moodle to ask questions as well.
Literature references You are encouraged to read original source articles. They will enhance your understanding of the material. Reading will be assigned.
Themes throughout the course: gene/protein families We will use retinol-binding protein 4 (RBP4) as a model gene/protein throughout the course. RBP4 is a member of the lipocalin family. It is a small, abundant carrier protein. We will study it in a variety of contexts including --sequence alignment --gene expression --protein structure --phylogeny --homologs in various species We will also use other examples, such as the globins and the pol protein of HIV-1
The HIV-1 pol gene encodes three proteins Aspartyl protease Reverse transcriptase Integrase PR RT IN
Themes throughout the course: computer labs There is a computer lab each Friday. This is a chance to gain practical experience using a variety of web resources. You can do the lab on your own, ahead of time. However, during the Friday lab you can get help on problems, and in some cases the computers will have specialized software.
Grading 40% ten moodle quizzes (corresponding to chapters 2-11) 30% final exam October 25 (in class) 30% discovery of a novel gene: --Find the novel gene by the end of September, and turn in the final report, with phylogenetic tree, by October 25 --Instructions are posted on the course website --We will discuss this project in detail in the next two weeks.
Grading Quizzes are taken at the moodle website, and are due one week after the relevant lecture ten quizzes 4% Chapter 2 quiz (sequences) 4% Chapter 3 quiz (alignment) 4% Chapter 4 quiz (BLAST) 4% Chapter 5 quiz (advanced BLAST) 4% Chapter 6 quiz (RNA) 4% Chapter 7 quiz (microarrays) 4% Chapter 8 quiz (proteomics) 4% Chapter 9 quiz (protein structure) 4% Chapter 10 quiz (multiple alignment) 4% Chapter 11 quiz (phylogeny) 30% find-a-gene project (due October 25) 30% final exam October 25 (in class)
Outline for today (chapters 1 and 2) Definition of bioinformatics Overview of the NCBI website Accessing information about DNA and proteins --Definition of an accession number --Four ways to find information on proteins and DNA Access to biomedical literature
What is bioinformatics? Interface of biology and computers Analysis of proteins, genes and genomes using computer algorithms and computer databases Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.
Top ten challenges for bioinformatics [1] Precise models of where and when transcription will occur in a genome (initiation and termination) [2] Precise, predictive models of alternative RNA splicing [3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli [4] Determining protein:dna, protein:rna, protein:protein recognition codes [5] Accurate ab initio protein structure prediction
Top ten challenges for bioinformatics [6] Rational design of small molecule inhibitors of proteins [7] Mechanistic understanding of protein evolution [8] Mechanistic understanding of speciation [9] Development of effective gene ontologies: systematic ways to describe gene and protein function [10] Education: development of bioinformatics curricula Source: Ewan Birney, Chris Burge, Jim Fickett
On bioinformatics Science is about building causal relations between natural phenomena (for instance, between a mutation in a gene and a disease). The development of instruments to increase our capacity to observe natural phenomena has, therefore, played a crucial role in the development of science - the microscope being the paradigmatic example in biology. With the human genome, the natural world takes an unprecedented turn: it is better described as a sequence of symbols. Besides high-throughput machines such as sequencers and DNA chip readers, the computer and the associated software becomes the instrument to observe it, and the discipline of bioinformatics flourishes.
On bioinformatics However, as the separation between us (the observers) and the phenomena observed increases (from organism to cell to genome, for instance), instruments may capture phenomena only indirectly, through the footprints they leave. Instruments therefore need to be calibrated: the distance between the reality and the observation (through the instrument) needs to be accounted for. This issue of Genome Biology is about calibrating instruments to observe gene sequences; more specifically, computer programs to identify human genes in the sequence of the human genome. Martin Reese and Roderic Guigó, Genome Biology 2006 7(Suppl I):S1, introducing EGASP, the Encyclopedia of DNA Elements (ENCODE) Genome Annotation Assessment Project
bioinformatics medical informatics Tool-users public health informatics databases algorithms Tool-makers infrastructure
Three perspectives on bioinformatics The cell The organism The tree of life Page 4
DNA RNA protein phenotype Page 5
Time of development Body region, physiology, pharmacology, pathology Page 5
After Pace NR (1997) Science 276:734 Page 6
DNA RNA protein phenotype
Growth of GenBank Base pairs of DNA (billions) 1982 1986 1990 1994 1998 2002 Fig. 2.1 Year Page 17 Sequences (millions) Updated 8-12-04: >40b base pairs
Base pairs of DNA (billions) Growth of GenBank 70 60 50 40 30 20 10 0 1985 1990 1995 2000 Sequences (millions) December 1982 June 2006
Growth of the International Nucleotide Sequence Database Collaboration Base pairs of DNA (billions) Base pairs contributed by GenBank EMBL DDBJ http://www.ncbi.nlm.nih.gov/genbank/
Central dogma of molecular biology DNA RNA protein genome transcriptome proteome Central dogma of bioinformatics and genomics
DNA RNA protein phenotype genomic DNA databases cdna ESTs UniGene protein sequence databases Fig. 2.2 Page 20
There are three major public DNA databases EMBL GenBank DDBJ The underlying raw DNA sequences are identical Page 16
There are three major public DNA databases EMBL Housed at EBI European Bioinformatics Institute GenBank Housed at NCBI National Center for Biotechnology Information DDBJ Housed in Japan Page 16
>100,000 species are represented in GenBank all species 128,941 viruses 6,137 bacteria 31,262 archaea 2,100 eukaryota 87,147 Table 2-1 Page 17
Taxonomy nodes at NCBI 8/06 http://www.ncbi.nlm.nih.gov/taxonomy/txstat.cgi
The most sequenced organisms in GenBank Homo sapiens 10.7 billion bases Mus musculus 6.5b Rattus norvegicus 5.6b Danio rerio 1.7b Zea mays 1.4b Oryza sativa 0.8b Drosophila melanogaster 0.7b Gallus gallus 0.5b Arabidopsis thaliana 0.5b Updated 8-12-04 GenBank release 142.0 Table 2-2 Page 18
The most sequenced organisms in GenBank Homo sapiens 11.2 billion bases Mus musculus 7.5b Rattus norvegicus 5.7b Danio rerio 2.1b Bos taurus 1.9b Zea mays 1.4b Oryza sativa (japonica) 1.2b Xenopus tropicalis 0.9b Canis familiaris 0.8b Drosophila melanogaster 0.7b Updated 8-29-05 GenBank release 149.0 Table 2-2 Page 18
The most sequenced organisms in GenBank Homo sapiens 12.3 billion bases Mus musculus 8.0b Rattus norvegicus 5.7b Bos taurus 3.5b Danio rerio 2.5b Zea mays 1.8b Oryza sativa (japonica) 1.5b Strongylocentrotus purpurata 1.2b Sus scrofa 1.0b Xenopus tropicalis 1.0b Updated 7-19-06 GenBank release 154.0 Table 2-2 Page 18
National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Page 24
www.ncbi.nlm.nih.gov Fig. 2.5 Page 25
Fig. 2.5 Page 25
PubMed is National Library of Medicine's search service 16 million citations in MEDLINE links to participating online journals PubMed tutorial (via Education on side bar) Page 24
Entrez integrates the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes Page 24
Entrez is a search and retrieval system that integrates NCBI databases Page 24
BLAST is Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 100,000 searches per day Page 25
OMIM is Online Mendelian Inheritance in Man catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU Page 25
Books is searchable resource of on-line books Page 26
TaxBrowser is browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms Page 26
Structure site includes Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST) Page 26
Accessing information on molecular sequences Page 26
Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26
What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbsnp (single nucleotide polymorphism) DNA N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) RNA NP_007635 AAC02945 Q28369 1KT7 RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record protein Page 27
Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Note: LocusLink at NCBI was recently retired. The third printing of the book has updated these sections (pages 27-31). Page 27
4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27
From the NCBI home page, type rbp4 and hit Go revised Fig. 2.7 Page 29
revised Fig. 2.7 Page 29
By applying limits, there are now just two entries
Entrez Gene (top of page) Note that links to many other RBP4 database entries are available revised Fig. 2.8 Page 30
Entrez Gene (middle of page)
Entrez Gene (bottom of page)
Fig. 2.9 Page 32
Fig. 2.9 Page 32
Fig. 2.9 Page 32
FASTA format Fig. 2.10 Page 32
What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbsnp (single nucleotide polymorphism) DNA N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) RNA NP_007635 AAC02945 Q28369 1KT7 RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record protein Page 27
NCBI s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon reference version of a sequence. RefSeq identifiers include the following formats: Complete genome Complete chromosome Genomic contig mrna (DNA format) Protein NC_###### NC_###### NT_###### NM_###### e.g. NM_006744 NP_###### e.g. NP_006735 Page 29-30
NCBI s RefSeq project: accession for genomic, mrna, protein sequences Accession Molecule Method Note AC_123456 Genomic Mixed Alternate complete genomic AP_123456 Protein Mixed Protein products; alternate NC_123456 Genomic Mixed Complete genomic molecules NG_123456 Genomic Mixed Incomplete genomic regions NM_123456 mrna Mixed Transcript products; mrna NM_123456789 mrna Mixed Transcript products; 9-digit NP_123456 Protein Mixed Protein products; NP_123456789 Protein Curation Protein products; 9-digit NR_123456 RNA Mixed Non-coding transcripts NT_123456 Genomic Automated Genomic assemblies NW_123456 Genomic Automated Genomic assemblies NZ_ABCD12345678 Genomic Automated Whole genome shotgun data XM_123456 mrna Automated Transcript products XP_123456 Protein Automated Protein products XR_123456 RNA Automated Transcript products YP_123456 Protein Auto. & Curated Protein products ZP_12345678 Protein Automated Protein products
Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 31
DNA RNA protein complementary DNA (cdna) UniGene Fig. 2.3 Page 23
UniGene: unique genes via ESTs Find UniGene at NCBI: www.ncbi.nlm.nih.gov/unigene UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically 500 base pairs in length) corresponding to the mrna from an expressed gene. ESTs are sequenced from a complementary DNA (cdna) library. UniGene data come from many cdna libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution. Pages 20-21
Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1 Fig. 2.3 Page 23
Cluster sizes in UniGene This is a gene with 10 ESTs associated; the cluster size is 10
Cluster sizes in UniGene (human) Cluster size (ESTs) Number of clusters 1 42,800 2 6,500 3-4 6,500 5-8 5,400 9-16 4,100 17-32 3,300 500-1000 2,128 2000-4000 233 8000-16,000 21 16,000-30,000 8 UniGene build 194, 8/06
UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). We will discuss UniGene further on September 18 (gene expression). Page 31
Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 31
Ensembl to access protein and DNA sequences Try Ensembl at www.ensembl.org for a premier human genome web browser. We will encounter Ensembl as we study the human genome, BLAST, and other topics.
click human
enter RBP4
Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 33
ExPASy to access protein and DNA sequences ExPASy sequence retrieval system (ExPASy = Expert Protein Analysis System) Visit http://www.expasy.ch/ Page 33
Fig. 2.11 Page 33
Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol Page 34
Searching for HIV-1 pol: Following the genome link yields a manageable three results Page 34
Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for hiv-1 ), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! Page 34
only 1 RefSeq over 100,000 nucleotide entries for HIV-1
Examples of how to access sequence data: histone query for histone # results protein records 21847 RefSeq entries 7544 RefSeq (limit to human) 1108 NOT deacetylase 697 At this point, select a reasonable candidate (e.g. histone 2, H4) and follow its link to Entrez Gene. There, you can confirm you have the right gene/protein. 8-12-06
Access to Biomedical Literature Page 35
PubMed at NCBI to find literature information
PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has >14 million records dating back to 1966. Page 35
MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature. Page 35
PubMed search strategies Try the tutorial ( education on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using limits Try Links to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/ Page 35
1 AND 2 1 2 lipocalin AND disease (60 results) 1 OR 2 1 2 lipocalin OR disease (1,650,000 results) 1 NOT 2 8/04 1 2 lipocalin NOT disease (530 results) Fig. 2.12 Page 34
Search result: globin is present Article contents: globin is absent globin is found true positive false positive (article does not discuss globins) globin is not found false negative (article discusses globins) true negative 8/06
WelchWeb is available at http://www.welch.jhu.edu
http://www.welch.jhu.edu Brian Brown (bbrown20@jhmi.edu) and Carrie Iwema (iwema@jhmi.edu) are the Welch Medical Library liasons to the basic sciences
Course sponsors Dept. of Molecular Microbiology & Immunology, and Dept. of Biostatistics, School of Public Health