Introduction to Bioinformatics

Size: px

Start display at page:

Download "Introduction to Bioinformatics"

Andrea Morrison
6 years ago
Views:

1 Introduction to Bioinformatics September 1, 2006 Jonathan Pevsner, Ph.D.

2 Teaching assistants Hugh Cahill Jennifer Turney Meg Zupancic

3 Who is taking this course? People with very diverse backgrounds in biology People with diverse backgrounds in computer science and biostatistics Most people have a favorite gene, protein, or disease

4 What are the goals of the course? To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI) and EBI To focus on the analysis of DNA, RNA and proteins To introduce you to the analysis of genomes To combine theory and practice to help you solve research problems

5 Themes throughout the course Textbooks Web sites Literature references Gene/protein families Computer labs

6 Textbook The course textbook is J. Pevsner, Bioinformatics and Functional Genomics (Wiley, 2003). The chapters contain content, lab exercises, and quizzes that were developed in this course over the past six years. A few copies will be available on reserve at Welch Library for those of you who do not want to buy a copy (go up to the 2 nd floor), and the library has six more copies. Several other bioinformatics texts are available: Baxevanis and Ouellette David Mount Durbin et al.

7 Web sites The course website is reached via: (or Google pevsnerlab courses) This site contains the powerpoints for each lecture. The textbook website is: This has 1000 URLs, organized by chapter This site also contains the same powerpoints. The weekly quizzes are on my website: Once you log in and take a quiz, you will get instant feedback. You can use moodle to ask questions as well.

8 Literature references You are encouraged to read original source articles. They will enhance your understanding of the material. Reading will be assigned.

9 Themes throughout the course: gene/protein families We will use retinol-binding protein 4 (RBP4) as a model gene/protein throughout the course. RBP4 is a member of the lipocalin family. It is a small, abundant carrier protein. We will study it in a variety of contexts including --sequence alignment --gene expression --protein structure --phylogeny --homologs in various species We will also use other examples, such as the globins and the pol protein of HIV-1

11 The HIV-1 pol gene encodes three proteins Aspartyl protease Reverse transcriptase Integrase PR RT IN

12 Themes throughout the course: computer labs There is a computer lab each Friday. This is a chance to gain practical experience using a variety of web resources. You can do the lab on your own, ahead of time. However, during the Friday lab you can get help on problems, and in some cases the computers will have specialized software.

13 Grading 40% ten moodle quizzes (corresponding to chapters 2-11) 30% final exam October 25 (in class) 30% discovery of a novel gene: --Find the novel gene by the end of September, and turn in the final report, with phylogenetic tree, by October 25 --Instructions are posted on the course website --We will discuss this project in detail in the next two weeks.

14 Grading Quizzes are taken at the moodle website, and are due one week after the relevant lecture ten quizzes 4% Chapter 2 quiz (sequences) 4% Chapter 3 quiz (alignment) 4% Chapter 4 quiz (BLAST) 4% Chapter 5 quiz (advanced BLAST) 4% Chapter 6 quiz (RNA) 4% Chapter 7 quiz (microarrays) 4% Chapter 8 quiz (proteomics) 4% Chapter 9 quiz (protein structure) 4% Chapter 10 quiz (multiple alignment) 4% Chapter 11 quiz (phylogeny) 30% find-a-gene project (due October 25) 30% final exam October 25 (in class)

15 Outline for today (chapters 1 and 2) Definition of bioinformatics Overview of the NCBI website Accessing information about DNA and proteins --Definition of an accession number --Four ways to find information on proteins and DNA Access to biomedical literature

16 What is bioinformatics? Interface of biology and computers Analysis of proteins, genes and genomes using computer algorithms and computer databases Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.

17 Top ten challenges for bioinformatics [1] Precise models of where and when transcription will occur in a genome (initiation and termination) [2] Precise, predictive models of alternative RNA splicing [3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli [4] Determining protein:dna, protein:rna, protein:protein recognition codes [5] Accurate ab initio protein structure prediction

18 Top ten challenges for bioinformatics [6] Rational design of small molecule inhibitors of proteins [7] Mechanistic understanding of protein evolution [8] Mechanistic understanding of speciation [9] Development of effective gene ontologies: systematic ways to describe gene and protein function [10] Education: development of bioinformatics curricula Source: Ewan Birney, Chris Burge, Jim Fickett

19 On bioinformatics Science is about building causal relations between natural phenomena (for instance, between a mutation in a gene and a disease). The development of instruments to increase our capacity to observe natural phenomena has, therefore, played a crucial role in the development of science - the microscope being the paradigmatic example in biology. With the human genome, the natural world takes an unprecedented turn: it is better described as a sequence of symbols. Besides high-throughput machines such as sequencers and DNA chip readers, the computer and the associated software becomes the instrument to observe it, and the discipline of bioinformatics flourishes.

20 On bioinformatics However, as the separation between us (the observers) and the phenomena observed increases (from organism to cell to genome, for instance), instruments may capture phenomena only indirectly, through the footprints they leave. Instruments therefore need to be calibrated: the distance between the reality and the observation (through the instrument) needs to be accounted for. This issue of Genome Biology is about calibrating instruments to observe gene sequences; more specifically, computer programs to identify human genes in the sequence of the human genome. Martin Reese and Roderic Guigó, Genome Biology (Suppl I):S1, introducing EGASP, the Encyclopedia of DNA Elements (ENCODE) Genome Annotation Assessment Project

21 bioinformatics medical informatics Tool-users public health informatics databases algorithms Tool-makers infrastructure

22 Three perspectives on bioinformatics The cell The organism The tree of life Page 4

24 DNA RNA protein phenotype Page 5

25 Time of development Body region, physiology, pharmacology, pathology Page 5

26 After Pace NR (1997) Science 276:734 Page 6

27 DNA RNA protein phenotype

Growth of GenBank Base pairs of DNA (billions) 1982 1986 1990 1994 1998

28 Growth of GenBank Base pairs of DNA (billions) Fig. 2.1 Year Page 17 Sequences (millions) Updated : >40b base pairs

29 Base pairs of DNA (billions) Growth of GenBank Sequences (millions) December 1982 June 2006

30 Growth of the International Nucleotide Sequence Database Collaboration Base pairs of DNA (billions) Base pairs contributed by GenBank EMBL DDBJ

31 Central dogma of molecular biology DNA RNA protein genome transcriptome proteome Central dogma of bioinformatics and genomics

32 DNA RNA protein phenotype genomic DNA databases cdna ESTs UniGene protein sequence databases Fig. 2.2 Page 20

33 There are three major public DNA databases EMBL GenBank DDBJ The underlying raw DNA sequences are identical Page 16

34 There are three major public DNA databases EMBL Housed at EBI European Bioinformatics Institute GenBank Housed at NCBI National Center for Biotechnology Information DDBJ Housed in Japan Page 16

35 >100,000 species are represented in GenBank all species 128,941 viruses 6,137 bacteria 31,262 archaea 2,100 eukaryota 87,147 Table 2-1 Page 17

36 Taxonomy nodes at NCBI 8/06

37 The most sequenced organisms in GenBank Homo sapiens 10.7 billion bases Mus musculus 6.5b Rattus norvegicus 5.6b Danio rerio 1.7b Zea mays 1.4b Oryza sativa 0.8b Drosophila melanogaster 0.7b Gallus gallus 0.5b Arabidopsis thaliana 0.5b Updated GenBank release Table 2-2 Page 18

38 The most sequenced organisms in GenBank Homo sapiens 11.2 billion bases Mus musculus 7.5b Rattus norvegicus 5.7b Danio rerio 2.1b Bos taurus 1.9b Zea mays 1.4b Oryza sativa (japonica) 1.2b Xenopus tropicalis 0.9b Canis familiaris 0.8b Drosophila melanogaster 0.7b Updated GenBank release Table 2-2 Page 18

39 The most sequenced organisms in GenBank Homo sapiens 12.3 billion bases Mus musculus 8.0b Rattus norvegicus 5.7b Bos taurus 3.5b Danio rerio 2.5b Zea mays 1.8b Oryza sativa (japonica) 1.5b Strongylocentrotus purpurata 1.2b Sus scrofa 1.0b Xenopus tropicalis 1.0b Updated GenBank release Table 2-2 Page 18

40 National Center for Biotechnology Information (NCBI) Page 24

41 Fig. 2.5 Page 25

42 Fig. 2.5 Page 25

43 PubMed is National Library of Medicine's search service 16 million citations in MEDLINE links to participating online journals PubMed tutorial (via Education on side bar) Page 24

44 Entrez integrates the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes Page 24

45 Entrez is a search and retrieval system that integrates NCBI databases Page 24

46 BLAST is Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 100,000 searches per day Page 25

47 OMIM is Online Mendelian Inheritance in Man catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU Page 25

48 Books is searchable resource of on-line books Page 26

49 TaxBrowser is browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms Page 26

50 Structure site includes Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST) Page 26

51 Accessing information on molecular sequences Page 26

52 Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26

53 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_ Rs GenBank genomic DNA sequence Genomic contig dbsnp (single nucleotide polymorphism) DNA N An expressed sequence tag (1 of 170) NM_ RefSeq DNA sequence (from a transcript) RNA NP_ AAC02945 Q KT7 RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record protein Page 27

54 Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Note: LocusLink at NCBI was recently retired. The third printing of the book has updated these sections (pages 27-31). Page 27

55 4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27

56 From the NCBI home page, type rbp4 and hit Go revised Fig. 2.7 Page 29

57 revised Fig. 2.7 Page 29

60 By applying limits, there are now just two entries

61 Entrez Gene (top of page) Note that links to many other RBP4 database entries are available revised Fig. 2.8 Page 30

62 Entrez Gene (middle of page)

63 Entrez Gene (bottom of page)

64 Fig. 2.9 Page 32

65 Fig. 2.9 Page 32

66 Fig. 2.9 Page 32

67 FASTA format Fig Page 32

68 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_ Rs GenBank genomic DNA sequence Genomic contig dbsnp (single nucleotide polymorphism) DNA N An expressed sequence tag (1 of 170) NM_ RefSeq DNA sequence (from a transcript) RNA NP_ AAC02945 Q KT7 RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record protein Page 27

69 NCBI s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon reference version of a sequence. RefSeq identifiers include the following formats: Complete genome Complete chromosome Genomic contig mrna (DNA format) Protein NC_###### NC_###### NT_###### NM_###### e.g. NM_ NP_###### e.g. NP_ Page 29-30

70 NCBI s RefSeq project: accession for genomic, mrna, protein sequences Accession Molecule Method Note AC_ Genomic Mixed Alternate complete genomic AP_ Protein Mixed Protein products; alternate NC_ Genomic Mixed Complete genomic molecules NG_ Genomic Mixed Incomplete genomic regions NM_ mrna Mixed Transcript products; mrna NM_ mrna Mixed Transcript products; 9-digit NP_ Protein Mixed Protein products; NP_ Protein Curation Protein products; 9-digit NR_ RNA Mixed Non-coding transcripts NT_ Genomic Automated Genomic assemblies NW_ Genomic Automated Genomic assemblies NZ_ABCD Genomic Automated Whole genome shotgun data XM_ mrna Automated Transcript products XP_ Protein Automated Protein products XR_ RNA Automated Transcript products YP_ Protein Auto. & Curated Protein products ZP_ Protein Automated Protein products

71 Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 31

72 DNA RNA protein complementary DNA (cdna) UniGene Fig. 2.3 Page 23

73 UniGene: unique genes via ESTs Find UniGene at NCBI: UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically 500 base pairs in length) corresponding to the mrna from an expressed gene. ESTs are sequenced from a complementary DNA (cdna) library. UniGene data come from many cdna libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution. Pages 20-21

74 Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1 Fig. 2.3 Page 23

75 Cluster sizes in UniGene This is a gene with 10 ESTs associated; the cluster size is 10

76 Cluster sizes in UniGene (human) Cluster size (ESTs) Number of clusters 1 42, , , , , , , , ,000-30,000 8 UniGene build 194, 8/06

77 UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). We will discuss UniGene further on September 18 (gene expression). Page 31

78 Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 31

79 Ensembl to access protein and DNA sequences Try Ensembl at for a premier human genome web browser. We will encounter Ensembl as we study the human genome, BLAST, and other topics.

80 click human

81 enter RBP4

83 Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 33

84 ExPASy to access protein and DNA sequences ExPASy sequence retrieval system (ExPASy = Expert Protein Analysis System) Visit Page 33

85 Fig Page 33

87 Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol Page 34

89 Searching for HIV-1 pol: Following the genome link yields a manageable three results Page 34

90 Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for hiv-1 ), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! Page 34

91 only 1 RefSeq over 100,000 nucleotide entries for HIV-1

92 Examples of how to access sequence data: histone query for histone # results protein records RefSeq entries 7544 RefSeq (limit to human) 1108 NOT deacetylase 697 At this point, select a reasonable candidate (e.g. histone 2, H4) and follow its link to Entrez Gene. There, you can confirm you have the right gene/protein

94 Access to Biomedical Literature Page 35

95 PubMed at NCBI to find literature information

96 PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has >14 million records dating back to Page 35

97 MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature. Page 35

100 PubMed search strategies Try the tutorial ( education on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using limits Try Links to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): Page 35

101 1 AND lipocalin AND disease (60 results) 1 OR lipocalin OR disease (1,650,000 results) 1 NOT 2 8/ lipocalin NOT disease (530 results) Fig Page 34

102 Search result: globin is present Article contents: globin is absent globin is found true positive false positive (article does not discuss globins) globin is not found false negative (article discusses globins) true negative 8/06

103 WelchWeb is available at

http://www.welch.jhu.edu Brian Brown (bbrown20@jhmi.

104 Brian Brown and Carrie Iwema are the Welch Medical Library liasons to the basic sciences

105 Course sponsors Dept. of Molecular Microbiology & Immunology, and Dept. of Biostatistics, School of Public Health

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Introduction to Bioinformatics CPSC 265 Thanks to Jonathan Pevsner, Ph.D. Textbooks Johnathan Pevsner, who I stole most of these slides from (thanks!) has written a textbook, Bioinformatics and Functional