Bioinformatics. Joshua Gilkerson Albert Kalim Ka-him Leung David Owen

Size: px
Start display at page:

Download "Bioinformatics. Joshua Gilkerson Albert Kalim Ka-him Leung David Owen"

Transcription

1 Bioinformatics Joshua Gilkerson Albert Kalim Ka-him Leung David Owen 1

2 What is Bioinformatics? Bioinformatics: The collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied in molecular genetics and genomics. (Dictionary.com) Molecular genetics: The branch of genetics that deals with the expression of genes by studying the DNA sequences of chromosomes. (Dictionary.com) 2

3 What is Bioinformatics? (cont.) Another definition of molecular genetics: The branch of genetics that deals with hereditary transmission and variation on the molecular level. (Dictionary.com) Genomics: A branch of biotechnology concerned with applying the techniques of genetics and molecular biology to the genetic mapping and DNA sequencing of sets of genes or the complete genomes of selected organisms using high-speed methods, with organizing the results in databases, and with applications of the data (as in medicine or biology). (Dictionary.com) 3

4 How old is the discipline? The answer to this one depends on which source you choose to read. From T K Attwood and D J Parry-Smith's "Introduction to Bioinformatics", Prentice-Hall 1999 [Longman Higher Education; ISBN ]: "The term bioinformatics is used to encompass almost all computer applications in biological sciences, but was originally coined in the mid-1980s for the analysis of biological sequence data." 4

5 How old is the discipline? (cont.) From Mark S. Boguski's article in the "Trends Guide to Bioinformatics" Elsevier, Trends Supplement 1998 p1: "The term "bioinformatics" is a relatively recent invention, not appearing in the literature until 1991 and then only in the context of the emergence of electronic publishing... 5

6 Bioinformatic Research up to 2005 DNA sequence Gene expression Protein expression Protein Structure Genome mapping Metabolic networks Regulatory networks Trait mapping Gene function analysis Scientific literature 6

7 What remains to be done? Comparative Genomics Description of mrnas, proteins (identity and structure) Functional analyses Detailed understanding of development, regulation, variation 7

8 The Human Genetic Code 8

9 Bioinformatics Activity: Where Is Bioinformatics Done? The biggest and best source of bioinformatics links is the Genome Web at the Rosalind Franklin Centre for Genomics Research at the Genome Campus near Cambridge, United Kingdom. Others: Research Centers, Sequencing Centers, and "Virtual" Centers (for example consortia and communities). 9

10 Research Centers Centro Nacional de Biotecnologia (CNB), Madrid, Spain. Computational Biology and Informatics Laboratory at the University of Pennysylvania, Philadelphia, USA CIRB: Centro Interdipartimentale di Ricerche Biotecnologiche, Bologna, Italy Cold Spring Harbor Labs, New York, USA European Molecular Biology Laboratory (EMBL), Heidelberg, Germany. Généthon, France. GIRI: Genetic Information Research Institute, California, USA. MRC Human Genetics Unit, Edinburgh, United Kingdom. MRC Rosalind Franklin Centre for Genomics Research(RFCGR), Hinxton, United Kingdom. 10

11 Sequencing Centers The Department of Genome Analysis at the Institute of Molecular Biotechnology, Jena, Germany. The Australian Genome Research Facility, Austalia. Baylor College of Medicine, USA. Michael Smith Genome Sciences Centre, Canada. 11

12 Virtual Centers International Center for Cooperation in Bioinformatics network (ICCBnet): Belgian EMBnet node: 12

13 Online Resources: What Bioinformatics Websites Are There? Blogs Information Directories Portals Societies Tools Tutorials 13

14 Blogs Bioinformatics.Org is a bioinformatics blog. The Bio-Web ( links to resources online for molecular and cell biologists and covers current news in various biological/computational fields. Genehack ( is one of the first bioinformatics blogs. 14

15 Information The Australian National Genomic Information Service (ANGIS) is operated by the Australian Genomic Information Centre ( html#agic, currently at the University of Sydney) to offer software, databases, documentation, training and support for biologists "The University of Maryland AgNIC gateway ( is a guide to quality agricultural biotechnology information on the Internet." 15

16 Directories Christy Hightower, Engineering Librarian at the Science and Engineering Library, University of California Santa Cruz has already done this better than me. Visit her excellent article ( winter/internet.html) about bioinformatics Net resources in Issues in Science and Technology Librarianship. 16

17 Societies Humberto Ortiz Zuazaga kindly introduced The International Society for Computational Biology ( which he points out "has links to programs of study and online courses in computational biology and to job postings". 17

18 Collection of Tools Bioinformatics.Org for a collection of bioinformatics toolbox. The Rosalind Franklin Center's "GenomeWeb ( Of historical interest only now, is the legendary " Pedro's Molecular Biology Search and Analysis Tools ( ch_tools.html) that provides a collection of WWW Links to Information and Services Useful to Molecular Biologists. 18

19 Portals Bioinformatics.Org is an international organization which promotes freedom and openness in the field of bioinformatics and is the root domain of a damned fine Website. CCP11 (Collaborative Computational Project 11, is another product of the UK's Genome Campus. CCP11 is funded by the BBSRC and is hosted at the MRC Rosalind Franklin Center for Genomics Research RFCGR located on the Wellcome Trust Genome Campus, Cambridge. Jennifer Steinbachs runs compbiology.org which is a general computational biology site as well as being a portal to her own work. BioPlanet ( is well worth visiting. It describes itself as "a not-for-profit site, funded with our resources, for [its users'] benefit." ColorBasePair ( is a densely packed portal with lots of bioinformatics links. 19

20 Genome Project Ka-Him Leung 20

21 Genomics Genome complete set of genetic instructions for making an organism Genomics attempts to analyze or compare the entire genetic complement of a species 21

22 Genomic Issues Genomic DNA is a linear sequence of 4 nucleotides (A, C, G, T) DNA forms the double helix by pairing with its reverse complement (A-T, G-C) Genomic DNA contains many genes, each of which is formed from one or more exons (stretches of genomic DNA), separated by introns A gene is copied into complementary RNA in a process called transcription (U substitutes T) 22

23 Genomic Issues (cont.) DNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks (called bases and abbreviated A, T, C, and G) that make up the DNA of the 24 different human chromosomes In the human genome, about 3 billion bases are arranged along the chromosomes in a particular order for each unique individual. One million bases (called a megabase and abbreviated Mb) of DNA sequence data is roughly equivalent to 1 megabyte of computer data storage space. Since the human genome is 3 billion base pairs long, 3 gigabytes of computer data storage space are needed to store the entire genome. 23

24 Different Genomics Comparative Genomics: the management and analysis of the millions of data points that result from Genomics Functional Genomics: ways of identifying gene functions and associations Structural Genomic: emphasizes highthroughput, whole-genome analysis. 24

25 History of Genome 1980 First complete genome sequence for an organism is published FX174-5,386 base pairs coding nine proteins. (~5Kb) 1995 First bacterial genome(haemophilus influenzea) sequenced (1.8 Mb) 1996 Saccharomyces cerevisiae genome sequenced (baker's yeast, 12.1 Mb) 1997 E. coli genome sequenced (4.7 Mbp) 1998 Sequence of first human chromosome completed 2000 A. Thaliana genome (flower) (100 Mb) D. Melanogaster genome(fruitfly) (180Mb) ,000 full-length human cdnas sequenced 2003 Human genome sequence completed 25

26 Human Genome Project U.S. Human Genome Project was a 13-year effort coordinated by the Department of Energy and the National Institutes of Health. Start at To complete mapping and understanding of all the genes of human beings. In June 2000, scientists completed the first working draft of the human genome. A high-quality, "finished" full sequence was completed in April

27 Goals of HGP identify all the approximately 20,000-25,000 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, improve tools for data analysis, transfer related technologies to the private sector, and address the ethical, legal, and social issues (ELSI) that may arise from the project. 27

28 DNA Sequencing Process Mapping Identify set of clones that span region of genome to be sequenced Library Creation Make sets of smaller clones from mapped clones Template Preparation Purify DNA from smaller clones. Setup and perform sequencing chemistries Gel Electrophoresis Determine sequences from smaller clones Pre-finishing and Finishing Specialty techniques to produce high quality sequences Data editing Annotation Quality assurance; Verification; Biological annotation; Submission to public database 28

29 29

30 Future of HGP HGP is the first step in understanding humans at the molecular level. Work is still ongoing to determine the function of many of the human genes. What still need to be done: Gene number, exact locations, and functions Gene regulation DNA sequence organization Chromosomal structure and organization Noncoding DNA types, amount, distribution, information content, and functions Coordination of gene expression, protein synthesis, and posttranslational events Interaction of proteins in complex molecular machines Predicted vs. experimentally determined gene function Evolutionary conservation among organisms Protein conservation (structure and function) Proteomes (total protein content and function) in organisms 30

31 31

32 Sequence Alignment Joshua Gilkerson 32

33 Sequence Alignment In genomics, many situations arise when sequences need to be compared or searched for similar sub-sequences. Both of these task are aided by aligning the sequences to one another. The two sequences are called the subject and the query. 33

34 Local vs. Global Global alignment aligns the entire query to the entire subject. Local alignment aligns a piece one sequence to a piece of the other. Which is used depends on the application. Surprisingly, these are computationally equivalent. Sometime local-global mixed are used, aligning the entire query sequence against any one part of the subject. 34

35 Example Alignments Global Alignment AGCTCGA--GATTGCTGGACATGCTGCTGCT A--TCGAGCGATTGC-----ATGCAGCTGCT Local Alignment Same subject as above Query Sequence: GAGAT AGCTCGAGATTGCTGGACATGCTGCTGCT AGAT GAGAT GAGAT 35

36 Model for Alignment The best alignment is the one chosen from all possible alignments that minimizes the score. Scoring is done pairwise at each position along the alignment. Introducing a gap is more expensive than extending one already introduced(affine gap penalty). 36

37 Model for Alignment Score = gap penalties + similarity weights Gap penalty = open penalty + size * size penalty Open penalty and size penalty are constants >=0. Similarity weight is zero for same base, >=0 for disparate bases. BLOSUM similarity weights are most commonly used. 37

38 Scoring Example Same example as earlier Using: Gap opening penalty of 1 Gap size penalty of 1 Similarity scores all 1 AGCTCGA--GATTGCTGGACATGCTGCTGCT A--TCGAGCGATTGC-----ATGCAGCTGCT =13 38

39 Needleman-Wunsch Algorithm Sequences Q and S Scoring matrix M len(q) x len(s) Similarity matrix s Gap length penalty - g opening penalty - 0 M(i,j) - score for best alignment of first i elements of Q and first j elements of S. M(i,j) = minimum of M(i-1,j)+g, M(i,j-1)+g, M(i-1,j-1)+s(Q(i),Q(j)) 39

40 Needleman-Wunsch Example CAT vs TAG <-s M-> g=1 A C T G A C T 1 T 0 0 A 2 G 0 G 3 C 1 A 2 T 3 40

41 Needleman-Wunsch Example CAT vs TAG <-s M-> g=1 A C T G C A T A C T T 0 0 A G 0 G

42 Needleman-Wunsch Example CAT vs TAG <-s M-> g=1 A C T G C A T A C T T 0 0 A G 0 G

43 Needleman-Wunsch Example Two equally good alignments: -CAT C-AT T-AG and -TAG 43

44 Needleman-Wunsch Runs in n 2 time. Easily generalized to allow gap opening penalty by using 3 copies of M, one for prefixes ending with a match, one ending with a gap in each sequence. Easily generalized to local alignment by saying s is best score for an alignment of some suffix of the sequences ending at i and j. In practice, this means: The first row and column are filled with all zeroes instead of just the top-left-most position. The end of the alignment is at the globally minimal position, not the lower-left corner. The beginning is at the location where backtracking cannot continue. 44

45 Other Alignment Tools The Basic Local Alignment Search Tool (BLAST) is probably the most widely used tool in genomics. Finds local alignments. Used on very large sequences (entire genomes) Smith-Waterman Algorithm - Adaptation of Needleman-Wunsch for local alignments. FASTA package 45

46 The Importance of Bioinformatics and Summary David Owen 46

47 The importance of bioinformatics Traditionally, molecular biology research was done entirely in a laboratory. But the genome projects has increased the data by a huge amount. Thus the researchers need to incorporate computers for making sense of the vast amount of data. 47

48 Challenges Intelligent and efficient storage of the massive data. Easy and reliable access to the data. Development of tools which allow the extraction of meaningful information. The developer of the tool must also consider the following: The user (biologist) might not be an expert with computers. The tool must be able to provide access across the internet. 48

49 Processes Three main processes a bioinformatics tool must have: DNA sequence determines protein sequence Protein sequence determines protein structure Protein structure determines protein function The information obtained from these processes allow us to understand better of the biology of organisms. 49

50 Computer Scientist vs. Biologist Computer scientist: Logic Problem-solving Process-oriented Algorithmic Optimizing Biologist: Knowledge gathering Experimentally-focused Exceptions are as common as rules Describe work as a story Develop conclusions and models The need for communication between computer scientist and biologist. 50

51 Research Areas Further research areas include: Sequence alignment Protein structure prediction Prediction of gene expression Protein-protein interactions Modeling of evolution 51

52 Future of Bioinformatics - Integration of a wide variety of data sources. E.g. Combining the GIS data (maps) and weather systems, with crop health and genotype data, allows us to predict successful outcomes of agricultural experiments. - Large-scale comparative genomics. E.g. the development of tolls that can do 10-way comparisons of genomes. - Modeling and visualization of full networks of complex system. 52

53 Ultimate Goal Obtain a better understanding of the biology of organisms through the examination of biological information hidden in the vast amount of data we have. This knowledge will allow us to improve our standard of life. 53

54 References uman_genome/project/about.shtml cts/final-4/ /index.html /bioinformatics/day1- files/1.0_intro_bffo_2005.pdf 54

55 References (cont.) ndex.html 55