Introduction to Bioinformatics. Ulf Leser

Similar documents
Introduction to Bioinformatics. Ulf Leser

Introduction to Bioinformatics Finish. Johannes Starlinger

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Introduction to BIOINFORMATICS

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

GREG GIBSON SPENCER V. MUSE

Computational Genomics ( )

Data Mining for Biological Data Analysis

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Introduction to Bioinformatics

Genomes: What we know and what we don t know

CHAPTER 21 LECTURE SLIDES

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

CSE/Beng/BIMM 182: Biological Data Analysis. Instructor: Vineet Bafna TA: Nitin Udpa

Machine Learning. HMM applications in computational biology

M. Phil. (Computer Science) Programme < >

Genomics and Bioinformatics GMS6231 (3 credits)

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

Challenging algorithms in bioinformatics

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

Biology 644: Bioinformatics

Introduction. CS482/682 Computational Techniques in Biological Sequence Analysis

Bioinformatics: Sequence Analysis. COMP 571 Luay Nakhleh, Rice University

Klinisk kemisk diagnostik BIOINFORMATICS

NCBI web resources I: databases and Entrez

Introduction to 'Omics and Bioinformatics

Analysis of Biological Sequences SPH

Complete draft sequence 2001

Introduction to BIOINFORMATICS

Unit 1: DNA and the Genome. Sub-Topic (1.3) Gene Expression

Motivation From Protein to Gene

Computational Challenges in Life Sciences Research Infrastructures

Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses

City University of Hong Kong Course Syllabus. offered by Department of Biomedical Sciences with effect from Semester B 2015 /16

BIMM 143: Introduction to Bioinformatics (Winter 2018)

Two Mark question and Answers

CENTER FOR BIOTECHNOLOGY

Genome Reassembly From Fragments. 28 March 2013 OSU CSE 1

Computational methods in bioinformatics: Lecture 1

Computational Genomics. Irit Gat-Viks & Ron Shamir & Haim Wolfson Fall

Introduction to Algorithms in Computational Biology Lecture 1

Introduction to Bioinformatics

COMP 555 Bioalgorithms. Fall Lecture 1: Course Introduction

Introducing Bioinformatics Concepts in CS1

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Introduction to Bioinformatics

Bioinformatics (Globex, Summer 2015) Lecture 1

11/22/13. Proteomics, functional genomics, and systems biology. Biosciences 741: Genomics Fall, 2013 Week 11

CSC 2427: Algorithms in Molecular Biology Lecture #14

Engineering Genetic Circuits

Ontologies - Useful tools in Life Sciences and Forensics

Logistics. Final exam date. Project Presentation. Plan for this week. Evolutionary Algorithms. Crossover and Mutation

AC Algorithms for Mining Biological Sequences (COMP 680)

The Human Genome Project

Genetics and Bioinformatics

Introduction to Bioinformatics and Gene Expression Technology

Basic Local Alignment Search Tool

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

Advanced Bioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2018

Ecological genomics and molecular adaptation: state of the Union and some research goals for the near future.

Introduction to Cellular Biology and Bioinformatics. Farzaneh Salari

Cory Brouwer, Ph.D. Xiuxia Du, Ph.D. Anthony Fodor, Ph.D.

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Transcription: Synthesis of RNA

Introduction to Genome Biology

10/20/2009 Comp 590/Comp Fall

Introduction to Bioinformatics for Computer Scientists. Lecture 2

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Chapter 1. from genomics to proteomics Ⅱ

ESSENTIAL BIOINFORMATICS

DNA & THE GENETIC CODE DON T PANIC! THIS SECTION OF SLIDES IS AVAILABLE AT CLASS WEBSITE

CISC 436/636 Computational Biology &Bioinformatics (Fall 2016) Lecture 1

Office hours: Wednesday from 11:30 to 12:30 and Friday from 11:30 to 12:30. Lectures: Wednesday and Friday 10:15 AM -11:30 AM, Room HB-130

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Year III Pharm.D Dr. V. Chitra

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

Unit 1 Human cells. 1. Division and differentiation in human cells

MOLECULAR BIOLOGY OF EUKARYOTES 2016 SYLLABUS

MATH 5610, Computational Biology

Microbial Metabolism Systems Microbiology

Computational Methods in Bioinformatics

3.1.4 DNA Microarray Technology

BIOL 461/ 661 Cell Biology 4 Credits Instructor: Dr. Kristin O Brien. T/TH 11:30-1:00, Irving I 208 Office hours: M 9-10, TH 3-4

Why learn sequence database searching? Searching Molecular Databases with BLAST

Applied Bioinformatics

BIOL 205. Fall Term (2017)

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Chapter 17 Lecture. Concepts of Genetics. Tenth Edition. Regulation of Gene Expression in Eukaryotes

Annotating the Genome (H)

Textbook Reading Guidelines

Capabilities & Services

Microarrays & Gene Expression Analysis

Department of Biochemistry and Molecular Genetics

Bioinformatics for Cell Biologists

Transcription:

Introduction to Bioinformatics Ulf Leser

Bioinformatics 25.4.2003 50. Jubiläum der Entdeckung der Doppelhelix durch Watson/Crick 14.4.2003 Humanes Genom zu 99% sequenziert mit 99.99% Genauigkeit 2008 Genom of J. Watson finished 4 Months, 1.5 Million USD 2010 1000 Genomes Project Ulf Leser: Bioinformatics, Summer Semester 2013 2

Example: Int. Cancer Genome Cons. Large-scale, international endeavor Planned for 50 different cancer types Cancer types are assigned to countries Distributed BioMart-based infrastructure First federated approach to a large int. genome project [HAA+08] Ulf Leser: Bioinformatics, Summer Semester 2013 3

Things you can do with it 2002 2 companies 32 Tests Price: 100 1400 Quelle: Berth, Deutsches Ärzteblatt, 4.10.2002 Ulf Leser: Bioinformatics, Summer Semester 2013 4

State of the Art 6/2010: Gentest-Firma vertauscht DNA-Ergebnisse ihrer Kunden (Nature Blog) 7/2010: US general accounting office compared 15 (4) companies: totally contradicting results Ulf Leser: Bioinformatics, Summer Semester 2013 5

This Lecture Formal stuff A very short introduction in Molecular Biology What is Bioinformatics? And an example Topics of this course Ulf Leser: Bioinformatics, Summer Semester 2013 6

This course Is mandatory for students of Biophysics Bachelor Is open for Bachelor students in computer science Brings 5 SP and will be held as 2+2 Does assume basic knowledge in computer science Will not teach programming you need to know it already Does not assume knowledge in biology Is introductory many topics, often not much depth Visit Algorithmische Bioinformatik afterwards Ask questions! leser (a) informatik.hu berlin Ulf Leser: Bioinformatics, Summer Semester 2013 7

Exercises Taught by Philippe Thomas Registration through Goya There will be 6 assignments We build teams of 2 students No grades System First week: 2-3 presentations of results of previous assignment and discussion of new assignment Next week: Questions You need to pass all but one assignment to be admitted to the exam Ulf Leser: Bioinformatics, Summer Semester 2013 8

Exams Examination will be? If oral: 20-30 minutes per student Day(s) will be set in June Ulf Leser: Bioinformatics, Summer Semester 2013 9

Literature For algorithms Gusfield (1997). Algorithms on Strings, Trees, and Sequences, Cambridge University Press Böckenhauer, Bongartz (2003). Algorithmische Grundlagen der Bioinformatik, Teubner For other topics Lesk (2005). Introduction to Bioinformatics, Oxford Press Cristianini, Hahn (2007). "Introduction to Computational Genomics - A Case Study Approach", Cambridge University Press Merkl, Waack (2009). "Bioinformatik Interaktiv", Wiley-VCH Verlag. For finding motivation and relaxation Gibson, Muse (2001). "A Primer of Genome Science", Sinauer Associates. Krane, Raymer (2003). "Fundamental Concepts of Bioinformatics", Benjamine Cummings. These slides Ulf Leser: Bioinformatics, Summer Semester 2013 10

Web Side Ulf Leser: Bioinformatics, Summer Semester 2013 11

Wissensmanagement in der Bioinformatik Our topics in research Management of biomedical data and knowledge Scientific database systems Text Mining Biomedical data analysis Our topics in teaching Algorithmische Bioinformatik Text Analytics Data Warehousing und Data Mining Informationsintegration Ulf Leser: Bioinformatics, Summer Semester 2013 15

My Questions Diplominformatiker? Bachelor Informatik? Kombibachelor? Biophysik? Other? Semester? Prüfung? Spezielle Erwartungen? Ulf Leser: Bioinformatics, Summer Semester 2013 16

This Lecture Formal stuff on the course A very short introduction in Molecular Biology What is Bioinformatics? Topics of this course Ulf Leser: Bioinformatics, Summer Semester 2013 17

Cells and Bodies App. 75 trillion cells in a human body App. 250 different types: nerve, muscle, skin, blood, Ulf Leser: Bioinformatics, Summer Semester 2013 18

DesoxyriboNucleicAcid DNA: Desoxyribonukleinsäure Four different molecules The DNA of all chromosomes in a cell forms its genome All cells in a (human) body carry the same genome All living beings are based on DNA for proliferation There are always always always exceptions Ulf Leser: Bioinformatics, Summer Semester 2013 19

DesoxyriboNucleicAcid DNA: Desoxyribonukleinsäure Four different molecules (one swapped in RNA) The DNA of all chromosomes in a cell together with the mitochondria-dna forms its genome Almost all cells in a (human) body carry almost the same genome All living beings are based on DNA (or RNA) for proliferation Ulf Leser: Bioinformatics, Summer Semester 2013 20

The Human Genome 23 chromosomes Most in pairs ~3.000.000.000 letters ~50% are repetitions of 4 identical subsequences ~100.000 genes ~56.000 genes ~30.000 genes ~24.000 genes ~20.000 genes Ulf Leser: Bioinformatics, Summer Semester 2013 21

(Protein-Coding) Genes Chromosome RNA mrna Proteine A C G UT A C G UU G AU G AC G AC C A G A G C U U G U A G A G C U U G U Ulf Leser: Bioinformatics, Summer Semester 2013 22

Proliferation Sequence Proteins Networks Organism Ulf Leser: Bioinformatics, Summer Semester 2013 23

Computer Science in Molecular Biology / Medicine Genomics Sequencing Gene prediction Evolutionary relationships Motifs - TFBS Transcriptomics RNA folding Proteomics Structure prediction comparison Motives, active sites Docking Protein-Protein Interaction Proteomics Systems Biology Pathway analysis Gene regulation Signaling Metabolism Quantitative models Integrative analysis Medicine Phenotype genotype Mutations and risk Population genetics Adverse effects Ulf Leser: Bioinformatics, Summer Semester 2013 24

This Lecture Genomics Sequencing Gene prediction Evolutionary relationships Motifs - TFBS Transcriptomics RNA folding Proteomics Structure prediction comparison Motives, active sites Docking Protein-Protein Interaction Proteomics Systems Biology Pathway analysis Gene regulation Signaling Metabolism Quantitative models Integrative analysis Medicine Phenotype genotype Mutations and risk Population genetics Adverse effects Ulf Leser: Bioinformatics, Summer Semester 2013 25

This Lecture Formal stuff on the course A very short introduction in Molecular Biology What is Bioinformatics? And an example Topics of this course Ulf Leser: Bioinformatics, Summer Semester 2013 26

Bioinformatics / Computational Biology Computer Science methods for Solving biologically relevant problems Analyzing and managing experimental data sets Empirical: Data from high throughput experiments Focused on algorithms and statistics Problems are typically complex, data full of errors importance of heuristics and approximate methods Strongly reductionist Strings, graphs, sequences Interdisciplinary: Biology, Computer Science, Physics, Mathematics, Genetics, Ulf Leser: Bioinformatics, Summer Semester 2013 27

History First protein sequences: 1951 Sanger sequencing: 1972 Exponential growth of available data since end of 70 th Bioinformatics is largely data-driven new methods yield new data requiring new algorithms Quelle: EMBL, Genome Monitoring Tables Ulf Leser: Bioinformatics, Summer Semester 2013 28

History 2 First papers on sequence alignment Needleman-Wunsch 1970, Gibbs 1970, Smith-Waterman 1981, Altschul et al. 1990 Large impact of the Human Genome Projekt (~1990) Only 14 mentions of Bioinformatics before 1995 Journal of Computational Biology since 1994 First professorships in Germany: end of 90th First university programs: ~2000 First German book: 2001 Commercial hype: 1999 2004 Ulf Leser: Bioinformatics, Summer Semester 2013 29

A Concrete Example: Sequencing a Genome Chromosomes (yet) cannot be sequenced entirely Instead: Only small fragments can be sequenced But: Chromosomes cannot be cut at position X, Y, Instead: Chromosomes only can be cut at certain subsequences But: We don t know where in a chromosome those subsequences are Sequence assembly problem Ulf Leser: Bioinformatics, Summer Semester 2013 30

Problem Given a large set of (sub)sequences from randomly chosen positions from a given chromosome of unknown sequence Assembly problem: Determine the sequence of the original chromosome Everything may overlap with everything to varying degrees Let s forget about orientation and sequencing errors f 3-80 -50 f 1-10 f 2-60 -40 f4 Ulf Leser: Bioinformatics, Summer Semester 2013 31

Greedy? Take one sequence and compute overlap with all others Keep the one with largest overlap and align Repeat such extensions until no more sequences are left Note: This would work perfectly if all symbols of the chromosome were distinct accgttaaagcaaagatta aagattattgaaccgtt aaagcaaagattattg attattgccagta accgttaaagcaaagatta aaagcaaagattattg aagattattgaaccgtt attattgccagta accgttaaagcaaagatta aaagcaaagattattg attattgccagta aagattattgaaccgtt Ulf Leser: Bioinformatics, Summer Semester 2013 32

Abstract Formulation SUPERSTRING Given a set S of strings Find string t such that (a) s S: s t (all s are substrings of t) (b) t for which (a) holds: : t t Problem is NP-complete ( t ist minimal) Very likely, there is no algorithm that solves the problem in less than k 1 *k 2 2 n operations, where k 1,k 2 are constants and n= S Bioinformatics: Find clever heuristics Solve the problem good enough Finish in reasonable time Ulf Leser: Bioinformatics, Summer Semester 2013 33

Dimension Whole genome shotgun Fragment an entire chromosome in pieces of 1KB-100KB Sequence start and end of all fragments Homo sap.: 28 million reads Drosophila: 3.2 million reads Eukaryotes are very difficult to assemble because of repeats A random sequence is easy Ulf Leser: Bioinformatics, Summer Semester 2013 34

This Lecture Formal stuff on the course A very short introduction in Molecular Biology What is Bioinformatics? And an example Topics of this course Ulf Leser: Bioinformatics, Summer Semester 2013 35

Searching Sequences (Strings) A chromosome is a string Substrings may represent biologically important areas Genes on a chromosome Transcription factor binding sites Similar gene in a different species Exact or approximate string search Ulf Leser: Bioinformatics, Summer Semester 2013 36

Searching a Database of Strings Comparing two sequences is costly Given s, assume we want to find the most similar s in a database of all known sequences Naïve: Compare s with all strings in DB Will take years and years BLAST: Basic local alignment search tool Ranks all strings in DB according to similarity to s Similarity: High is s, s contain substrings that are highly similar Heuristic: Might miss certain similar sequences Extremely popular: You can blast a sequence Ulf Leser: Bioinformatics, Summer Semester 2013 37

Multiple Sequence Alignment Given a set S of sequences: Find an arrangement of all strings in S in columns such that there are (a) few columns and (b) columns are maximally homogeneous Additional spaces allowed Source: Pfam, Zinc finger domain Goal: Find commonality between a set of functionally related sequences Proteins are composed of different functional domains Which domain performs a certain function? Ulf Leser: Bioinformatics, Summer Semester 2013 38

Microarrays / Transcriptomics Referenzarray (Probe) Zellprobe (Sample) Hybridisierung Arrayaufbereitung Scanning TIFF Bild Bilderkennung Rohdaten Ulf Leser: Bioinformatics, Summer Semester 2013 39

Proteomics The real workhorses in a cell are proteins Differential splicing, post-translational modifications, degradation rates, various levels of regulation, But: Much more difficult to study (compared to mrna) Separation of proteins 2D page, GC / LC Identification of proteins Mass-spectrometry Ulf Leser: Bioinformatics, Summer Semester 2013 40

Protein-Protein-Interactions Proteins do not work in isolation but interact with each other Metabolism, complex formation, signal transduction, transport, PPI networks Neighbors tend to have similar functions Interactions tend to be evolutionary conserved Dense subgraphs (cliques) tend to perform distinct functions Are not random at all Ulf Leser: Bioinformatics, Summer Semester 2013 41

Systems Biology Biological networks are more than edges and nodes Example: Metabolic networks Graphs capturing biochemical reactions Can be described quantitatively: N 2 + 3H 2 2NH 3 Can be analyzed (Balanced flux? Elements never produced / consumed? ) Dynamic modeling: Kinetics Ulf Leser: Bioinformatics, Summer Semester 2013 42