BIOINF 4399B Computational Proteomics and Metabolomics

Size: px
Start display at page:

Download "BIOINF 4399B Computational Proteomics and Metabolomics"

Transcription

1 BIOINF 4399B Computational Proteomics and Metabolomics Oliver Kohlbacher & Sven Nahnsen WS 11/12 1. Introduction and Overview

2 Overview Administrative stuff (credits, requirements) Motivation/ quick review of relevant contents of Bioinformatics 2 Overview of the contents of this lecture Proteomics and Metabolomics Computational mass spectrometry

3 Course Requirements To pass this course you must: regularly and actively participate in the weekly problem sessions, pass the final exam and assignments You have to work on assignments alone (no groups!)

4 Course Credits & Grading Credits MSc Bioinfo: 4 LP, module Wahlpflichtbereich Bioinformatik MSc Info: 4 LP, area Wahlpflichtbereich Informatik Diplom: 2+2 SWS Prakt. Informatik (passing required) Grade 40% assignments 60% finals Finals: oral exam (30 minutes) covering the contents of the whole lecture and the assignments Finals will be scheduled at the end of the semester (in the week of February 7)

5 Assignments One assignment every week One week for each assignment, hand in: via before the problem sessions on Tuesdays Work alone on assignments! Assignments will comprise theoretical tasks, as well as programming tasks Schedule 1 st assignment: online: Oct. 17; printed: Oct. 18; due: Oct. 25

6 Recommended Software OpenMS/ TOPP A software library for mass spectrometry

7 Contact Questions concerning the lecture/assignments Website abi.inf.uni-tuebingen.de/teaching/ss12/cpm Timo Sachsenberg (Sand 14, C322) Mathias Walzer (Sand 14, C 304) Sven Nahnsen (Sand 14, C322, please send first)

8 The central dogma of molecular biology First articulation by Francis Crick in 1956 Published in Nature in 1970 Origin of the Central Dogma of Molecular Biology (Francis Crick, 1956)

9 The central dogma classical view In general, the classic view reflects how biology is (biological data are) organized Genomics enabled a more complex view Barry, P Genome 2.0: Mountains of new data are challenging old views. Science News 172(10):154 (week of Sept. 8). The RNA revolution: Biology's Big Bang. The Economist, Jun 14th 2007 Gerstein et al., What is a gene, post-encode? History and updated definition. Genome Research 17(6): RNA editing can lead to protein sequences that are very different from the initial DNA (Li, M. et al. Science doi: /science (2011)) National Human Genome Research Institute (NHGRI)

10 Oltvai-Barabasi, Science, 2002 Reminder (Bioinformatics 2) Systems Biology

11 Reminder (Bioinformatics 2) Systems Biology Quantitative data on various levels of biological complexity build fundaments of systems biology Mathematical modeling has been based on gene expression Recent important technological improvements allow the analysis of protein and metabolite profiles to a great depth Important layers for understanding biology New experimental techniques offer tremendous challenges for computational analysis

12 Aims of systems biology Describe large-scale organization Quantitative modeling Describe cell as system of networks Fundamental research: time-resolved quantitative understanding of living systems Medicine: enable personalized medicine (e.g., improve treatment strategies for cancer patients) Biotechnology: improve production, degradation, construction of synthetic organisms, etc.

13 Exp. Methods Transcriptomics Extract and amplify RNA Hybridization on microarray Identify and quantify by fluorescence signal Sequences can be mapped back to genome Lindsay, Nature Rev. Drug Discovery, 2003, 2, 803

14 Microarray Data Analysis Key problems in microarray data analysis are Data normalization Clustering Dimension reduction Diagnostics/classification Network inference Visualization of results Janko Dietzsch, Nils Gehlenborg and Kay Nieselt. Mayday-a microarray data analysis workbench. Bioinformatics (8):

15 Genome sequencing February 15, 2001 February 16, 2001

16 Genome sequencing 2001: initial publication 2003: 2 nd draft Human Genome > 13 years of work and > 3*10 9 $ 2010: 8 days 1*10 4 $ Future: within 3 years Biotech company (Pacific Biosciences) expects similar amount of data in < 15 min for < 1*10 3 $

17 Status genomics/transcriptomics Dramatic drop in cost for genome sequencing Number of sequenced genomes grows continuously Genome is a very static snapshot of living system Biological adaption is rather slow; long-term information storage Proteins and their reaction products, metabolites are much closer to reality Genome and transcriptome databases are essential bases for proteomics and metabolomics research

18 Genomics vs. Proteomics Genomics Genomes rather static Proteomics Proteomes are dynamic (age, tissue, breakfast, ) ~ 20 k genes established technology (capillary sequencer) up to 1000 k proteins emerging technologies (MS, HPLC/MS, protein chips)

19 Main fields of proteomics protein interaction protein characterization (identification + PTMs)? protein localization 0.0 protein expression

20 Applications of proteomics protein interaction Drug target identification protein characterization (identification + PTMs) Determine content of a protein mixture Understanding regulation of protein activity Functional annotation (compartment and function) Drug target identification? Gene annotation Therapeutic markers Drug target identification protein localization protein expression

21 Exp. Methods Proteomics Compare two proteomes (e.g. healthy/diseased) Separate using 2D-PAGE (w.r.t. molecular mass, pi) Excise protein spots from the gel Tryptic digest of the proteins Identify proteins using mass spectrometry and Database search Lindsay, Nature Rev. Drug Discovery, 2003, 2, 803

22 HPLC-MS I HPLC ESI TOF RT Spectrum (scan) Separation 1 separate peptides by their retention time on column Ionization electrospray, transfers charge to the peptides Separation 2 MS separates by mass-to-charge ratio (m/z)

23 Intensity Mass Spectrometry Peak area proportional to peptide concentration mass spectrometry measure a peptide s mass-to-charge ratio m/z

24 Proteomics: Database Search Identification of mass spectra is easily done through database search Search all peptides of matching mass from a database Construct a theoretical mass spectrum for these peptide candidates Score against the experimental spectrum??? Sequence DB

25 Exp. Metabolomics Extract all metabolites/ small molecules, usually < 800 Da Separate homogenous collection of analytes (lipids, di- or tripeptides, phospholipids, sugars, etc.) Identify and quantify the analytes

26 Exptl. Metabolomics Nicholson and Lindon. Nature 2008, 455,

27 Metabolic Networks

28 Metabolomics: Simulation

29 This lecture uantitative mass spectrometry System-wide biological data on proteome and metabolome level Computational mass spectrometry

30 This lecture Basics of Proteomics/ Metabolomics Basics of chromatography and mass spectrometry Computational mass spectrometry Algorithms for peptide/ protein quantification and identification Algorithms for metabolite identification and quantification Applications, e.g., biomarkers and complete proteomes Open research questions, e.g., clinical translation

31 Proteomics Studying the proteome Proteome:=

32 Proteomics Studying the proteome Proteome:= All proteins that are expressed in a given organism, tissue or cell at a given state and time

33 Proteomics Studying the proteome Proteome:= All proteins that are expressed in a given organism, tissue or cell at a given state and time Goal of studying proteomes: understand the function of all proteins in a biological system Large databases have been established, e.g., the Gene Ontology Consortium ( catalogues all proteins by their molecular function, biological process and cellular compartment

34 Protein A protein or polypeptide consists of a linear chain of amino acids that build 3-dimensional structures Amino acids are connected via peptide bonds Peptide bonds N-terminus H 2 N H O C C R 1 NH C C NH C C NH C C R 3 R 2 O H O H O R 4 OH C-terminus

35 Protein There are some problematic issues on defining a protein Protein identity: unique amino acid sequence and single source of origin? There may be different genes encoding the identical amino acid sequence Different organisms may encode identical proteins Splice variants: A gene can give rise to different mrnas Polymorphisms: many genes occur in allelic variants encoding sequence variations Posttranslational modifications: PTMs are very heterogeneous and significantly alter the function of the protein

36 Metabolomics Studying the metabolome Metabolome:=

37 Metabolomics Studying the metabolome Metabolome:= The metabolome refers to the complete set of small-molecule metabolites (such as metabolic intermediates, hormones and other signaling molecules, and secondary metabolites) to be found within a biological sample, such as a single organism. (

38 Technologies Modern Proteomics and Metabolomics studies are based on Liquid chromatography (LC) - Mass spectrometry (MS)

39 (Liquid) chromatography Mobile phase liquid, stationary phase is usually solid Analytes are held back on a column Mobile phase is pumped over the column Analytes continously separate and elute from the column according to specific properties (e.g. hydrophobicity) Other chromatography (e.g. gas chromatography) techniques will also be mentioned

40 HPLC (High Performance Liquid Chromatography) pump column (stationary phase) detector mobile phase retention time (RT)

41 HPLC (High Performance Liquid Chromatography) injection valve pump analyte mixture eluent column detector

42 HPLC (High Performance Liquid Chromatography) injection valve pump analyte mixture eluent column detector

43 HPLC (High Performance Liquid Chromatography) injection valve pump analyte mixture eluent column detector

44 HPLC (High Performance Liquid Chromatography) injection valve pump analyte mixture eluent column detector

45 HPLC (High Performance Liquid Chromatography) injection valve pump analyte mixture eluent column detector

46 HPLC (High Performance Liquid Chromatography) injection valve pump analyte mixture eluent column detector

47 HPLC (High Performance Liquid Chromatography) injection valve pump analyte mixture eluent column detector

48 HPLC (High Performance Liquid Chromatography) injection valve pump analyte mixture eluent column detector

49 HPLC (High Performance Liquid Chromatography) injection valve pump analyte mixture eluent column detector

50 Mass spectrometry Mass spectrometry (MS) is an analytical technique to measure the mass (or more precisely: mass-to-charge ratio, m/z) of an analyte MS has a long history in physics and chemistry and today the key technology in proteomics and metabolomics soft ionization methods enable its application in the bio(-analytical) sciences For OMICS analyses MS is usually coupled to a second separation technique (e.g. LC for proteomics and LC/GC for metabolomics) There are various types of mass spectrometers (see 3 rd lecture)

51 Mass spectrometry Ionization techniques Electrospray ionization (ESI) Matrix Assisted Laser Desorption/ Ionisation (MALDI) Reflector time-of-flight (TOF) Quadrupole time of-flightc time-of-flight time-of-flight (TOF-TOF) Ion trap Triple Quadrupole Fourier transform Ioncyclotron resonance Orbitrap Electron multiplier Mass analyzers Mass detector Modified from Aebersold and Mann, Nature, 2003

52 Challenges in computational MS Huge data sets (up to TBs per experiment) Ambiguity in protein identification Uncertainty in proteome size Ambiguity of masses for small molecules Peak picking/ Feature Finding Map alignment/ Quantification Peptide/ Protein Identification upstream Metabolite identification Statistical analysis Enrichment analysis Analysis of time course data downstream Data integration

53 Peak Picking raw data sticks Identify peaks Integrate peaks to sticks

54 Quantification Determine volume of each feature in a map

55 Quantification RT: elution profile feature model m/z: Isotopic pattern Quantification as a 3D signal detection problem

56 Map alignment Correct for retention time offset and distortions in label-free experiments

57 Peptide Identification LC-MS/MS experiment Experimental spectra Fragment m/z values RT m/z [%] m/z Compare Score hits Theoretical spectra QRESTATDILQK EIEEDSLEGLKK GIEDDLMDLIKK [%] Q9NSC5 HOME3_HUMAN Homer protein homolog 3 - Homo sapiens (Human) MSTAREQPIFSTRAHVFQIDPATKRNWIPAGKHALTVSYFY DATRNVYRIISIGGAKAIINSTVTPNMTFTKTSQKFGQWDS RANTVYGLGFASEQHLTQFAEKFQEVKEAARLAREKSQD GGELTSPALGLASHQVPPSPLVSANGPGEEKLFRSQSADA PGPTERERLKKMLSEGSVGEVQWEAEFFALQDSNNKLAG ALREANAAAAQWRQQLEAQRAEAERLRQRVAELEAQAAS EVTPTGEKEGLGQGQSLEQLEALVQTKDQEIQTLKSQTGG PREALEAAEREETQQKVQDLETRNAELEHQLRAMERSLEE ARAERERARAEVGRAAQLLDVSLFELSELREGLARLAEAAP Sequence db [%] [%] m/z m/z Theoretical fragment m/z values from suitable peptides m/z

58 Protein inference Nesvizhskii, Molecular and Cellular Proteomics, 2005

59 Metabolite identification Kind and Fiehn. BMC Bioiformatics 2006, 7:235

60 Preliminary Schedule Date Oct 11 Oct 18 Oct 25 Nov 1 Nov 8 Nov 15 Nov 22 Nov 29 Dec 6 Dec 13 Dec 20 Dec 27 / Jan 3 Jan 10 Jan 17 Jan 24 Jan 31 Week of Feb 7 Topic Today s overview Proteomics and Metabolomics Physics and chemistry of LC-MS ALLERHEILIGEN Lab Excursion: Proteome Center Tübingen Basic statistics for computational MS Protein/ Metabolite quantification I Protein/ Metabolite quantification II Protein/ Metabolite quantification III Peptide ID I Peptide ID II CHRISTMAS BREAK Protein ID Posttranslational Modifications Metabolite ID Summary/ repetition Exams

61 Textbooks Eidhammer, Flikka, Martens, Mikalsen: Computational methods for mass spectrometry proteomics, Wiley, 2007 Good introduction: Biochemical basics Mass spectrometry Algorithms for protein identification/ quantification

62 Important papers Nature Reviews, Molecular Cell Biology, 2004

63 Important papers Nature, 2003

64 Important papers Nature Methods, 2007

65 Nesvizhskii, Molecular and Cellular Proteomics, 2005

66 Important papers

67 Important papers

68 Important papers

69 Materials Script will be made available as a printout at the beginning of each class Additional materials (papers, literature) available on the course web site or can be requested via nahnsen@informatik.uni-tuebingen.de course web site: Textbooks are available in the library in the section dedicated to this lecture (Handapparat)