Proteomics: A Challenge for Technology and Information Science. What is proteomics?

Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics tgriffin@umn.edu What is proteomics? Proteomics includes not only the identification and quantification of proteins, but also the determination of their localization, modifications, interactions, activities, and, ultimately, their function. -Stan Fields in Science, 2001. 1

Genomics vs. Proteomics Similarities: Large datasets, tools needed for annotation and interpretation of results Differences: Genomics generally mature technologies, data processing methods, questions asked usually involve quantitative changes in RNA transcripts (microarrays) Proteomics still evolving, complexity of protein biochemical properties: expression changes, modifications, interactions, activities many questions to ask and data to interpret, methods changing, different approaches (mass spec, arrays etc.), Genomics, Proteomics, and Systems Biology genomics proteomics computational biology genomic DNA mrna protein products functional protein system mature prototype emerging sequencing arrays 3D structure quantitative profiling protein cataloguing catalytic activity sub cellular location Protein Modifications Protein dynamics protein phosphorylation descriptive protein interaction maps interactions between components identify system components measure and define properties 2

Shotgun identification of proteins in mixtures by LC-MS/MS Liquid chromatography coupled to tandem mass spectrometry (MS/MS) Protein(s) peptides peptide fragments Digestion µlc separation (50-100 um) Ionization: MALDI or Electrospray Isolation Fragmentation Mass Analysis Tandem mass spectrum (thousands in a matter of hours) Peptide sequence determination from MS/MS spectra Collision-induced dissociation (CID) creates two prominent ion series: y-series: y 14 y 13 y 12 y 11 y 10 y 9 y 8 y 7 y 6 y 5 y 4 y 3 y 2 y 1 2 N-N--S--G--D--I--V--N--L--G--S--I--A--G--R-COO b-series: b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 b 10 b 11 b 12 b 13 b 14 Relative Abundance 200 400 600 800 1000 1200 3

Peptide sequence identifies the protein Relative Abundance GDIVNLGSIAGR DIVNLGSIAGR IVNLGSIAGR VNLGSIAGR NLGSIAGR LGSIAGR GSIAGR SIAGR IAGR AGR GR R 2 N-NSGDIVNLGSIAGR-COOCOO 200 400 600 800 1000 1200 YMR134W, yeast protein involved in iron metabolism igh-throughput protein identification by LC-MS/MS and automated sequence database searching Raw MS/MS spectrum Relative Abundance 200 400 600 800 1000 1200 Protein sequence and/or DNA sequence database search Direct identification of 1000 proteins from complex mixtures Peptide sequence match Relative Abundance GDIVNLGSIAGR DIVNLGSIAGR IVNLGSIAGR VNLGSIAGR NLGSIAGR LGSIAGR GSIAGR SIAGR IAGR AGR GR R 2 N-NSGDIVNLGSIAGR-COOCOO 200 400 600 800 1000 1200 Protein identification 4

Dealing with the data 1. Data acquisition Experimental information, metadata capture Integrated workflow? 2. Peak analysis 3. Knowledge annotation and interpretation Sequence database searching Quantitative analysis Database mining Assignment of function, pathway, localization etc. Output for database archiving, publication 1. Data acquisition: capturing experimental information Proteomics Experimental Data Repository (PEDRo) Proposed schema Similar to genomic needs, but experimental info a bit different 5

ProFound Mascot PepSea MS-Fit MOWSE Peptident Multident Sequest PepFrag MS-Tag 2. Peak Analysis Computational algorithms for searching MS/MS spectra against protein sequence databases, mrna sequences, DNA sequences Relative Abundance 200 400 600 800 1000 1200 Protein identification need cpu horsepower (parallel computing) 2. Peak Analysis: data formats Format 1 Format 2 Format 3?? Output 1 Output 2 Output 3 Lack of flexibility Slow to evolve Lack of incorporation of competing products, methods 6

2. Peak Analysis: need general, flexible, in-house solutions Format 1 Format 2 Format 3 reverse engineering of data formats General tools for analysis of multiple data formats 2. Peak Analysis; reverse engineering data formats http://sashimi.sourceforge.net/software_glossolalia.html 7

2. Peak analysis: quality control of protein matches filtering Unfiltered 10 5 matches (lots of noise and junk) Filtered thousands of true matches Statistical analysis of database results (tools are available) 2. Peak Analysis: Quantitative analysis State 1 State 2 N = normal isotope label N combine, proteolyze and isolate labeled peptides = heavy isotopic label (e.g. 2, 13 C, 15 N) N analyze peptides by mass spectrometry External chemical labeling Metabolic labeling (SILAC) Enzymatic incorporation (O 16 /O 18 ) intensity N Δm relative protein abundance = [intensity of N-labeled peptide] [intensity of -labeled peptide] mass-to-charge () Flexibility is key need tools to handle different quantitative methods 8

2. Peak Analysis: Quantitative analysis T O F M S : 2 0 M C A sca n s fro m m m _ sa m p le.w iff a=3.56145059693694800e-004, t0=6.89652636903192620e001 274 260 240 220 200 180 160 140 Relative intensity = relative protein abundance Sample 2 1926.0240 1927.0231 1928.0203 Max. 274.0 counts. 120 100 80 Sample 1 1917.9946 1916.9909 1929.0322 60 40 20 1918.9924 1920.0007 1921.0165 1924.9803 1930.0176 1931.0077 0 1914 1916 1918 1920 1922 1924 1926 1928 1930 1932 1934, amu Evolving methodologies: itraq Sample: 1 2 3 4 Digest to peptides Digest to peptides Digest to peptides itraq label: 114 115 116 117 Digest to peptides Multidimensional separation MS/MS spectrum Intensity 1 2 3 4 114 115 116 117 Diagnostic ions used for quantitative analysis Peptide fragments used for sequence identification 4-way multiplexing: simultaneous comparison of multiple states, replicates 9

old Need for changeable tools new 3 116.0972 TOF MS: 20 MCA scans from mm_sample.wiff a=3.56145059693694800e-004, t0=6.89652636903192620e001 274 260 240 220 200 180 160 140 120 100 80 60 40 20 0 Sample 1 1916.9909 1917.9946 Relative intensity = relative protein abundance 1918.9924 1920.0007 1921.0165 Sample 2 1924.9803 1926.0240 1927.0231 1928.0203 1929.0322 1930.0176 1931.0077 Max. 274.0 counts. 1914 1916 1918 1920 1922 1924 1926 1928 1930 1932 1934, amu Intensity 1 115.0963 117.1025 114.1005 2 4 Automated analysis tools? 3. Knowledge annotation: making sense of lists of data 10

3. Knowledge annotation: mining proteomic/genomic databases 3. Knowledge annotation: needs Annotation: accession numbers and protein names Functional assignments (functional degeneracy?) Pathway assignments Subcellular localization Disease implications Comparison of different proteomic datasets (i.e. expression profiles compared to modification state profiles, other protein properties) Automated and streamlined?? Publication and deposit in databases Visualization of complex phenomena, interpretation of biological relevance Modeling, integration with genomics data computational and systems biology 11