TAMU: PROTEOMICS SPECTRA 1 What Are Proteomics Spectra? DNA makes RNA makes Protein

Similar documents
Short Communication. ª 2003 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim. Proteomics 2003, 3, DOI /pmic

Keywords: calibration, data preprocessing, follow-up studies, proteomic mass spectrometry, randomised run order, validation

Keywords: calibration, data preprocessing, follow-up studies, proteomic mass spectrometry, randomised run order, validation

Received on July 14, 2003; revised on October 14, 2003; accepted on October 16, 2003 Advance Access publication January 29, 2004

Statistical Contributions to Proteomic Research

Design Issues in Omics Studies

Proteomics Background and clinical utility

Discriminant models for high-throughput proteomics mass spectrometer data

Normalization. Getting the numbers comparable. DNA Microarray Bioinformatics - #27612

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

A Highly Accurate Mass Profiling Approach to Protein Biomarker Discovery Using HPLC-Chip/ MS-Enabled ESI-TOF MS

Clinical proteomics in colorectal and renal cell cancer. Comparing the old and new generation SELDI- TOF MS: implications for serum protein profiling

Homework : Data Mining. Due at the start of class Friday, 25 September 2009

Proteomics and Cancer

QLUCORE: Novel platform for protein microarray

Introduction to Microarray Analysis

Protocols for disease classification from mass spectrometry data

Analysis of Microarray Data

Research Powered by Agilent s GeneSpring

Proteomic Pattern Recognition

Multivariate Methods to detecting co-related trends in data

Genomics and Proteomics *

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

Designing Functional Genomics Experiments for Successful Analysis RORY STARK 18 SEPTEMBER 2017

Analysis of Microarray Data

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Data Mining for Biological Data Analysis

Bioinformatics : Gene Expression Data Analysis

Functional genomics + Data mining

Gene Expression Data Analysis

Predicting Microarray Signals by Physical Modeling. Josh Deutsch. University of California. Santa Cruz

Introduction to gene expression microarray data analysis

Introduction to Bioinformatics: Chapter 11: Measuring Expression of Genome Information

Biology: the basis for smart proteomic approaches to protein analysis

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

Automatic detection of plasmonic nanoparticles in tissue sections

GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA

Bioinformatics for Biologists

Outline. Array platform considerations: Comparison between the technologies available in microarrays

Quantitative Proteomics: From Technology to Cancer Biology

Quantitative Real Time PCR USING SYBR GREEN

DNA Microarrays and Clustering of Gene Expression Data

2. Materials and Methods

MassHunter Profinder: Batch Processing Software for High Quality Feature Extraction of Mass Spectrometry Data

Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world.

11/22/13. Proteomics, functional genomics, and systems biology. Biosciences 741: Genomics Fall, 2013 Week 11

Quality Control and Peak Finding for Proteomics Data Collected from Nipple Aspirate Fluid by Surface-Enhanced Laser Desorption and Ionization

Important Information for MCP Authors

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

Parameter Estimation for the Exponential- Normal Convolution Model for Background Correction of Affymetrix GeneChip Data

Lecture 11 Microarrays and Expression Data

Microarrays & Gene Expression Analysis

Strategies in proteomics

Quality of Cancer RNA Samples Is Essential for Molecular Classification Based on Microarray Results Application

Basic principles of NMR-based metabolomics

Data Pre-processing in Liquid Chromatography-Mass Spectrometry Based Proteomics

1. Klenk H.-P. and colleagues published the complete genome sequence of the organism Archaeoglobus fulgidus in Nature. [ Nature 390: (1997)].

Expression summarization

Technical Assessment (TA) Summary Form (M00116)

Introduction to Bioinformatics. Fabian Hoti 6.10.

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data

Normalization of metabolomics data using multiple internal standards

Determining Method of Action in Drug Discovery Using Affymetrix Microarray Data

FEATURE-LEVEL EXPLORATION OF THE CHOE ET AL. AFFYMETRIX GENECHIP CONTROL DATASET

Background Correction and Normalization. Lecture 3 Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

Xevo G2-S QTof and TransOmics: A Multi-Omics System for the Differential LC/MS Analysis of Proteins, Metabolites, and Lipids

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

Classification in Parkinson s disease. ABDBM (c) Ron Shamir

Statistically Integrated Metabonomic-Proteomic Studies on a Human Prostate Cancer Xenograft Model in Mice

The Open2Dprot Proteomics Project for n-dimensional Protein Expression Data Analysis. The Open2Dprot Project. Introduction

Image Analysis. Based on Information from Terry Speed s Group, UC Berkeley. Lecture 3 Pre-Processing of Affymetrix Arrays. Affymetrix Terminology

Neural Networks and Applications in Bioinformatics

A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

2D Liquid-Phase Proteomic Profiling in Pre- Clinical and Mechanistic Analysis of Novel Polynuclear Chemotherapeutics for Glioma

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Statistical Design and Analytical Strategies for Discovery of Disease-Specific Protein Patterns

A Multi-feature Reproducibility Assessment of Mass Spectral Data in Clinical Proteomic Studies

The essentials of microarray data analysis

DISCOVERY AND VALIDATION OF TARGETS AND BIOMARKERS BY MASS SPECTROMETRY-BASED PROTEOMICS. September, 2011

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

Model Selection, Evaluation, Diagnosis

Top 5 Lessons Learned From MAQC III/SEQC

DEVELOPMENT AND EVALUATION OF A PROTOTYPE SYSTEM FOR AUTOMATED ANALYSIS OF CLINICAL MASS SPECTROMETRY DATA. Nafeh Fananapazir.

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

Computational Biology I

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

Multiple Lab Comparison of Microarray Technology

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

Clinical Proteomics: From Research to Practice. Eric Fung, MD, PhD Vice President of Medical and Clinical Affairs

Lecture 23: Clinical and Biomedical Applications of Proteomics; Proteomics Industry

An Implementation of genetic algorithm based feature selection approach over medical datasets

3.1.4 DNA Microarray Technology

Statistical power in mass-spectrometry proteomic studies

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Cover Page. The handle holds various files of this Leiden University dissertation.

Transcription:

The Analysis of Proteomics Spectra from Serum Samples Jeffrey S. Morris Department of Biostatistics MD Anderson Cancer Center

TAMU: PROTEOMICS SPECTRA 1 What Are Proteomics Spectra? DNA makes RNA makes Protein

TAMU: PROTEOMICS SPECTRA 1 What Are Proteomics Spectra? DNA makes RNA makes Protein Microarrays allow us to measure the mrna complement of a set of cells

TAMU: PROTEOMICS SPECTRA 1 What Are Proteomics Spectra? DNA makes RNA makes Protein Microarrays allow us to measure the mrna complement of a set of cells Mass spectrometry allows us to measure the protein complement (or subset thereof) of a set of cells

TAMU: PROTEOMICS SPECTRA 1 What Are Proteomics Spectra? DNA makes RNA makes Protein Microarrays allow us to measure the mrna complement of a set of cells Mass spectrometry allows us to measure the protein complement (or subset thereof) of a set of cells Proteomics spectra are mass spectrometry traces of biological specimens

TAMU: PROTEOMICS SPECTRA 1 What Are Proteomics Spectra? DNA makes RNA makes Protein Microarrays allow us to measure the mrna complement of a set of cells Mass spectrometry allows us to measure the protein complement (or subset thereof) of a set of cells Proteomics spectra are mass spectrometry traces of biological specimens

TAMU: PROTEOMICS SPECTRA 2 Why Are We Excited? Profiles can be assessed using less invasive samples (serum, urine, nipple aspirate fluid) rather than tissue biopsies

TAMU: PROTEOMICS SPECTRA 2 Why Are We Excited? Profiles can be assessed using less invasive samples (serum, urine, nipple aspirate fluid) rather than tissue biopsies Spectra are cheaper to run on a per unit basis than microarrays

TAMU: PROTEOMICS SPECTRA 2 Why Are We Excited? Profiles can be assessed using less invasive samples (serum, urine, nipple aspirate fluid) rather than tissue biopsies Spectra are cheaper to run on a per unit basis than microarrays Can run samples on large numbers of patients

TAMU: PROTEOMICS SPECTRA 2 Why Are We Excited? Profiles can be assessed using less invasive samples (serum, urine, nipple aspirate fluid) rather than tissue biopsies Spectra are cheaper to run on a per unit basis than microarrays Can run samples on large numbers of patients

TAMU: PROTEOMICS SPECTRA 3 How Does Mass Spec Work?

TAMU: PROTEOMICS SPECTRA 4 What Do the Data Look Like?

TAMU: P ROTEOMICS S PECTRA Learning: Spotting the Samples 5

TAMU: P ROTEOMICS S PECTRA 6 What the Guts Look Like

TAMU: P ROTEOMICS S PECTRA 7 Taking Data

TAMU: PROTEOMICS SPECTRA 8 Some Other Common Steps Fractionating the Samples Changing the Laser Intensity Working with Different Matrix Substrates

TAMU: PROTEOMICS SPECTRA 9 SELDI: A Special Case www.ciphergen.com Precoated surface performs some preselection of the proteins for you. Machines are nominally easier to use.

TAMU: PROTEOMICS SPECTRA 10 A Tale of Two Examples Example 1 : Learning from the literature Example 2 : Testing out our understanding A story in pictures

TAMU: PROTEOMICS SPECTRA 11 Example 1: Feb 16 02 Lancet 100 ovarian cancer patients 100 normal controls 16 patients with benign disease

TAMU: PROTEOMICS SPECTRA 11 Example 1: Feb 16 02 Lancet 100 ovarian cancer patients 100 normal controls 16 patients with benign disease Use 50 cancer and 50 normal spectra to train a classification method; test the algorithm on the remaining samples.

TAMU: PROTEOMICS SPECTRA 12 Their Results Correctly classified 50/50 of the ovarian cancer cases. Correctly classified 46/50 of the normal cases. Correctly classified 16/16 of the benign disease as other. Data at http://clinicalproteomics.steem.com Large sample sizes, using serum

TAMU: PROTEOMICS SPECTRA 13 The Data Sets 3 data sets on ovarian cancer Data Set 1 : The initial experiment. 216 samples, baseline subtracted, H4 chip

TAMU: PROTEOMICS SPECTRA 13 The Data Sets 3 data sets on ovarian cancer Data Set 1 : The initial experiment. 216 samples, baseline subtracted, H4 chip Data Set 2 : Followup: the same 216 samples, baseline subtracted, WCX2 chip

TAMU: PROTEOMICS SPECTRA 13 The Data Sets 3 data sets on ovarian cancer Data Set 1 : The initial experiment. 216 samples, baseline subtracted, H4 chip Data Set 2 : Followup: the same 216 samples, baseline subtracted, WCX2 chip Data Set 3 : New experiment: 162 cancers, 91 normals, baseline NOT subtracted, WCX2 chip

TAMU: PROTEOMICS SPECTRA 13 The Data Sets 3 data sets on ovarian cancer Data Set 1 : The initial experiment. 216 samples, baseline subtracted, H4 chip Data Set 2 : Followup: the same 216 samples, baseline subtracted, WCX2 chip Data Set 3 : New experiment: 162 cancers, 91 normals, baseline NOT subtracted, WCX2 chip A set of 5-7 separating peaks is supplied for each data set. We tried to (a) replicate their results, and (b) check consistency of the proteins found

TAMU: PROTEOMICS SPECTRA 14 We Can t Replicate their Results (DS1 & DS2)

TAMU: PROTEOMICS SPECTRA 15 Some Structure is Visible in DS1

TAMU: PROTEOMICS SPECTRA 16 Or is it? Not in DS2

TAMU: PROTEOMICS SPECTRA 17 Processing Can Trump Biology (DS1 & DS2)

TAMU: PROTEOMICS SPECTRA 18 We Can Analyze Data Set 3!

TAMU: PROTEOMICS SPECTRA 19 Do the DS2 Peaks Work for DS3?

TAMU: PROTEOMICS SPECTRA 20 Do the DS3 Peaks Work for DS2?

TAMU: PROTEOMICS SPECTRA 21 Which Peaks are Best? T-statistics Note the magnitudes: t-values in excess of 20 (absolute value)!

TAMU: PROTEOMICS SPECTRA 22 One Bivariate Plot: M/Z = (435.46,465.57) Perfect Separation. These are the first 2 peaks in their list, and ones we checked against DS2.

TAMU: PROTEOMICS SPECTRA 23 Another Bivariate Plot: M/Z = (2.79,245.2) Perfect Separation, using a completely different pair. Further, look at the masses: this is the noise region.

TAMU: PROTEOMICS SPECTRA 24 Perfect Classification with Noise? This is a problem, in that it suggests a qualitative difference in how the samples were processed, not just a difference in the biology. This type of separation reminds us of what we saw with benign disease.

TAMU: PROTEOMICS SPECTRA 25 Mass Accuracy is Poor? A tale of 5 masses... Feb 02 Apr 02 Jun 02 DS1 DS2 DS3-7.86E-05-7.86E-05-7.86E-05 2.18E-07 2.18E-07 2.18E-07 9.60E-05 9.60E-05 9.60E-05 0.000366014 0.000366014 0.000366014 0.000810195 0.000810195 0.000810195

TAMU: PROTEOMICS SPECTRA 26 How are masses determined? Calibrating known proteins

TAMU: PROTEOMICS SPECTRA 27 Calibration is the Same? M/Z vectors the same for all three data sets. Machine calibration the same for 4+ months?

TAMU: PROTEOMICS SPECTRA 28 What is the Calibration Equation? The Ciphergen equation m/z U = a(t t 0) 2 + b, U = 20K, t = (0, 1,...) 0.004

TAMU: PROTEOMICS SPECTRA 28 What is the Calibration Equation? The Ciphergen equation m/z U = a(t t 0) 2 + b, U = 20K, t = (0, 1,...) 0.004 Fitting it here a = 0.2721697 10 3, b = 0, t 0 = 0.0038

TAMU: PROTEOMICS SPECTRA 28 What is the Calibration Equation? The Ciphergen equation m/z U = a(t t 0) 2 + b, U = 20K, t = (0, 1,...) 0.004 Fitting it here a = 0.2721697 10 3, b = 0, t 0 = 0.0038 These are the default settings that ship with the software!

TAMU: PROTEOMICS SPECTRA 29 Other issues Q-star data different clinical trials?

TAMU: PROTEOMICS SPECTRA 30 Example 2: Proteomics Data Mining 41 samples, 24 with lung cancer, 17 controls. 20 fractions per sample. Goal: distinguish the two groups; Data used to be at http://www.radweb.mc.duke.edu/cme/proteomics/explain.htm but the site has been retired. Send email to Ned Patz or Mike Campa at Duke if interested.

TAMU: PROTEOMICS SPECTRA 31 Raw Spectra Have Different Baselines Note the need for baseline correction.

TAMU: PROTEOMICS SPECTRA 32 Oscillatory Behavior... Roughly half the spectra have sinusoidal noise.

TAMU: PROTEOMICS SPECTRA 32 Oscillatory Behavior... Roughly half the spectra have sinusoidal noise. We re seeing the A/C power cord.

TAMU: PROTEOMICS SPECTRA 33 Baseline Adj: Fraction Agreement, Before & After

TAMU: PROTEOMICS SPECTRA 34 Fractionation is Unstable

TAMU: PROTEOMICS SPECTRA 35 Unfractionating the Data

TAMU: PROTEOMICS SPECTRA 36 The Overall Average Shows Spikes. Difference It.

TAMU: PROTEOMICS SPECTRA 37 Computer Buffer? Spike spacing has a wavelength of 4096 = 2 12.

TAMU: PROTEOMICS SPECTRA 38 Are We Done Cleaning Yet? Give the problem a chance to be easy, try some simple clustering.

TAMU: PROTEOMICS SPECTRA 39 PCA Splits off Half the Normals

TAMU: PROTEOMICS SPECTRA 40 Peaks at Integer Multiples of M/Z 180.6! This suggests a polymer. No Amino Acid dimers fit.

TAMU: PROTEOMICS SPECTRA 41 Cleaning Redux Baseline Correction and Normalization Inconsistent Fractionation Computer Buffers Polymers in some Normal Spectra Peak Finding (Use Theirs) Data reduced to 1 spectrum/patient, with 506 peaks per spectrum.

TAMU: PROTEOMICS SPECTRA 42 Find the Best Separators Peaks MD P-Value Wrong LOOCV 12886 2.547 0.005 11 11 8840, 12886 5.679 0.01 5 6 3077, 12886 9.019 0.01 3 4 74263 5863, 8143 12.585 0.01 3 3 8840, 12886 4125, 7000 23.108 0.01 1 1 9010, 12886 74263 There are 9 values that recur frequently, at masses of 3077, 4069, 5825, 6955, 8840, 12886, 17318, 61000, and 74263. P-values are not from table lookups!

TAMU: PROTEOMICS SPECTRA 43 Testing Reality (Significance) Generate a bunch of random noise data matrices, each 41 506 in size. For each matrix, split the 41 noise samples into groups of 24 and 17. Repeat our search procedure on the random data, and see how well we can separate things.

TAMU: PROTEOMICS SPECTRA 44 The Eyeball Test We applied one last filtering step and actually looked at the regions identified. All 9 peaks listed above passed the eye test. Blue lines = Cancers Red lines = Controls

TAMU: PROTEOMICS SPECTRA 45 Punchlines There is no magic bullet here. (Too bad)

TAMU: PROTEOMICS SPECTRA 45 Punchlines There is no magic bullet here. (Too bad) Data preprocessing is extremely important with this type of data, and there is still much room for improvement.

TAMU: PROTEOMICS SPECTRA 45 Punchlines There is no magic bullet here. (Too bad) Data preprocessing is extremely important with this type of data, and there is still much room for improvement. Dimension reduction is critical; both to avoid spurious structure and to focus our attention on peaks.

TAMU: PROTEOMICS SPECTRA 45 Punchlines There is no magic bullet here. (Too bad) Data preprocessing is extremely important with this type of data, and there is still much room for improvement. Dimension reduction is critical; both to avoid spurious structure and to focus our attention on peaks. There is structure in this data (some peaks have been confirmed, and the writeup is in progress) and it can be found!

TAMU: PROTEOMICS SPECTRA 45 Punchlines There is no magic bullet here. (Too bad) Data preprocessing is extremely important with this type of data, and there is still much room for improvement. Dimension reduction is critical; both to avoid spurious structure and to focus our attention on peaks. There is structure in this data (some peaks have been confirmed, and the writeup is in progress) and it can be found!

TAMU: PROTEOMICS SPECTRA 46 Other Stuff We were the only ones to notice the sinusoidal noise.

TAMU: PROTEOMICS SPECTRA 46 Other Stuff We were the only ones to notice the sinusoidal noise. and the clock tick.

TAMU: PROTEOMICS SPECTRA 46 Other Stuff We were the only ones to notice the sinusoidal noise. and the clock tick. and we also won the competition...

TAMU: PROTEOMICS SPECTRA 47 Important Lessons Experimental Design Issues are Crucial

TAMU: PROTEOMICS SPECTRA 47 Important Lessons Experimental Design Issues are Crucial Randomization Uniform handling of samples Blinding

TAMU: PROTEOMICS SPECTRA 47 Important Lessons Experimental Design Issues are Crucial Randomization Uniform handling of samples Blinding Careful Pre-Processing of Data is Essential

TAMU: PROTEOMICS SPECTRA 47 Important Lessons Experimental Design Issues are Crucial Randomization Uniform handling of samples Blinding Careful Pre-Processing of Data is Essential Calibration Baseline Correction Normalization

TAMU: PROTEOMICS SPECTRA 47 Important Lessons Experimental Design Issues are Crucial Randomization Uniform handling of samples Blinding Careful Pre-Processing of Data is Essential Calibration Baseline Correction Normalization Exploratory Data Analysis is Important: Look at the Data!

TAMU: PROTEOMICS SPECTRA 47 Important Lessons Experimental Design Issues are Crucial Randomization Uniform handling of samples Blinding Careful Pre-Processing of Data is Essential Calibration Baseline Correction Normalization Exploratory Data Analysis is Important: Look at the Data! Search for anomalies Confirm numerical results

TAMU: PROTEOMICS SPECTRA 48 References On the Lancet data: Baggerly, Morris and Coombes (2003), accepted by Bioinformatics pending revisions. On the Proteomics Data Mining Conference data: Baggerly, Morris, Wang, Gold, Xiao and Coombes (2003), Proteomics, 3(9):1677-1682. pdf preprints are available.

TAMU: PROTEOMICS SPECTRA 49 The Deluge Bladder Cancer Pancreatic Cancer Leukemia Colorectal Cancer Brain Cancer Several show real structure, several show processing effects. If you re not working on a proteomics project, you will be soon! Kevin Coombes to Bioinf section, 3/25/03

TAMU: PROTEOMICS SPECTRA 50 Collaborators Keith Baggerly Kevin Coombes Jing Wang David Gold Lian-Chun Xiao ********************* Ryuji Kobayashi David Hawke John Koomen