Data Analysis in Metabolomics. Tim Ebbels Imperial College London

Size: px
Start display at page:

Download "Data Analysis in Metabolomics. Tim Ebbels Imperial College London"

Transcription

1 Data Analysis in Metabolomics Tim Ebbels Imperial College London

2 Themes Overview of metabolomics data processing workflow Differences between metabolomics and transcriptomics data Approaches to improving reproducibility & data quality Key challenges and bottlenecks

3 Metabolomics The study of the complement of small molecules within biological systems Untargeted: No prior hypothesis of specific metabolites involved METABOLIC PROFILE CELL TISSUE BIOFLUID Hormones

4 Metabolomics workflow Biological question Experimental design Sampling Sample preparation Data acquisition Data preprocessing Data analysis Metabolite identification Protocol Samples Metabolites Biological interpretation Raw data Data table Relevant metabolites, connectivities, models

5 Transcriptomic vs. Metabolomic Data Transcriptomics (microarrays Metabolomics / sequencing) Number of genes / Yes No metabolites known? Identity of genes / Yes (sequence/locus) No (a priori) metabolites known? Coverage Whole genome Very low (few %) of metabolome Number of platforms Single Multiple (in same experiment) Standardisation High Constantly changing analytical technology Standardisation data Relatively high Low analysis Correlation between variables Medium Very high

6 LC MS Metabolomics Data

7 LC-MS Metabolic Profiles ~10,000s signals, s (?) metabolites

8 LC-MS preprocessing Raw data Peak detection Peak matching Peak table Peak integration Peak filling Retention time alignment XCMS Smith et al. Anal Chem 78, 779 (2006)

9 Quality Control Samples Representative biological sample, e.g. pool of study samples Repeated analysis throughout analytical run Study samples Pooled QC sample Run order Gika, H. G., Theodoridis, G. A., Wingate, J. E., and Wilson, 9 I. D., J. Proteome Res. 6 (8), 3291 (2007).

10 Quality Control and Data Filtering Repeatability filter E.g. Filter out all features with CV<30% in QC samples Linearity filter E.g. Filter out all features with correlation to dilution < 0.8 Normalisation Correct global intensity drift Drift correction Correct feature specific drift within a batch Batch correction Correct drift across batches 10

11 Drift Correction Instrument response changes smoothly over the run Use QC samples to estimate changes Typically local regression (e.g. LOESS) with cross validation Requires frequent QC injections Dunn, W. B. et al. Nat Protoc 6, 1060 (2011).

12 Filtering for Repeatability Remove features with low repeatability in QC samples (e.g. coefficient of variation, CV<30%) Lab C Positive ESI, 100% CV Threshold Lab C Positive ESI, 10% CV Threshold t[2] t[2] t[1] -120 COMET2 / Rob Whiffin t[1]

13 Filtering for Linearity Some metabolite concentrations will be Outside linear range of instrument, or Contaminants, solvent artefacts etc Use a dilution series to select features which respond linearly R 2 Intensity CV (%) Dilution factor

14 NMR Metabolomics Data

15 NMR Metabolic Profiles ~100s signals,10-100s (?) metabolites

16 NMR Metabolic Profiles: Problems Problems: Assignment Knowns Unknowns Peak overlap Peak shift?

17 Peak shifts Caused primarily by ph & ionic strength variations Some peaks more susceptible than others Peaks for same molecule generally do NOT shift In same direction By same amount Restrict ph variation using buffer Try to keep in physiological range (~7 8) ph shift may be the effect you re looking for! ph 12 Urine titration series ph 2

18 Binning Integrate spectral intensity in each region one variable Benefits: reduces problems of Peak shift Large number of data points Drawbacks Bins not easily assigned can be one or several compounds Statistical models not easily interpreted Raw spectrum Binned spectrum

19 Full resolution spectra Benefits: Reduces difficulty of assignment (still manual) Drawbacks: does not overcome Overlap Shift Large number of data points

20 Full resolution + alignment Non-aligned data RSPA corrected data Misassignment Artificial signal Warping of peak shape and/or area Sample number Move peaks until positions in different spectra match Difficult task, usually requires manual validation Can produce artefacts Intensity (a.u.) ppm ppm Veselkov et al. Anal. Chem. 2009

21 Peak fitting E.g. Chenomx NMR suite Manual process, requiring manual validation Succinate Glutamine Glutamate Malate

22 Normalisation Transformation on each sample Removing unwanted variation Making samples more comparable What variation is unwanted? Examples: Changes in detector response Differences in urine volume/dilution Classically achieved by setting total signal to a constant ( x = 1)

23 Constant sum Raw data Normalisation to constant sum

24 Normalisation Account for gross sample to sample changes Global, e.g. Median fold change Total intensity Intensity dependent, e.g. LOESS Quantile Median fold change normalise Veselkov, K. A. et al. Anal. Chem. 83, 5864 (2011).

25 Comparison of Normalisation Methods Simulated data 4 normalisations: Total area Median fold change Minimum entropy PCA scores Minimum entropy Difference (test ref) is constant for dilution variables low entropy Other methods: Histogram Quantile Robust regression Hector Keun / Jake Pearce

26 Summary Metabolomic data share many characteristics with other omics But fundamentally different: cannot copy data analysis pipeline Current bottlenecks/challenges: Metabolite identification Standardisation of Sample collection Analytical procedure Data analysis

27