Estoril Education Day

Similar documents
Designing Complex Omics Experiments

Gene Expression Data Analysis (I)

Microarray Technique. Some background. M. Nath

Analysis of Microarray Data

Introduction to Bioinformatics. Fabian Hoti 6.10.

Crowe Critical Appraisal Tool (CCAT) User Guide

OPTIMIZATION AND CV ESTIMATION OF A PLATE COUNT ASSAY USING JMP

EECS730: Introduction to Bioinformatics

Introduction to gene expression microarray data analysis

CASE-STUDY- VALIDATION of PCR based methodology. Beata Surmacz-Cordle Senior Analytical Development Scientist

Calculating the Standard Error of Measurement

Towards unbiased biomarker discovery

Overview of Statistics used in QbD Throughout the Product Lifecycle

LC/MS/MS Solutions for Biomarker Discovery QSTAR. Elite Hybrid LC/MS/MS System. More performance, more reliability, more answers

Experimental Design Day 2

Quantitative Analysis on the Public Protein Prospector Web Site. Introduction

Event-specific Method for the Quantification of Soybean CV127 Using Real-time PCR. Validation Report

reverse transcription! RT 1! RT 2! RT 3!

Troubleshooting of Real Time PCR Ameer Effat M. Elfarash

Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds. Overview. Data Analysis Tutorial

How to view Results with Scaffold. Proteomics Shared Resource

Technical Review. Real time PCR

SECTION 11 ACUTE TOXICITY DATA ANALYSIS

VICH Topic GL2 (Validation: Methodology) GUIDELINE ON VALIDATION OF ANALYTICAL PROCEDURES: METHODOLOGY

Proteomics And Cancer Biomarker Discovery. Dr. Zahid Khan Institute of chemical Sciences (ICS) University of Peshawar. Overview. Cancer.

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur

Exploration and Analysis of DNA Microarray Data

Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes

The Five Key Elements of a Successful Metabolomics Study

The Kruskal-Wallis Test with Excel In 3 Simple Steps. Kilem L. Gwet, Ph.D.

How to view Results with. Proteomics Shared Resource

Primerdesign Ltd. High risk Human Papillomavirus. Multiplex screening kit. genesig kit. 100 tests. For general laboratory and research use only

Analysis of Microarray Data

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

BenchSmart 96. Semi-automated Pipetting Higher Accuracy, Greater Flexibility

Quality Control Assessment in Genotyping Console

Gene Expression Analysis Superior Solutions for any Project

Network System Inference

Microarray Gene Expression Analysis at CNIO

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Measurement of uncertainty for Elisa Tests. University of Hasselt, Center for Statistics, Hasselt, Belgium

Validating, Verifying, and Evaluating Your Test Methods: It s NOT a Regulatory Exercise!

Dariusz Leszczynski & Martin L. Meltz March 15 th, 2006 ****************************************************************************************

Calculation of Spot Reliability Evaluation Scores (SRED) for DNA Microarray Data

Inherent variation in the reactions, type of enzymes used. Depends on the type of labeling and procedures, as well as the age of the labels.

Tips for Multiplexing Cell-Based Assays:

SOP: SYBR Green-based real-time RT-PCR

Disclaimer This presentation expresses my personal views on this topic and must not be interpreted as the regulatory views or the policy of the FDA

Genome Sequence Assembly

Bioinformatics Advice on Experimental Design

IPA Advanced Training Course

CHAPTER 8 T Tests. A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test

Využití cílené proteomiky pro kontrolu falšování potravin: identifikace peptidových markerů v mase pomocí LC- Q Exactive MS/MS

Supplementary Figure 1. (a) The qrt-pcr for lnc-2, lnc-6 and lnc-7 RNA level in DU145, 22Rv1, wild type HCT116 and HCT116 Dicer ex5 cells transfected

Supplementary Fig. 1 related to Fig. 1 Clinical relevance of lncrna candidate

A2LA. R231 Specific Requirements: Threat Agent Testing Laboratory Accreditation Program. December 6, 2017

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

PERFORMANCE MADE EASY REAL-TIME PCR

Bacteriophage MS2. genesig Standard Kit. Phage MS2 genome. 150 tests. Primerdesign Ltd. For general laboratory and research use only

Human Papillomavirus 16

Transfer of Methods Supporting Biologics and Vaccines

MBios 478: Mass Spectrometry Applications [Dr. Wyrick] Slide #1. Lecture 25: Mass Spectrometry Applications

Examination Assignments

A SIMULATION STUDY OF THE ROBUSTNESS OF THE LEAST MEDIAN OF SQUARES ESTIMATOR OF SLOPE IN A REGRESSION THROUGH THE ORIGIN MODEL

Modeling Cardiac Hypertrophy: Endothelin-1 Induction with qrt-pcr Analysis

DISCOVERY AND VALIDATION OF TARGETS AND BIOMARKERS BY MASS SPECTROMETRY-BASED PROTEOMICS. September, 2011

2 Gene Technologies in Our Lives

Barrack Road, The Nothe, Weymouth DT4 8UB E: T: +44 (0) F: +44 (0)

Thermo Scientific Mass Spectrometric Immunoassay (MSIA) Pipette Tips. Next generation immunoaffinity. Robust quantitative platform

Draft agreed by Scientific Advice Working Party 5 September Adopted by CHMP for release for consultation 19 September

GETTING READY FOR DATA COLLECTION

A Comparison of AlphaLISA and TR-FRET Homogeneous Immunoassays in Serum-Containing Samples

New Stringent Two-Color Gene Expression Workflow Enables More Accurate and Reproducible Microarray Data

Gene Signal Estimates from Exon Arrays

Rat α-melanocyte stimulating hormone (α-msh) ELISA Kit

Dengue Virus subtypes 1,2 3 and 4

Quantitative real-time PCR data analysis with R

Statistically Integrated Metabonomic-Proteomic Studies on a Human Prostate Cancer Xenograft Model in Mice

Harbingers of Failure: Online Appendix

CREDIT RISK MODELLING Using SAS

The Role of Mass Spectrometry for Developing Biotherapeutics: Regulatory Perspectives

GLP/SC/01 Basic statistical tools for analytical chemistry (2 days)

rapiflex Innovation with Integrity Designed for Molecules that Matter. MALDI TOF/TOF

M. tuberculosis_mpb64/is611. genesig Advanced Kit. 150 tests. Primerdesign Ltd. For general laboratory and research use only

Quantitative Real Time PCR USING SYBR GREEN

Tony Mire-Sluis Vice President, Corporate, Product and Device Quality Amgen Inc

MIAPE: Mass Spectrometry Informatics

Real-Time PCR Workshop Gene Expression. Applications Absolute and Relative Quantitation

qpcr Quantitative PCR or Real-time PCR Gives a measurement of PCR product at end of each cycle real time

Session 2 summary Designs & Methods. Pairwise comparisons approach. Dose finding approaches discussed: Guiding principles for good dose selection

ANALYSING QUANTITATIVE DATA

Lifecycle Management of Process Analytical Technology Procedures

Roche Molecular Biochemicals Technical Note No. LC 10/2000

Human Papillomavirus 52 and 52b

Improved Chemistry for NGS Library Cleanup and Size Selection Speakers: Charles Cowles, PhD & Curtis Knox

TECHNICAL GUIDANCE MANUAL FOR HYDROGEOLOGIC INVESTIGATIONS AND GROUND WATER MONITORING CHAPTER 13 STATISTICS FOR GROUND WATER QUALITY COMPARISON

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005

Xevo G2-S QTof and TransOmics: A Multi-Omics System for the Differential LC/MS Analysis of Proteins, Metabolites, and Lipids

Epstein Barr Virus (Human Herpes virus 4)

Transcription:

Estoril Education Day -Experimental design in Proteomics October 23rd, 2010 Peter James

Note Taking All the Powerpoint slides from the Talks are available for download from: http://www.immun.lth.se/education/ protein_technology/hupo %2C_eupa_and_nordic_qp_courses/ estoril_education_day/

Is this Course Necessary? Journal Guidelines: Journal of Proteome Research The methods for how the biological reliability of measurements was validated using biological replicates, statistical methods, independent experiments, etc. The methods for how the analytical reliability of measurements was validated using technical replicates and statistical methods. The treatment of relevant systematic error effects such as peptides shared by multiple proteins, interference from overlapping precursor ions, incomplete isotope labeling, bias correction for pipetting error, etc. The treatment of random error issues such as outlier rejection and the categorical exclusion of data by thresholds, for example, based on signal to noise or minimum ion counts. All quantitative results upon which conclusions are based must bear proper estimates of uncertainty and the methods for the error analysis should be clearly described.

Is this Course Necessary? Journal Guidelines: Journal of Proteomics The experimental design must be provided and must include details of the number of biological and analytical replicates. Only one biological/analytical replicate will not be acceptable. In clinical studies, it is highly desirable that a power analysis predicting the appropriate sample size for subsequent statistical analysis of the data is carried out. For expression analysis studies, summary statistics (mean, standard deviation) must be provided and results of statistical analysis must be shown. Reporting fold differences alone is not acceptable. Authors must report the following: methods of data normalization, transformation, missing value handling, the statistical tests used, the degrees of freedom and the statistical package or program used. Where biologically important differences in protein (gene) expression are reported, confirmatory data (e.g. from Western blot, RT-PCR analysis, etc.) are desirable. For biomarker discovery/validation studies, the sensitivity and specificity of the biomarker(s) should be provided wherever possible. It is desirable that receiver operator characteristic (ROC) curves and areas under the curves are given.

Is this Course Necessary? Journal Guidelines: Molecular and Cellular Proteomics A thorough description of the experimental design, including the biological sample size and number of technical replicates of such samples or preparations derived thereof so that (bio)statistical methods may be used to assess independently the significance of the results presented. Studies in which the number of biological and/or technical replicates equals one, can generally not be accepted particularly if only few or a single peptide is used for quantification. In exceptional circumstances, other lines of evidence such as time or dose dependent experiments may be acceptable instead of technical replicates.

Is this Course Necessary? Journal Guidelines: Molecular and Cellular Proteomics The experimental design must be provided and must include details of the number of biological and analytical replicates. Only one biological/analytical replicate will not be acceptable. In clinical studies, it is highly desirable that a power analysis predicting the appropriate sample size for subsequent statistical analysis of the data is carried out. For expression analysis studies, summary statistics (mean, standard deviation) must be provided and results of statistical analysis must be shown. Reporting fold differences alone is not acceptable. Authors must report the following: methods of data normalization, transformation, missing value handling, the statistical tests used, the degrees of freedom and the statistical package or program used. Where biologically important differences in protein (gene) expression are reported, confirmatory data (e.g. from validated immunoassays) are desirable. For biomarker discovery/validation studies, the sensitivity and specificity of the biomarker(s) should be provided wherever possible. It is desirable that receiver operator characteristic curves and areas under the curves are given.

Talk Overview Introduction to experimental design Sources of error How many replicates, controls? Experimental design flow Pilot experiments Normal data? Parametric, non-parametric Idea of power to calculate needs Journal guidelines

Is all this

Experimental Design Experimental design definition The statistics that happens before an experiment Why think about it? Proper planning can save having to repeat entire experiment Reduces analysis time and lowers error rate and costs Reduces experimental time to a minimum Design the experiment to answer a biological question

Experimental Design Flow Pilot Study Variation, Cluster and Power Analysis Full Scale Experiment Publication Data Validation Bioinformatics Complete Analysis

Goals of Experimental Design Avoid experimental artifacts Eliminate bias Use a simultaneous control group Randomization Blinding Reduce sampling error Replication Balance Blocking

Experimental Artifacts Experimental artifacts a bias in a measurement produced by unintended consequences of experimental procedures e.g. using doxycycline to activate a cloned gene in a viral vector with a teto gene promoter switches on the gene, but also many other pathways. A scrambled insert must be used as a control Conduct your experiments under conditions that are as close to reality as possible to avoid artifacts Inadequate CO 2 in cell culture experiments leads to large variations in ph and hence protein expression

Can I Compare my Data Sets? Non-normalised Normalised Correction for dye or isotope label incorporation efficiency Swap labels e.g. replicate Cy3Cy5 or TMT126 for 131

Scaling Data to a Target Intensity Target Intensity (100) Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 Exp. 6 Exp. 7 TGT = Average intensity x Scaling Factor If scaling factor is < 3 fold, a comparison can be made between all experiments in the set

Eliminating Bias Use a control group A control group is a group of subjects left untreated for the treatment of interest but otherwise experiencing the same conditions as the treated subjects Randomization Randomization is the random assignment of treatments to units in an experimental study which breaks the association between potential confounding variables and the explanatory variables Blinding where some of the persons involved are prevented from knowing certain information that might lead to conscious or unconscious bias on their part, invalidating the results Single blind. Experimenter knows all facts, subjects do not Double blind. Neither experimenter nor subject know facts until the finish

Randomization Without randomization, the confounding variable differs among treatments

Randomization With randomization, the confounding variable does not differ among treatments

Balance In a balanced experiment, all treatments have equal sample size This maximizes power This makes tests more robust to violating assumptions

Blocking Blocking is the grouping of experimental units that have similar properties Within each block, treatments are randomly assigned to experimental treatments Randomized block design

Practical Questions to Consider How much variability does your system have? Understand and minimize variation How many treatments? How many controls? Comparative analysis (one experimental condition) Serial analysis design (multiple conditions) What level of significance is needed? More replicates needed for subtle changes

Three Sources of Variability Biological: Differences between samples - The ultimate goal of the research Technical: Sample preparation - Protocols and operator Systematic: MS analysis - Instruments, reagents, settings

Experimental Replicates Technical replicates from the same sample Allows an evaluation of bench effects to the overall variability Biological replicates from different samples Replicates that reproduce biological variables explored in the experiment Permit the use of formal statistical tests Also allows the interrogation of technical variability Gold standard Use of a standard protein digest to evaluate sensitivity, mass accuracy and search parameter settings Allows an estimation of systematic variation

Effective Studies may need many replicates Treatments Controls Average Differential Expression

How many Samples do I need? You should estimate the size of the three error sources The best way is to do a pilot experiment Use minimum three biological replicates Use minimum two technical replicates Check systematic errors with a gold standard Do a Power Analysis

Systematic Error Estimation: Reproducibility of retention time precision 5 days

Technical Error Estimation Coefficient of Variability CV% is a measure of variance amongst replicates Defined as the standard deviation (σ) divided by the mean multiplied by 100 Example: 5 values representing 5 replicates 230.4, 241.7, 252.9, 338.8, 178.9 Mean = 248.56; σ = 57.9; CV% = 23.29%

Which Statistical Test to Use? Assess the normality for each protein species Then select a parametric or non-parametric test Student s t-test assumes normality, independent sampling, and homogeneity of variance Mann-Whitney assumes independent sampling but not a normal distribution Frequency 0 50 100 150 2 families of tests -3-2 -1 0 1 2 3 Parametric Non-parametric

Is my Data Normally Distributed? A q-q plot is a plot of the quantiles of the data set 1 against data set 2 A quantile is the value which divides the distribution such given proportion of observation below 50% equivalent to the median value If the two sets come from a population with the same distribution, the points should fall approximately along a 45 0 reference line Alternatively plot data If it shows a symmetrical peak about the mean and 68% of the data lies within 1 standard deviation from the mean, the data is normally distributed

Biological Error Estimation Does the Experiment make sense? Hierachical Clustering is an unsupervised process It finds structures in unlabelled data A cluster is a set of objects (replicates) that are similar to each other and dissimilar to other clusters Basic way of checking results Do similar biological replicates cluster? Do technical replicates cluster within biological clusters?

Estimation of Replicates Needed How many Replicates must I have to prove my hypothesis? You must define a null hypothesis The hypothesis is that there is no statistical difference between control and experiment at a defined confidence level Power Analysis can provide an estimate of samples needed One must define a confidence level One must balance sample size against error rate and size of effect

Visualising Data -Clustering

Hierachical Clustering Nearest Neighbor Algorithm is a bottom-up approach Starts with n nodes n is the size of the sample merge the 2 most similar nodes at each step stop when the desired number of clusters is reached.

Nearest Neighbour Algorithm Nearest Neighbor, Level 1, k = 8 clusters Nearest Neighbor, Level 2, k = 7 clusters Nearest Neighbor, Level 3, k = 6 clusters

Nearest Neighbour Algorithm Nearest Neighbor, Level 4, k = 5 clusters Nearest Neighbor, Level 5, k = 4 clusters Nearest Neighbor, Level 6, k = 3 clusters

Nearest Neighbour Algorithm Nearest Neighbor, Level 7, k = 2 clusters Nearest Neighbor, Level 8, k = 1 clusters Technical replicates should cluster together within biological replicates

Verification Orthogonal validation (Physiol Genomics 28: 24 32, 2006) Western blots, enzyme activity assays, But if you don t see a change twice is it- False positive in the first experiment? False negative in the second? Need new samples Why? measurement error does not lead to false positives rather there is a need to validate against sampling variability Carry out a Power Analysis

Power Analysis You must define a null hypothesis H0 There is no difference between the experiments and controls Finding no difference does not prove the null hypothesis We simply do not have evidence to reject it Lack of a significant effect does not have to signify the means are equal Perhaps an effect exists, but the data is too noisy to demonstrate it. We need to define the Power of the experiment the probability of detecting a real effect And of not making a type II error

Possible Experimental Outcomes Experimental result statistically significant p < threshold H0 false Statistical not significant p > threshold H0 True Biologically no change H0 True False positive Type one error (α) Correct rejection Biological change H0 False Correct acceptance False negative Type two error (β)

What is Power? Power is your ability to find a difference when a real difference exists The power of a study is determined by three factors: Alpha level (what is p value -how many false positives allowed) Sample size (number of experiments needed to get result) Effect size (how large is the biological effect) Separation of Means relative to error variance. How do you Calculate Power? The best freeware solution, G*Power is available at http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3 Works on Mac OSX and Windows XP/Vista

Power and Sample Size Power analysis can be used to estimate the sample size required for a particular study Too small an effect size and an effect may be missed Too large an effect size too expensive a study Different formulae/tables for calculating sample size are required according to experimental design

Power and Effect Size As the separation between two means increases the power also increases

Power and Effect Size As the variability about a mean decreases power also increases

Should I Pool my Samples? Pooling Taking same amount of protein from different samples and create pool. Assumption: Signal from pool represents mathematical average Advantage: Can increase number of samples measured Disadvantage: Intra-group biological variation is lost Option: Sub-pooling, possible to estimate biological variation Can result in irreversible loss of information Pool of all samples can be used as internal reference in DIGE, itraq, etc. Pool minimum three or maximum five samples Equal pooling of samples is essential

Mixing Replicate Types 3 readings on the 3 biological gives a total of 18 readings This is an example of pseudoreplication There are only really 3 different subjects Student s t-test, requires independent samples and cannot be used A test which allows for hierarchy in the data is needed such as a nested ANOVA

Getting Help Learn the Basics of Statistics Look up Wikipedia for a starting point Collaborate with Statisticians, Informatics groups etc, BEFORE you start Use a reliable Statistics Program such as SPSS now called PASW This has extensive on-line Tutorials

Thanks To the following for providing many slides Morten Krogh Michaela Scigelova Natasha Karp Marianne Sandin Fredrik Levander And many others