BST 226 Statistical Methods for Bioinformatics David M. Rocke. January 13, 2014 BST 226 Statistical Methods for Bioinformatics 1

Similar documents
CASE-STUDY- VALIDATION of PCR based methodology. Beata Surmacz-Cordle Senior Analytical Development Scientist

Quantitative Subject. Quantitative Methods. Calculus in Economics. AGEC485 January 23 nd, 2008

Unit 6: Simple Linear Regression Lecture 2: Outliers and inference

Timing Production Runs

VICH Topic GL2 (Validation: Methodology) GUIDELINE ON VALIDATION OF ANALYTICAL PROCEDURES: METHODOLOGY

Winsor Approach in Regression Analysis. with Outlier

Gasoline Consumption Analysis

OPTIMIZATION AND CV ESTIMATION OF A PLATE COUNT ASSAY USING JMP

Chapter 3. Table of Contents. Introduction. Empirical Methods for Demand Analysis

Background Correction and Normalization. Lecture 3 Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

Revision of 30 April 2013 draft, 4 November 2013

A SIMULATION STUDY OF THE ROBUSTNESS OF THE LEAST MEDIAN OF SQUARES ESTIMATOR OF SLOPE IN A REGRESSION THROUGH THE ORIGIN MODEL

Chapter 19. Confidence Intervals for Proportions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Introduction to gene expression microarray data analysis

Estoril Education Day

R = kc Eqn 5.1. R = k C Eqn 5.2. R = kc + B Eqn 5.3

Managerial Economics

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Validating, Verifying, and Evaluating Your Test Methods: It s NOT a Regulatory Exercise!

Energy Efficiency Impact Study

A2LA. R231 Specific Requirements: Threat Agent Testing Laboratory Accreditation Program. December 6, 2017

SECTION 11 ACUTE TOXICITY DATA ANALYSIS

Quadratic Regressions Group Acitivity 2 Business Project Week #4

STATISTICAL TECHNIQUES. Data Analysis and Modelling

Measurement Systems Analysis

Basic Statistics, Sampling Error, and Confidence Intervals

Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds. Overview. Data Analysis Tutorial

User Bulletin. ABI PRISM 7000 Sequence Detection System DRAFT. SUBJECT: Software Upgrade from Version 1.0 to 1.1. Version 1.

Troubleshooting of Real Time PCR Ameer Effat M. Elfarash

Verification of Compendial Methods

Designing Complex Omics Experiments

Quantitative Analysis Using Statistics for Forecasting and Validity Testing. Course #6300/QAS6300 Course Material

California State University, Sacramento: Lead Study

A peer-reviewed version of this preprint was published in PeerJ on 16 March 2018.

Quantitative Real time PCR. Only for teaching purposes - not for reproduction or sale

SEPARATING BACKGROUND FROM METALS CONTAMINATION: GEOCHEMICAL EVALUATIONS

Regression diagnostics

High throughput metabolomic studies of developmental toxicity in Japanese medaka

1.103 CIVIL ENGINEERING MATERIALS LABORATORY (1-2-3) Dr. J.T. Germaine Spring 2004 PROPERTIES OF HEAT TREATED STEEL

WINDOWS, MINITAB, AND INFERENCE

Leveraging Smart Meter Data & Expanding Services BY ELLEN FRANCONI, PH.D., BEMP, MEMBER ASHRAE; DAVID JUMP, PH.D., P.E.

Problem Set "Normal Distribution - Chapter Review (Practice Test)" id:[14354]

Test Method Validation of In-Vitro Diagnostic Products. Tim Carr QA Manager, Beckman Coulter

Event-specific Method for the Quantification of Soybean CV127 Using Real-time PCR. Validation Report

FACTORS CONTRIBUTING TO VARIABILITY IN DNA MICROARRAY RESULTS: THE ABRF MICROARRAY RESEARCH GROUP 2002 STUDY

ICEAA 2016 Symposium 20 Oct 2016

COST-EFFECTIVE SAMPLING of GROUNDWATER MONITORING WELLS: A Data Review & Well Frequency Evaluation. Maureen Ridley (1) and Don MacQueen (2)

STATISTICS PART Instructor: Dr. Samir Safi Name:

CHAPTER 8 T Tests. A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test

CONSIDERATION OF NEAR-FAULT GROUND MOTION EFFECTS IN SEISMIC DESIGN

Econometría 2: Análisis de series de Tiempo

ANALYTICAL PERFORMANCE OF A HANDHELD EDXRF SPECTROMETER WITH MINIATURE X-RAY TUBE EXCITATION

Evaluating Diagnostic Tests in the Absence of a Gold Standard

Sawtooth Software. Sample Size Issues for Conjoint Analysis Studies RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

June 8, Submitted to: The Environmental Protection Agency Research Triangle Park, NC. Prepared by:

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

INTERNAL PILOT DESIGNS FOR CLUSTER SAMPLES

ImmuLux Human IL-6 Fluorescent ELISA Kit

Welcome to the course, Evaluating the Measurement System. The Measurement System is all the elements that make up the use of a particular gage.

COORDINATING DEMAND FORECASTING AND OPERATIONAL DECISION-MAKING WITH ASYMMETRIC COSTS: THE TREND CASE

Composition Analysis of Animal Feed by HR ICP-OES

TAMU: PROTEOMICS SPECTRA 1 What Are Proteomics Spectra? DNA makes RNA makes Protein

The groundwater hydrology of the site has been logically divided into five phases of monitoring, analysis, and investigation as outlined below:

Limit of Detection. Theoretical Lowest Concentration

Why do Gage R&Rs fail?

Online Student Guide Types of Control Charts

Nima Hejazi. Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi. nimahejazi.org github/nhejazi

Procedure 3. Statistical Requirements for CEC Test Methods

Practical Regression: Fixed Effects Models

Reliability Engineering. RIT Software Engineering

Error Analysis and Data Quality

Lifecycle Management Concepts to analytical Procedures: A compendial perspective. Horacio Pappa, Ph.D. Director - General Chapters U.S.

Lab 9: Sampling Distributions

Appendix B: Assessing the Statistical Significance of Shifts in Cell EC 50 Data with Varying Coefficients of Variation

Forecasting Introduction Version 1.7

Connecticut Department of Energy and Environmental Protection and Environmental Professionals Organization of Connecticut, Inc.

RESIDUAL FILES IN HLM OVERVIEW

Heterogeneity of Variance in Gene Expression Microarray Data

Determination of Iron Content in Different Hemoglobin Samples from Some Patients by UV-Visible Spectrophotometer

Affymetrix GeneChip Arrays. Lecture 3 (continued) Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

The combination of factors and their levels describe the battlespace conditions.

Human Cytokine Assay Products from Meso Scale Discovery

SPECIAL CONTROL CHARTS

Roche Molecular Biochemicals Technical Note No. LC 10/2000

Rev Experiment 10

CHAPTER 10 REGRESSION AND CORRELATION

Electricity Generation and Water- Related Constraints: An Empirical Analysis of Four Southeastern States

QUANLYNX APPLICATION MANAGER

Outliers and Their Effect on Distribution Assessment

Transfer of Methods Supporting Biologics and Vaccines

ProQuantum high-sensitivity immunoassays

New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays

Performance of the Sievers 500 RL On-Line TOC Analyzer

Understanding and accounting for product

PAB PRINCIPLE. Kit Reorder # ANNUAL REVIEW Reviewed by: Date. Date INTENDED USE

Copyright K. Gwet in Statistics with Excel Problems & Detailed Solutions. Kilem L. Gwet, Ph.D.

Module1TheBasicsofRealTimePCR Monday, March 19, 2007

LEAN SIX SIGMA BLACK BELT CHEAT SHEET

Application Note Detecting low copy numbers. Introduction. Methods A08-005B

Transcription:

BST 226 Statistical Methods for Bioinformatics David M. Rocke January 13, 2014 BST 226 Statistical Methods for Bioinformatics 1

Quantitative Prediction Regression analysis is the statistical name for the prediction of one quantitative variable (fasting blood glucose level) from another (body mass index) Items of interest include whether there is in fact a relationship and what the expected change is in one variable when the other changes January 13, 2014 BST 226 Statistical Methods for Bioinformatics 2

Assumptions Inference about whether there is a real relationship or not is dependent on a number of assumptions, many of which can be checked When these assumptions are substantially incorrect, alterations in method can rescue the analysis No assumption is ever exactly correct January 13, 2014 BST 226 Statistical Methods for Bioinformatics 3

Linearity This is the most important assumption If x is the predictor, and y is the response, then we assume that the average response for a given value of x is a linear function of x E(y) = a + bx y = a + bx + ε ε is the error or variability January 13, 2014 BST 226 Statistical Methods for Bioinformatics 4

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 5

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 6

In general, it is important to get the model right, and the most important of these issues is that the mean function looks like it is specified If a linear function does not fit, various types of curves can be used, but what is used should fit the data Otherwise predictions are biased January 13, 2014 BST 226 Statistical Methods for Bioinformatics 7

Independence It is assumed that different observations are statistically independent If this is not the case inference and prediction can be completely wrong There may appear to be a relationship even though there is not Randomization and then controlling the treatment assignment prevents this in general January 13, 2014 BST 226 Statistical Methods for Bioinformatics 8

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 9

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 10

Note no relationship between x and y These data were generated as follows: x y 1 1 0 x y 0.95x i 1 i i 0.95y ε η i 1 i i January 13, 2014 BST 226 Statistical Methods for Bioinformatics 11

Constant Variance Constant variance, or homoscedacticity, means that the variability is the same in all parts of the prediction function If this is not the case, the predictions may be on the average correct, but the uncertainties associated with the predictions will be wrong Heteroscedacticity is non-constant variance January 13, 2014 BST 226 Statistical Methods for Bioinformatics 12

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 13

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 14

Consequences of Heteroscedacticity Predictions may be unbiased (correct on the average) Prediction uncertainties are not correct; too small sometimes, too large others Inferences are incorrect (is there any relationship or is it random?) January 13, 2014 BST 226 Statistical Methods for Bioinformatics 15

Normality of Errors Mostly this is not particularly important Very large outliers can be problematic Graphing data often helps If in a gene expression array experiment, we do 40,000 regressions, graphical analysis is not possible Significant relationships should be examined in detail January 13, 2014 BST 226 Statistical Methods for Bioinformatics 16

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 17

Statistical Lab Books You should keep track of what things you try The eventual analysis is best recorded in a file of commands so it can later be replicated Plots should also be produced this way, at least in final form, and not done on the fly Otherwise, when the paper comes back for review, you may not even be able to reproduce your own analysis January 13, 2014 BST 226 Statistical Methods for Bioinformatics 18

Fluorescein Example Standard aqueous solutions of fluorescein (in pg/ml) are examined in a fluorescence spectrometer and the intensity (arbitrary units) is recorded What is the relationship of intensity to concentration Use later to infer concentration of labeled analyte Concentration (pg/ml) 0 2 4 6 8 10 12 Intensity 2.1 5.0 9.0 12.6 17.3 21.0 24.7 January 13, 2014 BST 226 Statistical Methods for Bioinformatics 19

> fluor.lm <- lm(intensity ~ concentration) > summary(fluor.lm) Call: lm(formula = intensity ~ concentration) Residuals: 1 2 3 4 5 6 7 0.58214-0.37857-0.23929-0.50000 0.33929 0.17857 0.01786 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.5179 0.2949 5.146 0.00363 ** concentration 1.9304 0.0409 47.197 8.07e-08 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.4328 on 5 degrees of freedom Multiple R-Squared: 0.9978, Adjusted R-squared: 0.9973 F-statistic: 2228 on 1 and 5 DF, p-value: 8.066e-08 January 13, 2014 BST 226 Statistical Methods for Bioinformatics 20

Use of the calibration curve yˆ 1.52 1.93x yˆ is the predicted average intensity x is the true concentration y 1.52 xˆ 1.93 y is the observed intensity xˆ is the estimated concentration January 13, 2014 BST 226 Statistical Methods for Bioinformatics 21

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 22

Measurement and Calibration Essentially all things we measure are indirect The thing we wish to measure produces an observed transduced value that is related to the quantity of interest but is not itself directly the quantity of interest Calibration takes known quantities, observes the transduced values, and uses the inferred relationship to quantitate unknowns January 13, 2014 BST 226 Statistical Methods for Bioinformatics 23

Measurement Examples Weight is observed via deflection of a spring (calibrated) Concentration of an analyte in mass spec is observed through the electrical current integrated over a peak (possibly calibrated) Gene expression is observed via fluorescence of a spot to which the analyte has bound (usually not calibrated) January 13, 2014 BST 226 Statistical Methods for Bioinformatics 24

Correlation Wright peak-flow data set has two measures of peak expiratory flow rate for each of 17 patients in l/min. ISwR library, data(wright) Both are subject to measurement error In ordinary regression, we assume the predictor is known For two measures of the same thing with no error-free gold standard, one can use correlation to measure agreement January 13, 2014 BST 226 Statistical Methods for Bioinformatics 25

> cor(wright) std.wright mini.wright std.wright 1.0000000 0.9432794 mini.wright 0.9432794 1.0000000 > wplot1() ----------------------------------------------------- File wright.r: library(iswr) data(wright) attach(wright) wplot1 <- function() { plot(std.wright,mini.wright,xlab="standard Flow Meter", ylab="mini Flow Meter",lwd=2) title("mini vs. Standard Peak Flow Meters") wright.lm <- lm(mini.wright ~ std.wright) abline(coef(wright.lm),col="red",lwd=2) } detach(wright) January 13, 2014 BST 226 Statistical Methods for Bioinformatics 26

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 27

Issues with Correlation For any given relationship between two measurement devices, the correlation will depend on the range over which the devices are compared. If we restrict the Wright data to the range 300-550, the correlation falls from 0.94 to 0.77. Correlation only measures linear agreement January 13, 2014 BST 226 Statistical Methods for Bioinformatics 28

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 29

Measurement with no Gold Standard y y ξ 1 j j j 2 j j = a+ bξ + = ξ j is the true concentration Method 2 is the gold standard, measured without error We can estimate all the unknowns, including σ 2 y y ξ 1 j j j 2 j j = a+ bξ + = ξ + η j j is the true unknown concentration We have one more unknown ( σ η 2 ) but no additional data Cannot be solved without information/assumptions January 13, 2014 BST 226 Statistical Methods for Bioinformatics 30

Calibrated Measurements We produce a calibration curve of the form y = f(x), where x is the concentration of the analyte and y is the transduced value, such as peak height or peak area in mass spectrometry. Often the curve is linear. It is estimated from measurements at a series of known concentrations and their responses. A new measurement y produces an estimated concentration using x = f -1 (y). January 13, 2014 BST 226 Statistical Methods for Bioinformatics 31

Limits of Detection The term limit of detection is actually ambiguous and can mean various things that are often not distinguished from each other We will instead define three concepts that are all used in this context called the critical level, the minimum detectable value, and the limit of quantitation. The critical level is the measurement that is not consistent with the analyte being absent The minimum detectable value is the concentration that will almost always have a measurement above the critical level January 13, 2014 BST 226 Statistical Methods for Bioinformatics 32

Distribution of Measurements for T 0.0 0.1 0.2 0.3 0.4 CL MDV -4-2 0 2 4 6 8 10 Measured Concentration (ppb) January 13, 2014 BST 226 Statistical Methods for Bioinformatics 33

y = x+ N 2 ~ (0, σ ) If the concentration is zero, then y N 2 ~ (0, σ ) Negative measured values are possible and informative. If σ = 1ppb, and if we want to use the 99.9% point of the normal, which is 3.090232, then the critical level is 0 + (1)(3.090232) = 3.090232 which is a measured level. If the reading is CL = 3.090232ppb or greater, then we are confident that there is more than 0 of the analyte. If the true concentration is 3.090232, then about half the time the measured concentration is less than the CL. If the true concentration is MDV = 3.090232 + 3.090232 = 6.180464ppb, then 99.9% of the time the measured value will exceed the CL. January 13, 2014 BST 226 Statistical Methods for Bioinformatics 34

y = x + z α : Pr( z > z ~ N(0,1) CL( α):0+ z MDV ( αβ, ):(0+ z σ) + z σ CI for x : y ± z α /2 σ z Assuming σ α α ) σ = α α is known and constant. β January 13, 2014 BST 226 Statistical Methods for Bioinformatics 35

Examples Serum calcium usually lies in the range 8.5 10.5 mg/dl or 2.2 2.7 mmol/l. Suppose the standard deviation of repeat measurements is 0.15 mmol/l. Using α = 0.01, z α = 2.326, so the critical level is (2.326)(0.15) = 0.35 mmol/l. The MDV is 0.70 mg/l, well out of the physiological range. January 13, 2014 BST 226 Statistical Methods for Bioinformatics 36

A test for toluene exposure uses GC/MS to test serum samples. The standard deviation of repeat measurements of the same serum at low levels of toluene is 0.03 μg/l. The critical level at α = 0.01 is (0.03)(2.326) = 0.070 μg/l. The MDV is 0.140 μg/l. Unexposed non-smokers 0.4 μg/l Unexposed smokers 0.6 μg/l Chemical workers 2.8 μg/l One EPA standard is < 1 mg/l blood concentration. Toluene abusers may have levels of 0.3 30 mg/l and fatalities have been observed at 10 48 mg/l January 13, 2014 BST 226 Statistical Methods for Bioinformatics 37

The EPA has determined that there is no safe level of dioxin (2,3,7,8-TCDD (tetrachlorodibenzodioxin)), so the Maximum Contaminant Level Goal (MCLG) is 0. The Maximum Contaminant Level (MCL) is based on the best existing analytical technology and was set at 30 ppq. EPA Method 1613 uses high-resolution GC/MS and has a standard deviation at low levels of 1.2 ppq. The critical level at 1% is (2.326)(1.2ppq) = 2.8 ppq and the MDV, called the Method Detection Limit by EPA, is 5.6 ppq. 1 ppq = 1pg/L = 1gm in a square lake 1 meter deep and 10 km on a side. The reason why the MCL is set at 30 ppq instead of 2.8 ppq will be addressed later. January 13, 2014 BST 226 Statistical Methods for Bioinformatics 38

Error Behavior at High Levels For most analytical methods, when the measurements are well above the CL, the standard deviation is a constant multiple of the concentration The ratio of the standard deviation to the mean is called the coefficient of variation (CV), and is often expressed in percents. For example, an analytical method may have a CV of 10%, so when the mean is 100 mg/l, the standard deviation is 10 mg/l. When a measurement has constant CV, the log of the measurement has approximately constant standard deviation. If we use the natural log, then SD on the log scale is approximately CV on the raw scale January 13, 2014 BST 226 Statistical Methods for Bioinformatics 39

Zinc Concentration Spikes at 5, 10, and 25 ppb 9 or 10 replicates at each concentration Mean measured values, SD, and CV are below Raw 5 10 25 Mean 4.85 9.73 25.14 SD 0.189 0.511 0.739 CV 0.039 0.053 0.029 Log 1.61 2.30 3.22 Mean 1.58 2.27 3.22 SD 0.038 0.055 0.030 January 13, 2014 BST 226 Statistical Methods for Bioinformatics 40

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 41

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 42

Summary At low levels, assays tend to have roughly constant variance not depending on the mean. This may hold up to the MDV or somewhat higher. For low level data, analyze the raw data. At high levels, assays tend to have roughly constant CV, so that the variance is roughly constant on the log scale. For high level data, analyze the logs. We run into trouble with data sets where the analyte concentrations vary from quite high to very low This is a characteristic of many gene expression, proteomics, and metabolomics data sets. January 13, 2014 BST 226 Statistical Methods for Bioinformatics 43

The two-component model The two-component model treats assay data as having two sources of error, an additive error that represents machine noise and the like, and a multiplicative error. When the concentration is low, the additive error dominates. When the concentration is high, the multiplicative error dominates. There are transformations similar to the log that can be used here. January 13, 2014 BST 226 Statistical Methods for Bioinformatics 44

e σ 2 2 V( y) = xv( e ) + σ e 2 σ ( e 1) y = xe σ ~ σ = 2 σ = 4 σ = σ = 2 σ = 4 σ 2 η 2 2 2 η ση 2 σ 2 e ( e 1) + σ = x x η σ + + σ 2 2 2 η 1 1 1 σ σ σ 2 6 2 ~1+ σ 2 4 6 = + + + + 2 2 2 ~ (1 + σ ) σ ~ σ 0.1 0.01 0.0001 0.2 0.04 = 0.0016 January 13, 2014 BST 226 Statistical Methods for Bioinformatics 45

January 13, 2014 BST 226 Statistical Methods for Bioinformatics 46

Detection Limits for Calibrated Assays η y = a + bxe + y a xˆ = b η = xe + / b CL is in units of the response originally, can be translated to units of concentration. MDC is in units of concentration. January 13, 2014 BST 226 Statistical Methods for Bioinformatics 47

So-Called Limit of Quantitation Consider an assay with a variability near 0 of 29 ppt and a CV at high levels of 3.9%. Where is this assay most accurate? Near zero where the SD is smallest? At high levels where the CV is smallest? LOQ is where the CV falls to 20% from infinite at zero to 3.9% at large levels. This happens at 148ppt Some use 10*sd(0) = 290 ppt instead CL is at 67 ppt and MDV is at 135 ppt Some say that measurements between 67 and 148 show that there is detection, but it cannot be quantified. This is clearly wrong. January 13, 2014 BST 226 Statistical Methods for Bioinformatics 48

Conc Mean SD CV 0 22 28 10 29 2 0.07 20 81 4 0.05 100 164 17 0.10 200 289 5 0.02 500 555 12 0.02 1,000 1,038 32 0.03 2,000 1,981 28 0.01 5,000 4,851 188 0.04 10,000 9,734 511 0.05 25,000 25,146 739 0.03 January 13, 2014 BST 226 Statistical Methods for Bioinformatics 49

Confidence Limits Ignoring uncertainty in the calibration line. Assume variance is well enough estimated to be known Use SD 2 (x) = (28.9) 2 + (0.039x) 2 A measured value of 0 has SD(0) = 28.9, so the 95% CI is 0 ± (1.960)(28.9) = 0 ± 57 or [0, 57] A measured value of 10 has SD(10) = 28.9 so the CI is [0, 67] A measured value of -10 has a CI of [0, 47] For high levels, make the CI on the log scale January 13, 2014 BST 226 Statistical Methods for Bioinformatics 50

Zinc Calibration If we take the spiked concentration and the peak area and predict the peak area from the concentration using linear regression, we get Peak Area = 104.5 + 7.2080 Concentration. The predicted concentration is then (Peak Area 104.5)/7.2080 The peak area measurements at 0 true concentration are 115, 631, 508, 317, 220, 93, 99, 135 The predicted concentrations in ppt are then 1.45, 73.04, 55.97, 29.48, 16.02, 1.60, 0.77, 4.23 Note that two of them are negative. These should not be reported as < 0 rather than the actual number. January 13, 2014 BST 226 Statistical Methods for Bioinformatics 51

Weighted Least Squares y = β + β x + 0 1 V () = σ + x σ 2 2 2 η A simple approach is as follows: Step 1: For each concentration level x, find the sample variance of the peak areas s 2 2 Step 2: Regress the si on the xi to produce predict Step 3: Regress y on x with weights 1/ ˆ σ i 2 i 2 ed variances ˆ σ for each Compare with unweighted regression. More complex methods are in Wilson, Rocke, Durbin, and Kahn (2004). These methods can fall into trouble if the estimated variance at 0 is negative or 0. i x i 2 i January 13, 2014 BST 226 Statistical Methods for Bioinformatics 52

Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/l. What is the Critical Level if we use a 1% probability criterion? What is the Minimum Detectable Value? If we can use 52 ng/l as the standard deviation, what is a 95% confidence interval for the true concentration if the measured concentration is 175 ng/l? If the CV at high levels is 12%, about what is the standard deviation at high levels for the natural log measured concentration? Find a 95% confidence interval for the concentration if the measured concentration is 1850 ng/l? January 13, 2014 BST 226 Statistical Methods for Bioinformatics 53

Exercise 2 Download data on measurement of zinc in water by ICP/MS ( Zinc.csv ). Use read.csv() to load. Conduct a regression analysis in which you predict peak area from concentration Which of the usual regression assumptions appears to be satisfied and which do not? What would the estimated concentration be if the peak area of a new sample was 1850? From the blanks part of the data, how big should a result be to indicate the presence of zinc with some degree of certainty? Try using weighted least squares for a better estimate of the calibration curve. Does it seem to make a difference? January 13, 2014 BST 226 Statistical Methods for Bioinformatics 54

References Lloyd Currie (1995) Nomenclature in Evaluation of Analytical Methods Including Detection and Quantification Capabilities, Pure & Applied Chemistry, 67, 1699 1723. David M. Rocke and Stefan Lorenzato (1995) A Two- Component Model For Measurement Error In Analytical Chemistry, Technometrics, 37, 176 184. Machelle Wilson, David M. Rocke, Blythe Durbin, and Henry Kahn (2004) Detection Limits And Goodness-of-Fit Measures For The Two-component Model Of Chemical Analytical Error, Analytica Chimica Acta, 509, 197 208. January 13, 2014 BST 226 Statistical Methods for Bioinformatics 55