December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 1

Size: px
Start display at page:

Download "December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 1"

Transcription

1 December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 1

2 Orthogonality and Collinearity Two predictors are orthogonal if they are uncorrelated. For example, if two wells of a microplate each are devoted to 1) 0 added TNF-α and 20 C; 2) 0 added TNF-α and 35 C; 3) 2 nm TNF-α and 20 C; 4) 2 nm TNF-α and 35 C, then the predictors are orthogonal because the vectors (0,0,0,0,2,2,2,2) and (20,20,35,35,20,20,35,35) have correlation 0. One consequence of orthogonality is that the coefficient of each variable stays the same if the other variable is omitted. Also, the sum of squares in the ANOVA for a variable does not depend on whether the other variable is in or out. December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 2

3 Collinearity If predictors are not orthogonal, then a coefficient test for whether a variable can be omitted can depend on the presence or absence of other variables. Also, an ANOVA test of whether a variable can be omitted can depend on what other variables are in or out of the model. If we are selecting predictive variables from a set of possible ones, then this introduces complications. December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 3

4 ANOVA The Analysis of Variance (ANOVA) is a method of dividing up the sum of squares into parts to test statistically if one of those parts is needed. SST = SSR + SSE We can compute the MS for these parts by dividing the SS by the df. Statistical tests come from ratios of mean squares using the F-distribution. The test for the whole model is MSR/MSE. December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 4

5 >> thllm2 = fitlm(th,'ldiameter ~ lconc * glucose') Estimated Coefficients: Estimate SE tstat pvalue (Intercept) e-53 lconc e-20 glucose lconc:glucose >> anova(thllm2,'components',3) SumSq DF MeanSq F pvalue lconc e-20 glucose e e lconc:glucose Error This shows the increase in the error sum of squares when each variable individually is removed. For quantitative or binary variables, this gives the same results as the t-test of the coefficient. December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 5

6 'sstype' Type of sum of squares h Type of sum squares, specified as the comma-separated pair consisting of 'sstype' and the following: 1 Type I sum of squares. The reduction in residual sum of squares obtained by adding that term to a fit that already includes the terms listed before it. 2 Type II sum of squares. The reduction in residual sum of squares obtained by adding that term to a model consisting of all other terms that do not contain the term in question. 3 Type III sum of squares. The reduction in residual sum of squares obtained by adding that term to a model containing all other terms. h Hierarchical model. Similar to type 2, but with continuous as well as categorical factors used to determine the hierarchy of terms. Components = 1 (sequential each variable added in listed order) Components = 2 (each variable added to model that includes all terms not involving that variable, with respect to factors not variables) Components = 3 (each variable removed with all other variables kept) Components = h (each variable added to model including all terms not involving that variable) (default in 2018a for linear models anova) December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 6

7 Components = 1 (sequential each variable added in listed order) Components = 2 (each variable added to model including all terms not involving that variable, with respect to factors not variables) Components = 3 (each variable removed with all other variables kept) Components = h (each variable added to model including all terms not involving that variable) (default in 2014a) >> thllm2 = fitlm(th,'ldiameter ~ lconc * glucose') Components = 1 (sequential) lconc ldiameter ~ 1 + lconc ldiameter ~ 1 glucose ldiameter ~ lconc + glucose ldiameter ~ lconc lconc:glucose ldiameter ~ lconc + glucose + lconc:glucose ldiameter ~ lconc + glucose December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 7

8 Components = 1 (sequential each variable added in listed order) Components = 2 (each variable added to model including all terms not involving that variable, with respect to factors not variables) Components = 3 (each variable removed with all other variables kept) Components = h (each variable added to model including all terms not involving that variable) (default in 2014a) >> thllm2 = fitlm(th,'ldiameter ~ lconc * glucose') Components = h (hierarchical) lconc ldiameter ~ lconc + glucose ldiameter ~ glucose glucose ldiameter ~ lconc + glucose ldiameter ~ lconc lconc:glucose ldiameter ~ lconc + glucose + lconc:glucose ldiameter ~ lconc + glucose December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 8

9 Components = 1 (sequential each variable added in listed order) Components = 2 (each variable added to model including all terms not involving that variable, with respect to factors not variables) Components = 3 (each variable removed with all other variables kept) Components = h (each variable added to model including all terms not involving that variable) (default in 2014a) >> thllm2 = fitlm(th,'ldiameter ~ lconc * glucose') Components = 2 or 3 (individual terms) lconc ldiameter ~ lconc + glucose + lconc:glucose ldiameter ~ glucose + lconc:glucose glucose ldiameter ~ lconc + glucose + lconc:glucose ldiameter ~ lconc + lconc:glucose lconc:glucose ldiameter ~ lconc + glucose + lconc:glucose ldiameter ~ lconc + glucose December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 9

10 >> anova(thllm2,'components',1) SumSq DF MeanSq F pvalue lconc e-28 glucose e-14 lconc:glucose Error >> anova(thllm2,'components','h') SumSq DF MeanSq F pvalue lconc e-28 glucose e-14 lconc:glucose Error Both sequential ANOVA and hierarchical ANOVA show clearly that adding the glucose variable is important, but adding the interaction of glucose and lconc is not. Thus it is proper to fit the model without the interaction, meaning that the slopes of the two lines can be considered to be the same. The last two tests are the same only the first test is slightly different. The MATLAB default of h is usually OK. December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 10

11 Model Selection Suppose we have several possible predictor variables for a response. Each of the predictors and the response may be transformed. We can try to find a subset of the variables that predict well. We may want to include powers or interactions of the variables. We need a way to compare and assess models. December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 11

12 Principles of Model Selection Whenever an interaction is included, so should be any main effect or interaction contained in it. If x 1 :x 2 is in the model, then so should be x 1 and x 2. If x 1 :x 2 :x 3 is in the model, then so should be x 1 :x 2, etc. Whenever x 2 is in the model then so should be x. We want the simplest model that fits the data. Occam s Razor. Many possible criteria for whether a variable should be kept: the default in MATLAB stepwise is an F-test for the two models. December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 12

13 Total Lung Capacity The data set tlc contains observations on 32 lung transplant patients. age in years sex (female=1, male=2) height (cm) tlc = total lung capacity (L) We want to investigate the ability of the first three variables to predict total lung capacity. This data set was imported as a Table. December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 13

14 >> summary(tlc) age: 32x1 double Values: min 11 median 28.5 max 52 sex: 32x1 double Values: min 1 median 1.5 max 2 height: 32x1 double Values: min 138 median 170 max 189 tlc: 32x1 double Values: min 3.4 median 6.15 max 9.45 December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 14

15 Linear Model for tlc Data >> fitlm(tlc) Linear regression model: tlc ~ 1 + age + sex + height Estimated Coefficients: Estimate SE tstat pvalue (Intercept) age sex height Number of observations: 32, Error degrees of freedom: 28 Root Mean Squared Error: 1.16 R-squared: 0.542, Adjusted R-Squared F-statistic vs. constant model: 11.1, p-value = 5.82e-05 December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 15

16 >> fitlm(tlc,'tlc~age*height*sex') Linear regression model: tlc ~ 1 + age*sex + age*height + sex*height + age:sex:height Estimated Coefficients: Estimate SE tstat pvalue (Intercept) age sex height age:sex age:height sex:height age:sex:height Number of observations: 32, Error degrees of freedom: 24 Root Mean Squared Error: 1.14 R-squared: 0.618, Adjusted R-Squared F-statistic vs. constant model: 5.54, p-value = December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 16

17 >> stepwiselm(tlc) 1. Adding height, FStat = , pvalue = e-06 Linear regression model: tlc ~ 1 + height Estimated Coefficients: Estimate SE tstat pvalue (Intercept) height e-06 Number of observations: 32, Error degrees of freedom: 30 Root Mean Squared Error: 1.19 R-squared: 0.484, Adjusted R-Squared F-statistic vs. constant model: 28.1, p-value = 9.85e-06 December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 17

18 >> stepwiselm(tlc,'tlc~age*height*sex') 1. Removing age:sex:height, FStat = , pvalue = Removing sex:height, FStat = , pvalue = Removing age:sex, FStat = , pvalue = Removing age:height, FStat = , pvalue = Removing age, FStat = 1.131, pvalue = Removing sex, FStat = , pvalue = Linear regression model: tlc ~ 1 + height Estimated Coefficients: Estimate SE tstat pvalue (Intercept) height e-06 Number of observations: 32, Error degrees of freedom: 30 Root Mean Squared Error: 1.19 R-squared: 0.484, Adjusted R-Squared F-statistic vs. constant model: 28.1, p-value = 9.85e-06 December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 18

19 Data Set cpus Data on the performance of 209 computer processors. perf = benchmark performance (response) syct = cycle time in nanoseconds. mmin = minimum main memory mmax maximum main memory cach = cache size chmin = minimum number of channels. chmax = maximum number of channels. December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 19

20 >> summary(cpus) syct: 209x1 double min 17 median 110 max 1500 mmin: 209x1 double min 64 median 2000 max mmax: 209x1 double min 64 median 8000 max cach: 209x1 double min 0 median 8 max 256 chmin: 209x1 double min 0 median 2 max 52 chmax: 209x1 double min 0 median 8 max 176 perf: 209x1 double min 6 median 50 max 1150 December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 20

21 160 Histogram of perf December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 21

22 50 Histogram of log(perf) December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 22

23 140 Histogram of syct December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 23

24 45 Histogram of log(syct) December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 24

25 160 Histogram of mmin x 10 4 December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 25

26 60 Histogram of log(mmin) December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 26

27 90 Histogram of max x 10 4 December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 27

28 60 Histogram of log(mmax) December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 28

29 150 Histogram of cach December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 29

30 80 Histogram of log(cach+2) December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 30

31 150 Histogram of chmin December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 31

32 100 Histogram of log(chmin+1) December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 32

33 100 Histogram of sqrt(chmin) December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 33

34 150 Histogram of chmax December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 34

35 60 Histogram of log(chmax+1) December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 35

36 >> cpus2 = table(log(cpus.syct),log(cpus.mmin),log(cpus.mmax),log(cpus.cach+2), log(cpus.chmin+1),log(cpus.chmax+1),log(cpus.perf), 'VariableNames',{'lsyct', 'lmmin', 'lmmax', 'lcach', 'lchmin', 'lchmax', 'lperf'} ) >> fitlm(cpus2) Linear regression model: lperf ~ 1 + lsyct + lmmin + lmmax + lcach + lchmin + lchmax Estimated Coefficients: Estimate SE tstat pvalue (Intercept) lsyct lmmin lmmax e-11 lcach e-14 lchmin lchmax Number of observations: 209, Error degrees of freedom: 202 Root Mean Squared Error: R-squared: 0.834, Adjusted R-Squared F-statistic vs. constant model: 169, p-value = 7.86e-76 December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 36

37 >> stepwiselm(cpus2) 1. Adding lmmax, FStat = , pvalue = e Adding lcach, FStat = , pvalue = e Adding lchmin, FStat = , pvalue = e Adding lmmin, FStat = , pvalue = Adding lmmin:lmmax, FStat = , pvalue = e Adding lchmax, FStat = , pvalue = Adding lchmin:lchmax, FStat = , pvalue = Adding lmmax:lchmin, FStat = , pvalue = Linear regression model: lperf ~ 1 + lcach + lmmin*lmmax + lmmax*lchmin + lchmin*lchmax Estimated Coefficients: Estimate SE tstat pvalue (Intercept) e-05 lmmin e-06 lmmax lcach e-12 lchmin lchmax lmmin:lmmax e-08 lmmax:lchmin lchmin:lchmax e-05 Number of observations: 209, Error degrees of freedom: 200 Root Mean Squared Error: R-squared: 0.866, Adjusted R-Squared 0.86 F-statistic vs. constant model: 161, p-value = 7.19e-83 December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 37

38 Model Selection Many models will make predictions of similar quality. In such cases, there is no one best model, but one can still select the simplest model that seems to do a good job. Stepwise regression can be helpful, but the variables should first be placed on the right scale (possibly taking logs). If one starts with many variables, stepwise regression will usually select some even if none of the variables actually is related to the response, so caution is often called for (see pages in the text). December 4, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 38