Checking the model. Linearity. Normality. Constant variance. Influential points. Covariate overlap

Size: px
Start display at page:

Download "Checking the model. Linearity. Normality. Constant variance. Influential points. Covariate overlap"

Transcription

1 Checking the model Linearity Normality Constant variance Influential points Covariate overlap 1

2 Checking the model: linearity Average value of outcome initially assumed to be linear function of continuous predictors slope of regression line assumed constant equivalently, regression line has no curvature If model is correct residuals have mean zero at every value of predictor 2

3 Checking the model: linearity If assumption badly violated, result can be biased coefficient estimates, residual confounding reduced precision and power, missed real effects misleading, over-simplified conclusions 3

4 Three departures from linearity 6 linear fit Lowess smooth E[y x] 5 linear fit Lowess smooth E[y x] x x linear fit Lowess smooth E[y x] linear fit Lowess smooth E[y x] x x 4

5 Diagnostics: RVP and CPR plots To account for effects of other predictors, diagnostics use residuals rather than outcome Basic approach: check for non-linear patterns in plots of residuals versus each continuous predictor (RVP) plots Better alternative: component plus residual (CPR) plots component due to predictor added back into residual 5

6 Diagnostics: RVP and CPR plots CPR plots better for diagnosing non-linearity: show trend, RVP plots do not easier to add LOWESS smooth Need to use RVP for quadratic, other polynomial models e.g., E[Y X] = β 0 + β 1 X + β 2 X 2 + β 3 X 3 In both CPR and RVP: mismatch of linear regression line, LOWESS smooth indicates lack of linearity 6

7 RVP plot for weight and BMD BMD Residual weight (kg) Residuals lowess residuals weight 7

8 CPR plot for weight and BMD BMD Component Plus Residual weight (kg) 8

9 Solution: transform continuous predictors Smooth predictor transformations to fix non-linearity: log(x) provided E[Y X] is monotone square root, cube root, other fractional powers of x x 2, x 3 (lower order terms usually included in the model) 9

10 1 Predictor transformations square of x 1 square and cube of x x log of x x square root of x x x 10

11 CPR plot for log-weight and BMD BMD Component Plus Residual natural log of weight 11

12 RVP plot for log-weight and BMD BMD Residual natural log of weight Residuals lowess residuals lweight 12

13 Alternatives: categorize the predictor Split at quantiles or clinically familiar cutpoints Models mean as a step function Flexible, familiar, clinically interpretable, but unrealistic if the regression line changes smoothly, sensitive to choice of cutpoints, inefficient compared to smooth transformations Numbers of categories must balance fit against noisiness 13

14 Too coarsely categorizing the predictor BMD (gm/cm^2) BMI (kg/m^2) BMD Categorical Fit Lowess Fit 14

15 A better tradeoff BMD (gm/cm^2) BMI (kg/m^2) BMD Categorical Fit Lowess Fit 15

16 Alternatives: linear, restricted cubic splines Flexibly relax linearity assumption (mkspline command) Linear spline: piecewise linear with knots Restricted cubic spline: better behaved than polynomials easy test for linearity, but presentation requires plotting Also: fractional polynomials (fracpoly command) 16

17 Linear spline model for BMI effect on BMD. mkspline bmi bmi2 25 bmi3 30 bmi4 35 bmi5 = bmi. regress bmd bmi1-bmi5 Source SS df MS Number of obs = F( 5, 272) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = bmd Coef. Std. Err. t P> t [95% Conf. Interval] bmi bmi bmi bmi bmi _cons

18 Linear spline fit BMD (gm/cm^2) BMI (kg/m^2) BMD Linear spline fit 18

19 Testing for non-linearity using linear splines. testparm bmi*, equal ( 1) - bmi1 + bmi2 = 0 ( 2) - bmi1 + bmi3 = 0 ( 3) - bmi1 + bmi4 = 0 ( 4) - bmi1 + bmi5 = 0 F( 4, 272) = 2.24 Prob > F =

20 Cubic spline model for trends in viral load, in patients with wild type and drug-resistant HIV. mkspline dursp = duration, cubic knots( ). forvalues i = 1/4 { 2. forvalues j = 0/1 { 3. gen dursp i _ j = dursp i *(Anyresistance== j ) 4. } 5. }. xtmixed logvl Anyresistance dursp*_0 dursp*_1 studyid: duration, cov(uns) logvl Coef. Std. Err. z P> z [95% Conf. Interval] Anyresis~e dursp1_ dursp2_ dursp3_ dursp4_ dursp1_ dursp2_ dursp3_ dursp4_ _cons

21 Cubic spline model for trends in viral load, in patients with wild type and drug-resistant HIV Log Viral Load Days Since HIV Infection Wild Type Any Resistance 21

22 Test for any time effect on VL in drug resistant group. testparm dursp1_1 dursp2_1 dursp3_1 dursp4_1 ( 1) [logvl]dursp1_1 = 0 ( 2) [logvl]dursp2_1 = 0 ( 3) [logvl]dursp3_1 = 0 ( 4) [logvl]dursp4_1 = 0 chi2( 4) = Prob > chi2 = Test for departure from linearity in drug resistant group. testparm dursp2_1 dursp3_1 dursp4_1 ( 1) [logvl]dursp2_1 = 0 ( 2) [logvl]dursp3_1 = 0 ( 3) [logvl]dursp4_1 = 0 chi2( 3) = Prob > chi2 = Similar code for testing within wild type group 22

23 Full disclosure: testing for between-group differences is complicated foreach day in { * calculate values of spine variables at 30, 60, and 90 days after infection * see mkspline entry of STATA online PDF manual, page 1057 * requires variables k1-k5 giving knot locations local sp1 = day forvalues i = 1/3 { local j = i +1 local sp j = (max(0,( day -k i )^3)- /// (max(0,( day -k4)^3)*(k5-k i )-max(0,( day -k5)^3)*(k4-k i ))/(k5-k4))/(k5-k1)^2 } * estimate and test difference between wild type and drug resistant groups lincom Anyresistance /// + sp1 *(dursp1_1-dursp1_0) /// + sp2 *(dursp2_1-dursp2_0) /// + sp3 *(dursp3_1-dursp3_0) /// + sp4 *(dursp4_1-dursp4_0) display "Above: test for between-group differences at day day " } 23

24 But results are suggestive logvl Coef. Std. Err. z P> z [95% Conf. Interval] (1) Above: test for between-group differences at day logvl Coef. Std. Err. z P> z [95% Conf. Interval] (1) Above: test for between-group differences at day logvl Coef. Std. Err. z P> z [95% Conf. Interval] (1) Above: test for between-group differences at day 90 24

25 Checking linearity: summary Diagnostics: linear models: curved LOWESS smooth in CPR or RVP plot more generally (i.e., linear, logistic, Cox models): fit restricted cubic spline, test for departure from linearity using testparm for all but first spline component Solutions: transform predictor, use linear or cubic splines 25

26 Checking the model: normality t- and F -tests, CIs based on normality of errors (ɛ) Fairly robust to violations, especially short-tailed errors in larger samples However, long-tailed errors can degrade power, precision Diagnostics: Q-Q and other plots of residuals tests for normality lack power where you need it 26

27 Diagnosing departures from normality Density Residuals Density Residuals Residuals Residuals Inverse Normal 27

28 Solution: transform the outcome Residuals skewed (usually to the right): log, square root, other power transformations may need to add constant to make all values positive Search for best transformation using qladder command Residuals symmetrically long-tailed rank transformation, trimming, Winsorization 28

29 Q-ladder plots for LDL -2.00e e e e+07 cubic -1.00e e e square identity sqrt log 1/sqrt inverse 1/square 1/cubic e e e e e e-06 Quantile-Normal plots by transformation LDL cholesterol, mg/dl 29

30 Residuals of log-transformed LDL.4 1 Residuals.3 Fraction Residuals Density -1 1 Residuals Inverse Normal 1.5 Density 1.5 Residuals Residuals Kernel Density Estimate Inverse Normal 30

31 Another solution: bootstrap CIs Resample N observations with replacement from data, re-fit model, store estimates, repeat 100, 500, 1,000 times or more Distribution of bootstrap estimates models sampling distribution of actual estimate Quick, partial solution: 1. replace model-based SE by SD of bootstrap estimates 2. construct CIs assuming Normality 31

32 A better solution: percentile bootstrap CIs 95% CI: 2.5 th to 97.5 th percentile of bootstrap estimates Bias-correction shifts CI slightly to right or left Slower but avoids making Normality assumption Requires using many ( 1, 000) bootstrap samples extreme percentiles are noisy! 32

33 Solution: model a transform of the mean (rather than a transform of the outcome) Logistic model for binary outcomes uses logit transformation of E[Y X] = P r[y = 1 X] log E[Y X]) 1 E[Y X] = β 0 + β 1 x β p x p (1) Other generalized linear models (GLMs) avoid dichotomizing outcome, generally use log E[Y X] (Biostat 209) gamma, Poisson, negative binomial, zero-inflated Poisson and negative binomial 33

34 Another solution: ordinal models Agatston scores for coronary artery calcium (CAC) mostly zeroes with long right tail Log-transformation (after adding 1) does not help: still mostly zeroes with long right tail Could dichotomize outcome as CAC > 0 or CAC > 10, use logistic model but potentially wasteful 34

35 Another solution: ordinal models Alternatively, categorize CAC as 0, 1-9, 10-99, , 400, use regression model for ordinal outcomes proportional odds (ologit) continuation ratio (ocratio) Proportional odds assumption relaxed using gologit2 Steve will briefly cover these 35

36 Checking normality: summary Diagnostics: curvature in QQ-plot Solutions: transform outcome, use bootstrap percentile CIs, or GLM or ordinal model 36

37 Checking the model: constant variance If constant variance assumption is violated coefficient estimates unbiased but inefficient tests for between-group differences may be invalid unlike Normality problems, larger samples don t help 37

38 Diagnostics: constant variance Plot residuals against fitted values, predictors check for horizontal funnel shapes Compare sample size, variance of residuals across subgroups: watch out if both differ by factors of more than 2 38

39 RVF plot to diagnose non-constant variance Residuals Fitted values 39

40 Solution: transform outcome outcome variance mean SD mean proportions correlations transformation square root log arcsin log[(1 + ρ)/(1 ρ)] 40

41 After square root transformation of outcome Residuals Fitted values 41

42 Comparing N, residual variance by subgroup. tabstat resid, by(physact) stat(n var) nototal physact N variance much less active somewhat less ac about as active somewhat more ac much more active tabstat resid, by(diabetes) stat(n var) nototal diabetes N variance no yes

43 Solution: use robust SEs. regress glucose diabetes i.physact age i.raceth smoking drinkany, vce(robust) Robust glucose Coef. Std. Err. t P> t [95% Conf. Interval] diabetes physact age raceth smoking drinkany _cons

44 ... or use more conservative robust SEs. regress glucose diabetes i.physact age i.raceth smoking drinkany, vce(hc3) Robust HC3 glucose Coef. Std. Err. t P> t [95% Conf. Interval] diabetes physact age raceth smoking drinkany _cons

45 Solution: use GLMs Variance-to-Mean Distribution Relationship Outcome Normal σ 2 constant Continuous Binomial σ 2 = nµ(1 µ) Successes in n trials OD Binomial σ 2 nµ(1 µ) Clustered successes Poisson σ 2 = µ Counts OD Poisson σ 2 µ Counts Negative binomial σ 2 = µ + µ 2 /k Counts Gamma σ µ Continuous over-dispersed See Table 8.8, VGSM 45

46 Checking constant variance: summary Diagnostics: funnel shapes in RVP plot, variable Ns, SDs across subgroups Solutions: transform outcome, use robust SEs or GLM 46

47 Checking the model: high leverage and influential points High-leverage: 1 extreme predictor, or anomalous combination potential to influence coefficient estimates unduly Influential: high-leverage plus big impact on coefficients Inferences based on a few observations potentially misleading 47

48 Simple outlier, high leverage, high influence 40 X - low leverage outlier X all data points omitting X 35 X - high leverage point y x leverage = 0.04 dfbeta = X - high leverage outlier. y x leverage = 0.52 dfbeta = -.61 X y X x leverage = 0.52 dfbeta =

49 Diagnostics: boxplots of dfbeta statistics dfbeta statistics measure changes in each β j when each data point is omitted Defined for each observation and predictor in model Check for outliers in boxplots of dfbetas 49

50 Boxplots of dfbetas for BMI - LDL model DFbmi DFnonwhite DFdrinkany DFage10 DFsmoking 50

51 Solution Identify up to 10 observations with biggest DFbetas Check for data errors or other anomaly Refit model without influential points, re-assess conclusions, report sensitivities Consider deleting influential points if they represent a different population 51

52 Sensitivity of LDL model to 4 influential points with dfbetas>0.2 in absolute value Predictor All observations Omitting 4 points variable ˆβ P -Value ˆβ P -Value BMI Age Nonwhite Smoking Alcohol Use

53 Checking influential points: summary Diagnostics: boxplots of dfbetas Solutions: fix errors, conduct sensitivity analyses omitting influential points 53

54 Checking the model: covariate overlap Observational analysis of binary exposure problematic if exposed, unexposed too unlike Lack of overlap makes true model hard to find, especially in small datasets Comparing each covariate in exposed and unexposed may not be enough, because covariates are correlated: some combinations of covariates may be unrepresented in one group 54

55 Lack of age overlap in model for effect of treatment on Beck Depression Inventory score Change in BDI Score Age True model for BDI change in treated True model for BDI change in controls 55

56 No power to detect interaction. regress del_bdi i.treatment##c.age Source SS df MS Number of obs = F( 3, 27) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = del_bdi Coef. Std. Err. t P> t [95% Conf. Interval] treatment age treatment# c.age _cons

57 Diagnosing lack of overlap Compare mean, quartiles, range of covariates in exposed and unexposed Use propensity scores fit logistic model for primary predictor include an MSAS for the exposure-outcome relationship capture non-linearities and interactions get fitted values (on linear predictor or probability scale) plot the results by primary predictor and check overlap 57

58 Propensity score model for statin use. * logistic model for statin use. quietly logistic statins agesp* i.raceth i.educ_cat /// > i.smoking##i.lessactive diabetes. * calculate logit propensity score. predict logit_ps, xb. * density plots of logit scores in statin users and non-users. twoway (kdensity logit_ps if statins==1, area(1) lpattern(solid)) /// > (kdensity logit_ps if statins==0, area(1) lpattern(longdash)), /// > ytitle("density") xtitle("logit Propensity Score") /// > legend(order(1 "Treated" 2 "Untreated")) /// > saving(pscores, replace) 58

59 Overlap diagnostics for statin use Density Logit Propensity Score Treated Untreated 59

60 Solution: lack of overlap Restrict inference to region of good overlap Match on prognostic covariates or propensity scores 60

61 Restricting inference to region of overlap Change in Beck Depression Inventory Score Age Inference region 61

62 Checking overlap: summary Diagnostics: compare covariates, density plots of logit-propensity scores in exposed, unexposed Solutions: restrict inference to region of good overlap, possibly by matching 62

63 Model checking: to transform or not Transformations can help meet assumptions but make results harder to interpret If violations mild, results robust, reasonable not to transform If conclusions change substantially after transformation model that meets assumptions better is more reliable 63

64 Model checking: summary Non-linearity: Diagnostics: curved Lowess smooth in CPR or RVP plot Solutions: transform predictor, including splines Non-normality: Diagnostics: curvature in QQ-plot Solutions: transform outcome, use bootstrap CIs, GLM or ordinal model 64

65 Model checking: summary Non-constant variance: Diagnostics: funnel shapes in RVP plot, SDs differ across unequal size subgroups Solutions: transform outcome, use GLM, robust SEs Influential points: Diagnostics: boxplots of dfbeta statistics Solutions: identify up to 10 influential points, correct data errors, omit influential points if justifiable, present sensitivity analysis 65