RACE 616: Advance Analysis in Medical Research Jan 5 th Feb 7 th 2017

Size: px
Start display at page:

Download "RACE 616: Advance Analysis in Medical Research Jan 5 th Feb 7 th 2017"

Transcription

1 RACE 616: Advance Analysis in Medical Research Jan 5 th Feb 7 th 2017 Ammarin Thakkinstian, Ph.D. Section for Clinical Epidemiology and Biostatistics (CEB) ammarin.tha@mahidol.ac.th Application: CEB-RAMA Phone: Curse outline Session Module Assignments Contents Logistic regression 1 I, P1 ROC curve analysis & Clinical prediction 1 I score Clinical prediction scores (cont.) 1 I, P2 1 Log-linear & Poisson regression 1 II 2 Tutoring: wrap up, questions & answers Survival analysis : KM & Cox regression 1 III, P1 3 Survival analysis II: Competing risk 2 III, P2 model Survival analysis III: Multi-state model 3 III, P3 4 Sample size estimation 4 III, P4 1

2 Curse outline Contents Longitudinal data analysis I Longitudinal data analysis II Session Module Assignments 1 IV 2 IV 5 Evaluation Five assignments Due ~ 2 weeks after finishing that topic Resource module CEB-RAMA application Modules Data Assignments Slides Further readings Appendix

3 Reference Hosmer DW, Lemeshow S. Applied logistic regression, 2ndedition. New York: John Weiley & Sons, Inc Klienbaum GD., Kupper LL, Muller EK, and Nizam A. Allied regression analysis and other multivariable methods, 3rd edition. Washington: Duxbury Press 1998; Pagano M. and Gauvreau K. Principle of Biostatistics. California: Duxbury Press 1993; Moons KG, Kengne AP, Woodward M, et al. Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker. Heart 2012;98(9): North RA, McCowan LM, Dekker GA, et al. Clinical risk prediction for pre-eclampsia in nulliparous women: development of model in international prospective cohort. BMJ 2011;342:d1875. Cook NR, Paynter NP. Performance of reclassification statistics in comparing risk prediction models. Biom J 2011;53(2): Pencina MJ, D'Agostino RB, Sr., D'Agostino RB, Jr., et al. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med 2008;27(2):157-72; discussion

4 Logistic regression analysis Objective Apply logistic regression properly Construct the logit equation Estimate the probability of event, the adjusted odds ratio and its 95% confidence interval Interpret the results of logistic regression analysis Assess goodness of fit of the logit model & diagnostic measuring 4

5 Objective Develop a clinical prediction model using the logit equation Calibrate the cut-off or threshold Assess model performances Calibration Validation Perform internal and external validations Outline of talk Construct logistic equation Simple logistic model Multiple logistic model Model selection Assessing a goodness of fit of the model Diagnostic measures Creating a clinical prediction score Derivative phase Validation phase Internal validation External validation 5

6 When should apply logistic regression Assessing association between factors and the outcome Outcome Dichotomous only DM/Non-Dm, HT/Non-HT, CKD/non-CKD, Retinopathy/Non-Ratinopathy, Studied factors Can be either continuous or categorical variables Construct risk/prognostic prediction models Example 1 Factors associate with acute stroke Design: Case-control study Outcome variable: Case vs Control Case is patient who has been diagnosed as haemorhagic or ischemic stroke Control is subject who has never had history of stroke 6

7 Interested variables Demographic variables Age, gender, BMI, Waist-hip ratio Risk behaviour Smoking, alcohol consumption Physical activity History of illness DM HT High Cholesterol, LDL, HDL, Trig Variables (cont) Genetic factors tissue-type plasminogen activator (t-pa) R353Q polymorphism of the Factor VII gene Platelet glycoprotein (GP 1bα) gene Thr/Met & Kozak polymorphisms 7

8 Example 2 Prognostic factors of retinopathy in diabetic type 2 patients Design Cohort study Study period 10 years Outcome Retinopathy vs Non-retinopathy Variables Demographic data Age, gender BMI/Waist-hip ratio, smoking, alcohol History of disease HT Abnormal lipid profile Clinical data SBP/DBP Kidney function (GFR or Cr) HA 1 C Medication ACR-I, ARB 8

9 Example 3 Risk factors of chronic kidney disease (CKD) Design Cross-sectional survey study Outcome CKD versus non-ckd Variables Demographic variables Age, gender, BMI/Waist-hip ratio Risk/preventive behaviours Alcohol consumption Smoking Exercise & Physical activity Co-morbidity DM, HT, Abnormal lipid profile, kidney stone Medications NSAID, Cyclo-oxygenase type 2 inhibitor (Cox-2), Traditional medicine 9

10 Example 4 Does MPV associate with progression of cardio-vascular diseases? Design EGAT Cohort with Major cardio-vascular diseases Study period Outcome Cardiovascular death Studied variables MPV Covariables Demographic variables Age, gender, BMI/Waist-hip ratio Risk/preventive behaviours Alcohol consumption Smoking Exercise & Physical activity Co-morbidity DM, HT, Abnormal lipid profile 10

11 Example 5 Factors associate with sleep apnea Design Cross-sectional study of subjects who were on the waiting lists of performing polysormnography at sleep lab centre, Royal Newcastle Hospital, Newcastle, Au. Variables Demographic variables Sleep variables Co-morbid Variables Description Categori cal variables Age BMI Age at performing PS, year Body mass index, weight/ ht 2 (m) < >60 < >40 Sex Gender Male Female Code/v alue

12 Variables Description Categor ical variable s Snoring History of snoring Yes No Stopping breathing Choking Waking up refreshed Leg kicking History of stop breathing History of choking during sleeping History of being refreshed after waking up History of kicking leg during sleep Yes No Yes No Yes sometim e No Yes sometim e No Code/va lue Variables Description Categor ical variable s DM History of diabetes Yes No Hyperte nsion History of high blood pressure Yes No Allergy History of allergy Yes No Outcome : ahi Apnoea-hypopnoea index > 5 5 Code/v alue

13 Assess associations between categorical variables 2x2 contingency table snore SA 1 2 Total Total Ho: Snore and sleep apnea (SA) are independent OR Ho: P 1 =P 2 Statistical test Chi-square Exact test Magnitude of association Odds ratio Risk ratio 13

14 snore SA 1 2 Total Total Pearson chi2(1) = Pr = cc SA snore2 Proportion Exposed Unexposed Total Exposed Cases Controls Total Point estimate [95% Conf. Interval] Odds ratio (exact) Attr. frac. ex (exact) Attr. frac. pop chi2(1) = Pr>chi2 =

15 Attributable risk Attributable fraction of exposure AF = (OR-1)/OR The proportion (number) of cases that can be attributed to that exposure Population attributable risk (PAR) PAR = AFxa/n 1 or PAR = P e (OR e -1) / [1 + P e (OR e -1)] The proportion (or number) of cases that would not occur if the factor was eliminated A= a number of expose in cases n1 = a number of cases 2x4 contingency tables tab SA age_gr, col Key frequency column percentage age_gr SA < Total Total

16 Ho: Odds 1 =Odd 2 =,=Odds k tabodds SA age_gr age_gr cases controls odds [95% Conf.Interval] < Test of homogeneity (equal odds): chi2(3) = Pr>chi2 = tabodds SA agegr, or agegr Odds Ratio chi2 P>chi2 [95% Conf. Interval] Test of homogeneity (equal odds): chi2(3) = Pr>chi2 = Score test for trend of odds: chi2(1) = Pr>chi2 =

17 Confounder effects Confounders Crude OR versus Adjusted OR -> -> snore = Key frequency column percentage SA choking 0 1 Total Total

18 -> snore = Key frequency column percentage SA choking 0 1 Total Total

19 Effect modifier Logistic equations Consider > 2 variables simultaneously Linear regression 19

20 Age & Sleep apnea Group Age SA Non-SA n Mean P 1 <

21 Mean value of SA given age group E(Y X) Expected value (mean) of SA given X 0 E(Y X) 1 21

22 Logit equation: 22

23 Simple logistic regression Fit equation 23

24 Performing analysis in STATA xi: logit SA i.snore, nolog i.snore _Isnore_1-2 (naturally coded; _Isnore_2 omitted) Logistic regression Number of obs = 837 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = SA Coef. Std. Err. z P> z [95% Conf. Interval] _Isnore_ _cons Interpretation Patients with a history of snoring have the logit of sleep apnea 1.57 higher than patients without a history of snoring. 24

25 Interpretation The logit of sleep apnea for patients with & without a history of snoring is therefore equated Interpretation 25

26 Testing association Wald test Testing association Likelihood ratio test 26

27 Estimate probability of having event Multiple logistic regression Multiple factors associate with the outcome of interest Osteoporotic hip fracture Age, BMI, use of Corticosteroid, alcohol consumption, calcium intake, etc 27

28 CKD Age, Gender, BMI, use of NSAID, diabetes, HT, Chol SA Age, gender, BMI, snore, stop breathing, etc Multiple logistic regression Consider > 1 factor simultaneously Cumulative factors can better predict event than one factor Control confounding effects, i.e., assess effect of each factor controlling for other factors 28

29 Steps of analysis Model selection Not too many variables Only variables can well explain the interested event Clinical significance Statistical significance Model selection Univariated analysis Multi-variated model Selection methods Backward Forward Model comparison Likelihood ratio test = G = -2[LL0 - LL1] Wald test = β/se AIC/BIC 29

30 Model selection I) Univariate analysis Demographic varaibles age_gr, sex, BMI_gr, Sleep variables snore, stop_bre, choking, awake_re, kick_leg, accident, ess Risk behaviour smoker, alcohol, Co-morbid ht, dm allergy Dealing with continuous Compare mean/median between two groups of SA Fit it as it is in the logit model Keep all possible information Linear, polynomial, fractional polynomial relationship Categorization Using previous reference range Likelihood ratio test Yuden s index 30

31 Fractional polynomial regression Allows Logarithm transformation Non-integer powers (e.g., -0.5), and Repeated powers (e.g., 0.5, 0.5) Equation 31

32 Age fp <age>: logit SA <age> (fitting 44 models) (...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%) Fractional polynomial comparisons: age df Deviance Dev. dif. P(*) Powers omitted linear m = m = (*) P = sig. level of model with m = 2 based on chi^2 of dev. dif. FP will search for powers with a degree of 2 by choosing from a range of possible power (-2,-1, -0.5,0,0.5,1,2,3) Deviance (-2LL) of that model compares with the model with lowest -2LL The model df 1 (i.e. linear) is compared to the model df 4 D = =8.926 ; df = 4-1 = 3; p value of FP with power of (-.5 3) is better than linear model 32

33 Age is transformed to be age_1 and age_2; detail of transformation can be seen from. fracpoly logit SA age... -> gen double Iage 1 = X^ if e(sample) -> gen double Iage 2 = X^ if e(sample) (where: X = age/10) That is age is transformed to be age/10, then power with -( ) for age_1 ( ) for age_2 Suggestion from FP can lead to generate new fp variables fp gen age1 = age^(-.5 3) logit SA age1_1 age1_2 OR fitting the command fp <age>, fp(-.5 3): logit SA <age> 33

34 TABLE 1. Patients characteristics between SA and non-sa groups Factors Group P value n = SA (%) n Non-SA = (%) Age < >60 Gender Male Female BMI < >40 Snoring Yes No Stopping breathing Yes No Choking Yes No Waking up refreshed Yes Sometime No Leg kicking Yes Sometime No Accident due to sleepiness Yes No 34

35 ESS score, median (range) Smoking Yes Ex-smoke No Alcohol consumption Yes No Hypertension Yes No Diabetes mellitus Yes No Allergy Yes No TABLE 2. Factors associated with SA (AHI > 5): multiple logistic regression analysis Factors Coefficient SE P value OR (95% CI) 35

36 TABLE 3. Scoring scheme: steps used to calculate prediction scores Factors Scoring Score for individual Total score.... Table 4. Percentage of sleep apnoea according to prediction score category in derivation and validation phases Score Probabili ty of SA low Derivation Groups LR + (95% SA Non- CI) SA Validation PPV Group LR + SA Non- SA (95% CI) PPV medium high 36

37 Model selection II) Multivariate analysis by simultaneously considering variables p < 0.10 into the model Stepwise backward/forward selection using LR test III) AIC/BIC Leaps-and-bound selection gvselect (SJ15-4) Fitness and complexity Akaike information criterion (AIC) 37

38 Bayesian information criterion BIC = -2(LL) +ln(n)k N = Number of observations K = number of parameters estimated Given two models fit on the same data, the model with the smaller value of the information criterion is considered to be better Leaps-and-bound selection 38

39 xi: gvselect <term> (i.age_gr) i.sex (i.bmi_gr) i.snore i.stop (i.choking) i.awake i.accident i.ht : logit SA <term> Optimal models: # Preds LL AIC BIC predictors for each model: 1 : _Istop_bre_2 2 : _Isex_2 _Isnore_1 3 : _Iage_gr_4 _Isex_2 _Isnore_1 4 : _Iage_gr_4 _Isex_2 _Istop_bre_2 _Isnore_1 5 : _Iage_gr_4 _Isex_2 _IBMI_gr_40 _Istop_bre_2 _Isnore_1 6 : _Iage_gr_4 _Isex_2 _IBMI_gr_40 _Istop_bre_2 _Iage_gr_3 _Isnore_1 7 : _Iage_gr_4 _Isex_2 _IBMI_gr_40 _Istop_bre_2 _Iage_gr_3 _Isnore_1 _Iage_gr_2 8 : _Iage_gr_4 _Isex_2 _IBMI_gr_40 _Istop_bre_2 _Iage_gr_3 _Isnore_1 _IBMI_gr_39 _Iage_gr_2 9 : _Iage_gr_4 _Isex_2 _IBMI_gr_40 _Istop_bre_2 _Iage_gr_3 _Isnore_1 _IBMI_gr_39 _Iage_gr_2 _IBMI_gr_29 10 : _Iage_gr_4 _Isex_2 _IBMI_gr_40 _Istop_bre_2 _Iage_gr_3 _Isnore_1 _IBMI_gr_39 _Iage_gr_2 _IBMI_gr_29 _Iawake_re_2 39

40 logit SA stop_bre1 agegr2 agegr3 agegr4 sex2 BMI_gr2 BMI_gr3 BMI_gr4 snore2 estat ic Akaike's information criterion and Bayesian information criterion Model Obs ll(null) ll(model) df AIC BIC Note: N=Obs used in calculating BIC; see [R] BIC note. Performance of the model Calibration How similar are the predicted and observed outcomes? Testing: Goodness of fit All possible patterns = 2x4x4x2x2 =

41 Hosmer-Lemeshow GOF estat gof, table gr(10) Logistic model for SA, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) Group Prob Obs_1 Exp_1 Obs_0 Exp_0 Total number of observations = 837 number of groups = 10 Hosmer-Lemeshow chi2(8) = Prob > chi2 = HL Chi2 = sum[(o j -e j ) 2 /e j (1-e j /n j )] #delimit; disp ((8-10.3)^2/(10.3*(1-10.3/87))+ ( )^2/(33.8*(1-33.8/89))+ ( )^2/(54.8*(1-54.8/94))+ ( )^2/(44.7*(1-44.7/68))+ ( )^2/(89.9*(1-89.8/111)) + ( )^2/(57.9*(1-57.9/68))+ ( )^2/(87.2*(1-87.2/98))+ ( )^2/(102.4*( /110))+ ( )^2/(29.7*(1-29.7/31))) ;

42 O/E #delimit; disp ((8/10.3)+(79/76.7)+ ( )+(52/55.2)+ (56/54.8)+(38/40)+ (49/44.7)+(19/23.3)+ (57/59.1)+(24/21.9)+ (86/89.9)+(25/21.1) + (56/57.9)+(12/10.1)+ (90/87.2)+(8/10.8)+ (103/102.4)+(7/7.6)+ (27/29.7)+(4/1.3))/20 ; notes: sum(oj/ej); j=1,...,20 Model performance Discrimination Assign the cut-off/threshold Construct 2x2 table Estimate predictive values Sen Spec PPV, NPV Accuracy Area under ROC or Concordance (C) statistics 42

43 Model discrimination Area under the ROC Also know as C statistic Summary statistics that can tell us whether the logit model can discriminate disease from non-disease subjects. Plots sensitivity versus 1-specificity (false positive) for the whole range of estimated probabilities 43

44 Interpretation of ROC Area under ROC Interpretation 0.5 ROC < 0.6 Fail 0.6 ROC < 0.7 Poor 0.7 ROC < 0.8 Fair 0.8 ROC < 0.9 Good 0.9 Excellent 44

45 Diagnostic measures Residuals Pearson s chi-square residual Deviance residual 45

46 Outliers Leverage hjj values reflects distance of Xj from the centre mean The higher the hjj, the longer distance from the centre mean Influence of outliers Influence on prediction value of Y Including/excluding the pattern/s that are outlier would change Y values Pearson residual change 46

47 Deviance residual change Influence on coefficient estimation 47

48 48

49 49