Module 20 Case Studies in Longitudinal Data Analysis

Size: px
Start display at page:

Download "Module 20 Case Studies in Longitudinal Data Analysis"

Transcription

1 Module 20 Case Studies in Longitudinal Data Analysis Benjamin French, PhD Radiation Effects Research Foundation University of Pennsylvania SISCR 2016 July 29, 2016

2 Learning objectives This module will focus on the design of longitudinal studies, exploratory data analysis, and application of regression techniques based on estimating equations and mixed-effects models Case studies will be used to discuss analysis strategies, the application of appropriate analysis methods, and the interpretation of results, with examples in R and Stata Some theoretical background and details will be provided; our goal is to translate statistical theory into practical application At the conclusion of this module, you should be able to apply appropriate exploratory and regression techniques to summarize and generate inference from longitudinal data B French (Module 20) Longitudinal Data Analysis SISCR / 155

3 Overview Review: Longitudinal data analysis Case study: Longitudinal depression scores Case study: Indonesian Children s Health Study Case study: Carpal tunnel syndrome Summary and resources B French (Module 20) Longitudinal Data Analysis SISCR / 155

4 Overview Review: Longitudinal data analysis Case study: Longitudinal depression scores Case study: Indonesian Children s Health Study Case study: Carpal tunnel syndrome Summary and resources B French (Module 20) Longitudinal Data Analysis SISCR / 155

5 Longitudinal studies Repeatedly collect information on the same individuals over time Benefits Record incident events Ascertain exposure prospectively Separate time effects: cohort, period, age Distinguish changes over time within individuals Offer attractive efficiency gains over cross-sectional studies Help establish causal effect of exposure on outcome B French (Module 20) Longitudinal Data Analysis SISCR / 155

6 Longitudinal studies Repeatedly collect information on the same individuals over time Challenges Determine causality when covariates vary over time Choose exposure lag when covariates vary over time Account for incomplete participant follow-up Use specialized methods that account for longitudinal correlation B French (Module 20) Longitudinal Data Analysis SISCR / 155

7 Motivating example Georgian infant birth weight Birth weight measured for each of m = 5 children of n = 200 mothers Birth weight for infants j comprise repeated measures on mothers i Interested in the association between birth order and birth weight Estimate the average time course among all mothers Estimate the time course for individual mothers Quantify the degree of heterogeneity across mothers Consider adjustment for mother s initial age (at first birth) B French (Module 20) Longitudinal Data Analysis SISCR / 155

8 Motivating example momid birthord bweight lowbrth initage [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] B French (Module 20) Longitudinal Data Analysis SISCR / 155

9 Motivating example Birth weight (grams) Birth weight (grams) Birth order Birth order B French (Module 20) Longitudinal Data Analysis SISCR / 155

10 Strategies for analysis of longitudinal data Derived variable: Collapse the longitudinal series for each subject into a summary statistic, such as a difference (a.k.a. change score ) or regression coefficient, and use methods for independent data Repeated measures: Include all data in a regression model for the mean response and account for longitudinal and/or cluster correlation B French (Module 20) Longitudinal Data Analysis SISCR / 155

11 Options for analysis of change Does mean change differ across groups? Consider simple situation with Baseline measurement (t = 0) Single follow-up measurement (t = 1) Analysis options for simple pre-post design Analysis of POST only Analysis of CHANGE (post-pre) Analysis of POST controlling for BASELINE Analysis of CHANGE controlling for BASELINE B French (Module 20) Longitudinal Data Analysis SISCR / 155

12 Change and randomized studies Key assumption: groups equivalent at baseline Methods that adjust for baseline are generally preferable due to greater precision ρ > 1/2 POST CHANGE ANCOVA ρ < 1/2 CHANGE POST ANCOVA CHANGE analysis adjusts for baseline by subtracting it from follow-up ANCOVA analysis adjusts for baseline by controlling for it in a model Missing data will impact each approach B French (Module 20) Longitudinal Data Analysis SISCR / 155

13 Change and non-randomized studies Baseline equivalence no longer guaranteed Methods no longer answer same scientific question POST: How different are groups at follow-up? CHANGE: How different is the change in outcome for the two groups? ANCOVA: What is the expected difference in the mean outcome at follow-up across the two groups, controlling for the baseline value of the outcome? CHANGE typically most relevant; multivariable methods to come later characterize CHANGE across multiple timepoints B French (Module 20) Longitudinal Data Analysis SISCR / 155

14 Strategies for analysis of longitudinal data Derived variable: Collapse the longitudinal series for each subject into a summary statistic, such as a difference (a.k.a. change score ) or regression coefficient, and use methods for independent data Example: birth weight of 2nd child birth weight of 1st child Might be adequate for two time points and no missing data Repeated measures: Include all data in a regression model for the mean response and account for longitudinal and/or cluster correlation Generalized estimating equations (GEE) Generalized linear mixed-effects models (GLMM) B French (Module 20) Longitudinal Data Analysis SISCR / 155

15 Notation Define m i = number of observations for subject i = 1,..., n Y ij = outcome for subject i at time j = 1,..., m i X i = (x i1, x i2,..., x imi ) x ij = (x ij1, x ij2,..., x ijp ) exposure, covariates Stacks of data for each subject: Y i1 Y Y i = i2. X i = x i11 x i12... x i1p x i21 x i22... x i2p.... Y imi x imi 1 x imi 2... x imi p B French (Module 20) Longitudinal Data Analysis SISCR / 155

16 Dependence and correlation Issue Response variables measured on the same subject are correlated Observations are dependent or correlated when one variable predicts the value of another variable The birth weight for a first child is predictive of the birth weight for a second child born to the same mother Variance: measures average distance that an observation falls away from the mean Covariance: measures whether, on average, departures in one variable Y ij µ j go together with departures in another variable Y ik µ k Correlation: measure of dependence that takes values from 1 to +1 B French (Module 20) Longitudinal Data Analysis SISCR / 155

17 Covariance: Something new to model Cov(Y i ) = = Var(Y i1 ) Cov(Y i1, Y i2 )... Cov(Y i1, Y imi ) Cov(Y i2, Y i1 ) Var(Y i2 )... Cov(Y i2, Y imi ).... Cov(Y imi, Y i1 ) Cov(Y imi, Y i2 )... Var(Y imi ) σ1 2 σ 1 σ 2 ρ σ 1 σ mi ρ 1mi σ 2 σ 1 ρ 21 σ σ 2 σ mi ρ 2mi.... σ mi σ 1 ρ mi 1 σ mi σ 2 ρ mi 2... σ 2 m i Note: ρ = correlation B French (Module 20) Longitudinal Data Analysis SISCR / 155

18 GEE (Liang and Zeger, 1986) 9145 citations as of July 2016 Contrast average outcome values across populations of individuals defined by covariate values, while accounting for correlation Focus on a generalized linear model with regression parameters β, which characterize the systemic variation in Y across covariates X Y i = (Y i1, Y i2,..., Y imi ) T X i = (x i1, x i2,..., x imi ) T x ij = (x ij1, x ij2,..., x ijp ) β = (β 1, β 2,..., β p ) T for i = 1,..., n; j = 1,..., m i ; and k = 1,..., p Longitudinal correlation structure is a nuisance feature of the data B French (Module 20) Longitudinal Data Analysis SISCR / 155

19 Mean model Assumptions Observations are independent across subjects Observations could be correlated within subjects Mean model: Primary focus of the analysis E[Y ij x ij ] = µ ij (β) g(µ ij ) = x ij β Corresponds to any generalized linear model with link g( ) Continuous outcome Count outcome Binary outcome E[Y ij x ij ] = µ ij E[Y ij x ij ] = µ ij P[Y ij = 1 x ij ] = µ ij µ ij = x ij β log(µ ij ) = x ij β logit(µ ij ) = x ij β Characterizes a marginal mean regression model B French (Module 20) Longitudinal Data Analysis SISCR / 155

20 Covariance model Longitudinal correlation is a nuisance; secondary to mean model of interest 1. Assume a form for variance that could depend on µ ij Continuous outcome: Var[Y ij x ij ] = σ 2 Count outcome: Var[Y ij x ij ] = µ ij Binary outcome: Var[Y ij x ij ] = µ ij (1 µ ij ) which could also include a scale or dispersion parameter φ > 0 2. Select a model for longitudinal correlation with parameters α Independence: Corr[Y ij, Y ij X i ] = 0 Exchangeable: Corr[Y ij, Y ij X i ] = α Auto-regressive: Corr[Y ij, Y ij X i ] = α j j Unstructured: Corr[Y ij, Y ij X i ] = α jj B French (Module 20) Longitudinal Data Analysis SISCR / 155

21 Intuition 0 = n Di T V 1 }{{} i (Y i ˆµ i ) }{{}}{{} i= The model for the mean, µ i (β), is compared to the observed data, Y i ; setting the equations to equal 0 tries to minimize the difference between observed and expected 2 Estimation uses the inverse of the variance (covariance) to weight the data from subject i; more weight is given to differences between observed and expected for subjects who contribute more information 3 Simply a change of scale from the scale of the mean, µ i, to the scale of the regression coefficients (covariates) B French (Module 20) Longitudinal Data Analysis SISCR / 155

22 Comments GEE is specified by a mean model and a correlation model 1. A regression model for the average outcome, e.g., linear, logistic 2. A model for longitudinal correlation, e.g., independence, exchangeable ˆβ is a consistent estimator for β provided that the mean model is correctly specified, even if the model for longitudinal correlation is incorrectly specified, i.e., ˆβ is robust to correlation model mis-specification However, the variance of ˆβ must capture the correlation in the data, either by choosing the correct correlation model, or via an alternative variance estimator GEE computes a sandwich variance estimator (aka empirical, robust, or Huber-White variance estimator) Empirical variance estimator provides valid standard errors for ˆβ even if the working correlation model is incorrect, but requires n 40 (Mancl and DeRouen, 2001) B French (Module 20) Longitudinal Data Analysis SISCR / 155

23 Variance estimators Independence estimating equation: An estimation equation with a working independence correlation structure Model-based standard errors are generally not valid Empirical standard errors are valid given large n and n m Weighted estimation equation: An estimation equation with a non-independence working correlation structure Model-based standard errors are valid if correlation model is correct Empirical standard errors are valid given large n and n m Variance estimator Estimating equation Model-based Empirical Independence +/ Weighted /+ + B French (Module 20) Longitudinal Data Analysis SISCR / 155

24 GEE commands Stata: xtset, then use xtgee R: geeglm in geepack library, using geese fitter function SAS: PROC GENMOD NB: Order might be important for analysis in software Requires sorting the data by unique subject identifier and time Important for exchangeable and auto-regressive correlation structures B French (Module 20) Longitudinal Data Analysis SISCR / 155

25 Motivating example Interested in the association between birth order and birth weight E[Y ij x ij ] = β 0 + β 1 x ij1 + β 2 x ij2 for i = 1,..., 200 and j = 1,..., 5 with Y ij : Infant birth weight (continuous) x ij1 : Infant birth order x ij2 : Mother s initial age B French (Module 20) Longitudinal Data Analysis SISCR / 155

26 Motivating example: Stata commands * Declare the dataset to be "panel" data, grouped by momid * with time variable birthord xtset momid birthord * Fit a linear model with independence correlation xtgee bweight birthord initage, corr(ind) robust * Fit a linear model with exchangeable correlation xtgee bweight birthord initage, corr(exc) robust B French (Module 20) Longitudinal Data Analysis SISCR / 155

27 Motivating example: Stata output GEE population-averaged model Number of obs = 1000 Group variable: momid Number of groups = 200 Link: identity Obs per group: min = 5 Family: Gaussian avg = 5.0 Correlation: independent max = 5 Wald chi2(2) = Scale parameter: Prob > chi2 = (Std. Err. adjusted for clustering on momid) Semi-robust bweight Coef. Std. Err. z P> z [95% Conf. Interval] birthord initage _cons B French (Module 20) Longitudinal Data Analysis SISCR / 155

28 Motivating example: Stata output GEE population-averaged model Number of obs = 1000 Group variable: momid Number of groups = 200 Link: identity Obs per group: min = 5 Family: Gaussian avg = 5.0 Correlation: exchangeable max = 5 Wald chi2(2) = Scale parameter: Prob > chi2 = (Std. Err. adjusted for clustering on momid) Semi-robust bweight Coef. Std. Err. z P> z [95% Conf. Interval] birthord initage _cons B French (Module 20) Longitudinal Data Analysis SISCR / 155

29 Motivating example: Comments Difference in mean birth weight between two populations of infants whose birth order differs by one is 46.6 grams, 95% CI: (27.0, 66.2) Model-based standard errors are smaller for exchangeable structure, indicating efficiency gain from assuming a correct correlation structure In practice, i.e. with real-world data, it s often difficult to tell what the correct correlation structure is from exploratory analyses A priori scientific knowledge should ultimately guide the decision I tend to use working independence with empirical standard errors unless I have a good reason to do otherwise, e.g. large efficiency gain Try not to select the structure that gives you the smallest p-value Stata labels the standard errors semi-robust because the empirical variance estimator protects against mis-specification of the correlation model, but requires correct specification of the mean model See help xtgee for detailed syntax, other options, and saved results B French (Module 20) Longitudinal Data Analysis SISCR / 155

30 GEE summary In the GEE approach the primary focus of the analysis is a marginal mean regression model that corresponds to any GLM Longitudinal correlation is secondary to the mean model of interest and is treated as a nuisance feature of the data Requires selection of a working correlation model Semi-parametric: Only the mean and correlation models are specified The correlation model does not need to be correctly specified to obtain a consistent estimator for β or valid standard errors for ˆβ Efficiency gains are possible if the correlation model is correct Issues Accommodates only one source of correlation: Longitudinal or cluster GEE requires that any missing data are missing completely at random Issues arise with time-dependent exposures and covariance weighting B French (Module 20) Longitudinal Data Analysis SISCR / 155

31 Strategies for analysis of longitudinal data Derived variable: Collapse the longitudinal series for each subject into a summary statistic, such as a difference (a.k.a. change score ) or regression coefficient, and use methods for independent data Example: birth weight of 2nd child birth weight of 1st child Might be adequate for two time points and no missing data Repeated measures: Include all data in a regression model for the mean response and account for longitudinal and/or cluster correlation Generalized estimating equations (GEE): A marginal model for the mean response and a model for longitudinal or cluster correlation g(e[y ij x ij ]) = x ij β and Corr[Y ij, Y ij ] = ρ(α) Generalized linear mixed-effects models (GLMM) B French (Module 20) Longitudinal Data Analysis SISCR / 155

32 Mixed-effects models (Laird and Ware, 1982) 4515 citations as of July 2016 Contrast outcomes both within and between individuals Assume that each subject has a regression model characterized by subject-specific parameters: a combination of fixed-effects parameters common to all individuals in the population and random-effects parameters unique to each individual subject Although covariates allow for differences across subjects, typically cannot measure all factors that give rise to subject-specific variation Subject-specific random effects induce a correlation structure B French (Module 20) Longitudinal Data Analysis SISCR / 155

33 Set-up For subject i the mixed-effects model is characterized by Y i = (Y i1, Y i2,..., Y imi ) T β = (β1, β 2,..., β p) T Fixed effects x ij = (x ij1, x ij2,..., x ijp ) X i = (x i1, x i2,..., x imi ) T Design matrix for fixed effects γ i = (γ 1i, γ 2i,..., γ qi ) T Random effects z ij = (z ij1, z ij2,..., z ijq ) Z i = (z i1, z i2,..., z imi ) T Design matrix for random effects for i = 1,..., n; j = 1,..., m i ; and k = 1,..., p with q p B French (Module 20) Longitudinal Data Analysis SISCR / 155

34 Linear mixed-effects model Consider a linear mixed-effects model for a continuous outcome Y ij Stage 1: Model for response given random effects Y ij = x ij β + z ij γ i + ɛ ij with x ij is a vector a covariates z ij is a subset of x ij β is a vector of fixed-effects parameters γi is a vector of random-effects parameters ɛij is observation-specific measurement error Stage 2: Model for random effects γ i N(0, G) ɛ ij N(0, σ 2 ) with γ i and ɛ ij are assumed to be independent B French (Module 20) Longitudinal Data Analysis SISCR / 155

35 Choices for random effects Consider the linear mixed-effects models that include Random intercepts Random intercepts and slopes Y ij = β 0 + β 1 t ij + γ 0i + ɛ ij = (β 0 + γ 0i ) + β 1 t ij + ɛ ij Y ij = β 0 + β 1 t ij + γ 0i + γ 1i t ij + ɛ ij = (β 0 + γ 0i ) + (β 1 + γ 1i )t ij + ɛ ij B French (Module 20) Longitudinal Data Analysis SISCR / 155

36 Choices for random effects Fixed intercept, fixed slope Random intercept, fixed slope Y ij Y ij t ij t ij B French (Module 20) Longitudinal Data Analysis SISCR / 155

37 Choices for random effects Fixed intercept, random slope Random intercept, random slope Y ij Y ij t ij t ij B French (Module 20) Longitudinal Data Analysis SISCR / 155

38 Choices for random effects: G G quantifies random variation in trajectories across subjects [ ] G 11 G 12 G = G 21 G 22 G 11 is the typical deviation in the level of the response G 22 is the typical deviation in the change in the response G 12 is the covariance between subject-specific intercepts and slopes G 12 = 0 indicates subject-specific intercepts and slopes are uncorrelated G 12 > 0 indicates subjects with high level have high rate of change G12 < 0 indicates subjects with high level have low rate of change (G 12 = G 21 ) B French (Module 20) Longitudinal Data Analysis SISCR / 155

39 Generalized linear mixed-effects models A GLMM is defined by random and systematic components Random: Conditional on γ i the outcomes Y i = (Y i1,..., Y imi ) T are mutually independent and have an exponential family density f (Y ij β, γ i, φ) = exp{[y ij θ ij ψ(θ ij )]/φ + c(y ij, φ)} for i = 1,..., n and j = 1,..., m i with a scale parameter φ > 0 and θ ij θ ij (β, γ i ) B French (Module 20) Longitudinal Data Analysis SISCR / 155

40 Generalized linear mixed-effects models A GLMM is defined by random and systematic components Systematic: µ ij is modeled via a linear predictor containing fixed regression parameters β common to all individuals in the population and subject-specific random effects γ i with a known link function g( ) g(µ ij) = x ij β + z ij γ i µ ij = g 1 (x ij β + z ij γ i ) where the random effects γ i are mutually independent with a common underlying multivariate distribution, typically assumed to be γ i N q (0, G) so that G quantifies random variation across subjects B French (Module 20) Longitudinal Data Analysis SISCR / 155

41 Likelihood-based estimation of β Requires specification of a complete probability distribution for the data Likelihood-based methods are designed for fixed effects, so integrate over the assumed distribution for the random effects n L Y (β, σ, G) = f Y γ (Y i γ i, β, σ) f γ (γ i G)dγ i i=1 where f γ is typically the density function of a Normal random variable For linear models the required integration is straightforward because Y i and γ i are both normally distributed (easy to program) For non-linear models the integration is difficult and requires either approximation or numerical techniques (hard to program) B French (Module 20) Longitudinal Data Analysis SISCR / 155

42 Likelihood-based estimation of β Two likelihood-based approaches to estimation using a GLMM 1. Conditional likelihood: Treat the random effects as if they were fixed parameters and eliminate them by conditioning on their sufficient statistics; does not require a specified distribution for γ i xtreg and xtlogit with fe option in Stata 2. Maximum likelihood: Treat the random effects as unobserved nuisance variables and integrate over their assumed distribution to obtain the marginal likelihood for β; typically assume γ i N(0, G) xtreg and xtlogit with re option in Stata mixed and melogit in Stata lmer and glmer in R package lme4 B French (Module 20) Longitudinal Data Analysis SISCR / 155

43 Fixed effects versus random effects Fixed-effects approach provided by conditional likelihood estimation Comparisons are made within individuals who act as their own control and differences are averaged across all individuals in the sample Could eliminate potentially large sources of bias by controlling for all stable characteristics of the individuals under study (+) Variation across subjects is ignored, which could provide standard error estimates that are too big; conservative inference ( ) Although controlled for by conditioning, cannot estimate coefficients for covariates that have no within-subject variation ( /+) B French (Module 20) Longitudinal Data Analysis SISCR / 155

44 Fixed effects versus random effects Random-effects approach provided by maximum likelihood estimation Comparisons are based on within- and between-subject contrasts Requires a specified distribution for subject-specific effects; correct specification is required for valid likelihood-based inference ( /+) Do not control for unmeasured characteristics because random effects are almost always assumed to be uncorrelated with covariates ( ) Can estimate effects of within- and between-subject covariates (+) B French (Module 20) Longitudinal Data Analysis SISCR / 155

45 Assumptions Valid inference from a linear mixed-effects model relies on Mean model: As with any regression model for an average outcome, need to correctly specify the functional form of x ij β (here also z ij γ i ) Included important covariates in the model Correctly specified any transformations or interactions Covariance model: Correct covariance model (random-effects specification) is required for correct standard error estimates for ˆβ Normality: Normality of ɛ ij and γ i is required for normal likelihood function to be the correct likelihood function for Y ij n sufficiently large for asymptotic inference to be valid These assumptions must be verified to evaluate any fitted model B French (Module 20) Longitudinal Data Analysis SISCR / 155

46 Motivating example Interested in the association between birth order and birth weight E[Y ij x ij, γ i ] = β 0 + β 1 x ij1 + β 2 x ij2 + γ 0i or β 0 + β 1 x ij1 + β 2 x ij2 + γ 0i + γ 1i x ij1 for i = 1,..., 200 and j = 1,..., 5 with Y ij : Infant birth weight (continuous) x ij1 : Infant birth order x ij2 : Mother s initial age B French (Module 20) Longitudinal Data Analysis SISCR / 155

47 Motivating example Birth weight (grams) Observed data Birth weight (grams) Individual fits Birth order Birth order B French (Module 20) Longitudinal Data Analysis SISCR / 155

48 Motivating example: Stata commands * Declare the dataset to be "panel" data, grouped by momid * with time variable birthord xtset momid birthord * Fit a linear model with random intercepts xtmixed bweight birthord initage momid: * Fit a linear model with random intercepts and slopes xtmixed bweight birthord initage momid: birthord B French (Module 20) Longitudinal Data Analysis SISCR / 155

49 Motivating example: Stata output Mixed-effects REML regression Number of obs = 1000 Group variable: momid Number of groups = 200 Obs per group: min = 5 avg = 5.0 max = 5 Wald chi2(2) = Log restricted-likelihood = Prob > chi2 = bweight Coef. Std. Err. z P> z [95% Conf. Interval] birthord initage _cons Random-effects Parameters Estimate Std. Err. [95% Conf. Interval] momid: Identity sd(_cons) sd(residual) LR test vs. linear regression: chibar2(01) = Prob >= chibar2 = B French (Module 20) Longitudinal Data Analysis SISCR / 155

50 Motivating example: Stata output Mixed-effects REML regression Number of obs = 1000 Group variable: momid Number of groups = 200 Obs per group: min = 5 avg = 5.0 max = 5 Wald chi2(2) = Log restricted-likelihood = Prob > chi2 = bweight Coef. Std. Err. z P> z [95% Conf. Interval] birthord initage _cons Random-effects Parameters Estimate Std. Err. [95% Conf. Interval] momid: Independent sd(birthord) sd(_cons) sd(residual) LR test vs. linear regression: chi2(2) = Prob > chi2 = B French (Module 20) Longitudinal Data Analysis SISCR / 155

51 Motivating example: Comments Difference in mean birth weight between two populations of infants whose birth order differs by one is 46.6 grams, 95% CI: (27.0, 66.2) Agrees well with GEE point estimate and confidence interval Ĝ00 = 323 indicates substantial variability across mothers in the initial level of infant birth weight; Ĝ11 = 49 indicates substantial variability across mothers in the trend of birth weight over time Note: Typically can specify correlated intercepts and slopes, i.e. G 01 0, but in this case the model would not converge There are options for formal statistical evaluation of two randomeffects specifications, but I generally do not recommend an inferential procedure in which a p-value tells you to use a simpler model Specification for the random effects should be guided by a priori scientific knowledge and exploratory data analysis B French (Module 20) Longitudinal Data Analysis SISCR / 155

52 GLMM summary A GLMM is defined by a random component describing outcomes given subject-specific effects and a systematic component describing the conditional mean given subject-specific effects Conditional likelihood for fixed effects eliminates subject-specific effects by conditioning on their sufficient statistics Maximum likelihood for random effects integrates over the assumed distribution of the subject-specific effects Interpretation of parameter estimates depends on the assumed distribution for the outcome given the subject-specific effects and the specified structure for subject-specific effects Issues GLMM requires that any missing data are missing at random Issues arise with time-dependent exposures and covariance weighting B French (Module 20) Longitudinal Data Analysis SISCR / 155

53 Strategies for analysis of longitudinal data Derived variable: Collapse the longitudinal series for each subject into a summary statistic, such as a difference (a.k.a. change score ) or regression coefficient, and use methods for independent data Example: birth weight of 2nd child birth weight of 1st child Might be adequate for two time points and no missing data Repeated measures: Include all data in a regression model for the mean response and account for longitudinal and/or cluster correlation Generalized estimating equations (GEE): A marginal model for the mean response and a model for longitudinal or cluster correlation g(e[y ij x ij ]) = x ij β and Corr[Y ij, Y ij ] = ρ(α) Generalized linear mixed-effects models (GLMM): A conditional model for the mean response given subject-specific random effects, which induce a (possibly hierarchical) correlation structure g(e[y ij x ij, γ i ]) = x ij β + z ij γ i B French (Module 20) Longitudinal Data Analysis SISCR / 155

54 Overview Review: Longitudinal data analysis Case study: Longitudinal depression scores Case study: Indonesian Children s Health Study Case study: Carpal tunnel syndrome Summary and resources B French (Module 20) Longitudinal Data Analysis SISCR / 155

55 Motivation and design Gregoire et al. (1996) published the results of an efficacy study on estrogen patches in treating postnatal depression 61 women with major depression, which began within 3 months of childbirth and persisted for up to 18 months postnatally, participated in a double-blind, placebo-controlled study Women were randomly assigned to active treatment (n = 34) or placebo (n = 27) Participants attended clinics monthly and at each visit self-ratings of depressive symptoms on the Edinburgh postnatal depression scale (EPDS) were measured EPDS is a standardized, validated, self-rating scale consisting of 10 items, each rated on a 4-point scale of 0 3 Goal: Investigate the antidepressant efficacy of treatment with estrogen over time B French (Module 20) Longitudinal Data Analysis SISCR / 155

56 Data Depression scores are assessed across m = 7 months for the n = 61 subjects in the study Depression scores for visit j are the longitudinal components measured on subject i subj group dep0 dep1 dep2 dep3 dep4 dep5 dep placebo placebo placebo placebo placebo placebo placebo placebo placebo placebo Wide form: A row for each subject Note that there are some missing data due to drop-out B French (Module 20) Longitudinal Data Analysis SISCR / 155

57 Exploratory analyses 1. Summarize the depression scores by visit and treatment group 2. Examine within-person correlations among depression scores, graphically and numerically 3. Graph depression scores over time, by treatment group; include a lowess line (smoother) for each group to summarize trends 4. Plot individual trajectories by treatment group B French (Module 20) Longitudinal Data Analysis SISCR / 155

58 Regression analyses 5. Consider collapsing the longitudinal series for each subject into a summary statistic between the baseline and sixth depression scores. Use methods for independent data to evaluate the association between change in depression scores and estrogen treatment 6. Reshape the data into long form and evaluate longitudinal associations between depression scores and treatment using GEE Use visit as a linear variable Use visit as a categorical variable Evaluate whether the treatment effect varies over time B French (Module 20) Longitudinal Data Analysis SISCR / 155

59 Reshape the data Recall what the data look like in wide form subj group dep0 dep1 dep2 dep3 dep4 dep5 dep placebo placebo placebo placebo placebo For some analyses, reshape the data from wide form to long form. reshape long dep, i(subj) j(visit) (note: j = ) Data wide -> long Number of obs. 61 -> 427 Number of variables 9 -> 4 j variable (7 values) -> visit xij variables: dep0 dep1... dep6 -> dep B French (Module 20) Longitudinal Data Analysis SISCR / 155

60 Reshape the data Long form: A row for each observation subj visit group dep placebo placebo placebo placebo placebo placebo placebo placebo placebo placebo B French (Module 20) Longitudinal Data Analysis SISCR / 155

61 Answers B French (Module 20) Longitudinal Data Analysis SISCR / 155

62 Summaries by group and visit. sort group. by group: summarize dep0 dep1 dep2 dep3 dep4 dep5 dep > group = placebo Variable Obs Mean Std. Dev. Min Max dep dep dep dep dep dep dep > group = estrogen Variable Obs Mean Std. Dev. Min Max dep dep dep dep dep dep dep Note: There are fewer observations observed over time B French (Module 20) Longitudinal Data Analysis SISCR / 155

63 Correlation. graph matrix dep0 dep1 dep2 dep3 dep4 dep5 dep6, half All correlations are positive Strong correlation between adjacent visits B French (Module 20) Longitudinal Data Analysis SISCR / 155

64 Depression scores over time. separate dep, by(group). graph twoway (scatter dep0 visit, jitter(10) mcolor(green)) (scatter dep1 visit, jitter(10) mcolor(purple)) /// (lowess dep0 visit, lcolor(green)) (lowess dep1 visit, lcolor(purple)) For each treatment arm, mean depression scores decrease over time B French (Module 20) Longitudinal Data Analysis SISCR / 155

65 Individual trajectories. xtline dep, i(subj) t(visit) overlay legend(off) xlab(0(1)6) xtitle("visit") ytitle("depression Score") Reveals the complexity of individual trajectories Note that several patients drop out after the second visit B French (Module 20) Longitudinal Data Analysis SISCR / 155

66 Simple difference. gen diff=dep6-dep0. ttest diff, by(group) unequal Two-sample t test with unequal variances Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] placebo estrogen combined diff diff = mean(placebo) - mean(estrogen) t = Ho: diff = 0 Satterthwaite s degrees of freedom = Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = Pr( T > t ) = Pr(T > t) = Clear decreases over time; larger decreases among estrogen group Limited to those with complete measurement series B French (Module 20) Longitudinal Data Analysis SISCR / 155

67 GEE A special feature of longitudinal data is that the m = 7 observations that are nested within the n = 61 subjects are ordered in time We can consider marginal models to model the within-subject dependence by allowing us to specify the covariance structure across the nested observations Parameters describing the covariance must be estimated along with typical regression coefficients A variety of options are available to describe the covariance Some covariance patterns require more information (i.e., require more parameters to be estimated than others) Recall, we identify the data as a panel data set using the xtset command in Stata B French (Module 20) Longitudinal Data Analysis SISCR / 155

68 Assumptions To account for the repeated measures we can use generalized estimating equations which include all of the data over the time points in a marginal model for the mean response and account for the longitudinal correlation Assumptions g(e[y ij x ij ]) = x ij β and Corr[Y ij, Y ij ] = ρ(α) Observations are independent across subjects Observations could be correlated within subjects Missing data are missing completely at random B French (Module 20) Longitudinal Data Analysis SISCR / 155

69 Model Using the GEE framework, we consider the cross-sectional model where we are interested in the average treatment effect over time E[Y ij x ij ] = β 0 + β 1 x ij1 + β 2 x ij2 with Y ij : continuous depression score (dep) xij1 : continuous variable for visit (visit) xij2 : binary treatment group with 1=estrogen, 0=placebo (group) For the continuous outcome, we use an identity link, link(iden), in the Gaussian family, fam(gaus); these are the default In Stata, xtgee allows us to specify various working covariance structures through the corr option; the command estat wcorr allows us to view the working correlation matrix B French (Module 20) Longitudinal Data Analysis SISCR / 155

70 Correlation structures Independence: Observations are assumed to be independent For correlation between any two observations on the same subject we assume that Corr[Y ij, Y ij ] = 0 It is unlikely that for any subject, depression scores are independent from one visit to the next Exchangeable: Correlations are assumed to be constant between any two observations on the same subject; Corr[Y ij, Y ij ] = α AR(1): Correlation is assumed to decay as a function of time or distance between observations; Corr[Y ij, Y ij ] = α j j Likely to be appropriate in cases where there are a reasonable number of repeated measurements over time Given that our data are measured over time, using the AR(1) correlation might help increase efficiency of SE estimation Unstructured: No relationship is imposed on dependence over time or within subjects; Corr[Y ij, Y ij ] = α jj Robust variance estimator protects against incorrect choice B French (Module 20) Longitudinal Data Analysis SISCR / 155

71 GEE-independence. xtgee dep visit i.group, corr(ind) robust GEE population-averaged model Number of obs = 356 Group variable: subj Number of groups = 61 Link: identity Obs per group: Family: Gaussian min = 2 Correlation: independent avg = 5.8 max = 7 Wald chi2(2) = Scale parameter: Prob > chi2 = Pearson chi2(356): Deviance = Dispersion (Pearson): Dispersion = (Std. Err. adjusted for clustering on subj) Robust dep Coef. Std. Err. z P> z [95% Conf. Interval] visit group estrogen _cons B French (Module 20) Longitudinal Data Analysis SISCR / 155

72 GEE-AR(1). xtgee dep visit i.group, corr(ar1) robust GEE population-averaged model Number of obs = 356 Group and time vars: subj visit Number of groups = 61 Link: identity Obs per group: Family: Gaussian min = 2 Correlation: AR(1) avg = 5.8 max = 7 Wald chi2(2) = Scale parameter: Prob > chi2 = (Std. Err. adjusted for clustering on subj) Robust dep Coef. Std. Err. z P> z [95% Conf. Interval] visit group estrogen _cons B French (Module 20) Longitudinal Data Analysis SISCR / 155

73 Working correlation structure Examine the correlation structure estimated by the model. estat wcorr Estimated within-subj correlation matrix R: c1 c2 c3 c4 c5 c6 c r1 1 r r r r r r Compare with simple pairwise correlations. corr dep0 dep1 dep2 dep3 dep4 dep5 dep6 (obs=45) dep0 dep1 dep2 dep3 dep4 dep5 dep dep dep dep dep dep dep dep B French (Module 20) Longitudinal Data Analysis SISCR / 155

74 Modeling time Valid inference from GEE requires that the mean model is correct We have two covariates: treatment group is binary, time is? Instead of a continuous variable (or, grouped linear term) for time, consider a categorical variable E[Y ij x ij ] = β 0 + β 2 x ij2 + β 3 x ij3 + β 4 x ij4 + β 5 x ij5 + β 6 x ij6 + β 7 x ij7 + β 8 x ij8 with, in addition to x ij2 representing the treatment variable (group) xij3 : dummy variable for visit 1 compared to visit 0 x ij4 : dummy variable for visit 2 compared to visit 0. xij8 : dummy variable for visit 6 compared to visit 0 B French (Module 20) Longitudinal Data Analysis SISCR / 155

75 GEE-AR(1), categorical time. xtgee dep i.visit i.group, corr(ar1) robust GEE population-averaged model Number of obs = 356 Group and time vars: subj visit Number of groups = 61 Link: identity Obs per group: Family: Gaussian min = 2 Correlation: AR(1) avg = 5.8 max = 7 Wald chi2(7) = Scale parameter: Prob > chi2 = (Std. Err. adjusted for clustering on subj) Robust dep Coef. Std. Err. z P> z [95% Conf. Interval] visit group estrogen _cons B French (Module 20) Longitudinal Data Analysis SISCR / 155

76 Modeling time Strong evidence that depression scores vary over time. testparm i.visit ( 1) 1.visit = 0 ( 2) 2.visit = 0 ( 3) 3.visit = 0 ( 4) 4.visit = 0 ( 5) 5.visit = 0 ( 6) 6.visit = 0 chi2( 6) = Prob > chi2 = In the model with continuous visit, the difference in mean score between groups was 2.53 and it was highly significant (p = 0.008) When considering categorical visit, the difference in mean score between groups was 2.59 and it was highly significant (p = 0.007) Noting that the estimated treatment effect is the same in both models, we opt for the parsimony of the model with continuous visit B French (Module 20) Longitudinal Data Analysis SISCR / 155

77 Model with interaction Consider a model that allows the treatment effect to depend on time The model of interest becomes E[Y ij x ij ] = β 0 + β 1 x ij1 + β 2 x ij2 + β 3 (x ij1 x ij2 ) where Y ij is the continuous depression score, x ij1 is a continuous variable for visit, and x ij2 is the treatment variable Model includes their main effects and the interaction term For subjects in the placebo group (x ij2 = 0), the model is E[Y ij x ij ] = β 0 + β 1 x ij1 For subjects in the estrogen group (x ij2 = 1), the model is E[Y ij x ij ] = (β 0 + β 2 ) + (β 1 + β 3 )x ij1 Now we can compare whether the mean change in depression score over time differs between treatment groups ( longitudinal model) B French (Module 20) Longitudinal Data Analysis SISCR / 155

78 GEE-AR(1), continuous time, interaction. xtgee dep c.visit##i.group, corr(ar1) robust GEE population-averaged model Number of obs = 356 Group and time vars: subj visit Number of groups = 61 Link: identity Obs per group: Family: Gaussian min = 2 Correlation: AR(1) avg = 5.8 max = 7 Wald chi2(3) = Scale parameter: Prob > chi2 = (Std. Err. adjusted for clustering on subj) Robust dep Coef. Std. Err. z P> z [95% Conf. Interval] visit group estrogen group#c.visit estrogen _cons B French (Module 20) Longitudinal Data Analysis SISCR / 155

79 Interpretation Estimate the change over time for the estrogen group by adding the coefficients for the visit variable and the interaction term. lincom visit + 1.group#c.visit ( 1) visit + 1.group#c.visit = dep Coef. Std. Err. z P> z [95% Conf. Interval] (1) For a population of women on placebo treatment, mean depression score decreases by approximately 1.65 points for each additional visit, 95% CI: (-2.04, -1.25) For a population of women on estrogen treatment, mean depression score decreases by approximately 2.37 points for each additional visit, 95% CI: (-2.65, -2.08) Strong evidence that these associations are different (p = 0.004) B French (Module 20) Longitudinal Data Analysis SISCR / 155

80 Summary GEE is specified by a mean model and a correlation model We created a linear regression model for the average depression score and modeled the longitudinal correlation using an AR(1) structure GEE requires that the mean model is correctly specified We explored different options for modeling temporal trends GEE provides valid estimates and standard errors for the regression parameters even under misspecification of the correlation structure, but efficiency gains are possible if the correlation model is correct We chose AR(1) with the robust option Model with a group-by-time interaction term facilitated estimation of changes over time within groups and between-group comparisons in temporal trends Contrasted this with a cross-sectional model that compared the mean depression score between groups over all times B French (Module 20) Longitudinal Data Analysis SISCR / 155

81 Overview Review: Longitudinal data analysis Case study: Longitudinal depression scores Case study: Indonesian Children s Health Study Case study: Carpal tunnel syndrome Summary and resources B French (Module 20) Longitudinal Data Analysis SISCR / 155

82 Indonesian Children s Health Study (ICHS) Determine the effects of vitamin A deficiency in preschool children n = 275 children examined for respiratory infection at up to 6 visits Xerophthalmia is an ocular manifestation of vitamin A deficiency Goal: Evaluate association between vitamin A deficiency and risk of respiratory infection Age (years) Xerophthalmia Infection No No No Yes Yes No Yes Yes B French (Module 20) Longitudinal Data Analysis SISCR / 155

83 Data. list id age time infection xerop gender hfora cost sint id age time infect~n xerop gender hfora cost sint Multiple records per person, with age in months, centered at 36 months, and time indicating visit number B French (Module 20) Longitudinal Data Analysis SISCR / 155

84 Exploratory analyses 1. Plot vitamin A deficiency and infection status, by age, for a sample of individuals 2. Plot percent with respiratory infection versus age, by presence or absence of vitamin A deficiency 3. Explore correlation structure by visit number, and calculate percent with respiratory infection at each visit B French (Module 20) Longitudinal Data Analysis SISCR / 155

85 Regression analyses 4. Evaluate the association between respiratory infection and vitamin A deficiency using an ordinary logistic regression model 5. Use GEE to estimate the population-averaged odds ratio for respiratory infection, comparing those with vitamin A deficiency to those without, given equivalent values of other covariates. Explore multiple specifications of working correlation 6. Use GLMM to estimate the conditional odds ratio for respiratory infection, comparing a typical individual with vitamin A deficiency to a typical individual without, given equivalent values of other covariates. Estimate the variability in the probability of respiratory infection across individuals B French (Module 20) Longitudinal Data Analysis SISCR / 155

86 Answers B French (Module 20) Longitudinal Data Analysis SISCR / 155

87 Individual trajectories Y Infection N subject Y Infection N subject Y Infection N subject Y Xerop N Age (years) Y Xerop N Age (years) Y Xerop N Age (years) Y Infection N subject Y Infection N subject Y Infection N subject Y Xerop N Age (years) Y Xerop N Age (years) Y Xerop N Age (years) Y Infection N subject Y Infection N subject Y Infection N subject Y Xerop N Age (years) Y Xerop N Age (years) Y Xerop N Age (years) B French (Module 20) Longitudinal Data Analysis SISCR / 155

88 Monthly averages 1.0 Percent with Respiratory Infection Age (years) No Xerop Xerop B French (Module 20) Longitudinal Data Analysis SISCR / 155

89 Yearly averages 1.0 Percent with Respiratory Infection Age (years) No Xerop Xerop B French (Module 20) Longitudinal Data Analysis SISCR / 155

90 Logistic regression model > summary(glm(infection ~ xerop + age + gender + hfora + cost + sint, data=ichs, family="binomial")) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** xerop age e-07 *** gender hfora * cost *** sint exp β 1 = % CI: (0.88, 4.88) Does not take into account within-person correlation B French (Module 20) Longitudinal Data Analysis SISCR / 155

91 GEE motivation Do vitamin A deficient children have an increased risk of infection? µ ij = E[Y ij x ij ] = P[Y ij = 1 x ij ] µ ij logit µ ij = log 1 µ ij = β 0 + β 1 Xerophthalmia ij + log P[Y ij = 1 x ij ] P[Y ij = 0 x ij ] exp β 1 represents the ratio of the expected odds of respiratory infection among a population of vitamin A deficient children to that for a population of children replete with vitamin A of the same age, gender,... exp β 1 is therefore a population-averaged parameter Respiratory infection is rare so odds ratio approximates relative risk B French (Module 20) Longitudinal Data Analysis SISCR / 155

92 Correlations Use visit time (not age) to obtain a correlation matrix with n = observations per cell Time 1 Time 2 Time 3 Time 4 Time 5 Time 6 Time 1 1 Time Time Time Time Time % 5 % 7 % 4 % 15 % 9 % B French (Module 20) Longitudinal Data Analysis SISCR / 155

93 Covariance structure For a binary outcome, variance depends on mean Var[Y ij ] = E[Y ij ](1 E[Y ij ]) Correlation also depends (in a somewhat complicated way) on pairwise means NB With respect to age, data are neither balanced nor complete Even if our analysis will be a function of age, examination of covariance and correlation matrices with respect to visit time is useful Dependence of correlation on pairwise means motivates alternate methods that model odds ratios instead of correlations B French (Module 20) Longitudinal Data Analysis SISCR / 155

94 Covariance structure Odds ratios measure the association between two binary variables Here, binary outcomes at two different visit times Time 1 Time 2 Time 3 Time 4 Time 5 Time 6 Time 1 Time Time Time Time Time B French (Module 20) Longitudinal Data Analysis SISCR / 155

95 Covariance structure Variance model Var[Y ij x ij ] = µ ij (1 µ ij ) Consider various specifications for the working correlation structure Independence Exchangeable Auto-regressive NB: In practice, selection of a working correlation structure should be guided by a priori knowledge and/or exploratory analysis B French (Module 20) Longitudinal Data Analysis SISCR / 155

96 geepack geepack implements estimating equations for β, α, and φ geeglm Syntax similar to glm; returns an object similar to a glm object An anova method provides multivariate Wald tests for joint hypotheses Calls a fitter function geese to solve the estimating equations geese Provides estimation and inference for β, α, and φ Model objects are available within geeglm objects names(m1) names(m1$geese) m1$geese$vbeta B French (Module 20) Longitudinal Data Analysis SISCR / 155

97 R commands load("ichs.rdata") library(geepack) m1 <- geeglm(infection ~ xerop + age + gender + hfora + cost + sint, id=id, data=ichs, family="binomial", corstr="independence") m2 <- geeglm(infection ~ xerop + age + gender + hfora + cost + sint, id=id, data=ichs, family="binomial", corstr="exchangeable") m3 <- geeglm(infection ~ xerop + age + gender + hfora + cost + sint, id=id, data=ichs, family="binomial", corstr="ar1") B French (Module 20) Longitudinal Data Analysis SISCR / 155

98 GEE-independence > summary(m1) Coefficients: Estimate Std.err Wald Pr(> W ) (Intercept) < 2e-16 *** xerop age e-07 *** gender hfora * cost *** sint Signif. codes: 0 *** ** 0.01 * Estimated Scale Parameters: Estimate Std.err (Intercept) Correlation: Structure = independence Number of clusters: 275 Maximum cluster size: 6 B French (Module 20) Longitudinal Data Analysis SISCR / 155

99 GEE-exchangeable > summary(m2) Coefficients: Estimate Std.err Wald Pr(> W ) (Intercept) < 2e-16 *** xerop age e-07 *** gender hfora * cost *** sint Signif. codes: 0 *** ** 0.01 * Estimated Scale Parameters: Estimate Std.err (Intercept) Correlation: Structure = exchangeable Link = identity Estimated Correlation Parameters: Estimate Std.err alpha Number of clusters: 275 Maximum cluster size: 6 B French (Module 20) Longitudinal Data Analysis SISCR / 155

100 GEE-AR(1) > summary(m3) Coefficients: Estimate Std.err Wald Pr(> W ) (Intercept) < 2e-16 *** xerop age e-07 *** gender hfora * cost *** sint Signif. codes: 0 *** ** 0.01 * Estimated Scale Parameters: Estimate Std.err (Intercept) Correlation: Structure = ar1 Link = identity Estimated Correlation Parameters: Estimate Std.err alpha Number of clusters: 275 Maximum cluster size: 6 B French (Module 20) Longitudinal Data Analysis SISCR / 155

101 Results ˆβ 1 (SE) exp( ˆβ 1 ) (95% CI) Independence 0.73 (0.42) 2.08 (0.91, 4.76) Exchangeable 0.63 (0.44) 1.87 (0.80, 4.40) Auto-regressive 0.67 (0.44) 1.95 (0.83, 4.63) Vitamin A deficient children have an increased risk of respiratory infection, but confidence interval includes the null-hypothesized value geese provides estimation and inference for β, α, and φ Cannot reject the hypothesis that α = 0 Note: Model fit can be evaluated using QIC (Pan, 2001) B French (Module 20) Longitudinal Data Analysis SISCR / 155

102 Working correlation structures Exchangeable : Auto-regressive : B French (Module 20) Longitudinal Data Analysis SISCR / 155

103 Stata commands * Declare the dataset to be "panel" data, grouped by id * with time variable time xtset id time * Fit models with an exchangeable correlation structure xtgee infection i.xerop age gender hfora cost sint, family(binomial) link(logit) corr(exch) robust * Examine working correlation structure estat wcorr B French (Module 20) Longitudinal Data Analysis SISCR / 155

104 GEE-exchangeable GEE population-averaged model Number of obs = 1200 Group variable: id Number of groups = 275 Link: logit Obs per group: min = 1 Family: binomial avg = 4.4 Correlation: exchangeable max = 6 Wald chi2(6) = Scale parameter: 1 Prob > chi2 = (Std. Err. adjusted for clustering on id) Semi-robust infection Coef. Std. Err. z P> z [95% Conf. Interval] xerop age gender hfora cost sint _cons B French (Module 20) Longitudinal Data Analysis SISCR / 155

105 Working correlation structure. estat wcorr Estimated within-id correlation matrix R: c1 c2 c3 c4 c5 c r1 1 r r r r r B French (Module 20) Longitudinal Data Analysis SISCR / 155

106 Mixed-effects models Do vitamin A deficient children have an increased risk of infection? µ ij = E[Y ij γ 0i ] = P[Y ij = 1 γ 0i ] µ ij logit µ ij = log 1 µ ij = (β0 + γ 0i ) + β1 Xerophthalmia ij + for i = 1,..., 275 and j = 1,..., m i exp β1 represents the ratio of the expected odds of respiratory infection for a typical individual with vitamin A deficiency to that for a typical individual with a sufficient amount of vitamin A of the same age, gender,... exp β1 is therefore a conditional parameter Respiratory infection is rare so odds ratio approximates relative risk B French (Module 20) Longitudinal Data Analysis SISCR / 155

107 R commands Use the glmer command in the lme4 library library(lme4)?glmer m_ri <- glmer(infection ~ (1 id) + factor(xerop) + age + factor(gender) + hfora + cost + sint, family=binomial, data=ichs, nagq=7) methods(class="mermod") expit <- function(x){exp(x)/(1+exp(x))} expit(fixef(m_ri)[1]) expit(fixef(m_ri)[1]-1.96*sqrt(varcorr(m_ri)$id[[1]])) expit(fixef(m_ri)[1]+1.96*sqrt(varcorr(m_ri)$id[[1]])) B French (Module 20) Longitudinal Data Analysis SISCR / 155

108 Random intercepts model > summary(m_ri) Random effects: Groups Name Variance Std.Dev. id (Intercept) Number of obs: 1200, groups: id, 275 Fixed effects: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** factor(xerop) age e-06 *** factor(gender) hfora * cost *** sint B French (Module 20) Longitudinal Data Analysis SISCR / 155

109 Interpreting random effects components For continuous outcomes interpreting random effects is easy because their standard deviation is on the scale of the outcome For binary outcomes the standard deviation is on the log-odds scale Recall for a GLMM with random intercepts γ 0i N(0, G 11 ) (β 0 + γ 0i ) N(β 0, G 11 ) In the ICHS analysis the intercept corresponds to the log odds of respiratory infection among females, age 36 months,..., with a sufficient amount of vitamin A We can use ˆβ 0 and Ĝ11 to form an interval to quantify variability in the probability of respiratory infection across these individuals expit( ˆβ 0 ± 1.96 Ĝ 11 ) = exp( ˆβ 0 ± 1.96 Ĝ 11 ) 1 + exp( ˆβ 0 ± 1.96 Ĝ 11 ), which is calculated to be 0.06 (0.01, 0.28) NB: This is not a confidence interval for β 0 B French (Module 20) Longitudinal Data Analysis SISCR / 155

110 Conditional and marginal effects Parameter estimates obtained from a marginal model (as obtained via a GEE) estimate population-averaged contrasts Parameter estimates obtained from a conditional model (as obtained via a GLMM) estimate subject-specific contrasts In a linear model for a Gaussian outcome with an identity link these contrasts are equivalent; not the case with non-linear models Depends on the outcome distribution Depends on the specified random effects B French (Module 20) Longitudinal Data Analysis SISCR / 155

111 Conditional and marginal effects Fitted conditional model Outcome Coefficient Random intercept Random intercept/slope Continuous Intercept Marginal Marginal Slope Marginal Marginal Count Intercept Conditional Conditional Slope Marginal Conditional Binary Intercept Conditional Conditional Slope Conditional Conditional Marginal = population-averaged; conditional = subject-specific B French (Module 20) Longitudinal Data Analysis SISCR / 155

112 Stata commands * Declare the dataset to be "panel" data, grouped by id * with time variable time xtset id time * Fit a model with random intercepts help melogit melogit infection i.xerop age i.gender hfora cost sint id: * Obtain predicted probabilities of infection, * setting the random effects to 0 margins i.xerop, predict(mu fixed) B French (Module 20) Longitudinal Data Analysis SISCR / 155

113 Random intercepts model Mixed-effects logistic regression Number of obs = 1200 Group variable: id Number of groups = 275 Obs per group: min = 1 avg = 4.4 max = 6 Integration method: mvaghermite Integration points = 7 Wald chi2(6) = Log likelihood = Prob > chi2 = infection Coef. Std. Err. z P> z [95% Conf. Interval] xerop age gender hfora cost sint _cons B French (Module 20) Longitudinal Data Analysis SISCR / 155

114 Random intercepts model id var(_cons) LR test vs. logistic regression: chibar2(01) = 5.52 Prob>=chibar2 = Predictive margins Number of obs = 1200 Model VCE : OIM Expression : Predicted mean, fixed portion only, predict(mu fixed) Delta-method Margin Std. Err. z P> z [95% Conf. Interval] xerop B French (Module 20) Longitudinal Data Analysis SISCR / 155

115 Summary Exploratory analysis with binary outcomes is not straightforward Plots of raw data not always useful Aggregated percents (means) can summarize mean response Correlation can be examined using correlations or odds ratios GEE provides marginal, population-averaged contrasts Ratio of the expected odds of respiratory infection among a population of vitamin A deficient children to that for a population of children replete with vitamin A of the same age, gender,... GLMM provides conditional, subject-specific contrasts Ratio of the expected odds of respiratory infection for a typical individual with vitamin A deficiency to that for a typical individual with a sufficient amount of vitamin A of the same age, gender,... Random effects variance components quantify heterogeneity in effects Lack of significance likely due to small number of exposed cases B French (Module 20) Longitudinal Data Analysis SISCR / 155

116 Overview Review: Longitudinal data analysis Case study: Longitudinal depression scores Case study: Indonesian Children s Health Study Case study: Carpal tunnel syndrome Summary and resources B French (Module 20) Longitudinal Data Analysis SISCR / 155

117 Carpal tunnel syndrome trial Jarvik et al. (2009) compared surgical versus non-surgical treatment for carpal tunnel syndrome (CTS) 116 participants were randomized Outcomes were assessed using the CTS assessment questionnaire (CTSAQ), every 3 months for 12 months Primary: functional status (low values are favorable) Secondary: symptom severity Crossover to surgery was allowed after 3 months Goal: Determine whether surgery improves functional status B French (Module 20) Longitudinal Data Analysis SISCR / 155

118 Data (wide format). list ID idgroup treatassign surgical ctsaqf0 ctsaqf1 ctsaqf2 ctsaqf3 ctsaqf ID idgroup treata~n surgical ctsaqf0 ctsaqf1 ctsaqf2 ctsaqf3 ctsaqf more-- B French (Module 20) Longitudinal Data Analysis SISCR / 155

119 Variables ID: unique participant ID idgroup: study site (1 = private, 2 = UW, 3 = VA, 4 = HMC) age: age in years gender (0 = male, 1 = female) treatassign: randomized intervention (0 = non-surgery, 1 = surgery) surgreported#: surgery reported at visit # (0 = no, 1 = yes) ctsaqf#: CTSAQ functional status at visit # ctsaqs#: CTSAQ symptom severity at visit # surgical: treated surgically during study (0 = never, 1 = 0 3 mos, 2 = 3 6 mos, 3 = 6 9 mos, 4 = 9 12 mos) B French (Module 20) Longitudinal Data Analysis SISCR / 155

120 Exploratory analyses 1. Plot individual trajectories in CTSAQF over time by treatment 2. Plot average CTSAQF over time by treatment 3. Summarize means, variances, and correlations over time by treatment B French (Module 20) Longitudinal Data Analysis SISCR / 155

121 Regression analyses (intention-to-treat) 4. Evaluate POST, CHANGE, and ANCOVA models for each timepoint with CTSAQF as outcome POST: follow-up measurement only CHANGE: difference between follow-up and baseline measurement ANCOVA: follow-up measurement controlling for baseline 5. Evaluate difference in mean CTSAQF between treatments at 12 months, adjusted for baseline CTSAQF and study site 6. Using repeated measures regression models, evaluate difference in mean CTSAQF between treatments using all timepoints, adjusted for baseline CTSAQF and study site B French (Module 20) Longitudinal Data Analysis SISCR / 155

122 Bonus analyses (as treated) 7. Summarize actual treatment patterns by assigned treatment group 8. Plot average CTSAQF by visit... For those who received surgery by 3 months versus those who did not For those who received surgery by 9 months versus those who did not 9. Use linear mixed-effects models to estimate differences in mean CTSAQF when exposure is surgery by 3 months and surgery by 9 months instead of assigned treatment group B French (Module 20) Longitudinal Data Analysis SISCR / 155

123 Answers B French (Module 20) Longitudinal Data Analysis SISCR / 155

124 Individual trajectories, non-surgery arm graph twoway connected ctsaqf visit if(id<=13062 & ID!=13009 & treatassign==0), by(id) mcolor(maroon) lcolor(maroon) xtitle("visit") ytitle("ctsaq-f") B French (Module 20) Longitudinal Data Analysis SISCR / 155

125 Individual trajectories, surgery arm graph twoway connected ctsaqf visit if(id<=13101 & ID!=13009 & treatassign==1), by(id) mcolor(maroon) lcolor(maroon) xtitle("visit") ytitle("ctsaq-f") B French (Module 20) Longitudinal Data Analysis SISCR / 155

126 Linear trajectories, non-surgery arm graph twoway (lfit ctsaqf visit) (scatter ctsaqf visit) if(id<=13062 & ID!=13009 & treatassign==0), by(id, legend(off)) xtitle("visit") ytitle("ctsaq-f") B French (Module 20) Longitudinal Data Analysis SISCR / 155

127 Linear trajectories, surgery arm graph twoway (lfit ctsaqf visit) (scatter ctsaqf visit) if (ID<=13101 & ID!=13009 & treatassign==1), by(id, legend(off)) xtitle("visit") ytitle("ctsaq-f") B French (Module 20) Longitudinal Data Analysis SISCR / 155

128 Mean CTSAQF collapse (mean) ctsaqf, by(visit treatassign) graph twoway (scatter ctsaqf visit if treatassign==0) (line ctsaqf visit if treatassign==0) (scatter ctsaqf visit if treatassign==1) (line ctsaqf visit if treatassign==1) B French (Module 20) Longitudinal Data Analysis SISCR / 155

129 Means and variances. use "cts.dta", clear. bysort treatassign: summarize ctsaqf0 ctsaqf1 ctsaqf2 ctsaqf3 ctsaqf > treatassign = 0 Variable Obs Mean Std. Dev. Min Max ctsaqf ctsaqf ctsaqf ctsaqf ctsaqf > treatassign = 1 Variable Obs Mean Std. Dev. Min Max ctsaqf ctsaqf ctsaqf ctsaqf ctsaqf Both treatment groups improve, but surgery group improves more Variance is larger in non-surgery group after baseline Missing data exist in both treatment groups B French (Module 20) Longitudinal Data Analysis SISCR / 155

130 Correlation. bysort treatassign: cor ctsaqf0 ctsaqf1 ctsaqf2 ctsaqf3 ctsaqf > treatassign = 0 (obs=41) ctsaqf0 ctsaqf1 ctsaqf2 ctsaqf3 ctsaqf ctsaqf ctsaqf ctsaqf ctsaqf ctsaqf > treatassign = 1 (obs=44) ctsaqf0 ctsaqf1 ctsaqf2 ctsaqf3 ctsaqf ctsaqf ctsaqf ctsaqf ctsaqf ctsaqf Strong positive correlations across most measurement pairs Note: Only a subset of participants has measurements at all times B French (Module 20) Longitudinal Data Analysis SISCR / 155

131 Generate change variables. use "cts.dta", clear. gen change1 = ctsaqf1 - ctsaqf0 (9 missing values generated). gen change2 = ctsaqf2 - ctsaqf0 (12 missing values generated). gen change3 = ctsaqf3 - ctsaqf0 (22 missing values generated). gen change4 = ctsaqf4 - ctsaqf0 (15 missing values generated) B French (Module 20) Longitudinal Data Analysis SISCR / 155

132 POST results. ttest ctsaqf1, by(treatassign) unequal Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] diff Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = Pr( T > t ) = Pr(T > t) = ttest ctsaqf2, by(treatassign) unequal diff Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = Pr( T > t ) = Pr(T > t) = ttest ctsaqf3, by(treatassign) unequal diff Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = Pr( T > t ) = Pr(T > t) = ttest ctsaqf4, by(treatassign) unequal diff Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = Pr( T > t ) = Pr(T > t) = B French (Module 20) Longitudinal Data Analysis SISCR / 155

133 CHANGE results. ttest change1, by(treatassign) unequal Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] diff Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = Pr( T > t ) = Pr(T > t) = ttest change2, by(treatassign) unequal diff Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = Pr( T > t ) = Pr(T > t) = ttest change3, by(treatassign) unequal diff Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = Pr( T > t ) = Pr(T > t) = ttest change4, by(treatassign) unequal diff Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = Pr( T > t ) = Pr(T > t) = B French (Module 20) Longitudinal Data Analysis SISCR / 155

134 ANCOVA results. reg ctsaqf1 treatassign ctsaqf Coef. Std. Err. t P> t [95% Conf. Interval] treatassign ctsaqf _cons reg ctsaqf2 treatassign ctsaqf treatassign ctsaqf _cons reg ctsaqf3 treatassign ctsaqf treatassign ctsaqf _cons reg ctsaqf4 treatassign ctsaqf treatassign ctsaqf _cons B French (Module 20) Longitudinal Data Analysis SISCR / 155

135 Results for each timepoint 3 months 6 months 9 months 12 months Method Mean (SE) Mean (SE) Mean (SE) Mean (SE) POST 0.22 (0.17) 0.53 (0.17) 0.47 (0.18) 0.43 (0.17) CHANGE 0.17 (0.14) 0.42 (0.13) 0.41 (0.13) 0.35 (0.16) ANCOVA 0.19 (0.13) 0.45 (0.13) 0.42 (0.13) 0.38 (0.15) Standard errors are lower when baseline information is incorporated into the model (CHANGE and ANCOVA) Estimated difference (control group minus surgical group) also varies across methods due to difference in baseline values B French (Module 20) Longitudinal Data Analysis SISCR / 155

136 CTSQAF at 12 months. reg ctsaqf4 i.treatassign ctsaqf0 i.idgroup Source SS df MS Number of obs = F(5, 95) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = ctsaqf4 Coef. Std. Err. t P> t [95% Conf. Interval] treatassign ctsaqf idgroup _cons Significant difference in adjusted mean CTSQAF at 12 months, indicating superiority of surgery Symptoms in both groups improved, but surgical treatment led to better outcome than did non-surgical treatment Clinical relevance of this difference was modest B French (Module 20) Longitudinal Data Analysis SISCR / 155

137 GEE-independence. xtset ID visit. xtgee ctsaqf i.treatassign ctsaqfbase visit i.idgroup if visit!=0, corr(ind) robust GEE population-averaged model Number of obs = 406 Group variable: ID Number of groups = 113 Link: identity Obs per group: Family: Gaussian min = 1 Correlation: independent avg = 3.6 max = 4 Wald chi2(6) = Scale parameter: Prob > chi2 = Pearson chi2(406): Deviance = Dispersion (Pearson): Dispersion = (Std. Err. adjusted for clustering on ID) Robust ctsaqf Coef. Std. Err. z P> z [95% Conf. Interval] treatassign ctsaqfbase visit idgroup _cons B French (Module 20) Longitudinal Data Analysis SISCR / 155

138 GEE-exchangeable. xtgee ctsaqf i.treatassign ctsaqfbase visit i.idgroup if visit!=0, corr(exc) robust GEE population-averaged model Number of obs = 406 Group variable: ID Number of groups = 113 Link: identity Obs per group: Family: Gaussian min = 1 Correlation: exchangeable avg = 3.6 max = 4 Wald chi2(6) = Scale parameter: Prob > chi2 = (Std. Err. adjusted for clustering on ID) Robust ctsaqf Coef. Std. Err. z P> z [95% Conf. Interval] treatassign ctsaqfbase visit idgroup _cons Estimated correlation for exchangeable structure: 0.33 B French (Module 20) Longitudinal Data Analysis SISCR / 155

139 Random intercepts model. xtmixed ctsaqf i.treatassign ctsaqfbase visit i.idgroup if visit!=0 ID: Mixed-effects ML regression Number of obs = 406 Group variable: ID Number of groups = ctsaqf Coef. Std. Err. z P> z [95% Conf. Interval] treatassign ctsaqfbase visit idgroup _cons Random-effects Parameters Estimate Std. Err. [95% Conf. Interval] ID: Identity sd(_cons) sd(residual) LR test vs. linear model: chibar2(01) = Prob >= chibar2 = B French (Module 20) Longitudinal Data Analysis SISCR / 155

140 Random intercepts and slopes model. xtmixed ctsaqf i.treatassign ctsaqfbase visit i.idgroup if visit!=0 ID: visit Mixed-effects ML regression Number of obs = 406 Group variable: ID Number of groups = ctsaqf Coef. Std. Err. z P> z [95% Conf. Interval] treatassign ctsaqfbase visit idgroup _cons Random-effects Parameters Estimate Std. Err. [95% Conf. Interval] ID: Independent sd(visit) sd(_cons) sd(residual) LR test vs. linear model: chi2(2) = Prob > chi2 = B French (Module 20) Longitudinal Data Analysis SISCR / 155

141 Treatment. tab treatassign treatassign Freq. Percent Cum Total tab treatassign surgical treatassig surgical n Total Total Of 57 assigned to surgery, 42 had it by 3 months and 13 never had it Of 59 assigned to no surgery, 23 actually had surgery during the study B French (Module 20) Longitudinal Data Analysis SISCR / 155

142 Treatment. gen surgby3 = (surgical==1). gen surgby9 = (surgical==1 surgical==2 surgical==3). collapse (mean) surgby3 surgby9 treatassign, by(id). tab treatassign surgby3, row (mean) treatassig (mean) surgby3 n 0 1 Total Total tab treatassign surgby9, row (mean) treatassig (mean) surgby9 n 0 1 Total Total B French (Module 20) Longitudinal Data Analysis SISCR / 155

143 Mean CTSQAF, 3-month exposure collapse (mean) ctsaqf, by(visit surgby3) graph twoway (scatter ctsaqf visit if surgby3==0) (line ctsaqf visit if surgby3==0) (scatter ctsaqf visit if surgby3==1) (line ctsaqf visit if surgby3==1) B French (Module 20) Longitudinal Data Analysis SISCR / 155

144 Mean CTSQAF, 9-month exposure collapse (mean) ctsaqf, by(visit surgby9) graph twoway (scatter ctsaqf visit if treatassign==0) (line ctsaqf visit if treatassign==0) (scatter ctsaqf visit if treatassign==1) (line ctsaqf visit if treatassign==1) B French (Module 20) Longitudinal Data Analysis SISCR / 155