Case study: Modelling berry yield through GLMMs

Size: px
Start display at page:

Download "Case study: Modelling berry yield through GLMMs"

Transcription

1 Case study: Modelling berry yield through GLMMs Jari Miina Finnish Forest Research Institute (Metla) European NWFPs network Action FP TRAINING SCHOOL Modelling NWFP El Escorial, 29 th September 2rd October 2014

2 Content Generalized Linear Mixed Models (GLMMs) for non-normal data Modelling the percentage coverage of berries (binomial) Modelling the number of berries (Poisson) Further training with count data GLMM_training.pdf (lecture) GLMM_training.csv (data) GLMM_training.r (R code)

3 Purpose of modelling NWFPs Data collection Surveys by sampling - a subset of a population - probability vs. non-probability Experiments - detect causal inferences - control external effects - reduce variability in conditions Analyses - linear regression - non-linear regression - generalized linear models - mixed models - multivariate models - time series analysis - expert modelling - zero-inflated models - spline modelling - process-based modelling (etc.) Information Models for predicting NWFPs in stand and forest management simulators. Abundance, yield and distribution of NWFP resources. Research task Data collection & analyses Information

4 Generalized Linear Mixed Models? A linear model: Y = (effect_1 predictor_1) + (effect_2 predictor_2) + (General) Linear Mixed Models: conditionally normal outcome distribution, fixed and random effects Generalized Linear Mixed Models: any conditional outcome distribution, fixed and random effects, the expected value of dependent variable μ = E[Y] is predicted from a linear combination of fixed and random effects through link function g( ) g(μ) = (effect_1 predictor_1) + (effect_2 predictor_2) + Note: A link function and transforming is not the same! For example, log-link function ln(e[y]) ln(y)

5 Bilberry and cowberry sample plots Different factors affect the (1) abundance and (2) fruiting of wild berries => two different datasets and models (1 and 2) Model 1: The percentage coverage of a berry plant as a function of stand and site variables (the PSP3000 data) Model 2: The number of berries as a function of coverage and stand and site variables (the MASI data)

6 (1) Berry sample plots: the PSP3000 data The coverages (%) of bilberry and cowberry are determined on 2-m 2 quadrats in the permanent sample plots (PSP) of NFI Kühlmann-Berenzon, S. & Hjorth, U. 2007

7 (1) Berry sample plots: the PSP3000 data Dependent variable: the percentage coverage (%) of berry plant Errors are not normally distributed (non-normal) Variance is not constant Response is bounded (0 100 %) Overdispersion (+ extra zeros?) => random effect at the lowest level (i.e. pseudo-level) Why not convert to proportions, arcsine-transform and fit a linear model?

8 (1) Berry sample plots: the PSP3000 data For binomial distribution: Mean = np Variance = np(1 p)

9 (1) Berry sample plots: the PSP3000 data Dependent variable: the percentage coverage (%) of berry plant Hierarchical data structure: forest centre region i, municipality j, NFIplot cluster k, NFI-plot l, stand m, quadrats Proportional response variable y: the mean percentage coverage of 2-m 2 quadrats in the stand; p % = p/100 i.e. p successes in 100 trials A generalized linear mixed model (a multi-level binomial model with logit link function): y ijklm ~ binomial ( n ijklm, p ijklm ) p ln 1 p ijklm ijklm f ( X ijklm, ) u i u ij u ijk u ijkl u ijklm random effect at pseudo-level logit = natural logarithm of odds ratio

10 (1) Berry sample plots: the PSP3000 data A model for the mean percentage coverage of bilberry on 2-m 2 quadrats in the stand (Miina et al. 2009): Site = site fertility Pine, Birch = dominant tree species Alt = altitude (m) T, G = stand age (years) and stand basal area (m 2 /ha)

11 (1) Berry sample plots: the PSP3000 data Simulated coverage of bilberry and cowberry in Scots pine and Norway spruce stands on Myrtillus (MT), Vaccinium (VT) and Calluna (CT) sites in southern Finland. The stand development was simulated using the Motti simulator, and thinnings (arrows) and final cutting were simulated according to the silvicultural recommendations.

12 (2) Berry sample plots: the MASI data In about 200 stands, five 1-m 2 quadrats Berries counted annually Metla/Kauko Salo Turtiainen et al. 2011

13 (2) Berry sample plots: the MASI data Dependent variable: the number of berries Errors are not normally distributed (non-normal) Variance is not constant Count response is non-negative integer ( 0) Mean = 83 >> Variance = 5829 Overdispersion? => random effect at the lowest level (i.e. pseudo-level) For Poisson distribution: Mean = Variance

14 (2) Berry sample plots: the MASI data Dependent variable: the number of berries Hierarchical data structure: forest centre region i, municipality j, stand m, quadrats Cross-effect: year t (as fixed effect) Count response variable y: the annual mean number of berries on 1-m 2 quadrats in the stand; A generalized linear mixed model (a multi-level Poisson model with log link function): y ijmt ln( ~ Poisson( ijmt ijmt ijmt ) f ( X, ) u ) i u ij u ijm u ijmt random effect at pseudo-level

15 (2) Berry sample plots: the MASI data A model for the annual number of unripe bilberries on 1-m 2 quadrats in pine and spruce stands (Miina et al. 2009): Pine stands: Spruce stands: G = stand basal area (m 2 /ha) u = random year effect

16 Models (1) & (2): Simulated mean annual berry yield (solid line) and its 95% confidence interval (broken lines) in Scots pine stands on Myrtillus and Vaccinium sites in southern Finland. The stand development was simulated using the Motti simulator, and thinnings (arrows) and final cutting were simulated according to the silvicultural recommendations.

17 R code for fitting the over-dispersed binomial mixed model using the PSP3000 data. The response variable (bilberry coverage, %) is given as (success,failure): bilberry <- round(bilberry_coverage) # Only integers Model_1 <- glmmpql(cbind(bilberry,100-bilberry) ~ # Dummy-variables for site types (site type 3 is reference, Myrtillus sites) I(sitetype == 1) # Oxalis-Maianthemum sites + I(sitetype == 2) # Oxalis-Myrtillus sites + I(sitetype == 4) # Vaccinium sites + I(sitetype == 5) # Calluna sites # Dummy-variable for dominant tree species + I(sitetype == 2 & broadleaves == 1) + I(pine == 1) # Dummy-variables for stand management + I(FormerAgrLand == 1) # Former agricultural land + I(ArtificialRegen == 1) # Planted or seeded stand # Continuous predictors + altitude # Altitude, m + age + I(age^2) # Stand age, years + ba + I(ba^2) # Stand basal area, m 2 /ha, # Hierarcical data structure, data clustering, unbalanced data random = ~ 1 ForestCentre/Municipality/NFIblock/NFIplot/Stand, family = binomial(), data = PSP3000)

18 Summary of the mixed binomial model fitted using R function glmmpql: > summary(model_1) Linear mixed-effects model fit by maximum likelihood Data: PSP3000 AIC BIC loglik NA NA NA Random effects: Formula: ~1 ForestCentre (Intercept) StdDev: Formula: ~1 Municipality %in% ForestCentre (Intercept) StdDev: Formula: ~1 NFIblock %in% Municipality %in% ForestCentre (Intercept) StdDev: Formula: ~1 NFIplot %in% NFIblock %in% Municipality %in% ForestCentre (Intercept) StdDev: Formula: ~1 Stand %in% NFIplot %in% NFIblock %in% Municipality %in% ForestCentre (Intercept) Residual StdDev: continue

19 Summary of the mixed binomial model fitted using R function glmmpql: > summary(model_1). continue Value Std.Error DF t-value p-value (Intercept) I(sitetype == 1)TRUE I(sitetype == 2)TRUE I(sitetype == 4)TRUE I(sitetype == 5)TRUE I(sitetype == 2 & broadleaves == 1) I(pine == 1)TRUE I(FormerAgrLand == 1)TRUE I(ArtificialRegen == 1)TRUE altitude age I(age^2) ba I(ba^2) snip Number of Observations: 2056 Number of Groups: ForestCentre Municipality %in% ForestCentre NFIblock %in% Municipality %in% ForestCentre NFIplot %in% NFIblock %in% Municipality %in% ForestCentre Stand %in% NFIplot %in% NFIblock %in% Municipality %in% ForestCentre 2056

20 R code for fitting the over-dispersed Poisson mixed model using the MASI data (glmmpql): bilberries <- round(bilberry_count) # Only integers Model_2 <- glmmpql(bilberries) ~ # Dummy-variables for fixed year effects (year 2006 is reference) I(vuosi == 2001) + I(vuosi == 2002) + I(vuosi == 2003) + I(vuosi == 2004) + I(vuosi == 2005) + I(vuosi == 2007) # Continuous predictors # + ba + I(ba^2) # Stand basal area, m 2 /ha, # Hierarcical data structure, data clustering, unbalanced data random = ~ 1 ForestCentre/Municipality/Stand/StandYear, family = poisson(), data = MASI)

21 Summary of the mixed Poisson model fitted using R function glmmpql: > summary(model_2) Linear mixed-effects model fit by maximum likelihood Data: MASI AIC BIC loglik NA NA NA Random effects: Formula: ~1 ForestCentre (Intercept) StdDev: Formula: ~1 Municipality %in% ForestCentre (Intercept) StdDev: Formula: ~1 Stand %in% Municipality %in% ForestCentre (Intercept) StdDev: Formula: ~1 StandYear %in% Stand %in% Municipality %in% ForestCentre (Intercept) Residual StdDev: continue

22 Summary of the mixed Poisson model fitted using R function glmmpql: > summary(model_2). continue Value Std.Error DF t-value p-value (Intercept) I(year == 2001)TRUE I(year == 2002)TRUE I(year == 2003)TRUE I(year == 2004)TRUE I(year == 2005)TRUE I(year == 2007)TRUE Number of Observations: 112 Number of Groups: ForestCentre 11 Municipality %in% ForestCentre 21 Stand %in% Municipality %in% ForestCentre 25 StandYear %in% Stand %in% Municipality %in% ForestCentre 112

23 Some observations on non-normal data For over-dispersed count data, try also negative binomial distribution (i.e. residuals are a mixture of Poisson and gamma distributions) Under/over-dispersion vs. spatial distribution ( Poisson forest ) The R function glmmpql assumes only mean and variance, not a distribution ( statistical woo-doo ) => no fitting statistics available: Raw residuals are skewed and have non-constant variance => the Pearson residuals (the raw residual divided by the square root of the variance function). Pseudo-R 2 = 1 var(obs pred)/var(obs) (Model_1: 26 %) Use the R function glmer: the MASI data worked well, but estimation didn t converged with the PSP3000 data. AIC, BIC, loglik, deviance are available Random cross-effects can be included (e.g. year effect) Review on GLMMs, see Bolker et al. 2008, Trends in Ecology and Evolution

24

25

26 The mixed Poisson model fitted using R function glmer (year as a fixed effect) (See GLMM_training.* -files for further training): Fit1 <- glmer(bilberries ~ as.factor(year) + (1 ForestCentre) + (1 Municipality)+ (1 Stand) + (1 StandYear), data = MASI, family = poisson) > summary(fit1) Generalized linear mixed model fit by the Laplace approximation Formula: bilberries~ as.factor(year) + (1 ForestCentre) + (1 Municipality) + (1 Stand) + (1 StandYear) Data: MASI AIC BIC loglik deviance Random effects: Groups Name Variance Std.Dev. StandYear (Intercept) e e-01 Stand (Intercept) e e-01 Municipality (Intercept) e e-07 ForestCentre (Intercept) e e-05 Number of obs: 112, groups: StandYear, 112; Stand, 25; Municipality, 21; ForestCentre, 11 Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** as.factor(year) * as.factor(year) as.factor(year) e-06 *** as.factor(year) as.factor(year) *** as.factor(year) **

27 The mixed Poisson model fitted using R function glmer (year as a random effect): Fit2 <- glmer(bilberries ~ int + (1 Year) + (1 ForestCentre) + (1 Municipality) + (1 Stand) + (1 StandYear), data = MASI, family = poisson) > summary(fit2) Generalized linear mixed model fit by the Laplace approximation Formula: bilberries~ +(1 Year) + (1 ForestCentre) + (1 Municipality) + (1 Stand) + (1 StandYear) Data: MASI AIC BIC loglik deviance Random effects: Groups Name Variance Std.Dev. StandYear (Intercept) e e-01 Stand (Intercept) e e-01 Municipality (Intercept) e e+00 ForestCentre (Intercept) e e-06 Year (Intercept) e e-01 Number of obs: 112, groups: StandYear, 112; Stand, 25; Municipality, 21; ForestCentre, 11; Year, 7 Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** > anova(fit1,fit2) Df AIC BIC loglik Chisq Chi Df Pr(>Chisq) Fit Fit **

28 Thank you