Clustering of Quality of Life Items around Latent Variables

Size: px
Start display at page:

Download "Clustering of Quality of Life Items around Latent Variables"

Transcription

1 Clustering of Quality of Life Items around Latent Variables Jean-Benoit Hardouin Laboratory of Biostatistics University of Nantes France Perugia, September 8th, 2006 Perugia, September

2 Context How can we select variables which the responses depend on the same latent variable? In the field of the quality of life (Qol): how can we determine the «items» (binary or ordinal variables) which the responses depend on the same «dimension» of the quality of life? Perugia, September

3 Used methods in Qol to determine sets of items Factor analysis Results determined after a rotation (i.e. Varimax) to affect each item to only one latent variable (factorial axes) Item Response Theory - IRT Mokken Scale Procedure (MSP) based on the coherence of the responses to the items (Hemker and al., 1995) HCA/CCPROX based on the matrix of the covariance between the items conditionally to the latent variable (Roussos and al., 1998) Raschfit based on the fit of the data to a parametric model (i.e. Rasch model ) (Hardouin and al., 2003) Perugia, September

4 Clustering around latent variables (CLV) Vigneau & Qannari (2003) Method developed in the field of the sensometry, for quantitative variables (notes of the members of a jury) Based on a Hierarchical Cluster Analysis (HCA) of the variables followed by a consolidation phase Vigneau & Qannari, Communications in Statistics, simulation and computation, 2003 Perugia, September

5 CLV : algorithm of the HCA At each step, a partition of the variables is defined At the initial step, each variable represents a cluster In each cluster of variables, we note λ 1 The variance of the variable if the cluster is composed of only one variable The first eigenvalue of the covariance matrix if the cluster is composed of several variables The T criterion is equal to the sum of all the λ 1 and represents the variance explained by the partition The aim is to maximize the T criterion, at each step, among all the possible grouping of two clusters Perugia, September

6 CLV : Dendrogram Clustering around Latent Variables (CLV) % Unexplained Variance itema3 itema1 itema2 itema4 itema5 itemb5 itemb1 itemb2 itemb3 itemb4 Variables Perugia, September

7 CLV with binary and ordinal variables Is CLV an adapted method for binary variables? CLV is based on the covariance matrix between the variables How to interpret the covariance between two binary or ordinal variables? Nevertheless, the analysis of the covariance matrix are very usual in the field of the Qol Is useful to adapt CLV to this kind of variables with more rigorous indexes like the tetrachoric and polychoric correlations? Perugia, September

8 Is CLV a good method with binary data? Is useful to adapt CLV? Simulations Study 1 : Is CLV able to find the real partition of binary items? 100 replications of each case 2 latent variables ; 500 individuals ; 7 items relied to each latent variable Simulation of the responses to the items by a Rasch model ( ) [ ( )] ( q ) exp x θ nq δ j P X nj = x / θ nq, δ j =, x = 0,1 1 + exp θ δ θ δ nq j : qth latent var : Difficulty of iable the item for the [( )] Correlation between the two latent variables is supposed to have influence on the results j nq nth individual Perugia, September j

9 Results : rate of simulations allowing to find the simulated partition 100% 100% 100% 100% 98% 89% 90% 75% 59% 58% 50% 25% 13% 15% 0% rho=0 rho=0.2 rho=0.4 rho=0.6 rho=0.8 Classical CLV CLV with polychoric correlations Perugia, September

10 Is CLV a good method? Is useful to adapt CLV? Is CLV a good method to find the real partition of binary items? Yes, in particular if the latent variables are weakly correlated (<0.3) : close of 100% of success Yes, if the latent variables are moderately correlated ( ) : 60 to 90% of success No, if the latent variables are strongly correlated (>0.6) Is useful to adapt CLV (by using polychoric correlations)? Results seems to conclude to NO Polychoric version is (very) more computationally expensive Perugia, September

11 Is CLV a better method than existing methods? Simulations study 2 Comparison of (classical) CLV with HCA/CCPROX, MSP and Raschfit 500 individuals 800 replications Main factors influencing the results of the existing methods: Correlation between the latent variables Discriminating power of the items Perugia, September

12 Simulations Simulation of the data with a Birnbaum model P Discriminating powers of the items (α j ) : Randomly defined by a gaussian distribution of mean µ and 1 variance σ²=0.2²=0.04 Case 1 : µ=.5 Case 2 : µ=1 Case 3 : µ=2 ( q) ( X = x / θ, δ ) nj nq j exp = 1+ exp 0,75 0,5 0,25 [ α jx( θnq δ j )] [ α ( θ δ )] j nq j ,5-1 -0,5 0 0,5 1 1,5 2 Weak (.5) Medium (1) Strong (2) Perugia, September

13 Simulations Number of items which the response depends on Correlation between the 2 latent traits (ρ) the first latent variable the second latent variable Scenario A (Unidimensional Case) Scenario B Scenario C Scenario D Bidimensional Case Scenario E Perugia, September

14 Results : case 3 (µ=2 - strong) HCA/CCPROX MSP Raschfit CLV Unidimensional Case r=0.0 r=0.2 r=0.4 r=0.6 Bidimensional Cases Perugia, September

15 Results : case 2 (µ=1 - medium) HCA/CCPROX MSP Raschfit CLV Unidimensional Case r=0.0 r=0.2 r=0.4 r=0.6 Bidimensional Cases Perugia, September

16 Results : case 1 (µ=.5 - weak) HCA/CCPROX MSP Raschfit CLV Unidimensional Case r=0.0 r=0.2 r=0.4 r=0.6 Bidimensional Cases Perugia, September

17 Quality of the methods Perugia, September

18 Quality of the methods Area where these methods can be used in practical Perugia, September

19 Quality of the methods Difficulty to interpret the latent traits (counfounding area) Perugia, September

20 Quality of the methods Items which doesn t allow discriminating the individuals (useless items) Perugia, September

21 Real data : HADS Hospital Anxiety and Depression Scale French version 14 items with 4 modalities Pair questions : depression Odd questions : anxiety 392 «old» inhabitants of Orléans (France) Missing data : 0.3% Perugia, September

22 Dendrogram had5 Clustering around Latent Variables (CLV) Variables had1 had13 had3 had9 had11 had7 had12 had4 had2 had8 had6 had10 had % Unexplained Variance Perugia, September

23 Dendrogram had5 Clustering around Latent Variables (CLV) had1 Variables had13 had3 had9 had11 had7 had12 had4 had2 ANXIETY DEPRESSION had8 had6 had10 had % Unexplained Variance Perugia, September

24 Dendrogram had5 Clustering around Latent Variables (CLV) had1 Variables had13 had3 had9 had11 had7 had12 had4 had2 ANXIETY???? DEPRESSION had8 had6 had10 had % Unexplained Variance Perugia, September

25 Problematic items Items 7 & 11 : 2 items which allows measuring the anxiety Others references : Friedman and al : three dimensional questionnaire (items 1, 7 and 11 : psychomotor agitation) Caci and al : Items 7, 11 and 14 are problematic, «the anxiety score must be taken with caution» Perugia, September

26 CLV : programs SAS, macro %clv Vigneau & Qannari Stata, module clv- ( Hardouin Polychoric option Perugia, September

27 Conclusion about CLV Easy to interpret (dendrogram, T criterion ) Fast to run (compared to raschfit!!) Interesting results with binary data In a practical point of view, it is interesting to compare results obtained with several methods Work in progress : How to automatically stop the HCA algorithm (how to find a good index)? Application to ordinal variables Perugia, September