Defining models using equations...

Size: px
Start display at page:

Download "Defining models using equations..."

Transcription

1 A Course in Statistical Modelling August 27, 28 and 29, 2014 session 03: Defining models and test selection Graeme Hutcheson Manchester Institute of Education University of Manchester Defining models using equations...

2 The first stage of our system of analysis is to represent our research questions using a standard format. The format we will use is based on equations that can be simply constructed to represent a whole range of analyses and research designs. These equations can be used to represent simple research questions. For example, whether scores in a mathematics test are related to gender... mathematics test score gender or whether success at a particular task is related to age. success age The equations can include multiple explanatory variables... For example, whether the seriousness of injuries inflicted on cyclists is related to the wearing of helmets, the cyclists age and the type of vehicle involved in the accident... injury helmet use + age + type of vehicle or whether the starting salary of new employees is related to their gender, age and ethnicity... salary gender + age + ethnicity

3 More complex relationships including interactions can also be represented using this equation notation... For example, the combination of gender and age may be related to the wages a company pays an employee (we suspect that older males get paid the most). This combination can be included in the model as an interaction term... wages age + gender + age:gender which can be written more succinctly as... wages age*gender Note: interactions age:gender represents an interaction term (the combination of age and gender). age*gender represents the interaction of age and gender and the main effects of age and gender. When an interaction term is entered into a model, the lower-order terms also need to be included if the model parameters are to make sense. A full explanation of this including a demonstration is provided in... Hutcheson, G. D. (2013). Models with interactions: test type and misinterpretation. Journal of Modelling in Management, 8,1: jm2).

4 Note: Wilkinson and Rogers model notation The notation used to represent models (, +,, :, etc.) is based on a scheme proposed by Wilkinson and Rogers (1973). This notation is also used by the statistical software used in this course. The Wilkinson and Rogers notation is comprehensive; however, in order to complete this course, we just need to know the +, * and : operators. A useful discussion of Wilkinson and Rogers notation is provided online at... The equation notation can also represent the experimental design of the study. For example, the following shows an independent groups design... School A School B School C child01 = 56 child02 = 71 child03 = 59 child04 = 62 child05 = 68 child06 = 72 child07 = 69 child08 = 75 child09 = 61 child10 = 76 child11 = 61 child12 = 52 child13 = 43 child14 = 60 child15 = 62 which is represented using the equation... Score School

5 Other experimental designs can also be represented using equation notation. For example, a dependent groups (or matched groups) design... Mathematics Physics child01 child02 child03 child04 child which is represented using the equation... Score Academic Subject + Child Note: How to structure data The pictorial representations given above illustrate the research designs, but do not show how the data should be entered in the spreadsheet. For example, should the data for the independent groups design be entered as three columns, or using some other structure? Although this is crucial for running analyses, it is not always immediately obvious how data should be structured... The GLM representation of the model is useful here as it shows exactly how the data should be structured. The equation simply identifies each variable in the analysis, each of which is recorded in it s own column...

6 Note: How to structure data The independent groups design Score School, is therefore recorded as two columns... Score School 56 schoola 71 schoola 59 schoola 62 schoola 68 schoola 72 schoolb 69 schoolb schoolc 62 schoolc Note: How to structure data... and the dependent groups design Score Academic Subject + Child, is recorded as three columns... Score Subject Child 58 mathematics child01 72 mathematics child02 40 mathematics child03 62 mathematics child04 48 mathematics child05 61 physics child01 69 physics child02 53 physics child03 65 physics child04 62 physics child05

7 Note: How to structure data A detailed discussion of data structures is provided in... Hutcheson, G. D. (2011). Data Set Structure. In L. Moutinho and G. D. Hutcheson, The SAGE Dictionary of Quantitative Management Research. Pages a copy of which is provided as part of the additional reading for this course. More complex experimental and non-experimental designs which include control groups and additional covariates can also be represented using the equation notation. Examples of representing these research designs and complex interactions using the equation format are included in the exercises provided below...

8 Information given in contingency tables can also be represented in equation format. For example, job satisfaction and occupation... Satisfied Unsatisfied Doctor Dentist 7 19 Engineer In order to determine whether type of occupation is related to satisfaction, we need to model the cell counts. This analysis can be represented using the following equation, which results in the same output as a standard chi-square analysis... cell count occupation + satisfaction + occupation:satisfaction or more succinctly as... cell count occupation*satisfaction A more direct way to analyse the data above is to directly predict satisfaction (a categorical variable) using occupation as an explanatory variable. This analysis is equivalent to the one above and can be represented using the equation... satisfied (Y/N) occupation Data structure: In order to run these two analyses, the data will need to be structured differently; three columns for the analysis of cell count (cell count, occupation, satisfaction) and two columns for the analysis of satisfaction (satisfied, occupation).

9 Although the equation format can look complicated, representing research in this way has enormous advantages... It provides a standard way of representing research which highlights common underlying analytical processes. It can accommodate many different research designs and types of data It identifies the analytical technique that can be applied It identifies the structure and format of the data. Representing research questions in the form of equations is an important skill for analysts and a central part of this course. Statistical analysis...

10 Generalized linear models (GLMs) are used in this course as they provide a powerful and theoretically-consistent approach to analysis. The advantages of using GLMs is particularly clear when they are applied as an integral part of the system of analysis described in this course. The Generalized Linear Model The basic idea behind the GLM is simple. A particular variable (the response variable) may be predicted using other variables (the explanatory variables). This can be represented using an equation... variable Y variable X + variable Z or, more generally as... Y X i + X j X n

11 In a GLM the response and explanatory variables (also known as the random and systematic components) are linked ( ) according to a function that takes account of the measurement scale of the response variable. In practice, this means that if we know the scale of measurement of the response variable, we are able to select an appropriate link function and the appropriate analytical technique. Although the GLMs offer many links to take account of the different ways in which response variables can be distributed, this course introduces three links to deal with continuous, categorical and count response variables. Measurement Scales Measurement scales are an important part of GLMs and readers who are unfamiliar with them are directed to the following references which deal with the theory of measurement scales in detail... Hutcheson, G. D. and Moutinho, L. (2008). Statistical Modelling for Management. Sage Publications. Hutcheson, G. D. (2011). Measurement scales. In Moutinho, L. and Hutcheson, G. D. (editors). The Sage Dictionary of Quantitative Management Research. pgs , Sage Pulications.

12 GLM techniques The following link functions and techniques may be applied to continuous, categorical and count data... Distribution of response variable link function technique continuous identity OLS regression ordered categorical logit Proportional odds unordered categorical logit Multinomial count log Poisson regression Detailed examples of all these techniques are provided later in this course. Traditional tests and equivalent GLM models The GLM models reproduce or replace many of the traditional tests. For example, tests for independent group designs... Traditional Test GLM one independent variable t-test (unrelated) Mann-Whitney 1-way ANOVA (unrelated) Kruskal-Wallis Jonck-heere Trend chi-square (contingency table) etc., etc. Y X multiple independent variables complex selection of multi-way ANOVA models multi-way contingency tables (log-linear) Y X 1 + X 2

13 Traditional tests and equivalent GLM models... and tests for dependent (or matched) group designs... Traditional Test GLM one independent variable paired t-test Wilcoxon 1-way ANOVA (related) Friedman Pages L-trend etc., etc., Y subject + X multiple independent variables complex selection of multi-way ANOVA models multi-way contingency tables (log-linear) Y subject + X 1 + X 2 In order to realise the power of the GLMs, it is important to understand the equivalence of the traditional tests and the GLMs. Detailed information about this is provided in... Hutcheson, G. D. and Schaefer, L. (2012). Test selection in the 21st century. Journal of Modelling in Management, 7,3: emeraldinsight.com/products/journals/journals.htm?id=jm2).

14 The usefulness of the equation format for representing statistical models is evident when the analyses are run in the software. In order to select the appropriate GLM define equation 2. Identify scale of measurement for response variable 3. Select the appropriate model in the Rcmdr. Generalized linear model... for continuous and count responses Multinomial logit model... for unordered categorical responses Ordinal regression model... for ordered categorical responses 4. Input the equation 5. Press OK

15 Exercises... defining models and identifying statistical techniques

16 Defining the model and identifying an appropriate statistical technique and data structure for your research are crucial skills in data analysis. The following exercises provide a number of research scenarios. The task here is to... represent your research questions in equation format identify how the dataset should be structured identify an appropriate statistical technique to analyse the data Scenario 1 A standardized test in mental arithmetic is given to 100 classrooms in England and Singapore (a balanced sample of classrooms were randomly selected from each country). You also have information about the amount of formal teaching the children receive in mental arithmetic in each school (it is likely that this variable influences attainment and it is also likely that the amount is different for the two countries). Predict performance in mental arithmetic using information about country and amount of formal teaching when... performance is recorded as a percentage performance is recorded as a grade (A to E)

17 Scenario 2 A social skills programme is applied to a sample of companies in an attempt to change the attitudes of the employees. In order to test whether the programme was successful, the following studies were run. 1. An independent groups study comparing attitudes with a control group post-treatment. 2. A matched groups study comparing attitudes with a control group post treatment. 3. An independent groups study comparing attitudes with a control group pre and post treatment. 4. A dependent groups study comparing attitudes with a control group pre and post treatment. Identify the statistical techniques when attitudes are recorded as ordered categorical (likert scale) and unordered categorical (positive, negative, no answer). Scenario 3 You are interested in modelling choice of school for pupils from different SES backgrounds, ethicities, and parental engagement (ordered), from a number of school with different league table scores (continuous). You suspect that all the above variables may affect school choice and that there may be an interaction between SES and parental engagement (you suspect higher SES to be positively associated with engagement). Provide a model to predict school choice and identify an appropriate statistical technique.

18 Scenario 4 The relationship between O-ring failure in the Space Shuttle and the temperature at launch. 1. When O-ring failure is recorded as a binary yes-no variable When O-ring failure is recorded as a number of fails in each ring (0 to 5) When O-ring failure is recorded as ordered categorical data (no, some, lots)...