Principles of Psychometrics and Measurement Design

Size: px
Start display at page:

Download "Principles of Psychometrics and Measurement Design"

Transcription

1 Principles of Psychometrics and Measurement Design Questionmark Analytics Austin Fossey 2014 Users Conference San Antonio March 4 th 7 th

2 Austin Fossey Reporting and Analytics Manager, Questionmark

3 Objectives Learning Objectives: Explain the differences between criterion, construct, and content validity Summarize a validity study Implement Toulmin s structure to support argument-based validity Summarize the concept of reliability and its relationship to validity Define the three parts of the conceptual assessment framework Slide 3

4 Introduction

5 Basic Terms Measurement assign scores/values based on a set of rules Testing standardized procedure to collect information Assessment collect information with an evaluative aspect Psychometrics application of probabilistic model to make an inference about an unobserved/latent construct Construct hypothetical concept that the test developers define based on theory, existing literature, and prior empirical work Slide 5

6 Where do I calculate validity? Assessment tools & technology Deep body of knowledge on best practices Well-defined criteria for assessment quality Test Developers Slide 6

7 Validity Uses and Inferences

8 Validity Survey Do you have validity studies built into your test development process? Do you use the results to improve your assessments? Do you report your findings to stakeholders? Do you have a plan in place if there is evidence that an assessment is not valid?

9 Defining Validity Validity refers to proper inferences and uses of assessment results (Bachman, 2005). Implies that the assessment itself is not valid Validity refers to how we interpret and use the assessment results Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests (APA, NCME, & AERA, 1999). Slide 9

10 Defining Validity Simple concept at first glance... Validity is a continually evolving concept. Disagreements about what is important and what needs to be validated (Sireci, 2013). Easy for there to be a lack of alignment between: Validity theories Test development approaches and documentation Informed decision about the defensibility and suitability of results (Sireci, 2013) Slide 10

11 Where do I calculate validity? Modern validity studies are typically research projects with both a quantitative and qualitative element. Validity is no longer restricted to test scores. Smarter Balanced Consortium s program validity (Sireci, 2013) The Standards: 1. Test content 2. Response process 3. Internal structure 4. Relations to other variables 5. Testing consequences Integrate into a validity argument (APA, NCME, & AERA, 1999) Slide 11

12 Validity Studies Common Types of Validity

13 Validity and Reliability Reliability is a measure of consistency. Expresses how well our observed scores relate to the true scores (Crocker & Algina, 2008) ρ XX = σ T 2 σ X 2 σ T 2 is the variance of the true scores σ X 2 is the variance of the observed scores If our instrument is not reliable, our inferences are not valid. We cannot trust the scores. But just because an instrument is reliable does not mean our inferences our valid. Still must demonstrate that we measure what we intend to and draw the correct inferences from the results

14 Criterion-Related Validation Demonstrate that assessment scores have a relation to a relevant criterion that relates to the inferences or uses surrounding the assessment results. Concurrent Relationship between the assessment scores and a criterion measure taken at the same time. Predictive Relationship between the assessment scores and a criterion measure taken in the future. Slide 14

15 Criterion-Related Validation Examples Concurrent: do scores on the written drivers license assessment correlate with performance in the on-theroad test taken the same day? Predictive: do SAT scores correlate with students first semester GPA in college? Slide 15

16 Criterion-Related Validation Study Criterion-Related Validation Study (Crocker & Algina, 2008) 1. Identify a suitable criterion behavior and a method for measuring it. 2. Identify a representative sample of participants. 3. Administer the assessment and record the scores. 4. Obtain a criterion measure from each participant in the sample when they become available. 5. Determine the strength of the relationship between the assessment scores and the criterion measures. Slide 16

17 Criterion-Related Validation in Practice Criterion problem the criterion of interest may be a very complex construct (e.g., teaching effectiveness). May require in-depth, ongoing measures of the criterion to validate the assessment results. Sample size small sample sizes will not yield accurate validity coefficients. Study may need to collect research from studies of similar predictors as evidence of criterion-related validity. Criterion contamination assessment scores affect criterion measures (dependence). Restriction of Range Systematically missing some measures in criterion. From Crocker & Algina, 2008 Slide 17

18 Reporting Criterion-Related Results Report statistics related to the relation between the assessment scores and the criterion measure. Report standard errors of measurement and reliability coefficients for the assessment and the criterion (if appropriate). Visualize the relation with expectancy table (Crocker & Algina, 2008). Slide 18

19 Criterion-Related Expectancy Table Assessment Score Range % Hired Above Entry Level % Hired at Entry Level % Not Hired Number of Applicants % % 80% % 75% 13% % 60% 10% % 5 Total Applicants Adapted from Crocker & Algina, 2008 Slide 19

20 Content Validity Demonstrate that items adequately represent the construct being measured. This requires that the construct be defined with a set of learning objectives or tasks, such as those determined in a Job Task Analysis study. Content validity studies take place after the assessment is constructed. The study should use a set of subject matter experts who are independent from those who wrote the items and constructed the forms. Slide 20

21 Content Validation Study Content Validation Study (Crocker & Algina, 2008) 1. Define the construct or performance domain (e.g., job task analysis, cognitive task analysis) 2. Recruit independent panel of subject matter experts 3. Provide panel with structured framework and documented instructions for the process of matching items to the construct 4. Collect, summarize, and report the results Slide 21

22 Content Validation in Practice Items can be weighted by importance to determine representation on the assessment (e.g. JTA results). If this is done, requires specific definition of importance. The process for matching items to objectives needs to be defined in advance. Reviewers also need to know which aspects of an item are supposed to be matched to objectives. Study may be flawed if the objectives do not properly represent the construct. From Crocker & Algina, 2008 Slide 22

23 Reporting Content Validation Results Percentage of items matched to objectives (Crocker & Algina, 2008) Percentage of items matched to high-importance objectives (Crocker & Algina, 2008) Percentage of objectives not assessed by any items (Crocker & Algina, 2008) Correlation between objectives importance ratings and number of items matched to those objectives (Klein & Kosecoff, 1975) Index of item-objective congruence (Rovinelli & Hambleton, 1977) Slide 23

24 Index of Item-Objective Congruence Assumes that each item should measure one and only one objective. Raters score item with +1 if there is a match, 0 if there is uncertainty, and -1 if it does not match the objective (Rovinelli & Hambleton, 1977). N I ik = 2N 2 (μ k μ) I ik is the index of item-objective congruence for item i on objective k. N is the number of objectives μ k is the mean rating of item i on objective k. μ is the mean rating of item i across all objectives. Slide 24

25 Construct Validity Uses assessment scores and supporting evidence to support a theory of a nomological network: How does a construct relate to observed (measurable) variables? How does a construct relate to other constructs, as represented by other observed variables? Construct 1 Construct 2 Construct 3 Observed A Observed B Observed C Observed D Observed E Observed F Observed G Sample Nomological Network Slide 25

26 Construct Validation Study Construct Validation Study (Crocker & Algina, 2008) 1. Explicitly define theory of how those who differ on the assessed construct will differ in terms of demographics, performance, or other validated constructs. 2. Administer assessment that has items which are specific, concrete manifestations of the construct. 3. Gather data for other nodes in nomological network to test hypothesized relationships. 4. Determine if data are consistent with the original theory, and consider other possible conflicting theories (rebuttals). Slide 26

27 Construct Validation in Practice Possibly one of the more difficult validity studies to complete. Can require a lot of data and research. Statistical approaches include multiple regression analysis or factor analysis, but can also use correlations as in multi-trait/multi-method matrix. In experimental scenarios, it is difficult to diagnose why relationships are not found. Bad theory? Bad instrument? Both? From Crocker & Algina, 2008 Slide 27

28 Reporting Construct Validity Results A common method for reporting construct validity is with a multi-trait multi-method matrix (Crocker & Algina, 2008). Measuring the same construct with different methods should yield similar results. In practice, the data may come from different studies (not ideal). Slide 28

29 Multi-Trait Multi-Method Matrix Method True False Force Resp. Inc. Sent. Trait A B C A B C A B C 1. True False A. Sex-Guilt.95 B. Hostility-Guilt C. Morality-Conscience Forced Response A. Sex-Guilt Reliability Mono-trait / Hetero-method Hetero-trait / Mono-method Hetero-trait / Hetero-method B. Hostility-Guilt C. Morality-Conscience Incomplete Sentences A. Sex-Guilt B. Hostility-Guilt C. Morality-Conscience Slide 29 From Mosher, 1968

30 Argument-Based Validity Criterion, content, and construct validity are crucial aspects of assessment result validity, but how do we demonstrate the link to the inferences and uses of the assessment results? Argument-based validity (e.g., Kane, 1992) provides logic using Toulmin s structure of an argument to support claims about inferences. Bachman (2005) expands this to include validity arguments for use cases. Slide 30

31 Example Toulmin Structure for Validity Inference Claim: Mike cannot make a sandwich Unless Rebuttal: Too many questions were about the bread, and Mike did not have sufficient opportunity to demonstrate knowledge of ingredients and layering standards Warrant: Poor performance on the sandwich exam correlates with low performance of making sandwiches Supports Backing Evidence: Criterion validity study of sandwich exam scores and sandwich assembly performance at the sandwich shop Since So Rejects Rebuttal Evidence: Content validity study confirms that items are categorized correctly for blueprint. Blueprint is based on results of JTA. There were not too many questions about bread. Data: Mike got a failing score on his exam about making a sandwich Slide 31

32 Argument-Based Validity for Use Cases Bachman (2005) defines four decision (use case) warrants that should be addressed with a validity argument for each use case associated with the assessment results: Is the interpretation of the score relevant to the decision being made? Is the interpretation of the score useful for the decision being made? Are the intended consequences of the assessment beneficial for the stakeholders? Does the assessment provide sufficient information for making the decision? Slide 32

33 Argument-Based Validation Study Argument-Based Validation Study (Chapelle, Enright, & Jamieson, 2010) 1. Identify inferences, the warrants leading to these inferences, and the assumptions underlying the warrants. Document these inferences. 2. Identify or collect evidence backing the assumptions for the warrants. Document this evidence. 3. Identify or collect rebuttals, and document evidence supporting or refuting the rebuttal. Document this evidence with the evidence backing the assumptions. Slide 33

34 Utility of Argument-Based Validity Validation can be viewed as developing a scientifically sound validity argument to support the intended interpretation of test scores and their relevance to the proposed use (APA, NCME, & AERA, 1999). Argument-based validity forces us to look at and document the logical connections between classic validity studies and the real world defensibility of our assessment (Sireci, 2013). Slide 34

35 Utility of Argument-Based Validity By requiring test developers to research inferences and build the argument structure to support these inferences, we can avoid three common fallacies of validity studies: Taking inferences and their assumptions as givens Making overly-ambitious, unrealistic inferences Making a claim of validity by selectively choosing evidence while glossing over evidence of weaknesses in the inferences From Kane, 2006 Slide 35

36 Evidence-Centered Design (ECD) A Principled Test Development Framework

37 Principled Test Development Frameworks Frameworks for how to connect assessment tools and practices to reach desired goals for assessment quality. Practical methods for implementing assessment design and development Guides test developers to make thoughtful, explicit decision Improve the efficiency and effectiveness of item/task development Typically supports the documentation of evidence needed to support argument-based validity Helps with increased design decision and granular data needs while minimizing construct-irrelevant variance. From Ferrara, Nichols, & Lai, 2013

38 Examples of Principled Test Development Frameworks Diagnostic Assessment Framework Construct-Centered Framework Evidence-Centered Design Assessment Engineering Principled Design for Efficacy From Ferrara, Nichols, & Lai, 2013 Slide 38

39 Principled Test Development Survey Do you use a principled test development framework? Have you wanted to try to implement a principled test development framework, but been deterred because it seems like too much work?

40 Evidence-Centered Design ECD is a framework for assessment development that is designed to create the evidence needed to support assessment inferences as the assessment is being built (e.g., Mislevy, 2011; Mislevy et al., 2012,). Applies a broad range of assessment design resources (e.g., subject matter knowledge, software design, psychometrics, pedagogical knowledge) to the inferences. Avoids awkward situation of finding validity problems after the assessment has already been built. Slide 40

41 ECD Process Domain Analysis What is important about this domain (construct)? What work and situations are central to this domain? Domain Modeling Conceptual Assessment Framework Assessment Implementation Assessment Delivery How do we represent the aspects from the domain analysis as assessment arguments? Design structures: student model, evidence model, and task model Building the assessment: item writing, scoring engines, statistical models Participants interact with items/tasks. Performance is evaluated, and results and feedback are reported. Slide 41 From Mislevy et al., 2012

42 ECD Flexibility ECD is designed to be flexible enough to accommodate any assessment design. Different construct modeling approaches New item types and assessment formats with new technology Different scoring models, or combinations of scoring models Growing use of assessment scores and inferences ECD vocabulary and process aligns test development work across disciplines Documents how test development outcomes connect Common vocabulary helps people understand what they are doing and why Slide 42

43 Example of ECD: IMMEX True Roots Educational game to measure cognitive behavior based on sequence of participants actions Captures sequence of responses in the game Classifies sequence with artificial neural network Designed and reported with ECD (Stevens & Casillas, 2006) True Roots problem space (Cox Jr., Jordan, Cooper, & Stevens, 2004) Slide 43

44 Conceptual Assessment Framework (CAF) ECD may be a lot to implement for every assessment, but the principles can still help guide our test development work. The CAF represents the keystone of ECD. This is how we begin to explain the intellectual leap from scores to inferences. Three parts of an assessment s CAF: Task Model Student Model Evidence Model Slide 44

45 Conceptual Assessment Framework (CAF) Student Model Evidence Model Evaluation Component Measurement Model Component Task Model Slide 45 From Mislevy et al., 2012

46 CAF: Task Model Defines the assumptions and specifications for what the participant can do in the assessment and the features of the environment in which the task takes place (Mislevy et al., 2012). Examples of task model decisions: Item format and content Delivery format (random delivery, time limits) Resources Translations or accommodations Response format Slide 46

47 CAF: Student Model Defines the construct and construct relationships that are being measured from which we will make an inference (Mislevy et al., 2012). Examples of student model decisions: Total score rules and interpretations Topic score rules and interpretations Rubric structure Conditional delivery (e.g., jump blocks, CAT) Slide 47

48 CAF: Evidence Model Slide 48 Connects the task model and the student model (Mislevy et al., 2012). Evaluation Component Defines how evidence is identified from the responses generated within the task model. Rules for identifying correct responses Rules for what aspects of a response to observe in performance task or human-scored item Sequence and tagging Measurement Model Aggregates response data to yield inferences about the student. Item scoring and outcomes Weighting and scaling Aggregation models (e.g., CTT, IRT, Bayes nets, regression, network analysis)

49 Reporting the CAF There will be blurred lines between the three elements of CAF, but this is because the three models are interdependent. Documenting the CAF is becoming more common in the literature. Provides defensibility by being able to demonstrate how your instrument is collecting and scoring evidence about the construct to support specific inferences. Naturally lends itself to argument-based validity. This is the evidence needed to support many of your warrants. Slide 49

50 Thank you! Austin Fossey Reporting and Analytics Manager, Questionmark

51 References American Psychological Association, National Council of Measurement in Education, American Educational Research Association. (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 2(1), Chapelle, C. A., Enright, M. K., & Jamieson, J. Cox Jr., C. T., Jordan, J., Cooper, M. M., Stevens, R. (2004). Assessing student understanding with technology: the use of IMMEX problems in the science classroom. Retrieved from on February 21, Crocker, L, & Algina, J. (2008). Introduction to Classical and Modern Test Theory. Mason, OH: Cengage Learning Slide 51

52 References Ferrara, S., Nichols, P., & Lai, E. (2013). Design and development for next generation tests: Principled design for efficacy (PDE). Proceedings from the Maryland Assessment Research Center Conference. Retrieved from on February 20, Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112, Kane, M. (1996). Validation. In R. L. Brennan (Ed.), Educational Measurement: Fourth Edition (17-64). Westport, CT: Praeger Publishers. Klein, S. P., & Kosekoff, J. P. (1975). Determing how well a test measures your objectives. (CSE Report No. 94). Los Angeles, CA: Center for the Study of Evaluation, University of California. Slide 52

53 References Mislevy, R.J. (2011). Evidence-centered design for simulation-based assessment. (CRESST Report 800). Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Mislevy, R. J., Behrens, J. T., Dicerbo, K. E., & Levy, R. (2012). Design and discovery in educational assessment: Evidence-Centered Design, psychometrics, and Educational Data Mining. Journal of Educational Data Mining, 4(1), Mosher, D. L. (1968). Measurement of guilt by self-report inventories. Journal of Consulting and Clinical Psychology, 32, Rovinelli, R. J., & Hambleton, R. K. (1977). On the use of content specialists in the assessment of criterion-referenced test item validity. Dutch Journal of Educational Research, 2, Slide 53

54 References Sireci, S. G. (2013). A theory of action of test validation. Proceedings from the Maryland Assessment Research Center Conference. Retrieved from on February 20, Stevens, R. H., & Casillas, A. (2006). Artificial neural networks. In D. M. Williamson, R. J. Mislevy, & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer-based testing (pp ). Mahwah, NJ: Erlbaum. Slide 54