Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRT

Size: px
Start display at page:

Download "Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRT"

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations Summer 2011 Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRT Benjamin James Andrews University of Iowa Copyright 2011 Benjamin Andrews This dissertation is available at Iowa Research Online: Recommended Citation Andrews, Benjamin James. "Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRT." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Educational Psychology Commons

2 ASSESSING FIRST- AND SECOND-ORDER EQUITY FOR THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN USING MULTIDIMENSIONAL IRT by Benjamin James Andrews An Abstract Of a thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa July 2011 Thesis Supervisors: Professor Michael J. Kolen Associate Professor Won-Chan Lee

3 1 ABSTRACT The equity properties can be used to assess the quality of an equating. The degree to which expected scores conditional on ability are similar between test forms is referred to as first-order equity. Second-order equity is the degree to which conditional standard errors of measurement are similar between test forms after equating. The purpose of this dissertation was to investigate the use of a multidimensional IRT framework for assessing first- and second-order equity of mixed-format tests. Both real and simulated data were used for assessing the equity properties for mixed-format tests. Using real data from three Advanced Placement (AP) exams, five different equating methods were compared in their preservation of first- and second-order equity. Frequency estimation, chained equipercentile, unidimensional IRT true score, unidimensional IRT observed score, and multidimensional IRT observed score equating methods were used. Both a unidimensional IRT framework and a multidimensional IRT framework were used to assess the equity properties. Two simulation studies were also conducted. The first investigated the accuracy of expected scores and conditional standard errors of measurement as tests became increasingly multidimensional using both a unidimensional IRT framework and multidimensional IRT framework. In the second simulation study, the five different equating methods were compared in their ability to preserve first- and second-order equity as tests became more multidimensional and as differences in group ability increased. Results from the real data analyses indicated that the performance of the equating methods based on first- and second-order equity varied depending on which framework was used to assess equity and which test was used. Some tests showed similar preservation of equity for both frameworks while others differed greatly in their assessment of equity. Results from the first simulation study showed that estimates of expected scores had lower mean squared error values when the unidimensional

4 2 framework was used compared to when the multidimensional framework was used when the correlation between abilities was high. The multidimensional IRT framework had lower mean squared error values for conditional standard errors of measurement when the correlation between abilities was less than.95. In the second simulation study, chained equating performed better than frequency estimation for first-order equity. Frequency estimation better preserved second-order equity compared to the chained method. As tests became more multidimensional or as group differences increased, the multidimensional IRT observed score equating method tended to perform better than the other methods. Abstract Approved: Thesis Supervisor Title and Department Date Thesis Supervisor Title and Department Date

5 ASSESSING FIRST- AND SECOND-ORDER EQUITY FOR THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN USING MULTIDIMENSIONAL IRT by Benjamin James Andrews A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa July 2011 Thesis Supervisors: Professor Michael J. Kolen Associate Professor Won-Chan Lee

6 Copyright by BENJAMIN JAMES ANDREWS 2011 All Rights Reserved

7 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Benjamin James Andrews has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) at the July 2011 graduation. Thesis Committee: Michael J. Kolen, Thesis Supervisor Won-Chan Lee, Thesis Supervisor Robert L. Brennan Timothy N. Ansley Mary Kathryn Cowles

8 ACKNOWLEDGMENTS I would first like to thank my academic advisor, Dr. Michael Kolen. I am truly grateful for his insightful guidance and feedback throughout my time at the University of Iowa. I would also like to thank Dr. Won-Chan Lee for his constant assistance throughout the whole dissertation process. He was always willing to discuss research ideas, assist with computer programming and help in any way he could. The other members of my dissertation committee deserve thanks as well. Dr. Robert Brennan, Dr. Tim Ansley and Dr. Kate Cowles all provided valuable feedback on my dissertation and I am very thankful for their willingness to serve on my committee. I cannot thank my fellow classmates enough. They were always so eager to help with any issues and always provided wonderful feedback. Specifically, I would like to thank my office mates in 204 Lindquist Center. Karoline Jarr, Sarah Hagge and Katherine Furgol never hesitated to offer their help. They always provided the challenge, encouragement and humor that was needed and made the process truly enjoyable. ii

9 ABSTRACT The equity properties can be used to assess the quality of an equating. The degree to which expected scores conditional on ability are similar between test forms is referred to as first-order equity. Second-order equity is the degree to which conditional standard errors of measurement are similar between test forms after equating. The purpose of this dissertation was to investigate the use of a multidimensional IRT framework for assessing first- and second-order equity of mixed-format tests. Both real and simulated data were used for assessing the equity properties for mixed-format tests. Using real data from three Advanced Placement (AP) exams, five different equating methods were compared in their preservation of first- and second-order equity. Frequency estimation, chained equipercentile, unidimensional IRT true score, unidimensional IRT observed score, and multidimensional IRT observed score equating methods were used. Both a unidimensional IRT framework and a multidimensional IRT framework were used to assess the equity properties. Two simulation studies were also conducted. The first investigated the accuracy of expected scores and conditional standard errors of measurement as tests became increasingly multidimensional using both a unidimensional IRT framework and multidimensional IRT framework. In the second simulation study, the five different equating methods were compared in their ability to preserve first- and second-order equity as tests became more multidimensional and as differences in group ability increased. Results from the real data analyses indicated that the performance of the equating methods based on first- and second-order equity varied depending on which framework was used to assess equity and which test was used. Some tests showed similar preservation of equity for both frameworks while others differed greatly in their assessment of equity. Results from the first simulation study showed that estimates of expected scores had lower mean squared error values when the unidimensional iii

10 framework was used compared to when the multidimensional framework was used when the correlation between abilities was high. The multidimensional IRT framework had lower mean squared error values for conditional standard errors of measurement when the correlation between abilities was less than.95. In the second simulation study, chained equating performed better than frequency estimation for first-order equity. Frequency estimation better preserved second-order equity compared to the chained method. As tests became more multidimensional or as group differences increased, the multidimensional IRT observed score equating method tended to perform better than the other methods. iv

11 TABLE OF CONTENTS LIST OF TABLES... vii LIST OF FIGURES... ix CHAPTER I. INTRODUCTION...1 Equity Properties...1 Item Response Theory...3 The Unidimensional IRT Approach...4 The Multidimensional IRT Approach...5 Equating Design...6 Equating Methods...6 Research Purpose...7 CHAPTER II. LITERATURE REVIEW...9 Equity Frameworks...9 Unidimensional IRT Equity Framework...9 Multidimensional IRT Equity Framework...11 Scale Linking...12 Estimates of Ability...14 Review of Relevant Literature...15 Equivalence of Traits Assessed By Multiple-Choice and Free- Response Items...15 Scale Linking...16 Equating Methods...17 CHAPTER III. METHODOLOGY...21 Equating Methods...21 Frequency Estimation Equipercentile Equating...22 Chained Equipercentile Equating...23 Unidimensional IRT True Score Equating...24 Unidimensional IRT Observed Score Equating...25 Multidimensional IRT Observed Score Equating...26 Estimation and Software...27 Description of the Real Data...29 Scale Score Conversion Table...31 Procedures for Real Data Examples...32 Simulation Study Population Parameters...33 Data Generation Procedures...34 Simulation Procedures...35 Evaluation of Results...36 Simulation Study Population Parameters...37 Simulation Procedures...38 Evaluation of Results...39 CHAPTER IV. RESULTS...43 v

12 Real Data Examples...43 English Language Test...43 Spanish Literature Test...46 French Language Test...47 Simulation Study Effect of Correlation Between Abilities...49 Differences in Ability Estimates...51 Simulation Study Comparison Between Traditional Equating Methods...52 Comparison Between IRT Equating Methods...53 Conditional Results...54 CHAPTER V. DISCUSSION Summary of Findings Research Questions 1 and Research Question Research Questions 4 and Limitations and Future Research Practical Implications REFERENCES vi

13 LIST OF TABLES Table 3-1. Descriptive Statistics for Spanish Literature, English Language and French Language Test Forms...41 Table 3-2. Summary of Simulation Conditions for Simulation Study Table 4-1. Statistics for the English Language Test Using the Unidimensional Framework...57 Table 4-2. Statistics for the English Language Test Using the Multidimensional Framework...57 Table 4-3. Statistics for the Spanish Literature Test Using the Unidimensional Framework...58 Table 4-4. Statistics for the Spanish Literature Test Using the Multidimensional Framework...58 Table 4-5. Statistics for the French Language Test Using the Unidimensional Framework...59 Table 4-6. Statistics for the French Language Test Using the Multidimensional Framework...59 Table 4-7. Summary Statistics of Estimated Expected Scores with Correlation of Table 4-8. Summary Statistics of Estimated CSEMs with Correlation of Table 4-9. Summary Statistics of Estimated Expected Scores with Correlation of Table Summary Statistics of Estimated CSEMs with Correlation of Table Summary Statistics of Estimated Expected Scores with Correlation of Table Summary Statistics of Estimated CSEMs with Correlation of Table Summary Statistics of Estimated Expected Scores with Correlation of Table Summary Statistics of Estimated CSEMs with Correlation of Table Mean D1 and D2 Statistics for Unidimensional Data Table Mean D1 and D2 Statistics for a Correlation of Table Mean D1 and D2 Statistics for a Correlation of vii

14 Table Mean D1 and D2 Statistics for a Correlation of Table Mean D1 and D2 Statistics for a Correlation of viii

15 LIST OF FIGURES Figure 4-1. Equating relationship for English Language test forms Figure 4-2. Differences in expected scale scores and scale score CSEMs for the English Language test forms using the unidimensional framework Figure 4-3. Differences in expected scale scores for the English Language test forms using the multidimensional framework Figure 4-4. Differences in scale score CSEMs for the English Language test forms using the multidimensional framework Figure 4-5. Equating relationship for the Spanish Literature test forms Figure 4-6. Differences in expected scale scores and scale score CSEMs for the Spanish Literature test forms using the unidimensional framework Figure 4-7. Differences in expected scale scores for the Spanish Literature test form using the multidimensional framework...75 Figure 4-8. Differences in scale score CSEMs for the Spanish Literature test forms using the multidimensional framework Figure 4-9. Equating relationship for French Language test forms Figure Differences in expected scale scores and scale score CSEMs for the French Language test forms using the unidimensional framework Figure Differences in expected scale scores for the French Language test forms using the multidimensional framework Figure Differences in scale score CSEMs for the French Language test forms using the multidimensional framework Figure Mean differences between expected scores and CSEMs for unidimensional data with a group difference of Figure Mean differences between expected scores and CSEMs for unidimensional data with a group difference of Figure Mean differences between expected scores and CSEMs for unidimensional data with a group difference of Figure Mean differences between expected scale scores when correlation between ability was.95 and difference between groups was Figure Mean differences between scale score CSEMs when correlation between ability was.95 and difference between groups was Figure Mean differences between expected scale scores when correlation between ability was.95 and difference between groups was ix

16 Figure Mean differences between scale score CSEMs when correlation between ability was.95 and difference between groups was Figure Mean differences between expected scale scores when correlation between ability was.95 and difference between groups was Figure Mean differences between scale score CSEMs when correlation between ability was.95 and difference between groups was Figure Mean differences between expected scale scores when correlation between ability was.80 and difference between groups was Figure Mean differences between scale score CSEMs when correlation between ability was.80 and difference between groups was Figure Mean differences between expected scale scores when correlation between ability was.80 and difference between groups was Figure Mean differences between scale score CSEMs when correlation between ability was.80 and difference between groups was Figure Mean differences between expected scale scores when correlation between ability was.80 and difference between groups was Figure Mean differences between scale score CSEMs when correlation between ability was.80 and difference between groups was Figure Mean differences between expected scale scores when correlation between ability was.65 and difference between groups was Figure Mean differences between scale score CSEMs when correlation between ability was.65 and difference between groups was Figure Mean differences between expected scale scores when correlation between ability was.65 and difference between groups was Figure Mean differences between scale score CSEMs when correlation between ability was.65 and difference between groups was Figure Mean differences between expected scale scores when correlation between ability was.65 and difference between groups was Figure Mean differences between scale score CSEMs when correlation between ability was.65 and difference between groups was Figure Mean differences between expected scale scores when correlation between ability was.50 and difference between groups was Figure Mean differences between scale score CSEMs when correlation between ability was.50 and difference between groups was Figure Mean differences between expected scale scores when correlation between ability was.50 and difference between groups was x

17 Figure Mean differences between scale score CSEMs when correlation between ability was.50 and difference between groups was Figure Mean differences between expected scale scores when correlation between ability was.50 and difference between groups was Figure Mean differences between scale score CSEMs when correlation between ability was.50 and difference between groups was xi

18 1 CHAPTER I. INTRODUCTION In many testing programs, it is important for scores on different forms of a test to be comparable to one another. In order for test forms built to have similar content to be used interchangeably, there needs to be a statistical adjustment made for the differences in difficulty between forms. This statistical process is called equating (Kolen & Brennan, 2004). Equating is an important aspect of any testing program that administers multiple forms of a test. The need for equating arises if a testing program administers different forms on different test dates for security reasons, when examinees are allowed to take tests on more than one occasion or when test scores are to be compared across years. There are many different procedures that have been used to evaluate the results of equating. Harris and Crouse (1993) provided a summary of these methods. One of these methods for evaluating results is to assess how well equity properties are preserved. This method serves as the focus of this dissertation. Equity Properties The equity property of equating was first discussed by Lord (1980). According to Lord, equity is achieved if any examinee would be indifferent about which form of a test he or she takes. Specifically, conditional on ability level, the distribution of equated scores on the new form should be the same as the conditional distribution of scores on the old form. These conditional distributions should be the same between test forms at all levels of ability. Lord also demonstrated that the only way to achieve this definition of equity is if both forms have reliabilities of one or if both forms are identical. So in other words, an equating of two tests cannot meet Lord s equity property unless they are perfectly reliable, an unrealistic condition, or unless the forms are identical which makes equating unnecessary. Morris (1982) proposed a less stringent definition of equity. According to this definition, the means conditional on ability should be the same for each test form after

19 2 equating. The extent to which the means of the conditional distributions are similar is commonly referred to as first-order equity. Typically, the standard deviations of conditional distributions are also compared between test forms. The extent to which the standard deviations of the conditional distributions are similar is referred to as secondorder equity. The issue of equity can also have practical implications. If first- and second-order equity properties are not preserved, it no longer becomes a matter of indifference to examinees which test form is taken. For instance, if first-order equity fails to hold and the expected scale score at a certain ability is higher for one form of a test, examinees of that ability taking that form would have an advantage over students of the same ability taking a different test form. A situation could also occur where first-order equity is preserved but second-order equity is not. In such an instance, though the conditional means for two test forms could be the same for a wide range of abilities, lower ability students would prefer to take the test form with greater variance because they will be more likely to benefit from measurement error. Higher ability students would prefer to take the test form with the smaller variance because it would be more likely their high ability is assessed accurately. In order to assess how well these equity properties are being met, some sort of psychometric model needs to be used. The model must relate true score to ability and also have a way of quantifying error variance conditional on true score. For example, Kolen, Hanson, and Brennan (1992) presented methods for assessing how well the equity properties are preserved for scale scores using a strong true score model. Kolen, Zeng, and Hanson (1996) demonstrated methods that can be used to estimate how well the equity properties are preserved using unidimensional IRT models for dichotomous item tests and Wang, Kolen, and Harris (2000) described methodologies for using unidimensional IRT models with polytomous item tests. Kolen and Wang (1998, 2007)

20 3 extended the unidimensional methodologies for use with multidimensional IRT models. This dissertation focuses on the IRT methodologies. Item Response Theory Item response theory (IRT; Lord, 1980) is a psychometric theory that relates examinee ability or proficiency, frequently symbolized by, to the probability an examinee responds a certain way to a particular item. For multiple-choice items where examinees receive scores of 0 or 1, a frequently used model is the three-parameter logistic model. The probability of a correct response on item j for the three-parameter logistic model is written as (1.1) where is examinee ability, is the item discrimination parameter, is the item difficulty and is the lower asymptote parameter. This probability is referred to as the item characteristic curve. Several IRT models also exist for items where examinees could receive one of many different scores. For instance, the graded response model (Samejima, 1997) and the nominal response model (Bock, 1997) can both be used with polytomous examinee response data. For these models, the probability of an examinee receiving each possible score category is a function of examinee ability. A special case of the nominal response model, described by Muraki (1992), is called the generalized partial credit model. The probability of getting a score that corresponds to category on item j for the generalized partial credit model is (1.2) where is ability, is the item discrimination parameter, is item difficulty and are the category parameters. Category is the largest score an

21 4 examinee can receive on item j. This probability of receiving a particular score is called the category response function. This dissertation uses the generalized partial credit model for polytomous items. There are some assumptions that are made with the use of these IRT models. The first assumption is unidimensionality. The probability of an examinee getting a particular score on an item is dependent on the item parameters and a single latent variable. No other variables are assumed to influence examinee responses. Another assumption is local independence. For the local independence assumption to hold, examinee performance on any two items must be independent (and hence, uncorrelated) for any fixed value of. Provided these assumptions are not violated, IRT provides a helpful framework to use in various different applications. This dissertation focuses on the use of IRT for both assessing equity properties and as a foundation for equating methods. The Unidimensional IRT Approach The unidimensional IRT framework for assessing first- and second-order equity has the same assumptions that accompany unidimensional item response theory. Specifically, only a single latent variable is assumed to influence examinee responses. This framework is based on procedures described by Kolen, Zeng, and Hanson (1996) and Wang, Kolen, and Harris (2000). Assessing first- and second-order equity using the unidimensional IRT framework involves the following steps. First, data for each test are fit with a unidimensional IRT model. For this approach, all sets of items are considered to be influenced by the same latent variable. Using the estimated item parameters, the distribution of raw scores is found conditional on true score for the old test form. The distribution of raw scores is then converted to scale scores. For the new form, once item parameter estimates are put on the old form scale, the conditional distributions of raw scores are found. Those raw scores are then converted to their old form equivalents and then to scale scores based on

22 5 the results from an equating method. First-order equity for a particular equating method is evaluated based on how similar the conditional means of these two scale score distributions are at each true score level. The standard deviations of the scale score distributions are the conditional standard errors of measurement (CSEMs) for scale scores. The degree to which CSEM values are similar across the two forms is the degree to which second-order equity is preserved. The Multidimensional IRT Approach The methodologies for the multidimensional IRT approach are described by Kolen and Wang (1998, 2007). For this particular multidimensional IRT model, instead of a single latent variable underlying all examinee responses, a different latent variable underlies responses to each different section. This is referred to as a simple structure IRT model. For this approach, it is assumed that items load on only a single ability. These latent abilities are allowed to be correlated. For the multidimensional method, instead of conditioning on a single, the conditioning variable is a vector,, that is made up of the for each dimension. In this dissertation, each element of represents a different section. For instance, one element of represents ability for the multiple-choice items and the other element represents ability for the free-response items. The same IRT model assumptions are made for each section. Namely, a single latent variable underlies responses to each section, and local independence holds for responses in each section. The process of finding the conditional distributions is similar to what is done for the unidimensional IRT method. Data for each section on both tests are fit with a separate unidimensional IRT model. All item parameter estimates are obtained using separate runs of IRT estimation software. A distribution of raw scores is found conditional on for the old form based on the estimated item parameters. These raw scores are then converted to scale scores. For the new form, the distribution of raw scores is found conditional on after item parameters are put on the same scale as the old

23 6 form. These raw scores are converted to old form equivalents using an equating method. Those equivalents are then converted to scale scores. First-order equity is the degree to which the means of these conditional scale score distributions are similar across test forms when a particular equating method is used. Second-order equity for a particular equating is the degree to which the standard deviations of these conditional scale distributions are similar across test forms. Equating Design Several different designs can be used to gather the data needed to conduct equating (Kolen & Brennan, 2004). The equating design used in this study is the common-item nonequivalent groups design. The old form, denoted as Form Y, is taken by Group 1. The new form is called Form X and is taken by Group 2. Groups 1 and 2 are assumed to be sampled from different populations. Under this design, Form X and Form Y share a set of common items. Through examinee performance on this common item set, test form differences are disentangled from examinee group differences. Equating Methods Five different equating methods are used in this dissertation. They have varying assumptions, advantages and limitations. Equipercentile equating defines two scores as equivalent if they both have the same percentile rank for a particular population. Chained equating uses equipercentile methods to link from new form scores to common-item scores to old form scores. Unidimensional IRT true score equating defines equivalent scores as the true scores on both forms that correspond to the same ability. The true score in an IRT context is the expected raw score conditional on ability. Unidimensional IRT observed score equating uses estimated item parameters to estimate score distributions for the population for each test form. Like equipercentile equating, two scores are equivalent if they have the same percentile rank in each of the population score distributions. Multidimensional IRT observed score equating follows the same

24 7 procedures except a multidimensional ability distribution is assumed to underlie examinee responses. Research Purpose Assessing equity properties can provide valuable evidence about how well an equating method adjusts for differences in difficulty between forms. There has been some research evaluating how well equity properties are met using unidimensional IRT models (e.g., Tong & Kolen, 2005). This research suggests that when multiple-choice test forms are of similar difficulty, both first- and second-order equity can be achieved reasonably well. However, when test forms differ considerably in difficulty, first- and second-order equity might not both be achieved. In this case, there is some evidence that IRT true score equating better achieves first-order equity and that equipercentile and IRT observed score equating better achieve second-order equity. When tests contain both multiple-choice and free-response items, the assumption of unidimensionality that is often used to assess equity could be violated. In such a situation, using unidimensional IRT as the model to judge first- and second-order equity could lead to inaccurate results. A simple structure multidimensional IRT model could provide more accurate information about the quality of equating results. In other words, a unidimensional IRT model for each section of the test could be more appropriate. This dissertation is divided into three different parts. The first part uses real test data to demonstrate how different equating methods preserve first- and second-order equity using both unidimensional IRT and multidimensional IRT frameworks. The second part addresses how estimates of expected scores and CSEMs differ under both frameworks as mixed-format tests become more multidimensional. The third part addresses how well different equating methods preserve the equity properties for mixedformat tests under varying conditions of multidimensionality. How well equating

25 8 methods preserve the equity properties as groups become increasingly different is also addressed. Specifically, this research focuses on the following questions: 1. How do estimates of first- and second-order equity compare for different equating methods using the common-item nonequivalent groups design? 2. To what extent does using a unidimensional or a multidimensional framework provide different evidence concerning how well equity properties are being met? 3. As mixed-format tests are increasingly multidimensional, how do estimates of expected scores and CSEMs differ between the unidimensional framework and the multidimensional framework? 4. When mixed-format tests are increasingly multidimensional, are there some equating methods that work better than others for preserving first- and secondorder equity? 5. When the groups that take each test form become more different, do some equating methods better preserve first- and second-order equity? The first two research questions are addressed by the first part of the dissertation using real data. The third research question is addressed by the first simulation study and the second simulation study addresses the final two research questions. This dissertation adds to the literature in this area in the following ways. Previous research on the equity properties has involved tests that contained only multiple-choice items. The current research uses mixed-format tests. Also, previous work used either a strong true score framework or a unidimensional IRT framework to assess first- and second-order equity. The current research uses both a unidimensional IRT framework and a multidimensional IRT framework to assess the equity properties.

26 9 CHAPTER II. LITERATURE REVIEW There are four different sections in this chapter. The first section describes the unidimensional and multidimensional IRT frameworks. The second section describes the process for IRT scale linking, the third describes IRT ability estimation, and the final section discusses relevant research. Equity Frameworks Unidimensional IRT Equity Framework When first- and second-order equity are assessed using a unidimensional IRT framework, a single latent variable is assumed to influence examinee responses to all items. This framework is based on theory described by Kolen, Zeng, and Hanson (1996) for dichotomous items tests. Wang, Kolen, and Harris (2000) described procedures for using polytomous IRT models. The processes for calculating expected scale scores and CSEMs are described next. The majority of the tests considered in this dissertation are mixed-format tests, meaning they contain both multiple-choice and free-response items. All raw scores in this dissertation are weighted sums of item scores. The raw score is (2.1) where is the item weight and is the examinee s score on item. To evaluate how well equity properties are preserved using this framework, score distributions conditional on ability are found using item parameters. The recursive algorithm used for calculating the conditional distributions is described by Hanson (1994) and Thissen, Pommerich, Billeaud, and Williams (1995). Using this algorithm, the distribution of raw scores conditional on each ability is found. The conditional raw score distribution of Form X is represented by over the first r items at a given ability. For the first item, each

27 10 score category has a probability of earning that score given the ability. These are defined as, for the first category, for the second category and so on up to the last category. The recursion formula can then be used for each additional item to find the probability of earning a score x on the test that now contains r items. The recursion formula for is (2.2) where and are the minimum and maximum scores on the test after the item is added and is the weight for item. The recursive algorithm is also used to obtain the distribution of raw scores for Form Y. The mean of these conditional distributions is the true or expected raw score at that ability level. The standard deviation of this distribution is the raw score CSEM. The raw scores are then converted to scale scores. The expected or true scale score is the mean of the scale score distribution. The formula can be written as (2.3) where min X and max X are the minimum and maximum raw scores on the test form, respectively. The probability of an examinee with ability receiving a raw score of is represented by. represents the scale score that corresponds to a raw score of. First-order equity under the unidimensional IRT framework is said to hold to the extent that the expected scale scores are similar for two equated forms over the entire range of abilities. This statement is written symbolically as (2.4) where scale score associated with the old form equivalent after equating and Y is the old form raw score.

28 11 The variance of the conditional scale score distribution is the conditional error variance for scale scores. This quantity is written as (2.5) The square root of this value is the CSEM. The degree of similarity between CSEM values across test forms is the extent to which second-order equity is preserved. Secondorder equity is written symbolically as (2.6) Multidimensional IRT Equity Framework Kolen and Wang (1998, 2007) described a general procedure that can be used for assessing first- and second-order equity using a multidimensional IRT framework. Under this framework, a simple structure multidimensional IRT model is used. Each item type is fit with a separate unidimensional IRT model. Items load on only a single dimension. For instance, the multiple-choice section is represented by one ability dimension and the free-response section is represented by another ability dimension. A vector,, made up of the for each dimension, is assumed to underlie responses for the entire test. The process for finding expected scale scores and CSEMs is similar to the procedure for the unidimensional IRT framework. First, raw score distributions are obtained conditional on ability. Conditional independence is also assumed, just as in other IRT contexts. In other words, conditional on a vector of abilities, responses to all items are independent. Because of this conditional independence, the probability of a particular combination of scores on each of the different sections can be written as (2.7) where is the vector of scores on each section, represented by. The conditional distributions for each of the K sections are represented by

29 12. The conditional distribution of raw scores for each of the dimensions is found using the algorithm described by Hanson (1994) and Thissen et al. (1995). Let be the summation of all the. The distribution of can be written as (2.8) where the summation is taken over all combinations of section scores that result in a total score of. The scale score that corresponds to a raw score of is denoted as. The following equation is then used to find the mean of the distribution of scale scores conditional on (2.9) where the summation is taken over all possible values of x. The conditional error variance for scale scores under the multidimensional framework is (2.10) The square root of the conditional error variance is the conditional standard error of measurement. If the identity function is used in place of in Equations 2.7 and 2.8, the resulting values are for raw scores. Scale Linking Under the common-item nonequivalent groups, examinees taking each test form are believed to be from different populations. When item parameter estimates for each test form are obtained separately, they are on different scales. A transformation is needed to put the new form item parameters on the old form scale. Kolen and Brennan (2004) demonstrated that provided the data fit the model, any linear transformation of the scale also fits the data if the item parameters are also transformed. Let Scale N represent the

30 13 scale for the new form and let Scale O represent the scale for the old form. The relationship between ability for the two scales can be written as (2.11) where A and B are constants. Similar relationships exist for item parameters. Item discrimination, difficulty, and category parameters for Scale N can be transformed to Scale O for item j using the following equations: (2.12) The lower asymptote parameters are not affected by the scale transformation. When item parameters are estimated, these relationships fail to hold perfectly due to sampling error and model misfit. Several different methods exist for calculating the constants, A and B, for transforming item parameters. Two commonly used types of methods are the moment methods and the characteristic curve methods. The mean/sigma method (Marco, 1977) and the mean/mean method (Loyd & Hoover, 1980) are moment methods. These two methods use the estimated moments of item parameters for the common items to calculate A and B. One disadvantage of these methods is that they do not utilize information about all item parameters simultaneously. The Haebara (1980) and Stocking and Lord (1983) methods are characteristic curve methods. All parameters are considered simultaneously to find the constants A and B that minimize differences in the characteristic curve between tests. The Haebara (1980) method finds the A and B constants that minimizes the differences between item characteristic curves for the common items. The criterion function is (2.13)

31 14 where, (2.14) and (2.15) where and represent the ability distributions for the old and new forms, respectively, and is the maximum score that can be obtained on item j. The total number of common items is. and are the category response functions for the old and new scales, respectively. represents the category response function for item parameters on the new scale converted to the old form scale and is the category response function for parameters on the old scale converted to the new form scale. Estimates of Ability There exist many different ways to estimate examinee ability from test data (Thissen & Wainer, 2001). Three different methods were used in the current research to estimate ability. The likelihood equation for a unidimensional IRT model can be written as (2.16) where is the probability of a response given ability on item i. The maximum likelihood estimate (MLE) of ability is the value of that maximizes the likelihood function in Equation MLE ability values do not depend on the population distribution of ability.

32 15 Bayesian expected a posteriori estimators (EAP) are another way to estimate an examinee s ability. The EAP ability estimate is the mean of the posterior distribution and is found using the following equation (2.17) where is the density of the ability distribution. EAP estimates of ability are dependent on the ability distribution in the population. Ability can also be estimated through the test characteristic curve. To obtain the test characteristic curve (TCC) estimate of ability, examinee responses to all items are first summed. This summed score is substituted on the left side of the function for the test characteristic curve found in Equation An iterative process is then used to find the that corresponds to that summed score. Review of Relevant Literature The review of relevant literature is comprised of three different sections. First, the equivalence of the traits assessed by different item formats is discussed followed by a discussion of scale linking for mixed format tests. Relevant research on equating is then discussed. Equivalence of Traits Assessed By Multiple-Choice and Free-Response Items Traub (1993) reviewed nine studies that aimed to investigate if the traits measured by multiple-choice and free-response items were equivalent abilities. The studies that were reviewed were classified based on the type of tasks the tests required, language tasks or quantitative tasks. The language tasks were further classified into writing, word knowledge, and reading comprehension. There was some evidence that different item formats measured different traits for the writing tasks. Studies using tests measuring word knowledge had conflicting evidence of the equivalence of traits measured by the

33 16 two item types. For studies involving reading comprehension and quantitative constructs, the literature review suggests that the item formats do not differ in what they measure. The construct equivalence of multiple-choice and free-response items was investigated by Rodriguez (2003). The goal of the meta-analysis was to find what factors influenced the relationship between multiple-choice and free-response item scores. Typically the equivalence of item types is investigated in one of two different ways. Stem-equivalent items are those that have the same stem. When both multiple-choice and free-response items have the same stem, it is possible to investigate format effects. Items of different formats can be written to have different stems. Items of this type can assess either similar or different content or cognitive processes. When item types differ in their stems, it is not possible to separate content and format effects. The disattenuated correlations between scores on multiple-choice and free-response items were significantly higher when stems were equivalent than when they were not. The mean disattenuated correlation for stem-equivalent forms was.92 and.85 for non stemequivalent forms. Rodriguez suggested that equivalence of multiple-choice and freeresponse items depends partially on the intent of the item writer. Items that are intended to measure similar constructs tend to have higher disattenuated correlations than items written to measure a different construct. Scale Linking Kim and Lee (2006) compared IRT scale linking methods for mixed-format tests in a simulation study. Both the moment methods and the characteristic curve methods were compared. The characteristic curve methods were found to perform better than the moment methods. Also, characteristic curve methods were found to perform similarly to the concurrent calibration method. Kim and Kolen (2006) investigated the robustness of IRT parameter linking of mixed-format tests under varying conditions of multidimensionality. The two moment

34 17 methods and the two characteristic curve methods were compared. The degree of multidimensionality was defined as the correlation between abilities for the multiplechoice and free-response items. The characteristic curve methods were found to perform better than the moments methods when test were increasingly multidimensional. Results suggested that even when correlation between parts is relatively low, the characteristic curve methods might be suitable for tests simulated to have characteristics similar to typical achievement tests. Equating Methods Tong and Kolen (2005) used both real and simulated to evaluate how well equity properties were preserved for different equating methods. Tests containing multiplechoice items were used and data were collected using the random groups design. When test forms were similar in difficulty, IRT equating methods and equipercentile equating methods provided similar results. If test forms differ in difficulty, the equating method chosen can have an effect on how well first- and second-order equity are preserved. First-order equity was better preserved by the IRT true score equating method. Equipercentile equating and IRT observed score equating performed better regarding second-order equity. Equity properties were evaluated using only a unidimensional IRT framework. Kim, Brennan, and Kolen (2005) used the equity properties to evaluate equating results for unidimensional IRT equating methods and beta 4 true and observed score equating. Two different frameworks were used to assess preservation of first- and second-order equity for tests containing multiple-choice items using the random groups design. One framework was a unidimensional IRT framework and the other was a strong true score framework. The strong true score framework assumes that the distribution of true scores is a four-parameter beta distribution and the distribution of observed score conditional on true score is the compound binomial distribution. When the equating

35 18 method had the same underlying model assumptions as the framework used to assess the equity properties, there were a couple different consistent findings. First-order equity was better preserved for the true score equating method and second-order equity was better preserved for observed score equating. Also, unidimensional IRT observed score equating tended to better preserve second-order equity compared to all other equating methods regardless of the framework used to assess the equity properties. Brennan (2010) gave a more theoretical discussion of first- and second-order equity. The conditions for attaining or nearly attaining first- and second-order equity were discussed when the equating relationship is linear and when the equating relationship is curvilinear. In the curvilinear equating relationship example, the relationship between forms could be fully explained by a quadratic polynomial function. Two different types of equating methods were considered. The first was the use of observed scores in place of true scores, referred to as applied to score equating (ATSE). When ATSE was used for the curvilinear equating relationship, first-order equity held exactly only when the true equating relationship was linear or the conditional error variance at each true score was 0. There are some conditions where first-order equity would typically be closer to being satisfied. As reliability for the new form increases, first-order equity is expected to become closer to being satisfied. When observed score equating was used, first-order equity is closer to being achieved when reliabilities are equal and are high. It was shown that second-order equity likely does not hold when ATSE is used. For second-order equity to be approximately satisfied when using ATSE, one of two different sets of conditions must be met. One set of conditions is high reliabilities and the ratio of conditional error variances must be constant equal to the ratio of true score variances. Also, second-order equity holds approximately for ATSE for tests that have similar, high reliabilities and have homoscedastic error variances. Similar conditions exist that make second-order equity more likely to hold approximately for observed score