Subscore equating with the random groups design

Size: px
Start display at page:

Download "Subscore equating with the random groups design"

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations Spring 2016 Subscore equating with the random groups design Euijin Lim University of Iowa Copyright 2016 Euijin Lim This dissertation is available at Iowa Research Online: Recommended Citation Lim, Euijin. "Subscore equating with the random groups design." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Education Commons

2 SUBSCORE EQUATING WITH THE RANDOM GROUPS DESIGN by Euijin Lim A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa May 2016 Thesis Supervisor: Associate Professor Won-Chan Lee

3 Copyright by EUIJIN LIM 2016 All Rights Reserved

4 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Euijin Lim has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) at the May 2016 graduation. Thesis Committee: Won-Chan Lee, Thesis Supervisor Robert L. Brennan Michael J. Kolen Timothy N. Ansley Aixin Tan

5 ACKNOWLEDGMENTS First, I would like to thank Dr. Won-Chan Lee for his support, patience, and guidance as my dissertation chair and academic advisor. He helped me prepare for and progress in the long journey of a PhD. I would also like to thank Dr. Robert Brennan for his insightful comments and suggestions on the dissertation. Throughout the time at CASMA, Dr. Lee and Dr. Brennan provided me with great opportunities to learn and pursue my research interests. I cannot emphasize enough how valuable the experience was for me. I am also very thankful to Dr. Michael Kolen, Dr. Timothy Ansley, and Dr. Aixin Tan for their helpful feedback that made the dissertation better. Many thanks to my friends for all of the support that they have given me through the years. I thank my friends from the department for sharing not only their knowledge but their time in Iowa City with me. When I was going through challenging times, they supported and encouraged me. I would also like to express my gratitude for the friends in Korea. Even though the distance between Korea and Iowa City is far, they have made me feel like we have been together. I need to thank my Mom, Dad, and Grandma. Whatever I might have accomplished would have never been possible without the many sacrifices my family had to make. Their phone calls of love, support, and encouragement filled me up with warmth and energy. Finally, I thank my fiancé, Kookwon Yun, for so many reasons. I could not have completed this journey without him. He has patiently waited for me to complete my studies and supported me in every way every day. I am so happy to be on a new journey with him. ii

6 ABSTRACT There is an increasing demand for subscore reporting in the testing industry. Many testing programs already include subscores as part of their score report or consider a plan of reporting subscores. However, relatively few studies have been conducted on subscore equating. The purpose of this dissertation is to address the necessity for subscore equating and to evaluate the performance of various equating methods for subscores. Assuming the random groups design and number-correct scoring, this dissertation analyzed two sets of real data and simulated data with four study factors including test dimensionality, subtest length, form difference in difficulty, and sample size. Equating methods considered in this dissertation were linear equating, equipercentile equating, equipercentile with log-linear presmoothing, equipercentile equating with cubic-spline postsmoothing, IRT true score equating using a three-parameter logistic model (3PL) with separate calibration (3PsepT), IRT observed score equating using 3PL with separate calibration (3PsepO), IRT true score equating using 3PL with simultaneous calibration (3PsimT), IRT observed score equating using 3PL with simultaneous calibration (3PsimO), IRT true score equating using a bifactor model (BF) with simultaneous calibration (BFT), and IRT observed score equating using BF with simultaneous calibration (BFO). They were compared to identity equating and evaluated with respect to systematic, random, and total errors of equating. The main findings of this dissertation were as follows: (1) reporting subscores without equating would provide misleading information in terms of score profiles; (2) reporting subscores without a pre-specified test specification would bring practical issues such as constructing alternate subtest forms with comparable difficulty, conducting equating between forms with different lengths, and deciding an appropriate score scale to be reported; (3) the best performing subscore equating method, overall, was 3PsepO iii

7 followed by equipercentile equating with presmoothing, and the worst performing method was BFT; (4) simultaneous calibration involving other subtest items in the calibration process yielded larger bias but smaller random error than did separate calibration, indicating that borrowing information from other subtests increased bias but decreased random error in subscore equating; (5) BFO performed the best when a test is multidimensional, form difference is small, subtest length is short, or sample size is small; (6) equating results for BFT and BFO were affected by the magnitude of factor loading and variability for the estimated general and specific factors; and (7) smoothing improved equating results, in general. iv

8 PUBLIC ABSTRACT There is an increasing demand for subscore reporting in the testing industry. Many test users believe that subscores provide more insight into students strengths and weaknesses. Due to such demands, many testing programs already include subscores as part of their score report or consider a plan of reporting subscores. However, subscores reported might represent form difference in difficulty rather than a student s relative performances between subareas. The purpose of this dissertation is to address why equating a statistical process to adjust possible differences in form difficulty is required for subscores and to examine which equating method is preferred under various subtest conditions. The results showed that reporting subscores without equating indicates a student s strengths and weaknesses incorrectly due to form difference in difficulty. It was also noted that reporting subscores that were not intended to be reported would have numerous practical issues, which makes it difficult to conduct equating and to assign meaning to subscores. The findings of this dissertation may help test developers to make decisions on subscore reporting and equating and inform test users about how to interpret a score report including subscores. v

9 TABLE OF CONTENTS LIST OF TABLES... viii LIST OF FIGURES...x LIST OF NOTATIONS... xi CHAPTER I INTRODUCTION...1 Subscore...2 Previous Research on Subscores...3 Practical Issues in Subscore Reporting...4 Equating...6 Equating Designs...6 Traditional Equating Methods...8 Equating with Item Response Theory...9 Research Objectives...11 Overview of Dissertation...12 CHAPTER II LITERATURE REVIEW...14 Traditional Equating Methods...14 Linear Equating Method...14 Equipercentile Equating Method...15 Smoothing Method...17 Equating with Item Response Theory...23 Unidimensional Item Response Theory...23 Multidimensional Item Response Theory...24 IRT Equating Methods...27 Subscore...33 Subscore Scoring...33 Subscore Reporting...38 Subscore Equating...40 Summary of the Literature Review...41 CHAPTER III METHODOLOGY...43 Study 1: Operational Data Analyses...43 Data Preparation...43 Equating Procedures...46 Conversion to Scale Scores...48 Evaluation Criteria...49 Study 2: Simulated Data Analyses...51 Factors Considered...51 Simulation Procedures...54 Equating Procedures...56 Evaluation Criteria...57 Study 3: Simulated Data Analyses for a Special Case...58 Factors Considered...59 Data Modification and Simulation Procedures...59 Equating Procedures...60 vi

10 Evaluation Criteria...60 CHAPTER IV RESULTS...84 Operational Data Analyses...84 Research Question Research Question Research Question Simulated Data Analyses...94 Research Question Research Question Research Question Research Question Research Question Conditional Results CHAPTER V DISCUSSION Summary of Results Research Question Research Question Research Question Research Question Research Question Research Question Research Question Research Question Limitations and Future Research Conclusions and Implications REFERENCES vii

11 LIST OF TABLES Table 3.1. Characteristics of FL and PHY Tests...62 Table 3.2. Descriptive Statistics of Sample for FL and PHY...62 Table 3.3. Descriptive Statistics of Sample for FL Subtests...63 Table 3.4. Descriptive Statistics of Sample for PHY Subtests...63 Table 3.5. Correlation, Cronbach s Alpha, and Disattenuated Correlation for FL Subtests...64 Table 3.6. Correlation, Cronbach s Alpha, and Disattenuated Correlation for PHY Subtests...64 Table 3.7. Degree of Smoothing for Subtests...65 Table 3.8. Scale Score Conversion Table for FL Subtest 1 (Reading)...66 Table 3.9. Scale Score Conversion Table for FL Subtest 2 (Listening)...67 Table Scale Score Conversion Table for PHY Subtest 1 (Classical Mechanics)...68 Table Scale Score Conversion Table for PHY Subtest 2 (Electricity & Magnetism)...68 Table Scale Score Conversion Table for PHY Subtest 3 (Waves & Optics)...69 Table Scale Score Conversion Table for PHY Subtest 4 (Thermal Physics)...69 Table Scale Score Conversion Table for PHY Subtest 5 (Fluid Mechanics)...69 Table Scale Score Conversion Table for PHY Subtest 6 (Atomic & Nuclear Physics)...70 Table Summary of Conditions for Study 2: Simulated Data Analyses...70 Table Descriptive Statistics of M3PL Item Parameter Estimates for FL and PHY...71 Table Descriptive Statistics of Generated Item Parameters for Subtest Length of Table Descriptive Statistics of Generated Item Parameters for Subtest Length of Table Descriptive Statistics of Generated Item Parameters for Subtest Length of Table Descriptive Statistics of Generated Item Parameters for Subtest Length of viii

12 Table Central Moments for Generated Subscores Aggregated over 100 Replications When N = 3,000 and ρ = Table Summary of Conditions for Study 3: Simulated Data Analyses for a Special Case...76 Table 4.1. MAD for FL Subtests Table 4.2. MAD for PHY Subtests Table 4.3. Proportion of Examinees Whose Rank Orders Change for FL Table 4.4. Proportion of Examinees Whose Rank Orders Change for PHY Table 4.5. New Form NCEs Conversion Table for FL Subtest 1 (Reading) Table 4.6. New Form Stanines Conversion Table for FL Subtest 1 (Reading) Table 4.7. New Form NCEs Conversion Table for PHY Subtest 4 (Thermal Physics) Table 4.8. New Form Stanines Conversion Table for PHY Subtest 4 (Thermal Physics) Table 4.9. Summary Statistics for Equating Methods in Study Table Summary Statistics for Equating Methods in Study Table Summary Statistics under the Condition of Test Dimensionality in Study Table Summary Statistics under the Condition of Test Dimensionality in Study Table Mean and SD of BF ag and as Estimates When N = 3,000 in Study Table Summary Statistics under the Condition of Subtest Length in Study Table Summary Statistics under the Condition of Subtest Length in Study Table Summary Statistics under the Condition of Form Difference in Difficulty in Study Table Summary Statistics under the Condition of Sample Size in Study Table Summary Statistics under the Condition of Sample Size in Study ix

13 LIST OF FIGURES Figure 3.1. Form Y Score Distribution When N = 3,000 and ρ =.8 for the First Replication...77 Figure 3.2. Form X1 Score Distribution When N = 3,000 and ρ =.8 for the First Replication...78 Figure 3.3. Form X2 Score Distribution When N = 3,000 and ρ =.8 for the First Replication...79 Figure 3.4. Presmoothing Parameter Chosen for Form Y...80 Figure 3.5. Presmoothing Parameter Chosen for Form X Figure 3.6. Presmoothing Parameter Chosen for Form X Figure 3.7. Presmoothing Parameter Chosen for Form X Figure 4.1. Difference Plot for FL Subtests Figure 4.2. Difference Plot for PHY Subtests Figure 4.3. FL Group Mean Profile for the New Form Group Figure 4.4. PHY Group Mean Profile for the New Form Group Figure 4.5. Bias, SEE, and RMSD When n = 5, form difference =.20, and N = 3,000 in Study Figure 4.6. Bias, SEE, and RMSD When n = 30, form difference =.20, and N = 3,000 in Study Figure 4.7. Bias, SEE, and RMSD When n = 30 and N = 3,000 in Study x

14 LIST OF NOTATIONS 1PL 2PL 3PL 3PsepO 3PsepT 3PsimO 3PsimT a a aa gg AIC APSS aa ss b BF BFO BFT Bias BIC C One-parameter logistic model Two-parameter logistic model Three-parameter logistic model IRT observed score equating using 3PL with separate calibration IRT true score equating using 3PL with separate calibration IRT observed score equating using 3PL with simultaneous calibration IRT true score equating using 3PL with simultaneous calibration Item discrimination parameter Vector of item discrimination parameter Item discrimination parameter for the general dimension The Akaike Information Criterion Approximate simple structure Item discrimination parameter for the specific dimension Item difficulty parameter Bifactor model IRT observed score equating using BF with simultaneous calibration IRT true score equating using BF with simultaneous calibration Conditional bias The Bayesian Information Criterion Degree of polynomial log-linear presmoothing xi

15 c CAIC CINEG d DTM EM F ff(xx) ff (xx) FL G gg(yy) gg (yy) GG 2 ICC ICS IRT K m M3PL MAD Item pseudo-guessing parameter The Consistent Akaike Information Criterion Common item nonequivalent groups Item intercept parameter Difference that matters Expectation-maximization The cumulative distribution functions of Form X scores The observed relative frequency distribution for X The expected relative frequency distribution for X The AP French Language test The cumulative distribution functions of Form Y scores The observed relative frequency distribution for Y The expected relative frequency distribution for Y The likelihood ratio chi-square Item characteristic curve Item characteristic surface Item response theory Number of items (the maximum score point) Number of dimensions Multidimensional 3PL model The weighted average of the absolute value of the differences xii

16 MDIFF MDISC MH-RM MIRT N n NCEs NORTA P Multidimensional item difficulty index Multidimensional item discrimination index Metropolis-Hastings Robbins-Monro Multidimensional item response theory Number of examinees Number of items The Normal Curve Equivalents NORmal To Anything Percentile rank for X P* A given percentile rank P -1 PHY PRMSE Q The percentile function for X The AP Physics test The Proportion Reduction in Mean Square Error Percentile rank for Y Q* A given percentile rank q Q -1 R RG RMSD S Quadrature point The percentile function for Y Number of replications Random groups Conditional root mean square deviation Degree of cubic-spline postsmoothing xiii

17 Number of subtests SEE TCC TCS UIRT w WB WRMSD WSEE Conditional standard error of equating Test characteristic curve Test characteristic surface Unidimensional item response theory Relative frequency of a score point Weighted bias Weighted root mean square deviation Weighted standard error of equating x* An integer closest to x such that x*.5 x < x* +.5 xx UU The smallest integer score with a cumulative percent [100F(x)] greater than P* yy UU The smallest integer score with a cumulative percent [100G(y)] greater than Q* α β γ θ θ θθ gg θθ ss λ Shape parameter of a beta distribution Shape parameter of a beta distribution Angle in radian Ability parameter Vector of ability parameters General ability parameter Specific ability parameter Factor loading xiv

18 μ ρ Σ ψψ ω Mean vector Correlation Variance-covariance matrix The distribution of θ Parameter of the polynomial function xv

19 1 CHAPTER I INTRODUCTION When a test consists of several different content areas or constructs, scores on the subdomains are called subscores. Usually a test blueprint does not consider a situation where subscores are reported. However, there is a non-negligible demand for subscore reporting in the testing industry. Policy makers, admission officers, educators, and individual task takers want to receive subscores to make decisions about remediation or classification (Brennan, 2011; Haberman, 2008). Due to such demands, many testing programs already include subscores as a part of their score report or consider a plan of reporting subscores. For example, the College Board (2014) announced that, for their redesigned SAT, they are planning to report seven subscores to provide more insight into students strengths and weaknesses. ACT (2014) also included three to ten subscores called category scores for each subject in the ACT Aspire score report. Due to the high interest in subscores, many studies have been conducted to validate decisions about scoring and reporting of subscores. Ideally, a decision on subscore reporting should be made at the very early stage of test development; unfortunately, however, often subscores are added after a testing program has been defined and used. In such cases, a test specification does not include detailed requirements for subtests, so subtests would have different numbers of items between subtests and even between forms. It makes difficult to define the meaning of reported subscores. Then, what is the best solution for subscore reporting in practice? In order to facilitate a comparison of scores from different content areas, scores typically are expressed in a score profile a common unit that has a norm-referenced meaning (Thorndike & Thorndike-Christ, 2009, p. 96). For example, Iowa Assessments provides an individual score profile for all subject level scores and subscores using the

20 2 national standard scores, the national grade equivalents, and the national percentile ranks (Iowa Testing Programs, n.d.). Putting subscores on the same scale is not enough. Once the decision to report subscores has been made, subscore equating is a necessary process to sustain the meaning of subscores across forms. This is because a test is commonly administered with multiple alternate forms so that equating a statistical process to adjust possible differences in form difficulty is required to ensure score comparability across forms. However, for a test not originally designed to report subscores, it is common to report proportion-correct subscores or percent subscores without equating. The main focus of this dissertation is to address the necessity for subscore equating and the practical issues related thereto. Further background about subscores, equating designs, methods, and research questions are provided in the following sections. Subscore Subscores are scores for subareas of a test. The concept of a subscore is relative to the grain size, the degree of content specificity of a test. Thus, the psychometric properties of subscores vary across tests depending on how the total score is defined. A test covering broad content areas has distinctive subtests, while a test on a very specific domain has subscores with subtle differences. For example, a teacher certification test may consist of four content areas such as language arts, mathematics, social studies, and science; on the other hand, a mathematics test would have four content areas including algebra, geometry, statistics, and trigonometry which are not as distinct as the four subtests of the first case. That is, subscores are very different in content distinctiveness and statistical properties from test to test. Compared to the total test whose construct is broadly and generally defined, each subtest might focus on more specific constructs; sometimes, the construct of the total test can be defined as a composite of what are measured by subtests, which are correlated one another. In any case, subscores are

21 3 reported based on an assumption that they provide information that is at least somewhat different from the total score. This implies that a test reporting subscores is multidimensional. Subtests have two salient properties: short length and relatedness with other subtests belonging to the same test. Between subscores that typically have a limited number of items, constructing alternate forms of comparable difficulty is challenging because form difficulty is substantially affected by the choice of each item (Stahl & Masters, 2009). Consequently, equating subscores can be challenging, too. The other property of subtests is that subscores are correlated with each other. Because a subtest consists of items which are a subset of the total test, subscores are correlated with the other subscores and the total scores. The relatedness property makes it possible to use information from items in other subtests to stabilize scoring or equating for a target subtest. Previous Research on Subscores The strong demand for subscores has led to many studies about scoring and reporting of subscores. For scoring, many researchers show that, in order to estimate ability parameters for each subtest, using data for the total test gives better results, especially in terms of stability, than using the corresponding subtest data, only (Ackerman & Davey, 1991; de la Torre & Song, 2009; Wainer et al., 2001; W.-C. Wang, Chen, & Cheng, 2004; Yao, 2010; Yao & Boughton, 2007; Yen, 1987). Accurate scoring is not a sufficient condition for reporting subscores. Another factor to be considered is distinctiveness of the interpretation of subscores and profiles (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). Evidence is required that subscores provide additional information different from that provided by the total score. Recent research has investigated whether subscores are distinctive enough from the total

22 4 score to give additional information for individuals and found that few subscores have added value over the total score (Haberman, 2008; Sinharay & Haberman, 2008; Sinharay, Puhan, & Haberman, 2011). Also, some studies focus on subscores reported at an aggregate level (Haberman, Sinharay, & Puhan, 2009; Puhan, Sinharay, Haberman, & Larkin, 2010; Sinharay & Haberman, 2008). These studies examined whether aggregated subscores reported to institutions give additional information over and beyond the total score and concluded that subscores rarely give additional information even at an institutional level. Although many of the previous studies on subscore reporting have concluded that reporting subscores does not provide additional information over the total score, test users still want to receive subscores and testing programs try to meet their needs. Until recently, researchers have mainly focused on scoring methods and a decision about whether or not to report subscores. On the other hand, practical issues following a decision of subscore reporting have been little studied. Practical Issues in Subscore Reporting When a score report includes subscores, the diagnostic information from subscores is conveyed as a form of a score profile so that users know their relative performances between subdomains. A score profile means scores from a test battery or subscores from a test on the same metric. Sometimes, a visualized graph or chart of subscores facilitating the interpretation is also called a score profile. Even for tests having subscores not on the same scale score metric, it is common to report a bar chart or a line graph for the subscores that enables a comparison of performances across subareas. Using the same scale for scores from different contents and different numbers of items brings a variety of practical issues. For instance, it is hard to decide on an adequate number of score points for the score scale. More scale-score points than raw-score points may suggest more precision than actually exists in the test, while too few scale-score

23 5 points may lead to loss of much information (Petersen, Kolen, & Hoover, 1989). That is, a scale with many score points has gaps between score points and exaggerates just one item difference in raw scores; if scale scores have a much smaller number of score points than raw scores, a range of raw-score points is converted to the same scale-score point, and thus discrimination power is lost. Choosing a common scale becomes much more complicated if subtests have very different numbers of items. Given such issues, reporting unequated subscores might be a reasonable option; for example, fractional correct raw scores denoted with the total number of items, such as 1/3, 2/3, and 3/3, would be less confusing to the users. However, there is a serious problem in reporting unequated subscores. A score profile based on unequated subscores could inform an examinee s relative performance on different subtests in reverse. That is, if subscores are not equated, relative strengths and weaknesses of examinees shown in score profiles do not solely show their proficiencies but reflect differences in subtest form difficulties as well. For example, a student s score profile for mathematics could say that he is better at algebra than at geometry when using unequated scores; however, his geometry score could become higher than his algebra score after equating. It may lead to huge resources being inappropriately spent. Only a few studies exist in the literature dealing with subscore equating. Sinharay and Haberman (2011) suggested equating methods with regressed subscores, when a target subtest is scored by borrowing information from the other subtests. Even though number-correct scoring is still the most widely used in operational setting, a thorough study dealing with subscore equating for number-correct scores has not been conducted. Moreover, the effect of equating on score profiles has not been studied yet. Another issue is involved in the use of collateral information from other subtest items. Some scoring methods try to improve subscore reliability by borrowing information from other subtests (Ackerman & Davey, 1991; Kahraman & Kamata, 2004; Wainer et al., 2001; Yen, 1987). Such methods could introduce bias but produce better

24 6 estimates by reducing standard error. Collateral information can also be used for equating purposes. For example, Puhan and Liang (2011) proposed a subscore equating method using the total score as an anchor. If equating methods depend on collateral information from other subtests or the total test, the existence and extent of bias should be examined. This dissertation attempts to show how equating influences score profiles using two sets of real data from an operational test. The tests differ in subtest length, relatedness between subscores, and the degree of parallelism between forms. The performance of various equating methods is evaluated using simulated data. Various factors including subtest length, correlation between subscores, form difficulty, and sample size are considered. Equating Equating is a statistical process to adjust scores from alternate forms for differences in difficulty (Kolen & Brennan, 2014, p. 2). The process of equating is used to secure score comparability across different forms which is one of the criteria for score reporting required by Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999). Moreover, equating a newly constructed form to an old form linked to scale scores enables scores from the new form to be on the same scale and to have the same meaning. In order for subscores to be reported, equating should be performed for subscores to ensure fairness of scores for alternate forms and to sustain the meaning of the scores. Equating Designs Data collection is a starting point in equating. Because the fundamental step of equating is separating differences in scores into form differences and group differences, there should be something in common between the forms or the groups. The most widely used equating designs are the random groups (RG) design and the common item

25 7 nonequivalent groups (CINEG) design. Each design has its advantages and disadvantages. This dissertation uses the RG design that collects data from randomly equivalent groups of examinees. It assumes that random groups of examinees are from the same population. A spiraling process randomly assigns examinees to alternate forms, so more than two forms can be administered and equated on a single administration date. Test administration with multiple forms could be a source of concern if test security is a sensitive issue. The virtue of this design is that statistical assumptions are minimal. Because groups are randomly equivalent, it can be assumed that score differences come from form differences in difficulty so that relatively simple equating methods can be applied to data gathered under this design. By contrast, form differences and group differences are confounded in the CINEG design. In this design, different forms are administered to groups of examinees from different populations. In order to separate form differences and group differences, the CINEG design requires the use of an anchor test, a set of items administered in or with both forms. Information from group performance on an anchor test is used to extract the degree of form differences in difficulty. Because group differences estimated from an anchor test are generalized to the entire forms (assuming that groups perform on the entire test as they do on the anchor test), it is crucial to construct an anchor test that is a representative sample of the whole test in content and statistical characteristics. Otherwise, using an anchor test would lead to biased equating results. In the subscore equating context, typically an anchor test corresponding to a subtest is too short to represent the subtest in content and statistical characteristics and to function appropriately as an anchor.

26 8 Traditional Equating Methods The focus of this section is on traditional equating methods for the random groups design. Traditional equating methods aim at finding an equating relationship between observed scores from old and new forms. The equating relationship moves the statistical characteristics of a score distribution from a new form closer to those from an old form. Linear equating assumes a linear relationship between scores; i.e., differences in difficulty vary linearly along the score scale. This means that the location of a new-form score distribution could change, while the shape of the distribution is preserved, although it may be stretched or compressed. After linear equating, equated new-form scores have the same mean and variance as those for the old form. The linear equating function can be determined easily by estimating means and variances of two forms. However, typically there is an issue at extreme scores where equated scores may exceed the possible score range, or the highest intended scale score is not reached. Equipercentile equating allows a curvilinear equating relationship to make an equated score distribution for the new form as similar as possible to an old-from score distribution. Equipercentile equating requires large sample sizes to estimate frequency for every score point, but equipercentile equating is very flexible in handling differences in difficulty along the score scale. The equipercentile equating function defines equated scores to have the same cumulative density as old-form scores. Theoretically, the distribution of equated scores in equipercentile equating has the same central moments including mean, variance, skewness, and kurtosis as the old-form score distribution. However, because test scores are discrete, the cumulative density function and its inverse cannot be used directly and are replaced by percentile ranks and a continuization process. Such replacement yields departures for the moments, and these moment departures are likely to be exacerbated when test length is short (Kolen & Brennan, 2014, p. 45). Because equipercentile equating is one of the most widely used methods in operational

27 9 contexts, its performance in subscore equating should be investigated because subtest length is usually very short. Equipercentile equating often involves bumpiness in observed-score distributions. Because observed-score distributions are expected to be smooth in the population, it is plausible to consider that the bumpiness is due to sampling error. Smoothing is introduced to render them smooth so as to reduce sampling error. There are two types of smoothing: presmoothing and postsmoothing. Presmoothing is applied to score distributions before equating, and postsmoothing to an equating relationship after equating. This dissertation uses polynomial log-linear presmoothing and cubic-spline postsmoothing methods that have long been used and researched. They have been found to perform well especially when sample size is small (Hanson, Zeng, & Colton, 1994; Liu, 2011; Livingston, 1993; Zeng, 1995). However, in a short length test, the performance of smoothing is not obvious because the equating relationships can appear irregular, even after smoothing (Kolen & Brennan, 2014, p. 65). Equating with Item Response Theory Item response theory (IRT) is based on modeling the probability of correctly answering an item as a function of item and ability parameters. The most important assumption in IRT is local independence, which means that item responses are independent of each other given item and ability parameters. This assumption cannot be met in a strict sense, because numerous factors affect item responses in reality such as passages related to specific items, different sub-content areas, and testing time. To be defensible, it must be assumed that IRT models are robust with respect to violations of this assumption. Unidimensional IRT (UIRT) models use a single latent variable to represent a person s proficiency, which is under the assumption that a test measures a single unidimensional proficiency. When a test measures multiple traits or skills, more complex IRT models might be needed to reflect the complexity. Multidimensional IRT

28 10 (MIRT) models are able to consider more complex situations than UIRT models (Reckase, 2009, pp ). IRT is widely used in the testing industry for many purposes such as form construction, scoring, and computerized adaptive testing. Another important use of IRT is equating. A preliminary step in IRT equating is item parameter estimation, which is hereinafter referred to as calibration. With item parameter estimates, IRT equating can be conducted in two different ways: IRT true score equating and observed score equating. IRT true score equating uses the test characteristic curve (TCC), a relationship between true scores and ability parameters, to equate two forms. IRT observed score equating constructs a model-based fitted observed-score distribution based on item parameter estimates for each form and conducts equipercentile equating using these fitted distributions. Various IRT models can be employed to build a TCC or a fitted distribution. In terms of the calibration step, there are two possible approaches in IRT subscore equating. One is separate calibration for each subtest which treats a subtest as an independent test and rules out a possible adverse effect of other subtests. The other approach is simultaneous calibration for the entire test, borrowing information from other subtests in some sense. Several studies using UIRT models have shown that using information from other subtest items improves proficiency estimation for the target subtest (Ackerman & Davey, 1991; Kahraman & Kamata, 2004; Kahraman & Thompson, 2011). However, the effect of simultaneous item calibration on equating has not been investigated. For simultaneous calibration, it is important to consider whether to use a UIRT or MIRT model. If a test consists of multiple subtests but they are very similar to each other, using a UIRT model should provide acceptable results. For a test with very distinctive subtests, on the other hand, applying a MIRT model seems reasonable. In an operational

29 11 setting, using a UIRT model is common because MIRT models require larger response datasets and much more complexity. This dissertation compares IRT equating methods for subtests using both UIRT and MIRT models. Separate calibration and simultaneous calibration are also compared with each other. A more detailed description about IRT equating methods is provided in Chapter 2. Research Objectives Subscore equating, as discussed in this dissertation, involves a distinctive issue that is not typically considered in total-score equating namely, score profiles can change when equating is conducted. The possible changes in score profiles address the importance of subscore equating that has not been paid much attention in the literature. Another issue is choosing the best equating method under various conditions. The relatedness property of subscores allows more complexity in equating methods as to the use of collateral information, and the performances of various equating methods can be different under conditions such as multidimensionality, test length, form difference, and sample size. For this dissertation, the RG design is used, and ten equating methods including traditional and IRT equating are compared with each other and with predefined criteria. This dissertation consists of three studies as to the purposes and data types. The first study involves applying various equating methods to operational datasets to show how score profiles change particularly when compared with unequated results. Also, using the operational data, practical issues and concerns related to reported scale scores are discussed. The main purpose of the first study is to address the necessity for subscore equating. The second and third studies compare and evaluate the performance of various equating methods for subscores using simulated data. Different criteria are used in the second and third studies, which bring different issues to be discussed. The specific focus

30 12 of the second study is on examining the effect of important factors related to subtests on the equating results; the third study focuses more on evaluating equating methods in terms of bias. More specifically, this dissertation seeks to address the following research questions. The first study of this dissertation answers the first three questions, and the second and third studies are for the other five questions How does identity equating (i.e., no equating) differ from other equating methods in subscore equating results? 1-2. How does subscore equating impact score profiles? 1-3. How does the choice of a score scale impact conversion tables? 2-1. In subscore equating, how do different equating methods compare in equating accuracy? 2-2. When the degree of test dimensionality varies, how do different equating methods compare in equating accuracy? 2-3. When subtest length becomes shorter, how do different equating methods compare in equating accuracy? 2-4. When form differences become larger, how do different equating methods compare in equating accuracy? 2-5. When sample sizes become larger, how do different equating methods compare in equating accuracy? Overview of Dissertation This dissertation consists of five chapters. Chapter I, the current chapter, describes practical issues related to subscores with some background information. Equating designs, a brief overview of equating methods, and research questions are also provided in Chapter I. Chapter II includes a detailed description of equating methods, and the literature review on subscores. Chapter III describes the methodology used to address the

31 13 research questions for the operational and simulated data analyses. Chapter IV presents results. Chapter V summarizes findings and implications for future research.

32 14 CHAPTER II LITERATURE REVIEW The current chapter provides background information for equating and subscores. This chapter is divided into three sections. The first and the second sections present equating methods under the RG design. In the first section, traditional equating methods including linear equating, equipercentile equating, and smoothing methods are presented in detail. The second section provides an overview of UIRT and MIRT, and IRT equating methods. The third section provides the literature review on the subscore scoring, reporting, and equating methods. Traditional Equating Methods A common feature of traditional equating methods is using observed score frequency rather than response patterns. The score moments or frequency distributions are used to construct an equating relationship. Because statistical assumptions required are minimal under the RG design, equating methods are relatively simple. This section provides the definitions and properties of linear and equipercentile equating. Smoothing, a process employed before or after equipercentile equating to reduce sampling error, is also described. This section follows the notation used in Kolen and Brennan (2014). Linear Equating Method The linear equating method linearly transforms raw scores to the old-form scale. That is, the differences in difficulty linearly increase or decrease along the score scale. In order to decide the slope and intercept of a linear equating relationship, linear equating uses the first two moments from each of the old- and the new-form score distributions. Here the old form is denoted as Y and the new form as X; then linear equating is defined by the following formula: ll YY (xx) = σσ(yy) σσ(yy) xx + μμ(yy) μμ(xx), (2.1) σσ(xx) σσ(xx)

33 15 where ll YY (xx) refers to the equated score from a new-form score x to the old-form scale. As shown in Equation (2.1), the slope and intercept are expressed as a function of means and standard deviations. After conducting linear equating, the new-form equated scores have the same mean and the same standard deviation as those for the old-form scores. If the old and new forms have the same standard deviation, Equation (2.1) is shortened to mm YY (xx) = xx + [μμ(yy) μμ(xx)], (2.2) which is referred to as mean equating that adjusts scores using the mean difference only between two forms. After mean equating, the equated new-form scores have the same mean as that of the old-form scores, but the standard deviation remains the same as it is. Equation (2.2) could be even more shortened when the old and new forms have the same mean in addition to the same standard deviation. In such a case, new-form scores are equivalent to old-form scores without any adjustment: ii YY (xx) = xx. (2.3) This is called identity equating which means no equating. Proportion-correct raw scores or normalized scores without equating are commonly reported for subscores. Equipercentile Equating Method It is not plausible to assume that form differences in difficulty linearly change along the score scale. Equipercentile equating is a powerful and refined method for adjusting the new-form scores to have the same distribution as that of the old-form scores through a nonlinear transformation. As the name implies, after equipercentile equating is employed, examinees with the same percentile rank receive the same score regardless of forms. Let F and G refer to the cumulative distribution functions of Form X scores and Form Y scores, respectively. Then, equipercentile equating is defined as follows: ee YY (xx) = GG 1 [FF(xx)], (2.4)

34 16 where G 1 is the inverse function of G. As shown in Equation (2.4), finding an equipercentile equivalent of a new-form score x is a two-step process: find F(x), a percentage of examinees getting a score equal to or below x in the Form X score distribution, and then find a Form Y score point with the same percentage of examinees using G 1. However, there is a challenge that has to be solved: typically, test scores are discrete and finding an old-form score point with the same percentile rank is rarely possible. Using percentiles and percentile ranks with a uniform continuization is a traditional solution. This approach treats a score x as a representation of a score range [x 0.5, x + 0.5], and the frequency of x is assumed to be uniformly distributed in the range. Then, P(x), a percentile rank of a non-integer score x, is computed by using the percentile rank function defined as follows: PP(xx) = 100{FF(xx 1) + [xx (xx.5)][ff(xx ) FF(xx 1)]},.5 xx < KK XX +.5, = 0, xx <.5, = 100, xx KK XX +.5, where x* is an integer closest to x such that x*.5 x < x* +.5, F(x) is the discrete (2.5) cumulative distribution function of Form X scores, and KX is the number of items on Form X. The inverse function of Equation (2.5), also called the percentile function, is used to find a score corresponding to the given percentile rank. For a given percentile rank P*, the percentile is xx UU (PP ) = PP 1 [PP ] = PP 100 FF(xx UU 1) + (xx FF xx UU FF xx UU 1 UU.5), 0 PP < 100, = KK XX +.5, PP 100, where xx UU refers to the smallest integer score with a cumulative percent [100F(x)] greater than P*. An alternative function using xx LL, a lower bound integer, is provided in Kolen and Brennan (2014, p. 43) with examples. (2.6)

35 17 Using the percentile rank function and the percentile function, Equation (2.4) is redefined as follows: ee YY (xx) = QQ 1 [PP(xx)], = PP(xx) 100 GG(yy UU 1) GG yy UU GG yy UU 1 + (yy UU.5), 0 PP(xx) < 100, = KK YY +.5, PP(xx) = 100, where Q 1 is the percentile function of Form Y, G(y) is the discrete cumulative (2.7) distribution function of Form Y scores, yy UU is the smallest integer score with a cumulative percent [100G(y)] greater than Q(y), and KY is the number of items on Form Y. An alternative function using yy LL produces the same results unless the Form Y score distribution has non-zero probabilities for all possible score points. The intended outcome of equipercentile equating is that the new form has the same score distribution as the old form; after equipercentile equating, not only the mean and the standard deviation, but every moment of the two forms should be the same in theory. However, the distribution of equated scores is not exactly identical to the oldform distribution because scores are discrete, which yields differences in moments between the two forms. Moment differences would not be serious when test length is reasonably long, but might be problematic in a short test like subtests as shown in Kolen and Brennan (2014, p. 46). Smoothing Methods Even though Equations (2.1) (2.7) are expressed in terms of parameters for the population, in practice, the parameters have to be estimated using samples. Sample size matters particularly in equipercentile equating because it requires percentile ranks for every score point to be estimated. Sometimes the score distributions and/or the equipercentile equating relationships based on samples are irregular; the irregularity may be attributed to sampling error (i.e., random error) because using larger samples produces smoother results. Smoothing is a procedure which renders observed score distributions or

36 18 an equating relationship as smooth as they would be in the population. Notwithstanding the risk of systematic error (i.e., bias) introduced by smoothing, it is hoped that smoothing reduces random error so that the final equating results are closer to the population equating relationship. There are two types of smoothing differentiated by when and where it is conducted: presmoothing is applied to observed score distributions before equating; postsmoothing is to equipercentile equivalents after equating. This section overviews polynomial log-linear presmoothing and cubic-spline postsmoothing methods that have been widely used and well researched. Polynomial Log-Linear Presmoothing Presmoothing is used to estimate the population density of scores based on observed relative frequencies. The estimated score distributions are expected to be smooth as they would be in the population. The polynomial log-linear presmoothing method fits a polynomial model to the log of the sample density in order to find a smooth estimate of the distribution. The log of the expected density expressed as a polynomial of degree C is log NNff (xx) = ωω 0 + CC kk=1 ωω kk xx kk, (2.8) where N is the number of examinees, ff (xx) is the expected relative frequency distribution, and ωω 0, ωω 1, and ωω CC are the parameters of the polynomial function estimated by the method of maximum likelihood. The value of C determines the number of the central moments preserved in the fitted distribution; for example, if C = 2, the first two moments (mean and variance) are preserved; if C = 4, the first four moments (mean, variance, skewness, and kurtosis) are preserved. That is, as the value of C is higher, the fitted distribution is more similar as the actual distribution. The selection strategies for the value of C are divided into two classes: statistical significance tests and the principle of parsimony (Moses & Holland, 2009). Because a