IRT-Based Assessments of Rater Effects in Multiple Source Feedback Instruments. Michael A. Barr. Nambury S. Raju. Illinois Institute Of Technology

Size: px
Start display at page:

Download "IRT-Based Assessments of Rater Effects in Multiple Source Feedback Instruments. Michael A. Barr. Nambury S. Raju. Illinois Institute Of Technology"

Transcription

1 IRT Based Assessments 1 IRT-Based Assessments of Rater Effects in Multiple Source Feedback Instruments Michael A. Barr Nambury S. Raju Illinois Institute Of Technology RUNNING HEAD: IRT Based Assessments Paper presented at the 16 th Annual Conference of the Society for Industrial and Organizational Psychology, San Diego, California, April 28, Correspondence should be sent to Michael A. Barr (barrmic@charlie.cns.iit.edu) or Nambury S. Raju, (raju@iit.edu) Institute of Psychology, Illinois Institute of Technology, Chicago, Illinois

2 IRT Based Assessments 2 Abstract This study compared three IRT based models of assessing measurement equivalence in 360 Feedback: the traditional DIF methodology, Muraki s (1993) Rater s Effect Model, and Patz, Junker, and Johnsons (1999) Hierarchical Rater Model. Using data from 491 managers collected on the Benchmarks instrument, we found that the traditional DIF methodology provides the most information about the rater s conception of the ratee s ability, whereas the other two models provide explicit estimates of rater leniency/severity. We also found that rater source effects of leniency and severity may have little practical impact. The different results and conclusions produced by each model are discussed.

3 IRT Based Assessments 3 IRT-Based Assessments of Rater Effects in Multiple Source Feedback Instruments The use of multi-source feedback (MSF) instruments has increased in recent years, partly due to the belief that obtaining performance ratings from multiple sources increases the usefulness of the feedback to the ratee (London & Smither, 1995). The most common format of MSF is the 360 feedback instrument, where ratees are provided with ratings from several sources. Sources of ratings on a 360 feedback instrument almost universally include subordinates, peers, and at least one supervisor, and sometimes include customers or vendors. Self-ratings are also frequently included. Ratings from these various sources are usually reported separately to the ratee, because the sources frequently disagree on the ratings (for example, Conway & Huffcutt, 1997; Harris & Schaubroeck, 1988). When ratings lacking in agreement are presented to the ratee for the purpose of personal development, the bewildered ratee might well ask, Which ratings reflect my actual performance? Recently, the focus of research surrounding 360 feedback has been on examining the relationship between self-other agreement and effectiveness (Atwater, 1998; Atwater, Ostroff, Yammarino, & Fleenor, 1998; Van Velsor, Taylor, & Leslie, 1993; Yammarino & Atwater, 1993; Yammarino & Atwater, 1997). These studies generally appear to lead to the conclusion that there is a complex predictive relationship between self-other agreement discrepancies and managerial effectiveness. It is critical to recognize that, when self-other rating discrepancies are calculated, a strong assumption must be made that the performance ratings from the various sources are rendered on a common psychological metric in other words, that they have measurement equivalence and can be directly compared.

4 IRT Based Assessments 4 Empirical studies that compare scores across raters for a single employee frequently use scale or test level scores. For example, Atwater et al. (1998) averaged across 16 scales to obtain a single score. If the assumption that the scores are all on the same metric is violated, the conclusions made from such studies may not be valid. It is good practice, then, to check for violations of this assumption before providing performance feedback through the MSF processes. In practice, feedback from 360 instruments is typically provided to the ratee at the item level (London & Smither, 1995). This means that confirming measurement equivalence at the scale level is not sufficient for the purpose of comparing MSF across raters. The assumption must be made that there is measurement equivalence between sources not only at the scale level, but at the item level as well. The purpose of this investigation is to compare three different item response theory (IRT)-based methods for assessing measurement equivalence, at the item level as well as at the scale level, of a 360 feedback assessment. Rater Agreement The differences that appear between rating sources in MSF instruments are generally viewed as disagreement between raters. Discrepancies in ratings between sources are viewed as useful information (London & Smither, 1995), but real differences must be separated from artifactual differences before using that information. There is some question as to how much agreement between raters can be expected. Kenny (1991) suggests that the degree to which raters will agree on a given ratee s performance can be modeled as a function of six parameters: (1) the number of behaviors each rater observes, (2) the proportion of ratee behaviors that overlap between

5 IRT Based Assessments 5 raters, (3) the degree to which the ratee s behaviors are consistent across raters, (4) the correlation between the two rater s scale values for the same acts or behaviors, (5) the weight assigned by a particular rater for the unique impression that the ratee s behaviors has produced, and (6) the degree to which the raters influence each other s ratings. Kenny implies that raters can be used as the psychometric equivalents of items only when three conditions are met. First, the raters must each view a non-overlapping set of behaviors. Second, the unique impressions each rater has must have no weight. Finally, raters must be rating on the same psychological metric. Essentially, Kenny s model says that real differences in rating scores reflect differences in observed behavior only when measurement equivalence between raters exists, raters are calibrated to each other, and raters have ample opportunity to independently sample ratees behaviors. According to Drasgow and Kanfer (1985), measurement equivalence in a psychometric sense exists when the relationship between observed scores and latent constructs are the same across groups. Although the literal definition of measurement equivalence remains the same when applying it to MSF, the meaning becomes quite different. When different raters assign scores to the same individual, the actual ability of the rated individual remains the same regardless of the rater assigning the score, but the underlying construct being estimated in this model are not the ratee s proficiency or ability, but rather the rater s conception of the ratee s proficiency. Landy and Farr (1980) called these constructs implicit personality theories of raters (p. 97), and may be similar to Kenny s (1991) weighted unique impressions. The ratee s standing on the underlying construct of performance proficiency does not change from rater to rater, but the method of data collection, i.e., the rater, does (although the instrument does not).

6 IRT Based Assessments 6 Measurement Equivalence Studies that have examined the measurement equivalence between sources of performance ratings (and therefore, the appropriate use of between-source score comparisons) have used both IRT-based differential item functioning (DIF) methodology and simultaneous confirmatory factor analytical (SCFA) models (for a direct comparison of these methods, see Maurer, Raju, & Collins, 1998). Both of these models are based on methodology developed to assess between-group measurement equivalence when the objects of measurement (persons) may be drawn from different sub-populations. They are effective in detecting DIF when the sources of error variance are limited to the item and the group membership of the person who is the object of assessment. In the current study, three different IRT-based methods of measurement equivalence or DIF are investigated. Each of these methods incorporates some of the unique features associated with 360 feedback assessments. For example, although each manager may be characterized with a unique level of proficiency, one s conception of this unique proficiency may vary from one source to the next; that is, the Boss s conception of a manager may be different from that of a Peer or a DR. Furthermore, even when the conceptions are the same, some raters are more lenient or severe than others. Some of these special features of 360 feedback assessments are incorporated into the three IRTbased methods of assessing between-source measurement equivalence described below. Traditional DIF Method In this method, the feedback assessment data are calibrated separately for each source, using a polytomous IRT model. That is, within source, a separate estimate of

7 IRT Based Assessments 7 proficiency (θ ) for each manager is obtained; also, separate estimates of item parameters are generated for each source. Because the data from each source yields a separate proficiency estimate, these estimates may be viewed as reflecting the rater s conceptions of a manager s true proficiency. Therefore, the inter-correlations between these proficiency estimates may be thought of as reflecting the degree to which the conceptions of a manager s proficiency from two different sources converge, while allowing for measurement error. These estimates of manager s proficiency may also reflect a rater s tendency for leniency or severity; with the traditional DIF method, there no way to separate the rater s conceptions of a manager s proficiency from his/her tendencies for leniency/severity. Muraki s (1993) generalization of Master s partial credit model (Masters, 1982) was used in this study. The underlying structure for this generalization is a twoparameter logistic model, which may be expressed as P jk * Da jr ( θ ir b jkr ) e ( θ ir ) =, (1) Da jr ( θ ir b jkr ) 1+ e where P * jk is the probability of rater r assigning manager i to category k over category (k-1) in a polytomous item j. In the above equation, D and e are constants, θir (r s) conception of a manager s (i s) proficiency, and a jr and b jkr are the item is rater s parameters for item j from rating source r. Two important aspects of this equation should be mentioned: (1) A manager s proficiency estimate varies from one source to the next, reflecting both the conception and severity/leniency effects; (2) item parameters are allowed to vary from item to item and from source to source.

8 IRT Based Assessments 8 In the case of a polytomous item with m categories, there will be one a parameter and (m-1) b parameters (Muraki & Bock, 1996). Following Muraki (1993), the probability of a manager i getting assigned to category k by rater r on item j in the generalized partial credit model may be expressed as x j exp a jr ( θir b jfr ) P = m= jk ( θ ir ) 0 (2) m j k exp a jr ( θir b jfr ) k = 0 f = 0 where x = 0, 1,, m j and k = 0, 1,, m j. Muraki s (1993) Rater s Effect Model The probability of getting assigned to category k in Muraki s rater s effect model may be expressed as x j exp Da j ( θi b jf + ρr ) P = m= jk ( θ i ) 0. (3) m j k exp Da j ( θi b jf + ρr ) k = 0 f = 0 There are three important differences between the rater s effect model (Equation 3) and the traditional DIF model (Equation 2). First, the estimates of a manager s proficiency do not vary from source to source in the rater s effects model. That is, only a single theta is estimated, meaning that the source-based conceptions of a manager s proficiency are not reflected in the estimation of theta. Second, the item parameters do not vary from source to source. Third, Equation 3 reflects a rater s tendency for severity/leniency. This is denoted by ρ r, and it is independent of items. In summary, the traditional DIF model appears to reflect a rater s conception of a manager s proficiency; it also reflects the

9 IRT Based Assessments 9 interaction between the rating source and the item by allowing different item parameters for each source. The rater s effect model, on the other hand, clearly and separately accounts for the rater s tendency to be either lenient or severe, while such effect is confounded with the rater s conception of a manager s proficiency in the traditional DIF model. The HRM Model The hierarchical rater model (HRM) of Patz, Junker, and Johnson (1999) conceptualizes the rating process as having two distinct phases, and evaluates each phase separately. In the first phase, the examinee generates a behavior (for example, produces a work sample). The proficiency exhibited by this observable performance is presumably a product of the underlying latent proficiency of the examinee, or θ. This performance has an ideal rating ξ, which is also an unobserved variable. In the second phase, the rater places an observable value k on the performance. If the rater is accurate, k = ξ. Theoretically, ξ can be thought of as equivalent to the ratee s expected true score. The HRM process employs traditional IRT methods (here, Master s partial credit model) to estimate parameters. What makes HRM different from the Rater s Effect model is that, rather than estimate the rater shift parameter ρ as component of the b parameter, HRM estimates rater shift as a function of the probability that the observed rating given by the rater is equal to the test taker s ideal rating. Using Markov Chain Monte Carlo (MCMC) simulations, HRM estimates the ideal rating ξ ij for each item for each examinee, conditioning this estimate on the probability mass generated by the matrix of all observed rating by item by person combinations and the assumed distributions of rater shift and rater reliability (see Patz &

10 IRT Based Assessments 10 Junker, 1999a, for a detailed explanation of MCMC). At this point the ideal rating estimates ξ ij are conditionally independent of the observed ratings. The ideal rating estimates are then used in the GPCM to calculate ability parameters θ for each examinee. The probability of a rater successfully detecting the ideal rating is described in a transition matrix having a unimodal discrete distribution in each row, having the mode at the ideal rating. This model is shown in Table 1. Insert Table 1 About Here Similar to the Rater s Effect model, HRM estimates of a manager s proficiency do not vary from source to source, and therefore, only a single theta is estimated, and the item parameters do not vary from source to source. The rater s effect parameter is similarly independent of items. HRM also generates an index of rater reliability. Establishing Measurement Equivalence IRT-based assessment of DIF fundamentally involves the pairwise comparison of the item response functions from each group (Hambleton, Swaminathan, & Rogers, 1991). An item is said to be biased or to function differentially (have DIF) when a member of one group has a different probability of getting an item correct (or choosing a particular response category) than a member of another group, when both members have the same amount of proficiency or ability. DIF can be examined at the item level or at the scale level. The differential functioning of items and tests (DFIT) can be examined with Raju, van der Linden, and Fleer s (1995) DFIT framework. Within this framework, indices of DIF at the item level (denoted as NCDIF) and at the test/scale level (denoted

11 IRT Based Assessments 11 by DTF) are provided. At the item level, the NCDIF index reflects the degree to which the item-level true scores vary between sources across the theta scale. Similarly, the DTF index measures the degree to which scale-level true scores vary from one source to the next across the theta scale. When there is significant DTF, the DFIT program deletes items, one item at a time, until DTF becomes non-significant. This study employs the DFIT procedure across methodologies to assess differential item and test functioning. For additional information about the DFIT framework, please refer to Raju et al. (1995). In summary, the question addressed in this study is: Will these three methods of isolating and identifying rater source effects (measurement equivalence using the DFIT framework across the traditional DIF method, the Rater s Effect Model, and HRM) each provide different (or conflicting) information on the same data when applied to the analysis of MSF ratings from boss, self, peer, and, direct reports? This exploratory investigation will compare the results obtained from analyzing the same data three different ways and discuss the possible impact of drawing different conclusions on the data based on the method used for analysis.

12 IRT Based Assessments 12 Method The exploration and comparison of the results obtained from data analysis using the methodologies described above comprised five steps. First, because all methods to be compared were based on IRT, it was necessary to test for unidimensionality. Second, the same data set was analyzed three different ways using Raju et al. s (1995) DFIT method: using the traditional DIF approach, using Muraki s (1993) Rater s Effect Model, and using Patz et al. s (1999) Hierarchical Rater Model. The final step was juxtapose the various results and compare the conclusions that would be drawn from each method. Sample Data were collected from managers who participated in leadership development programs in which a common 360 feedback instrument was used. Our sample consisted of 491 managers. Each manager had one rating from the Boss (B), one rating from a Peer (P), one rating from a direct report (DR), and a rating from himself/herself (S). No information on age, ethnicity, or gender was provided in this archival data for the participants. Two categories of job information were provided: (1) Position (staff or line), and (2) Title (seven levels in order of increasing responsibility from supervisor to corporate officer). A cross-tabulation of these demographics are presented in Table 2. Insert Table 2 About Here The Benchmarks Survey The common 360 feedback survey used in the manager workshops from which the sample was drawn was the Benchmarks survey (Lombardo & McCauley, 1990).

13 IRT Based Assessments 13 Benchmarks is a multisource feedback instrument used to evaluate managerial strengths and weaknesses. It is divided into two sections. Table 3 describes how the survey is organized. Insert Table 3 About Here As scales with more items produce less biased ability estimates (Hambleton, Swaminathan, & Rogers, 1991), and a minimum of ten items is required in a scale to generate satisfactory fit statistics (Mislevy & Bock, 1990), only scales with more than ten items were selected for examination. Data from four scales from Section 1 were selected for evaluation, as these scales (and only these scales) have more than ten items each, making them suitable for IRT analysis (particularly DFIT). The four scales used here were Scale 1 ( Resourcefulness ), which contains 17 items, Scale 2 ( Doing Whatever It Takes ), which contains 14 items, Scale 5 ( Leading Employees ), which contains 13 items, and Scale 10 ( Building and Mending Relationships ), which contains 11 items. Methodology Data Preparation. In a preliminary review of the suitability of the data for this investigation, it was noted that none of the managers received a rating of 1 on some of the items. Since it was not possible to successfully complete the IRT calibrations with zero frequencies in the end categories, Categories 1 and 2 were combined into a single category for all items. This action reduced the number of usable categories from 5 to 4 for all items in the four scales. Further, this review of the item data for the four scales showed that some managers had missing values as ratings on some items. Deleting such

14 IRT Based Assessments 14 cases would have resulted in substantial reductions in the sample size. Based on preliminary item analysis, the typical mean rating of 3 was substituted for a missing item rating in each of the four scales. Assessing Unidimensionality. To evaluate the assumption of local independence required by IRT, the unidimensionality of each scale by source (for a total of 16 data sets) was assessed by two methods: principal components analysis (PCA) and confirmatory factor analysis (CFA). As Likert-type scales are ordinal level data, polychoric correlation matrices were used for both unidimensionality analyses. For CFA (see below), the weighted least squares option was used because it is appropriate when using asymptotic covariance matrices as input for the creation of the polychoric correlation matrices. Traditional DIF Analysis. The DFIT analyses for the traditional DIF model was performed according to the procedure outlined in Collins, Raju, and Edwards (2000). In this study, we used parameters generated using Muraki s (1992) Generalized Partial Credit Model. First, items were calibrated separately for each scale by source using PARSCALE 3.2 (Muraki & Bock, 1997). The calibration function in PARSCALE allows for either Samejima s (1969) graded response model or Masters (1982) partial credit model. For this analysis, in order to be consistent across the three models being compared, Masters model (actually the 2PL model developed by Muraki, 1992) was used. When comparing independently calibrated parameters of the same instrument, it was necessary to put the parameters on a common metric. This is known as equating. The process of equating requires that one group be chosen as the reference group (usually

15 IRT Based Assessments 15 the majority group in a DIF analysis) and one group be chosen as the focal group (usually the minority group in a DIF analysis). As noted above, Other ratings tend to agree with each other more than they do with Self ratings, so for this analysis, Self ratings were used as the focal group in each of three sets of comparisons with Boss, Peer, and Direct Report. Each set consisted of all items from Scale 1, Scale 2, Scale 5, and Scale 10. This process generated 12 sets of linear transformation values that were used to equate the data and facilitate the evaluation of the focal/reference pair for differential item and test functioning. For this purpose, the computer program EQUATE 2.1 (Baker, 1995) was used. The final step in the traditional DIF analysis was to compare the equated parameters and test the differences in the item parameters for statistical significance. This was done using the DFIT6PM program created by Raju (2001a). This program provides DIF indices at the item level (NCDIF) and the test/scale level (DTF). An item was considered to have NCDIF when this value exceeds the suggested cutoff for the number of options in the item. For a four option item, this value is.054. The cutoff for DTF is the cutoff for NCDIF multiplied by the number of items retained in the scale. For example, using.054 as the cutoff value for NCDIF, the DTF cutoff for a ten item scale would be.54. The Rater s Effect Model. To perform the RE model analysis, we used the Rater s Effect option of PARSCALE 4.0 (Muraki and Bock, 2001) to compare the four rating sources as raters in the model. Parameter estimates were then used to calculate CDIF and NCDIF using DFITP6MR (Raju, 2001b).

16 IRT Based Assessments 16 The Hierarchical Rater Model. Parameters of the HRM were estimated for each rater and item using experimental software obtained from Brian Junker, Department of Statistics, Carnegie Mellon University, and described in Patz et al. (1999). Parameter estimates were then used to calculate CDIF and NCDIF using DFITP6MR (Raju, 2001b). As this experimental software uses Master s (1982) Partial Credit Model (the oneparameter model) for parameter estimation, values of the a parameter were set equal to 1 across items in the DFIT program. HRM does not provide point estimates of parameters. Instead, it generates a set of parameters sampled from the MCMC simulations. To get point estimates, it was necessary to calculate the mean and standard deviation of these parameters across estimates. Although Patz and Junker (1999b) used the median value in the distribution as their point estimate, they did so because their distributions were skewed. As the distributions of parameter estimates generated by HRM for this study were not skewed, the mean value from the simulation runs was used as the point estimate.

17 IRT Based Assessments 17 Results This study investigated the use of the DFIT method (Raju et al., 1995) for detecting differential item and test functioning across the traditional DIF model, the Rater s Effect Model (Muraki, 1993), and the HRM (Patz et al. 1999) in order to compare the usefulness and information generated by each model. Descriptive statistics of the raw data are presented first. This is followed by a summary of the analyses performed to test the required assumption of unidimensionality. The rater shift parameters (rater effects) are then summarized, followed by a detailed comparison of results across models. Next, the expected raw scores are compared across models, followed by the intercorrelations of raw scores and ability estimates across models by scale. Presented last are the rater scale parameters, the feature unique to HRM. Summary of Raw Data Scale means and standard deviations of observed scores by scale and rating source as well as the corresponding alpha reliabilities are presented in Table 4. Reliabilities for Self ratings were consistently lower than other sources. Reliabilities among the remaining sources were similar to each other. Self ratings also showed the least amount of variance, while Direct Reports showed the greatest amount of variance. This may explain why Self ratings generally showed the lowest alpha reliabilities and Direct Report ratings generally showed the highest alpha reliabilities. Insert Table 4 About Here

18 IRT Based Assessments 18 Assessment of Unidimensionality The results of the unidimensionality analysis required to perform IRT analysis are shown in Table 5. The eigenvalue of the first principal component accounted for a minimum of 34.2% of the total scale variance (Self ratings, Scale 1) and a maximum of 62.7% of the total scale variance for (Direct Report ratings, Scale 10). The lowest values were for Self ratings (between 34 and 44 percent), while the remaining three sources (Boss, Peer, and Direct Report) were all above 49%. A visual analysis of the scree plots of all 16 scales very clearly indicated a single dominant factor. The results of the confirmatory factor analysis show a reasonable fit for a one factor model for all scales. All GFI s were between.95 and.98, indicating good fit. RMSEA s were all less than.8, also indicating a good fit. Although the RMR s for eight of the 16 scales exceeded 1, based on the other evidence of unidimensionality presented, these were still reasonably small enough to consider the one factor model fit acceptable. Insert Table 5 About Here Rater Shift Parameters Rater shift parameters, which are indicative of rater severity/leniency, are compared between the RE model and the HRM in Table 6. These values have the effect of shifting the item response function. A more positive value (severe rater) shifts the IRF in a way that makes the item appear more difficult. In other words, for that rater, the ratee needs more proficiency or ability to get a high score than for a rater who is more lenient. More negative (lower) numbers indicate relative leniency.

19 IRT Based Assessments 19 Insert Table 6 About Here The shift parameter values cannot be directly compared across models as they are on different metrics. It should be noted here that the RE model constrains the rater shift parameters to sum to zero, but the HRM does not. The patterns are similar to each other in that all rank ordering of leniency (and hence, severity) by rating source is the same. Eleven of the twelve comparisons in the RE model were judged to be significant by constructing either a 95% or 99% confidence interval around the difference between the paired comparisons of shift parameters using the pooled standard error. Effect sizes ranged from 1.0 (between Self and Boss for Scale 1) to 12.6 (between Self and Peer for Scale 5), with an average of 5.0. The only non-significant difference in this comparison set was between Self and Boss for Scale 10. Using the same procedure, ten of the twelve comparisons were significant in the HRM. Effect sizes for this model ranged from zero (between Self and Direct Report in Scale 10) to 10.3 (between Self and Peer in Scale 5). Non-significant differences in this model were between Self and Boss for Scale 10 (as in the RE model), and between Direct Report and Self in Scale 10. Although Self ratings were more lenient than Boss ratings for each comparison, Self did not appear as the most lenient source when compared to Peer and Direct Report, and Boss did not appear as the most severe source for any comparison. For Scale 1, Direct Report was the most lenient (in both models), for Scale 2, Direct Report was most lenient (in both models), for Scale 5, Self is most lenient (in both models), and for Scale

20 IRT Based Assessments 20 10, Direct Report and Self are equally lenient in the HRM, but Self is most lenient in the RE model. Measurement Equivalence Across Models A summary of the results of the three models (the traditional DIF model, Raju et al., 1995; RE or Rater s Effect, Muraki, 1993; and HRM (Patz et al., 1999; 2000) is shown in Table 7. A striking feature of RE and HRM is that these models display no significant NCDIF indices. Insert Table 7 About Here Comparisons Across Models. The traditional DIF model identified four different comparisons at the item level with significant NCDIF in Scale 1, nine in Scale 2, two in Scale 5, and three in Scale 10. DIF is illustrated in Figure 1 where the eighth item from Scale 1 is shown for the Self/Direct Report comparison. This item showed significant DIF in two comparisons, and was removed to achieve non-significant DTF in the Self/Direct Report comparison. Figure 2 shows the scale true score response function for the Self/Direct Report comparison on Scale 1. This comparison displayed significant DTF before the removal of Item 8. Note that the true score functions cross, indicating a difference in item discrimination parameters between raters. Insert Figures 1 and 2 About Here

21 IRT Based Assessments 21 The scale level true score functions for the Self/Direct Report comparison for Scale 2 are shown in Figure 3. This comparison did not show significant DTF. At the item level, NCDIF was significant for four items in this comparison. Two of these differentially functioning items, Items 13 and 14, are displayed in Figures 4 and 5 respectively. They were chosen for illustration for two reasons. First, Item 13 shows significant, non-uniform NCDIF (.125; cutoff is.054). Second, Item 14 shows significant uniform NCDIF (.076). Notice that in Item 13, when θ < -2, Direct Reports perceive the item as easier and more discriminatory, and in Item 14, Self raters perceive the item as easier but having about the same discriminatory power. Insert Figures 3, 4, and 5 About Here The RE model identified four comparisons with significant DTF (differential test scale functioning), whereas the traditional DIF model identified only one, and the HRM did not identify any. As expected, those comparisons having the greatest difference between rater shift parameters corresponded with comparisons showing DTF. The presence of DTF in the RE model also depends on the standard deviation of the threshold parameters. For example, the difference between Self and Peer for Scale 2 was about the same as the difference between Self and Direct Report for Scale 5, yet only Scale 5 showed significant DTF. An examination of the standard deviation of the threshold parameters of these two scales shows that the standard deviation of the threshold parameters for Scale 2 (.509) was slightly more than twice the standard deviation of the threshold parameters for Scale 5 (.229). It is not only the size of the difference between

22 IRT Based Assessments 22 rater shift parameters, but also the variance of the threshold parameters that results in significant DTF in this model. Similar to Scale 5, the size of the difference between rater shift parameters in the Self/Peer Comparison for Scale 10 was large enough to cause significant DTF before the removal of three items. This scale level DTF is shown in Figure 6. Note that the true score functions do not cross. This is also true at the item level. Item 9 for this comparison is shown in Figure 7. Item 9 had the largest value for CDIF in the comparison and was the first removed by the DFIT program in order to reduce the CDIF to a non-significant level. Insert Figures 6 and 7 About Here The most extreme example of DTF was provided by the Self/Peer comparison for Scale 5 in the RE model, which required the removal of ten of the 13 items to create a scale with no significant DTF. This large initial value of DTF (3.519, with a cutoff of.702 when all 13 items are included) is clearly visible in Figure 8. It appears to occur because the only difference between the item true score functions is the rater shift parameter, which in this model is constant across items, causing all CDIF to be positive, and therefore, non-compensatory at the scale level. For example, Item 4 in this comparison, had the largest CDIF value (.330), reflected in the large area seen between true score functions in Figure 9. For this comparison, the value of the rater shift parameter was large enough to produce a relatively large cumulative difference at the scale level.

23 IRT Based Assessments 23 Insert Figures 8 and 9 About Here No differential functioning of items or scales were commonly defined across any two models. In fact, none were detected in the HRM. Patz and his colleagues (1999) point out that rater shift parameters in the HRM less than.5 indicate that the highest probability is that the rater will select the ideal rating. None of the rater shift parameters estimated by the HRM approached this value. Additionally, although the rater shift parameters appear to be equivalent between models (see Table 6), they are not on the same metric. When compared to the standard deviations of threshold parameters in the RE model, they are relatively large, but when compared to the standard deviations of the threshold parameters in the HRM, they are relatively small, too small to shift the true score response function enough to show DTF. The standard deviations of the threshold parameters in the HRM were approximately ten times those obtained in the RE model. Expected True Scores A comparison of the means of the expected true scores (at the scale level) listed in Table 8 to the means of the observed scores (at the scale level) listed above in Table 4 shows that the raw scores are most similar to the expected true scores of the RE model. As mentioned, in every case, the HRM expected true scores are approximately two points higher than the expected true scores generated by either the DFIT model or the RE model, because the rater shift is constant across items. The magnitude of this difference is relative to the number of items in the scale; scales with more items show a greater difference. Table 9 shows the correlations between the raw scores and theta estimates

24 IRT Based Assessments 24 across models. All correlations were significant at p <.01. The patterns that emerged in this table were as expected, with low correlations between sources when rating sources were estimated separately in the traditional DIF model. Raw scores by source correlated very highly (about.99 in all cases) with the estimated thetas. Theta estimates from the RE model and HRM correlated with each other.956 on average. Correlations between thetas estimated by the RE model and thetas estimated using the traditional DIF method, and between thetas estimated by the RE model and raw scores were very close to the correlations between thetas estimated by the HRM and thetas estimated using the traditional DIF method, and between thetas estimated by the HRM model and raw scores. These correlations ranged from.500 to.730. Insert Tables 8 and 9 About Here Rater Scale Parameters A feature of the HRM that is not shared by the other models is a separate index of rater reliability. This parameter is designated the rater scale (see Table 10). Patz et al. (1999) do not provide guidance on what a value for relatively reliable might be, but they consider a value of.72 to be surprisingly large (p. 20). Note that these values are not comparable to alpha reliabilities, as they are not on a similar scale. Higher values indicate greater unreliability. Rater scale parameter estimates range from a low (most reliable) of.698 (Self, Scale 10) to a high (least reliable) of (Direct Report, Scale 5). Across scales, Self received the lowest rater scale values, indicating that Self was the most reliable source. Boss was the next most reliable source, followed by Peer, and the least reliable source was Direct Report. These findings are consistent with Conway and

25 IRT Based Assessments 25 Huffcutt s (1997) estimates of single rater reliabilities, although Conway and Huffcutt do not estimate single rater reliabilities for Self as a rating source. Insert Table 10 About Here

26 IRT Based Assessments 26 Discussion This study was undertaken to compare the results obtained across three different IRT-based methodologies when the investigator s purpose is to identify rating source bias in 360 Feedback data. The goal was to see whether each method would identify the same effects from raters and items, and to compare the estimates of the ratee s position on the underlying construct of job performance returned by each analysis. Ratings here were collected from four sources: supervisor (Boss), subordinate (Direct Report), colleague (Peer), and the ratee (Self). An established MSF rating instrument, Benchmarks, was used for this study. Restated, the primary research question was, Will each method identify the same or different items and rating sources as contributing to systematic measurement error? The results indicated that each method yields somewhat different information on ratees, items, and raters. Ratee proficiency does not appear to be directly estimated in the traditional DIF method, and rater effects (severity/leniency) are not explicitly modeled either. Rater by item interactions are not as evident in the Rater s Effect and Hierarchical Rater Models, although these models appear to estimate ratee proficiency more directly. Across methods, no items or scales were commonly identified as functioning differentially. On the other hand, rater effects appeared to be consistent across RE and HRM models, having a differential impact depending on the method employed. Comparisons Across Models The traditional DIF model appears to give the most information about differential item functioning. The Rater s Effect Model and the Hierarchical Rater Model provide

27 IRT Based Assessments 27 almost the same information about raters, although the meaning of this information is not the same between the two, and only HRM provides the rater scale parameter. Each of these models will be discussed in turn. The Traditional DIF Model. In this study, 19 instances of item level DIF were identified across ten out of twelve focal/referent comparisons. This indicates that the observed score from Boss, Peer, or Direct Report for one or more items in each scale were not directly comparable to the Self rating. For the ratee, this makes the feedback on those particular items less useful, since for the affected items the meaning of any difference in observed scores is not readily interpretable. This lack of interpretability is because item level DIF may be caused by rater bias, rater conception of ability, or rater by item interaction, and these effects are not separated out using this model. On the other hand, very little DTF was found using the traditional DIF analysis. In all but one comparison, the differential item functioning turns out to be compensatory at the scale level. The one instance of DTF (found in the Self/Direct Report comparison for Scale 1) was remedied by the deletion of only one item from the scale. Finding very little evidence of differential functioning at the scale level using the traditional DIF methodology may imply that MSF may be more frequently comparable at the scale level. One might conclude from this that the practice of providing item-level ratings from each source may be less effective than comparing scores only at the scale level. Item level differential functioning only appeared when using the traditional DIF method. This results from the way data sets are analyzed. In this method, each set of ratings is a separate data set that is calibrated independently of the others, and both the

28 IRT Based Assessments 28 item discrimination parameters (slope) and the item difficulty parameters (location) have different values for the focal and reference group (when the 2 Parameter Model is used). The process of equating required by the traditional DIF methodology brings to light a very important distinction between the traditional DIF model and the other two models: In the traditional DIF model, there is one estimate of some underlying construct being generated by each rater, so in this study there were four different estimates of what has been previously been assumed to be ratee proficiency or ability. The underlying construct estimated using a model where each set of ratings generates a different value may more accurately be described as the rater s conception of the ratee s ability, even when the rater is Self. When measurement equivalence is affirmed by this model, it may be a way of judging that the two raters have equivalent conceptions of the ratee s ability, rather than that the measure reflects agreement on the estimate of the ratee s ability per se. Rater s Effect Model. In contrast to the traditional DIF model, the RE model delivers just one proficiency or ability parameter estimate per scale, regardless of the number of raters providing ratings. By design, only one set of item parameters is estimated as well. Therefore, when the DFIT method is employed to calculate NCDIF and CDIF between the raters, the item parameters change only by the size of the rater shift parameter. A consequence of this method is that there is only uniform DIF and the differences between item parameters are in the same direction across items. This results in all CDIF values being positive, and as CDIF accumulates over items, DTF is much more likely to be statistically significant in the RE model. This is exactly the result that was seen. Conversely, very little NCDIF was detected, and none of it was significant in

29 IRT Based Assessments 29 this model. This is to be expected, as the item discrimination (slope) parameters are held constant between rating sources, so there is no possibility of non-uniform DIF being produced by this method. The information provided by the rater shift parameters and the detected differential functioning at the scale level paints quite a different picture than that provided by the traditional DIF method. The evidence presented by the Rater s Effect model may suggest that rater bias makes scale level scores incommensurable, demonstrated most notably in Scale 5 (See Table 7). For example, in Scale 5 for the Self/Direct Report comparison, the traditional DIF model did not identify any item or scale level differential functioning. When analyzed using the Rater s Effect model, this same set of data indicated that six items had to be removed to produce a scale with nonsignificant DTF. Similarly, in Scale 2 for the Self/Direct Report comparison, four items were identified as having NCDIF using the traditional DIF methodology, but using the DFIT methodology, no DTF was found because this item level differential functioning appears to be compensatory at the scale level. For this particular comparison, no item or test level differential functioning was revealed by the Rater s Effect model. Because neither model identified the same items as having significant DIF, it raises the question of which model is appropriate to use. The Hierarchical Rater Model. For the most part, the results obtained from the HRM were similar to the results obtained from the RE Model with two notable exceptions. First, there was no DIF of any kind detected with this model, and second, the model generates the rater scale parameter, an index of rater reliability.

30 IRT Based Assessments 30 Considering the similarities between the models, it was unexpected that the HRM did not produce any results indicating DIF either at the scale or item level. The ability parameter estimates were highly correlated between HRM and the RE model, and the rater shift parameters were also comparable, in both rank order and effect size. What was startling was that these parameter estimates, when compared with the average standard deviation of the threshold parameters, were shown to be only about one-tenth the size of the rater effects found in the RE model. According to Patz and his associates (Patz et al., 1999) rater shift parameters less than.5 are too small to negatively affect the probability that a rater will assign the ideal rating to a given response. Although these small biases are reflected in the expected true score estimates as differences in scale scores, these differences would not change any scale score rounded to the nearest integer. This is not much different from the results obtained from the RE model, again suggesting that the effect of the rater shift parameter has little practical impact. The answer may lie in the fact that the rater effect parameter in the HRM is estimated independently of the item parameters. The calibration of the item parameters is conditioned on the ideal score, not the observed score. The way the rater scale parameter (the estimate of individual rater reliability) fits into the HRM system, larger rater scale parameters shrink rater effects. In other words, unreliability in raters leads to smaller estimates of rater effects. The relatively small rater effects seen in the present study may also be due to the fact that HRM was developed using individual raters, not rating sources. As Scullen, Mount, and Goff (2000) demonstrated, the idiosyncratic effects of an individual rater tend to be much larger than the effects due to the rater s position. The

31 IRT Based Assessments 31 results presented here demonstrate only the practical effects of rater leniency or bias due to the rater s position, and not the individual rater. Additionally, HRM was developed in an educational setting where each rater is carefully trained and trial ratings are calibrated against expert ratings of the same material. Additionally, the rated behavior, which in this case is student essays, is constant across raters and may be viewed and reviewed during the rating task. Rater scale estimates as large as those seen in this study were termed surprisingly large by Patz and his associates (Patz et al., 1999, p. 20). Alpha reliabilities for the ratings in this study were all above the traditional recommended value of.80, though the highest was.92. The rater scale scores present a picture of rater reliability (or rather, unreliability) that is the opposite found using alpha reliabilities. Using coefficient alpha alone, one would judge the Self to be the least reliable source and the Direct Report to be the most reliable. This is exactly the opposite of the picture presented by the rater scale parameters. This may be the result of separating item effects from rater effects, and suggests that the lower reliability for the Self as source in the alpha coefficient may be the result of greater item effect for this source. The unidimensionality analysis presented in Table 5 provides support for this. It appears that the scales may be relatively more multidimensional for the Self than for other sources, which would reduce the value of the observed alpha coefficient. A final consequence of the method by which rater shift parameters are obtained in HRM may be that the expected true scores deviate somewhat from the observed scores and from the expected true scores produced by the other two methods. Rater shift

32 IRT Based Assessments 32 parameters in the HRM are not constrained zero, and in fact in this study their sum indicates a general severity bias on the part of all raters that is, the sum of the rater shift parameters in each case is positive. Implications for Practice There are two practice areas that are impacted by these findings. The first is that no one current methodology for assessing measurement equivalence across different rating sources or raters captures all of its dimensions. Second, depending on the methodology employed, we may reach different conclusions about the between-source measurement equivalence of 360 feedback, and this has implications for the practice of providing item-level feedback separately by source. Basically, two approaches were investigated here: the traditional DIF approach and the rater shift approach. From the traditional DIF approach it is appears that measurement inequivalence at the item level is likely to be compensatory at the scale level. From the rater shift models, it may be concluded that the nature of severity or leniency may render large parts of a scale non-comparable at either item or scale level. Both types of model provide useful information. The traditional DIF model identifies lack of measurement equivalence at the item level, which may be caused either by differences in the rater s conception of the ratee s performance, rater severity/leniency or both. The rater shift models separate item effects from rater effects and facilitate the evaluation of the relative severity or leniency of a rater or rating source. They do not, however, provide information about differences in the rater s conceptions of the ratee s performance.