Comparative Analysis of the Reliability of Job Performance Ratings

Size: px

Start display at page:

Download "Comparative Analysis of the Reliability of Job Performance Ratings"

Diana Pearson
6 years ago
Views:

Journal of Applied Psychology 1996, Vol. 81, No. 5, 557-574 Copyright 1996 by the American Psychological Association, Inc. 0021-9010/96/J3.

1 Journal of Applied Psychology 1996, Vol. 81, No. 5, Copyright 1996 by the American Psychological Association, Inc /96/J3.00 Comparative Analysis of the Reliability of Job Performance Ratings Chockalingam Viswesvaran Florida International University Deniz S. Ones University of Houston Frank L. Schmidt University of Iowa This study used meta-analytic methods to compare the interrater and intrarater reliabilities of ratings of 10 dimensions of job performance used in the literature; ratings of overall job performance were also examined. There was mixed support for the notion that some dimensions are rated more reliably than others. Supervisory ratings appear to have higher interrater reliability than peer ratings. Consistent with H. R. Rothstein (1990), mean interrater reliability of supervisory ratings of overall job performance was found to be.52. In all cases, interrater reliability is lower than intrarater reliability, indicating that the inappropriate use of intrarater reliability estimates to correct for biases from measurement error leads to biased research results. These findings have important implications for both research and practice. Several measures of job performance have been used over the years as criterion measures (cf. Campbell, 1990; Campbell, Gasser, & Oswald, 1996; Cleveland, Murphy, & Williams, 1989). Attempts have also been made to identify the specifications for these criteria (Blum & Naylor, 1968; Brogden, 1946; Dunnette, 1963; Stuit & Wilson, 1946; Toops, 1944). For example, Blum and Naylor (1968) identified 11 dimensions or characteristics on which the different criteria can be evaluated, whereas Brogden (1946) identified relevance, reliability, and practicality as desired characteristics for criteria. Reliability of criteria has been included as an important consideration by all authors writing about job performance measurement. Indeed, for a measure to have any research or admin- Chockalingam Viswesvaran, Department of Psychology, Florida International University; Deniz S. Ones, Department of Management, University of Houston; Frank L. Schmidt, Department of Management and Organizations, University of Iowa. The order of authorship is arbitrary, and all three authors contributed equally. Deniz S. Ones is now at the Department of Psychology, University of Minnesota. An earlier version of this article was presented at a symposium called "Reliability and Accuracy Issues in Measuring Job Performance," which was chaired by Frank L. Schmidt at the 10th Annual Meeting of the Society of Industrial and Organizational Psychology, Orlando, Florida, May Correspondence concerning this article should be addressed to Chockalingam Viswesvaran, Department of Psychology, Florida International University, Miami, Florida Electronic mail may be sent via Internet to vish@servax.fiu.edu. istrative use, it must have some reliability. Low-reliability results in systematic reduction in the magnitude of observed relationships and can therefore distort theory testing. The recent resurgence of interest in criteria (Austin & Villanova, 1992) and in developing a theory of job performance (e.g., Campbell, 1990; Campbell, McCloy, Oppier, & Sager, 1993; McCloy, Campbell, & Cudeck, 1994; Schmidt & Hunter, 1992) also emphasizes the importance of the reliability of criterion measures. A thorough investigation of the criterion domain ought to include an examination of the reliability of dimensions of job performance. The focus of this article is the reliability of job performance ratings. Of the different ways to measure job performance, performance ratings are the most prevalent. Ratings are subjective evaluations that can be obtained from supervisors, peers, subordinates, self, or customers, with supervisors being the most commonly used source (Cascio, 1991; Cleveland et al., 1989) and peers constituting the second most commonly used source. For example, Bernardin and Beatty (1984) found in a survey of human resource managers that over 90% of the respondents used supervisory ratings as their primary source of performance ratings and peers were the second most widely used source of ratings. Constructing comprehensive and valid theories of human motivation and work behavior is predicated on the reliable measurement of constructs. Given the centrality of the construct of job performance in industrial and organizational psychology (Campbell et al., 1996), and given that ratings are the most commonly used source for 557

558 VISWESVARAN, ONES, AND SCHMIDT measuring job performance, it is important to estimate precisely the reliability of job performance ratings.

2 558 VISWESVARAN, ONES, AND SCHMIDT measuring job performance, it is important to estimate precisely the reliability of job performance ratings. Furthermore, competing cognitive process mechanisms have been postulated (e.g., Borman, 1979; Wohlers & London, 1989) to explain the convergence in ratings between two raters. An accurate evaluation of these competing mechanisms will facilitate and enhance understanding of the psychology underlying the rating or the evaluative judgment process in general, and job performance in particular. Finally, many human resource practices recommended to practitioners in organizations are predicated on the reliable measurement of job performance. As such, both from a theoretical perspective (i.e., to analyze and build theories contributing to the science of industrial and organizational psychology) and from a practice perspective, a comparative analysis of reliability of job performance ratings is warranted. The primary purpose of this study was to investigate the reliability of peer and supervisory ratings of various job performance dimensions. Meta-analytic principles (Hunter & Schmidt, 1990) were used to cumulate reliability estimates across studies. Reliability estimates of various job performance dimensions can be compared to identify which dimensions are reliably rated so as to identify and improve through training the dimensions rated with low reliability. A second objective of this study was to compare interrater agreement and intrarater consistency in the reliability of ratings. A third and final objective of this study was to compile and present reliability distributions that can be used in future meta-analyses involving job performance ratings. Comparing Reliability Across Dimensions Supervisory and peer ratings have been used to assess individuals on many dimensions of job performance. Comparing the reliability of ratings of different dimensions enables an empirical test of the hypothesis that certain dimensions of job performance are easier to evaluate than others (cf. Wohlers & London, 1989). In essence, the thrust of this hypothesis is that some dimensions of job performance are easier than others to evaluate because they are easier to observe and clearer standards of evaluation are available. Wohlers and London (1989) suggested that dimensions of performance such as administrative competence, leadership, and communication competence are more difficult to evaluate than dimensions such as output and errors. Similarly, Borman (1979) found that "raters evaluated ratees significantly more accurately on some dimensions than on others, and that for most part these differences were consistent across formats" (p. 419). Borman (1979) also stated that the rank order accuracy on the different dimensions in his study were similar to the rank order accuracy in an earlier study by Borman, Hough, and Dunnette (1976). The rank order correlation was.88 for assessing managers and.54 for recruiters. That is, the rank order accuracy of different dimensions were consistent across rating formats, studies and samples, and jobs. Borman (1979) noted that this consistent dimension effect, even across a variety of formats (and if we may add, across jobs and samples), may be due to something inherent in the nature of the dimensions that makes them either difficult or easy for raters. Furthermore, Borman suggested that "accuracy is highest on those dimensions for which actors provided the least ambiguous, most consistent performances, perhaps because they, as well as the student raters, understood those particular dimensions better than some of the other dimensions" (p. 420). The hypothesis that certain dimensions of job performance are easier to evaluate than others is also found in personality literature (e.g., Christensen, 1974). This line of thought is also found in literature on social psychology. For example, Bandura (1977), as well as Salancik and Pfeffer (1978), posited from a social information-processing framework that when there are no clear interpretable signs of behavior or where the standards of evaluation are ambiguous, the interrater agreement will be lower in comparison with where there are clear interpretable signs and standards are unambiguous. This is also hypothesized to be true when certain dimensions of job performance have rare occurrence (low base rate) or have greater salience in memory (e.g., accidents). Comparing the reliability of dimensions facilitates an empirical test of the hypothesis (Borman, 1979; Wohlers & London, 1989) of a gradient in reliabilities across job performance dimensions. Such knowledge will facilitate an understanding of the rating processes. Comparing Different Types of Reliability Estimates Comparing the different types of reliability estimates (coefficient of equivalence, coefficient of stability, etc.) for each dimension of job performance is also valuable. Reliability of a measure is defined as the ratio of the true to observed variance (Nunnally, 1978). Different types of reliability coefficients assign different sources of variance to measurement error. In general, the most frequently used reliability coefficients associated with criterion ratings can be broadly classified into two categories: interrater and intrarater. In the context of performance measurement, interrater reliability assesses the extent to which different raters agree on the performance of different individuals. As such, individual raters' idiosyncratic perceptions of job performance are considered to be part of measurement error. Intrarater reliability, on

3 RELIABILITY OF RATINGS 559 the other hand, assigns any specific error unique to the individual rater to true variance. That is, each rater's idiosyncratic perceptions of job performance is relegated to the true variance component. Both coefficient alpha and the coefficient of stability (rate-rerate reliability with the same rater) are forms of intrarater reliability. Intrarater reliability is most frequently indexed by coefficient alpha computed on ratings from a single rater on the basis of the correlations or covariances among different rating items or dimensions. Coefficient alpha assesses the extent to which the different items used to measure a criterion are indeed assessing the same criterion. 1 Rate-rerate reliability computed using data from the same rater at two points in time assesses the extent to which there is consistency in performance appraisal ratings of a given rater over time. Both of these indices of intrarater reliability, coefficient alpha and coefficient of stability (over short period of times when it is assumed that true performance does not change), estimate what the correlation would be if the same rater rerated the same employees (Cronbach, 1951). 2 Thus, different types of reliability estimates assign different sources of variance to measurement error variance. When a single judge rates a job performance dimension with a set of items, coefficient alpha may be computed on the basis of the correlations or covariances among the items. Coefficient alpha, which is a measure of equivalence of the items, assigns the item specific variance and variance because of random responses in ratings to measurement error variance. Influences unique to the particular rating occasion and unique to the rater are not assigned to measurement error but incorporated into the true variance. When job performance is assessed by the same rater with the same set of items at two different points in time, the resulting coefficient of stability (rate-rerate reliability) assigns variance because of transient errors in rating (i.e., variance from mental states and other factors in raters that vary over days) to measurement error variance (Schmidt & Hunter, 1996). Thus, by comparing the different reliability estimates for the same dimension of job performance, one can gauge the magnitude of a particular source of error in ratings involving that dimension of job performance. Such knowledge can be valuable in designing rating formats and rater training programs. Constructing Artifact Distributions Constructing artifact distributions for different dimensions of job performance also serves meta-analytic cumulations involving ratings of job performance. The reliability distributions reported here could be used in meta-analyses of studies involving ratings of performance. 3 Also, some published meta-analyses involving ratings have erroneously combined estimates of interrater and intrarater reliability in one artifact distribution as if these estimates were equivalent. With the increasing emphasis on precision of estimates to be used in theory testing (Schmidt & Hunter, 1996; Viswesvaran & Ones, 1995), it is imperative that future meta-analyses use the appropriate reliability estimates. By providing estimates of different reliability coefficients for each dimension of job performance, this article aims to provide a useful source of reference to researchers. Thus, the primary purpose in this article is to cumulate the reliabilities of job performance ratings with the principles of psychometric meta-analysis (Hunter & Schmidt, 1990) and to compare the reliability of the ratings of different dimensions. Comparing the reliability of different dimensions enables an evaluation of the hypothesis (Borman, 1979; Wohlers & London, 1989) that evaluation difficulty varies across dimensions. A secondary purpose in this article is to compare the magnitude of the different sources of errors (by comparing interrater reliabilities, coefficient alphas, and coefficients of stability) that exist in ratings of each dimension of job performance. A third purpose is to provide reliability distributions that could be used in future meta-analytic cumulations involving ratings of performance. Database Method We searched the literature for articles that reported reliability coefficients either for job performance dimensions or for overall job performance. Only studies that were based on actual job performance were included. Interviewer ratings, assessment center ratings, and ratings of performance in simulated exercises were excluded. We searched all issues starting from the first issue of each journal through January 1994 of the following 15 journals: Journal of Applied Psychology, Personnel Psychology, Academy of Management Journal, Human Relations, Journal of Business and Psychology, Journal of Management, Organizational Behavior and Human Decision Processes, Accident Anal- 1 Coefficient alpha computed on ratings from a single rater is an estimate of the rate-rerate reliability with the same rater. As such, it is a form of intrarater reliability. However, it should be noted that a different coefficient alpha can be used to index interrater reliability. This is possible if the variance-covariance matrix across raters is used in the computations. In this study, we did not examine coefficient alphas obtained by using data across raters. 2 For a recent discussion of these and other reliabilities in industrial and organizational psychology research, see Schmidt and Hunter (1996). 3 Frequency distributions of the reliabilities contributing to the analyses reported in this article may be obtained by writing to Chockalingam Viswesvaran.

560 VISWESVARAN, ONES, AND SCHMIDT ysis and Prevention, International Journal of Intercultural Relations, Journal of Vocational Behavior, Journal of Applied Behavioral Analysis, Human Resources

4 560 VISWESVARAN, ONES, AND SCHMIDT ysis and Prevention, International Journal of Intercultural Relations, Journal of Vocational Behavior, Journal of Applied Behavioral Analysis, Human Resources Management Research, Journal of Occupational Psychology, Psychological Reports, and Journal of Organizational Behavior. Analyses In cumulating results across studies, the same job performance dimensions can be referred to with different labels. Any grouping of the different labels as measuring the same criteria has to be guided by theoretical considerations. That is, we need to theoretically define the criteria first (Campbell et al., 1993). The broader the definition, the more general and possibly more useful the criteria are; on the other hand, the narrower the definition (up to a point), the more unambiguous the criteria become. The delineation of the job performance domain to its component dimensions was undertaken as part of a study examining whether a general job performance factor is responsible for the covariation among job performance dimensions (Viswesvaran, 1993). Viswesvaran (1993) identified 10 job performance dimensions that comprehensively represented the entire job performance domain. In this study, all the job performance measures used in the individual studies were listed and then grouped into one of the conceptually similar categories by the authors. That is, the definition of the job performance dimensions and classification of the job performance ratings into these 10 dimensions preceded the coding of the reliability estimates. We read all the articles making up our database and then classified the reliabilities. In other words, not only did we take into account the definitions but also the context (and all other information) provided in the article in classifying the reliabilities into the dimensions. Interrater agreement was 93%. Disagreements were resolved through mutual discussion until consensus was reached. Definitions for the 10 groups of ratings for which analyses are reported here are provided in Table 1. Given 10 dimensions of job performance and three types of reliabilities (interrater, stability, and equivalence), there were potentially 30 reliability distributions to be investigated. Because our interest was in examining the reliability of both supervisory and peer ratings, there are potentially 60 distributions to be meta-analyzed. Of these, some combinations have not been assessed in the literature. The reliability values obtained from the individual studies were coded into 1 of the 60 distributions. Next, in cumulating the reliability of any particular criteria across several studies, the length of the measuring instrument (number of raters for interrater reliability estimates and number of items for coefficient alpha estimates) varied across the studies. One option was to use the Spearman-Brown formula to bring all estimates to a common length. We reduced all interrater reliability estimates to that of one rater. In many organizations, there will almost never be more than one supervisor doing the rating, but there will almost never be an instrument with only one item (i.e., performance dimension rated). As such, we did not correct the coefficient alphas for the number of items. Furthermore, most rating instruments had a range of items where the Spearman-Brown adjustments did not make a practical difference. For the coefficient of stability, without knowing the functional relationship between the estimates of stability and the time interval between the measurements, corrections to bring estimates of stability to the same interval are impossible. All we can say, intuitively speaking, is that as the time interval increases, the reliability estimate generally decreases. The function that captures this decrease is unknown. Jensen (1980), based on fitting curves to empirical data, reported that the stability of IQ test scores is a function of the square root of the ratio of the chronological ages at the two points of measurement. Another possibility is to assume an asymptotic function where reliability estimate falls as time interval increases (at infinite time interval, the reliability estimate will be zero, or an asymptote at zero). This is similar to Rothstein (1990) who presented empirical evidence that as the opportunity to observe (indexed by number of years supervised or observed) increases, the interrater reliability increases but reaches an asymptotic maximum of.60. Lacking information on the functional relationship between reliability estimates and time intervals between measurements, no corrections were made to bring all estimates of coefficients of stability included in a meta-analysis to the same interval. Note that in our intrarater reliability analyses, we were careful to include only the coefficients of stability that were based on ratings from the same rater. Rate-rerate correlations from different raters at two points in time are interrater reliability coefficients and will be lower than the former estimates where the same rater provides the ratings at the two points in time and thus intrarater reliability is assessed (Cronbach, 1947). A meta-analysis correcting only for sampling error was conducted for each of the 60 distributions for which there were at least four estimates to be cumulated. The sample size weighted mean, observed standard deviation, and residual standard deviation were computed for each distribution. We also computed the unweighted mean and standard deviation. The computations of the unweighted mean and standard deviation do not weight the reliability estimates by sample size of each study contributing to the analysis. Each reliability coefficient is equally weighted. The sample size weighted mean gives the best estimate of the mean reliability, whereas the unweighted mean ensures that our results are not skewed by a few large sample estimates. In addition, we also computed the mean and standard deviation of the square root of the reliabilities. The mean of the square root of the reliabilities differ slightly from the square root of the mean of the reliabilities. Therefore, we also assessed the mean and standard deviation of the square root of the reliabilities. Both sample size weighted and unweighted (i.e., frequency weighted) analyses were undertaken. Thus, for each of the 60 distributions, the objective was to estimate the mean and standard deviation of (a) sample size weighted reliability estimates, (b) reliability estimates (unweighted or frequency weighted), (c) sample size weighted square root of the reliabilities, and (d) square root of the reliabilities (unweighted or frequency weighted). The sampling error variance associated with the mean of the reliability was estimated as the variance divided by the number of estimates averaged (Callender & Osburn, 1988). The sampling error of the mean was used to construct an 80% confidence interval around the mean. Assuming normality, 80% of points in the distribution falls within this interval. That is, the proba-

5 562 VISWESVARAN, ONES, AND SCHMIDT Table 2 Interrater Reliabilities of Supervisory Ratings of Job Performance Dimension Af w SD W AC 80% CI cred Overall job performance Productivity Quality Leadership Communication competence Administrative competence Effort Interpersonal competence Job knowledge Compliance with or acceptance of authority 14,650 2,015 1,225 2,171 1,563 1,120 2,714 3,006 14, S2-.62.S S4-.62.S S S A9-.ll S S Note, k = number of reliabilities included in the meta-analysis; wt = sample size weighted; unwt = unweighted or frequency weighted; sqwt = square root of the estimates, weighted; squnwt = square root of the estimates, unweighted; CI = confidence interval; cred = credibility interval; res = residual. eluded in the meta-analysis. Columns 4 and 5 provide the sample size weighted mean and standard deviation of the values meta-analyzed. The unweighted (or frequency weighted) mean and standard deviation of the values meta-analyzed are in Columns 6 and 7, respectively. The sample size weighted mean and standard deviation of the square root of the reliabilities are in Columns 8 and 9, respectively. Finally, the unweighted (or frequency weighted) mean and standard deviation of the square root of the reliabilities are in Columns 10 and 11, respectively. Column 12 provides the 80% confidence interval that is based on the sample size weighted mean reliability values. Different intervals (e.g., 95%) can be constructed on the basis of the values reported in Columns 3, 4, and 5. Similarly, different intervals can be computed for (a) unweighted (or frequency weighted) reliability values derived from data reported in Columns 3, 6, and 7; (b) sample size weighted square root of the reliability esti- Table 3 Interrater Reliabilities of Peer Ratings and Coefficients of Stability for Supervisory Ratings of Job Performance Performance dimension «k A/«SD M A/unwl 5Dunwt AC.W1 5Aqwt AC.unwt Oi'squnwt 80% CI 5»» 80% cred Peer ratings Overall job performance Productivity Leadership Effort Interpersonal competence Job knowledge Compliance with or acceptance of authority 2, S S S S S S S S-.77 Overall job performance 1, Supervisory ratings s Note, k = number of reliabilities included in the meta-analysis; wt = sample size weighted; unwt = unweighted or frequency weighted; sqwt = square root of the estimates, weighted; squnwt = square root of the estimates, unweighted; CI = confidence interval; cred = credibility interval; res = residual.

RELIABILITY OF RATINGS 561 Table 1 Definition of Job Performance Rating Dimension rated Overall job performance Job performance or productivity Quality Leadership Communication competence

6 RELIABILITY OF RATINGS 561 Table 1 Definition of Job Performance Rating Dimension rated Overall job performance Job performance or productivity Quality Leadership Communication competence Administrative competence Effort Interpersonal competence Job knowledge Compliance with or acceptance of authority Definition Ratings on statements (or ranking of individuals on statements) referring to overall performance, overall effectiveness, overall job performance, overall work reputation, or the sum of all individual dimensions rated. Ratings of the quantity or volume of work produced. Raters' rating or ranking individuals were based on productivity or sales; examples include ratings of the number of accounts opened by bank tellers and the number of transactions completed by sales clerks. Measure of how well the job was done. Ratings of (or rankings of individuals on statements referring to) the quality of tasks completed, lack of errors, accuracy to specifications, thoroughness, and amount of wastage. Measure of the ability to inspire, to bring out extra performance in others, to motivate others to scale great heights, and professional stature; includes performance appraisal statments such as "gets subordinates to work efficiently," "stimulates subordinates effectively," and "maintains authority easily and comfortably." Skill in gathering and transmitting information (both in oral and written format). The proficiency to express, either in written or oral format, information views, opinions, and positions. This refers to the ability to make oneself understood; includes performance appraisal statements such as "very good in making reports," "reports are clear," "reports are unambiguous," and "reports need no further clarification." Proficiency in handling the coordination of different roles in an organization. This refers to proficiency in organizing and scheduling work periods, administrative maintenance of records (note, however, that clarity is under Communication competence above), ability to place and assign subordinates, and knowledge of the job duties and responsibilities of others. Amount of work an individual expends in striving to do a good job. Measure of initiative, attention to duty, alertness, resourcefulness, enthusiasm about work, industriousness, earnestness at work, persistence in seeking goals, dedication, personal involvement in the job, and effort and energy expended on the job characterize this dimension of job performance. Ability to work well with others. Ratings or rankings of individuals on cooperation with others, customer relations, working with co-workers, and acceptance by others, as well as nominations for "easy to get along with," are included in this dimension. Measure of the knowledge required to get the job done. Includes ratings or rankings of individuals on job knowledge, keeping up-to-date, as well as nominations of who knows the job best and nominations of who keeps up-to-date. A generally positive perspective about rules and regulations; includes obeying rules, conforming to regulations in the work place, having a positive attitude toward supervision, conforming to organizational norms and culture, without incessant complaining about organizational policies and following instructions. bility of obtaining a value higher than the upper bound of the interval and the probability of obtaining a value lower than the lower bound is. 10. For both interrater reliability and coefficient of stability (rate-rerate reliability with the same rater), in addition to the confidence interval, the sampling error of the correlation was computed, and credibility intervals were constructed. A residual standard deviation was computed as the square root of the difference between observed and sampling error variance of the correlation (i.e., the interrater reliability coefficient in the former case and the rate-rerate reliability coefficient in the latter case). Note, however, that the sampling error formula for coefficient alpha is different from those of interrater reliability coefficients and coefficients of stability (i.e., correlation coefficients). Given the mean and residual standard deviation along with the normality assumption (assuming two-tailed tests), we can compute the estimated reliability below which the population reliability value is likely to fall with a 90% chance; M (residual standard deviation). Though different (90%, 95%, etc.) credibility intervals (and upper bound values) can be constructed, we report only on the 80% credibility interval for the sample size weighted mean reliability estimate for the reliability distributions. Interested readers can compute the other credibility intervals (90%, 95%, etc.) on the basis of the mean reliability and the residual standard deviation. Results Tables 2-4 summarize the results of the meta-analyses. Interrater reliability estimates for supervisory ratings are summarized in Table 2. Interrater reliability estimates for peer ratings are in Table 3, and the estimates of coefficient of stability for supervisory ratings of overall job performance are also in Table 3. Estimates of coefficient alpha for supervisory and peer ratings are provided in Table 4. Notice that all 10 dimensions are not present in every table; we do not present the results of meta-analyses that were based on less than four reliability estimates. In each table, Column 1 indicates the job performance dimension being meta-analyzed, Column 2 indicates the total sample size (the total number of individuals rated across studies included in that meta-analysis), and Column 3 provides the number of independent estimates in-

RELIABILITY OF RATINGS 563 Table 4 Coefficient Alpha Reliabilities of Supervisory and Peer Ratings of Job Performance (Intrarater Reliabilities) Dimension n k M M SD W M ivl unwt SD mw M^ SAqw,

7 RELIABILITY OF RATINGS 563 Table 4 Coefficient Alpha Reliabilities of Supervisory and Peer Ratings of Job Performance (Intrarater Reliabilities) Dimension n k M M SD W M ivl unwt SD mw M^ SAqw, Miqunwl SD^n 80% CI Overall job performance Productivity Quality Leadership Communication competence Administrative competence Effort Interpersonal competence Job knowledge Compliance with or acceptance of authority Overall job performance Leadership Effort Interpersonal competence 17,899 2, , ,754 3,112 10, ,438 1,270 1,082 1, Supervisory Peer Note, k = number of reliabilities included in the meta-analysis; wt = sample size weighted; unwt = unweighted or frequency weighted; sqwt square root of the estimates, weighted; squnwt = square root of the estimates, unweighted; CI = confidence interval s8.7s s mates derived from data presented in Columns 3, 8, and 9; and (c) unweighted (or frequency weighted) square root of reliabilities derived from information provided in Columns 3, 10, and 11. For interrater reliability and coefficient of stability, the residual standard deviations of the reliability distributions and the 80% credibility intervals are reported in Columns 13 and 14, respectively. The credibility interval refers to the entire distribution, not the mean value. Also, it refers to population values (the estimated distribution of population values), not observed values, which were affected by sampling error. In discussing the results, we first compared the supervisory rating reliability of different dimensions of rated performance for each type of reliability (e.g., interrater). Then we focused on the same type of reliability (e.g., interrater), but it was based on peer ratings of the different dimensions. Third, we compared the reliability for peer and supervisory ratings. These three steps were repeated for each type of reliability: interrater, stability, and coefficient alphas. A final section discusses assessment of the relative influence of the different sources of error. Interrater Reliability From the results reported in Table 2, the mean interrater reliability for supervisory ratings of overall job performance was.52 (k = 40, N = 14,650). The 80% credibility interval ranged from.41 to.63. That is, it is estimated that 90% of the values of interrater reliability of supervisory ratings of overall job performance are below.63. For supervisors, the mean sample size weighted mean interrater reliability across nine specific job performance dimensions (excluding overall job performance) was.53. It appears that, for supervisors, interrater reliability of overall job performance ratings is similar to the mean interrater reliability across job performance dimensions. This is noteworthy because most interrater reliabilities for overall performance in our database were for sums of items across different job performance dimensions. Contrary to expectations, higher intrarater reliability associated with longer rating forms (see also the interrater reliability section below) does not appear to improve interrater reliability in the job performance domain. A second interesting point to note is that there is variation across the 10 dimensions in the mean interrater reliabilities for supervisory ratings. Although the credibility intervals for all the 10 dimensions do overlap, the 80% confidence intervals indicate that for example, both communication competence and interpersonal competence are rated less reliably, on average, than productivity or quality. Thus, the hypothesis of Wohlers and London (1989) and Borman (1979) is partially supported.

564 VISWESVARAN, ONES, AND SCHMIDT Interrater reliability for peer ratings of 7 of the 10 dimensions are reported in Table 3 (there were less than four estimates for the other three dirhensions).

8 564 VISWESVARAN, ONES, AND SCHMIDT Interrater reliability for peer ratings of 7 of the 10 dimensions are reported in Table 3 (there were less than four estimates for the other three dirhensions). The estimates ranged from.34 for ratings of productivity (SD =.14) to.71 for ratings of compliance with authority (SD =.05). For peers, the sample size weighted mean interrater reliability across six specific dimensions of job performance (i.e., excluding overall job performance) was.42. For ratings of overall job performance, interrater reliability for peer ratings was also.42 (SD =. 11). The 80% credibility intervals for interrater reliability of peer ratings of overall job performance ranged from.30 to.54. That is, 90% of the actual (population) values are estimated to be less than.54, and 90% of the values were estimated to be greater than.30. Similar to the results for supervisors, interrater reliability of overall job performance ratings is the same as the sample size weighted mean interrater reliability across individual job performance dimensions. Even though a large portion of the peer interrater reliabilities for overall performance in our database were computed in studies by summing items across different job performance dimensions, higher intrarater reliability associated with longer rating forms (also see Coefficient Alphas: Measures oflntrarater Reliability below) does not appear to lead to higher peer interrater reliability. This mirrors the case for supervisors. A comparison of the results reported in Tables 2 and 3 seems to indicate that there was generally more agreement between two supervisors than there was between two peers. However, caution is needed in inferring such a conclusion. First, the interrater reliability estimates for peer ratings were based on small number of studies (so are some dimensions of supervisory ratings). Second, there is considerable overlap in credibility intervals between interrater reliability estimates of peer and supervisory ratings. Finally, two of the studies reporting interrater reliabilities of peer ratings (Borman, 1974; Hausman & Strupp, 1955) reported very low values. When these two studies were eliminated from the database as outliers, peers and supervisors had comparable levels of interrater agreement. However, similar to the overall results of this meta-analysis, we should note that a recent large sample primary study has also reported lower interrater reliability estimates for peers compared with supervisors (Scullen, Mount, & Sytysma, 1995). Furthermore, in practice, given that peer ratings are based on the average ratings of several peers, the averaged multiple peer ratings may be more reliable than the ratings from a single supervisor. The Spearman-Brown prophecy formula can be used to determine the number of peer raters required. Coefficient of Stability Compared with the number of studies reporting interrater reliabilities or coefficient alphas, very few studies reported coefficients of stability. This is consistent with the general trend of more cross-sectional than longitudinal studies among published journal articles. In fact, we were able to assess the coefficient of stability only for supervisory ratings of overall job performance. There were 12 reliabilities across 1,374 individuals contributing to this analysis. For supervisory ratings of overall job performance, the sample size weighted mean coefficient of stability was.81 (SD =.09). This analysis included only estimates where the same rater was used at the two points in time. Coefficient Alphas: Measures oflntrarater Reliability Intrarater reliabilities assessed by coefficient alphas were also substantial. For supervisory ratings, overall job performance was most reliably rated (.86). The least reliably rated dimension was communication competence (.73). Although the alpha estimates for all dimensions as well as for overall ratings were higher than.70, it is important to note that these estimates are inclusive of, among other things, a halo component. Another observation is that for supervisory ratings of overall job performance, the coefficient of stability reported above and the coefficient alpha were similar in size (.81 and.86, respectively). This finding supports the inference that the variance of transient errors (variance because of rater mental states or moods that vary over days) is small. These figures suggest that this source of measurement error variance in overall job performance ratings is only 5% of the total variance ( =.05). In Table 4, it can be seen that peer ratings of overall job performance had a mean alpha of.85 (k = 10, N = 1,270). The intrarater reliability associated with peer ratings of leadership, effort, and interpersonal competence were above.60. Similar to interrater reliability, comparing alphas for peer and supervisory ratings should be tentative. When comparing peer and supervisory ratings, it appears that the intrarater reliability is lower for peer than for supervisory ratings of specific job performance dimensions, but not for overall performance. An interesting point to note for both peer and supervisory ratings is that the alphas were higher for the overall job performance ratings than for any of the dimensional ratings. For supervisors, the coefficient alpha for overall job performance was.86, whereas the mean sample size weighted alpha across the specific job performance dimensions was.78. For peers, the coefficient alpha for overall job performance was.85, whereas the mean sample size weighted alpha across the specific job performance dimensions was.68. There are two potential explanations for this result. First, this could have been due to greater length of the instrument used for measure-

RELIABILITY OF RATINGS 565 ment. In a large number of the studies we coded, overall job performance was measured by summing the various dimensions of job performance into a composite.

9 RELIABILITY OF RATINGS 565 ment. In a large number of the studies we coded, overall job performance was measured by summing the various dimensions of job performance into a composite. However, we should point out that the nature of the relationship between the number of items and reliability can be best described as convex. The reliability increases rapidly initially as the number of items increases, but after some point the increase in the reliability is very small. Most of the scales meta-analyzed in this article had enough items that further increases in length and the application of the Spearman-Brown formula did not make an appreciable difference. Note that this could indirectly explain our earlier finding that both supervisor and peer interrater reliabilities for specific dimensions of job performance are similar to the interrater reliabilities for overall job performance. The second potential explanation for higher alphas for ratings of overall job performance stems from the broadness of the construct of overall job performance compared with any of the constructs represented by the individual dimensions of job performance. Moreover, there is some evidence (at least in the personality domain) that broader constructs are more reliably rated than narrower constructs (Ones & Viswesvaran, in press). That is, this finding could reflect that broader traits or constructs are more reliably rated than narrowly defined traits (Ones & Viswesvaran, in press). Unfortunately, this meta-analytic investigation cannot determine which of the two potential explanations is correct. Comparison of Different Estimates Types of Reliability Conceptually, given that (a) the reliability coefficient is the ratio of true to observed variance and (b) observed variance is true plus error variance, all types of reliability estimates have the same denominator. Coefficient alpha (using a single rater) has variance specific to the rater and variance because of transient error in the numerator. Coefficient of stability or rate-rerate with the same rater has variance specific to the rater in the numerator (assuming true performance did not change in the ratererate interval), but not transient variance. Thus, the difference between coefficient alpha and coefficient of stability with the same rater gives an estimate of the transient error in that job performance dimension, as noted earlier. Interrater reliability does not have variance specific to the rater or transient error variance in the numerator. Therefore, the difference between interrater and coefficient of stability provides an estimate of the variance from rater idiosyncrasy. For both peer and supervisory ratings and for all dimensions and overall ratings, interrater reliability estimates are substantially lower than intrarater reliability estimates (coefficients of stability and coefficient alphas). For example, consider supervisory ratings of overall job performance: the mean interrater reliability estimate is.52, the mean coefficient of stability is.81 (on the basis of ratings from same rater at the two points in time), and the mean coefficient alpha is.86. Approximately 29% of the variance ( =.29) in supervisory ratings of overall job performance appears to be due to rater idiosyncrasy, whereas 5%, or X 100, of the variance is estimated to be from transient error, assuming true job performance is stable. Similar analyses can be done for other dimensions on the basis of data reported in Tables 2-4 to compare the magnitude of the source of error in ratings of different dimensions of job performance. Intrarater reliability for supervisory ratings of job performance dimensions are between.70 and.90. However, the mean interrater reliabilities range approximately between.50 and.65. The difference between intrarater and interrater reliability estimates that we obtained indicate that 20% to 30% of the variance in job performance dimension ratings of the average rater is specific to the rater. Using coefficient alpha instead of interrater reliability of job performance ratings to correct observed validities (say in validating interviews) will underestimate the validity. Lacking empirically derived reliability distributions, like those yielded by this study, previous meta-analysts may have combined the correct interrater and incorrect intrarater reliabilities. However, future metaanalyses involving job performance ratings should use the appropriate reliability coefficients (Schmidt & Hunter, 1996) to obtain more precise estimates of correlations that could be used for theory testing. Discussion Job performance measures play a crucial role in research and practice. Ratings (both peer and supervisory) are an important method of job performance measurement in organizations. Many decisions are made on the basis of ratings. As such, the reliability of ratings is an important concern in organizational science. Depending on the objective of the researcher, different reliability estimates need to be assessed. In personnel selection, the use of intrarater reliabilities to correct criterion-related validity coefficients for unreliability in job performance ratings may result in substantial downward biases in estimates of actual operational validity. This bias arises mostly from including rater specific error variance (variance due to rater idiosyncrasies) as true job performance variance in computing intrarater reliability. On the other hand, what is needed to assess actual job performance and its dimensions is an answer to the question: Would the same ratings be obtained if a different but equally knowledgeable judge rated the same employees. This calls for an assessment of interrater reliability. This

566 VISWESVARAN, ONES, AND SCHMIDT is why interrater reliability is the appropriate reliability in making corrections for criterion unreliability in validation research, not coefficient alpha or

10 566 VISWESVARAN, ONES, AND SCHMIDT is why interrater reliability is the appropriate reliability in making corrections for criterion unreliability in validation research, not coefficient alpha or rate-rerate reliability with the same rater. This article quantitatively summarizes the available evidence in the literature for use by researchers and practitioners. A question for future research is whether interrater reliability ratings of overall job performance can be increased by obtaining dimensional ratings before obtaining the overall ratings. 4 (Note that a similar potential does not exist for intrarater reliabilities.) It is possible that when overall performance is rated after dimension ratings are made, interrater reliabilities for overall ratings are higher because all raters have a more similar frameof-reference compared to when the overall performance rating is made on its own or when overall ratings precede dimensional ratings. Furthermore, the issue here is complicated by the fact that in many studies overall job performance ratings are obtained by summing the dimensional ratings, whereas in others overall ratings are obtained on a single item (or a few items) before or after dimensional ratings are provided. To the extent frame-ofreference effects were operating, the standard deviation of the mean interrater reliability for overall ratings should be higher than the standard deviation of mean dimensional ratings. That is, some studies would have obtained overall performance ratings prior to dimensional ratings, and others would have obtained overall ratings after dimensional ratings. If the frame-of-reference hypothesis were correct, in a meta-analytic investigation this would have been detected as greater variance in the interrater reliability of overall job performance ratings. Of course, the interrater reliability of dimensional ratings would not have this source of variance. Hence, the standard deviation of the interrater reliability for overall ratings would be high compared with the standard deviation of dimensional ratings. Our results indicate that this is not the case. However, given that the studies contributing to our overall job performance analyses were a mixture of a sum of dimensional ratings and items directly assessing overall job performance, we cannot reach any definite conclusions regarding the frame-of-reference effects. In any event, this is an interesting hypothesis for future research. In cumulating results across studies, a concern exists whether moderating influences are obscured. The low values of the standard deviations (compared with the means) mitigate this concern to some extent. Furthermore, Churchill and Peter (1984) and Peterson (1994) examined as many as 13 moderators of reliability estimates (e.g., whether the reliabilities were obtained for research or administrative purposes). No substantial relationships were found between any hypothesized moderator and magnitude of reliability estimates. A potentially important moderating influence may be whether the ratings were obtained for research or administrative purposes. McDaniel, Whetzel, Schmidt, and Maurer (1994) found that the purpose of the performance ratings (administrative vs. research) moderated the validities of employment interviews. In this study, we examined three moderators of job performance rating reliabilities: type of reliability (interrater vs. intrarater), source of rating (peer vs. supervisors), and job performance-dimension rated. We were not able to examine the moderating influences of administrative versus research-based ratings. This was primarily because, given the number of studies, analysis of any other moderator in a fully hierarchical design (Hunter & Schmidt, 1990) would have resulted in too few studies for a robust meta-analysis. The concern for sufficient data to detect moderators, coupled with the fact that previous meta-analyses (e.g., Peterson, 1994) that included alternate moderators did not find support for those alternate moderators, led us to focus only on these three moderators (type of reliability, source of rating, and rating content). However, future research should examine the interaction of these three moderators with other potential moderators, such as the purpose for which the ratings were obtained (administrative vs. research). The results reported here can be used to construct reliability artifact distributions to be used in meta-analyses (Hunter & Schmidt, 1990) when correcting for unreliability in the criterion ratings. For example, the report by a National Academy of Sciences (NAS) panel (Hartigan & Wigdor, 1989) evaluating the utility gains from validity generalization (Hunter, 1983) maintained that the mean interrater reliability estimate of.60 used by Hunter (1983) was too small and that the interrater reliability of supervisory ratings of overall job performance is better estimated as.80. The results reported here indicate that the average interrater reliability of supervisory ratings of job performance (cumulated across all studies available in the literature) is.52. FurthermoVe, this value is similar to that obtained by Rothstein (1990), although we should point out that a recent large-scale primary study (N = 2,249) obtained a lower value of.45 (Scullen et al., 1995). On the basis of our findings, we estimate that the probability of interrater reliability of supervisory ratings of overall job performance being as high as.80 (as claimed by the NAS panel) is only These findings indicate that the reliability estimate used by Hunter (1983) is, if anything, probably an overestimate of the reliability of supervisory ratings of overall job performance. Thus, it appears that Schmidt, Ones, and Hunter (1992) were correct in concluding that the NAS panel underestimated the validity of the General Aptitude Test 4 We thank an anonymous reviewer for suggesting this.

11 RELIABILITY OF RATINGS 567 Battery (GATE). The estimated validity of other operational tests may be similarly rescrutinized. An anonymous reviewer presented two concerns as fundamental questions that need to be addressed. First, the reviewer raised the question whether reliability corrections should be undertaken when one does not have estimates from the same study in which the validity was estimated. Second, if the answer was affirmative to the first question, another question arises as to whether one should use the mean reliabilities reported in this article or some conservative value (e.g., the 80% upper bound values reported in this article). There are two reasons for answering the first question in the affirmative. First, any bias introduced in the estimated true validity from using reliability estimates reported in this article will be much less than the downward bias in validity estimates if no corrections were undertaken. That is, when reliability estimates from the sample are not available, the alternative is to make no corrections. Second, the meta-analytically obtained reliability estimates reported here may be more accurate than the sample-based estimates a primary researcher could obtain, given the major effect of sampling error on reliability estimates in single studies (which typically have small sample sizes). Using the meta-analytically obtained estimates reported here instead of the sample-based estimates may result in greater accuracy. The numerous simulation studies indicating the robustness of artifact distribution-based meta-analyses (cf. Hunter & Schmidt, 1990) support the conclusion that bias is lower when meta-analytically obtained means are used to correct for bias than if either (a) sample-based estimates are used in the corrections or (b) no corrections are made. The answer to the second question raised by the reviewer can also be framed in terms of bias in the estimated correlations. Using conservative values for reliability results in more bias than the use of the mean values. Many researchers maintain that being conservative is good science, but conservative estimates are by definition biased estimates. We believe it is more appropriate to aim for unbiased estimates because the research goal is to maximize the accuracy of the final estimates. Future meta-analytic research is needed to examine the reliability of criteria obtained from other sources such as customer ratings, self-ratings, and subordinate ratings. In a large-scale primary study (N= 2,273), Scullen et al. (1995) reported that the inter rater reliability of subordinate ratings is similar to those obtained for peers (ranging between.31 and.36) for various dimensions of job performance. We see the efforts of Scullen et al. (1995) as a valuable first step in reaching generalizable conclusions about the reliability of subordinate ratings. Future research is also needed to examine the process mechanisms (e.g., Campbell, 1990; DeNisi, Cafferty, & Meglino, 1984) by which the criterion data are gathered and thus improve the reliability of the obtained ratings. There are several unique contributions of the present study. Particularly, we want to clearly delineate how our study contributes over and beyond the Rothstein (1990) study, which was the largest scale primary study reported examining the interrater reliability of supervisory ratings. First, Rothstein (1990) focused only on interrater reliabilities. Here, we investigated interrater and intrarater reliabilities, cumulated interrater reliabilities, coefficient alphas, and test-retest reliabilities. Second, Rothstein (1990) focused on overall job performance. Rothstein (1990) did not examine the reliabilities of dimensions of the job performance construct. Given the theoretical arguments and rating processes hypothesized that posit different reliabilities for different dimensions, we examined the reliability of different dimensions of job performance as well as the reliability of overall job performance. Third, whereas the Rothstein (1990) study was based on a large sample, it was nevertheless a single primary study confined to one research corporation that markets the Supervisory Profile Record (see Rothstein, 1990). Finally, Rothstein (1990) focused on reliabilities of supervisory ratings only. We analyzed both supervisory and peer ratings, and we examined whether the reliabilities of peer and supervisory ratings are similar across job performance dimensions. However, in contrast to our study, Rothstein (1990) was able to examine the effects of length of exposure on interrater reliability with her primary data. We were not able to test this effect, as most studies did not specify how long the raters were exposed to the ratees. (Of course, that was not the focus in many studies making up our database.) Future meta-analytic research should attempt to generalize the Rothstein (1990) findings with regard to length of exposure to other rating instruments. The results of this article offer psychometric insights into the psychological and substantive characteristics of job performance measures. The construction of generalizable theories of job performance starts with an examination of the reliable measurement of job performance dimensions. Given that ratings (supervisory and peer) are used most frequently in the measurement of this central construct, it is crucial that researchers and managers be concerned about the reliability of these measurements. For research involving the construct of job performance, accurate construct measurement is predicated on reliable job performance measurement. For practice, accurate administrative decisions depend on the reliable measurement of job performance. It is our hope that the results presented here can be used to understand and improve job performance measurement in organizations.

568 VISWESVARAN, ONES, AND SCHMIDT References The asterisk (*) indicates studies that were included in the meta-analysis. *Albrecht, P. A., Glaser, E. M., & Marks, J. (1964).

12 568 VISWESVARAN, ONES, AND SCHMIDT References The asterisk (*) indicates studies that were included in the meta-analysis. *Albrecht, P. A., Glaser, E. M., & Marks, J. (1964). Validation of a multiple-assessment procedure for managerial personnel. Journal of Applied Psychology, 48, *Anderson, H. E., Jr., Roush, S. L., & McClary, J. E. (1973). Relationships among ratings, production, efficiency, and the general aptitude test battery scales in an industrial setting. Journal of Applied Psychology, 58, *Arvey, R. D., Landon, T. E., Nutting, S. M., & Maxwell, S. E. (1992). Development of physical ability tests for police officers: A construct validation approach. Journal of Applied Psychology, 77, *Ashford, S. J., & Tsui, A. S. (1991). Self-regulation for managerial effectiveness: The role of active feedback seeking. Academy of Management Journal, 34, Austin, J. T, & Villanova, P. (1992). The criterion problem: Journal of Applied Psychology, 77, *Baird, L. S. (1977). Self and superior ratings of performance: As related to self-esteem and satisfaction with supervision. Academy of Management Journal, 20, Bandura, A. (1977). Social learning theory. Englewood Cliffs, NJ: Prentice-Hall. *Barrick, M. R., Mount, M. K., & Strauss, J. P. (1993). Conscientiousness and performance of sales representatives: Test of the mediating effects of goal setting. Journal of Applied Psychology, 78, *Bass, A. R., & Turner, J. N. (1973). Ethnic group differences in relationships among criteria of job performance. Journal of Applied Psychology, 57, *Becker, T. E., & Vance, R. J. (1993). Construct validity of three types of organizational citizenship behavior: An illustration of the direct product model with refinements. Journal of Management, 19, *Bernardin, H. J. (1987). Effect of reciprocal leniency on the relation between consideration scores from the leader behavior description questionnaire and performance ratings. Psychological Reports, 60, Bernardin, H. J., & Beatty, R. W. (1984). Performance appraisal: Assessing human behavior at work. Boston: Kent. *Bhagat, R. S., & Allie, S. M. (1989). Organizational stress, personal life style, and symptoms of life strains: An examination of the moderating role of sense of competence. Journal of Vocational Behavior, 35, *Blank, W., Weitzel, J. R., & Green, S. G. (1990). A test of the situational leadership theory. Personnel Psychology, 43, *Blanz, F., & Ghiselli, E. E. (1972). The mixed standard scale: A new rating system. Personnel Psychology, 25, *Blau, G. (1986). The relationship of management level to effort level, direction of effort, and managerial performance. Journal of Vocational Behavior, 29, *Blau, G. (1988). An investigation of the apprenticeship organizational socialization strategy. Journal of Vocational Behavior, 32, *Blau, G. (1990). Exploring the mediating mechanisms affecting the relationship of recruitment source to employee performance. Journal of Vocational Behavior, 37, *Bledsoe, J. C. (1981). Factors related to academic and job performance of graduates of practical nursing programs. Psychological Reports, 49, Blum, M. L., & Naylor, J. C. (1968). Industrial psychology: Its theoretical and social foundations. New York: Harper & Row. *Borman, W. C. (1974). The rating of individuals in organizations: An alternate approach. Organizational Behavior and Human Performance, 12, Borman, W. C. (1979). Format and training effects on rating accuracy and rater errors. Journal of Applied Psychology, 64, Borman, W. C., Hough, L. M., & Dunnette, M. D. (1976). Performance ratings: An investigation of reliability, accuracy, and relationship between individual differences and rater error. Minneapolis, MN: Personnel Decisions. *Breaugh, J. A. (1981 a). Predicting absenteeism from prior absenteeism and work attitudes. Journal of Applied Psychology, 66, *Breaugh, J. A. (1981b). Relationships between recruiting sources and employee performance, absenteeism, and work attitudes. Academy of Management Journal, 24, Brogden, H. E. (1946). An approach to the problem of differential prediction. Psychometrika, 11, *Buckner, D. N. (1959). The predictability of ratings as a function of interrater agreement. Journal of Applied Psychology, 43, *Buel, W. D., & Bachner, V. M. (1961). The assessment of creativity in a research setting. Journal of Applied Psychology, 45, *Bushe, G. R., & Gibbs, B. W. (1990). Predicting organization development consulting competence from the Myers-Briggs type indicator and state of ego development. Journal of Applied Behavioral Science, 26, *Butler, M. C., & Ehrlich, S. B. (1991). Positional influences on job satisfaction and job performance: A multivariate, predictive approach. Psychological Reports, 69, Callender, J. C., & Osburn, H. G. (1988). Unbiased estimation of the sampling variance of correlations. Journal of Applied Psychology, 73, *Campbell, C. H., Ford, P., Rumsey, M. G., Pulakos, E. D., Borman, W. C., Felker, D. B., De Vera, M. V, & Riegelhaupt, B. J. (1990). Development of multiple job performance measures in a representative sample of jobs. Personnel Psychology, 43, Campbell, J. P. (1990). Modeling the performance prediction problem in industrial and organizational psychology. In M. Dunnette&L. M. Hough (Eds.), Handbook ofindustrial organizational psychology (2nd ed.. Vol. 1, pp ). Palo Alto, CA: Consulting Psychologists Press. *Campbell, J. P., Dunnette, M. D., Arvey, R. D., & Hellervik, L. V. (1973). The development and evaluation of behaviorally based rating scales. Journal of Applied Psychology, 57, Campbell, J. P., Gasser, M. B., & Oswald, F. L. (1996). The substantive nature of job performance variability. In K. R. Murphy (Ed.), Individual differences and behavior in organizations (pp ). San Francisco: Jossey-Bass.

RELIABILITY OF RATINGS 569 Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp.

13 RELIABILITY OF RATINGS 569 Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993). A theory of performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp ). San Francisco: Jossey-Bass. Cascio, W. F. (1991). Applied psychology in personnel management (4th ed.). Englewood Cliffs, NJ: Prentice-Hall. *Cascio, W. F., & Valenzi, E. R. (1977). Behaviorally anchored rating scales: Effects of education and job experience of raters andratees. Journal of Applied Psychology, 62, *Cascio, W. E, & Valenzi, E. R. (1978). Relations among criteria of police performance. Journal of Applied Psychology, 63, *Cheloha, R. S., & Farr, J. L. (1980). Absenteeism, job involvement, and job satisfaction in an organizational setting. Journal of Applied Psychology, 65, Christensen, L. (1974). The influence of trait, sex, and information accuracy of personality assessment. Journal of Personality Assessment, 38, Churchill, G. A., Jr., & Peter, J. P. (1984). Research design effects on the reliability of rating scales: A meta-analysis. Journal of Marketing Research, 21, *Cleveland, J. N., & Landy, F. J. (1981). The influence of rater and ratee age on two performance judgments. Personnel Psychology, 34, Cleveland, J. N., Murphy, K. R., & Williams, R. E. (1989). Multiple uses of performance appraisal: Prevalence and correlates. Journal of Applied Psychology, 74, *Cleveland, J. N., & Shore, L. M. (1992). Self- and supervisory perspectives on age and work attitudes and performance. Journal of Applied Psychology, 77, *Colarelli, S. M., Dean, R. A., & Konstans, C. (1987). Comparative effects of personal and situational influences on job outcomes of new professionals. Journal of Applied Psychology, 72, "Cooper, R. (1966). Leader's task relevance and subordinate behaviour in industrial work groups. Human Relations, 19, *Cooper, R., & Payne, R. (1967). Extraversion and some aspects of work behavior. Personnel Psychology, 20, *Cortina, J. M., Doherty, M. L., Schmitt, N., Kaufman, G., & Smith, R. G. (1992). The "Big Five" personality factors in the IPI and MMPI: Predictors of police performance. Personnel Psychology, 45, *Cotton, J., & Stoltz, R. E. (1960). The general applicability of a scale for rating research productivity. Journal of Applied Psychology, 44, Cronbach, L. J. (1947). Test reliability: Its meaning and determination. Psychometrika, 12, Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, *David, F. R., Pearce, J. A., II, & Randolph, W. A. (1989). Linking technology and structure to enhance group performance. Journal of Applied Psychology, 74, *Day, D. W., & Silverman, S. B. (1989). Personality and job performance: Evidence of incremental validity. Personnel Psychology, 42, *Deadrick, D. L., & Madigan, R. M. (1990). Dynamic criteria revisited: A longitudinal study of performance stability and predictive validity. Personnel Psychology, 43, DeNisi, A. S., Cafferty, T. P., & Meglino, B. M. (1984). A cognitive view of the performance appraisal process: A model and research propositions. Organizational Behavior and Human Performance, 33, *Dicken, C. F, & Black, J. D. (1965). Predictive validity of psychometric evaluations of supervisors. Journal of Applied Psychology, 49, "Dickinson, T. L., & Tice, T. E. (1973). A multitraitmultimethod analysis of scales developed by retranslation. Organizational Behavior and Human Performance, 9, *Distefano, M. K., Jr., Pryer, M. W., & Erffmeyer, R. C. (1983). Application of content validity methods to the development of a job-related performance rating criterion. Personnel Psychology, 56, *Dreher, G. F. (1981). Predicting the salary satisfaction of exempt employees. Personnel Psychology, 34, *Dunegan, K. J., Duchon, D., & Uhl-Bien, M. (1992). Examining the link between leader-member exchange and subordinate performance: The role of task analyzability and variety as moderators. Journal of Management, 18, Dunnette, M. D. (1963). A note on the criterion. Journal of Applied Psychology, 47, *Edwards, P. K. (1979). Attachment to work and absence behavior. Human Relations, 32, *Ekpo-Ufot, A. (1979). Self-perceived task-relevant abilities, rated job performance, and complaining behavior of junior employees in a government ministry. Journal of Applied Psychology, 64, *Farh, J., Podsakoff, P. M., & Organ, D. W. (1990). Accounting for organizational citizenship behavior: Leader fairness and task scope versus satisfaction. Journal of Management, 16, *Farh, J., Werbel, J. D, & Bedeian, A. G. (1988). An empirical investigation of self-appraisal-based performance evaluation. Personnel Psychology, 41, *Farr, J.-L., O'Leary, B. S., & Bartlett, C. J. (1971). Ethnic group membership as a moderator of the prediction of job performance. Personnel Psychology, 24, *Flanders, J. K. (1918). Mental tests of a group of employed men showing correlations with estimates furnished by employer. Journal of Applied Psychology, 2, *Gardner, D. G., Dunham, R. B., Cummings, L. L., & Pierce, J. L. (1989). Focus of attention at work: Construct definition and empirical validation. Journal of Occupational Psychology, 62, *Gerloff, E. A., Muir, N. K., & Bodensteiner, W. D. (1991). Three components of perceived environmental uncertainty: An exploratory analysis of the effects of aggregation. Journal of Management, 17, *Ghiselli, E. E. (1942). The use of the Strong vocational interest blank and the Pressy senior classification test in the selection of casualty insurance agents. Journal of Applied Psychology, 26, *Gough, H. G., Bradley, P., & McDonald, J. S. (1991). Performance of residents in anesthesiology as related to measures of personality and interests. Psychological Reports, 68, *Graen, G., Dansereau, F, Jr., & Minami, T. (1972). An em-

570 VISWESVARAN, ONES, AND SCHMIDT pirical test of the man-in-the-middle hypothesis among executives in a hierarchical organization employing a unit-set analysis.

14 570 VISWESVARAN, ONES, AND SCHMIDT pirical test of the man-in-the-middle hypothesis among executives in a hierarchical organization employing a unit-set analysis. Organizational Behavior and Human Performance, 8, *Graen, G., Novak, M. A., & Sommerkamp, P. (1982). The effects of leader-member exchange and job design on productivity and satisfaction: Testing a dual attachment model. Organizational Behavior and Human Performance, 30, *Green, S. B., & Stutzman, T. (1986). An evaluation of methods to select respondents to structured job-analysis questionnaires. Personnel Psychology, 39, *Greenhaus, J. H., Bedeian, A. G., & Mossholder, K. W. (1987). Work experiences, job performance, and feelings of personal and family well-being. Journal of Vocational Behavior, 31, *Griffin, R. W. (1991). Effects of work redesign on employee perceptions, attitudes, and behaviors: A long-term investigation. Academy of Management Journal, 34, *Guion, R. M. (1965). Synthetic validity in a small company: A demonstration. Personnel Psychology, 18, " Gunderson, E. K. E., & Nelson, P. D. (1966). Criterion measures for extremely isolated groups. Personnel Psychology, 19, *Gunderson, E. K. E., & Ryman, D. H. (1971). Convergent and discriminant validities of performance evaluations in extremely isolated groups. Personnel Psychology, 24, "Hackman, J. R., & Lawler, E. E., III. (1971). Employee reactions to j ob characteristics [ Monograph ]. Journal of Applied Psychology, 55, "Hackman, J. R., & Porter, L. W. (1968). Expectancy theory predictions of work effectiveness. Organizational Behavior and Human Performance, 3, Hartigan, J. A., & Wigdor, A. K. (Eds.). (1989). Fairness in employee testing: Validity generalization, minority issues, and the General Aptitude Test Battery. Washington, DC: National Academy Press. *Hatcher, L., Ross, T. L., & Collins, D. (1989). Prosocial behavior, job complexity, and suggestion contribution under gainsharing plans. Journal of Applied Behavioral Science, 25, *Hater, J. J., & Bass, B. M. (1988). Superiors' evaluations and subordinates' perceptions of transformational and transactional leadership. Journal of Applied Psychology, 73, *Hausman, H. J., & Strupp, H. H. (1955). Non-technical factors in supervisors' ratings of job performance. Personnel Psychology, 5, "Heneman, H. G., III. (1974). Comparisons of self- and superior ratings of managerial performance. Journal of Applied Psychology, 59, *Heron, A. (1954). Satisfaction and satisfactoriness: Complementary aspects of occupational adjustment. Occupational Psychology, 28, *Hilton, A. C., Bolin, S. R, Parker, J. W, Jr., Taylor, E. K., & Walker, W. B. (1955), The validity of personnel assessments by professional psychologists. Journal of Applied Psychology, 39, "Hoffman, C. C., Nathan, B. R., & Holden, L. M. (1991). A comparison of validation criteria: Objective versus subjective performance measures and self- versus supervisor ratings. Personnel Psychology, 44, "Hogan, J., Hogan, R., & Busch, C. M. (1984). How to measure service orientation. Journal of Applied Psychology, 69, *Hough, L. M. (1984). Development and evaluation of the "accomplishment record" method of selecting and promoting professionals. Journal of Applied Psychology, 69, *Huck, J. R., & Bray, D. W. (1976). Management assessment center evaluations and subsequent job performance of white and black females. Personnel Psychology, 29, " Hughes, G. L., & Prien, E. P. (1986). An evaluation of alternate scoring methods for the mixed standard scale. Personnel Psychology, 39, Hunter, J. E. (1983). Test validation for 12,000 jobs: An application of job classification and validity generalization to General Aptitude Test Battery (U.S. Employment Service Test Research Report No. 45). Washington, DC: U.S. Department of Labor. Hunter, J. E., & Schmidt, F. L. (1990). Methods ofmeta-analysis: Correcting for error and bias in research findings. Newbury Park, CA: Sage. *Ivancevich, J. M. (1980). A longitudinal study of behavioral expectation scales: Attitudes and performance. Journal of Applied Psychology, 65, "Ivancevich, J. M. (1983). Contrast effects in performance evaluation and reward practices. Academy of Management Journal, 26, *Ivancevich, J. M. (1985). Predicting absenteeism from prior absence and work attitudes. Academy of Management Journal, 28, "Ivancevich, J. M., & McMahon, J. T. (1977). Black-white differences in a goal-setting program. Organizational Behavior and Human Performance, 20, "Ivancevich, J. M., & McMahon, T. J. (1982). The effects of goal setting, external feedback, and self-generated feedback on outcome variables: A field experiment. Academy of Management Journal, 25, " Ivancevich, J. M., & Smith, S. V. (1981). Goal setting interview skills training: Simulated and on-the-job analyses. Journal of Applied Psychology, 66, " Ivancevich, J. M., & Smith, S. V. (1982). Job difficulty as interpreted by incumbents: A study of nurses and engineers. Human Relations, 35, "Jamal, M. (1984). Job stress and job performance controversy: An empirical assessment. Organizational Behavior and Human Performance, 33, " James, L. R., & Ellison, R. L. (1973). Creation composites for scientific creativity. Personnel Psychology, 26, Jensen, A. R. (1980). Bias in mental testing. New York: Free Press. "Johnson, J. A., & Hogan, R. (1981). Vocational interests, personality and effective police performance. Personnel Psychology, 34, " Jones, J. W., & Terris, W. (1983). Predicting employees' theft in home improvement centers. Psychological Reports, 52,

RELIABILITY OF RATINGS 571 *Jordan, J. L. (1989). Effects of race on interrater reliability of peer ratings. Psychological Reports, 64, 1221-1222. "Jurgensen, C. E. (1950).

15 RELIABILITY OF RATINGS 571 *Jordan, J. L. (1989). Effects of race on interrater reliability of peer ratings. Psychological Reports, 64, "Jurgensen, C. E. (1950). Intercorrelations in merit rating traits. Journal of Applied Psychology, 34, *Keller, R. T. (1984). The role of performance and absenteeism in the prediction of turnover. Academy of Management Journal, 27, *King, L. M., Hunter, J. E., & Schmidt, F. L. (1980). Halo in a multidimensional forced-choice performance evaluation scale. Journal of Applied Psychology, 65, *Klaas, B. S. (1989). Managerial decision making about employee grievances: The impact of the grievant's work history. Personnel Psychology, 42, *Klaas, B. S., & DeNisi, A. S. (1989). Managerial reactions to employee dissent: The impact of grievance activity on performance ratings. Academy of Management Journal, 32, *Klimoski, R. J., & Hayes, N. J. (1980). Leader behavior and subordinate motivation. Personnel Psychology, 33, *Knauft, E. B. (1949). A selection battery for bake shop managers. Journal of Applied Psychology, 33, *Kubany, A. J. (1957). Use of sociometric peer nominations in medical education research. Journal of Applied Psychology, 41, *Landy, F. J., & Guion, R. M. (1970). Development of scales for the measurement of work motivation. Organizational Behavior and Human Performance, 5, "Latham, G. P., Fay, C. H., & Saari, L. M. (1979). The development of behavioral observation scales for appraising the performance of foremen. Personnel Psychology, 32, *Latham, G. P., & Wexley, K. N. (1977). Behavioral observation scales for performance appraisal purposes. Personnel Psychology, 30, *Lawshe, C. H., & McGinley, A. D., Jr. (1951). Job performance criteria studies: I. The job performance of proofreaders. Journal of Applied Psychology, 35, *Lee, R., Malone, M., & Greco, S. (1981). Multitraitmultimethod-multirater analysis of performance ratings for law enforcement personnel. Journal of Applied Psychology, 66, *Love, K. G. (1981). Comparison of peer assessment methods: Reliability, validity, friendship bias, and user reaction. Journal of Applied Psychology, 66, *Love, K. G., & O'Hara, K. (1987). Predicting job performance of youth trainees under a job training partnership act program (JTPA): Criterion validation of a behavior-based measure of work maturity. Personnel Psychology, 40, *MacKenzie, S. B., Podsakoff, P. M., & Fetter, R. (1991). Organizational citizenship behavior and objective productivity as determinants of managerial evaluations of salespersons' performance. Organizational Behavior and Human Decision Processes, 50, *Matteson, M. T., Ivancevich, J. M., & Smith, S. V. (1984). Relation of type A behavior to performance and satisfaction among sales personnel. Journal of Vocational Behavior, 25, *Mayfield, E. C. (1970). Management selection: Buddy nominations revisited. Personnel Psychology, 23, *McCarrey, M. W., & Edwards, S. A. (1973). Organizational climate conditions for effective research scientist role performance. Organizational Behavior and Human Performance, 9, *McCauley, C. D., Lombardo, M. M., & Usher, C. J. (1989). Diagnosing management development needs: An instrument based on how managers develop. Journal of Management, 15, McCloy, R. A., Campbell, J. P., & Cudeck, R. (1994). A confirmatory test of a model of performance determinants. Journal of Applied Psychology, 79, McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer, S. D. (1994). The validity of employment interviews: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79, *McEvoy, G. M., & Beatty, R. W. (1989). Assessment centers and subordinate appraisals of managers: A seven-year examination of predictive validity. Personnel Psychology, 42, *Meglino, B. M., Ravlin, E. C., & Adkins, C. L. (1989). A work values approach to corporate culture: A field test of the value congruence process and its relationship to individual outcomes. Journal of Applied Psychology, 74, *Meredith, G. M. (1990). Dossier evaluation in screening candidates for excellence in teaching awards. Psychological Reports, 67, *Meyer, J. P., Paunonen, S. V., Gellatly, I. R., Coffin, R. D., & Jackson, D. N. (1989). Organizational commitment and job performance: It's the nature of the commitment that counts. Journal of Applied Psychology, 74, *Miner, J. B. (1970). Executive and personnel interviews as predictors of consulting success. Personnel Psychology, 23, *Miner, J. B. (1970). Psychological evaluations as predictors of consulting success. Personnel Psychology, 23, *Mitchell, T. R., & Albright, D. W. (1972). Expectancy theory predictions of the satisfaction, effort, performance, and retention of naval aviation officers. Organizational Behavior and Human Decision Processes, 8, *Morgan, R. B. (1993). Self- and co-worker perceptions of ethics and their relationships to leadership and salary. Academy of Management Journal, 36, *Morse, J. J., & Wagner, F. R. (1978). Measuring the process of managerial effectiveness. Academy of Management Journal, 21, *Mossholder, K. W., Bedeian, A. G., Norris, D. R., Giles, W. F., & Feild, H. S. (1988). Job performance and turnover decisions: Two field studies. Journal of Management, 14, *Motowidlo, S. J. (1982). Relationship between self-rated performance and pay satisfaction among sales representatives. Journal of Applied Psychology, 67, *Mount, M. K. (1984). Psychometric properties of subordinate ratings of managerial performance. Personnel Psychology, 37, *Nathan, B. R., Morman, A. M., Jr., & Milliman, J. (1991). Interpersonal relations as a context for the effects of appraisal interviews on performance and satisfaction: A longitudinal study. Academy of 'Management Journal, 34,

572 VISWESVARAN, ONES, AND SCHMIDT *Nealey, S. M., & Owen, T. W. (1970). A multitraitmultimethod analysis of predictors and criteria of nursing performance.

16 572 VISWESVARAN, ONES, AND SCHMIDT *Nealey, S. M., & Owen, T. W. (1970). A multitraitmultimethod analysis of predictors and criteria of nursing performance. Organizational Behavior and Human Performance, 5, *Niehoff, B. P., & Moorman, R. H. (1993). Justice as a mediator of the relationship between methods of monitoring and organizational citizenship behavior. Academy of Management Journal, 36, *Noe, R. A., & Schmitt, N. (1986). The influence of trainee attitudes on training effectiveness: Test of a model. Personnel Psychology, 39, *Norris, D. R., & Niebuhr, R. E. (1984). Organization tenure as a moderator of the job satisfaction-job performance relationship. Journal of Vocational Behavior, 24, Nunnally, J. C. (1978). Psychometric theory. New York: McGraw Hill. ""O'Connor, E. J., Peters, L. H., Pooyan, A., Weekley, J., Frank, B., & Erenkrantz, B. (1984). Situational constraint effects on performance, affective reactions, and turnover: A field replication and extension. Journal of Applied Psychology, 69, *Oldham, G. R. (1976). The motivational strategies used by supervisors: Relationships to effectiveness indicators. Organizational Behavior and Human Performance, 15, Ones, D. S., & Viswesvaran, C. (in press). Bandwidth-fidelity dilemma in personality measurement for personnel selection. Journal of Organizational Behavior. *Organ, D. W., & Konovsky, M. (1989). Cognitive versus affective determinants of organizational citizenship behavior. Journal of Applied Psychology, 74, *Otten, M. W., & Kahn, M. (1975). Effectiveness of crisis center volunteers and the personal orientation inventory. Psychological Reports, 37, *Parker, J. W., Taylor, E. K., Barrett, R. S., & Martens, L. (1959). Rating scale content: III. Relationship between supervisory- and self-ratings. Personnel Psychology, 12, *Parsons, C. K., Herold, D. M., & Leatherwood, M. L. (1985). Turnover during initial employment: A longitudinal study of the role of causal attributions. Journal of Applied Psychology, 70, *Penley, L. E., & Hawkins, B. L. (1980). Organizational communication, performance, and job satisfaction as a function of ethnicity and sex. Journal of Vocational Behavior, 16, Peterson, R. A. (1994). A meta-analysis of Cronbach's coefficient alpha. Journal of Consumer Research, 21, *Podsakoff, P. M., Niehoff, B. P., MacKenzie, S. B., & Williams, M. L. (1993). Do substitutes for leadership really substitute for leadership? An empirical examination of Kerr and Jermier's situational leadership model. Organizational Behavior and Human Decision Processes, 54, *Podsakoff, P. M., Todor, W. D., & Skov, R. (1982). Effects of leader contingent and noncontingent reward and punishment behaviors on subordinate performance and satisfaction. Academy of Management Journal, 25, *Prien, E. P., & Kult, M. (1968). Analysis of performance criteria and comparison of a priori and empirically-derived keys for a forced-choice scoring. Personnel Psychology, 21, *Prien, E. P., & Liske, R. E. (1962). Assessments of higher level personnel: III. Rating criteria: A comparative analysis of supervisor ratings and incumbent self-ratings of job performance. Personnel Psychology, 15, *Puffer, S. M. (1987). Prosocial behavior, noncompliant behavior, and work performance among commission salespeople. Journal of Applied Psychology, 72, *Pulakos, E. D., Borman, W. C., & Hough, L. M. (1988). Test validation for scientific understanding: Two demonstrations of an approach to studying predictor-criterion linkages. Personnel Psychology, 41, *Pulakos, E. D., & Wexley, K. N. (1983). The relationship among perceptual similarity, sex, and performance ratings in manager-subordinate dyads. Academy of Management Journal, 26, *Pym, D. L. A., & Auld, H. D. (1965). The self-rating as a measure of employee satisfactoriness. Occupational Psychology, 39, *Rabinowitz, S., & Stumpf, S. A. (1987). Facets of role conflict, role-specific performance, and organizational level within the academic career. Journal of Vocational Behavior, 30, *Ronan, W. W. (1963). A factor analysis of eleven job performance measures. Personnel Psychology, 16, *Rosinger, G., Myers, L. B., Levy, G. W., Loar, M., Mohrman, S. A., & Stock, J. R. (1982). Development of a behaviorally based performance appraisal system. Personnel Psychology, 35, *Ross, P. F., & Dunfield, N. M. (1964). Selecting salesmen for an oil company. Personnel Psychology, 17, *Rosse, J. G. (1987). Job-related ability and turnover. Journal of Business and Psychology, 1, *Rosse, J. G., & Kraut, A. I. (1983). Reconsidering the vertical dyad linkage model of leadership. Journal of Occupational Psychology, 56, *Rosse, J. G., Miller, H. E., & Barnes, L. K. (1991). Combining personality and cognitive ability predictors for hiring serviceoriented employees. Journal of Business and Psychology, 5, *Rothstein, H. R. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe. Journal of Applied Psychology, 75, *Rothstein, H. R., Schmidt, F. L., Erwin, F. W., Owens, W. A., & Sparks, C. P. (1990). Biographical data in employment selection: Can validities be made generalizable? Journal of Applied Psychology, 75, *Rousseau, D. M. (1978). Relationship of work to nonwork. Journal of Applied Psychology, 63, *Rush, C. H. Jr. (1953). A factorial study of sales criteria. Personnel Psychology, 6, *Russell, C. J. (1990). Selecting top corporate leaders: An example of biographical information. Journal of Management, 16, *Russell, C. J., Mattson, J., Devlin, S. E., & Atwater, D. (1990). Predictive validity of biodata items generated from retrospective life experience essays. Journal of Applied Psychology, 75, *Sackett, P. R., Zedeck, S., & Fogli, L. (1988). Relations be-

RELIABILITY OF RATINGS 573 tween measures of typical and maximum job performance. Journal of Applied Psychology, 73, 482-486. Salancik, G. R., & Pfeffer, J. (1978).

17 RELIABILITY OF RATINGS 573 tween measures of typical and maximum job performance. Journal of Applied Psychology, 73, Salancik, G. R., & Pfeffer, J. (1978). A social information processing approach to job attitudes and task design. Administrative Science Quarterly, 23, *Schaubroeck, J., Ganster, D. C, Sime, W. E., & Ditman, D. (1993). A field experiment testing supervisory role clarification. Personnel Psychology, 46, *Schippmann, J. S., & Prien, E. P. (1986). Psychometric evaluation of an integrated assessment procedure. Psychological Reports, 59, Schmidt, F. L., & Hunter, J. E. (1992). Development of a causal model of processes determining job performance. Current Directions in Psychological Science, 1, Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1, Schmidt, F. L., Ones, D. S., & Hunter, J. E. (1992). Personnel selection. In M. R. Rosenzweig & L. W. Porter (Eds.), Annual review of psychology (pp ). Palo Alto, CA: Annual Reviews. *Schuerger, J. M., Kochevar, K. F., & Reinwald, J. E. (1982). Male and female correction officers: Personality and rated performance. Psychological Reports, 51, Scullen, S. E., Mount, M. K., & Sytysma, M. R. (1995). Comparison of self, peer, direct report and boss ratings of managers'performance. Unpublished manuscript. *Seybolt, J. W., & Pavett, C. M. (1979). The prediction of effort and performance among hospital professionals: Moderating effects of feedback on expectancy theory formulations. Journal of Occupational Psychology, 52, *Siegel, A. I., Schultz, D. G., Fischl, M. A., & Lanterman, R. S. (1968). Absolute scaling of job performance. Journal of Applied Psychology, 52, *Siegel, L. (1982). Paired comparison evaluations of managerial effectiveness by peers and supervisors. Personnel Psychology, 35, *Slocum, J. W., Jr., & Cron, W. L. (1985). Job attitudes and performance during three career stages. Journal of Vocational Behavior, 26, *Smircich, L., & Chesser, R. J. (1981). Superiors' and subordinates' perceptions of performance: Beyond disagreement. Academy of Management Journal, 24, *Sneath, F. A., White, G. C., & Randell, G. A. (1966). Validating a workshop reporting procedure. Occupational Psychology, 40, *Soar, R. S. (1956). Personal history as a predictor of success in service station management. Journal of Applied Psychology, 40, *South, J. C. (1974). Early career performance of engineers Its composition and measurement. Personnel Psychology, 27, *Spector, P. E., Dwyer, D. J., & Jex, S. M. (1988). Relation of job stressors to affective, health, and performance outcomes: A comparison of multiple data sources. Journal of Applied Psychology, 73, *Spencer, D. G., & Steers, R. M. (1981). Performance as a moderator of the job satisfaction-turnover relationship. Journal of Applied Psychology, 66, *Spitzer, M. E., & McNamara, W. J. (1964). A managerial selection study. Personnel Psychology, 17, *Sprecher, T. B. (1959). A study of engineers' criteria for creativity. Journal of Applied Psychology, 43, *Springer, D. (1953). Ratings of candidates for promotion by co-workers and supervisors. Journal of Applied Psychology, 37, *Steel, R. P., & Mento, A. J. (1986). Impact of situational constraints on subjective and objective criteria of managerial job performance. Organizational Behavior and Human Decision Processes, 37, *Steel, R. P., Mento, A. J., & Hendrix, W. H. (1987). Constraining forces and the work performance of finance company cashiers. Journal of Management, 13, *Steel, R. P., & Ovalle, N. K. (1984). Self-appraisal based upon supervisory feedback. Personnel Psychology, 37, *Steel, R. P., Shane, G. S., & Kennedy, K. A. (1990). Effects of social-system factors on absenteeism, turnover, and job performance. Journal of Business and Psychology, 4, Stout, S. K., Slocum, J. W., Jr., & Cron, W. L. (1987). Career transitions of superiors and subordinates. Journal of Vocational Behavior, 30, Stuit, D. B., & Wilson, J. T. (1946). The effect of an increasingly well defined criterion on the prediction of success at naval training school (tactical radar). Journal of Applied Psychology, 30, *Stumpf, S. A. (1981). Career roles, psychological success, and job attitudes. Journal of Vocational Behavior, 19, *Stumpf, S. A., & Rabinowitz, S. (1981). Career stage as a moderator of performance relationships with facets of job satisfaction and role perceptions. Journal of Vocational Behavior, 18, *Sulkin, H. A., & Pranis, R. W. (1967). Comparison of grievants with non-grievants in a heavy machinery company. Personnel Psychology, 20, *Swaroff, P. G., Barclay, L. A., & Bass, A. R. (1985). Recruiting sources: Another look. Journal of Applied Psychology, 70, *Szilagyi, A. D. (1980). Causal inferences between leader reward behaviour and subordinate performance, absenteeism, and work satisfaction. Journal of Occupational Psychology, 53, "Taylor, E. K.., Schneider, D. E., & Symons, N. A. (1953). A short forced-choice evaluation form for salesmen. Personnel Psychology, 6, Taylor, R. L., & Wilsted, W. D. (1974). Capturing judgment policies: A field study of performance appraisal. Academy of Management Journal, 17, Taylor, R. L., & Wilsted, W. D. (1976). Capturing judgment policies in performance rating. Industrial Relations, 15, Taylor, S. M., & Schmidt, D. W. (1983). A process-oriented investigation of recruitment source effectiveness. Personnel Psychology, 36, Tenopyr, M. L. (1969). The comparative validity of selected leadership scales relative to success in production management. Personnel Psychology, 22, Thompson, D. E., & Thompson, T. A. (1985). Task-based per-

574 VISWESVARAN, ONES, AND SCHMIDT formance appraisal for blue-collar jobs: Evaluation of race and sex effects. Journal of Applied Psychology, 70, 747-753. "Thomson, H. A. (1970).

18 574 VISWESVARAN, ONES, AND SCHMIDT formance appraisal for blue-collar jobs: Evaluation of race and sex effects. Journal of Applied Psychology, 70, "Thomson, H. A. (1970). Comparison of predictor and criterion judgments of managerial performance using the multitrait-multimethod approach. Journal of Applied Psychology, 54, Toops, H. A. (1944). The criterion. Educational and Psychological Measurement, 4, Tsui, A. S., & Ohlott, P. (1988). Multiple assessment of managerial effectiveness: Interrater agreement and consensus in effectiveness models. Personnel Psychology, 41, Tucker, M. F., Cline, V. B., & Schmitt, J. R. (1967). Prediction of creativity and other performance measures from biographical information among pharmaceutical scientists. Journal of Applied Psychology, 51, Turner, W. W. (1960). Dimensions of foreman performance: A factor analysis of criterion measures. Journal of Applied Psychology, 44, "Validity information exchange. (1954). No Personnel Psychology, 7, 279. "Validity information exchange. (1954). No Personnel Psychology, 7, "Validity information exchange. (1956). No Personnel Psychology, 9, "Validity information exchange. (1958). No Personnel Psychology.il, "Validity information exchange. (1958). No Personnel Psychology, 11, "Validity information exchange. (1958). No Personnel Psychology, 11, "Validity information exchange. (1960). No Personnel Psychology, 13, "Validity information exchange. (1963). No Personnel Psychology, 16, "Validity information exchange. (1963). No Personnel Psychology, 16, "Vecchio, R. P. (1987). Situational leadership theory: An examination of a prescriptive theory. Journal of Applied Psychology, 72, "Vecchio, R. P., & Gobdel, B. C. (1984). The vertical dyad linkage model of leadership: Problems and prospects. Organizational Behavior and Human Performance, 34, "Villanova, P., & Bernardin, J. H. (1990). Work behavior correlates of interviewer job compatibility. Journal of Business and Psychology, 5, Viswesvaran, C. (1993). Modeling job performance: Is there a generalfactor? Unpublished doctoral dissertation, University of Iowa, Iowa City. Viswesvaran, C., & Ones, D. S. (1995). Theory testing: Combining psychometric meta-analysis and structural equations modeling. Personnel Psychology, 48, "Waldman, D. A., Yammarino, F. J., & Avolio, B. J. (1990). A multiple level investigation of personnel ratings. Personnel Psychology, 5, "Wanous, J. P., Stumpf, S. A., & Bedrosian, H. (1979). Job survival of new employees. Personnel Psychology, 32, "Wayne, S. J., & Ferris, G. R. (1990). Influence tactics, affect, and exchange quality in supervisor-subordinate interactions: A laboratory experiment and field study. Journal of Applied Psychology, 75, "Wernimont, P. F, & Kirchner, W. K. (1972). Practical problems in the revalidation of tests. Occupational Psychology, 46, "Wexley, K. N., Alexander, R. A., Greenawalt, J. P., & Couch, M. A. (1980). Attitudinal congruence and similarity as related to interpersonal evaluations in manager-subordinate dyads. Academy of Management Journal, 23, "Wexley, K. N., & Pulakos, E. D. (1982). Sex effects on performance ratings in manager-subordinate dyads: A field study. Journal of Applied Psychology, 67, "Wexley, K. N., & Youtz, M. A. (1985). Rater beliefs about others: Their effects on rating errors and rater accuracy. Journal of Occupational Psychology, 58, "Williams, C. R., Labig, C. E., Jr., & Stone, T. H. (1993). Recruitment sources and posthire outcomes for job applicants and new hires: A test of two hypotheses. Journal of Applied Psychology, 78, "Williams, L. J., & Anderson, S. E. (1991). Job satisfaction and organizational commitment as predictors of organizational citizenship and in-role behaviors. Journal of Management, 77, "Williams, W. E., & Seiler, D. A. (1973). Relationship between measures of effort and job performance. Journal of Applied Psychology, 57, Wohlers, A. J., & London, M. (1989). Ratings of managerial characteristics: Evaluation difficulty, co-worker agreement, and self-awareness. Personnel Psychology, 42, "Woodmansee, J. J. (1978). Validation of the nurturance scale of the Edwards Personal Preference Schedule. Psychological Reports, 42, "Worbois, G. M. (1975). Validation of externally developed assessment procedures for identification of supervisory potential. Personnel Psychology, 28, "Yammarino, F. J., & Dubinsky, A. J. (1990). Salesperson performance and managerially controllable factors: An investigation of individual and work group effects. Journal of Management, 16, "Yukl, G. A., & Latham, G. P. (1978). Interrelationships among employee participation, individual differences, goal difficulty, goal acceptance, goal instrumentality, and performance. Personnel Psychology, 31, "Zedeck, S., & Baker, H. T. (1972). Nursing performance as measured by behavioral expectation scales: A multitraitmultirater analysis. Organizational Behavior and Human Decision Processes, 7, Received October 10, 1995 Revision received March 29, 1996 Accepted April 22, 1996

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Band Battery

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Band Battery 1 1 1 0 1 0 1 0 1 Glossary of Terms Ability A defined domain of cognitive, perceptual, psychomotor, or physical functioning. Accommodation A change in the content, format, and/or administration of a selection