Sales Selector Technical Report 2017

Size: px
Start display at page:

Download "Sales Selector Technical Report 2017"

Transcription

1 Sales Selector Technical Report 2017

2 Table of Contents Executive Summary Purpose Development of an Experimental Battery Sample Characteristics Dimensions of Performance Item Analysis and Validity Tests for Adverse Impact and Test Bias Conclusions References Appendix A Appendix B... 20

3 Executive Summary IBM Kenexa conducted a concurrent validity study to develop a Sales Selector for use in hiring effective Sales Representatives within sales environments. Complete data were available on 1,568 currently employed Sales Representatives from 7 different organizations. Predictor data included measures of cognitive ability, personalityoriented items and questions about situational judgment. The primary criterion measure was participants performance as rated by their immediate supervisors or reported as objective sales. A predictor was configured from a combination of the personality, cognitive ability, and situational judgment measures. The Sales Selector was significantly related to composite performance with an estimated validity of.27. To illustrate the potential impact of using the selector, the total sample of employees included in this study was divided into above average, average and below average groups based on their predictor test scores. The average criteria performance of these three groups is shown below. Objective Subjective Composite Predictor Group Performance Performance Performance Above Average Average Below Average There are useful differences in mean performance across the three predictor categories. Those who scored in the top third of the predictor also scored approximately 13.1%, 16.5%, and 15.8% higher than those in the bottom third for objective performance, manager rated performance, and overall performance, respectively. To further understand the impact of such a validity coefficient (estimated validity of approximately.27), a utility analysis was conducted. From the Taylor-Russell tables (1939), the improvement in hit rates using the Selector can be estimated. Assuming that 50% of current Sales Representative hires are above-average performers (i.e., a hit rate of 50%) and that the employer begins hiring only candidates scoring in the top one-half of the Selector, the hit rate is expected to increase from 50% to 59%. In other words, organizations using the predictor and hiring only top 50% scores will experience an 18% improvement in its hit rate amongst new hires (i.e., 59/50). The Selector was examined for evidence of bias, using the Cleary regression model of test bias. There was no evidence that the Selector was biased against any subgroup considered here (i.e., race, gender, and age). Results of the adverse impact analyses showed there to be no adverse impact as a function of race, gender, or age at selection ratios up to and including the 70 th percentile. In terms of conclusions:

4 First, a sound predictor for the Sales Representative position was developed. The Selector differentiated between participants in objective, manager rated, and composite performance. Second, the Selector appears to be unbiased across all of the subgroups considered here. Using the Cleary model of bias, no evidence was found of test bias with respect to gender, race and age. Furthermore, the adverse impact analyses indicate no potential for adverse impact as a function of race, gender, or age at cut scores up to and including the 70 th percentile. It is recommended that the cut scores not exceed the 50 th percentile. Future analyses should be conducted to confirm the present findings.

5 IBM Kenexa Sales Selector Technical Report 1. Purpose IBM Kenexa developed and validated a structured system for use in hiring effective Sales Representatives. The primary objective of this research was to develop and validate an assessment predictive of Sales Representative performance across industries and sales positions. The ideal selector would be significantly related to performance, free of bias and adverse impact, and generalizable across similar jobs and organizations. Prior to detailing how the Sales Selector was developed, some of the statistical procedures used in this report are briefly explained below. The primary procedures used are Pearson s product moment correlation ( r ) and regression analysis. A correlation coefficient is a value used to determine the extent to which two variables are related, which ranges from negative 1.00 to positive A positive 1.00 denotes a perfect positive relationship, a negative 1.00 indicates a perfect negative relationship and 0 signifies no relationship. For example, height and weight generally have a positive correlation (e.g.,.50 to.70). Taller people tend to weigh more, but the relationship is not perfect as would be indicated by a correlation of Negative correlations express inverse relationships. For example, the amount of exercise tends to be inversely related to body fat. Though usually expressed as a decimal point followed by two digits, correlations should not be confused with percentages. Regression analysis is a technique in which an independent variable or set of variables is fitted in a model in such a way as to maximize the prediction of a different, dependent variable. In its simplest form, regression analysis defines the slope and y-intercept of a line describing the relationship between two variables (e.g., height and weight). The likelihood that a given finding is due to chance is expressed by the probability value ( p ). Probability is greatly influenced by sample size. Given two correlations of equal size, one may be considered significant and the other not by virtue of the former s larger sample size. Typically, any probability less than 5% (p <.05) is considered statistically significant. Due to the large sample size, relationships involving analyses of the entire sample are considered statistically significant if the probabilities are less than p =.01 (1%). In other words, there must be less than a 5% probability that the observed result is due to chance before the result is considered significant in analyses of the calibration and cross-validation samples, but there must be less than a 1% probability that the observed results is due to chance when the entire sample is being analyzed. 2. Development of an Experimental Battery

6 A comprehensive analysis of the Sales Representative job was undertaken to identify the knowledge, skills, abilities, and other requirements (KSAOs) critical for effective performance within the job. Job analyses of the relevant jobs were completed. One major conclusion from these analyses was that there were few differences between the jobs in terms of the knowledge, skills, abilities and other attributes (KSAOs) required for successful performance. For the Sales Representative jobs, the KSAOs required for successful performance were identified, and several online experimental instruments were identified or developed to measure these KSAOs. To increase the validity of the Sales Selector, a review of the sales performance literature was conducted to identify the constructs most likely to predict high performing Sales Representatives. Conscientiousness and extraversion were cited as predictors of sales performance, with specific evidence for the facets of achievement and potency (Barrick, Mount, & Strauss, 1993; Vinchur et al, 1998). The relationship will likely be higher for more autonomous sales professionals, and agreeableness may demonstrate a negative relationship with performance (Barrick & Mount, 1993). Meta-analytic evidence supports the relationship between GMA and sales ratings (Vinchur et al, 1998). Finally, situational-based judgment items have been shown to provide incremental validity over general mental ability (McDaniel et al, 2001). While the constructs assessed differed slightly between samples, the overall assessment for each sales environment consisted of three sections. The first section of the assessment was designed to measure important work styles or personality variables including achievement, adaptability, analytical, concern for others, dependability, detail orientation, initiative, leadership, persuasion, persistence, self control, social orientation, and stress tolerance. Next, either a 40-item measure or a 32-item measure of general cognitive ability containing items related to verbal skills, mathematical operations, and general reasoning abilities was used. Cognitive ability is usually among the strongest predictors of job performance, particularly as job complexity increases (Hunter & Hunter, 1984). Finally, data were gathered on situational judgment. Using critical incident reports of subject matter experts, 51 item pairs were developed to measure situations Sales Representatives would be likely to encounter on the job. Each situational description was followed by five options describing possible behavioral responses. Subjects were directed to select the option they thought was the best course of action under the circumstances and the option they thought was the worst course of action. The data collected varied slightly by sample. Items were selected if they predicted performance in the calibration sample and were shown to cross-validate. 3. Sample Characteristics Data were gathered from 7 different sales environments. All of the jobs included in this research were Sales Representative jobs, requiring customer interaction with

7 sales as the primary responsibility. Sales Representatives in organization 1 held one of six sales jobs in a call center environment. Organization 2 is composed of account managers who focused on working primarily internally or externally, although the job analysis noted few differences between the two. Sales Representatives in organization 3 were involved in either business to business or business to customer sales. Sales Representatives in organization 4 worked in one of four outbound sales focused jobs in a call center environment. Sales Representatives in organization 5 include high volume business to customer and business to business sales. Organization 6 includes account managers in business to business sales. Organization 7 includes inbound Sales Representatives in a call center environment. A total of 1,568 individuals participated in the study. Of these, 67.0% were white and 33.0% were nonwhite. Additionally, 48.3% were male and 51.7% were female. Furthermore, 51.8% of the participants were below the age of 40 and 48.2% were 40 years of age or older. The breakdown for each sales environment is shown in Table 1 as percentages of those participants who chose to report demographic data. Table 1. Demographic Data for the Entire Sample Organization n Gender Race Age % Male % Female % White % Nonwhite % Under 40 % 40 or Over Organization Organization Organization Organization Organization Organization Organization Total Sample Dimensions of Performance Performance for Sales Representatives was collected through company archives or rated by his/her immediate supervisor using a research only performance appraisal (PA) which was either developed from existing job documentation (e.g., job descriptions, training materials, etc.) obtained during a job analysis or an existing internal performance appraisal. Research has shown that performance appraisals conducted for purposes other than research can be artificially inflated (Murphy & Cleveland, 1991) and yield lower validities (McDaniel, Whetzel, Schmidt, & Mauer; 1994). In fact, Jawahar & Williams (1997) recommend taking the time to gather appraisal data for research purposes only, as ratings obtained for administrative purposes have shown to be far more lenient than those obtained for research purposes. Thus, research only performance appraisals were used in some of these

8 studies to ensure more accurate, predictable performance ratings. The criteria varied between organizations and are presented in Table 2. Table 2. Criterion measurement across sampled organizations Organization Objective Performance Task Performance Sales Performance 1 Research 2 Goal Research Research Attainment 3 Research (5) Research 4 Monthly Research Research average sales 5 Goal Attainment 6 Goal Attainment 7 Goal Research Research Attainment Note: Numbers in parentheses indicate the items in the rating scale. Customer Service Performance Teamwork Performance Overall Performance Research (5) Research (5) Research (1) Research Research Research Research Research Research Research Research Research (1) Research (1) Archive (1) Research (1) Research (1) Research (1) Objective performance was collected from company archives and averaged across time periods as short as two months to as long as one year. It is important to note that research indicates that objective sales can have low test-retest reliability (Sturman, Cheramie, & Cashen, 2005). Stewart and Nandkeolyar (2007) proposed that organizational performance constraints such as the actions of other coworkers may limit objective performance which would reduce the relationship between predictors and objective criteria. All other performance criteria are manager ratings of performance collected either through research only forms or through company archives. All objective sales data were standardized, using intra-organizational norms, to account for differences between organizations in sales volume and value. The task, sales, customer service, teamwork, and overall performance ratings were also standardized within organization and averaged into an overall subjective performance criterion. The objective and subjective performance criteria were averaged to form a composite performance criterion. In each case, the composite consisted of a unit-weighted combination of available performance measures. Relationships with individual criteria such as manager ratings of sales, customer service, teamwork, and manager ratings of overall performance are reported separately. The measures of performance were correlated with one another. Intercorrelations for these measures are shown in Table 3. Table 3. Intercorrelations between Performance Dimensions Objective Performance Subjective Performance

9 3. Task Performance Customer Service Performance Teamwork Performance Composite Performance Note: All scales reported in standardized scores. An asterisk (*) denotes significance at p<.01. The measures of performance were correlated with race, gender, and age (see Table 4). As shown below, the correlations were generally small. Slightly higher subjective manager ratings were given to white Sales Representatives (r = -.10), while employees over 40 were rated higher on teamwork (r =.20). Table 4. Correlations between Performance Dimensions and Demographic Characteristics Performance Dimension Gender Race Age Objective Performance Subjective Performance *.04 Task Performance Customer Service Performance Teamwork Performance * Composite Performance Note: An asterisk (*) denotes significance at p<.01. Negative correlations indicate that whites, males, or those below the age of 40 were rated higher in the performance dimension. 5. Item Analysis and Validity To develop a predictor, the total sample (n = 1,568) was used which was comprised of IBM Kenexa experimental predictor instruments. The personality scales were designed around three of the Big Five personality characteristics: Conscientiousness, Extraversion, and Agreeableness. Personality traits were operationalized with ten 10-item sub-traits measured with the IBM Kenexa WorkStyles Questionnaire of Achievement Orientation (i.e., achievement, initiative, and persistence), Conscientiousness (i.e., dependability, detail orientation, and integrity), and Social Influence (i.e., leadership, persuasion, and energy) scales. These scales were averaged separately and included in analyses. Cognitive ability was measured with 40 items. The cognitive ability items were scored as correct or incorrect and averaged separately before being combined to create an overall measure of critical thinking ability. Unlike the personality items, the situational judgment items were not arrayed along a continuum. For the situational judgment items, respondents were presented with a description of a situation they would be likely to encounter on the job. Each situation was followed by five possible response options. Respondents were instructed to choose which of the five options was the best and which was the worst. These items were keyed at the response option level, against subject matter expert opinion using the following strategy. Respondents correctly choosing the correct best answer, (as determined by the SMEs) received credit for 1 point. Respondents correctly identifying the SMEs correct worst answer

10 were similarly credited with a point. Respondents received a -1 by either choosing as the best answer the correct worst answer or choosing as the worst answer the correct best answer. In summary, scores on each answer could range from 2 to -2 and were calculated as follows: (1) to receive a 2, a person would have to choose as their best response the correct best answer and as their worst response the correct worst answer, i.e., get both right; (2) to receive a -2, a person would have to choose the correct worst answer as their best response and the correct best answer as their worst response, i.e., get both wrong; (3) a 1 was received if the respondent successfully identified either the correct best answer or the correct worst answer, but not both; (4) a -1 was indicated if the respondent chose as the best response the correct worst answer or chose as the worst response the correct best answer, but not both; and (5) a 0 was received by those choosing neither the correct best nor the correct worse response. Forty of the 51 situational judgment item pairs were found to predict performance in the developmental sample, with 21 being found to cross-validate. Item correlations were compared between organizations, with the 14 item pairs most consistently predictive of performance used for the final selector. A total predictor was created by summing the situational judgment, cognitive ability 1, and the personality scales 2 (i.e., detail orientation, energy 3, initiative, leadership, and persistence) after first standardizing each to a mean of 0 and standard deviation of 1, which has the effect of ensuring that each scale is equally weighted. As shown in Table 5, the total predictor was positively related to the composite measure of performance (r =.21; p <.01). These validity estimates are likely to be biased downward due to range restriction and criterion unreliability. Assuming criterion reliability of.60 (Rothstein, 1990), true validity is estimated at r =.27 when corrected for attenuation (most recent research places the reliability of most performance measures lower, at about.50, making this a conservative correction; see Viswesvaran, Ones & Schmidt, 1996). Table 5. Predictor Reliabilities and Validities (Entire Sample) Scale N Mean SD Alpha Achievement *.09* * Dependability *.09*.08*.09*.11*.11* Detail Orientation* *.08* * Energy* *.13*.12*.10*.16*.11* Initiative* *.23*.21*.17*.29*.19* Integrity Leadership* *.18*.17*.14*.17*.15* 1 The cognitive ability scale was updated to a computer adaptive section based on item response theory. Additional information regarding this updated scale is discussed in the Appendix A. 2 The personality scales have been updated as discussed in Appendix B. 3 The achievement orientation and dependability scales were not included in the overall selector based on high correlations with the energy and leadership scales.

11 Persistence* * Persuasion *.08* Cognitive Ability* *.08*.06*.05.08* Situational Judgment* *.12*.05.08*.15* Total Predictor n/a.16*.21*.21*.15*.16*.21* Note: 1: objective performance, 2: subjective ratings, 3: task performance rating, 4: customer service rating, 5: teamwork rating, 6: composite performance. n/a indicates not available or unable to calculate. An asterisk (*) denotes significance (p<.01 onetailed test) To further understand the performance implications of these results, the total sample was divided roughly into thirds, corresponding with above average, average and below average scores on the Sales Selector. To eliminate negative values, each performance dimension was standardized to a mean of 3 and standard deviation of 1. Table 6 shows the mean criterion scores by predictor category. Table 6. Expected Performance by Test Group Predictor Group Objective Subjective Composite Performance Performance Performance Above Average Average Below Average As can be seen, there are useful differences in mean performance across the Sales Selector categories: For objective performance, the above-average group performed 13.1% higher than the below-average group. The average group, in turn, performed 7.0% higher than the below-average group. The above-average group was rated 16.5% higher on subjective performance than the below-average group. There was a 7.6% increase when comparing the average group to the below-average group. Lastly, the above-average group performed 15.8% higher than the belowaverage group on the composite performance criterion. The average group, in turn, performed 7.3% higher than the below-average group. From the Taylor-Russell tables (1939), the improvement in hit rates using the Sales Selector can be estimated. Assuming that 50% of current hires are above-average performers (i.e., a hit rate of 50%) and that the employer begins hiring only candidates scoring in the top one-half on the Sales Selector, the hit rate is expected to increase from 50% to 59%. In other words, using the Sales Selector (estimated validity of.27) and hiring only top 50% scores it is expected that there will be a 18% improvement in overall performance amongst new hires (i.e., 59/50). In summary, the Sales Selector does a good job of separating participants into meaningfully different categories. Those predicted to be above average show higher performance than do those predicted to be average or below average. Similar differences are observed between those predicted to be average and below average.

12 6. Tests for Adverse Impact and Test Bias Adverse impact is found when the hiring rate of the minority group is less than fourfifths or 80% of the hiring rate of the majority group. This is often referred to as the four-fifths rule. Standardized group differences between groups directly contribute to adverse impact. The larger the group differences, the greater the likelihood (and magnitude) of adverse impact. Therefore, it is useful to examine the extent of these mean differences before interpreting adverse impact. Table 7 illustrates the subgroup mean differences for the Sales Selector. To examine the extent of these group differences, all mean differences were converted to a d statistic (also known as a standardized mean difference). The d statistic shows the mean difference in standard deviation units through the following formula: d = (majority sample mean minority sample mean) / pooled standard deviation of both samples. Note that all comparisons are referenced to the majority sample, thus negative numbers indicate the majority sample scored lower than the minority sample. For example, when examining gender differences, the male sample is the reference group. Therefore, a negative d statistic would indicate that males scored lower on the assessment than females. In general, d values of.20,.50, and.80 correspond to small, medium, and large differences, respectively. As shown in Table 7, there were no observed mean differences for gender, race, or age statistically significant at the.05 level. Based on the Aamodt (2007) tables, adverse impact will likely not occur below the 80 th percentile, however differences between samples could create subgroup mean differences in practice, and thus the potential for adverse impact should be monitored. Table 7. Standardized Mean Differences (d values) for Gender, Race, and Age (Percentile Scores) Gender Race Age All Under 40 or Male Female White Minority 40 Over Mean SD Mean Difference Pooled SD d Critical Cut Score n/a 95 th n/a Note: 1 Negative numbers indicate that the males, whites, and those under the age of 40 scored lower on the selector. Adverse impact analyses were conducted by calculating the pass rates for each respective subgroup (i.e., gender, race, and age). Each of the Sales Selector scales was (a) converted to z-scores (using the norms), (b) the composite z-score was converted into a percentile, and (c) the percentile is used to make a hiring recommendation.

13 Adverse impact analyses are generally calculated by creating a ratio where we divide the hiring rate of one group by the hiring rate of the second group (# minority group pass / # minority group applied) / (# majority group pass / # majority group applied). An adverse impact ratio value of 1.00 indicates both groups passed at an equal rate. If the value is less than.80, there is evidence of adverse impact. Table 8 shows the pass rates and adverse impact ratio based on cuts at the 20 th, 30 th, 40 th, 50 th, 60 th, and 70 th percentiles using the concurrent norms. Table 8. Pass Rates Based on Cut Scores at the 20 th through 70 th Percentiles (Entire Sample) Gender Race Age Percentile Cut Score Non- Under 40 or Male Female AI Ratio White AI Ratio AI Ratio White 40 Over 20th th th th th th Note: Adverse impact values less than.80 indicate the potential for adverse impact; values greater than 1.20 indicate the minority group is hired at a greater rate. As noted in the table above, the results show that there is no evidence for adverse impact for the Sales Selector at the cut scores studied (i.e., 20 th percentile to the 70 th percentile). Therefore, if needed, cut scores for the Sales Selector can be set as high as the 70 th percentile. However, it is recommended that the cut scores not exceed the 50 th percentile. Using the Cleary model of test bias (Cascio, 1991, p.180), moderated regressions were conducted to determine if the predictor was differentially valid for various subgroups (males vs. females, whites vs. nonwhites, and over 40 vs. under 40). In the Cleary model, moderated regressions are conducted in which the predictor score is entered in the model first, followed by the subgroup variable, and finally by the interaction between the predictor score and subgroup variable. Significant interaction terms would indicate that the predictor works better for one subgroup than another and suggests bias. The results of the moderated regression analyses using the composite performance scores as the criterion are reported in Table 9. As can be seen, the Sales Selector scores provide estimates of performance that are equally accurate by race, gender, and age subgroups. None of the interaction terms were statistically significant (p <.01) suggesting that the predictor does not have differential validity for each of the groups studied here. Furthermore, there were no significant main effects for any of the demographic groups, which implies that the Sales Selector does not over- or under-predict performance for race, gender, or age.

14 Table 9. Cleary Model of Test Bias for the Selector Source Sum of Squares Fchange p < Predictor Gender ns Interaction ns Predictor Race ns Interaction ns Predictor Age ns Interaction ns Note: ns indicates a non-significant relationship 7. Conclusions From this study, the following conclusions can be drawn: First, a sound predictor for the Sales Representative position was developed. The Selector differentiated between participants in objective, subjective, and overall performance ratings. Second, the Selector appears to be unbiased across all of the subgroups considered here. Using the Cleary model of bias, no evidence was found of test bias with respect to gender, race and age. Furthermore, the adverse impact analyses indicate no potential for adverse impact as a function of race, gender, or age at cut scores up to and including the 70 th percentile. Future analyses should be conducted to confirm the present findings.

15 8. References Aamodt, M. G. (2007). Industrial/Organizational Psychology: An Applied Approach (5 th edition). Pacific Grove, CA: Wadsworth Publishing. Barrick, M. R., & Mount, M. K. (1993). Autonomy as a moderator of the relationship between the Big Five personality dimensions and job performance. Journal of Applied Psychology, 78, Barrick, M. R., Mount, M. K., & Strauss, J. P. (1993). Conscientiousness and performance of sales representatives: Test of the mediating effects of goal setting. Journal of Applied Psychology, 78, Cascio, W.F. (1991). Applied psychology in personnel management (4th ed.), Englewood Cliffs, NJ: Prentice Hall. Hunter, J. E., & Hunter, R. F., (1984). Validity and utility of alternate predictors of job performance. Psychological Bulletin, 96, Jawahar, I. M., & Williams, C. R. (1997). Where all the children are above average: The performance appraisal performance affect. Personnel Psychology, 50, McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A. Braverman, E. P. (2001). Use of situational judgment test to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86, McDaniel, M.A., Whetzel, D.L., Schmidt, F.L., & Mauer, S.D. (1994). The validity of employment interviews: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79, Murphy K.R., Cleveland J.N. (1991). Performance appraisal: An organizational perspective. Boston: Allyn and Bacon. Stewart, G. L., & Nandkeolyar, A. K. (2007). Exploring how constraints created by other people influence intraindividual variation in objective performance measures. Journal of Applied Psychology, 92, Sturman, M.C., Cheramie, R. A., & Cashen, L. H. (2005). The impact of job complexity and performance measurement on the temporal consistency, stability, and test-retest reliability of employee job performance ratings. Journal of Applied Psychology, 90, Taylor, H. C., & Russell, J. T. (1939). The relationship of validity coefficients to the practical effectiveness of tests in selection: Discussion and tables. Journal of Applied Psychology, 23,

16 Vinchur, A. J., Schippmann, J. S., Switzer, F. S., & Roth, P. L. (1998). A meta-analytic review of predictors of job performance for salespeople. Journal of Applied Psychology, 83, Viswesvaran, C., Ones, D., & Schmidt, F. L. Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81,

17 9. Appendix A Equivalence of Scores on Cognitive Ability Assessments IBM offers cognitive ability assessments as a standalone or a section in the selector assessment packages in two delivery modes: Computerized adaptive testing (CAT) and a computer-based testing (CBT). Both delivery modes are provided via internet/online. CBT is delivered in two kinds: Infinity Series (i.e. a static version of reasoning test) and General Mental Ability (GMA). CAT and Infinity Series measure the same constructs in three domain areas: Logical Reasoning, Numerical Reasoning and Verbal Reasoning. This memo describes a process and outcome of establishing the relationship between CAT and CBT (Infinity Series or GMA), in terms of score linking and comparability. Static Version (Infinity Series) vs. IBM Ability CAT Version The Infinity Series assessments and CATs are designed to measure the same constructs in three domains: Logical Reasoning (LR), Numerical Reasoning (NR) and Verbal Reasoning (VR). A framework developed by researchers (Holland and Dorans, 2006; Mislevy, 1992; Lin, 1993) suggests that linking is a general class of transformation between the scores from one form of assessment and those from another. In such framework, the methods to link scores and make scores comparable are generally categorized into predicting, scale aligning and equating. Predicting a form Y from another form X has been performed using a linear regression method since Galton era (19th century). Under the classical test theory (CTT), observed scores are different from the true scores, but can be estimated with some error. Prediction is possible on either the observed scores or the true scores, which are estimated from the observed scores. Scale Aligning (or simply put scaling ) refers to a procedure of transformation of scores from two different test forms, X and Y, onto a common scale. Equating is the strongest form of score linking. It requires the tests measure the same constructs given the same psychometric specification such as similar difficulty and precision reliability. Although test forms are built to the same specification, the psychometric properties may not be the same. Therefore, statistical adjustment of psychometric property (difficulty) is necessary. Equating is thus a procedure to place the scores from different test forms on a common scale and makes the use of the scores in the same way (interchangeable scores). The infinity Series assessments (Logical Reasoning Test or LRT; Numerical Reasoning Test or NRT; Verbal Reasoning Test or VRT) have been scored and reported using the CTT. The CAT assessments (LRCAT; NRCAT; VRCAT) were developed to achieve maximum efficiency (up to 50% of saving items and thus, seat time for each individual test taker) and effectiveness (the same, guaranteed precision

18 level of measurement for all test takers). The current CATs measure the cognitive ability in three domain areas using the IRT (a three parameter logistic model or 3PLM). Each CAT session is delivered with an adaptive item selection from the same item bank, and the test length for different individuals are designed to vary to reach maximum efficiency and effectiveness. Score equivalency of Infinity Series and CATs can be best achieved via a psychometric procedure, known as Calibration, since the same constructs are measured on these tests in similar difficulty but possibly different reliability. Under the IRT framework, calibration is referred to item parameter estimation in which the scores derived from using the calibrated items are comparable on a common scale. Table 1 presents the bank size of CAT and the test length of Infinity Series, reflected in the following analysis for calibration to establish equivalency of the scores. Table 1. Bank Size and Test Length by Assessment Domain LR NR VR Total Base CAT Infinity The item pool size for Infinity Series is larger than the standard test length above. Those items (30 LRT, 146 NRT and 61 VRT) are linked to the CAT bank via concurrent calibration with the IRT 3PLM. It is a procedure called item linking. After item linking, the Infinity Series sample data (i.e., item responses) were rescored, and therefore the rescored measures in theta metric/scale are comparable to those from CATs in each domain. The score scales are different. The Infinity Series scores are the number correct total (or simply the raw) scores but the CAT scores are the thetas that ranges from -4.0 to 4.0, indicating that the higher the theta values the more the test takers are able to answer the questions correctly. Through a calibration and IRT rescoring, the relationship between the Infinity Series scores and the CAT scores are established on the (common) theta scale. Table 2 below presents a raw score (Infinity Series) to theta (CAT) conversion table. The users of Infinity Series Assessments should be able to locate comparable CAT scores based on the raw Infinity Series scores. The corresponding theta value to a raw score of Infinity Series indicates that a test taker might have received a theta score on CAT in the same domain area, on average, if the CAT was taken in place of Infinity Series. Table 2. Raw Score to Theta Conversion Table Raw Total LRT Theta NRT Theta VRT Theta

19 Static Version (Infinity Series) vs. IBM Ability CAT Version The Cognitive Ability measure in a Selector was a test of General Mental Ability (GMA) with a range of cognitive items. GMA has been replaced with a measure of deductive reasoning, specifically CAT numerical reasoning. NR CAT provides measures that are more reliable and equally precise for all test takers. The measures (scores) on CAT are more efficiently obtained than GMA section in the Selector by saving the number of items (and seat time) administered during a CAT session - typically up to over 50%. Score equivalency of GMA and CAT can be best achieved via a psychometric procedure, known as Concordance. Concordance refers to a case in which different tests measure similar constructs given similar test length, difficulty and reliability. The scores from both assessments can be linked via concordance and thus relative. It does not mean that scores on GMA can be interchangeably used with those counterparts on CAT. However, the relationship derived via concordance provides similar table for score comparison as the case of calibration in relative standing. With a specific Selector data, a concordance table can be quickly developed for score comparison, upon request.

20 10. Appendix B Recent Changes to the Scale Composition of Sales Selector The Sales Selector as outlined in the studies in this technical report was composed of personality scales taken from Kenexa s Work Style Questionnaire (WSQ). In 2015, the WSQ was discontinued and the newly developed Kenexa Personality Assessment (KPA) took its place. This section considers the overlap between the personality scales in the WSQ and KPA versions of the Sales Selector. In essence, the new KPA scales are conceptually identical for four of the five WSQ scales and retain the same names. For the remaining WSQ scale (Leadership), the KPA equivalent (Authority) is not conceptually identical but shares many of the characteristics of the WSQ scale. Both the WSQ scale and the KPA equivalent have an emphasis on wanting to take the lead role and being seen as leader by others, but the KPA equivalent is more about being active in controlling and directing others, and having an element of dominance over them. The following table shows the overlap between the Sales Selector using the WSQ scales, and the Sales Selector using the KPA equivalent scales. Final Selector (WSQ) Detail Orientation Energy Initiative Leadership Persistence Cognitive Ability Situational Judgment Current Selector (KPA) Detail Orientation Energy Initiative Authority Persistence Numerical Reasoning Situational Judgment The Situational Judgement sections of the Sales Selector are not based on the WSQ or KPA and remain unchanged. The Cognitive Ability measure in the Sales Selector was a test of General Mental Ability with a range of cognitive items; this has been replaced with a measure of deductive reasoning, specifically numerical reasoning. As discussed in Appendix A, this new measure (the CAT Numerical Reasoning) is a very reliable and valid assessment, and is likely to increase the overall validity of the Sales Selector.