IBM Workforce Science. IBM Kenexa Ability Series Computerized Adaptive Tests (IKASCAT) Technical Manual

Size: px

Start display at page:

Download "IBM Workforce Science. IBM Kenexa Ability Series Computerized Adaptive Tests (IKASCAT) Technical Manual"

Primrose Short
6 years ago
Views:

1 IBM Workforce Science IBM Kenexa Ability Series Computerized Adaptive Tests (IKASCAT) Technical Manual Version UK/Europe Release Date: October 2014

3 Table of Contents CHAPTER 1: WHAT IS IKASCAT? Introduction... 5 CHAPTER 2: ASSESSMENT CONTENT Assessment Components Logical Reasoning Test Numerical Reasoning Test Verbal Reasoning Test... 8 CHAPTER 3: IKASCAT UTILIZES CAT TECHNOLOGY CHAPTER 4: DEVELOPMENT OF IKASCAT CAT SYSTEM CAT System Design and Development CAT Content Development CAT Implementation and Maintenance CHAPTER 5: WHY USE PSYCHOMETRIC TESTS? Why Use Cognitive Ability Tests? Generalized Validity of Ability Tests CHAPTER 6: CAT AND HOW IT IS USED IN IKASCAT What is CAT? The Advantages of using CAT What is Item Response Theory (IRT)? Item and Test Information Parameter Estimation Theta Estimation Item Parameter Estimation Building Appropriate CAT Strategies Starting rule for selecting the first item Item selection algorithm Item scoring and updating ability procedure Constraints on item selection Stopping Rule CHAPTER 7: ADMINISTRATION, SCORING & REPORTING Administration Scoring Copyright IBM Corporation All rights reserved. 3

4 7.3 Reporting CHAPTER 8: SUMMARY STATS AND GROUP DIFFERENCES 'Norming' and Norms Groups Available Logical Reasoning Test Numerical Reasoning Test Verbal Reasoning Norms Group Differences Setting cut-off scores Group Differences: LRT Group Differences: NRT Group Differences: VRT CHAPTER 9: RELIABILITY CHAPTER 10: VALIDITY Defining Validity Criterion Validation Studies What Do Employers Get From Using IKASCAT? What do employers get from the LRT? What do employers get from the NRT? What do employers get from the VRT? CHAPTER 11: EQUATING IKASCAT TO INFINITY SERIES Can a PBT or CBT (static form) be equated to CAT? Framework of Score Linking Purpose of Equating Linking Design Data Collection Design for Equating Classical Equating Methods IRT Equating Methods Establishing relationship between IKASCAT and Infinity Series CHAPTER 12: VALIDATION STUDIES (LRT, NRT, VRT) CHAPTER 13: REFERENCES Copyright IBM Corporation All rights reserved. 4

5 Chapter 1: What is IKASCAT? 1.1 Introduction One of the main interests in the field of occupational psychology lies in the area of recruitment and selection, and the identification of factors which can predict successful occupational performance. Researchers have compared possible predictors of job performance, such as biographical data, references, educational level, college grades, interviews, ability tests and personality questionnaires, and the general consensus of the research is that the best predictor of occupational performance is cognitive ability (Schmidt & Hunter, 1998; Gottfredson, 2002). IBM s Kenexa Ability Series Computerized Adaptive Test (IKASCAT) is a suite of assessments that assess three of the major components of cognitive ability (Logical Reasoning, Numerical Reasoning, and Verbal Reasoning). The IKASCAT utilizes computerized adaptive testing (CAT) which adapts to a test taker s responses, providing test takers with items that most closely reflect their ability and calculating their ability in the most accurate and secure method available. IKASCAT measures cognitive abilities that are important predictors of job performance and training success. Schmidt and Hunter s (1998) review of over 85 years of research into personnel selection identified tests of cognitive ability as being the best predictors of job performance and training success. IKASCAT measures such cognitive abilities, assessing both deductive reasoning skills (using verbal and numerical formats) and inductive reasoning skills (using abstract/logical reasoning formats) for use in work-related settings. There are two distinct parts to IKASCAT: the Assessment Content (i.e. the assessments themselves including all of the questions asked) and the CAT system (i.e. the administration and scoring system that deliver the questions and produce scores based on the test taker s answers). Both of these are described in some detail later in this document. This technical manual has been written for users of the IKASCAT, and provides the following: Descriptions of the assessments themselves Rationale behind IBM developing CAT systems for assessments An explanation of what CAT is A summary of the development of the CAT system Statistical details on IKASCAT Information on administration, scoring and reporting, and Examples of reports produced Copyright IBM Corporation All rights reserved. 5

6 Chapter 2: Assessment Content The questions (or items) used in IKASCAT were written and reviewed by a team of occupational psychologists and psychometricians with combined test development expertise in excess of 100 years, and from a range of English-speaking countries (including UK, Ireland, US, New Zealand, Singapore and South Africa). IBM Workforce Science has developed computer-administered psychometric tests for the last 25 years, and online ability assessments produced by IBM psychologists are used with tens of millions of test takers annually. The development work was started formally in January 2012 until the final pilot studies in June The assessment went through several iterations until the final pilot study, the results of which provide some of the technical and statistical information in this manual. General design criteria have been applied in developing this assessment. The most important of these general criteria is the quality of the question asked (the items). Great care was taken in choosing the format, structure, and appearance of the items. All items have been checked, reviewed, modified it necessary, trialled and re-trialled. All items have been reviewed for issues of legality, particularly concerning diversity or disability, and to ensure that local idioms are avoided and offence is not caused by any of the questions asked. The core stems or stimuli used (the information on which the items are based) need to allow test takers to show their ability to draw conclusions, make deductions and infer logically from the information provided. Multiple choice questions were developed. Test takers need to choose the correct answer from a range of possible options. In each case, one and only one of the possible options is correct. Items (for the NRT and VRT in particular) cover a wide range of subject matters. Different IRT scoring methods were used, including Rasch scoring, 2PL and 3PL models. 2.1 Assessment Components The IKASCAT is composed of three assessments: Logical Reasoning (LRT), Numerical Reasoning (NRT) and Verbal Reasoning (VRT). These are designed for use in unproctored internet (or online) testing context for both CAT and non-cat context (IRT scoring or traditional scoring). Copyright IBM Corporation All rights reserved. 6

7 2.1.1 Logical Reasoning Test Logical reasoning is the ability to analyse situations, identify patterns and relationships that underpin these situations, and derive or extrapolate from these. This is a necessary condition for all logical problem solving situations, particular those requiring scientific, mathematical, engineering or financial problem solving. The Logical Reasoning Test (LRT) is designed to provide a fair, objective, rapid and practical measure of inductive reasoning. It measures a person s skills in evaluating the patterns and trends in information, without reference to written text or numerical data. The LRT has been developed to be a culture-fair assessment, useful in multi-cultural, multi-racial or multiple language contexts. Inductive reasoning is the process of reasoning from specific premises or observations to reach a general conclusion or overall rule. Deductive reasoning denotes the process of reasoning from a set of general premises to reach a logically valid conclusion. Deductive inferences draw out conclusions that are implicit in the given information whereas inductive inferences add information in order to draw a conclusion. The information in the LRT questions comes in the form of an abstract form or shape that have been changed or modified across a series of stages. One of these stages is missing and candidates need to carry out an analysis of the information to enable them to choose which one of a series of options would complete the series logically. The LRT does require the test taker to: Attend to the information available (i.e. the characteristics of the forms and shapes) Identify the relationships, patterns and trends in the information Derive a set of rules that can support the relationship Apply these rules to correctly identify the required answer The LRT does not require the candidate to: Use prior knowledge or have knowledge of a particular subject or area Have learned or acquired a particular skill Be a speaker of a particular language Numerical Reasoning Test The Numerical Reasoning Test (NRT) is a test of deductive reasoning, one of the major components of fluid intelligence, a concept originally identified by Raymond Cattell (1971). Numerical reasoning is the ability to evaluate numerical information critically, understand patterns and trends in data, and the ability to draw valid logically inferences from the information presented. It is designed to provide a fair, objective, rapid and practical measure of deductive reasoning, using numerical information. Copyright IBM Corporation All rights reserved. 7

8 The content of this test is representative of numerical information likely to be encountered within a business context, thus providing wide applicability across a range of professional and managerial selection, development and recruitment activities. Managerial and professional roles inherently require employees to frequently deal with complex numerical data, for example in financial planning, market analysis and problem solving situations. The NRT was therefore designed to assess this level of numerical reasoning ability. Questions needed to: Be easy to read and assess Present information in the simplest format possible Include realistic scenarios Use real data sets (simplified and modified for use in assessment) Involve simple arithmetic operations such as the addition, subtraction, multiplication and division Involve the use of whole numbers (integers), decimals and fractions. Involve the use of the ratios and percentages Present information in the form of charts, graphs and tables (often combination of these) The NRT does require the test taker to: Evaluate numerical information critically Understand patterns and trends in the data presented Carry out simple computational analysis in order to come to the correct conclusions The NRT does not require the candidate to: Have prior knowledge of the numerical content in the stimuli Apply complex formulae Have knowledge of complex mathematical methods Verbal Reasoning Test The Verbal Reasoning Test (VRT) is designed to provide a fair, objective, rapid and practical measure of deductive reasoning, using written information. It measures a person s ability to critically evaluate information presented in a written verbal format. In addition to understanding written communication, the VRT also encompasses the ability to understand complex discussions and other verbal interactions. Many jobs involve working with verbal information and verbal comprehension forms a core component of almost all professional and managerial roles. The VRT offers a high level assessment of the verbal reasoning processes that people use almost on a daily basis when analysing and evaluating detailed content of reports and other business documentation, produced by themselves, by colleagues or by Copyright IBM Corporation All rights reserved. 8

9 outside agencies. In many organisations, verbal reasoning skills are key to the effective dissemination of business information, upwards and downwards, right across the workforce. Most of the items in the VRT include a number of short passages of text followed by statements based on the information given in the passage. Candidates are asked to indicate whether the statements are true or false, or whether it is not possible to say so either way. In answering these questions, candidates use only the information given in the passage and should not try and answer them in the light of any more detailed knowledge that they personally may have. Test developers needed to: Make passage length as short as possible (around 120 words) Take into account general reading speed Avoid grammatical or vocabulary complications Ensure that the information in the passage was factually correct Ensure that the information in the passage was not controversial Ensure that the information in the passage was not emotionally affective (i.e. people may react to it emotionally) Develop passages that were similar to short articles found on websites, in newspapers or magazines The VRT does require the test taker to: Analyze and critically evaluate verbal information Understand complex arguments or positions in written communication Draw appropriate inferences from complex written information The VRT does not require the test taker to: Have prior knowledge of the factual content in the passages Have technical knowledge of grammar Spot errors in spelling of unfamiliar words Show knowledge of acquired specialist vocabulary Copyright IBM Corporation All rights reserved. 9

10 . Chapter 3: IKASCAT utilizes CAT Technology IKASCAT utilizes CAT technology in order to provide test users such as hiring managers with the most efficient, effective and accurate method of assessing cognitive ability. IBM has invested millions in developing a bespoke CAT system because the psychometric testing literature shows that CAT has a range of significant advantages over conventional online testing. These advantages include: Shorter test length (more than 50% fewer questions required) Shorter test duration (between 30% and 50% saving in time required) Greater measurement accuracy and test reliability Increased test taker motivation Increased test taker experience Increased test effectiveness (better at differentiating between candidates) Greater test security (particularly important with unsupervised testing) Greater scope for enhancement and updating These advantage are elaborated on and are fully referenced in Chapter 3 of this document, Other considerations involve the use of online assessments with the diversity of candidates expected. Fixed length, timed ability tests are the commonly used outside of North America, Fixed, timed versions of ability tests show larger differences between disabled and non-disabled candidates than untimed assessments (REFERENCE NEEDED). IBM presented a paper at the BPS Division of Occupational Psychology Conference 2014 (Keeley, S, & Parkes, J,. 2014a) which showed that adjustments in test time (i.e. increasing the time allowed) had the effect of reducing differences between disabled and non-disabled candidates but not removing them, as some disabled candidates still timed out even when given extra time. Accordingly, due to being untimed, CAT tests have the additional advantages: Candidate performance is maximized Better at dealing with adjustments required by disabled candidates (no need to add additional time as the assessments are untimed) The IKASCAT utilizes computerized adaptive testing so item administration is tailored to the ability of each individual test taker. Each test is likely to have a unique combination of items; items are drawn from an item bank (or database) containing a large number of individual items and their psychometric Copyright IBM Corporation All rights reserved. 10

11 characteristics (e.g. item difficulty). Tests are constructed based on a number of criteria, the most important of which is the test taker s performance during the test itself. The items presented are selected based on how the test taker has answered previous questions. If the test taker answers correctly, a more difficult item is administered; if the test taker answers incorrectly, an easier item is administered. The test adapts itself to the test taker s ability. Accordingly, lower ability test takers will be presented with easier questions than higher ability test takers. It means that test takers may have got the same number or percentage of questions correct but the higher-ability test takers will score better as they have answered more difficult questions. The psychometric models behind IKASCAT are item response theory (IRT) models for both dichotomously scored (i.e. scored 0 and 1) and polytomously scored items (i.e. scored more than 0 and 1) for a variety of possible item types and formats. In particular, the IRT models available for the IKASCATs are the IRT three parameter logistic (3PL) model, the two parameter logistic (2PL) model and the one parameter logistic (1PL) model or the Rasch Dichotomous measurement model. The IRT models adopted for the development of the IKASCATs are important building blocks that enable the scoring of candidates performances on the cognitive ability assessments in real time and making them comparable. IBM s CAT system built around the IRT models is the most advanced CAT system in the industry with its signature components Item Banker, CAT engine, CAT delivery and CAT management system that are hosted on the Assess on the Cloud platform. IBM began its pre-production process for both the CAT system development and the content development based on the test specification or blueprint in The following chapter explains how this CAT system was developed and what it actually entails. Copyright IBM Corporation All rights reserved. 11

12 Chapter 4: Development of IKASCAT CAT System Based on most popular psychometric models, IKASCAT was developed in three phases: system design and development, content development, and implementation/maintenance. These are shown in Figure 1 below. Figure 1. Phases of Development for the IKASCATs 4.1 CAT System Design and Development During phase one (CAT System Design and Development), the CAT system was designed to accommodate dichotomous and polytomous IRT models and popular item types (e.g. multiple choice, rating scale, forced choice). It accommodates both unproctored or proctored internet based testing (IBT), as well as multiple languages. A psychometric design and programming guideline was Copyright IBM Corporation All rights reserved. 12

13 produced to guide development of a CAT system, based on optimal conditions identified via Monte Carlo simulation studies. A large team of experts in programming and psychometrics were involved to develop and conduct quality control checks on the programming codes from spring 2012 through spring As a result, a series of improvements were made to the system to enhance its usability and scoring accuracy. Further improvements have been made to CAT s scoring and effectiveness after the initial CAT system development phase. The CAT system consists of modules of item banking system, test engine, test management and delivery system. The item banking system (or banker) stores item content and psychometric properties associated with each item (or question). The test engine module reads in the psychometric characteristics of items from the item banker, administers items adaptively and estimates the ability for each content domain. The engine also records, processes and stores all item response data, item records and ability estimates. The test management and delivery system takes in the candidate registration information from the applicant tracking system (ATS) and controls administration allowing unlimited access to CAT via the Internet around the globe. It also produces final scores such as raw or scale scores, and reports out the results (item responses, ability estimates and psychometric item characteristics) to the end users, internally and externally. The CAT delivery and management module is integrated with the IBM s signature assessment platform, Assess on the Cloud. CAT administration and score reporting follows the standard procedural order of Assess - authoring and publishing CATs into Assess, scheduling, delivery and score reporting (see Figure 2 below). Figure 2. CAT Management and Delivery via Assess Item Banking Test creation/ customization Master Catalog Custom Catalog Authoring Scheduling Standalone scheduling Schedule via integration with 2x Solutions (2xB, ATS) Standard reports Custom reports Summary statistics Test and item analysis Reporting Delivery Online, mobile, print/scan On-demand via integration with 2x solutions (2xB, ATS) Copyright IBM Corporation All rights reserved. 13

14 4.2 CAT Content Development IBM Workforce Science has developed computer-based tests for the last 25 years. With the extensive test development experience and expertise, more than 20 I-O psychologists and content experts as well as psychometricians were involved in the content development process for IKASCAT. A full-cycle development process is presented below. Collected and reviewed item content and characteristics of existing cognitive ability assessments as the test will be used globally, each item was reviewed to ensure cultural sensitivity across multiple languages. Identified the item type/style/format for use in CAT. Recruited item writers from a range of global geographic regions and cultures (these included many English speaking countries (UK, Ireland, US, South Africa, Australia, and New Zealand) as well as China, Pakistan, Hong Kong, Singapore, France, and Germany). Conducted item writing training sessions via web conferences to ensure consistency. Wrote new items. Conducted bias and sensitivity review to ensure that new items were free of bias. Assembled standalone pretesting (field testing, item tryout or item trial), given psychometric conditions, documented in the psychometric design IRT model, sample size and demographics, data collection design, number/percentage of items covering each content section or domain, multiple form assembly, test publishing, testing window, test administration, delivery platform, data collection and item linking and calibration. Performed final psychometric data review and final content review. Identified operational items and built the initial item pool for each subject (domain). Conducted simulation studies with the approved operational items to find the optimal conditions for building operational CATs. Planned new item writing and standalone pretesting or embedded pretesting in live CAT, depending on the pool size. A standalone pretesting with multiple pretest forms assembled was necessary to build up the initial item pool since not all participants in the pretesting can see all items in the given test form. Two popular approaches to build a final item pool/bank (as known as item linking) is using the common items that are included in between two adjacent pretesting forms or across all pretesting forms, or having the common group (or sample) of participants take all pretesting forms. Both of these approaches were used in building the final item pool/bank for the IKASCAT assessments; the former approach is known as the common item linking, and the latter approach as the common person item linking. 4.3 CAT Implementation and Maintenance Copyright IBM Corporation All rights reserved. 14

15 In building an item pool and measurement scale for use in an adaptive test, it is critical to determine procedures for identifying items that do not perform well. Poor items should be removed from the pool as soon as they are identified. Otherwise, it introduces bias to the ability estimates. It is probably necessary to evaluate item performance at job candidate volume intervals to see if they are performing as the target functions require. It is possible that the difficulty of items drifts or changes over time. Sometimes they drift to be easier, other times they drift to be harder. Sometimes they drift to be easier, other times they drift to be harder. It is important to evaluate items for drift on an annual basis and when needed to update item parameter estimates. At specified points in the test life cycle, item pools are refreshed to ensure model fit and to conform to specified security provisions. The current item refreshment plan is primarily concerned with updating items that have been overexposed with new items. Further expansion of the banked items is underway, with new items being trialled and included in the item pool on an ongoing basis. Copyright IBM Corporation All rights reserved. 15

16 Chapter 5: Why use Psychometric Tests? The term psychometric means mental measurement. Consequently, psychometric tests are devices that measure psychological characteristics such as intelligence, personality, or ability to perform a particular task. One major benefit of psychometric tests is that they are designed as systematic and standardised methods of measurement. In practice this means that the questions asked are consistent for every person that completes the test, the instructions they are given are consistent, and the conditions under which they complete the test should be controlled and as standardized as possible. With standardized practices, we are able to compare the results from tests taken at different times and in different places. Test developers also put in place systems for scoring their tests (for almost all Kenexa assessments this is computerized) enabling us to score and interpret the results in a consistent way. Another characteristic of psychometric tests is that they are designed to obtain a snapshot or sample of a person s ability or characteristics upon which we can make an assessment. An alternative would be, for example, to observe a person continuously in order to assess their ability, but this would be impractical. Designers of psychometric tests aim to ensure that the information we obtain by assessing a sample of a person s ability can be reliably used to make an assessment of their ability in general. In order to make sense of the information obtained from a psychometric test, often a person s results are compared with those from a relevant group or population. For example, a person s results on a graduate ability test will be compared with the scores of a graduate population. Similarly a person s results on a work-based personality questionnaire will be compared with those from a working population. Psychometric tests can be divided into those that assess maximum performance and those that assess typical performance. Tests that assess maximum performance are designed to determine how well a person performs at their best. These types of test may be timed with everyone given exactly the same amount of time to complete them and they typically have right and wrong answers. Tests that assess maximum performance include ability tests and attainment tests. These are often timed but the IKASCAT assessments are usually untimed, with no strict limit on the amount of time allowed for completing the test (although guidelines are often provided) Why Use Cognitive Ability Tests? Cognitive ability is one of the most studied constructs in psychology, with over 100 years of research behind it. Almost from the outset, work on the understanding of cognitive abilities has been conducted from an applied standpoint. For example, Alfred Binet, considered to be the developer of the first intelligence test, constructed measurements to understand the potential of children to benefit from educational instruction. This resulted in the first recognised test of mental ability being published in 1905 (Binet,1905). This tradition of applied research has continued, particularly in the areas of education and personnel selection. Copyright IBM Corporation All rights reserved. 16

Measures of cognitive ability have always been recognised in the academic literature as the best general predictors of job performance and are among the cheapest and most cost-effective methods to

17 Measures of cognitive ability have always been recognised in the academic literature as the best general predictors of job performance and are among the cheapest and most cost-effective methods to implement. The US Office of Personnel Management state on their website that Cognitive ability tests are used because they are among the least expensive measures to administer and the most valid for the greatest variety of jobs. 1 As with many areas of psychology, there is no single agreed definition of what cognitive ability is. In his influential book on the structure of human abilities, Carroll (1993) argues that abilities need to be understood in the context of a specific task, with a cognitive task being any task in which correct or appropriate processing of mental information is critical to successful performance. Cognitive ability is any class of cognitive activity that concerns some class of cognitive tasks, so defined (Carroll, 1993, p 10). It is particularly helpful as it not only provides a far-ranging map of intelligence, but also allows individual tests to be placed within this structure. As Carroll s model shows, at the level of Stratum I sit tests of specific abilities. Stratum II clusters these into broad families of tests, on the basis of factoranalytic research. For example, performance across sequential reasoning, induction and quantitative reasoning tests is assumed to be related to the underlying influence of fluid intelligence. In turn, performance on all tests is assumed to be influenced by a person s general intelligence, which forms Stratum III of Carroll s model. From the perspective of test development, it is important to recognise that most psychometric tests can exist only at Stratum I. Stratum II and III of the model are abstractions hypothesised from the statistical analysis of test results and are never directly observed. However, the weight of empirical research strongly suggests that these abstractions do have psychological reality (Carroll, 1993). Figure 3. Carroll s Three-Stratum Model of Intelligence 1 Retrieved from apps.opm.gov/adt/content.aspx?page=2-02 Copyright IBM Corporation All rights reserved. 17

18 5.1.2 Generalized Validity of Ability Tests A number of major studies are often invoked to support the use of cognitive ability tests such as the Logical, Numerical and Verbal Reasoning tests included in IKASCAT. In 1998, Schmidt and Hunter reviewed over 85 years of research into personnel selection. This extensive synthesis of the literature identified tests of general mental ability (GMA) 2 as being the single best predictor of job performance and success on job-related training courses. Outtz s study (2002) showed significant correlations between cognitive ability tests and measures of job performance across a large range of jobs and roles. Ree et al. (1994) investigated the role of general cognitive ability and specific abilities or knowledge as predictors of work sample job performance criteria in seven jobs for US Air Force enlistees. Analyses revealed cognitive ability was the best predictor of all criteria and specific abilities or knowledge added a statistically significant but smaller amount to predictive efficiency. These results are consistent with previous military studies, such as Army Project A. Schmidt and Hunter s major meta-analytical study (2004) presented extensive evidence that cognitive ability predicts both occupational level attainment and performance within one s chosen occupation and does so better than any other ability, trait, or disposition, and considerably better than job experience. Other work, much of it involving meta-analysis, has further supported the validity of GMA in the prediction of job performance. Bertua, Anderson and Salgado (2005) examined the literature on criterion validity, and largely replicated previous work. Tests of GMA were seen to predict job performance (0.48) and training success (0.50). Validity was again seen to vary among occupations, ranging from 0.74 for professional roles to 0.32 for clerical roles. Bertua et al s work also studied different types of ability tests. All test types studied had substantial validity. In terms of measures of job performance and across 20 different samples (n = 3,410), numerical ability tests showed an operational validity of 0.42 and a 90% credibility value of 0.26, indicating that the validity of numerical ability tests can be generalized across samples and settings. In terms of measures of training success and across 46 different samples (n = 15,925), numerical ability tests showed an operational validity of 0.54 and a 90% credibility value of 0.43, indicating that the validity of numerical ability tests can be generalized across samples and settings. In terms of measures of job performance and across 14 different samples (n = 3,464), verbal ability tests showed slightly lower operational validities of 0.39 and a 90% credibility value of 0.20, indicating that the validity of verbal ability tests can be generalized across samples and settings. In terms of training success and across 33 different samples (n = 12,679), verbal ability tests showed an operational validity of 0.49 and a 90% credibility value of 0.36, indicating that the validity of verbal ability tests can be generalized across samples and settings. 2 General mental ability is the term frequently used in literature that summarises the results from research using a range of cognitive ability tests. Variations in the content and style of the tests are acknowledged. However, the positive manifold demonstrated by such tests, which implies an underlying construct influencing performance across different tests, is used to justify considering them as all being assessments of the construct of general mental ability. Copyright IBM Corporation All rights reserved. 18

19 Chapter 6: CAT and how it is used in IKASCAT 6.1 What is CAT? A Computerized Adaptive Test (CAT) is a test, administered by computer, which dynamically adjusts itself to the cognitive ability level of each test taker during the course of administration. CAT is normally used to describe a test delivery method as compared to the conventional paper and pencil based testing (PBT). In a conventional PBT test of ability, every person takes the same fixed form test, regardless of the item characteristics for a given level of ability. Typically, a conventional ability PBT test presents items that measure well candidates with the mid-ability levels. This means the introduction of more measurement errors for those at the extreme level of ability. In other words, it is wasteful if the hardest items are administered to candidates with the lowest ability level or if the easiest items administered to candidates with the highest ability level. Bored high ability persons are likely to respond carelessly and frustrated low ability persons are more likely to respond in a random manner, and thus more errors of measurement of ability are introduced. CAT creates and delivers a customized test for each respondent using computers (increasingly online), aiming to measure various psychological constructs such as ability, achievement, attitude and personality traits in the most efficient and effective way. CAT successively selects questions so as to maximize the precision of the test based on what is known about the candidate from previous questions. From the candidate's perspective, the difficulty of the exam seems to tailor itself to his or her level of ability. For example, if a candidate performs well on an item of intermediate difficulty, he will then be presented with a more difficult question. Or, if he performed poorly, he would be presented with an easier question. Compared to static multiple choice tests where everyone is required to take a fixed set of items regardless of their ability (or construct) levels, CAT requires fewer test items to arrive at equally precise measures. 6.2 The Advantages of using CAT Among many known advantages, efficiency and control of measurement precision are prominent. CATs are more efficient than conventional tests that are delivered via PBT (and IBT that is non-cat). The test length for examinees can be reduced by 50% or more (i.e., feature of variable length CAT). A properly designed CAT can measure every examinee with the same degree of precision which is not true of conventional PBT, or IBT that is non-cat. Figure 4 shows that the standard error of measurement is similar across the full range of ability and at very low levels. Figure 4. Degree of Precision: Conditional Standard Error of Measurement across Ability Estimates (Thetas) Copyright IBM Corporation All rights reserved. 19

There are many additional advantages recorded in the literature with regard to CAT (Linacre, 2000; Rudner,1998). Test takers receive tests that are tailored to their actual ability level.

20 There are many additional advantages recorded in the literature with regard to CAT (Linacre, 2000; Rudner,1998). Test takers receive tests that are tailored to their actual ability level. This means that test takers are not given a series of irrelevant questions which are either too easy (and therefore do not tell us the highest level of performance for this test taker) or too difficult (and therefore only tell us that their highest level of performance is lower than this). The fact that CAT assessments adapt to the actual performance of the test taker, means that their approximate ability level is more quickly identified, and then more specific questions can be administered to enable more accurate identification of the test taker s actual ability level. The adaptive nature of these assessments means that CAT tests are shorter in duration (around 50% shorter in terms of time, and up to 65% shorter in terms of questions presented). These CAT tests are most likely to be administered unproctored but both on-site and off-site testing time will be reduced. Overall CAT tests are much more accurate (i.e. more reliable) than conventional static cognitive ability tests or even tests in which items are administered randomly from a large item bank (Grelle, Dainis, & Hurst, 2009). The Kenexa Ability Test CAT series use a minimum reliability equating to 0.8 for each test; some will be well in excess of this. Test Security is also increased by the use of CAT. Item exposure is reduced because fewer questions are administered. By comparison with Kenexa s non-cat versions of these assessments, this might reduce the number of questions presented from 20 items for a fixed NRT test to 8 items or less for a CAT version. This means that each candidate sees fewer questions and only sees questions which equate to their ability level. The methods used to score CAT assessments also mean that efforts to access a large number of items can be thwarted. A maximum number of items per administration is Copyright IBM Corporation All rights reserved. 20

21 set and test sessions may time out if excessive time is taken over the test as a whole or over individual items. If test items do become over exposed or compromised (through cheating or piracy), these items can be deleted from the item bank without the integrity of the whole item bank being affected. One of the advantages of the CAT methodology is that items can be deleted and new items can easily be added to the total item bank. Replacement and alternative items are constantly being trialled and added to the item banks for these assessments. Despite the sophistication and complexity of CAT scoring, scores for test takers are immediately available. This is due to the fact that the ability level (which will be represented by a particular score (or theta value in this case) needs to be calculated after every question, to calculate the next question administered. Another possible unexpected advantage is increased motivation. Linacre (2000) mentions increases in the motivation of candidates during CAT testing sessions. During the assessment, the test takers might feel discouraged if the items are too difficult or, on the other hand, might lose interest if the items are too easy. As CAT assessments adapt themselves to a test taker s ability level, this enables the test taker to achieve their most accurate and highest score possible. The shorter test time is also likely to improve the test taker experience by reducing the chances of test fatigue which should result in a reduction in drop-out rates i.e. the number of test takers who leave the assessment unfinished. 6.3 What is Item Response Theory (IRT)? Item response theory (IRT) is an important advance in the technology of psychometrics that provides benefits to the test and stakeholders, including individualized score precision, better characterization of the concept of measurement error, and the possibility of CAT. The calculation of CAT scores is founded on the principles of IRT models. As suggested, IRT consists of several families of mathematical models, including dichotomous, polytomous, and multidimensional. This manual focuses primarily on dichotomous models, which are appropriate for data that has two scored data points, typically right and wrong or correct and incorrect, where the item type is multiple choice with three to five item response options/alternatives, depending on the item domain area. We assume in the dichotomous IRT that the relationship in between the response to an item and a person can be explained by a specific mathematical function called the item response function (IRF). There are several models commonly used. One of which is the three parameter logistic model (3PLM), which models the probability of an person j with a given ability θ j (Greek letter theta) correctly responding to an item i as (Hambleton & Swaminathan, 1985): P( X i 1 ) c j i exp[ Dai ( bi )] (1 ci ) 1 exp[ Da ( b )] i i (1) Copyright IBM Corporation All rights reserved. 21

22 where a i is the item discrimination parameter or the slope, b i is the item difficulty or the location parameter (or the threshold), c i is the lower asymptote, or the pseudo-guessing parameter, and D is a scaling constant equal to or 1.0. Figure 5 illustrates an IRF for the 3PLM. The difficulty (0.0) is the reflection point in the IRF projected onto the ability continuum, where the probability of correct response to this item is 0.6 (i.e., the midpoint after taking into consideration the pseudo-guessing parameter. The discrimination parameter (1.5) is the slope of the IRF, indicating the strength of an item for discriminating among persons with different levels of ability. The degree of item discrimination is related to precision; that is, a more discriminating item adds more information to the measurement, and thus increases the precision level of ability. The pseudo-guessing parameter (0.2) introduces a non-zero lower-bound to the model; it represents the probability of a lower ability person correctly responding to an item, presumably by chance. Figure 5. Item Response Function for A Dichotomously Scored Item The model can be simplified into two other commonly used dichotomous IRT models. The twoparameter logistic model (2PLM) assumes that there is no guessing (c i = 0.0) and only utilizes the difficulty and discrimination parameters. It is therefore appropriate when guessing would not play an important role in assessment. The one-parameter logistic model (1PLM) makes the further assumption that all items have a discrimination parameter of 1.0, and therefore differ only with respect to difficulty. The 1PLM is Copyright IBM Corporation All rights reserved. 22

23 P P P mathematically equivalent to the Rasch model, although the philosophy is different by the users of each model. Figure 6 presents IRFs for three exemplary items in the 1PL, 2PL, and 3PL models. Note that all IRFs for the 1PLM are parallel to one another and do not intersect. This demonstrates the objective measurement property of the 1PLM, whereby there is no interaction between items and ability. Probability (P) of correct response for harder items will always be lower than probability for easier items. This is not always the case for the 2PLM and 3PLM as evidenced by Figure 6. It is because the slopes (i.e., the discrimination parameters) are different in these two models, whereas the slopes for the 1PLM are equal. Figure 6. IRFs for Dichotomous IRT Models 1PLM 2PLM b = -1.0 b = 0.5 b = a = 0.5, b = -1.0 a = 1.5, b = 0.5 a = 1.0, b = 2.0 3PLM Ability (theta) Ability (theta) a = 0.5, b = -1.0, c = 0.2 a = 1.5, b = 0.5, c = 0.3 a = 1.0, b = 2.0, c = 0.4 Ability (theta) The above models assume that the item responses are a function of only a latent trait (unidimensionality) and that an person s item response is solely determined by his/her location on the latent continuum and not by his/her responses to other items (local or conditional independence). An approach to claim that the test is unidimensional is to show the model-data fit (or data-model fit in the Rasch dichotomous model). Item level fit can be also checked. Another way is to compare the model IRF against the empirical IRF. The model IRF can be conceptualized similarly to a standard linear or logistic regression line: it is simply a model-based function that is fit to a particular set of data. This is illustrated in Figure 7, which provides some plots of empirical and model IRFs. An empirical IRF can be constructed by classifying persons according to ability and computing the proportion-correct within Copyright IBM Corporation All rights reserved. 23

24 P P each ability category. The model IRF attempts to model the curve for the correct response but for an infinite number of groups, on a continuous distribution. Figure 7. Empirical and Model IRFs (a) Good fit between an empirical and modelbased IRF (b) Poor fit between an empirical and modelbased IRF: Suggests the need for a 3PL Ability (theta) Ability (theta) Item and Test Information An important concept in IRT for the purposes of test development and adaptive testing is information. Broadly defined, information is an index of the increase in measurement precision (or decrease in uncertainty). Like the IRF, it is also a continuous function across θs, as an item can provide more information at certain levels. This is because information is primarily a function of the slope of an IRF; at levels of θs, where the IRF has little slope and therefore little differentiating power, the item provides little information. An item provides the most information where the slope of the IRF is highest. For example, a very difficult multiple choice item will differentiate amongst top persons, but provide no differentiation amongst below-average persons; virtually all of the latter would respond incorrectly or be forced to guess. The information function for the 3PL is specifically defined as (Embretson & Reise, 2000) P i ( Pi ci ) D ai 2 Pi (1 ci) Ii( ) (2) which simplifies to D ai Pi (1 Pi ) for the 2PL and D P i (1 P i ) for the 1PL models. While information is maximized at b i for the one- and two-parameter models, for 3PLM it is maximized at (Lord, 1980): b i i Da 1 ln 1 8c 1 i i 2 (3) Copyright IBM Corporation All rights reserved. 24

25 Each item has its own item information function (IIF) that differs based on the item parameters. Consider the following example items: Table 1. Example item parameters Item a b c Item 1 is relatively easy item, with b = -2.00, while Item 4 is more difficult, with b = The IRFs for these items are show in the following figure. Figure 8. IRFs for Example Items The IIFs for the same items are shown below. Note that each item has more information (y-axis) where the IRF in the figure above has more slope. Item 1 had the highest discrimination value, and therefore has the highest peak in the IIF. Figure 9. IIFs for Example Items Copyright IBM Corporation All rights reserved. 25

26 The figure above is one of the core concepts of adaptive testing. CAT typically works by constructing a table of values representing that graph, and look for items that are most informative for a given ability level. For example, if a person s ability estimate is at -2.00, then Item 1 is the most appropriate item for them, as it easily provides the most information around the ability estimate. IIFs are useful in the test construction process because they can be summed across all items to produce the test information function (TIF). The TIF is a function that provides an index of expected (model-based) measurement precision as a function of θs, since TIF and the standard error of measurement (SEM) conditional on ability (CSEM) are inversely related, such that: CSEM 1 n I i i1 ( ) A test intended for a pass/fail decision with a single cut-off score can be built to have a TIF that is peaked near that cut-off score, and thus, there is a high amount of precision. A test that contains several decision points across θs can be built with a TIF that is high across a wider range. The concept of using the TIF and CSEM in test and item bank design are discussed in detail later. (4) Parameter Estimation In IRT, both items and persons are characterized with parameters. Item parameters include a, b, and c, while the person parameter is the ability level θs (theta). These parameters are estimated based on a set of item response data. Estimation of the item and person parameters are dependent on each other. That is, item parameters are used to calculate person θ estimates, which are in turn necessary to estimate item parameters. For this reason, the process of calibrating data with IRT is iterative, and Copyright IBM Corporation All rights reserved. 26

Glossary of Standardized Testing Terms https://www.ets.org/understanding_testing/glossary/

Glossary of Standardized Testing Terms https://www.ets.org/understanding_testing/glossary/ a parameter In item response theory (IRT), the a parameter is a number that indicates the discrimination of a