19 3 Chapter Standardization and Derivation of Scores This chapter presents the sampling and standardization procedures used to create the normative scores for the UNIT. The demographic characteristics of the standardization sample and the sample s representativeness of the U.S. population according to these variables are reported in detail. The chapter concludes with a description of the procedures used to derive scores for UNIT interpretation. Standardization Standardization Sample The standardization of the UNIT was based on a carefully designed, stratified, random sampling plan that resulted in a sample closely representative of the U.S. population. Normative data were collected from a comprehensive national sample of 2,100 children and adolescents from ages 5 years 0 months through 17 years 11 months 30 days. An additional 1,765 children and adolescents participated in the reliability, validity, and fairness studies. Several sources provided the demographic and educational parameters of the U.S. population. These sources included the Current Population Survey, March 1995 (U.S. Bureau of the Census, 1995), the Seventeenth Annual Report to Congress of the Implementation of the Individuals With Disabilities Education Act (U.S. Department of Education, 1995), the
20 Examiner s Manual Digest of Education Statistics 1996 (U.S. Department of Education National Center for Education Statistics [NCES], 1996), and Schools and Staffing in the United States: Selected Data for Public and Private Schools, 1993 94 (U.S. Department of Education NCES, 1995). Based on the 1995 U.S. census data, the standardization sample was stratified and proportionately representative of the U.S. population according to the following variables: Sex Race (White, African American, Asian/Pacific Islander, Native American, Other) Hispanic Origin (Hispanic, non-hispanic) Region (Midwest, Northeast, South, West) Community Setting (Urban/Suburban, Rural) Classroom Placement (Full-Time Regular Classroom, Full-Time Self- Contained Classroom, Part-Time Special Education Resource, Other) Special Education Services (Learning Disability, Speech and Language Impairments, Serious Emotional Disturbance, Mental Retardation, Giftedness, English as a Second Language and Bilingual Education, and Regular Education) Parental Educational Attainment (Less Than High School Degree, High School Graduate or Equivalent, Some College or Technical School, Four or More Years of College) Data were collected at 108 sites in 38 states as indicated in Figure 3.1 and listed in Appendix F. Age and Sex The UNIT standardization sample included an approximately equal number of female and male respondents in each of 12 age groups, from age 5 through age 17, with 175 respondents in each group. The decision to combine ages 16 and 17 into one group was based on early studies that suggested that only minimal development in the abilities assessed by the UNIT occurred from age 16 to age 17. Efforts were made to ensure an equal distribution of participants throughout each age group. Sufficient balance was achieved to generate age norms with 4-month intervals through the year, that is, 0 months 0 days to 3 months 30 days, 4 months 0 days to 7 months 30 days, and 8 months 0 days to 11 months 30 days. These gradations enhance the precision of developmental assessment. The composition of the UNIT standardization sample by age and sex is shown in Table 3.1. Although U.S. census breakdowns by sex are not precisely equivalent, for all the remaining demographic breakdowns of the UNIT standardization sample discussed in the following sections, male and female participants are equally represented, by design, in order to ensure fairness.
Standardization and Derivation of Scores 21
22 Examiner s Manual Race and Ethnicity The U.S. Bureau of the Census asks individuals to decide for themselves to which racial and ethnic groups they belong. During sample acquisition for the UNIT, parents were asked to identify on permission and consent forms their children s race: White or Caucasian, African American or Black, Native American or American Indian, Asian or Pacific Islander, or Other. For the purpose of tabulation of the UNIT sample data, three categories American Indian, Asian or Pacific Islander, and Other were collapsed into the Other category. The performance of several racial groups is discussed at length in Chapter 6. As the data reported in Table 3.2 indicate, the racial proportions of the UNIT standardization sample closely matched those of the U.S. population. Parents were also asked whether their children were of Hispanic origin. According to census procedures, persons of any race can be Hispanic on the basis of their ethnic origin, including Mexican American, Chicano, Mexican, Mexicano, Puerto Rican, Cuban, Central or South American, or other Hispanic. The percentages of children and adolescents in the standardization sample identified as Hispanic are reported in Table 3.3. The results indicate that the UNIT sample was representative of the U.S. population of children from ages 5 through 17 in terms of Hispanic origin. Community Size and Geographic Region Community size was used to ensure adequate representation of children and adolescents from rural communities in the United States. The U.S. Bureau of the Census defines rural areas as those with populations less than 2,500 and not classified as urban. A total of 27.6% of the UNIT standardization sample had their primary residence in a rural community, with the remaining 72.4% living in urban or suburban communities. These proportions closely approximate 1996 U.S. Census proportions of 24.8% rural and 75.2% urban (U.S. Bureau of the Census, 1996). The four geographic regions of the United States shown in Figure 3.1 (i.e., Midwest, Northeast, South, West) were sampled according to the national population parameters. The proportions of the UNIT standardization sample stratified by age and geographic region are reported in Table 3.4. Results show that the UNIT sample closely matched the regional distributions of the U.S. population. Parent Education Attainment Parent education attainment has been widely used by researchers as an indication of socioeconomic status (SES) because it is information that most families will accurately convey, as opposed to personal information such as annual income. Education attainment is measured by the U.S. Bureau of the Census as the highest grade that an individual has completed or the highest degree that an individual has received. During the UNIT standardization, the highest education attainment of either parent was used as the criterion for SES. Parent education was divided into four levels
Standardization and Derivation of Scores 23
24 Examiner s Manual
Standardization and Derivation of Scores 25 for stratification: Less than high school education ( HS), high school graduate (HS), 1 3 years of college or technical training (Some College), and 4 or more years of college or technical training ( 4 Years College). The proportions of the UNIT standardization sample stratified by age, sex, and parent education level are presented in Table 3.5; stratified by age, race, and parent education in Table 3.6; and stratified by age, Hispanic origin, and parent education level in Table 3.7. As the data in these tables show, the UNIT sample breakdowns were consistently similar to U.S. census figures for parent education attainment. Additional Stratifications The UNIT standardization sample was constructed to represent the U.S. population on several additional cross-tabular stratifications. The data in Tables 3.8, 3.9, and 3.10, respectively, show that the UNIT standardization sample was similar to the U.S. population on age, sex, and race; age, sex, and Hispanic origin; and age, race, and geographic region. As the data reported in Tables 3.1 through 3.10 show, the UNIT standardization sample was closely representative of the U.S. population on all of the important demographic variables. Special Educational Services As part of the UNIT normative studies, children and adolescents who were diagnosed with and who were receiving special educational services for an educational disability or exceptionality were included in the sampling plan. The percentages of these special populations in the UNIT standardization sample and in the U.S. population of school-aged children are reported here according to the categories of exceptionality. Special populations included students with learning disabilities (5.6% of the UNIT sample, 5.9% of the U.S. population), speech and language delays or impairments (2.3%, 2.4%), serious emotional disturbance (0.9%, 1.0%), mental retardation (1.2%, 1.3%), hearing impairments (0.2%, 0.2%), intellectual giftedness (6.2%, 6.4%), bilingual education (1.8%, 3.1%), and English as a second language (2.0%, 4.0%). These data indicate that the proportions of the participants in the UNIT sample with various educational exceptionalities were close to those expected for school-aged children in the U.S. population. The percentages of U.S. public school students participating in particular programs or services were obtained from the Seventeenth Annual Report to Congress of the Implementation of the Individuals With Disabilities Education Act (U.S. Department of Education, 1995), the NCES (U.S. Department of Education NCES, 1995), and the Digest of Education Statistics 1996 (U.S. Department of Education NCES, 1996).
26 Examiner s Manual
Standardization and Derivation of Scores 27
28 Examiner s Manual
Standardization and Derivation of Scores 29
30 Examiner s Manual
Standardization and Derivation of Scores 31
32 Examiner s Manual Examiner Identification and Training Standardized testing procedures were ensured through the screening of examiners according to qualifications and experience, comprehensive examiner training, completion and review of practice cases, ongoing consultation and guidance from the project staff, and quality control procedures for reviewing every returned protocol. A total of 274 examiners received training and participated in the UNIT data collection. Examiners either were licensed or certified psychologists or were students in psychology graduate training programs receiving supervision from licensed psychologists or university trainers. All examiners were required to have had prior course work and practical experience in individual psychological assessment before receiving training to administer the UNIT. As part of their training, examiners viewed a 28-minute UNIT training videotape, which was distributed to each of 116 site coordinators for training purposes. Site coordinators also received supplemental UNIT training materials, including sample cases, test materials, and examiner s manual. The training videotape described the UNIT, including its theoretical foundation, materials, and administration and scoring procedures, and included a demonstration of a child being administered the UNIT. Site coordinators ensured that each of their examiners viewed the videotape and answered any questions from the examiners. After viewing the videotape, examiners practiced administering the UNIT under supervision until the site coordinator deemed the examiner sufficiently prepared to attempt a practice assessment. The site coordinator checked the examiner s practice protocol, pointed out and corrected administration errors, and submitted the protocol to the project staff. The project staff reviewed the protocol to determine whether a standardized administration had been accomplished and provided additional feedback to the examiner when necessary. At this point, the examiner was either authorized to collect data for standardization or required to receive additional training and to conduct additional practice administrations. Quality Control To ensure the quality of the UNIT normative data, the project staff followed a stringent quality-control procedure. Every submitted record form and response booklet was checked against a 33-point checklist. This review included an examination of all scoring decisions and subtest data for adherence to starting and discontinuing rules, time limits, and scoring rules. When completed record forms were quality checked and approved, the data were coded and entered into the UNIT normative data pool. Data were checked multiple times through the key-entry process to ensure that the tests had been correctly administered and that all information had been correctly entered into the computer database. Demographic informa-
Standardization and Derivation of Scores 33 tion from each protocol was entered into the database twice, once upon receipt from the examiner and again during final key entry. Information from the two entries was compared, and all discrepant cases were corrected. Retest demographic information was compared against that from the first testing as well as that in the database. Once keyed, the computer records were checked by a program for unusual or inconsistent response patterns, and corrected if necessary. After all of the standardization protocols were entered, all of the data were again run through the checking programs to ensure that all values were within expected ranges. Summary of Standardization The UNIT standardization was designed so that the sample conformed to the contemporary U.S. population on important demographic variables. Participants were randomly selected within the demographically determined strata. Derivation of UNIT Scores Construction of the UNIT norms began with analyses of item, subtest, and scale properties. (Many of these analyses are described in Chapter 5, Technical Properties, and Chapter 6, Fairness.) On the basis of these analyses, 25 items from the standardization edition were deleted due to scaling issues, psychometric properties, or bias considerations. In addition to item deletion, a slight rearrangement of items within each subtest ensured that items were arranged according to difficulty level, with the easier items appearing before the more difficult items. Empirically determined and, thus, more accurate starting and stopping points also were established on the basis of the standardization data analyses. Subtest Scoring Determination of Starting and Ceiling Rules So that administration time is as short as possible, each subtest has two suggested starting points, one for 5- to 7-year-olds and one for examinees 8 years and older. Stopping rules were determined for each subtest on the basis of the standardization data. Because items are arranged in ascending order of difficulty, the point at which greater than 95% of the standardization sample were not likely to respond correctly to any additional items could be determined. That point (i.e., the number of failed items) was used as the final discontinue rule. Raw Scores After all the UNIT normative data were collected and analyzed, raw scores were determined for each subtest, for each examinee. Item responses on each of the UNIT Memory subtests are dichotomously scored, that is, as
34 Examiner s Manual passed or failed. Memory items for which responses were correct in all respects (e.g., color, object, placement, sequence, and number) were scored as passed and awarded 1-point credit; responses that were incorrect in any respect were awarded zero points. For example, on the Symbolic Memory subtest, the examinee might have selected the correct objects (e.g., man, woman, girl) and placed them in their correct sequence and number but might have failed to select the correct color of one or more of the objects (e.g., green girl rather than black girl). In this instance, the response would have been scored as incorrect because color, object, number, and sequence are all salient characteristics that must be recalled accurately before the response can be scored as correct. Of the three Reasoning subtests, only Analogic Reasoning responses were scored dichotomously. Cube Design items are three-dimensional, and the examinee received credit for each of the three faces of the design (i.e., top, left side, and right side) positioned correctly. If an examinee correctly matched the top and left faces of the response design with the stimulus model, for example, but failed to match the right exposed side correctly, the examinee was awarded 2-points credit 1 point each for the top and left faces positioned correctly. On the Mazes subtest, an examinee received 1 point for every correct decision made until he or she made an incorrect decision. Once the examinee made an incorrect decision, no additional correct decisions were credited. For example, Item 8 on the Mazes subtest has a total of six decision points and therefore a total possible raw score of 6. If an examinee made consecutive correct decisions at Junctures 1 4 but an incorrect decision at Juncture 5, then self-corrected and continued to complete the remainder of the maze correctly, including Juncture 6, the examinee was awarded 1 point for each of the first four consecutive correct decisions. No further credit was given for the examinee s self-correction at Juncture 5 or the correct decision at Juncture 6. Bonus Points On the Cube Design and Mazes subtests, bonus points were awarded for perfect completion of an item within specified time limits. Although these subtests are not primarily speeded tasks, additional credit was awarded to those examinees who demonstrated the foresight and mental efficiency to complete the task quickly and accurately. Subtest Total Raw Scores The total raw score on each of the subtests was determined as follows: All unadministered items (excluding sample items) preceding the first item passed by the examinee were credited as passed. All items (except sample items) to which the examinee responded correctly were scored as passed. All items beyond the discontinuation point were scored as failed. Item raw scores across all items within each subtest were added to derive the subtest raw score. For dichotomously scored items (i.e., all items of the Memory subtests and Analogic Reasoning), each item that was answered correctly contributed 1 point to the subtest total raw score. Because items
Standardization and Derivation of Scores 35 on the Cube Design and Mazes subtests contribute differentially to the subtest total raw score, an item score was the sum of the points awarded for the response and any bonus points awarded for quick and perfect completion. Scaled and Standard Score Development Scaled Score Equivalents of Raw Scores Subtest total raw scores were converted to percentile ranks for each fullyear age group. For each age group, percentile ranks were then converted to z scores; these were converted to subtest scaled scores with a mean of 10 and a standard deviation of 3 (range 1 19, or 3 SD). These normalized scaled scores were then smoothed within each age group and across all age groups. For the intermediate age levels within an age group, subtest scaled scores were interpolated from the smoothed norms. The result was a smooth distribution of scaled scores that reflects the intellectual growth pattern of children and adolescents as assessed by the UNIT subtests. Standard Score Equivalents of Sums of Scaled Scores For each of the UNIT scales (i.e., Memory, Reasoning, Symbolic, Nonsymbolic, and Full Scale), the scaled scores of each of the contributing subtests were added. For example, for the Memory Scale of the Extended Battery, the scaled scores on the Symbolic Memory, Spatial Memory, and Object Memory subtests were summed. The sums of the respective scaled scores were then distributed and converted to deviation IQ scores with a mean of 100 and a standard deviation of 15 via percentile ranks, and the distributions smoothed. Confidence Intervals Confidence intervals at the 90% and 95% levels for all five scales were based on the estimated true scores and the standard errors of estimation (SE E ). The confidence interval is centered on the estimated true score, which is a correction for regression to the mean, and is calculated according to methods described by Dudek (1979) and Glutting, McDermott, and Stanley (1987). The standard error of estimation is derived according to methods described by Stanley (1971). Test-Age Equivalents Test-age equivalents were derived from the subtest norms tables and are reported in Table C.2 in Appendix C. For any given subtest raw score, the test-age equivalent is the age at which that raw score would receive a scaled score of 10. If there was more than one age at which a particular raw score received a scaled score of 10, the median age group was used as the test-age equivalent. A test-age equivalent may be viewed as the level of performance of the typical child of that given age.
36 Examiner s Manual Statistical Significance of Subtest and Scale Score Differences The values required for significance when two subtests or scales were compared were computed with the following formula, which is based on the standard error of measurement of each score: 2 2 Difference Score z (S E M a S E M ), b where z is the normal curve value associated with the specified significance level and SE Ma and SE Mb are the standard errors of measurement of the two scores. Intraindividual (Ipsative) Score Differences The values required for significance when an individual s performance is compared with his or her average subtest performance within scales or average performance across all subtests were computed with Davis s (1959) formula. Frequency of Differences The frequency of a discrepancy between two scores refers to the percentage of examinees in the standardization sample whose scores differed by the specified magnitude. This frequency may be used as an estimate of the occurrence in the general population.