PRINCIPLES AND APPLICATIONS OF SPECIAL EDUCATION ASSESSMENT

Similar documents
UNIVERSIlY OF SWAZILAND. FACULlY OF EDUCATION DEPARTMENT OF EDUCATIONAL FOUNDATIONS AND MANAGEMENT SUPPLEMENTARY EXAMINATION PAPER 2012/2013

Standardized Measurement and Assessment

KeyMath Revised: A Diagnostic Inventory of Essential Mathematics (Connolly, 1998) is

Glossary of Standardized Testing Terms

VIII. STATISTICS. Part I

Chapter 3 Norms and Reliability

Kaufman Test of Educational Achievement Normative Update. The Kaufman Test of Educational Achievement was originally published in 1985.

Statistical Analysis. Chapter 26

Kaufman Test of Educational Achievement - Third Edition

STAT 2300: Unit 1 Learning Objectives Spring 2019

Woodcock Reading Mastery Test Revised (WRM)Academic and Reading Skills

Statistics Definitions ID1050 Quantitative & Qualitative Reasoning

Stanford 10 (Stanford Achievement Test Series, Tenth Edition)

FOR TRAINING ONLY! Score Report. DAS II: School Age Battery CORE BATTERY. Core Cluster and Composite Scores and Indexes

Distinguish between different types of numerical data and different data collection processes.

GETTING READY FOR DATA COLLECTION

Bar graph or Histogram? (Both allow you to compare groups.)

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Band Battery

Central Tendency. Ch 3. Essentials of Statistics for the Behavior Science Ch.3

Section Sampling Techniques. What You Will Learn. Statistics. Statistics. Statisticians

Statistics: General principles

Score Report. Client Information

Clovis Community College Class Assessment

Screening Score Report

Module - 01 Lecture - 03 Descriptive Statistics: Graphical Approaches

3.2 Measures of Central Tendency

Understanding and Interpreting Pharmacy College Admission Test Scores

Diagnostic Online Math Assessment: Technical Document. Published by Let s Go Learn, Inc.

Students will understand the definition of mean, median, mode and standard deviation and be able to calculate these functions with given set of

Introduction to Statistics. Measures of Central Tendency

The Standardized Reading Inventory Second Edition (Newcomer, 1999) is an

AP Statistics Scope & Sequence

ALTE Quality Assurance Checklists. Unit 1. Test Construction

Scoring & Reporting Software

Lecture 10. Outline. 1-1 Introduction. 1-1 Introduction. 1-1 Introduction. Introduction to Statistics

Chapter 3. Displaying and Summarizing Quantitative Data. 1 of 66 05/21/ :00 AM

Applying the Principles of Item and Test Analysis

ALTE Quality Assurance Checklists. Unit 1. Test Construction

Chapter 1 Data and Descriptive Statistics

CHAPTER 4. Labeling Methods for Identifying Outliers

Introducing WISC-V Spanish Anise Flowers, Ph.D.

Math 1 Variable Manipulation Part 8 Working with Data

Math 1 Variable Manipulation Part 8 Working with Data

Computing Descriptive Statistics Argosy University

Ante s parents have requested a cognitive and emotional assessment so that Ante can work towards fulfilling his true potential.

Chapter 8 Script. Welcome to Chapter 8, Are Your Curves Normal? Probability and Why It Counts.

Introduction to Statistics. Measures of Central Tendency and Dispersion

Assessment Report. Name: Roel Maldonado Date of Assessment: October 8 November 12, 2014 Date of Birth: 04/07/2005 Date of Report: November 13, 2014

10.2 Correlation. Plotting paired data points leads to a scatterplot. Each data pair becomes one dot in the scatterplot.

Statistics Chapter 3 Triola (2014)

Section 9: Presenting and describing quantitative data

STAT/MATH Chapter3. Statistical Methods in Practice. Averages and Variation 1/27/2017. Measures of Central Tendency: Mode, Median, and Mean

Measurement and Scaling Concepts

Descriptive Statistics Tutorial

AP Statistics Part 1 Review Test 2

Chapter Standardization and Derivation of Scores

The 1995 Stanford Diagnostic Reading Test (Karlsen & Gardner, 1996) is the fourth

PARENT REPORT. Tables and Graphs Report for WNV and WIAT-II. Copyright 2008 by Pearson Education, Inc. or its affiliate(s). All rights reserved.

Chapter 5. Statistical Reasoning

RIST-2 Score Report. by Cecil R. Reynolds, PhD, and Randy W. Kamphaus, PhD

GLOSSARY OF COMPENSATION TERMS

educationhubnepal.wordpress.com Course Title: Measurement and Evaluation in Education Level: M. Ed. Credit Hours: 3 Semester: Third Teaching Hours: 48

Audit - The process of conducting an evaluation of an entity's compliance with published standards. This is also referred to as a program audit.

7 Statistical characteristics of the test

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended.

Analyzing Language & Literacy using the WMLS-R

The uses of the WISC-III and the WAIS-III with people with a learning disability: Three concerns

ISO 13528:2015 Statistical methods for use in proficiency testing by interlaboratory comparison

Multidimensional Aptitude Battery-II (MAB-II) Clinical Report

Standardised Scores Have A Mean Of Answer And Standard Deviation Of Answer

TEW. L 3: Test of Early Written Language Third Edition, Complete Kit OUR PRICE-$ Revision of the TEWL-2 Test! Ages: 4-0 through years

Webinar 3: Interpreting Assessment Results to Establish a Connection

Evaluating the Technical Adequacy and Usability of Early Reading Measures

AP Statistics Test #1 (Chapter 1)

Statistics is the area of Math that is all about 1: collecting, 2: analysing and 3: reporting about data.

Draft Poof - Do not copy, post, or distribute

Super-marketing. A Data Investigation. A note to teachers:

1. Contingency Table (Cross Tabulation Table)

Introduction to Research

Frequently Asked Questions (FAQs)

Joe Sample. Total Administration Time: C6wPgCYJK. Candidate ID: Sample Distributor. Organization:

CREDIT RISK MODELLING Using SAS

Equivalence of Q-interactive and Paper Administrations of Cognitive Tasks: Selected NEPSY II and CMS Subtests

ALTE Quality Assurance Checklists. Unit 4. Test analysis and Post-examination Review

Special Education. Understanding Standardized Tests. The same quantity (or test score) can be expressed using different units of measure.

Reliability & Validity

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS

Answer vital questions about your students language and literacy skills with TILLS TM

of a student s grades for the period is a better method than using the mean. Suppose the table at the right shows your test grades.

Module 1: Fundamentals of Data Analysis

Mathematics in Contemporary Society - Chapter 5 (Spring 2018)

ANALYSING QUANTITATIVE DATA

Day 1: Confidence Intervals, Center and Spread (CLT, Variability of Sample Mean) Day 2: Regression, Regression Inference, Classification

Topic 1: Descriptive Statistics

Setting Standards. John Norcini, Ph.D.

Online Student Guide Types of Control Charts

Biostat Exam 10/7/03 Coverage: StatPrimer 1 4

Core vs NYS Standards

Course on Data Analysis and Interpretation P Presented by B. Unmar. Sponsored by GGSU PART 1

Review Materials for Test 1 (4/26/04) (answers will be posted 4/20/04)

Transcription:

PRINCIPLES AND APPLICATIONS OF SPECIAL EDUCATION ASSESSMENT CLASS 3: DESCRIPTIVE STATISTICS & RELIABILITY AND VALIDITY FEBRUARY 2, 2015

OBJECTIVES Define basic terminology used in assessment, such as validity, reliability, standard deviation, etc. Understand how to evaluate the technical adequacy of tests including the norms, reliability, and validity. Interpret information from formal and informal assessments. Describe the function of standardized assessment in the eligibility process.

TONIGHT S SCHEDULE 4:30 4:45 Group Presentations Utah SPED Rules 4:45 5:15 Problem-Solving Teams Case Studies 5:15 6:00 Descriptive Statistics 6:00 6:15 Break 6:15 7:00 Reliability & Validity 7:00 7:20 Graduate Students Annotated Bibliography

REVIEW Children can not be determined to have a disability because of what? Describe each of the following: RIOT/ICEL RTI and its relationship to the medical model LRE Components of an IEP

CRITERION OR NORM-REFERENCED? WISC-IV (Intelligence Test) History test Correct words on a spelling test Woodcock Johnson Achievement Test III Driving Test Number of steps correctly performed in a dressing routine.

WHY IS MEASUREMENT IMPORTANT? Standardized assessment is heavily applied in the educational decision-making process. Educators must understand Test-selection criteria Basic principles of measurement Administration techniques Scoring procedures

CONCERNS IN THE FIELD High priority placed on assessment Mistakes made by professionals: Identified students based upon referral information and not testing. Data presented played little role in planning. Choosing poor-quality instruments. Taking the recommendation at face value. Using quick assessments even if those assessments do not address the areas of concern. Failure to establish effective rapport with the examinee. Failure to document behaviors during the examination that may be of diagnostic value. Failure to adhere to the administration rules. Making scoring errors. Ineffectively interpreting assessment results for educational use.

NUMERICAL SCALES Nominal Scale Used for identification purposes only; the numbers function like a name (e.g., an ID number) Numbers cannot be used in mathematical operations Least useful scale

NUMERICAL SCALES Ordinal Scale Used to rank the order of items Numbers have the quality of identification and indicate greater or lesser quality (e.g., first place, second place, etc.) Numbers are not equidistant (i.e., the distance between first and second place and second and third place is not necessarily the same)

NUMERICAL SCALES Ratio Scale Used for direct comparisons and mathematical manipulations Numbers are equidistant from each other Numbers have a true meaning of absolute zero Can be used in all mathematical operations (e.g., counts of behaviors, income, height, weight, etc.)

NUMERICAL SCALES Interval Scale Used for identification that rank greater or lesser quality or amount Numbers are equidistant (e.g., degrees on a thermometer, IQ Scores, rating scales). Most data in education will be interval scale data. Does not have an absolute-zero quality Numbers cannot be used in other mathematical operations (e.g., multiplication)

RAW SCORES Scores an individual receives when individual items on tests are summed. Raw scores convey very little meaning unless referenced to some standard. Subtract the number of items students missed from the number of items presented. All other scores, derived scores, are derived from the raw score.

DESCRIPTIVE STATISTICS Large sets of data are organized and understood through methods known as descriptive statistics. Derived scores obtain meaning from large sets of data or large samples of scores. Scores derived from the raw score include: Percentile rank Standard score Grade equivalent Age equivalent

MEASURES OF CENTRAL TENDENCY Measures of Central Tendency: A way to organize data to see how the data cluster, or are distributed around a numerical representation of the average score. Caution using this technique if your scores are widely scattered. A normal distribution represents the way test scores would fall if a test was given to every single student of the same age or grade in the population. Most students scores fall in the middle of the curve Distribution is symmetric or equal on either side of the vertical line. Fewer students scores fall at the edges of the curve

AVERAGE PERFORMANCE It is important to know how students performed as a group and what constitutes excellent, average, and poor performance. Frequency Distribution Rank scores from highest to lowest. Tally how many of each score was obtained. Mode The score that occurs the most number of times. Bimodal Distribution The distribution has two modes. Multimodal Distribution A distribution with three or more modes. Frequency Polygon A graph that represents a data set.

MODE & FREQUENCY POLYGON Mode

MEDIAN Median Found by rank ordering the data set, writing each score the number of times it occurs. Count halfway down the list of scores; 50% of the data are listed above the median and 50% are below. In a data set with an even number of scores, the median score may not actually exist in the data set.

MEDIAN EXAMPLES Median Score 100 69 97 69 89 68 85 62 85 60 78 Median Score 100 83 96 82 84 95 80 90 78 85 77

MEAN One of the best measures of average performance is the mean. The mean is found by calculating simple average. Mean can be affected by extreme scores, especially if the group is composed of only a few students. Can be controlled by eliminating extreme scores (i.e., outliers). Example: Data set: 90, 80, 75, 60, 70, 65, 80, 100, 80, 80 780 10 = 78

MEASURES OF DISPERSION Measures of dispersion are used to calculate how scores are spread from the mean. Variability is the way that scores in a set of data are spread apart. Range Provides an idea about the spread. Calculated by subtracting the lowest score from the highest score. Example: Top score = 100; Lowest score = 45 100 45 = 55

VARIANCE Data are described as having variance. Variance can be described as the degree or amount of variability or dispersion in a set of scores. The dispersion of a set of scores around the mean Applicable for Equal Interval & Ratio, not Nominal or Ordinal

STANDARD DEVIATION Standard deviation is one determined typical unit above and below the score of 100. Standard deviation is one method of calculating difference in scores or variability of scores known as dispersion. Must calculate variance before you can calculate standard deviation. Standard Deviation = variance Any test score that is 1 standard deviation above or below the mean score is considered significant. Applicable for Equal Interval & Ratio, not Nominal or Ordinal

STANDARD DEVIATION & NORMAL DISTRIBUTION In a normal distribution, the standard deviations represent the percentages of scores shown on the bell curve. More than 68% of the scores fall within one standard deviation above or below the mean. Two standard deviations below the mean = Intellectual Disability Two standard deviations above the mean = Gifted

SKEWED DISTRIBUTIONS When small samples or very restricted populations are used, test results may not distribute into a normal curve. Extreme scores can change the appearance of a set of scores and subsequently influence the way the data are described. Distributions can be skewed in a positive or negative direction. Negatively Skewed: Large number of scores occur above the mean. Positively Skewed: Large number of scores occur below the mean.

TYPES OF SCORES Percentile Rank Rank each score on the continuum of the normal distribution Percentiles range from <1% to 99.9%, with 50 being the average. A person who scores at the 75%tile scored as well or better than 75% of the students in that age/grade group.

Percentile Rank For example: Jalen obtained a percentile rank of 42. This means that Jalen performed as well as or better than 42% of children his age on the test. Or, 42% of children Jalen's age scored at or below Jalen's score.

Descriptors for Percentile Ranges Percentile Range Descriptor 98 th %ile and Above Upper Extreme 91 st to 97 th %ile Well Above Average 75 th to 90 th %ile Above Average 25 th to 74 th %ile Average 9 th to 24 th %ile Below Average 3 rd to 8 th %ile Well Below Average 2 nd %ile and Below Lower Extreme

TYPES OF SCORES T scores Have an average of 50 and standard deviation of 10 Stanines Scores are divided into 9 groups with 5 being the mean and 2 being the standard deviation Deciles Scores are divided into 10 groups, 10 for the lowest group, 100 for the highest Each group represents 10% of the obtained scores.

STANDARD SCORES Standard scores are scores of relative standing with a set, fixed, predetermined mean and standard deviation

CHOICE OF TEST SCORES Percentile Ranks Preferable over age and grade equivalents Are considered comparable scores Straightforward indicators of an individual s standing within a group Reported as a reference to the student s standing to the group upon which the test was normed

CHOICE OF TEST SCORES (CONTINUED) Standard Scores Advantages Comparative Based upon a normal or normalized distribution of scores (bell curve) Can be directly translated into percentile ranks Because of a uniform mean (bell curve), they can be compared from one subtest to the next and one test administration to another.

CHOICE OF TEST SCORES (CONTINUED) Age and Grade Equivalents: Appear to be the simplest, but in fact, they can be the most misinterpreted. Major limitations: Do not provide information about whether student s performance is within average limits. Do not describe a student s current instructional level. Do not indicate what test questions the student answered correctly. A word of caution: Findings should be reported and worded carefully to prevent misinterpretation.

GRADE EQUIVALENTS ARE OBTAINED FROM MEAN OR MEDIAN SCORES BY GRADE.

MISLEADING: FALSE IMPRESSION OF PROGRESS Grade Placement Grade Equivalent Years Below Percentile Rank 2 1.9.1 25 th 3 2.4.6 25 th 4 3.1.9 25 th 5 3.9 1.1 25 th 6 4.5 1.5 25 th 7 5.3 1.7 25 th

AGE AND GRADE EQUIVALENTS What are the scores based on? Why is this a problem? 1) 32-16 2) 3) 4) 42 12 19 67 70 +14-6 +12-15 + 5 5) 6) 7) 8) 9) 10) 11) 12) 12 23 59 37 45 17-11 +16-32 +26-26 +14 Chelyn only gets the even # questions correct Raw Score = 6 Lou only gets the odd # questions correct Raw Score = 6 DO THESE STUDENTS HAVE THE SAME SKILLS?

AGE & GRADE EQUIVALENTS 2 years below grade has different meanings at different grades Kurt is at the 12.5 grade level and obtained a grade equivalent of 10.5 on the Reading Recognition Subtest of the PIAT. Mason is at the 3.5 grade level and obtained a grade equivalent of 1.5 on the same test. Is their performance the same? Who performed better? Kurt obtained a standard score of 93, 33rd percentile Mason obtained a standard score of 72, 3rd percentile

GRADE EQUIVALENTS MEAN DIFFERENT THINGS ON DIFFERENT TESTS Billy, grade placement 7.5, obtained a grade equivalent of 5.5 on the WRMT. Bobby, grade placement 7.5, obtained a grade equivalent of 5.5 on the Reading Subtest of the WRAT. Is their performance the same? Who performed better? Billy performed at the 18th percentile Bobby performed at the 34th percentile At the same point on the scale and the same age level, identical grade equivalents mean different things on different tests.

RELIABILITY & VALIDITY

RELIABILITY & VALIDITY Aids in determining test accuracy and dependability Reliability the dependability or consistency of an instrument across time or items. Validity the degree to which an instrument measures what it was designed to measure. Instruments should have both properties but may have only one (not that strong of an instrument)

Correlation (r) Correlation the degree of relationship between two variables. Two administrations of the same test Administration of equivalent forms Correlation coefficient ranges: +1.00 to -1.00 Perfect positive correlation = +1.00 Perfect negative correlation = -1.00 No correlation = 0 Numbers closer to +1.00 represent stronger relationships The greater degree of the relationship, the more reliable the instrument. The + does not indicate strength, but direction.

SCATTERGRAM Scattergrams provide a graphic representation of a data set and show a correlation. The more closely the dots on a scattergram approximate a straight line, the nearer to perfect the correlation.

TYPES OF CORRELATION POSITIVE CORRELATION NEGATIVE CORRELATION No Correlation Variables with a positive relationship move in the same direction. Scores on variables increase simultaneously. High scores on one variable are associated with low scores on another variable. n n When data from two variables are not associated or have no relationship. No linear direction on a scattergram

RELATIONSHIP BETWEEN RELIABIITY & VALIDITY On the other hand, if I have a correctly printed measuring tape Suppose I have a faulty measuring tape and I use it to measure each student s height. My tool is invalid, but it s still reliable. My tool is both valid and reliable.

RELIABILITY Another way to think of reliability is to imagine a kitchen scale. If you weigh five pounds of potatoes in the morning, and the scale is reliable, the same scale should register five pounds for the potatoes an hour later.

VALIDITY Let s imagine a bathroom scale that consistently tells you that you weigh 130 pounds. The reliability (consistency) of this scale is very good, but it is not accurate (valid) because you actually weigh 150 pounds (perhaps you re-set the scale in a weak moment).

RELIABILITY CHECKS Test-Retest (Stability) Equivalent Forms Inter-Rater (Agreement)

TEST-RETEST RELIABILITY Test-retest reliability the trait being measured is one that is stable over time. If the trait being measured remains constant, the re-administration of the instrument will result in scores similar to the first score. Important to conduct re-test shortly after first test to control for influencing variables. Difficulties: Too soon: Students may remember test items (practice effect) and score higher the second time. Too far: Greater influence of time variables (e.g., learning, maturation, etc.)

EQUIVALENT (ALTERNATE) FORMS RELIABILITY Equivalent forms reliability Two forms of the same instrument are used. Items are matched for difficulty. Advantage: Two tests of the same difficulty level that can be administered within a short time frame without the influence of practice effects.

INTERRATER RELIABILITY Interrater reliability The consistency of a test across examiners. One person administers a test, a second person rescores the test. The scores are then correlated to determine how much variability exists between the scores.

ASSUMPTIONS OF TESTING 1. People involved are skilled 2. Error is always present 3. Acculturation is comparable 4. Behavior sample is adequate 5. Present behavior is observed

1. PEOPLE ARE SKILLED: in administering the test - including establishing rapport in scoring the test in interpreting the results in utilizing the results

2. ERROR IS ALWAYS PRESENT Obtained Score = True Score + Error Random error is unreliability e.g., lack of familiarity with tests, examiner fatigue, etc. Do not make decisions based on error

3. ACCULTURATION IS COMPARABLE The comparison group has comparable Experiential Background Test item asking about how to get out of a forest for an inner city child. Opportunity to Learn Books available in the child s home

4. BEHAVIOR SAMPLE IS ADEQUATE All tests are only samples of behavior. Samples of Behaviors on Test Domain of Behaviors of Interest

5. PRESENT BEHAVIOR IS OBSERVED Future behavior is inferred. Tests can only inform us directly about present behavior.

TEST VALIDITY Does the test actually measure what it is supposed to measure? Criterion-related validity: Comparing scores with other criteria known to be indicators of the same trait or skill Concurrent Validity: Two tests are given within a very short timeframe (often the same day). If scores are similar, the tests are said to be measuring the same trait. Predictive Validity: Measures how well an instrument can predict performance on some other variable (e.g, ACT or GRE scores).

CONTENT VALIDITY Ensuring that the items in a test are representative of content purported to be measured. PROBLEM: Teachers often generalize and assume the test covers more than it does (e.g., the WRAT-3 reading subtest only measures word recognition not phonemic awareness, phonics, vocabulary, reading comprehension, etc.). Some of the variables of content validity may influence the manner in which results are obtained and can contribute to bias in testing. Presentation Format: The method by which items are presented to the student Response Mode: The method for the examinee to answer items.

VALIDITY OF TEST ~V~ VALIDITY OF USE Tests may be used inappropriately even though they are valid instruments. Results obtained may be used in an invalid manner. Tests may be biased and/or discriminate against different groups. Item bias, when an item is answered incorrectly a disproportionate number of times by one group compared to another. Predictive validity may predict accurately for one group and not another.

NEXT WEEK Read Chapter 5 Submit Online Self-Assessment

SOURCES Overton, T. (2012). Assessing learners with special needs (7 th ed.). Upper Saddle River, NJ: Pearson Education Inc.