OAG Sampling Guide for Performance Audits

Size: px
Start display at page:

Download "OAG Sampling Guide for Performance Audits"

Transcription

1 OAG Sampling Guide for Performance Audits February 2006 Paul Pilon

2 Ce document est également disponible en français. Contents may not be reproduced for commercial purposes, but any other reproduction, with acknowledgments, is encouraged. Minister of Public Works and Government Services Canada, 2006

3 Table of Contents Table of Contents Sampling Guide for Performance Audits 1 Audit Objectives Typical Audit Objectives File Reviews and Surveys Typical Audit Objectives for Representative Sampling Detailed Description of Processes 7 2 Representative Sampling: Preliminary Assessment Representative Sampling: Types of Variables Defining Populations and/or Sub-Populations Establishing the Scope of the Research Assessing Homogeneity Types of Distributions of Numeric Variables Batching Reducing Sample Size Level of Detail When Reporting Calculating Sample Size 45 3 Representative Sampling Methods Sampling Large Populations Simple Random Sampling Define Strata for Stratified Random Sampling Cumulative Square Root Method: Defining Strata for a Numeric Variable Is Each Stratum Equally Important? Proportional Stratified Random Sampling Non-Proportional Stratified Random Sampling 66 4 Reporting Results from Representative Samples Calculating Estimates and Precision of Estimates Estimates Sources of Non-Sampling Error 85 5 Conducting a Census Surveys of Small Populations Parameters 90 6 Purposeful Sampling Purposeful Sampling Purposeful sampling strategies Sample Size for Purposeful Sampling Extracting a Sample Reporting Information from a Purposeful Sample 100 OAG February 2006 Sampling Guide for Performance Audits 3

4 Table of Contents 4 Sampling Guide for Performance Audits OAG February 2006

5 Chapter 1 Audit Objectives 1 Audit Objectives 1.1 Typical Audit Objectives 1.2 File Reviews and Surveys 1.3 Typical Audit Objectives for Representative Sampling 1.4 Detailed Description of Processes 2.0 Representative Sampling: Preliminary Assessment 6.0 Purposeful Sampling Table of Contents OAG February 2006 Sampling Guide for Performance Audits 5

6 Chapter 1 Audit Objectives 1.1 Typical Audit Objectives Audit objectives should always be made explicit prior to developing any sampling plans as they may have an impact on the type of sampling method. As the sampling plan is developed compromises may be required to accommodate the available time and resources. As a result, the audit objectives may need to be adjusted accordingly to reflect the specific sampling method. Audit objectives need to be stated in very concrete terms. 1.2 File Reviews and Surveys Surveys A survey refers to any systematic review of members of any population. The members of a population are called sampling units. The typical survey has individual people as sampling units, and uses either a questionnaire or interview to collect information. However, the sampling units and tools for collecting information can vary. For example, an audit might involve a survey of Aboriginal communities, and the data collection might be a combination of interviews, demographic variables, and measurements of the physical environment. Whenever you are referring to a survey, you should always mention the sampling unit. For example: Survey of Customs Inspectors Survey of Entities audited in Survey of Aboriginal Communities in British Columbia File reviews File reviews are just one specific type of survey and requires special attention because of its unique nature. The sampling units are often the individual files, and data is often collected with the use of a file review checklist. Unlike other surveys, many populations of files have an associated dollar amount that must be taken into consideration. As with other surveys, it is important to be explicit about the sampling unit. For example: Review of Grants managed by Health Canada in Review of Minister s Permits issued from to Review of National Parole Board files prepared during Sampling Guide for Performance Audits OAG February 2006

7 Chapter 1 Audit Objectives 1.3 Typical Audit Objectives for Representative Sampling When using representative sampling, audit objectives can be very specific and quantitative. The audit objective establishes a goal to determine what percentage or ratio of units will fall into a specific category. For example, how many files comply with a specific standard, or what percentage of people agree with a statement? The following are examples of audit objectives that make use of representative sampling. To accurately estimate the ratio of grants whose management comply with the Terms and Conditions established by the Treasury Board. To accurately estimate the percentage of CBSA inspectors who agree that they receive insufficient professional training. To accurately estimate the ratio of Minister s Permits involving individuals with criminal records comply with documentation standards. To accurately estimate the ratio of sponsorship files comply with Terms and Conditions established by the Treasure Board. 1.4 Detailed Description of Processes Typical audit objectives for purposeful sampling The objectives of audits that make use of purposeful sampling are more broad and qualitative in nature. Purposeful samples should be used to illustrate a problem in great detail. Purposeful samples should not be used to estimate the prevalence of a problem. The following are examples of audit objectives that make use of purposeful samples. To illustrate the potential for and resulting impact of not documenting deliverables of sponsorship contracts. To illustrate the impact of DND purchasing guidelines for high cost equipment. To document the favorable treatment of some recipients of Health Canada Grants and Contributions, and illustrate the managerial style that limits the capacity of file managers to enforce guidelines. OAG February 2006 Sampling Guide for Performance Audits 7

8 Chapter 1 Audit Objectives 8 Sampling Guide for Performance Audits OAG February 2006

9 Chapter 2 Representative Sampling: Preliminary Assessment 2 Representative Sampling: Preliminary Assessment 1.0 Audit Objectives 2.1 Representative Sampling: Types of Variables 2.2 Defining Populations and/or Sub-Populations 2.3 Establishing the Scope of the Research 2.4 Assessing Homogeneity NO 2.6 Batch / Identify Outliers YES 2.8 Level of Detail When Reporting 2.7 Reduce Level of Detail 2.7 Reduce Scope 2.7 Reduce # of Sub- Populations 2.9 Calculating Sample Size 2.7 Sample is too large YES NO 3.0 Representative Sampling Methods Table of Contents OAG February 2006 Sampling Guide for Performance Audits 9

10 Chapter 2 Representative Sampling: Preliminary Assessment 2.1 Representative Sampling: Types of Variables The purpose of representative sampling is to provide Quantitative Information which can be used to infer something about the population. Descriptive statistics of Variables are the typical Quantitative Information that we produce using representative sampling. The type of descriptive statistic depends on the type of Variable. Types of variables typically measured with representative sampling Character-nominal variables Character-ordinal variables Numeric variables Types of variables typically measured with representative sampling There are three basic types of Variables: Character-Nominal Character-Ordinal Numeric Character-nominal variables Character-Nominal variables are used to categorize members of a population/sample into distinct groups. Character-Nominal variables with only two values have a special classification. They are referred to as Binomial Variables. The following are examples of Character-Nominal Variables: Character-Nominal Variable #1: Certified to use X-ray equipment in a customs office Value 1 Value 2 Yes No Character-Nominal Variable #2: Geographic region within Canada you work in Value 1 Value 2 Value 3 Pacific Prairies Southern Ontario 10 Sampling Guide for Performance Audits OAG February 2006

11 Chapter 2 Representative Sampling: Preliminary Assessment Value 4 Value 5 Value 6 Northern Ontario Quebec Atlantic Character Variable #3: Compliance with Terms and Conditions Value 1 Value 2 Does Comply with Terms and Conditions Does Not Comply with Terms and Conditions Character-ordinal variables Character-Ordinal variables are also used to place members of population/sample into a specific order. The following are examples of Character-Ordinal Variables. Value 1: Value 2: Value 3: Value 4: Character-Ordinal Variable #1: Size of Automotive Vehicles Compact Small Medium Large Value 1: Value 2: Value 3: Value 4: Character-Ordinal Variable #2: Size of Sweaters Small Medium Large Extra-large Value 1: Value 2: Value 3: Value 4: Value 5: Character-Ordinal Variable #3: Level of Agreement with a Statement Strongly Disagree Disagree Neutral Agree Strongly Agree OAG February 2006 Sampling Guide for Performance Audits 11

12 Chapter 2 Representative Sampling: Preliminary Assessment The value of the character-ordinal variable only indicates its position relative to other members of the population or sample. We know that a Medium sized car is larger than a Compact, but we have no measure of how much larger. We know that an Extra-large sweater is larger than a Medium, but we don t know if it is 1.5 times larger, or 10 times larger. We know that a person that Strongly Agrees with a statement agrees with it more than someone that claims to be neutral, but we don t know how much more he or she agrees with the statement. Numeric variables Numeric variables are used to place members of a population or sample along a continuum that ranges from a small value to a large value. The range of values of a numeric variable can be infinitely large, and the value given to any member of a population or sample also reflects its degree of difference from other members. The following are examples of Numeric variables: Numeric variable #1: Amount of overpayment of EI claim Values can range from $0.00 to positive infinity dollars. An overpayment of $200 is twice as large as an overpayment of $100. An overpayment of $600 is three times larger than an overpayment of $200. Numeric variable #2: Years of experience Values can range from 0 years to approximately years. An employee with 15 years experience has ten more years of experience than an employee with only five years experience. An employee with 20 years experience has four times more years of experience than a person with five years experience. 2.2 Defining Populations and/or Sub-Populations Basic definitions What is a population? What is a sample? What is a sampling unit? What is a sample frame? What is over-coverage and under-coverage? What is an operational definition of sampling units? 12 Sampling Guide for Performance Audits OAG February 2006

13 Chapter 2 Representative Sampling: Preliminary Assessment What is a sub-population? What is homogeneity? Defining the population (surveys) Individuals as sampling units Organizations as sampling units Defining sub-populations (surveys) Defining the population (file review) Individual files as sampling units Multiple files as sampling units Defining sub-populations (file review) Defining sub-populations using a numeric variable Defining a sub-population using a character variable Basic definitions What is a population? A population is made up of all the members that are of interest to your study. What is a sample? A sample is a subset of the population that is selected such that the distribution of the sample is similar to the distribution of the population. What is a sampling unit? Sampling Units are the individual elements that make up the population. A sampling unit can be a file, a person, or an administrative unit such as a company or an organization. What is a sample frame? The sample frame is the list or process used to identify all the sample units. What is over-coverage and under-coverage? Coverage refers to how well the sampling frame includes all members of the population and excludes all non-members of the population. Specifically, undercoverage occurs when the sampling frame does not include all members of the population. For example, use of a municipal telephone directory as a sampling frame of households, excludes all household that have unlisted phone numbers, or have no land-line phone. Likewise, over-coverage occurs when the sampling frame includes non-members of the population. OAG February 2006 Sampling Guide for Performance Audits 13

14 Chapter 2 Representative Sampling: Preliminary Assessment Over-coverage can be dealt with relatively easily by using a screening tool to identify non-members of the target population and then exclude them from the survey. Unfortunately, under-coverage is more difficult to deal with and poses a more serious risk of bias then over-coverage. What is an operational definition of sampling units? An operational definition is a clear and unambiguous definition of who is to be included in the sample frame and who is to be excluded. The operational definition needs to be unambiguous enough that if the study were to be repeated by another person, he or she would be sampling from the same population. What is a sub-population? Sub-populations are sub sets of the original population. Each sub-population is distinct from other sub-populations, and each sub-population is homogenous. What is homogeneity? The term homogeneity is used to describe a population that is made up of sampling units that are reasonably similar to each other so as to be considered members of the same group. Click here to learn more about assessing homogeneity. Defining the population (surveys) Who are the people you are sampling from? An operational definition of the population needs to be very precise and concrete. Start by identifying your sampling unit. Is it a person, or is it an organization. Individuals as sampling units Sometimes sampling units are individuals. For example, for a survey of CRA employees, the operational definition of the population might be all unionized members of the CRA that has held the position of Inspector during the fiscal year, and has held the position for at least two concurrent years. This definition explicitly determines who is and who is not considered a member of the population for this particular survey. The Sample Frame used to identify the population would be a human resource database that identifies all employees, whether or not they were inspectors during the fiscal year, and how long they held the position. Organizations as sampling units Sometimes, surveys are designed to gain information from many organizations or departments. In this case, the sampling unit is the organization or department, and the operational definition of the sample unit needs to clearly identify the departments to be included in the survey, and it needs to identify the individual to be surveyed. 14 Sampling Guide for Performance Audits OAG February 2006

15 Chapter 2 Representative Sampling: Preliminary Assessment The Office s post audit surveys are an example of a survey that has an organization, the entity, as a sampling unit. Hence, we must clearly define which entities are to be included in the post-audit survey. We must also clearly stipulate who within the entity is in the best position to respond to the survey. For example, the operational definition for the population of the post-audit survey could be: all entities that have been the focus of a single entity audit that resulted in a chapter published during the fiscal year. The person selected for completing the audit should hold a position not lower than a director within the organization, and should be a person that was exposed to all aspects of the audit. If more than one person meets these criteria, than each person will be asked to complete the survey, and the responses will be aggregated. Defining sub-populations (surveys) It is important to define sub-populations when the population is not homogenous because one sample is not capable of representing a heterogeneous population. Hence, heterogeneous populations must be subdivided into homogenous subpopulations and each sub-population needs to be sampled independently. Click here to learn more about assessing homogeneity. An example of a heterogeneous population is the employees of a hospital. While they all the work for the same organization, they are made up of several very different groups of employees. Medical service providers Administration Technicians Support staff When defining a population, every effort should be made to explore the possibility that distinct sub-populations exist. Otherwise, the aggregated results will not be representative. Defining the population (file review) Exactly which files do you want to audit? An operational definition of the population needs to be very precise and concrete. Start by identifying your sampling unit. Which files make up the population, and is the sampling unit an individual file, or does each sampling unit consist of several files? Individual files as sampling units Sometimes sampling units are individual files. For example, for a review of Health Canada grants, each grant is represented by one file. The operational definition of the population might be all Health Canada grants in existence during the fiscal year, and which commenced no later than the beginning of the previous fiscal year. This definition explicitly determines which OAG February 2006 Sampling Guide for Performance Audits 15

16 Chapter 2 Representative Sampling: Preliminary Assessment files are and are not considered a member of the population for this particular file review. The Sample Frame used to identify the population would be a management database of files for the fiscal year, which identifies the commencement date of the grant. Multiple files as sampling units Sometimes, each file only represents a portion of the sampling unit. For example, if the audit objective is to determine if advertising contracts are being properly managed, each advertising contract may consist of several advertising project files. As a result, care needs to be taken to ensure the proper sampling unit it used to adequately meet the needs of the audit objective. Do not assume that the organization of files as given to you by the entity is consistent with the sampling units you need to use. Defining sub-populations (file review) It is important to define sub-populations when the population is not homogenous because one sample is not capable of representing a heterogeneous population. Hence, heterogeneous populations must be subdivided into homogenous subpopulations and each sub-population needs to be sampled independently. Click here to learn more about assessing homogeneity. A population of files can be heterogeneous with respect to either a Character Variable or a Numeric Variable. Defining sub-populations using a numeric variable Some files have an associated dollar value. Dollars are a numeric variable. Very often, the distribution of dollar values is not normal, but it is skewed. As a result, the population is not homogenous and has to be divided into sup-populations. One sub-population might be all high-dollar amounts, and another sub-population might consist of all remaining files. Defining a sub-population using a character variable Some files have no associated dollar value, but can still be divided into different groups using a character variable. For example, files prepared by Correction Services Canada for the National Parole Board are used to determine if an inmate should be granted parole. These files can be easily divided according to type of crime committed, a Character Variable. It would be reasonable to assume that files prepared for inmates accused of a violent crime would be dealt with differently than other files. Hence, it would be reasonable to define at least two subpopulations of files, those associated with violent crime, and all other files. These sub-populations of files would be sampled independently. 2.3 Establishing the Scope of the Research The iterative process Elements of scope 16 Sampling Guide for Performance Audits OAG February 2006

17 Chapter 2 Representative Sampling: Preliminary Assessment Complexity and number of audit objectives Number of sub-populations The iterative process As the scope of a research project expands, so do the time and resources that are needed to successfully complete the project. As a result, the scope of a research project has to be balanced against the time and resources available. Establishing the scope of the research is an iterative process that goes through a series of steps before finally settling on a scope that meets the needs of the audit, and can be completed with the given resources. In order to ensure that sufficient time and resources are available for research projects, it is important to establish the scope during the survey phase of your audit. If electronic files exist regarding the population of interest, it is a good idea to request them during the survey phase and review in order to analyze the distribution, and develop a realistic sampling plan. Elements of scope Complexity and number of audit objectives As audit objectives increase in number and complexity, so does the cost of the audit. If doing a file review, complexity and number of objectives translates in a longer examination time for each file. For a traditional survey, complexity and number of objectives results in lengthy and complex questionnaires, and as a questionnaire becomes longer and more difficult to complete, the response rate decreases substantially. Low response rate has a serious impact on the validity of the results, and more time and resource have to be directed towards encouraging participates to respond. In order to manage the cost of the research, the number of audit objectives may have to be reduced, or some of the objectives may have to be simplified. Number of sub-populations In order to reduce heterogeneity, a large population may be divided into several sub-populations. The total sample size needed to report reliable estimates is roughly proportional to the number of sub-populations. As a result, the number of sub-populations has a direct impact on the cost of the audit. If there are many sub-populations, some may have to be excluded from the scope of the audit in order to reduce the cost of the research. For example, when doing a review of files, only high-dollar amounts might be examined, and low-dollar amounts might be excluded from the scope of the research. 2.4 Assessing Homogeneity Definition of homogeneity OAG February 2006 Sampling Guide for Performance Audits 17

18 Chapter 2 Representative Sampling: Preliminary Assessment Homogeneity and numeric variables Histograms (numeric variables only) Normal distributions Bimodal distributions Skewed distributions Homogeneity and character variables (nominal or ordinal) Bar chart example: Types of deaths experienced by rock stars Definition of homogeneity Homogeneity refers to the degree of similarity among the members (sampling units) of a population. Although we expect members of a population to differ from each other, they must non-the-less be similar enough to be considered members of one single population. Homogeneity can be assessed with respect to either a character (Nominal or Ordinal) variable, or a Numeric variable. Click here for a Power Point Presentation regarding Homogeneity. Homogeneity and numeric variables Numeric Variables, as the term implies, are numbers that range from small to large values. The shape of the distribution of the numeric variable is used assess its level of homogeneity. We can see the shape of the distribution by creating a histogram of the variables. Histograms (numeric variables only) Histograms help us see the distribution of a Numeric Variable within a population. The software package IDEA is an excellent tool for creating histograms. Click here to see a demonstration of how to create histograms with IDEA. Normal distributions Normal distributions are unimodal, symmetrical, and have bell shape. The mode is the value in a distribution that occurs most frequently. The mode is the highest point of the distribution. Normal distributions have only one mode or only one hill in the histogram. Normal distributions are also symmetrical. If the shape of the histogram was divided in half at the mode, the left hand side would look very similar to the right hand side. 18 Sampling Guide for Performance Audits OAG February 2006

19 Chapter 2 Representative Sampling: Preliminary Assessment A normal distribution indicates that the population is homogenous relative to the numeric variable being examined. Populations with a normal distribution typically do not have to be batched into subpopulations. Bimodal distributions Bimodal distributions have two modes. The histogram of a bimodal distribution will have two hills. If the distribution has more than two modes, it is called multimodal. Bimodal distributions are heterogeneous. If the variable represents an important characteristic of the population, then the population should to be batched to create two or more unimodal distributions. Skewed distributions Skewed distributions are non-symmetrical. Positively Skewed distributions have a limited number of members with extremely high values. This results in a tail in the histogram that extends outward to the right. Negatively Skewed distributions have a limited number of members with extremely low values. This results in a tail in the histogram that extends outward to the left. The members in the tail of the histogram are considered outliers from the main body of the distribution. Skewed distributions are heterogeneous. If the variable represents an important characteristic of the population, then the population should be batched in order to remove the outliers from the population. Click here for a demonstration on how to define outliers of a numeric variable. Homogeneity and character variables (nominal or ordinal) Character variables are used to categorize members of a population into distinct groups or categories. When defining the population to be sampled, it is important not to combine diverse groups. The potential for heterogeneity regarding a character variable can be assessed using bar charts. Click here to see a demonstration on how to create a Bar Chart with IDEA. After identifying the groups that exist within the population, the researcher must decide which groups if any can be combined into single populations. OAG February 2006 Sampling Guide for Performance Audits 19

20 Chapter 2 Representative Sampling: Preliminary Assessment In the case of a survey, if group membership is expected to dramatically affect how members respond to survey questions, then the researcher should NOT combine the groups into a single population. In the case of a file review, if group membership is expected to dramatically affect the results of the file review, then the researcher should NOT combine the groups into a single sample. Bar chart example: Types of deaths experienced by rock stars This bar chart displays the types of death incurred by rock stars. If a researcher is assessing the lifestyle of rock stars, he or she may want to differentiate between types of death. 2.5 Types of Distributions of Numeric Variables Normal distribution Bimodal distribution Positively skewed distribution Negatively skewed distribution 20 Sampling Guide for Performance Audits OAG February 2006

21 Chapter 2 Representative Sampling: Preliminary Assessment Normal distribution Unimodal Symmetrical Bell Shaped OAG February 2006 Sampling Guide for Performance Audits 21

22 Chapter 2 Representative Sampling: Preliminary Assessment Bimodal distribution More than one mode 22 Sampling Guide for Performance Audits OAG February 2006

23 Chapter 2 Representative Sampling: Preliminary Assessment Positively skewed distribution Non-symmetrical Limited number of extremely high values OAG February 2006 Sampling Guide for Performance Audits 23

24 Chapter 2 Representative Sampling: Preliminary Assessment Negatively skewed distribution Non-symmetrical Limited number of extremely low values 2.6 Batching Batching is the process of dividing a heterogeneous population into two or more homogenous sub-populations. As you create more batches or sub-populations, each sub-population becomes more homogenous. However, as the number of sub-populations increase, so does the number of samples needed, and the total sample size can start to rise exponentially with the number of batches. Click here for a Power Point presentation about batching. Batching a population using a character variable Example from a file review: Advertising contracts separated into media costs and production costs 24 Sampling Guide for Performance Audits OAG February 2006

25 Chapter 2 Representative Sampling: Preliminary Assessment Example from a survey: CRA employees separated into unionized and management Batching a population using a numeric variable Tukey s Outlier Filter: Used to identify outliers Calculating a high-value cut-off with Tukey s Outlier Filter Extracting a database of non high-value items Extracting a database of high-value items Batching a population using a character variable Character variables place members of a population into discrete categories. As a result, it is easy to batch using a character variable. Example from a file review: Advertising contracts separated into media costs and production costs Advertising contracts can be divided into two main categories, some contracts have the majority of spending directed toward purchasing media time, and the second has most of its costs directed towards production costs. Depending on the focus of the audit, an auditor might decide to sample independently from each of these two sub-populations. Example from a survey: CRA employees separated into unionized and management Surveys of employees often ask questions regarding work environment. Union/Management status can have a great impact on the way these questions are answered As a result, unionized and management employees should be surveyed separately. Batching a population using a numeric variable Numeric Variables (a.k.a. Cardinal Variables) range from small to large values. As a result, it can be difficult to find clear and definitive values to use as cutoff points to create two or more sub-populations. If the distribution of the numeric variable is bimodal, then the population can be divided at the value of the lowest frequency between the two modes. If the distribution is skewed, then outliers can be identified using Tukey s Outlier Filter (Tukey, J.W Exploratory data analysis, Addison- Wesley). OAG February 2006 Sampling Guide for Performance Audits 25

26 Chapter 2 Representative Sampling: Preliminary Assessment A skewed population can be batched into at least two sub-populations by separating the outliers from the rest of the population. If necessary, the remainder of the population can be batched further. Depending on the number of outliers that exist, the sub-population of outliers can be either sampled from, or a census of all the outliers can be performed. Typically, outliers represent high-value items and hence can also represent high risk items. Tukey s Outlier Filter: Used to identify outliers Tukey s Outlier Filter is an empirical method of separating outliers from normally distributed values in a population or sample. Depending on the population, non-empirical criteria do exist for identifying outliers. In some cases, you might find expert advice that specific values represent outlier cutoffs. In the absence any solid non-empirical criteria, one can usually rely on Tukey s Outlier Filter to identify outliers. The following equation is used to identify a cutoff point for outliers for a Positively Skewed Population High Value cut-off = 3 rd Quartile * (3 rd Quartile 1 st Quartile) The following equation is used to identify a cutoff point for outliers for a Negatively Skewed Population Low Value cut-off = 1 st Quartile * (3 rd Quartile 1 st Quartile) The 1 st and 3 rd Quartiles are the 25 th and 75 th percentiles respectively of a distribution of numbers. The 25 th percentile is the value in a distribution for which 25 percent of the population is lower, and 75 percent is higher than that value. The 50 th percentile is the value in a distribution for which 50 percent of the population is lower, and 50 percent is higher than that value. The 50 th percentile is also known as the Median Value. The 75 th percentile is the value in a distribution for which 75 percent of the population is lower, and 25 percent is higher than that value. Calculating a high-value cut-off with Tukey s Outlier Filter Step 1 Define your population. Eliminate any units that fall outside the scope of your audit. In this example, all files with a TOTAL_COST of 0 are eliminated using IDEA. 26 Sampling Guide for Performance Audits OAG February 2006

27 Chapter 2 Representative Sampling: Preliminary Assessment Enter a title for the new database. Click here to enter the criteria for extraction OAG February 2006 Sampling Guide for Performance Audits 27

28 Chapter 2 Representative Sampling: Preliminary Assessment Enter the criteria for extraction. Click here to continue 28 Sampling Guide for Performance Audits OAG February 2006

29 Chapter 2 Representative Sampling: Preliminary Assessment Step 2 Export data to an MS Excel file. OAG February 2006 Sampling Guide for Performance Audits 29

30 Chapter 2 Representative Sampling: Preliminary Assessment Open the Excel file. 30 Sampling Guide for Performance Audits OAG February 2006

31 Chapter 2 Representative Sampling: Preliminary Assessment Step 3 Use the quartile worksheet function to calculate the first and third quartile of all the data. Use the drag-down menu beside a summation symbol and select More functions. OAG February 2006 Sampling Guide for Performance Audits 31

32 Chapter 2 Representative Sampling: Preliminary Assessment 32 Sampling Guide for Performance Audits OAG February 2006

33 Chapter 2 Representative Sampling: Preliminary Assessment Alternatively, you can also type in the quartile worksheet function directly. OAG February 2006 Sampling Guide for Performance Audits 33

34 Chapter 2 Representative Sampling: Preliminary Assessment Step 4 Type in the formula to calculate Tukey s Outlier Filter. High Value cutoff = 3 rd Quartile * (3 rd Quartile 1 st Quartile) 34 Sampling Guide for Performance Audits OAG February 2006

35 Chapter 2 Representative Sampling: Preliminary Assessment The cut-off value for high-values for this dataset is $559, Extracting a database of non high-value items OAG February 2006 Sampling Guide for Performance Audits 35

36 Chapter 2 Representative Sampling: Preliminary Assessment Enter a title for the new database Click here to enter a criterion for extraction Enter the criteria for extraction. Click here to continue 36 Sampling Guide for Performance Audits OAG February 2006

37 Chapter 2 Representative Sampling: Preliminary Assessment OAG February 2006 Sampling Guide for Performance Audits 37

38 Chapter 2 Representative Sampling: Preliminary Assessment Extracting a database of high-value items 38 Sampling Guide for Performance Audits OAG February 2006

39 Chapter 2 Representative Sampling: Preliminary Assessment Enter a title for the new database Click here to enter a criterion for extraction OAG February 2006 Sampling Guide for Performance Audits 39

40 Chapter 2 Representative Sampling: Preliminary Assessment 40 Sampling Guide for Performance Audits OAG February 2006

41 Chapter 2 Representative Sampling: Preliminary Assessment 2.7 Reducing Sample Size The cost of an audit is related to the sample size, especially in the case of a file review. Each additional file adds significantly to both the time necessary to perform the audit and to the total cost of the audit. When conducting a survey, increasing the sample size has less of an impact on the overall cost and length of the audit since respondents are typically completing questionnaires simultaneously and on their own time. Prior to calculating the sample sizes of the various groups and sub-populations, researchers need to: Establish the scope of the research Batch heterogeneous populations in sub-populations Decide on a level of detail for reporting results Each of these activities has an impact on the total sample size needed for the audit. To reduce the total sample size, it is necessary to revise one or all of these decisions made regarding the audit Scope of the audit Example: Audit of advertising contracts Reducing the amount of batching Reducing the level of detail of reportable results Example: Review of EI benefits claims Scope of the audit If the scope of audit is reduced, then sections of the population will be ignored and will not have to be sampled. Example: Audit of advertising contracts The initial scope of an audit was to assess the management of all advertising contracts An assessment of Homogeneity with respect to the dollar amounts of the contracts (a Numeric Variable) revealed that the distribution was positively skewed. The population was batched into two groups, a normal distribution and outliers. Further assessment of Homogeneity with respect to type of contract (a Character Variable) revealed two distinct types of contracts, Media and Production. The Population was batched again into two groups. Batching along two variables resulted in four sub-populations: 1) Media- Normally Distributed, 2) Media-High Value, 3) Production-Normally Distributed, and 4) Production-High Value. OAG February 2006 Sampling Guide for Performance Audits 41

42 Chapter 2 Representative Sampling: Preliminary Assessment Including all four sub-populations in the scope of the audit resulted in a total sample size that would cost too much and take too long. As a result, the scope of the audit was reduced to include only two subpopulations, Media-High Value, and Production-High Value. Reducing the amount of batching If in retrospect, the population was batched into too many sub-populations, you may decide to recombine sub-populations back into larger populations. However, do not risk the homogeneity of sub-populations. It would be more appropriate to reduce the scope of the audit than to sample from a heterogeneous population. Reducing the level of detail of reportable results Attempting to report results at a very fine level will increase the total sample size. If possible, report results for larger groups. Example: Review of EI benefits claims The Comprehensive Tracking System (CTS) of HRDC samples 500 files of EI benefit claims each year. For the purposes of a financial audit, 500 files was the minimum amount necessary for a national estimate of over and under payments. This dataset was reviewed to determine if the sample of 500 files was sufficient to predict Provincial/Territorial estimates of error rates. It was determined that the estimates for some individual Provinces and Territories had Confidence Intervals larger than 10%. As a result, it was recommended to group some Provinces/Territories together to create larger regional areas in order to increase the reliability of estimates. Instead of reporting error rates for each individual Province/Territory, reliable estimates could be reported for larger regional area (e.g., Atlantic, Quebec, Ontario, Prairies, and BC/Territories). 2.8 Level of Detail When Reporting Results from a survey can be reported at various levels of detail. Reporting at a fine level of detail offers results for many sub-groups within the population. Reporting at a very gross level of offers results for only the entire population, or two or three large groups. 42 Sampling Guide for Performance Audits OAG February 2006

43 Chapter 2 Representative Sampling: Preliminary Assessment Example: Differing levels of regional detail Level of detail and batching Level of detail and sample size Example: Differing levels of regional detail Table 1 is reporting data at a very fine level of detail. Values for each province and territory are reported. Table 1 Question: Is the amount of formal training for CRA employees adequate? Region n Agree (%) Disagree (%) Newfoundland & Labrador 36 10% 86% Nova Scotia 36 15% 76% New Brunswick 36 20% 62% Prince Edward Island 36 10% 81% Quebec 50 25% 68% Ontario 50 20% 71% Manitoba 40 15% 78% Alberta 40 13% 84% Saskatchewan 40 24% 52% British Columbia 40 11% 78% Yukon 30 5% 90% NWT 30 13% 77% Nunavut 30 21% 71% Canada % 74% OAG February 2006 Sampling Guide for Performance Audits 43

44 Chapter 2 Representative Sampling: Preliminary Assessment Table 2 is reporting data at a moderate level of detail. Values for each large geographic region are reported. Table 3 is only reporting data for the national level. Notice that the level of detail for reporting has a direct impact on the total sample size required. The sample size for each group must be large enough to provide reliable estimates. Level of detail and batching Level of Detail is not the same as Batching Batching is used to divide a heterogeneous population into two or more homogenous sub-populations. Creating Levels of Detail is used to divide a single homogeneous population into groups for the sake of reporting results. Level of detail and sample size Table 2 Question: Is the amount of formal training for CRA employees adequate? Region n Agree (%) Disagree (%) Maritimes 50 14% 76% Quebec 50 25% 68% Ontario 50 20% 71% Prairies 50 17% 72% BC/Territories 50 12% 79% Canada % 74% Table 3 Question: Is the amount of formal training for CRA employees adequate? Region n Agree (%) Disagree (%) Canada 75 16% 74% Fine levels of detail increase the sample size Prior to establishing a sampling method and sample size, you have to decide before hand how you plan to report the results (that is, what level of detail do you wish to use). 44 Sampling Guide for Performance Audits OAG February 2006

45 Chapter 2 Representative Sampling: Preliminary Assessment 2.9 Calculating Sample Size Goal of calculating sample size How is reliability measured Confidence interval Confidence level Confidence interval (CI) and confidence level (CL) Factors affecting reliability Population size (n) Most Likely Estimate (MLE) Sample size (n) Calculating sample size Goal of calculating sample size The goal of calculating sample size is to choose a sample size that is large enough to deliver reliable results Click here for a Power Point Presentation that reviews calculating sample size. How is reliability measured Reliability is expressed in terms of the amount of error we can tolerate for the estimates we make of the population. Confidence interval We express error as a Confidence Interval around the estimate A 10% Confidence Interval implies that we expect the true population parameter to be within +/- 10% of the sample estimate. For Example, if our estimate is 25%, then the true population parameter might be as low as 15% or as high as 35%. Small Confidence Intervals imply higher reliability The Office should use Confidence Intervals that are 10% or less. Confidence Intervals larger than 10% should not be reported. Confidence level Confidence Level represents the likelihood that the population parameter is within our Confidence Interval. A 90% Confidence Level means that we are 90% certain that the true population parameter is within the Confidence Interval. A high Confidence Level implies higher reliability. The Office should use Confidence Levels that are 90% or higher. OAG February 2006 Sampling Guide for Performance Audits 45

46 Chapter 2 Representative Sampling: Preliminary Assessment Confidence interval (CI) and confidence level (CL) The CI and CL of an estimate need to be reported together. The Office should aim for reporting results with a maximum CI of 10%, and a minimum CL of 90%. Reliability of results should always be included when reporting estimates. For Example, The results of the survey are accurate within +/- 10%, 18 times out of 20. Factors affecting reliability Three factors affect the reliability of results Population Size (N) Expected Results or Most Likely Estimate (MLE) Sample Size (n) Population size (n) When the population is relatively small (N < 500), the population size has a major impact on sample size. Beyond a certain size, (N>1000), the population size has a much smaller impact on the sample size. Most Likely Estimate (MLE) The expected findings have a major impact on sample size. The required sample size is largest when MLE is 50% of the population. As MLE deviates from 50%, then smaller samples are capable of generating equally reliable results. As a convention, MLE is always expressed as a percentage between 0% and 50%. Sample size (n) Of the three factors that affect the reliability of the results, sample size is the only one that is under the direct control of the researcher. As sample size increases, so does the reliability of the results. Prior to calculating sample size, we need to know the Population Size (N), and the Most Likely Estimate (MLE). If the MLE is unknown, then we should assume it is 50% Typically, a survey seeks to assess several variables, and each has its own MLE. In this case, the researcher should use the MLE that is closest to 50%. 46 Sampling Guide for Performance Audits OAG February 2006

47 Chapter 2 Representative Sampling: Preliminary Assessment Calculating sample size Click here for a demonstration of how to calculate sample size with IDEA OAG February 2006 Sampling Guide for Performance Audits 47

48 Chapter 2 Representative Sampling: Preliminary Assessment 48 Sampling Guide for Performance Audits OAG February 2006

49 Chapter 3 Representative Sampling Methods 3 Representative Sampling Methods 2.7 Sample is too large NO YES 2.0 Representative Sampling: Preliminary Assessment Small Populations 3.1 Large Populations 5.0 Conducting a Census 3.1 Stratification YES 3.3 Define Strata NO 3.5 Is Each Stratum Equally Important & Proportional? NO 3.2 Simple Random Sampling 3.6 Proportional Stratified Random Sampling YES 3.7 Non-Proportional Stratified Random Sampling 4.0 Reporting Results from Representative Samples Table of Contents OAG February 2006 Sampling Guide for Performance Audits 49

50 Chapter 3 Representative Sampling Methods 3.1 Sampling Large Populations In order to get representative results from a sample of a large population, a survey needs to meet three important criteria The sample size needs to be sufficiently large The sample has to be selected in such a way as to be representative of the population The response rate has to be sufficiently high. This section of the manual deals with the second criteria: methods of sample selection. Decide if the population should be stratified The first step in designing a method of sample selection is to decide if the population can or should be stratified. Stratification is the process of dividing the population up into different sections or strata. Each stratum represents one layer or section of the population. A population can only be stratified if there are clear and effective means of assigning sampling units to different strata the population size of each stratum is known each stratum can be sampled separately If a population cannot be stratified, then a researcher should rely on simple random sampling as a method of sample selection. If the population can be stratified, then a researcher needs to decide on what type of stratified sampling method to use. There are two general categories of stratified sampling: Proportional Stratified Sampling Non-Proportional Stratified Sampling 3.2 Simple Random Sampling Simple random sampling is the most basic method of selecting a sample from a population. Each member of the population must have an equal chance of being selected. Step 1: Sampling frame Step 2: Random selection Random number generator Sampling intervals 50 Sampling Guide for Performance Audits OAG February 2006

51 Chapter 3 Representative Sampling Methods Step 1: Sampling frame The Sampling Frame is a method of listing all the members (sampling units) of the population. The Sampling Frame might take the form of an actual list of all the members of the population. An example would be a list of all files within a department that is being audited. It is very common to gain access to a list of all files from a department when doing a file review. If so, IDEA can be used to select a random sample. Click here for a demonstration of how to use IDEA for sample selection. If an actual list is not possible to produce, a Sampling Frame could also be a description of how to access any member within the population. For example, when doing a sample of all adults living in a community, it is rare to find a comprehensive list of all people. Instead, surveyors make use of telephone lists to contact households, and then once in contact in the household, the surveyor assesses the number of adults currently living there and then randomly chooses one to be surveyed. Step 2: Random selection Random selection typically done in one of two ways. Using a random number generator Using a sampling interval with a random starting point Random number generator Most common spreadsheet applications such as Excel will have a random number generator In the same spreadsheet that all the members of the population, generate a new column of random numbers. If you need a sample of 50 from a population of 200, then generate a set of random numbers between 1 and an arbitrarily large number about 100 or 1,000 times the size of the population. This reduces the chances of getting duplicate random numbers. Resort the population based on the new column of random numbers. The sample will be the first fifty units. If any units of the original sample are unavailable or turn out to be outside the sampling frame, then replace them with the subsequent units. Sampling intervals Using a sampling interval is also referred to as systematic random sampling, but it is also a method of simple random sampling. OAG February 2006 Sampling Guide for Performance Audits 51

52 Chapter 3 Representative Sampling Methods If a sample of 50 files is needed from a population of 1200, then the population size is divided by the sample size to generate a Sampling Interval. Sampling Interval = Population size / Sample size Sampling Interval = 1200 / 50 Sampling Interval = 24 A random starting point between 1 and 24 (the sampling interval) is selected with a random number generator. Starting at the file that corresponds to the random starting point, each 24 th file after that is selected as part of the sample. Click here to see a demonstration of how to perform a simple random sample using IDEA. 3.3 Define Strata for Stratified Random Sampling Number of strata (k) Defining strata with character variables Example of strata using character variables Defining strata with numeric variables Cumulative square root (CSR) method of defining strata Number of strata (k) Researchers in the area of survey and statistics have suggested that no more that 4 strata are sufficient for substantially improving the quality of a survey sample (Sethi, V. K., A Note on Optimum Stratification for Estimating the Population Mean, The Australian Journal of Statistics, 5:20-23). This means that you only need to divide your population into no more then 4 groups. Defining strata with character variables Character variables place members of a population into categories. Depending on the number of categories, and the frequency within each, some of the categories may need to be combined to provide fewer, but larger, strata to use for sampling. Example of strata using character variables A survey of gun ownership is being conducted of all households in Canada. The Sampling Unit is the household. For the purposes of sampling, communities in Canada can be divided up into basic regions or strata. These strata can be: 1) Urban areas, 2) Suburban Areas, and 3) Rural Areas. In order to help ensure a representative sample that includes households from all types of communities, the population of households is divided into three strata, Urban, Suburban, and Rural, and a sample from each is selected. 52 Sampling Guide for Performance Audits OAG February 2006

53 Chapter 3 Representative Sampling Methods Defining strata with numeric variables If the variable being used for stratification is numeric, then cut-off points are used to define the boundaries of each stratum. Cumulative square root (CSR) method of defining strata The CSR Method uses the cumulative score of the square root of the frequencies in a histogram. Click here to see an example of defining strata using IDEA. Click here to see a demonstration of this method. Step 1: Create a histogram of the distribution. A histogram can be easily created in IDEA. Save it as a database for further manipulation. Step 2: Create a new variable that is equal to the square root of the number of records (NO_OF_RECS). Step 3: Use the Field Statistics to find the sum of all the square roots of the number of records (cumulative of square root of frequencies). ( FREQ) Step 4: Divide this sum by the number of strata (k) you want to define. INT = ( FREQ)/k Step 5: Use this value as an interval size to divide the population into strata. Starting at the lowest values, group the frequencies together until the sum of the square root of frequencies is equal to the interval size. Step 6: If the interval size (INT) is too large, then return to Step one and create a new histogram with a larger number of categories. 3.4 Cumulative Square Root Method: Defining Strata for a Numeric Variable Step 1: Create a histogram of the distribution Choose an interval size Numeric stratification Transfer frequency distribution into a database Step 2: Calculate the square root of the frequency Step 3: Calculate the sum of the square root of frequencies Step 4: Choose number of strata Step 5: Define strata OAG February 2006 Sampling Guide for Performance Audits 53

54 Chapter 3 Representative Sampling Methods Example of proportional stratified random sampling Step 6: Repeat if necessary Step 1: Create a histogram of the distribution A histogram can be easily created in IDEA. Save it as a database for further manipulation. Choose an interval size Start with a relatively normally distributed distribution (Use Tukey s Outlier Filter to remove outliers for separate consideration) 54 Sampling Guide for Performance Audits OAG February 2006

55 Chapter 3 Representative Sampling Methods Take note of the minimum and Maximum Values of the variable you wish to use for stratification (TOTAL_COST). Chose an interval size for the histogram that roughly divides the range of the variable into 20 portions (30,000 in this case). OAG February 2006 Sampling Guide for Performance Audits 55

56 Chapter 3 Representative Sampling Methods Numeric stratification Perform a Numeric File Stratification from the Analysis Menu option. 56 Sampling Guide for Performance Audits OAG February 2006

57 Chapter 3 Representative Sampling Methods Select the variable for stratification (TOTAL_COST) Enter the appropriate interval size (30,000) Define the Lower and Upper limits for the histogram OAG February 2006 Sampling Guide for Performance Audits 57

58 Chapter 3 Representative Sampling Methods Transfer frequency distribution into a database This will result in a Frequency Distribution Choose the option: Create a database from the result. This will allow you to create new columns of data. Create a database from the results 58 Sampling Guide for Performance Audits OAG February 2006

59 Chapter 3 Representative Sampling Methods Step 2: Calculate the square root of the frequency Create a new variable that is equal to the square root of the number of records (NO_OF_RECS). OAG February 2006 Sampling Guide for Performance Audits 59

60 Chapter 3 Representative Sampling Methods Click on Append to create a new variable Enter a new variable name (FREQ_SQRT) for the Square root of the Frequency (NO_OF_RECS) Set the variable Type to Virtual numeric Set the Parameter to calculate the square root of the NO_OF_RECS 60 Sampling Guide for Performance Audits OAG February 2006

61 Chapter 3 Representative Sampling Methods Step 3: Calculate the sum of the square root of frequencies Use the Field Statistics to find the sum of all the square roots of the number of records (cumulative of square root of frequencies). ( FREQ) The net value of the square root of the frequencies is 49. Step 4: Choose number of strata Divide this sum by the number of strata (k) you want to define. INT = ( FREQ)/k INT = 49/3 INT = 16.3 Use a value of 16 or 17 to group the levels of the histogram. OAG February 2006 Sampling Guide for Performance Audits 61

62 Chapter 3 Representative Sampling Methods Step 5: Define strata Use this value as an interval size to divide the population into strata. Starting at the lowest values, group the frequencies together until the sum of the square root of frequencies is equal to the interval size. Example of proportional stratified random sampling Stratum Limits Population % Sample 1 $0-$60K % 37 2 $60K-$240K 57 23% 13 3 $240K-$570K 26 10% 6 Total % 56* *A sample size of 56 was selected based on a population of 248, an expected error rate of 25%, a confidence level of 95%, and a one-tailed confidence Interval of 10%. 62 Sampling Guide for Performance Audits OAG February 2006

63 Chapter 3 Representative Sampling Methods Step 6: Repeat if necessary If the interval size (INT) is too large, then return to Step one and create a new histogram with a larger number of categories. Create a histogram with 30 categories instead of 20, and return to step one. 3.5 Is Each Stratum Equally Important? There are two basic methods of sampling from several strata. 1) Proportional Stratified Sampling, and 2) Non-proportional Stratified Sampling. When each stratum is considered equally important and roughly the same size, then Proportional Stratified Sampling is acceptable. When one or more strata are considered much more important that the rest, then Non-Proportional Sampling is preferred. Non-Proportional Stratified Sampling is also preferred when one of the strata is much smaller than the others. This is because Proportional Stratified Sampling would likely not end up selecting anything from very small strata. Example 1: Over representation of a small but important stratum. In doing a survey of health care usage, a researcher wishes to sample the general population. She divides it into several strata based on age: Children, Adolescents, Adults, and Aged. While the Aged make up only a small portion of the population, they are also the most prominent users of the health care system. Accordingly, the researchers decide to use a non-proportional stratified sampling method, and over-represent the Aged in their sample. Population Sample Group Size Proportion Size Proportion Children 18,000 15% 30 15% Adolescents 30,000 25% 50 25% Adults 66,000 55% 60 30% Aged 6,000 5% 60 30% Total 120, % % OAG February 2006 Sampling Guide for Performance Audits 63

64 Chapter 3 Representative Sampling Methods Example 2 In doing a survey of advertising contracts, the researcher divides them into two main strata: Media-Time purchases, and Production. Media purchases are considered low risk for poor oversight, while the Production contracts are considered high risk. Accordingly, when sampling files for review, he or she uses a non-proportional stratified sampling method, and over-represents Production contracts. 3.6 Proportional Stratified Random Sampling Proportional Stratified Random Sampling (PSRS) is used to help ensure that a sample is representative of its population. Determining sample sizes for each stratum Example 1: General population divided into five age groups Example 2: General population divided into five geographic areas Determining sample sizes for each stratum Example 1: General population divided into five age groups Step 1 Divide the population into approximately 3-5 groups. The groups don t have to be of equal size, but it is important that each group represents a substantial proportion of the population. In this example, we will divide the non-child population of Canada into 5 age groups. Step 2 Determine the population size (N) of each group or stratum, and its proportion of the entire population. Age Group N Proportion ,120,600 8% ,188,500 8% ,547,300 37% ,931,500 31% 65+ 4,060,200 16% Total 25,848, Sampling Guide for Performance Audits OAG February 2006

65 Chapter 3 Representative Sampling Methods Step 3 When sampling, ensure that the proportions of each group within the sample match those of the population. If the total sample size is 1200, then the number of people years of age in the sample should be 8% of 1200, or 98. Example 2: General population divided into five geographic areas Step 1 Divide the population into approximately 3-5 groups. The groups don t have to be of equal size, but it is important that each group represents a substantial proportion of the population. In this example, we will divide the non-child population of Canada into 5 geographical regions. Step 2 Population Sample Age Group N Proportion n Proportion ,120,600 8% 98 8% ,188,500 8% 102 8% ,547,300 37% % ,931,500 31% % 65+ 4,060,200 16% % Total 25,848, Determine the population size (N) of each group or stratum, and its proportion of the entire population. Region N Proportion Atlantic 1,947,400 8% Quebec 6,203,100 24% Ontario 9,931,200 38% Prairies 4,250,300 16% BC/Terr. 3,516,200 14% Total 25,848,200 OAG February 2006 Sampling Guide for Performance Audits 65

66 Chapter 3 Representative Sampling Methods Step 3 When sampling, ensure that the proportions of each group within the sample match those of the population. If the total sample size is 500, then the number of people in the sample from Ontario should be 38% of 500, or 192. Population Sample Region N Proportion n Proportion Atlantic 1,947,400 8% 38 8% Quebec 6,203,100 24% % Ontario 9,931,200 38% % Prairies 4,250,300 16% 82 16% BC/Terr. 3,516,200 14% 68 14% Total 25,848, Non-Proportional Stratified Random Sampling Like Proportional Stratified Random Sampling (PSRS), Non-Proportional Stratified Random Sampling (NSRS) also divides the population into meaningful layers or strata. When using PSRS, the distribution of the sample is the same as the population. That is, the proportion within each strata of the sample is the same as in the population. With NPRS, some strata are purposely over and under represented. NSRS helps increase the accuracy of population estimates by oversampling areas of the population that are scarce. Age Group N Proportion of Pop. PSRS n NSRS n NSRS proportions ,120,600 8% % ,188,500 8% % ,547,300 37% % ,931,500 31% % 65+ 4,060,200 16% % Total 25,848, % % 66 Sampling Guide for Performance Audits OAG February 2006

67 Chapter 3 Representative Sampling Methods Determining sample sizes for each stratum Example 1: General population divided into five age groups Step 1: Define strata Divide the population into approximately 3-5 groups. The groups don t have to be of equal size, but it is important that each group represents a substantial and homogeneous proportion of the population. In this example, we will divide the non-child population of Canada into 5 age groups. Age Group N ,120, ,188, ,547, ,931, ,060,200 Total 25,848,100 Step 2: Define the population Determine the population size (N) of each group or stratum, and its proportion of the entire population. Age Group N Proportion ,120,600 8% ,188,500 8% ,547,300 37% ,931,500 31% Step 3: Assess risk 65+ 4,060,200 16% Total 25,848,100 High risk groups should always be well represented in your sample. Any important group that is very scarce is likely not going to be well represented using a PSRS strategy. Hence, scarce groups should be over-represented, especially if they represent a critical area of the population. OAG February 2006 Sampling Guide for Performance Audits 67

68 Chapter 3 Representative Sampling Methods Step 4: Assign sample sizes While the youth and the aged represent small proportions of the population, there are often the most important groups to consider. Hence, it is reasonable to increase their sample sizes and over-represent them in our sample. Age Group N Proportion of Pop. NSRS n Example 2: General population divided into five geographic areas Step 1: Define strata Divide the population into approximately 3-5 groups. The groups don t have to be of equal size, but it is important that each group represents a substantial proportion of the population. In this example, we will divide the non-child population of Canada into 5 geographical regions. Step 2: Define the population NSRS proportions ,120,600 8% 20 17% ,188,500 8% 20 17% ,547,300 37% 30 25% ,931,500 31% 30 25% 65+ 4,060,200 16% 20 17% Total 25,848, % % Determine the population size (N) of each group or stratum, and its proportion of the entire population. Region N Proportion Atlantic 1,947,400 8% Quebec 6,203,100 24% Ontario 9,931,200 38% Prairies 4,250,300 16% BC/Terr. 3,516,200 14% Total 25,848, Sampling Guide for Performance Audits OAG February 2006

69 Chapter 3 Representative Sampling Methods Step 3: Assess risk High risk groups should always be well representing in your sample. Any important group that is very scarce is likely not going to be well represented using a PSRS strategy. Hence, scarce groups should be over-represented, especially if they represent a critical area of the population. In this case, the Atlantic region of Canada is the scarcest. Step 4: Assign sample sizes While provinces such as Ontario and Quebec would be well-represented using a PSRS, areas such as Atlantic Canada would not. Hence, it would be reasonable to over-represent Atlantic Canada. Population NSRS Sample Region N Proportion n Proportion Atlantic 1,947,400 8% 90 18% Quebec 6,203,100 24% % Ontario 9,931,200 38% % Prairies 4,250,300 16% 90 18% BC/Terr. 3,516,200 14% 90 18% Total 25,848, % % OAG February 2006 Sampling Guide for Performance Audits 69

70 Chapter 3 Representative Sampling Methods 70 Sampling Guide for Performance Audits OAG February 2006

71 Chapter 4 Reporting Results from Representative Samples 4 Reporting Results from Representative Samples 3.0 Representative Sampling Methods 3.2 Simple Random Sampling 3.6 Proportional Stratified Random Sampling 3.7 Non-Proportional Stratified Random Sampling 4.1 Calculating simple estimates and precision 4.1 Calculating weighted estimates and precision 4.2 Reporting Estimates Table of Contents OAG January 2003 Sampling Guide for Performance Audits 71

72 Chapter 4 Reporting Results from Representative Samples 4.1 Calculating Estimates and Precision of Estimates The manner in which we calculate population estimates and the degree of precision of those estimates depends on the method of sampling. When the distribution of the sample is the same as the distribution of the population, such as when we use either Simple Random Sampling or Proportional Stratified Random Sampling, the calculation of estimates and precision is simple and straightforward. However, when the distribution of the sample differs from that of the population, weights must be used to calculate the estimates and precision. Calculating simple estimates and precision Simple estimates of deviation rates Simple estimates of averages Precision of simple estimates of deviation rates (using IDEA) Precision of simple estimates of averages An example of calculating precision of a simple estimate of an average Calculating weighted estimates and precision Weighted estimates of deviation rates Weighted estimates of averages Precision of weighted estimates of deviation rates Precision of weighted estimates of averages Calculating simple estimates and precision Simple estimates of deviation rates The calculation of estimates is very straightforward when using either Simple Random Sampling or Proportional Stratified Random Sampling. The population estimates of central tendency (Mean, Median or Mode), ratio (error rate), or variance (standard deviation), are the same as those calculated for the sample. Hence, the error rate found in the sample is the estimate of the error rate of the population. 72 Sampling Guide for Performance Audits OAG January 2003

73 Chapter 4 Reporting Results from Representative Samples Population Size (N) Sample size (n) Calculation of a ratio from a simple random sample # of Deviations (d) Sample Deviation Rate (P) The Most Likely Estimate (MLE) of the deviation rate in the population from this example is simply calculated by dividing the number of deviations (d) in the sample by the sample size (n). Alternatively, we can divide the estimated number of deviations in the population (D) by the population size (N). With a simple random sample, the Sample Deviation Rate (P) will equal the Most Likely Estimate (MLE). Calculation of a ratio from a proportional stratified random sample When using a Proportional Stratified Random Sample (PSRS), the distribution of the sample is identical to that of the population. As a result, the Most Likely Estimate (MLE) can be calculated by simply dividing the number of deviations (d) by the sample size (n) ( = 27.5%). Alternatively, we can divide the estimated number of deviations in the population (D) by the population size (N) (7,108,228 25,848,100 = 27.5%). Simple estimates of averages Estimate of population deviations (D) Most Likely Estimate (MLE) N n d P = d/n D= N*P MLE = D/N 25,848, % 7,108, % Strata (L=5) Population Size (N) Sample size (n) # of Deviations (d) Stratum Deviation Rate (P) Estimate of population deviations (D) Most Likely Estimate (MLE) Age N n d P = d/n D= N*P MLE = D/N Total Population ,154, % 215, ,154, % 430, ,477, % 3,877, ,969, % 2,369, ,092, % 215,401 25,848, % 7,108, % Estimating the average of a population is very simple when using either a simple random sample or proportional stratified sampling. The average of the sample is the estimate of the population. OAG January 2003 Sampling Guide for Performance Audits 73

74 Chapter 4 Reporting Results from Representative Samples Calculation of an average from a simple random sample Population Sample Sample Sum Average Income N n sum mean=sum/n 25,848, $6,351,557 $52,930 Calculation of an average from a proportional stratified random sample Population Sample Stratum Sum of Income Average Income Ni N ni n sum mean=sum/n N1 2,154,008 n1 10 $241,580 N2 2,154,008 n2 10 $309,520 N3 9,477,637 n3 44 $2,586,012 N4 7,969,831 n4 37 $2,381,542 N5 4,092,616 n5 19 $832,903 N 25,848,100 n 120 $6,351,557 $52, Sampling Guide for Performance Audits OAG January 2003

75 Chapter 4 Reporting Results from Representative Samples Precision of simple estimates of deviation rates (using IDEA) Once the survey is complete and the sample deviation rate is known, IDEA can be used to estimate the level of precision of the MLE. Select the Attribute Planning and Evaluation from the Sampling Menu option. OAG January 2003 Sampling Guide for Performance Audits 75

76 Chapter 4 Reporting Results from Representative Samples Select the Sample Evaluation tab. Enter the necessary parameters: 1) Population Size: The total number of sampling units in the population. 2) Sample Size: The number of sampling units in your sample 3) Number of Deviations in Sample: The raw number of errors that were found 4) % Desired Confidence Level: Enter either for a moderate level of confidence, or for a high level. Press Compute to get an interpretation of your findings. 76 Sampling Guide for Performance Audits OAG January 2003

77 Chapter 4 Reporting Results from Representative Samples The one-sided confidence interval will be the difference between the 1-sided Upper Limit and the Sample Deviation rate. In this example, the confidence interval is 35.14% % = 5.97%. As a general rule, if the confidence interval is less than 10%, then the findings are deemed representative of the population. The output also gives a plain language interpretation of the findings. Precision of simple estimates of averages IDEA can only be used to calculate the estimates of ratios. Fortunately, the vast majority of data collected during an audit is binomial in nature (e.g., pass/fail or yes/no) and the Attribute Sampling utility in IDEA is ideal for calculating precisions in theses cases. However, when we are collecting numeric data, and the estimate produced is an average, then the following formulae can be used to generate a level of precision. OAG January 2003 Sampling Guide for Performance Audits 77