Statistical Office of the European Communities PRACTICAL GUIDE TO DATA VALIDATION EUROSTAT

Size: px
Start display at page:

Download "Statistical Office of the European Communities PRACTICAL GUIDE TO DATA VALIDATION EUROSTAT"

Transcription

1 EUROSTAT Statistical Office of the European Communities PRACTICAL GUIDE TO DATA VALIDATION IN EUROSTAT

2 TABLE OF CONTENTS 1. Introduction Data editing Literature review Main general procedures adopted in Member States Foreign Trade Industrial Output Commercial Companies Survey Employment Survey Private Sector Statistics on Earnings Survey of Business Owners Building Permits Survey Main general procedures adopted in Eurostat Harmonization of national data Corrections using data from the same Member State Corrections using data from other Member States Foreign Trade Transport Statistics Labour Force Survey Eurofarm Guidelines for data editing Stages of data editing Micro data Error detection Error correction Country data Error detection Error correction Aggregate (Eurostat) data Concluding remarks Missing data and imputation Literature review Single imputation methods Explicit modelling Mean imputation Regression imputation Implicit modelling Hot deck imputation Substitution Cold deck imputation Composite methods Multiple imputation methods

3 3.2 Main general procedures adopted in Member States Foreign Trade Industrial Output Commercial Companies Survey Employment Survey Annual Survey of Hours and Earnings Survey of Business Owners Building Permits Survey Housing Rents Survey Basic Monthly Survey Main general procedures adopted in Eurostat Community Innovation Survey Continuing Vocational Training Survey European Community Household Panel Guidelines for data imputation Stages of imputation of missing data Micro data Country data Aggregate (Eurostat) data Concluding remarks Advanced validation Literature review Strategies for handling outliers Testing for discordancy Exploratory data analysis Statistical testing for outliers Single outlier tests Multiple outlier tests Multivariate data Methods of accommodation Estimation of location Estimation of dispersion Time series analysis Main general procedures adopted in Member States Foreign Trade Consumer Price Index Main general procedures adopted in Eurostat Community Innovation Survey Guidelines for advanced validation Stages of advanced validation Micro data Advanced detection of problems Error correction Country data Aggregate (Eurostat) data Concluding remarks References

4 1. INTRODUCTION A main goal of any statistical organization is the dissemination of high-quality information and this is particularly true in Eurostat. Quality implies that the data available to users have the ability to satisfy their needs and requirements concerning statistical information and is defined in a multidimensional way involving six criteria: Relevance, Accuracy, Timeliness and punctuality, Accessibility and clarity, Comparability and Coherence. Broadly speaking, data validation may be defined as supporting all the other steps of the data production process in order to improve the quality of statistical information. In the Handbook on improving quality by analysis of process variables (LEG on Quality project by ONS UK, Statistics Sweden, National Statistical Service of Greece, and INE PT) it is described as the method of detecting errors resulting from data collection. In short, it is designed to check plausibility of the data and to correct possible errors and is one of the most complex operations in the life cycle of statistical data, including steps and procedures of two main categories: checks (or edits) and transformations (or imputations). Its three main components are the following: Data editing The application of checks that identify missing, invalid or inconsistent entries or that point to data records that are potentially in error.. Missing data and imputation Analysis of imputation and reweighting methods used to correct for missing data caused by non-response. Non-response can be total, when there is no information on a given respondent (unit non-response), or partial, when only part of the information on the respondent is missing (item non-response). Imputation is a procedure used to estimate and replace missing or inconsistent (unusable) data items in order to provide a complete data set. Advanced validation Advanced statistical methods can be used to improve data quality. Many of them are related to outlier detection since the conclusions and inferences obtained from a contaminated (by outliers) data set may be seriously biased. Before Eurostat dissemination, data validation has to be performed at different stages depending on who is processing the data: The first stage is at the end of the collection phase and concerns micro data. Member States are responsible for it, since they conduct the surveys. The second stage concerns country data, i.e., the micro-data country aggregates sent by Member States to Eurostat. Validation has to be performed by the latter at this stage. The third and last stage concerns aggregate (Eurostat) data before their dissemination and it is also performed by Eurostat. Validation should be performed according to a set of common (to what? all sources and records? For one application or for all) and specific rules depending on the stage and on the data aggregation level. In this document, some general and common guidance are provided for each stage. More detailed rules and procedures can only be provided when looking at a specific survey, i.e., since each one has its own particular characteristics and problems. A thorough set of validation guidelines can only then be defined for a specific statistical project. Nevertheless, this document intends to discuss the most important issues that arise concerning validation of any statistical data set, describing its main problems and how to handle them. It lists as thoroughly as possible the different aspects that need to be analyzed for error diagnostic and checking, the most adequate methods and procedures for that purpose and 3

5 finally possible ways to correct the errors found. It should be seen as an introduction to data validation and provide references to further reading by any statistician or staff of a statistical organization working on this matter. That is, being the general starting point for data validation, this document may be applied and adapted to any particular statistical project or data set and may also be used as input for the building block for specific handbooks defining a set of rules and procedures common to Member States and Eurostat. In short, the text of this document should be regarded as the guidelines for general approach to data validation and should be followed by subsequent rules and procedures specifically designed for any statistical project and shared by Member States and Eurostat whose responsibilities also have to be clearly defined. In fact, the ultimate purpose should be the set-up of Current Best Methods (the description of the best methods available for a specific process) in validation for Member States and Eurostat, leading to efficiency gains and to an improvement in data quality as mentioned above. To this end, the introduction of new processes or of process changes, the adoption of new solutions and methods and the promotion of know-how and information exchange are sought. Therefore, the rules, procedures and methods should be discussed and recommendations provided that are not only based on strong statistical methodology but are also commonly used and widely tested in practice. The structure of this document is the following: the next sections discuss the three validation components mentioned above in that order, listing the main problems that may arise, providing some guidance for their detection and correction and indicating who should run validation at each stage. Some examples of validation procedures in surveys conducted in Member States, USA and Canada are also provided. They are only a few illustrative examples of the main rules and procedures used. 4

6 2. DATA EDITING Data validation checks the data have to be checked for their correctness, consistency and completeness in terms of number and content, because several errors can arise in the collection process, such as: The failure to identify some population units or include units outside the scope (under and over coverage). Difficulties to define and classify statistical units. Differences in the interpretation of questions. Errors in recording or coding the data obtained. Other errors of collection, response, coverage, processing, and estimation for missing or misreported data. The purpose of any checks is to ensure a higher level of data quality. It is also important to reduce the time required for the data editing process and the following procedures can help: Electronic data processing Data should be checked and corrected already when provided by the respondents. Therefore, supplying data by electronic means should be encouraged (electronic questionnaires and electronic data interchange). Application of statistical methods Faulty, incomplete, and missing data can be corrected by queries with the respondents but errors can also be corrected through the application of statistical models, largely keeping the data structure and still meeting the requirements in terms of accuracy and timeliness of the output. Continuous improvement of data editing procedures For repeated statistics, data editing settings should be adjusted to meet changing requirements and knowledge from previous editing of statistical data should be taken into account to improve questionnaires and make data editing more efficient. Omitting editing and/or correction of data such that the change would have only negligible impact on the estimates or aggregates. 2.1 Literature review Although there is a large number of papers on data editing in the literature, the seminal paper by Fellegi and Holt (1976) is still the main reference, where these authors introduced the normal set of edits as a systematic approach to automatic editing (and imputation) based on set theory. Following these authors, the logical edits for qualitative variables are based on combinations of code values in different fields that are not acceptable. Therefore, any edit can be broken down into a series of statements of the form a specified combination of code values is not permissible. The subset of the code space such that any record in it fails an edit is called the normal form of edits. Any complex edit statement can be broken down into a series of edits, each having the normal form. Edit specifications contains essentially two types of statements: Simple validation edits, specifying the set of permissible code values for a given field in a record, any other value being an error. This can be converted into the normal form very easily and automatically. More complex consistency edits, involving a finite set of codes. These are typically of the form that whenever a record has certain combinations of code values in some fields, it should have some other combinations of code values in some other fields. Then, the edit statement is that if a record does not respect this condition on the intersection of 5

7 combinations of code values, the record fails the edit. This statement can also be converted into the normal form. Hence, whether the edits are given in a form defining edit failures explicitly or in a form describing conditions that must be satisfied, the edit can be converted into a series of edits in the normal form, each specifying conditions of edit failure. The normal form of edits is originally designed for qualitative variables, but it can be extended to quantitative variables even though, for the latter, this is not its natural form. The edits are expressed as equalities or inequalities and a record that does not respect them for all the quantitative variables, fails the edit. A record which passes all the stated edits is said to be a clean record, not in need of any correction. Conversely, a record which fails any of the edits is in need of some corrections. The advantage of this methodology is that it eliminates the necessity for a separate set of specifications for data corrections. The need for corrections is automatically deduced from the edits themselves which will ensure that the corrections are always consistent with the edits. Another important aspect is that the corrections required for a record to satisfy all edits change the fewest possible data items (fields) so that the maximum amount of original data is kept unchanged, subject to the edit constraints. The methods and procedures described and discussed next as well as the proposed guidelines on data editing and correction fit into this model of normal form of edits as will become clear. 2.2 Some general procedures applied in Countries Data editing procedures depend on the specific data they concern. Therefore, as illustrative examples, we describe some of the main procedures applied by the Statistical Institutes. Error detection usually implies contact with the respondents leading to the correction of those errors Foreign Trade Some responses can only be accepted if they belong to a given list of categories (nomenclatures). Therefore, the admissibility of the response is checked according to that list (for example, delivery conditions, transaction nature or means of transport can only be accepted if they assume a category of the corresponding list). The combination of the values of some variables has to respect a set of rules. Otherwise, the value of one or several of those variables is incorrect. Detection of large differences between the invoice and the statistical values for those respondents who have to provide both values. Detection of large differences between the current period of time and historical data. Detection of non-admissible invoice or statistical values, net weight, supplementary units or prices. The detection is based on the computation from historical data of admissibility intervals for these variables at a highly disaggregated level. Detection of large differences between the response and the values provided by other sources, e.g., VAT data Industrial Output Detection of large differences (quantities, values, prices, etc) between the response of the current period t and the values in past periods (t-1) and (t-2). For infra-annual data, the differences between the response of the current period and the response of the same period 6

8 in the previous year are also checked. For example, for monthly data, the differences between the values in time t and (t-12) are checked, for quarterly data, the differences are between the values in time t and (t-4), etc. Detection of large differences between the response and those provided by similar respondents, namely those companies of the same industrial branch and/or in the same region and/or variables (quantities, values, prices, etc) Commercial Companies Survey Automatic checking of the main activity code with what?. Coherence of the companies responses, mainly their balance sheets. Correction of small errors is automatically carried out. Coherence with the previous period is also checked Employment Survey Error detection The respondents are surveyed twice in the same period and the detection of large differences between the two responses leads to the deletion of the first one, i.e., the second response is considered correct and the first is considered wrong. Error assessment a global error measure may be computed from the comparison between the first and the second responses for every respondent. Therefore, for any given characteristic with k categories C 1, C 2,,C k, responses can be classified in the following table: 1 st resp. C 1 C... 2 C... 2 nd j C k resp. C 1 n 11 n n... 1j n 1k C 2 n 21 n n... 2j n 2k M M M M M M M C i n i1 n... i2 n... ij n ik M M M M M... M C n... n... n k n k1 k2 where n ij represents the number of respondents classified in category C i in the second response and in category C j in the first response. If there are no errors in the n respondents correctly surveyed, only the elements in the main diagonal will not be zero. The global quality index is computed as QI = i n kj 100%. n If both responses agree for every respondent, we have QI = 100, and QI = 0 if they disagree for every respondent. This indicator is a global measure of the quality of the data in the entire survey, i.e., for every characteristic in the survey. It is also computed for every variable in the survey Private Sector Statistics on Earnings Automated checking of different items concerning salaries and occupation, namely number of employees, salary item averages and salary item average changes relatively to the previous year. What is being checked? ii kk 7

9 Every item (such as the basic monthly salary) is subject to specific checking routines in order to detect errors such as negative salaries, values under the minimum salary, low or high salaries or other benefits and low or high growth rates. Data are also examined at different levels of aggregation: total level, industry level and company level. coherence check with aggregates or with t-1? If errors are found, data are analysed and corrected at the micro level. Minimum and maximum values for each salary item are checked (and corrected if wrong). Top n or p%? Survey of Business Owners Data errors are detected and corrected through an automated data edit designed to review the data for reasonableness and consistency. Quality control techniques are used to verify that operating (collection? processing? See point below) procedures were carried out as specified Building Permits Survey Most reporting and data entry errors are corrected through computerized input and complex data review procedures. Strict quality control procedures are applied to ensure that collection, coding and data processing are as accurate as possible Checks are also performed on totals and the magnitude of data What checks?) Comparisons to assess the quality and consistency of the data series The data and trends from the survey are periodically compared with data on housing starts from other sources, with other public and private surveys data for the non-residential sector and with data published by some municipalities on the number of building permits issued. 2.3 Some general procedures applied in Eurostat Eurostat checks the internal and external consistency of each data set received from Member States (country data). The main checks and corrections concerning several statistical projects made by Eurostat after discussion with the Member State involved are as follows: Ex post harmonization of national data to EU norms. Data format checking. Classification of data according to the appropriate nomenclature. Rules on relationships between variables (consistency). Non negativity constraints for statistics mirror flows. What is that? Plausibility checks of data. What is that? The balance checks like differences between credits and debits. Aggregation of items and general consistency when breaking down information (e.g. geographical, activity breakdowns). Time evolution checking. 8

10 More precisely, different kinds of corrections can be envisaged Harmonization of national data It is necessary to ensure the comparability and consistency of national data. Statistical tables for each Member State can then be compiled and published based on the common Eurostat classification. To this end, Eurostat checks that the instructions to fill in the questionnaire have been followed by the reporting countries. When relevant differences relatively to the definitions are detected, Eurostat reallocates national statistics according to the common classification. This involves the followingverifications: On the country and economic zone, to ensure that the contents of each country and economic zone have been filled in the same way. On the economic activity, to check if all the items (sub-items) have been aggregated in the same way by Member States Corrections (deterministic imputation) using data from the same Member State Corrections with direct data Correction of a variable using the difference between two others such as the net flows with credit and debit flows or flows for an individual item with flows of two other aggregated items. Correction of a variable using the sum of other variables such as flows for an aggregated item with individual given items. Correction of a variable using others such as flows for an aggregated partner zone with flows of other(s) partner zone(s). Correction of a variable by computing net amounts such as the flows of Insurance services with the available gross flows, i.e. by deducting from Gross flows, Gross claims received and Gross claims paid. Corrections (imputation using estimators?) with weighted structure Correction of flows for a given partner zone and a given year using an average proportion involving another partner zone and other years. Correction of flows for a given item and a given year using an average proportion involving another item zone and other years. Correction of flows for a given item and a given partner zone using an average proportion involving another item zone and another partner zone. Correction of flows for a given item using a proportion involving two other items Corrections (deterministic imputation) using data from other Member States Corrections with direct data Correction of flows for partner zone intra-eu using available bilateral flows of main EU partners. Corrections with (imputation using estimators?) weighted structure Correction of flows for a given item and a given year using an average proportion involving a mixed item, other EU Member States and several years. Correction of flows for partner zone extra-eu using an average proportion involving partner(s) intra-eu, partner(s) (intra-eu + extra-eu) and other EU Member States. Correction of flows for a given partner zone and a given year using an average proportion involving another partner zone, other EU Member States and another year. 9

11 We next present examples of surveys where validation is performed by Eurostat Foreign Trade The data sets received by Eurostat are checked according to a set of rules the same rules as those applied by Member States, such as the following examples. Checking for invalid nomenclature codes, i.e., some variables have to assume values of a given list (nomenclature). Checking for invalid combinations of values in different variables. Detection of non-admissible values, i.e., checking if a variable is within a certain interval range Transport Statistics Transport Statistics are available for Maritime, Air, Road and Rail transport modes. Some of the main checks are the following. Checking the variables attributes such as data format, length, and type or nomenclature codes. Detection of non-admissible values. Checking for invalid combinations and relationships of values in different variables Labour Force Survey The Labour Force Unit collects data for employment in the Member States. The main checks are as follows. Checking the variables attributes such as data format, length, type or nomenclature codes. Comparison of variables to detect eventual inconsistencies Eurofarm Eurofarm is a system aiming at processing and storing statistics on the structure of agricultural holdings that are derived from surveys carried out by Member States. Its main checks are the following. Checking the variables attributes such as data format, length, and type or nomenclature codes. Checking for non-response. Detection of non-admissible values. Comparison of variables to detect eventual inconsistencies. 2.4 Guidance on data editing Stages of data editing Before dissemination, data checking and editing may have to be performed at the three different validation stages mentioned in the introduction, depending on who is processing the data and the phase of the production process. The first stage for error checking and correcting is the collection stage and concerns micro data. In general, Member States (MS) are responsible for it, since they conduct the surveys, even when Eurostat receives this type of data. The second stage concerns country data, i.e., the micro-data country aggregates sent by Member States to Eurostat. Data checking at this stage has to be made by 10

12 Eurostat(presumably after thorough verification by the data source) and, if errors are detected, the data set could be sent back to the country involved for correction. If the sending back is not possible, Eurostat has to make the necessary adjustments and estimations. The third stage concerns aggregated (Eurostat) data before their dissemination and a last check has to be run by Eurostat since it might be possible that some inconsistency or errors in the data can be found only at this stage. This requires further corrections by Eurostat.. Since data editing and correction depend on the specific data, we propose several procedures that can be generally applied at each stage. The actual application should choose the appropriate procedures Micro data Validation checks on micro data should be run by Member States, i.e., when they send their data sets to Eurostat, these sets should have been scrutinized and error-free already. This also applies to those situations where Eurostat receives the micro data because MS conduct the surveys and therefore are closer to the respondents and can detect and correct errors more efficiently. In fact, as it will be discussed later, error correction very often requires new contacts with the respondents which can be done much more quickly and better by national statistical agencies. As mentioned above, it is important that the time required for the data checking and editing process is reduced and to this end automated data processing, application of statistical methods and continuous improvement of data editing procedures should be pursued Error detection Since checking and editing depend on the specific data concerned, we next propose some procedures that can be generally applied and adapted to any particular survey: 2. Checking of the data sender, particularly for electronic submission. Example: foreign trade statistics (Intrastat). 3. Checking for non-responses in many surveys, several respondents are known, especially the largest or most important ones. If their responses are not received, it usually means that they failed to respond and this may have a significant impact on the final data. Thus, checking for missing responses is very important. Examples: in foreign trade, industrial statistics, or building permits survey, the most important respondents (companies in the former two cases and municipalities in the latter) are perfectly known by the national statistical organizations and if they fail to send their information, the impact on the final data may be very strong. 1. Checking of the data format the data must have a predefined format (data attributes, number of fields or records, etc.). Example: foreign trade, industrial statistics or employment survey. 5. Detection of large differences between the invoice and the statistical values for those respondents who have to provide both values, or between the response and VAT data. Examples: foreign trade, industrial statistics. 4. Detection of non-admissible responses Checking of the response category of qualitative variables, since responses on this type of variables have to assume a category of a given list (nomenclatures). Therefore, only responses belonging to that list can be accepted. Examples: delivery conditions, 11

13 transaction nature, means of transport, gender, occupation, main activity sector such as industrial branch. Quantitative variables whose values cannot be outside a given range. Examples: salaries, income, sales, output, exports, imports, prices, weights, number, age, etc., have to be positive. Quantitative variables whose values have to be within a given interval. These admissibility intervals have to be computed from historical data at a highly disaggregated level. Examples: unit values or prices, unit weights, height of a building, age of a person, number of hours worked, income, etc., have to be inside a given interval of admissible values; salaries cannot be lower than the minimum salary, etc. 6 Detection of large differences between current and past values (growth rates). In particular, the value on time t (current value) should be compared with the values on time (t-1) and (t-2) for example and, for infra-annual data, with the corresponding period of the previous year, i.e., time (t-12) for monthly data, (t-4) for quarterly data and so forth. 7 Detection of large differences between the response and those provided by similar respondents. Examples: companies of the same industrial branch and/or in the same region and/or variables (quantities, values, prices, etc). 9 Outlier detection the last four items are related with outlier detection which will be discussed in section 4. 8 Detection of incoherencies in the responses from the same respondent and error assessment, since there are usually relationships and restrictions among several variables. Examples: exports or imports and total output of the same company (these variables have to be coherent); coherence in a company s balance sheets; age and marital status (for instance, a two-year old person who is a widow). When the respondents are surveyed more than once, coherence between the responses has to be checked (usually, this is the purpose of surveying the same respondent more than once). Large differences between the two responses require corrections and a global error measure as the QI statistic in the Employment survey mentioned above can be computed (error assessment). Low values of this indicator mean significant incoherencies requiring error correction. The number and variety of data editing and checking procedures is very large since they depend on the specific data and country, thus requiring that the general procedures described above are adapted. Some categories, reference (admissible) values or intervals, however, are common to the different countries Error correction When errors are detected in the micro data, they have to be corrected which should be done by Member States, even in those cases where Eurostat receives these disaggregated data. Like error detection, correction procedures depend on the particular data and disaggregation level. Therefore, we discuss the main procedures that can be generally adopted: Generation of the list of errors as a starting point for the correction process. The errors may have attributes such as severity and size of impact. A score function (footnote to Latouche) can be used to assign the importance. Correction of the coding, classification or typing errors and other data attributes such as the format. Correction of those variables whose values can be obtained from other variables of the same respondent. Example: unit prices can be computed from the total value and the corresponding quantity. 12

14 Contact with the respondents most of the errors has to be solved through the contact with the respondents. Moreover, the values questioned are often correct and end up by being confirmed which can only be done by the respondents themselves, requiring such contact. Imputation of missing or erroneous data in case the contact is not possible, too expensive or its outcome is not received on time, the values requiring correction have to be discarded from the data base, thus originating non-responses. These values in question will have to be imputed with methods as discussed in section 3. These last two procedures are the main reasons why Member States should be in charge of validation of micro data, i.e., they should run validation at this stage even when Eurostat receives these data. In fact, if validation was performed by Eurostat, it would have to return the error list to the country involved for correction which is an important loss of efficiency and may jeopardize the deadlines for dissemination. Therefore, it is very important that validation is run by MS at this stage. Note that the procedures of editing and imputations should be as uniform (identical) as possible among all data sources Country data The country data received by Eurostat should already be validated at the micro level by the national statistical organizations. Nevertheless, some errors or problems can only be detected when data from the different countries are combined, compared or analysed, such as bilateral flows in foreign trade. When these errors are detected and the problem is significant, the correction should be made by Eurostat, consulting the country involved whenever possible Error detection As for micro data, checking and editing depend on the specific data concerned and therefore we propose some general procedures for error detection by Eurostat that can be applied and adapted to any particular survey: 4 Detection of different definitions in national statistics common definitions and classifications in national statistics. The data sets supplied by Member States can only be compiled and published by Eurostat if they are based on the same (or able to map 1:1 or n:1) classification in order to ensure the comparability and consistency of national data. If divergences are found, they have to be corrected. 1. Checking the data format the data must have a predefined format. 2. Checking for incomplete data checking whether the data are complete or there are missing data. The more extreme situation is when a Member State does not send its data set at all. Other examples of partially missing data are when the country total is received, but not the regional breakdown or, in foreign trade, when the country total is received but not some or all the bilateral flows. 3 Checking the classification of variables this classification has to follow the appropriate nomenclatures. 5 Changes in the definitions and classifications used when the definitions and classifications adopted are changed (such as concepts, methodologies, surveyed population, data processing), the data will show the differences. 6 Detection of non-admissible values the value of some variables has to be within a given range. For example, age, salaries, foreign trade flows, output or price indices cannot be negative; indices with values that make no sense, such as decimal values or values in the order of tens of thousands (with base 100). 13

15 8 Detection of large differences between the country s current and past values (growth rates) the current value should be compared with the previous values and, for infraannual data, with the corresponding period of the previous year. The occurrence of such differences is usually caused by errors. 7 Detection of incoherencies among variables there are often relationships and restrictions among variables that have to be satisfied. When they are not, the incoherence found has to be corrected. A very simple example, among many others, is that the balance has to equal the difference between credits and debits. 9 Search for breaks in the series, i.e., large jumps or differences in the data from a period to the next these differences are probably caused by an error or by a change in the definitions and classifications adopted. 10 Large changes in the series length if the number of observations in a data series supplied by a Member State suffers an important change, the reason for this difference has to be checked because it may be caused by error, or by changes in data processing, or by retropolation of the series, etc. 11 Aggregated items correspond to the sum of sub-items when the country provides the breakdown of a given data, the total has to equal the sum of the parts. Similarly, when a country provides different breakdowns of the same data, such as turnover in companies by region and activity, the total of the two breakdowns has to be the same. 12 Cross checking with other sources the data from a given country should be checked for coherence with other data from the same country or with data from another country. If differences are found, they have to be investigated and corrected. For example, industrial and foreign trade statistics from the same country; in foreign trade statistics, the bilateral flows from a country should be checked with the corresponding bilateral flows from its partners (mirror statistics) Error correction When errors are detected in country data, they have to be corrected by Eurostat, possibly after discussion with the national statistical organization involved. Like error detection, correction procedures depend on the particular data and consequently we discuss the main procedures that can be generally adopted, taking into account that the corrections performed at Eurostat described above are appropriate. Harmonization of national data if significant discrepancies arise in the statistics of a given country because of relevant differences relatively to the definitions (concepts and classifications), Eurostat has to check whether the instructions to fill in the questionnaire have been followed by the reporting countries and ask the country to recompute the national statistics according to the common definition or classification. This involves the following steps: On the country and economic zone, to ensure that the contents of each country and economic zone have been filled in the same way. On the economic activity, to check if all the items (sub-items) have been aggregated in the same way by Member States. Moreover, the statistical agency of the country involved has to correct the problem in the future, i.e., it has to stop using its own definitions and classifications and start using those set up by Eurostat. Correction of the data format and variable classification this may require a considerable programming and computational effort for large data sets, thus being time consuming. If 14

16 classification of the data is wrong, they have to be regrouped based on the correspondence between the two classifications (nomenclatures) used. Changes in the definitions and classifications used a warning has to be issued about those changes and when they occurred. Retropolation of the series should be computed based on the new definitions, if possible. Imputation of incomplete data when part or the whole data set is missing, Eurostat should first try to make the Member State send the missing data. If this is not possible in a timely manner, it is equivalent to a non-response (total or partial) and Eurostat has to impute it (with the methods discussed in section 3). The solution of flagging it as Non-available is inadequate and should be avoided. Correction of non-admissible values when this type of errors occurs, it may be possible to determine the correct values by using other variables in the same or in other data sets. If this is not possible, the non-admissible values have to be imputed with the methods discussed in section 3. Correction of large differences relatively to the country s past values Eurostat should first try to make the Member State involved to correct or confirm the values leading to such differences in a timely manner. However, if it is not possible, the correction has to be made by Eurostat. This issue is related to outlier detection and correction discussed in section 4. Nevertheless, some errors can be corrected (by Eurostat) with methods like the following that are very straightforward and easy to apply. Corrections using data from the same Member State. Examples: correction of a variable using the difference of two others (such as net flows with the positive and negative flows), or the sum of others (such as flows of an aggregated item with the individual items); correction of a variable using the net amounts, i.e., by comparing the available net amounts with the result of computing those amounts from the difference of the variables involved; likewise for sums; correction of a variable using others (such as flows for an aggregated partner zone with flows of other partner zones). Corrections using data from other Member States. Examples: correction of intra-eu flows of a Member State using the available bilateral flows of its partners; correction of extra-eu flows of a Member State using published data from other sources (such as OECD, IMF or UN) with extra-eu bilateral flows to or from that Member State. Note also that these simple procedures can also be applied to the correction of the previous two items, namely the imputation of missing data and the correction of non-admissible values, which is very straightforward. Correction of incoherencies among variables when incoherencies are found, they have to be corrected. It is sometimes possible to correct them by using other variables from the same country, such as computing the balance from the difference between credits and debits, or the first set of examples in the previous item. In other situations, data from another country has to be used, such as the second set of examples in the previous item. When such corrections are not possible, Eurostat has to impute the values of the incoherent variable(s) by using the methods of section 3. Series breaks if they are caused by error, it has to be corrected. If they are caused by other factors, such as changes in the definitions or classifications used, these changes have to be flagged or the data have to be recomputed with the previous parameters. If this is not possible on time, it is preferable to impute the values after the break(s) with the methods of section 3 and correct them later. Series length if it changes because of error, Eurostat should return to the old series. Otherwise, the change should be flagged. 15

17 Correction of incoherencies in the aggregation of data if aggregated variables do not correspond to the sum of their parts in the breakdown, the former, i.e., the aggregate has to be corrected. If two different breakdowns of the same data do not have the same total, the parts of each breakdown have to be checked and corrected and it is possible that some of them have to be imputed (section 3). Correction of incoherencies with other sources if differences between alternative sources are found for the data of a given country, they have to be corrected by using the most reliable source. Sometimes, the highest value is chosen from the alternatives. For example, in foreign trade statistics, when the bilateral flows between two Member States do not agree, the highest value should be used and the appropriate corrections made to the total flows of the partner that had the smallest value (mirror statistics) Aggregate (Eurostat) data The data sets received from the Member States have to be scrutinized and error-free before their dissemination and the two previous editing and correction stages should be sufficient to this purpose. However, some problems or inconsistencies may become apparent only when aggregate (Eurostat) data are computed, such as growth rates, European aggregates, or bilateral flows with other geographical zones or economic entities. Moreover, the aggregate values computed for different geographical zones have to correspond to the aggregation (sum) of the countries involved. Another issue, particularly important for dissemination purposes, is that the figures published by Eurostat have to compatible with national statistics. When such problems are detected, their cause has to be identified and corrected at the country level since simply discarding the data received from a country (or several countries) is not an adequate solution because it provides no information on that (those) country(ies) and prevents the computation of Eurostat aggregates. Consequently, that solution should not be considered as an option and we are back to the previous stage of editing and correction which means that the same methods described above for country data apply here. This is the final stage where these methods can be applied and, if no correction is possible on time for dissemination, imputation should be performed (section 3). It is preferable to use imputed data (assuming that the imputation method used is appropriate) than a wrong value or no value at all. After this final stage of corrections is complete, the data are ready for dissemination Concluding remarks Error detection and correction in Eurostat statistical data may be performed at each of the three stages of the production process: at the micro (collection) level, at the country level and at aggregate (Eurostat) level. Moreover, it should be performed at the earliest stage possible. In the ideal situation, each stage should seek the complete detection and correction of any errors, leaving as few problems as possible to be solved later, because the earlier the detection, the more accurate the correction can be. This will simplify the task of the following stages achieving a higher quality and speeding up the process of data production and dissemination. State Members are responsible for the first stage and Eurostat for the other two. Nevertheless, the latter should play an important role in the coordination and harmonization of editing procedures by the former. The checks and corrections applied depend on the stage and on the data set under scrutiny. The more efficient the detection and correction procedures are, the higher the quality of the data and the better those inferences will be. Quality assessment can be made by comparing the corrected values with the corresponding revised data that will be obtained later. To this end, accuracy measures such as the mean squared error or the QI statistic in the employment 16

18 survey may be calculated. It is also important to keep a record of the errors detected, their sources and the corrections required in order to avoid the former and to improve and speed up the latter in the future. 17

19 3. MISSING DATA AND IMPUTATION Missing data caused by non-response is a source of error in any data set requiring correction. To this end, imputation methods can be used in order to fill those gaps and provide a complete data set. Non-response errors are often the major sources of error in surveys and they can lead to serious problems in statistical analysis. It is usual to distinguish missing data caused by unit non-response (total non-response) and missing data caused by item non-response (partial nonresponse). The former is usually corrected by imputation whereas the latter is usually dealt with by reweighting. 3.1 Literature review The literature on Imputation of missing data is vast and covering it thoroughly is far beyond the scope of this document. Nevertheless, we briefly discuss the main and most commonly used methods, including in Eurostat and in Member States. The main references in this field are Lehtonen and Pahkinen (2004), Little and Rubin (2002), which we will follow closely, and Rubin (2004). Moreover, time series models can also be used and it is in fact a valid, useful and easy to implement approach to this problem. However, since they are another class of methods and a different perspective, totally based on historical data, we will not consider them here. There are two main classes of methods: single imputation methods, where one value is imputed to each missing item, and multiple imputation methods, where more than one value is imputed to allow the assessment of imputation uncertainty. Each method has advantages and disadvantages, but discussing them is beyond the scope of this document. Such a discussion is in the references mentioned above or in Eurostat Internal Document (2000). We start by describing the former methods Single imputation methods There are two generic approaches to single imputation of missing data based on the observed values, Explicit and Implicit modeling and they will be briefly described next Explicit modeling Imputation is based on a formal statistical model and hence the assumptions are explicit. The methods included here are discussed next Mean imputation Unconditional mean imputation The missing values are replaced (estimated) by the mean of the observed (i.e., respondent) values. Conditional mean imputation Respondents and non-respondents are previously classified in classes (strata) based on the observed variables and the missing values are replaced by the mean of the respondents of the same class. In order to avoid the effect of outliers, the median may be used instead of the mean. For categorical data, the mode is used for the imputation. 18

20 Regression imputation Deterministic regression imputation This method replaces the missing values by predicted values from a regression of the missing item on items observed for the unit. Consider X1, K, X k 1 fully observed and X k observed for the first r observations and missing for the last n r observations. Regression imputation computes the regression of X k on X1, K, X k 1 based on the r complete cases and then fills in the missing values as predictions from the regression, Suppose case i has X ik missing and Xi 1, K, Xi,k 1 observed. The missing value is imputed using the fitted regression equation Xˆ = βˆ + βˆ X + K βˆ X (3.1) ik 0 1 where β 0 is the intercept (which may be zero, leading to a regression through the origin) and β1, K, βk 1 are respectively the regression coefficients of X1, K, X k 1 in the regression of X k on X1, K, X k 1 based on the r complete cases (estimated parameters or predicted values of a variable are denoted by a ^). Note that if the observed variables are dummies for a categorical variable, the predictions from regression (3.1) are respondent means within classes defined by that variable and this method reduces to conditional mean imputation. The above regression equation has no residual (stochastic) variable and therefore this method is called deterministic regression imputation. Stochastic regression imputation It is a similar approach to the previous one, but a residual random variable is added to the right-hand side of the regression equation. Consequently, instead of imputing the mean (3.1), we impute a draw: Xˆ βˆ + βˆ X + K βˆ X + U (3.2) i1 k 1 ik = 0 1 i1 k 1 i,k 1 2 where U ik is a random normal residual variable with mean zero and variance ˆσ which is the residual variance from the regression of X k on X1, K, X k 1 based on the r complete cases. The addition of the random normal variable makes the imputation a draw from the prediction distribution of the missing values, rather than the mean. If the observed variables are dummies for a categorical variable, the predictions from regression (3.2) are conditional draws (instead of conditional means as in regression 3.1) Implicit modelling The focus is on an algorithm, which implies an underlying model. Assumptions are implicit, but it is necessary to check if they are reasonable Hot deck imputation This is a common method in survey practice. Missing data are replaced by values drawn from similar respondents called donors and there are several donor sampling schemes. Suppose that a sample of out of N units is selected and r out of the n sampled values of a variable X are recorded. The mean of X may then be estimated as the mean of the responding and the imputed units: rx R + ( n r) X NR X HD = (3.3) n where X R is the mean of the respondent units and r HiXi X NR = (3.4) i= 1 n r i,k 1 ik 19