ITEM 4.2B: METHODOLOGICAL DOCUMENTS STANDARD REPORT

Size: px
Start display at page:

Download "ITEM 4.2B: METHODOLOGICAL DOCUMENTS STANDARD REPORT"

Transcription

1 Doc. Eurostat/A4/Quality/03/General/Standard_Report Available in EN Working Group "Assessment of quality in statistics" Sixth meeting Luxembourg, 2-3 October 2003 at 9 h 30 Room Ampere, Bech building ITEM 4.2B: METHODOLOGICAL DOCUMENTS STANDARD REPORT

2 2 STANDARD QUALITY REPORT CONTENTS Introduction 3 1. RELEVANCE 4 2. ACCURACY Sampling errors Non sampling errors Coverage errors Measurement errors Processing errors Non-response errors Model assumption errors TIMELINESS AND PUNCTUALITY ACCESSIBILITY AND CLARITY COMPARABILITY COHERENCE COST AND BURDEN 18 2

3 3 STANDARD QUALITY REPORT INTRODUCTION This document presents the basic guidelines for statisticians of the European Statistical System who have to report on the quality of statistics. It is in line with the Eurostat "Definition of quality in statistics" (draft October 2003) and the comprehensive handbook on How to make a Quality Report (draft October 2003). It supersedes the Eurostat Standard Quality Report from May 2002 (Eurostat/A4/Quality/02/General/Standard Report). This document is intentionally limited to presenting the basic definitions and necessary explanatory text with respect to the quality components identified in Eurostat s definition of quality: - relevance - accuracy - timeliness and punctuality - accessibility and clarity - comparability, and - coherence. In addition, for each topic, it presents a summary checklist of the items to be addressed for appropriately completing the Standard Quality Report. The last chapter presents reporting aspects of cost and burden which, although not quality dimensions, are essential elements of quality assessment. Readers who would like to obtain more extensive information can refer to the Eurostat definition of quality and the handbook mentioned above, the Eurostat Glossary (draft October 2003) or to traditional statistical literature. All of these are available in the CIRCA Interest Group Quality in Statistics. To access the Circa group, visit the page : 3

4 4 1. Relevance Relevance is the degree to which statistics meet current and potential users needs. It refers to whether all statistics that are needed are produced and the extent to which concepts used (definitions, classifications etc.) reflect user needs. When reporting on relevance, the aim is to describe the extent to which the statistics are useful to, and used by, the broadest array of users. For this purpose, statisticians need to compile information, firstly about their users (Who they are, How many they are, How important is each one of them), secondly on their needs, and finally to assess how far these needs are met. In addition, the Quality Report should ideally conclude with an overall evaluation of the level of relevance of the statistical product and state the main reasons for lack of relevance. The Quality Report should firstly contain a classification and a description of the users. It should also present the translation of users needs into appropriate statistical terminology and products to meet the needs. Preferably, terminology and products should refer to existing statistical concepts, particularly when they are defined and agreed at international level. The analysis of the different user needs should aim at synthesizing and balancing them. This will allow the statistical authority to define the statistical products of the survey (indicators, precise definitions of the indicators, domains of study the indicators will refer to, etc) that will satisfy the synthesis of needs. The report should present this synthesis (mainly by class of users), reflect on their possible contradictory expectations as well as report on the priorities set. If a user class is given strategic importance, a more thorough description of their needs can prove useful. The evaluation of whether users' needs have been satisfied should be carried out using all efficient available means: user satisfaction surveys is the best scenario, but failing this proxy measures and substitute indicators of user satisfaction should be produced using auxiliary means such as publication sales, number of questions received, complaints etc. Completeness is easier to assess. The rate of non-available statistics is an indicator of the degree to which completeness has been achieved concerning a specific product. One problem however, is that not all statistics are of the same value to users. A weighted ratio of non availability that reflects the usefulness of the provided statistics is probably better at reflecting the level of non completeness. If a weighted ratio is not feasible, then the ordinary ratio should be accompanied with descriptive information that will enable users to correctly interpret it. This can include a description of what is not available and the reasons behind the fact. 4

5 5 What should be reported in the Quality Report on relevance A description and classification of users. A description of the variety of users needs (by class of users, mainly), the possible contradictory expectations among them and priorities made. If a class of users is given strategic importance, a more thorough description of their needs can prove useful. Reference of specific documents where the description of more comprehensive needs could be found, if any. Main results regarding the satisfaction of users. In particular, appraisals by the most important class of users, if any. Main reasons for lack of relevance. The number or percentage of unavailable results, compared to what should be available. Reasons for incompleteness as well as the prospects for future solutions. Follow-up of the user satisfaction assessment, i.e. the measures and actions taken to improve user satisfaction. Circulation and /or readership of publications (paper of electronic) Number of Web hits for the relevant web-pages and/or number of downloads of specific products 2. Accuracy The purpose of each survey is to produce statistics, i.e. estimates of the (usually) unknown values of quantifiable characteristics of a target population. Accuracy in the general statistical sense denotes the closeness of computations or estimates to the exact or true values. Statistics are not equal with the true values because of variability (the statistics change from implementation to implementation of the survey due to random effects) and bias (the average of the possible values of the statistics from implementation to implementation is not equal to the true value due to systematic effects). Several types of error, stemming from all survey processes, comprise the error of the statistics (their bias and variability). A certain typology of errors has nowadays been adopted in statistics. Sampling errors affect only sample surveys; they are simply due to the fact that only a subset of the population, usually randomly selected, is enumerated. Non-sampling errors affect sample surveys and complete enumerations alike and comprise: 1. Coverage errors; 2. Measurement errors; 3. Processing errors; 4. Non response errors; and 5. Model assumption errors. The variability of a statistic around its expected value is expressed by its variance, its standard error, its coefficient of variation (CV) or a confidence interval. The computation of the bias requires knowledge of the true population value and detailed knowledge of the survey processes. In practice it is usually possible to get an idea about whether the bias is positive or negative. The variability of a statistic around the unknown true population value is expressed with the mean square error (MSE) defined as the sum of variance with the square of the bias. 5

6 6 What should be reported in the Quality Report on accuracy order of magnitude (or at least sign) of the bias of the main variables, estimated CVs, CIs and MSEs for the statistics; alternatively qualitative assessments of variability, a statement about the types of error that have been taken into account in the estimation of variances, accuracy levels for given statistics, imposed by regulations, explanations for non-compliance with prescribed levels and proposed improvements, comparisons of bias and variability of the main statistics with previous implementations of the survey (growth rates, average figures when they are meaningful, etc.) specific information about sampling, coverage, measurement, processing, non response and model assumption errors and their contribution (see below) 2.1 Sampling errors Sampling errors affect only sample surveys and arise from the fact that not all units of the frame population 1 are enumerated. The statistics produced from a sample survey will differ from the values which would be computed if exactly the same survey operations were applied to the whole frame population. Sampling can be of two types: probability sampling, meaning that each unit of the frame population has a known, non-zero probability of being selected in the sample and nonprobability sampling. Sampling theory and approximation techniques allow the estimation of the expected value and variance of statistics over all possible samples, for probability sampling. Therefore, the CV which corresponds to sampling can in principle be estimated. Bias and MSE are harder to estimate because of not knowing the true population value. From sampling theory an indication of their size and of the sign of bias may be all that can be obtained. The sampling CV of each statistic can be estimated as the ratio Square root of sampling variance estimate / parameter estimate. If many similar statistics are produced (e.g. a mean for each category for a classification) then, for reasons of brevity, the report can contain the indicators for the most important statistics and a summary of the CVs (minimum, maximum, mean, median and quartiles) for the remaining ones. When non-probability sampling is applied, it is theoretically impossible to measure the sampling error. With the assumption that the sample obtained by such sampling is representative, i.e. resembles a probability sample, formulae for similar probability sampling designs are used to obtain estimates of sampling variance. 1 Frame population is the population which is accessible for data collection. It need not coincide or be a subset of the target population. The issue is clarified in the section on coverage errors. 6

7 7 What should be reported in the Quality Report on sampling errors If probability sampling is applied: use of biased or unbiased estimators; sampling bias of the main variables, sampling CV for the main variables; additionally, MSE for variables with biased estimators, summary on sampling CVs: minimum, maximum, mean, median, etc for the rest of the variables, reference to documents or databases where complete series of CVs may be found, regulation imposed thresholds on sampling bias and variability, explanations for the little accuracy of less accurate variables, brief reference to the sampling design, methodologies applied for variance estimation. Simple formulae can be reported or the main principles can be stated. Any bias of the variance estimates resulting from the methodology should be given, name of the software and main options used for packages including facilities for variance estimation, mention that ad-hoc computer code has been used, if that is the case, mention of other factors (non response, imputation, misclassification, etc) whose effect has been incorporated in the estimated sampling variances. More details about them will be given in the relevant sections of the quality report If non-probability sampling is applied: type of sampling and the exact way sample selection was carried out in the field, estimates of sampling bias and variance, CV, etc, assumptions used in the estimations, justification, or lack of it, of the assumptions, repetition of the last three items if more than on set of alternative assumptions are used other error types taken into account in bias and variance assessments 2.2 Non-sampling errors Coverage errors The frame is a device that permits access to population units. Frame population is the set of population units which can be accessed through the frame and the survey s conclusions really apply to this population. Coverage errors (or frame errors) are due to divergences between the target population and the frame population. We can distinguish the following types of coverage error: Undercoverage: there are target population units which are not accessible via the frame (e.g. persons without a phone will not be listed in a telephone catalogue) Overcoverage: there are units accessible via the frame which do not belong to the target population (e.g. deceased persons still listed in a telephone catalogue) Multiple listings: target population units are present more than once in the frame (e.g. persons with two or more telephone connections). Incorrect auxiliary information: the auxiliary information provided by the frame may be inaccurate for some population units (e.g. wrong size of business establishments in a business register). Coverage errors may lead to bias and underestimation of variance. They may or may not be detected. In any survey, every contacted population unit should be checked about whether the frame information about it is accurate. In this way overcoverage, inaccurate 7

8 8 auxiliary information and multiple listings can be detected. The extent of these problems among the selected units can give an idea about their extent over the whole frame. Undercoverage on the other hand cannot be detected. Specialized frame quality reviews must be undertaken or survey resources must be allocated for the detection of target population units not present in the frame. What should be reported in the Quality Report on coverage errors type and size of coverage errors, which errors have been taken into account in the production of statistics and variance estimation (methodology used for this purpose), possible impact (bias and extra variation) of errors unaccounted for in estimation, actions taken for the assessment of undercoverage, information about the frame: reference period, updating actions, quality review actions Measurement errors Measurement errors are errors that occur during data collection and cause the recorded values of variables to be different than the true ones. Their causes are commonly categorized as: survey instrument: the form, questionnaire or measuring device used for data collection may lead to the recording of wrong values. Respondent: respondents may, consciously or unconsciously, give erroneous data. Interviewer: interviewers may influence the answers given by respondents. Measurement errors may cause both bias and extra variability of the produced statistics. In order to assess instrument or interviewer effects repeated measurements would have to be taken with different instruments (e.g. alternative phrasing of questions) or different interviewers. Alternatively an experiment should be carried out with subsamples being randomly allocated to different instruments and /or interviewers. Respondent effects are even harder to assess, requiring independent sources of information about the same respondent. Assessment of measurement error effects may lead to a probabilistic measurement error model which can then be combined with the sampling model (if a sample survey is involved) to arrive at estimates of values of interest and of variances. Data editing identifies inconsistencies in the data which usually represent errors. The errors could also be processing errors (due to coding or data entry). The proportion of records that fail each edit is an indication of the quality of the original data. The failure rate of each edit should be calculated over the records on which the edit was applied. The rates can be combined into a single rate for the whole dataset by taking their weighted average, with weights equal to the number of records each edit was applied to. Clerical correction or imputation are usually applied in order to remove the inconsistencies from the data. The failure rates therefore are an indication of the quality of data collection and processing and not of the quality of the final data. 8

9 9 What should be reported in the Quality Report on measurement errors the measurement errors identified and their extent (e.g. the mean and variance of measurement error per variable of interest); the methods and any error models used to assess the errors, indication of whether statistics and their variances have taken these errors into account (and the method used to achieve this), remaining errors impact on statistics (bias and possible extra variation not accounted for), indications about the causes of measurement errors, the efforts taken for questionnaire design and testing, information on interviewer training, information about mechanisms (e.g. randomized response) used for reducing measurement error, overall failure rate for consistency edits applied to the data (it indicates both measurement and processing errors prior to correction or imputation) Processing errors Between data collection and the beginning of statistical analysis for the production of statistics, data must undergo a certain processing: coding, data entry, data editing, imputation, etc. Errors introduced at these stages are called processing errors. These errors are in essence similar to measurement error. Processing errors cause bias and variation in the produced statistics, just like measurement errors do. To assess their extent and their impact on figures a statistical experiment must be carried out where for example, a sample of the questionnaires are recoded or re-entered to the computer or re-imputed. During such experiments the actual sample questionnaires will have some of their errors corrected; they should therefore not be taken into account in error calculations but should give indications about the errors in the rest of the data. On-line consistency checks during data entry may keep a log of data entry mistakes and of their extent. Finally, the method of multiple imputation explicitly suggests multiple imputed values for each missing or wrong data value and their variability is taken into account in estimation. Failure rates of edits are useful, as discussed in the section on measurement errors, as indicators of the quality of the original data (prior to correction). Since it is not always possible to distinguish between processing and measurement error, cases where measurement and processing errors could be confounded should be marked out in the quality report. What should be reported in the Quality Report on processing errors a summary of the processing the data are subjected to between collection and production of statistics, processing errors identified and their extent (e.g. the mean and variance of processing error per variable of interest); the methods used to assess the errors, indication of whether statistics and their variances have taken these errors into account (and the method used to achieve this), indication of remaining errors impact on statistics (bias and possible extra variation caused by non-corrected errors and not accounted for in estimation), indications about the causes of processing errors, presentation of the processes put in place for controlling and reducing processing errors (coders training, performance data of automatic coding software, data entry personnel s training, data editing used, imputation algorithms used). 9

10 Non response errors Non response is the failure of a survey to collect data on all survey variables, from all the population units designated for data collection in a sample or complete enumeration. The difference between the statistics computed from the collected data and those that would be computed if there were no missing values is the non response error. There are two types of non response: unit non response which occurs when no data are collected about a designated population unit, and item non response which occurs when data only on some but not all the survey variables are collected about a designated population unit. The extent of response (and accordingly of non response) is measured with response rates. They can be of two kinds: unit response rate: it is the ratio of the number of units which have provided data at least on some variables over the total number of units designated for data collection item response rate: it is the ratio of the number of units which have provided data for a given variable (an item) over the total number of designated units or over the number of units that have provided data at least for some variables. Weighted response rates instead of counts sum the sample weights of the units. Valueweighted response rates sum the values of auxiliary variables instead of sampling weights. The impact of non response on the statistics is that it increases their variability and introduces bias. Variability increases because non response simply reduces the available number of responses. Bias is introduced by the fact that non respondents may be different than respondents in their values of some survey variables. The estimation method used in order to produce the statistics and to estimate the variance should as much as possible take account of remaining non response and of the imputation methods used to replace missing values. Random subsamples of non respondents or randomised response mechanisms and the data provided can also be incorporated in the estimation of statistics using probabilistic arguments. If data are still missing after all possible efforts, a response model may be assumed, along with suitable assumptions about the mechanism which causes non response, which assigns probabilities of item non response to the units. This leads to re-weighting of the units. Auxiliary variables may also be used in the estimation. 10

11 11 What should be reported in the Quality Report on non response errors non-response (remaining after call backs or data collection from other sources, but before imputation); unit and item non response rates for main variables, both un-weighted and weighted imputation methods used (if any), statement of whether statistics and the variances have taken non response (and methods used to correct it) into account; estimation methods used, findings about similarity or not between non respondents and respondents for the main survey variables, indications of remaining non response impact on statistics (bias and possible extra variation not accounted for), indications about the causes of non response, information about call backs or collection of data from other sources; indication of the accuracy of the latter, information about mechanisms (e.g. incentives, legal obligations of respondents, interviewer training, randomised response) used for reducing non response Model assumption errors Very often statistical models need to be estimated and used in the estimation phase of a survey. Every modelling activity, that is a selection of model, the collection of relevant data and the estimation of the model s parameters involves certain assumptions, ranging from the model s parametric form up to necessary assumptions for its estimation. In survey estimation a further assumption is that any data used, apart from those collected at the survey, are accurate. If some of the assumptions are violated, then the accuracy of the survey s statistics will be affected. Model assumption errors will probably lead to bias in the final statistics and moreover the uncertainty about their appropriateness as well as the variability of their parameters estimators will lead to increased variance of the statistics. The assumptions used in modelling should be thoroughly checked. Sensitivity analyses should also be carried out with the help of simulations in order to check the robustness of the survey statistics to the assumptions. If the additional data used in estimation of the models are of unknown accuracy some error-in-variables modelling method should also be used. The estimation of the survey s statistics should take into account the variability introduced by the models usage. Unfortunately the same cannot be done about the bias. The later can be qualitatively assessed. What should be reported in the Quality Report on model assumption errors the models used in the production of the survey s statistics and the assumptions on which they rely, evidence about the validity of the assumptions, the models estimation process, statement about the accuracy of any additional data used in the models estimation, evidence about the robustness of the survey s statistics against the assumptions (outcome of sensitivity studies with the help of simulations), statement of whether the statistics and their estimated variances take into account the uncertainty of modelling assumptions, indication about any remaining (unaccounted for) bias and variability which could affect the statistics. 11

12 12 3. Timeliness and punctuality Timeliness of statistics reflects the length of time between their availability and the event or phenomenon they describe. Punctuality refers to the time lag between the release date of data and the target date on which they should have been delivered, with reference to dates announced in some official release calendar, for instance, laid down by Regulations or previously agreed among partners. Timeliness is relatively easy and straightforward to measure. A common measure is the average production time (for a number of survey implementations). The maximum production time could also be useful by providing the worst recorded case. Punctuality and timeliness are connected with the frequency of released statistics: monthly data for example, should not be available too many months after the reference month, otherwise there is an obvious loss of interest. It is interesting therefore to benchmark production time with the periodicity of statistics by computing the ratio of production time to periodicity. This allows for some comparisons between surveys with different periodicity. In the case where quality standards have been set-up they can be used for benchmarking as well. This can be done by either taking the ratio of actual production time over the standard or the ratio of the difference over the standard. The difference of the actual production time from the target can also be used. Data Freshness, referring to how current statistical data are, provides another framework for combining survey frequency and timeliness. In many cases statistical data are released periodically helping users to monitor specific developments. In this case what is important for users is the time lag between the present and the reference time of the last available statistics. If we assume a constant production time t, the freshness of statistics is never more than -t and its lowest value is -(t+t), where T is the periodicity of the survey. A useful indicator is to get an average value of data freshness for any day in the year and a natural choice is F=-(t+T/2). What should be reported in the Quality Report on timeliness and punctuality The average timeliness of data. The data frequency and average data freshness The percentage of late data releases, based on scheduled dissemination dates laid out in Regulations, official timetables or other agreements. The mean delay of data non-punctually delivered, assessed in appropriate units: number of days, working days, weeks and so on. The maximum observed delay. The reasons for late delivery: bottle-necks in the production phase, breakdown, strikes, etc. 4. Accessibility and clarity Accessibility and clarity refer to the simplicity and ease for users to access statistics using simple and user-friendly procedures, obtaining them in an expected form and within an acceptable time period, with the appropriate user information and assistance: a global context which finally enables them to make optimum use of the statistics. 12

13 13 Accessibility refers to the physical conditions in which users can access statistics: distribution channels, ordering procedures, time required for delivery, pricing policy, marketing conditions (copyright, etc.), availability of micro or macro data, media (paper, CD-ROM, Internet ), etc. Clarity refers to the statistics information environment: appropriate metadata provided with the statistics (textual information, explanations, documentation, etc); graphs, maps, and other illustrations; availability of information on the statistics quality (possible limitation in use ); assistance offered to users by the NSI. The evaluation of accessibility can take many forms since it is affected by the many aspects of a dissemination practise: (a) dissemination channels, (b) ease for a user to get the product, (c) the form of the available datasets (microdata or aggregates figures), (d) pricing policies. Clarity is more difficult to assess and relates to the quality of statistical metadata which are disseminated alongside a statistical product. In the framework of quality used by Statistics Canada clarity is seen as the relevance of statistical metadata. In effect it refers to the extent to which the metadata satisfy the users needs. Assessment requires information from both the producer for the description of the accompanying information and from the user, for assessing the adequacy and appropriateness of such information for future use. A standard set of headings for each product (such as a template) might help to at least examine in a qualitative manner whether provided metadata is complete. What should be reported in the Quality Report on accessibility and clarity A summary description of the conditions of access to data: media, support, marketing conditions, possible restrictions, existing service-level agreement, etc. A summary description of the information accompanying the statistics (documentation, explanation, quality limitations, etc). A summary description of the possible further assistance available to users. Summarization of user feedback. A presentation of possible improvements, compared to the previous situation. 5. Comparability Comparability aims at measuring the impact of differences in applied statistical concepts and definitions on the comparison of statistics between geographical areas, nongeographical domains, or over time. We can say that it is the extent to which differences between statistics are attributed to differences between the true values of the statistical characteristics. The factors that may cause two statistical figures to lose comparability are attributes of the surveys that produce them. These attributes may be grouped into two major categories: (a) concepts of the survey and (b) measurement / estimation methodology. Concepts: In order to plan a survey a lot of entities must be defined in advance. Such entities include the reference population, the characteristics themselves, the classes of a classification of the population, etc. If two surveys of the same characteristic, in different countries for example, do not use exactly the same definitions, their statistical products will lack comparability. The extent of the lack will be directly related to the difference in definitions. 13

14 14 Measurement / estimation methodology: measurement aspects include methods of measurement and data collection, and the related substantive analysis and should be strictly standardised so as to control (make similar) biases of measurement in the comparisons. Estimation aspects include many operational aspects, as well as weighting, estimation and other aspects of statistical analysis. Generally, these have to be chosen flexibly to suit the conditions and requirements of individual populations in the comparison. What is required, is not identical procedures, but common standards to be followed. The list that follows presents a detailed breakdown of aspects that may affect the comparability of statistical figures. It can be adapted to any domain of application by adding aspects that are pertinent to the domain: 1 Concepts 1.1 Statistical characteristics 1.2 Statistical measure (indicator) 1.3 Statistical unit 1.4 Target population 1.5 Frame population 1.6 Reference period and frequency 1.7 Study domains 1.8 Geographical coverage (for comparability over time) 1.9 Standards 1.10 Structure effects 1.11 Conceptual aspects specific for a domain under study 2 Measurement 2.1 Sample design 2.2 Data collection 2.3 Data processing 2.4 Estimation 2.5 Measurement aspects specific for a domain under study (these include characteristics pertinent to any specific domain, e.g. thresholds for Foreign Trade Statistics) The following kinds of comparability may be discerned: Geographical comparability: it refers to the degree of comparability between similar surveys that measure the same phenomenon, are conducted by different statistical agencies and are referring to populations in different geographical entities. Comparability over Time: it refers to the degree of comparability between two survey instances. Comparability between domains: it refers to the comparability between different surveys which target similar characteristics in different statistical domains. Combinations of the above: comparability problems can arise in pairs, for instance when one wants to use time series of several countries in order to compare forecasted values. 5.1 Geographical comparability The closest we can come to an absolute measurement of geographical comparability is by comparing a figure with a golden standard. This can be a European norm or a model survey from a single country. If we have figures from a number of countries we can obtain an absolute assessment of comparability by making all pair wise comparisons between the figures and then summarising them. 14

15 15 Statistical figures may be compared by comparing the respective metadata of the surveys that produced them. The first step therefore is to determine a comprehensive list of aspects that can cause incomparability and assemble the necessary (meta)data from the countries. Any differences in the metadata must be quantified. One effective way is to use sensitivity analysis and simulations. Another way is to apply each set of definitions on each dataset. A conceptually simple approach is to devise a scoring scheme, which assigns a score to each metadata difference according to its effect on comparability. In the end, by adding the scores we see in how many of the list s metadata the two regions differ. For pair wise studies the scoring scheme will lead to a two-way symmetrical matrix with the regions defining its rows and columns. The entry of each cell will be the comparison score of the two countries to which the cell corresponds. To find an overall score for a region we can add or take the mean of the scores in the row (or column) to which it corresponds. This matrix can also be used as input for further analysis. Applying Multidimensional Scaling methods for example, will produce a two dimensional scatter diagram of regions where the relative magnitude of the distances between them is proportional to their methodological differences. The closest two regions are on the map the more comparable their statistical figures are. Another way to assess the geographical comparability of statistics in some domains is by using mirror statistics. They typically consist of two matrices with inbound flows in one and outbound flows in the other. Absolute differences of inbound and outbound flows for a pair of countries can be summed up for each country yielding an indicator based on discrepancies. This value may be interpreted as an indication of this country s comparability with the rest. What should be reported on Geographical Comparability Brief descriptions of all concepts and methods that can affect the comparability of the results. Differences between national practises and European standards (if such standards exist). Moreover, a (preferably quantitative) assessment of the effect of each reported difference on the estimates. In the case of comparisons between certain regions: difference scores, comparison matrices, multidimensional scaling maps; the scoring method used to quantify metadata differences should also be reported. In the case of mirror statistics: Comments on the discrepancies that appear in the mirror statistics. 5.2 Comparability over time Inconsistencies over time occur when data collected for a specific reference period are not fully comparable with the data of the following periods due to a number of peculiarities in certain time periods. In such cases we say that we have a break in time series. The difference in concepts and methods of measurements between two reference periods should be examined. The measurement of comparability between two different instances of the same survey may be achieved in a way similar to that used for geographical comparability. 15

16 16 What should be reported on comparability over time The reference period of the survey where the break occurred; It should specifically be mentioned a) whether the difference reported is a once-off adopted policy with limited implications for the time series or an adopted policy for the future and/or b) if the reported change led to a harmonization with any standards. The difference in concepts and methods of measurement before and after the break. A description of the difference (changes in classification, in statistical methodology, statistical population, methods of data manipulation, etc.). Assessment of the magnitude of the effect of the change in a, as much as possible, quantitative way. 5.3 Comparability between domains Users frequently compare statistics from different domains, which are often defined according to classifications. For instance these classifications may be related to economic activities, size classes, products, modes of transport, sex, etc. The difference in concepts used for the estimation of the statistics should be reported. It concerns mainly the definition of statistical characteristics, the reference period, the definition of the statistical unit and the statistical measure. There are various domains between which such comparability assessment is of value. For example, small business and large business statistics may be based on different sources, thus rendering all concepts and standards definitions targets of comparability assessment. The measurement of comparability between domains may be achieved in a way similar to that used for geographical comparability. What should be reported for comparability between domains Inclusions and exclusions in definitions for each survey and/or source. The collection methods (the same survey, different surveys, census, administrative sources). The target and frame population used for each survey and the sampling methods, sampling units, etc. A description of the differences in classifications, statistical methodology, statistical population, methods of data manipulation, etc. Assessment of the magnitude of the effect of the differences in a, as much as possible, quantitative way. 6. Coherence Coherence of statistics is their adequacy to be reliably combined in different ways and for various uses. It is, however, generally easier to show cases of incoherence than to prove coherence. When originating from different sources, and in particular from statistical surveys of different nature and/or frequencies, statistics may not be completely coherent in the 16

17 17 sense that they may be based on different approaches, classifications and methodological standards. Both coherence and comparability refer to a dataset with respect to another. The difference between the two is that the basis for deciding if two sets are coherent is inconsistencies between the actual data where comparability can usually be assessed only based on metadata. This is because comparability refers to comparisons between statistics based on usually unrelated populations and coherence refers to comparisons between statistics for the same or largely similar populations. Having said that, we should note that exceptions exist; in the case of flows two agencies compute statistics on the same populations for two countries and coherence and geographical comparability are assessed in the same way. There are several areas where coherence can be assessed, some of which are described in the following paragraphs. 6.1 Coherence between provisional and final statistics Provisional and final statistics are normally based on the same concepts and data collection methods. However, for provisional statistics there might not be much available information and the processing must be quicker. The lack of coherence between provisional and final statistics can materialise in terms of reliability or detail. This is due to a well established trade-off between accuracy and timeliness. An (in)coherence metric must be chosen and a threshold must be established, that will define when provisional and final statistics have large differences. Two possible metrics are: (a) the absolute percentage error which is the absolute value of the difference between the provisional and final statistics divided by the final statistic, and (b) the unbiased absolute percentage error which is the absolute value of the difference between the provisional and final statistics divided by the average of the provisional and final statistics. 6.2 Coherence of annual and short-term statistics For many characteristics, statistics have to be produced with both infra-annual and annual frequencies. These statistics are often produced according to different methodologies. A simple validation method consists in comparing estimates of annual average levels or totals when both frequencies provide estimators in level, and in comparing annual growth rates when at least one of these statistics is an index. The same metrics as in the previous section can be used to establish when a difference is so large that an explanation is deemed necessary. The same can be applied for the cases where statistics should not necessary be equal but should not diverge too much. 6.3 Coherence of statistics in the same socio-economic domain Frequently, a group of statistics, possibly of a different type (in monetary value, in volume or constant price, price indicators, etc) measure the same phenomenon, but from different approaches. For instance, business short-term statistics like turnover, valueadded, or variations in stocks, may be compared on a yearly basis with results from Structural Business Statistics after deflation by production prices. It is very important to check that these representations do not diverge too much in order to anticipate users' questions and prepare corrective actions. 17

18 18 For statistics of flows (e.g. trade, transport, balance of payments, tourism), mirror statistics give an idea of accuracy through discrepancies or asymmetries. However, there are often differences in the concepts used. 6.4 Comparison of statistics with national accounts For advising users on the information source best suited to their needs, it may also be useful to compare survey statistics with national accounts. The methodology used for national accounts would need to be described for the considered statistics, including the primary data source and the adjustments made. Divergences in the concepts should also be taken into account. What should be reported in the Quality Report on coherence Coherence between provisional and final statistics A comparison of provisional and final statistics for the main characteristics. If possible a division of the total error into sampling errors, coverage errors, measurement errors, processing errors, non-response errors, and model assumption errors. Explanations and other comments for differences which are considered large. Coherence of annual and short-term statistics A comparison on an annual basis of statistics and growth rates, if relevant, taking into account the overall accuracy with which both kinds of statistics should be estimated. If differences are not fully explained by the accuracy components, differences in national concepts should be investigated and assessed. Comments on results. Coherence of statistics in same domains Annual differences for the common characteristics according to accuracy component and differences in national concepts. Summaries of the mirror statistics. Estimation of mirror asymmetries due to the differences in concepts and in accuracy. Comments on results. Coherence with National accounts A summary of the comparison. 7. Cost and Burden Cost - for whoever finally bears it - and respondent burden, are aspects of the quality assessment task in the sense that quality of statistics cannot be regarded as isolated from them. The assessment of cost associated with a statistical product is a rather complicated task since there must exist a mechanism for appointing portions of shared costs (for instance 18

19 19 the business register or shared IT resources and dissemination channels) and overheads (office space, utility bills etc) and must be detailed and clear enough so as to provide for international comparisons among agencies of different structures. In the first instance, it is proposed to limit the investigation to the direct costs for National Statistical Institutes. In a further step, additional costs such as charges paid by users may also be considered. It is likely that, for the time being, they represent a negligible share for most of the current statistical operations. Regarding response burden, it can not be easily materialised in financial terms, but rather in time spent for filling up questionnaires or responding to an interviewer. However, the Office for National Statistics, UK has developed a method for measuring the response burden of enterprises in financial terms. This is presented in the Eurostat document on How to make a quality report. What should be reported in the Quality Report on cost and burden Costs supported by National Statistical Institutes. Response burden evaluated with the ONS method. Or failing that: An evaluation of the burden on respondents, only in physical terms (time required for response, etc) 19