Combining technical standards for statistical business processes from end-to-end

Combining technical standards for statistical business processes from end-to-end Dušan Praženka, Peter Boško e-mail: prazenka@infostat.sk e-mail: bosko@infostat.sk Abstract The paper discusses the technical standards defined in SDMX and DDI to be used for data and metadata standardization and exchange. The paper provides a brief overview on the main standards of relevance for official statistics, i.e. SDMX and DDI. Key features of this paper will be how and at what stage of the statistical business process these standards can be used or if alternatively these standards should be combined or linked. The relationships of the SDMX and DDI standards, their common features and also their differences will be presented. The analysis will also take into consideration the sequence of processes described by the Generic Statistical Business Process Model (as defined by the UNECE/Eurostat/OECD). Some practical experiences with SDMX to be applied and used for statistical micro data in data collection procedures will be added. A proposal for possible further development of the standards for use in official statistics will be outlined. Keywords: SDMX, DDI, GSBPM, statistical questionnaire, statistical processes, Eurostat, NSI 1. Standardization efforts for statistics Standardization of data and metadata is the important issue in processing of statistical data and its dissemination. A great deal of standardization effort in sector of statistics was initiated by international organizations that are compiling, analyzing and publishing statistical information receiving from different countries. This is due to the fact that the statistical data collected from NSIs and other sources differs in its content and structure and could therefore have various interpretations that may decrease a level of data comparability and consistency. Metadata received together with the data in common standards helps these organizations to interpret the data properly. The standards of metadata that have been developed so far are oriented mostly for supporting the data exchange between statistical agencies and international organizations. However it should be noted that rather than data exchange it is a collection of data from the national statistical agencies. In practice most of NSIs act as respondents in sending their data to these organizations. Since the NSIs usually have not adopted the same standards in their own practice in productions of statistics they should translate the data (and metadata) produced from their internal standards, if any, to the standard required. In many cases some manual work is needed to prepare responses. Present projects supported mainly by Eurostat aim at reduction of the amount of extra work in preparing the data for exchange using the SDMX standards.

It should be noted that the problems of data and metadata standardization specifics for statistical data production at the NSI level has not been treated with particular attention so far. However, the recent initiatives indicate that the situation in this respect is going to be changed. 2. SDMX and DDI introduction Two important initiatives in introducing standards for statistical data exchange rose in last decade for statistical community SDMX and DDI. 2.1 SDMX Statistical data and metadata exchange SDMX is XML based standard created for statistical data and metadata exchange. It is supported by the organizations with the wide international impact and strong influence on statistical activities. The important organizations that sponsored SDMX are BIS Bank for International Settlements, ECB, Eurostat, OECD, IMF, UN and World Bank. Initiative to create the standard launched in 2002, a version that is currently adopted is the SDMX version 2.1. As the support for SDMX came from the international organizations the SDMX is primarily used to exchange aggregated data in standard formats. SDMX initiative aimed at developing and employing more efficient processes for exchange and sharing of statistical data and metadata among international organizations and their member countries. The SDMX facilitates mainly: processes for exchange and sharing of data and metadata using modern technology, access to statistical data, wherever these data may be, access to metadata that makes the data more meaningful and usable, national organizations to fulfill their responsibilities towards international organizations by using their data as soon as they are released, suppressing individual approaches of international and national statistical institutes to their clients and bringing common platform in data collection and/or dissemination 2.2 DDI Data documentation initiative The Data Documentation Initiative (DDI) is an effort to create an international standard for describing data from the social, behavioral, and economic sciences. First versions of DDI were developed by an informal network of individuals from the social science community and official statistics. Since DDI 3, DDI Alliance was born to facilitate the development in a consistent and on-going fashion, Current version 3.1 was published in October 2009. DDI, in its 3. version represents a major change from preceding versions its scope has increased. Historically, DDI was focused on data archiving. This still remains but major focus in the 3. version is that the full data life cycle is supported. An example is e.g. use of metadata. When the data collection process proceeds, the growing set of metadata describing this activity can be collected and expressed in DDI. As declared in the DDI documentation DDI metadata accompanies and enables data conceptualization, collection, processing, distribution, discovery, analysis, repurposing, and archiving.

As it implies in the DDI documentation [4], [5], [6], the DDI facilitates:. So called repurposing of data for different needs and applications. It is a secondary use of the data from a study. Richer content so providing the potential data analyst with broader knowledge about a given collection. Capturing metadata throughout the life cycle of the data Modular structure, allowing creators to use only those sections or modules of the DDI that were needed at the time and then adding new modules as data progressed through the life cycle Use of XML namespaces that allow the vocabulary to be modularized, making it more manageable and maintainable over the long run design - description of the questionnaire its content and questions flow The scheme that contains a list of concept terms and definitions which may be grouped into a hierarchical structure. The concepts in the scheme are referenced by questions and variables, providing a consistent definition for all concept terms and means of locating all questions and variable used to measure or represent a single concept. Interoperability. Codebooks marked up using the DDI specification can be exchanged and transported Structuring the questions in questionnaire, Precision in searching due to a tagging of elements in a DDI-compliant codebook in a specific way. 3. SDMX and DDI comparison Both DDI and SDMX have same kinds of basic artifacts on which both standards stand: Identifiable elements, which have ID and can be referenced through it Versionable elements, which support versions multiple versions of same object can exist Maintanable elements, which are maintained by specific agency Possibility of notes to elements In SDMX, they are called annotations support multi language notes which can be attached to any SDMX element Both use XML technology Some major differences identified in the documentation: SDMX Used to macro data SDMX standards refer to business processes, but do not have a state full model. SDMX is stateless SDMX can handle better :statistical like type of questionnaire (large data matrices) Is much simpler than DDI, covers functionality of datasets, and distinction Used to micro data DDI Has combined lifecycle model and supports operations in every phase of lifecycle Concentrates more on type of questionnaire with question-answer rather than forms with large data matrices. Is much more complex in handling with metadata than SDMX. E.g. it provides

between attributes/dimensions/groups. DSD describes a conceptual multidimensional cube used in a Data Flow and referenced in Datasets Processes metadata Operates with fewer components - basic are Concept, Concept scheme, Code List, Data structure definition, Metadata structure definition, Category schemes, Organizational schemes Use cases of SDMX : Any statistical data exchange between any participant organization Dissemination of macro data Possible new use case is considered now in ESSnet project, which should be important approach in dissemination of micro data - questionnaire data dissemination Concepts organized into schemes Data Structure Definitions (Key Families) DSD describes a conceptual multidimensional cube used in a Data Flow and referenced in Datasets functionality for questionnaires - like definition of questionnaire - e.g. various types of loops within the document, complex validations and conditional question flow. Processes + archives metadata Consists of more schemes : Category Scheme Code Scheme Concept Scheme Control Construct Scheme Geographic Structure Scheme Geographic Location Scheme Interviewer Instruction Scheme Question Scheme NCubeScheme Organization Scheme Physical Structure Scheme Record Layout Scheme Universe Scheme Variable Scheme Use cases of DDI : Study design/survey instrumentation generation/data collection and processing Data recoding, aggregation and other processing Data dissemination/discovery Archival ingestion/metadata value-add Question /concept /variable banks DDI for use within a research project Capture of metadata regarding data use Metadata mining for comparison, etc. Generating instruction packages/presentations Data sourced from registers Data Collection Metadata Methodology, Sampling, Collection strategy, Questions, Control constructs, and Interviewer Instructions organized into schemes Conceptual metadata Concepts organized into schemes Universes organized into schemes Geography structures and locations organized into schemes

SDMX and DDI and their mapping to GSBPM Processes in DDI are described in the DDI documentation [7] by combined life cycle model (See Figure 1). Figure 1: Combined life cycle model GSBPM is described in the [ 2 ] and basically it introduced 9 groups of statistical business processes: 1. Specify Needs 2. Design 3. Build 4. Collect 5. Process 6. Analyse 7. Disseminate 8. Archive 9. Evaluate Each group of the model is composed of several sub-processes. This allows relatively easy to make references of any package that implements statistical processes to the GSBPM. The following comparison shows the mappings of DDI processes to GSBPM GSBPM DDI life cycle model 1 Specify Needs Study Concept 2 Design Repurposing (part) 3 Build 4 Collect Data Collection 5 Process Data Processing (mostly) Repurposing (part)

6 Analyse Data Discovery Data Analysis Data Processing (part) 7 Disseminate Data Distribution 8 Archive Data Archiving 9 Evaluate 4. Use of SDMX in collection of statistical data by NSI - An application development in the EssNet on SDMX project 4.1. Application development It is obvious that the rich sources of metadata exist in the process of survey design and planning the data processing. Additional sources of metadata are in data collection process. The metadata entering in these two processes should be preserved for further use either internally in course of the business process cycle within the statistics or outside the statistics in distribution of statistical products. In course of data flows from one process to the other the metadata accompanying the data is either directly used by this process, transformed or it bypasses the process in original forms. Finally it could be filtered or completed in the dissemination phase. To collect and preserve metadata that describes the micro-data at the earliest stages of the statistical business process cycle were the basic objective of the project package development. In our approach we keep the reference and structural metadata that describes the raw data in the processes of questionnaire design and later in the data collection process. This is corresponding approximately to the first four groups of processes of the GSBP Model. In a questionnaire design phase the all metadata related is kept in the SDMX formats. For facilitating the application development a unified structure of statistical questionnaire has been defined. The basic principles in design of the unified questionnaire were: Division of e-questionnaire into sections Fix order of the sections Each section is dedicated for particular content Division of data parts of the questionnaire into data modules The following 6 sections have been defined for the unified questionnaire: Preamble, Identification Information Data Declarative Methodological notes. The application package, with the working title ECOLLECT-X, of the SDMX on ESSnet project is developed in two steps:

1. Definition of questionnaire In this step the application for preparing the survey questionnaire is developed. A user designer of survey enters using SDMX tools the all metadata describing.a questionnaire content and form. The descriptions are saved as MSD reports and used for description of questionnaire on the Web - further step of the application development with functionality for filling the questionnaire by respondent. 2. The second step in development of application includes also functionality for respondent for logging and filling the questionnaire. Application implements also workflow of filled questionnaire its sending of filled values to statistical office, its possible return to respondent, acceptation of values. Statistician user can any time in future review filled questionnaires. Some limitations had to be taken into account: we are able to handle only simple tabular questionnaires (No complex GUI). We provide only simple validations (Mandatory/Optional fields), we do not provide conditional question flow. 604. Module EMPLOYEES AND SALARIES ACCOMMODATION (to be completed quarterly) Private entrepreneurs and their associates (number) Average registered number of employees in natural persons Wages and salary compensation of employees (in Euros) Compensation for standby duty outside the workplace (in EUR) Checksum (line 1 to 4) 99 Note: 1) If you are not VAT payer, just fill Col.. 1. 1 2 In quarter. During the quarter 1) Figure 2. An example of a data module (close data module) of the questionnaire The ECOLLECT-X facilitates: Functionality for survey designers - in questionnaire generating phase: o Matrix form for statistical data input with descriptions of variables in the matrix cells o Concepts definition for questionnaire o Methodological notes for respondents in text format o Code lists input o Multi language questionnaire and code list o Preserving the all metadata describing the questionnaire o Preserving the metadata structure o Saving of survey attributes and questionnaire definition

o Displaying and accessing the questionnaires on Web Functionality for respondents and survey administrator: o logging the respondents according to the duties assigned o filling the questionnaire by the respondents o sending the filled questionnaires to statistical office, o possible return to respondent for data editing, o acceptation messages on corrected filled questionnaire o future review of the filled questionnaires o simple checking procedure (formal checks) An overall schema of the ECOLLECT-X system for the questionnaire generating phase illustrates the Figure 3. Survey administrator Unified METADATA BASE metadata (code lists, concepts, methodological notes, data Metadata input tools (MSD Editor) Application metadata message Parser Metadata: Code lists Data modules Data items in the data module generator Web page visualization Figure 3: A general schema of the questionnaire generating phase in the ECOLLECT-X system Possible location of ECOLLECT project functionalities in the GSBMP with comparison to the existing standards used in statistic business processes illustrates Figure 4.

Figure 4: Covering the GSBPM processes by ECOLLECT-X application package As it implies from the Figure 3 the ECOLLECT-X functionality covers with the exceptions of the processes 5, 8 and 9 the processes that are covered also by DDI. However it should be noted that the functionality embedded in the ECOLLECT-X is more oriented to the descriptions of typical statistical questionnaire with a room for inputs of large data matrixes. 4.2. Further development In the next phase of the system development we shall concentrate on the use of SDMX standard in micro data dissemination. The aim is to complete the system with full functionality, and with use of metadata produced in questionnaires descriptions. A plan is to use questionnaire metadata to automatically generate Data structure definition (SDMX artifact used in data dissemination) and also automate processes of creation of SDMX infrastructure used for data dissemination (e.g. generation of Mapping assistant database). 5. Some conclusions Most of the initiatives for developing standards for data exchange were taken by international bodies collecting the aggregated data of national statistics. This initiatives still continue e.g. by Eurostat with supporting the SDMX applications at NSIs for dissemination statistical data. In the past less activity has been visible in applying standards in statistical data processing at national level. We tried to show that there is a tendency to expand use of standards for statistical data exchange from the international level to the national level at the NSIs. The most recent initiatives in this respect was taken by Eurostat by supporting the applications developed

within the project ESSnet on SDMX and independently, also by the DDI Alliance. In both initiatives similar aims could be identified i.e. to apply data and metadata standards in the chain of statistical business processes starting from data collection through data processing up to data dissemination. At present two main streams of statistical standards could be recognized: 1. Related to processing, exchange and dissemination of macrodata for this purpouse is SDMX fully suitable, was designed to this purpouse and can be used without restrictions 2. Related to collection, processing, distribution and analysis of microdata there exist two approaches within this stream one is usage of DDI, another described EssNet project on SDMX which tries to create tool for processing of Microdata within SDMX infrastructure Both this streams can be combined in mixed fashion using SDMX and DDI in integrated solution. In addition we could conclude that no single standard is applied to the whole chain of statistical business processes as described in the GSBPM. It should be also noted that even if the objectives of standard developers are almost the same, the approaches differ in covering the GSBPM processes as well as in implementation strategies chosen by the developers. References [1] Arofan Gregory (2010) COMBINING METADATA STANDARDS: APPROACHES AND BENEFITS [2] Steven Vale, Exploring the relationship between DDI, SDMX and the Generic Statistical Business Process Model [3] GESIS, Converting General MS Word to DDI, Presentation on EDDI conference [4] DDI 3.1 Part II User manual [5] Course on DDI 3: Putting DDI to Work for You, Wendy Thomas, Minnesota Population Center, Presentation on the 2nd Annual European DDI Users Group Meeting, Utrecht, Netherlands [6] DDI Web page, www.ddialliance.org [7] DDI 3.1 Part I Overview