GSBPM a proposed evolution of the model

Size: px
Start display at page:

Download "GSBPM a proposed evolution of the model"

Transcription

1 GSBPM a proposed evolution of the model Telling Canada s story in numbers Paul Holness, Senior Analyst Jackey Mayda, Director International Cooperation and Corporate Statistical Methods April 2018

2 Evolving data ecosystem Rapidly changing and increasingly complex economy and society Proliferation of data and data providers Data revolution, ingenuity and innovation Increased expectations and demand for real-time and micro/ detailed data 2

3 Statistical Organizations Trends Agency transformation Agility, Flexibility, Quality, Efficiency, Relevance modes of engagement and delivery products, services, partnerships, collaboration, leadership, education Provider Partnerships, Administrative and Big Data Digital platforms and shared services Cost optimization Advanced methods and tools Cross-agency, domain, levels of government, statistics to support new policy, service delivery initiatives Locally relevant statistics (small area) for local government service delivery Sustainable development goals and world collaboration Internal innovation programs Open data platforms 3

4 Statistics Canada Modernization Vision Statements Pillar Vision User-centric Service Delivery Leading-edge Methods & Data Integration Statistical Capacity Building & Leadership Sharing & Collaboration Modern Workforce and Flexible Workplace Users have the information and data they need, when they need it, in the ways they want to access it, with the tools and knowledge to make full use of it. Access to new or untapped data; modify the role of surveys; greater reliance on modelling and integration; capacity through R&D environment. To be leaders in identifying, building and fostering savvy information and critical analysis skills beyond our own perimeters. Statistics Canada has developed and nurtured strategic, innovative partnerships that allow for the open sharing of data, expertise and best practices. We are proactive, flexible and responsive to partner needs. Have the talent and environment required to fulfill our business needs at the time and be open and nimble to continue to position ourselves for the future. 4

5 Statistics Canada's Data Model Vision BUSINESS PROCESS Governance This is aligned with architectures such as UK, Netherlands, Australia GATHER GUARD GROW GIVE Data Discovery Data needs Negotiation Preliminary files Ingestion Temporary Repository Pre-processing De-identification Statistical Identification Corporate Repository Registers Management Integration Programs Analysis Update registers Direct tabulation Dissemination Open Government Information Meta Data - Driven IM Access and Security Rules

6 DATA CHARACTERISTICS STATISTICAL SYSTEM 6

7 Why do we want to enhance the GSBPM? Explore opportunities where changes in the data ecosystem have exposed gaps and challenges in the existing model Encompass all activities undertaken in the production of official statistics that result in data outputs Applicable to all types of data sources, not just survey data: Administrative sources / register-based statistics Non-survey sources (Big Data, earth observations, sensor data, scanner data) Mixed sources Cover the comprehensive data lifecycle (including data preparation and integration) Support multiple input/output streams and data types Structured, Semi-structured and unstructured Built-in data science platform Support profiling & discovery, visualisation, integration and data analytics and decision processing Support data management and data quality Provide built-in framework for performance measurement Increase collaboration and promote the use of common statistical production architecture 7

8 Applying data visualization to the GSBPM A few conventions: Change Activity Macro Economic Accounts STC Modernization Objective Continuation 8

9 Transition from GSBPM 5.0 Overview of the data lifecycle Collect becomes Acquisition Process becomes Data Preparation (with sub processes Profile & Discover and Clean & Transform) Integration (Join, Link, Model) Data Preparation Integration Join, Link, Model GSBPM NL 9

10 Specify Needs Specify Needs GSBPM V 5.0 GSBPM Proposal Why change? 1.1 Identify Needs 1.2 Consult and confirm needs 1.4 Identify concepts 1.4 Identify concepts & sensitive data elements Identification & protect sensitive information throughout lifecycle 1.3 Establish output objectives 1.4 Identify concepts & sensitive data elements 1.5 Check data availability 1.5 Check data & intelligence availability & initial data quality Check availability, metadata, initial data quality 1.5 Check data & Intelligence availability (Environmental Scan) & initial input data quality assessment 1.6 Prepare business case & seek approval 1.6 Prepare business case 1.6 Prepare business case & seek approval Legend: 10

11 Design Profile Design & Discover GSBPM V 5.0 GSBPM Proposal Why change? 2.1 Design outputs 2.2 Design variable descriptions 2.3 Design data input channels 2.4 Design sample & target data strategy 2.3 Design data collection 2.4 Design frame & sample 2.3 Design data input channels 2.4 Design frame & sample Multiple input data sources: survey, admin, streaming, earth observation, sensors, etc.; Greater use of unstructured data Changed underlying text to support alternative data types including survey, admin, webbased etc. 2.5 Design processing and analysis 2.6 Design production systems and work flow Legend: 11

12 Build Profile Build & Discover GSBPM V 5.0 GSBPM Proposal Why change? 3.1 Build or enhance acquisition instrument 3.2 Build or enhanced process components 3.1 Build collection instrument 3.1 Build or enhance acquisition instrument Different sources & types require alternative instruments 3.3 Build or enhance dissemination components 3.4 Configure workflows 3.5 Test production system 3.6 Test statistical business process 3.7 Finalise production systems Legend: 12

13 Acquisition Profile Acquisition & Discover GSBPM V 5.0 GSBPM Proposal Why change? 4.1 Select sample & target data 4.2 Set up data acquisition (collection & ingestion) 4.3 Acquire (collect & ingest) data 4.4 Monitor acquisition, report, visualize & adjust to support data quality This sub-process refers to the monitoring and remediation of the acquisition process towards optimizing the quality of data collection 4.4 Monitor acquisition, report, visualize & adjust to support data quality 4.5 Finalise acquisition (collection & ingestion) Legend: 13

14 Data Preparation: Profile & Discover Profile & Profile Discover & Discover 5.1 Procure access to raw data & intelligence 5.2 Perform profile & overlap analysis 5.7 Build & evaluate data / statistical models Create unified data models (MDM) Profiling and Discovery The analysis of information for use (in a data warehouse) in order to clarify the structure, content, relationships, and derivation rules of the data 5.3 Locate, classify and mask sensitive data 5.4 Explore matching variables, features & merge analysis 5.5 Discover & map transformations from source to target 5.6 Conflict analysis (Concept, definition, convention) Feedback to/from data provider 5.8 Document profile & transformations & prepare treatment strategy Share program code 5.9 Export data objects Profiling helps to not only understand anomalies and assess data quality, but also to discover, register, and assess enterprise metadata The result of the analysis is used to determine the suitability of the candidate source systems, usually giving the basis for an early go/no-go decision, and also to identify problems for later solution design Legend: 14

15 Data Preparation: Clean & Transform Cleans & Clean Transform & Transform 6.1 Standardized attribute formats 6.2 Parse, tokenize & map attributes to fields or concepts 6.3 Normalize abbreviations, honourifics & stopwords 6.4 Classify & code attributes 6.5 Review, validate attributes 6.7 Derive new variables & units 6.8 Finalize unified source data files 6.9 Measure & document the impact of cleansing & transformation & lineage Examples of cleansing & transformation Convert all letters to lower case Remove all punctuation marks (avoid if seeking emojis) Remove all numerals (avoid when mining for quantities) Remove all extraneous white space Remove characters within brackets Replace all numerals with words Replace abbreviations Replace contractions Replace all symbols with words Remove stop words and uninformative words Stem words and complete stems to remove empty variation Phonetic accent representation Neologisms and portmanteaus Poor translations or foreign words 6.6 Edit & impute attributes Legend: 15

16 Data Integration Integration Profile & Discover (Join, Link, Model) Generic Record Linkage Process 7.1 Identify potential record pairs Data Source 1 Data Source 2 Analytical data 7.2 Reduce comparison space 7.7 Calculate aggregates, seasonality, deflation, benchmarking Profile & Discover Retrieve Analytical Variables 7.3 Compare & classify candidate record pairs 7.4 Create new or update existing integrated datasets 7.5 Assess join & linkage quality & performance 7.8 Assess data quality, balance, adjust & recalculate 7.9 Document & report methods & outcomes & metadata to Picasso Cleans & Transform Data Reduction -Blocking/Index Field Comparison Classification Match Possible Match Unmatched Assess data quality Staging data Clerical Review 7.6 Calculate weights for unit data Type of Integration Method Example Identifier Transactional Joins Primary/Foreign Key Record Number Record Linkage Imperfect identifiers Name, Address, Postal Statistical Linkages Statistical & Model-based Matching Statistical Attributes Legend: 16

17 Data Analytics & Decision Process Analytics Profile & Discover & Decision Process 8.1 Procure access to analytical dataset 8.2 Exploratory data analysis, visualization, measure, diagnose 8.3 Data analytics 8.4 Validate outputs 8.5 Interpret and explain outputs 8.6 Assess the impact of integration on analytical outputs 8.7 Apply disclosure control 8.8 Finalise outputs This phase is broken down into eight sub-processes, which are generally sequential, from left to right, but can also occur in parallel, and can be iterative. It includes 3 new subprocesses. 8.1 Procuring access to analytical dataset 8.2 Exploratory Data Analysis Analyzing data sets to summarize their main characteristics, often with visual methods i.e. Self-service dashboards 8.3 Consists of three distinct data analytics 8.3a Descriptive analytics or observe Uses data aggregation and data mining to provide insight into the past and answer: What has happened? : Mean, median, mode etc. 8.3b Predictive analytics or predict Encompasses statistical techniques ranging from predictive modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future or otherwise unknown events 8.3c Prescriptive analytics or influence This sub-process, Prescriptive analytics is the area of business analytics (BA) dedicated to finding the best course of action for a given situation. Prescriptive analytics is related to both descriptive and predictive analytics. 8.6 Assess the impact of integration on analytical outputs Legend: 17

18 Disseminate Disseminate Profile & Discover 9.1 Update output systems 9.2 Produce dissemination products 9.3 Manage release of dissemination products 9.4 Promote dissemination products Assess products & services 9.5 Manage user-support Track and measure quality of interaction with users and link to prioritization Legend: 18

19 Evaluate Profile Evaluate & Discover 10.1 Gather evaluation inputs 10.2 Conduct evaluation 10.3 Agree to an action plan Legend: 19

20 GSBPM Proposal Legend: 20

21 Concluding remarks We feel the proposed changes to GSBPM support the current evolution in business process activities Increases visibility into the data lifecycle Supports multiple data types Potentially improves information production time lines by accelerating data preparation Promotes standardized delivery of outputs (data, metadata and code) Supports activity-based costing by breaking down the process into appropriate pieces 21

22 Next steps Feedback within Statistics Canada has been quite positive We are seeking your input on the relevance of further development of the GSBPM in this way 22

23 Comments and feedback are welcome 23