Data Preparation and the Question of Data Quality. Harald Smith, Director, Product Management

Size: px
Start display at page:

Download "Data Preparation and the Question of Data Quality. Harald Smith, Director, Product Management"

Transcription

1 Data Preparation and the Question of Data Quality Harald Smith, Director, Product Management

2 Speaker Harald Smith Director of Product Management Trillium Software ~20 years in Information Management incl. data quality, integration, and governance Consulting, product management, software & solution development Co-author of Patterns of Information Management, as well as two Redbooks on Information Governance and Data Integration TRILLIUM SOFTWARE, A Harte Hanks Company 2

3 Agenda Data Preparation and the question of Data Quality Common Challenges Facing Business Analysts Delivering Corporate Insights is Getting Harder Emergence of Self-Service Data Preparation What about Data Quality? What s relevant? What s fit for purpose? What should you measure? What do you need to communicate? TRILLIUM SOFTWARE, A Harte Hanks Company 3

4 Big Data driving Business Disruption Big Data has changed our world. And with it, it brought Disruption Social media, Internet of Things, machine learning, disruptive technologies Today, businesses need to move faster than many organizations are built to move TRILLIUM SOFTWARE, A Harte Hanks Company 4

5 Big Data challenges across your business Data preparation is one of most difficult and time-consuming challenges facing business users of BI and data discovery tools, as well as advanced analytics platforms. 70% of business executives spend more than 40% of their time vetting and validating data. TRILLIUM SOFTWARE, A Harte Hanks Company 5

6 Big Data challenges across your business Business leaders Lack trust in data needed to make rapid, accurate decisions that grow business Business analysts Can t access or understand data, and spend excessive time on data preparation Information leaders Must facilitate business collaboration and data transparency IT Leaders Working with limited resources, and need to empower the business faster TRILLIUM SOFTWARE, A Harte Hanks Company 6

7 The Challenges Facing a Business Analyst Reliance on IT support for ETL processes causes delay in analysis Lack of common language between the business and IT means further delay Existing tools have steep learning curves, require basic knowledge of programming and an understanding of the platform architecture TRILLIUM SOFTWARE, A Harte Hanks Company 8

8 The Challenges Are Getting Worse Many more data sources now part of business operations These new data sources are often unstructured causing more delay in ETL Real-time data sources coming online: IoT data, sensor data, social media data etc. Time to insight needed now measured in days or minutes not weeks Many more analysts across the organization with BI tools need data preparation services Data access policies, security and governance add barriers to self-service integration platforms TRILLIUM SOFTWARE, A Harte Hanks Company 9

9 What if you could Access data your competitors can t Accelerate data preparation from days and months to hours and minutes Scale analytics projects with minimal IT resources Increase the ROI of your analytics and Big Data investments in 30 days or less Trust your data and make better business decisions TRILLIUM SOFTWARE, A Harte Hanks Company 10

10 Automation End-user Productivity Emergence of Self-Service Data Preparation IT-Centric Business-user centric Catalog Data SSDP (Managed) Automatic Discovery Surface Relationships Generate Enterprise Entity-relationship Graph Generate Metadata SSDP (Semi-Automated) Discover & Automate Data Quality Rules Discover & Automate Governance Rules SSDP (Automated) Additional Automation Technology disruptions Traditional Data Integration TRILLIUM SOFTWARE, A Harte Hanks Company 11

11 Six Key Capabilities for Data Preparation Self-service data preparation must address ALL of the following elements: ACQUIRE Source data from first or third-party data sources DISCOVER Search all data sources to select which are required for the desired analysis CLEANSE & ENRICH Remove delimiters, spurious fields, unwanted values etc. and/or add values to enrich analysis NORMALIZE Combine or modify two or more data sources based on objects of interest TRANSFORM Provide structure to the data so it can be combined with other data sources or viewed individually FORMAT Format the resulting transform to be viewed by a visualization tool TRILLIUM SOFTWARE, A Harte Hanks Company 13

12 The Value of Data Preparation One Customer Challenge: Combine customer data and over 60 other data sources with existing campaign data to determine why campaign results are achieved and modify campaign in-flight Solution: Deploy data preparation tool on Hadoop. Allow business users across the company to access ALL data sources quickly and easily Benefit: Campaigns are more effective and supports organizational differentiation as a service organization based on data 2016 Forbes Insights. All rights reserved. 14

13 The Question of Data Quality In a world without standard or canonical information models, What is Data Quality? Does it matter? If it does matter, Is it about the quality of data used? Or is it about the quality of data produced? Or both? And if we are measuring data quality, what is it we should measure? Overwhelmed by a glut of predictive analytics, machinelearning models, deep-learning applications, and other fruits of this new age, it s not clear how we will distinguish the junk from the output that has value. James Kobielus IBM Data Science Evangelist & Editor-in-Chief IBM Data Magazine Blog post: July 6, 2016 Pushing Data-Science Automation To Its Practical Limits 15

14 We Have Moved to a World of Digitalization We Were Here We Are Here Information-Driven Business Models IT Craftsmanship IT Industrialization Digitalization ADOPT IDEATE MONETIZE CREATE OFFER ENGAGE IT provided innovation and capabilities Business-Driven Information Models Data quality is an "IT thing" 2016 Gartner, Inc. and/or its affiliates. All rights reserved. IT supports efficiency, effectiveness, integrity Data quality is a "Business thing" Digital provides continual growth, innovation and differentiation opportunities Huh? Why are you even asking?

15 Critical Business Decisions driven by Trust YES Option 1 NO Option 2 Option 3 17

16 Four Steps to achieving Trust How do you reach a state of Trust? 1) Know your goal (or at least your hypothesis!) 2) Understand your data 3) Determine if the data is Fit for Purpose Fitness for Purpose Traditional Measures of Data Quality New Measures of Data Quality 4) Document and validate your findings or results within your data governance process Data are of high quality if they are fit for use in their intended operational, decision-making, and other roles. Fitness implies both freedom from defects and possession of desired features. Juran and Godfrey [1999] Quality: degree to which a set of inherent characteristics of an object fulfils requirements. ISO 9000:

17 Know your goal What information matters to your decisions! Two Critical Factors: 1) What am I trying to achieve? This is a business decision a) Increase Revenue b) Improve Customer Satisfaction/Retention c) Mitigate Risks d) Reduce Costs 2) What information is required to make this decision? a) Timeframes, costs, and risks associated with the decision (i.e. constraints exist) b) Hypotheses and questions to evaluate 19

18 Understand your data What you don t know can hurt you! Signal loss Unxpressed assumptions Incorrect defaults Noise Differing aggregations Missing inputs Cognitive bias YES??? Lack of context 20

19 Subtle differences exist! Name: John Doe Many applications: Common name used in test data Healthcare: Common name for someone admitted to a facility without identification Social media: Someone s handle? ~100 on Twitter! A reference to a generic anybody in a hashtag? Evaluate % Exclude from use Evaluate % Must include but must consider each as unique person Case-by-case Does it have any relevance or importance? 21

20 Fit for Purpose What do you need to know to make an informed and effective business decision? Business Requirements: Complete set of potential customers beyond 20 miles from a brick-and-mortar store for the Veteran s Day Sale Timeliness is critical Analysis of the Data: Customer master file excludes prospects 20% of prospective customers only have or phone information External sources have specific geographic or demographic limits Data Requirements & Measures: All customers & prospects are unique All records have an identified geolocation All records have a flag indicating exact or approximate location Need to target households to reduce cost and waste 22

21 Traditional Measures of Data Quality What measures can we take advantage of? 1) Completeness Are the relevant fields populated? 2) Integrity Does the data maintain an internal structural integrity or a relational integrity across sources 3) Uniqueness Are keys or records unique? 4) Validity Does the data have the correct values? 5) Consistency Is the data at consistent levels of aggregation or does it have consistent valid values over time? 6) Timeliness Did the data arrive in a time period that makes it useful or usable? 23

22 Call Center Data Record Complete Valid? Is Duration = 0 important? Is 01/01/20xx a defaulted date? Integrity Consistent Unique Timely 24

23 Twitter Feed Complete? Valid? Integrity? Consistent? Unique? Timely? 25

24 New Measures of Data Quality What else can we measure? 1) Provenance Where did the data originate, who gathered it, and what criteria was used to create it? E.g. government agency, 3 rd party provider 2) Continuity Data points for all intervals or expected intervals? E.g. sensors, weather records, call data records 3) Triangulation What Gartner describes as consistency of data across proximate data points, i.e. consistent measurements from related points of reference. E.g. if temperatures in Chicago and Louisville are 30 and 32 then temperature in Indianapolis for same day is unlikely to be 70 4) Repetition or duplication of data patterns Data points exactly the same across multiple recording intervals or across multiple sensors. E.g. is there tampering with sensors or call data? 5) Relevance How relevant is the data source for its potential purpose? E.g. relevant geography, free or paid data 6) Transformation from origin how many layers and/or changes has the data passed through? E.g. the address was parsed, matched to a postal table, replaced with the verified postal address, and merged with two other records 7) Usage and/or Ratings Who else has used this data and what was their opinion of it? E.g. this source has been used repeatedly by marketing for sales analysis 26

25 Twitter Feed Provenance Jane Doe pulled from Twitter based on #Blackberry All tweets appear unique within the date & vs. prior feeds Repeated patterns Continuity All items for #Blackberry in time interval appear to be included Relevance Good association with current sales data Marketing confirms these have high value Triangulated Transformation Usage 27

26 Document and validate your findings Considerations & Recommendations: Does the available data and information support a recommendation for a given decision? a) Hypotheses and questions asked of the data a) Data evaluated, measurements, & conclusions b) Which data supports and which contradicts the hypotheses? b) Timeframes, costs, and risks associated with the current data a) Is more or better data required, and if so, what does it cost to obtain? E.g. different sources, better algorithms, enrichment, b) What information must be communicated to ensure effective use of the data and reduce risks with its use? c) What monitoring is needed to ensure the data remains viable? Based on the available data and evidence, can you achieve your goal? E.g. increase revenue, improve customer satisfaction, mitigate risks, reduce costs TRILLIUM SOFTWARE, A Harte Hanks Company 28

27 Document and validate your findings Considerations & Recommendations: How do others know what you ve found during your data evaluation? a) Metadata is critical what s recorded in association with the data preparation process? a) Provenance b) Lineage c) Usage d) Relevance b) Data profiles and rule results capture the quality measurements applied a) Descriptive or categorical information provides context for use & understanding Ultimately, is this recommendation supported by the data governance processes, policies, & procedures in place? TRILLIUM SOFTWARE, A Harte Hanks Company 29

28 Key Capabilities to Validate Data Preparation Data quality in the self-service world must address ALL of the following elements: ENUMERATE Establish the criteria defining goals, relevance, and fitness for purpose ACQUIRE Capture the metadata for data sources under consideration and use DISCOVER Profile the data sources which are required for the desired analysis VALIDATE Evaluate the data sources for the identified and required qualities DOCUMENT Document and store the findings about data sources and processes CATALOG Provide and communicate findings about data sources and processes for others to utilize TRILLIUM SOFTWARE, A Harte Hanks Company 30

29 Benefits to your business ACCURATE, TIMELY ANALYTICS Uncover new, previously inaccessible insights Accelerate speed of organizational decision-making TARGETED MARKETING & REVENUE GROWTH Gain the most accurate, in-depth view of your customers Monitor and respond to customer activity in real-time OPERATIONAL EFFICIENCY & COST REDUCTION Minimize time spent on manual data preparation Ensure accuracy of global operations and supply chain RISK AND COMPLIANCE Ensure confidence in regulatory reporting Identify and manage risk more quickly and completely TRILLIUM SOFTWARE, A Harte Hanks Company 31

30 Questions and Next Steps For more information, please contact: Harald Smith Director of Product Management Trillium Software TRILLIUM SOFTWARE, A Harte Hanks Company 32

31 Thanks! For more information go to: trilliumsoftware.com