Cognitive Data Governance

Similar documents
Accelerating Cloud Value through Analytics


Information Server: 11.x Information Governance Catalog. Marc Haber Senior Offering Manager, Governance Catalog & Tools

Getting Started: Modeling the Structure and Operations of Big Data

Paul Chang Senior Consultant, Data Scientist, IBM Cloud tw.ibm.com

GET MORE VALUE OUT OF BIG DATA

Big Data for the Pharmaceutical Industry

Advancing Information Management and Analysis with Entity Resolution. Whitepaper ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION

Amsterdam. (technical) Updates & demonstration. Robert Voermans Governance architect

How to Design a Successful Data Lake

IBM Db2 Warehouse. Hybrid data warehousing using a software-defined environment in a private cloud. The evolution of the data warehouse

The Next Frontier in HR Analytics

Tata Consultancy Services Enters the Cognitive Software Market with Digitate and ignio A Neural Automation System

1. Search, for finding individual or sets of documents and files

Improved Automotive Brand Loyalty

The IBM DataFirst Method Fueling your data-driven transformation

Making intelligent decisions about identities and their access

Solution Brief. The IBM Explorys Platform. Liberate your healthcare data

HOW TO USE AI IN BUSINESS

Planning for the General Data Protection Regulation

Take a Dive into the Data Lake

A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data

It's a Game Changer. Brightfield Jason Ezratty President. Allegis Global Solutions Bruce Morton Head of Strategy

TECHNOLOGY BRIEF Intapp and Applied AI. THE INTAPP PROFESSIONAL SERVICES PLATFORM Unified Client Life Cycle

Data Driven Culture Enabled by Data Management

IBM Planning Analytics

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Adobe Cloud Platform

Financial Services Compliance

Retail Business Intelligence Solution

Successful healthcare analytics begin with the right data blueprint

IBM Analytics. Data science is a team sport. Do you have the skills to be a team player?

IBM Systems Lab Services Systems Consulting. Proven expertise to help leaders design, build, and deliver IT infrastructure for the cognitive era

PORTFOLIO MANAGEMENT Thomas Zimmermann, Solutions Director, Software AG, May 03, 2017

IBM Business Automation Content Analyzer

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

DLT AnalyticsStack. Powering big data, analytics and data science strategies for government agencies

Mosaic Decisions. LTI's Advanced Analytics Platform for Insurance

IBM Tivoli Endpoint Manager for Software Use Analysis

EMPOWER YOUR ANALYSTS. GO BEYOND BIG DATA. Delivering Unparalleled Clarity of Entity Data. White Paper. September 2015 novetta.com 2015, Novetta, LLC.

Ensuring progress toward risk management and continuous configuration compliance

IBM SPSS Modeler Personal

Executive Brief. 3 Keys to Self-Service Data Preparation

Rapid Da ta Monetization

IBM Decision Optimization and Data Science

IBM InfoSphere Master Data Management

Customer Value Analytics for Banking & Capital Markets

BUSINESS CASES & OUTCOMES

Job Aid: Multi Organization, Multi Site Set Up Considerations

Machine Learning 101

2017 IBM Corporation. IBM s Journey to GDPR Readiness

Machine Learning For Enterprise: Beyond Open Source. April Jean-François Puget

IBM Grid Offering for Analytics Acceleration: Customer Insight in Banking

PERSPECTIVE. Monetize Data

A technical discussion of performance and availability December IBM Tivoli Monitoring solutions for performance and availability

The Five Essential Elements of Self-Service Data Integration

Smarter Manufacturing across the Lifecycle with Analytics

Intelligence, a powerful

Leading the way in technology support

Responsive enterprise the future of the enterprise PERSPECTIVE

Cognitive enterprise archive and retrieval

Customer Value Analytics for Banking & Capital Markets

: Boosting Business Returns with Faster and Smarter Data Lakes

ACCELERATE TO THE NEW PORTALS TO BIG DATA

IBM Business Consulting Services. IBM Business Intelligence Services: enabling information on demand.

A RESPONSE TO CELENT S DAWN OF A NEW ERA IN AML TECHNOLOGY REPORT. Addressing Financial Crime and Compliance Challenges with Advanced Analytics

Embark on Your Data Management Journey with Confidence

IBM Maximo Asset Management solutions for the oil and gas industry

SAP Leonardo Machine Learning Enabling the intelligent enterprise. Bruno Renzo Localization Product Manager Localization Day Spain 2017

NEAT EVALUATION FOR INTELENET GLOBAL SERVICES: Market Segments: Overall & Increased Revenue Capability

Is Machine Learning the future of the Business Intelligence?

Smarter Healthcare across the Lifecycle with Analytics

Big Data Management Best Practices for Data Lakes Philip Russom, Ph.D.

Rapid Delivery Predictable Outcomes Your SAP Data Partner

INSIGHTS & BIG DATA. Data Science as a Service BIG DATA ANALYTICS

Hybrid Multi-cloud Artificial Intelligence (AI): IBM Watson Studio and Watson Machine Learning

How Artificial Intelligence Will (and Won t) Change Procurement and Contracting

IBM Analytics Unleash the power of data with Apache Spark

Data Trends & Opportunities in the

Intelligent Enterprise for SAP

U.S. Bank Access Online

SAS Decision Manager

Top 10. best practices for successful multi-cloud management. How the multi-cloud world is changing the face of IT

COMM 391. Learning Objectives. Introduction to Management Information Systems. Case 10.1 Quality Assurance at Daimler AG. Winter 2014 Term 1

BIG DATA IN BANKING Pravin Burra, Sept 2018

IBM SPSS Modeler Personal

Business Insight and Big Data Maturity in 2014

IBM Balanced Warehouse Buyer s Guide. Unlock the potential of data with the right data warehouse solution

Infosys Real Time Streams

Solution Brief. An Agile Approach to Feeding Cloud Data Warehouses

Enable Connected Intelligence

Need Management. April 2009

Make smart business decisions when they matter most September IBM Active Content: Linking ECM and BPM to enable the adaptive enterprise

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and

Service management solutions White paper. Six steps toward assuring service availability and performance.

Winning the Hearts & Minds of the Data Scientist in the Cognitive Era. Gaurav Rao Director, Advanced Analytics IBM Analytics

Our Emerging Offerings Differentiators In-focus

THE IMPORTANCE OF END USER DATA PREPARATION

Confidential

Integrated Social and Enterprise Data = Enhanced Analytics

Transcription:

IBM Unified Governance & Integration White Paper Powered by Machine Learning to find and use governed data Jo Ramos Distinguished Engineer & Director IBM Analytics Rakesh Ranjan Program Director & Data Scientist IBM Analytics

Contents. 3 4 5 6 7 8 Introduction Benefits of Managing Master Data Auto Data Classification Importance of Feedback Learning Applying Regulations to the Business Data Conclusion ã Copyright IBM Corporation 2018 IBM Corporation Route 100 Somers, NY 10589 This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates. THE INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided.

3 Introduction The purpose of this white paper is to discuss how machine learning and deep learning techniques can impact the way companies implement metadata discovery and business term assignments using the IBM Unified Governance & Integration platform. First, a set of definitions to ensure consistent understanding of the topic: Definitions Master Data: the consistent and uniform set of identifiers and extended attributes that describe the core entities of an enterprise (such as existing/prospective customers, products, services, employees, vendors, suppliers, hierarchies and the charts of accounts) Machine Learning: an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed Data Governance: the overall management of data availability, relevancy, usability, integrity and security in an enterprise Regulatory Compliance: an organization s adherence to laws, regulations, guidelines and specifications relevant to its business Most organizations spend a great deal of time and energy wrestling dirty, poorly-integrated data. They either cannot find the right data or cannot trust the data they find. On top of that, they must deal with multiple regulations in their industry that are barriers to self-service and data democratization. As a result, they try to fix their data through a variety of labor-intensive tasks, from writing custom programs to global replace functions overall diminishing their productivity as data analysts and data scientists. This is particularly true within large organizations, where many years of mergers and acquisitions have resulted in an extremely complex data environment of diverse systems and databases. While organizations are busy maintaining these legacy data environments, new data is constantly being created at unseen speed. Some try to solve this problem using Master Data Management tools by unifying desperate data sources to achieve a single view of their critical business entities. Several vendor tools approached this problem with a rule-based engine that unites a variety of data sources in their offerings. Rules are easy to implement and understood by many, however, rule-based engines do not scale very well. In the context of large enterprises, where organizations must deal with large amounts of data and a variety of desperate systems, machine learning technologies are replacing rules engines. Machine learning has proven remarkably powerful in accomplishing a wide variety of analytics objectives, such as predicting customer churn or detecting fraud in online credit card transactions. While identifying data similarities or unifying data may not be the most exciting application of machine learning, it is one of the most beneficial and financially valuable applications to IBM s clients.

4 The Benefits of Managing Master Data Building a data catalog can be very labor-intensive and time-consuming, which is why so many organizations give up on creating and updating a well-organized data catalog. They also face further challenges, such as: Standardizing business definitions and creating a business glossary Cataloguing all data sources and updating with clear business descriptions Linking business terms to data fields across all data sources Time is not the only thing needed to build a robust data catalog, it is also extremely expensive to hire domain experts who can perform this on an ongoing basis. This is where artificial intelligence and machine learning technology can help. IBM Unified Governance & Integration uses machine learning and neural networks to identify probabilistic matches of multiple data records that are likely to be the same entity, even if they look different. This enables analysis of master data for quality and business term relationships, a major pain point for IBM clients. Projects that used to take months could now be done in a few weeks. While machine learning enables automation of tasks, there is always a need for human intervention in the process like any other artificial intelligence or machine learning application. Through feedback learning, if the confidence score of match is below a certain threshold level, the system will refer the candidate data records to a human expert using the workflow. It is far more productive for those experts to deal with a small subset of weak matches that an entire dataset. The benefits of this activity are huge, both for data curators and data scientists in any organization. Consider a new data scientist who is given a task to develop a machine learning model to detect customer churn for a specific product or service. While the data scientist has an idea on what needs to be accomplish, he or she has no idea what data sets he/she can use to start the task. With IBM data governance technology enabled with machine learning, the data scientists can easily search for business terms like customer retention to get a graph view of all connected entities to be able to drill down and get information about the quality and authenticity of the data. Figure 1: IBM Information Server Enterprise Search showing Business terms and Technical metadata relationships

5 What you see in the figure above is a result of automated data classifications and term assignments using a machine learning model. A classification or taxonomy is a way of understanding the world by grouping and categorizing. Many organizations use Social Security Numbers (SSN) to track a customer with various investment products, but they may appear in various forms such as a Tax Identification Number or Employee Identification Number. Using a traditional rule-based engine, it is difficult to conclude that these three terminologies refer to the same entity. In contrast, one term may also have different meanings in the same organization. Machine learning models offer a new way to train the system to describe a domain from the data that helps identify these relationships. Auto Data Classification The use of machine learning to perform data classification is a three-step process involving clustering, sorting and classifying. The goal of clustering is to find similar entities in data. The different characteristics used to compare data such as shape and size of an object are called features. The more features that are the same or close to being the same among two data values are considered similar data values. Several statistical techniques are available that can be employed to perform clustering. Some techniques ask the user to specify features in advance, whereas others develop the features through comparison of different data values. The figure below shows the IBM Unified Governance and Integration software stack with core services that it offers. Some of these services are either in beta mode or will be available later this year. Figure 2: IBM Information Server 11.7 core services The IBM Unified Governance and Integration software stack is focused on providing personalized user experiences and targeted outcomes for a variety of personas. For example, business users often struggle to find the data they need due to data silos and lack of a central data catalog oftentimes turning to IT for help. Even when business users find the data they are looking for, they struggle to understand it due to the following reasons: The data sources are unlabelled leaving users to guess the context and meaning of each data field The data sources have schemas and data fields that use abbreviations or acronyms that are difficult to understand, specifically from legacy systems

6 The data fields are repeated on multiple data sources with different labels, creating a need to standardize business terms IBM Unified Governance & Integration is enabling business and technical users in the enterprise to identify data not just easily, but also discover quality data that is useful and endorsed by other users in the organization. Importance of Feedback Learning in the Discovery Process Many organizations do not have a well-defined relationship between their business terms and technical metadata. IBM Unified Governance & Integration has employed a technique to pre-train models with industry-oriented datasets and labels to provide candidates matches to users as soon as they load their business glossaries into the IBM Governance Catalog. Users receive confidence scores associated with every candidate term, helping them pick the right one. If a match meets the threshold of 80%, the term is automatically assigned to the metadata and if not, the system perceives this as a knowledge input to retrain the machine learning model. This dynamic feedback loop enables data curators to become more productive with enhanced results on subsequent discoveries of metadata. Figure 3: IBM Information Server 11.7 ML powered metadata discovery process Traditional techniques of metadata matching and assignments are rule-based. While machine learning models to a better job with ambiguous data sets, they are not replacements for existent application rules especially in cases where existing application rules and regular expression have proven to work well. IBM Unified Governance & Integration uses machine learning to complement existing rules and provide optimized results where systems can learn from a customer s data domain.

7 Figure 4: Automated business term assignment using machine learning Applying Regulation to Business Data When businesses try to find, understand, and use the information in their organization, they use several approaches to organize data. They physically move data from systems of engagement to systems of insight data warehouses, data marts, connecting a graph database straight to the instance data, data lakes to create a landing zone of all the information that they might be interested in, fishing for what is useful, at the edge of the lake creating sandboxes at the edge of the lake so people can sift through what is coming into the lake and start mining for gold. Ultimately, the most important aspect is knowing what data exists, what it means, where it comes from, how it has been manipulated (data lineage), and what users can do with the data. To identify what data exists, IBM has offerings for discovery and auto classification such as StoredIQ and Information Analyzer. These product use machine learning models to read customer data, understand the pattern, and learn from it. IBM Unified Governance & Integration is addressing key issues of on-boarding regulations to a governed data lake. IBM s approach combines IBM Industry Models with neural networks model to on-ramp a variety of regulations into the IBM Governance Catalog and then map the regulatory terms to the business terms. Take for example, the Global Data Protection Regulation (GDPR). There are four processes before GDPR terminology can relate to business terminology to help leverage them for privacy regulations: 1. Support content terms must be manually extracted from GDPR documents (various sections, articles) 2. Hierarchies must be created for key categories 3. Supportive content terms must be matched manually with the business terms by domain experts 4. These supportive terms must be mapped to the business data model

8 IBM Unified Governance & Integration uses machine learning to create a neural network model that interprets regulations based on its experience with similar regulations. This not only extracts supportive content terms from raw documents but also create a well-formed taxonomy that can easily be ingested into IBM s Governance Catalog. Conclusion Companies in the current trend of global regulations and emergence of the use of hybrid cloud environments (Public, Private, Hybrid) will have to work on managing their Master Data Entities more efficiently and use AI and machine learning to understand the impact of regulations and changes on their business. They need a solution that scales their business processes and helps them throughout their journey of data curation, data discovery, quality and governance. IBM s Unified Governance and Integration solutions are designed to help clients achieve that with ease and efficiency. Watch this webcast to learn how you can faster insights from your data using cognitive data governance in IBM Information Server.