Advancing Information Management and Analysis with Entity Resolution. Whitepaper ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION

Size: px
Start display at page:

Download "Advancing Information Management and Analysis with Entity Resolution. Whitepaper ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION"

Transcription

1 Advancing Information Management and Analysis with Entity Resolution Whitepaper February 2016 novetta.com 2016, Novetta ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION

2 Advancing Information Management and Analysis with Entity Resolution 1 ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION 1 ENTERPRISE ARCHITECTURE DELIVERING AVAILABLE AND ACCURATE DATA 3 DATA INTEGRATION PROVIDING TRUSTED, UNIFIED VIEWS OF DATA 4 DATA QUALITY MONITORING AND IMPROVING DATA RELIABILITY 5 MASTER DATA MANAGEMENT OPERATIONALIZING CUSTOMER AND PRODUCT DATA 5 DATA WAREHOUSE STORING DATA FOR ANALYSIS AND REPORTING 6 BUSINESS INTELLIGENCE DELIVERING BETTER INFORMATION AND PREDICTIONS 7 ENHANCING ENTERPRISE ARCHITECTURES AND APPLICATIONS WITH NOVETTA ENTITY ANALYTICS ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION

3 ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION Organizations use data integration, data quality, master data management (MDM), data warehouse and business intelligence technologies and applications to collect, curate, transform, operationalize, store and analyze data. Many organizations are looking for new ways to leverage data that were not possible before. They need a quick and effective solution to combine and analyze greater volumes, varieties and velocities of new and existing data. Novetta Entity Analytics unlocks the potential of data and enables organizations to get more out of existing enterprise technologies and applications. This white paper outlines a typical enterprise architecture, the roles of data integration, data quality, MDM, data warehouse and business intelligence applications, and describes how Novetta Entity Analytics augments each of these five technologies to enable organizations to achieve better results from their data. ENTERPRISE ARCHITECTURE DELIVERING AVAILABLE AND ACCURATE DATA Enterprise architectures evolved to establish processes to collect, curate, transform, operationalize, store and analyze data, while ensuring data was available, accurate and trustworthy. Data within enterprise architectures originates from internal or external sources and contains information about customers, prospects and potential markets. Internal operational systems such as sales, enterprise resource planning (ERP), supply chain management (SCM), point of sale (POS) and other transactional systems contain vast quantities of structured data critical to an organization s daily operations. The quality, trustworthiness, reliability and availability of these systems and their data are critical because they have a direct impact on an organization s success or failure. In addition to operational systems, organizations have large volumes of semi-structured and unstructured data in the form of , content repository, clickstream, log file and call log data. These sources, often referred to as dark data because they are rarely used in analysis, contain hidden content that most organizations have failed to leverage quickly and effectively. Organizations also access external data sources that contain vital information about customers, prospects and markets, but their veracity and quality varies widely. External sources can include social media such as Twitter, Facebook or LinkedIn, and open source news, blogs and other media, as well as data from brokers such as Acxiom, TransUnion, Equifax, Experian and Dun & Bradstreet. ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION PAGE 1

4 In a typical enterprise architecture, data from internal and external sources is replicated to a relational database, text-based files stored on a file system, or a Hadoop data lake within an integration staging layer, see Figure 1 for more details. The primary purpose of a staging layer is to collect data from sources and employ data integration applications to extract, transform, and load (ETL) it into data targets. Data integration applications have a wide variety of connectors that enable access to data sources and targets. In the staging layer, data is consolidated, aligned, cleansed and tracked before it is sent to operational data stores, MDM applications, data warehouses or data marts. Data quality applications, used within the staging layer and at other points within the enterprise architecture, track and measure data and standardize data to improve data quality over time, across all systems. Figure 1: Example Enterprise Architecture Data integration and quality processes are transparent to most business users because they have already been applied to the data within warehouses and marts that users access through business intelligence and analytics applications. Since business intelligence reporting and analytical modeling entail heavier analytic processing than operational systems, they require warehouses with different data storage and schema. In ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION PAGE 2

5 addition, data discovery and analytics applications often access data within a staging layer and bypass processes that govern and measure data quality and insights. It is important to note that data within enterprise architectures does not always flow in one direction. In addition to data flowing from sources to a staging layer and out to data and analysis applications, data quality business processes and workflows also send updates from MDM, data quality or data integration applications back to operational source systems to improve an organization s data over time. With the recent availability of less expensive and more robust analytic processing and storage technologies, many organizations are looking for new ways to leverage the applications and processes central to their enterprise architectures. This includes adding large volumes of fragmented and lower quality data from external sources and internal unstructured channels, which requires the use of new processing methods to avoid sacrificing overall quality. As depicted in Figure 2, Novetta Entity Analytics augments data integration and data quality workflows to Figure 2: Example Enterprise Architecture with Novetta Entity Analytics ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION PAGE 3

6 allow organizations to add new data varieties at greater volumes while improving the quality of the data in data warehouses, data marts, operational data stores and MDM applications. Below are details about the roles of data integration, data quality, MDM, data warehouse and business intelligence applications with explanations of how Novetta Entity Analytics complements and increases their capabilities. DATA INTEGRATION PROVIDING TRUSTED, UNIFIED VIEWS OF DATA Data integration applications provide trusted, unified views of data throughout an organization. These applications extract data from multiple sources and prepare, curate and track it for use by target systems such as operational data stores, data warehouses, data marts and MDM applications. Historically, users of business intelligence and analytic applications have relied on data already sorted within data warehouses or data marts, and expect that the data they use has already been perfected by data integration applications. Newer users of business intelligence and analytic applications, sometimes referred to as explorers or miners, seek to find nuggets of hidden information across multiple datasets by directly accessing raw data in a staging layer, thereby bypassing data warehouses. Organizations employ several types of data integration applications depending on the capabilities they need and which enterprise applications and systems are deployed. Today, most data integration applications have added business processes and rules to ensure data is accurate and trustworthy for enterprise operations. As data landscapes expand beyond operational systems, and organizations want to maintain trusted analytics streams separate from operational data feeds, data integration applications struggle to effectively deal with more highly fragmented and ephemeral data sources. Organizations that extend data integration application workflows to leverage Novetta Entity Analytics can more easily add volatile and less trustworthy data sources for analysis, and identify buried entities and relationships across existing datasets. Novetta Entity Analytics enables data analysts to quickly and easily: Profile and measure the quality and characteristics of new or existing datasets in the time it would normally take them to analyze a sample. Resolve and group records into entities of people, organization and locations, and iterate to refine and improve the quality of entities. Take the guesswork out of defining strategies and matching rules to implement governance policies, deal with ambiguity and connect hidden dots within the data. Create multiple views of data to support the needs of different system targets from the same set of source data. ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION PAGE 4

7 DATA QUALITY MONITORING AND IMPROVING DATA RELIABILITY Data quality applications improve the usefulness and reliability of data within enterprise applications and systems and provide better insights into the quality of data throughout the enterprise. Data quality applications perform a wide range of actions, including: Profile data, measure quality, track source quality over time, and assign trust metrics based on the reliability of sources. Parse and standardize data in a specific field, such as name, address or part number, by applying rules to the reference data. Cleanse data to remove noise and duplicates, or to correct data errors. Measure and track source quality of target systems over time to identify issues with ETL integration processes and ensure reliability of enterprise data. Novetta Entity Analytics applies some of the same processes as data quality applications to the data it groups and resolves into unified entity views. The main goal for Novetta Entity Analytics is to improve the quality of entity views, which is different from that of data quality applications that are responsible for monitoring and tracking overall enterprise data quality. Through data profiling, measurement and standardization, Novetta Entity Analytics transforms and cleanses data as needed, ensures incomplete or error-prone data is resolved to the correct entity, and provides data analysts with insights about the context of all underlying data. Novetta Entity Analytics also provides organizations with a simple way to evaluate less trusted data sources, especially those not yet monitored or operationalized. In addition, Novetta Entity Analytics automatically identifies patterns and values in data, measures quality, creates advanced 360-degree views from data, and defines important relationships, even for dirty or incomplete sources. MDM OPERATIONALIZING CUSTOMER AND PRODUCT DATA MDM applications incorporate the business processes, policies, governance, and services required to define, create and manage unified views of customers and products across an organization. These unified views contain critical pieces of information that business applications need to ensure they are referring to the correct customer or product. MDM applications support hundreds of millions of queries on names, customer numbers or other personally identifiable details within customer records. Query responses are immediately presented to all operational systems as unified views of required customer information. To ensure changes to data are appropriate, accurate and trusted, MDM systems also support data governance and other policy and workflow processes, such as auditing and reporting. Organizations can use Novetta Entity Analytics to add new data sources that augment the unified customer and product views provided by MDM systems without reducing the trustworthiness or reliability of ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION PAGE 5

8 enterprise systems. Novetta Entity Analytics connects data stored within MDM systems to a wide variety of data coming from lower quality sources, such as social media, logs and clickstreams. These sources are inappropriate for mastering, but offer organizations new opportunities to gain analytic insights. With Novetta Entity Analytics, organizations can also connect MDM data to high volume transactional data to define behavioral patterns and relationships, and uncover hidden leads or instances of fraud DATA WAREHOUSE STORING DATA FOR ANALYSIS AND REPORTING Data warehouses serve as a centralized location for storing current and historical data produced by operational systems and used for analytic reporting and analysis. Data warehouses come in many forms, including data marts, operational data stores and analytical warehouses. Organizations usually employ data integration applications operating within a staging area to extract and transform data before it is loaded into a warehouse. Data warehouses rely on prebuilt indices to respond quickly to analytic and reporting queries, and provide data for models. Data analysts and scientists regularly create new indices or algorithms to support changing business use cases, sources and market demands. Some organizations leverage Hadoop s parallel processing capabilities to reduce the time it takes to create new indices. In addition, many organizations archive data they want to use for longer running analysis to Hadoop data lakes because of the high cost of keeping data in warehouses for more than three to six months. It is still a significant challenge for organizations to know the full state of their business, even with access to more data and analytics capabilities. This is due to the constant fluctuation of current and historical data, and its connections within data warehouses, Hadoop, and other external sources. Organizations can use Novetta Entity Analytics and the resolved entity models it creates to clean up nodes and edges, reduce duplication within the data, and improve the accuracy of models and reporting for online analytical processing systems. With Novetta Entity Analytics, organizations will not only enrich data stored in data warehouses by adding entity indices that describe how data is connected around entity types such as customer or product, but also be able to easily connect data within warehouses to external sources. BUSINESS INTELLIGENCE DELIVERING BETTER INFORMATION AND PREDICTIONS Business intelligence applications apply analytical processing and reporting to data stored within data warehouses to provide real-time insights about what is happening across the organization from supply chains, sales, ordering, fulfillment and tracking, to brand reputation and customer experience. Business intelligence applications also help business owners predict how changes to activities and strategies will impact future outcomes. Business intelligence analytics have traditionally been the domain of Analytics Centers of Excellence, which include stakeholders from different functional business teams who determine the potential business value of analytic applications and their potential impact on business processes and IT. Some ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION PAGE 6

9 organizations have also created Business Intelligence Competency Centers responsible for providing practitioners with stewardship guidance and the skills needed to use analytics and business intelligence effectively. As organizations look for new ways to get more out of existing data or add external data sources for analysis, providing self-service access to savvy business users and analysts who want to explore raw data on their own is becoming increasingly popular, yet introduces new challenges. Many organizations have to formulate new policies for handling data, including how to bring together and manage data from multiple locations and sources, measure the quality of input datasets, govern and audit data, and ensure compliance with internal and external requirements. In addition, organizations must support users who lack data science expertise while ensuring the validity of the analytics produced by these same users. Novetta Entity Analytics employs proven data science principles to allow business users and analysts to produce datasets containing complete views of people, organizations, locations and other entities, and the details about complex relationships between entities. Self-service business intelligence, data discovery and other analytics applications can use these valid subsets of resolved data for analysis. With Novetta Entity Analytics, data scientists can easily view the data produced to: Ensure datasets and analysis are acceptable. Understand the original quality of the source data and details about how data was joined together to form entities. Determine the exact business logic used to resolve data and define relationships. ENHANCING ENTERPRISE ARCHITECTURES AND APPLICATIONS WITH NOVETTA ENTITY ANALYTICS Organizations need new ways to increase efficiencies and insights from existing enterprise technologies and applications without causing disruption to enterprise architectures and IT infrastructures. Novetta Entity Analytics enables organizations to improve business analytics and outcomes by easily adding internal and external data sources, and augmenting data integration, data quality, MDM, data warehouse and business intelligence applications to gain valuable new insights from data. ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION PAGE 7

10 Headquartered in McLean, VA with nearly 750 employees across the US, Novetta has over two decades of experience solving problems of national significance through advanced analytics for government and commercial enterprises worldwide. Grounded in its work for national security clients, Novetta has pioneered disruptive technologies in four key areas of advanced analytics: data, cyber, open source/media and multiint fusion. Novetta enables customers to find clarity from the complexity of Big Data at the scale and speed needed to drive enterprise and mission success. Visit for more information Jones Branch Dr, Suite 500 McLean, VA (571) novetta Copyright 2016, Novetta WP ADVANCING INFORMATION MANAGEMENT AND ANALYSIS WITH ENTITY RESOLUTION PAGE 9