The Role of Big Data and Data Warehousing in the Modern Analytics Ecosystem

Size: px
Start display at page:

Download "The Role of Big Data and Data Warehousing in the Modern Analytics Ecosystem"

Transcription

1 The Role of Big Data and Data Warehousing in the Modern Analytics Ecosystem By Wayne W. Eckerson April 2018

2 About the Author Wayne W. Eckerson has been a thought leader in the business intelligence and analytics field since the early 1990s. He is a sought-after consultant, noted speaker, and expert educator who thinks critically, writes clearly, and presents persuasively about complex topics. Eckerson has conducted many groundbreaking research studies, chaired numerous conferences, written two widely read books on performance dashboards and analytics, and consulted on BI, analytics, and data management topics for numerous organizations. Eckerson is the founder and principal consultant of Eckerson Group. About Eckerson Group Eckerson Group is a research and consulting firm that helps business and analytics leaders use data and technology to drive better insights and actions. Through its reports and advisory services, the firm helps companies maximize their investment in data and analytics. Its researchers and consultants each have more than 25 years of experience in the field and are uniquely qualified to help business and technical leaders succeed with business intelligence, analytics, data management, data governance, performance management, and data science. About this Report This report is the culmination of meetings with data architects at both vendor and user organizations, as well as colleagues who are consultants and practitioners in the field. The report is made possible by SAP. Eckerson Group

3 Executive Summary The data warehouse has served as the foundation for analytics architecture for the past 30 years. But the era of big data has rolled over many of the traditional approaches to delivering data and insights to business users. Consequently, some think the data warehouse has become obsolete, while others believe it still plays a key role in a modern analytics ecosystem. This report explores the promise and reality of both traditional data warehouses running on relational databases and data lakes running on Hadoop. Each environment has unique capabilities that make it ideal for supporting different kinds of use cases and workloads. This report examines the strengths and weaknesses of both environments and shows how they can complement one another in a modern analytics ecosystem. Organizations today especially those with large volumes of multi-structured data are pairing data warehouses and data lakes to create a vibrant information supply chain that continuously feeds a multiplicity of downstream systems and applications. The result is a modern analytics ecosystem that adapts quickly to new requirements while hiding the complexity of the data environment from business users and many developers. Two Poles of Data Processing A Debate Breaks Out Last fall, a raging debate broke out at Eckerson Group in a very public manner. My colleague, Stephen Smith, wrote a blog titled, The Demise of the Data Warehouse in which he argued that the data warehouse has failed to live up to its promise and is being replaced by DL+MDM a combination of a data lake and master data management system. He writes that DL+MDM is a nimble, flexible and superior substitute for the classical data warehouse. That article triggered a flurry of comments from readers and a series of rebuttals from other Eckerson Group research consultants Dave Wells, Dewayne Washington, David Loshin, and myself most of whom came to the defense of the data warehouse while admitting its shortcomings. All recognize that new technologies have surrounded the data warehouse and in some cases significantly changed its technological foundation but all reject the notion that a modern organization can live without one. Eckerson Group

4 This debate prompts the question: What is a data warehouse in the age of big data? How does the advent of Hadoop, Spark, Python, data virtualization, data preparation, and data catalogs impact how we design and implement a data warehouse? What does a modern data warehouse look like in the context of the larger analytics ecosystem? The Traditional Data Warehouse To answer these questions, let s start at the beginning. The traditional data warehouse arose in the 1990s to offload queries from operational systems and create a data repository dedicated to decision making. Using a relational database, data architects designed data warehouses with dimensional schema (stars and snowflakes) and populated them with data from operational systems via extract, transform, and load (ETL) tools. Business professionals used SQL-based business intelligence (BI) tools to query the data warehouse. The Promise. The beauty of a traditional data warehouse and why almost every single organization has one is that it s a one-stop shop for any data that users need to make decisions. Also, thanks to the data warehouse s relational and SQL underpinnings, business people can use any BI tool they want to query that data. Moreover, the traditional data warehouse is designed to house enterprise data and history. It allows business users to query data across functional boundaries and ask questions that once were impossible to answer. In addition, a data warehouse uses dimensional models (or views) that make it easy for business users to understand and converse with the data in an iterative manner. Finally, as a single, central repository of enterprise data, the data warehouse serves as the single version of truth. It provides consistent, trustworthy data based on consensus definitions of key terms and metrics, and a systematic data hygiene process. The Pitfalls. Unfortunately, data warehouses don t meet all enterprise needs. Because source data must first be transformed to conform with a data warehouse data model (i.e., a process known as schema on write), loading a data warehouse with new data can require a complex pipeline. The relational model also makes it difficult to load multi-structured data that doesn t fit neatly into the rows and columns of a relational database. And there are scalability issues: most relational data warehouses scale comfortably in the low terabyte range (1 to 20 terabytes), but bog down as they grow, without sizable hardware investments. The Reality. Today, well-designed data warehouses are critical for supporting top-down reporting and analysis requirements. In essence, they are great at dishing up standard reports and dashboards to hundreds or thousands of concurrent users. Users can interact with the content iteratively, drilling down and across dimensions to do lightweight analysis along predefined drill paths and hierarchies. But they Eckerson Group

5 do less well at handling pure ad hoc workloads and providing power users quick access to granular data of all shapes and sizes. Hadoop Promise. The big data movement emerged around 2010, largely to address the shortcomings of data warehousing. In systems built on commodity hardware, open source software, and a scale-out distributed file system, Hadoop can process large volumes of multi-structured data at very low cost. Hadoop gives power users quick access to the raw data they prize. Hadoop quickly gave rise to the notion of a data lake. Like a data warehouse, the data lake is a central data repository for decision making; but unlike a data warehouse, it doesn t require organizations to first model, clean, and conform data before loading it. This schema on read approach accelerates insight and lowers the cost of acquiring and using new types of high-volume data, notably log files, clickstream data, social media content, and sensor data. Challenges. Responding to Hadoop s revolutionary approach to data processing, many companies installed small clusters to experiment with the new low-cost technology. Unfortunately, not every experiment in the age of big data has succeeded. As companies moved beyond pilot projects to run Hadoop in production and at scale, some have found the free software quite expensive when they added the investments in people and products required to manage a scale-out infrastructure and keep pace with rapid changes in open source functionality. They also discovered that Hadoop lacked strong governance, metadata, security, and other standard features of enterprise software. Reality. To be fair, new technology takes years or decades to mature before it can support enterprise requirements. Currently, Hadoop is ideal for supporting a set of use cases that require large-scale, multi-structured data, such as data science sandboxes or ETL offloading. But it has not replaced most traditional data warehouses, as some early advocates espoused. Thus, organizations that want to modernize their data environments need to design a data architecture in which both Hadoop and the traditional data warehouse play well together. Dueling Environments With the advent of the data lake, the modern data analytics environment has become more complex and uncertainties abound. Data architects now ask: Should we replace our data warehouse with a data lake? Should we migrate or rebuild our data warehouse inside a data lake? Eckerson Group

6 Can a data warehouse and data lake coexist? What workloads run best on a data warehouse? Data lake? Should we move our data warehouse or data lake to the cloud? Merging Polarities. There are no easy answers to these questions, especially since the technologies powering these environments continue to evolve at a fast pace. In 2010, when Hadoop was in its infancy, the processing characteristics of the two environments were diametrically opposite. Since then, both environments have converged in features so the functional gap between them has shrunk considerably. (See figure 1.) Figure 1. Disparate Data Processing Platforms Circa Relational and Hadoop platforms, once polar opposites in terms of data processing characteristics (circa 2010), are becoming more similar. Since 2010, relational data warehouses have become more agile, scalable, open, programmatic, and affordable, while data lakes powered by Hadoop have become faster and more interactive, secure, structured, and governed. And both are fast moving into the cloud to reduce the friction in purchasing, deploying, and maintaining complex data infrastructure. Eckerson Group

7 Theoretically, an organization today can choose either environment to support the entire spectrum of data and analytical requirements. Small and midsize companies may have no choice but to select one and live with the tradeoffs. But large or leading-edge organizations find that pairing a relational data warehouse with a data lake makes sense they get the best of both worlds with few of the downsides. Complementary Workloads Rather than force-fit workloads into a single architecture, many organizations use data lakes for what they do best ingest, parse, and transform large volumes of multi-structured data and data warehouses for what they do best process dimensional queries for large numbers of concurrent users. In essence, the data lake becomes the pre-processing zone a landing and staging area for the data warehouse and other downstream systems and applications. It also can archive data from the data warehouse and serve as a sandbox for data analysts and data scientists. Meanwhile, the data warehouse serves as a repository of integrated, historical data that is dimensionally modeled and can be queried in an ad hoc manner by dozens or hundreds of concurrent business users. Today, organizations use data lakes and data warehouses for different types of data, use cases, and workloads, making the environments complementary. (See table 1.) Table : Complementary Workloads and Use Cases Scale Schema Data Data prep Loading Governance Joins Use cases Data Warehouse Moderate: terabyte range Schema on write Structured, internal Model once for multiple uses Batch and near-real-time Mature Complex, multi-table joins Standard reports, standard dashboards, OLAP-style analysis Data Lake High: petabyte range Schema on read Multi-structured, external Self-service modeling for specific use cases Batch, near-real-time, and streaming Emerging Simple joins; full-table scans Data science sandboxes, ETL offload, DW archive, DW staging area Eckerson Group

8 Use Cases. Today, a data lake is often used as a data science sandbox or BI SWAT environment, where power users can explore data, create analytical models, or build prototype reports unfettered by relational models, IT processes or architectural requirements. In contrast, a data warehouse is best used to support enterprise reports, dashboards, and OLAP analyses in which requirements can be defined up front. Where to Productionize. A big question facing organizations that use data lakes for data science experiments and analytic exploration is where to productionize the applications. Some argue that it doesn t make sense to move data and logic from Hadoop to a data warehouse just to scale up an application. As Hadoop becomes more industrial strength, they argue that organizations should just process the data in place. However, others claim that Hadoop isn t quite ready to run operational applications and that it still makes sense to rebuild the application on a relational system despite the migration costs and delays. Integration Companies that pair a traditional data warehouse with a data lake need to support data flows between the two environments as well as source and target systems. These data flows are governed by the data processing styles of each environment. (See figure 2.) A traditional data warehouse lives by the mantra capture only what s needed. That s due to the high cost of its schema on write approach in which developers need to model data before loading it. In contrast, the data lake uses a schema on read approach that doesn t require upfront modeling. Therefore, the cost to load new data is minimal, giving rise to the mantra capture in case it s needed. Figure 2. Data Flows Eckerson Group

9 The eight data flows depicted in figure 2 are described here: 1: Extract, transform, load (or ELT) moves structured data, largely from operational systems into an analytical database. This is the traditional data flow to create and maintain a data warehouse. 2: Query data. Business users query the data warehouse using SQL-based analytical tools. In some cases, the data warehouse is linked to the data lake using external tables or another pass-through technique that automatically redirects queries to the data lake. This flow also works in reverse. 3: Archive data. The data lake archives cold or lukewarm data from the data warehouse that is rarely queried by business users. This functions as a low-cost, online archive. 4: Land un/semi-structured data. The data lake is ideal for storing large volumes of multi-structured data, such as clickstream, log, social media, or sensor data. 5: Explore data. Power users (e.g., data scientists or data analysts) explore data in the data lake to find interesting combinations, patterns, and insights. In the absence of a robust data catalog (e.g., HCatalog), these users need to know the shape and schema of the data (schema on read) before they can query the data. 6: Parse, aggregate data. To standardize and disseminate insights discovered in the data lake, organizations might want to parse and aggregate essentially clean it up to support more standardized reporting and queries. 7: Bulk load. Organizations can move the parsed, aggregated data into the data warehouse where dozens or hundreds of concurrent users can query or run reports with consistently high performance. 8: Federated query/etl. Alternatively, rather than moving the data, organizations just query it in place using a data virtualization or query federation tool. This frees users from having to know what database and tool to use to get the data they need. The same works on the back end with data developers who need to create transformation rules in one place and deploy them in multiple places. The goal of a paired environment (data lake + data warehouse) is to give business users the best of both worlds, while minimizing the downsides of each. This requires tight integration between the two environments. Fortunately, leading Hadoop and relational database vendors have developed integration techniques and technologies to bind the two environments seamlessly. The Modern Analytics Ecosystem Data lakes and data warehouses are part of a modern analytics ecosystem that funnels source data to business users. Just as an oil refinery processes crude oil into a multiplicity of products (e.g., heating oil, gasoline, lubricants), the ecosystem refines data as it flows from source systems to business users. In this Eckerson Group

10 sense, the modern analytics ecosystem serves as an information supply chain or data pipeline. Figure 3 shows the basic composition of the information supply chain with four zones of refinement. Figure 3. Information Supply Chain A modern information supply chain refines data in zones while providing different types of users access to optimal points in the supply chain. Zones of Refinement As data flows through the supply chain, it gets refined from its raw state into higher-level objects. A critical feature of this ecosystem is that business users can access the supply chain at any point during the refinement process, as long as they have prior authorization and suitable data skills. Zone 1 Landing Area. Here, data is ingested, time-stamped, tagged with metadata, and perhaps encrypted. It is also often ingested in increments via change data capture and snapshots via periodic refreshes to support time-series and other analyses. Very few business users, if any, access the data in this zone, since it s still in its raw state, reflecting source system schemas. Only data scientists (~2% of employees) access this zone. They prefer to use raw, atomic data rather than the scrubbed, transformed, and aggregated data found in a data warehouse. Zone 2 Data Hub. The data hub consists of base-level tables that feed downstream systems and tools. The base tables consist of clean, flattened, subject-area tables as well as master data, reference data, and standard KPIs and dimensions that can serve as the building blocks of analytic applications. Unlike a data warehouse, the data hub is not meant to be queried; it serves as a Eckerson Group

11 clearinghouse of commonly used data. Data analysts (~8% of employees) access the hub to populate self-service visualization tools and create dashboards to share with others. Zone 3 Business Views. A business view is a business representation of data in the hub or elsewhere that data explorers (~30% of employees) can view, manipulate, and query. A business view can be the semantic layer of a BI tool, a database view, or a data service provided by a data virtualization tool or data-centric API. The business view makes it easy for data explorers to edit existing reports and dashboards or create ad hoc reports to get exactly the information they need to make decisions or take action. Zone 4 Analytic Applications. Here, data consumers (~60% of employees) consume predefined reports and dashboards built by data analysts, data engineers, or the IT department using BI tools and custom development environments. The data consumers view, navigate, and interact with these reports and dashboards that can answer 60 80% of the questions they ask on a daily basis, and use self-service BI tools for the remainder. The IT department generally manages Zones 1 and 2 which constitute a data lake while business unit developers and analysts work in Zones 3 and 4. There is generally a single instance of Zones 1 and 2 (i.e., the data lake) but multiple instances of Zones 3 and 4 (i.e., departmental reports, data sets, and data marts.) An existing data warehouse might supply data for Zones 2 and 3. Multiple Data Flows In reality, data does not flow through the ecosystem in one direction. Most organizations have multiple pipelines emanating from the data lake, one for each department or domain-specific use case. As seen in figure 4, the data lake feeds a multiplicity of downstream systems, including a data warehouse, data marts, visual discovery servers (i.e., BI data extracts), real-time applications, and data science sandboxes. Eckerson Group

12 Figure 4. The Modern Analytics Ecosystem The modern analytics ecosystem supports multiple data pipelines and bidirectional flows of data. But all the components work together seamlessly to support the continuous flow of data from source to users, who are shielded from its complexity through data abstraction tools. Bidirectional. In addition, the information supply chain is not linear. Rather, it is iterative and bidirectional. Many downstream systems feed data back into the data hub, where the data or artifacts (i.e., metrics, dimensions, reference data) can be reused. Abstraction. Business users (and some developers) are shielded from the complexity of this ecosystem by data abstraction tools. Specifically, casual users query data via reports or BI semantic layers that make all data appear unified and local. Power users typically search a data catalog to download relevant data sets and publish their output back into the catalog for others to use. Data Environment. Some might consider the environment in figure 4 to be a data lake; we prefer to call it a data environment because a data lake is closely associated with Hadoop. Each element depicted in the data environment can be implemented numerous ways; some might reside in Hadoop and others outside of it, in a relational database or the cloud. For instance, the elements in dark blue might be virtual constructs inside a traditional data lake (e.g., Hive tables) or physical constructs outside a traditional data lake (e.g., a relational data warehouse or OLAP cube). Eckerson Group

13 The Modern Analytics Ecosystem in Action Many companies have already combined data lakes and data warehouses with good results. For example, SAP customers often use SAP Cloud Platform Big Data Services (i.e., Hadoop and Spark as a service) and SAP HANA (i.e., relational data warehouse) in conjunction with each other to support a multiplicity of users and application use cases. (See figure 5.) Figure 5. Integrating Data Lakes and Data Warehouses in an SAP Environment Many SAP customers now combine a data lake (i.e., SAP Cloud Platform Big Data Services) with a relational data warehouse (i.e., SAP HANA) to support a multiplicity of users and application use cases. Payment Solutions Provider. For example, a provider of payment solutions for the retail industry combined a Hadoop data refinery with an in-memory relational database, both from SAP. The company pushes incoming data from various sources into a big data cluster where it is integrated with historical data using Spark. This configuration enabled the company to offload ETL operations from its previous relational system to a more cost-effective Hadoop platform. It replicates a copy of incoming relational data to both the Hadoop cluster and relational engine. This preserves business rules and logic in the relational source data, so it doesn t have to be recreated from a data lake. But it also provides a copy of all data in the data lake to support the discovery requirements of data analysts and scientists. Electronics Company. Similarly, an electronics company captures data from both the product and marketing sides of the house sensor data from its smart products and web clickstream data and loads it into a data lake. This data is parsed and loaded into a relational data warehouse (SAP HANA). Data Eckerson Group

14 scientists can analyze the data in the data lake, while data analysts access the combined data in the relational data warehouse, effectively connecting newer data from the data lake with enterprise data in the data warehouse for internal reporting and analytics. In turn, the data lake offloads detailed data from the data warehouse. Organizations such as these exploit the strengths of both data lake and data warehouse, in the context of the modern analytics ecosystem, to modernize their business models and customer experiences. Liberation There is no single best way to design a data environment. That s because data processing technologies are evolving fast. Relational, Hadoop, NoSQL, and cloud platforms continue to add features and functions, perpetually shifting which workloads each platform is best suited to support. It s hard to know which platform will dominate the data landscape in five years, or whether a new technology will emerge to supplant them all. Perhaps we won t be talking about data lakes and data warehouses as separate entities down the road, and a single platform or one set of technologies will emerge as a standard architecture. However, one thing is clear now: the data lake has liberated the data warehouse it no longer has to do all the heavy lifting in a data analytics environment. It can focus on what it does best support standard reporting and dashboard environments based on known requirements. It can leave the pre-processing of multi-structured data and ad hoc and exploratory work to the data lake. Together, the data warehouse and data lake can form the backbone of any data analytics processing environment and meet the data management needs of modern enterprises. Need help with your business analytics or data management and governance strategy? Want to learn about the latest business analytics and big data tools and trends? Check out Eckerson Group research and consulting services. Eckerson Group