Take a Dive into the Data Lake

Size: px
Start display at page:

Download "Take a Dive into the Data Lake"

Transcription

1 Take a Dive into the Data Lake Philip Russom, Ph.D. Senior Research Director, TDWI March 29, 2017

2 SPONSOR 2

3 PHILIP RUSSOM Senior Research Director for Data Management, TDWI

4 ROBERT ROUTZAHN Program Director, Data Warehouse and Hadoop, IBM

5 AGENDA Dive into the Data Lake Data Lakes 1. What are they? 2. Why are lakes hot? 3. What tech is required? 4. How do I get started? 5. What are success factors? Busting five myths about Data Lakes Questions and Answers

6 Question No.1 What is a data lake? tdwi.org

7 DEFINING Data Lake Method for organizing large volumes of highly diverse data For broad data exploration and discovery, plus advanced analytics Depending on the platform, a data lake may handle many data structures Tends to ingest data quickly & prep it later, so data is available ASAP Tends to persist data in its original raw detailed state So data can be repurposed repeatedly for new analytics and use cases For analytics needing detailed source: mining, statistics, machine learning Supports multiple use cases, in multiple data architectures Marketing, sales, healthcare, logistics // EDW, data integration, analytics

8 The Governed Data Lake: Data Lake Redefined IBM s vision of a data lake is a governed data lake: a group of repositories that are managed, governed, protected, connected by metadata, and provide selfservice access. With a methodology to build analytic capabilities over all data in a manner that documents their contents, provides data lineage back to source systems, and allows the use of the best tool for the job IBM Corporation

9 MYTH: A data lake is a dumping ground. A data lake is a balancing act Allow some users to dump some data Required for sandboxing, data science, analytics, profiling new data... Yet, control everyone else and all other data Respect lake s mandate for raw data Required for exploration, repurposing data, some forms of discovery oriented analytics... Yet, alter and improve some datasets in a lake so they achieve full usability and business value

10 Question No.2 Why are data lakes suddenly hot? tdwi.org

11 Drivers for Data Lakes On one hand, businesses depend on data more heavily than ever before. On the other, many organizations have new data coming online From new applications and sources; customer channels and touch points; machinery, sensors, IoT; external sources like partners & data aggregators Technical teams are under pressure to capture new data and develop its business value, largely via analytics. The data lake captures big data, new data, other diverse sources Enables data exploration & profiling, visualization & analytics, complete views

12 Why this matters: What a Governed Data Lake Is and Isn t An environment where users can access vast amounts of raw data An environment for developing and proving an analytics model, and then moving it into production An analytics sandbox for exploring data to gain insight An enterprise-wide catalog that helps users find data and link business terms with technical metadata An environment for enabling reuse of data transformations and queries What a governed data lake is 12 A data warehouse or data mart for housing all of the data in an enterprise A replacement operational data store (ODS) A high-performance production environment A production reporting application A purpose-built system to solve a specific problem (though a purposebuilt data mart could be fed from a data lake) What a governed data lake isn t 2017 IBM Corporation

13 MYTH: Data lakes are for Internet firms. TDWI has found data lakes in production in several mainstream industries Financials, insurance, telco, healthcare TDWI has found departmental data lakes Marketing, sales forecasting, logistics, R&D TDWI has found multiple forms of analytics operating on lake data From mining to clustering to predictive to NLP EDW extension, fraud, risk, customer segmentation, security breech, insider trading

14 Question No.3 What kind of technology do I need to set up a data lake? tdwi.org

15 Choose Data Platforms & Tools Carefully Choice of data platform is important Hadoop, maybe other file systems Relational database, maybe multiple ones Hybrid mix, with integration across platforms Choice of tooling is important Analytics, exploration, viz, integration & prep Give priority to end user requirements Cloud-based tools and platforms are being used, along with on-premises ones

16 Fast and integrated graph computation Hadoop is not the data lake silver bullet! The Hype The Reality Centralize all your data! Low cost storage! Data scientist s paradise! Replace your warehouses! Many departments still withhold data (security concerns) Custom ETL coding is pensive ML workloads take too much RAM / don t parallelize well Hadoop is too slow for reporting workloads Even Worse Data governance on Hadoop alone is not sufficient Many still think of Hadoop as the sole component in a data lake IBM Corporation

17 MYTH: Data lakes require Hadoop. Most data lakes are on Hadoop (53%) Some are on relational database management systems (RDBMSs) (5%) A data lake can be a virtual entity that spans multiple data platforms of multiple types. For example: Hadoop cluster integrated with RDBMS (24%) Multiple types of RDBMSs, each specialized for general use, columns, clouds, SQL-based analytics, graph analytics, etc.

18 Question No.4 How can I get started with a data lake? tdwi.org

19 Get a Goal, Plan, Tools, Staff, and Skills Start with business use case that a lake can address w/roi E.g., consolidation of customer data; biz users needing exploration; new data sources needing leverage; adjustments to old data arch s Create a plan. Prioritize use cases. Update as biz evolves. Choose a data platform you can stand up fast & easy Get tools that work with platform & satisfy user requirements Augment your staff with consultants exp d with data lakes Train staff for Hadoop, analytics, lakes. Maybe hire, too.

20 Efficient Management, Protection and Access Data Lake Services Data Lake Repositories Information Management / Governance Fabric Governed Data Lake The data lake governed reference architecture includes a description of governance and management processes and definitions to ensure the human and business systems around the technology support a collaborative, self-service, and safe environment for data use IBM Corporation

21 MYTH: If we build it, they will come. They won t come unless you have Compelling business case / Right tools for self service, viz, analytics They won t stay unless you provide Governed, quality, trusted data Plan for controlled expansion They won t succeed without Data lake oriented training and consulting help

22 Question No.5 What are the critical success factors for data lakes? tdwi.org

23 Users Concerns RE: Data Lakes Addressing these ensures success UNKNOWN TERRITORY AHEAD Lack of data governance (DG) (41%) Inadequate skills for big data (32%), Hadoop (32%), data integration (32%) Lack of compelling business case (31%) or sponsorship (28%) Exposing sensitive data (28%) Immaturity of data lake concept (27%) OTHER: Self-service tools for end users, Business metadata, Automation for DG SOURCE: TDWI report on Data Lakes, to be published April 1, 2017

24 Benefits of Governed Data Lakes Easier data access to broad range of data x-organization Faster data preparation Enhanced agility Maintain data quality, security and governance More accurate insights Better decisions IBM Corporation

25 MYTH: All data lakes become swamps. A lake may deteriorate into a data swamp An undocumented and disorganized data store Difficult to navigate, use, trust, and leverage A data swamp results from a lack of Data governance, stewardship, or curation Control over incoming data and access to data Standards & data mgt practices for data in lake Data governance prevents swamps A data steward should curate the data of a lake DG policies should define controls & standards

26 SUMMARY Take a Dive into the Data Lake 1. What is a data lake? Way of organizing large repositories of data from diverse sources, structures, containers Exploration, analytics, integration, DW extension 2. Why are data lakes suddenly hot? Compelling use cases across industries, departments, data infrastructures 3. What kind of tech do I need? Data platforms: Database, Hadoop, file system Tools for analytics, exploration, integration & prep 4. How do I get started with a data lake? Biz-driven use case w/roi & data it requires 5. What are the critical success factors? Governance, skills, biz case, tools, data platform

27 Question-&-Answer Period Philip Russom Research Director, TDWI Rob Routzahn Program Director, IBM tdwi.org

28 Further Study Attend TDWI webinar by Philip Russom Data Lakes: Purposes, Practices, Patterns, and Platforms April 13, 2017, Noon ET Register Online:

29 Further Study Attend the next TDWI Leadership Summit Architecting Modern Data Ecosystems Chicago, May 8-9, 2017 bit.ly/tdwichicago17