Enterprise Data Lake Platforms: Deep Storage for Big Data and Analytics

Similar documents
Tata Consultancy Services Enters the Cognitive Software Market with Digitate and ignio A Neural Automation System

Cloudian HyperFile Brings Enterprise NAS Functionality Closer to Object-Based Storage

Support Services: The Value of Technical Account Managers

High-Tech Industry-Specific Offers from TCS' Cincinnati Lab

On the Mobility Fast Track: Bank Islam Brunei Darussalam's Digital Transformation

AppDynamics Launches Business iq

Worldwide IT Event and Log Management Software Market Shares, 2016: Year of Strong Growth

IDC MarketScape: Worldwide Subscription Relationship Management 2017 Vendor Assessment

IDC MarketScape: Worldwide Life Science Sales and Marketing Digital Transformation 2016 Vendor Assessment

IDC MarketScape: Worldwide Object-Based Storage 2018 Vendor Assessment

Four Services That Enable IoT for Organizations

Worldwide IT Operations Management Software Market Shares, 2017: Hybrid Management Drives Growth

The Benefits of Modern BI: Strategy Companion's Analyzer with Recombinant BI Functionality

IDC's Worldwide Data Services for Hybrid Cloud Vendors Key Players Portfolio Analysis

IDC MarketScape: Worldwide Anti Money Laundering Solutions in Financial Services 2018 Vendor Assessment

Worldwide Virtual Machine Software Market Shares, 2017: Virtualization Still Showing Positive Growth

IDC MarketScape: Worldwide Network Consulting Services 2017 Vendor Assessment

IDC MarketScape: Worldwide Microsoft Implementation Services 2017 Vendor Assessment

IDC MarketScape: Worldwide SaaS and Cloud-Enabled PSA ERP Applications 2017 Vendor Assessment

Wireless in the Era of Digital Transformation

IDC MarketScape: Worldwide Digital Transformation Consulting and Systems Integration Services 2015 Vendor Assessment

IDC MaturityScape Benchmark: Big Data and Analytics in the United States

Hsinchu City to Roll Out Smart City Acceleration Program with Asus OmniThings Cloud

IDC MarketScape: Worldwide Anti Money Laundering Solutions in Financial Services 2018 Vendor Assessment

Developing a Cloud Strategy for Digital Transformation: Hybrid Cloud and Beyond

Global Headquarters: 5 Speen Street Framingham, MA USA P F

ZAP Strengthens its Data Management Proposition with Data Hub

IDC MarketScape: Worldwide Hosted and Cloud Contact Center 2016 Vendor Assessment

Worldwide IT Automation and Configuration Management Software Market Shares, 2017: Hybrid IT Drives Growth

Next Generation Services for Digital Transformation: An Enterprise Guide for Prioritization

Perspective: TCS Supply Chain Center of Excellence An Update

IDC MarketScape: Worldwide Security Solutions and Services Hardcopy 2017 Vendor Assessment

Changing IT Leadership: Part 5 Partnering with Vendors and Suppliers

Connected Banking Through Enhanced B2B

Cloud Skills and Organizational Influence: How Cloud Skills Are Accelerating the Careers of IT Professionals

IDC MarketScape Excerpt: Worldwide Client Virtualization Software 2013 Vendor Assessment

Bayshore Networks: Enabling Policy-Based, Content-Aware Security for the Industrial Internet of Things

Oracle OpenWorld 2016: Buyer Perspectives on Moving HCM to the Cloud

IDC MarketScape: Worldwide Oracle Implementation Services Ecosystem 2018 Vendor Assessment

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Journey to 3rd Platform Digital Customer Experience

Wholesale: Small and Midsize Firms Are Using Technology to Sharpen Business Practices and Improve Customer Engagement

I D C M a r k e t S c a p e : W o r l d w i d e E n t e r p r i s e S o c i a l S o f t w a r e V e n d o r A n a l y s i s

Open Systems Are Driving Operational Excellence in Oil and Gas

ZAP: Improving Time to Insight for Midsize Firms

CREATING A FOUNDATION FOR BUSINESS VALUE

IDC MarketScape: Worldwide Utilities Mobile Field Force Management Software, 2014 Vendor Assessment

IDC FutureScape: Worldwide Internet of Things 2017 Predictions

IDC's Tech Marketing Benchmark Survey, 2017: Executive Summary of Results

Dell EMC Native Hybrid Cloud, Powered By Intel: Jump- Start Your Adoption of the Cloud-Native Model

Midrange Customers Demand High-End Functionality at Affordable Prices

Worldwide Cloud Systems Management Software 2012 Vendor Shares

Worldwide DDI Market Update

Enterprise Voice Transformation: Migration from TDM to IP

Key Success Factors for Digital Transformation in the Banking Industry

Dignity Health Transforms the Process of Developing Compensation Plans Through Salary.com's CompAnalyst Platform

Global Media and Entertainment Company Uses IBM Cloud for Skytap Solutions to Deliver More Cost-Effective, Functional, and Timely Application Releases

Leveraging Effective Application Discovery, Delivery, Change, and Quality Strategies for Digital Transformation

The Knowledge Quotient: Unlocking the Hidden Value of Information

Ensuring Data Protection and Recovery in the 3rd Platform Era

IDC MarketScape: Canadian Security Services 2018 Vendor Assessment

NVMe: The Key to Unlocking Next-Generation Tier 0 Storage

Amazon Web Services Marketplace: A Value Proposition for ISVs

Inspur Walks with Hyperscalers to Propel Artificial Intelligence Application Adoptions

IDC MarketScape: Worldwide Manufacturing Information Transformation Strategic Consulting 2018 Vendor Assessment

IDC MarketScape: Worldwide Life Science Social Media Analytics 2017 Vendor Assessment

Digital Transformation

Third-Party Enterprise Software Support: Key Risks and Questions to Ask

BT & TechMahindra Global Test Factory: Delivering Business Value

IBM Spectrum Scale. Advanced storage management of unstructured data for cloud, big data, analytics, objects and more. Highlights

S U R V E Y I D C O P I N I O N. Cushing Anderson

IDC MarketScape: Worldwide Life Science Manufacturing and Supply Chain ITO 2013 Vendor Assessment

IDC FUTURESCAPE WORLDWIDE CLOUD 2016 PREDICTIONS. APEJ Implications. Brought to you by

Critical Application And Business KPIs For Successful Cloud Migration. An IDC White Paper, Sponsored by AppDynamics

Become a Data Thriver: Realize Data-Driven Digital Transformation (DX)

IDC MarketScape: Worldwide SaaS and Cloud-Enabled PSA ERP Applications 2017 Vendor Assessment

SaaS and Cloud-Enabled ERP: The Perfect Storm to Move Beyond Legacy ERP and Spreadsheets

IDC MarketScape: Worldwide Multi-Enterprise Supply Chain Commerce Network 2018 Vendor Assessment

IDC MarketScape: Worldwide Know-Your-Customer Solutions in Financial Services 2018 Vendor Assessment

BIG DATA TRANSFORMS BUSINESS. The EMC Big Data Solution

Ensuring Petabyte-Scale Data Consistency in a Multicloud Environment

I D C M a r k e t S c a p e : W o r l d w i d e B u s i n e s s A n a l y t i c s B P O S e r v i c e s V e n d o r A n a l y s i s

IDC FutureScape: Top CIO Decision Imperatives for 2015

epages and the Importance of Developing an Ecosystem in the Cloud

DataAdapt Active Insight

IDC MarketScape: Americas Business Consulting Services 2018 Vendor Assessment

IDC MarketScape: Worldwide Business Consulting Services 2018 Vendor Assessment

IDC MarketScape: Worldwide IT Service and Incident Management Software 2017 Vendor Assessment

Hospitals and Health Systems: Beginning the Journey to the Cloud with Medical Imaging

IDC MarketScape: Worldwide SaaS and Cloud-Enabled Procureto-Pay Applications 2018 Vendor Assessment

IDC MarketScape: Worldwide Life Science Manufacturing and Supply Chain BPO 2013 Vendor Assessment

Solutions Will Open Up a New,

IDC MarketScape: Worldwide SAP Implementation Services Ecosystem 2018 Vendor Assessment

IDC MarketScape: Worldwide Mobile Application Development and Testing Services 2014 Vendor Assessment

I D C M a r k e t S c a p e : W o r l d w i d e B u s i n e s s C o n s u l t i n g S e r v i c e s V e n d o r A n a l y s i s

Cloud-Based Unified Communications and Collaboration: Transformation of the Work Experience

Store. Analyze. Preserve. Big Data Assets

The Future of Hybrid IT Made Simple

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Zooming Out to Capture the Broader Application Outsourcing Opportunity: 2007 Integrated and Discrete Views (Excerpt from IDC #214063)

Transcription:

Insight Enterprise Data Lake Platforms: Deep Storage for Big Data and Analytics Ashish Nadkarni Laura DuBois IDC OPINION In the past 18 months or so, the term data lakes has surfaced as yet another phrase that seemingly attempts to describe a large repository of unstructured data. Other terms used in the industry have included content depots, content repositories, and object stores and, of course, the well-used moniker Big Data. Given IDC's previous definition of Big Data, there is an obvious affinity or overlap between Big Data repositories and data lakes. With this document, IDC has sought to define data lakes and enterprise data lake platforms (EDLPs) and distinguish them from other similarly descriptive terms. In the Worldwide Storage in Big Data 2013 2017 Forecast Update (IDC #244959, December 2013), IDC noted that revenue associated with storage for the Big Data market will surpass $6 billion in 2014. IDC expects a significant portion of this revenue to be associated with unstructured and semistructured data collected and collated from different sources into large repositories. These Big Data repositories offer several benefits over traditional NAS systems that are just used for storing unstructured data but pose limitations on how this data can be ingested or accessed. However, many Big Data repositories pose restrictions on data ingest and access, forcing the data to be formatted and moved to the platform for analytics. Data lakes on the other hand are a corpus of unstructured and semistructured data collected and collated from different sources into a large repository. However, more crucially, data lakes support the ingest, persistent storage, and access of data in a manner agnostic to how it is moved into the repository and in a manner that makes it easier for adjacent Big Data workloads to inplace analytics of data in this repository. Specifically, the data is stored using open standard rather than in proprietary formats. Storage platforms that support data lakes (known as enterprise data lake platforms) present firms with opportunity to use multiple analytics tools (such as Hadoop) to concurrently analyze the data. IDC expects that macro trends such as the Internet of Things, associated with the move to the 3rd Platform era, will continue to push more and more businesses to adopt EDLPs for their Big Data repositories and: Consolidate their Big Data storage islands into a single repository for unstructured, structured, and semistructured data (including data stored in Hadoop- and NoSQL-friendly formats). They can leverage a single platform that supports upstream (flash) tiering for analysis of hot data sets and downstream (cloud, cold storage) tiering for inactive data sets. Use a "data in place" model that can concurrently service various Big Data workloads (via their native access mechanisms) such as Hadoop without moving data to the compute layer and back. In essence, they can take a step closer to replacing their enterprise data warehouse (EDW) with an EDL that stores data in an open and multiple-access format. July 2014, IDC #250000

IN THIS INSIGHT This IDC Insight provides IDC's perspective on enterprise data lake platforms. The concept is still new and, at this time, in IDC's view, does not warrant a comprehensive taxonomy. However, given the characteristics of systems that are designed to support EDLPs, IDC contends that the relevant definitions and taxonomies when developed will align closely to IDC's Worldwide File- and Object- Based Storage Taxonomy, 2014 (IDC #245940, January 2014). SITUATION OVERVIEW In a recent document that outlined the growth of unstructured data in the enterprise, IDC noted that suppliers shipped nearly 52EB of capacity in 2013 corresponding to nearly $36.1 billion in revenue. In IDC's estimates, unstructured data made up for 64.6% of the total capacity shipped but 46.4% of the total revenue. IDC estimates that by the end of 2015, as far as enterprise disk storage systems are concerned, unstructured data will surpass structured data both in terms of capacity shipped and customer revenue (unstructured data already occupies a lion's share of cloud-based storage): Much of the unstructured data growth in the traditional enterprise will come from end-user computing an ecosystem that is slowly shifting toward BYOD devices like smartphones and tablets. Chiefly, however, in many industries, this data growth will come from devices, sensors, and "connected things" that are spewing analytics-rich (unstructured) data sets on to the enterprise disk storage infrastructure. Much of this infrastructure is capacity optimized, meaning it is tuned for storing large quantities of data at low dollar-per-gigabyte costs. A new hybrid data type known as "semistructured" data will slowly gain a foothold in many industries that are pursuing the "Internet of Things" (i.e., the proliferation of sensor and machine-generated data). IDC also expects that unstructured and semistructured data placement techniques will undergo a shift as well moving from unitary file systems to distributed (scale-out), seamlessly extensible, and unified file- and object-based systems. By leveraging server-side technologies like PCIe flash, enterprises will also be able to realize the value of this data by analyzing it continuously and on demand. Today, businesses are forced to select a suitable storage platform for each of their Big Data and unstructured data storage workloads. This selection has to be done during the design stage, and pretty much locks them into that platform. While selecting the platform they have to: Make the choice between selecting either a file-based or an object-based platform. Choose a single-access mechanism up front for feeding data into the platform. Restrict their applications to the metadata limitations of the access protocol or mechanism. Make the painful choice between moving data to compute or vice versa, especially for large data sets that are incompatible with each other. 2014 IDC #250000 2

Create separate storage infrastructure islands for different workloads such as Hadoop, search, and discovery. This results in a fragmented infrastructure, the antithesis of where the IT industry wants to go with converged and densely utilized infrastructure. Other solutions create "islands" of storage that are difficult and costly to manage, resulting in hot spots, inefficiencies, and poor storage utilization. They also require heavy lifting to scale. For example, many businesses that deploy Hadoop have to make the choice to implement a separate repository for Hadoop and move data that requires MapReduce operations to this repository. Not only does this become inefficient as the size of the data sets increases, but it also creates multiple copies of the same data set something that IDC believes is an issue businesses are grappling with as well (see The Copy Data Problem: An Order of Magnitude Analysis, IDC #239875, March 2013, and The Copy Data Management Challenge: 65% of External Storage Capacity Is Used for Copy Data, IDC #lcus24655014, January 2014). Data Lakes and Enterprise Data Lake Platforms Like Big Data repositories, data lakes can be thought of as a corpus of unstructured and semistructured data collected and collated from different sources into a single unified data pool (hence the term data lake). A data lake offers multiple access points for data "on-ramping," meaning support for standard network access protocols (NFS, CIFS, pnfs) as well as RESTful object interfaces by which applications can write data into the repository. However, more crucially, a data lake supports the storing of the data in a manner agnostic to how it is moved into the repository and in a manner that makes it easier for adjacent Big Data workloads to analyze it. Specifically, the data is stored using open standards rather than in proprietary formats. In enterprises, data lakes can be considered to be a central "deep storage" repository for consolidating different types of unstructured, semistructured and, to some extent, structured data. Enterprise data lake platforms, or EDLPs, are seen as a solution to the data deluge and access conundrum faced by enterprises that cannot be solved using Big Data repositories built on a single platform like Hadoop. Akin to enterprise data warehouses, EDLPs allow disparate and incoherent data types to be consolidated onto a single, scalable, extensible, and agile storage platform. IDC expects that most EDLPs will need to support: Multiformat multiprotocol data ingest and access. EDLPs need to support data to be ingested (i.e., placed on them) via a variety of file, object, and even block interfaces that include, but are not limited to, NFS, pnfs SMB, NDMP, HDFS, or RESTful object interfaces (such as OpenStack Swift, Amazon S3, and CDMI) by which applications can write data into the repository. Access mechanisms can be open, standards based or, where required, application specific. The expectation then is that this same data should be consumable (read: accessible) via different mechanisms or interfaces without the need to copy, replicate, or export it (into a different format). Access-agnostic storage. EDLPs should not make any presumptions on the manner in which the data is ingested or accessed the two mechanisms could be completely different. For example, data ingested via NFS could be accessed via HDFS or via an API. This also means that unlike typical file-based storage platforms, EDLPs should make their metadata extensible 2014 IDC #250000 3

and programmatically accessible, beyond the normal expectations of a specific access interface or API. EDLPs can therefore be suitable for both traditional workloads such as home directories, file shares, sync-and-share applications, and Hadoop, as well as next-generation business and social analytics and cloud and mobile applications. Deep storage with "infinite" scalability and efficiency. EDLPs should support unprecedented nondisruptive scalability and agility. EDLPs should also support efficiency, both for upstream and downstream tiering. The platform should make use of upstream (flash) tiering that support efficient analytics workloads of hot data sets and downstream (cloud, cold storage) tiering for storing inactive data sets but, crucially, should support data movement between tiers depending on the I/O activity and decay. In other words, the highly active data is placed on a tier optimized for performance (dollar per IOPS), while inactive data is placed on a tier optimized for capacity (dollar per gigabyte). EDLPs also need to have built-in data optimization, protection, and availability mechanisms that exceed the service levels established for the workloads operating on them. As a consolidated repository, it is essential that the platform also have a built-in robust data loss prevention (DLP) mechanism. Support the three AAAs of security. Given the fact that most EDLPs will need to store sensitive data sets, it is essential that they support robust authorization, audit, and authentication mechanisms for users and applications. They also need to support inline and at-rest data encryption. With a platform that acts like an enterprise data lake, businesses can minimize fragmentation and gain better and consistent insight into their entire data. FUTURE OUTLOOK IDC believes that EDLPs will become a core part of enterprise storage infrastructure in the coming areas. As businesses learn to collate data from various sources and convert it into consumable nuggets of information for their various organizational units, they will no doubt be compelled to establish enterprisewide data lakes upon which various workloads can concurrently operate. Such data lakes will enable existing workloads, as well as be future proof to seamlessly support new applications and workloads. It is evident from recent announcements that suppliers like EMC are beginning to position their scaleout file and object platforms like Isilon to support data lakes as the next stop on a journey to make their platforms ready for the next wave of Big Data, social, mobile, and cloud applications. EDLPs set the stage for an extension and perhaps an eventual unification of the scale-out file and object market segments. This unification will come at the expense of traditional scale-up unitary file servers and filebased storage (NAS) market segments both of which will shrink further as businesses consolidate their file-based and object-based data onto deep storage repositories also known as enterprise data lakes. 2014 IDC #250000 4

About IDC International Data Corporation (IDC) is the premier global provider of market intelligence, advisory services, and events for the information technology, telecommunications and consumer technology markets. IDC helps IT professionals, business executives, and the investment community make factbased decisions on technology purchases and business strategy. More than 1,100 IDC analysts provide global, regional, and local expertise on technology and industry opportunities and trends in over 110 countries worldwide. For 50 years, IDC has provided strategic insights to help our clients achieve their key business objectives. IDC is a subsidiary of IDG, the world's leading technology media, research, and events company. Global Headquarters 5 Speen Street Framingham, MA 01701 USA 508.872.8200 Twitter: @IDC idc-insights-community.com www.idc.com Copyright Notice This IDC research document was published as part of an IDC continuous intelligence service, providing written research, analyst interactions, telebriefings, and conferences. Visit www.idc.com to learn more about IDC subscription and consulting services. To view a list of IDC offices worldwide, visit www.idc.com/offices. Please contact the IDC Hotline at 800.343.4952, ext. 7988 (or +1.508.988.7988) or sales@idc.com for information on applying the price of this document toward the purchase of an IDC service or for information on additional copies or Web rights. Copyright 2014 IDC. Reproduction is forbidden unless authorized. All rights reserved.