Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

Size: px
Start display at page:

Download "Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud"

Transcription

1 Datametica The Modern Data Platform Enterprise Data Hub Implementations Why is workload moving to Cloud 1

2 What we used do Enterprise Data Hub & Analytics What is Changing Why it is Changing

3 Enterprise Data Hub & Analytics WHAT WE USED TO DO ( STILL DO) WHY ITS CHANGING WHAT IS CHANGING Self-Service to Data and tools Data growth - > 40% in most companies Proliferation of new tools Personalization Copy data many times Break-up data into analytics-ready tables Open-Source and cloud innovation Mobile and online require low latency Extensive ETL to coordinate Scale To almost unlimited size Cost pressures on IT organizations Divide analytics workload in order to scale Performance To almost unlimited levels Growth of advanced marketing approaches Move data to lower-cost platforms Cost Reduced by 50%-80% Move to digital customer engagement Archive frequently for capacity management Flexibility To store and analyze any data Need for better fraud detection Latency Down to milliseconds Need for better opportunity detection Re-use personal data marts as input ETL Becoming obsolete and changed to streaming and ELTTT... Geo-location services Audit and lineage mostly ignored Social communication channels Access control poorly managed

4 The Single-Point-of-Truth Always a promise of Data Warehousing Never delivered because of issues: i. The ability to Scale ii. Performance has always been restricted on large data sets (technical and cost driven) iii. Cost of large data warehouses is prohibitive iv. Rigid schema challenge Cloud and Big Data Architecture Now SPoT in an Enterprise Data Hub is realistic Separation of storage from compute (Cloud object storage is a paradigm shift) Serverless queries that scale without infrastructure limits Layering concept for the data model - Delay the application of schemas until downstream Cloud scalability a. Unlimited storage b. Unlimited compute c. Flexible capacity on demand 4

5 The Modern Data Platform The Enterprise Data Hub What do we mean by MDP and EDH? A single point of truth for data Analytics separated from data integration and storage Real-time data integration where needed All workload referencing a single data source Layered and secured data architecture Data lineage visibility with audit capabilities A dynamic and flexible infrastructure Analytics tools brought to the Data (not moving/copying Data to the tools) Tool agnostic Future proofed architecture Designed to avoid software and vendor lock-in Bring tools to the Data DO NOT move and copy data... to the tools

6 Conceptual View Data Hub

7 Data Hub Layered View Data Sources Incoming Layer Incoming Data Data Acquisition Basic file level validation Persisted Full History Full Detail Concatenated Data Quality Data Profiling Data Quality Gold Smith Layer Layer Single Point of Truth De-Normalized Concatenated Segmented Use-Defined Marts Persisted Marts Temporary Marts Export Marts Security-Driven Marts Batch working area Data Processing Data Serving Metadata Management and Audit

8 Data Hub Reference Architecture Common Ingestion, Processing, Persistence, Analytics, & Provisioning Layers Data Staging, Integration, Enrichment, Modeling, Analytics, Persistence & Security File Data DB Data Stream Data Sources Streaming Data Governance replicate, retain, monitor, quality, lineage, metadata, workflow API Streaming Alerts Transient API Visualization, Access, Delivery NoSQL Traditional Provision/ Package Sources Semi-structured Un-structured Variety Micro-Batch Structured Analytics Raw Graph Predictive Publish/ Subscribe Charting Mobile Enrichment Relational Mining Aggregate/ Report Refined In-Memory Discovery Browse/ Search Processing Layer Storage Layer Analytics Layer Web Portal Low High Batch Near-Real Time Batch Volume Reports Security authenticate, authorize, encrypt, protect, audit Velocity Data Mining Extracts/Feeds Real Time Sources Machine Learning Ingestion Layer Data Hub DataMetica. All rights reserved. Data Marts Provisioning Layer Delivery 8 - Fast Big Data. Big Experience. Big Expertise

9 Data Hub Patterns DataMetica. All rights reserved. Datametica 2017reserved. DataMetica. All rights 9 - Fast Big Data. Big Experience. Big Expertise

10 What is the Status of Hadoop Status More than a decade old Mature and stable In widespread use Deployments still growing Changes/Trends Taking a minor and supporting role New workload is using less pure Hadoop Hadoop is being used for: Data ingestion Profiling Data quality ETL replacement It is being used less for BI and Analytics Hadoop is being mixed with Cloud BI tools Dominance of open source versions - EMR and Apache Hadoop Cloud storage, especially S3 and GCP are taking significant share away from Hadoop storage 10

11 Datametica Why is workload moving to Cloud? Trends and Considerations 11

12 Data Growth Cloud 30% CAGR On-Premise 9% CAGR Zetabytes per Year 92% 12

13 Public Cloud Adoption 2017 vs Source: Rightscale 13

14 Cloud Investment Analysts and vendors agree - Seeing a massive migration of dollars and data into cloud Gartner, says: Worldwide public cloud services will grow 18% this year to become a $247 billion business Cloud will easily account for the majority of analytics purchases by 2020 Forrester, says: Says it will be a $162 billion business This projected growth appears to come, in part, at the expense of on-premise Hadoop and Data Warehouses The installed base of Hadoop clusters in 2020 will not be as big as the industry widely believed during the peak Hadoop years of 2012 to 2014

15 Analytics and Hadoop on Cloud - Why not Why Not Cloud has (some) ongoing stigma Related to putting data into a shared environment Legal indemnification by cloud vendors Cloud can be expensive if not well managed Enterprise security teams have concerns about the use of public cloud Some perceive that cloud adds complexity Support costs and complexity are a concern Some like to own everything within their own walls Pay-for-use is uncomfortable for some IT executives who like to control their spending limits Some say we already made huge investments in Data Warehouses and analytics tools, we need to use them Copying data to another environment takes time, bandwidth, people, tools and added cost Some say they they want fewer tools and technologies, not more Skill implications is a concern Audit groups feel that they will loses control and that data Lineage becomes more complex Architects see increased complexity

16 Analytics and Hadoop on Cloud - Why Why Cloud has fully come of age All tools are available Scale is effectively unlimited Performance is equivalent and better in most cases Powerful cost advantages - Caution Security is fully matured and tested Simplify data management and analytics environment - e.g. Google BigQuery Reduce support costs and complexity of upgrades Purchasing, repairing and replacing hardware is no-longer a consideration Agility - Spin-up and down platforms and BI tools in minutes Pay for only what is used Focus on analytics to drive value, not on infrastructure and running hardware Reduce traditional software licensing Eliminate most traditional ETL - Reduce cost and data latency Easily adopt new tools as they become available

17 Analytics on Cloud - Costs Cost Hadoop was considered an economical solution for both structured and unstructured data It is designed to handle almost any workload workloads Deploying Hadoop on-premise was the standard over the past 10 years Cloud options were immature and expensive Not any more. Organizations are deploying BI/Analytics in the cloud Saving of 20% to 60% over on-premises infrastructure cost The cost of cloud storage is approximately 1 to 3 cents/gb/month Separate storage and compute - A revolutionary change from legacy Data Warehouses & Hadoop Eliminate data center capital planning Hardware and software capacity planning is gone Even the concept of servers, nodes or clusters is fading Buying capacity ahead of the need is gone Short-term increases in capacity is now a routine capability Infrastructure upgrade costs are mostly eliminated Changes in hardware become irrelevant

18 Analytics and Hadoop on Cloud - Agility & Efficiency Agility Analysts have what they need at any time (within security profile restrictions) A well governed Enterprise Data Hub (EDH) eliminates up to 50% of analysts time to gather data Advanced analytics and machine learning to build predictive models Easily move into text analytics to analyze unstructured data Leverage real-time streaming analytics Development teams can have their own clusters Faster advanced analytics results Analysts are more productive Full historical data and full fidelity are available Focus on driving business through real-time performance KPI analytics Tools can be tested and used dynamically Any data Any BI tool Any scale Any time Segregate the compute function for each team Share read-only data Stress test in isolated environments Geographically dispersed Backup and geographical data distribution is much simpler, faster and simpler

19 Analytics on Cloud - Simplify Simplify A well-design MDP/EDH greatly simplifies the data landscape Sourcing data directly from service buses or transactional systems, removes an ETL layer Infrastructure is vastly simplified and much faster Dev/Test/QA environments are simple and flexible Have as many as needed, just for as long as needed Development teams can have isolated environments and access to full data sets Stress testing is simple and fast in separate compute environments Capital planning is greatly simplified Chargeback to businesses is simpler and can be automated Tools are rich for automation of processes Reduced software licensing complexity with tools that are pay-per-use Accelerators speed the migration to Cloud and reduce errors

20 Accelerators Faster migration to Cloud with less errors Traditional Data Warehouse costs remain high are leading to this fast moving trend Accelerators and Architecture Critical to reduce risk, mistakes and speed modernization Automation and tools Greatly speed the migration of Data Warehouse workload to Cloud Convert data, jobs and queries from Teradata, Netezza, Greenplum, Oracle DB2, Mainframe and more Mature cloud architectures reduce time to production Data matching and validation tools cross-platform, both cloud and on-premise, reduce errors and mistakes Automated Data model migration Automated SQL migration Automated migration project sequencing Test the migration plan in advance and run in parallel to avoid production issues 20

21 Datametica Why is workload moving to Cloud?. I think you can see why! 21

22 About Us DataMetica was created and is led by some of the strongest experts in Cloud and Big Data Analytics From Ideation to Support Market leaders in Big Data, Cloud migration and Analytics Background of implementing large Big Data solutions in numerous global companies Accelerators driving Data and Analytics project delivery time down by typically 50% Expert technical know-how and proven implementation methodologies in Big Data & Cloud Modern data platforms, operational model, Dev. Ops. and governance Implement Big Data Analytics, Cloud, BI and Performance Management priorities Global Delivery Centre in India Cost-effective access to leading Big Data resources

23