Sanoma Big Data Migration Sander Kieft

Similar documents
Architecting an Open Data Lake for the Enterprise

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

AMAZON EC2 SPOT MARKET

20775: Performing Data Engineering on Microsoft HD Insight

Insights to HDInsight

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

20775A: Performing Data Engineering on Microsoft HD Insight

20775 Performing Data Engineering on Microsoft HD Insight

Building a Data Lake on AWS EBOOK: BUILDING A DATA LAKE ON AWS 1

Building data-driven applications with SAP Data Hub and Amazon Web Services

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Enterprise-Scale MATLAB Applications

20775A: Performing Data Engineering on Microsoft HD Insight

BARCELONA. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

with Dell EMC s On-Premises Solutions

5th Annual. Cloudera, Inc. All rights reserved.

EBOOK: Cloudwick Powering the Digital Enterprise

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Big Data & Hadoop Advance

How to Build Your Data Ecosystem with Tableau on AWS

Cloud Based Analytics for SAP

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Cask Data Application Platform (CDAP) Extensions

Building a Data Lake on AWS

ADVANCED ANALYTICS & IOT ARCHITECTURES

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

BIG DATA AND HADOOP DEVELOPER

Application Performance Management for Microsoft Azure and HDInsight

CI in the cloud. Case Talend & Nordcloud Ilja Summala, Group CTO, Nordcloud Anubhav Sharma, Senior Software Architect, Talend Deutschland

Microsoft Azure Essentials

Building a Self-Service Datalake to Enable Business Growth

Solution Brief. An Agile Approach to Feeding Cloud Data Warehouses

ARCHITECTURES ADVANCED ANALYTICS & IOT. Presented by: Orion Gebremedhin. Marc Lobree. Director of Technology, Data & Analytics

Taking Advantage of Cloud Elasticity and Flexibility

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

Intro to Big Data and Hadoop

REDEFINE BIG DATA. Zvi Brunner CTO. Copyright 2015 EMC Corporation. All rights reserved.

Building a Robust Analytics Platform

The Economic Benefits of Migrating Apache Spark and Hadoop to Amazon EMR

Datametica DAMA. The Modern Data Platform Enterprise Data Hub Implementations. What is happening with Hadoop Why is workload moving to Cloud

GET MORE VALUE OUT OF BIG DATA

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Bringing the Power of SAS to Hadoop Title

Transforming Analytics with Cloudera Data Science WorkBench

AWS Architecture Case Study: Real-Time Bidding. Tom Maddox, Solutions Architect

COST ADVANTAGES OF HADOOP ETL OFFLOAD WITH THE INTEL PROCESSOR- POWERED DELL CLOUDERA SYNCSORT SOLUTION

Business is being transformed by three trends

Simplifying Your Modern Data Architecture Footprint

Operational Hadoop and the Lambda Architecture for Streaming Data


Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

Apache Hadoop in the Datacenter and Cloud

Berkeley Data Analytics Stack (BDAS) Overview

Oracle Cloud What's New for Oracle Big Data Cloud Service. Version

Big Data Hadoop Administrator.

Architecture Optimization for the new Data Warehouse. Cloudera, Inc. All rights reserved.

Getting Started with Amazon QuickSight

Spotlight Sessions. Nik Rouda. Director of Product Marketing Cloudera, Inc. All rights reserved. 1

GUIDE The Enterprise Buyer s Guide to Public Cloud Computing

Spark, Hadoop, and Friends

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Cloudera, Inc. All rights reserved.

Oracle's Cloud Strategie für den Geschäftserfolg Alles Neue von der OOW

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Advancing your Big Data Strategy

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

Big data is hard. Top 3 Challenges To Adopting Big Data

Cloud-Scale Data Platform

Cask Data Application Platform (CDAP)

Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Oracle Big Data Cloud Service

Pentaho 8.0 Overview. Pedro Alves

An Analytics System for Optimizing Manufacturing Processes

Trifacta Data Wrangling for Hadoop: Accelerating Business Adoption While Ensuring Security & Governance

Sr. Sergio Rodríguez de Guzmán CTO PUE

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Cloudera Data Science and Machine Learning. Robin Harrison, Account Executive David Kemp, Systems Engineer. Cloudera, Inc. All rights reserved.

Big Data Cloud. Simple, Secure, Integrated and Performant Big Data Platform for the Cloud

Your Top 5 Reasons Why You Should Choose SAP Data Hub INTERNAL

Managing explosion of data. Cloudera, Inc. All rights reserved.

Introducing Amazon Kinesis Managed Service for Real-time Big Data Processing

Active Analytics Overview

Five Advances in Analytics

What s New for Oracle Big Data Cloud Service. Topics: Oracle Cloud. What's New for Oracle Big Data Cloud Service Version

Safe Harbor Statement

IBM Db2 Warehouse. Hybrid data warehousing using a software-defined environment in a private cloud. The evolution of the data warehouse

MIGRATING MEDIA WORKFLOWS TO THE CLOUD. Scott Malkie, Systems Engineer

CLOUDCHECKR ON AWS AND AZURE: A GUIDE TO PUBLIC CLOUD COST MANAGEMENT

The Economic Advantage of Google BigQuery On- Demand Serverless Analytics

Big Data Introduction

AMSTERDAM. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Confidential

St Louis CMG Boris Zibitsker, PhD

Copyright - Diyotta, Inc. - All Rights Reserved. Page 2

Infrastructure Efficiency Yields Significant Cost Savings

OSIsoft Super Regional Transform Your World

TOP 5 LESSONS LEARNED IN DEPLOYING AI IN THE REAL WORLD. Joshua Robinson, Founding Engineer

Big Data The Big Story

Transcription:

Sanoma Big Data Migration Sander Kieft

2 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

3 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

About me Manager Core Services at Sanoma Responsible for common services, including the Big Data platform Work: Centralized services Data platform Search Like: Work Water(sports) Whiskey Tinkering: Arduino, Raspberry PI, soldering stuff 4 23 June 2016

Sanoma, Publishing and Learning company 2+100 7 30+ 200+ 100 2 Finnish newspapers Over 100 magazines in The Netherland, Belgium and Finland TV channels in Finland and The Netherlands, incl. on demand platforms Learning applications Websites Mobile applications on various mobile platforms 5 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Sanoma, Big Data use cases Users of Sanoma s websites, mobile applications, online TV products generate large volumes of data We use this data to improve our products for our users and our advertisers Use cases: Dashboards and Reporting Product Improvements Advertising optimization Reporting on various data sources Self Service Data Access Editorial Guidance A/B testing Recommenders Search optimizations Adaptive learning/tutoring (Re)Targeting Attribution Auction price optimization 6 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Photo credits: kennysarmy - https://www.flickr.com/photos/kennysarmy/3207539502/

History < 2008 2009 2010 2011 2012 2013 2014 2015

Data Lake Photo credits: Wikipedia

In science, one man's noise is another man's signal. -- Edward Ng

History < 2008 2009 2010 2011 2012 2013 2014 2015 11 6/23/2016 Sanoma Media

Lake filling

Self service Photo credits: misternaxal - http://www.flickr.com/photos/misternaxal/2888791930/

Enabling self service Extract Transform Source systems Staging Raw Data (Native format) Normalized Data structure FI NL Reusable stem data Weather Int. event calanders Int. AdWord Prices Exchange rates Finland Netherlands Learning SL HDFS HIVE 15 23.6.2016 Sanoma Media Data Scientists Data Analysts

Enabling self service Extract Transform Source systems Staging Raw Data (Native format) Normalized Data structure FI NL Reusable stem data Weather Int. event calanders Int. AdWord Prices Exchange rates Finland Netherlands Learning SL HDFS HIVE 16 23.6.2016 Sanoma Media Data Scientists Data Analysts

28.márc.10 28.ápr.10 28.máj.10 28.jún.10 28.júl.10 28.aug.10 28.szept.10 28.okt.10 28.nov.10 28.dec.10 28.jan.11 28.febr.11 31.márc.11 30.ápr.11 31.máj.11 30.jún.11 31.júl.11 31.aug.11 30.szept.11 31.okt.11 30.nov.11 31.dec.11 31.jan.12 29.febr.12 31.márc.12 30.ápr.12 31.máj.12 30.jún.12 31.júl.12 31.aug.12 30.szept.12 31.okt.12 30.nov.12 31.dec.12 31.jan.13 28.febr.13 31.márc.13 30.ápr.13 31.máj.13 30.jún.13 31.júl.13 31.aug.13 30.szept.13 31.okt.13 30.nov.13 31.dec.13 31.jan.14 28.febr.14 31.márc.14 30.ápr.14 31.máj.14 30.jún.14 31.júl.14 31.aug.14 30.szept.14 31.okt.14 30.nov.14 31.dec.14 31.jan.15 28.febr.15 31.márc.15 30.ápr.15 Storage growth 140 Usage in TB (Netto) 120 100 80 Current growth rate: 100GB/day source data 150GB/day refined data 1 server/month 60 40 20 0 17 23 June 2016 Presentation name

Keeps filling

Starting point EDW Sources ETL (Jenkins & Jython) QlikView Reporting R Studio Collecting S3 Kafka Stream processing (Storm) Compute & Storage (Cloudera Dist. Hadoop) Hive Hadoop User Environment (HUE) Sanoma data center Serving Redis / Druid.io AWS cloud 19 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Present < 2008 2009 2010 2011 2012 2013 2014 2015 YARN

60+ NODES

720TB CAPACITY

475TB DATA STORED (GROSS)

180TB DATA STORED (NETTO)

150GB DAILY GROWTH

50+ DATA SOURCES

200+ DATA PROCESSES

3000+ DAILY HADOOP JOBS

200+ DASHBOARDS

275+ AVG MONTHLY DASHBOARD USERS

Challenges Positives Users have been steadily increasing Demand for.. more real time processing faster data availability higher availability (SLA) quicker availability of new versions specialized hardware (GPU s/ssd s) quicker experiments (Fail Fast) Negatives Data Center network support end of life Outsourcing of own operations team Version upgrades harder to manage Poor job isolation between test, dev, prod and interactive, etl and data science workloads Higher level of security and access control 32 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Lake was full Photo credits: Wikipedia

Big Data Hosting Options On Premise New Hardware Generic Cloud Provider A Provider B Specialized Cloud Current On Premise AWS Naïve AWS Redshift AWS EMR + Spot Storage Compute Services Cloud B Specialized Cloud 34 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Big Data Hosting Options On Premise New Hardware Generic Cloud Provider A Provider B Specialized Cloud Adding data services from cloud provider to the comparison Current On Premise AWS Naïve AWS Redshift AWS EMR + Spot Storage Compute Services Cloud B Specialized Cloud 35 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Big Data Hosting Options On Premise New Hardware Generic Cloud Provider A Provider B Specialized Cloud Adding data services from cloud provider to the comparison Replacing portion of processing with Spot Pricing Current On Premise AWS Naïve AWS Redshift AWS EMR + Spot Storage Compute Services Cloud B Specialized Cloud 36 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Big Data Hosting Options On Premise New Hardware Generic Cloud Provider A Provider B Specialized Cloud Adding data services from cloud provider to the comparison Replacing portion of processing with Spot Pricing Flexibility and Time-to-Market of new data services more important then slight cost increase More knowledge available in-house and in market Already investment in connectivity and automation done by Sanoma Other Sanoma Assets also in AWS Spot pricing being the deciding factor Current On Premise AWS Naïve AWS Redshift AWS EMR + Spot Storage Compute Services Cloud B Specialized Cloud 37 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Amazon Instance Buying Options c3.xlarge, 4 vcpu, 7.5 GiB On-demand Instances: hourly pricing $0.239 per Hour Reserved Instances: $0.146 $0.078 per Hour Up to 75% cheaper than on-demand pricing 1 3 year commitment (Large) upfront costs; typical breakeven at 50-80% utilization Spot Instances: ~ $0.03 per Hour (> 80% reduction) AWS sells excess capacity to highest bidder Hourly commitments at a price you name Can be up to 90% cheaper that on-demand pricing 38 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Amazon - Spot Prices AWS determines market price based on supply and demand Instance is allocated to highest bidder Bidder pays market price On Demand Price Allocated instance is terminated (with a 2 minute warning) when market price increases above your bid price Diversification of instance families, instance types, availability zones increases continuity Avg ~$0.03 80% of ondemand Spot Instances can take a while to provision with a different workflow than a traditional on-demand model. Guard against termination with adequate pricing, but don t try and prevent it. Automation is key. 39 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Amazon Spot Pricing Via Console or via cli: aws emr create-cluster --name "Spot cluster" --ami-version 3.3 InstanceGroupType=MASTER, InstanceType=m3.xlarge,InstanceCount=1, InstanceGroupType=CORE, BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2 InstanceGroupType=TASK, BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3 40 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Destination Amazon Photo credits: Amauri Aguiar- https://www.flickr.com/photos/amauriaguiar/3204101173/

Amazon Migration Scenario s Three scenario s evaluated for moving L S M EC2 EBS S3 EC2 EBS EMR S3 EC2 EBS EMR Redshift S3 All services on EC2 Only source data on S3 All data on S3 EMR for Hadoop EC2 only for utility services not provided by EMR All data on S3 EMR for Hadoop Interactive querying workload is moved to Redshift instead of Hive 42 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Amazon Migration Scenario s Three scenario s evaluated for moving L S M EC2 EBS S3 EC2 EBS EMR S3 EC2 EBS EMR Redshift S3 All services on EC2 Only source data on S3 All data on S3 EMR for Hadoop EC2 only for utility services not provided by EMR All data on S3 EMR for Hadoop Interactive querying workload is moved to Redshift instead of Hive Easier to leverage spot pricing, due to data on S3 43 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Target architecture EDW Sources ETL (Jenkins & Jython) QlikView Reporting Sanoma data center S3 Amazon EMR Hive Sources Work Hive Data S3 Buckets warehouse Core instances Task instances EMR Cluster 1 Task Spot Instances HUE Zepp elin R Studio Collecting Kafka Stream processing (Storm) Serving Redis / Druid.io AWS cloud Amazon RDS 44 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Node types Amazon EMR Hive HUE Basic Cluster Core instances Task instances EMR Cluster 1 Task Spot Instances Zepp elin 1x MASTER m4.2xlarge 25x CORE d2.xlarge 40x TASK r3.2xlarge Basic Cluster with spot pricing: 1x MASTER m4.2xlarge 25x CORE d2.xlarge 20x TASK r3.2xlarge (On Demand) 40x TASK r3.2xlarge (Spot Pricing) Possible bidding strategies: - Bid on-demand price - Diversification of - instance families - instance types - availability zones - Bundle and rolling prices - 5x 0,01, 5x 0,02, 5x 0,03 45 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Enabling self service In Amazon Extract Transform Source systems Staging Raw Data (Native format) Normalized Data structure FI Reusable stem data Finland NL Amazon EMR Amazon EMR Weather Int. event calanders Int. AdWord Prices Exchange rates Netherlands Amazon Redshift Learning SL Sources Work Hive Data warehouse HDFS HIVE 46 23.6.2016 Sanoma Media Data Scientists Data Analysts

Migration

Migration Migration of the data from HDFS to S3 took long time Large volume of data Migration of source data & data warehouse Using EMR required more rewrites then initially planned Due to network isolation of EMR Cluster it s harder to initiate jobs from outside the cluster Jobs had to be rewritten to mitigate side effects EMRFS INSERT OVERWRITE TABLE xx AS SELECT xx Has different behavior Data formats Hive Hive on EMR doesn t support RC-files Had to convert our Data Warehouse to ORC 48 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Learnings We re almost done with the migration. Running in parallel now. Setup solves our challenges, some still require work Missing Cloudera Manager for small config changes and monitoring EMR not ideal for long running clusters S3 Bucket structure impacts performance Setup access control, human error will occur Uploading data takes time. Start early! Check Snowball or new upload service EMR Check data formats! RC vs ORC/Parquet Start jobs from master node Break up long running jobs, shorter independent Spot pricing & pre-empted nodes /w Spark HUE and Zeppelin meta data on RDS Research EMR FS behavior for your use case 49 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

S3 Bucket Structure Throughput optimization S3 automatically partitions based upon key prefix Bucket: example-hive-data Object keys: warehouse/weblogs/2016-01-01/a.log.bz2 warehouse/advertising/2016-01-01/a.log.bz2 warehouse/analytics/2016-01-01/a.log.bz2 warehouse/targeting/2016-01-01/a.log.bz2 Bucket: example-hive-data Object keys: weblogs/2016-01-01/a.log.bz2 advertising/2016-01-01/a.log.bz2 analytics/2016-01-01/a.log.bz2 targeting/2016-01-01/a.log.bz2 Partition Key: example-hive-data/w example-hive-data/w Partition Keys: example-hive-data/a example-hive-data/t 50 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Spot pricing & pre-empted nodes /w Spark Spark: Massively parallel Uses DAGs instead of mapreduce for execution Minimizes I/O by storing data in RDDs in memory Partitioning-aware to avoid network-intensive shuffle Ideal for iterative algorithms We use spark a lot for Data Science ETL Processing slowly moving to Spark too Problem with Spark and EMR: No control over the node where Application Master lands Spark Executors are termination resilient, master is not. Possible solutions: Run separate cluster for Spark workload Assign node labels, but current implementation is crude and is exclusive 51 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Conclusion Amazon is a wonderful place to run your Big Data infrastructure It s flexible, but costs can grow rapidly Many options for cost control available, but might impact architecture Take your time testing and validating your setup If you have time, rethink whole setup No time, move as-is first and then start optimizing Much faster to iterate your solution when everything is at Amazon then partly AWS/On Premise 52 23 June 2016 Budapest Big Data Forum - Sanoma Big Data Migration

Thank you! Questions? Twitter: @skieft