Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Similar documents
Hadoop and Analytics at CERN IT CERN IT-DB

New Big Data Solutions and Opportunities for DB Workloads

Data Analytics Use Cases, Platforms, Services. ITMM, March 5 th, 2018 Luca Canali, IT-DB

Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

5th Annual. Cloudera, Inc. All rights reserved.

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

20775A: Performing Data Engineering on Microsoft HD Insight

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

20775 Performing Data Engineering on Microsoft HD Insight

20775: Performing Data Engineering on Microsoft HD Insight

20775A: Performing Data Engineering on Microsoft HD Insight

Transforming Analytics with Cloudera Data Science WorkBench

BIG DATA AND HADOOP DEVELOPER

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Big Data Hadoop Administrator.

Common Customer Use Cases in FSI

Cloudera, Inc. All rights reserved.

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Sr. Sergio Rodríguez de Guzmán CTO PUE

Oracle Big Data Cloud Service

Taking Advantage of Cloud Elasticity and Flexibility

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Cloudera Data Science and Machine Learning. Robin Harrison, Account Executive David Kemp, Systems Engineer. Cloudera, Inc. All rights reserved.

Bringing the Power of SAS to Hadoop Title

Apache Hadoop in the Datacenter and Cloud

Spark and Hadoop Perfect Together

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Hadoop in the Cloud. Ryan Lippert, Cloudera Product Cloudera, Inc. All rights reserved.

Architecture Optimization for the new Data Warehouse. Cloudera, Inc. All rights reserved.

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Cloud Based Analytics for SAP

Big Data Analytics met Hadoop

Cask Data Application Platform (CDAP) Extensions

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Leveraging Predictive Tools to Decrease Resolution Time

Building Your Big Data Team

Insights to HDInsight


Machine Learning For Enterprise: Beyond Open Source. April Jean-François Puget

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

SAS & HADOOP ANALYTICS ON BIG DATA

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Hadoop Course Content

Cask Data Application Platform (CDAP)

Operational Hadoop and the Lambda Architecture for Streaming Data

Confidential

Big Data Introduction

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Hortonworks Connected Data Platforms

Stateful Services on DC/OS. Santa Clara, California April 23th 25th, 2018

How In-Memory Computing can Maximize the Performance of Modern Payments

Adobe and Hadoop Integration

Spotlight Sessions. Nik Rouda. Director of Product Marketing Cloudera, Inc. All rights reserved. 1

Preface About the Book

DataAdapt Active Insight

SOLUTION SHEET Hortonworks DataFlow (HDF ) End-to-end data flow management and streaming analytics platform

Microsoft Azure Essentials

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Big Data & Hadoop Advance

Analytics for the NFV World with PNDA.io

Building a Robust Analytics Platform

Big data is hard. Top 3 Challenges To Adopting Big Data

Analysis and machine learning on logs of the monitoring infrastructure

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Data Analytics. Nagesh Madhwal Client Solutions Director, Consulting, Southeast Asia, Dell EMC

Berkeley Data Analytics Stack (BDAS) Overview

Modernizing Your Data Warehouse with Azure

zdata Solutions BI / Advanced Analytic Platform and Pilot Programs

Big Data Application Engineer/ Developer. Specialization in Apache Spark, Kafka, Airflow, HBase

REDEFINE BIG DATA. Zvi Brunner CTO. Copyright 2015 EMC Corporation. All rights reserved.

Adobe and Hadoop Integration

Intro to Big Data and Hadoop

Advanced Analytics in Azure

Teradata IntelliSphere

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

StackIQ Enterprise Data Reference Architecture

Analytics in Action transforming the way we use and consume information

Got Data Silos? Automate Data Ingestion Into Isilon In Support Of Analytics

What s New. Bernd Wiswedel KNIME KNIME AG. All Rights Reserved.

SOLUTION SHEET End to End Data Flow Management and Streaming Analytics Platform

Safe Harbor Statement

Pentaho 8.0 Overview. Pedro Alves

Modern Analytics Architecture

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Enterprise-Scale MATLAB Applications

APAC Big Data & Cloud Summit 2013

TechArch Day Digital Decoupling. Oscar Renalias. Accenture

Flink meet DC/OS. Deploying Apache Flink at Scale. Elizabeth K. Ravi FlinkForward San Francisco

Amsterdam. (technical) Updates & demonstration. Robert Voermans Governance architect

Data Science is a Team Sport and an Iterative Process

IBM Analytics Unleash the power of data with Apache Spark

Analytics Platform System

How to develop Data Scientist Super Powers! Using Azure from R to scale and persist analytic workloads.. Simon Field

WELCOME TO. Cloud Data Services: The Art of the Possible

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

BIG DATA and DATA SCIENCE

Big Data Trends Arató Bence. BI Consulting

Transcription:

Data Analytics and CERN IT Hadoop Service CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB 1

Data Analytics at Scale The Challenge When you cannot fit your workload in a desktop Data analysis and ML algorithms over large data sets Deploy on distributed systems Complexity quickly goes up Data ingestion tools and file systems Storage and processing engines ML tools that work at scale

Engineering Effort for Effective ML From Hidden Technical Debt in Machine Learning Systems, D. Sculley at al. (Google), paper at NIPS 2015

Managed Services for Data Engineering Platform Capacity planning and configuration Define, configure and support components Running central services Build a team with domain expertise Share experience Economy of scale

Hadoop Service at CERN IT Setup and run the infrastructure Provide consultancy Build user community Joint work IT-DB and IT-ST

Overview of Available Components (Dec 2016) Kafka Streaming/In gestion

Hadoop clusters at CERN IT 3 production clusters (+ 1 for QA) as of December 2016 Cluster Name Configuration Primary Usage lxhadoop analytix hadalytic 22 nodes (cores 560,Mem 880GB,Storage 1.30 PB) 56 nodes (cores 780,Mem 1.31TB,Storage 2.22 PB) 14 nodes (cores 224,Mem 768GB,Storage 2.15 PB) Experiment activities General Purpose SQL-oriented engines and datawarehouse workloads

Apache Spark Spark evolution from map reduce ideas Powerful engine, in particular for data science and streaming Aims to be a unified engine for big data processing 8

Machine Learning and Spark Spark addresses use cases for machine learning at scale Distributed deep learning Working on use cases with CMS and ATLAS Custom development: library to integrate Keras + Spark Testing also other solutions 9

Some Important Challenges Infrastructure Components Evolution Running Services Knowledge sharing and technology adoption

Infrastructure and configuration What HW to use, how to configure it? We are starting small (~ 1000 cores, 3TB RAM, 6 PB disk) Commodity HW (standard components at CERN datacenter) Cloud Pilot in the pipeline using private cloud at CERN collaboration with OpenStack team Future: test public cloud?

Components, Engines and File Formats State of the art evolves quickly Challenge: reviewing application and architecture choices on short cycles (~2 years?) Need to minimize technical debt Examples: Yesterday: it was all about map reduce Today: Spark, Impala Service evolution and testing in progress Example: Kudu promising to add fast insert and updates and keybased search

Deep Learning Quickly evolving field New tools and platforms Role of HW also very important (GPUs, FPGAs, etc) Many challenges to understand with distributed learning Hadoop service at CERN working on Integration with Spark Comment: an area to further explore

Running the service Configuration: Currently using CDH (Cloudera) distribution + puppet Plans to test use of Cloudera Manager (at least for monitoring, possibly for installation) Backup and recovery: work in progress More work needed on monitoring and security, building from current experience Further work and understanding on workload management and performance

Pilot Implementation NXCALS Pilot architecture tested by CERN Accelerator Logging Services Critical system for running LHC Challenge: production in 2017 Credit: Jakub Wozniak, BE-CO-DS

Performance and Testing at Scale Challenges with ramping up the scale Example from the CMS data reduction challenge: 1 PB and 1000 cores Production for this use case is expected 10x of that. New territory to explore Action: access to HW for tests CERN clusters + external resources from partners?

Technology Adoption Build community at CERN Hadoop users forum, analytics working group, openlab workshops, seminars, Build a HEP community around ML How to position ML with current tools for HEP analysis? Offer and provide training Examples: tutorials by IT-DB, also discussing with tech training at CERN on courses for the catalog, Link with other scientific communities and industry Knowledge sharing

Conclusions CERN Hadoop and Spark service Established and evolving Bring big data solutions from open source into CERN use cases Several production implementations more in pipeline Brings value for analytics and large datasets Machine learning at scale IT Hadoop service provides consultancy, platforms and tools 18

Acknowledgements The following have contributed to the work reported in this presentation Members of IT-DB-SAS section Supporting Hadoop components FE Rainer Toebbicke, Dirk Duellmann, Luca Menichetti from IT-ST Supporting Hadoop Core FE 19

Discussion / Feedback Q & A 20

Backup Slides Backup Slides with additional details on ongoing projects

Impala - SQL on Hadoop Distributed SQL query engine for data stored in Hadoop Based on MPP paradigm (no MapReduce, Spark) Designed for high performance Written in C++ Runtime code generation using LLVM Direct data access 22

Production Implementation WLCG Monitoring HADOOP Web UI Lambda Architecture Real-time Batch processing ACTIV E MQ Modernizing the applications for ever demanding analytics needs Credit: Luca Magnoni, IT-CM 23

Projects Atlas Event Index (Production Service) HBASE for fast lookup events; 40 TB/year LHC Postmortem Analysis Real-time Postmortem Analytics of LHC monitoring data Kafka + Spark Analysis of industrial controls data Future Circular Collider: Reliability and Availability analysis Integrating heterogeneous data sources Correlation between different domains 24

Connecting Hadoop and Oracle Offload data from Oracle to Hadoop recent data in Oracle; archive data in Hadoop Advantages No changes to the application Data sources are transparent to the users Opens up the possibility for new analytical queries Oracle Hadoop table partitions limited storage throughput Offload queries to Hadoop scalable storage create view big_table as select * from online_big_table where date > 2016-05-05 union all select * from archival_big_table@hadoop where date <= 2016-05-05 Oracle offloaded SQL Hadoop Offload interface: DB LINK, External table SQL engines: Impala, Hive

Jupyter Notebooks Jupyter notebooks for data analysis System developed at CERN (EP-SFT) based on CERN IT cloud SWAN: Service for Web-based Analysis ROOT and other libraries available Integration with Hadoop and Spark service Distributed processing for ROOT analysis Access to EOS and HDFS storage 26

Hadoop User Experience - HUE Hue is a web interface for analyzing data with Apache Hadoop View your data using HDFS filebrowser Enhance and Analyze using Query editors for Impala, HIVE Analyze & visualize using Spark notebooks (beta) Requested by the user community Available on Hadoop clusters https://hue-hadalytic.cern.ch https://hue-analytix.cern.ch (soon) 27

Oracle Big Data Discovery Features Data Exploration & Discovery Data Transformation with Spark in Hadoop Apply built-in transformations or write your own scripts Data Enrichment: Text analytics, geolocation, etc. Collaborative environment CERN SSO integrated Already available for Hadoop test cluster Some demos https://www.youtube.com/watch?v=jyw9ntuz_ks

Hadoop performance troubleshooting hprofile Tool developed by IT Hadoop service to troubleshoot application performance on Hadoop Ability to identify part of the code the application is spending most time on and visualize this in a Human readable manner using flamegraphs Usage and more information https://github.com/cerndb/hadoop-profiler Blog - http://db-blog.web.cern.ch/blog/joeri-hermans/2016-04- hadoop-performance-troubleshooting-stack-tracing-introduction 29

Hadoop performance troubleshooting This profiler helped to identify the performance bottlenecks in sqoop when importing data in parquet format

Service Evolution Kudu New Hadoop storage for faster analytics Complements HDFS and HBASE Fills the gap in capabilities of HDFS (optimized for analytics on extremely large datasets) and HBASE (optimized for fast ingestion and queries over small datasets) Backups for Hadoop Evaluation and possible deployment of Alluxio in-memory distributed filesystem 31