Hadoop and Analytics at CERN IT CERN IT-DB

Similar documents
Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

New Big Data Solutions and Opportunities for DB Workloads

Data Analytics Use Cases, Platforms, Services. ITMM, March 5 th, 2018 Luca Canali, IT-DB

Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

20775A: Performing Data Engineering on Microsoft HD Insight

20775 Performing Data Engineering on Microsoft HD Insight

5th Annual. Cloudera, Inc. All rights reserved.

20775: Performing Data Engineering on Microsoft HD Insight

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

20775A: Performing Data Engineering on Microsoft HD Insight

BIG DATA AND HADOOP DEVELOPER

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

Transforming Analytics with Cloudera Data Science WorkBench

DataAdapt Active Insight

Architecture Optimization for the new Data Warehouse. Cloudera, Inc. All rights reserved.

Big Data Hadoop Administrator.

Common Customer Use Cases in FSI

Operational Hadoop and the Lambda Architecture for Streaming Data

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Sr. Sergio Rodríguez de Guzmán CTO PUE

Microsoft Azure Essentials

Cask Data Application Platform (CDAP) Extensions

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Bringing the Power of SAS to Hadoop Title

Taking Advantage of Cloud Elasticity and Flexibility

Spark and Hadoop Perfect Together

Big data is hard. Top 3 Challenges To Adopting Big Data

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

Data Analytics. Nagesh Madhwal Client Solutions Director, Consulting, Southeast Asia, Dell EMC

Hadoop Course Content

Spotlight Sessions. Nik Rouda. Director of Product Marketing Cloudera, Inc. All rights reserved. 1

Big Data Analytics met Hadoop

Business is being transformed by three trends

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Oracle Big Data Cloud Service

Insights to HDInsight

Confidential

Cloudera, Inc. All rights reserved.

When Big Data Meets Fast Data

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Building a Robust Analytics Platform

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Cask Data Application Platform (CDAP)

SOLUTION SHEET Hortonworks DataFlow (HDF ) End-to-end data flow management and streaming analytics platform


SAS & HADOOP ANALYTICS ON BIG DATA

Apache Hadoop in the Datacenter and Cloud

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Big Data & Hadoop Advance

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

Building Your Big Data Team

Adobe and Hadoop Integration

Hortonworks Connected Data Platforms

Pentaho 8.0 Overview. Pedro Alves

Intro to Big Data and Hadoop

How In-Memory Computing can Maximize the Performance of Modern Payments

Analytics in Action transforming the way we use and consume information

Enabling Self-Service Analytics Across The UDA With Teradata AppCenter

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses

Analytics for the NFV World with PNDA.io

Modern Analytics Architecture

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Leveraging Predictive Tools to Decrease Resolution Time

IBM Db2 Warehouse. Hybrid data warehousing using a software-defined environment in a private cloud. The evolution of the data warehouse

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Trifacta Data Wrangling for Hadoop: Accelerating Business Adoption While Ensuring Security & Governance

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

Modernizing Your Data Warehouse with Azure

Machine Learning For Enterprise: Beyond Open Source. April Jean-François Puget

TechArch Day Digital Decoupling. Oscar Renalias. Accenture

MapR Pentaho Business Solutions

Analysis and machine learning on logs of the monitoring infrastructure

Leveraging smart meter data for electric utilities:

Leveraging smart meter data for electric utilities:

StackIQ Enterprise Data Reference Architecture

Cloud Based Analytics for SAP

Hadoop in the Cloud. Ryan Lippert, Cloudera Product Cloudera, Inc. All rights reserved.

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Adobe and Hadoop Integration

zdata Solutions BI / Advanced Analytic Platform and Pilot Programs

D a t a J u g g l i n g a t S k y B e t t i n g a n d G a m i n g. A b r i e f l o o k i n s i d e t h e D a t a S c i e n c e t o o l b o x

Cask Data Application Platform (CDAP) The Integrated Platform for Developers and Organizations to Build, Deploy, and Manage Data Applications

Why Big Data Matters? Speaker: Paras Doshi

Managing explosion of data. Cloudera, Inc. All rights reserved.

REDEFINE BIG DATA. Zvi Brunner CTO. Copyright 2015 EMC Corporation. All rights reserved.

Preface About the Book

Hadoop Stories. Tim Marston. Director, Regional Alliances Page 1. Hortonworks Inc All Rights Reserved

Hortonworks Data Platform

Exploring Big Data and Data Analytics with Hadoop and IDOL. Brochure. You are experiencing transformational changes in the computing arena.

Your Top 5 Reasons Why You Should Choose SAP Data Hub INTERNAL

Microsoft Big Data. Solution Brief

Got Data Silos? Automate Data Ingestion Into Isilon In Support Of Analytics

Amsterdam. (technical) Updates & demonstration. Robert Voermans Governance architect

Analytics Platform System

Session 30 Powerful Ways to Use Hadoop in your Healthcare Big Data Strategy

Transcription:

Hadoop and Analytics at CERN IT CERN IT-DB 1

Hadoop Use cases Parallel processing of large amounts of data Perform analytics on a large scale Dealing with complex data: structured, semi-structured, unstructured Projects: Migrations and consolidations of SQL-like workloads Data warehouse and reporting Data ingestion and pipelines into Hadoop Do more with data at scale: analytics and machine learning 2

Hadoop Service in IT Setup and run the infrastructure Provide consultancy Build user community Joint work IT-DB and IT-ST 3

Hadoop clusters in IT 4 main clusters Cluster Name Configuration Primary Usage lxhadoop analytix hadalytic hadoopqa 22 nodes (cores 560,Mem 880GB,Storage 1.30 PB) 56 nodes (cores 780,Mem 1.31TB,Storage 2.22 PB) 14 nodes (cores 224,Mem 768GB,Storage 2.15 PB) 12 nodes (cores 28,Mem 42 GB,Storage 358 TB) Experiment activities General Purpose SQL oriented installation QA cluster 4

Overview of Available Components Kafka Streaming/In gestion

Apache Spark Spark Eco System Hadoop Service Update 6

Impala - SQL on Hadoop distributed SQL query engine for data stored in Hadoop Based on MPP paradigm (no MapReduce, Spark) Designed for high performance Written in C++ Runtime code generation using LLVM Direct data access Hadoop Service Update 7

Production Implementation WLCG Monitoring HADOOP Web UI Lambda Architecture Real-time Batch processing ACTIV E MQ Modernizing the applications for ever demanding analytics needs Credit: Luca Magnoni, IT-CM 8

Pilot Implementation CALS 2.0 Pilot architecture tested by CERN Accelerator Logging Services Credit: Jakub Wozniak, BE-CO-DS

Projects Atlas Event Index (Production Service) HBASE for fast lookup events; 40 TB/year LHC Postmortem Analysis Real-time Postmortem Analytics of LHC monitoring data Kafka + Spark Analysis of industrial controls data Future Circular Collider: Reliability and Availability analysis Integrating heterogeneous data sources Correlation between different domains 10

Connecting Hadoop and Oracle Offload data from Oracle to Hadoop recent data in Oracle; archive data in Hadoop Advantages No changes to the application Data sources are transparent to the users Opens up the possibility for new analytical queries Oracle Hadoop table partitions limited storage throughput Offload queries to Hadoop scalable storage create view big_table as select * from online_big_table where date > 2016-05-05 union all select * from archival_big_table@hadoop where date <= 2016-05-05 Oracle offloaded SQL Hadoop Offload interface: DB LINK, External table SQL engines: Impala, Hive

Jupyter Notebooks Jupyter notebooks for data analysis System developed at CERN (EP-SFT) based on CERN IT cloud SWAN: Service for Web-based Analysis ROOT and other libraries available Integration with Hadoop and Spark service Distributed processing for ROOT analysis Access to EOS and HDFS storage 12

Machine Learning and Spark Spark addresses use cases for machine learning at scale Distributed deep learning Working on use cases with CMS and ATLAS Custom development: library to integrate Keras + Spark Testing also deeplearning4j Hadoop Service Update 13

Hadoop User Experience - HUE Hue is a web interface for analyzing data with Apache Hadoop View your data using HDFS filebrowser Enhance and Analyze using Query editors for Impala, HIVE Analyze & visualize using Spark notebooks (beta) Requested by the user community Available on Hadoop clusters https://hue-hadalytic.cern.ch https://hue-analytix.cern.ch (soon) Hadoop Service Update 14

Oracle Big Data Discovery Features Data Exploration & Discovery Data Transformation with Spark in Hadoop Apply built-in transformations or write your own scripts Data Enrichment: Text analytics, geolocation, etc. Collaborative environment CERN SSO integrated Already available for Hadoop test cluster Some demos https://www.youtube.com/watch?v=jyw9ntuz_ks

Hadoop performance troubleshooting hprofile Tool developed by IT Hadoop service to troubleshoot application performance on Hadoop Ability to identify part of the code the application is spending most time on and visualize this in a Human readable manner using flamegraphs Usage and more information https://github.com/cerndb/hadoop-profiler Blog - http://db-blog.web.cern.ch/blog/joeri-hermans/2016-04- hadoop-performance-troubleshooting-stack-tracing-introduction 16

Hadoop performance troubleshooting This profiler helped to identify the performance bottlenecks in sqoop when importing data in parquet format

Service Evolution Kudu New Hadoop storage for faster analytics Complements HDFS and HBASE Fills the gap in capabilities of HDFS (optimized for analytics on extremely large datasets) and HBASE (optimized for fast ingestion and queries over small datasets) Backups for Hadoop Evaluation and possible deployment of Alluxio in-memory distributed filesystem 18

Conclusions CERN Hadoop and Spark service Established and evolving Bring big data solutions from open source into CERN use cases Several production implementations more in pipeline Brings value for analytics and large datasets Machine learning at scale IT Hadoop service provides consultancy, platforms and tools 19

Acknowledgements The following have contributed to the work reported in this presentation Members of IT-DB-SAS section Supporting Hadoop components FE Rainer Toebbicke, Dirk Duellmann, Luca Menichetti from IT-ST Supporting Hadoop Core FE 20

Discussion / Feedback Q & A 21