Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

Similar documents
Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Hadoop and Analytics at CERN IT CERN IT-DB

New Big Data Solutions and Opportunities for DB Workloads

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

20775A: Performing Data Engineering on Microsoft HD Insight

20775 Performing Data Engineering on Microsoft HD Insight

20775: Performing Data Engineering on Microsoft HD Insight

20775A: Performing Data Engineering on Microsoft HD Insight

Oracle Big Data Discovery The Visual Face of Big Data

Big data is hard. Top 3 Challenges To Adopting Big Data

BIG DATA AND HADOOP DEVELOPER

Oracle Big Data Discovery Cloud Service

Spark and Hadoop Perfect Together

Data Analytics Use Cases, Platforms, Services. ITMM, March 5 th, 2018 Luca Canali, IT-DB

Microsoft Azure Essentials


Big Data Hadoop Administrator.

Transforming Analytics with Cloudera Data Science WorkBench

Oracle Big Data Cloud Service

Intro to Big Data and Hadoop

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Cloudera Data Science and Machine Learning. Robin Harrison, Account Executive David Kemp, Systems Engineer. Cloudera, Inc. All rights reserved.

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

Copyright 2015, Oracle and/or its affiliates. All rights reserved.

5th Annual. Cloudera, Inc. All rights reserved.

Big Data Introduction

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Hortonworks Connected Data Platforms

Your Top 5 Reasons Why You Should Choose SAP Data Hub INTERNAL

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

BIG DATA and DATA SCIENCE

Cask Data Application Platform (CDAP) Extensions

How In-Memory Computing can Maximize the Performance of Modern Payments

Hadoop Course Content

Apache Hadoop in the Datacenter and Cloud

MapR: Solution for Customer Production Success

Modern Analytics Architecture

Insights-Driven Operations with SAP HANA and Cloudera Enterprise

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

Bringing the Power of SAS to Hadoop Title

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Cask Data Application Platform (CDAP)

Big Data & Hadoop Advance

Enterprise Analytics Accelerating Your Path to Value with an Open Analytics Platform

PNDA.io: when big data and OSS collide

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Big Data Application Engineer/ Developer. Specialization in Apache Spark, Kafka, Airflow, HBase

Data Analytics. Nagesh Madhwal Client Solutions Director, Consulting, Southeast Asia, Dell EMC

Machine Learning For Enterprise: Beyond Open Source. April Jean-François Puget

HDInsight - Hadoop for the Commoner Matt Stenzel Data Platform Technical Specialist

Edge Analytics for IoT Device Intelligence

Cloudera, Inc. All rights reserved.

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

NFLABS SIMPLIFYING BIG DATA. Real &me, interac&ve data analy&cs pla4orm for Hadoop

Azure ML Data Camp. Ivan Kosyakov MTC Architect, Ph.D. Microsoft Technology Centers Microsoft Technology Centers. Experience the Microsoft Cloud

Sr. Sergio Rodríguez de Guzmán CTO PUE

Common Customer Use Cases in FSI

Managing explosion of data. Cloudera, Inc. All rights reserved.

Azure Data Analytics & Machine Learning Seminar. Daire Cunningham: BI Practice Area Manager

D a t a J u g g l i n g a t S k y B e t t i n g a n d G a m i n g. A b r i e f l o o k i n s i d e t h e D a t a S c i e n c e t o o l b o x

Cloud Based Analytics for SAP

Business is being transformed by three trends

Intel Public Sector 3

Analytics for the NFV World with PNDA.io

GET MORE VALUE OUT OF BIG DATA

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Towards Seamless Integration of Data Analytics into Existing HPC Infrastructures

SOLUTION SHEET Hortonworks DataFlow (HDF ) End-to-end data flow management and streaming analytics platform

New Approach for scheduling tasks and/or jobs in Big Data Cluster

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Cloudera Enterprise Data Hub Reference Architecture for Oracle Cloud Infrastructure Deployments O R A C L E W H I T E P A P E R J U N E

Berkeley Data Analytics Stack (BDAS) Overview

Spark, Hadoop, and Friends

Amsterdam. (technical) Updates & demonstration. Robert Voermans Governance architect

Real-time IoT Big Data-in-Motion Analytics Case Study: Managing Millions of Devices at Country-Scale

Applying Automated Methods of Managing Test and Evaluation Processes

SAP Predictive Analytics Suite

Making Data Science Simple

Data - tools for data integration, access, preparation, discovery, and data streaming.

Monetizing the Lake. Kirk Haslbeck, Hortonworks Dan Kernaghan, Pitney Bowes

Taking Advantage of Cloud Elasticity and Flexibility

CASE STUDY Delivering Real Time Financial Transaction Monitoring

SOLUTION SHEET End to End Data Flow Management and Streaming Analytics Platform

The Applicability of HPC for Cyber Situational Awareness

Analytics in Action transforming the way we use and consume information

Pentaho 8.0 and Beyond. Matt Howard Pentaho Sr. Director of Product Management, Hitachi Vantara

ADVANCED ANALYTICS & IOT ARCHITECTURES

SAS FORUM RUSSIA Welcome

ETL on Hadoop What is Required

Insights to HDInsight

Operational Hadoop and the Lambda Architecture for Streaming Data

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Pentaho 8.0 Overview. Pedro Alves

TechArch Day Digital Decoupling. Oscar Renalias. Accenture

MapR Pentaho Business Solutions

Two offerings which interoperate really well

DataAdapt Active Insight

Power BI for Data Science Integration and exploration capabilities

Adobe and Hadoop Integration

Transcription:

Leveraging Oracle Big Data Discovery to Master CERN s Data Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

Manuel Martin Marquez Intel IoT Ignition Lab Cloud and Big Data Munich, September 17th CERN - European Laboratory for Particle Physics 3

A World-Wide Collaboration Manuel Martin Marquez Intel IoT Ignition Lab Cloud and Big Data Munich, September 17th 4

Manuel Martin Marquez Intel IoT Ignition Lab Cloud and Big Data Munich, September 17th 5

6

11/30/2016 7

CERN Aerial View 11/30/2016 Document reference 8

LHC Installation 11/30/2016 Document reference 9

CMS Detector 11/30/2016 Document reference 10

11

Hadoop and Analytics IT-DB-SAS New scalable data services Scalable databases Hadoop ecosystem Time Series databases Big Data Analytics Activities and objectives Support of Hadoop Components Further value of Analytics solutions Define scalable platform evolution Hadoop Production Service 12

requests per day. Direct SQL access is not permitted. A generic Java GUI called TIMBER is also provided CERN Accelerator as a means visualize and Logging extract logged data. The Service +800 extraction clients +5 million extraction requests per day 130 custom applications Credit: BE-CO-DS ~ 250 000 Signals ~ 50 data loading processes ~ 5.5 billion records per day ~ 275 GB / day 100 TB / year throughput Filters for data Reduction tool is heavily used, with more than 800 active users. 800 600 400 200 0 Data ingestion Per day GB 2014 Figure 2: Logging ervice architecture overview. The Java APIs for both logging and extracting data are procedures have been h an understanding of h performing, in terms of how it is being done, and Optimal Use of Softw ~ 1 million signals ~ 300 data loading processes ~ 4 billion records per day ~ 160 GB / day 52 TB / year stored The database mod business logic (written Java infrastructure inter engineered to use the Oracle, to maximize per the LS systems are be aforementioned instrum which features and tech performance. Data Quality Contro The MDB (introduc filtering capabilities wh for long-term storage. effort [4] to ensure 13 configurations, the MD

2008 2008 2008 2009 2009 2010 2010 2010 2011 2011 2012 2012 2012 2013 2013 2014 2014 2014 2015 2015 2015 CERN Accelerator Logging Service New Landscape bring new challenges Better Performance on bigger datasets Big Data queries: Impala, Spark SQL Leverage analytics capabilities Spark Analytics: Python, ML, R More heterogeneous data access models Storage Evolution - Size in GB / day 700 600 500 400 300 200 100 0 Credit: BE-CO-DS QPS 14

CERN Accelerator Logging Service Speed Log. Proc. Log. Proc. 100mS Kafka 1min 7 min Gobblin 1min 7 min Batch Compactor HBase HDFS Credit: BE-CO-DS CCDB Schema Partition Provider Storage 15

Accelerator Postmortem Analysis Postmortem Analysis Diagnostic on failures Continue operations safely Intervention Required Designed for CERN LHC Extended to injectors complex (SPS) External Post Operational Checks Injection Quality Checks 16

Accelerator Postmortem Analysis Challenges: Stringent Timing Constraint Better scalability data storage IO throughput Big Data Streaming Analytics 17

Post-LHC accelerator projects (80-100 km)

Architecture overview CDH 5.7.1 16 nodes, 24 GB ram Intel Xeon L5520 @ 2.27GHz 165 TB HDFS C o o r d i n a t i o n Oracle Big Data Discovery Libraries + Hive table detector Resource Management (YARN) Data Storage Data Integration Big Data Discovery v1.2.2 Dgraph & Studio 4x Xeon E7-8895 v2 (15 cores each) 2 TB RAM 4.8 TB Flash + 6 x 1.2 TB 10K HDD

Oracle Big Data Discovery Overview Data Exploration & Discovery Interactive catalog of all data Assess attribute statistics, data quality and outliers Quick data exploration or create dashboards and applications Data Transformation with Spark in Hadoop Apply built-in transformations or write your own scripts Data Enrichment Text: Entity extraction, relevant terms, sentiment, language detection Geographical information: address, IP, reverse Preview results, undo, commit and replay transforms Collaborative environment Share and bookmarks Create and share transformed datasets

Data Transformation UI - ETL 26/01/2016 - USA BIWA 16 22

Discovery Applications 26/01/2016 - USA BIWA 16 23

Advance Analytics - Notebooks Easy to create and share documents that contain live code Step by step execution reproduce the analysis, charts, etc. Support for multiple languages/kernels Multiple notebook software available Jupyter/IPython BDD provides notebook from version 1.2.0 (BDD Shell) Can be used with Jupyter/IPython HUE notebooks Apache Zeppelin More

Scalable Analytics Reliability of degrading components of valves in the cryogenic system of the LHC (University of Delft) BDD -> Data Extraction -> Refine Calculations Scalable solutions apply to all the cryogenics valves

Conclusions Hadoop is not the solution for all your problems but.. Unlock new ways to exploit your investment on data overcome technical limitations for several CERN use cases Allows heterogeneous data access not only SQL or custom java APIs Once the data is in Hadoop only half of the way is done Data visualization and discovery Notebooks are easy to use and powerful for advanced analytics Self-service tools improve productivity Users should be able to do what they need without IT intervention 11/30/2016 26