Guagua: An Iterative Computing Framework on Hadoop. Zhang Pengshan(David), PayPal

Size: px
Start display at page:

Download "Guagua: An Iterative Computing Framework on Hadoop. Zhang Pengshan(David), PayPal"

Transcription

1

2 Guagua: An Iterative Computing Framework on Hadoop Zhang Pengshan(David), PayPal

3 AGENDA Introduction Distributed Neural Network Algorithm What is Guagua? Guagua Advanced Features Shifu on Guagua Future Plans

4 AGENDA Introduction Distributed Neural Network Algorithm What is Guagua? Guagua Advanced Features Shifu on Guagua Future Plans

5 ALIPAY vs. PAYPAL Q: Where is risk control in PayPal? A: Risk control is everywhere in paypal.com.

6 FRAUD TYPES IN PAYPAL Account Take Over Fraud Types in PayPal Stolen Financials Credit Cards INR/SNAD INR: SNAD: Item Not Received Significantly Not as Described

7 RISK CONTROL IN PAYPAL Models Rules Agents

8 RISK MODELING IN PAYPAL MODELING CHALLENGES Thousands of Features Algorithms (LR, NN, DT) Big Training Data SLA (Online) Simulation

9 RISK MODELING IN PAYPAL MODELING CHALLENGES Thousands of Features Algorithms (LR, NN, DT) Big Training Data SLA (Online) Simulation Q: How to train models with TB data and thousands of features?

10 AGENDA Introduction Distributed Neural Network Algorithm What is Guagua? Guagua Advanced Features Shifu on Guagua Future Plans

11 DISTRIBUTED NEURAL NETWORK ALGORITHM* GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] ACCUMULATE GRADIENTS UPDATE WEIGHTS 1st iteration WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] ACCUMULATE GRADIENTS UPDATE WEIGHTS 2nd iteration WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] * Distributed batch gradient descent algorithm.

12 DISTRIBUTED NEURAL NETWORK ALGORITHM GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] ACCUMULATE GRADIENTS UPDATE WEIGHTS 1st iteration WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] ACCUMULATE GRADIENTS UPDATE WEIGHTS 2nd iteration WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] Q: How to implement it?

13 WHY NOT MAHOUT OR SPARK? Mahout Spark No distributed logistic regression & neural network. Iterative through Hadoop jobs, bad performance. No independent Spark cluster. Hadoop cluster is still 1.0 based, not YARN. Q: How to implement it in Hadoop?

14 POSSIBLE SOLUTIONS Pros Cons Hadoop YARN Flexible framework for framework Self resource management Alpha PayPal Clusters: Hadoop Extra fault tolerance, splits, UI Hadoop MapReduce Works well on all Hadoop versions Mature computing model Internal fault tolerance, splits, UI Different computing model How to do iterative coordination?

15 AGENDA Introduction Distributed Neural Network Algorithm What is Guagua? Guagua Advanced Features Shifu on Guagua Future Plans

16 ITERATIVE COMPUTING MODEL IN GUAGUA WORKER RESULT WORKER RESULT WORKER RESULT 1st iteration MASTER RESULT MASTER RESULT MASTER RESULT WORKER RESULT WORKER RESULT WORKER RESULT 2nd iteration MASTER RESULT MASTER RESULT MASTER RESULT Guagua is a framework over such iterative computing model, compared with Hadoop 1.0 over MapReduce.

17 GUAGUA API Computable Computable

18 GUAGUA OVERVIEW - s core CORE Fault Tolerance Coordination Iterative Computing Framework MapReduce Adapter (For Hadoop 1.0) YARN Adapter (For Hadoop 2.0) Consistent Client Distributed Neural Network Application

19 PLUGGABLE, SCALABLE INTERCEPTORS Fault Tolerance Interceptor Fault Tolerance Interceptor ZooKeeper Coordinator ZooKeeper Coordinator Perf Interceptor Perf Interceptor Timer Timer User Defined Interceptors User Defined Interceptors Computation Computation * These two graphs are aspects for each iteration.

20 GUAGUA RUNTIME : Mapper (Container) : Mapper (Container) : Mapper (Container) : Mapper (Container) : Mapper (Container) : Mapper (Container)

21 GUAGUA RUNTIME : Mapper (Container) REGISTER REGISTER : Mapper (Container) : Mapper (Container) REGISTER ZooKeeper Cluster REGISTER : Mapper (Container) REGISTER : Mapper (Container) REGISTER : Mapper (Container) 1. is listening znodes of workers. 2. s are listening znode of master.

22 GUAGUA RUNTIME : Mapper (Container) UPDATE ITER UPDATE ITER : Mapper (Container) : Mapper (Container) UPDATE ITER ZooKeeper Cluster UPDATE IITER : Mapper (Container) UPDATE ITER : Mapper (Container) UPDATE ITER : Mapper (Container) 1. Data is loaded in worker memory in the first iteration. 2. Whole process is done when reaches maximal iteration or halt condition is triggered.

23 AGENDA Introduction Distributed Neural Network Algorithm What is Guagua? Guagua Advanced Features Shifu on Guagua Future Plans

24 FAULT TOLERANCE n

25 FAULT TOLERANCE n * The same on workers.

26 STRAGGLER MITIGATION 1 2 3

27 STRAGGLER MITIGATION 1 2 3

28 STRAGGLER MITIGATION 1 2 3

29 PROGRESS AND STATE REPORT 0.86 = 432/501 (Current Iteration) / (Total Iteration)

30 GUAGUA UNIT

31 AGENDA Introduction Distributed Neural Network Algorithm What is Guagua? Guagua Advanced Features Shifu on Guagua Future Plans

32 WHAT IS SHIFU? Shifu* is an open-source, end-to-end machine learning and data mining framework built on top of Hadoop. NEW INIT STATS VARSELECT EVAL NORMALIZE POSTTRAIN TRAIN Built on Guagua *Want to try Shifu? Please visit

33 SHIFU ON GUAGUA (TRAIN STEP) Computable Interceptor Computable Gradients NN AbstractComputable BasicInterceptor Weights NN NNOutput GUAGUA API SHIFU CODE ENCOG CODE

34 SHIFU NN vs. SPARK LR Run Time Comparison Time Shifu-NN Spark-LR Iteration Shifu-NN: 1102*20*1 Netw ork, 319 Mappers * 1G Spark-LR: 1102 features, 120 executors * 3G

35 SHIFU NN BENCHMARK RESULTS Run Time (Seconds) Time(Seconds) G 375G 500G 625G 750G 875G 1000G Size of Input All data are located in memory. At most w e used 2400 mappers. 20 epochs are used.

36 AGENDA Introduction Distributed Neural Network Algorithm What is Guagua? Guagua Advanced Features Shifu on Guagua Future Plans

37 WHAT S NEXT? More open source docs Support more (distributed) machine learning algorithms Improve YARN (Beta) implementation Support more input formats Big model support Deep learning support

38 Q&A

39 APPENDIX Website Guagua issue website Shifu & Guagua source code:

40

41 @InfoQ infoqchina

BIG DATA AND HADOOP DEVELOPER

BIG DATA AND HADOOP DEVELOPER BIG DATA AND HADOOP DEVELOPER Approximate Duration - 60 Hrs Classes + 30 hrs Lab work + 20 hrs Assessment = 110 Hrs + 50 hrs Project Total duration of course = 160 hrs Lesson 00 - Course Introduction 0.1

More information

Deloitte School of Analytics. Demystifying Data Science: Leveraging this phenomenon to drive your organisation forward

Deloitte School of Analytics. Demystifying Data Science: Leveraging this phenomenon to drive your organisation forward Deloitte School of Analytics Demystifying Data Science: Leveraging this phenomenon to drive your organisation forward February 2018 Agenda 7 February 2018 8 February 2018 9 February 2018 8:00 9:00 Networking

More information

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation Roger Ding Cloudera February 3rd, 2018 1 Agenda Hadoop History Introduction to Apache Hadoop

More information

Enterprise-Scale MATLAB Applications

Enterprise-Scale MATLAB Applications Enterprise-Scale Applications Sylvain Lacaze Rory Adams 2018 The MathWorks, Inc. 1 Enterprise Integration Access and Explore Data Preprocess Data Develop Predictive Models Integrate Analytics with Systems

More information

Intro to Big Data and Hadoop

Intro to Big Data and Hadoop Intro to Big and Hadoop Portions copyright 2001 SAS Institute Inc., Cary, NC, USA. All Rights Reserved. Reproduced with permission of SAS Institute Inc., Cary, NC, USA. SAS Institute Inc. makes no warranties

More information

New Approach for scheduling tasks and/or jobs in Big Data Cluster

New Approach for scheduling tasks and/or jobs in Big Data Cluster New Approach for scheduling tasks and/or jobs in Big Data Cluster IT College, Chairperson of MS Dept. Agenda Introduction What is Big Data? The 4 characteristics of Big Data V4s Different Categories of

More information

Big Data Application Engineer/ Developer. Specialization in Apache Spark, Kafka, Airflow, HBase

Big Data Application Engineer/ Developer. Specialization in Apache Spark, Kafka, Airflow, HBase BIG DATA COURSE Big Data Application Engineer/ Developer Specialization in Apache Spark, Kafka, Airflow, HBase In Exclusive Association with 21,347+ Participants 10,000+ Brands 1200+ Trainings 45+ Countries

More information

DATA SCIENCE: HYPE AND REALITY PATRICK HALL

DATA SCIENCE: HYPE AND REALITY PATRICK HALL DATA SCIENCE: HYPE AND REALITY PATRICK HALL About me SAS Enterprise Miner, 2012 Cloudera Data Scientist, 2014 Do you use Kolmogorov Smirnov often? Statistician No, I mix my martinis with gin. Data Scientist

More information

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA. ABOUT THIS TRAINING: The world of Hadoop and Big Data" can be intimidating - hundreds of different technologies with cryptic names form the Hadoop ecosystem. This comprehensive training has been designed

More information

Big data in practice

Big data in practice Big data in practice Françoise Soulié Fogelman Gilles Polart-Donat IMT - TeraLab Berkeley, CA Clark Kerr Campus Conference Center September 30 - October 2, 2014 Agenda What is Big Data Opportunities &

More information

How In-Memory Computing can Maximize the Performance of Modern Payments

How In-Memory Computing can Maximize the Performance of Modern Payments How In-Memory Computing can Maximize the Performance of Modern Payments 2018 The mobile payments market is expected to grow to over a trillion dollars by 2019 How can in-memory computing maximize the performance

More information

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW TOPICS COVERED 1 2 Fundamentals of Big Data Platforms Major Big Data Tools Scaling Up vs. Out SCALE UP (SMP) SCALE OUT (MPP) + (n) Upgrade

More information

Stateful Services on DC/OS. Santa Clara, California April 23th 25th, 2018

Stateful Services on DC/OS. Santa Clara, California April 23th 25th, 2018 Stateful Services on DC/OS Santa Clara, California April 23th 25th, 2018 Who Am I? Shafique Hassan Solutions Architect @ Mesosphere Operator 2 Agenda DC/OS Introduction and Recap Why Stateful Services

More information

CS 6604: Data Mining Large Networks and Time- series. B. Aditya Prakash Lecture #6: Hadoop and Graph Analysis

CS 6604: Data Mining Large Networks and Time- series. B. Aditya Prakash Lecture #6: Hadoop and Graph Analysis CS 0: Data Mining Large Networks and Time- series B. Aditya Prakash Lecture #: Hadoop and Graph Analysis What to do if data is really large? Peta- bytes (exabytes, ze?abytes.. ) Google processed PB of

More information

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11 Top 5 Challenges for Hadoop MapReduce in the Enterprise Whitepaper - May 2011 http://platform.com/mapreduce 2 5/9/11 Table of Contents Introduction... 2 Current Market Conditions and Drivers. Customer

More information

CASE STUDY Delivering Real Time Financial Transaction Monitoring

CASE STUDY Delivering Real Time Financial Transaction Monitoring CASE STUDY Delivering Real Time Financial Transaction Monitoring Steve Wilkes Striim Co-Founder and CTO Background Customer is a US based Payment Systems Provider Large Network of ATM and Cashier Operated

More information

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering COMP9321 Web Application Engineering Semester 1, 2017 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2457

More information

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform Analytics R&D and Product Management Document Version 1 WP-Urika-GX-Big-Data-Analytics-0217 www.cray.com

More information

Hadoop Course Content

Hadoop Course Content Hadoop Course Content Hadoop Course Content Hadoop Overview, Architecture Considerations, Infrastructure, Platforms and Automation Use case walkthrough ETL Log Analytics Real Time Analytics Hbase for Developers

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Common Customer Use Cases in FSI

Common Customer Use Cases in FSI Common Customer Use Cases in FSI 1 Marketing Optimization 2014 2014 MapR MapR Technologies Technologies 2 Fortune 100 Financial Services Company 104M CARD MEMBERS 3 Financial Services: Recommendation Engine

More information

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica Accelerating Your Big Data Analytics Jeff Healey, Director Product Marketing, HPE Vertica Recent Waves of Disruption IT Infrastructu re for Analytics Data Warehouse Modernization Big Data/ Hadoop Cloud

More information

Brian Macdonald Big Data & Analytics Specialist - Oracle

Brian Macdonald Big Data & Analytics Specialist - Oracle Brian Macdonald Big Data & Analytics Specialist - Oracle Improving Predictive Model Development Time with R and Oracle Big Data Discovery brian.macdonald@oracle.com Copyright 2015, Oracle and/or its affiliates.

More information

HPC IN THE CLOUD Sean O Brien, Jeffrey Gumpf, Brian Kucic and Michael Senizaiz

HPC IN THE CLOUD Sean O Brien, Jeffrey Gumpf, Brian Kucic and Michael Senizaiz HPC IN THE CLOUD Sean O Brien, Jeffrey Gumpf, Brian Kucic and Michael Senizaiz Internet2, Case Western Reserve University, R Systems NA, Inc 2015 Internet2 HPC in the Cloud CONTENTS Introductions and Background

More information

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Hadoop on HPC: Integrating Hadoop and Pilot-based Andre Luckow, Ioannis Paraskevakos, George Chantzialexiou and Shantenu Jha Overview Introduction and Motivation Background Integrating Hadoop/Spark with

More information

Berkeley Data Analytics Stack (BDAS) Overview

Berkeley Data Analytics Stack (BDAS) Overview Berkeley Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY What is Big used For? Reports, e.g., - Track business processes, transactions Diagnosis, e.g., - Why is user engagement dropping?

More information

Apache Mesos. Delivering mixed batch & real-time data infrastructure , Galway, Ireland

Apache Mesos. Delivering mixed batch & real-time data infrastructure , Galway, Ireland Apache Mesos Delivering mixed batch & real-time data infrastructure 2015-11-17, Galway, Ireland Naoise Dunne, Insight Centre for Data Analytics Michael Hausenblas, Mesosphere Inc. http://www.meetup.com/galway-data-meetup/events/226672887/

More information

Data Analytics for Semiconductor Manufacturing The MathWorks, Inc. 1

Data Analytics for Semiconductor Manufacturing The MathWorks, Inc. 1 Data Analytics for Semiconductor Manufacturing 2016 The MathWorks, Inc. 1 Competitive Advantage What do we mean by Data Analytics? Analytics uses data to drive decision making, rather than gut feel or

More information

Insights to HDInsight

Insights to HDInsight Insights to HDInsight Why Hadoop in the Cloud? No hardware costs Unlimited Scale Pay for What You Need Deployed in minutes Azure HDInsight Big Data made easy Enterprise Ready Easier and more productive

More information

Machine Learning For Enterprise: Beyond Open Source. April Jean-François Puget

Machine Learning For Enterprise: Beyond Open Source. April Jean-François Puget Machine Learning For Enterprise: Beyond Open Source April 2018 Jean-François Puget Use Cases for Machine/Deep Learning Cyber Defense Drug Discovery Fraud Detection Aeronautics IoT Earth Monitoring Advanced

More information

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

E-guide Hadoop Big Data Platforms Buyer s Guide part 1 Hadoop Big Data Platforms Buyer s Guide part 1 Your expert guide to Hadoop big data platforms for managing big data David Loshin, Knowledge Integrity Inc. Companies of all sizes can use Hadoop, as vendors

More information

SmartCare. SPSS Workshop. Rick Durham - North American Advanced Analytics Channel Team IBM Corporation. Date: 5/28/2014

SmartCare. SPSS Workshop. Rick Durham - North American Advanced Analytics Channel Team IBM Corporation. Date: 5/28/2014 SPSS Workshop Key Presenter Rick Durham - North American Advanced Analytics Channel Team Date: 5/28/2014 Agenda What is Predictive Analytics? What is the architecture of the IBM/SPSS technology stack?

More information

SAS Machine Learning and other Analytics: Trends and Roadmap. Sascha Schubert Sberbank 8 Sep 2017

SAS Machine Learning and other Analytics: Trends and Roadmap. Sascha Schubert Sberbank 8 Sep 2017 SAS Machine Learning and other Analytics: Trends and Roadmap Sascha Schubert Sberbank 8 Sep 2017 How Big Analytics will Change Organizations Optimization and Innovation Optimizing existing processes Customer

More information

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer Redefine Big Data: EMC Data Lake in Action Andrea Prosperi Systems Engineer 1 Agenda Data Analytics Today Big data Hadoop & HDFS Different types of analytics Data lakes EMC Solutions for Data Lakes 2 The

More information

New Big Data Solutions and Opportunities for DB Workloads

New Big Data Solutions and Opportunities for DB Workloads New Big Data Solutions and Opportunities for DB Workloads Hadoop and Spark Ecosystem for Data Analytics, Experience and Outlook Luca Canali, IT-DB Hadoop and Spark Service WLCG, GDB meeting CERN, September

More information

Evolution or Revolution: Top Ten Development Trends

Evolution or Revolution: Top Ten Development Trends Evolution or Revolution: Top Ten Development Trends Jim Lundy CEO and Lead Analyst IT Development Trends: Building a Fighter Jet Agenda What are the Top Ten Trends in Development? What are the Best Practices

More information

Analytical Capability Security Compute Ease Data Scale Price Users Traditional Statistics vs. Machine Learning In-Memory vs. Shared Infrastructure CRAN vs. Parallelization Desktop vs. Remote Explicit vs.

More information

IBM SPSS & Apache Spark

IBM SPSS & Apache Spark IBM SPSS & Apache Spark Making Big Data analytics easier and more accessible ramiro.rego@es.ibm.com @foreswearer 1 2016 IBM Corporation Modeler y Spark. Integration Infrastructure overview Spark, Hadoop

More information

Data Analytics with MATLAB Adam Filion Application Engineer MathWorks

Data Analytics with MATLAB Adam Filion Application Engineer MathWorks Data Analytics with Adam Filion Application Engineer MathWorks 2015 The MathWorks, Inc. 1 Case Study: Day-Ahead Load Forecasting Goal: Implement a tool for easy and accurate computation of dayahead system

More information

FINANCE + DEEP LEARNING SKYMIND / DEEPLEARNING4J 2015

FINANCE + DEEP LEARNING SKYMIND / DEEPLEARNING4J 2015 FINANCE + DEEP LEARNING SKYMIND / DEEPLEARNING4J 2015 WHICH OUTCOMES MATTER? BASIC APPLICATIONS Anomaly Detection for Compliance Rogue Traders, Fat Fingers Fraud, Money Laundering Detection Trading Strategies

More information

Using the Blaze Engine to Run Profiles and Scorecards

Using the Blaze Engine to Run Profiles and Scorecards Using the Blaze Engine to Run Profiles and Scorecards 1993, 2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording

More information

Building a Multi-Tenant Infrastructure for Diverse Application Workloads

Building a Multi-Tenant Infrastructure for Diverse Application Workloads Building a Multi-Tenant Infrastructure for Diverse Application Workloads Rick Janowski Marketing Manager IBM Platform Computing 1 The Why and What of Multi-Tenancy 2 Parallelizable problems demand fresh

More information

Hadoop and Analytics at CERN IT CERN IT-DB

Hadoop and Analytics at CERN IT CERN IT-DB Hadoop and Analytics at CERN IT CERN IT-DB 1 Hadoop Use cases Parallel processing of large amounts of data Perform analytics on a large scale Dealing with complex data: structured, semi-structured, unstructured

More information

Modernizing Your Data Warehouse with Azure

Modernizing Your Data Warehouse with Azure Modernizing Your Data Warehouse with Azure Big data. Small data. All data. Christian Coté S P O N S O R S The traditional BI Environment The traditional data warehouse data warehousing has reached the

More information

INFORMS Analytics Maturity Model. User Guide

INFORMS Analytics Maturity Model. User Guide INFORMS Analytics Maturity Model User Guide Introducing the INFORMS Scorecard INFORMS, the leading association for advanced analytics professionals, seeks to advance the practice, research, methods, and

More information

Integrating Artificial Intelligence and Simulation Modeling

Integrating Artificial Intelligence and Simulation Modeling www.pwc.com Integrating Artificial Intelligence and Simulation Modeling PwC Artificial Intelligence Accelerator Fall 2017 PwC Artificial Intelligence Accelerator is working with AnyLogic simulation and

More information

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015 Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Probabilistic Classification Introduction to Logistic regression Binary logistic regression

More information

Building Efficient Large-Scale Big Data Processing Platforms

Building Efficient Large-Scale Big Data Processing Platforms University of Massachusetts Boston ScholarWorks at UMass Boston Graduate Doctoral Dissertations Doctoral Dissertations and Masters Theses 5-31-2017 Building Efficient Large-Scale Big Data Processing Platforms

More information

Spark, Hadoop, and Friends

Spark, Hadoop, and Friends Spark, Hadoop, and Friends (and the Zeppelin Notebook) Douglas Eadline Jan 4, 2017 NJIT Presenter Douglas Eadline deadline@basement-supercomputing.com @thedeadline HPC/Hadoop Consultant/Writer http://www.basement-supercomputing.com

More information

Synthetic Data Generation for Realistic Analytics Examples and Testing

Synthetic Data Generation for Realistic Analytics Examples and Testing Synthetic Data Generation for Realistic Analytics Examples and Testing Ronald J. Nowling Red Hat, Inc. rnowling@redhat.com http://rnowling.github.io/ Who Am I? Software Engineer at Red Hat Data Science

More information

How to develop Data Scientist Super Powers! Using Azure from R to scale and persist analytic workloads.. Simon Field

How to develop Data Scientist Super Powers! Using Azure from R to scale and persist analytic workloads.. Simon Field How to develop Data Scientist Super Powers! Using Azure from R to scale and persist analytic workloads.. Simon Field Topics Why cloud Managing cloud resources from R Highly parallelised model training

More information

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme VIRT1400BU Real-World Customer Architecture for Big Data on VMware vsphere Joe Bruneau, General Mills Justin Murray, Technical Marketing, VMware #VMworld #VIRT1400BU Disclaimer This presentation may contain

More information

Intro Logistic Regression Gradient Descent + SGD

Intro Logistic Regression Gradient Descent + SGD Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade March 29, 2016 1 Ad Placement

More information

Transforming Analytics with Cloudera Data Science WorkBench

Transforming Analytics with Cloudera Data Science WorkBench Transforming Analytics with Cloudera Data Science WorkBench Process data, develop and serve predictive models. 1 Age of Machine Learning Data volume NO Machine Learning Machine Learning 1950s 1960s 1970s

More information

One Year Executive Program in Applied Business Analytics

One Year Executive Program in Applied Business Analytics Vivekanand Education Society Institute of Management Studies & Research Big Data & Advanced Analytics One Year Executive Program in Applied Business Analytics In a rapidly changing global environment,

More information

Hortonworks Data Platform

Hortonworks Data Platform Hortonworks Data Platform An open-architecture platform to manage data in motion and at rest Highlights Addresses a range of data-at-rest use cases Powers real-time customer applications Delivers robust

More information

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB Data Analytics and CERN IT Hadoop Service CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB 1 Data Analytics at Scale The Challenge When you cannot fit your workload in a desktop Data

More information

5th Annual. Cloudera, Inc. All rights reserved.

5th Annual. Cloudera, Inc. All rights reserved. 5th Annual 1 The Essentials of Apache Hadoop The What, Why and How to Meet Agency Objectives Sarah Sproehnle, Vice President, Customer Success 2 Introduction 3 What is Apache Hadoop? Hadoop is a software

More information

Real Applications of Big Data. Challenges and Opportunities

Real Applications of Big Data. Challenges and Opportunities Real Applications of Big Data. Challenges and Opportunities David Gil University of Alicante Polytechnic School - EPSA Department of Computer Technology Introduction to Big Data Projects Benchmarking (weka,

More information

BIG DATA ANALYTICS WITH HADOOP. 40 Hour Course

BIG DATA ANALYTICS WITH HADOOP. 40 Hour Course 1 BIG DATA ANALYTICS WITH HADOOP 40 Hour Course OVERVIEW Learning Objectives Understanding Big Data Understanding various types of data that can be stored in Hadoop Setting up and Configuring Hadoop in

More information

Predictive Maintenance and Condition Monitoring

Predictive Maintenance and Condition Monitoring Predictive Maintenance and Condition Monitoring Tim Yeh Application Engineer 2018 The MathWorks, Inc. 1 Agenda What is Predictive Maintenance? Algorithm behind Remaining Useful Life (RUL) Estimation Methods

More information

Copyright 2013 Oracle and/or its affiliates. All rights reserved.

Copyright 2013 Oracle and/or its affiliates. All rights reserved. 1 The Value of Big Data and Analytics in Government Jan 22, 2014 Wayne Babby, Deputy Director (A), California Department of Corrections & Rehabilitation Tim Dexter, Solution Architect, Analytics Practice,

More information

Cluster management at Google

Cluster management at Google Cluster management at Google LISA 2013 john wilkes (johnwilkes@google.com) Software Engineer, Google, Inc. We own and operate data centers around the world http://www.google.com/about/datacenters/inside/locations/

More information

DEVELOP QUALITY CHARACTERISTICS BASED QUALITY EVALUATION PROCESS FOR READY TO USE SOFTWARE PRODUCTS

DEVELOP QUALITY CHARACTERISTICS BASED QUALITY EVALUATION PROCESS FOR READY TO USE SOFTWARE PRODUCTS DEVELOP QUALITY CHARACTERISTICS BASED QUALITY EVALUATION PROCESS FOR READY TO USE SOFTWARE PRODUCTS Daiju Kato 1 and Hiroshi Ishikawa 2 1 WingArc1st Inc., Tokyo, Japan kato.d@wingarc.com 2 Graduate School

More information

Conquer Dark Data by Unleasing the Power of SAP Analytics

Conquer Dark Data by Unleasing the Power of SAP Analytics Conquer Dark Data by Unleasing the Power of SAP Analytics Natasa Mastellou, Business Intelligence Solution Advisor, SAP Hellas, Cyprus & Malta 1 IoT is everywhere Connected Cars Smart Equipment Smart Logistics

More information

National Occupational Standard

National Occupational Standard National Occupational Standard Overview This unit is about performing research and designing a variety of algorithmic models for internal and external clients 19 National Occupational Standard Unit Code

More information

On Cloud Computational Models and the Heterogeneity Challenge

On Cloud Computational Models and the Heterogeneity Challenge On Cloud Computational Models and the Heterogeneity Challenge Raouf Boutaba D. Cheriton School of Computer Science University of Waterloo WCU IT Convergence Engineering Division POSTECH FOME, December

More information

St Louis CMG Boris Zibitsker, PhD

St Louis CMG Boris Zibitsker, PhD ENTERPRISE PERFORMANCE ASSURANCE BASED ON BIG DATA ANALYTICS St Louis CMG Boris Zibitsker, PhD www.beznext.com bzibitsker@beznext.com Abstract Today s fast-paced businesses have to make business decisions

More information

Real World Use Cases: Hadoop & NoSQL in Production. Big Data Everywhere London 4 June 2015

Real World Use Cases: Hadoop & NoSQL in Production. Big Data Everywhere London 4 June 2015 Real World Use Cases: Hadoop & NoSQL in Production Ted Dunning Big Data Everywhere London 4 June 2015 1 Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC

More information

Big Data Trends Arató Bence. BI Consulting

Big Data Trends Arató Bence. BI Consulting Big Data Trends 2017 Arató Bence BI Consulting arato@biconsulting.hu 1 Introduction Arató Bence Consulting and Advisory BI/DW/Big Data strategy, Architecture planning, vendor and tool selection. Also provides

More information

Powered by Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

Powered by Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS Powered by Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS www.upxacademy.com 1800-123-1260 About us UpX Academy is an ed-tech platform providing advanced professional training in Big Data Analytics

More information

Powered by. Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

Powered by. Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS Powered by Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS www.upxacademy.com 1800-123-1260 About us UpX Academy is an ed-tech platform providing advanced professional training in Big Data Analytics

More information

Processing over a trillion events a day CASE STUDIES IN SCALING STREAM PROCESSING AT LINKEDIN

Processing over a trillion events a day CASE STUDIES IN SCALING STREAM PROCESSING AT LINKEDIN Processing over a trillion events a day CASE STUDIES IN SCALING STREAM PROCESSING AT LINKEDIN Processing over a trillion events a day CASE STUDIES IN SCALING STREAM PROCESSING AT LINKEDIN Jagadish Venkatraman

More information

Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes

Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes Contents Introduction...3 Hadoop s humble beginnings...4 The benefits of Hadoop...5

More information

Realising Value from Data

Realising Value from Data Realising Value from Data Togetherwith Open Source Drives Innovation & Adoption in Big Data BCS Open Source SIG London 1 May 2013 Timings 6:00-6:30pm. Register / Refreshments 6:30-8:00pm, Presentation

More information

BIG DATA IN BANKING Pravin Burra, Sept 2018

BIG DATA IN BANKING Pravin Burra, Sept 2018 BIG DATA IN BANKING Pravin Burra, Sept 2018 1 November 2018 Data Use Cases Capacity Alignment Banks have large volumes of data In any large organisation including Budgets, technology and capacity IT scheduling

More information

Powered by. Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS

Powered by. Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS Powered by Tech Mahindra MAKE IT BIG WITH BIG DATA ANALYTICS www.upxacademy.com 1800-123-1260 About us UpX Academy is an ed-tech platform providing advanced professional training in Big Data Analytics

More information

Real-time IoT Big Data-in-Motion Analytics Case Study: Managing Millions of Devices at Country-Scale

Real-time IoT Big Data-in-Motion Analytics Case Study: Managing Millions of Devices at Country-Scale Real-time IoT Big Data-in-Motion Analytics Case Study: Managing Millions of Devices at Country-Scale Real-time IoT Big Data-in-Motion Analytics Case Study: Managing Millions of Devices at Country-Scale

More information

HP Software EMEA Performance Tour Zurich, Switzerland September 18

HP Software EMEA Performance Tour Zurich, Switzerland September 18 HP Software EMEA Performance Tour 2013 Zurich, Switzerland September 18 HP Service Virtualization Virtuelle Services im Software Entwicklungs-Lebenszyklus Bernhard Weiss Bernd Schindelasch 18. September,

More information

Your Top 5 Reasons Why You Should Choose SAP Data Hub INTERNAL

Your Top 5 Reasons Why You Should Choose SAP Data Hub INTERNAL Your Top 5 Reasons Why You Should Choose INTERNAL Top 5 reasons for choosing the solution 1 UNIVERSAL 2 INTELLIGENT 3 EFFICIENT 4 SCALABLE 5 COMPLIANT Universal view of the enterprise and Big Data: Get

More information

Energy Efficient MapReduce Task Scheduling on YARN

Energy Efficient MapReduce Task Scheduling on YARN Energy Efficient MapReduce Task Scheduling on YARN Sushant Shinde 1, Sowmiya Raksha Nayak 2 1P.G. Student, Department of Computer Engineering, V.J.T.I, Mumbai, Maharashtra, India 2 Assistant Professor,

More information

Leveraging Big Data For Payment Risk Management

Leveraging Big Data For Payment Risk Management Payments for platform businesses. New York 2014 Leveraging Big Data For Payment Risk Management John Canfield, VP Risk Management, WePay @JCRisk" June 11, 2014 Outline 1. Payments opportunity for the bottoms-up

More information

MapR: Solution for Customer Production Success

MapR: Solution for Customer Production Success 2015 MapR Technologies 2015 MapR Technologies 1 MapR: Solution for Customer Production Success Big Data High Growth 700+ Customers Cloud Leaders Riding the Wave with Hadoop The Big Data Platform of Choice

More information

Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data

Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data Hiroyuki MAKINO, NTT Software Innovation Center XLDB2012 Sep. 12, 2012 Objective of Jubatus To satisfy Scalable,

More information

LOYALTY MANAGEMENT FOR RETAIL

LOYALTY MANAGEMENT FOR RETAIL RSM TECHNOLOGY ACADEMY elearning Syllabus and Agenda LOYALTY MANAGEMENT FOR RETAIL FOR MICROSOFT DYNAMICS AX Course Details 3 Audience 3 Continuing Professional Education 3 Registration and Payment 3 Refund

More information

Cloudera Data Science and Machine Learning. Robin Harrison, Account Executive David Kemp, Systems Engineer. Cloudera, Inc. All rights reserved.

Cloudera Data Science and Machine Learning. Robin Harrison, Account Executive David Kemp, Systems Engineer. Cloudera, Inc. All rights reserved. Cloudera Data Science and Machine Learning Robin Harrison, Account Executive David Kemp, Systems Engineer 1 This is the age of machine learning. Data volume NO Machine Learning Machine Learning 1950s 1960s

More information

Cloudera, Inc. All rights reserved.

Cloudera, Inc. All rights reserved. 1 Data Analytics 2018 CDSW Teamplay und Governance in der Data Science Entwicklung Thomas Friebel Partner Sales Engineer tfriebel@cloudera.com 2 We believe data can make what is impossible today, possible

More information

How to build and deploy machine learning projects

How to build and deploy machine learning projects How to build and deploy machine learning projects Litan Ilany, Advanced Analytics litan.ilany@intel.com Agenda Introduction Machine Learning: Exploration vs Solution CRISP-DM Flow considerations Other

More information

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight. Course Content Course Description: The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight. At Course Completion: After competing this course,

More information

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN From Information to Insight: The Big Value of Big Data Faire Ann Co Marketing Manager, Information Management Software, ASEAN The World is Changing and Becoming More INSTRUMENTED INTERCONNECTED INTELLIGENT

More information

GPU ACCELERATED BIG DATA ARCHITECTURE

GPU ACCELERATED BIG DATA ARCHITECTURE INNOVATION PLATFORM WHITE PAPER 1 Today s enterprise is producing and consuming more data than ever before. Enterprise data storage and processing architectures have struggled to keep up with this exponentially

More information

20775A: Performing Data Engineering on Microsoft HD Insight

20775A: Performing Data Engineering on Microsoft HD Insight 20775A: Performing Data Engineering on Microsoft HD Insight Duration: 5 days; Instructor-led Implement Spark Streaming Using the DStream API. Develop Big Data Real-Time Processing Solutions with Apache

More information

Data Center Operating System (DCOS) IBM Platform Solutions

Data Center Operating System (DCOS) IBM Platform Solutions April 2015 Data Center Operating System (DCOS) IBM Platform Solutions Agenda Market Context DCOS Definitions IBM Platform Overview DCOS Adoption in IBM Spark on EGO EGO-Mesos Integration 2 Market Context

More information

Data Science, realizing the Hype Cycle. Luigi Di Rito, Director Data Science Team, SAP Center of Excellence

Data Science, realizing the Hype Cycle. Luigi Di Rito, Director Data Science Team, SAP Center of Excellence Data Science, realizing the Hype Cycle. Luigi Di Rito, Director Data Science Team, SAP Center of Excellence Data Science, Machine Learning and Artificial Intelligence Deep Learning AREAS OF AI Rule-based

More information

Big Data Foundation. 2 Days Classroom Training PHILIPPINES :: MALAYSIA :: VIETNAM :: SINGAPORE :: INDIA

Big Data Foundation. 2 Days Classroom Training PHILIPPINES :: MALAYSIA :: VIETNAM :: SINGAPORE :: INDIA Big Data Foundation 2 Days Classroom Training PHILIPPINES :: MALAYSIA :: VIETNAM :: SINGAPORE :: INDIA Content Big Data Foundation Course Introduction Who we are Course Overview Career Path Course Content

More information

Automating Netflix ML Pipelines With Meson. QCon SF 2017 Eugen Cepoi, Davis Shepherd

Automating Netflix ML Pipelines With Meson. QCon SF 2017 Eugen Cepoi, Davis Shepherd Automating Netflix ML Pipelines With Meson QCon SF 2017 Eugen Cepoi, Davis Shepherd ? Goal Create a personalized experience to help members find content to watch and enjoy Recommendation Context Online

More information

Deep Fraud. Doing research with financial data

Deep Fraud. Doing research with financial data Deep Fraud Doing research with financial data Chapter 1: Problem Statement Each time a new card payment arrives to our mainframe, provide a risk score (from 0 to 99) Apply this score together with some

More information

20775 Performing Data Engineering on Microsoft HD Insight

20775 Performing Data Engineering on Microsoft HD Insight Duración del curso: 5 Días Acerca de este curso The main purpose of the course is to give students the ability plan and implement big data workflows on HD. Perfil de público The primary audience for this

More information

Business is being transformed by three trends

Business is being transformed by three trends Business is being transformed by three trends Big Cloud Intelligence Stay ahead of the curve with Cortana Intelligence Suite Business apps People Custom apps Apps Sensors and devices Cortana Intelligence

More information

Resource Scheduling Architectural Evolution at Scale and Distributed Scheduler Load Simulator

Resource Scheduling Architectural Evolution at Scale and Distributed Scheduler Load Simulator Resource Scheduling Architectural Evolution at Scale and Distributed Scheduler Load Simulator Renyu Yang Supported by Collaborated 863 and 973 Program Resource Scheduling Problems 2 Challenges at Scale

More information