Knowledge Discovery and Data Mining

Similar documents
Data Informatics. Seon Ho Kim, Ph.D.

Basics of Big Data Analytics

CS4491/CS 7265 BIG DATA ANALYTICS

Sunnie Chung. Cleveland State University

Big Data Foundation. 2 Days Classroom Training PHILIPPINES :: MALAYSIA :: VIETNAM :: SINGAPORE :: INDIA

2012 SNIA Analytics and Big Data Summit. Insert Your Company Name. All Rights Reserved.

COMP9321 Web Application Engineering

Introduction to Big Data. By Dinesh Amatya

European Journal of Science and Theology, October 2014, Vol.10, Suppl.1, BIG DATA ANALYSIS. Andrej Trnka *

Tapping on the Power of Big Data to Improve the Operation of Power Systems

The Importance of Secure Analytics & AI

New Approach for scheduling tasks and/or jobs in Big Data Cluster

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

5th Annual. Cloudera, Inc. All rights reserved.

Modernizing Your Data Warehouse with Azure

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 1 An Overview of Business Intelligence, Analytics, and Data Science

Why Big Data Matters? Speaker: Paras Doshi

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

The Internet of Everything and the Research on Big Data. Angelo E. M. Ciarlini Research Head, Brazil R&D Center

Bringing the Power of SAS to Hadoop Title

Big Data: A BIG problem and a HUGE opportunity. Version MAY 2013 xcommedia

GE Intelligent Platforms. Proficy Historian HD

Realising Value from Data

Microsoft Big Data. Solution Brief

Genomic Data Is Going Google. Ask Bigger Biological Questions

A Quantified Approach for Analyzing the User Rating Behaviour in Social Media

Transforming Big Data into Unlimited Knowledge

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Big Data Analytics: Technologies, Opportunities, and Challenges. Murali K. Pusala. Jan 11, 2018

Intro to Big Data and Hadoop

Big Data The Big Story

Spark, Hadoop, and Friends

Audit and Validation Testing For Big Data Applications

Analytics Building Blocks

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Languages, Systems and Paradigms for Big Data. Dario COLAZZO

Big Data in the Future Workforce

Augmented Real-time Clinical DataMart. Phani S Srinivasan Ponnapalli, Syneos Health Subrahmanyam Rayaprolu, Syneos Health

COGNITIVE COGNITIVE COMPUTING _

Analytics Building Blocks

Preface About the Book

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

The ABCs of Big Data Analytics, Bandwidth and Content

ELEC 6910Q Analytics and Systems for Social Media and Big Data Applications

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data -the Next Frontier for Innovation, Competition and Productivity

BIG DATA AND HADOOP DEVELOPER

DATA SCIENCE: HYPE AND REALITY PATRICK HALL

red red red red red red red red red red red red red red red red red red red red CYS Rithu P Ravi CYS Saumya K

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses

Modernizing Data Integration

Got Hadoop? Whitepaper: Hadoop and EXASOL - a perfect combination for processing, storing and analyzing big data volumes

icare: A framework for Big Data based Customer Analytics Authors: N. Sun, J. G. Morris, J. Xu, X. Zhu, M. Xie

Operational Hadoop and the Lambda Architecture for Streaming Data

Data Science Training Course

Big Data. Fluch oder Segen? Dr. Rainer Schmidt Scientist AIT Austrian Institute of Technology GmbH

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Datameer for Data Preparation: Empowering Your Business Analysts

Big Data, Predictive Analytics, and Security

PURSUING THE AGILE ENTERPRISE:

BIG DATA-AS-A- SERVICE (BDAAS) IN CLOUD COMPUTING ENVIRONMENTS

Transforming Big Data into Unlimited Knowledge

IBM Analytics Unleash the power of data with Apache Spark

Session 30 Powerful Ways to Use Hadoop in your Healthcare Big Data Strategy

Managing the Cognitive Transformation. Paul Harmon

GPU ACCELERATED BIG DATA ARCHITECTURE

Marketing & Big Data

NEXT GENERATION PREDICATIVE ANALYTICS USING HP DISTRIBUTED R

REDEFINE BIG DATA. Zvi Brunner CTO. Copyright 2015 EMC Corporation. All rights reserved.

A BIG DATA REVOLUTION IN HEALTH CARE SECTOR: OPPORTUNITIES, CHALLENGES AND TECHNOLOGICAL ADVANCEMENTS

E-Guide THE EVOLUTION OF IOT ANALYTICS AND BIG DATA

Job Interview Questions Series. Hadoop BIG DATA. Interview Questions You'll Most Likely Be Asked. Interview Questions

Analyzing Big Data: The Path to Competitive Advantage

Analytics Landscape and Careers

Data Analytics with HPC L01. Introduction What are data analytics, big data, data science,?

2016 INFORMS International The Analytics Tool Kit: A Case Study with JMP Pro

Estatístico no mercado de big data: Formação de especialistas e a importância da estatística no universo da Ciência de Dados

BIG Data Analytics AWS Training

Experiences in the Use of Big Data for Official Statistics

IMPORTANCE OF BIG DATA IN MARITIME TRANSPORT

Big Data Trends Arató Bence. BI Consulting

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

APAC Big Data & Cloud Summit 2013

Copyright - Diyotta, Inc. - All Rights Reserved. Page 2

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud


Statistics & Optimization with Big Data

Big Data Analytics met Hadoop

Data Mining and Knowledge Discovery in Large Databases

The Evolution of Big Data

PREDICTION OF SOCIAL NETWORK SITES USING WEKA TOOL

1. Understanding Big Data. Big Data and its Real Impact on Your Security & Privacy Framework: A Pragmatic Overview

Transforming Analytics with Cloudera Data Science WorkBench

Escalation the Power of Big Data

Cask Data Application Platform (CDAP) Extensions

SENPAI.

Transcription:

Knowledge Discovery and Data Mining Unit # 19 1 Acknowledgement The following discussion is based on the paper Mining Big Data: Current Status, and Forecast to the Future by Fan and Bifet and online presentation by Internet Database Lab, Korea 2 1

Abstract The Big Data challenge is becoming one of the most exciting opportunities for the next years. Big Data is a new term used to identify the datasets that due to their large size and complexity, we can not manage them with our current methodologies or data mining soft-ware tools. Big Data mining is the capability of extracting useful information from these large datasets or streams of data, that due to its volume, variability, and velocity, it was not possible before to do it. 3 What Is Big Data? There is not a consensus as to how to define big data Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population. - Teradata Magazine article, 2011 Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. - The McKinsey Global Institute, 2011 4 2

Big Data Mining The term 'Big Data' appeared for first time in 1998 in a Silicon Graphics (SGI) slide deck by John Mashey with the title of "Big Data and the Next Wave of InfraStress" [9]. Big Data mining was very relevant from the beginning, as the first book mentioning 'Big Data' is a data mining book that appeared also in 1998 by Weiss and Indrukya [34]. However, the first academic paper with the words 'Big Data' in the title appeared a bit later in 2000 in a paper by Diebold [8]. The origin of the term 'Big Data' is due to the fact that we are creating a huge amount of data every day. 5 Big Data Mining (Cont d) Usama Fayyad [11] in his invited talk at the KDD BigMine'12 Work-shop presented amazing data numbers about internet usage, among them the following: each day Google has more than 1 billion queries per day, Twitter has more than 250 milion tweets per day, Facebook has more than 800 million updates per day, and YouTube has more than 4 billion views per day. The data produced nowadays is estimated in the order of zettabytes, and it is growing around 40% every year. 6 3

3 V s Doug Laney[19] was the first one in talking about 3 V's in Big Data management: Volume: there is more data than ever before, its size continues increasing, but not the percent of data that our tools can process Variety: there are many different types of data, as text, sensor data, audio, video, graph, and more Velocity: data is arriving continuously as streams of data, and we are interested in obtaining useful information from it in real time 7 How Is Big Data Different? 1) Automatically generated by a machine (e.g. Sensor embedded in an engine) 2) Typically an entirely new source of data (e.g. Use of the internet) 3) Not designed to be friendly (e.g. Text streams) 4) May not have much values Need to focus on the important part 8 4

Risks of Big Data Will be so overwhelmed Need the right people and solve the right problems Costs escalate too fast Isn t necessary to capture 100% Many sources of big data is privacy self-regulation Legal regulation 9 Why You Need to Tame Big Data Analyzing big data is already standard (e.g. ecommerce) Be left behind in a few years So far, only missed the chance on the bleeding edge Capturing data, using analysis to make decisions Just an extension of what you are already doing today 10 5

The Structure of Big Data Structured Most traditional data sources Semi-structured Many sources of big data Unstructured Video data, audio data 11 The time for developing an analysis Exploring Big Data Analyzing data (20~30%) The time for developing an analysis (Initially working with big data) Analyzing data (5%) Gathering & preparing data (70~80%) Gathering & preparing data (95%) 12 6

Mixing Big Data with Traditional Data The biggest value in big data can be driven by combing big data with other corporate data Big data Other data Create a synergy effect 13 Mixing Big Data with Traditional Data Browsing history Knowing how valuable a customer is What they have bought in the past Text (Online chat and e-mails) Knowing the detailed product specification being discussed The sales data related those products 14 7

Who Benefits from Big Data Companies with a tradition of fact-based decision making Engineering and research functions The best web-native companies 15 Big Data Controversies There is no need to distinguish Big Data analytics from data analytics, as data will continue growing, and it will never be small again. Big Data may be a hype to sell Hadoop based computing systems. Hadoop is not always the best tool [23]. It seems that data management system sellers try to sell systems based in Hadoop, and MapReduce may be not always the best programming platform, for example for medium-size companies. In real time analytics, data may be changing. In that case, what it is important is not the size of the data, it is its recency. 16 8

Big Data Controversies (Cont d) Claims to accuracy are misleading. As Taleb explains in his new book [32], when the number of variables grow, the number of fake correlations also grow. For example, Leinweber [21] showed that the S&P 500 stock index was correlated with butter production in Bangladesh, and other funny correlations. Bigger data are not always better data. It depends if the data is noisy or not, and if it is representative of what we are looking for. For example, some times Twitter users are assumed to be a representative of the global population, when this is not always the case. 17 Big Data Controversies (Cont d) Ethical concerns about accessibility. The main issue is if it is ethical that people can be analyzed without knowing it. Limited access to Big Data creates new digital divides. There may be a digital divide between people or organizations being able to analyze Big Data or not. Also organizations with access to Big Data will be able to extract knowledge that without this Big Data is not possible to get. We may create a division between Big Data rich and poor organizations. 18 9

Tools: Open Source Revolution Apache Hadoop [3]: software for data-intensive distributed applications, based in the MapReduce programming model and a distributed file system called Hadoop Distributed Filesystem (HDFS). Hadoop allows writing applications that rapidly process large amounts of data in parallel on large clusters of compute nodes. A MapReduce job divides the input dataset into independent subsets that are processed by map tasks in parallel. This step of mapping is then followed by a step of reducing tasks. These reduce tasks use the output of the maps to obtain the final result of the job. 19 Big Data Mining Tools Apache Mahout [4]: Scalable machine learning and datamining open source software based mainly in Hadoop. It has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining. R [29]: open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is used for statistical analysis of very large data sets. 20 10

Concluding Remarks on Big Data Big Data is going to continue growing during the next years, and each data scientist will have to manage much more amount of data every year. This data is going to be more diverse, larger, and faster. Big Data is becoming the new Final Frontier for scientific data research and for business applications. We are at the beginning of a new era where Big Data mining will help us to discover knowledge that no one has discovered before. 21 Analytics and Decision Making Google, Amazon, and others have prospered not by giving customers information but by giving them shortcuts to decisions and actions. (Analytics 3.0, Harvard Business Review December 2013) 22 11

Course Recap 23 Course Objectives Know the knowledge discovery process Understand the different categories of algorithms Be able to judge which algorithms fit different problems Have practical experience choosing algorithms for a specific problem Have practical experience working in technical teams Have practical experience executing data mining projects Have practical experience using open source data mining software 24 12

Course Outlines Classification Techniques Classification/Decision Trees Naïve Bayes Neural Networks Clustering Partitioning Methods Hierarchical Methods Feature Selection Model Evaluation Patterns and Association Mining Text Mining Papers/Case Studies/Applications Reading 25 What Next? An excellent online book on Data Mining with R by Yanchang Zhao http://cran.rproject.org/doc/contrib/zhao_r_and_data_minin g.pdf Another great online book on Mining of Massive Data by several professors of Stanford http://infolab.stanford.edu/~ullman/mmds/book. pdf 26 13

What Next? (Cont d) Cloudera has provided a Virtual Machine of Hadoop. You can simply download the VM file and open it in Virtual Box to learn and to experiment with Hadoop and its echo system. http://www.cloudera.com/content/support/en/do wnloads.html Keep visiting Kaggle.com and other similar websites to keep your knowledge and skill up to date. Wish you a strong career in analytics! 27 14