Data Informatics. Seon Ho Kim, Ph.D.

Size: px
Start display at page:

Download "Data Informatics. Seon Ho Kim, Ph.D."

Transcription

1 Data Informatics Seon Ho Kim, Ph.D.

2 What is Big Data?

3 What is Big Data? Big Data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it

4 Trends leading to Data Flood More data is generated: Bank, telecom, other business transactions... Scientific data: astronomy, biology, etc Web, text, and e-commerce 4

5 Who s Generating Big Data Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 5

6 Unstructured Data Unstructured data is a generic label for describing any corporate information that is not in a database. Textual or non-textual Facebook, YouTube, Twitter, Web log, etc. Storage and search problem just adding more hardware to house data while ignoring its content no longer suffices

7 Characteristics of Big Data: 1-Scale (Volume) Data Volume 44x increase from From 0.8 zettabytes to 35zb Data volume is increasing exponentially Exponential increase in collected/generated data 7

8 Characteristics of Big Data: 2-Complexity (Varity) Various formats, types, and structures Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc Static data vs. streaming data A single application can be generating/collecting many types of data To extract knowledgeè all these types of data need to linked together 8

9 Characteristics of Big Data: 3-Speed (Velocity) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions è missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like è send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body è any abnormal measurements require immediate reaction 9

10 Big Data: 3V s 10

11 Some Make it 4V s 11

12

13

14 The Model Has Changed The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 14

15 More Formally Big Data Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include: Management (capture, store, process, share, etc.). For example, Hadoop Ecosystem. Analysis (Predictive analysis or others to extract value from data). For example, machine learning. Privacy: open question Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

16 Management

17 Exploring Big Data The time for developing an analysis (Initially working with big data) ETL process: taking a raw feed of data, reading it, and producing a usable set of output Extract Transform Load Gathering & preparing data (95%) Analyzing data (5%)

18 Why Machine Learning? Machine learning is programming computers to optimize a performance criterion using exampledata or past experience. There is no need to learn to calculate payroll Learning is used when: Human expertise does not exist (navigatingon Mars), Humans areunableto explain their expertise (speech recognition) Solution changes in time (routingon a computer network) Solution needs to be adapted to particular cases (user biometrics) 18

19 What We Talk About When We Talk About Learning Learning models from a data of particular examples Data is cheap and abundant; knowledge is expensive and scarce. Example in retail: Customer transactions to consumer behavior: People who bought X also bought Y Build a model that is a good and useful approximation to the data. 19

20 What is Machine Learning? Optimize a performance criterion using example data or past experience. Role of Statistics: Buildmathematical models Inference from samples Role of Computer science: Efficient algorithms to Solve the optimization problem Representing and evaluating the model for inference 20

21 The Structure of Big Data Structured: Most traditional data sources Semi-structured: Many sources of big data Unstructured: Video data, audio data

22 Applications Association Supervised Learning: learningfrom knownvalues Classification (Recognition) Regression Unsupervised Learning: from not known values Clustering (Grouping) ReinforcementLearning: learning a policy, a sequence of outputs 22

23 Techniques Creating Business Values Anomaly or Outlier detection Association rule learning Clustering analysis Classification analysis Regression analysis

24 Big Data Visualization

25 Big Data Analysis Example 25

26 What s driving Big Data - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets

27 Value of Big Data Analytics Big data is more real-time in nature than traditional Data Warehouse applications Traditional DW architectures are not wellsuited for big data apps Shared nothing, massively parallel processing, scale out architectures are well-suited for big data applications

28 Challenges in Handling Big Data The Bottleneck is in technology New architecture, algorithms, techniques are needed Also in technical skills Experts in using the new technology and dealing with big data

29 Big Data Summary Big Data is being generated every where Human and machines Big Data analysis is already every where Still Risks: Overwhelmed right problem, right person? Cost escalates fast how much data, accuracy? Privacy issue what is tolerable? Big potential for new startup business too!