IBM Taiwan Claire Lin Data Mining using SPSS Modeler 2nd Session 2014 IBM Corporation
Agenda Data Mining Process Business Understanding Data Understanding Live Demo and Exercise Data Preparation and Manipulation Live Demo and Exercise 2 2014 IBM Corporation
What is Data Mining? The analysis step of the Knowledge Discovery in Databases (KDD) process, it encompasses a number of techniques to extract useful information from (large) data files, without necessarily having preconceived notions about what will be discovered. The goal of data mining is to extract information from a data set and transform it into an understandable structure for further use 3 2014 IBM Corporation
Data mining process Cross Industry Standard Process for Data Mining(CRISP) 4 2014 IBM Corporation
Data mining process Cross Industry Standard Process for Data Mining(CRISP) What SPSS Modeler can do? Input raw data Data understanding Check missing data Check anomalous and outlier data Data preparation Filter, derive, reclassify nodes Modeling Output 5 2013 IBM Corporation
Business Understanding Determining business objectives Finding what people will buy together with 粽子 during Dragon Festival Predicting who is likely to not renew and contract for mobile phone service Assessing the situation Determining data mining goals Producing a project plan 6 2014 IBM Corporation
Data Understanding Need to understand Includes What your data resources are What the characteristics of those resources are Collecting initial data Describing data Exploring data Verifying data quality Missing Data Anomalous Data 7 2014 IBM Corporation
Data Understanding - Missing Data Blank Contain no information. White space if the field is string and Null value (non-numeric) if the field is numeric Empty string A string field may be empty, which means that it contains nothing (This is common in databases) Value blanks Represent missing or invalid information 8 2014 IBM Corporation
Data Understanding - Missing Data 9 2014 IBM Corporation
Data Understanding - Anomalous Data What is Anomalous Data? Far from the center of the distribution Measured by the mean or median and using the standard deviation as a measure of spread Far from other values Whether close to the center of the distribution, or not 10 2014 IBM Corporation
Data Understanding Anomaly detection 11 2013 IBM Corporation
SPSS Modeler User Interface 12 2013 IBM Corporation
13 2013 IBM Corporation
Data Sources Database: ODBC source Var. File: free-field text file Fixed File: fixed-field text file Statistics File/SAS File/Excel File 14 2014 IBM Corporation
Data Understanding The Data Audit node Provide report Missing values Outlier data and Extreme data Information on a field s distribution 15 2013 IBM Corporation
Data Understanding Anomaly detection models identify outliers or unusual cases by using clustering analysis Each record is assigned an anomaly index It's the ratio of the group deviation index to its average over the cluster that the case belongs to Cases with an index value greater than 2 could be good anomaly candidates 16 2013 IBM Corporation
Data Understanding Outliers Data Live Demo Live Demo SPSS Modeler UI Read data into SPSS Modeler Check missing data Check anomalous and outlier data Data Audit Node Anomaly Node 17 2013 IBM Corporation
Live Demo & Exercise I 18 2013 IBM Corporation
Data Preparation and Manipulation Objective: Construct the final dataset for modeling Record Operations Select partial data from dataset Sort the data Field Operations 19 2013 IBM Corporation
Type: Specifies field metadata and properties Type Continuous Categorical Nominal Description Used to describe numeric values, such as a range of 0 100 or 0.75 1.25. A continuous value can be an integer, real number, or date/time. String values Used to describe data with multiple distinct values, each treated as a member of a set. Ordinal Used to describe data with multiple distinct values that have an inherent order. 20 2013 IBM Corporation Flag Used for data with two distinct values that indicate the presence or absence of a trait. Such as true and false, Yes and No or 0 and 1.
Filter: Filters, renames fields 21 2013 IBM Corporation
Derive: Modifies data values or creates new fields 22 2013 IBM Corporation
Reclassify 23 2013 IBM Corporation
Live Demo & Exercise II 24 2013 IBM Corporation
Trugarez Breton Merci French Gracias Spanish Grazie Italian Hindi Arabic Obrigado Brazilian Portuguese Traditional Chinese Korean go raibh maith agat Gaelic Dankon Esperanto Simplified Chinese Hebrew Tack så mycket Swedish Tak Danish Danke German Japanese Thank You English Tamil Dank u Dutch Thai Dekujeme Vam Czech 25 2013 IBM Corporation