Supervised and Unsupervised Learning

Size: px
Start display at page:

Download "Supervised and Unsupervised Learning"

Transcription

1 Supervised and Unsupervised Learning Kwok-Leung Tsui Industrial & Systems Engineering Georgia Institute of Technology 1/7/2009 1

2 Data Mining (KDD) Process Determine Business Objectives Data Preparation Mining & Modeling Consolidation and Application 1/7/2009 2

3 Data Mining and Modeling 1/7/2009 3

4 Data Mining & Modeling Start Choose Models Consider Alternate Models Sample Data Train Data Validation Data Build/Fit Model Refine / Tune Model (model size & diagnosis) Collect more data Test Data (Evaluation Data) Score Data Evaluate Model (e.g. Prediction error) Meet accuracy reqt. YES Prediction NO Make Decisions 1/7/2009 4

5 Data Description & Visualization Descriptive statistical measures Central tendency/ location Dispersion/spread Shape & symmetry Class Characterization and Comparisons Analytical characterization Attribute relevance analysis Class discrimination and comparisons Data Visualization Scatter-plot matrix & density plot 3-D stereoscopic scatter-plot Parallel coordinate plot 1/7/2009 5

6 Supervised & Unsupervised Learning Supervised learning: Learning with a teacher Classification, e.g. online shoppers (buyers Vs. non-buyers) Unsupervised learning: Learning without a teacher Clustering, e.g. online shoppers (segmentation of nonbuyers) Other related terms: Machine Learning (analogies to human receiving) Neural Networks (biological analogies to brain) 1/7/2009 6

7 Supervised Learning Inputs: (Predictors, independent variables, y) A set of variables which are measured or preset. Outputs: (Responses, dependent variables, x) A set of measurable variables which are influenced by the inputs Steps: Establish models / systems(y hat) based on collected inputs & outputs (x and y). Predict the values of outputs based on the established models / systems and a new set of specified inputs. 1/7/2009 7

8 Supervised Learning Learning with a teacher (generalization) Student presents answer ( y ) i given x i ) Teacher provides the correct answer yi or an error for student s answer The result is characterized by some loss function: L( y, y) ) Objective: Minimize the expected loss Function approximation: Y=f(x, ε) 1/7/2009 8

9 Problems in Supervised Learning (Application/Problem Oriented) Classification problem: Output is categorical / qualitative. Prediction (Regression) problem: Output is continuous / quantitative. (also called prediction problem.) Forecasting problem: Output in future domain. 1/7/2009 9

10 Supervised Learning Methods X Y Continuous Categorical Prediction (Regression) Classification 1/7/

11 Statistical Problems and Decision Theory 1/7/

12 Formulation of Statistical Problems Estimation (Point and Interval) Hypothesis Testing Ranking and Selection Prediction and Forecasting Decision Making Etc. 1/7/

13 Statistical Decision Theory 1/7/

14 Statistical Decision Theory 1/7/

15 Statistical Decision Theory Least Squares Estimation 1/7/

16 Statistical Decision Theory Classification & Bayes Classifier 1/7/

17 Statistical Decision Theory Classification & Bayes Classifier Bayes Classifier: Choose the class with maximum probability 1/7/

18 Model Complexity and Prediction/Classification Error 1/7/

19 Datasets Training Set Dataset used for creating classifiers Testing Set Dataset used for validating classifier obtained from training set. 1/7/

20 Classification Example Linear Regression Method for Classification 1/7/

21 1/7/

22 Classification Example Nearest Neighbor Method for Classification 1/7/

23 1/7/

24 1/7/

25 1/7/

26 Prediction or Classification Error Prediction Error Training error Overfitting Test error Low Model Complexity High 1/7/

27 Training Error, Cross-Validation Error, Testing Error Testing data Training data Cross-Validation K... Fitted model using training data Testing error based on testing data Training error based on training data 1/7/

28 Cross-Validation Method 1st round nd round th round /7/

29 Models for Supervised Learning (Methodology/Model Oriented) Regression Type Models: Linear Models, GLM, Logistic Regression Generalized additive models (Hastie & Tibshirani, 1990) Classification and Regression Tree (CART) (Breiman, Friedman, Olson, Stone, 1981) Multivariate Adaptive Regression Spline (MARS) Multiple Additive Regression Tree (MART) Neural Networks 1/7/

30 Models for Supervised Learning Segmentation Type Models Support vector machines (SVM) Generalized linear discriminant analysis (DA) Flexible DA, Penalized DA, Mixture DA K-Nearest Neighbors (NN), Adaptive k-nn Bayesian Classification Genetic Algorithms Fuzzy Set Classification Classification and Regression Tree (CART) 1/7/

31 Supervised Learning Not clear how to categorize the regression and segmentation type models. Most regression models can be used for both classification and prediction (regression) problems. Segmentation models can also be useful for regression problem, e.g., Regression tree, SVR. Computer scientists focus on problem while statisticians focus on models (algorithms Vs. models, e.g. boosting Vs. MART) 1/7/

32 Some Characteristics of Different Learning Methods (Hastie et al.) = good = fair = poor Characteristics Neural Net SVM Trees MARS Natural handling of data of mixed type K-NN, Kernel MART Handling of missing values Robustness to outliers in input space Insensitive to monotone transformations of inputs Computational scalability (large N) Ability to deal with irrelevant inputs Ability to extract linear combinations of features Interpretability Predictive power 1/7/

33 Unsupervised Learning Learning without a teacher Statistical Definition Observe N vectors from the population distribution Directly inference on the properties (e.g. relationship, grouping) on the population distribution Dimension of the observation (# of variables or attributes) is often very high (much higher than that in supervised learning) No clear measure of success The success is often judged (subjectively) by the value of discovery knowledge or the effectiveness of the algorithm 1/7/

34 Problems in Unsupervised Learning Association Rule: Single-Dimensional Association Rule from Transaction Database Multi-Level Association Rule from Transaction Database Multi-Dimensional Association Rule from Relational Data Base and Data Warehouse Correlation Analysis 1/7/

35 Association Rules Examples : Multi-Level Association Rule : Computer: desktop (IBM, Dell), laptop (Toshiba, Sony) Software: educational (Microsoft, ), financial (, ) Printer: color (HP, Epson), B/W (HP, Sony) Rule e.g.: {IBM desktop computer => B/W printer} Multi-Dimensional Association Rule : buys(x, IBM desktop computer ) => buys(x, Sony B/W printer ) Age(X, 20 to 29 )& Occupation(X, students ) => buys(x, laptops ) 1/7/

36 Algorithms in Unsupervised Learning Clustering Partitioning methods: K-means, K-medoids Hierarchical Methods: BIRCH, CURE, chameleon, algorithms Density-Based Methods: DB SCAN, OPTICS, DENCLUS Grid-based methods: STING, Wave cluster, CLIQUE Model-Based Clustering: CoBWEB (tree-model) Neural Network model 1/7/

37 Models for Unsupervised Learning Association Rules Market basket analysis Generalized association rules Cluster Analysis K-mean algorithms Clustering algorithms Combinatorial algorithms Other Multivariate Methods Principle components Factor analysis and latent variables Projection pursuit Multi-dimensional scaling 1/7/

38 Application Examples 1/7/

39 Classification Learn a method for predicting the instance class from pre labeled (classified) instances Classification Models 1/7/

40 Clustering Find natural grouping of data given unlabeled data 1/7/

41 Classification Problem Sensing Feature extraction Width, length, lightness, etc. width bass salmon lightness Classification problem 1/7/

42 Index Problem (Query by Content) Given a query functional data and some similarity measure (e.g., Euclidean distance), find the nearest matching functional data in DB. 1 6 Query Q (template) C 6 is the best match Database C 1/7/

43 Clustering Signals 4 Find a natural groups (clusters) of the functional data in database /7/

44 Faulty Signal Detection Given a faulty signal from the monitoring procedure, how to classify it to one of known classes 3 Fault 1 3 Fault Fault Fault Fault /7/

45 Hand Writing Recognition 1/7/

46 Bioinformatics Microarray Microarray (e.g., 50,000 spots) From Normal From Disease 1/7/

47 Bioinformatics Microarray Clustering problem Partition the genes into groups or clusters based on their expression patterns. 1/7/

48 Bioinformatics Gene Finding Input: An DNA string (nucleotide) over the alphabet {A,C,G,T} Output: An annotation of the string showing for every nucleotide whether it is coding (gene) or non coding. AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG Gene finder AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG Gene!! 1/7/

49 Bioinformatics Protein Structure Prediction Aminoacid Sequence Secondary Structure 3D coordinates of atoms 1/7/

50 Bioinformatics Metabolomics Finding inherent metabolic patterns in response to pathophysiogical stimuli or genetic modification 0 14:30-2 morning (7:30-12:30) afternoon/evening(13:30-22:30) night(23:30-6:30) :30 18:30 16:30 19: PC :30* 7:30 12:30 21:30 22:30 17:30 20:30 13:30 3:30 1:30 2:30 00:30 23:30 11: : :30 8: PC1 6:30 4:30 5: : :30 18: chemical shift (ppm) PC :30 10:30 12:30 21:30 11:30 8:30 7:30 8:30* 15:30 20:30 16:30 22:30 17:30 13:30 6:30 2:30 1:30 3:30 4:30 23:30 00:30 5:30-5 PC PC /7/

51 Industrial Projects AT&T business data mining Inventory management in military maintenance Sea cargo demand forecasting SMATRAQ project in transportation policies Location problem of letterbox Home improvement store shrinkage analysis Hotels & Resorts chain data mining Fast food drive through call center 1/7/

52 Data Mining in Telecom. (Funded AT&T project) ~160 billion dollar per year industry (~70 B long distance & ~90 B dollars local) 100 million + customers/accounts/lines >1 billion phone calls per day Book closing (Estimating this month price/usage/revenue) Budgeting (Forecasting next year price/usage/revenue) Segmentation (Clustering of usage, growth, ) Cross Selling (Association Rule) Churn (Disconnect prediction & Tracking) Fraud (Detection of unusual usage time series behavior) Each of these problems worth hundreds millions dollars 1/7/

53 Inventory Management in Air Force (Funded project) A contractor manages parts inventory for aircraft maintenance Characterization and forecasting of demand and lead time distributions 60,000 different parts and 500 bench locations Data tracked by an automated system Demand data not available & stockout penalty 1/7/

54 Data Mining in Sea Cargo Application (Funded TLIAP project) Sea cargo network optimization Contract planning & booking control Characterize & forecast sea cargo demand distribution & cost structure Improve ocean carrier and terminal operation efficiency 1/7/

55 SMARTRAQ Project for Transportation Policies Strategies for Metropolitan Atlanta s Regional Transportation & Air Quality Five-year project sponsored by Transportation Dept., Federal Highway Admin., EPA, CDC, etc. Assess air quality, travel behavior, land use & transportation policies Reduce auto-dependence and vehicle emissions Highway Design based on detailed GPS data 1/7/

56 Mining of Letter Box Transaction Data Improve performance of express mail dropoff letter boxes 50,000 letter boxes & 8 month transaction data Relate performance with important factors, e.g. regions, demographic, adjacent competition, pick-up schedule Comparison with direct competitors Customer demand analysis and forecast 1/7/

57 Data Mining for Shrinkage Analysis in Retail Industry Inventory shrinkage costs US retailers 32 billions Shrinkage = book inventory inventory on hand Working with a home improvement store s Loss Prevention Group Develop predictive model to relate shrinkage to important variables Extract hidden knowledge to reduce loss and improve operation efficiency 1/7/

58 Data Mining for Hotels and Resorts Chain Business Manage chain hotels and resorts in different scale Evaluate impact of promotional programs Forecasting of customer behavior in frequent stay program Monitor performance in customer survey Predict performance with important factors 1/7/

59 Data Mining and Forecasting for Fast Food Call Center Centralized call center for drive through order operated by an independent company Profit = Revenue (fixed rate per call) cost (#operators) Constraint: 3 second response time and 20 second route back to store Objective: Reduce cost by optimizing operation time and scheduling Tools: data mining & forecasting, simulation, optimization 1/7/

60 A General Framework for Dynamic Modeling & Activity Monitoring (DMDA) Detection/Classification Interpretation Forecasting/Prediction Problem Objective Segmentation & Model Selection Monitoring Profile Data Time domain profile Profile w. controllable predictors Profile w. uncontrollable predictors Segmentation Known Unknown Model Selection Global w/o segmentation Global w. segmentation Local within Segment Dynamic Update Actions Phase I: estimating unknown parameter Phase II: monitoring and detecting Anticipated drifts Vs. unanticipated changes 1/7/

61 Applications Manufacturing Processes Stamping Tonnage Signal Data (functional data) Mass Flow Controller (MFC) Calibration (linear profile) Vertical Density Profile (VDP) Data (nonlinear profile) Service Operations Telecom. Customer Usage Sea Cargo Terminal Operation Used Car Price Mining and Prediction Hotel Performance Monitoring Fast Food Drive Through Call Center Forecasting & Scheduling 1/7/

62 Manufacturing: Stamping Tonnage Signal Data Figure 2: An Tonnage Signal and Some Possible Faults (Jin and Shi 1999) 1/7/

63 Stamping Tonnage Signal Data Problem Time domain profile (a tonnage signal represents the stamping force in a process cycle). Objective Fault detection and classification Segmentation & Model Selection Known segmentation: most process faults occur only in specific working stages. Boundaries and sizes of segments are determined by process knowledge. (Jin and Shi 1999) Global model: wavelet transforms Monitoring For each segment, use T2 charts based on selected wavelet coefficients to conduct monitoring. (Jin and Shi 2001) Dynamic Update Classify a signal as normal, a known fault or a new fault as abnormal, and update wavelet coefficients selection and parameter estimates (e.g. μ,, etc.) using all available data. Actions Identify and remove assignable causes. 1/7/

64 Telecom. Customer Usage Problem Profile with uncontrollable predictors Objective Abnormal behavior detection and classification Forecasting/prediction Segmentation & Model Selection Unknown segmentation: segment customers based on demographic, geographic, psychographic and/or behavioral information. Local model: fit model for each customer segment, e.g. linear regression. Monitoring Use the model built for each segment to monitor customer behaviors, e.g. monitor linear regression parameter vector β using T 2 chart. Dynamic Update Update customer segmentation, segmental model fitting and/or parameter monitoring, e.g. parameters update based on known trend. Actions Service improvement, customer approval, etc. 1/7/

65 Telecom. Customer Usage Profile: profile with uncontrollable predictors Objective Abnormal behavior detection and classification Forecasting/prediction Segmentation Unknown (segments are defined by customer information.) Model Selection segmental (e.g. linear regression on uncontrollable predictors for each segment) Monitoring Phase I: unknown control chart parameters estimated from data Phase II: monitoring by control charts, like T 2 chart, EWMA chart, etc. Dynamic Update Update segmentation, model selection and/or parameter monitoring Actions: service improvement, customer approval, etc. 1/7/

66 Conclusions Data Mining Subject matter experts Statisticians Computer Scientists Support data mining Support data mining by Support data mining by by producing data, mathematical theory and computational algorithm business problems, statistical methods. and relevant software. software & hardware equipment for testing and implementing results. 1/7/

67 Conclusions It is not hard to obtain interesting and useful knowledge from data mining. The challenge is to transform and implement the interesting knowledge for business decisions making. Issues involved: internal collaboration efforts (sales versus marketing), external collaboration efforts (competitors among the industry), privacy protection. 1/7/