Clustering. CS4780 Machine Learning Fall Thorsten Joachims Cornell University. Reading: Manning/Schuetze Chapter 14 (not , 14.1.

Similar documents
Recap. CS276A Text Retrieval and Mining. The Curse of Dimensionality. Today s Topics: Clustering 2. Hierarchical Agglomerative Clustering (HAC)

CHAPTER 8 DYNAMIC RESOURCE ALLOCATION IN GRID COMPUTING USING FUZZY-GENETIC ALGORITHM

MULTIPLE FACILITY LOCATION ANALYSIS PROBLEM WITH WEIGHTED EUCLIDEAN DISTANCE. Dileep R. Sule and Anuj A. Davalbhakta Louisiana Tech University

Concept Discovery from Text

Market Segmentation of Inbound Business Tourists to Thailand by Binding of Unsupervised and Supervised Learning Techniques

Band Selection Using Clustering Technique for Dimensionality Reduction in Hyper spectral Image

INTEGER PROGRAMMING 1.224J/ESD.204J TRANSPORTATION OPERATIONS, PLANNING AND CONTROL: CARRIER SYSTEMS

Evaluating Clustering Methods for Multi-Echelon (r,q) Policy Setting

A Similarity-Based Approach for the All-Time Demand Prediction of New Automotive Spare Parts

On Advantages of Scheduling using Genetic Fuzzy Systems

Application of Ant colony Algorithm in Cloud Resource Scheduling Based on Three Constraint Conditions

Experiments with Protocols for Service Negotiation

A TABU SEARCH FOR MULTIPLE MULTI-LEVEL REDUNDANCY ALLOCATION PROBLEM IN SERIES-PARALLEL SYSTEMS

Fuzzy Clustering Applied on Mobile Agent Behaviour Selection

Extended Abstract for WISE 2005: Workshop on Information Systems and Economics

Consumption capability analysis for Micro-blog users based on data mining

Study on Productive Process Model Basic Oxygen Furnace Steelmaking Based on RBF Neural Network

A New Artificial Fish Swarm Algorithm for Dynamic Optimization Problems

An Example (based on the Phillips article)

Planning of work schedules for toll booth collectors

Impacts of supply and demand shifts

Financial Distress Prediction of K-means Clustering Based on Genetic Algorithm and Rough Set Theory

Finite Element Analysis and Optimization for the Multi- Stage Deep Drawing of Molybdenum Sheet

A SIMULATION STUDY OF QUALITY INDEX IN MACHINE-COMPONF~T GROUPING

Dynamic optimal groundwater management considering fixed and operation costs for an unconfined aquifer

The 27th Annual Conference of the Japanese Society for Artificial Intelligence, Shu-Chen Cheng Guan-Yu Chen I-Chun Pan

A Review of Clustering Algorithm Based On Swarm Intelligence

ON LINKAGE-BASED CLUSTERING APPROACH AND AIR TRAFFIC PATTERN RECOGNITION

An Artificial Neural Network Method For Optimal Generation Dispatch With Multiple Fuel Options

The current IGCC settlement description can be found in the Stakeholder document for the principles of IGCC on the ENTSO-E webpage for IGCC.

Prediction algorithm for users Retweet Times

Optimization of Groundwater Use in the Goksu Delta at Silifke, Turkey

Customer segmentation, return and risk management: An emprical analysis based on BP neural network

Experimental design methodologies for the identification of Michaelis- Menten type kinetics

San Juan National Forest - American Marten Snow Track Index Evaluation.doc 1

TRAFFIC SIGNAL CONTROL FOR REDUCING VEHICLE CARBON DIOXIDE EMISSIONS ON AN URBAN ROAD NETWORK

Incremental online PCA for automatic motion learning of eigen behaviour. Xianhua Jiang and Yuichi Motai*

AN ITERATIVE ALGORITHM FOR PROFIT MAXIMIZATION BY MARKET EQUILIBRIUM CONSTRAINTS

Applied Soft Computing

Appendix 6.1 The least-cost theorem and pollution control

Comparison of robust M estimator, S estimator & MM estimator with Wiener based denoising filter for gray level image denoising with Gaussian noise

A Review of Fixed Priority and EDF Scheduling for Hard Real-Time Uniprocessor Systems

EVALUATION METHODOLOGY OF BUS RAPID TRANSIT (BRT) OPERATION

Optimization of e-learning Model Using Fuzzy Genetic Algorithm

Logistics Management. Where We Are Now CHAPTER ELEVEN. Measurement. Organizational. Sustainability. Management. Globalization. Culture/Ethics Change

Task Scheduling in Grid Computing: A Review

CONFLICT RESOLUTION IN WATER RESOURCES ALLOCATION

A DEEP Q-LEARNING NETWORK FOR SHIP STOWAGE PLANNING PROBLEM

An Implicit Rating based Product Recommendation System

An Exelon Company December 15, Supplier Operating Manual

An Exelon Company December 15, Supplier Operating Manual

Enhanced Parametric Railway Capacity Evaluation Tool

Design of flexible manufacturing cell considering uncertain product mix requirement

The monopoly market. Telecommunications in Portugal. Managerial Economics MBACatólica

FIN DESIGN FOR FIN-AND-TUBE HEAT EXCHANGER WITH MICROGROOVE SMALL DIAMETER TUBES FOR AIR CONDITIONER

LLFpi : Schedulability-Improved LLF Algorithm in Multiprocessor Real-Time Embedded Systems

of 10 mmol O 2 /g-dry wt-h are to be cultured. The critical

Video Personalization in Resource-Constrained Multimedia Environments

COST OPTIMIZATION OF WATER DISTRIBUTION SYSTEMS SUBJECTED TO WATER HAMMER

Simulation-based Decision Support System for Real-time Disaster Response Management

2013 IEEE 7th International Conference on Self-Adaptation and Self-Organizing Systems Workshops. {xy336699,

Lecture 5: Applications of Consumer Theory

Mega Weaver: A Simple Iterative Approach for BAC Consensus Assembly

CHAPTER 2 OBJECTIVES AND METHODOLOGY

Analyses Based on Combining Similar Information from Multiple Surveys

Construction of Control Chart Based on Six Sigma Initiatives for Regression

RELIABILITY-BASED OPTIMAL DESIGN FOR WATER DISTRIBUTION NETWORKS OF EL-MOSTAKBAL CITY, EGYPT (CASE STUDY)

Automated Chat Thread Analysis: Untangling the Web

Maximizing the Validity of a Test as a Function of Subtest Lengths for a Fixed Total Testing Time: A Comparison Between Two Methods

Genetic Algorithm based Modification of Production Schedule for Variance Minimisation of Energy Consumption

Cloud Computing for Short-Term Load Forecasting Based on Machine Learning Technique

How to Review the Performance/Adequacy of ECV Observations? - Science Perspective from an ECV Producer -

Optimum Generation Scheduling for Thermal Power Plants using Artificial Neural Network

An Analysis on Stability of Competitive Contractual Strategic Alliance Based on the Modified Lotka-Voterra Model

A Scenario-Based Objective Function for an M/M/K Queuing Model with Priority (A Case Study in the Gear Box Production Factory)

CBR System for Leukemia Patients Diagnosis

Estimation Using Differential Evolution for Optimal Crop Plan

Optimal Operation of a Wind and Fuel Cell Power Plant Based CHP System for Grid-Parallel Residential Micro-Grid

Research on Interactive Design Based on Artificial Intelligence

CONSIDERATIONS OF PROBABILITY OF DETECTION IN FRACTURE-CRITICAL INSPECTIONS OF FORGED POLISHED CAR RIMS

State Variables Updating Algorithm for Open-Channel and Reservoir Flow Simulation Model

WISE 2004 Extended Abstract

Analysis Online Shopping Behavior of Consumer Using Decision Tree Leiyue Yao 1, a, Jianying Xiong 2,b

Steady State Load Shedding to Prevent Blackout in the Power System using Artificial Bee Colony Algorithm

Models - Repositories of Knowledge (Proceedings ModelCARE2011 held at Leipzig, Germany, in September 2011) (IAHS Publ. 3XX, 201X).

Best-Order Crossover in an Evolutionary Approach to Multi-Mode Resource-Constrained Project Scheduling

Research on chaos PSO with associated logistics transportation scheduling under hard time windows

June 12, 2007 Supplier Operating Manual For Atlantic City Electric Company, Delmarva Power and Pepco

Journal of Applied Research and Technology ISSN: Centro de Ciencias Aplicadas y Desarrollo Tecnológico.

A Hybrid Meta-Heuristic Algorithm for Job Scheduling on Computational Grids

EFFECT OF VOLUME ON THE MECHANICAL PROPERTIES OF NICKEL NANOWIRE

A STUDY ON THE FACTORS AFFECTING THE ECONOMICAL LIFE OF HEAVY CONSTRUCTION EQUIPMENT

Environmental Economical Power Dispatch problem using Particle Swarm Optimization Technique

Sporlan Valve Company

Tuneable hydrogels. complete.handy.simple USER GUIDE. Innovative cell culture systems.

A Robust Method Based Storage Aggregator Model for Grid Dispatch

AS AN integration of imaging and spectroscopy,

A Novel Gravitational Search Algorithm for Combined Economic and Emission Dispatch

The Impacts of Price Controls on the Performance of the Pharmaceutical Industry

1991), a development of the BLAST program which integrates the building zone energy balance with the system and central plant simulation.

Transcription:

Clusterng CS4780 Mahne Learnng Fall 2009 Thorsten Joahms Cornell Unversty Readng: Mannng/Shuetze Chapter 14 (not 14.1.3, 14.1.4) Based on sldes from Prof. Clare Carde, Prof. Ray Mooney, Prof. Ymng Yang

Outlne Supervsed vs. Unsupervsed Learnng Herarhal Clusterng Herarhal Agglomeratve Clusterng (HAC) Non-Herarhal Clusterng K-means EM-Algorthm

Supervsed vs. Unsupervsed Learnng Supervsed Learnng Classfaton: partton examples nto groups aordng to pre-defned ategores Regresson: assgn value to feature vetors Requres labeled data for tranng Unsupervsed Learnng Clusterng: partton examples nto groups when no pre-defned ategores/lasses are avalable Novelty deteton: fnd hanges n data Outler deteton: fnd unusual events (e.g. hakers) Only nstanes requred, but no labels

Clusterng Partton unlabeled examples nto dsjont subsets of lusters, suh that: Examples wthn a luster are smlar Examples n dfferent lusters are dfferent Dsover new ategores n an unsupervsed manner (no sample ategory labels provded).

Applatons of Clusterng Cluster retreved douments (e.g. Teoma) to present more organzed and understandable results to user Detetng near duplates Entty resoluton E.g. Thorsten Joahms == Thorsten B Joahms Cheatng deteton Exploratory data analyss Automated (or sem-automated) reaton of taxonomes e.g. Yahoo-style Compresson

Clusterng Example

Clusterng Example

Clusterng Example

Smlarty (Dstane) Measures Euldan dstane (L 2 norm): L 1 norm: Cosne smlarty: Kernels L L m 2 2 ( x, x') ( x x ') 1 m 1 ( x, x') x x ' 1 os( x, x') x x x' x'

Herarhal Clusterng Buld a tree-based herarhal taxonomy from a set of unlabeled examples. anmal vertebrate fsh reptle amphb. mammal nvertebrate worm nset rustaean Reursve applaton of a standard lusterng algorthm an produe a herarhal lusterng.

Agglomeratve vs. Dvsve Clusterng Agglomeratve (bottom-up) methods start wth eah example n ts own luster and teratvely ombne them to form larger and larger lusters. Dvsve (top-down) separate all examples mmedately nto lusters. anmal vertebrate fsh reptle amphb. mammal nvertebrate worm nset rustaean

Herarhal Agglomeratve Clusterng (HAC) Assumes a smlarty funton for determnng the smlarty of two lusters. Starts wth all nstanes n a separate luster and then repeatedly jons the two lusters that are most smlar untl there s only one luster. The hstory of mergng forms a bnary tree or herarhy. Bas algorthm: Start wth all nstanes n ther own luster. Untl there s only one luster: Among the urrent lusters, determne the two lusters, and j, that are most smlar. Replae and j wth a sngle luster j

Cluster Smlarty How to ompute smlarty of two lusters eah possbly ontanng multple nstanes? Sngle lnk: Smlarty of two most smlar members. Complete lnk: Smlarty of two least smlar members. Group average: Average smlarty between members.

Sngle-Lnk Agglomeratve Clusterng When omputng luster smlarty, use maxmum smlarty of pars: sm(, j ) x max, y j sm( x, y) Can result n straggly (long and thn) lusters due to hanng effet.

Sngle Lnk Example 1 2 5 6 3 4 7 8

Complete Lnk Agglomeratve Clusterng When omputng luster smlarty, use mnmum smlarty of pars: sm(, j ) mn, y j sm( x, y) Makes more tght, spheral lusters. x

Complete Lnk Example 1 2 5 6 3 4 7 8

Computatonal Complexty of HAC In the frst teraton, all HAC methods need to ompute smlarty of all pars of n ndvdual nstanes whh s O(n 2 ). In eah of the subsequent n 2 mergng teratons, t must ompute the dstane between the most reently reated luster and all other exstng lusters. In order to mantan the smlarty matrx n O(n 2 ) overall, omputng the smlarty to any other luster must eah be done n onstant tme. Mantan e.g. Heap to fnd smallest par

Computng Cluster Smlarty After mergng and j, the smlarty of the resultng luster to any other luster, k, an be omputed by: Sngle Lnk: sm(( j ), k ) max( sm(, k ), sm( j, k Complete Lnk: sm(( j ), k ) mn( sm(, k ), sm( j, k )) ))

Group Average Agglomeratve Clusterng Use average smlarty aross all pars wthn the merged luster to measure the smlarty of two lusters. Compromse between sngle and omplete lnk. ) ( : ) ( ), ( 1) ( 1 ), ( j j x x y y j j j y sm x sm

Computng Group Average Smlarty Assume osne smlarty and normalzed vetors wth unt length. Always mantan sum of vetors n eah luster. s( j ) x x j Compute smlarty of lusters n onstant tme: sm(, j ) ( s( ) s( ( j )) ( s( ) )( s( j )) ( 1) )

Non-Herarhal Clusterng Sngle-pass lusterng K-means lusterng ( hard ) Expetaton maxmzaton ( soft )

Clusterng Crteron Evaluaton funton that assgns a (usually realvalued) value to a lusterng Clusterng rteron typally funton of wthn-luster smlarty and between-luster dssmlarty Optmzaton Fnd lusterng that maxmzes the rteron Global optmzaton (often ntratable) Greedy searh Approxmaton algorthms

Centrod-Based Clusterng Assumes nstanes are real-valued vetors. Clusters represented va entrods (.e. mean of ponts n a luster) : μ() 1 x x Reassgnment of nstanes to lusters s based on dstane to the urrent luster entrods.

K-Means Algorthm Input: k = number of lusters, dstane measure d Selet k random nstanes {s 1, s 2, s k } as seeds. Untl lusterng onverges or other stoppng rteron: For eah nstane x : Assgn x to the luster j suh that d(x, s j ) s mn. For eah luster j //update the entrod of eah luster s j = ( j )

K-means Example (k=2) Pk seeds Reassgn lusters Compute entrods Reasssgn lusters x x x x Compute entrods Reassgn lusters Converged!

Tme Complexty Assume omputng dstane between two nstanes s O(m) where m s the dmensonalty of the vetors. Reassgnng lusters for n ponts: O(kn) dstane omputatons, or O(knm). Computng entrods: Eah nstane gets added one to some entrod: O(nm). Assume these two steps are eah done one for teratons: O(knm). Lnear n all relevant fators, assumng a fxed number of teratons, more effent than HAC.

Problem Bukshot Algorthm Results an vary based on random seed seleton, espeally for hgh-dmensonal data. Some seeds an result n poor onvergene rate, or onvergene to sub-optmal lusterngs. Idea: Combne HAC and K-means lusterng. Frst randomly take a sample of nstanes of sze Run group-average HAC on ths sample Use the results of HAC as ntal seeds for K-means. Overall algorthm s effent and avods problems of bad seed seleton. n