Clusterng CS4780 Mahne Learnng Fall 2009 Thorsten Joahms Cornell Unversty Readng: Mannng/Shuetze Chapter 14 (not 14.1.3, 14.1.4) Based on sldes from Prof. Clare Carde, Prof. Ray Mooney, Prof. Ymng Yang
Outlne Supervsed vs. Unsupervsed Learnng Herarhal Clusterng Herarhal Agglomeratve Clusterng (HAC) Non-Herarhal Clusterng K-means EM-Algorthm
Supervsed vs. Unsupervsed Learnng Supervsed Learnng Classfaton: partton examples nto groups aordng to pre-defned ategores Regresson: assgn value to feature vetors Requres labeled data for tranng Unsupervsed Learnng Clusterng: partton examples nto groups when no pre-defned ategores/lasses are avalable Novelty deteton: fnd hanges n data Outler deteton: fnd unusual events (e.g. hakers) Only nstanes requred, but no labels
Clusterng Partton unlabeled examples nto dsjont subsets of lusters, suh that: Examples wthn a luster are smlar Examples n dfferent lusters are dfferent Dsover new ategores n an unsupervsed manner (no sample ategory labels provded).
Applatons of Clusterng Cluster retreved douments (e.g. Teoma) to present more organzed and understandable results to user Detetng near duplates Entty resoluton E.g. Thorsten Joahms == Thorsten B Joahms Cheatng deteton Exploratory data analyss Automated (or sem-automated) reaton of taxonomes e.g. Yahoo-style Compresson
Clusterng Example
Clusterng Example
Clusterng Example
Smlarty (Dstane) Measures Euldan dstane (L 2 norm): L 1 norm: Cosne smlarty: Kernels L L m 2 2 ( x, x') ( x x ') 1 m 1 ( x, x') x x ' 1 os( x, x') x x x' x'
Herarhal Clusterng Buld a tree-based herarhal taxonomy from a set of unlabeled examples. anmal vertebrate fsh reptle amphb. mammal nvertebrate worm nset rustaean Reursve applaton of a standard lusterng algorthm an produe a herarhal lusterng.
Agglomeratve vs. Dvsve Clusterng Agglomeratve (bottom-up) methods start wth eah example n ts own luster and teratvely ombne them to form larger and larger lusters. Dvsve (top-down) separate all examples mmedately nto lusters. anmal vertebrate fsh reptle amphb. mammal nvertebrate worm nset rustaean
Herarhal Agglomeratve Clusterng (HAC) Assumes a smlarty funton for determnng the smlarty of two lusters. Starts wth all nstanes n a separate luster and then repeatedly jons the two lusters that are most smlar untl there s only one luster. The hstory of mergng forms a bnary tree or herarhy. Bas algorthm: Start wth all nstanes n ther own luster. Untl there s only one luster: Among the urrent lusters, determne the two lusters, and j, that are most smlar. Replae and j wth a sngle luster j
Cluster Smlarty How to ompute smlarty of two lusters eah possbly ontanng multple nstanes? Sngle lnk: Smlarty of two most smlar members. Complete lnk: Smlarty of two least smlar members. Group average: Average smlarty between members.
Sngle-Lnk Agglomeratve Clusterng When omputng luster smlarty, use maxmum smlarty of pars: sm(, j ) x max, y j sm( x, y) Can result n straggly (long and thn) lusters due to hanng effet.
Sngle Lnk Example 1 2 5 6 3 4 7 8
Complete Lnk Agglomeratve Clusterng When omputng luster smlarty, use mnmum smlarty of pars: sm(, j ) mn, y j sm( x, y) Makes more tght, spheral lusters. x
Complete Lnk Example 1 2 5 6 3 4 7 8
Computatonal Complexty of HAC In the frst teraton, all HAC methods need to ompute smlarty of all pars of n ndvdual nstanes whh s O(n 2 ). In eah of the subsequent n 2 mergng teratons, t must ompute the dstane between the most reently reated luster and all other exstng lusters. In order to mantan the smlarty matrx n O(n 2 ) overall, omputng the smlarty to any other luster must eah be done n onstant tme. Mantan e.g. Heap to fnd smallest par
Computng Cluster Smlarty After mergng and j, the smlarty of the resultng luster to any other luster, k, an be omputed by: Sngle Lnk: sm(( j ), k ) max( sm(, k ), sm( j, k Complete Lnk: sm(( j ), k ) mn( sm(, k ), sm( j, k )) ))
Group Average Agglomeratve Clusterng Use average smlarty aross all pars wthn the merged luster to measure the smlarty of two lusters. Compromse between sngle and omplete lnk. ) ( : ) ( ), ( 1) ( 1 ), ( j j x x y y j j j y sm x sm
Computng Group Average Smlarty Assume osne smlarty and normalzed vetors wth unt length. Always mantan sum of vetors n eah luster. s( j ) x x j Compute smlarty of lusters n onstant tme: sm(, j ) ( s( ) s( ( j )) ( s( ) )( s( j )) ( 1) )
Non-Herarhal Clusterng Sngle-pass lusterng K-means lusterng ( hard ) Expetaton maxmzaton ( soft )
Clusterng Crteron Evaluaton funton that assgns a (usually realvalued) value to a lusterng Clusterng rteron typally funton of wthn-luster smlarty and between-luster dssmlarty Optmzaton Fnd lusterng that maxmzes the rteron Global optmzaton (often ntratable) Greedy searh Approxmaton algorthms
Centrod-Based Clusterng Assumes nstanes are real-valued vetors. Clusters represented va entrods (.e. mean of ponts n a luster) : μ() 1 x x Reassgnment of nstanes to lusters s based on dstane to the urrent luster entrods.
K-Means Algorthm Input: k = number of lusters, dstane measure d Selet k random nstanes {s 1, s 2, s k } as seeds. Untl lusterng onverges or other stoppng rteron: For eah nstane x : Assgn x to the luster j suh that d(x, s j ) s mn. For eah luster j //update the entrod of eah luster s j = ( j )
K-means Example (k=2) Pk seeds Reassgn lusters Compute entrods Reasssgn lusters x x x x Compute entrods Reassgn lusters Converged!
Tme Complexty Assume omputng dstane between two nstanes s O(m) where m s the dmensonalty of the vetors. Reassgnng lusters for n ponts: O(kn) dstane omputatons, or O(knm). Computng entrods: Eah nstane gets added one to some entrod: O(nm). Assume these two steps are eah done one for teratons: O(knm). Lnear n all relevant fators, assumng a fxed number of teratons, more effent than HAC.
Problem Bukshot Algorthm Results an vary based on random seed seleton, espeally for hgh-dmensonal data. Some seeds an result n poor onvergene rate, or onvergene to sub-optmal lusterngs. Idea: Combne HAC and K-means lusterng. Frst randomly take a sample of nstanes of sze Run group-average HAC on ths sample Use the results of HAC as ntal seeds for K-means. Overall algorthm s effent and avods problems of bad seed seleton. n