Document Classification and Clustering II. Linear classifiers

Document Classification and Clustering II CS 510 Winter 2007 1 Linear classifiers Support both DPC, CPC Can make use of search index DPC: Treat d as a query against category vectors v 1, v 2,, v n CPC: Treat v i as a query against documents in D CS 510 Winter 2007 2 (c) 2007 David Maier and Susan Price 1

Profile construction Rocchio method (based on approach to relevance feedback) Have category c i, want profile v i = <w 1, w 2,, w m > weights for terms t 1, t 2,, t m. POS = positive examples for c i NEG = negative examples for c i CS 510 Winter 2007 3 Calculating weights To calculate weight w j Positive contribution P j = Σw d,tj / POS i for d in POS i Negative contribution N j = Σw d,tj / NEG i for d in NEG i w j = βp j γn j If β = 1 and γ = 0, have the centroid CS 510 Winter 2007 4 (c) 2007 David Maier and Susan Price 2

tables Problem with centroid Doesn t do well if category has disjoint clusters beds mattresses centroid Furniture couches CS 510 Winter 2007 5 Use of near-positives Can improve on Rocchio by using only near-positives instead of all negatives Choose from close categories For example, appliances, fixtures, if they are both under home furnishings with furniture Query negatives using the centroid of the positives Take k-closest, or everything within a radius Idea comes from relevance feedback CS 510 Winter 2007 6 (c) 2007 David Maier and Susan Price 3

tables Near-positives around centroid Doesn t do well if category has disjoint clusters beds mattresses centroid Furniture couches Use these as near-positives CS 510 Winter 2007 7 Example-based classification Lazy learners don t actually do anything in advance with the training set, except perhaps index it At classification time, compare document d to be classified with training documents to determine category CS 510 Winter 2007 8 (c) 2007 David Maier and Susan Price 4

k-nn: k nearest neighbors Given document d, find k documents d 1, d 2,, d k in training collection T that are closest to d Look at their categories, use that distribution for a classification decision c1 c4 d c2 c3 CS 510 Winter 2007 9 For example k = 10, closest documents and categories d6 d14 d29 d49 d51 d54 d68 d70 d71 d93 c1 c2 c2 c3 c1 c2 c3 c1 c4 c2 CS 510 Winter 2007 10 (c) 2007 David Maier and Susan Price 5

What next? Classify d into the category that appears most frequently among the k Classify d into all categories that occur above a given threshold Computer CSV i (d) by similarity of documents for c i among the k Question: How would you adapt the k-nn approach if training examples have multiple labels? CS 510 Winter 2007 11 Advantages 1. Looks a lot like standard IR search. Document d is the query, training examples are being searched 2. Doesn t divide the document space linearly, so it can handle categories with internal clusters CS 510 Winter 2007 12 (c) 2007 David Maier and Susan Price 6

Disadvantages 1. Really only suited to DPC, not CPC 2. Inefficiency need to compare all to all training documents Lazy strategy defers all work from training to classification time Could combine with profile-based methods represent each cluster in a category by a profile, search only the profiles CS 510 Winter 2007 13 Profile per cluster cluster profiles CS 510 Winter 2007 14 (c) 2007 David Maier and Susan Price 7

Classification effectiveness Consider single-label classification Compare classifier results to expert judgments For category c i, have TP i, FP i, TN i, FN i based on classifying all documents in a test set S Error: (FP i + FN i )/(TP i + FP i + TN i + FN i ) Often used in machine learning Problem here: The TN i terms dominates CS 510 Winter 2007 15 Precision & recall apply Precision i = TP i /TP i + FP i Recall i = TP i /TP i + FN i Note that TN i doesn t enter in CS 510 Winter 2007 16 (c) 2007 David Maier and Susan Price 8

Issue: overall rating I. Micro-averaging: sum results over all categories e.g., TP = TP 1 + TP 2 + TP m II. Macro-averaging: compute effectiveness per category, average those values Precision = avg(precision i ) Micro-averaging washes out lowgenerality categories CS 510 Winter 2007 17 Utility Might take into account the consequence or gain of TP, FP, TN, FN Consider spam labeling FN probably worse than FP What s worse, 5 spams that get into your regular mailbox or five legitimate emails in your spam box CS 510 Winter 2007 18 (c) 2007 David Maier and Susan Price 9

Questions to consider What about the ranked case? What about multi-label classifiers? CS 510 Winter 2007 19 Uses of clustering 1. Search: cluster hypothesis Search and retrieve cluster 2. Relevance feedback 3. Organizing search results Northern Light Little Blue Folders Gives you an overview of the information space Helps find novel items 4. Discovering natural categories 5. Understanding an information space CS 510 Winter 2007 20 (c) 2007 David Maier and Susan Price 10

In-Spire, Pacific Northwest National Laboratory Information space analysis CS 510 Winter 2007 23 In-Spire, Pacific Northwest National Laboratory Information space analysis CS 510 Winter 2007 24 (c) 2007 David Maier and Susan Price 12

ThemeRiver, Pacific Northwest National Laboratory Information space analysis CS 510 Winter 2007 25 Methods van Rijksbergen: Graph-based Figure out edges based on thresholds on similarity or difference measure Unsupervised learning from ML Generally iterative, e.g., k-means CS 510 Winter 2007 26 (c) 2007 David Maier and Susan Price 13

Questions Do we expect clusters to be stable over time (or when do we fix them)? How do you explain a cluster to a human? Features Examples Visualizations needs to reduce from many dimensions down to 2 or 3 CS 510 Winter 2007 27 (c) 2007 David Maier and Susan Price 14