Document Classification and Clustering II. Linear classifiers

Size: px
Start display at page:

Download "Document Classification and Clustering II. Linear classifiers"

Transcription

1 Document Classification and Clustering II CS 510 Winter Linear classifiers Support both DPC, CPC Can make use of search index DPC: Treat d as a query against category vectors v 1, v 2,, v n CPC: Treat v i as a query against documents in D CS 510 Winter (c) 2007 David Maier and Susan Price 1

2 Profile construction Rocchio method (based on approach to relevance feedback) Have category c i, want profile v i = <w 1, w 2,, w m > weights for terms t 1, t 2,, t m. POS = positive examples for c i NEG = negative examples for c i CS 510 Winter Calculating weights To calculate weight w j Positive contribution P j = Σw d,tj / POS i for d in POS i Negative contribution N j = Σw d,tj / NEG i for d in NEG i w j = βp j γn j If β = 1 and γ = 0, have the centroid CS 510 Winter (c) 2007 David Maier and Susan Price 2

3 tables Problem with centroid Doesn t do well if category has disjoint clusters beds mattresses centroid Furniture couches CS 510 Winter Use of near-positives Can improve on Rocchio by using only near-positives instead of all negatives Choose from close categories For example, appliances, fixtures, if they are both under home furnishings with furniture Query negatives using the centroid of the positives Take k-closest, or everything within a radius Idea comes from relevance feedback CS 510 Winter (c) 2007 David Maier and Susan Price 3

4 tables Near-positives around centroid Doesn t do well if category has disjoint clusters beds mattresses centroid Furniture couches Use these as near-positives CS 510 Winter Example-based classification Lazy learners don t actually do anything in advance with the training set, except perhaps index it At classification time, compare document d to be classified with training documents to determine category CS 510 Winter (c) 2007 David Maier and Susan Price 4

5 k-nn: k nearest neighbors Given document d, find k documents d 1, d 2,, d k in training collection T that are closest to d Look at their categories, use that distribution for a classification decision c1 c4 d c2 c3 CS 510 Winter For example k = 10, closest documents and categories d6 d14 d29 d49 d51 d54 d68 d70 d71 d93 c1 c2 c2 c3 c1 c2 c3 c1 c4 c2 CS 510 Winter (c) 2007 David Maier and Susan Price 5

6 What next? Classify d into the category that appears most frequently among the k Classify d into all categories that occur above a given threshold Computer CSV i (d) by similarity of documents for c i among the k Question: How would you adapt the k-nn approach if training examples have multiple labels? CS 510 Winter Advantages 1. Looks a lot like standard IR search. Document d is the query, training examples are being searched 2. Doesn t divide the document space linearly, so it can handle categories with internal clusters CS 510 Winter (c) 2007 David Maier and Susan Price 6

7 Disadvantages 1. Really only suited to DPC, not CPC 2. Inefficiency need to compare all to all training documents Lazy strategy defers all work from training to classification time Could combine with profile-based methods represent each cluster in a category by a profile, search only the profiles CS 510 Winter Profile per cluster cluster profiles CS 510 Winter (c) 2007 David Maier and Susan Price 7

8 Classification effectiveness Consider single-label classification Compare classifier results to expert judgments For category c i, have TP i, FP i, TN i, FN i based on classifying all documents in a test set S Error: (FP i + FN i )/(TP i + FP i + TN i + FN i ) Often used in machine learning Problem here: The TN i terms dominates CS 510 Winter Precision & recall apply Precision i = TP i /TP i + FP i Recall i = TP i /TP i + FN i Note that TN i doesn t enter in CS 510 Winter (c) 2007 David Maier and Susan Price 8

9 Issue: overall rating I. Micro-averaging: sum results over all categories e.g., TP = TP 1 + TP 2 + TP m II. Macro-averaging: compute effectiveness per category, average those values Precision = avg(precision i ) Micro-averaging washes out lowgenerality categories CS 510 Winter Utility Might take into account the consequence or gain of TP, FP, TN, FN Consider spam labeling FN probably worse than FP What s worse, 5 spams that get into your regular mailbox or five legitimate s in your spam box CS 510 Winter (c) 2007 David Maier and Susan Price 9

10 Questions to consider What about the ranked case? What about multi-label classifiers? CS 510 Winter Uses of clustering 1. Search: cluster hypothesis Search and retrieve cluster 2. Relevance feedback 3. Organizing search results Northern Light Little Blue Folders Gives you an overview of the information space Helps find novel items 4. Discovering natural categories 5. Understanding an information space CS 510 Winter (c) 2007 David Maier and Susan Price 10

11 Northern Light CS 510 Winter Example: Google News CS 510 Winter (c) 2007 David Maier and Susan Price 11

12 In-Spire, Pacific Northwest National Laboratory Information space analysis CS 510 Winter In-Spire, Pacific Northwest National Laboratory Information space analysis CS 510 Winter (c) 2007 David Maier and Susan Price 12

13 ThemeRiver, Pacific Northwest National Laboratory Information space analysis CS 510 Winter Methods van Rijksbergen: Graph-based Figure out edges based on thresholds on similarity or difference measure Unsupervised learning from ML Generally iterative, e.g., k-means CS 510 Winter (c) 2007 David Maier and Susan Price 13

14 Questions Do we expect clusters to be stable over time (or when do we fix them)? How do you explain a cluster to a human? Features Examples Visualizations needs to reduce from many dimensions down to 2 or 3 CS 510 Winter (c) 2007 David Maier and Susan Price 14