Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis: Supervised Learning

Size: px
Start display at page:

Download "Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis: Supervised Learning"

Transcription

1 Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis: Supervised Learning Gary King Institute for Quantitative Social Science Harvard University April 22, 2012 Gary King (Harvard, IQSS) Text Analysis: Supervised Learning April 22, / 30

2 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text American Journal of Political Science, Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 2 / 30

3 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text American Journal of Political Science, commercialized via: Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 2 / 30

4 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text American Journal of Political Science, commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 2 / 30

5 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text American Journal of Political Science, commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science Copies at Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 2 / 30

6 Inputs and Target Quantities of Interest Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

7 Inputs and Target Quantities of Interest Input Data: Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

8 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

9 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

10 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

11 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

12 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

13 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

14 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

15 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

16 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) High classification accuracy unbiased category proportions Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

17 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) High classification accuracy unbiased category proportions Different methods optimize estimation of the different quantities Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 3 / 30

18 One specific quantity of interest Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 4 / 30

19 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 4 / 30

20 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 4 / 30

21 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 4 / 30

22 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 4 / 30

23 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 4 / 30

24 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Informal language: my crunchy gf thinks dubya hid the wmd s, :)! Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 4 / 30

25 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Informal language: my crunchy gf thinks dubya hid the wmd s, :)! Little common internal structure (no inverted pyramid) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 4 / 30

26 The Conversation about John Kerry s Botched Joke Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 5 / 30

27 The Conversation about John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 5 / 30

28 The Conversation about John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Affect Towards John Kerry Proportion Sept Oct Nov Dec Jan Feb Mar Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 5 / 30

29 Representing Text as Numbers Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 6 / 30

30 Representing Text as Numbers Filter: choose English language blogs that mention Bush Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 6 / 30

31 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 6 / 30

32 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 6 / 30

33 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 6 / 30

34 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 6 / 30

35 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 6 / 30

36 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 6 / 30

37 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types More sophisticated summaries: we ve used, but they re not necessary Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 6 / 30

38 Notation Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 7 / 30

39 Notation Document Category -2 extremely negative -1 negative 0 neutral D i = 1 positive 2 extremely positive NA no opinion expressed NB not a blog Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 7 / 30

40 Notation Document Category -2 extremely negative -1 negative 0 neutral D i = 1 positive 2 extremely positive NA no opinion expressed NB not a blog Word Stem Profile: S i1 = 1 if awful is used, 0 if not S i2 = 1 if good is used, 0 if not S i =.. S ik = 1 if except is used, 0 if not Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 7 / 30

41 Quantities of Interest Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 8 / 30

42 Quantities of Interest Computer Science: individual document classifications D 1, D 2..., D L Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 8 / 30

43 Quantities of Interest Computer Science: individual document classifications D 1, D 2..., D L Social Science: proportions in each category P(D = 2) P(D = 1) P(D = 0) P(D) = P(D = 1) P(D = 2) P(D = NA) P(D = NB) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 8 / 30

44 Issues with Existing Statistical Approaches Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

45 Issues with Existing Statistical Approaches 1 Direct Sampling Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

46 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

47 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

48 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

49 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

50 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

51 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

52 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

53 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

54 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. S spans the space of all predictors of D (i.e., all information in the document) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

55 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. S spans the space of all predictors of D (i.e., all information in the document) Bias even with optimal classification and high % correctly classified Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 9 / 30

56 Using Misclassification Rates to Correct Proportions Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 10 / 30

57 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 10 / 30

58 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 10 / 30

59 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 10 / 30

60 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 10 / 30

61 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 10 / 30

62 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions (No new assumptions beyond that of the classifier) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 10 / 30

63 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions (No new assumptions beyond that of the classifier) (still requires random samples, individual classification, etc) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 10 / 30

64 Formalization from Epidemiology (Levy and Kass, 1970) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 11 / 30

65 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 11 / 30

66 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) Solve: P(D = 1) = P( ˆD = 1) (1 spec) sens (1 spec) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 11 / 30

67 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: Solve: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) P(D = 1) = P( ˆD = 1) (1 spec) sens (1 spec) Use this equation to correct P( ˆD = 1) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 11 / 30

68 Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 12 / 30

69 Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 12 / 30

70 Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Drop ˆD calculation, since ˆD = f (S): P(S = s) = J P(S = s D = j )P(D = j ) j =1 Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 12 / 30

71 Generalizations: J Categories, No Individual Classification (King and Lu, 2008, in press) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Drop ˆD calculation, since ˆD = f (S): P(S = s) = J P(S = s D = j )P(D = j ) j =1 Simplify to an equivalent matrix expression: P(S) = P(S D)P(D) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 12 / 30

72 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

73 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Document category proportions (quantity of interest) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

74 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Word stem profile proportions (estimate in unlabeled set by tabulation) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

75 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Word stem profiles, by category (estimate in labeled set by tabulation) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

76 Estimation The matrix expression again: P(S) 2 K 1 = Y = X β = P(S D) P(D) 2 K J J 1 Alternative symbols (to emphasize the linear equation) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

77 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Solve for quantity of interest (with no error term) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

78 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

79 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

80 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

81 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

82 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

83 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

84 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

85 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Use constrained LS to constrain P(D) to simplex Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

86 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Use constrained LS to constrain P(D) to simplex Result: fast, accurate, with very little (human) tuning required Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 13 / 30

87 A Nonrandom Hand-coded Sample Differences in Document Category Frequencies Differences in Word Profile Frequencies P(D) P(S) P h (D) P h (S) All existing methods would fail with these data. Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 14 / 30

88 Accurate Estimates Estimated P(D) Actual P(D) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 15 / 30

89 Out-of-sample Comparison: 60 Seconds vs. 8.7 Days Affect in Blogs Estimated P(D) Actual P(D) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 16 / 30

90 Out of Sample Validation: Other Examples Congressional Speeches Immigration Editorials Enron s Estimated P(D) Estimated P(D) Estimated P(D) Actual P(D) Actual P(D) Actual P(D) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 17 / 30

91 Verbal Autopsy Methods Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 18 / 30

92 Verbal Autopsy Methods The Problem Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 18 / 30

93 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 18 / 30

94 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 18 / 30

95 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 18 / 30

96 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 18 / 30

97 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 18 / 30

98 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Apply expert algorithms (high reliability, low validity) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 18 / 30

99 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Apply expert algorithms (high reliability, low validity) Find deaths with medically certified causes from a local hospital, trace caregivers to their homes, ask the same symptom questions, and statistically classify deaths in population (model-dependent, low accuracy) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 18 / 30

100 An Alternative Approach Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 19 / 30

101 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 19 / 30

102 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Word Stem Profile, Symptoms: S i1 = 1 if breathing difficulties, 0 if not S i2 = 1 if stomach ache, 0 if not S i =.. S ik = 1 if diarrhea, 0 if not Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 19 / 30

103 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Word Stem Profile, Symptoms: S i1 = 1 if breathing difficulties, 0 if not S i2 = 1 if stomach ache, 0 if not S i =.. S ik = 1 if diarrhea, 0 if not Apply the same methods Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 19 / 30

104 Validation in Tanzania Random Split Sample Community Sample Estimate Error Error Estimate TRUE TRUE Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 20 / 30

105 Validation in China Random Split Sample Estimate TRUE Error City Sample I Estimate TRUE Error City Sample II Estimate TRUE Error Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 21 / 30

106 Implications for an Individual Classifier Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 22 / 30

107 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 22 / 30

108 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 22 / 30

109 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 22 / 30

110 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 22 / 30

111 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) The goal: individual classification Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 22 / 30

112 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Output from our estimator (described above) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 22 / 30

113 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Nonparametric estimate from labeled set (an assumption) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 22 / 30

114 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Nonparametric estimate from unlabeled set (no assumption) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 22 / 30

115 Classification with Less Restrictive Assumptions Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 23 / 30

116 Classification with Less Restrictive Assumptions P h (D=j) P(D=j) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 23 / 30

117 Classification with Less Restrictive Assumptions P h (D=j) P h (S k=1) P(D=j) P(S k=1) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 23 / 30

118 Classification with Less Restrictive Assumptions Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 24 / 30

119 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 24 / 30

120 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 24 / 30

121 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: SVM (best existing classifier): 40.5% Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 24 / 30

122 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: SVM (best existing classifier): 40.5% Our nonparametric approach: 59.8% Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 24 / 30

123 Misclassification Matrix for Blog Posts NA NB P(D 1 ) NA NB Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 25 / 30

124 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 26 / 30

125 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 27 / 30

126 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 28 / 30

127 SIMEX Analysis of Other Categories Category 2 Category 0 Category α Category α Category α Category NA α α α Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 29 / 30

128 For more information Gary King (Harvard, IQSS) Text Analysis: Supervised Learning 30 / 30