(& Classify Deaths Without Physicians) 1

Size: px
Start display at page:

Download "(& Classify Deaths Without Physicians) 1"

Transcription

1 Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis I: How to Read 100 Million Blogs (& Classify Deaths Without Physicians) 1 Gary King April 25, c Copyright 2010 Gary King, All Rights Reserved. Gary King () Advanced Quantitative Research Methodology, Lecture Notes: Text AprilAnalysis 25, 2010I: How1 to/ Rea 1

2 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): Gary King (Harvard, IQSS) Text Analysis 2 / 1

3 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King (Harvard, IQSS) Text Analysis 2 / 1

4 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science 23, 1 (February, 2008): Pp Gary King (Harvard, IQSS) Text Analysis 2 / 1

5 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science 23, 1 (February, 2008): Pp In use by (among others): Gary King (Harvard, IQSS) Text Analysis 2 / 1

6 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science 23, 1 (February, 2008): Pp In use by (among others): Copies at Gary King (Harvard, IQSS) Text Analysis 2 / 1

7 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science 23, 1 (February, 2008): Pp In use by (among others): Copies at (play after ad) Gary King (Harvard, IQSS) Text Analysis 2 / 1

8 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science 23, 1 (February, 2008): Pp In use by (among others): Copies at (play after ad) (play 10:00-12:06) Gary King (Harvard, IQSS) Text Analysis 2 / 1

9 Inputs and Target Quantities of Interest Gary King (Harvard, IQSS) Text Analysis 3 / 1

10 Inputs and Target Quantities of Interest Input Data: Gary King (Harvard, IQSS) Text Analysis 3 / 1

11 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) Gary King (Harvard, IQSS) Text Analysis 3 / 1

12 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories Gary King (Harvard, IQSS) Text Analysis 3 / 1

13 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Gary King (Harvard, IQSS) Text Analysis 3 / 1

14 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest Gary King (Harvard, IQSS) Text Analysis 3 / 1

15 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) Gary King (Harvard, IQSS) Text Analysis 3 / 1

16 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Gary King (Harvard, IQSS) Text Analysis 3 / 1

17 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Gary King (Harvard, IQSS) Text Analysis 3 / 1

18 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) Gary King (Harvard, IQSS) Text Analysis 3 / 1

19 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) High classification accuracy unbiased category proportions Gary King (Harvard, IQSS) Text Analysis 3 / 1

20 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) High classification accuracy unbiased category proportions Different methods optimize estimation of the different quantities Gary King (Harvard, IQSS) Text Analysis 3 / 1

21 Blogs as a Running Example Gary King (Harvard, IQSS) Text Analysis 4 / 1

22 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. Gary King (Harvard, IQSS) Text Analysis 4 / 1

23 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. We are living through the largest expansion of expressive capability in the history of the human race Gary King (Harvard, IQSS) Text Analysis 4 / 1

24 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. We are living through the largest expansion of expressive capability in the history of the human race Measures classical notion of public opinion: active public expressions designed to influence policy and politics (previously: strikes, boycotts, demonstrations, editorials) Gary King (Harvard, IQSS) Text Analysis 4 / 1

25 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. We are living through the largest expansion of expressive capability in the history of the human race Measures classical notion of public opinion: active public expressions designed to influence policy and politics (previously: strikes, boycotts, demonstrations, editorials) (Public opinion surveys) Gary King (Harvard, IQSS) Text Analysis 4 / 1

26 One specific quantity of interest Gary King (Harvard, IQSS) Text Analysis 5 / 1

27 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Gary King (Harvard, IQSS) Text Analysis 5 / 1

28 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Gary King (Harvard, IQSS) Text Analysis 5 / 1

29 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Gary King (Harvard, IQSS) Text Analysis 5 / 1

30 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Gary King (Harvard, IQSS) Text Analysis 5 / 1

31 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Gary King (Harvard, IQSS) Text Analysis 5 / 1

32 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Informal language: my crunchy gf thinks dubya hid the wmd s, :)! Gary King (Harvard, IQSS) Text Analysis 5 / 1

33 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Informal language: my crunchy gf thinks dubya hid the wmd s, :)! Little common internal structure (no inverted pyramid) Gary King (Harvard, IQSS) Text Analysis 5 / 1

34 Example of output: John Kerry s Botched Joke Gary King (Harvard, IQSS) Text Analysis 6 / 1

35 Example of output: John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Gary King (Harvard, IQSS) Text Analysis 6 / 1

36 Example of output: John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Affect Towards John Kerry Proportion Sept Oct Nov Dec Jan Feb Mar Gary King (Harvard, IQSS) Text Analysis 6 / 1

37 Representing Text as Numbers Gary King (Harvard, IQSS) Text Analysis 7 / 1

38 Representing Text as Numbers Filter: choose English language blogs that mention Bush Gary King (Harvard, IQSS) Text Analysis 7 / 1

39 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Gary King (Harvard, IQSS) Text Analysis 7 / 1

40 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Gary King (Harvard, IQSS) Text Analysis 7 / 1

41 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Gary King (Harvard, IQSS) Text Analysis 7 / 1

42 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. Gary King (Harvard, IQSS) Text Analysis 7 / 1

43 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Gary King (Harvard, IQSS) Text Analysis 7 / 1

44 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types Gary King (Harvard, IQSS) Text Analysis 7 / 1

45 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types More sophisticated summaries: we ve used, but they re not necessary Gary King (Harvard, IQSS) Text Analysis 7 / 1

46 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types More sophisticated summaries: we ve used, but they re not necessary (More systematic than than 1 dummy variable per document) Gary King (Harvard, IQSS) Text Analysis 7 / 1

47 Notation Gary King (Harvard, IQSS) Text Analysis 8 / 1

48 Notation Document Category -2 extremely negative -1 negative 0 neutral D i = 1 positive 2 extremely positive NA no opinion expressed NB not a blog Gary King (Harvard, IQSS) Text Analysis 8 / 1

49 Notation Document Category -2 extremely negative -1 negative 0 neutral D i = 1 positive 2 extremely positive NA no opinion expressed NB not a blog Word Stem Profile: S i1 = 1 if awful is used, 0 if not S i2 = 1 if good is used, 0 if not S i =.. S ik = 1 if except is used, 0 if not Gary King (Harvard, IQSS) Text Analysis 8 / 1

50 Quantities of Interest Gary King (Harvard, IQSS) Text Analysis 9 / 1

51 Quantities of Interest Computer Science: individual document classifications D 1, D 2..., D L Gary King (Harvard, IQSS) Text Analysis 9 / 1

52 Quantities of Interest Computer Science: individual document classifications D 1, D 2..., D L Social Science: proportions in each category P(D = 2) P(D = 1) P(D = 0) P(D) = P(D = 1) P(D = 2) P(D = NA) P(D = NB) Gary King (Harvard, IQSS) Text Analysis 9 / 1

53 Issues with Existing Statistical Approaches Gary King (Harvard, IQSS) Text Analysis 10 / 1

54 Issues with Existing Statistical Approaches 1 Direct Sampling Gary King (Harvard, IQSS) Text Analysis 10 / 1

55 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample Gary King (Harvard, IQSS) Text Analysis 10 / 1

56 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. Gary King (Harvard, IQSS) Text Analysis 10 / 1

57 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) Gary King (Harvard, IQSS) Text Analysis 10 / 1

58 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Gary King (Harvard, IQSS) Text Analysis 10 / 1

59 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Gary King (Harvard, IQSS) Text Analysis 10 / 1

60 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Gary King (Harvard, IQSS) Text Analysis 10 / 1

61 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless Gary King (Harvard, IQSS) Text Analysis 10 / 1

62 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. Gary King (Harvard, IQSS) Text Analysis 10 / 1

63 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. S spans the space of all predictors of D (i.e., all information in the document) Gary King (Harvard, IQSS) Text Analysis 10 / 1

64 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. S spans the space of all predictors of D (i.e., all information in the document) Bias even with optimal classification and high % correctly classified Gary King (Harvard, IQSS) Text Analysis 10 / 1

65 Using Misclassification Rates to Correct Proportions Gary King (Harvard, IQSS) Text Analysis 11 / 1

66 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Gary King (Harvard, IQSS) Text Analysis 11 / 1

67 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Gary King (Harvard, IQSS) Text Analysis 11 / 1

68 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Gary King (Harvard, IQSS) Text Analysis 11 / 1

69 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Gary King (Harvard, IQSS) Text Analysis 11 / 1

70 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions Gary King (Harvard, IQSS) Text Analysis 11 / 1

71 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions (No new assumptions beyond that of the classifier) Gary King (Harvard, IQSS) Text Analysis 11 / 1

72 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions (No new assumptions beyond that of the classifier) (still requires random samples, individual classification, etc) Gary King (Harvard, IQSS) Text Analysis 11 / 1

73 Formalization from Epidemiology (Levy and Kass, 1970) Gary King (Harvard, IQSS) Text Analysis 12 / 1

74 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) Gary King (Harvard, IQSS) Text Analysis 12 / 1

75 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) Solve: P(D = 1) = P( ˆD = 1) (1 spec) sens (1 spec) Gary King (Harvard, IQSS) Text Analysis 12 / 1

76 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: Solve: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) P(D = 1) = P( ˆD = 1) (1 spec) sens (1 spec) Use this equation to correct P( ˆD = 1) Gary King (Harvard, IQSS) Text Analysis 12 / 1

77 Generalizations: J Categories, No Individual Classification (King and Lu, 2008) Gary King (Harvard, IQSS) Text Analysis 13 / 1

78 Generalizations: J Categories, No Individual Classification (King and Lu, 2008) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Gary King (Harvard, IQSS) Text Analysis 13 / 1

79 Generalizations: J Categories, No Individual Classification (King and Lu, 2008) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Drop ˆD calculation, since ˆD = f (S): P(S = s) = J P(S = s D = j )P(D = j ) j =1 Gary King (Harvard, IQSS) Text Analysis 13 / 1

80 Generalizations: J Categories, No Individual Classification (King and Lu, 2008) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Drop ˆD calculation, since ˆD = f (S): P(S = s) = J P(S = s D = j )P(D = j ) j =1 Simplify to an equivalent matrix expression: P(S) = P(S D)P(D) Gary King (Harvard, IQSS) Text Analysis 13 / 1

81 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Gary King (Harvard, IQSS) Text Analysis 14 / 1

82 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Document category proportions (quantity of interest) Gary King (Harvard, IQSS) Text Analysis 14 / 1

83 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Word stem profile proportions (estimate in unlabeled set by tabulation) Gary King (Harvard, IQSS) Text Analysis 14 / 1

84 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Word stem profiles, by category (estimate in labeled set by tabulation) Gary King (Harvard, IQSS) Text Analysis 14 / 1

85 Estimation The matrix expression again: P(S) 2 K 1 = Y = X β = P(S D) P(D) 2 K J J 1 Alternative symbols (to emphasize the linear equation) Gary King (Harvard, IQSS) Text Analysis 14 / 1

86 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Solve for quantity of interest (with no error term) Gary King (Harvard, IQSS) Text Analysis 14 / 1

87 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: Gary King (Harvard, IQSS) Text Analysis 14 / 1

88 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer Gary King (Harvard, IQSS) Text Analysis 14 / 1

89 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Gary King (Harvard, IQSS) Text Analysis 14 / 1

90 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Gary King (Harvard, IQSS) Text Analysis 14 / 1

91 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Gary King (Harvard, IQSS) Text Analysis 14 / 1

92 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Gary King (Harvard, IQSS) Text Analysis 14 / 1

93 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Gary King (Harvard, IQSS) Text Analysis 14 / 1

94 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Use constrained LS to constrain P(D) to simplex Gary King (Harvard, IQSS) Text Analysis 14 / 1

95 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Use constrained LS to constrain P(D) to simplex Result: fast, accurate, with very little (human) tuning required Gary King (Harvard, IQSS) Text Analysis 14 / 1

96 A Nonrandom Hand-coded Sample Differences in Document Category Frequencies Differences in Word Profile Frequencies P(D) P(S) P h (D) P h (S) All existing methods would fail with these data. Gary King (Harvard, IQSS) Text Analysis 15 / 1

97 Accurate Estimates Estimated P(D) Actual P(D) Gary King (Harvard, IQSS) Text Analysis 16 / 1

98 Out-of-sample Comparison: 60 Seconds vs. 8.7 Days Affect in Blogs Estimated P(D) Actual P(D) Gary King (Harvard, IQSS) Text Analysis 17 / 1

99 Out of Sample Validation: Other Examples Congressional Speeches Immigration Editorials Enron s Estimated P(D) Estimated P(D) Estimated P(D) Actual P(D) Actual P(D) Actual P(D) Gary King (Harvard, IQSS) Text Analysis 18 / 1

100 Verbal Autopsy Methods Gary King (Harvard, IQSS) Text Analysis 19 / 1

101 Verbal Autopsy Methods The Problem Gary King (Harvard, IQSS) Text Analysis 19 / 1

102 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies Gary King (Harvard, IQSS) Text Analysis 19 / 1

103 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Gary King (Harvard, IQSS) Text Analysis 19 / 1

104 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Gary King (Harvard, IQSS) Text Analysis 19 / 1

105 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Gary King (Harvard, IQSS) Text Analysis 19 / 1

106 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Gary King (Harvard, IQSS) Text Analysis 19 / 1

107 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Apply expert algorithms (high reliability, low validity) Gary King (Harvard, IQSS) Text Analysis 19 / 1

108 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Apply expert algorithms (high reliability, low validity) Find deaths with medically certified causes from a local hospital, trace caregivers to their homes, ask the same symptom questions, and statistically classify deaths in population (model-dependent, low accuracy) Gary King (Harvard, IQSS) Text Analysis 19 / 1

109 An Alternative Approach Gary King (Harvard, IQSS) Text Analysis 20 / 1

110 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Gary King (Harvard, IQSS) Text Analysis 20 / 1

111 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Word Stem Profile, Symptoms: S i1 = 1 if breathing difficulties, 0 if not S i2 = 1 if stomach ache, 0 if not S i =.. S ik = 1 if diarrhea, 0 if not Gary King (Harvard, IQSS) Text Analysis 20 / 1

112 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Word Stem Profile, Symptoms: S i1 = 1 if breathing difficulties, 0 if not S i2 = 1 if stomach ache, 0 if not S i =.. S ik = 1 if diarrhea, 0 if not Apply the same methods Gary King (Harvard, IQSS) Text Analysis 20 / 1

113 Validation in Tanzania Random Split Sample Community Sample Estimate Error Error Estimate TRUE TRUE Gary King (Harvard, IQSS) Text Analysis 21 / 1

114 Validation in China Random Split Sample Estimate TRUE Error City Sample I Estimate TRUE Error City Sample II Estimate TRUE Error Gary King (Harvard, IQSS) Text Analysis 22 / 1

115 Implications for an Individual Classifier Gary King (Harvard, IQSS) Text Analysis 23 / 1

116 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) Gary King (Harvard, IQSS) Text Analysis 23 / 1

117 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) Gary King (Harvard, IQSS) Text Analysis 23 / 1

118 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): Gary King (Harvard, IQSS) Text Analysis 23 / 1

119 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Gary King (Harvard, IQSS) Text Analysis 23 / 1

120 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) The goal: individual classification Gary King (Harvard, IQSS) Text Analysis 23 / 1

121 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Output from our estimator (described above) Gary King (Harvard, IQSS) Text Analysis 23 / 1

122 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Nonparametric estimate from labeled set (an assumption) Gary King (Harvard, IQSS) Text Analysis 23 / 1

123 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Nonparametric estimate from unlabeled set (no assumption) Gary King (Harvard, IQSS) Text Analysis 23 / 1

124 Classification with Less Restrictive Assumptions Gary King (Harvard, IQSS) Text Analysis 24 / 1

125 Classification with Less Restrictive Assumptions P h (D=j) P(D=j) Gary King (Harvard, IQSS) Text Analysis 24 / 1

126 Classification with Less Restrictive Assumptions P h (D=j) P h (S k=1) P(D=j) P(S k=1) Gary King (Harvard, IQSS) Text Analysis 24 / 1

127 Classification with Less Restrictive Assumptions Gary King (Harvard, IQSS) Text Analysis 25 / 1

128 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Gary King (Harvard, IQSS) Text Analysis 25 / 1

129 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: Gary King (Harvard, IQSS) Text Analysis 25 / 1

130 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: SVM (best existing classifier): 40.5% Gary King (Harvard, IQSS) Text Analysis 25 / 1

131 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: SVM (best existing classifier): 40.5% Our nonparametric approach: 59.8% Gary King (Harvard, IQSS) Text Analysis 25 / 1

132 Misclassification Matrix for Blog Posts NA NB P(D 1 ) NA NB Gary King (Harvard, IQSS) Text Analysis 26 / 1

133 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Text Analysis 27 / 1

134 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Text Analysis 28 / 1

135 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Text Analysis 29 / 1

136 SIMEX Analysis of Other Categories Category 2 Category 0 Category α Category α Category α Category NA α α α Gary King (Harvard, IQSS) Text Analysis 30 / 1

137 For more information Gary King (Harvard, IQSS) Text Analysis 31 / 1

Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis: Supervised Learning

Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis: Supervised Learning Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis: Supervised Learning Gary King Institute for Quantitative Social Science Harvard University April 22, 2012 Gary King (Harvard, IQSS)

More information

Big Data. Methodological issues in using Big Data for Official Statistics

Big Data. Methodological issues in using Big Data for Official Statistics Giulio Barcaroli Istat (barcarol@istat.it) Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official Statistics. Methodological issues in using Big Data for Official Statistics

More information

Predicting Stock Prices through Textual Analysis of Web News

Predicting Stock Prices through Textual Analysis of Web News Predicting Stock Prices through Textual Analysis of Web News Daniel Gallegos, Alice Hau December 11, 2015 1 Introduction Investors have access to a wealth of information through a variety of news channels

More information

Predicting Corporate Influence Cascades In Health Care Communities

Predicting Corporate Influence Cascades In Health Care Communities Predicting Corporate Influence Cascades In Health Care Communities Shouzhong Shi, Chaudary Zeeshan Arif, Sarah Tran December 11, 2015 Part A Introduction The standard model of drug prescription choice

More information

Conclusions and Future Work

Conclusions and Future Work Chapter 9 Conclusions and Future Work Having done the exhaustive study of recommender systems belonging to various domains, stock market prediction systems, social resource recommender, tag recommender

More information

Assistant Professor Neha Pandya Department of Information Technology, Parul Institute Of Engineering & Technology Gujarat Technological University

Assistant Professor Neha Pandya Department of Information Technology, Parul Institute Of Engineering & Technology Gujarat Technological University Feature Level Text Categorization For Opinion Mining Gandhi Vaibhav C. Computer Engineering Parul Institute Of Engineering & Technology Gujarat Technological University Assistant Professor Neha Pandya

More information

CSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification

CSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification CSE 255 Lecture 3 Data Mining and Predictive Analytics Supervised learning Classification Last week Last week we started looking at supervised learning problems Last week We studied linear regression,

More information

Sentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter

Sentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter Sentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter Tianyu Ding 1 and Junyi Deng 1 and Jingting Li 1 and Yu-Ru Lin 1 1 University of Pittsburgh, Pittsburgh PA

More information

Data Mining in Social Network. Presenter: Keren Ye

Data Mining in Social Network. Presenter: Keren Ye Data Mining in Social Network Presenter: Keren Ye References Kwak, Haewoon, et al. "What is Twitter, a social network or a news media?." Proceedings of the 19th international conference on World wide web.

More information

Preface to the third edition Preface to the first edition Acknowledgments

Preface to the third edition Preface to the first edition Acknowledgments Contents Foreword Preface to the third edition Preface to the first edition Acknowledgments Part I PRELIMINARIES XXI XXIII XXVII XXIX CHAPTER 1 Introduction 3 1.1 What Is Business Analytics?................

More information

Linear model to forecast sales from past data of Rossmann drug Store

Linear model to forecast sales from past data of Rossmann drug Store Abstract Linear model to forecast sales from past data of Rossmann drug Store Group id: G3 Recent years, the explosive growth in data results in the need to develop new tools to process data into knowledge

More information

Predictive Analytics

Predictive Analytics Predictive Analytics Mani Janakiram, PhD Director, Supply Chain Intelligence & Analytics, Intel Corp. Adjunct Professor of Supply Chain, ASU October 2017 "Prediction is very difficult, especially if it's

More information

Predicting Corporate 8-K Content Using Machine Learning Techniques

Predicting Corporate 8-K Content Using Machine Learning Techniques Predicting Corporate 8-K Content Using Machine Learning Techniques Min Ji Lee Graduate School of Business Stanford University Stanford, California 94305 E-mail: minjilee@stanford.edu Hyungjun Lee Department

More information

PAST research has shown that real-time Twitter data can

PAST research has shown that real-time Twitter data can Algorithmic Trading of Cryptocurrency Based on Twitter Sentiment Analysis Stuart Colianni, Stephanie Rosales, and Michael Signorotti ABSTRACT PAST research has shown that real-time Twitter data can be

More information

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.

More information

Applications of Machine Learning to Predict Yelp Ratings

Applications of Machine Learning to Predict Yelp Ratings Applications of Machine Learning to Predict Yelp Ratings Kyle Carbon Aeronautics and Astronautics kcarbon@stanford.edu Kacyn Fujii Electrical Engineering khfujii@stanford.edu Prasanth Veerina Computer

More information

HUMAN RESOURCE PLANNING AND ENGAGEMENT DECISION SUPPORT THROUGH ANALYTICS

HUMAN RESOURCE PLANNING AND ENGAGEMENT DECISION SUPPORT THROUGH ANALYTICS HUMAN RESOURCE PLANNING AND ENGAGEMENT DECISION SUPPORT THROUGH ANALYTICS Janaki Sivasankaran 1, B Thilaka 2 1,2 Department of Applied Mathematics, Sri Venkateswara College of Engineering, (India) ABSTRACT

More information

A MATHEMATICAL MODEL FOR PREDICTING OUTPUT IN AN OILFIELD IN THE NIGER DELTA AREA OF NIGERIA

A MATHEMATICAL MODEL FOR PREDICTING OUTPUT IN AN OILFIELD IN THE NIGER DELTA AREA OF NIGERIA A MATHEMATICAL MODEL FOR PREDICTING OUTPUT IN AN OILFIELD IN THE NIGER DELTA AREA OF NIGERIA M. H. Oladeinde 1,*, A. O. Ohwo and C. A. Oladeinde 3, 3 PRODUCTION ENGINEERING DEPARTMENT, UNIVERSITY OF BENIN,

More information

Data Mining. Chapter 7: Score Functions for Data Mining Algorithms. Fall Ming Li

Data Mining. Chapter 7: Score Functions for Data Mining Algorithms. Fall Ming Li Data Mining Chapter 7: Score Functions for Data Mining Algorithms Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University The merit of score function Score function indicates

More information

Lumière. A Smart Review Analysis Engine. Ruchi Asthana Nathaniel Brennan Zhe Wang

Lumière. A Smart Review Analysis Engine. Ruchi Asthana Nathaniel Brennan Zhe Wang Lumière A Smart Review Analysis Engine Ruchi Asthana Nathaniel Brennan Zhe Wang Purpose A rapid increase in Internet users along with the growing power of online reviews has given birth to fields like

More information

Text Categorization. Hongning Wang

Text Categorization. Hongning Wang Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS

More information

Text Categorization. Hongning Wang

Text Categorization. Hongning Wang Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS

More information

Time Series Analysis in the Social Sciences

Time Series Analysis in the Social Sciences one Time Series Analysis in the Social Sciences in the social sciences, data are usually collected across space, that is, across countries, cities, and so on. Sometimes, however, data are collected across

More information

Model Selection, Evaluation, Diagnosis

Model Selection, Evaluation, Diagnosis Model Selection, Evaluation, Diagnosis INFO-4604, Applied Machine Learning University of Colorado Boulder October 31 November 2, 2017 Prof. Michael Paul Today How do you estimate how well your classifier

More information

Cryptocurrency Price Prediction Using News and Social Media Sentiment

Cryptocurrency Price Prediction Using News and Social Media Sentiment Cryptocurrency Price Prediction Using News and Social Media Sentiment Connor Lamon, Eric Nielsen, Eric Redondo Abstract This project analyzes the ability of news and social media data to predict price

More information

1/15 Test 1A. COB 191, Fall 2004

1/15 Test 1A. COB 191, Fall 2004 1/15 Test 1A. COB 191, Fall 2004 Name Grade Please provide computational details for questions and problems to get any credit. The following problem is associated with questions 1 to 5. Most presidential

More information

ABA English: An elearning Leader Learns from Its Users E B C D A

ABA English: An elearning Leader Learns from Its Users E B C D A ABA English: An elearning Leader Learns from Its Users E E B C D A A ABA English is an online, subscription-based, distance learning platform that helps adults in over 7 countries learn English. ABA faced

More information

On of the major merits of the Flag Model is its potential for representation. There are three approaches to such a task: a qualitative, a

On of the major merits of the Flag Model is its potential for representation. There are three approaches to such a task: a qualitative, a Regime Analysis Regime Analysis is a discrete multi-assessment method suitable to assess projects as well as policies. The strength of the Regime Analysis is that it is able to cope with binary, ordinal,

More information

Machine learning-based approaches for BioCreative III tasks

Machine learning-based approaches for BioCreative III tasks Machine learning-based approaches for BioCreative III tasks Shashank Agarwal 1, Feifan Liu 2, Zuofeng Li 2 and Hong Yu 1,2,3 1 Medical Informatics, College of Engineering and Applied Sciences, University

More information

Political Science 452: Text as Data

Political Science 452: Text as Data Political Science 452: Text as Data Justin Grimmer Assistant Professor Department of Political Science Stanford University April 13th, 2011 Justin Grimmer (Stanford University) Text as Data April 13th,

More information

An Implementation of genetic algorithm based feature selection approach over medical datasets

An Implementation of genetic algorithm based feature selection approach over medical datasets An Implementation of genetic algorithm based feature selection approach over medical s Dr. A. Shaik Abdul Khadir #1, K. Mohamed Amanullah #2 #1 Research Department of Computer Science, KhadirMohideen College,

More information

Chapter 5 Demand Forecasting

Chapter 5 Demand Forecasting Chapter 5 Demand Forecasting TRUE/FALSE 1. One of the goals of an effective CPFR system is to minimize the negative impacts of the bullwhip effect on supply chains. 2. The modern day business environment

More information

Predicting Restaurants Rating And Popularity Based On Yelp Dataset

Predicting Restaurants Rating And Popularity Based On Yelp Dataset CS 229 MACHINE LEARNING FINAL PROJECT 1 Predicting Restaurants Rating And Popularity Based On Yelp Dataset Yiwen Guo, ICME, Anran Lu, ICME, and Zeyu Wang, Department of Economics, Stanford University Abstract

More information

Code Compulsory Module Credits Continuous Assignment

Code Compulsory Module Credits Continuous Assignment CURRICULUM AND SCHEME OF EVALUATION Compulsory Modules Evaluation (%) Code Compulsory Module Credits Continuous Assignment Final Exam MA 5210 Probability and Statistics 3 40±10 60 10 MA 5202 Statistical

More information

Text Mining Approach for Product Quality Enhancement

Text Mining Approach for Product Quality Enhancement 2017 IEEE 7th International Advance Computing Conference Text Mining Approach for Product Quality Enhancement (Improving Product Quality through Machine Learning) Chandrasekhar Rangu Shuvojit Chatterjee

More information

Enabling News Trading by Automatic Categorization of News Articles

Enabling News Trading by Automatic Categorization of News Articles SCSUG 2016 Paper AA22 Enabling News Trading by Automatic Categorization of News Articles ABSTRACT Praveen Kumar Kotekal, Oklahoma State University Vishwanath Kolar Bhaskara, Oklahoma State University Traders

More information

Copyright 2013, SAS Institute Inc. All rights reserved.

Copyright 2013, SAS Institute Inc. All rights reserved. IMPROVING PREDICTION OF CYBER ATTACKS USING ENSEMBLE MODELING June 17, 2014 82 nd MORSS Alexandria, VA Tom Donnelly, PhD Systems Engineer & Co-insurrectionist JMP Federal Government Team ABSTRACT Improving

More information

Models in Engineering Glossary

Models in Engineering Glossary Models in Engineering Glossary Anchoring bias is the tendency to use an initial piece of information to make subsequent judgments. Once an anchor is set, there is a bias toward interpreting other information

More information

Prediction from Blog Data

Prediction from Blog Data Prediction from Blog Data Aditya Parameswaran Eldar Sadikov Petros Venetis 1. INTRODUCTION We have approximately one year s worth of blog posts from [1] with over 12 million web blogs tracked. On average

More information

Distinguish between different types of numerical data and different data collection processes.

Distinguish between different types of numerical data and different data collection processes. Level: Diploma in Business Learning Outcomes 1.1 1.3 Distinguish between different types of numerical data and different data collection processes. Introduce the course by defining statistics and explaining

More information

Determining Method of Action in Drug Discovery Using Affymetrix Microarray Data

Determining Method of Action in Drug Discovery Using Affymetrix Microarray Data Determining Method of Action in Drug Discovery Using Affymetrix Microarray Data Max Kuhn max.kuhn@pfizer.com Pfizer Global R&D Research Statistics Groton, CT Method of Action As the level of drug resistance

More information

International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March ISSN Web and Text Mining Sentiment Analysis

International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March ISSN Web and Text Mining Sentiment Analysis International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March-2015 672 Web and Text Mining Sentiment Analysis Ms. Anjana Agrawal Abstract This paper describes the key steps followed

More information

Integrating natural language processing and machine learning algorithms to categorize oncologic response in radiology reports

Integrating natural language processing and machine learning algorithms to categorize oncologic response in radiology reports Integrating natural language processing and machine learning algorithms to categorize oncologic response in radiology reports Po-Hao Chen, MD MBA Hanna Zafar, MD Tessa S. Cook, MD PhD Roadmap Background

More information

Architecture of Text Mining Application in Analyzing Public Sentiments of West Java Governor Election using Naive Bayes Classification

Architecture of Text Mining Application in Analyzing Public Sentiments of West Java Governor Election using Naive Bayes Classification Architecture of Text Mining Application in Analyzing Public Sentiments of West Java Governor Election using Naive Bayes Classification Suryanto Nugroho Master of Informatics Engineering, Amikom Yogyakarta

More information

CHAPTER Activity Cost Behavior

CHAPTER Activity Cost Behavior 3-1 CHAPTER Activity Cost Behavior Objectives 3-2 1. Define cost behavior After studying for fixed, this variable, and mixed costs. chapter, you should 2. Explain the role be of the able resource to: usage

More information

CHAPTER 1 Defining and Collecting Data

CHAPTER 1 Defining and Collecting Data CHAPTER 1 Defining and Collecting Data In this book we will use Define the variables for which you want to reach conclusions Collect the data from appropriate sources Organize the data collected by developing

More information

Correcting Sample Bias in Oversampled Logistic Modeling. Building Stable Models from Data with Very Low event Count

Correcting Sample Bias in Oversampled Logistic Modeling. Building Stable Models from Data with Very Low event Count Correcting Sample Bias in Oversampled Logistic Modeling Building Stable Models from Data with Very Low event Count ABSTRACT In binary outcome regression models with very few bads or minority events, it

More information

Inferring Social Ties across Heterogeneous Networks

Inferring Social Ties across Heterogeneous Networks Inferring Social Ties across Heterogeneous Networks CS 6001 Complex Network Structures HARISH ANANDAN Introduction Social Ties Information carrying connections between people It can be: Strong, weak or

More information

Evaluating Diagnostic Tests in the Absence of a Gold Standard

Evaluating Diagnostic Tests in the Absence of a Gold Standard Evaluating Diagnostic Tests in the Absence of a Gold Standard Nandini Dendukuri Departments of Medicine & Epidemiology, Biostatistics and Occupational Health, McGill University; Technology Assessment Unit,

More information

Accurate Campaign Targeting Using Classification Algorithms

Accurate Campaign Targeting Using Classification Algorithms Accurate Campaign Targeting Using Classification Algorithms Jieming Wei Sharon Zhang Introduction Many organizations prospect for loyal supporters and donors by sending direct mail appeals. This is an

More information

Application of Machine Learning to Financial Trading

Application of Machine Learning to Financial Trading Application of Machine Learning to Financial Trading January 2, 2015 Some slides borrowed from: Andrew Moore s lectures, Yaser Abu Mustafa s lectures About Us Our Goal : To use advanced mathematical and

More information

Forecasting Survey. How far into the future do you typically project when trying to forecast the health of your industry? less than 4 months 3%

Forecasting Survey. How far into the future do you typically project when trying to forecast the health of your industry? less than 4 months 3% Forecasting Forecasting Survey How far into the future do you typically project when trying to forecast the health of your industry? less than 4 months 3% 4-6 months 12% 7-12 months 28% > 12 months 57%

More information

ECPR Methods Summer School: Big Data Analysis in the Social Sciences. pablobarbera.com/ecpr-sc105

ECPR Methods Summer School: Big Data Analysis in the Social Sciences. pablobarbera.com/ecpr-sc105 ECPR Methods Summer School: Big Data Analysis in the Social Sciences Pablo Barberá London School of Economics pablobarbera.com Course website: pablobarbera.com/ecpr-sc105 Supervised Machine Learning Supervised

More information

Session 7. Introduction to important statistical techniques for competitiveness analysis example and interpretations

Session 7. Introduction to important statistical techniques for competitiveness analysis example and interpretations ARTNeT Greater Mekong Sub-region (GMS) initiative Session 7 Introduction to important statistical techniques for competitiveness analysis example and interpretations ARTNeT Consultant Witada Anukoonwattaka,

More information

Prediction of Google Local Users Restaurant ratings

Prediction of Google Local Users Restaurant ratings CSE 190 Assignment 2 Report Professor Julian McAuley Page 1 Nov 30, 2015 Prediction of Google Local Users Restaurant ratings Shunxin Lu Muyu Ma Ziran Zhang Xin Chen Abstract Since mobile devices and the

More information

Statistical approaches for dealing with imperfect reference standards

Statistical approaches for dealing with imperfect reference standards Statistical approaches for dealing with imperfect reference standards Nandini Dendukuri Departments of Medicine & Epidemiology, Biostatistics and Occupational Health, McGill University; Technology Assessment

More information

Various Techniques for Efficient Retrieval of Contents across Social Networks Based On Events

Various Techniques for Efficient Retrieval of Contents across Social Networks Based On Events Various Techniques for Efficient Retrieval of Contents across Social Networks Based On Events SAarif Ahamed 1 First Year ME (CSE) Department of CSE MIET EC ahamedaarif@yahoocom BAVishnupriya 1 First Year

More information

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest 1. Introduction Reddit is a social media website where users submit content to a public forum, and other

More information

CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow

CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow haoyeec@stanford.edu ac1408@stanford.edu Problem Statement It is often said that stock prices are determined

More information

Analytics for Banks. September 19, 2017

Analytics for Banks. September 19, 2017 Analytics for Banks September 19, 2017 Outline About AlgoAnalytics Problems we can solve for banks Our experience Technology Page 2 About AlgoAnalytics Analytics Consultancy Work at the intersection of

More information

e-learning Student Guide

e-learning Student Guide e-learning Student Guide Basic Statistics Student Guide Copyright TQG - 2004 Page 1 of 16 The material in this guide was written as a supplement for use with the Basic Statistics e-learning curriculum

More information

Predicting International Restaurant Success with Yelp

Predicting International Restaurant Success with Yelp Predicting International Restaurant Success with Yelp Angela Kong 1, Vivian Nguyen 2, and Catherina Xu 3 Abstract In this project, we aim to identify the key features people in different countries look

More information

Chapter 1 The Science of Macroeconomics

Chapter 1 The Science of Macroeconomics Chapter 1 The Science of Macroeconomics Modified by Yun Wang Eco 3203 Intermediate Macroeconomics Florida International University Summer 2017 2016 Worth Publishers, all rights reserved Learning Objectives

More information

Sawtooth Software. Sample Size Issues for Conjoint Analysis Studies RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Sawtooth Software. Sample Size Issues for Conjoint Analysis Studies RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc. Sawtooth Software RESEARCH PAPER SERIES Sample Size Issues for Conjoint Analysis Studies Bryan Orme, Sawtooth Software, Inc. 1998 Copyright 1998-2001, Sawtooth Software, Inc. 530 W. Fir St. Sequim, WA

More information

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS

More information

Competency-Development Project 08-September-2015

Competency-Development Project 08-September-2015 Competency-Development Project 08-September-2015 Status Report: Competency Development 8 Sept 2015 Key Deliverables / Milestones Results / Accomplishments Milestones / Deliverable Due Date Percent Complete

More information

Rank hotels on Expedia.com to maximize purchases

Rank hotels on Expedia.com to maximize purchases Rank hotels on Expedia.com to maximize purchases Nishith Khantal, Valentina Kroshilina, Deepak Maini December 14, 2013 1 Introduction For an online travel agency (OTA), matching users to hotel inventory

More information

Multi-site Time Series Analysis. Motivation and Methodology

Multi-site Time Series Analysis. Motivation and Methodology Multi-site Time Series Analysis Motivation and Methodology SAMSI Spatial Epidemiology Fall 2009 Howard Chang hhchang@jhsph.edu 1 Epidemiology The study of factors affecting the health of human populations

More information

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines Identifying Splice Sites Of Messenger RNA Using Support Vector Machines Paige Diamond, Zachary Elkins, Kayla Huff, Lauren Naylor, Sarah Schoeberle, Shannon White, Timothy Urness, Matthew Zwier Drake University

More information

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005 Biological question Experimental design Microarray experiment Failed Lecture : Discrimination (cont) Quality Measurement Image analysis Preprocessing Jane Fridlyand Pass Normalization Sample/Condition

More information

Unlocking Unstructured Social Media Data in Marketing. William Rand Assistant Professor of Bussiness Management

Unlocking Unstructured Social Media Data in Marketing. William Rand Assistant Professor of Bussiness Management Unlocking Unstructured Social Media Data in Marketing William Rand Assistant Professor of Bussiness Management In Collaboration with Kelly Hewett, Roland Rust, and Harald J. van Heerde Managers perspectives

More information

Stock Price Prediction with Daily News

Stock Price Prediction with Daily News Stock Price Prediction with Daily News GU Jinshan MA Mingyu Derek MA Zhenyuan ZHOU Huakang 14110914D 14110562D 14111439D 15050698D 1 Contents 1. Work flow of the prediction tool 2. Model performance evaluation

More information

Ask the Expert SAS Text Miner: Getting Started. Presenter: Twanda Baker Senior Associate Systems Engineer SAS Customer Loyalty Team

Ask the Expert SAS Text Miner: Getting Started. Presenter: Twanda Baker Senior Associate Systems Engineer SAS Customer Loyalty Team Ask the Expert SAS Text Miner: Getting Started Ask the Expert SAS Text Miner: Getting Started Presenter: Twanda Baker Senior Associate Systems Engineer SAS Customer Loyalty Team Q&A: Melodie Rush Senior

More information

Learning objectives. The Science of Macroeconomics slide 1. Important issues in macroeconomics

Learning objectives. The Science of Macroeconomics slide 1. Important issues in macroeconomics Learning objectives This chapter introduces you to the issues macroeconomists study the tools macroeconomists use some important concepts in macroeconomic analysis The Science of Macroeconomics slide 1

More information

Government Text as Data: Opportunities and Challenges

Government Text as Data: Opportunities and Challenges Government Text as Data: Opportunities and Challenges John Wilkerson, Andreu Casas University of Washington jwilker@uw.edu June 22, 2015 CAP Text as Data Workshop 1 / 31 A World of Possibility 2 / 31 First

More information

Chapter 12. Sample Surveys. Copyright 2010 Pearson Education, Inc.

Chapter 12. Sample Surveys. Copyright 2010 Pearson Education, Inc. Chapter 12 Sample Surveys Copyright 2010 Pearson Education, Inc. Background We have learned ways to display, describe, and summarize data, but have been limited to examining the particular batch of data

More information

EMBARGOED FOR RELEASE: Wednesday, October 21 at 1:00 p.m.

EMBARGOED FOR RELEASE: Wednesday, October 21 at 1:00 p.m. Interviews with 1,028 adult Americans conducted by telephone by ORC International on October 14-17, 2015. The margin of sampling error for results based on the total sample is plus or minus 3 percentage

More information

Tracking #metoo on Twitter to Predict Engagement in the Movement

Tracking #metoo on Twitter to Predict Engagement in the Movement Tracking #metoo on Twitter to Predict Engagement in the Movement Ana Tarano (atarano) and Dana Murphy (d km0713) Abstract: In the past few months, the social movement #metoo has garnered incredible social

More information

Final Examination. Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring Tuesday June 7, 2011

Final Examination. Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring Tuesday June 7, 2011 Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring 2011 Your name: Final Examination Tuesday June 7, 2011 Instructions: Answer each question in the space

More information

Forecasting for Short-Lived Products

Forecasting for Short-Lived Products HP Strategic Planning and Modeling Group Forecasting for Short-Lived Products Jim Burruss Dorothea Kuettner Hewlett-Packard, Inc. July, 22 Revision 2 About the Authors Jim Burruss is a Process Technology

More information

Analysis of Microarray Data

Analysis of Microarray Data Analysis of Microarray Data Lecture 1: Experimental Design and Data Normalization George Bell, Ph.D. Senior Bioinformatics Scientist Bioinformatics and Research Computing Whitehead Institute Outline Introduction

More information

A simulation approach for evaluating hedonic wage models ability to recover marginal values for risk reductions

A simulation approach for evaluating hedonic wage models ability to recover marginal values for risk reductions A simulation approach for evaluating hedonic wage models ability to recover marginal values for risk reductions Xingyi S. Puckett PhD. Candidate Center for Environmental and Resource Economic Policy North

More information

Masters in Business Statistics (MBS) /2015. Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa. Web:

Masters in Business Statistics (MBS) /2015. Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa. Web: Masters in Business Statistics (MBS) - 2014/2015 Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa Web: www.mrt.ac.lk Course Coordinator: Prof. T S G Peiris Prof. in Applied

More information

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN Predictive Modeling using SAS Enterprise Miner and SAS/STAT : Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN 1 Overview This presentation will: Provide a brief introduction of how to set

More information

Analysing the Immune System with Fisher Features

Analysing the Immune System with Fisher Features Analysing the Immune System with John Department of Computer Science University College London WITMSE, Helsinki, September 2016 Experiment β chain CDR3 TCR repertoire sequenced from CD4 spleen cells. unimmunised

More information

15. Text Data Visualization. Prof. Tulasi Prasad Sariki SCSE, VIT, Chennai

15. Text Data Visualization. Prof. Tulasi Prasad Sariki SCSE, VIT, Chennai 15. Text Data Visualization Prof. Tulasi Prasad Sariki SCSE, VIT, Chennai www.learnersdesk.weebly.com Why Visualize Text? Understanding get the gist of a document Grouping cluster for overview or classifcation

More information

1 PEW RESEARCH CENTER

1 PEW RESEARCH CENTER 1 Methodology This report contains two different analyses of Twitter hashtags: an analysis of the volume of tweets over time mentioning certain hashtags and a content analysis of the major topics mentioned

More information

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 02 Data Mining Process Welcome to the lecture 2 of

More information

Rebuilding Reputation won t Work Without the Full Picture

Rebuilding Reputation won t Work Without the Full Picture Rebuilding Reputation won t Work Without the Full Picture 11. Agenda Setting Conference, October 28, 2010 www.mediatenor.com www.agendasetting.com What you know of Pisa 4 And what is really there 5 Strategic

More information

Reaction Paper Regarding the Flow of Influence and Social Meaning Across Social Media Networks

Reaction Paper Regarding the Flow of Influence and Social Meaning Across Social Media Networks Reaction Paper Regarding the Flow of Influence and Social Meaning Across Social Media Networks Mahalia Miller Daniel Wiesenthal October 6, 2010 1 Introduction One topic of current interest is how language

More information

Automatic Facial Expression Recognition

Automatic Facial Expression Recognition Automatic Facial Expression Recognition Huchuan Lu, Pei Wu, Hui Lin, Deli Yang School of Electronic and Information Engineering, Dalian University of Technology Dalian, Liaoning Province, China lhchuan@dlut.edu.cn

More information

Inventory Lot Sizing with Supplier Selection

Inventory Lot Sizing with Supplier Selection Inventory Lot Sizing with Supplier Selection Chuda Basnet Department of Management Systems The University of Waikato, Private Bag 315 Hamilton, New Zealand chuda@waikato.ac.nz Janny M.Y. Leung Department

More information

Characterizing the long-term PM mortality response function: Comparing the strengths and weaknesses of research synthesis approaches

Characterizing the long-term PM mortality response function: Comparing the strengths and weaknesses of research synthesis approaches Characterizing the long-term PM 2.5 - mortality response function: Comparing the strengths and weaknesses of research synthesis approaches Neal Fann*, Elisabeth Gilmore & Katherine Walker* 1 * Usual institutional

More information

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Outline. Analysis of Microarray Data. Most important design question. General experimental issues Outline Analysis of Microarray Data Lecture 1: Experimental Design and Data Normalization Introduction to microarrays Experimental design Data normalization Other data transformation Exercises George Bell,

More information

MODULE 1 LECTURE NOTES 2 MODELING OF WATER RESOURCES SYSTEMS

MODULE 1 LECTURE NOTES 2 MODELING OF WATER RESOURCES SYSTEMS 1 MODULE 1 LECTURE NOTES 2 MODELING OF WATER RESOURCES SYSTEMS INTRODUCTION In this lecture we will discuss about the concept of a system, classification of systems and modeling of water resources systems.

More information

Data Preprocessing, Sentiment Analysis & NER On Twitter Data.

Data Preprocessing, Sentiment Analysis & NER On Twitter Data. IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727 PP 73-79 www.iosrjournals.org Data Preprocessing, Sentiment Analysis & NER On Twitter Data. Mr.SanketPatil, Prof.VarshaWangikar,

More information

Supervised Learning Using Artificial Prediction Markets

Supervised Learning Using Artificial Prediction Markets Supervised Learning Using Artificial Prediction Markets Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, FSU Dept. of Scientific Computing 1 Main Contributions

More information

ML Methods for Solving Complex Sorting and Ranking Problems in Human Hiring

ML Methods for Solving Complex Sorting and Ranking Problems in Human Hiring ML Methods for Solving Complex Sorting and Ranking Problems in Human Hiring 1 Kavyashree M Bandekar, 2 Maddala Tejasree, 3 Misba Sultana S N, 4 Nayana G K, 5 Harshavardhana Doddamani 1, 2, 3, 4 Engineering

More information

A Fuzzy Multiple Attribute Decision Making Model for Benefit-Cost Analysis with Qualitative and Quantitative Attributes

A Fuzzy Multiple Attribute Decision Making Model for Benefit-Cost Analysis with Qualitative and Quantitative Attributes A Fuzzy Multiple Attribute Decision Making Model for Benefit-Cost Analysis with Qualitative and Quantitative Attributes M. Ghazanfari and M. Mellatparast Department of Industrial Engineering Iran University

More information

Learn What s New. Statistical Software

Learn What s New. Statistical Software Statistical Software Learn What s New Upgrade now to access new and improved statistical features and other enhancements that make it even easier to analyze your data. The Assistant Let Minitab s Assistant

More information