(& Classify Deaths Without Physicians) 1
|
|
- Britney Owens
- 5 years ago
- Views:
Transcription
1 Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis I: How to Read 100 Million Blogs (& Classify Deaths Without Physicians) 1 Gary King April 25, c Copyright 2010 Gary King, All Rights Reserved. Gary King () Advanced Quantitative Research Methodology, Lecture Notes: Text AprilAnalysis 25, 2010I: How1 to/ Rea 1
2 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): Gary King (Harvard, IQSS) Text Analysis 2 / 1
3 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King (Harvard, IQSS) Text Analysis 2 / 1
4 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science 23, 1 (February, 2008): Pp Gary King (Harvard, IQSS) Text Analysis 2 / 1
5 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science 23, 1 (February, 2008): Pp In use by (among others): Gary King (Harvard, IQSS) Text Analysis 2 / 1
6 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science 23, 1 (February, 2008): Pp In use by (among others): Copies at Gary King (Harvard, IQSS) Text Analysis 2 / 1
7 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science 23, 1 (February, 2008): Pp In use by (among others): Copies at (play after ad) Gary King (Harvard, IQSS) Text Analysis 2 / 1
8 References Daniel Hopkins and Gary King. Extracting Systematic Social Science Meaning from Text 54, 1 (January 2010): commercialized via: Gary King and Ying Lu. Verbal Autopsy Methods with Multiple Causes of Death, Statistical Science 23, 1 (February, 2008): Pp In use by (among others): Copies at (play after ad) (play 10:00-12:06) Gary King (Harvard, IQSS) Text Analysis 2 / 1
9 Inputs and Target Quantities of Interest Gary King (Harvard, IQSS) Text Analysis 3 / 1
10 Inputs and Target Quantities of Interest Input Data: Gary King (Harvard, IQSS) Text Analysis 3 / 1
11 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) Gary King (Harvard, IQSS) Text Analysis 3 / 1
12 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories Gary King (Harvard, IQSS) Text Analysis 3 / 1
13 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Gary King (Harvard, IQSS) Text Analysis 3 / 1
14 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest Gary King (Harvard, IQSS) Text Analysis 3 / 1
15 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) Gary King (Harvard, IQSS) Text Analysis 3 / 1
16 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Gary King (Harvard, IQSS) Text Analysis 3 / 1
17 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Gary King (Harvard, IQSS) Text Analysis 3 / 1
18 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) Gary King (Harvard, IQSS) Text Analysis 3 / 1
19 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) High classification accuracy unbiased category proportions Gary King (Harvard, IQSS) Text Analysis 3 / 1
20 Inputs and Target Quantities of Interest Input Data: Large set of text documents (blogs, web pages, s, etc.) A set of (mutually exclusive and exhaustive) categories A small set of documents hand-coded into the categories Quantities of interest individual document classifications (spam filters) proportion in each category (proportion which is spam) Estimation Can get the 2nd by counting the 1st (turns out not to be necessary!) High classification accuracy unbiased category proportions Different methods optimize estimation of the different quantities Gary King (Harvard, IQSS) Text Analysis 3 / 1
21 Blogs as a Running Example Gary King (Harvard, IQSS) Text Analysis 4 / 1
22 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. Gary King (Harvard, IQSS) Text Analysis 4 / 1
23 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. We are living through the largest expansion of expressive capability in the history of the human race Gary King (Harvard, IQSS) Text Analysis 4 / 1
24 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. We are living through the largest expansion of expressive capability in the history of the human race Measures classical notion of public opinion: active public expressions designed to influence policy and politics (previously: strikes, boycotts, demonstrations, editorials) Gary King (Harvard, IQSS) Text Analysis 4 / 1
25 Blogs as a Running Example Blogs (web logs): web version of a daily diary, with posts listed in reverse chronological order. We are living through the largest expansion of expressive capability in the history of the human race Measures classical notion of public opinion: active public expressions designed to influence policy and politics (previously: strikes, boycotts, demonstrations, editorials) (Public opinion surveys) Gary King (Harvard, IQSS) Text Analysis 4 / 1
26 One specific quantity of interest Gary King (Harvard, IQSS) Text Analysis 5 / 1
27 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Gary King (Harvard, IQSS) Text Analysis 5 / 1
28 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Gary King (Harvard, IQSS) Text Analysis 5 / 1
29 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Gary King (Harvard, IQSS) Text Analysis 5 / 1
30 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Gary King (Harvard, IQSS) Text Analysis 5 / 1
31 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Gary King (Harvard, IQSS) Text Analysis 5 / 1
32 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Informal language: my crunchy gf thinks dubya hid the wmd s, :)! Gary King (Harvard, IQSS) Text Analysis 5 / 1
33 One specific quantity of interest Daily opinion about President Bush and 2008 candidates among all English language blog posts Specific categories: Label Category 2 extremely negative 1 negative 0 neutral 1 positive 2 extremely positive NA no opinion expressed NB not a blog Hard case: Part ordinal, part nominal categorization Sentiment categorization is more difficult than topic classification Informal language: my crunchy gf thinks dubya hid the wmd s, :)! Little common internal structure (no inverted pyramid) Gary King (Harvard, IQSS) Text Analysis 5 / 1
34 Example of output: John Kerry s Botched Joke Gary King (Harvard, IQSS) Text Analysis 6 / 1
35 Example of output: John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Gary King (Harvard, IQSS) Text Analysis 6 / 1
36 Example of output: John Kerry s Botched Joke You know, education if you make the most of it... you can do well. If you don t, you get stuck in Iraq. Affect Towards John Kerry Proportion Sept Oct Nov Dec Jan Feb Mar Gary King (Harvard, IQSS) Text Analysis 6 / 1
37 Representing Text as Numbers Gary King (Harvard, IQSS) Text Analysis 7 / 1
38 Representing Text as Numbers Filter: choose English language blogs that mention Bush Gary King (Harvard, IQSS) Text Analysis 7 / 1
39 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Gary King (Harvard, IQSS) Text Analysis 7 / 1
40 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Gary King (Harvard, IQSS) Text Analysis 7 / 1
41 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Gary King (Harvard, IQSS) Text Analysis 7 / 1
42 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. Gary King (Harvard, IQSS) Text Analysis 7 / 1
43 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Gary King (Harvard, IQSS) Text Analysis 7 / 1
44 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types Gary King (Harvard, IQSS) Text Analysis 7 / 1
45 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types More sophisticated summaries: we ve used, but they re not necessary Gary King (Harvard, IQSS) Text Analysis 7 / 1
46 Representing Text as Numbers Filter: choose English language blogs that mention Bush Preprocess: convert to lower case, remove punctuation, keep only word stems ( consist, consisted, consistency consist ) Code variables: presence/absence of unique unigrams, bigrams, trigrams Our Example: Our 10,771 blog posts about Bush and Clinton: 201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams. keep only unigrams in > 1% or < 99% of documents: 3,672 variables Groups infinite possible posts into only 2 3,672 distinct types More sophisticated summaries: we ve used, but they re not necessary (More systematic than than 1 dummy variable per document) Gary King (Harvard, IQSS) Text Analysis 7 / 1
47 Notation Gary King (Harvard, IQSS) Text Analysis 8 / 1
48 Notation Document Category -2 extremely negative -1 negative 0 neutral D i = 1 positive 2 extremely positive NA no opinion expressed NB not a blog Gary King (Harvard, IQSS) Text Analysis 8 / 1
49 Notation Document Category -2 extremely negative -1 negative 0 neutral D i = 1 positive 2 extremely positive NA no opinion expressed NB not a blog Word Stem Profile: S i1 = 1 if awful is used, 0 if not S i2 = 1 if good is used, 0 if not S i =.. S ik = 1 if except is used, 0 if not Gary King (Harvard, IQSS) Text Analysis 8 / 1
50 Quantities of Interest Gary King (Harvard, IQSS) Text Analysis 9 / 1
51 Quantities of Interest Computer Science: individual document classifications D 1, D 2..., D L Gary King (Harvard, IQSS) Text Analysis 9 / 1
52 Quantities of Interest Computer Science: individual document classifications D 1, D 2..., D L Social Science: proportions in each category P(D = 2) P(D = 1) P(D = 0) P(D) = P(D = 1) P(D = 2) P(D = NA) P(D = NB) Gary King (Harvard, IQSS) Text Analysis 9 / 1
53 Issues with Existing Statistical Approaches Gary King (Harvard, IQSS) Text Analysis 10 / 1
54 Issues with Existing Statistical Approaches 1 Direct Sampling Gary King (Harvard, IQSS) Text Analysis 10 / 1
55 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample Gary King (Harvard, IQSS) Text Analysis 10 / 1
56 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. Gary King (Harvard, IQSS) Text Analysis 10 / 1
57 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) Gary King (Harvard, IQSS) Text Analysis 10 / 1
58 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Gary King (Harvard, IQSS) Text Analysis 10 / 1
59 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Gary King (Harvard, IQSS) Text Analysis 10 / 1
60 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Gary King (Harvard, IQSS) Text Analysis 10 / 1
61 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless Gary King (Harvard, IQSS) Text Analysis 10 / 1
62 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. Gary King (Harvard, IQSS) Text Analysis 10 / 1
63 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. S spans the space of all predictors of D (i.e., all information in the document) Gary King (Harvard, IQSS) Text Analysis 10 / 1
64 Issues with Existing Statistical Approaches 1 Direct Sampling Biased without a random sample nonrandomness common due to population drift, data subdivisions, etc. (Classification of population documents not necessary) 2 Aggregation of model-based individual classifications Biased without a random sample Models P(D S), but the world works as P(S D) Bias unless P(D S) encompasses the true model. S spans the space of all predictors of D (i.e., all information in the document) Bias even with optimal classification and high % correctly classified Gary King (Harvard, IQSS) Text Analysis 10 / 1
65 Using Misclassification Rates to Correct Proportions Gary King (Harvard, IQSS) Text Analysis 11 / 1
66 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Gary King (Harvard, IQSS) Text Analysis 11 / 1
67 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Gary King (Harvard, IQSS) Text Analysis 11 / 1
68 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Gary King (Harvard, IQSS) Text Analysis 11 / 1
69 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Gary King (Harvard, IQSS) Text Analysis 11 / 1
70 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions Gary King (Harvard, IQSS) Text Analysis 11 / 1
71 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions (No new assumptions beyond that of the classifier) Gary King (Harvard, IQSS) Text Analysis 11 / 1
72 Using Misclassification Rates to Correct Proportions Use some method to classify unlabeled documents Aggregate classifications to category proportions Use labeled set to estimate misclassification rates (by cross-validation) Use misclassification rates to correct proportions Result: vastly improved estimates of category proportions (No new assumptions beyond that of the classifier) (still requires random samples, individual classification, etc) Gary King (Harvard, IQSS) Text Analysis 11 / 1
73 Formalization from Epidemiology (Levy and Kass, 1970) Gary King (Harvard, IQSS) Text Analysis 12 / 1
74 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) Gary King (Harvard, IQSS) Text Analysis 12 / 1
75 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) Solve: P(D = 1) = P( ˆD = 1) (1 spec) sens (1 spec) Gary King (Harvard, IQSS) Text Analysis 12 / 1
76 Formalization from Epidemiology (Levy and Kass, 1970) Accounting identity for 2 categories: Solve: P( ˆD = 1) = (sens)p(d = 1) + (1 spec)p(d = 2) P(D = 1) = P( ˆD = 1) (1 spec) sens (1 spec) Use this equation to correct P( ˆD = 1) Gary King (Harvard, IQSS) Text Analysis 12 / 1
77 Generalizations: J Categories, No Individual Classification (King and Lu, 2008) Gary King (Harvard, IQSS) Text Analysis 13 / 1
78 Generalizations: J Categories, No Individual Classification (King and Lu, 2008) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Gary King (Harvard, IQSS) Text Analysis 13 / 1
79 Generalizations: J Categories, No Individual Classification (King and Lu, 2008) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Drop ˆD calculation, since ˆD = f (S): P(S = s) = J P(S = s D = j )P(D = j ) j =1 Gary King (Harvard, IQSS) Text Analysis 13 / 1
80 Generalizations: J Categories, No Individual Classification (King and Lu, 2008) Accounting identity for J categories P( ˆD = j) = J P( ˆD = j D = j )P(D = j ) j =1 Drop ˆD calculation, since ˆD = f (S): P(S = s) = J P(S = s D = j )P(D = j ) j =1 Simplify to an equivalent matrix expression: P(S) = P(S D)P(D) Gary King (Harvard, IQSS) Text Analysis 13 / 1
81 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Gary King (Harvard, IQSS) Text Analysis 14 / 1
82 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Document category proportions (quantity of interest) Gary King (Harvard, IQSS) Text Analysis 14 / 1
83 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Word stem profile proportions (estimate in unlabeled set by tabulation) Gary King (Harvard, IQSS) Text Analysis 14 / 1
84 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 Word stem profiles, by category (estimate in labeled set by tabulation) Gary King (Harvard, IQSS) Text Analysis 14 / 1
85 Estimation The matrix expression again: P(S) 2 K 1 = Y = X β = P(S D) P(D) 2 K J J 1 Alternative symbols (to emphasize the linear equation) Gary King (Harvard, IQSS) Text Analysis 14 / 1
86 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Solve for quantity of interest (with no error term) Gary King (Harvard, IQSS) Text Analysis 14 / 1
87 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: Gary King (Harvard, IQSS) Text Analysis 14 / 1
88 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer Gary King (Harvard, IQSS) Text Analysis 14 / 1
89 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Gary King (Harvard, IQSS) Text Analysis 14 / 1
90 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Gary King (Harvard, IQSS) Text Analysis 14 / 1
91 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Gary King (Harvard, IQSS) Text Analysis 14 / 1
92 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Gary King (Harvard, IQSS) Text Analysis 14 / 1
93 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Gary King (Harvard, IQSS) Text Analysis 14 / 1
94 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Use constrained LS to constrain P(D) to simplex Gary King (Harvard, IQSS) Text Analysis 14 / 1
95 Estimation The matrix expression again: P(S) 2 K 1 = P(S D) P(D) 2 K J J 1 = Y = X β = β = (X X ) 1 X y Technical estimation issues: 2 K is enormous, far larger than any existing computer P(S) and P(S D) will be too sparse Elements of P(D) must be between 0 and 1 and sum to 1 Solutions Use subsets of S; average results Equivalent to kernel density smoothing of sparse categorical data Use constrained LS to constrain P(D) to simplex Result: fast, accurate, with very little (human) tuning required Gary King (Harvard, IQSS) Text Analysis 14 / 1
96 A Nonrandom Hand-coded Sample Differences in Document Category Frequencies Differences in Word Profile Frequencies P(D) P(S) P h (D) P h (S) All existing methods would fail with these data. Gary King (Harvard, IQSS) Text Analysis 15 / 1
97 Accurate Estimates Estimated P(D) Actual P(D) Gary King (Harvard, IQSS) Text Analysis 16 / 1
98 Out-of-sample Comparison: 60 Seconds vs. 8.7 Days Affect in Blogs Estimated P(D) Actual P(D) Gary King (Harvard, IQSS) Text Analysis 17 / 1
99 Out of Sample Validation: Other Examples Congressional Speeches Immigration Editorials Enron s Estimated P(D) Estimated P(D) Estimated P(D) Actual P(D) Actual P(D) Actual P(D) Gary King (Harvard, IQSS) Text Analysis 18 / 1
100 Verbal Autopsy Methods Gary King (Harvard, IQSS) Text Analysis 19 / 1
101 Verbal Autopsy Methods The Problem Gary King (Harvard, IQSS) Text Analysis 19 / 1
102 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies Gary King (Harvard, IQSS) Text Analysis 19 / 1
103 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Gary King (Harvard, IQSS) Text Analysis 19 / 1
104 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Gary King (Harvard, IQSS) Text Analysis 19 / 1
105 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Gary King (Harvard, IQSS) Text Analysis 19 / 1
106 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Gary King (Harvard, IQSS) Text Analysis 19 / 1
107 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Apply expert algorithms (high reliability, low validity) Gary King (Harvard, IQSS) Text Analysis 19 / 1
108 Verbal Autopsy Methods The Problem Policymakers need the cause-specific mortality rate to set research goals, budgetary priorities, and ameliorative policies High quality death registration: only 23/192 countries Existing Approaches Verbal Autopsy: Ask relatives or caregivers symptom questions Ask physicians to determine cause of death (low intercoder reliability) Apply expert algorithms (high reliability, low validity) Find deaths with medically certified causes from a local hospital, trace caregivers to their homes, ask the same symptom questions, and statistically classify deaths in population (model-dependent, low accuracy) Gary King (Harvard, IQSS) Text Analysis 19 / 1
109 An Alternative Approach Gary King (Harvard, IQSS) Text Analysis 20 / 1
110 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Gary King (Harvard, IQSS) Text Analysis 20 / 1
111 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Word Stem Profile, Symptoms: S i1 = 1 if breathing difficulties, 0 if not S i2 = 1 if stomach ache, 0 if not S i =.. S ik = 1 if diarrhea, 0 if not Gary King (Harvard, IQSS) Text Analysis 20 / 1
112 An Alternative Approach Document Category, Cause of Death, 1 if bladder cancer 2 if cardiovascular disease D i = 3 if transportation accident.. J if infectious respiratory Word Stem Profile, Symptoms: S i1 = 1 if breathing difficulties, 0 if not S i2 = 1 if stomach ache, 0 if not S i =.. S ik = 1 if diarrhea, 0 if not Apply the same methods Gary King (Harvard, IQSS) Text Analysis 20 / 1
113 Validation in Tanzania Random Split Sample Community Sample Estimate Error Error Estimate TRUE TRUE Gary King (Harvard, IQSS) Text Analysis 21 / 1
114 Validation in China Random Split Sample Estimate TRUE Error City Sample I Estimate TRUE Error City Sample II Estimate TRUE Error Gary King (Harvard, IQSS) Text Analysis 22 / 1
115 Implications for an Individual Classifier Gary King (Harvard, IQSS) Text Analysis 23 / 1
116 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) Gary King (Harvard, IQSS) Text Analysis 23 / 1
117 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) Gary King (Harvard, IQSS) Text Analysis 23 / 1
118 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): Gary King (Harvard, IQSS) Text Analysis 23 / 1
119 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Gary King (Harvard, IQSS) Text Analysis 23 / 1
120 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) The goal: individual classification Gary King (Harvard, IQSS) Text Analysis 23 / 1
121 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Output from our estimator (described above) Gary King (Harvard, IQSS) Text Analysis 23 / 1
122 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Nonparametric estimate from labeled set (an assumption) Gary King (Harvard, IQSS) Text Analysis 23 / 1
123 Implications for an Individual Classifier All existing classifiers assume: P h (S, D) = P(S, D) For a different quantity we assume: P h (S D) = P(S D) How to use this (less restrictive) assumption for classification (Bayes Theorem): P(D l S l = s l ) = P(S l = s l D l = j)p(d l = j) P(S l = s l ) Nonparametric estimate from unlabeled set (no assumption) Gary King (Harvard, IQSS) Text Analysis 23 / 1
124 Classification with Less Restrictive Assumptions Gary King (Harvard, IQSS) Text Analysis 24 / 1
125 Classification with Less Restrictive Assumptions P h (D=j) P(D=j) Gary King (Harvard, IQSS) Text Analysis 24 / 1
126 Classification with Less Restrictive Assumptions P h (D=j) P h (S k=1) P(D=j) P(S k=1) Gary King (Harvard, IQSS) Text Analysis 24 / 1
127 Classification with Less Restrictive Assumptions Gary King (Harvard, IQSS) Text Analysis 25 / 1
128 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Gary King (Harvard, IQSS) Text Analysis 25 / 1
129 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: Gary King (Harvard, IQSS) Text Analysis 25 / 1
130 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: SVM (best existing classifier): 40.5% Gary King (Harvard, IQSS) Text Analysis 25 / 1
131 Classification with Less Restrictive Assumptions P^(D=j) SVM Nonparametric P(D=j) Percent correctly classified: SVM (best existing classifier): 40.5% Our nonparametric approach: 59.8% Gary King (Harvard, IQSS) Text Analysis 25 / 1
132 Misclassification Matrix for Blog Posts NA NB P(D 1 ) NA NB Gary King (Harvard, IQSS) Text Analysis 26 / 1
133 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Text Analysis 27 / 1
134 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Text Analysis 28 / 1
135 SIMEX Analysis of Not a Blog Category Category NB α Gary King (Harvard, IQSS) Text Analysis 29 / 1
136 SIMEX Analysis of Other Categories Category 2 Category 0 Category α Category α Category α Category NA α α α Gary King (Harvard, IQSS) Text Analysis 30 / 1
137 For more information Gary King (Harvard, IQSS) Text Analysis 31 / 1
Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis: Supervised Learning
Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis: Supervised Learning Gary King Institute for Quantitative Social Science Harvard University April 22, 2012 Gary King (Harvard, IQSS)
More informationBig Data. Methodological issues in using Big Data for Official Statistics
Giulio Barcaroli Istat (barcarol@istat.it) Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official Statistics. Methodological issues in using Big Data for Official Statistics
More informationPredicting Stock Prices through Textual Analysis of Web News
Predicting Stock Prices through Textual Analysis of Web News Daniel Gallegos, Alice Hau December 11, 2015 1 Introduction Investors have access to a wealth of information through a variety of news channels
More informationPredicting Corporate Influence Cascades In Health Care Communities
Predicting Corporate Influence Cascades In Health Care Communities Shouzhong Shi, Chaudary Zeeshan Arif, Sarah Tran December 11, 2015 Part A Introduction The standard model of drug prescription choice
More informationConclusions and Future Work
Chapter 9 Conclusions and Future Work Having done the exhaustive study of recommender systems belonging to various domains, stock market prediction systems, social resource recommender, tag recommender
More informationAssistant Professor Neha Pandya Department of Information Technology, Parul Institute Of Engineering & Technology Gujarat Technological University
Feature Level Text Categorization For Opinion Mining Gandhi Vaibhav C. Computer Engineering Parul Institute Of Engineering & Technology Gujarat Technological University Assistant Professor Neha Pandya
More informationCSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification
CSE 255 Lecture 3 Data Mining and Predictive Analytics Supervised learning Classification Last week Last week we started looking at supervised learning problems Last week We studied linear regression,
More informationSentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter
Sentiment Analysis and Political Party Classification in 2016 U.S. President Debates in Twitter Tianyu Ding 1 and Junyi Deng 1 and Jingting Li 1 and Yu-Ru Lin 1 1 University of Pittsburgh, Pittsburgh PA
More informationData Mining in Social Network. Presenter: Keren Ye
Data Mining in Social Network Presenter: Keren Ye References Kwak, Haewoon, et al. "What is Twitter, a social network or a news media?." Proceedings of the 19th international conference on World wide web.
More informationPreface to the third edition Preface to the first edition Acknowledgments
Contents Foreword Preface to the third edition Preface to the first edition Acknowledgments Part I PRELIMINARIES XXI XXIII XXVII XXIX CHAPTER 1 Introduction 3 1.1 What Is Business Analytics?................
More informationLinear model to forecast sales from past data of Rossmann drug Store
Abstract Linear model to forecast sales from past data of Rossmann drug Store Group id: G3 Recent years, the explosive growth in data results in the need to develop new tools to process data into knowledge
More informationPredictive Analytics
Predictive Analytics Mani Janakiram, PhD Director, Supply Chain Intelligence & Analytics, Intel Corp. Adjunct Professor of Supply Chain, ASU October 2017 "Prediction is very difficult, especially if it's
More informationPredicting Corporate 8-K Content Using Machine Learning Techniques
Predicting Corporate 8-K Content Using Machine Learning Techniques Min Ji Lee Graduate School of Business Stanford University Stanford, California 94305 E-mail: minjilee@stanford.edu Hyungjun Lee Department
More informationPAST research has shown that real-time Twitter data can
Algorithmic Trading of Cryptocurrency Based on Twitter Sentiment Analysis Stuart Colianni, Stephanie Rosales, and Michael Signorotti ABSTRACT PAST research has shown that real-time Twitter data can be
More informationProgress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong
Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.
More informationApplications of Machine Learning to Predict Yelp Ratings
Applications of Machine Learning to Predict Yelp Ratings Kyle Carbon Aeronautics and Astronautics kcarbon@stanford.edu Kacyn Fujii Electrical Engineering khfujii@stanford.edu Prasanth Veerina Computer
More informationHUMAN RESOURCE PLANNING AND ENGAGEMENT DECISION SUPPORT THROUGH ANALYTICS
HUMAN RESOURCE PLANNING AND ENGAGEMENT DECISION SUPPORT THROUGH ANALYTICS Janaki Sivasankaran 1, B Thilaka 2 1,2 Department of Applied Mathematics, Sri Venkateswara College of Engineering, (India) ABSTRACT
More informationA MATHEMATICAL MODEL FOR PREDICTING OUTPUT IN AN OILFIELD IN THE NIGER DELTA AREA OF NIGERIA
A MATHEMATICAL MODEL FOR PREDICTING OUTPUT IN AN OILFIELD IN THE NIGER DELTA AREA OF NIGERIA M. H. Oladeinde 1,*, A. O. Ohwo and C. A. Oladeinde 3, 3 PRODUCTION ENGINEERING DEPARTMENT, UNIVERSITY OF BENIN,
More informationData Mining. Chapter 7: Score Functions for Data Mining Algorithms. Fall Ming Li
Data Mining Chapter 7: Score Functions for Data Mining Algorithms Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University The merit of score function Score function indicates
More informationLumière. A Smart Review Analysis Engine. Ruchi Asthana Nathaniel Brennan Zhe Wang
Lumière A Smart Review Analysis Engine Ruchi Asthana Nathaniel Brennan Zhe Wang Purpose A rapid increase in Internet users along with the growing power of online reviews has given birth to fields like
More informationText Categorization. Hongning Wang
Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS
More informationText Categorization. Hongning Wang
Text Categorization Hongning Wang CS@UVa Today s lecture Bayes decision theory Supervised text categorization General steps for text categorization Feature selection methods Evaluation metrics CS@UVa CS
More informationTime Series Analysis in the Social Sciences
one Time Series Analysis in the Social Sciences in the social sciences, data are usually collected across space, that is, across countries, cities, and so on. Sometimes, however, data are collected across
More informationModel Selection, Evaluation, Diagnosis
Model Selection, Evaluation, Diagnosis INFO-4604, Applied Machine Learning University of Colorado Boulder October 31 November 2, 2017 Prof. Michael Paul Today How do you estimate how well your classifier
More informationCryptocurrency Price Prediction Using News and Social Media Sentiment
Cryptocurrency Price Prediction Using News and Social Media Sentiment Connor Lamon, Eric Nielsen, Eric Redondo Abstract This project analyzes the ability of news and social media data to predict price
More information1/15 Test 1A. COB 191, Fall 2004
1/15 Test 1A. COB 191, Fall 2004 Name Grade Please provide computational details for questions and problems to get any credit. The following problem is associated with questions 1 to 5. Most presidential
More informationABA English: An elearning Leader Learns from Its Users E B C D A
ABA English: An elearning Leader Learns from Its Users E E B C D A A ABA English is an online, subscription-based, distance learning platform that helps adults in over 7 countries learn English. ABA faced
More informationOn of the major merits of the Flag Model is its potential for representation. There are three approaches to such a task: a qualitative, a
Regime Analysis Regime Analysis is a discrete multi-assessment method suitable to assess projects as well as policies. The strength of the Regime Analysis is that it is able to cope with binary, ordinal,
More informationMachine learning-based approaches for BioCreative III tasks
Machine learning-based approaches for BioCreative III tasks Shashank Agarwal 1, Feifan Liu 2, Zuofeng Li 2 and Hong Yu 1,2,3 1 Medical Informatics, College of Engineering and Applied Sciences, University
More informationPolitical Science 452: Text as Data
Political Science 452: Text as Data Justin Grimmer Assistant Professor Department of Political Science Stanford University April 13th, 2011 Justin Grimmer (Stanford University) Text as Data April 13th,
More informationAn Implementation of genetic algorithm based feature selection approach over medical datasets
An Implementation of genetic algorithm based feature selection approach over medical s Dr. A. Shaik Abdul Khadir #1, K. Mohamed Amanullah #2 #1 Research Department of Computer Science, KhadirMohideen College,
More informationChapter 5 Demand Forecasting
Chapter 5 Demand Forecasting TRUE/FALSE 1. One of the goals of an effective CPFR system is to minimize the negative impacts of the bullwhip effect on supply chains. 2. The modern day business environment
More informationPredicting Restaurants Rating And Popularity Based On Yelp Dataset
CS 229 MACHINE LEARNING FINAL PROJECT 1 Predicting Restaurants Rating And Popularity Based On Yelp Dataset Yiwen Guo, ICME, Anran Lu, ICME, and Zeyu Wang, Department of Economics, Stanford University Abstract
More informationCode Compulsory Module Credits Continuous Assignment
CURRICULUM AND SCHEME OF EVALUATION Compulsory Modules Evaluation (%) Code Compulsory Module Credits Continuous Assignment Final Exam MA 5210 Probability and Statistics 3 40±10 60 10 MA 5202 Statistical
More informationText Mining Approach for Product Quality Enhancement
2017 IEEE 7th International Advance Computing Conference Text Mining Approach for Product Quality Enhancement (Improving Product Quality through Machine Learning) Chandrasekhar Rangu Shuvojit Chatterjee
More informationEnabling News Trading by Automatic Categorization of News Articles
SCSUG 2016 Paper AA22 Enabling News Trading by Automatic Categorization of News Articles ABSTRACT Praveen Kumar Kotekal, Oklahoma State University Vishwanath Kolar Bhaskara, Oklahoma State University Traders
More informationCopyright 2013, SAS Institute Inc. All rights reserved.
IMPROVING PREDICTION OF CYBER ATTACKS USING ENSEMBLE MODELING June 17, 2014 82 nd MORSS Alexandria, VA Tom Donnelly, PhD Systems Engineer & Co-insurrectionist JMP Federal Government Team ABSTRACT Improving
More informationModels in Engineering Glossary
Models in Engineering Glossary Anchoring bias is the tendency to use an initial piece of information to make subsequent judgments. Once an anchor is set, there is a bias toward interpreting other information
More informationPrediction from Blog Data
Prediction from Blog Data Aditya Parameswaran Eldar Sadikov Petros Venetis 1. INTRODUCTION We have approximately one year s worth of blog posts from [1] with over 12 million web blogs tracked. On average
More informationDistinguish between different types of numerical data and different data collection processes.
Level: Diploma in Business Learning Outcomes 1.1 1.3 Distinguish between different types of numerical data and different data collection processes. Introduce the course by defining statistics and explaining
More informationDetermining Method of Action in Drug Discovery Using Affymetrix Microarray Data
Determining Method of Action in Drug Discovery Using Affymetrix Microarray Data Max Kuhn max.kuhn@pfizer.com Pfizer Global R&D Research Statistics Groton, CT Method of Action As the level of drug resistance
More informationInternational Journal of Scientific & Engineering Research, Volume 6, Issue 3, March ISSN Web and Text Mining Sentiment Analysis
International Journal of Scientific & Engineering Research, Volume 6, Issue 3, March-2015 672 Web and Text Mining Sentiment Analysis Ms. Anjana Agrawal Abstract This paper describes the key steps followed
More informationIntegrating natural language processing and machine learning algorithms to categorize oncologic response in radiology reports
Integrating natural language processing and machine learning algorithms to categorize oncologic response in radiology reports Po-Hao Chen, MD MBA Hanna Zafar, MD Tessa S. Cook, MD PhD Roadmap Background
More informationArchitecture of Text Mining Application in Analyzing Public Sentiments of West Java Governor Election using Naive Bayes Classification
Architecture of Text Mining Application in Analyzing Public Sentiments of West Java Governor Election using Naive Bayes Classification Suryanto Nugroho Master of Informatics Engineering, Amikom Yogyakarta
More informationCHAPTER Activity Cost Behavior
3-1 CHAPTER Activity Cost Behavior Objectives 3-2 1. Define cost behavior After studying for fixed, this variable, and mixed costs. chapter, you should 2. Explain the role be of the able resource to: usage
More informationCHAPTER 1 Defining and Collecting Data
CHAPTER 1 Defining and Collecting Data In this book we will use Define the variables for which you want to reach conclusions Collect the data from appropriate sources Organize the data collected by developing
More informationCorrecting Sample Bias in Oversampled Logistic Modeling. Building Stable Models from Data with Very Low event Count
Correcting Sample Bias in Oversampled Logistic Modeling Building Stable Models from Data with Very Low event Count ABSTRACT In binary outcome regression models with very few bads or minority events, it
More informationInferring Social Ties across Heterogeneous Networks
Inferring Social Ties across Heterogeneous Networks CS 6001 Complex Network Structures HARISH ANANDAN Introduction Social Ties Information carrying connections between people It can be: Strong, weak or
More informationEvaluating Diagnostic Tests in the Absence of a Gold Standard
Evaluating Diagnostic Tests in the Absence of a Gold Standard Nandini Dendukuri Departments of Medicine & Epidemiology, Biostatistics and Occupational Health, McGill University; Technology Assessment Unit,
More informationAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms Jieming Wei Sharon Zhang Introduction Many organizations prospect for loyal supporters and donors by sending direct mail appeals. This is an
More informationApplication of Machine Learning to Financial Trading
Application of Machine Learning to Financial Trading January 2, 2015 Some slides borrowed from: Andrew Moore s lectures, Yaser Abu Mustafa s lectures About Us Our Goal : To use advanced mathematical and
More informationForecasting Survey. How far into the future do you typically project when trying to forecast the health of your industry? less than 4 months 3%
Forecasting Forecasting Survey How far into the future do you typically project when trying to forecast the health of your industry? less than 4 months 3% 4-6 months 12% 7-12 months 28% > 12 months 57%
More informationECPR Methods Summer School: Big Data Analysis in the Social Sciences. pablobarbera.com/ecpr-sc105
ECPR Methods Summer School: Big Data Analysis in the Social Sciences Pablo Barberá London School of Economics pablobarbera.com Course website: pablobarbera.com/ecpr-sc105 Supervised Machine Learning Supervised
More informationSession 7. Introduction to important statistical techniques for competitiveness analysis example and interpretations
ARTNeT Greater Mekong Sub-region (GMS) initiative Session 7 Introduction to important statistical techniques for competitiveness analysis example and interpretations ARTNeT Consultant Witada Anukoonwattaka,
More informationPrediction of Google Local Users Restaurant ratings
CSE 190 Assignment 2 Report Professor Julian McAuley Page 1 Nov 30, 2015 Prediction of Google Local Users Restaurant ratings Shunxin Lu Muyu Ma Ziran Zhang Xin Chen Abstract Since mobile devices and the
More informationStatistical approaches for dealing with imperfect reference standards
Statistical approaches for dealing with imperfect reference standards Nandini Dendukuri Departments of Medicine & Epidemiology, Biostatistics and Occupational Health, McGill University; Technology Assessment
More informationVarious Techniques for Efficient Retrieval of Contents across Social Networks Based On Events
Various Techniques for Efficient Retrieval of Contents across Social Networks Based On Events SAarif Ahamed 1 First Year ME (CSE) Department of CSE MIET EC ahamedaarif@yahoocom BAVishnupriya 1 First Year
More informationPredicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest
Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest 1. Introduction Reddit is a social media website where users submit content to a public forum, and other
More informationCS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow
CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow haoyeec@stanford.edu ac1408@stanford.edu Problem Statement It is often said that stock prices are determined
More informationAnalytics for Banks. September 19, 2017
Analytics for Banks September 19, 2017 Outline About AlgoAnalytics Problems we can solve for banks Our experience Technology Page 2 About AlgoAnalytics Analytics Consultancy Work at the intersection of
More informatione-learning Student Guide
e-learning Student Guide Basic Statistics Student Guide Copyright TQG - 2004 Page 1 of 16 The material in this guide was written as a supplement for use with the Basic Statistics e-learning curriculum
More informationPredicting International Restaurant Success with Yelp
Predicting International Restaurant Success with Yelp Angela Kong 1, Vivian Nguyen 2, and Catherina Xu 3 Abstract In this project, we aim to identify the key features people in different countries look
More informationChapter 1 The Science of Macroeconomics
Chapter 1 The Science of Macroeconomics Modified by Yun Wang Eco 3203 Intermediate Macroeconomics Florida International University Summer 2017 2016 Worth Publishers, all rights reserved Learning Objectives
More informationSawtooth Software. Sample Size Issues for Conjoint Analysis Studies RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.
Sawtooth Software RESEARCH PAPER SERIES Sample Size Issues for Conjoint Analysis Studies Bryan Orme, Sawtooth Software, Inc. 1998 Copyright 1998-2001, Sawtooth Software, Inc. 530 W. Fir St. Sequim, WA
More informationApplying Regression Techniques For Predictive Analytics Paviya George Chemparathy
Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS
More informationCompetency-Development Project 08-September-2015
Competency-Development Project 08-September-2015 Status Report: Competency Development 8 Sept 2015 Key Deliverables / Milestones Results / Accomplishments Milestones / Deliverable Due Date Percent Complete
More informationRank hotels on Expedia.com to maximize purchases
Rank hotels on Expedia.com to maximize purchases Nishith Khantal, Valentina Kroshilina, Deepak Maini December 14, 2013 1 Introduction For an online travel agency (OTA), matching users to hotel inventory
More informationMulti-site Time Series Analysis. Motivation and Methodology
Multi-site Time Series Analysis Motivation and Methodology SAMSI Spatial Epidemiology Fall 2009 Howard Chang hhchang@jhsph.edu 1 Epidemiology The study of factors affecting the health of human populations
More informationIdentifying Splice Sites Of Messenger RNA Using Support Vector Machines
Identifying Splice Sites Of Messenger RNA Using Support Vector Machines Paige Diamond, Zachary Elkins, Kayla Huff, Lauren Naylor, Sarah Schoeberle, Shannon White, Timothy Urness, Matthew Zwier Drake University
More informationToday. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005
Biological question Experimental design Microarray experiment Failed Lecture : Discrimination (cont) Quality Measurement Image analysis Preprocessing Jane Fridlyand Pass Normalization Sample/Condition
More informationUnlocking Unstructured Social Media Data in Marketing. William Rand Assistant Professor of Bussiness Management
Unlocking Unstructured Social Media Data in Marketing William Rand Assistant Professor of Bussiness Management In Collaboration with Kelly Hewett, Roland Rust, and Harald J. van Heerde Managers perspectives
More informationStock Price Prediction with Daily News
Stock Price Prediction with Daily News GU Jinshan MA Mingyu Derek MA Zhenyuan ZHOU Huakang 14110914D 14110562D 14111439D 15050698D 1 Contents 1. Work flow of the prediction tool 2. Model performance evaluation
More informationAsk the Expert SAS Text Miner: Getting Started. Presenter: Twanda Baker Senior Associate Systems Engineer SAS Customer Loyalty Team
Ask the Expert SAS Text Miner: Getting Started Ask the Expert SAS Text Miner: Getting Started Presenter: Twanda Baker Senior Associate Systems Engineer SAS Customer Loyalty Team Q&A: Melodie Rush Senior
More informationLearning objectives. The Science of Macroeconomics slide 1. Important issues in macroeconomics
Learning objectives This chapter introduces you to the issues macroeconomists study the tools macroeconomists use some important concepts in macroeconomic analysis The Science of Macroeconomics slide 1
More informationGovernment Text as Data: Opportunities and Challenges
Government Text as Data: Opportunities and Challenges John Wilkerson, Andreu Casas University of Washington jwilker@uw.edu June 22, 2015 CAP Text as Data Workshop 1 / 31 A World of Possibility 2 / 31 First
More informationChapter 12. Sample Surveys. Copyright 2010 Pearson Education, Inc.
Chapter 12 Sample Surveys Copyright 2010 Pearson Education, Inc. Background We have learned ways to display, describe, and summarize data, but have been limited to examining the particular batch of data
More informationEMBARGOED FOR RELEASE: Wednesday, October 21 at 1:00 p.m.
Interviews with 1,028 adult Americans conducted by telephone by ORC International on October 14-17, 2015. The margin of sampling error for results based on the total sample is plus or minus 3 percentage
More informationTracking #metoo on Twitter to Predict Engagement in the Movement
Tracking #metoo on Twitter to Predict Engagement in the Movement Ana Tarano (atarano) and Dana Murphy (d km0713) Abstract: In the past few months, the social movement #metoo has garnered incredible social
More informationFinal Examination. Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring Tuesday June 7, 2011
Department of Computer Science and Engineering CSE 291 University of California, San Diego Spring 2011 Your name: Final Examination Tuesday June 7, 2011 Instructions: Answer each question in the space
More informationForecasting for Short-Lived Products
HP Strategic Planning and Modeling Group Forecasting for Short-Lived Products Jim Burruss Dorothea Kuettner Hewlett-Packard, Inc. July, 22 Revision 2 About the Authors Jim Burruss is a Process Technology
More informationAnalysis of Microarray Data
Analysis of Microarray Data Lecture 1: Experimental Design and Data Normalization George Bell, Ph.D. Senior Bioinformatics Scientist Bioinformatics and Research Computing Whitehead Institute Outline Introduction
More informationA simulation approach for evaluating hedonic wage models ability to recover marginal values for risk reductions
A simulation approach for evaluating hedonic wage models ability to recover marginal values for risk reductions Xingyi S. Puckett PhD. Candidate Center for Environmental and Resource Economic Policy North
More informationMasters in Business Statistics (MBS) /2015. Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa. Web:
Masters in Business Statistics (MBS) - 2014/2015 Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa Web: www.mrt.ac.lk Course Coordinator: Prof. T S G Peiris Prof. in Applied
More informationPredictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN
Predictive Modeling using SAS Enterprise Miner and SAS/STAT : Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN 1 Overview This presentation will: Provide a brief introduction of how to set
More informationAnalysing the Immune System with Fisher Features
Analysing the Immune System with John Department of Computer Science University College London WITMSE, Helsinki, September 2016 Experiment β chain CDR3 TCR repertoire sequenced from CD4 spleen cells. unimmunised
More information15. Text Data Visualization. Prof. Tulasi Prasad Sariki SCSE, VIT, Chennai
15. Text Data Visualization Prof. Tulasi Prasad Sariki SCSE, VIT, Chennai www.learnersdesk.weebly.com Why Visualize Text? Understanding get the gist of a document Grouping cluster for overview or classifcation
More information1 PEW RESEARCH CENTER
1 Methodology This report contains two different analyses of Twitter hashtags: an analysis of the volume of tweets over time mentioning certain hashtags and a content analysis of the major topics mentioned
More informationBusiness Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee
Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 02 Data Mining Process Welcome to the lecture 2 of
More informationRebuilding Reputation won t Work Without the Full Picture
Rebuilding Reputation won t Work Without the Full Picture 11. Agenda Setting Conference, October 28, 2010 www.mediatenor.com www.agendasetting.com What you know of Pisa 4 And what is really there 5 Strategic
More informationReaction Paper Regarding the Flow of Influence and Social Meaning Across Social Media Networks
Reaction Paper Regarding the Flow of Influence and Social Meaning Across Social Media Networks Mahalia Miller Daniel Wiesenthal October 6, 2010 1 Introduction One topic of current interest is how language
More informationAutomatic Facial Expression Recognition
Automatic Facial Expression Recognition Huchuan Lu, Pei Wu, Hui Lin, Deli Yang School of Electronic and Information Engineering, Dalian University of Technology Dalian, Liaoning Province, China lhchuan@dlut.edu.cn
More informationInventory Lot Sizing with Supplier Selection
Inventory Lot Sizing with Supplier Selection Chuda Basnet Department of Management Systems The University of Waikato, Private Bag 315 Hamilton, New Zealand chuda@waikato.ac.nz Janny M.Y. Leung Department
More informationCharacterizing the long-term PM mortality response function: Comparing the strengths and weaknesses of research synthesis approaches
Characterizing the long-term PM 2.5 - mortality response function: Comparing the strengths and weaknesses of research synthesis approaches Neal Fann*, Elisabeth Gilmore & Katherine Walker* 1 * Usual institutional
More informationOutline. Analysis of Microarray Data. Most important design question. General experimental issues
Outline Analysis of Microarray Data Lecture 1: Experimental Design and Data Normalization Introduction to microarrays Experimental design Data normalization Other data transformation Exercises George Bell,
More informationMODULE 1 LECTURE NOTES 2 MODELING OF WATER RESOURCES SYSTEMS
1 MODULE 1 LECTURE NOTES 2 MODELING OF WATER RESOURCES SYSTEMS INTRODUCTION In this lecture we will discuss about the concept of a system, classification of systems and modeling of water resources systems.
More informationData Preprocessing, Sentiment Analysis & NER On Twitter Data.
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727 PP 73-79 www.iosrjournals.org Data Preprocessing, Sentiment Analysis & NER On Twitter Data. Mr.SanketPatil, Prof.VarshaWangikar,
More informationSupervised Learning Using Artificial Prediction Markets
Supervised Learning Using Artificial Prediction Markets Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, FSU Dept. of Scientific Computing 1 Main Contributions
More informationML Methods for Solving Complex Sorting and Ranking Problems in Human Hiring
ML Methods for Solving Complex Sorting and Ranking Problems in Human Hiring 1 Kavyashree M Bandekar, 2 Maddala Tejasree, 3 Misba Sultana S N, 4 Nayana G K, 5 Harshavardhana Doddamani 1, 2, 3, 4 Engineering
More informationA Fuzzy Multiple Attribute Decision Making Model for Benefit-Cost Analysis with Qualitative and Quantitative Attributes
A Fuzzy Multiple Attribute Decision Making Model for Benefit-Cost Analysis with Qualitative and Quantitative Attributes M. Ghazanfari and M. Mellatparast Department of Industrial Engineering Iran University
More informationLearn What s New. Statistical Software
Statistical Software Learn What s New Upgrade now to access new and improved statistical features and other enhancements that make it even easier to analyze your data. The Assistant Let Minitab s Assistant
More information