The Foundations of Similarity and their Applications to NLP Tasks
|
|
- Iris Evans
- 5 years ago
- Views:
Transcription
1 The Foundations of Similarity and their Applications to NLP Tasks Enrique Amigó Julio Gonzalo Felisa Verdejo Jesús Giménez Damiano Spina Anselmo Peñas Victor Fresno Fernando Giner Guillermo Garrido Fernando López
2 The Problem /?/
3 The Problem (IR) /?/ vv
4 The Problem (Clustering) /?/ vv
5 The Problem (Machine Translation and Summarization) /?/ vv
6 NLP PROBLEMS Textual Entailment /?/ Semantic Textual Similarity vv Spam Detection Novelty Detection Instance Based Learning
7 The Problem The appropriateness of a similarity measure depends of the task The appropiateness of a similarity measure dependends of the data set... therefore Evaluating/combining measures over a training data set is not enough
8 The Goal To study the basic properties of similarity measure What is the best measure for a task? What we need to assume to exploit measures
9 Methodology Certain measures in certain conditions have certain behavior Formal analysis and theoretical proves Does the contitions match with real scenarios? Experiments
10 TEXT PAIRS GROUPS MEASURES queries 100 docs returned by the best system 60 CLUST. WEPS topics, 100 ramdomly selected pairs 167 MT NIST 2005 and Text sources and systems 60 AS DUC 2005 and Text Sources and systems 30 TE RTE 800 Four data sets 102 STS SEMEVAL Five data sets 88 TASK CORPUS IR GOV
11
12 Similarity Definition
13 High Similarity Theorem Same eye color is always a positive evidence of brothers
14
15 Similarity Definition Adding or removing randomly a word
16 High Similarity Theorem: Consequences Over random text samples: Any IR system has a decreasing Precision/Recall curve. Increasing the coherence of a clustering according to any similarity measure increases the cluster quality. Improving a MT or AS system according to any measure increases the quality...
17 High Similarity Theorem: Consequences T h is is the pr oblem not th Over random text samples:, e task Any IR system has a decreasing Precision/Recall curve. Increasing the coherence of a clustering according to any similarity measure increases the cluster quality. Improving a MT or AS system according to any measure increases the quality...
18 High Similarity Theorem: Experiment For each measure: I. We take all similarity instances. II. We sort them by the estimated similarity III.We compute the assessed similarity above each rank position. IV.We compute de correlation with the rank position V.We compute the measure granularity (amount of similarity ranges)
19 Experiments: Clustering
20 Experiments: Textual Entailment
21 Experiments: Information Retrieval
22 Experiments: Machine Translation
23 Experiments: Summarization Precis io n o r iente d ROUGE
24 Experiments: Summarization I don' t know
25 Strictness Theorem The amount of salient students can predict the level of schools
26
27 Strictness Theorem Assesed similarity distribution Random score for High similar samples Low score for low similar samples The average assessed similarity correlates With the amount of random scores.
28 Strictness Theorem Assesed similarity distribution Random score for High similar samples Low score for low similar samples The average assessed similarity correlates With the amount of random scores.
29 Strictness Theorem Assesed similarity distribution Random score for High similar samples Low score for low similar samples The average assessed similarity correlates With the amount of random scores.
30 Strictness Theorem: Consecuences Assuming a fixed similarity deviation in a set of samples, a strict measure: Predicts the average relevance of IR system outputs Predicts the optimal grouping threshold. Predicts the textual entailment at document level. Predicts the average quality of MT and AS systems
31 Strictness Theorem: Experiment For each measure: We divide the data set into subsets We estimate the strictness of the measure We estimate the correlation between the average estimated vs. assessed similarity.
32 Experiments: Information Retrieval
33 Experiments: Textual Entailment
34 Experiments: Clustering
35 Experiments: Semantic Textual Similarity
36 Experiments: Summarization
37 Experiments: Machine Translation
38 Conclusions Using a strict measure: We can predict the average relevance of a IR ranking (0.4 Pearson) We can predict the amount of clusters (0.6 Spearman) We can predict the average quality of systems in MT and AS We find the same pattern in all scenarios.
39 Observations: Combining diverse classifiers improves the classification results. Combining diverse similarity measures improves MT automatic evaluation.
40 Heterogeneity Theorem If your mother, your grandfather and your son likes your clothes, you are well dressed.
41
42
43
44 What is the similarity according to the most reliable measure? How heterogenous are the measures that corroborate a result?
45 HBR tends to be at least as reliable as the best measure... You can use the best measure, but what is it?
46 The correlation with HBR predicts the reliability Of measures
47 RESULTS I. A theoretical definition of similarity covering most of current measures. II. Formal explanations for several phenomena: Decreasing P/R curves, the reliability decrease of evaluation measures across development cycles, correlation at segment vs. system level of MT measures, accuracy of combining diverse systems, etc. III. A methodology to predict the average: relevance in a IR ranking, quality of systems, the amount of clusters in a document distribution... IV.An unsupervised method to combine evaluation measures, ranking systems, clustering similarity measures, etc. V. An unsupervised method to meta-evaluate whithout requiring human assessments. measures