The Foundations of Similarity and their Applications to NLP Tasks

Size: px
Start display at page:

Download "The Foundations of Similarity and their Applications to NLP Tasks"

Transcription

1 The Foundations of Similarity and their Applications to NLP Tasks Enrique Amigó Julio Gonzalo Felisa Verdejo Jesús Giménez Damiano Spina Anselmo Peñas Victor Fresno Fernando Giner Guillermo Garrido Fernando López

2 The Problem /?/

3 The Problem (IR) /?/ vv

4 The Problem (Clustering) /?/ vv

5 The Problem (Machine Translation and Summarization) /?/ vv

6 NLP PROBLEMS Textual Entailment /?/ Semantic Textual Similarity vv Spam Detection Novelty Detection Instance Based Learning

7 The Problem The appropriateness of a similarity measure depends of the task The appropiateness of a similarity measure dependends of the data set... therefore Evaluating/combining measures over a training data set is not enough

8 The Goal To study the basic properties of similarity measure What is the best measure for a task? What we need to assume to exploit measures

9 Methodology Certain measures in certain conditions have certain behavior Formal analysis and theoretical proves Does the contitions match with real scenarios? Experiments

10 TEXT PAIRS GROUPS MEASURES queries 100 docs returned by the best system 60 CLUST. WEPS topics, 100 ramdomly selected pairs 167 MT NIST 2005 and Text sources and systems 60 AS DUC 2005 and Text Sources and systems 30 TE RTE 800 Four data sets 102 STS SEMEVAL Five data sets 88 TASK CORPUS IR GOV

11

12 Similarity Definition

13 High Similarity Theorem Same eye color is always a positive evidence of brothers

14

15 Similarity Definition Adding or removing randomly a word

16 High Similarity Theorem: Consequences Over random text samples: Any IR system has a decreasing Precision/Recall curve. Increasing the coherence of a clustering according to any similarity measure increases the cluster quality. Improving a MT or AS system according to any measure increases the quality...

17 High Similarity Theorem: Consequences T h is is the pr oblem not th Over random text samples:, e task Any IR system has a decreasing Precision/Recall curve. Increasing the coherence of a clustering according to any similarity measure increases the cluster quality. Improving a MT or AS system according to any measure increases the quality...

18 High Similarity Theorem: Experiment For each measure: I. We take all similarity instances. II. We sort them by the estimated similarity III.We compute the assessed similarity above each rank position. IV.We compute de correlation with the rank position V.We compute the measure granularity (amount of similarity ranges)

19 Experiments: Clustering

20 Experiments: Textual Entailment

21 Experiments: Information Retrieval

22 Experiments: Machine Translation

23 Experiments: Summarization Precis io n o r iente d ROUGE

24 Experiments: Summarization I don' t know

25 Strictness Theorem The amount of salient students can predict the level of schools

26

27 Strictness Theorem Assesed similarity distribution Random score for High similar samples Low score for low similar samples The average assessed similarity correlates With the amount of random scores.

28 Strictness Theorem Assesed similarity distribution Random score for High similar samples Low score for low similar samples The average assessed similarity correlates With the amount of random scores.

29 Strictness Theorem Assesed similarity distribution Random score for High similar samples Low score for low similar samples The average assessed similarity correlates With the amount of random scores.

30 Strictness Theorem: Consecuences Assuming a fixed similarity deviation in a set of samples, a strict measure: Predicts the average relevance of IR system outputs Predicts the optimal grouping threshold. Predicts the textual entailment at document level. Predicts the average quality of MT and AS systems

31 Strictness Theorem: Experiment For each measure: We divide the data set into subsets We estimate the strictness of the measure We estimate the correlation between the average estimated vs. assessed similarity.

32 Experiments: Information Retrieval

33 Experiments: Textual Entailment

34 Experiments: Clustering

35 Experiments: Semantic Textual Similarity

36 Experiments: Summarization

37 Experiments: Machine Translation

38 Conclusions Using a strict measure: We can predict the average relevance of a IR ranking (0.4 Pearson) We can predict the amount of clusters (0.6 Spearman) We can predict the average quality of systems in MT and AS We find the same pattern in all scenarios.

39 Observations: Combining diverse classifiers improves the classification results. Combining diverse similarity measures improves MT automatic evaluation.

40 Heterogeneity Theorem If your mother, your grandfather and your son likes your clothes, you are well dressed.

41

42

43

44 What is the similarity according to the most reliable measure? How heterogenous are the measures that corroborate a result?

45 HBR tends to be at least as reliable as the best measure... You can use the best measure, but what is it?

46 The correlation with HBR predicts the reliability Of measures

47 RESULTS I. A theoretical definition of similarity covering most of current measures. II. Formal explanations for several phenomena: Decreasing P/R curves, the reliability decrease of evaluation measures across development cycles, correlation at segment vs. system level of MT measures, accuracy of combining diverse systems, etc. III. A methodology to predict the average: relevance in a IR ranking, quality of systems, the amount of clusters in a document distribution... IV.An unsupervised method to combine evaluation measures, ranking systems, clustering similarity measures, etc. V. An unsupervised method to meta-evaluate whithout requiring human assessments. measures