Latent Semantic Analysis. Hongning Wang

Size: px
Start display at page:

Download "Latent Semantic Analysis. Hongning Wang"

Transcription

1 Latent Semantic Analysis Hongning Wang

2 VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy: fly (action v.s. insect) CS6501: Text Mining 2

3 Choosing basis for VS model A concept space is preferred Semantic gap will be bridged Finance D 2 D 4 D 3 Query Sports Education D D 1 5 CS@UVa CS6501: Text Mining 3

4 How to build such a space Automatic term expansion Construction of thesaurus WordNet Clustering of words Word sense disambiguation Dictionary-based Relation between a pair of words should be similar as in text and dictionary s description Explore word usage context CS@UVa CS6501: Text Mining 4

5 How to build such a space Latent Semantic Analysis Assumption: there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with respect to retrieval It means: the observed term-document association data is contaminated by random noise CS@UVa CS6501: Text Mining 5

6 How to build such a space Solution Low rank matrix approximation Imagine this is *true* concept-document matrix Imagine this is our observed term-document matrix Random noise over the word selection in each document CS@UVa CS6501: Text Mining 6

7 Latent Semantic Analysis (LSA) Low rank approximation of term-document matrix CC MM NN Goal: remove noise in the observed termdocument association data Solution: find a matrix with rank kk which is closest to the original matrix in terms of Frobenius norm ZZ = argmin ZZ rrrrrrrr ZZ =kk CC ZZ FF = argmin ZZ rrrrrrrr ZZ =kk MM ii=1 NN 2 jj=1 CC iiii ZZ iiii CS@UVa CS6501: Text Mining 7

8 Basic concepts in linear algebra Symmetric matrix CC = CC TT Rank of a matrix Number of linearly independent rows (columns) in a matrix CC MM NN rrrrrrrr CC MM NN min(mm, NN) CS@UVa CS6501: Text Mining 8

9 Basic concepts in linear algebra Eigen system For a square matrix CC MM MM If CCCC = λλλλ, xx is called the right eigenvector of CC and λλ is the corresponding eigenvalue For a symmetric full-rank matrix CC MM MM We have its eigen-decomposition as CC = QQΛQQ TT where the columns of QQ are the orthogonal and normalized eigenvectors of CC and Λ is a diagonal matrix whose entries are the eigenvalues of CC CS@UVa CS6501: Text Mining 9

10 Basic concepts in linear algebra Singular value decomposition (SVD) For matrix CC MM NN with rank rr, we have CC = UUΣVV TT where UU MM rr and VV NN rr are orthogonal matrices, and Σ is a rr rr diagonal matrix, with Σ iiii = λλ ii and λλ 1 λλ rr are the eigenvalues of CCCC TT kk We define CC MM NN TT = UU MM kk Σ kk kk VV NN kk where we place Σ iiii in a descending order and set Σ iiii = λλ ii for ii kk, and Σ iiii = 0 for ii > kk CS@UVa CS6501: Text Mining 10

11 Latent Semantic Analysis (LSA) Solve LSA by SVD ZZ = argmin ZZ rrrrrrrr ZZ =kk = argmin ZZ rrrrrrrr ZZ =kk kk = CC MM NN Procedure of LSA Map to a lower dimensional space CC ZZ FF MM ii=1 NN 2 jj=1 CC iiii ZZ iiii 1. Perform SVD on document-term adjacency matrix kk 2. Construct CC MM NN by only keeping the largest kk singular values in Σ non-zero CS@UVa CS6501: Text Mining 11

12 Latent Semantic Analysis (LSA) Another interpretation CC MM NN is the document-term adjacency matrix TT DD MM MM = CC MM NN CC MM NN DD iiii : document-document similarity by counting how many terms co-occur in dd ii and dd jj DD = UUΣVV TT UUΣVV TT TT = UUΣ 2 UU TT Eigen-decomposition of document-document similarity matrix dd ii s new representation is then UUΣ ii in this system(space) In the lower dimensional space, we will only use the first kk elements in UUΣ ii to represent dd ii TT The same analysis applies to TT NN NN = CC MM NN CC MM NN CS@UVa CS6501: Text Mining 12

13 Geometric interpretation of LSA kk CC MM NN (i, j) measures the relatedness between dd ii and ww jj in the kk-dimensional space Therefore kk As CC MM NN TT = UU MM kk Σ kk kk VV NN kk 2 dd ii is represented as UU MM kk Σ kk kk 2 ww jj is represented as VV NN kk Σ kk kk 1 1 ii jj CS@UVa CS6501: Text Mining 13

14 Latent Semantic Analysis (LSA) Visualization HCI Graph theory CS6501: Text Mining 14

15 What are those dimensions in LSA Principle component analysis CS6501: Text Mining 15

16 Latent Semantic Analysis (LSA) What we have achieved via LSA Terms/documents that are closely associated are placed near one another in this new space Terms that do not occur in a document may still close to it, if that is consistent with the major patterns of association in the data A good choice of concept space for VS model! CS@UVa CS6501: Text Mining 16

17 LSA for retrieval Project queries into the new document space 1 qq = qqvv NN kk Σ kk kk Treat query as a pseudo document of term vector Cosine similarity between query and documents in this lower-dimensional space CS@UVa CS6501: Text Mining 17

18 LSA for retrieval HCI Graph theory q: human computer interaction CS6501: Text Mining 18

19 Discussions Computationally expensive Time complexity OO(MMNN 2 ) Empirically helpful for recall but not for precision Recall increases as kk decreases Optimal choice of kk Difficult to handle dynamic corpus Difficult to interpret the decomposition results CS@UVa We will come back to this CS6501: later! Text Mining 19

20 LSA beyond text Collaborative filtering CS6501: Text Mining 20

21 LSA beyond text Eigen face CS6501: Text Mining 21

22 LSA beyond text Cat from deep neuron network One of the neurons in the artificial neural network, trained from still frames from unlabeled YouTube videos, learned to detect cats. CS6501: Text Mining 22

23 What you should know Assumption in LSA Interpretation of LSA Low rank matrix approximation Eigen-decomposition of co-occurrence matrix for documents and terms LSA for IR CS6501: Text Mining 23

24 Today s reading Chapter 13: Matrix decompositions and latent semantic indexing All the sections! Deerwester, Scott C., et al. "Indexing by latent semantic analysis." JAsIs 41.6 (1990): CS@UVa CS6501: Text Mining 24