Covariance Matrices & All-pairs Similarity

Size: px

Start display at page:

Download "Covariance Matrices & All-pairs Similarity"

Nathan Sanders
5 years ago
Views:

1 Matrices & Similarity April 2015, Stanford DAO (Stanford) April / 34

2 Notation for matrix A Given m n matrix A, with m n. a 1,1 a 1,2 a 1,n a 2,1 a 2,2 a 2,n A = a m,1 a m,2 a m,n A is tall and skinny, example values m = 10 12, n = {10 4, 10 6 }. A has sparse rows, each row has at most L nonzeros. A is stored across hundreds of machines and cannot be streamed through a single machine. (Stanford) April / 34

3 Computing A T A We compute A T A. A T A is n n, considerably smaller than A. A T A is dense. Holds dot products between all pairs of columns of A. (Stanford) April / 34

4 Guarantees There is a knob γ which can be turned to preserve similarities and singular values. Paying O(nLγ) communication cost and O(γ) computation cost. With a low setting of γ, preserve similar entries of A T A (via Cosine, Dice, Overlap, and Jaccard ). With a high setting of γ, preserve singular values of A T A. (Stanford) April / 34

5 Computing All Pairs of Cosine Similarities We have to find dot products between all pairs of columns of A We prove results for general matrices, but can do better for those entries with cos(i, j) s Cosine : a widely used definition for " between two vectors cos(i, j) = ct i c j c i c j c i is the i th column of A (Stanford) April / 34

6 Example matrix Rows: users. Columns: movies. (Stanford) April / 34

7 Distributed Computing Environment With such large datasets, we must use many machines. Algorithm code available in and Scalding. Described with Maps and Reduces so that the framework takes care of distributing the computation. (Stanford) April / 34

8 Naive Implementation 1 Given row r i, Map with NaiveMapper (Algorithm 1) 2 Reduce using the NaiveReducer (Algorithm 2) Algorithm 1 NaiveMapper(r i ) for all pairs (a ij, a ik ) in r i do Emit ((j, k) a ij a ik ) end for Algorithm 2 NaiveReducer((i, j), v 1,..., v R ) output c T i c j R i=1 v i (Stanford) April / 34

9 for Very easy analysis 1) Shuffle size: O(mL 2 ) 2) Largest reduce-key: O(m) Both depend on m, the larger dimension, and are intractable for m = 10 12, L = 100. We ll bring both down via clever sampling Assuming column norms are known or estimates available (Stanford) April / 34

10 Dimension Independent Matrix Square using MapReduce Algorithm 3 Mapper(r i ) for all pairs (a ij, a ik ) in r i do With probability min emit ((j, k) a ij a ik ) end for ( 1, γ 1 c j c k ) Algorithm 4 Reducer((i, j), v 1,..., v R ) γ if c i c j > 1 then output b ij 1 R c i c j i=1 v i else output b ij 1 R γ i=1 v i end if (Stanford) April / 34

11 for The algorithm outputs b ij, which is a matrix of cosine similarities, call it B. Four things to prove: 1 Shuffle size: O(nLγ) 2 Largest reduce-key: O(γ) 3 The sampling scheme preserves similarities when γ = Ω(log(n)/s) 4 The sampling scheme preserves singular values when γ = Ω(n/ɛ 2 ) (Stanford) April / 34

12 Shuffle size for Theorem For {0, 1} matrices, the expected shuffle size for Mapper is O(nLγ). Proof. The expected contribution from each pair of columns will constitute the shuffle size: n n i=1 j=i+1 #(c i,c j ) k=1 Pr[Emit(c i, c j )] = n n #(c i, c j )Pr[Emit(c i, c j )] i=1 j=i+1 (Stanford) April / 34

13 Shuffle size for Proof. n n #(c i, c j ) γ #(ci ) #(c j ) i=1 j=i+1 (Stanford) April / 34

14 Shuffle size for Proof. n n #(c i, c j ) γ #(ci ) #(c j ) (by AM-GM) γ 2 i=1 j=i+1 n n 1 #(c i, c j )( #(c i ) + 1 #(c j ) ) i=1 j=i+1 (Stanford) April / 34

15 Shuffle size for Proof. n (by AM-GM) γ 2 n i=1 j=i+1 γ n #(c i, c j ) γ #(ci ) #(c j ) n i=1 j=i+1 n i=1 1 #(c i ) 1 #(c i, c j )( #(c i ) + 1 #(c j ) ) n #(c i, c j ) j=1 (Stanford) April / 34

16 Shuffle size for Proof. n (by AM-GM) γ 2 n i=1 j=i+1 γ n #(c i, c j ) γ #(ci ) #(c j ) n i=1 j=i+1 n i=1 1 #(c i ) 1 #(c i, c j )( #(c i ) + 1 #(c j ) ) n #(c i, c j ) j=1 γ n i=1 1 #(c i ) L#(c i) = γln (Stanford) April / 34

17 Shuffle size for O(nLγ) has no dependence on the dimension m, this is the heart of. Happens because higher magnitude columns are sampled with lower probability: 1 γ c 1 c 2 (Stanford) April / 34

18 Shuffle size for For matrices with real entries, we can still get a bound Let H be the smallest nonzero entry in magnitude, after all entries of A have been scaled to be in [ 1, 1] E.g. for {0, 1} matrices, we have H = 1 Shuffle size is bounded by O(nLγ/H 2 ) (Stanford) April / 34

19 Largest reduce key for Each reduce key receives at most γ values (the oversampling parameter) Immediately get that reduce-key complexity is O(γ) Also independent of dimension m. Happens because high magnitude columns are sampled with lower probability. (Stanford) April / 34

20 Correctness Since higher magnitude columns are sampled with lower probability, are we guaranteed to obtain correct results w.h.p.? Yes. By setting γ correctly. Preserve similarities when γ = Ω(log(n)/s) Preserve singular values when γ = Ω(n/ɛ 2 ) (Stanford) April / 34

21 Correctness Theorem Let A be an m n tall and skinny (m > n) matrix. If γ = Ω(n/ɛ 2 ) and D a diagonal matrix with entries d ii = c i, then the matrix B output by satisfies, with probability at least 1/2. DBD A T A 2 A T A 2 ɛ Relative error guaranteed to be low with constant probability. (Stanford) April / 34

22 Proof Uses Latala s theorem, bounds 2nd and 4th central moments of entries of B. Really need extra power of moments. (Stanford) April / 34

23 Latala s Theorem Theorem (Latala s theorem). Let X be a random matrix whose entries x ij are independent centered random variables with finite fourth moment. Denoting X 2 as the matrix spectral norm, we have E X 2 C[max i j E x 2 ij 1/2 + max j 1/4 ( i E x 2 ij ) 1/2 + i,j E x 4 ij ]. (Stanford) April / 34

24 Proof Prove two things E[(b ij Eb ij ) 2 ] 1 γ (easy) E[(b ij Eb ij ) 4 ] 2 (not easy) γ 2 Details in paper. (Stanford) April / 34

25 Correctness Theorem For any two columns c i and c j having cos(c i, c j ) s, let B be the output of with entries b ij = 1 γ m k=1 X ijk with X ijk as the indicator for the k th coin in the call to Mapper. Now if γ = Ω(α/s), then we have, and ] ( Pr [ c i c j b ij > (1 + δ)[a T e δ A] ij (1 + δ) (1+δ) ] Pr [ c i c j b i,j < (1 δ)[a T A] ij < exp( αδ 2 /2) ) α Relative error guaranteed to be low with high probability. (Stanford) April / 34

26 Correctness Proof. In the paper Uses standard concentration inequality for sums of indicator random variables. Ends up requiring that the oversampling parameter γ be set to γ = log(n 2 )/s = 2 log(n)/s. (Stanford) April / 34

27 Observations helpful when there are some popular columns e.g. The Netflix Matrix (some columns way more popular than others) Power-law columns are effectively neutralized (Stanford) April / 34

28 In practice Forget about theoretical settings for γ Crank up γ until application works Estimates for c i can be used, expectations still hold, but concentration isn t guaranteed If using for singular values, watch for ill-conditioned matrices (Stanford) April / 34

29 Large scale production live at twitter.com (Stanford) April / 34

30 Figure : Y-axis ranges from 0 to 100s of terabytes (Stanford) April / 34

31 Implementation Figure : Widely distributed with as of version 1.2 (Stanford) April / 34

32 Other Similarity Measures Picking out similar columns work for some other measures. Similarity Definition Shuffle Size Reduce-key size #(x,y) Cosine O(nL log(n)/s) O(log(n)/s) #(x) #(y) Jaccard Overlap Dice #(x,y) #(x)+#(y) #(x,y) O((n/s) log(n/s)) O(log(n/s)/s) #(x,y) min(#(x),#(y)) O(nL log(n)/s) O(log(n)/s) 2#(x,y) #(x)+#(y) O(nL log(n)/s) O(log(n)/s) Table : All sizes are independent of m, the dimension. (Stanford) April / 34

33 Locality Sensitive Hashing MinHash from the Locality-Sensitive-Hashing family can have its vanilla implementation greatly improved by. Another set of theorems for shuffle size and correctness in DISCO paper. stanford.edu/~rezab/papers/disco.pdf (Stanford) April / 34

34 Conclusion Consider if you ever need to compute A T A for large sparse A Many more experiments and results in paper at stanford.edu/~rezab (Stanford) April / 34

35 Combiners All bounds are without combining: can only get better with combining For similarities, (without combiners) beats naive with combining outright For singular values, (without combiners) beats naive with combining if the number of machines is large (e.g. 1000) with combining empirically beats naive with combining Difficult to analyze combiners since they happen at many levels. Optimizations break models. with combiners is best of both. (Stanford) April / 34

36 Combiners With k machines shuffle with combiner: O(min(nLγ, kn 2 )) reduce-key with combiner: O(min(γ, k)) Naive shuffle with combiner: O(kn 2 ) Naive reduce-key with combiner: O(k) with combiners is best of both. (Stanford) April / 34

37 1 0.8 DISCO Cosine shuffle size vs accuracy tradeoff DISCO Shuffle / Naive Shuffle avg relative err DISCO Shuffle / Naive Shuffle avg relative err log(p / ε) Figure : As γ = p/ɛ increases, shuffle size increases and error decreases. There is no thresholding for highly similar pairs here. (Stanford) April / 34

A Dynamics for Advertising on Networks. Atefeh Mohammadi Samane Malmir. Spring 1397

A Dynamics for Advertising on Networks Atefeh Mohammadi Samane Malmir Spring 1397 Outline Introduction Related work Contribution Model Theoretical Result Empirical Result Conclusion What is the problem?