Caroline Fetzer, Martin Köppelmann, Sebastian Stange. Track 2

Size: px
Start display at page:

Download "Caroline Fetzer, Martin Köppelmann, Sebastian Stange. Track 2"

Transcription

1 Caroline Fetzer, Martin Köppelmann, Sebastian Stange Track 2

2 Track 2 Assignment 2011 #User #Items #Ra,ngs #Train Ra,ngs #Test Ra,ngs nicht gerated mit > 80 %

3 Winner of 2007 Who Rated What: a combination of SVD, correlation and frequent sequence mining ABSTRACT Miklós Kurucz András A. Benczúr Tamás Kiss István Nagy Adrienn Szabó Balázs Torma Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences {realace, benczur, kisstom, iscsi, aszabo, torma}@ilab.sztaki.hu KDD Cup 2007 focuses on predicting aspects of movie rating behavior. We present our prediction method for Task 1 Who Rated What in 2006 where the task is to predict which users rated which movies in We use the combination of the following predictors, listed in the order of their efficiency in the prediction: The predicted number of ratings for each movie based 1. INTRODUCTION on time series prediction, also using movie and DVD release dates and movie series detection by the edit Recommender systems predict the preference of a user on distance of the titles. a given item based on known ratings. In order to evaluate methods, in October 2006 Netflix provided movie ratings The predicted number of ratings by each user by using from anonymous customers on nearly 18 thousand movie the fact that ratings were sampled proportional to the titles [5] called the Prize dataset. The KDD Cup 2007 tasks 1. Place: Who margin. rated What were related to this data set. For Task 1 Who Rated What in 2006 the task was to predict which users rated which The low rank approximation of the 0 1 matrix of known movies in 2006 while for Task 2 How Many Ratings in 2006 user movie pairs with rating. the task was to predict the number of additional ratings of movies. Informatics Lab The movie movie - Hungary similarity matrix. In this paper we present our method for Task 1 Who Rated What in The task was to predict the probability that a user rated a movie in 2006 (with the actual Association rules obtained by frequent sequence mining of user ratings considered as ordered itemsets. date and rating being irrelevant) for a given list of 100,000 user movie pairs. The users and movies are drawn from the By combining the predictions by linear regression we obtained a prediction with root mean squared error 0.256; the Prize data set, i.e. the movies appeared (or at least received ratings) before 2006 and the users also gave their first rating before 2006 such that none of the pairs were rated in the first runner up result was while a pure all zeroes prediction already gives 0.279, indicating the hardness of the training set. We give a detailed description of the sampling task. method in Section 2.2 since it gives information that we use for the prediction. Categories and Subject Descriptors Our method is summarized as follows: J.4 [Computer Applications]: Social and Behavioral Sciences; G.1.3 [Mathematics of Computing]: Numerical Analysis Numerical Linear Algebra This work was supported by the Mobile Innovation Center, Hungary, a Yahoo Faculty Research Grant and by grant ASTOR NKFP 2/004/05 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDDCup.07, August 12, 2007, San Jose, California, USA Copyright 2007 ACM /07/ $5.00. General Terms data mining, recommender systems Keywords singular value decomposition, item-item similarity, frequent sequence mining 1. A naive estimate based on a user movie independence assumption that uses time series analysis and event prediction from the IMDB movie and the videoeta.com DVD release dates as well as the user rating amount reconstructed from sample margins.!"#$%&&'#%$"()*+'#,'-*"./+*$'01"%(()/%#2"3/)"4%&5"672/" )%,*+"82%,9:"/3",2*";<<"=>?"@AAB"!"#$%&'(%)#*+&,%"&-%.#)/+& 01&2#%3()4*&5& 6789:&-*;#);<&'4*)=& >:9&?5&:76&9@&@9& A"#$%B+(%)#*+C=%"DE%.#)/+B/"E&!"#$%!&$! "#$%!&'&()!*(%+)$,(%!-.(!&-%%$,/(!0'1!2-!%-/3(!2'%4!56#-!)'2(*! 0#'278! -9! 2#(! :;;! <=>!?@@AB! "#(! &)-&-%(*! %-/C2$-.! $%! '! #$%2-)1D,'%(*!E-*(/!2#'2!&)(*$+2%!0#(2#()!'!C%()!0$//!3-2(!'!F$3(.! E-3$(B! :(1! &-$.2%! 2-! -C)! '&&)-'+#!')(!GHI!2#(!(%2$E'2$-.!-9!2#(! E-*(/!,'%(/$.(J!G?I!2#(!*(9$.$2$-.!-9!2#(!(K&/'.'2-)1!3')$',/(%!'.*! GLI! 2#(! E'2#(E'2$+'/! E-*(/! 9-)EB! M$3(.! 2#(!,$.')1! -C2+-E(! -9! 2#(!&)-,/(EJ!2#(!(%2$E'2$-.!-9!2#(!2)C(!,'%(/$.(!G)'2$-!-9!HN%!$.!2#(! 2(%2! *'2'I! $%! +)$2$+'/! $.! -)*()! 2-! +-))(+2/1! E'4(! &)(*$+2$-.%B! O.! &')'//(/J! 2-! $E&)-3(! 2#(! E-*(/! &)(*$+2$3(! &-0()J! 0(! #'3(! *(3(/-&(*! '! +')(9C/! +-.%2)C+2$-.! -9! 2#(! $.&C2! 3')$',/(%B! "#(%(! (K&/'.'2-)1! 3')$',/(%! +'.!,(! F)-C&(*! '%P! C%()! 3-2$.F!,(#'3$-C)! 3')$',/(%J! 2#(! E-3$(! +#')'+2()$%2$+%! '.*! C%()DE-3$(! $.2()'+2$-.%B! Q$.'//1J!2#(!E'2#(E'2$+'/!E-*(/!9-)E!G/$.(')!/-F$%2$+!)(F)(%%$-.I! #'%!,((.!+#-%(.!'E-.F!3')$-C%!E-*(/!9-)E!+-E&(2$2-)%B! &'()*+,-)./'01/#234)5(/6).5,-7(+,.! OBRBH!S8'((),0/%)5+*0-(-+0TP!U-*(/%!V!!"#"$!"$%#&' 9):;+,1.! >)(*$+2$3(!E-*(/$.FJ!*'2'!E$.$.FB! <=" >?$%@6A&$>@?/ "'%4! H! '2! 2#(! :;;! <=>!?@@A! $%!,'%(*! -.! 2#(! +-E&(2$2$-.! -)F'.$W(*!,1! X(29/$K! G#22&PYY000B.(29/$K&)$W(B+-EI! 0#$+#! &)-3$*(%! '! #$%2-)$+! *'2','%(! -9! E-)(! 2#'.! H@@! E$//$-.! E-3$(! )'2$.F%! SHTB! X(29/$K! 2)'$.$.F! *'2'! /'%2%! C&! 2-! ;(+(E,()!?@@R! '.*! 2#(! X(29/$K! <-E&(2$2$-.! F-'/! $%! 2-!,C$/*! '! E-*(/! 0#$+#! &)(*$+2%! 2#(! )'2$.F! F$3(.!,1! '! C%()! 2-! '! E-3$(B! O.! -)*()! 2-! '++C)'2(/1! (%2$E'2(! 2#(! E('.! &)(*$+2$-.! ())-)! 9-)! ('+#! &)-&-%(*! E-*(/J! X(29/$K!C%(%!'!2(%2!*'2'%(2!0$2#!?!E$//$-.!C%()!)'2$.F%B!! "'%4!H!'2!:;;!<C&N@A!$%!,'%(*!-.!2#(!X(29/$K!*'2'Z!,C2!2#(!F-'/! $%!%/$F#2/1!*$99()(.2P![()(!0(!')(!'%4(*!2-!&)(*$+2!0#(2#()!'!C%()! #'%!)'2(*!'!F$3(.!E-3$(!*C)$.F!?@@\B!"#()(9-)(!2#(!E-*(/!EC%2! 2FG"=+"&'*F*G#*=/*&,%"&-%.#)/+& 01&2#%3()4*&5& 6789:&-*;#);<&'4*)=& >:9&?5&:76&9@&@9& *FG"=+"B+*F*G#*=/*C=%"D E%.#)/+B/"E& #'3(!'!,$.')1!-C2+-E(B!!"+%&H()+&IF"#%J&,%"&-%.#)/+& 01&2#%3()4*&5& 6789:&-*;#);<&'4*)=& >:9&?5&:76&9@&@9& A"+%BF()+BGF"#%JC=%"D E%.#)/+B/"E& &! "#(!9$)%2!*$99$+C/21!$.!2#$%!2'%4!$%!2-!'++C)'2(/1!*(2()E$.(!2#(!)'2(! -9!&-%$2$3(!(3(.2%!G,'%(/$.(I!-.!2#(!&)-3$*(*!*'2'B!O.!9'+2J!#'3$.F! '!/--4!2-!2#(!9$.'/!)(%C/2%!-9!2#(!2'%4!H!:;;!<C&N@AJ!-.(!+'.!%((! 2#'2!]C%2!9$3(!2('E%!E'.'F(%!2-!&()9-)E!,(22()!2#'.!'!,(.+#E')4! E-*(/! +-.%2)C+2(*!,1! '%%$F.$.F! 2-! ('+#! &'$)! $.! 2#(! %+-)$.F! *'2'! 2#(!,'%(/$.(!&)-,',$/$21B!! ^C)!E-*(/$.F!'&&)-'+#!+-.%$%2%!-9!2#(!+/'%%$+'/!20-!%2(&%P! HB&?B& U-*(/! '.*! 3')$',/(! %(/(+2$-.B! 6(!,C$/2! '! &)(*$+2$3(! E-*(/!0#-%(!2')F(2!3')$',/(!$%!2#(!,$.')1!(3(.2!-9!)'2$.F! '!E-3$(!$.!?@@R!'.*!0#-%(!$.&C2!3')$',/(%!')(!+)('2(*! 0$2#! *'2'! C&! 2-! ;(+(E,()!?@@_B! "#$%! %2(&! $.+/C*(%! 3')$',/(!'.*!E-*(/!9-)E!%(/(+2$-.B! >)(*$+2$-.B!M$3(.!2#(!E-*(/!9-)EC/'2$-.!'.*!&')'E(2()! (%2$E'2(%! *(9$.(*! ',-3(J! 2#(! $.&C2! 3')$',/(%! ')(! )(+'/+C/'2(*! C%$.F! 2#(! 0#-/(! *'2'%(2! G$.+/C*$.F!?@@RIB! Q$.'//1!2#(!)(`C$)(*!&)(*$+2$-.%!9-)!2#(!%+-)(!*'2'%(2!')(! -,2'$.(*B! "#(! &'&()! $%! -)F'.$W(*! '%! 9-//-0%P! 9$)%2J! 0(! *(%+)$,(! #-0! 0(! %-/3(*!2#(!(%2$E'2$-.!-9!2#(!,'%(/$.(!9-)!2#(!1(')!?@@\!'.*!#-0!$2! 0'%!C%(*!2-!,C$/*!2#(!2)'$.$.F!2',/(B!"#(.!2#(!$.&C2!3')$',/(%!')(! *(%+)$,(*!'.*!9$.'//1!2#(!)(/(3'.+(!-9!%C+#!3')$',/(%!$%!*$%+C%%(*B! B=" "!#CD>?C/C#$>E!$>@?/ O.! -)*()! 2-! (%2$E'2(! 2#(!,'%(/$.(! 0(! EC%2! &'1! '22(.2$-.! 2-! 2#(! :;;!<C&N@A!QabN%B!"#(!Qab!*-+CE(.2!%2'2(%!2#'2!2#(!H@@B@@@! %+-)(! &'$)%! 0()(! %(/(+2(*!,1! )'.*-E/1! &$+4$.F! C&! &'$)%! GC%()J! E-3$(I!0$2#!&)-,',$/$21!&)-&-)2$-.'/!2-!2#(!.CE,()!-9!2$E(%!('+#! +-E&-.(.2!'&&(')%!$.!2#(!?@@\!)'2$.F%Z!QC)2#()E-)(!2#(!C%()!'.*! 2#(!E-3$(!')(!+#-%(.!$.*(&(.*(.2/1B! 6(!+-.%$*()!2#'2!+-))(+2!(%2$E'2$-.!-9!2#(!,'%(/$.(!$%!$E&-)2'.2!$.! -)*()!2-!'22'$.!'!F--*!%-/C2$-.!2-!2#(!&)-,/(E!&-%(*BB!Q-)!,'%(/$.(! (%2$E'2$-.! 0(! %#'//! &)-+((*! 2-! )(&/$+'2(! 2#(! &)-+(*C)(! C%(*! 2-! +)('2(!2#(!%+-)$.F!*'2'J!$.!-)*()!2-!&)-*C+(!'!2)'$.$.F!*'2'%(2!0$2#! %$E$/')!+#')'+2()$%2$+%B!"#(!%'E&/$.F!'/F-)$2#E!$%!'%!9-//-0%P! 2. The implementation of an SVD and an item-item similarity based recommender as well as association rule mining for the KDD Cup Task 1.! >()E$%%$-.!2-!E'4(!*$F$2'/!-)!#')*!+-&$(%!-9!'//!-)!&')2!-9!2#$%!0-)4!9-)! &()%-.'/!-)!+/'%%)--E!C%(!$%!F)'.2(*!0$2#-C2!9((!&)-3$*(*!2#'2!+-&$(%!')(! 3. Method fusion by using the machine learning toolkit.-2! E'*(! -)! *$%2)$,C2(*! 9-)! &)-9$2! -)! +-EE()+$'/! '*3'.2'F(! '.*! 2#'2! Weka [26]. HB& ;(9$.(!2#(!2$E(!)'.F(!9-)!2#(!2')F(2!3')$',/(J!$.!-C)!+'%(! +-&$(%!,(')! 2#$%!.-2$+(! '.*! 2#(! 9C//! +$2'2$-.! -.! 2#(! 9$)%2! &'F(B! "-! +-&1! '! 0#-/(! 1(')J! '.*! %(/(+2! 2#-%(! C%()%! '.*! E-3$(%! 2#'2! -2#()0$%(J! -)! )(&C,/$%#J! 2-! &-%2! -.! %()3()%! -)! 2-! )(*$%2)$,C2(! 2-! /$%2%J! 0()(!)'2(*!,(9-)(!2#(!*(9$.(*!2$E(!)'.F(B! 2. Place: A classical predictive )(`C$)(%!&)$-)!%&(+$9$+!&()E$%%$-.!'.*Y-)!'!9((B! ())*+,-./J!aCFC%2!H?J!?@@AJ!c'.!d-%(J!<'/$9-).$'J!=caB!?B& <#--%(! modeling,1! %$E&/(! )'.*-E! %'E&/$.F! 0$2#! )(&/'+(E(.2! approach We use the root mean squared error rmse 2 = X (wij ŵij) 2 <-&1)$F#2!?@@A!a<U!eAfDHDReReLDfL_DLY@AY@@@fghRB@@B!?@@B@@@! )'2$.F%!! '.*! %'3(! 2#(! +-))(%&-.*$.F! C%()! $*%J! ij R 48 Neo Metrics 54 Spain

4 Track 1 Assignment 2007 Which user rated which movies in Data set of the last years user-movie pairs of 2006 Probability that user-movie-pair is rated 0,7??????

5 Track 2 Assignment 2007 How many Ratings number of total ratings

6 Who Rated What [...] - ilab Paper 1 Result & Approach Result: RMSE (stdv) = 0,256 0,5533 * base prediction + 0,1987 * singular value decomposition + 0,029 * item item similarity + -0,0121 * association rules 0,0042

7 Who Rated What [...] - ilab Paper 1 Base Prediction Result: RMSE (stdv) = 0,256 prob (user-movie pair x) = 0 10 th 13 th place with stdv = 0,279 guessing correct factors for baseline 5 th 6 th Place with stdv = 0,268

8 Who Rated What [...] - ilab Paper 1 Base Prediction p um = (N u * N m ) / M * U user-movie-relation independent N u = number of ratings of the user N m = number of ratings of the movie M = total number of movies U = total number of users to estimate (Track )

9 Who Rated What [...] - ilab Paper 1 Base Prediction p um = (N u * N m ) / M * U we know (Track ) user-movie-relation independent N u = number of ratings of the user N m = number of ratings of the movie M = total number of movies U = total number of users

10 Who Rated What [...] - ilab Paper 1 Prediction #Ratings/Movie N m Secondary Information: (DVD-Release, IMDB Movie Release, series continuation releases) analyse time distribution and continue #ra0ngs trend 0me

11 Who Rated What [...] - ilab Paper 1 Prediction #Ratings/User N u - same sampling-method as in KDD-Cup stdv of sampled ratings of stdv of base predictions of Compare stdv 2006 and stdv adapt 2005 to 2006 #ra0ngs of user U in 2005 #ra0ngs of user U by base predic0on

12 Who Rated What [...] - ilab Paper 1 SVD C k = U k V T In: u-m-matrix with predictions from several base prediction-values Out: denser matrix Eckhart-Young Theorem: after using svd you got a rank-k-matrix, which is an approximation of the original matrix Implementing "Lanczos" (SVD-pack) Too high number of dimensions leads to "overfitting" Machine learning approach to get optimal k for SVD-partition calculate optimal partition with Frobenius Norm = error value important for us

13 Who Rated What [...] - ilab Paper 1 SVD

14 Who Rated What [...] - ilab Paper 1 Item Item Similarity Cosine similarity: vectors i j k Item a Item b Item c User User User

15 Who Rated What [...] - ilab Paper 1 Item Item Similarity

16 Who Rated What [...] - ilab Paper 1 Machine Learning Weka toolkit: - training data: sample of 2005 (sampling method 2007) - Applied to data of 2006 Weighting of several factors: (linear regression) 0,5533 * pum base prediction 0,029 * correlation item item similarity 0,1987 * SVD singular value decomposition -0,0121 * association rules 0,0042

17 classical modeling approach [...] Neo Metrics Paper 2 Result & Approach Result: RMSE (stdv) = 0,263 - deliberate selection of variables - constructing more variables with SVD - Machine learning over all variables with own training sample

18 classical modeling approach [...] Neo Metrics Paper 2 Baseline guessed baseline: ca. 20% baseline after data cleaning: 3,8% cleaning over time dependent informations -> new users and movies are outlier (in 2004 avg. more ratings, in 2005 less) ->eliminate outliers to create time independent model real baseline: 7,8%

19 classical modeling approach [...] Neo Metrics Paper 2 Variables User Variables: - Number of historic user ratings - Percent of 1-star ratings of the user - Stdv of user ratings - Number of months since the first rating of the user -... Movie Variables: - Number of historic ratings received by the movie - Percent of 1-star ratings received by the movie - Stdv of ratings received by the movie -... User-Movie Interactions (after SVD): - Likelihood of rating similar movies more than the mean - Likelihood of similar users rating the movie more than the mean

20 classical modeling approach [...] Neo Metrics Paper 2 SVD, Cluster Analysis - SVD with user-movie-matrix with ratings to group by users / movies in matrix cluster analysis

21 classical modeling approach [...] Neo Metrics Paper 2 Machine Learning - training data: sample of 2004 (sample method of 2007) - Applied to 2005 Weighting of all variables

22 Conclusion / our possible Approach - baseline in both papers very important we cannot use such kind of baseline - SVD used in two different ways could also be important for us - Item Item Similarity was less efficient we think more efficient for us - Machine learning perhaps to weight our methods) - Work with hierarchies for clustering - n songs of the same album rated - delete users from training data users with incalculable music taste?

23 Sources: Su, Xiaoyuan & Khoshgoftaar, Taghi M.: "A survey of collaborative filtering techniques" Miklós Kurucz, András A. Benczúr, Tamás Kiss, István Nagy, Adrienn Szabó & Balázs Torma: "Who Rated What: a Combination of SVD, Correlation and Frequent Sequence Mining" Jorge Sueiras: "A classical predictive modeling approach for Task Who rated what? of the KDD CUP 2007 Miklós Kurucz, András A. Benczúr, Károly Csalogány: Methods for large scale SVD with missing values Book??? of Arvid with explanation of SVD