Rating Predictions on Kindle Store Reviews

Size: px
Start display at page:

Download "Rating Predictions on Kindle Store Reviews"

Transcription

1 1 Rating Predictions on Kindle Store Reviews Guanlin Li, Yang Liu, Junjie Zhu A A A Abstract This paper focuses on the predictions of customers ratings in Kindle Store Reviews. In detail, we extract features from reviews of Kindle Store in Amazon Product Dataset[1]. We use Latent Factor Model and Text Mining Model to predict a customer s rating on a Kindle Store item. We implement a trivial average prediction method and linear regression model as our baseline models. We also carry out Gradient Boosted Regression Tree[3], which is a mature model used in regression problem frequently. Our measurement of prediction is Mean Squared Error (MSE) on a test dataset. In addition, we compare the above three models (also compare with baseline model) and discuss the reasons why one model would perform better than another. From the above work, we explore in detail the attributes of the Kindle Store Dataset and we will be more confident in choosing models when we encounter another similar dataset. Index Terms Kindle Store Dataset, rating prediction, linear regression, boosted regression tree, latent factor, text mining, comparison and analysis. I. INTRODUCTION Rating prediction of customers reviews is an essential part in recommending systems. A good recommendation will effectively inform the customer with a new product and make a potential sale. In addition, good recommendations will let customers constantly stay in the circle of this recommending system and possibly build a loyalty community for future benefits. A poor recommending system, in contrast, will disappoint current and potential customers, and make it worse than without any recommendation at all (so that customers are free to explore the product catalog). Rating predictions is at the core of recommending systems because it reveals and gives a reasonable expectation of the attitude of a customer towards a certain product. Also, rating predictions can be used to investigate customers opinions in a community (e.g. Internet community, Kindle Store community). For example, after collecting a huge amount of reviews about a certain product, rating predictions can approximately find out the overall ratings of customers about this product. Even more, after running sentiment analysis, keywords of customers opinions can be found out. Certainly, there are plenty of other applications of rating predictions. In this paper, we would like to explore different approaches on rating prediction: a simple linear regression model, gradient boosted regression tree model, latent factor model and sentiment analysis model. All of these models are based on different assumptions in predicting customers ratings. The linear regression model takes in features like number (or ratio) of helpful votes, length of review text and review time. The latent factor model focuses on user and item specific ratings. The sentiment analysis, however, specializes in text and emotions of customers reviews. The gradient boosted regression tree can take in any useful feature. In this report, we would like to compare different models, the advantages and disadvantages of each of them. So that we will be able to determine appropriate models for future applications on rating predictions. A. Data Source II. DATASET ANALYSIS Here we use Kindle Store 5-core data from Amazon product data[1]. There are pieces of reviews in total. Each record has nine fields. reviewerid - ID of the reviewer asin - ID of the product reviewername - name of the reviewer helpful- helpfulness rating of the review, having two attributes, nhelpful and outof reviewtext - text of the review overall - rating of the product summary - summary of the review unixreviewtime - time of the review (unix time) reviewtime - time of the review (raw) B. Analysis on Data Among the distinct records, there are distinct users and distinct items. Statistics on Users The table below shows the min, max, mean, median, mode and variance of the total reviews a user gives. min max mean median mode variance Here is a histogram showing the partitions of reviewers based on the total number of reviews they have given.there are 814 users give more than 100 reviews,which don t appear in the histogram below.

2 2 The histogram above follows power laws, meaning a few people give many reviews in total and many people just give a few reviews in total. Statistics on Items The table below shows the min, max, mean, median, mode and variance of the total reviews an item receives. min max mean median mode variance Here is a histogram showing the partitions of items based on the total number of reviews they have received. There are 624 items receive more than 100 reviews,which don t appear in the histogram below. The histogram above also follows power laws. This gives us some confidence that the dataset is a reasonable enough for the predictive task. Statistics on Ratings Rating is an integer in the range [1,5]. The table below shows the mean, median, mode and variance of ratings. mean median mode variance Below is a bar chart showing the distribution of ratings. Statistics on Time Time for the distinct records began at Mar 5th, 2000 and ended at Mar 29th, Below is a diagram showing number of reviews received on each day. C. Data Setting We firstly shuffle the data and then use the initial 2/3 splits of the data as training set, the last 1/3 split as test set for the three models. Some model may need validation set, i.e. Latent Factor Model, which will then split the training set into two parts, first half as real training set and last half as validation set. D. Data Analysis Conclusion From the above data analysis, we have about pieces of records for training and about pieces of records for test, which are large enough for any learning models. Also, we can see that this Kindle Store Dataset provides plenty of features on users/items, on text mining, as well as on number of helpfulness votes. Thus, we think that Latent Factor Model, Sentiment Analysis Model and Random Forest Model will have enough input for training and produce reasonable rating predictions. III. BASELINE PERFORMANCE A. Baseline Prediction s Strategy A trivial prediction of ratings is just to predict the global average of ratings in training data, since it is the most reasonable prediction that minimizes the Mean Squared Error assuming Gaussian Distribution of ratings. Based on the training data, we calculate the average of ratings using the following formula: T train avg = T [ overall ] train The value of average rating is: avg = B. Baseline Prediction s Result With the calculated average rating, we can compute the MSE on the test data using the following formula: T test MSE base = (T [ overall ] avg) 2 test The value of MSE base on test set is: MSE base =

3 3 Any model below at least should beat this trivial predict to be good enough since this prediction assumes no specific knowledge on the given dataset and other models do explore in details the features of the dataset. IV. LINEAR REGRESSION MODEL A. Concepts of Linear Regression Model Another baseline model is Linear Regression Model, which is a simple form of supervised learning exploring the relationship between features and a label of data. In our rating prediction problem, the features we used are: number of votes on helpful, number of total votes, length of review text, and length of summary text. All of the above features are included in the dataset already. The label of this supervised learning is the rating. The reason we want to use this model for predictive task is that it shows the relative importance of features in predicting ratings. Thus, we will be able to see which feature correlates most with rating. B. Mathematical Formula The mathematical expression of linear regression is quite simple. Here, we use nh to represent number of helpfulness vote, nt to represent number of total votes, len(rt) to represent length of review text and len(s) to represent length of summary. rating θ 0 +θ 1 nh +θ 2 nt +θ 3 len(rt )+θ 4 len(s) C. Computational Results We used Numpy Linear Algebra Least Square Solver to obtain the values of θ i. The results are: θ 0 = θ 1 = θ 2 = θ 3 = θ 4 = The resulting MSE on test set is: Here, we can see that our linear regression model beats the baseline model, which has a MSE of It is reasonable because linear regression model looks at the features and try to predict ratings from those features. However, the model does not make a huge improvement in MSE, possibly because the features have limited relationships with the label (rating). D. Analysis on importance of features In order to obtain the importance of each feature, we performed an ablation experiment, in which we removed one feature at a time and check how well each prediction goes. (We did not removed offset term). Feature removed Resulting MSE From the above table, we can see that number of helpfulness vote and number of total votes are relative important features, because after removing each of them, the MSE increases. Similarly, length of review text and summary are less important features on rating prediction. The intuition behind is that if there are more total or helpfulness vote, the author of this review typically gives good rating on the item. The reason is probably that the author really loves this Kindle book (it is a rare case when a customer writes a high quality review on the disadvantages of a product and many other people tend to like this review). V. LATENT FACTOR MODEL A. Concepts of Latent Factor Model Latent Factor Model is a supervised learning model which uses features extracted from users and items. In our rating prediction model, it is a reasonable model to use because features from users and items are key factors how a user will rate a certain item. The model focuses on each user and each item, in a sense that it predicts how a certain user typically rates an item and what kind of rating a certain item tends to receive. We also added a term which computes the correlations between users and items. Moreover, we added a regularization term λ to prevent overfitting. With this configuration, we further divide train data into two parts evenly: train data and validation data. After investigating the dataset, we find out that out of users in validation set, users do not show up in train set and out of items in validation set, 8200 users do not show up in train set (which means the model has not seen this user/item before and cannot make meaning predictions). Similarly, in test dataset, out of users in validation set, users do not show up and out of items in validation set, 8141 users do not show up. These are the statistics after we shuffle the original dataset (it gets worse if we do not shuffle the data). The above discussion induces two methods in calculating MSE: If the user/item does not show up in train set, the model will make a prediction based on offset term. The sum of squared error is averaged by the number of reviews in test set. If the user/item does not show up in train set, the model will not make a prediction on this piece of review. The sum of squared error will exclude this piece of review. The sum of squared error is averaged by the number of total reviews minus the number of unseen users and items. I think the first method will produce higher MSE because predicting unseen users/items ratings as offset assumes no additional information of users/item that latent factor takes advantage of. B. Mathematical Formula Here is the mathematical formula for predicting ratings. rating f(u, i) = α + β u + β i + γ u γ i

4 4 We use alternating least squares procedures to solve this problem. Firstly, we fix γi and update α, βu, βi, γu. Then, we fix γu and update α, βu, βi, γi. Here is the updating equations where Ru,i represents rating of a combination of u and i, λ is a regularization term, and Ntrain is the number of train reviews. γu γi are calculated using gradient descent with a learning rate of P u,i T rain(ru,i (βu +βi +γu γi )) α= Ntrain P Ru,i (α + βi + γu γi ) λ + Iu P Ru,i (α + βi + γu γi ) λ + Ui i Iu βu = βi = i Ui γukupdate = 2γik (α + βu + βi + γu γi Ru,i ) + 2λγuk γikupdate = 2γuk (α + βu + βi + γu γi Ru,i ) + 2λγik γuk = γuk learningrate γukupdate VI. S ENTIMENT A NALYSIS M ODEL A. Concepts of Sentiment Analysis Model Sentiment Analysis Model is a supervised learning model based on linear regression model which uses text to solve predictive tasks. In our rating prediction problem, it predicts what kind of rating a certain item tends to receive given the review text from a certain user. We analyzed the review text of all users by find the 1000 most common unigrams, bigrams and trigrams. This method is called Bag of Words Representation. We use two methods of feature representation to build a predictor. The first one is to use the number of appearances of unigrams, bigrams and trigrams in all review texts as feature. Another one is to use TF-IDF representations that estimate relevance of terms in a document as feature. B. Word Cloud Analysis The word cloud based on word frequency of 1000 most common unigrams, bigrams and trigrams rated 1 and 5 is shown below. γik = γik learningrate γikupdate C. Computational Results Here is the result of validation λ MSE Method set: MSE Method From the above table we can see that λ = 3 yield the best results and thus will be used again in train dataset. Also, we can see, Method 1 does yield higher MSE than Method 2 does as we talked above. Here is the result of test set: λ MSE Method 1 MSE Method (a) Rating 1 Unigram (b) Rating 5 Unigram Fig. 1: Unigram word clouds for rating 1 and 5 D. Analysis on Performance From the results above we can see that Latent Factor Model perform much better than baseline model and Linear Regression Model. The reason behind is probably because users and items are features that are much more important than number of helpfulness votes or length of review text, etc. Moreover, after taking account of the vectors representing the correlation between users and items (a.k.a. γu and γi ), the performance becomes even better. However, the latent factor model does have its disadvantages. One important among all disadvantages is cold-start, which means neither users nor items have been met in training set. From the above Method 1 and 2 exploration, we can see that cold-start means an increase in MSE. The unigram model is based on single word frequency. For ungiram word cloud for rating 1 and 5, they do not give much meaningful information because most of terms for both word cloud are the same. From bigram word cloud for rating 1, we can find that dissatisfied readers mostly complain about short story line and terrible main character development. Most reviews for rating 5 praise for the great story line and most reviewers mention that they are looking forward to the next book of this writer and they would highly recommend this book to other people. The further information observed from trigram word cloud for rating 1 is that most reviewers couldn t even finish reading this book and most of this book talks about nonsense.

5 5 X rating α + tf idf (ω, d, D) θω ω text tf (t, d) = {t d} (a) Rating 1 Bigram idf (t, D) = log( N ) {d D : t d} tf idf (t, d, D) = tf (t, d) idf (t, D) D. Computational Results (b) Rating 5 Bigram Fig. 2: Bigram word clouds for rating 1 and 5 Here is the result of using different Bag of words Representation as features: Bag of words Unigram Bigram Unigram+Bigram Uni/Bi/Trigram Train set MSE Test set MSE From the above table we can see that using unigrams as features yields the best results. Here is the result of using TF-IDF representations as features: (a) Rating 1 Trigram Train set MSE Test set MSE We can find that using TF-IDF representations as features yields a better MSE on test set compared to using number of appearances of terms as features. E. Analysis on Performance (b) Rating 5 Trigram Fig. 3: Trigram word clouds for rating 1 and 5 C. Mathematical Formula Here is the mathematical formula for predicting ratings using number of appearances as feature. We use ω to represent bag of words in the form of unigram, bigram and trigram, count(ω) to represent the number of ω s appearance in all review texts. X rating α + count(ω) θω ω text Here is the mathematical formula for predicting ratings using TF-IDF representations as feature. We use tf (t, d) to represent number of times the term t that appears in a document d and idf (t, D) to represent how rare is the term t across all documents D. At first, we need to state that our model doesn t overfit because it is based on Linear Regression Model and it generalize to new data. From the results above we can see that for Bag of Words representation unigrams performs the best. We can look through the following two tables to find the reason. Here is the table of 10 ungirams with the most positive/negative associated weights. 10 most positive unigrams excellent wait awesome disappoint highly hooked thanks five recommended wow 10 most negative unigrams boring predictable ok potential sorry okay supposed premise finish rushed Here is the table of 10 unigrams/bigrams/trigrams with the most positive/negative associated weights.

6 6 Positive uni/bi/trigrams year old writing style best friend really enjoyed reading character development would highly recommend worth reading would definitely recommend written story takes place Negative uni/bi/trigrams im looking forward two books two main characters felt like get enough years later first two books sex scenes several times dont get From two tables above, we can see that the results in the second table are mostly bigrams or trigrams and their associated weights are much larger. The reason behind is probably that these bigrams/trigrams appear much less frequently that others. Compared to unigram model, uni/bi/trigram model counts the appearance of unigrams, bigrams and trigrams and the appearances of bigrams and trigrams are much smaller than those of unigrams, which leads to a larger coefficients on these features though they may not be that important. Therefore, ungiram model performs best. We find that TF-IDF representations performs best on test set compared to using number of appearances of terms as features. The reason is that TF-IDF representations take the overall appearance of a term on the whole corpus into consideration, thus give us a normalized frequency; therefore eliminating the problem that term of low frequency tends to have larger coefficient. In addition, Sentiment Analysis Model outperforms baseline model and Latent Factor Model. The reason is probably that the information obtained from review text is more important than relation between users and items. Review text is more helpful than features such as number of helpfulness votes or length of review text, etc. when predicting rating. However, the Sentiment Analysis Model does have its disadvantages. It takes more time and costs more memory. B. Feature Design and Tuning Hyperparameters Though GBRT doesn t need prior data transformation, it s obvious that we can t use all the nine fields in a record to do the prediction. The reason is simple that unixreviewtime and reviewtime provide exactly the same information. Finally, we use reviewerid, asin, nhelpful, outof and unixreviewtime 5 features for predicting rating. In order to achieve better predictions, we tune the following parameters: max depth: maximum depth of the individual regression estimators learning rate: learning rate shrinks the contribution of each tree by learning rate. min samples leaf: the minimum number of samples required to split an internal node We tune these parameters by grid search from scikit-learn.and the optimal hyperparameters are as following: max depth learning rate min samples leaf Since GBRTs are prone to overfitting, to avoid this problem, we also plot how the error on training set and test set changes as n estimator increases. VII. RELATED WORK To gain more confidence on our models above, we find a model called Gradient Boosted Regression Tree after going through literature, which is widely used in regression problems because of it s flexibility. So in this part, we will give some introductions to GBRT model. Then we apply this model on the dataset. A. Concepts of Gradient Boosted Regression Model Gradient Boosting Regression Tree is a non-linear supervised learning model in the form of an ensemble of decision trees. It is flexible because it can handle different type of predictor variables and missing data. According to A working guide to boosted regression trees [6], this model s predictive performance is superior to most traditional modelling methods. So applying this model on our dataset can give us more information about how our Latent Factor Model and Sentiment Analysis Model performs. From the picture above, we can see there isn t overfitting problem even if we choose n estimator to be 800 because the error on test set still decreases. C. Computational Results Using the hyperparameters above, we got the final result as following: MSE train= MSE test= And the feature importance is as following: reviewerid asin nhelpful outof unixreviewtime which is reasonable because the two most important features are reviewerid and asin(itemid).

7 7 VIII. COMPARISON AND ANALYSIS This is a table showing MSE on test dataset for each model. Model MSE on Test Data Average Prediction Linear Regression Model Latent Factor Model Sentiment Analysis Model(TF-IDF) GBRT Model From the above work, we can see that linear regression and Gradient Boosted Regression Tree model perform better (in MSE) than trivial average prediction method. Also, Latent Factor model and Sentiment Analysis model perform better than linear regression and Gradient Boosted Regression Tree model. We find out that Latent Factor model performs much better than baseline models and the reasons behind it are that: 1) Users/items features are more useful than features like length of review text, number of helpfulness vote. 2) Latent Factor model explores the correlations between users and items (represented in γ u and γ i as features), such that a user s preference on an item can be expressed and calculated. 3) After adding λ as a regularization parameter, the model prevents overfitting and thus can yield reasonable results on test set. Though GBRT takes Users/items features into consideration, it doesn t capture the correlations between users and items. That s why Latent Factor Model beats it; however, Latent Factor model does have the disadvantage of cold-start and performs nearly the same as trivial average prediction on users/items that it has ever seen. Moreover, we can see that Sentiment Analysis model outperforms baseline models, Latent Factor model and Gradient Boosted Regression Tree model. The reasons behind it are that: 1) Features about review text give more information than features like users/items features, number of helpfulness vote. It is more helpful when doing predictive task. 2) TF-IDF representation not only focuses on each term frequency across all review texts, but also concerns about the relevance of each term in a certain review text. 3) Given the review text, this model can be directly applied to a new dataset. However, Sentiment Analysis model has a disadvantage of large amount of calculations and it costs too much time to train a predictor. recommendation system and Sentimental Analysis model can be used to investigate public opinions on items in community. REFERENCES [1] J. McAuley, C. Targett, J. Shi, A. van den Hengel Image-based recommendations on styles and substitutes. SIGIR, 2015 [2] J. McAuley, R. Pandey, J. Leskovec Inferring networks of substitutable and complementary products Knowledge Discovery and Data Mining, 2015 [3] Peter Prettenhofer.(April 4, 2014) Gradient Boosted Regression Trees Available: [4] Pedregosa, F. and Varoquaux, G. and Gramfort, etc. Scikit-learn: Machine Learning in Python Journal of Machine Learning Research,2011 [5] Andreas Mueller.(2013) word clouds in Python wordcloud 0.1 documentation Available: cloud/index.html [6] J. Elith, J.R.Leethwick, T. Hastie A working guide to boosted regression trees Journal of British Ecological Society, 8 April, 2008 IX. CONCLUSION In conclusion, we implemented a trivial average prediction method and linear regression model as baseline models, Gradient Boosted Regression Tree model as a comparison in literature work, as well as Latent Factor model and Sentiment Analysis model as our major work. After comparing the above models, we realize that review text and users/items features are more effective in predicting ratings in Kindle Store Reviews Dataset. In the future, when we are about to make a prediction on ratings in similar dataset, e.g. Amazon Product Reviews, we can focus on extracting features from review text and users/items. After accomplishing this rating prediction task, we can use our models in different applications. Latent Factor model can be used in building a