Rating Prediction and Recommendation of California Restaurants Based on Google Database

Size: px
Start display at page:

Download "Rating Prediction and Recommendation of California Restaurants Based on Google Database"

Transcription

1 Rating Prediction and Recommendation of California Restaurants Based on Google Database Jiajie Shi, Muhan Zhao, Zhiling Liu Department of Mathematics, University of California, San Diego, California, USA (Dated: March 13, 2017) This report summarizes the regression modeling and analysis results associated with the google location dataset. The purpose of this report is to document both the implemented models of linear regression, latent factor model, latent semantic model for rating prediction task and corresponding association analysis method for recommender systems. The structure of this report is as follows: the first section introduces the basic statistics of google location dataset to obtain the properties of features and objectives for future task. The second section constructs different models for prediction and recommendation task and the comparison of those models results. In the third section, we present some related literature about the models we implemented. Some conclusions are presented in the last section. I. INTRODUCTION A. Identify and filter the dataset Automating the learning process is one of the long standing goals of Artificial Intelligence - and its more recent specialization, Machine Learning - but also the core goal of newer research areas like Data Mining. To carry on data mining task we first identify a appropriate dataest which has proper size of data and interesting relationships inside each features. As a result of this, we found this google location dataset consists of location information, like ratings, price, business hours, category, phone number, longitude and latitude of its location, and users properties like rating, job, graduation, review text and review time. Compared with other small datasets, the goodness of this dataset is that it has various information about users and places which are suitable for digging properties between each other. The main object of this study is 292,465 reviews data and 45,536 restaurants located in California s- tate. In this section we describe the exploratory data analysis plots used for initially determining the feature of restaurants and users present in the concentration data. In this study, the dataset that we are using was quite huge and we need to identify the object of study to discuss. Bacause this dataset also contains information about hotel, bar, cafe and so on. In this way, not only the size of data will be reduced to an operable extent, but the data for study will be recognize. the buildings that located in California state. This dataset contains location such as restaurants, churches, companies, hotels and coffee shops etc. Then we found out that restaurants is the main object of this dataset. As a result of this, we chose restaurants to be our main study object. After that we made TABLE I that contains the most frequent categories of restaurants. Using basemap package in python we draw the distribution of restaurant located in California state, shown in FIG 1. It s clear from this figure that most restaurants are located near the main cities, San Francisco, LA, San Diego, San Jose and Irvine. According to the aggregation extent of restaurants, we implemented K-means clustering method compress those restaurants into 10 clusters in order to recognize the distribution of various restaurants presented in FIG 2. In FIG 2, the orange points represents the main cities while the white circle represents the centroids of clustering. Name of Category Num of Restaurants Fast Food Restaurant Asian Restaurant Mexican Restaurant American Restaurant Latin American Restaurant European Restaurant Hamburger Restaurant TABLE I. Frequency of restaurants categories B. Basic Statistical Analysis of Dataset We first filter the data using the geography information, longitude and latitude, of places. Using the geographic range provided by google maps, we identify muz021@ucsd.edu After those filtering steps above, the dataset has been built up for modeling with rating predictions. To find out the potential features that account for the rating of restaurants, we first plot some features like review text of time, numbers of reviews that the restaurant received and the length of reviews for each restaurants versus rating, which are showed in FIG 3 with 4 pictures. We randomly extracted 40,000 data to built up those plots.

2 2 FIG. 1. Distribution of restaurants in CA based on 40,000 data. From the picture it seems that in 2008 the ratings get lowest, perhaps the financial crisis had made a significant contribution to this phenomenon. Except for 2008, the rest of years have quite similar average ratings. In FIG 3 the last picture presents the relationship between ratings and their average reviews that received based on that rating. It seems that the lower the rating is, the fewer number of the reviews that restaurants would receive. From people s view, if we have been a pretty good restaurant we would like to give praise while if we dislike a restaurant, we do not want to judge too much about that not so good restaurant. In next section, we will build up rating prediction model based on linear regression, latent factor model and latent semantic model. II. RATING PREDICTION In this section, we implement three models, linear regression model, latent factor model and latent semantic model to simulate reviews rating given by customers received by restaurants. To complete the rating prediction task, we shuffle the dataset randomly and divide it into 3 part, training dataset, validation dataset and test dataset. The training dataset contains 70,000 reviews, validation dataset has 30,000 reviews and 20,000 reviews for test dataset. For each model, we use those three dataset to fulfill this rating prediction task. After implementing those three models, we compared the results and give some reasonable comments. FIG. 2. K-means clustering of restaurants in CA In FIG 3, the first picture is about the average length of reviews that different rating stars have. From the first picture we can see that 1 star has much longer length of reviews, it s reasonable that usually users give 1 star rating because they have lots of complains about this restaurant and have lots of issues to talk about. As the rating goes higher, usually the reviews become quite short with just few compliment words. The second picture of FIG 3 is the relationship between months and ratings. It s obvious that in June people usually give higher rating while in March, April, October and December people tend to give lower ratings. Perhaps it s the result that in June the weather gets pleasant and people have more chances to travel around. In this way people tend to be more happy and give higher ratings. In contrast, people might be busy in working and studying in March, April, October and December. It seems quite reasonable that ratings get lower because of the busyness. Also in those months the lack of festivals might also have an influence. The third picture shows each year s average ratings A. Linear Regression Model The first model that we considered is linear regression model. From FIG 3 we draw above, we tried to discover some potential useful features related with ratings intuitively. From the figure we can know there are some roughly linear relationship between certain features and the ratings. For example, the review length and the number of reviews a restaurant has received have some linear effect on customers rating. Another thing we need to notice is that for some features, we cannot simply put them into our linear regression model, such as year and month when this review was made. If we simply assign a coefficient for month or year, the mean square error(mse) on test dataset is not satisfying. Also the category of restaurant do not have a linear relationship with the ratings. We can set feature vectors for such features. Take month for example, using the feature vector like (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) represent for January and etc. After several tries, we select the 4 features, review length, the number of reviews that a restaurant has received, month and year that this review is made, as our predictor. The predictor equation goes as follows. After tuning regularization parameter on validation set, the fi-

3 3 Avg length of review Average length of reviews based on rating Avg points Polyfit curve nal MSE on test dataset is rating = θ 0 + θ 1 (length of reviews) + θ 2 (number of reviews) + θ 3 (months feature vector) + θ 4 (years feature vector) B. Latent Factor Model Avg review time-month Avg rating Rating Average rating based on various months 4.10 Avg points 4.08 Polyfit curve Months Average review time of year based on rating 4.15 Avg points 4.10 Polyfit curve Year Avg num of reviews Average num of reviews on rating 35 Avg points Polyfit curve Rating Compared with the linear model, which adjusts the coefficients assigned to different features, the latent factor model focuses on the users behavior and items properties. In latent factor model, features are not useful if we have many observations about users or items, but are useful for new users and items which have never appeared in the data set before. This is also known as the coldstart problem in recommender system. For example, how much does this particular user tend to rate things above the mean? Also, what is the possibility that this item tend to receive higher ratings than others? To quantify this tendency, we pull in the users features β u and items features β i. This linear model treats users and items independently, f(u, i) = α + β u + β i, (1) where f(u, i) represents the rating of item i given by user u, α is the constant which usually close to the average rating of all data. Compared with the linear model, nonlinear model treated users and items properties as a nonlinear multiplication. We pull in the fitting parameters for features, then the equation has the form: f(u, i) = α + β u + β i + γ u γ i (2) To compute for rating based on latent factor model, we use the gradient descent method to solve for variables β u, β i, γ u, γ i. For initial guess of above variables, we calculate the average ratings given by specific users then subtract the average rating to construct β u. Also the similar method to calculate the average ratings received by particular restaurants then subtract the average rating to construct β i. While γ u and γ i are built up by random variables based on normal distribution, and we set the dimensions of γ u and γ i for each customer and restaurant to 5. The 70,000 reviews training dataset is used for training predictor based on nonlinear latent factor model. While regularization parameter λ is modified using validation set. The range of λ varies from 0.2 to 10. Finally, we get the MSE of rating from test dataset. The tendency of MSE of validation dataset based on different values for λ are shown on FIG4. The minimum MSE on validation set is while the corresponding λ is 4.4. On test dataset, using λ = 4.4, the minimum of MSE is Moreover, we tried to set different regularization parameters for different terms such as β u, β i, γ u and γ i. However, the MSE of this model on test dataset FIG. 3. Basic statistics between features and ratings

4 4 is just a tiny improvement which indicates this method isn t too useful on this dataset. To prove that this model s performance is better than the simplest model, we use the average ratings of training dataset to simulate predictions of ratings on test dataset. The MSE of this simplest model on test dataset is , which is much higher than the prediction results obtained by latent factor model. 100,000 food, more than 4,000 place and more than 2,000 go appeared in reviews MSE versus λ MSE λ FIG. 4. MSE of validation set based on various λ C. Latent Semantic Model In this section, we will use text mining method - latent semantic model, to simulate the ratings given by different users based on their reviews text. Considering the fact that some customers gave ratings but they didn t give reviews, we first filter the dataset to get rid of the empty reviews. After that, we remove the punctuations and stop words to obtain the clean and manageable dataset contains unigrams for analysis. To build up semantic model, we first use 1000 most frequent unigrams counts in each review as our features to build up the linear regression model, as the equation shown below. This result shows a relatively large MSE, which amounts to rating α + count(ω) θ ω (3) ω text Then we perform a basic analysis on those popular words shown in FIG 5 and realize that most of those popular words are just common nouns and verbs such as food, place or go and get. Those words cannot reveal any tendency of customers opinions on restaurants. As a result, those useless words will affect the final result strongly by adding a biased weight on each reviews prediction. Therefore, we try to filter 1000 most popular words by recognizing whether or not this word has positive or negative emotion tendency. The below graph shows 10 most popular words, and from this graph we can see that there are more than FIG Most Frequent unigrams By referring to a package called A list of English positive and negative opinion words or sentiment words, we successfully separate those popular words into two parts: positive popular words and negative popular words. Here we take positive popular words as an example to explain the next part analysis: After filtering. The FIG 6 shows part of the 964 most popular positive words, from FIG 6 we can see that the highest frequency word is good. In contrast, the FIG 7 shows part of the most popular negative words, from 7 we can see that the highest frequency of positive word is bad. After filtering the data we get rid of empty reviews, there are still 144,283 reviews come from the high start rating restaurants, which means those restaurants rating is higher than or equal to 4. We randomly pick up 72,000 reviews as training dataset, and the rest 72,083 reviews as test dataset. After we perform the linear regression based on those unigrams features, this model s MSE on test dataset is with regularization parameter λ equals to 1. The coefficients θ of those features are as follows: [ , , , , ]. FIG. 6. Positive unigrams

5 5 III. RECOMMENDER SYSTEM A. Association rule learning FIG. 7. Negative unigrams Up till now the features that we are using are just the counts for frequent words with emotional tendency. To get a better performance of MSE on test dataset, we use the Term Frequency & Inverse Document Frequency (TF-IDF) number for most frequent unigrams instead of simple counts for improving the linear regression model. Term Frequency (TF) means how much does the term appears in this document and Inverse Document Frequency (IDF) means how rare is this term appears across all documents. First, we study 10,000 high rating reviews which are selected randomly and separate them into training data and test data equally. Then we count TF and IDF number for all of the popular positive words. The defination of IDF is as follows, N idf(t, D) = log( {d D : t d} ), where N is the number of total documents, t represents the term, d represents the document and D represents the corpus. The smallest ten IDFs in positive and negative words, which also mean that they are the most frequent positive and negative words, show in the TABLE II. Positive word IDF Negative word IDF great bad good horrible best poor love wrong delicious dirty like bland excellent expensive nice sucks friendly overpriced amazing worse Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large dataset. In order to select interesting rules from the set of all possible rules, constraints on various measures of significance and interest are used. The best-known constraints are minimum thresholds on support and confidence. The support of X with respect to T is defined as the proportion of reviewers history T in the dataset which contains category X. supp(x) = {t T ; X t} T (4) Confidence is an indication of how often the rule has been found to be true. The confidence value of a rule, X Y, with respect to a set of histories T, is the proportion of the histories that contains X which also contains Y. Confidence is defined as: conf(x Y ) = supp(x Y )/supp(x) (5) First we select the top 12 most popular categories as our research target, trying to find the association rule among them. According to the reviewers histories, we can calculate the support and the confidence between two categories. The FIG 8 shows the result of support: the larger the number is, the higher frequency that these two categories are going to show up together. The FIG 9 shows the result of confidence: the latger the number is, the higher frequency that the seconde category is related with the first category. TABLE II. IDF of 10 largest words Finally, calculate the TF-IDF for each review and popular word using the following equation, tfidf(t, d, D) = tf(t, d) idf(t, D) Thus, we simulate the rating by the following equation rather than the simplest counts, rating α + tfidf(ω, d, D) θ ω ω test After finding all features, we get a relatively small MSE for test data equals to with lambda equals to 1. FIG. 8. Support

6 can guess he is more likely to go American restaurant than any other kind of restaurant. 6 B. Model Ensemble At last we want to build a recommendation system for different users. Since now we have four models based on different features and method, the simplest way to ensemble them is a linear combination: R =λ 1 (α + β u + β i + γ u γ i ) + λ 2 ( θ 1 F ) + λ 3 ( θ 2 ω) + λ 4 (C) FIG. 9. Confidence Association rules are usually required to satisfy a userspecified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps: 1.A minimum support threshold is applied to find all frequent categories in a database. In this case is the chance two kinds of restaurants both appear in a reviewers history. 2.A minimum confidence constraint is applied to these frequent categories in order to form rules. In this case is the chance a reviewer goes to another kind of restaurant if a kind of restaurant appears in his history. Here we set the minimum support as 0.03 and minimum confidence 0.2 to find strong associations. Then we can get about 60 pairs of categories, which may indicate strong association behind reviewers behaviour. The final confidence are showed as follows: In the above equation, the first term is the latent factor models prediction, which predicts a final score for a certain reviewer towards a certain restaurant based on latent factors; the second term is linear regression to collect a restaurants features such as average review length and numbers of reviews to judge a restaurants quality; And the third term is very similar to the second one, but replace the general features with the semantic features to find out how good/bad is a restaurant based the textual information; The last part is the association factor, which is equal to confidence in association analysis. This term is also important in our recommendation system because it indicates the bias of a reviewer going to another kind of restaurant based on the kind he has been to. From the formula above we can see the first term include the potential factor between reviewers and restaurants which may affect reviewers choose, the middle two terms judges a restaurants quality through general and textual features, and the last term tell us which kind of restaurant a reviewers tend to choose based on the association rules learning. By comparing the final score we compute by using our evaluate formula, we can recommend the restaurant with highest score to our users. IV. LITERATURE REVIEW As we have implemented three models to fulfill this rating prediction task, we also searched online to find more interesting research based on those three different models. A. Association Rule Learning FIG. 10. Modified Confidence The left hand row is the confidence of a certain pair of the categories shows in the right hand row. For example, if a review has been to a seafood restaurant, then we In a recent paper[1] published by Google, YouTube engineers analyzed in greater detail the inner workings of YouTubes recommendation algorithm. YouTube recommendations are driven by Google Brain, which was recently opensourced as TensorFlow. By using TensorFlow one can experiment with different deep neural network

7 7 architectures using distributed training. The system consists of two neural networks. The first one, candidate generation, takes as input users watch history and using collaborative filtering selects videos in the range of hundreds. An important distinction between development and final deployment to production is that during development Google uses offline metrics for the performance of algorithms but the final decision comes from live A/B testing between the best performing algorithms. The second neural network is used for Ranking the few hundreds of videos in order. This is much simpler as a problem to candidate generation as the number of videos is smaller and more information is available for each video and its relationship with the user. YouTubes recommendation system is one of the most sophisticated and heavily used recommendation systems in industry. B. LFM for highly multi-relational data Many data such as social networks, movie preferences or knowledge bases are multi-relational, in that they describe multiple relationships between entities. While there is a large body of work focused on modeling these data, few considered modeling these multiple types of relationships jointly. Further, existing approaches tend to breakdown when the number of these types grows. In this paper [2] they proposed a method for modeling large multi-relational datasets, with possibly thousands of relations. Their model is based on a bilinear structure, which captures the various orders of interaction of the data, but also shares sparse latent factors across different relations. This paper illustrates the performance of their approach on standard tensor-factorization datasets where they attain, or outperform, state-of-the-art results. Finally, a NLP application demonstrates our scalability and the a- bility of our model to learn efficient, and semantically meaningful verb representations. C. Sentiment Analysis Sentiment analysis is a study of opinions, sentiments, evaluations, emotions, attitudes, etc. which often appear in reviews, feedback or comments. Opinions are main factors of influencers of our behaviors. This paper [3] presents a detailed understanding and cognition of things will affect the judgment of others. Because when people need to make a decision, they often consider the views of others. Therefore, turn these opinionated documents into structured data will help us to slice, dice and visualize the results and process the qualitative and quantitative analysis. Based on [4], people have six main emotions, which are love, joy, surprise, anger, sadness, and fear. Therefore, distinguish those emotions will help us to analyze much more information and find necessary summaries. There are several ways to distinguish words. 1. Directly apply supervised learning techniques to classify reviews into positive and negative. 2. Three classification techniques were tried: Naive Bayes, Maximum entropy and Support vector machines. V. CONCLUSIONS First we choose the restaurant located in California as our study object, and perform some basic statistical analysis of these data such as the frequency of different categories of restaurants, geographical distribution of restaurants, and the relationship between review features and ratings. Then we train three different models trying to predict the rating for each review. From the final MSE on the test set, we can weigh the pros and cons among them. The Linear Regression model has the highest MSE, which is easy to understand because the relationship between features we selected and the rating do not have to be linear. Secondly the latent factor model has a better MSE, 0.586, which implies that the latent factor model is a powerful way to predict the rating and no extra feature is need. This is because the latent factor model can discover a latent dimension corresponding to some outside features, so we dont need extra features. But this model also has its limit, which is that, the latent factor model is next-to-useless if either the user or the restaurant was never observed beforep. Finally, the Latent Semantic Model has the best MSE, which is This is because the rating is highly related to the review content. The Latent Semantic Model seems like cheating, because if we want to predict a users preference toward a restaurant, usually we do not know the review in advance. However, the historical review is still useful for our recommendation system because it can reflect how a restaurant is evaluated. At the last part of this paper, in order to enhance our recommendation precision, we perform the association rule learning and put the association factor into our recommendation formula. The recommendation system we built combines the three models we have trained and the association rule. Through computing the score for each user and restaurant, we can find the restaurant with highest score, which is the one we would like to recommend to this user. Finally, after reading some literature related to our work, we realized that building a recommendation system is a very sophisticated project, and there are still a lot of things to learn in the future. For example, we can do non-linear regression for our feature, use collaborative filtering algorithm instead of simple association rule learning and furthermore, try some other methods to ensemble different models

8 8 [1] Davidson J, Liebald B, Liu J, et al. The YouTube video recommendation system[c]//proceedings of the fourth ACM conference on Recommender systems. ACM, 2010: [2] Jenatton, Rodolphe, et al. A latent factor model for highly multi-relational data. Advances in Neural Information Processing Systems [3] Bing Liu. Sentiment Analysis and Opinion Mining Morgan & Claypool Publishers, May [4] Parrott, W. Gerrod. Emotions in social psychology: Essential readings. Psychology Press, 2001.