Product Recommender System for Small-Scale Niche Businesses using Association Rule Mining

Size: px

Start display at page:

Download "Product Recommender System for Small-Scale Niche Businesses using Association Rule Mining"

Audra Powers
6 years ago
Views:

Product Recommender System for Small-Scale Niche Businesses using Association Rule Mining Julia Di Russo ANR 156304 SNR u1278576 Master Thesis Master Data Science: Business and Governance Academic

1 Product Recommender System for Small-Scale Niche Businesses using Association Rule Mining Julia Di Russo ANR SNR u Master Thesis Master Data Science: Business and Governance Academic Year Tilburg University Date: May 15 th, 2017 Supervisor: Sander Bakkes Second reader: Drew Hendrickson Supervisor at OnMarc: Kevin Van Kalkeren & Melissa Paans Provider of the dataset: OnMarc Faculty: Tilburg School of Humanities

2 Table of Contents Preface... 2 Abstract Introduction Related Work General methods for recommender systems Common techniques for the evaluation of recommender systems Recommender systems for small-scale businesses Frequent item set mining & association rule mining Summary of related work Method Dataset description Pre-processing of the data Exploratory data analysis Experimental procedure Evaluation of the model Results Performance of the algorithm on the validation set Performance obtained on the test set Statistical analysis of the accuracy scores on test set Conclusion & Discussion Answers to the Research Questions Answer to the problem statement General Discussion Limitations and Recommendations for future research References Appendix

3 Preface I would like to thank my academic supervisor Sander Bakkes for his help regarding the writing of this thesis. I would also like to sincerely thank Kevin and Melissa for their advice and their valuable presence during this research, as well as Peter for his valuable advice regarding the analysis of my results. Moreover, I would like to thank the management of OnMarc for making it possible for me to carry out my thesis project in the company as well as for providing me with the datasets. Additionally, I would like to thank Michiel, for the daily support he gave me throughout my Master. 2

4 Abstract During the last decades, the online retailing business has globally been experiencing a substantial growth. Consequently, the importance of product recommender systems has become obvious and clear for both the leading companies in the online retailing market and the smallerscale online retailers. While many successful recommendation methods have already been published, small-scale retailers present limitations in their computational capabilities and in the amount of data available, creating a new area of challenges almost untouched in the field of recommender systems. Therefore, this thesis aims at designing an accurate product recommender system for small-scale niche businesses using two popular market basket analysis method: frequent-item set mining and association rule mining. The performance of the method chosen on unseen data was only measured with an offline experiment and a popularity baseline was chosen to compare the accuracy of our method to a relevant baseline. As a second part of this research, the impact of temporal and geographical dimension on the accuracy of the association rules was investigated by measuring whether the association rule would be more accurate when they are generated on a temporally or geographically split dataset. In conclusion, our recommender system performs better than the chosen popularity baseline but does not cover a large part of the products available on the website of the retailer. However, the association rules generated on the split dataset had an accuracy score lower than the baseline score. These results can be due to the low number of transactions available for most of these analyses. Eventually, it was determined that the predictions are the most accurate when the association rules are based on 2 or 3 products that were previously basket-added. To gain more insights into the accuracy and efficiency of this recommender system, we would recommend the evaluation of this recommendation method using an online experiment. 3

5 1. Introduction During the last decades, the online retailing business has globally been experiencing a substantial growth. Indeed, according to the results of a study executed by the Centre of Retail Research, this sector experienced a growth rate of 18.6% in Europe in 2015 and 16.7% in 2016, making it the fastest growing retail market in that region. The increasing competition on the online retailing market has forced these businesses to improve their techniques and their online environment to target their customers accurately and to maintain a high level of satisfaction. While the number of products available online is excessively large, online retailers now aim at instantly providing their customers with the products they are seeking, reducing the efforts of their customers as much as possible. Consequently, the importance of product recommender systems has become obvious and clear for the leading companies in the online retailing market. According to Huseynov, Huseynov & Özkan (2016), recommender systems can commonly be described as intelligent software providing easily accessible, high-quality recommendations for online consumers. Such systems are now considered serious business tools (Schafer et al, 2001), helping customers to find the item they are seeking or suggesting them additional ones. Numerous advantages are found in the use of accurate recommender systems. In fact, previous research has established that the use of accurate product recommender systems helps online customers make better decisions during their purchases, reducing the time and efforts put in their search. It was also found that the use of such systems can increase the number of purchases (Huseynov, Huseynov, Özkan, 2016). Often cited as examples, Amazon.com, the largest online retailer in the world and Netflix.com, the largest online distributor of streaming media, are two well-known and successful websites that established efficient product recommenders following collaborative filtering methods (among other techniques) (Marlin, Adams, Sadasivam & Houston, 2013). Collaborative filtering is one of the most popular method used in recommender systems and is based on the opinion or ratings given by previous users (Schafer et al, 2007). As reported in an article about their recommendation system, Netflix s researchers estimate that the combined effect of personalization and recommendation have probably saved the company around 1 billion dollars per year (Gomez- Uribe & Hunt, 2016). Often used and cited in the recommendation system studies, collaborative filtering is a very popular method for filtering items using the implicit or explicit rating of users (see Related Work section). However, the quality of recommendations using collaborative filtering can be significantly lower in cases when data is sparse, which can be often the case for small-scale online retailers (Cai et al, 2014). 4

6 In fact, traditional recommendation techniques are often hard to apply for small-scale retailers because of the small amount of data available as well as the lack of data about returning users (Kaminskas, Bridge, Foping & Roche, 2015). As stated above, many techniques, such as collaborative filtering methods, imply the use of a large amount of data and a large amount of purchases to train the recommender system. Moreover, small-scale companies do not commonly have the same resources and the same computing capability that some of the large-scale systems cited above commonly require (Chen, Miller, & Dagher, 2014.). While many companies do not benefit from the same number of visitors as Netflix, or, in the case of retailers, the same numbers of buying users as Amazon, very little attention has been paid to the design of recommender systems for small-scale businesses (see Related Work section for more details). Nevertheless, in the case of small online retailers, using an efficient product recommender is fundamental as it can have a significant influence on their sales revenue and can sometimes determine the success or failure of their business. Moreover, small-scale online retailers often display a large panel of products that customers have to browse through and can highly benefit from the use of an accurate and efficient recommender system. This study aims at designing an accurate and simple recommendation algorithm within the limits of the data that can be gathered by small online businesses and considering the potentially limited computing capability of such businesses. This research will also allow us to get more detailed and useful insights into product recommendations for small-scale retailers. Firstly, this study is focusing on the design of the algorithm adapted to the data and structure of the website of the selected small-scale online retailer. However, this thesis might also be of interest for other small-scale retailers who wish to design and implement or simply adapt such a recommendation system without having the resources to research it. As the number of recommender system algorithms for small-scale retailers is sparse, a new functional algorithm could provide a great opportunity for the expansion of the small-scale retailing business. Secondly, this study intends to provide other researchers in that field with new valuable knowledge and insights regarding product recommendations for small-scale online retailer. While the scientific focus is currently on the use and analysis of big data sets or significantly large datasets, this research will also provide a new detailed approach in the design of recommendation algorithms with sparse data and will provide knowledge about the current possibilities in this relatively untouched field of research. Problem Statement: To what extent can we accurately predict the next product bought by an online customer using a simple recommender system based on sparse purchase data? 5

7 To answer the problem statement, a product recommender system will be designed using purchase data from a real small-scale online retailer. The recommender system will be based on literature and on previous work carried out in the field of recommendation systems for small-scale retailers. To cover all aspects of this question, the problem has been split into three research questions. Firstly, to establish a reliable product recommender system and to accurately predict the next product to be purchased, the recommendation method chosen is based on the purchase history of previous customers. Therefore, the first research question of this thesis is expressed as follows: Research Question 1: To what extent does the purchase history of previous customers accurately predict the next product that a new customer will purchase? To answer this question, we will attempt to design a recommender system using frequent item sets mining and association rule mining based on the purchase history of previous customers. As described in more details in the Retailed Work section, frequent item sets mining and association rule mining are two popular basket analysis methods often used to find patterns in transactional data. As stated in the first research question, the recommendation method chosen for this research is based on the products basket-added and then purchased by the customer. The number of products included in a transaction can highly vary from one user to another. Therefore, we aim at determining the number of products needed to establish a confident recommendation to the user. Consequently, the Research Question 2 is presented as follows: Research Question 2: How many basket-added products must the recommendation be based on to obtain a highly accurate prediction? This question will be answered by finding the average number of products used in the most accurate associate rules generated with our data set. As suggested by Chen et al (2014), geographical dimension and temporal dimension can be two important factors to consider when recommending products to customers. To measure the influence of the two factors on the accuracy of the method chosen, our Research Question 3 is established as follows: Research Question 3: To what extent does the geographical origin and the temporal dimension help in the predictions accuracy? This question will be answered with three different analysis. In a first time, we will investigate the temporal dimension effect on the accuracy of the association rules. To investigate whether the day of the week can influence the predictions accuracy, we will split the transactions carried out during the week and during the week end. As we also aim at finding out whether the time of the day has an influence on the prediction accuracy, we will perform the same analysis for transactions made before 12pm and after 12pm. In a second time, the 6

8 transactions will be split between the most represented countries in this dataset. The association rules will be generated in all cases and the accuracy of the rules compared. This study has been organized in the following way. The next section of this paper will give an overview of the previous and related work done in the field of product recommender systems, focusing on the systems designed for small-scale retailers. The method section will then introduce the selected small-scale retailer, describe the dataset used for this research and explain the different manipulation carried out to extract the frequent item sets, generate the association rules and measure the performance of the newly designed method. In a fourth part, the results of the study will be thoroughly described including important measures such as the accuracy and performance of the designed algorithm. The discussion and the conclusion will follow respectively, stating the strength and weaknesses of this research as well as concluding its findings. 7

9 2. Related Work A considerable amount of literature has been published in the field of recommender systems. The vast majority of these studies focused on identifying new recommendation techniques to push the limits of their algorithm s accuracy or reviewing previous techniques to seek out the best performance and efficiency. While much can be said about the field of recommender systems, this section focuses on a few aspects of this subject. First, we present in Section 2.1 the general categories of recommendation methods and their differences. Subsequently, Section 2.2 and Section 2.3 respectively present the most common methods of evaluation for recommendation systems and highlight the previous studies done in the field of recommender systems for small online retailers. Ultimately, we introduce in Section 2.3 two popular basket analysis methods chosen for the design of our recommendation system: frequent item set mining and association rule mining General methods for recommender systems One study conducted by Huseynov et al (2014) as well as another study conducted the same year by Cai et al (2014), give a clear overview of the general organization of recommender systems and establish the main characteristics of the main methods currently available. According to Cai et al (2014), recommendation methods are most commonly separated in three categories: collaborative filtering, content-based and hybrid methods combining the previous two categories. Content-based systems commonly try to recommend items to a user by matching user profile characteristics and the characteristics of certain products. These profiles are built using the characteristics of the product previously rated by that user (Lops, et al, 2011). Content-based filtering methods have the great advantage of being able to provide recommendations for any new product that was not yet rated by the users, and do not rely on the number of ratings given by other users. (Lops et al, 2011). However, besides requiring the building of extensive user interest s profiles, content-based systems encounter several limitations. Indeed, as stated in the study of Lops et al (2011), content-based systems cannot provide very reliable recommendation for new users and tend to recommend new products that might be similar to previously bought products. For instance, if a customer recently purchased a coffee machine, the same customer might be recommended another similar coffee machine during its next visit on the retailer s website. Dissimilarly, collaborative filtering systems recommendations are based on the ratings given by previous users with a similar taste (Huseynov et al, 2016). This recommendation technique can include temporal dynamics, which makes it flexible and able to adapt to the changing trends 8

10 and to the user s changing tastes (Koren et al, 2011). As stated by Herlocker et al (2014), many collaborative filtering algorithms have been built for datasets in which the number of users is much larger than the number of products to recommend. As this method is based on user s ratings, a recommender system based on collaborative filtering cannot recommend items which have not yet been rated by users, limiting the number of items covered by the algorithm (Herlocker et al, 2004). Considering that explicit ratings are not always available, some collaborative filtering recommender systems also take into account some implicit ratings such as basket-adds, clicks on product pages as well as mouse movements. While both previous methods present major limitations, hybrid recommender systems combine both previous methods to generate reliable recommendations and to avoid the drawbacks of each individual system. Several techniques are available in the field of recommender systems to create hybrid recommendations such as weighted hybrid recommenders, switching hybrid recommender or mixed hybrid recommenders (Burke et al, 2002). Weighted hybrid recommender commonly combines the results of both recommendation techniques available and adjusts the weights of each technique according to the quality of the prediction. Dissimilarly, a switching hybrid recommender simply switches between recommendation technique depending on the situation it is facing. While both previous recommenders typically produce one single recommendation, mixed hybrid recommenders simultaneously provide the user with all the recommendations from all techniques included in the hybrid system (Burke et al, 2002) Common techniques for the evaluation of recommender systems Evaluating or comparing the accuracy and performance of existing recommendation techniques is one of the main issue in this field. As briefly discussed above, many studies found in the field of product recommender systems aim at reviewing the existing methods and algorithms to provide users with an objective overview of their performances in different cases than the one they were built for. However, reviewing the efficiency of recommendation algorithms is a difficult task as many algorithms are adapted to a certain type of dataset and will not yield the same results in different contexts. (Herlocker et al, 2004). Moreover, there are many different metrics that are being used to measure the accuracy of recommender systems and there seems to be no standard metric currently available (Herlocker et al, 2004). Among classification accuracy metrics, precision and recall are the two most common measures often used to evaluate the performance of collaborative systems. As explained by Raghavan, Jung and Bollmann (1989), recall represents the ratio of relevant instances retrieved divided by the total number of relevant instances and precision represents the ratio of relevant instances retrieved divided by the total number of retrieved instances. While precision and recall have been popular in this field for decades, many other 9

11 metrics can also be chosen to evaluate information retrieval systems, such as ROC-curve, another classification accuracy metric. ROC-Curve is an alternative measure to precision and recall which attempts to measure the ability of the system to distinguish between relevant information and noise (Herlocker et al, 2004). A critical advantage of classification accuracy metrics when using sparse data is the ability of ignoring recommendations for items that do not have any ratings. Nevertheless, different types of tasks require different types of metrics including rank accuracy metrics or predictive accuracy metrics (e.g. mean absolute error), severely increasing the number of metrics used in this field (Herlocker et al, 2004). Therefore, this lack of standardization and the large number of metrics used in research often complicates the task of comparing the performance of recommender systems designed by different authors Recommender systems for small-scale businesses. As stated previously, small-scale retailers commonly require a different approach for the design of their recommender system as they face several limitations. Indeed, small-scale businesses frequently face three major challenges: the sparsity of their data, the low number of returning users on their website and the potentially limited computational capabilities that they dispose of. However, few studies have investigated the matter of product recommendations for small-scale businesses. Recent work by Kaminskas et Al (2015) based on the data of two small-scale online retailers, has established a new hybrid approach allowing small-scale online retailers to produce accurate product recommendations with an item-centric approach based on two techniques: one using the product co-occurrences in the browsing history and one focusing on the textual description of the items. While a certain approach relying on association rule mining do not provide recommendations for all products available on the site because the data is too sparse, this study provides an answer to this problem by including the textual descriptions of the items. This research also uses the theme of products as feature, considering the categories that were manually included in by the retailer initially. However, this approach only intends to pair products together and relies on the fact that all items must have a sufficient textual description available on the website. Also, it is important to mention that to solve the problem of data sparsity, the researchers only focus on products viewed and not on the product purchases, limiting the validity of this recommender system. The same authors conducted another study one year later in an identical context and with the same retailers using a very similar approach. On top of the product views, this new technique adds basket events such as basket-adds to the initial hybrid approach including association rule mining and text-based similarity (Kaminskas et al, 2016). Interestingly, in both studies, the researchers have been able to carry out an offline and an online evaluation of their recommender system. Both recommender systems were implemented on the website of the small-scale retailers and the real influence on purchases could be measured. It was found that users engaging with the 10

12 newly displayed recommendations generated a higher amount of completed order and provided a higher revenue in both cases. Remaining in the field of recommender systems for small-scale retailers, another study by Chen et al (2014) attempted to design a product recommender using association rules mining and common features such as the month of purchase, the country of origin, the product last selected by the customer and the previous purchases of the customer. The recommender system designed in this study uses the Apriori algorithm and frequency analysis to include the demographical features in the association rule model. As efficiency and scalability were two major objectives in this study, the algorithm was run with several test sets to record its runtime. The results showed that the recommender system was both scalable and efficient as the runtime increased linearly with the size of the test set and the system produced recommendations in less than 0.1 seconds. To evaluate the content of the recommendations supplied by the newly designed method, the algorithm was tested on new instances and yielded a 56% of accuracy with 100 instances. Nevertheless, the accuracy was severely reduced when tested with a larger dataset of 200 instances, bringing it to 28% only. While the previous studies on small-scale retailers used different approaches and features in their datasets, they commonly included association rules mining to base their recommender system on. Indeed, many algorithms have been built for association rules mining, among them are Apriori, Eclat and Partition (Hipp, Guntzer & Nakhaizadeh, 2000). According to the results of a study from Hipp, Guntzer & Nakhaizadeh (2000) which reviewed several association rules mining algorithm, there is unexpectedly not a significant difference in the run time of these different algorithms and their performance with basket-like data Frequent item set mining & association rule mining As evoked previously, frequent item set mining and association rule mining are two popular data mining methods commonly chosen by retailers for basket analysis. Often used in combination with association rules mining, frequent item set mining is a data mining method initially created for market basket analysis and aiming at finding hidden patterns in the purchasing behaviour of customers. While frequent item set mining algorithms find the recurring patterns in the transactional data, association rule mining algorithms subsequently use the frequent item sets to create association rules. However, frequent item set mining is now also used more widely for different types of tasks such as finding regularities in certain variables (Borgelt, 2012). In a recent study from Geyer-Schulz & Hahsler (2002) which attempted to evaluate recommender systems using frequent item set mining, it was found that frequent item sets obtained from purchase histories and yielding a high accuracy appear to match the concept of useful recommendation as given by the KDD (community for data mining, data science and analytics). 11

13 First, frequent item sets are commonly extracted from a transaction database. Each frequent item set is generated with a support value which represents the number of transactions that includes the frequent item set. The minimum support value (set by the user) determines which item sets will be considered frequent. Despite its popularity and the simplicity of this method, a recurring problem with frequent item set mining, especially in large databases, is that the number of frequent item sets obtained can often become extremely high, (Borgelt, 2012). Indeed, frequent item set mining follows the Apriori principle which resides in the following sentence. If an item set is frequent then all of its subsets must be frequent. (Kumar et al, 2006). This means that if {a, b, c} is a frequent item set, all its subsets, such as {a,b}, {b,c}, {a}, {b}, {c}, are also frequent item sets. Nevertheless, maximal frequent item sets and closed frequent item sets can help reduce the number of sets generated. A maximal frequent item set is considered as such if none of its superset is frequent while a closed frequent item set is considered as such if none of its superset has the same support (Borgelt, 2012). By retaining only maximal frequent item sets or closed frequent item sets, one can significantly reduce the number of item sets generated. Unsurprisingly, maximal frequent item set mining it is one of the most investigated topic in the large field of data mining. Popular in this field, the DepthProject algorithm was designed in 2000 by three researchers from IBM and aims at efficiently finding maximal item sets in long databases using a depth first technique (Agarwal, Aggarwal & Prasad, 2000). While Depth Project might be considered the most efficient algorithm known for maximal frequent item set mining, newer techniques are being built to try to maximise the efficiency of these algorithms such as the MAFIA technique from Burdik et al (2001) also aiming at mining maximal frequent item sets in long databases. Functioning like a tree, this technique uses several efficient pruning components to trim the tree at several levels and significantly reduce the running time. The MAFIA algorithm presented a running time five time shorter compared to the running time of the Depth Project algorithm while applied to the same publicly available data sets. Secondly, association rule mining uses the frequent item sets previously generated to create rules regarding the allocation of items. For instance, an association rule can be defined as if a customer bought the products a and b, the next product bought is likely to be c. As association rule mining often generates a very large number of rules, this method typically has two objective measures, called support and confidence, that must be manually tuned to filter the useful association rules. Therefore, only association rules with minimum support and confidence will be retained by the recommender system. To begin with, the support measure is slightly different from the frequent item set supports as it represents the fraction of transactions containing the items in the rule. The equation below explains the calculation for an example rule with a premise item a and a recommendation item c. 12

14 support = number of transactions incl. a & c total number of transactions ( 1 ) The confidence represents the fraction of transactions with item a that also contain item c. It is calculated for each association rule according to the following formula (Lai & Cerpa, 2001): confidence = number of transactions incl. a & c number of transactions incl. a ( 2 ) As concluded by Geyer-Schulz & Hahsler (2002), association rules do not have model assumptions, making it a flexible and easy-to-tune model to be implemented on a vast range of data. However, the previously evoked study of Kaminskas et al (2015) shows that association rules alone often do not cover the total number of products that a retailer offers, especially when the number of transactions available to train the algorithm is low. Contrarily, in cases including a large data set, the association rule algorithm follows the same pattern as the frequent item set algorithm and the number of rules generated can quickly become enormous. To avoid the large running time that comes with a large number of association rules, Lin et al (2002) successfully designed a collaborative recommender system which adjusts the parameters of the association rule mining algorithm during the mining process to generate a number of rules within a predefined range. Their approach yielded a better accuracy than traditional correlation-based methods and reduces the running time needed to provide a good recommendation. Nevertheless, other measures are available to reduce the number of relevant association rules while retaining rules with a high interest. One of the most popular measure often used to filter a large number of association rules is the lift, also called interest. Lift selects the association rules by measuring their interestingness (or added value) according to the following formula: lift = P(c a) P(a)P(c) ( 3 ) The interpretation of the lift value is as follows. When the lift is greater than 1, the items are associated. If the lift is exactly 1, it means that a & c are independent from each other and only co-occur in the database. However, many researchers agree that the use of the lift measure can sometimes be problematic as it tends to yield high values for the rules that have a support value close to the minimum support value. Therefore, lift is highly unstable as it is highly likely to vary with any change of the minimum support value (Hahsler & Hornik, 2007). 13

15 2.5. Summary of related work Kaminska et al (2015) and Chen et al (2014) are two recent studies in the field of smallscale retailers that provide us with valuable insight for our research. We choose to follow their method closely as association rule mining appears to be a successful method yielding a satisfying accuracy in their online experiments. Consequently, frequent item set mining and association rules mining are two methods that will be used in the design of our recommender system for sparse data. However, to provide the field with new insights, we wish to analyse the application of association rule mining further and to also investigate the influence of temporal dimension and geographical dimension on the accuracy of the rules. The next section will further justify this choice and describe the implementation of the algorithms. 14

16 3. Method Firstly, this section describes the content and features of the dataset analysed in this thesis (Section 3.1). Subsequently, the cleaning of the data set is explained in Section 3.2 as well as the transformation needed to obtain the transactions format needed to apply the data mining method chosen. Then, we provide an exploratory data analysis in Section 3.3 and we present in Section 3.4 the process used to apply the data mining technique which extracts patterns and recommendations out of the transactions. Lastly, in Section 3.5, we explain the method of evaluation chosen to measure the performance of our model and the selected baseline to compare it to Dataset description The small-scale online retailer whose data is analysed in this study sells high-priced luxurious shoe accessories on its website which is available in three languages: English, Dutch and German. Their online shop receives around 11,000 visitors each month and records around 350 purchasing visitors per month. This retailer s characteristics are similar to the one s of the retailers included in the study of Kaminskas et al (2015) as they both present a sparse number of transactions and operate on a niche market. It is important to consider that this retailer does not actively gather personal data about its users such as gender or age during registration or purchasing process, limiting the amount of demographical data available about its users. Moreover, unique user IDs were not available at the start of our research so we could not track sessions that belonged to a same user and all sessions were assumed independent from each other. Considering these limitations, a new approach (as compared to previous studies in this field) was taken to design an accurate product recommender system for this small-scale online retailer. Meaningful events such as a product being added to the customer s basket (a basket-add) were included in the dataset with the date of the purchase completion and the name of each product added to the basket during a single session. Along with these features, and as partially mentioned in the research questions, day of the week and country of origin as well as city of origin are three user-focused features available in our data, following the model of Chen et al (2014) which includes similar demographic features in their analysis. The datasets used for training the algorithm contains around 3 months of collected data. The data was extracted from the Celebrus tracking system and was initially available in 3 csv files respectively including basket adds, visitors country and cities of origin as well as the recording of specific goals in the system such as a client completing a purchase order or adding a product to their wish list. The wish list represents a common feature on many website allowing users to keep the products they might want to purchase later in a personal 15

17 list often only available with a user account. The wish listed product is then easily retrievable for the user s next visit. Considering the low number of purchases, the sessions of users that only wish-listed products were also kept in the file and considered similar to a session including a purchase completion. Basket removals were not available in our datasets as they could not be extracted from the database. However, since only sessions with a meaningful goal such as a purchase completion were retained, all products basket-added during the retained sessions are highly likely to have been purchased. Therefore, it was assumed that all basket-added products from a session including an order completion were purchased. Since the data available for this study is very sparse and severely limits the quality of our analysis, new csv files including new data had to be extracted twice a few months later to validate the algorithm and to eventually test it. The new files had the same structure and features as the training files. Consequently, as the additional data was not available at the beginning of our research, the validation of the data set was executed on data that did not originate from the same period of time as the training data Pre-processing of the data This section justifies and describes the cleaning of the dataset and the different transformation tasks performed on the dataset as necessary for our analysis to be performed Cleaning and transformation of the data set As our data is separated in three different datasets, a transformation task must be performed. Using Python and an additional open source python library called pandas, the three datasets were merged by session number to obtain a complete file and to gather all available information for each session (See Figure 1.). Some of the session number columns had to be renamed in several files in order for the merging to succeed. The merged dataset was initially composed of around 41,000 rows and about 15 columns. Since we wish to predict the next product to be purchased, we aim at only keeping sessions that showed a high interest in the products. Therefore, the merged dataset was cleaned to only retain sessions including a purchase order completion or a wish list goal completion. Due to the formatting of the tracking system export, some empty columns were present in the file and had to be removed. Columns that were irrelevant or were of no use after the exploratory data analysis were deleted such as Goal Name and Total times goal achieved. Merging tasks can often duplicate columns as identical columns with identical information might be present in several of the files merged (e.g. columns including the date of purchase). Therefore, some columns were duplicated due to the merging of the files, such as dates and time of the different actions. The dates and times columns were deleted to only retain 16

18 the column date and the column time of the goal completion. Additional columns were created using information extracted from other columns, such as Month and Day of the week, that were both extracted from the date of goal completion. Figure 1. Illustration of the merging of the three datasets. There were two important issues remaining in the merged dataset. First, certain rows or session contained a goal completion but the product column was empty or contained a missing value. Consequently, the rows with no meaningful name in the product column were deleted out of the dataset. Secondly, it was also noticed that certain pages were wrongly considered as product by the tracking system, such as retour-service Nederland or any product name including http and consisting of a redirecting website link to a website page such as a social media page. Therefore, all the rows including the words retour, order, or http in the product column were also removed. Ultimately, the dataset had to be grouped by session number to obtain only one row per session and all basket-added products in each session were grouped in a new column as tuples. As the chosen algorithm can only be applied to a list of transactions, this column of tuples including all transactions of our dataset was extracted in a new variable and will be referred to as transactions dataset for the rest of this thesis Temporal splitting of the dataset Our second research question partly focuses on measuring whether temporal dimension has an influence on the accuracy of the prediction. Surprisingly, we could not find any literature 17

19 on the measurement of the temporal influence on association rules. Therefore, two analysis were carried out to determine whether time or day of purchase has an influence on the accuracy of the obtained association rules. The first analysis aims at trying to determine whether there is a significant difference between association rules in week days transactions and weekend days transactions. This question is to be answered by splitting the original transactions dataset between weekdays and weekend days and generating association rules for each period. This split was carried out by filtering the data set according to the Weekday column and retaining for one part only the days from Monday to Friday, and for the second part, only Saturday and Sunday. If the rules are significantly different between each period, the accuracy of the rules generated on the split dataset should be higher than the accuracy of the rules generated on the complete dataset. The second analysis aims at determining whether association rules are also significantly different in transactions carried out before 12 pm and transactions carried out after 12 pm. This analysis was executed similarly to the previous analysis described above. The entire dataset was again split in two new datasets according to the time of the goal completion in order to have one dataset with all transactions completed before 12pm and another dataset with all transactions completed after 12pm Geographical splitting of the dataset Additionally, as our third research question focuses on measuring the influence of the geographical dimension on the accuracy of the association rules, we perform a splitting of the dataset to individually generate association rules for each of the three most represented country. To that end, three new data sets were also created to retain transactions from each of the three countries most represented in the data set: the Netherlands, Great Britain and Belgium. This split was carried out by filtering the dataset according to the value present in the column Country. The data set with purchases from the Netherlands was significantly larger than the two other data sets (N Netherlands = 961, N GreatBritain = 148, N Belgium =137). The same pre-processing tasks were carried out on the validation dataset and on the test dataset after their extraction from the system Exploratory data analysis Once the dataset is cleaned and transformed, the list of unique products sold is constituted of 488 items. Most of the sessions retained in our dataset have a completed order goal (N = 1591) and only a low number of sessions have only added one or more products to the Wish list (N = 18

20 27). The first most frequent product was basket-added 487 times while the second most frequent product was basket-added 195 times (see Table 1). There is a total number of 1618 sessions that have been retained for this analysis. The sessions have their origin in 37 countries and only 6 sessions had a country of origin that could not be tracked. There were 432 sessions carried out during a week end day while 1186 were carried out during a week day. The three countries the most represented in this dataset are the Netherlands (N = 1125), Great Britain (N = 162) and Belgium (N = 159). Considering the low number of purchases available for other countries, only the three most represented countries were kept for the analysis including the geographical origin of the transaction. Table 1. Number of basket adds for the top 5 most popular products on training set. Name of the Product N product purchased cederhouten-schoenspanners paar-cederhouten-schoenspanners 195 pommadier-cream 142 schoen-oprekker 118 saphir-renovateur 117 It is important to mention that some products were duplicated in the dataset with two or three different names, one in the Dutch language and one in the English language or German language as the website is available in the three languages. Consequently, the total number of unique products appearing in our transaction dataset does not reflect the real number of unique products available on the website as it might contain the name of certain products in several languages (e.g. cedar-shoe-trees and cederhouten-schoenspanners are two different product names for the same product sold on two different versions of the same website.). Nevertheless, as each transaction could only be entirely completed on only one of the three sites, each transaction is always composed of products from the same language. Therefore, the names cannot be duplicated with more than one language in a transaction and this matter does not affect the extraction nor the accuracy of the association rules. As this cleaning task would be very time consuming, we decided not to translate the entire list of products names. However, the presence of several names for one product was taken into account when calculating the baseline and did not negatively influence its reliability (See section 5.6.). 19

3.4. Experimental procedure Since the association rule mining method was a successful choice for a similar small-scale retailer case in the previous studies of Chen et al (2014) and Kaminsky et al

21 3.4. Experimental procedure Since the association rule mining method was a successful choice for a similar small-scale retailer case in the previous studies of Chen et al (2014) and Kaminsky et al (2016), this data mining method was chosen for the design of our recommender system algorithm. This method was also chosen for its simplicity and its flexibility as described by Geyer-Schulz & Hahsler (2002). To generate the frequent item sets out of our data, a frequent item set mining algorithm was applied to the transactions dataset previously created in Section 3.2. Two functions from the pymining package were used to extract the frequent item sets and calculate their minimum support (See Related Work Section 4.4). All integers between 2 and 6 were input as minimum support value in order to determine the optimal minimum support and to avoid a lack of frequent item sets. Indeed, it is important to consider that a too low minimum support would generated a large amount of frequent item sets that would slow down the recommendation system while a too high minimum support would generate a very low number of frequent item sets that would prevent recommendations for a large fraction of the products available. Moreover, a very low support score might provide us with a high number of unreliable frequent item sets and a very high support score might force the recommender system to ignore very relevant frequent item sets that do not fit the criteria. Nevertheless, the values tried for the minimum support were kept under 6 because of the low amount of transactions available. Once a sufficient number of frequent item sets was obtained, the association rule mining function was used to generate rules out of the frequent item sets. As explained in the Section 4.4, the confidence is an additional parameter available with the association rule mining algorithm to filter the obtained association rules. Several values of minimum confidence were input in order to determine the best parameters for generating highly accurate rules. Figure 2 gives a short and clear overview of the transaction mining process. 20

22 Figure 2. Process of extraction of frequent item sets and association rules The association rules are generated in the following format: if a and b occur together, then recommend c, support, confidence score. The example below shows the construction of a rule as occurring with the transactions currently analysed. (frozenset({'applicator-cloth-by-saphir', polishing-cloth-by-saphir', 'pommadiercream'}), frozenset({'pate-de-luxe-wax-shoe-polish-100ml'}), 4, 0.8) This rule can be translated in natural language to: If the product 'applicator-cloth-by-saphir' and the product polishing-cloth-by-saphir' and the product 'pommadier-cream' are in the transaction, then the product 'pate-de-luxe-waxshoe-polish-100ml' is highly likely to be the next product added to the transaction. The support is 4 and the confidence of this rule is Evaluation of the model Association rules are commonly complex to validate and the validation of such algorithm is often very challenging when executed in an offline environment. To carry out the validation of our model and to tune the parameters of the association rules algorithm, the association rules were applied on the cleaned validation set with several combinations of parameters. The measures used to validate and evaluate our recommender systems are the accuracy and the coverage, which are two common measures used to evaluate the performance of such machine learning algorithms (Geyer-Schulz & Hahsler, 2002). The accuracy measures the share of correct recommendations compared to the total number of possible recommendations and the coverage represents the share of items for which recommendations are available, compared to the total number of items available. The accuracy is calculated in the following way. For each rule that had its premise and its recommendation in one of the validation set transactions, a score of +1 was added for that rule in the correct count list. If the same transaction includes more items than the one present in the rule, the correct count remains positive as the recommendation would have remained correct. If the rule had only the premise in the rule and not its recommendation, the rule had a score of +1 added for that rule in the incorrect count list. The rules that did not have their premise in any transaction were ignored and did not influence the accuracy score. The accuracy per rule was then computed for each rule by dividing the number of correct counts of each rule by the sum of correct counts and incorrect counts of the same rule. The accuracy per rule is a meaningful measurement in this thesis as it provides us with the accuracy scores needed to further on determine how many basket-added 21

23 products are needed to create an accurate prediction. This analysis is part of our third research question as described later in this section. Lastly, to know the average performance of the complete set of association rules, the average accuracy score was computed for the whole list of association rules found or partially found in the validation set. Once the best performing parameters on the validation set were found, these parameters were applied to generate the rules that would be evaluated on the test set. The frequent item sets and association rules were generated in the same way for the transactions that were previously temporally split, using the same parameters as for the original transaction set. The frequent item set and association rules were also generated for the transaction set split by country using the parameters retained from the validation of the complete transaction dataset. The accuracy and the coverage scores were compared for the whole dataset and for the temporally split datasets in order to determine whether the splitting between two different period (weekdays and weekend days transactions or before 12pm transactions and after 12pm transactions) allowed a better accuracy in our model. The same analysis was carried out for the geographically split dataset to measure whether the geographical dimension split improved the accuracy of the rules Baseline comparison Aiming at comparing the performance of our recommender system to a relevant baseline, we selected the 5 most popular products available in our training dataset and calculated the percentage of accuracy if we had recommended the top 5 most popular products. This baseline follows the method of Chen et al (2014) study, which used the top 8 most popular products of their small-scale retailer as baseline to compare their recommender system to. As our data set is smaller than the one used in that study and the number of products sold is lower, we decided to limit the number of most popular product to 5 instead of 8. However, as the website is available in three languages, the most popular products found are represented three times. To tend to this issue, the sum of purchases on each website for each popular product were summed. Subsequently, the total number of basket adds for our top 5 most popular products was divided by the total number of basket adds in order to get our baseline score. This popularity baseline allows us to determine whether recommending popular products would be an easier and better performing solution than providing personalized recommendation after implementation of our association rule system. To evaluate whether our recommender system performs significantly better than the baseline, a statistical analysis of the accuracy scores on test set was carried out by calculating the confidence intervals of each accuracy score and comparing them to the confidence intervals of the baseline s accuracy on test set. 22

24 Calculation of the length of the most accurate rules Hereafter, to answer our third research question regarding the number of basket-added items needed to influence the accuracy of the product recommender system, we used the accuracy score of each association rules previously calculated. As most of the rules with a high number of correct counts had a minimum accuracy score of 0.50, we chose to consider that these association rules are very likely to provide a highly accurate prediction. A list was created, retaining only association rules with an accuracy score higher than 0.50 and the mean number of items available in the premise of the rule was computed. In other words, the premise represents the number of basket adds on which the association rule based the recommendation (see Figure 3). The mode as well as the standard deviation were also computed to learn whether there are many differences in the number of products needed for an accurate association rule and whether the mean average is reliable. This number was computed both on the validation set and on the test set for comparison. It is important to consider that this average is computed using only the accuracy score of the rules that were partially or entirely present in the validation set or in the test set. The number of rules that was ignored during the evaluation was also calculated to give an insight into the relevance of the association rules generated on the training set. Figure 3. Association rule composition. This figure shows the two different parts of an association rule. The premise is the basis from which the conclusion is drawn, meaning the products already basket added by the user in the current session. The conclusion is the outcome of the rule, meaning the final product to be recommended. 23

25 4. Results In this section, we present the detailed results of our experiment. In Section 6.1, we describe the scores obtained during the tuning of the parameters carried out on the validation set. Subsequently, in Section 6.2, we present the final performance of the data mining technique chosen using accuracy and coverage scores. These scores are also presented for the temporally and geographically split data set to compare their results to the general data set performance. Finally, the number of products needed for an accurate prediction on the test set is given Performance of the algorithm on the validation set First, this section presents the results obtained on the validation set when we apply the chosen algorithms using different set of parameters. Secondly, we present the number of association rules generated for each temporal and geographical split using the best performing parameters Tuning of the parameters on the complete data set This section will describe the results obtained on the validation set during the tuning of the model parameters. There was a total number of 1618 unique transactions obtained from the dataset. The frequent item set mining algorithm was run three times with different minimum support score varying from 2 to 4 in order to obtain the optimal number of frequent items frequent item sets were generated with a minimum support of 2 and 813 frequent item sets were generated with a minimum support of 3 while 468 frequent item sets were generated with a minimum support of 4. A minimum support score of 2 was found to generate too many frequent item sets (> 1 million) as we aim at designing a simple and efficient product recommender that does not have a long running time (See Table 2.). As shown in the table, such a large number of frequent item sets would produce a similar or larger number of association rules, requiring a large running time to produce recommendations or a large computational capacity. 24

26 ACCURACY (IN PERCENTAGE) Table 2. Number of frequent item sets and association rules generated with different support and confidence scores. Support Number Frequent item sets Number of Association rules generated with 0,8 minimum confidence Number of Association rules generated with 0,5 confidence Subsequently, the association rules were generated several times with frequent item sets of a minimum support of 3 and with various minimum confidence between 0.5 to Figure 4 displays the accuracy and coverage scores obtained on the validation set with the different sets of association rules ,7 16,8 27,96 25,95 28,66 28,76 28,33 15,57 14,55 13,93 13,11 12,91 41,67 38,27 38,27 35,89 11,27 10,45 10,45 10,45 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 MINIMUM CONFIDENCE OF GENERATED ASSOCIATION RULES Accuracy Coverage Figure 4. Evolution of accuracy and coverage with different minimum confidence scores (minimum support is constant at 3). The best accuracy score on the validation set was obtained when the association rules were generated with a minimum confidence of Indeed, Figure 1 clearly displays that the accuracy decreases again when the confidence reaches higher than The best coverage is measured at a minimum confidence of 0.5 as coverage decreases continuously as the confidence increases. However, one can observe that the coverage remains stable when the confidence reaches 0.85 or 25

Introduction to Recommendation Engines

Introduction to Recommendation Engines A guide to algorithmically predicting what your customers want and when. By Tuck Ngun, PhD Introduction Recommendation engines have become a popular solution for