Waldemar Jaroński* Tom Brijs** Koen Vanhoof** COMBINING SEQUENTIAL PATTERNS AND ASSOCIATION RULES FOR SUPPORT IN ELECTRONIC CATALOGUE DESIGN

Size: px
Start display at page:

Download "Waldemar Jaroński* Tom Brijs** Koen Vanhoof** COMBINING SEQUENTIAL PATTERNS AND ASSOCIATION RULES FOR SUPPORT IN ELECTRONIC CATALOGUE DESIGN"

Transcription

1 Waldemar Jaroński* Tom Brijs** Koen Vanhoof** *University of Economics in Wrocław Department of Artificial Intelligence Systems ul. Komandorska 118/120, Wrocław POLAND **Limburg University Centre Department of Applied Economic Sciences B-3590 Diepenbeek BELGIUM COMBINING SEQUENTIAL PATTERNS AND ASSOCIATION RULES FOR SUPPORT IN ELECTRONIC CATALOGUE DESIGN 1. Introduction The paper is focused on use of sequential patterns and frequent itemsets in electronic commerce scenarios. The aim of our research is how to use this two types of knowledge in order to support a user while browsing product catalogue with hints on products he might be interested to buy next during his visit at the seller s site. There exist applications that use other methods of recommending products of potential interests to the user, e.g. based on collaborative filtering (e.g. Amazon.com). Our approach is different in that it uses both above mentioned data mining techniques at the same time. The rationale for our approach is that the decision about these hints should be based on observed frequent itemsets, optimal purchase baskets in terms of products cross-selling effects and observed sequential patterns underlying order in purchase behaviour. In order to maximize the efficiency of our approach, other data mining methods could be used as support for our methodology, e.g. products/customers clustering or collaborative filtering. The module under consideration should continuously, in real time, during the client s visit, determine what product to display as an incentive for a client to put in his basket next. 1

2 2. Problem statement Internet with its WWW multimedia service is an ideal environment for maintaining one-to-one relationships between clients and seller. Electronic nature of interaction between client and seller makes it easy to gather and use information on products browsing and to automatically discover rationale behind clients behaviour, which is one of the one-to-one marketing paradigms. On the other hand, one of the methods to increase sales volume for regular clients is making use of cross-selling potentials of offered products. With our methodology it is possible to identify such products. Both information might be then used to provide easier interface to products catalogue and hopefully increase sales volume. A dialog between user and the electronic catalogue on-line system is presented in Fig.1. First, when a client enters an e-shop, the system asks him to log on with user name and password and, if the user is recognized, assigns him a path designating reviewed products during current visit as empty. START Path:= Based on path and user s profile determine and show set of products P Read user s reaction Yes Update user s profile and evaluate path Proceed to finish the transaction Is browsing finished? No Read product Update path STOP Fig. 1. Interaction between user and electronic catalogue on-line web system. Next, based on known user s profile and the path so far, system determines set of several products, displays them and waits for user s reaction. 2

3 The user can either exit the catalogue or browse it further. In the former case system proceeds to take an order and finish the transaction. Otherwise, system interacts with user while browsing the catalogue showing them incentives determined in the former phase and when the user puts some product in the basket, updates user s path with selected product and continues to determine and show next set of products as being the most likely to be bought and profitable. Our approach is thus motivated by assuming that clients tend to frequently buy bunches of specific products together in one transaction, i.e. there exist products that occur together relatively frequent along the transaction database. These sets of products are called frequent itemsets. Thus, frequent itemsets may be considered as sets of products which exhibit complementarity effects. The strength of complementarity effect between products A and B can be measured by confidence of association rule A B. Knowing about frequent itemsets gives then also an indication on products that have cross-selling potentials, which is very important from marketing perspective, because it allows to increase sales volume for each customer. However, we do not want to consider all frequent itemsets. To us only those frequent itemsets are interesting that are at the same time relatively high-profitable. This is the meaning of the term optimal baskets. Based on knowledge on frequent itemsets and their profitability, optimal baskets are selected. The items that constitute optimal baskets will be preferred over other items in our application. The next assumption we made is that some recurrent characteristic in order in which customers buy products that constitute specific optimal baskets can be noticed along the transaction database. This characteristic is called sequential pattern. Example of sequential pattern might be in 20% of cases people first put Beethoven s IX Symphony and then Chopin mazurkas in their shopping basket. We are thereby interested in intra-transaction patterns rather than in intertransaction patterns. 3. Determining optimal baskets This phase comprises of three subtasks. Frequent itemsets mining can determine sets of products frequently bought together in one transaction on a website. This information will be used by PROFSET model [BRIJ99] to determine optimal baskets. Association rules generating is needed to accomplish evaluating products cross-selling possibilities after PROFSET method is used. 3

4 3.1 Frequent itemsets phase In this phase sets of products that are frequently, i.e. with minimum support, bought together are automatically determined based on transaction database. A typical approach to discover frequent itemsets is to make use of the knowledge that all subsets of frequent itemsets must also be frequent. Any algorithm presented for example in [AGRA96] might be used to discover frequent itemsets. 3.2 Optimal basket selection phase With PROFSET model [BRIJ99] we determine the set of optimal baskets based on support value and margin profit that frequent itemsets generated. The model was originally designed for supporting product assortment decisions in fully automated convenience store, but it can be easily adapted to support other realworld retail applications including hypermarket basket analysis and electronic catalogue design in e-commerce scenario as well. The main idea utilized in the model is the approach that the best assortment of products that a shop wants to sell should be determined based on frequent baskets of products as opposed to most common product-specific approach in which most frequent and profitable particular products are selected. Due to the fact that the baskets of products are taken into account, the model is able to predict cross-selling effects of products. The PROFSET model is based on integer programming task in which maximized objective function consists of two factors. Frequent itemsets and their associated gross margin contribute in a positive sence to the objective function, and on the other hand, individual products and their cost contribute in a negative sence. The model specification can be found in [BRIJ99]. In order to adapt the model to e-shop scenario, products costs can be costs of dispatch, packing or order processing and other costs not accounted for in the calculation of the gross margin. The output of this phase is then a set of frequent itemsets, called optimal baskets, such that they generate most profit for the seller. 3.3 Association rules mining In this phase association rules between items in optimal baskets are mined to check for complementarity effects between products. Complementarity effects can be measured by means of confidence factor of association rule A B, which is calculated as a percentage of transactions with product A that also contain product 4

5 B. The greater the confidence factor of the association rule A B, the stronger the complementarity effect between products A and B. 4. Sequential patterns mining phase In the next step we mine for sequential patterns, where each transaction is a separate sequence. To limit the number of found rules, we can set constraints on items, so that only sequential patterns containing items selected by PROFSET will be mined. The second constraint is the minimum support. Only sequential patterns that have at least minimum support are considered significant and interesting and only such sequrntial patterns should be mined. We can set the minimum support threefold: 1. Specifying minimum absolute support. Absolute support can be defined as a number of transactions in transaction database in which a given sequential pattern is present. In this case minimum support could be specified as following: sup( l itemset) min_abs_sup( sp) =, where k! sup(l-itemset) support of the largest itemset selected by Profset, k size of the largest itemset selected by Profset, sp sequential pattern. For a given basket, support values for all sequential patterns of size k must sum up to support value of k-itemset. Thus, with minimimum support set as above, we will mine at least one sequential pattern of size k leading to that basket. Other definition of absolute support defines it as the fraction of transactions in transaction database in which a sequential pattern is present. 2. Specifying minimum relative support. The definition of the relative support is the following: sup( sp) rel_sup( sp ) =, where sup( itemset) 5

6 sup(sp) absolute support for sequential pattern, sup(itemset) absolute support for itemset consisting of the same items as present in sequential pattern. It might occur that absolute support for some sequential pattern is lower than minimum support, but at the same time its relative support is significant, so in order not to lose such sequential patterns, relative support might be taken into consideration. It is reasonable to specify minimum relative support value for each value of k, e.g. for k=2, rel_sup 2 =0.35 for k=3, rel_sup 3 =0.15 and so on... If, for example, support value for some 3-itemset is being counted, then we want to expect that it has relative support rel_sup somewhat bigger than 1/3!=1/9, for example 3/9, because if it is about 1/9, then it is statistically insignificant. In other words the bigger the relative support, the more interesting the pattern. 3. Specifying both absolute and relative support values. The algorithm for mining sequential patterns can be similar as in [AGRA95]. 5. Determining of the next items In this section a general method of determining items to be displayed as incentives based on the sequential patterns obtained in the former phase is presented. In order to select the potentially best set of items as an incentive, sequential patterns obtained in the former phase must be represented in an efficient manner with appropriate data structure for the algorithm to be effective. One of the method might be to transform patterns to tree structure and use one of classical state search algorithms. The general idea of representing a set of sequential patterns with a tree is explained below. The first layer of the tree consists of items that constitute the first elements in the set of sequential patterns. Next, for each node on this layer we generate next 6

7 nodes based on the second items in sequential patterns. For a given node, each sequential pattern is checked against the first element and if it is the same as the node, a new node is generated for this one with item being the second element in this sequential pattern. The third layer is built similarly as the second one except that for each node at the second layer, the whole path leading to that node is checked against first two elements in sequential pattern and if they are the same, a new node is generated as the third item in this sequential pattern. Each next layer is generated similarly as the former one. Let s assume we have set Q of sequential patterns as shown in Figure 1. They can be represented with a tree structure shown in Figure 2. Sequential pattern (a c) (a e) (a d) (d e) (d c) (b e) (e b) (a d e) Support Margin profit Fig. 2. Set Q of sequential patterns. a d b e c e d e c e b e Fig. 3. Sequential patterns Q represented with tree. Having transformed a set of sequential patterns to a tree like in Figure 2, it is straightforward to automatically calculate a total potential profit for each item that can be determined. To determine products to show on the webpage we can search the tree calculating a value of objective function for every item which are the first items in the set of sequential patterns. The objective function will calculate total potential profit likely to be gained from buying a given product as a sum of all products of support value of sequential pattern and gross margin of frequent itemset that corresponds to a given sequence. Gross margin of frequent itemset is calculated during the PROFSET phase as a total margin on all transactions consisting of items constituting a given frequent itemset. Items with maximal objective function value will be selected as products for displaying on the webpage. As the value of the objective function is influenced by the size of the sequential pattern, it is important to specify the number of different products i.e. the depth of the tree to be searched. This number of products, beeing also the size of the sequential patterns under consideration can be set as: - the average number of items per transaction for this customer, 7

8 - arbitrary set number of items. It is a good idea to specify the size of the itemset under consideration as the average number of items per transaction that a given customer comitted, because it can be expected that customer will buy more or less this number of items in actual transaction. The modified version of algorithm for displaying best products is presented in Figure 3. START Set parameter m Determine best paths; k:=0 k:=k+1 Show k-th items DP from determined best paths Read user s reaction p. Update path. STOP Y End of browsing? N p DP Y N Y k = m N Fig. 4. Modified version of the algorithm for displaying. First, when the user enters an e-shop, the system checks the average number of items that this user buys in one transaction (parameter m on the schema). Based on this value the system determines a set of the best paths counting the objective function for every consecutive item and displays some of the first items on the 8

9 webpage. Then the system waits for the user reaction and and next updates the path. If the user does not end the browsing the system checks whether the user has just put one of determined products into the basket. If so another condition is checked: whether the number of items based on which the determined best paths were calculated is reachedif not the system continues with displaying the consecutive items from determined items. Otherwise, and in the case that the user selects other product than determined, the system has to redetermine best paths again based on the actual sequence in which the user puts items into his shopping basket. It is worth noticing that our approach to design of on-line intelligent catalogue considers only situations when the user already puts an item in the basket, i.e. it does not really take information about browsing the catalogue into account, what could be an issue of research attention in the near future. 6. Conclusions and future work The Internet with its WWW multimedia service is not just another marketing channel. Its nature, especially interactivity, two-way, individual and inexpensive addressing of communication, is, unlike other channels, perfect for forming and maintaining of long-lasting business relationships between customer and company. This article reveals how a company selling its products through the website can deliver a more comfortable interface to its catalogue making use of knowledge obtained from transaction database. The knowledge about commonly frequent and profitable sets of products bought by customers, called frequent itemsets, and the sequence in which they are put into the electronic basket is used in real time to recommend products that might of interest to the user based on this knowledge. However, this method is suitable when there are already some transactions data in database, thus at the beginning of running such a website, the association rules and sequential patterns can be delivered by an expert and gradually replaced by patterns obtained by mining the database. Our approach does not also take the data about browsing itself into account. Sometimes it would be desirable to know the sequence in which user browses the catalogue and reviews products descriptions, but our approach can be easily improved to consider also this type of information. Also it can be straightforward to modify it to consider sequential paterns from transactions where all transactions of a particular customer is thought of as one sequence. 9

10 The future work can also deal with incorporating other KDD methods in designing of intelligent on-line catalogue. REFERENCES [AGRA94] Agrawal R., Srikant R: Fast algorithms for mining association rules. Proceedings of the 20 th VLDB Conference. Santiago, Chile, [AGRA96] Agrawal R., Mannila H., Srikant R., Toivonen H., Verkamo A.: Fast discovery of association rules. In: Fayyad U., Piatetsky-Shapiro G., Smyth P., Uthurusamy R. (eds.). Advances in Knowledge Discovery and Data Mining. AAAI Press, [AGRA95] Agrawal R., Srikant R: Mining sequential patterns. Proc. Of International Conference on Data Engineering (ICDE). Taipei, Taiwan, March 95. [BRIJ99] Brijs T., Swinnen G., Vanhoof K., Wets G.: Using Association rules for product assortment decisions: a case study. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA, august 15-18,