Discovery of Frequent itemset using Higen Miner with Multiple Taxonomies

Size: px
Start display at page:

Download "Discovery of Frequent itemset using Higen Miner with Multiple Taxonomies"

Transcription

1 e-issn Volume 2 Issue 6, June 2016 pp Scientific Journal Impact Factor : Discovery of Frequent itemset using Higen Miner with Taxonomies Shrikant Shinde 1, R.A.Mangrule 2 1 P.G Student, Computer Science and Engineering, M.I.T Aurangabad 2 Assist Professor, Computer Science and Engineering, M.I.T Aurangabad Abstract Data mining technique is widely important and useful for finding the information from large raw data. Frequent itemset mining is used to find the different itemsets which are occurring more number of times in the raw data. Change mining can detects and report the changes if it occurs in the set of mined itemsets from different time periods. It may happen that some items may occur for a very specific time but its frequency is more. So such itemsets are considered to be as a nonredundant itemsets. To find frequent itemset number of data mining algorithms are introduced. The HIGEN miner algorithm and non-redundant algorithm to find the generalized itemsets with the help of multiple taxonomy. History Generalized Pattern (HIGEN) which detects frequent itemset, if any itemset becomes infrequent at some level it tries to make it frequent at next level by means of Apriori based algorithm. The proposed works finds HIGEN with multiple attributes with less number of extracted items. Keywords Data Mining, Change mining, Association rule, HIGEN miner, Minimum Threshold (Support). I. INTRODUCTION Frequent itemset mining basically extracts the items that are occurring frequently which is calculated by minimum support value. If the itemset being to be infrequent that means that it doesn t support the minimum support value and if the itemset satisfy the minimum support value than those itemset are called as frequent itemset. The applications of frequent pattern mining are thoroughly investigated in market basket analysis, medical image processing etc. Frequent itemset mining is composed by minimum support value whose observed support value in the source data is equal to or exceeds a given support value. If the low support value is considered then it can generate large amount of patterns and which may not be useful for further analysis and if high support value is used then it may generate less number of patterns. So to overcome this issue the concept of generalized itemset is exploited. Concept of generalized itemset was first used in Market basket analysis which provides high level of abstraction of mined knowledge. To use the concept of generalized itemset the taxonomy of different data items is created i.e. we can call the hierarchy of data items. If the itemset is infrequent at some level then to make the itemset frequent the itemset is generalized to next level. This work suggests the concept of Taxonomy HIGEN (History Generalized Pattern) mining. Where the itemset are extracted based on multiple attribute. In multiple taxonomy Higen mining the patterns are extracted based on both data features or attributes i.e. attribute Location and attribute Time where as in single taxonomy concepts either the Time attribute is considered or the Location attribute is considered to analyze the patterns using Higen mining algorithm which is called as single taxonomy Higen Mining Considered two dataset and that involves data about Social website usage for different social site along with date parameter, time span and location attribute for different countries. The complete dataset is divided into two parts based on date parameter for two months (January All rights Reserved 373

2 February) Table 1 and Table 2 represents the datasets collected for January 2014 and February 2014 respectively. Table 1: Dataset D1 Network Use in January 2014 Date Time Location Social site 02/01/ :00 pm Pune Facebook 05/01/ :00 pm Delhi Facebook 13/01/ :00 am Paris Twitter 30/01/ :00 pm Paris Twitter 31/01/ :00 pm Cannes Twitter Table 2:Dataset D2 Network Use infebruary2014 Date Time Location Social site 02/02/ :50 pm Pune Facebook 03/02/ :45 pm Pune Facebook 05/02/ :00 am Paris Twitter 25/02/ :40 pm Cannes Twitter 10/02/ :40 am Delhi Twitter India France Pune Delhi Paris Cannes Figure 1: Taxonomy of Countries A.M P.M 7 A.M to 9 A.M 9 A.M to 11 A.M 7 A.M to 9 A.M 9 A.M to 11 A.M Figure 2: Taxonomy of Time The Table 1 and Table 2 describe the usage of social sites, the time, the date and location of usage of different sites. For e.g. we are interested in mining generalized and non-generalized All rights Reserved 374

3 which involves the social site and the location or social site and time parameter. Table 1 i.e. Dataset and Table 2 i.e. Dataset describes the generalized and non-generalized itemsets along with their support threshold values. By using traditional data mining algorithm There can be some items that are frequent and some are that infrequent by considering there minimum support threshold value. Let consider the itemsets {Pune, Facebook} and {Paris, Twitter} the itemset {Pune, Facebook} is infrequent in while it is frequent in if we consider the minimum support value as 2. Likewise {Paris, Twitter} is frequent in and is infrequent in. As the infrequent itemsets are discarded so to stop this itemsets generalization can be used. Table 3: Non-generalized Itemset min_sup=2 Non Generalized Itemset SupD1 SupD2 {Facebook} 2 2 {Twitter} 3 3 {Facebook, Pune} 1(Inf) 2 {Twitter, Paris} 2 1(Inf) Table 4: Generalized Itemset min_sup=2 Generalized Itemset SupD1 SupD2 {Facebook, India} 2 2 {Twitter, France} 3 2 Table 5: HIGEN s Extracted Generalized Itemset SupD1 SupD2 {Facebook} 2 2 {Twitter} 3 3 {Facebook, India} 2 2 {Twitter, France} 3 2 From this analysis we have generated the generalized and non-generalized itemsets by enforcing the minimum support value to the same datasets. Here generalized itemsets describes as who support the minimum support count for both the months and the non-generalized itemsets are those which don t satisfy the minimum support count from any of two months. It may happens that the non-generalized can have more number of items then the minimum support value so for further analysis those itemsets should be considered. II. RELATED WORK There are different data mining techniques which can be useful for calculating the frequent itemset for further analysis. R. Agrawal and R. Srikant [1] worked on the problem of mining generalized association rules. Database consists of large transaction of customer data, where each transaction consists of a set of items, and each item have the taxonomy created over it. Here the associations are calculated between different items based on the level of taxonomy. Previous work doesn t work on generalization of items and the concept of taxonomy. Here each transaction is replaced with an extended transaction that consists of all the items in the original transaction as well as all the parent of each item in the original transaction. The three algorithms are used basic, EstMerge and Cumulate. Where All rights Reserved 375

4 work faster than Basic and the Cumulate as the performance gap increasing as the size of the database increases. Rakesh Agrawal, Tomasz I and Arun Swami focus on mining itemset on large databases [2]. The algorithm provides buffer management. It tries to find the itemsets that are large itemset. The large itemset are which provides fractional transaction above certain minimum support. Rakesh Agrawal Giuseppe Psaila worked on active data mining that combines the recent work on data mining with rich literature survey. In this paper the data is continuously mined at some frequency [3]. Whatever rules are discovered, those are added to the rule base. This algorithm consists of shape, Queries and triggers that are fired and appropriate action is executed. E. Baralis, L. Cagliero, T. Cerquitelli [4] focus on Support Driven Opportunistic Aggregation for Generalized Itemset Extraction. The GenIO algorithm i.e. (Generalized Itemset DiscOverer) extracts frequent generalized itemsets. However, it mines a generalized itemset if and only if at least one of its descendants is infrequent with respect to the minimum support threshold. It exploits a taxonomy T (i.e., a set of generalization hierarchies of arbitrary heights) to generalize concepts defined in the structured dataset under analysis. III. SYSTEM MODEL A. Basic Design of HIGEN Miner In proposed the Apriori based algorithm is used. To find the frequent itemset in a data set HIGEN is used. HIGEN algorithm basically finds the frequent itemsets for particular time interval for multiple attribute. If the itemset is infrequent at particular time interval then it tries to find it at next generalized level and the procedure is triggered on infrequent itemsets only. Fig 3 shows the taxonomy structure where the location is shown for different countries. Location India Australia Mumbai Delhi Melbourne Sydney Figure 3: Taxonomy B. HIGEN Miner HIGEN miners address the problem that is occurring in HIGEN. It store the infrequent items for future consideration, so that it can be used for further purpose, as this was not happening in the HIGEN as it rejects the infrequent items initially and was never used again. The HIGEN miner algorithm selects the data features from two consecutive months and find the frequent and infrequent items based on the support threshold value [5]. All the frequent and infrequent items are kept and if any changes occur in next generalize level the changes should be updated. C. HIGEN Categorization and Selection HIGEN are categorized in to three instances they are: 1) Stable HIGEN: Those include the generalized itemset that belongs to same generalization level. 2) Monotonous HIGEN: Those HIGEN that include the generalized itemset where generalization level shows monotonous All rights Reserved 376

5 3) Oscillatory HIGEN: whose HIGENs shows variable and monotonous tendency IV. PROPOSED WORK In the proposed system Taxonomy Higen Mining generates different patterns. Those patterns include generalized and non- generalized itemsets based on minimum support threshold value. Generalized itemset include the itemsets, which have been first introduced in [1] in the concept of market basket analysis, are itemsets that provide a high level abstraction of the mined knowledge. By exploiting taxonomy over data items, items are aggregated into higher level (generalized) ones. This project focuses on change mining in the context of frequent itemsets by exploiting generalized itemsets to represent patterns that are rare with respect to the support threshold, and thus are no longer extracted, at a certain point. A. System Architecture Non-generalized itemsets include items at the child level of taxonomy where it can be frequent items or infrequent items based on minimum support threshold. If the patterns are infrequent at child level i.e. non-generalized the data features are extracted at the higher level i.e. we call it as generalized itemset of the given taxonomy. This can generate frequent or infrequent items again. In Taxonomy Higen mining the patterns are extracted based on both data features or attributes i.e. attribute Location and attribute Time where as in single taxonomy concepts either the Time attribute is considered or the Location attribute is considered to analyze the patterns using Higen mining algorithm which is called as single taxonomy Higen Mining. Start Dataset Select Minimum Support Non Generalization Generalization Location (Single) Time (Single) Performance Analysis Figure 4: The Framework for Taxonomy Higen Mining B. Algorithm for Higen Mining with Taxonomies 1. Initialize HIGEN with null value (HG = ) 2. Initialize candidate length to 1 3. = set of itemsets ( in dataset D All rights Reserved 377

6 4. for each item (c) in k distinct itemset 5. Scan dataset D and calculate support (c) in given dataset D 6. End for 7. If support(c)>= min_sup for some 8. Then HG= update HIGEN 9. Initialize candidate generalization level to Initialize generalized item set container to 11. for all c in at level If sup(c) in dataset < min_sup 13. Then gen(c) go to new generalization level (l+1) 14. Gen(c)= evaluation of taxonomy 15. Gen =Gen U gen(c) 16. End if 17. = U Gen ( U Gen, U Gen, U Gen) 18. l=l Until Gen = 20. k=k Until = 22. Return HG The above algorithm represents the pseudocode of HIGEN MINER. The algorithm iteratively extracts frequent generalized itemset of increasing length for each time stamped dataset by using the Apriori-based algorithm and directly includes them into the HIGEN. C. Mathematical Model Let system S= (I, O, P, S, and F) Where, I is set of inputs (Dataset, minimum support value, Taxonomy) and Dataset = (,., ) O is output generated (,., ) = Preprocessed datasets = Frequent itemsets for multiple Level Taxonomy = HIGENs = HIGEN s with multiple level of taxonomy P is number of processes used P= (,., ) = preprocessing of dataset = candidate generation = checking whether support is greater than minimum support = Support calculation = Traverse HIGEN to upper level of multiple All rights Reserved 378

7 = HIGEN updation S be the success case International Journal of Current Trends in Engineering & Research (IJCTER) S= HIGEN miner for multiple level of taxonomies F is of failure cases. Generalized Itemset Support Let D be a timestamped structured dataset and not generalized itemset X is given by a taxonomy on D. The support of a generalized or ( D. State Transition Diagram P1 P2 P3 YES P6 NO P4 P5 Figure 5: Transition Diagram 1. In P1 Process basically used for preprocessing where the dataset is preprocessed and the required attributes are evaluated from the large dataset. 2. P2 process generates different candidate items where the HIGEN mining algorithm is executed for using multiple taxonomy approach. 3. In P3 process the algorithm calculates whether support is greater than minimum support 4. In P4 process the support count i.e. the minimum support threshold value is fired on the generated candidate itemsets basically in this project the two minimum support threshold is given for the child attribute of taxonomy as well as the parent attribute of the taxonomy. 5. The P5 process defines the HIGEN for multiple taxonomy 6. In P6 process the algorithm is terminated as if the generated itemset satisfy the minimum support value. And the HIGEN is updated finally. V. PERFORMANCE ANALYSIS Taxonomy Higen mining process considers multiple heterogeneous features from multiple dataset in the form of taxonomies while generating generalized and non-generalized itemsets. So which features should be included are selected during this phase. As discussed there is one dataset divided into two different month s record. These dataset contains multiple heterogeneous features i.e. attributes with different data types. Among those attributes maximum total 3 attributes can be selected including both datasets. Minimum one feature can be selected from social site attribute and location All rights Reserved 379

8 Table 6: Output of HIGEN Categorization for Taxonomy HIGEN Algorithm Time Period Generalized Itemset Support Gen. Level January February January February Strongly Stable HIGEN1 {(Service, Twitter, LinkedIn, Facebook),(Time, 6:00-8:00)} 50 3 {( Service, Twitter, LinkedIn, Facebook),(Time, 21:00-22:00)} 26 2 {( Service, Twitter, LinkedIn, Facebook),(Time, 6:00-8:00)} 31 3 {( Service, Twitter, LinkedIn, Facebook),(Time, 21:00-22:00)} 26 2 Strongly Stable HIGEN2 {( Service, Twitter, LinkedIn, Facebook),(Time, 7:00-12:00)} 72 3 {( Service, Twitter, LinkedIn, Facebook),(Time, 19:00-23:00)} 58 3 {( Service, Twitter, LinkedIn, Facebook),(Time, 7:00-12:00)} 69 3 {( Service, Twitter, LinkedIn, Facebook),(Time, 19:00-23:00)} 75 3 Monotonous HIGEN3 January {( Service, Twitter, LinkedIn, Facebook),(Location, Pune)} 19 1 February {( Service, Twitter, Snapdeal, Instagram),( Location, Pune)} 30 1 Monotonous HIGEN4 January {( Service, Twitter, LinkedIn, Facebook),(Location, India)} 81 2 February {( Service, Twitter, Snapdeal, Instagram),( Location, India)} Oscillatory HIGEN5 January {( Service, Twitter, LinkedIn, Facebook),(Location, Cannes)} 37 1 February {( Service, Twitter, Snapdeal, Instagram),( Location, Cannes)} 24 1 Oscillatory HIGEN6 January February {( Service, Twitter, LinkedIn, Facebook),(Time, 7:00-8:00)} 25 2 {( Service, Twitter, LinkedIn, Facebook),(Time, 13:00-14:00)} 23 2 {( Service, Twitter, LinkedIn, Facebook),(Time, 7:00-9:00)} 33 3 {( Service, Twitter, LinkedIn, Facebook),(Time, 13:00-15:00)} 35 3 Table 6 represents the HIGENs categorization for different instances i.e. into Strongly Stable HIGEN, Monotonous HIGEN, and Oscillatory HIGEN. Table 7: Comparative performance of the Algorithm Dataset Algorithm Non-Generalized Generalized Dataset1 Dataset2 Dataset3 Location Time Taxonomy HIGEN Location Time Taxonomy HIGEN Location Time Taxonomy HIGEN 130 All rights Reserved 380

9 Number of Extracted Items Non- Generalized Itemset Generalized Itemset 0 Location Time Figure 6: Generalized and Non Generalized Itemset Extracted from Dataset 1 for Min Support Threshold 2% Fig 6 shows the Generalized and Non-Generalized items extracted from Dataset1 for Min Support threshold 2% for the dataset size of 2000 transactions. Where HIGEN for Location attribute, HIGEN for Time attribute, and HIGEN for attributes are considered. This shows that the number of extracted items i.e. Generalized and Non-Generalized items are less in the Taxonomy HIGEN. The Time attribute generated more items when it is generalized Number of Extracted Items Non- Generalized Itemset Generalized Itemset Location Time Figure 7: Generalized and Non Generalized Itemset Extracted from Dataset 2 for Min Support Threshold 1% Fig 7 shows the Generalized and Non-Generalized items extracted from Dataset2 for Min Support threshold 1% for the dataset size of 6000 transactions. Where HIGEN for Location attribute, HIGEN for Time attribute, and HIGEN for attributes are considered. This shows that the number of extracted items i.e. Generalized and Non-Generalized items are very less in the Taxonomy HIGEN. The Time attribute generated more items when it is All rights Reserved 381

10 Number of Extracted Items Non- Generalized Itemset Generalized Itemset Location Time Figure 8: Generalized and Non Generalized Itemset Extracted from Dataset 3 for Min Support Threshold 3% The analysis represents that when higher number of support threshold is enforced it will generate less number of generalized and Non-Generalized items. As compared to small threshold support value. VI. CONCLUSION AND FUTURE SCOPE This project work evaluates the problem of change mining in the context of frequent itemsets. To represent the evolution of itemsets in different time periods without discarding relevant but rare itemsets due to minimum support value enforcement, it proposes to extract generalized itemsets characterized by minimal redundancy (i.e., minimum abstraction level) in case one itemset becomes infrequent in a certain time period. The multiple levels of Taxonomy HIGEN have been introduced. The usefulness of the proposed approach to support user and service profiling in a mobile contextaware environment has been validated by a domain expert. The Taxonomy HIGEN achieves 66.33% of extracted Non-generalized itemset and 22.85% of extracted Generalized Itemset when it is compared with Location attribute. While the Taxonomy HIGEN is compared with Time attribute it achieves 80.84% of extracted Non-generalized itemset and 73.49% of extracted Generalized itemset. REFERENCES [1] R. Agrawal and R. Srikant, Mining Generalized Association Rules, Proc. 21th Int l Conf. Very Large Data Bases (VLDB 95), pp , [2] R. Agrawal, T. Imieliski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases, ACM SIGMOD Record, vol. 22, pp , [3] R. Agrawal and G. Psaila, Active Data Mining, Proc. First Int l Conf. Knowledge Discovery and Data Mining, pp. 3-8, [4] E. Baralis, L. Cagliero, T. Cerquitelli, V. D Elia, and P. Garza, Support Driven Opportunistic Aggregation for Generalized Itemset Extraction, Proc. IEEE Fifth Int l Conf. Intelligent Systems (IS 10), [5] M.L. Antoine, O.R. Zaiane, and A. Coman, Application of Data Mining Techniques for Medical Image Classification, Proc. SecondInt l Workshop Multimedia Data Mining (MDM/KDD 01), [6] Luca Cagliero (March 2013), Discovering Temporal Change Patterns in the Presence of Taxonomies IEEE vol.25, No. 3 [7] Skiruthika, T.Sheik yousuf, Discovery of frequent and non-redundant itemset using Higen miner, International journal of computer science and information technology research ISSN X (online) vol. 2, issue 2, pp.: (53-59), Month: April-June All rights Reserved 382

11 [8] Politecnico di torino PhD in Information and System Engineering XXIVcycleIII Facoltµa di Ingegneria Settore scientifico ING-INF/05 PhD Thesis Data mining by means Generalized Patterns Author: Luca Cagliero Supervisor: Prof. Elena Baralis Matr [9] Poul Gaeta Venkatrao, Prof.P.D.Lambhate, Ascertaining Chronological Change Patterns in the Presence of Taxonomies, IJCET Aug [10] G.V.Poul, P.D.Lambhate (March 2015), Framework for Change Detection Using Taxonomies vol.3. [11] Bay Vo1 and Bac Le2, Fast Algorithm for Mining Generalized Association Rules, International Journal of Database Theory and ApplicationVol. 2, [12] TPC-H (2009.), The TPC Benchmark H. Transaction Processing Performance Council, All rights Reserved 383