Consumption capability analysis for Micro-blog users based on data mining

Consumpton capablty analyss for Mcro-blog users based on data mnng ABSTRACT Yue Sun Bejng Unversty of Posts and Telecommuncaton Bejng, Chna Emal: sunmoon5723@gmal.com Data mnng s an effectve method of dscoverng useful nformaton n a large amount of data. The capablty of understandng the user s consumpton s vtal for a company. Dscoverng the sgnfcant customers allows the company to focus on the most valuable customers. Ths paper uses mcro-blog users check-n data and shop nformaton for analyss and cluster method of data mnng. We analyze user s spendng ablty quanttatvely based on user s check-n actons. Compared wth other clusterng method, we choose DBSCAN clusterng method of data mnng to analyze the shop nformaton and poston. Users are dvded nto dfferent categores accordng to ther spendng power. Dscoverng the users wth hgh consumpton level n large amounts of data, whch s sgnfcant for a frm, can help the frm develop better strateges. KEYWORDS DBSCAN clusterng method, user consumpton acton, Mcro-blog user 1. INTRODUCTION It s wdely acknowledged that 80% of the revenue of a corporaton comes from mportant clents that take up 20% of the total, whle the remanng 20% of the revenue comes from the ordnary clents that take up 80%. As a result, we can conclude that not all clents are of the same value to a corporaton, and t s sgnfcant to dscover and keep ths 20% of mportant clents. Mcroblog s a platform for sharng, propagatng, and accessng nformaton based on clent connecton. Wth the fast development of the Internet n recent years, mcroblog, as a newly-born pattern of communcaton, has blossomed nto an ndustry wth hundreds of mllons of clents around the world. Based on the data of the mcroblog users, we are to extract useful nformaton usng the technology of data mnng. Specfcally, we are to evaluate ther consumpton capablty to dentfy the 20% mportant clents, makng t convenent for corporatons to concentrate on the clents wth the most values and potentals. Also, we are to learn the characterstcs of the consumptons of these users, whch provde the opportunty for corporatons to offer personalzed servces to ncrease user fdelty. 2. RELATED WORK McroBlog platform provdes a functon called check-ns, whch means the users can select ther current regon through moble or network locaton system when they tweet a new message. So other users related to them wll know ther current poston together wth the new message. The DOI : 10.5121/jaa.2013.4404 35

check-n pont s decded by moble poston servce or computer IP address, so the users records ncludng tme, poston, longtude, lattude are real and relable. 2.1 User Characterstc Analyss Accordng to survey data, there are 82.3% McroBlog users under the 35 years old.[1] So the analyss results we got n the paper can only reflect the consumpton characterstcs of young McroBlog users group. In addton, we suppose that all of the user s check-n actons are ther general behavor n lfe. 2.2 Check-n Place Characterstcs Analyss Po s short for pont of nterest that s user s check-n place. The scope of Users check-n places are wde, whch nclude dfferent categores such as restaurant, hotel, park, hosptal, subway staton, market, school, offce buldng. To analyss users characterstcs n dfferent aspects, we classfy all of po data. In ths paper, we take det servce for example. We select the food and beverage servce po from total po data that can represent user det consumpton capablty. These det servce po nclude restaurant, bar, coffee house, drnkng bar. Through practcal survey and statstcs, we have got the po shops nformaton ncludng prce, poston, shop area, category. The prce of shop means average per capta cost. Among the whole shop nformaton we collect, average per capta cost s the sutable attrbutes of a shop whch can reflect ts rank. Meanwhle, the shop poston decdes the dfferent busness area the shop belongs to. These busness areas are qute dfferent n aspect of prosperous degree that wll nfluence classfcaton of the shop. Therefore we make a shop grade dvson accordng to attrbutes we selected before, namely per capta consumer cost and ts geographcal locaton. 3. METHOD SELECTION We select the clusterng method nstead of choosng classfcaton method because classfcaton methods typcally requre a hgh prce collecton and a large number of marked tranng set [1]. Our analyss objects are mcro blog users, and we were n trouble when verfyng user attrbutes authentcty. So t s dffcult to get the tranng sets that are requred for classfcaton. We select the clusterng method whch dvdng data set nto group based on data smlarty and ts advantages are less data quantty and easy to adapt changes. There are many clusterng methods used n data dnng, such as the most commonly used partton method -- K-means algorthm whch s based on dstance, but can not fnd clusters of arbtrary shape. Densty-based BVSCAN algorthm can found arbtrary shape. Ths clusterng method wll regard cluster as object regon that s segmented by low-densty regons representng nose n data space and can dscover clusters of arbtrary shape n data space ncludng nose. DBSCAN dscover clusters by checkng the e neghborhood of each pont. If the e neghborhood of P pont contans ponts more than MnPts (namely Mnmum ponts), then create a new cluster whch p s a core object. DBSCAN then teratvely gathered the object that s drectly densty-reachable from the core object. 36

q m p s r o Fgure 1. DBSCAN method of dscoverng clusters DBSCAN algorthm need choose two parameters artfcally, namely MnPts and radus of specfed crcle. The parameter choce only depends on developer s experment and dfferent parameter chosen may cause dfferent clusterng results. In order to get an optmal clusterng result, we attempt range below to choose a sutable parameter. Set parameter ε from 0.01 to 0.3, MnPts from 3 to 6. Fgure 2. Percentage curves of ponts n dfferent clusters The horzontal axs represents cluster number, and the vertcal axs represents pont number percentage n a cluster. All data used are 16015 po records. In the testng, f parameter ε> 0.1, we cannot get multple clusters but only one cluster. If parameter ε< 0.05, the cluster amount s more than 20 whch s too large for the followng classfcaton and 25 percent clusters s no meanng for classfcaton because the percentage of the pont number n cluster s approxmately equal to zero. The ponts n same cluster represent these shops wth smlar attrbutes, so too lttle ponts n one cluster s not a sutable. Changng value of MnPts has no apparent effect on clusterng result. Selectng parameter ε= 0.8 and MnPts = 3, the clusterng result s ten clusters. Calculate the average prce of shops n the same category based on the average per capta cost of dfferent shops. We use normalzaton processng to obtan a value wthn the range of 0 to 1 and represent the dfferent categores of shops weghts. Results are as follows: AV: Average Prce C: Cluster Unt: Yuan C1 C2 C3 C4 C5 AV 52.24 13.54 84.95 217.27 138.6 0 C6 C7 C8 C9 C10 AV 556.86 957.79 356.66 52.48 31.44 Table 1. Average prce of shops n dfferent categores 37

Results after normalzaton processng: C1 C2 C3 C4 C5 AV 0.055 0.014 0.089 0.227 0.145 C6 C7 C8 C9 C10 AV 0.581 1.000 0.372 0.055 0.033 Accordng to user records, we statstc the total number of user records n dfferent consumpton levels. The amount of data s dfferent between dfferent users, whch wll mpact on the results. So use normalzaton processng on the user s check-n data. Total cost of one user = 9 0 C V V = Cluster record/total record(for a user) C : Cluster The total cost represents the normalzed user spendng whch can reflect the level of consumpton of each user, but s not the user s actual consumpton value. Select ten users randomly and draw ther consumpton cost curve of dfferent level. Fgure 3. User consumpton cost curve Ths approach helps us to quantfy the user s food and beverage consumpton level. It allows us to ntutve understandng of the user s spendng power. The maxmum value representng consumpton power s 10, and the mnmum value s 0. Larger value ndcates user s stronger spendng power, on the contrary, the user spendng power s weaker. Fgure 4. User consumpton ablty 38

4. TESTING THE METHOD In order to verfy the ratonalty of ths method, we use MATLAB to ft all user consumpton value, and the get the normal dstrbuton graph. Fgure 5. Matlab testng curve Fgure 6. Matlab testng parameter Goodness of ft s 0.9805, and adjusted goodness of ft s 0.9786. µ equals to 8.591. In order to facltate the mappng, the horzontal axs x1 that represents the user consumpton value has been enlarged 50 tmes. The actual value of µ s 0.17182. It means the average consumpton value of the mcro-blog user groups s 0.17182. As we have learned, the spendng power of an ordnary consumer groups should be consstent wth the normal dstrbuton. We verfy the feasblty of ths approach of analyzng mcro-blog user s spendng power. It can be learned from the results of our analyss, the consumpton cost of users of whom the spendng power s more than 0.2354 s accounted for 80% of the total. And the number of these users s accounted for 30% of the total. Ths also proved to us that 20% of customers would brng the company 80% of the profts. 5. CONCLUSION From above dscusson, we get an analytcal method of the spendng power of the user, whch s dvdng all po locatons nto dfferent levels by DBSCAN clusterng method and quantfy the user s consumpton ablty by addng dfferent weght. In order to get the approprate DBSCAN parameter, we compared the dfferent parameter values on the clusterng results. Through the dstrbuton of user s theoretcal consumpton cost, we obtan the top 20% of users who can brng huge profts for the frm. It also show us that the dstrbuton of consumer values of all users s normal dstrbuton. 39

6. FURTHER WORK Accordng to the top 20% users of sgnfcant nfluence on the frm we have dscovered, we can further analyze the behavoral characterstcs of the consumer groups to develop them to become loyal users. To assess the user s consumpton level, we have another method of statstcs user s consumer value before clusterng. REFERENCE [1] J. Han, M. Kamber, Data Mnng: Concepts and Technques, p. cm. London: Academc Press, 2001 [2] Margaret H. Dunham. Data Mnng Tutoral [M]. Tsnghua Unversty Press, 2005 33-46. [3] M. Hu and B. Lu, Mnng and summarzng customer revews, n Proceedngs of the tenth ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, Seattle, WA, USA, August 2004, pp. 168 177. [4] A. Balahur and A. Montoyo, A feature dependent method for opnon mnng and classfcaton, n Proceedng of Internatonal Conference on Natural Language Processng and Knowledge Engneerng, Bejng, Chna, Octorber 2008, pp. 1 7. [5] Grsh Punj and Davd W. Stewart. Cluster analyss n marketng research: Revew and suggestons for applcaton. Journal of Marketng Research, 20(2):134 148, 1983. 40