Fraudulent Behavior Forecast in Telecom Industry Based on Data Mining Technology

Similar documents
PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

Session 15 Business Intelligence: Data Mining and Data Warehousing

DATA MINING: A BRIEF INTRODUCTION

Chapter 8 Analytical Procedures

The Application of Data Mining Technology in Building Energy Consumption Data Analysis

IBM SPSS Modeler Personal

Data Mining in CRM THE CRM STRATEGY

Achieve Better Insight and Prediction with Data Mining

RFM analysis for decision support in e-banking area

Application of neural network to classify profitable customers for recommending services in u-commerce

Chapter 5 Evaluating Classification & Predictive Performance

Approaching an Analytical Project. Tuba Islam, Analytics CoE, SAS UK

A HYBRID MODERN AND CLASSICAL ALGORITHM FOR INDONESIAN ELECTRICITY DEMAND FORECASTING

Data Analytics with MATLAB Adam Filion Application Engineer MathWorks

Business Intelligence, Analytics, and Predictive Modeling Kim Gaddy

Using Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG

A Trust Evaluation Model for Social Commerce Based on BP Neural Network

SAS Machine Learning and other Analytics: Trends and Roadmap. Sascha Schubert Sberbank 8 Sep 2017

The application of data mining technology in e-commerce

Marketing & Big Data

What is Business Intelligence and Performance Management

Who Are My Best Customers?

Research on Regional Logistics Industry Development Strategy Based on System Dynamics: Take Changzhou as a Case

Genetic Algorithms-Based Model for Multi-Project Human Resource Allocation

Data Mining: Next Generation Challenges and Future Directions

Data Mining Research on Time Series of E-commerce Transaction. Lanzhou, China.

Product Purchase Sequence Analyses By Using Horizontal Data Sorting Technique

Advanced Analytics through the credit cycle

Data Warehousing. and Data Mining. Gauravkumarsingh Gaharwar

Week 1 Unit 4: Defining Project Success Criteria

Use of Data Mining and Machine. Use of Data Mining and Machine. Learning for Fraud Detection. Learning for Fraud Detection. Welcome!

The Combined Model of Gray Theory and Neural Network which is based Matlab Software for Forecasting of Oil Product Demand

Data Mining Applications in Human Resources Management System

RAID Collections. Datasheet. RAID Collections Datasheet 1

SPM 8.2. Salford Predictive Modeler

Progress Apama & Event Processing. Mark Palmer, Vice President of Event Processing

The Training Strategy of Professional Core Competencies of Talents in Competitive Intelligence

Data Warehouses & OLAP

Value Proposition for Telecom Companies

International Journal of Recent Trends in Electrical & Electronics Engg., Dec IJRTE ISSN:

CLASSIFICATION OF WELDING DEFECTS IN RADIOGRAPHS USING TRAVERSAL PROFILES TO THE WELD SEAM

The Research of Middleware Architecture of Intelligent Logistics System Based on SOA

Is Machine Learning the future of the Business Intelligence?

User Guide. User Guide to Recurring Billing and Storage. Setting up and using the Recurring Billing system

Churn Prediction Model Using Linear Discriminant Analysis (LDA)

The Analytical Revolution

Multi-agent Modeling and Simulation for the Evolution of Enterprise IT Capability

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

BACKPROPAGATION NEURAL NETWORK AND CORRELATION-BASED FEATURE SELECTION FOR EARNING RESPONSE COEFFICIENT PREDICTION

HOW ADVANCED INDUSTRIAL COMPANIES SHOULD APPROACH ARTIFICIAL-INTELLIGENCE STRATEGY

An Integrated Data Mining and Behavioral Scoring Model for Analyzing Bank Customers Nan-Chen Hsieh

BelVis PRO Enhancement Package (EHP)

Lecture 1: What is Data Mining?

Financial Crime Mitigation

A Novel Intelligence Recommendation Model for Insurance Products with Consumer Segmentation

MANAGING FRAUD ON TRAVEL BOOKINGS

Data Processing of Thermal Power Plants Based on Dynamic Data Reconciliation

MEASURING PUBLIC SATISFACTION FOR GOVERNMENT PROCESS REENGINEERING

Using Data Mining to Increase Customer Life Time Value

IM S5028. IMS Customer Relationship Management. What Is CRM? Ilona Jagielska 1

Research on the E-Commerce Credit Scoring Model Using the Gaussian Density Function

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Predicting Corporate Influence Cascades In Health Care Communities

Business Process Modeling for Supply Chain Transformation

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

ACCA Certified Accounting Technician Examination, Paper T5

Predictive Analysis Risk Analysis

Investigating the Demand for Short-shelf Life Food Products for SME Wholesalers

Ranking Potential Customers based on GroupEnsemble method

Intelligent Systems. For more information on partnering with the Kansas City Plant, contact:

The Age of Intelligent Data Systems: An Introduction with Application Examples. Paulo Cortez (ALGORITMI R&D Centre, University of Minho)

Big Data for Measuring the Information Society. Mohammad Ahli Executive Director Statistic Sector

WHITEPAPER. Unlocking Your ATM Big Data : Understanding the power of real-time transaction monitoring and analytics.

Using Decision Tree to predict repeat customers

IBM SPSS Decision Trees

Development Center, Warren, MI , USA 3 State Key Laboratory of Automotive Safety and Energy, Tsinghua University, Beijing , China

Time Series Motif Discovery

Billing Strategies for. Innovative Business Models

Leveraging Robots for the Customer Journey

Zoning Supporting Design of Deep Foundation Pit Based on Controlling Deformation of Surrounding Buildings

Machine Learning 101

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 1, Jan-Feb 2014

Digital Finance in Shared Services & GBS. Deloitte: Piyush Mistry & Oscar Hamilton LBG: Steve McKenna

Knowledge Discovery Services and Applications

Create a New Expense Report

How to build and deploy machine learning projects

Integrated production planning and control: A multi-objective optimization model

Assessment of modification level of hypoeutectic Al -Si alloys by pattern recognition of cooling curves

Managing Business Rules and Analytics as an Enterprise Asset

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

A Highly Customizable Blockchain Infrastructure

Investigating the myths and realities of contactless payment

Data processing. In this project, researchers Henry Liu of the Department of Civil Engineering and

Classification Model for Intent Mining in Personal Website Based on Support Vector Machine

FFQA 1. Complied by: Mohammad Faizan Farooq Qadri Attari ACCA (Finalist) Contact:

DESIGN OF COPD BIGDATA HEALTHCARE SYSTEM THROUGH MEDICAL IMAGE ANALYSIS USING IMAGE PROCESSING TECHNIQUES

Adaptive Water Demand Forecasting for Near Real-time Management of Smart Water Distribution Systems

TDA and Machine Learning: Better Together WHITE PAPER

Sequential Association Rules Based on Apriori Algorithm Applied in Personal Recommendation

Predictive analytics [Page 105]

Transcription:

Fraudulent Behavior Forecast in Telecom Industry Based on Data Mining Technology Sen Wu Naidong Kang Liu Yang School of Economics and Management University of Science and Technology Beijing ABSTRACT Outlier analysis in data mining is to find the deviation and abnormal data in the databases. By finding and explaining the outliers, we can apply them to detecting the business frauds effectively. This paper analyzes the common characteristics of fraudulent behavior of customers in telecom industry systematically. Based on the outlier-finding by clustering in data mining, we propose an effective solution to forecast the customers who are maliciously in arrears. Coupled with the actual application of forecasting the customers who are maliciously in arrears in telecom industry, we propose the specific method to forecast this kind of customers by using Kohonen neural network clustering algorithm. INTRODUCTION The behavior of malicious arrearage in telecom industry in China is common and has lasted for a long time. The malicious arrearage increases the bad account ratio in telecom companies and makes the profit untrue and the state-owed assets lost. Telecom industry is data-intensive. After many years of development, telecom companies have the detail records of both the customers and the real-time calls. Therefore, by analyzing the data on hand, companies can know the purchase habits and the nature attributes of customers. Based on the analysis of customers behavior, credit and risk, operators can build a system to avoid frauds. Through this system, operators can manage customers on-line at any time. Once a customer has abnormal phone call, operator gives alarm to service staff to take necessary actions, in case the customer leaves in arrears maliciously. Based on the clustering technique in data mining, this paper focuses on the problem of finding abnormal outliers in large data set by using Kohonen neural network clustering algorithm, and gives the specific method to forecast the behavior of malicious arrearage. Clustering Technique in Data Mining (1) Knowledge discovery by clustering From business aspect, data mining is a new method to process business information. Its main characteristic is that it selects, transforms and analyzes transaction data in business databases, extracts crucial information to help making business decisions. A common problem all companies face is that valuable information is limited though the data is rich. Therefore it is just like to collect gold from mine for company to gain information from the large data set, so data mining gets its name(han and Kamber, 2001). Company can enhance its competency by deep analysis of the information. Outlier detection in data mining is to find the deviation in databases, this kind of exploration of deviation or abnormal pattern has important meaning in related fields such as forecasting the customers who are maliciously in arrears in telecom industry. Supported by Program for New Century Excellent Talents in University (NCET-05-0097) Communications of the IIMA 1 2007 Volume 7 Issue 4

We could use clustering to accomplish the outlier detection. Knowledge discovery by clustering is one of the important functions of data mining. From the aspect of data mining, clustering research extracts valuable knowledge from large data sets intelligently and automatically. Knowledge discovery by clustering was proposed along with the development of databases and the emergence of data mining and Knowledge discovery technology. Knowledge discovery by clustering is applied in many areas, such as: forecast of bankruptcy, pattern recognition, marketing, market segmentation and so on. The abnormal data implies outliers in clustering. Through the finding and explaining of the outliers, we can extract the features of abnormal data, so apply them to detect business frauds effectively. (2) Kohonen neural network clustering algorithm We compared three kinds of clustering algorithms by experiment, Kohonen neural network algorithm, two-step clustering algorithm, and K-means algorithm. We figure out that Kohonen neural network algorithm is the most effective algorithm to find outliers. Kohonen neural network was proposed in 1981. It is also called self-organizing feature maps neural network clustering method. It gets the clusters of data set by using iterative algorithm to optimize the objective function (Li, Deng and Li, 2004; Yuan, 2000; Chen, 1996). Kohonen neural network is one of the most popular neural network methods for cluster analysis. Its goal is to represent all points in a high-dimensional source space by points in a low-dimensional (usually 2-D or 3-D) target space, such that the distance and proximity relationships are preserved as much as possible. The method is particularly useful when a nonlinear mapping is inherent in the problem itself (Han and Kamber, 2001). Kohonen neural network is a two layers feed-forward neural network, it has input layer and output layer. The number of neurons of input layer is M, which amounts to the dimension of input sample vector. The neurons of output layer are competitive output neurons, whose values are in {0, 1}. After self-organizing learning, Kohonen neural network makes the density of the connection weight vectors consistent with the probability distribution of input model, which means the density of the connection weight vectors could reflect the statistical features of input model. The layout of neurons in output layer has many forms, the typical one is two-dimension plane matrix (Yan and Zhang, 2000; Malone, etc., 2005). (1) Outlier finding BASIC IDEAS TO FORECAST THE MALICIOUS ARREARAGE To forecast the behavior of malicious arrearage in telecom industry based on outlier detection in data mining, is to find the abnormal customers from the large customer data set who have high-level call fees in short period of time and leave in arrears at last. In order to find the outliers, we first capture the features of the customers who leave network in arrears, then build model and apply it to analyze the current customers by comparing the features of current calling behavior of customers with those of malicious arrearage finally, detect the customers who will be possibly in arrears, warn them and suspend service to them (Wang, 2004). Classification is widely used for prediction in data mining, but when the ratio of target customer is small, it will miss many important rules by using classification, so the forecast result is poor. Outliers are the seldom abnormal case in data set. Outlier finding is a special and crucial application in data mining. We view the customers who are maliciously in arrears as outliers in the entire customer set. Based on Kohonen neural network clustering, we can find the outliers before classification in order to improve the quality of classification and forecast. The process of clustering can extract the outliers from the entire data set, but cannot state the features of this group, so we still have to use the decision tree of classification to extract the classification rules of this special group, and apply them to the process of business forecast. (2) Main steps to forecast the malicious arrearage We use a data mining tool Clementine (a data mining tool of SPSS) and Kohonen neural network clustering algorithm to analyze the data of training set, in order to find customer groups which are maliciously in arrears and Communications of the IIMA 2 2007 Volume 7 Issue 4

the basic features of these groups, then we will use these features to forecast the customers who may be maliciously in arrears and the probability of malicious arrearage happens so that we can ask for calling fees before this kind of customers leave the network. In order to evaluate the effectiveness of the model, we will test the model by test set different from training set. We follow the standard process of data mining CRISP-DM (Wu, Gao and Bastian, 2003), the specific steps are as follows: Step1: Business understanding: Data source is prepared in order to build the model for analysis; according to the business rules and requirements of telecom companies, clear definition is given of customers who are maliciously in arrears, and the entire customers are divided into target customers and unrelated customers. Step 2: Data understanding: Relevant data is selected based on the business understanding. By processing the bills and detail records of customers in a regular time period, we connect the fees of calling, time of calling, duration of calling, type of calling with information table and arrears table of customer and build feature records set of target customers. Step 3: Data preprocessing: it is the most time-consuming process in data mining. Its goal is to transform the original data into reliable data for mining (Corinne, etc., 2001). In the subject of forecasting and analysis of the malicious cheating, data preprocessing includes sampling, setting the label of cheat, the feature field calculation and so on. Step 4: Modeling: The sample customers are cluttered by Kohonen neural network algorithm in order to find the target customers which have features of malicious arrearage. We can extract the classification rules of target customers and build the model to forecast malicious arrearage. Step 5: Evaluation: adjusting and optimizing the model, evaluating the effectiveness of forecast model by test data set. Step 6: Implement: we forecast the behavior of malicious arrearage in the coming period by the forecast model, give the probability of malicious arrearage, and show it to telecom supervisor. (1) Target customer definition DATA PREPARATION BEFORE MODELING In order to forecast cheat, we should define the target customers beforehand. The target customers are the ones who are in debt and leave the network with high-level calling fees not paid. According to their information of consuming, account, state-changing, we forecast the probability of them to become the target customers in the later months. Customers may be halted out of services for arrearage for some reasons by the telecom companies, but they will pay for the bill in time. These customers are not target customers. We discover that 96.6% of the customers, who are halted out of services for more than four months, will not pay for the bill in the fifth month. And we should focus on the customers whose amounts of debts are high. If we could make them pay for the bill, we can recover major losses of the operators. We assume that the forecast period is one month. The customers who are maliciously in arrears, or the target customers, are described as follows: In forecast period, their states are normal and the bills for each customer are no less than 200 yuan, but they always halt with bills not paid or leave network in debt in the next four months. (2) Setting the time window The goal of forecast is to find the target customers before they leave the network. If the alarming point is too early, the accuracy is lower; on the contrary, if the alarming point is too late, we have no time to take action. Therefore our goal is to find the inclination of customers to leave network 2 or 3 months ahead. Based on the definition of target customers given above, we use the data in July, 2006 and the states of these customers in November, 2006, to build the model. That means if we want to forecast the customers who leave network in arrears in November, we can know it in August, the forecast depends on the data before August. Communications of the IIMA 3 2007 Volume 7 Issue 4

Indexes of the time window are as follows: (1) data window: July, 2006; (2) forecast window: November, 2006; (3) delay period: three months. (3) Data understanding Given the specific business goal, data understanding collects all the internal and external data relevant to the object of business. Then we check the quality of data and find subsets of data which are interesting and contain information. In the subject of forecasting the customers who are maliciously in arrears, we extract customers basic information table, bill of month, detailed records of calling, table of arrears and table of state-changing as our original data tables. From these tables, we can get the original data to forecast the customers who are maliciously in arrears. The original data includes: customer s ID, gender, age, time in network, state, fees of long-distant phone call, fees of roaming, lasting time of phone call, times of phone call, fees in arrears, service change type and so on. (4) Setting the label of cheat According to the standard of time window, we correlate the customer information tables from July to November. We choose the customer who joined the network before July and his/her state was normal and he/she spent more the 200 yuan in calling in July. We choose their states from August to November using customer s ID as foreign key. We add a field of flag as attribute of cheat in the tables. If the customer whose state is out of service and leave network in arrears, the value of flag is 1, which identifies the target customers; otherwise the flag is 0, which denotes that the customer is not our target. (5) Feature field selection We analyzed the amount of customer s spending in consecutive months and found that the range of cost fluctuates, but the range is small. Before the behavior of malicious arrearage conducted, the amount of customer s spending would rise dramatically compared with before. Therefore we create a derived variable, using the ratio of rising range of spending to the standard deviation of the amount of spending to measure the raised spending is normal or not. We call it index of spending. The formula is as below: Index of spending of the month n+1 M n+ 1 M S = n+ 1 n i= 1 ( M i M ) 2 (1) M i is the amount of spending in the month i, M is average amount of spending by month, M = n i=1 M n i (2) IMPLEMENT OF THE MODEL TO FORECAST MALICIOUS ARREARAGE (1) Clustering analysis The customers who are maliciously in arrears are outliers. Therefore when we set the parameters of the model, we should set enough classes to reflect the features of outliers so that the outliers will not submerged in the classes. We set 70 classes (the two-dimension matrix, length is 10, width is 7). The result of clustering analysis is show as figure 1. The black spots represent cheating customers, the gray spots represent the normal customers. A couple of Communications of the IIMA 4 2007 Volume 7 Issue 4

horizontal and vertical coordinates represent one class. In the picture, the black spots are mainly in the cluster (0, 6) and cluster (9, 6). Figure 1: Distribution map of the result of clustering analysis. (2) Extracting the rule of feature We get the cluster information of customers who are maliciously in arrears by using clustering analysis. If we want to acquire the main features of customers in that cluster, we should use decision tree of classification algorithm to attain the features of the target customer group. For example, a customer is target customer, if he/she has long-distant phone call with a particular number for a lot of times, the total time of calling is more than 12420 seconds or about 3.27 hours, he/she was in the network less than 6 months, and the index of spending is more than 4.164409. (3) Evaluating the effectiveness of the model We need to evaluate the forecasting effectiveness of the model by using test set. We can calculate the confusion matrix based on the test set. Table 1: Forecasting Effectiveness Forecast not in arrears 0 Forecast in arrears 1 Actually not in arrears 0 A B Actually in arrears 1 C D Accuracy rate of forecasting malicious arrearage = D/ (B+D) Accuracy rate of forecasting not in arrears maliciously = A/ (A+C) Hitting rate of forecasting malicious arrearage = D/(C+D) Hitting rate of forecasting not in arrears maliciously = A/ (A+B) (4) Application of the model Communications of the IIMA 5 2007 Volume 7 Issue 4

The model is a classifier. We can forecast the target customers by it. If the probability of a customer to be our target customer is higher than a given value, we can suspect that the customer inclines to cheat and take necessary action in order to reduce the probability of malicious arrearage and to effectively protect the profits of telecom companies. Because the parameters of the model are retained by historical data the model will be old after a few months. We can calculate the accuracy rate and hitting rate by comparing the target customers we forecasted 2 or 3 months ahead with the customers who actually leave the network. When the two rates decline, we should retrain and adjust the model. CONCLUSION In this paper, we apply Kohonen neural network clustering algorithm to the analysis of forecasting the customers who are maliciously in arrears in telecom industry. The main contents and results of the research are: (1) Data mining technology is used to forecast the cheat in telecom industry based on Kohonen neural network clustering algorithm. And the main steps are given to accomplish the task of forecasting the customers who are maliciously in arrears. (2) Following the CRISP-DM data mining model, we give the definition of target customers who are maliciously in arrears and the model of data extracting. And the specific methods are proposed to build, evaluate, and apply the model. (3) We find the outliers by using Kohonen neural network clustering algorithm, and furthermore extract the consumption features of target customers. A forecasting system is developed to find the customers who are maliciously in arrears for a telecom company. REFERENCES Chen, M. (1996). Neural Network Model. Dalian University of Technology Press. Dalian. Corinne, B., etc.. (2001). Mining Your Own Business in Telecoms Using DB2 Intelligent Miner for Data. IBM RedBooks, Malone, J., etc.. (2005). Data mining Using Rule Extraction from Kohonen Self-organising Maps. Neural Comput & Applic 15: 9-17. Han, J., Kamber, M. (2001). Data Mining Concepts and Techniques. Academic Press, New York. Li, Z., Deng, Q. and Li, H. (2004). Kohonen SOFM Neural Network Evolution and Research. Computer Engineering and Design. 1729-1830. Wang, Y. (2004). The Study on Data Warehouse and Data Mining in Telecom industry Management Analysis and Application. College of Computer Science, Chongqing University. Chongqing. Wu, S., Gao, X. and Bastian, M. (2003). Data Warehousing and Data Mining. Metallurgical Industry Press. Beijing. Yan, P., Zhang, Ch. (2000). Artificial Neuron Network and Simulated Evolution Computing. Tsinghua University Press. Beijing. Yuan, C. (2000). Artificial Neuron Network and Application. Tsinghua University Press. Beijing. Communications of the IIMA 6 2007 Volume 7 Issue 4