ANALYSIS OF INTEGRATED COMPLAINT MANAGEMENT SYSTEM DATA

Size: px
Start display at page:

Download "ANALYSIS OF INTEGRATED COMPLAINT MANAGEMENT SYSTEM DATA"

Transcription

1 ANALYSIS OF INTEGRATED COMPLAINT MANAGEMENT SYSTEM DATA REPORT OF SUMMER PROJECT Institute of Development and Research in Banking Technology May 10 July 9, 2013 By Sarbojit Roy Indian Institute of Technology, Kanpur Guided by Dr. N.P. Dhavale Deputy General Manager, Strategic Business Unit, Institute of Development and Research in Banking Technology, Hyderabad

2 CERTIFICATE OF COMPLETION This is to certify that Mr. Sarbojit Roy, pursuing M.Sc course at Indian Institute of Technology, Kanpur with statistics as major subject has undertaken a project as an intern at Institute of Development and Research in Banking Technology (IDRBT), Hyderabad from 10 th May, 2013 to 9 th July, He was assigned the project ANALYSIS OF INTEGRATED COMPLAINT MANAGEMENT SYSTEM DATA which he successfully completed under the guidance of Dr. N. P. Dhavale, IDRBT. Dr. N.P. Dhavale (Project Guide) Deputy General Manager Strategic Business Unit IDRBT, Hyderabad 2 P a g e

3 ABSTRACT Integrated Complaint Management System, as the name suggests, keeps and manages the record of faults and problems occurring in the network through which RBI and its member banks are connected. Each of the occurred problems and the cause(s) behind that along with other relevant details are recorded in this system with a unique identity tag called as Ticket no. In this project it is being tried to find out a pattern from the available data, to examine if there exists any relationship between the occurred problems and the cause of the problems, with the objective that if a problem occurs, one can predict the underlying cause, for immediate remedial action. For this purpose, the two fields of the system records, Problems reported & Cause of the problem are categorized and analyzed. 3 P a g e

4 CONTENTS TOPICS PAGE NO. 1. INTRODUCTION 5 2. ANALYSIS 6 3. SUGGESTION 8 4. ANALYSIS II 9 5. RESULT CONCLUSION LIMITATION OF THE PROJECT FUTURE WORKSCOPE ACKNOWLEDGEMENT 14 4 P a g e

5 INTRODUCTION This project is basically an analysis of Integrated Complaint Manage System data or in short ICMS data. RBI and its member banks are connected through a network. If any kind of difficulty or problem occurs in this network a ticket is generated which basically works as a unique tag to the problem. The necessary details corresponding to the occurred problem is stored in ICMS database as a record. The objective of this project is to provide some idea for improvement & up-gradation of this complaint management system by extracting some valuable piece of information from ICMS data through statistical analysis. 5 P a g e

6 ANALYSIS Corresponding to a problem, all the relevant and necessary details are recorded into ICMS by typing manually. This is the cause behind the presence of huge amount of error or noise in the data set. Typing error is the most common among them. Inappropriate and incomplete entries are other two common errors. It needs excessive time and labour to remove these errors as much as possible for doing appropriate analysis. To begin the analysis what we first need is the dataset in appropriate format. The data set is easily available in MS-Excel spreadsheet format from ICMS database. A small sample of the spreadsheet is given in the next page. In this project 4200 records are taken as a sample starting from 30 th march 2011 to 17 th may Since now most of the leased line connections are converted into MPLS, records corresponding to leased line connections are dropped. Firstly, the entries corresponding to the column named Problem Reported are categorized into 9 major categories, according to their types. These 9 categories are, 1. Application related problems 2. Hardware related problems 3. Latency issues 4. Link related issues 5. Management /Process issues 6. Packet drops related issues 7. Power issues 8. Software/Configuration related issues 9. Other issues 6 P a g e

7 Similarly, the entries of the column Cause of the Problem are also categorized into 7 categories. These categories are, 1. Hardware failure 2. Link failure(unspecified) 3. Management/Process issues 4. Non network/out of scope issues 5. Physical damage 6. Service provider internal issues 7. Software/Configuration failure A simple excel code for string manipulation is used to do this categorization. Further, each of those categories is again classified as subcategories. SUBCATEGORIES Some examples of sub-categories are given below. PROBLEM REPORTED CATEGORY ETC SOFTWARE/CONFIGURATION RELATED PROBLEMS SUBCATEGORY UPGRADATION REQUEST IOS CRASH ROUTER REBOOTED ETC CAUSE OF THE PROBLEM CATEGORY ETC SERVICE PROVIDER INTERNAL ISSUES SUBCATEGORY BACKBONE FAILURE AIRBACKHAUL PROBLEM ETC 7 P a g e

8 SUGGESTION To reduce the noise in the data we can replace the typing based database updating procedure by introducing some drop-down list feature in the system using the above created categories and sub-categories. CURRENT SCENARIO SUGGESTED SCENARIO 8 P a g e

9 ANALYSIS (CONTINUED) Suppose the categories of Problem reported are numbered as 1, 2 k. Now let us consider a random variable P which denotes the reported problems. Then P takes any value between 1 and k. Clearly P follows a multinomial distribution with success probabilities p 1, p 2 p k. These probabilities can be estimated by the sample proportion i.e. ρ i = no.of occurances of category i total no.of sample points where ρ i is the estimated probability of success for category i. i=1(1)k Similarly, suppose cause of the problem is classified into m categories. Let us denote cause of the problems by Q. Then Q also takes value starting from 1 to m. But clearly it is not sensible to consider Q as a random variable. Now we can put the whole data in a cross-tabulated format, formally known as contingency table. Here the categories of the causes are taken as row entries and the classes of reported problems are taken as columns and each cell contains the corresponding frequency as it is shown in the image below: 9 P a g e

10 Now we want to check whether there exists any association between P & Q or not. If the distributions of the second variable are nearly the same given the category of the first variable, then we say that there is not an association between the two variables. If there are significant differences in the distributions, then we say that there is an association between the two variables. To check the existence of association the Pearsonian Chi-Square test is performed. The null-hypothesis is stated as, P and Q are not associated and the alternative hypothesis is stated as, there exist an association between P and Q. To perform this test the Chi-Square statistic ( χ 2 ) is used. χ 2 = (O E) 2 E Where O = observed frequency of a cell E = expected frequency of that cell The expected frequency of (i, j) th cell is calculated by the following formula E i, j = it row total j t column total table total This χ 2 approximately follows the standard χ 2 distribution with degrees of freedom = (number of rows -1) * (number of columns -1) = (k-1) * (m-1) We reject the null hypothesis, i.e. we conclude that there exists an association between P and Q if the observed χ 2 value is large enough. At α% level of significance we reject null hypothesis if χ 2 obs > χ 2 (α,(k-1)*(m-1)) Here χ 2 (α,(k-1)*(m-1)) is the upper-α point of a χ 2 ((k-1)*(m-1)) distribution. From the dataset in our hand we get χ 2 obs = and if we take α=0.05 then χ 2 (0.95, 192) = Clearly, we reject the null hypothesis and we may conclude that there exists an association between Problem Reported & Cause of the Problem. 10 P a g e

11 POWER FIBER/CABLE UNSPECIFIED PLANNED/MAINTENANCE RESOLVED NO ISSUE FOUND HARDWARE CONFIGURATION ALARM FLUCTUATED OTHER PHY DIS REBOOT BACKBONE LASTMILE(UNSPECIFIED) HIGH UTILIZATION HANG LAN LOCAL LEAD MISTAKE CRYPTO DUPLEX MISMATCH LATENCY WAN AIRBACKHAUL AUTOFAILOVER RESULTS Now we want to find out which types of causes are more frequent. For this purpose Pareto charts are created using MS-Excel (80,30) marginal freq marg cum percntg PARETO CHART SHOWING THE MOST SIGNIFICANT CAUSES BEHIND THE REPORTED PROBLEMS In this chart marginal frequencies of each of the sub-categories of causes are plotted along X axis. The bars represent the marginal frequencies & the line indicates the proportion of the subcategories (taken cumulatively) with respect to total frequency. That means Power failure, Fiber/cable Cut, Planned/Maintenance Activity, Hardware failure and some Unspecified issues cover 80% of the total number of reported problems and these factors are just 30% of the total number of causes. 11 P a g e

12 POWER UNSPECIFIED FIBER/CABLE RESOLVED CONFIGURATION PLANNED/MAINTENANCE NO ISSUE FOUND FLUCTUATED HARDWARE OTHER PHY DIS ALARM LAN REBOOT BACKBONE DUPLEX MISMATCH HANG LATENCY HIGH UTILIZATION CRYPTO WAN LOCAL LEAD AIRBACKHAUL LASTMILE(UNSPECIFIED) AUTOFAILOVER MISTAKE , marginal downtime cum dwn time % PARETO CHART SHOWING THE MOST SIGNIFICANT CAUSES BEHIND DOWNTIME OF LINK From the above graph it can be told that 75% of the total downtime of link is caused by Power failure, Fiber/cable cut, Configuration failure, Planned/maintenance Activity & some unspecified issues. These almost cover 30% of the total subcategories of causes. It is to be noted that both the graphs support the 80:20 principle, i.e. 80% of the problems are caused by 20% of the reasons. 12 P a g e

13 CONCLUSION Based on the above analysis it can be said that power failure, fiber cut or cable cut and configuration related issues are three of the most frequent factors behind the problems occurring in the network. Also, the same factors are responsible for a major portion of link-downtime. So, if we focus on these three factors we can cure a major part of the problems and resist them to occur in the system. LIMITATIONS OF THE PROJECT The categories and subcategories for each of the field Problem reported & Cause of the problem may not be mutually exclusive and exhaustive. The created categories are totally based on available data. So is the number of categories and sub categories. One can always refine the analysis and make a better categorization. It is to be noted that, for many of the reported problems the reasons are either unspecified or unknown. For these kinds of cases we can utilize the expected probabilities of success ρ i for categorization and solve our problem. FUTURE WORKSCOPE If sufficiently large amount of data is available then location based analysis can be proceeded. By location based analysis we mean to check for any pattern of occurring problems in a particular bank location. Suppose a particular location, i.e. a branch of any member bank (or RBI locations) is facing similar kind of problems. If we find any pattern on this we can handle the problems more confidently. 13 P a g e

14 ACKNOWLEDGMENT I would like to express my sincere gratitude to the Institute for Development and Research in Banking Technology (IDRBT) and especially DR. N.P. Dhavale, (DGM, INFINET and services, IDRBT) who was my guide in this project. I am extremely grateful to DR. N.P. Dhavale for his advice, innovative suggestions and supervision. I thank him for introducing me to an excellent banking application and giving me the opportunity to work for its up-gradation. I am thankful to every staff of INFINET department at IDRBT for helping me to get familiar with the system and giving me a chance to study the system. I am thankful to IDRBT for providing such an amazing platform for students to work in real application oriented research. Finally I thank Shri V.S. Mahesh, Smt Anuradha, Shri Srihari and my cointerns at INFINET whom I worked throughout mu stint at IDRBT and the project was possible only with their co-operation. Sarbojit Roy Project Trainee Department of INFINET IDRBT, Hyderabad 14 P a g e