The Search for Gold Nuggets Using CRISP-DM Without a Seasoned Miner P.W. Beers University of Twente The Netherlands P.O. Box 217, 7500AE Enschede p.w.beers@student.utwente.nl ABSTRACT The rise of data mining has brought many changes to people s lives but also to companies and the importance of data analysis. Companies always had a tendency to gather as much data as possible but it has only been recently due to the developments in IT that large quantities of data can be analyzed in a fast and easy way. This new field gave rise to the methodology of Cross Industry Standard Process for Data mining (CRISP-DM) and this method is the current standard of data mining. This process has been widely applied but has not been updated since its release in 1999. There have been many suggestions for improvements to the technique by researchers such as Clifton and Thuraisingham [3] as well as Zapata and Gil [13] and many others. This study will look into what improvements can be made to CRISP-DM. The improvements recommended by this study are based on a literature study as well as a field study at a company to observe a data mining process. There were many suggestions found in the literature to improve CRISP-DM and the field study showed that there is not always a project leader but a departmental structure present at a company making it harder to implement a methodology such as CRISP-DM. This paper has made six suggestions for improvements to CRISP- DM which can result in new versions of CRISP-DM or even new data mining techniques. Keywords Data mining, CRISP-DM, improved CRISP-DM 1. INTRODUCTION Data mining has become something that is not only a useful tool but an absolute necessity for businesses to stay competitive. The knowledge that companies have is their most valuable resource, but utilizing this knowledge is a difficult process. Since the start of data analytics there have been many data mining methodologies which finally resulted in the current standard named Cross Industry Standard Process for Data Mining or CRISP-DM in short. Since its release in 1999 [7], CRISP-DM has been used for data mining for countless companies, organizations as well as in research. Eventually, it became the current standard without even being updated since its release. During the years, there have been a lot of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 25 th Twente Student Conference on IT, July 1 st, 2016, Enschede, The Netherlands. Copyright 2016, University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science. applications as well as suggestions that could change CRISP- DM to a more up to date methodology. These suggestions have come from various researchers of which an overview has been made in Reflection on Experience and Improvements in CRISP- DM [2] 1. Apart from a literature review, this research will focus on a field study of a data mining project. The purpose of this is to observe current practices as well as problems in the application of data mining. The company itself does not implement CRISP-DM, but their own process for data mining. Due to not finding a company which is implementing CRISP- DM the problems observed at the company will be compared to the CRISP-DM methodology. From this comparison, assumptions will be made about how CRISP-DM would handle these problems better or would need to be improved accordingly. This, in addition to the current literature on CRISP-DM, will form the basis for a new version of CRISP- DM. The main question that will be answered by this paper is What improvements can be made to CRISP-DM?. This question will be answered using current literature on the subject of improvements to CRISP-DM as well as a field study on data mining. This will form a basis for future research into data mining and the development of new data mining techniques and improving the existing CRISP-DM. 2. RESEARCH QUESTIONS This research will attempt to form the theoretical basis of a new version of CRISP-DM by gathering suggestions made in the literature as well as performing a field study. The main research question of this paper is: What improvements can be made to CRISP-DM?. The following three sub-question will be used to answer the main research question. 1. What suggestions are made by the current literature to improve CRISP-DM? 2. What observations are made during the field study on data mining? 3. How would CRISP-DM handle the observed processes/difficulties made during the field study? 3. RESEARCH METHOD 3.1 General Description To answer the research questions, the first step is to review the current literature. Throughout the years there have been many suggestions made by researchers on how CRISP-DM can be improved. Reviewing these suggestions as well as the concepts that they tend to improve will be the first step to a new version of CRISP-DM. Using the literature as a starting point next a field study will be done. This study observes a current data mining project at a 1 This article has not been published, it is available on request.
production factory with around 1900 full time employees. Next to the observations, interviews will be held with various stakeholders in the data mining project. Using the interviews and observations as a basis, the workings of the company with data mining projects will be described. The problems that were discovered in this process will then be compared to the workings of CRISP-DM in order to conclude if this would also happen if CRISP-DM had been used. Finally, the results of these comparisons will be added to the suggestions made by literature to form a theoretical basis for a new version of CRISP-DM. 3.2 Research Approach There has been a lot of research using CRISP-DM as the data mining process. During the use of CRISP-DM in various fields there have been new uses for CRISP-DM discovered but also many shortcomings. These shortcomings have been stated in the literature and will be gathered in this research based on the article Reflection on Experience and Improvements in CRISP- DM [2]. The practices described in this article are not only concerned with specific solutions but also with the concepts behind them. An overview of the suggested solutions will be created using these concepts as well as to what step of the CRISP-DM process they relate. To understand these improvements better, a general description of the workings of the 6 phases of CRISP-DM are also described. Going a step further than the review of literature, a field study on data mining will be held at a company. The company is production oriented in a factory setting with around 1900 full time employees. From this production process a large amount of data is gathered which can be used to further improve the current production process. The company has installed measurement equipment and is now in the phase of data gathering and analysis. The observation of the real life case at the company consists of 16 hours per week for five consecutive weeks in total. The observations will be made on the work floor as well as on the management-level. These will be done during and on a data mining project which is currently in progress at the company. During this period an overview is created of the general structure of the company as well as the workings. This is achieved by interviewing various stakeholders in the project ranging from developer to user. First, the observations of the field study will be listed and then discussed. Using these observations, a comparison will be made with the current practices of CRISP-DM. This is meant to bring theory and real life closer to each other for the advice on improvements for a new version of CRISP-DM. 4. DATA COLLECTION AND ANALYSIS 4.1 Literature There have been suggestions by literature to improve CRISP- DM multiple times. This technique should however first be described so the suggestions can be understood. CRISP-DM is a process for data mining which has a total of six phases[1, 7]. 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment These phases are of a cyclic nature and can be moved back and forth between as can be seen in figure 1. The business understanding phase is meant to determine the project objectives and requirements from a business perspective. Then using these objectives to state a data mining problem definition and a preliminary project plan. The data understanding phase is the phase in which the researcher makes himself familiar with the data and discovers first insights into the data. The data preparation phase is meant to construct the final dataset used for datamining into the modeling tool. Also preparing it for use and making sure of consistency. In the modeling phase various modeling are constructed using different techniques. In the evaluation phase the created model(s) are evaluated and compared to the objectives. If there are still objectives not addressed, then these will be taken back to modeling. Also it will be decided if the models are sufficient. And in the last phase deployment the model is finally finished and put to work. Feedback will come from it and it might have to be adapted to the use of the customer. This can be as simple as making it available to the customers to as difficult as having to implement a repeatable data mining process across the enterprise. Figure 1. The CRISP-DM process. Suggestions made in the review article [2] range from very general to specific changes. Such as a proposal to incorporate pre-conceptual schemas and goal diagrams in the business understanding phase to make this part of CRISP-DM less informal [13]. There are also suggestions to automate the process and taking out the manual labor which is currently involved in CRISP-DM [4, 10]. Then there are suggestions to instead of keeping CRISP-DM as a general framework to change it to have specialized tasks for certain domains [6, 9]. Another problem addressed is the difficulty to estimate the cost and effort involved in a data mining project especially for small and medium-sized enterprises (SMEs). There is a suggestion made to improve this using a new estimation method oriented to SMEs to be added to CRISP-DM [8]. Another example of a better estimation in costs, effort, planning and performance is the addition of the balanced scorecard to CRISP-DM. This has even been developed into a new version of CRISP-DM called CDM-BSC which stands for CRISP-DM with Balanced Scorecard [11, 12]. From this literature review sub-questions 1 is answered. In short, the literature recommends adding schemas to the business
understanding phase, making CRISP-DM automated, add specialized tasks, add cost estimation for SMEs and lastly combine CRISP-DM with the Balanced Scorecard. 4.2 Field Study The company at which the field study was done can be considered a medium-sized enterprise. It is divided into multiple departments such as IT, Production, Research and development, finance, etc. The observations at this company start in the phase of data preparation and modeling. There was an increasing demand for safety from the customer which lead to the addition of multiple sensors and measurement tools to the production process in the factory. It was deemed by management that these measurements can then also be used for performance increase after being analyzed. From this the data mining project started but was left without a clear direction. First of all, there were multiple parties available who were connected to this new data mining project. These parties would be divided over different departments making communications even less indirect. Due to the size of the company the different departments were made into customers and suppliers making the relations between the departments one of asking or one of being asked. The main departments concerned with the data mining project were production, IT and management. The project was originally started by management due to the customer s demand for safety. The production department was instructed to implement the changes and afterwards the IT department was expected to create guided reports based on the data. These guided reports can be seen as an overview of selected data for a specific purpose, for example an overview of the quarterly sales. Because of the customer/supplier nature in the company the first phase went smoothly. Management (customer) wanted improvements and gave the production department (supplier) the assignment to add the measurement devices. It was when management (customer) gave the IT department (supplier) the assignment to create guided reports that the data mining project became unclear. The customer who is management wasn t actually the one that wanted the reports but most likely the production department. The IT department who has other guided reports and projects to work on was given an unclear assignment which made it less preferable to finish. This unclearness came from the fact that IT is not familiar with the useful applications of the data because this is very process specific. It can be decided to develop guided reports that IT believes is useful but due to the absence of the customer in the development it could never be used at all. The observations made during the weeks at the company can be seen in table 1. Week 1 Week 2 Week 3 Week 4 Table 1. Observations made. There is an unclear situation on who wants guided reports Data mining and the guided reports that follow from it should be an addition to current processes to be useful else the reports will be functional but not helpful to the user. There is no clear leader of the data mining project but there is a problem owner. Due to the structure and culture at the company the current data mining process has been obstructed. To breach this current situation someone from either one department or someone from outside the company should study the wishes of the user. Week 5 The addition of the stagier has made it possible to get a feel for the wishes of the user and contact has now been made with the user which makes him the customer and the IT department the supplier. The company in the field study has a small Business Intelligence (BI) department as was told in an interview by an IT specialist at the company. There are many Ad Hoc requests made within the companies which are requests that are far from standard reporting and often not even necessary or helpful for the user. Within the company a structure with process owners is used where this person is the main contact person for the project often a manager. But often this process owner misses the needed BI knowledge to effectively lead a data mining project. The IT department can add its BI knowledge but this costs too much time which is unavailable due to the many Ad Hoc requests. In figure 2 this is illustrated. In bigger companies this process owner is a manager who has BI knowledge. For these companies, processes such as CRISP-DM are developed. Figure 2. Link between BI and Management. 5. RESULTS In the section of the paper sub-questions 2 and 3 will be answered. Sub-question 1 has already been answered in section 4.1 Literature. During the field study there was an absence in direction of the data mining process observed. This was due to, as described in section 4.2, the customer/supplier relationships between the different departments. The IT department was expected to build guided reports using the data gathered from the measurement devices but there were no specific requests for guided reports there was only the general make guided reports using this data request from management. In the company culture it was normal for someone to approach another department and ask for something specific but it was not normal to be expected to perform something without a specific customer. This was a breaking point in the data mining project which was eventually solved by meeting different people from departments and asking them if they had a specific purpose for the data. This is not normally something for the IT department to do and was achieved by the addition of a stagier. The stagier approached different stakeholders in the company and interviewed them to get a view of what the stakeholders would deem useful and reported this back to the IT department. This was an intervening action to break the company culture stalemate in the data mining project. But this observation was something that was not a unique occurrence for smaller and medium sized companies. The divided departments inside a company have to come together to work on a data mining project which is something that gets easily sidetracked. In bigger companies and
methodologies as CRISP-DM there is a process which is walked through under the leadership of one person. This person who is often a consultant oversees the processes in the company, the IT capabilities as well as communicates with management. This makes it possible to control the CRISP-DM process in an organized way. But when facing smaller companies where this central person is absent a construction with project owners and different departments is used to control the data mining project. Creating situations such as the unclear assignment for the IT department to create guided reports. From this field study the biggest observation made is the lack of a central person who controls the data mining process. This absence is not covered in any methodology and is very company specific. With this sub-question 2 has been answered. The Methodology of CRISP-DM is made as a general process to be used in almost any kind of setting. It was tested using mainly industrial test cases but has been used in the field of medicine as well as was shown in literature [5, 6, 9]. When studying the original publication of CRISP-DM [7] there was no mention of the communication difficulties when faced without a clear leader of the data mining project. Just following the process of CRISP-DM is insufficient as knowledge from different departments needs to be combined for the data mining project. This absence of who is meant to guide a CRISP-DM project in the methodology leaves a huge gap for companies to fill in. When data mining professionals decide to use CRISP- DM the user will most likely become this central person. However, when a project is started without a clear goal and with unclear communication and wishes there should be a fallback in the methodology to handle this situation. I propose two solutions for this problem to be resolved. One is that a central person is appointed and takes full responsibility for the data mining project, this can be someone within the company or possibly even better a consultant from outside the company who is not yet influenced by standard practices at the company. This was also done in the company of the field study where a stagier fulfilled this role. The second option, though less efficient but workable is a gathering of persons from different departments who become the contact person within the department as well as to the other departments. For this to work communication between the members of this gathering is essential and meeting regularly is needed. This solution is more focused on working within the current hierarchy instead of trying to oversee multiple departments at ones with one person. It can be concluded that CRISP-DM would not be able to handle situations in which the leadership is unclear. For this a fallback should be added to CRISP-DM in which the suggestion is presented to appoint a central person from within or from outside the company to lead the data mining project or to form a gathering from different departments to lead the project. With this sub-question 3 has been answered. 6. CONCLUSION The main question of this research is: What improvements can be made to CRISP-DM?. This would be answered by discussing the three sub-questions. The suggestions made by the literature to improve CRISP-DM consists of adding schemas to the business understanding phase, making CRISP- DM automated, add specialized tasks, add cost estimation for SMEs and lastly combine CRISP-DM with the Balanced Scorecard. These suggestions would improve CRISP-DM to be more efficient as well as grant greater control over the data mining process. The main observation made during the field study was that there is not always a central person present who implements a data mining methodology such as CRISP-DM but usually a cooperation between different departments who have a customer/supplier relationship with each other. CRISP-DM missed the capabilities to handle such a situation and would work inefficient or not work at all in such an environment. To answer the main question of this research it is suggested that the new version of CRISP-DM will incorporate a fallback in the methodology that will help to construct a clear leadership for the data mining project. In addition to this as suggested by literature the current version of CRISP-DM can be automated in total or partly, adding schemas to the business understanding phase, add specialized tasks for specific fields, add cost estimation for SMEs and lastly combine CRISP-DM with the Balanced Scorecard. Using any of these suggestions will improve CRISP-DM and could form the basis for a new version of CRISP-DM. 7. DISCUSSION This research was focused on discovering ways to improve the data mining methodology CRISP-DM. While there have been made several suggestions, both based in literature and from the field study, these techniques still have to be tested on how they perform. Using the suggestions made in this article could lead to new methodologies as well as uses of CRISP-DM. The observations done using the field study were taken from one company. To be able to conclude that the current departmental working is indeed the case, research can be done at multiple small and medium sized companies to confirm this statement. This could also lead to a more specific methodology for SMEs instead of an update of CRISP-DM. It was also suggested by literature that the cost and benefit analysis of CRISP-DM is inadequate for SMEs which suggests that CRISP-DM might not work as effective at SMEs as at big enterprises. Research into the effectiveness of CRISP-DM and other data mining methodologies compared to the size of the enterprise could give more insight in what kind of data methodology is actually needed by the current market. 8. REFERENCES [1.] Bednár, P., et al. Design and implementation of local data mining model for short-term fog prediction at the airport. in 9th IEEE International Symposium on Applied Machine Intelligence and Informatics, SAMI 2011 - Proceedings. 2011. [2.] Beers, P.W., Reflection on experience and improvements in CRISP-DM. 2016. [3.] Clifton, C. and B. Thuraisingham, Emerging standards for data mining. Computer Standards and Interfaces, 2001. 23(3): p. 187-193. [4.] Kumara, B.T.G.S., et al. Ontology-Based Workflow Generation for Intelligent Big Data Analytics. in Proceedings - 2015 IEEE International Conference on Web Services, ICWS 2015. 2015. [5.] McGregor, C., C. Catley, and A. James. A process mining driven framework for clinical guideline improvement in critical care. in CEUR Workshop Proceedings. 2011. [6.] Pérez, J., et al., A Data Preparation Methodology in Data Mining Applied to Mortality Population Databases. Journal of Medical Systems, 2015. 39(11). [7.] Pete Chapman (NCR), J.C.S., Randy Kerber (NCR),, T.R.D. Thomas Khabaza (SPSS), and C.S.S.a.R.W. (DaimlerChrysler), Cross industry standard process for data mining. 1999.
[8.] Pytel, P., P. Britos, and R. García-Martínez, A proposal of effort estimation method for information mining projects oriented to SMEs, in Lecture Notes in Business Information Processing. 2013. p. 58-74. [9.] Roa, D., M. Del Pilar Villamil, and J.D. Arboleda. Process model for data mining in health care sector. in CEUR Workshop Proceedings. 2011. [10.] Siriweera, T.H.A.S., I. Paik, and B.T.G.S. Kumara. Onotology-based service discovery for intelligent Big Data analytics. in IEEE 7th International Conference on Awareness Science and Technology, icast 2015 - Proceedings. 2015. [11.] Yun, Z. The study of CDM-BSC-based data mining driven fishbone applied for data processing. in 2015 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2015. 2015. [12.] Yun, Z., L. Weihua, and C. Yang. Applying balanced scordcard strategic performance management to CRISP-DM. in Proceedings - 2014 International Conference on Information Science, Electronics and Electrical Engineering, ISEEE 2014. 2014. [13.] Zapata, J.C.M. and N. Gil. Incorporation of both preconceptual schemas and goal diagrams in CRISP- DM. in 2011 6th Colombian Computing Congress, CCC 2011. 2011.