The Search for Gold Nuggets Using CRISP-DM Without a Seasoned Miner

Similar documents
BIG DATA AND DATA SCIENCE: A SCIENTOMETRICS APPROACH

Duke University Health System gets smarter for its patients

Compliance digitalization The impact on the Compliance function. Deloitte Risk Services April 2016

Now, I wish you lots of pleasure while reading this report. In case of questions or remarks please contact me at:

USING BIG DATA FOR EVIDENCE BASED GOVERNANCE IN CHILD WELFARE. Author: C. Joy Stewart, Roderick A. Rose, Dean F. Duncan Presenter: Parul Verma

IBM Analytics. Data science is a team sport. Do you have the skills to be a team player?

TDWI Analytics Principles and Practices

Universiti Teknologi MARA. Veterinary Clinic Management System with SMS Notification

Exploring how the management of ERP business benefits influences achieving them

TOUCH IOT WITH SAP LEONARDO

Service management solutions White paper. Six steps toward assuring service availability and performance.

Managing Project Risks

Energy Management and Higher Education. A case study in billing and cost allocation

Pay for What Performance? Lessons From Firms Using the Role-Based Performance Scale

Leadership Mindset and Performance. Fred Amador MC. Counseling Faculty Phoenix College

3 STEPS TO MAKE YOUR SHARED SERVICE ORGANIZATION A DIGITAL POWERHOUSE

Implementing a Software Verification and Validation Management Framework in the Space Industry BOGDAN MARCULESCU

A Risk Management Process for Information Security and Business Continuity

Quality Department Monitoring and Evaluation report on the Strategic Plan. Report 2, May 2018

Cognitive Data Governance

The Benefits of Modern BI: Strategy Companion's Analyzer with Recombinant BI Functionality

Knowledge Management Strategy for academic organisations

Children, Young People and Families apprenticeship standards version 1 consultation results- September 2015 Consultation period

As per the IMF, India is among the few bright spots in the global economy. Its economy is expected to grow by nearly 7.5%.

Create your ideal data quality strategy. Become a more profitable, informed company with better data insight

Senior data warehouse and business intelligence developer

Improving Material Handling Efficiency in a Ginning Machine Manufacturing Company

Reporting for Advancement

Business Events as a focal point of analysis

A management handbook for Deaf and Disabled people s organisations, peer led organisations and user led organisations

Making Predictive Maintenance Work By Bob O Donnell, TECHnalysis Research President

Approaches to the Performance Management

The importance of the right reporting, analytics and information delivery

LONG INTERNATIONAL. Douglas A. Bassett, P.Eng., PMP, FCIP

Innovation and Technology Management

Communicate and Collaborate with Visual Studio Team System 2008

Operational BI. White Paper. by Robert Blasum Date

Suspense Process Review

The importance of the right reporting, analytics and information delivery

An Agent-Based Approach to the Automation of Risk Management in IT Projects

Automatic Service Configuration under e 3 value approach

Mid-Atlantic CIO Forum

IMPLEMENTATION, EVALUATION & MAINTENANCE OF MIS:

Discover the Journey to Work

General Data Protection Regulation

What do the Agile Development methods tell us about implementation methods?

Case study. berner foods. Berner Foods Leverages Factory MES to Increase OEE by 68%

for Big Data and Analytics

Collaborative Knowledge Work: Theory and Practice of a Successful Commercial Application. Thomas Gruber CTO, Intraspect Software

Certified Business Analysis Professional - Introduction

What Makes a Good Employee? Human Resources Assignment

Training Needs and Challenges

What Is Performance Improvement?

Marketing Management Case Study of Waitrose. [Type the author name] [Pick the date]

Chapter 4. Phase Four: Evaluating Jobs. Key points made in this chapter

Change Catalysts Case Study

Introduction to Analytics Tools Data Models Problem solving with analytics

getabstract compressed knowledge Motivating Employees by Anne Bruce and James S. Pepitone 1999 McGraw-Hill 160 pages

June, 1999 SEUGI 17 1

Strategy Analysis. Chapter Study Group Learning Materials

What is the current state of mobile recruitment?

Realize the potential of a connected factory

WORKPLACE CORE SKILLS ASSESSMENT SUPPORT PACK

5 ways effective CAD collaboration can accelerate success across the enterprise

The Missing Link Between Manufacturing Data and Profitability

Implementation Methodology

Three Pillars of Collaboration in Industrial Facilities

Achieving More with the Career Framework

WORK MANAGEMENT SURVEY Executive Summary and Full Report

T H E B O T T O M L I N E

Strategic. Planning. Mike Hourihan

Design Like a Pro. Boost Your Skills in HMI / SCADA Project Development. Part 3: Designing HMI / SCADA Projects That Deliver Results

Agile Dimensional Model for a Data Warehouse Implementation in a Software Developer Company

Multi-service touchpoint experience: Variation at single-point of entry

Creating Usability-Driven Corporate Strategy (initial version)

WHAT ABOUT MUNICIPAL STRATEGIC MANAGEMENT AND PERFORMANCE MEASUREMENT

PROJECT OVERVIEW THE PROCESS THE CHALLENGE

Beyond Tools Optimizing Workforce Management Process and Technology

Role title Corporate Planning and Performance Officer

Types of Analytics in Requirements Engineering

Organizational Change Management for Data-Focused Initiatives

November 12-15, 2018 ARIA Resort Las Vegas, NV Training Topics. Working Draft - Subject to Change

DATASHEET. Tarams Business Intelligence. Services Data sheet

Capability White Paper Prescriptive Maintenance

Chapter Four Discussion Questions

Some insights about Insights

Intelligence and. Vivek Kaie

BUILDING A DIGITAL SUPPLY CHAIN

An investigation for improving knowledge management for design for manufacturing implementation in an aerospace company

Traditional auctions such as the English SOFTWARE FRAMEWORKS FOR ADVANCED PROCUREMENT

A Report on Operation Management of Bengal Biscuits Ltd.

Master s Thesis Kickoff Design and Evaluation of a Collaborative Approach for API Lifecycle Management

SURVEY. Of the SME in the Prilep Region. Valentin Parapanski Liesl Muench Prilep Region Enterprise Development Agency (PREDA)

INCREASING THE VALUE OF PROCESS MODELLING

Deliver All Analytics for All Users Through a Single Product in the Cloud

Manageable Steps TACKLE A COMPLEX. Accounts Payable. Automation PROJECT

Process Based Management

Cost reductions through standardization and automation at company X. Heini Guldmyr

Executive Summary... 1 The Six Steps in Brief... 2 Pre-implementation Stage... 3 Design Stage... 4

Analytical Tools 1. Analytical Tools Jennifer Dilly Ferris State University November 20, 2011

Transcription:

The Search for Gold Nuggets Using CRISP-DM Without a Seasoned Miner P.W. Beers University of Twente The Netherlands P.O. Box 217, 7500AE Enschede p.w.beers@student.utwente.nl ABSTRACT The rise of data mining has brought many changes to people s lives but also to companies and the importance of data analysis. Companies always had a tendency to gather as much data as possible but it has only been recently due to the developments in IT that large quantities of data can be analyzed in a fast and easy way. This new field gave rise to the methodology of Cross Industry Standard Process for Data mining (CRISP-DM) and this method is the current standard of data mining. This process has been widely applied but has not been updated since its release in 1999. There have been many suggestions for improvements to the technique by researchers such as Clifton and Thuraisingham [3] as well as Zapata and Gil [13] and many others. This study will look into what improvements can be made to CRISP-DM. The improvements recommended by this study are based on a literature study as well as a field study at a company to observe a data mining process. There were many suggestions found in the literature to improve CRISP-DM and the field study showed that there is not always a project leader but a departmental structure present at a company making it harder to implement a methodology such as CRISP-DM. This paper has made six suggestions for improvements to CRISP- DM which can result in new versions of CRISP-DM or even new data mining techniques. Keywords Data mining, CRISP-DM, improved CRISP-DM 1. INTRODUCTION Data mining has become something that is not only a useful tool but an absolute necessity for businesses to stay competitive. The knowledge that companies have is their most valuable resource, but utilizing this knowledge is a difficult process. Since the start of data analytics there have been many data mining methodologies which finally resulted in the current standard named Cross Industry Standard Process for Data Mining or CRISP-DM in short. Since its release in 1999 [7], CRISP-DM has been used for data mining for countless companies, organizations as well as in research. Eventually, it became the current standard without even being updated since its release. During the years, there have been a lot of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 25 th Twente Student Conference on IT, July 1 st, 2016, Enschede, The Netherlands. Copyright 2016, University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science. applications as well as suggestions that could change CRISP- DM to a more up to date methodology. These suggestions have come from various researchers of which an overview has been made in Reflection on Experience and Improvements in CRISP- DM [2] 1. Apart from a literature review, this research will focus on a field study of a data mining project. The purpose of this is to observe current practices as well as problems in the application of data mining. The company itself does not implement CRISP-DM, but their own process for data mining. Due to not finding a company which is implementing CRISP- DM the problems observed at the company will be compared to the CRISP-DM methodology. From this comparison, assumptions will be made about how CRISP-DM would handle these problems better or would need to be improved accordingly. This, in addition to the current literature on CRISP-DM, will form the basis for a new version of CRISP- DM. The main question that will be answered by this paper is What improvements can be made to CRISP-DM?. This question will be answered using current literature on the subject of improvements to CRISP-DM as well as a field study on data mining. This will form a basis for future research into data mining and the development of new data mining techniques and improving the existing CRISP-DM. 2. RESEARCH QUESTIONS This research will attempt to form the theoretical basis of a new version of CRISP-DM by gathering suggestions made in the literature as well as performing a field study. The main research question of this paper is: What improvements can be made to CRISP-DM?. The following three sub-question will be used to answer the main research question. 1. What suggestions are made by the current literature to improve CRISP-DM? 2. What observations are made during the field study on data mining? 3. How would CRISP-DM handle the observed processes/difficulties made during the field study? 3. RESEARCH METHOD 3.1 General Description To answer the research questions, the first step is to review the current literature. Throughout the years there have been many suggestions made by researchers on how CRISP-DM can be improved. Reviewing these suggestions as well as the concepts that they tend to improve will be the first step to a new version of CRISP-DM. Using the literature as a starting point next a field study will be done. This study observes a current data mining project at a 1 This article has not been published, it is available on request.

production factory with around 1900 full time employees. Next to the observations, interviews will be held with various stakeholders in the data mining project. Using the interviews and observations as a basis, the workings of the company with data mining projects will be described. The problems that were discovered in this process will then be compared to the workings of CRISP-DM in order to conclude if this would also happen if CRISP-DM had been used. Finally, the results of these comparisons will be added to the suggestions made by literature to form a theoretical basis for a new version of CRISP-DM. 3.2 Research Approach There has been a lot of research using CRISP-DM as the data mining process. During the use of CRISP-DM in various fields there have been new uses for CRISP-DM discovered but also many shortcomings. These shortcomings have been stated in the literature and will be gathered in this research based on the article Reflection on Experience and Improvements in CRISP- DM [2]. The practices described in this article are not only concerned with specific solutions but also with the concepts behind them. An overview of the suggested solutions will be created using these concepts as well as to what step of the CRISP-DM process they relate. To understand these improvements better, a general description of the workings of the 6 phases of CRISP-DM are also described. Going a step further than the review of literature, a field study on data mining will be held at a company. The company is production oriented in a factory setting with around 1900 full time employees. From this production process a large amount of data is gathered which can be used to further improve the current production process. The company has installed measurement equipment and is now in the phase of data gathering and analysis. The observation of the real life case at the company consists of 16 hours per week for five consecutive weeks in total. The observations will be made on the work floor as well as on the management-level. These will be done during and on a data mining project which is currently in progress at the company. During this period an overview is created of the general structure of the company as well as the workings. This is achieved by interviewing various stakeholders in the project ranging from developer to user. First, the observations of the field study will be listed and then discussed. Using these observations, a comparison will be made with the current practices of CRISP-DM. This is meant to bring theory and real life closer to each other for the advice on improvements for a new version of CRISP-DM. 4. DATA COLLECTION AND ANALYSIS 4.1 Literature There have been suggestions by literature to improve CRISP- DM multiple times. This technique should however first be described so the suggestions can be understood. CRISP-DM is a process for data mining which has a total of six phases[1, 7]. 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment These phases are of a cyclic nature and can be moved back and forth between as can be seen in figure 1. The business understanding phase is meant to determine the project objectives and requirements from a business perspective. Then using these objectives to state a data mining problem definition and a preliminary project plan. The data understanding phase is the phase in which the researcher makes himself familiar with the data and discovers first insights into the data. The data preparation phase is meant to construct the final dataset used for datamining into the modeling tool. Also preparing it for use and making sure of consistency. In the modeling phase various modeling are constructed using different techniques. In the evaluation phase the created model(s) are evaluated and compared to the objectives. If there are still objectives not addressed, then these will be taken back to modeling. Also it will be decided if the models are sufficient. And in the last phase deployment the model is finally finished and put to work. Feedback will come from it and it might have to be adapted to the use of the customer. This can be as simple as making it available to the customers to as difficult as having to implement a repeatable data mining process across the enterprise. Figure 1. The CRISP-DM process. Suggestions made in the review article [2] range from very general to specific changes. Such as a proposal to incorporate pre-conceptual schemas and goal diagrams in the business understanding phase to make this part of CRISP-DM less informal [13]. There are also suggestions to automate the process and taking out the manual labor which is currently involved in CRISP-DM [4, 10]. Then there are suggestions to instead of keeping CRISP-DM as a general framework to change it to have specialized tasks for certain domains [6, 9]. Another problem addressed is the difficulty to estimate the cost and effort involved in a data mining project especially for small and medium-sized enterprises (SMEs). There is a suggestion made to improve this using a new estimation method oriented to SMEs to be added to CRISP-DM [8]. Another example of a better estimation in costs, effort, planning and performance is the addition of the balanced scorecard to CRISP-DM. This has even been developed into a new version of CRISP-DM called CDM-BSC which stands for CRISP-DM with Balanced Scorecard [11, 12]. From this literature review sub-questions 1 is answered. In short, the literature recommends adding schemas to the business

understanding phase, making CRISP-DM automated, add specialized tasks, add cost estimation for SMEs and lastly combine CRISP-DM with the Balanced Scorecard. 4.2 Field Study The company at which the field study was done can be considered a medium-sized enterprise. It is divided into multiple departments such as IT, Production, Research and development, finance, etc. The observations at this company start in the phase of data preparation and modeling. There was an increasing demand for safety from the customer which lead to the addition of multiple sensors and measurement tools to the production process in the factory. It was deemed by management that these measurements can then also be used for performance increase after being analyzed. From this the data mining project started but was left without a clear direction. First of all, there were multiple parties available who were connected to this new data mining project. These parties would be divided over different departments making communications even less indirect. Due to the size of the company the different departments were made into customers and suppliers making the relations between the departments one of asking or one of being asked. The main departments concerned with the data mining project were production, IT and management. The project was originally started by management due to the customer s demand for safety. The production department was instructed to implement the changes and afterwards the IT department was expected to create guided reports based on the data. These guided reports can be seen as an overview of selected data for a specific purpose, for example an overview of the quarterly sales. Because of the customer/supplier nature in the company the first phase went smoothly. Management (customer) wanted improvements and gave the production department (supplier) the assignment to add the measurement devices. It was when management (customer) gave the IT department (supplier) the assignment to create guided reports that the data mining project became unclear. The customer who is management wasn t actually the one that wanted the reports but most likely the production department. The IT department who has other guided reports and projects to work on was given an unclear assignment which made it less preferable to finish. This unclearness came from the fact that IT is not familiar with the useful applications of the data because this is very process specific. It can be decided to develop guided reports that IT believes is useful but due to the absence of the customer in the development it could never be used at all. The observations made during the weeks at the company can be seen in table 1. Week 1 Week 2 Week 3 Week 4 Table 1. Observations made. There is an unclear situation on who wants guided reports Data mining and the guided reports that follow from it should be an addition to current processes to be useful else the reports will be functional but not helpful to the user. There is no clear leader of the data mining project but there is a problem owner. Due to the structure and culture at the company the current data mining process has been obstructed. To breach this current situation someone from either one department or someone from outside the company should study the wishes of the user. Week 5 The addition of the stagier has made it possible to get a feel for the wishes of the user and contact has now been made with the user which makes him the customer and the IT department the supplier. The company in the field study has a small Business Intelligence (BI) department as was told in an interview by an IT specialist at the company. There are many Ad Hoc requests made within the companies which are requests that are far from standard reporting and often not even necessary or helpful for the user. Within the company a structure with process owners is used where this person is the main contact person for the project often a manager. But often this process owner misses the needed BI knowledge to effectively lead a data mining project. The IT department can add its BI knowledge but this costs too much time which is unavailable due to the many Ad Hoc requests. In figure 2 this is illustrated. In bigger companies this process owner is a manager who has BI knowledge. For these companies, processes such as CRISP-DM are developed. Figure 2. Link between BI and Management. 5. RESULTS In the section of the paper sub-questions 2 and 3 will be answered. Sub-question 1 has already been answered in section 4.1 Literature. During the field study there was an absence in direction of the data mining process observed. This was due to, as described in section 4.2, the customer/supplier relationships between the different departments. The IT department was expected to build guided reports using the data gathered from the measurement devices but there were no specific requests for guided reports there was only the general make guided reports using this data request from management. In the company culture it was normal for someone to approach another department and ask for something specific but it was not normal to be expected to perform something without a specific customer. This was a breaking point in the data mining project which was eventually solved by meeting different people from departments and asking them if they had a specific purpose for the data. This is not normally something for the IT department to do and was achieved by the addition of a stagier. The stagier approached different stakeholders in the company and interviewed them to get a view of what the stakeholders would deem useful and reported this back to the IT department. This was an intervening action to break the company culture stalemate in the data mining project. But this observation was something that was not a unique occurrence for smaller and medium sized companies. The divided departments inside a company have to come together to work on a data mining project which is something that gets easily sidetracked. In bigger companies and

methodologies as CRISP-DM there is a process which is walked through under the leadership of one person. This person who is often a consultant oversees the processes in the company, the IT capabilities as well as communicates with management. This makes it possible to control the CRISP-DM process in an organized way. But when facing smaller companies where this central person is absent a construction with project owners and different departments is used to control the data mining project. Creating situations such as the unclear assignment for the IT department to create guided reports. From this field study the biggest observation made is the lack of a central person who controls the data mining process. This absence is not covered in any methodology and is very company specific. With this sub-question 2 has been answered. The Methodology of CRISP-DM is made as a general process to be used in almost any kind of setting. It was tested using mainly industrial test cases but has been used in the field of medicine as well as was shown in literature [5, 6, 9]. When studying the original publication of CRISP-DM [7] there was no mention of the communication difficulties when faced without a clear leader of the data mining project. Just following the process of CRISP-DM is insufficient as knowledge from different departments needs to be combined for the data mining project. This absence of who is meant to guide a CRISP-DM project in the methodology leaves a huge gap for companies to fill in. When data mining professionals decide to use CRISP- DM the user will most likely become this central person. However, when a project is started without a clear goal and with unclear communication and wishes there should be a fallback in the methodology to handle this situation. I propose two solutions for this problem to be resolved. One is that a central person is appointed and takes full responsibility for the data mining project, this can be someone within the company or possibly even better a consultant from outside the company who is not yet influenced by standard practices at the company. This was also done in the company of the field study where a stagier fulfilled this role. The second option, though less efficient but workable is a gathering of persons from different departments who become the contact person within the department as well as to the other departments. For this to work communication between the members of this gathering is essential and meeting regularly is needed. This solution is more focused on working within the current hierarchy instead of trying to oversee multiple departments at ones with one person. It can be concluded that CRISP-DM would not be able to handle situations in which the leadership is unclear. For this a fallback should be added to CRISP-DM in which the suggestion is presented to appoint a central person from within or from outside the company to lead the data mining project or to form a gathering from different departments to lead the project. With this sub-question 3 has been answered. 6. CONCLUSION The main question of this research is: What improvements can be made to CRISP-DM?. This would be answered by discussing the three sub-questions. The suggestions made by the literature to improve CRISP-DM consists of adding schemas to the business understanding phase, making CRISP- DM automated, add specialized tasks, add cost estimation for SMEs and lastly combine CRISP-DM with the Balanced Scorecard. These suggestions would improve CRISP-DM to be more efficient as well as grant greater control over the data mining process. The main observation made during the field study was that there is not always a central person present who implements a data mining methodology such as CRISP-DM but usually a cooperation between different departments who have a customer/supplier relationship with each other. CRISP-DM missed the capabilities to handle such a situation and would work inefficient or not work at all in such an environment. To answer the main question of this research it is suggested that the new version of CRISP-DM will incorporate a fallback in the methodology that will help to construct a clear leadership for the data mining project. In addition to this as suggested by literature the current version of CRISP-DM can be automated in total or partly, adding schemas to the business understanding phase, add specialized tasks for specific fields, add cost estimation for SMEs and lastly combine CRISP-DM with the Balanced Scorecard. Using any of these suggestions will improve CRISP-DM and could form the basis for a new version of CRISP-DM. 7. DISCUSSION This research was focused on discovering ways to improve the data mining methodology CRISP-DM. While there have been made several suggestions, both based in literature and from the field study, these techniques still have to be tested on how they perform. Using the suggestions made in this article could lead to new methodologies as well as uses of CRISP-DM. The observations done using the field study were taken from one company. To be able to conclude that the current departmental working is indeed the case, research can be done at multiple small and medium sized companies to confirm this statement. This could also lead to a more specific methodology for SMEs instead of an update of CRISP-DM. It was also suggested by literature that the cost and benefit analysis of CRISP-DM is inadequate for SMEs which suggests that CRISP-DM might not work as effective at SMEs as at big enterprises. Research into the effectiveness of CRISP-DM and other data mining methodologies compared to the size of the enterprise could give more insight in what kind of data methodology is actually needed by the current market. 8. REFERENCES [1.] Bednár, P., et al. Design and implementation of local data mining model for short-term fog prediction at the airport. in 9th IEEE International Symposium on Applied Machine Intelligence and Informatics, SAMI 2011 - Proceedings. 2011. [2.] Beers, P.W., Reflection on experience and improvements in CRISP-DM. 2016. [3.] Clifton, C. and B. Thuraisingham, Emerging standards for data mining. Computer Standards and Interfaces, 2001. 23(3): p. 187-193. [4.] Kumara, B.T.G.S., et al. Ontology-Based Workflow Generation for Intelligent Big Data Analytics. in Proceedings - 2015 IEEE International Conference on Web Services, ICWS 2015. 2015. [5.] McGregor, C., C. Catley, and A. James. A process mining driven framework for clinical guideline improvement in critical care. in CEUR Workshop Proceedings. 2011. [6.] Pérez, J., et al., A Data Preparation Methodology in Data Mining Applied to Mortality Population Databases. Journal of Medical Systems, 2015. 39(11). [7.] Pete Chapman (NCR), J.C.S., Randy Kerber (NCR),, T.R.D. Thomas Khabaza (SPSS), and C.S.S.a.R.W. (DaimlerChrysler), Cross industry standard process for data mining. 1999.

[8.] Pytel, P., P. Britos, and R. García-Martínez, A proposal of effort estimation method for information mining projects oriented to SMEs, in Lecture Notes in Business Information Processing. 2013. p. 58-74. [9.] Roa, D., M. Del Pilar Villamil, and J.D. Arboleda. Process model for data mining in health care sector. in CEUR Workshop Proceedings. 2011. [10.] Siriweera, T.H.A.S., I. Paik, and B.T.G.S. Kumara. Onotology-based service discovery for intelligent Big Data analytics. in IEEE 7th International Conference on Awareness Science and Technology, icast 2015 - Proceedings. 2015. [11.] Yun, Z. The study of CDM-BSC-based data mining driven fishbone applied for data processing. in 2015 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2015. 2015. [12.] Yun, Z., L. Weihua, and C. Yang. Applying balanced scordcard strategic performance management to CRISP-DM. in Proceedings - 2014 International Conference on Information Science, Electronics and Electrical Engineering, ISEEE 2014. 2014. [13.] Zapata, J.C.M. and N. Gil. Incorporation of both preconceptual schemas and goal diagrams in CRISP- DM. in 2011 6th Colombian Computing Congress, CCC 2011. 2011.