Data Science in a pricing process

Similar documents
C-14 FINDING THE RIGHT SYNERGY FROM GLMS AND MACHINE LEARNING. CAS Annual Meeting November 7-10

SPM 8.2. Salford Predictive Modeler

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models.

GLMs the Good, the Bad, and the Ugly Ratemaking and Product Management Seminar March Christopher Cooksey, FCAS, MAAA EagleEye Analytics

Ensemble Modeling. Toronto Data Mining Forum November 2017 Helen Ngo

Using Decision Tree to predict repeat customers

Linear model to forecast sales from past data of Rossmann drug Store

Practical Application of Predictive Analytics Michael Porter

Jialu Yan, Tingting Gao, Yilin Wei Advised by Dr. German Creamer, PhD, CFA Dec. 11th, Forecasting Rossmann Store Sales Prediction

From Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here.

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

CS6716 Pattern Recognition

Lecture 6: Decision Tree, Random Forest, and Boosting

PERSONALIZED INCENTIVE RECOMMENDATIONS USING ARTIFICIAL INTELLIGENCE TO OPTIMIZE YOUR INCENTIVE STRATEGY

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Genetic Programming for Symbolic Regression

A Systematic Approach to Performance Evaluation

Olin Business School Master of Science in Customer Analytics (MSCA) Curriculum Academic Year. List of Courses by Semester

ENVIRONMENTAL FINANCE CENTER AT THE UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL SCHOOL OF GOVERNMENT REPORT 3

From concept to implementation: the evolution of forest phenotyping

CONNECTING CORPORATE GOVERNANCE TO COMPANIES PERFORMANCE BY ARTIFICIAL NEURAL NETWORKS

Models in Engineering Glossary

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

Stochastic Gradient Approach for Energy and Supply Optimisation in Water Systems Management

Price Optimisation in Practice Stakeholder Views and Key Challenges. Nelson Henwood and John Yick

County- Scale Carbon Estimation in NASA s Carbon Monitoring System

Utilizing Optimization Techniques to Enhance Cost and Schedule Risk Analysis

Retail Sales Forecasting at Walmart. Brian Seaman WalmartLabs

MAS187/AEF258. University of Newcastle upon Tyne

STAT 2300: Unit 1 Learning Objectives Spring 2019

New Methods and Data that Improves Contact Center Forecasting. Ric Kosiba and Bayu Wicaksono

IBM SPSS Decision Trees

Effective CRM Using. Predictive Analytics. Antonios Chorianopoulos

Data Cleaning for sales modeling

Department of Economics, University of Michigan, Ann Arbor, MI

When to Book: Predicting Flight Pricing

Applications of Machine Learning to Predict Yelp Ratings

A new approach to Early Warning Systems for smaller European banks

Feature Selection for Predictive Modelling - a Needle in a Haystack Problem

Test lasts for 120 minutes. You must stay for the entire 120 minute period.

Exploiting full potential of predictive analytics on small data to drive business outcomes

10.2 Correlation. Plotting paired data points leads to a scatterplot. Each data pair becomes one dot in the scatterplot.

BUSS1020 Quantitative Business Analysis

Accurate Campaign Targeting Using Classification Algorithms

CASE STUDY: WEB-DOMAIN PRICE PREDICTION ON THE SECONDARY MARKET (4-LETTER CASE)

Quaero s Guide to Analytics

Sales Forecast and Business Management Strategy - ADDM (Advanced Data Decomposition Model)

FINAL PROJECT REPORT IME672. Group Number 6

Deep Dive into High Performance Machine Learning Procedures. Tuba Islam, Analytics CoE, SAS UK

Predicting user rating on Amazon Video Game Dataset

Introduction to Business Research 3

Utilizing Design Thinking for Digital Transformation

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005

Watts App: An Energy Analytics and Demand-Response Advisor Tool

Tree Depth in a Forest

Data mining and Renewable energy. Cindi Thompson

Predictive Analytics

AP Statistics Scope & Sequence

Machine Learning Models for Sales Time Series Forecasting

TOLERANCE ALLOCATION OF MECHANICAL ASSEMBLIES USING PARTICLE SWARM OPTIMIZATION

Methodological challenges of Big Data for official statistics

A Parametric Bootstrapping Approach to Forecast Intermittent Demand

Leveraging Attitudinal & Behavioral Data to Better Understand Global & Local Trends in Customer Loyalty & Retention

JMP TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

Estimating Discrete Choice Models of Demand. Data

Can Product Heterogeneity Explain Violations of the Law of One Price?

Multi-factor Stock Selection Model Based on Adaboost

Who Are My Best Customers?

Who Is Likely to Succeed: Predictive Modeling of the Journey from H-1B to Permanent US Work Visa

Copyright 2013, SAS Institute Inc. All rights reserved.

Evolutionary Algorithms


COGNITIVE COGNITIVE COMPUTING _

A GENETIC ALGORITHM WITH DESIGN OF EXPERIMENTS APPROACH TO PREDICT THE OPTIMAL PROCESS PARAMETERS FOR FDM

Creative Commons Attribution-NonCommercial-Share Alike License

Irrigation network design and reconstruction and its analysis by simulation model

Influence Maximization on Social Graphs. Yu-Ting Wen

A. ( ) ( ) 2. A process that is in statistical control necessarily implies that the process is capable of meeting the customer requirements.

Genetic Algorithm for Supply Planning Optimization under Uncertain Demand

Using People Analytics to Help Prevent Absences Due to Mental Health Issues

Forecasting Intermittent Demand Patterns with Time Series and Machine Learning Methodologies

Rank hotels on Expedia.com to maximize purchases

Genetic Algorithms and Sensitivity Analysis in Production Planning Optimization

Our Guide to Technibond Double Sided Foam Tapes

Detecting and Pruning Introns for Faster Decision Tree Evolution

Forecasting for Short-Lived Products

CREDIT RISK MODELLING Using SAS

A Comparison of PRIM and CART for Exploratory Analysis

Timber Management Planning with Timber RAM and Goal Programming

Evolutionary Algorithms

Heuristic Techniques for Solving the Vehicle Routing Problem with Time Windows Manar Hosny

Incorporating supply chain risk into a production planning process EXECUTIVE SUMMARY. Joshua Matthew Merrill. May 17, 2007

Introduction to Random Forests for Gene Expression Data. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3.

References. Introduction to Random Forests for Gene Expression Data. Machine Learning. Gene Profiling / Selection

Winsor Approach in Regression Analysis. with Outlier

Indicator Development for Forest Governance. IFRI Presentation

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Metrics based field problem prediction. Paul Luo Li ISRI SE - CMU

Logistic Regression and Decision Trees

Transcription:

Data Science in a pricing process Michaël Casalinuovo Consultant, ADDACTIS Software michael.casalinuovo@addactis.com

Contents Nowadays, we live in a continuously changing market environment, Pricing has become a challenge. Non-life insurance companies started a price competition. Among other reasons, this competition is due to the emergence of price Comparison websites. The aim is to calculate the right tariff in order to keep customers satisfied while creating business. This implies that they need to: compensate the risk they take; continue to be competitive; avoid anti selection. 2 4 8 0 GLM Machine learning methods Mixture of methods Conclusion In addition to that, insurers need to: operate very quickly; use appropriate data, sophisticated and best practice pricing models; have well documented and transparent pricing processes in place. Currently, GLM models are commonly used by Non-life insurance companies for their pricing process. These models have a lot of advantages: transparency, understanding, etc. However, Data science methods and algorithms are growing and creating new niches not only for insurance companies but for many businesses. Therefore, some companies try to integrate machine learning methods in their pricing process. As a result, we can wonder if Data science would replace GLM and become the new standard in the Pricing field.

In addactis Pricing, once a GLM has run, results are very easy to analyze. GLM As it is a multiplicative model, the user is able to visualize the gap between modalities of a variable, all other things being equal. Besides, tariff structure is also commonly multiplicative. Finally, GLM are flexible, the user is able to integrate constraints or smoothing if he has to review its model for legal or commercial reasons. GLM is the current market standard for pricing. Indeed, these models are used to model, for a segment, pricing indicators like frequency, average cost or directly risk premium. Then, GLM will be used to model estimations made by previous models in order to obtain a tariff structure. 1- Advantages First of all, model s behavior is easy to understand. Indeed, GLMs are an improvement of linear models. As a result, the user can easily configure the model he wants to run. Moreover, many software packages offer to use GLM. On the one hand, open source software with several packages and on the other, actuarial software which embeds a powerful GLM engine like addactis Pricing. 2- Disadvantages However, GLM also have disadvantages. First, GLMs are parametric methods. That means, the user has to specify a distribution when he sets up the model. And sometimes, it could be difficult to find a distribution which fits the data. As a result, this can cause prediction errors in the model. Moreover, the user has to choose by himself variables to include in the model. Even if there are methods enabling to identify which variables are the more discriminant in the model, it does not take into account interaction between variables. In addition to that, if two variables are highly correlated, the user will not be able to include them together in the model. Last, GLMs need a data preparation phase before they are run. Indeed, explicative factors have to be prepared for them to be added on a model. That means, the user has to check if there are no outliers or missing values and has to handle modalities with small exposures. Data preparation step can be time consuming and the time spent on this step is not spent on the GLM analysis itself. 2-3

MACHINE LEARNING METHODS Indeed, Data Science is an interdisciplinary field. This is a datadriven science and it actually tries to extract insights from large amount of data. Many algorithms are based on statistical learning and machine learning methods offer new perspectives for data analyses and models conception. These booming techniques have proved their effectiveness on the web and multimedia fields, for online advertising targeting for example. In the pricing sector, with exogeneous data through the arrival of open data, connected objects and satellite data which are used as explanatory variables, machine learning methods could be adapted. 1- Algorithms The first advantage of these methods is that they are non-parametric. Also, these methods need to split the initial base into two bases: training set and testing set. The model is applied on the training set and the testing set enables to validate the model by calculating statistics. There are 2 groups of algorithms: classification methods and regression methods. As pricing indicators are quantitative, regression methods will be used. The first method studied is a regression by tree and the others are refinements of the previous one. a. CART The CART method means Classification And Regression Trees. It is a recursive algorithm: based on the training set, at each step it searches, among all the available variables, which criteria reduce variance the most and splits the base according to this criterion. The algorithm stops when there are no criteria left or according to parameters specified by the user. Also, this method calculates variable importance by summing all the variance reduction brought by the criterion based on this variable. With this graph, the user can easily detect which variable is the most significant in the model. Finally, the model is applied on the testing set and statistics are calculated by comparing observed values and estimated values: As a first approach, we have decided to study 3 algorithms which can be adapted to model pricing indicators: CART Random forest Gradient Boosting 4-5 The tree is easy to read. The top of the tree is the training set, intermediate layer named node represents the splitting criteria. The last node is named leaf and contains the estimation for observations which fulfil all the criteria from the base to the leaf, this estimation is calculated by taking the observations values average for this group. Where: n : testing size y i : observed value for line i μ i : estimated value for line i The user can compare several models with these statistics, the smallest the value, the better.

b. Random Forest The previous method depends a lot on the training set, resulting in a risk of overlearning. As a result, researches have been made in order to improve this algorithm. Random forest methods have been introduced as a generalization of CART. These methods have the advantage to reduce the prediction variance. The Random Forest method consists in building a high number of independent trees. It is an iterative method; the user has to choose: number of simulations to run the number of variables to use (m) CART parameters for all trees At each iteration, a new base is created, randomly sampling the training set with replacement. Then a CART is fitted by randomly picking m variables (when m is equal to the initial number of variables, the method is also called Bagging ). Model estimation is calculated by taking average estimations from all individual CART. This method is more robust than CART but results interpretation are more difficult. c. Boosting Again, this last method is based on trees but it works differently than Random Forest. The aim of it is to reduce, at each step, prediction errors by fitting a tree on it. This algorithm is iterative, it begins by fitting a weak model on the training set. Then, the next steps are the following: Calculate the loss function gradient for each prediction Fit a CART to explain them Sum the estimations of the previous CART with the estimations of the previous iteration in order to obtain new predictions The user can choose the number of iterations and some other parameters in order to optimize the model. The main advantage of this method is, besides reducing the variance, that it reduces the bias and its predictive power is better than the previous algorithms. However, results are difficult to analyze and time processing can take longer than other methods. 2- Limits To conclude, machine learning methods can have better prediction quality, but they have also some limits. First of all, unlike standard methods, data science is slightly more difficult to implement and computational time is higher. Moreover, many people consider these algorithms as a black box because they do not totally handle actions made during the process. Then, even if they are non-parametric algorithms, there are several parameters the user has to define and understand. Furthermore, results are more difficult to analyze and the structure is not multiplicative as for GLM models. Finally, there is no possibility for the user to fix constraints for certain variables or to smooth estimations. As a result, in the pricing sector, machine learning enables to more precisely model pricing indicators but it is not totally adapted to define tariff structure. Machine learning methods 6-7

MIXTURE OF METHODS Failing to totally replace GLM models, machine learning methods can be combined with GLM in order to improve them. First of all, data science can bring another method to detect significant factors and modalities. Indeed, just like already widely used methods, machine learning algorithms can detect significant variables. But, in addition, it enables to define the best way to handle modalities; that means groups to create for nominal variables and interval for ordinal and continuous variables. Once this step is made, the user will apply these variables and modalities on a GLM model. Besides, the user could use machine learning methods to detect interaction between variables and then be able to create and add these interactions in a GLM model. Finally, as we have seen at the end of the previous part, machine learning algorithms results have not a multiplicative structure. Therefore, an alternative can be the following; use these algorithms to model the pricing indicators (frequency, amount, risk premium) and once these models have been consolidated the user can run GLM to explain the estimated premium in order to obtain a multiplicative tariff structure and to smooth estimators. 8-9

CONCLUSION Recently, Machine learning methods emerge in a lot of fields, including insurance. More specifically, in the pricing domain, these methods can bring a lot in a continuously changing market environment. However, even machine learning has advantages that current GLM methods haven t. Also, they have limits which can be blocking in a pricing process: lack of transparency, difficult to analyse, etc. As a result, from my point of view and even if people think so, the GLM age is not achieved. The next step is to learn more about machine learning methods and to find the best way to combine machine learning methods with current GLM methods in order to optimize the pricing process. GET IN TOUCH WITH contact@addactis.com www.addactis.com There is always an addactis expert close to you! Closer to your needs, with a fine understanding of your market and requirements. We are YOUR ALTERNATIVE! 10