Comparative analysis on the probability of being a good payer

Size: px

Start display at page:

Download "Comparative analysis on the probability of being a good payer"

Noah Wilkerson
6 years ago
Views:

1 Comparative analysis on the probability of being a good payer V. Mihova, and V. Pavlov Citation: AIP Conference Proceedings 1895, (2017); View online: View Table of Contents: Published by the American Institute of Physics Articles you may be interested in Planning outstanding reserves in general insurance AIP Conference Proceedings 1895, (2017); / Mathematical modeling of the heat transfer during pyrolysis process used for end-of-life tires treatment AIP Conference Proceedings 1895, (2017); / Preface: Application of Mathematics in Technical and Natural Sciences 9th International Conference - AMiTaNS 17 AIP Conference Proceedings 1895, (2017); / The method of trend analysis of parameters time series of gas-turbine engine state AIP Conference Proceedings 1895, (2017); / Regression trees modeling and forecasting of PM10 air pollution in urban areas AIP Conference Proceedings 1895, (2017); / Analysis and modeling of daily air pollutants in the city of Ruse, Bulgaria AIP Conference Proceedings 1895, (2017); /

2 Comparative Analysis on the Probability of Being a Good Payer V. Mihova a) and V. Pavlov b) University of Ruse, 8 Studentska str., 7017 Ruse, Bulgaria a) Corresponding author: vmicheva@uni-ruse.bg b) vpavlov@uni-ruse.bg Abstract. Credit risk assessment is crucial for the bank industry. The current practice uses various approaches for the calculation of credit risk. The core of these approaches is the use of multiple regression models, applied in order to assess the risk associated with the approval of people applying for certain products (loans, credit cards, etc.). Based on data from the past, these models try to predict what will happen in the future. Different data requires different type of models. This work studies the causal link between the conduct of an applicant upon payment of the loan and the data that he completed at the time of application. A database of 100 borrowers from a commercial bank is used for the purposes of the study. The available data includes information from the time of application and credit history while paying off the loan. Customers are divided into two groups, based on the credit history: Good and Bad payers. Linear and logistic regression are applied in parallel to the data in order to estimate the probability of being good for new borrowers. A variable, which contains value of 1 for Good borrowers and value of 0 for Bad candidates, is modeled as a dependent variable. To decide which of the variables listed in the database should be used in the modelling process (as independent variables), a correlation analysis is made. Due to the results of it, several combinations of independent variables are tested as initial models both with linear and logistic regression. The best linear and logistic models are obtained after initial transformation of the data and following a set of standard and robust statistical criteria. A comparative analysis between the two final models is made and scorecards are obtained from both models to assess new customers at the time of application. A cut-off level of points, bellow which to reject the applications and above it to accept them, has been suggested for both the models, applying the strategy to keep the same Accept Rate as in the current data. INTRODUCTION Credit risk is the risk of default on a debt due to a borrower failing to make required payments (on credit line, loan, mortgage, etc.). Duties of the borrower include everything, connected with the loan (credit): principals, coupons, interest, etc. The existence of credit risk raises a number of consequences [1]. At first the banking and financial industry, which is affected by this type of risk, in most countries is allowed to collect personal data about the borrowers. This data is used to build mathematical models for assessment of the credit risk. Moreover, changes in the behavior of financial companies arise, based on assessments of credit risk. These changes include restructuring of clients accounts, portfolio diversification, activity specialization, changes in interest rates and individual interest levels to different customers, depending on their solvency. Given the number of corporate relationships that arise due to the presence of credit risk, that risk is essential for the success of the financial institutions. Statistical models are commonly used in the banking industry in order to assess the credit risk associated with the approval of people applying for certain products (loans, credit cards, etc.). Based on data from the past, these models try to predict what will happen in the future [2]. Different type of models (linear, logistic, polynomial, reciprocal, etc.) could be used according to the available data and creditor s purposes. Sometimes several different models have to be tested in order to choose the one that gives the best fit to the data. As a result of the built statistical model each candidate receives an estimate in the form of points. The points of all candidates draw up a point system or the so-called scorecard. Due to the type of the model, some results are Application of Mathematics in Technical and Natural Sciences AIP Conf. Proc. 1895, ; Published by AIP Publishing /$

3 more difficult to interpret. The interpretation of the results of the linear model is straightforward, as it models directly the dependent variable Y, while in the logistic regression, for example, the estimates are in log-odds units, which makes them difficult to interpret. A relation between the information that each client submits at the time of application for a product and the behavior of that client after he has been approved for the product is made [1]. A regression model is used to assess how the behavior of the approved customers (good or bad customers) depends on the data submitted in the application form for the product. The resulting relation can be used to assess new customers even in the time of application. The purpose of the scorecard is to model the resulting mechanism. When the model is based on the data submitted in the application form, not on the repayment behavior, then this mechanism could be modeled as the probability the candidate to be good or bad [3]. (In fact, the linear model gives the likelihood the candidate to be a good payer, while the logistic model gives the log-likelihood of being a good payer.) The above statement is described mathematically as follows: p P( y 1/ x) 1 P( y 0 / x), where x a vector that contains the information from the application form; p the probability a candidate with characteristic vector x to be good; y=1 the event the candidate is Good ; y=0 the event the candidate is Bad. This work has studied the causal link between the conduct of an applicant upon payment of the loan (the dependent variable, y) and the data that he completed at the time of application (independent variables x ( x 1,..., x k ) ). Linear and logistic regression have been used to estimate the probability of being good (or the log Good: Bad odds in the case of logistic model). PROBLEM FORMULATION A database of 100 borrowers from a commercial bank is used for the purposes of the study. They have applied for loans in the first quarter of The available data includes information from the time of application and credit history while paying off the loan. Customers are divided into two groups, based on the credit history: Good and Bad payers. If in the last 18 months a borrower has no more than one missed payment, he is Good, otherwise the candidate is Bad. The Good: Bad ratio (G:B Odds) for the monitored customers is 1.86 (65 Good borrowers and 35 Bad borrowers). Linear and logistic regression are applied in parallel to the data. A variable named Good Flag (Y), which contains the information mentioned above (value of 1 for Good borrowers and value of 0 for Bad candidates), is modeled as a dependent variable. A correlation analysis is made in order to decide which of the variables listed in the database should be used in the modelling process (as independent variables). Due to the results of this analysis, several combinations of independent variables are tested as initial models both with linear and logistic regression. To choose a starting point between the initial test models, the linear ones are compared by Adjusted R-squared, while for the comparison of the logistic ones is used Nagelkerke R-squared. Both the linear and the logistic initial test models are tested for regression significance. After choosing a starting point for the two types of regression, final linear and logistic models are obtained after new iterations, using initial transformation of the data and following a set of standard and robust statistical criteria. A comparative analysis between the two final models is made. As results from both of the models, two primary scorecards are made correspondingly, where all the coefficients are statistically significant and meet economic expectations. The scorecards could be used to assess new borrowers each applicant could be passed through the preferred scorecard and will receive certain points (formed on the basis of the available data for the person at the time of application). The bank could set a cut off level of points, bellow which to reject the applications and above it to accept them. The independent variables / indicators that are analyzed are presented in Table 1. The indicators X1, X2, X3, X4, X5 and X11 are quantitative, while X6, X7, X8, X9 and X10 are qualitative. The qualitative variables have the following categories: - Residential Status (X6): H (Homeowner), L (Living with parents), T (Tenant), O (Other); - Loan Payment Method (X7): B (Bank Payment), Q (Cheque), S (Standing Order), X (Not Given); - Occupation Code (X8): B (Self-employed), M (Employee), P (Pensioner), O (Other); - Account Type (X9): FL (Fixed Loan), VL (Variable Loan); - Marital Status (X10): D (Divorced), M (Married), S (Single), W (Widow), Z (Not Given)

4 TABLE 1. List of independent variables / indicators. Indicators / Variables Designation Units of Measurement Time with Bank Months Time in Employment Months Gross Annual Income Euro Loan Amount Euro Age of Applicant Years Residential Status Category Loan Payment Method Category Occupation Code Category Account Type Category Marital Status Category Loan as Percent of Income Percentages MODELLING The present data analysis has been done by SPSS [4]. Basic steps for statistical analysis in SPSS are presented in [5, 6, 7]. Dummy variables have been created for each of the qualitative variables in order to quantify the qualitative data. Dummies are variables that take the value of 0 (for the absence of an attribute) and 1 (for its presence) and are used in the same way in regression analysis as the quantitative variables. The dummies correspond to the number of classes / categories of the qualitative variable - 1. Thus, the category for which no dummy variable is created, is used as a reference category [8]. Let us take for an example the variable Residential Status (X6). The category Homeowner is selected as a reference, since there falls the largest percentage of the observations. No dummy was created for this category. For the remaining categories dummy variables were created as follows - set to 1 if the applicant lives with his parents and to 0 in all other cases; - set to 1 if the applicant rents a house and to 0 in all other cases; - takes value of 0 if the applicant is a homeowner, lives with parents or rents a house, and 1 in all other cases. Another example: the variable Loan Payment Method (X7). The category Bank Payment is selected as a reference, since there falls the largest percentage of the observations. No dummy was created for this category. For the remaining categories dummy variables were created as follows - Loan Payment Method Cheque 7.1) set to 1 if the applicant pays with cheque and to 0 in all other cases; - Loan Payment Method Standing Order 7.2) set to 1 if the applicant uses standing order to pay and to 0 in all other cases; - Loan Payment Method Not Given 7.3) takes value of 0 if the applicant pays via bank payment, cheque or standing order, and 1 in all other cases. In order to decide which of the independent variables should be used in the model, a correlation analysis has been made. Due to the presence of qualitative (nominal) variables, Cramer s V [9] have been used to measure the correlation. The existing correlations between the used indicators and the degree of their significance (** - significant correlation at P = 99%, * - significant correlation at P = 95%) have been established using the everyone against everyone method (Table 2). Based on the results of the correlation analysis, some of the indicators have been excluded from the model. X3, X4, X5 and X9 have been excluded from the modelling process due to statistically insignificant correlation with the dependent variable Y

5 TABLE 2. The correlation matrix. Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 Y 1.00 X1 0.56* 1.00 X2 0.54* X ** 1.00 X * ** 1.00 X ** 0.79* X6 0.36** 0.56** ** ** 1.00 X7 0.45** ** 0.89** 0.79* 0.71** X8 0.30* ** ** 1.00 X * X * ** 0.61** 0.35** 0.31** X * ** 0.85** 0.78** ** 0.89** * Due to statistically significant correlation between the independent variables that could lead to displacement of scores and hence wrong conclusions from the model, the following combinations of the indicators have been tested. M1: X1, X2 and X10 enter the model. (X6, X7, X8 and X11 are excluded from the modelling process due to significant correlations with some of the rest indicators, as can be seen in Table 2.) M2: X1, X8 and X11 enter the model. (X2, X6, X7 and X10 are excluded from the modelling process due to significant correlations with some of the rest indicators, as can be seen in Table 2.) M3: X1, X10 and X11 enter the model. (X2, X6, X7 and X8 are excluded from the modelling process due to significant correlations with some of the rest indicators, as can be seen in Table 2.) M4: X2, X6 and X8 enter the model. (X1, X7, X10 and X11 are excluded from the modelling process due to significant correlations with some of the rest indicators, as can be seen in Table 2.) M5: X6 and X7 enter the model. (X1, X2, X8, X10 and X11 are excluded from the modelling process due to significant correlations with some of the rest indicators, as can be seen in Table 2.) The results from the initial test models are briefly outlined in Table 3. TABLE 3. Statistics summary of the initial test models. Model M1 M2 M3 M4 M5 Lin. Log. Lin. Log. Lin. Log. Lin. Log. Lin. Log. R-squared * Regression Significance ** Correctly classified cases (%) *** *Adjusted R-squared is given for the linear models and Nagelkerke R-squared (which is pseudo R-squared) is used to assess the logistic models. ** F-test is used for the linear models and Chi-square - for the logistic models. The p-value is given in the table. *** These statistics are collected only for the logistic models. The statistics, stated in Table 3, are used to choose one of the five models as starting point. The adjusted R- squared is greatest for the Linear Model M5. Also, the Logistic Model M5 has greatest Nagelkerke R-squared. Both the linear and logistic regression, version M5, are statistically significant even at significance level of 1%

6 Linear Regression Version M5 of the linear model is used as starting model. It has the following general form Y b b X b X b X b X b X b X 7.3 The values of the coefficients in front of each of the indicators are listed in Table 4. The table shows also that some of the coefficients are statistically insignificant (X6.1, X7.1 and X7.2) at the significance level of 5%. TABLE 4. Coefficients in the Linear Regression Initial Model M5 and their significance. Indicator Coefficient (b) t-value Significance level Intercept X X X X X X By replacing in the general form of the model with the relevant coefficients from Table 4 the following equation is obtained: Y X X X 0.318X 0.088X X The coefficients in front of the indicators make sense except of the one in front of X6.1. The expectations for the applicants who live with their parents (X6.1) are that they cannot be better than the homeowners. To handle this situation, the applicants, who live with their parents, could be added to the null group (i.e., X6.1 to be excluded from the model). Another reason to exclude X6.1 from the modelling process is that the coefficient in front of it is not significant. As it comes to the loan payment method, it is worthy to combine the applicants who pay with cheque (X7.1) with the ones who use standing order (X7.2) and to check whether the new variable (X7.12) will have statistically significant coefficient. After applying the mentioned changes to the Initial Linear Model M5, the following model (Version M5.1) is obtained Y X 0.324X 0.078X X. 7.3 The adjusted R-squared for the Linear Model M5.1 is 0.269, which is better than for the initial version. The new version of the model adequately reflects the dependency of the variables, even at a level of significance of 1% (Fvalue is 10.1). All of the coefficients are statistically significant at the significance level of 5% except X7.12 (Table 5). TABLE 5. Coefficients in the Linear Regression Model M5.1 and their significance. Indicator Coefficient (b) t-value Significance level Intercept X X X X One way to handle the situation with the insignificant coefficient of X7.12 is to add it to the null group (Version M5.2), i.e., to leave in the null group everybody, except the ones with not given loan payment method that way they will be punished for the missing information (with negative coefficient). Another way is to add the applicants with not given payment method to the null, but let the ones with bank payment to get points (Version M5.3). Of course, if the null group is changed, the applicants who pay with cheque or with standing order can be tested again in the model together (Version M5.4) or separately (Version M5.5). The suggested changes have been analyzed. Statistics summary for the versions of the linear model could be seen in Table

7 In all of the new versions of the linear model (from M5.2 to M5.5) the logic behind residential status remains the same (i.e., negative points for X and more negative points for X). In Linear Model Version M5.2 the applicants with not given loan payment method receive negative points, which meets the economic expectations. In Linear Model Version M5.3 the applicants with bank payment receive positive points. In Linear Model Version M5.4 the applicants with bank payment again receive positive points, and those who pay via cheque or standing order are better than them. In Linear Model Version M5.5 the applicants with bank payment again receive positive points, those who pay via standing order are better than them and those who pay with cheque are the best. TABLE 6. Statistics summary of the linear models. Versions of the Linear Model / Statistics Adjusted R- squared Regression Significance * M5 M5.1 M5.2 M5.3 M5.4 M * F-test is used to test the significance of the model. The p-value is given in the table. How to choose the best model? The Adjusted R-squared of the versions M5.2, M5.4 and M5.5 are close to one another. All the models are statistically significant even at significance level of 1%. In principle, it is better to have more categories of the variable Loan Payment Method in the model, as this would give better differentiation between the applicants (i.e., to choose between versions M5.4 and M5.5 of the linear model). But there are some other considerations, on the basis of which Linear Model Version M5.2 is preferred in this work. The applicants with bank payment represent 80% of the used database. It is better to leave such a big group in the null, as it will make no further distinction at all. Let us proceed with Linear Model Version M5.2. It has the following form Y X X X % of the change in the behavior of the applicant (the dependent variable Y) is determined by a change in the modeled independent variables (Adjusted R-squared is 0.274) [7, 8]. As previously mentioned, Linear Model Version M5.2 adequately reflects the dependency of the variables, even at a level of significance of 1%, which can be seen also from Table 7. Variations TABLE 7. ANOVA table for Linear Model Version M5.2. Sum of squares Degrees of freedom Degrees of freedom 7.3 F-value Significance level ESS RSS TSS All of the coefficients are statistically significant at the significance level of 5% (Table 8). TABLE 8. Coefficients in the Linear Regression Model M5.2 and their significance. Indicator Coefficient (b) t-value Significance level Intercept X X X

8 The model conducted above gives the likelihood the candidate to be a good payer. The scorecard, corresponding to the considered model, can be seen in Table 9 (the coefficients are multiplied by 1000 for clarity). Accordingly, each borrower can be passed through this scoring system. The model could be interpreted as follows. Each candidate starts at 816 points. If the applicant rents a house, he will get 393 points; if he is a homeowner or lives with parents, he will get no points on the variable Residential Status; if he has some other status (not in the mentioned ones) he will get 330 points. If the applicant is going to pay via bank payment, cheque or standing order, he will get no points on the variable Loan Payment Method. If he did not give information about the way of payment, his score will be deducted with 692 points. Linear Model Version M5.2 is used as the final linear model in this work. TABLE 9. Scorecard from Linear Model Version M5.2. No. Variable Points INTERCEPT RESIDENTIAL STATUS (X6) Homeowner (X6.0) 0 Living with Parents (X6.1) 0 Tenant (X) Other status (X) LOAN PAYMENT METHOD (X7) Bank Payment (X7.0) 0 Cheque (X7.1) Standing Order (X7.2) 0 Not Given (X7.3) 692 Logistic Regression Version M5 of the logistic model is used as starting model. The variable Good Flag (Y) takes values of 1 (Good borrowers) and 0 (Bad borrowers) with probabilities respectively p and 1-p. The starting model has the following general form p log e b0 b6.1x 6.1 b X b X b7.1x 7.1 b7.2 X 7.2 b7.3 X p The values of the coefficients in front of each of the indicators are listed in Table 9. The table shows that most of the coefficients are statistically insignificant (X6.1, X7.1, X7.2, X7.3 and the intercept) at the significance level of 5%. TABLE 10. Coefficients in the Logistic Regression Initial Model M5 and their significance. Indicator Coefficient (b) Wald Chi-square value Significance level Exp(b) Intercept X *10 8 X X X X X By replacing in the general form of the model with the relevant coefficients from Table 10 the following equation is obtained p loge X X 1.598X 0.598X X X p Before interpreting the coefficients in front of the indicators, it is better to handle with the insignificant ones

9 Using the same logic as in the linear model, it is a good idea to add the applicants, who live with their parents (X6.1) to the null group (i.e., X6.1 to be excluded from the model). As it comes to the Loan Payment Method (X7), it has been checked that the overall variable X7 is statistically insignificant (p-value=0.946) even at the significance level of 20%, so it is better to exclude it from the model. After applying the suggested changes to the Initial Logistic Model M5, the following model (Version M5.1) is obtained p loge X X. 1 p The new version of the model adequately reflects the dependency of the variables, even at a level of significance of 1% (Chi-square-value is 12.57). 71% of the cases are correctly classified. All of the coefficients are statistically significant at the significance level of 5% except the intercept (Table 11). TABLE 11. Coefficients in the Logistic Regression Model M5.1 and their significance. Indicator Coefficient (b) Wald Chi-square value Significance level Exp(b) Intercept X X Because the coefficients are in log-odds units, they are often difficult to interpret. Therefore, it is better to use the odds ratios (the exponentiation of the coefficients), listed in the right-most column in Table 11 labeled Exp(b), or the inverted odds ratios (1/odds ratios). The inverted odds ratios for X and X indicate that the odds of being good for the applicants who own a home or live with their parents are approximately 9 times higher than for the ones who rent a house (1/0.107) and approximately 4 times higher than for the rest applicants (1/0.267). This corresponds to the logic from the linear model. The Nagelkerke R-squared for the Logistic Model M5.1 is For comparison, the Nagelkerke R-squared for the Logistic Model M5 was and the Nagelkerke coefficients of all of the initial logistic models were higher. In this case, it is worthy to check the Initial Logistic Model M4 and approve it (it contains X6 and has Nagelkerke R-squared of the highest after the one of Initial Version M5). The Initial Logistic Model Version M4 has the following form p loge X X X 0.880X 0.563X X X p The coding of the variable Occupation Code (X8) is as follows. The category Self-employed is selected as a reference, since there falls the largest percentage of the observations. No dummy was created for this category. For the remaining categories dummy variables were created as follows - set to 1 if the applicant is an employee and to 0 in all other cases; - set to 1 if the applicant is a pensioner and to 0 in all other cases; - takes value of 0 if the applicant is self-employed, employee or pensioner, and 1 in all other cases. The values of the coefficients in front of each of the indicators are listed in Table 12. TABLE 12. Coefficients in the Logistic Regression Initial Model M4 and their significance. Indicator Coefficient (b) Wald Chi-square value Significance level Exp(b) Intercept X *10 8 X X X X X

10 The table shows that most of the coefficients are statistically insignificant (X6.1, X, X8.1, X8.2, X8.3 and the intercept) at the significance level of 5%. It could be seen that at the significance level of 10% X8.2 and X8.3 are significant. It could be seen from Table 12 that for every one-month increase in Time in Employment (X2), a increase in the log-odds of Good Flag (Y) is expected, holding all other independent variables constant. This meets the economic expectations. As the coefficient in front of X6.1 is again insignificant at the significance level of 5%, it is better to add the applicants, who live with their parents, to the null group (i.e., X6.1 to be excluded from the model) and see what will happen with the rest categories of Residential Status (X6). The same logic could be applied to Occupation Code (X8) exclude X8.1 and check what will happen with the rest categories. After applying the suggested changes to the Initial Logistic Model 4, the following model (Version M4.1) is obtained p loge X X 0.939X 0.913X X p The new version of the model adequately reflects the dependency of the variables, even at a level of significance of 1% (Chi-square-value is 24.43). 79% of the cases are correctly classified. The Nagelkerke R- squared for the Logistic Model M4.1 is All of the coefficients are statistically significant at the significance level of 15%, although X, X8.2 and X8.3 are insignificant at the significance level of 5% (Table 13). TABLE 13. Coefficients in the Logistic Regression Model M4.1 and their significance. Indicator Coefficient (b) Wald Chi-square value Significance level Exp(b) Intercept X X X X Logistic Model Version M4.1 could be treated as the final logistic model (at significance level of 15%) or some changes could be applied to it in order to approve the significance level. A good starting point for a new version of the model is to bring together X8.2 and X8.3 (coded with the dummy variable X8.23), as they both are expected to be worse than the null group. After applying the suggested changes, the following model (Version 4.3) is obtained p loge X X 0.993X X p All of the coefficients are statistically significant at the significance level of 10%, although X and the intercept are insignificant at the significance level of 5% (Table 14). TABLE 14. Coefficients in the Logistic Regression Model M4.2 and their significance. Indicator Coefficient (b) Wald Chi-square value Significance level Exp(b) Intercept X X X The new version of the logistic model is statistically significant (Chi-square-value is 23.71), p < The model explains 29.1% of the variance in Good Flag (Nagelkerke R-squared is 0.291) and correctly classifies 79% of cases

11 TABLE 15. Scorecard from Logistic Model Version M4.2. No. Variable Points INTERCEPT TIME IN EMPLOYMENT (X2) 7 2 RESIDENTIAL STATUS (X6) Homeowner (X6.0) 0 Living with Parents (X6.1) 0 Tenant (X) 2108 Other status (X) OCCUPATION CODE (X8) Self-employed (X8.0) 0 Employee (X8.1) 0 Pensioner (X8.2) 1089 Other (X8.3) 1089 Increasing Time in Employment (X2) is associated with an increased likelihood of being a good payer. The inverted odds ratios for X and X indicate that the odds of being good for the applicants who own a home or live with their parents are ~ 8 times higher than for the ones who rent a house and ~ 3 times higher than for the rest applicants (the inverted odds ratio for X is 1/0.121 and for X is 1/0.371). Employees and self-employed are ~ 3 times more likely to be good payers than pensioners and other applicants (the inverted odds ratio for X8.23 is 1/0.336). The scorecard, corresponding to the considered model, could be seen in Table 15 (the coefficients are multiplied by 1000 for clarity). In order to improve the significance level of the coefficients, X and X could be grouped together. This option is tested, but then the intercept becomes insignificant even at the significance level of 20% (p-value=0.297). Logistic Model Version M4.2 is used as the final logistic model (at significance level of 10%) in this work. It has better level of significance of the coefficients than Logistic Model Version M4.1 and commensurable performance. COMPARATIVE ANALYSIS After the final linear and logistic models have been chosen, here comes the question how to compare them. Using R-squared for this purpose is not a good idea, because both models need to work with the same sample size and the same dependent variables. The second condition is violated here the logistic model gives the loglikelihood of being a good payer, while the linear model gives the likelihood the candidate to be a good payer. There are some transformations, on the basis of which the coefficients of determination of the two models could be compared [8]. Instead of doing this, some other criteria of goodness of fit have been checked below for both the models. The reason is that R-squared is not the best criteria to look at for the logistic models. Something more, the quest for determining the right model is not to get a model with the highest coefficient of determination, but to get estimates of regression coefficients and to be able to draw statistical conclusions about these coefficients. It often happens in the empirical analysis to obtain models with a very high coefficient of determination, but the coefficients of the model to be statistically insignificant or even with the opposite sign of the theory. What is important is to obtain a model with statistically significant coefficients that meet the theoretical (economic) expectations. Therefore, the coefficient of determination has been used as a key factor (along with the overall significance of the model) only when choosing the initial models and after that more attention has been paid on the significance of the model coefficients. It should also be noticed that a multicollinearity check has been performed for both the final models and the results show that there is no multicollinearity neither for the linear, nor for the logistic model. The Variance Inflation Factor (VIF) has been used for this check. The variables from the final linear model have VIF between 1.02 and 1.05, while those from the final logistic model are with VIF between 1.02 and

12 Score Distribution Reports In order to compare the final linear and logistic models score distribution reports have been created, including the number and the percentages of Goods and Bads, as well as the Good: Bad Ratio (G:B Odds) per score band. The scores from the two models have been created from the Predicted Probability of Being Good (multiplied by 1000 for clarity), obtained from the corresponding models. The scores have been classed on 5 bands with approximately 20% of the population in each band. A good model is a model that have a monotonically increasing trend, i.e., more Bad candidates in the lower score bands, and increasing Good ones in the upper bands. In other words, the G:B Odds should increase with increasing of the points. The score distribution for the linear model (Table 16) shows that there is a big cluster of observations in the highest band. In fact, 64% of the applicants have a score greater than 500, or to be more precise the score of all of them is 817 points, which means there is no good distinction between the applicants in the highest band. Despite of this, the overall report for the linear score looks good the G:B Odds are increasing per interval (score bands arranged in ascending order). In Figure 1a it could be seen that the number of Goods is increasing per interval, but the number of Bads varies this is due to the bigger percentages of observations in the last bands (due to the clusters). Class of points TABLE 16. Score Distribution Report for Linear Model Version 5.3. # Goods % Goods # Bads % Bads G:B Odds # Total % Total [Low : 100] 0 0.0% 3 8.6% % (100 : 200] 0 0.0% % % (200 : 450] 2 3.1% % % (450 : 500] % % % (500 : High] % % % Total % % % The score for the logistic model is better distributed (Table 17), there are not such big clusters as in the linear model. The G:B Odds are increasing per interval and Figure 1b shows that the logistic score has decreasing number of Bads and increasing number of Goods per interval (score bands arranged in ascending order). Class of points TABLE 17. Score Distribution Report for Logistic Model Version 4.3. # Goods % Goods # Bads % Bads G:B Odds # Total % Total [Low : 450] 5 7.7% % % (450 : 550] % % % (550 : 750] % % % (750 : 850] % 3 8.6% % (850 : High] % 3 8.6% % Total % % % Gini Index The power of the model could be measured using the Gini index. The Gini index (or Gini coefficient) is a measure of statistical dispersion introduced by the Italian statistician and sociologist Corrado Gini in [10]. In fact, the Gini index is an example of a Lorentz curve that is used to estimate inequalities [11]. Gini index of 0 shows perfect equality, and index of 1 (if calculated as number, or 100 if calculated as percentage) shows perfect inequality (excellent discrimination between the variables)

In this work the calculation of the Gini index is based on a comparison between the cumulative number of Goods and Bads for a given interval of points.

61%) and the Gini index of the Logistic Score is 0.5965 (or 59.65%). This means that the logistic model gives a little bit better discrimination.

13 In this work the calculation of the Gini index is based on a comparison between the cumulative number of Goods and Bads for a given interval of points. It has been calculated (using the score distribution reports) that the Gini index of the Linear Score is (or 55.61%) and the Gini index of the Logistic Score is (or 59.65%). This means that the logistic model gives a little bit better discrimination. (a) (b) FIGURE 1. Bar Charts for the number of Goods and Bads per score interval: (a) for Linear Score; (b) for Logistic Score FIGURE 2. Gini index graphs for the final linear and logistic scores

14 An interesting fact was observed while drawing the graphs of the two Gini indexes. Figure 2 shows that the linear score performs better than the logistic score in the lower bands: the blue curve, corresponding to the linear model, is farther from the green line (the line of perfect equality). The situation in the upper bands is the opposite: the red line, corresponding to the logistic model, is farther from the line of perfect equality, which means that the logistic score performs better than the linear one here. In fact, the Gini index for the linear model represents the area between the line of perfect equality (the green line) and the curve of cumulative Goods (%) against cumulative Bads (%) for the linear score (the blue curve). Similarly, the Gini index for the logistic model represents the area between the green line and the red curve. The main disadvantage of the Gini coefficient is the implication of the relative value of the changes that may occur in different parts of the distribution [12]. Moving from the points of better to the points of worse applicant has a significantly greater impact on the Gini index if the two individuals are close to the average than if they were at either end of the score distribution. In some cases, this is a desired effect. In this work especially, the average bands of the score are important, as here is the place to set different credit limits. As the logistic score works better with the Good applicants (the upper bands) and has better Gini index, it could be used to set credit limits. At the same time, the linear score discriminates better between the Bad applicants (the lower bands) and therefore it could be used to set a cut-off level, which is to be done in the next section. Cut-Off Level How to separate Good from Bad candidates, that is, how to decide to whom to grant a loan? Once there is a working model, it already has a scoring system that can divide people to be rejected by those who will be approved. Let us fix a point level c and accept candidates who have received points that are higher or equal to that level, and reject those with points lower than c. This level is called Cut-off and is determined by the risk that the borrower is ready to take [3]. Different strategies could be taken, depending on the goal which the creditor pursues: retaining the same Accept Rate, but increasing the number of Goods; more accepted candidates, but with higher G:B Odds, etc. Two types of mistakes could be made when setting a cut off level: a first-order error and a second-order error. In the credit scoring a first-order error occurs when a Bad payer is marked as Good. When a Good payer is marked as Bad, then this is a second-order error. The cut-off level depends on how big first-order error is likely to accept the creditor. In the reviewed data the initial level of accepted candidates is ~ 90% (Accepted / (Accepted + Rejected)), while the initial Bad Rate (Bads / Accepted) is 35%. In order to compare the results of both methods, it will be checked how the Bad levels would change within each of the methods, keeping the level of accepted candidates at 90%. For the logistic model Accept Rate of ~ 90% corresponds to a cut-off level of approximately 283 points (i.e., to reject the 10%, which have points below (or equal) that level and accept the 90% with points above that level). For the linear model Accept Rate of ~ 90% corresponds to a cut-off level of approximately 125 points (as already mentioned, there are clusters of observations at certain points in the score from the linear model and for that reason cannot be rejected exactly 10% of the applicants, but the most closer is 9% Rejects). The cut-off points, the Accept Rates and the Bad Rates, corresponding to that cut-off levels of both the linear and the logistic model, could be seen in Table 18. TABLE 18. Cut-off analysis with strategy to keep the same Accept Rate. Type of the final model Cut-off level (points) Accept Rate Bad Rate Linear % 28.57% Logistic % 30.00% One could easily notice that both the models have lower Bad Rate than the current one. The linear model has a little bit lower Bad Rate than the logistic one, but they are comparable, having in mind the small number of Bads (in fact, there are no Goods in the rejected applicants from the linear model, and there are only 2 Goods in the rejected applicants from the logistic model). To prove this, a confidence interval on the number of Bads has been built. In fact, the Good Flag has binomial distribution (0 for the Bad applicants and 1 for the Good ones). For a

15 sample of 100 applicants with 35 Bads in them, the 95% confidence interval for the number of Bads shows that they can vary from 26 to 45, while as the 90% confidence interval shows from 28 to 43 Bads. Anyway, it is better to use the linear score for setting a cut-off level, as the Gini index graph (Figure 2) shows the linear model works better with the Bad applicants. Remark: A common practice to validate the results from the final model is to randomly separate small part from the population, which is excluded from the modeling process. The goal is to be able to demonstrate the performance of the model from three different points of view: the whole population; the population, on which the model is built (~ 80%); the population, which is excluded from modeling (~ 20%). By using a sample that is not taken in the modeling process, it will be possible to demonstrate how the model actually works on candidates that are new. In this case, they have data on whether they were Good or Bad, and it will be easy to check the discrimination, imposed by the model. In this work no validation sample is left because of the small amount of the data. But when working with big samples, it is very useful to use such sample to test the stability of the model. CONCLUSIONS The relationship between the information that each borrower submitted at the time of application for a product and his behavior, after he has been approved for the product, can be used to estimate the probability of being good for new customers. This could be done using different types of statistical models. Two of these models linear and logistic, have been compared in this work, using a database of 100 borrowers and following a set of standard and robust statistical criteria. As a result, two scorecards have been obtained from both models to assess new customers at the time of application. The comparative analysis between the two final models shows that the logistic model is more precise, but the linear model is easier to interpret. Furthermore, the linear model could be used to set a cut-off level, bellow which to reject the applicants, as it works better in identifying the Bad applicants. At the same time, the logistic model could be used to determine credit limits or annual percentage rates for the accepted candidates, to choose Good applicants for loyalty programs, etc. it works better with the Good applicants. The presented algorithm and criteria could be used for other data in a similar situation. Recommendation: Banks could set up strategies to work with the Bad applicants and take out the best of them, i.e., to think how exactly to approach them, so that they can still pay. Appropriate techniques for this purpose could be rescheduling, renegotiation, discounts, etc. ACKNOWLEDGMENTS This paper contains results of the work on project No 2017-FNSE-05, financed by Scientific Research Fund of Ruse University, Bulgaria. REFERENCES 1. V. Mihova and V. Pavlov (2017) An approach of estimating the probability of default for new borrowers, to appear in Economy & Business Journal D. Montrichard, Reject inference methodologies in credit risk modeling, in SESUG 2008, The Proceedings of the SouthEast SAS Users Group, 3. J. Mok, Reject inference in credit scoring, BMI paper, Amsterdam, SPSS IBM Statistics, 5. V. Goev, Statistical Processing and Data Analysis from Sociological, Marketing and Political Studies with SPSS (University Press Stopanstvo, Sofia, 1996). [in Bulgarian] 6. Manov, Statistics with SPSS (Trakia-M, Sofia, 2001). [in Bulgarian] 7. V. Pavlov and V. Mihova, Applied Statistics with SPSS (Avangard Print, Ruse, 2016). [in Bulgarian] 8. D. Gujarati, Basic Econometrics, 4th edn (McGraw Hill, New York, 2005). 9. H. Cramér, Mathematical Methods of Statistics (Princeton University Press, Princeton, 1946). 10. C. Gini, Variabilità e Mutabilità (C. Cuppini, Bologna, 1912)

16 11. A. Marshall, I. Olkin, and B. Arnold, Inequalities: Theory of Majorization and Its Applications (Academic press; New York, 1979). 12. F. Cowell, Measuring Inequality, LSE Perspectives in Economic Analysis (Oxford University Press, UK, 2008)

RESULT AND DISCUSSION

RESULT AND DISCUSSION 4 Figure 3 shows ROC curve. It plots the probability of false positive (1-specificity) against true positive (sensitivity). The area under the ROC curve (AUR), which ranges from to 1, provides measure