RESULT AND DISCUSSION

Size: px
Start display at page:

Download "RESULT AND DISCUSSION"

Transcription

1 4 Figure 3 shows ROC curve. It plots the probability of false positive (1-specificity) against true positive (sensitivity). The area under the ROC curve (AUR), which ranges from to 1, provides measure of the model ability to discriminate between those subjects who experience the outcome of interest versus those who don t. The measure of AUR is c- statistic..5 where n c : the number of concordant n d : the number of discordant t : the number of total pairs As general rule: C =.5 : no discrimination.7 C <.8 : acceptable discrimination.8 C <.9 : excellent discrimination C.9 : outstanding discrimination (Hosmer & Lemeshow 2). METHODOLOGY Data Source The data used in this research was the German Credit data set which was available at It contains observations on 1 past credit applicants. Each applicant was rated as good (7 cases) or bad (3 cases). There were 17 variables used in this research after considering The Six Basic C of Lending which consist of 3 numeric variables, 6 ordinal variables, 7 nominal variables, and 1 binary variable. Description of the variables can be seen in Appendix 1. Method Procedures used in this research were: 1. Divide the data into training data (74) for modeling and testing data (26) for validation. Each data set has the same pattern of good/bad debtors with the full data set, which comprise of 7% good debtors and 3% bad. 2. Data exploration. 3. Modeling the data by using stepwise, forward, and backward logistic regression. The probability modeled was Y=1 (the debtor had a good collectability status). Then choose one of those three models by considering the fit of the model and the model having the highest c statistic. 4. Modeling data using logistic ridge regression. 5. Determine optimal cutpoint from the intersection of sensitivity and specificity. 6. Model validation with testing data. 7. Comparing the classification rate and the c statistic between logistic ridge regression and the logistic regression with variable selection. 8. Generate V2* that had some specific correlation with V1. Then do step 3 until step 7 with new data (by replacing V2 with V2*) to see the performance of logistic regression with variable selection and logistic ridge regression as the correlation between V1 and V2* increases. RESULT AND DISCUSSION Data Exploration There were no outliers and missing values in the full data set, so all of the 1 observations were included in the analysis. Allocation of the data into modeling and validation was based on the proportion of bad and good cases of the overall data set. Each had 7% of good and 3% of bad which was appropriate with the full data set. The variables of V1 (duration of credit) and V2 (credit amount) had a decreasing trend to the response variable. Figure 4 showed that as the amount of the credit increased, the proportion of debtors with good collectability status decreased. The debtors with high installment rate (V4) tend to be bad debtors. The difference of good debtors for each occupation category was not significant. The group of debtors who were unemployed/unskilled-nonresident had the highest proportion of good debtor compared to the unskilled-resident, official, and officer. Proportion of good debtor Credit amount Figure 4 Plot of percentage of good debtors in each group of credit amount (V2) It can be seen in Appendix 2 that based on age (V3), the group of debtor aged 2 years old until 5 years old had a positive trend to the proportion of good debtor. As the age

2 5 increased the proportion of good debtors increased until the age of 5 years old. The group of debtors that were 66 until 75 years old had the lowest proportion of good debtors. Debtors with two dependents had higher proportion of good debtors than those who had one dependent (V6). As the status of checking account (V7) increased, debtors tend to be good debtors. Home ownership status (V12) also had a positive trend, the proportion of good debtors increased as the home ownership status changed from free, rent, and own. There was no pattern of the proportion of good debtors as the time of working experience in current job (V1) and the time living in their present residence (V11) increased. The figure can be seen in Appendix 2. Group of debtors that have been working four until seven years in their current job had the highest proportion of good debtors. Debtors that have been working more than seven years in their current job tend to be good debtors compared to those with less than four years of working experience in their current job. The unemployed debtors had higher proportion of good debtor than those with less than a year of working experience in their current job. For variable time of living in present residence, debtors with less than or equal to one year and also debtors with two until three years tend to be good debtors than the others. Proportion of good debtor no credit taken all credits paid back duly existing credit paid back duly Credit history delay in paying off in the past critical account Figure 5 Proportion of good debtors in credit history (V8) Figure 5 shows the credit history (V8) have a positive trend. Debtors that have not even credit taken before had the lowest proportion of good debtors. Those with high average balance in savings account (V9) tend to be good debtors. The difference between marital statuses (V14) was not significant, although the single males and married males had higher proportion of good debtor than females and divorced males. Debtors who had a guarantor (V15) tend to be good debtors than those who had a co-applicant. Those who had property (V16) tend to be good debtors than those with no property. Those having a credit purpose (V17) for used cars, furniture, or radio/television had higher proportion of good debtors than the other purposes. The lowest proportion of good debtors was those with education as the purpose of taking credit. Debtors who had a telephone number under his or her name (V19) tend to be good debtors than those who did not have telephone number under his or her name. The figure of percentage of good debtors in each variable can be seen in Appendix 2. Evaluation on the correlation between predictor variables can be seen in Table 1 and Table 2. It can be concluded that there were many significant correlations but there is only one high correlation coefficient. It exists between V1 and V2. Table 3 showed the Cramer statistic as the measure of association between the nominal variables. Table 1 Pearson correlation coefficient of numeric variables V1 V3 V V Table 2 Spearman correlation coefficient of ordinal variables V6 V1 V11 V12 V13 V V V V V12.82 Table 3 Cramer coefficient of nominal variables V8 V9 V14 V15 V16 V17 V19 V V V V V V V Between the numeric predictors, the only significant correlation occurs between V1 (duration of credit) and V2 (credit amount), with a correlation coefficient of.628 which is shown in Table 1. By using spearman coefficient of correlation shown in Table 2,.327 was the largest correlation which occurred between V11 (time in present resident) and V12 (housing). Variable V1 (time in current job) had significant correlation with all other ordinal variables except with variable V12 (home ownership).

3 6 The strength of association between nominal variables was measured by Cramer coefficient and can be seen in Table 3. Variable V17 (purpose of credit) had significant correlation with all other nominal variables. The highest correlation in nominal predictor exist between V16 (property owned) and V17 (purpose of credit), which was.218. Logistic Regression With Variable Selection Logistic regression model using forward, backward, and stepwise variable selection methods were built. Forward logistic regression gave the same result with stepwise logistic regression. Among the three selection methods, backward was the method which had the highest c statistic. By using Hosmer Lemeshow Goodness-of-Fit Test as proposed in Hosmer & Lemeshow (2), backward logistic regression model was considered fit with a p-value of.724. Table 4 Comparison of backward, forward, and stepwise logistic regression Method Hosmer and Lemeshow Goodnessof-Fit Test C statistic Chi-Square P-value Backward Forward Stepwise By using backward logistic regression, eight significant predictors were selected from 17 predictor variables, which were credit duration (V1), credit amount (V2), installment rate (V4), checking account status (V7), credit history (V8), balance in saving account (V9), marital status (V14), and the purpose of credit (V17). The parameter estimates of backward logistic regression were shown in Table 5. Variable V1 and V2 which had.628 of correlation coefficient both entered the model. It showed that the correlation of.628 was not high enough. All the parameter estimates were appropriate with the data exploration. For example V8, which had been explained above that debtors with no credits/history (the first dummy variable (V8 1 )), had the lowest proportion of good debtors. We can see from Table 5 that the parameter estimate of V8 1 was the lowest compared to the V8 2, V8 3 and V8 4. The value of the 3 rd dummy (representing debtor whose paid dully of the existing credit) was higher than 4 th dummy (representing debtor whose delay in paying off in the past) which was also appropriate with the data exploration. Table 5 Parameter estimate by using backward logistic regression Parameter Estimate SE Wald P-value Intercept <.1* V * V E * V * V * V V <.1* V <.1* V V * V * V * V * V * V V V V V * V V V V V V * V *significant at the.5 level By choosing the model that had eight significant predictors, the c statistic of backward logistic regression was.817 and it indicated that the model was an excellent classifier. From the intersection of sensitivity and specificity versus all possible cutpoints, the optimal cutpoint of backward logistic regression was equal to.68. The classification table can be seen in Table 6.

4 7 Table 6 Classification table of backward logistic regression by using a cutpoint of.68 Predicted CLASS % 25.66% Actual % 74.51% Total Correct % From 74 total cases, 551 cases (74.46%) were correctly predicted. There were 58 bad debtors (25.66%) that were predicted as the good debtors from the total of 226 bad debtors. From 514 good debtors, 131 cases (25.49%) were predicted as the bad cases by the chosen model. With the true positive rate and the true negative rate at about 74%, this model was good enough in classification and it was in line with the high c statistic that was mentioned above. Logistic Ridge Regression All the 17 variables were included in the logistic ridge regression. These variables were considered by the Six Basic Cs of Lending. The ridge parameter or λ achieved from the calculation that involved standardized parameter by ordinary logistic regression and standardized predictor variables was Table 7 Classification table of logistic ridge regression by using a cutpoint of.677 Predicted CLASS % % Actual % % Total Correct % The c statistic of this model was.832 so the model was categorized as an excellent classifier. By using optimal cutpoint of.677, the classification rate can be seen in Table 7. The total correctly predicted cases were 559 cases. There were 55 cases (24.33%) of false positive from the total of 226 bad cases. From 514 good debtors, 126 (24.51%) were predicted as bad debtors. Although the difference was not too big, there were some differences between parameter estimate achieved from the logistic ridge regression and backward logistic regression. The sign of the parameter estimate between those models were all same. But for variable V8, the parameter estimates of this model gave inappropriate result compared to the data exploration. The 1 st dummy (V8 1 ) should be lower than the 2 nd dummy (V8 2 ). Table 8 Parameter estimate by using logistic ridge regression Parameter Estimate Parameter Estimate Intercept V V V V2 -.1 V V3.116 V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V Comparison of Backward Logistic Regression and Logistic Ridge Regression From the optimal cutpoint of each model, the percentage of total correctly predicted of logistic ridge regression was better, although only six cases higher than the backward. There were 559 cases correctly predicted by using logistic ridge regression and 553 correctly predicted by using backward logistic regression. Figure 6 shows that the misclassification rates (false positive and false negative) of logistic ridge regression are lower than backward logistic regression. The correct classification rates (true positive and true negative) of logistic ridge regression were higher than of backward logistic regression. Although the logistic ridge regression was better (lower in MCR and higher in CCR), the values were not quite different.

5 8 Percentage (%) validation were higher than the backward logistic regression. However, the category of discrimination ability from both models was similar. With training data, backward and ridge were both excellent. With testing data, these two models were acceptable to discriminate the good/bad debtor. Total FP FN TP TN Backward Ridge Figure 6 Classification rate of backward and logistic ridge regression on each optimal cutpointt Figure 7 shows that in validation data set the total correctly predicted cases by using backward logistic regression differ a little bit from logistic ridge regression; it was 188 by using backward logistic regression and 185 by using logistic ridge regression from the total of 26 cases. The backward logistic regression method resulted in lower MCR and higher CCR than the logistic ridge regression. Pecentage (%) Total FP FN TP TN Backward Ridge Figure 8 Observed collectability status on P(Y= =1) by using backward logistic regression Figure 8 shows the distribution of probability of Y=1 achieved by backward logistic regression in each collectability status. The upper panel of the figure was the histogram of probability of Y=1 for the bad debtors and the lower panel was for the good debtors, as denoted by the vertical axis. The good debtors (lower panel) tend to have high probability of Y=1, there were only a few that had low probability of Y= =1. The bad debtors (upper panel) were distributed in overalll range of probability that showed the bad capability of the model in differentiating the bad debtors from the good ones. Figure 7 Validation s classification rate of backward and logistic ridge regression on each optimal cutpoint Because of the classification rates depend on only one threshold (cutpoint), so the other measure of model assessment was evaluated. It was the c statistic that measured the area under ROC curve and explained how well the model s performance. Table 9 Comparison of c statistic between backward regression and logistic ridge Model Validation Backward Ridge The c statistics of logistic ridge logistic regression with training data and also in Figure 9 Observed collectability status on P(Y= =1) by using logistic ridge regression Figure 9 shows the distribution of probability of Y=1 by logistic ridge regression. It was similar to the backward logistic regression that showed the bad debtors distribute in overall range of probability of Y=1. It represents the bad capability of the model in identifying the bad debtors.

6 9 Odds ratio is the interpretation of model coefficient in logistic regression. The odds ratio of V8 (credit history of debtor) can be seen in Table 1. Table 1 Odds ratio of V8 (credit history) Variable Backward Ridge 1 vs V8 2 vs vs vs By using backward logistic regression, debtors who had no credit tend to be a good debtor.21 times than debtors with critical account. Debtor that paid back duly all the credits at the bank tends to be a good debtor.229 times than debtor with critical account. Debtor with the existing credits paid back duly and one whose delay in paying off in the past tends to be a good debtor about one half than debtor with critical account. The last column of Table 1 was the odds ratio of V8 from the logistic ridge regression. The odds ratio of the logistic ridge regression was similar to the odds ratio of backward logistic regression. But similar to the parameter estimates, the 1 st and the 2 nd dummy variable, of logistic ridge regression showed an inconsistent result compared to the data exploration. The 1 st dummy should be lower than the 2 nd dummy. Except for the V8, all parameter estimates of the logistic ridge regression were appropriate with the data exploration. The odds ratio for the other variables can be seen in Appendix 3. Comparison of Logistic Regression with Variable Selection and Ridge with Generated V2* From the comparison of c statistic, logistic regression with variable selection and logistic ridge regression gave similar results. To know the performance of these two methods in some values of correlation coefficient, variable V2* was generated to replace V2. Variable V2* that had a correlation coefficient of.6 until.95 in increments of.5 were generated. Then the c statistics and the total correctly predicted cases were compared. The total correctly predicted cases were obtained from each model s optimal cutpoint. C statistic Figure 1 Comparison of c statistic between logistic regression with variable selection and logistic ridge regression of data set with generated V2* Total correctly predicted (%) Correlation between V1 & V2* Ridge Var. Selection Correlation between V1 & V2* Ridge Var Selection Figure 11 Comparison of total correctly predicted cases between logistic regression with variable selection and logistic ridge regression of data set with generated V2* Figure 1 and Figure 11 show that the c statistic and the total correctly predicted of logistic ridge regression was always higher than the logistic regression with variable selection in all correlation coefficient between V1 and V2*. There was no clear pattern of the difference of c statistic and total correctly predicted on these models with the increase of correlation coefficient between V1 and V2*. It may be caused by the fact that there were only two predictor variables that had high correlation coefficient. The value of the c statistic and the total correctly predicted both in modeling and validation can be seen in Appendix 4. Table 11 shows the parameter existence of each logistic regression model with variable selection method. It can be seen in the table that V11 only exist in the model with correlation coefficient of.75. It may be the cause of why the c statistic and the total correctly predicted of logistic regression with variable selection on correlation coefficient of

7 1.75 was higher than the others as shown in Figure 1 and Figure 11. Table 11 Parameter existence of the logistic regression model with variable selection Parameter Correlation coefficient between V1 & V2* V V2* V V V V7 V8 V9 V V V V V14 V15 V V17 V CONCLUSION Comparing the total correctly predicted cases with training data set logistic ridge regression was better than logistic regression with variable selection (in this case backward elimination). But when using validation data set, backward logistic regression was better. The optimal cutpoint of backward was.68 while for ridge the optimal cut point was.677. The comparison of c statistic and the total correctly predicted cases on German Credit data set show that logistic ridge regression has a little higher capability to predict the new applicant s collectability status than logistic regression with variable selection. But both models had low capability in identifying the bad debtors. RECOMMENDATION High multicollinearity in predictor variables is needed to examine the performance of logistic ridge regression. Other way to define optimal cutpoint is good to try. Usually, banks consider that the false positive affects in more losses for the bank than the false negative. Before entering variable as the predictor variable in logistic ridge regression, it is important to know whether the variable affect the response or not based on some related theories. REFERENCES Agresti A. 27. An Introduction to Categorical Data Analysis. Second Edition. New Jersey: J Wiley. Cessie S le, Houwelingan JC van Ridge estimators in logistic regression. Appl Statist 41: Daniel WW Applied Nonparametric Statistics. Second Edition. Atlanta: PWS- KENT. Gonen M. 26. Receiver Operating Characteristics (ROC) Curves. In: Statistics and Data Analysis. Proceedings of the 31 th SAS Users Group International (SUGI 31) Conference; San Francisco, Mar 26. New York: SAS Institute. page 18. Hosmer DW, Lemeshow S. 2. Applied Logistic Regression. Second Edition. New York: J Wiley. Kemeny S, Vago E. 26. Logistic ridge regression for clinical data analysis (a case study). J Appl Ecol Environ Res 4: Perlich C, Provost F, Simonoff JS. 23. Tree induction vs. logistic regression: a learning-curve analysis. J Mach Learn Res 4: Purnomo H. 21a. BI : Kredit Perbankan Tumbuh 1% di Januari /11/1325/ /5/bi-kreditperbankan-tumbuh-1-di-januari-21 [3 Jun 21]. Purnomo H. 21b. Kredit Tumbuh 2.3% Hingga Agustus. 9/3/145545/ /5/kredit-tumbuh- 23-hingga-agustus [15 Sep 21]. Reed D. 28. Mortgages 11. New York: AMACOM. Rose PS Commercial Bank Management. Fourth Edition. USA: McGraw-Hill.