CREDIT RISK MODELING USING LOGISTIC RIDGE REGRESSION RAKHMAWATI

Size: px

Start display at page:

Download "CREDIT RISK MODELING USING LOGISTIC RIDGE REGRESSION RAKHMAWATI"

Ella Greer
5 years ago
Views:

1 CREDIT RISK MODELI ING USING LOGISTIC RIDGE REGRESSION RAKHMAWATI DEPARTMENT OF STATISTICS FACULTY OF MATHEMATICS AND NATURAL SCIENCES BOGOR AGRICULTURAL UNIVERSITY 2111

2 ABSTRACT RAKHMAWATI. Credit Risk Modeling Using Logistic Ridge Regression. Supervised by AAM ALAMUDI and DIAN KUSUMANINGRUM. The growth of credit of national banking may cause a greater risk faced by banks. One thing we must highlight is a way to determine whether the new applicant will be good in loan repayments. A well known and widely used method for classifying the new applicant of credit is Logistic Regression. Multicollinearity is a problem that is frequently encountered in model building. Usually, variable selection method is used for handling this problem. But sometimes it creates a new problem when the important variable does not enter to the model. Logistic Ridge Regression could be an alternative in logistic regression when multicollinearity exists. The advantage of this method is that it can handle multicollinearity without deleting any predictor variables. This research compared the performance of logistic ridge regression and logistic regression with variable selection to predict the collectability status of new applicants of credit. There were 1 observations of German Credit data set. The 74 observations were used for modeling and 26 observations were used for validation. Backward was the best among other selection variable methods which had the highest c statistic and the model was fit by Hosmer and Lemeshow Goodness-of-Fit Test. By using backward logistic regression, it showed that among 17 variables there were eight variables which were significant in the wald test. There were many significant correlations among the predictors but the highest correlation coefficient was.628 which exist between duration of credit (V1) and credit amount (V2).The ridge parameter or λ was.1. The optimal cut point of backward logistic regression was.68, while for logistic ridge regression was.677. By comparing the c statistic and the total correctly predicted cases, we can see that the logistic ridge regression was better than backward logistic regression in training data. However, with testing data (validation), backward logistic regression was better. To have a better understanding of the model with higher correlation values between V1 and V2, V2* was generated to replace V2 and logistic regression with variable selection and ridge were also built. The result pointed out that logistic ridge regression has a little higher capability to predict the new applicant s collectability status than logistic regression with variable selection. Key words: credit risk modeling, logistic ridge regression, multicollinearity

3 CREDIT RISK MODELING USING LOGISTIC RIDGE REGRESSION RAKHMAWATI G Thesis as a requirement for Bachelor Degree in Statistics DEPARTMENT OF STATISTICS FACULTY OF MATHEMATICS AND NATURAL SCIENCES BOGOR AGRICULTURAL UNIVERSITY 211

4 Title : Credit Risk Modeling Using Logistic Ridge Regression Author : Rakhmawati NIM : G Approved by : Advisor I Advisor II Ir. Aam Alamudi, M.Si NIP Dian Kusumaningrum, S.Si, M,Si Acknowledged by : Head of Department Statistics Dr. Ir. Hari Wijayanto, M.Si NIP Graduation date:

5 BIOGRAPHY Rakhmawati was born in Salatiga on June, 16 th 1988 as the daughter of Paijo Nurhadi Santoso and Siti Naimah. She has a big brother named Nur Rahman Istianto and a little brother named Aulia Kharis Kurniawan. After graduating from SMA Negeri 1 Salatiga in 26, she continued her studies in Bogor Agricultural University through USMI. She took Statistics as her major in Bogor Agricultural University. She chose Information System as her Minor Subject and also some supporting courses from Department of Mathematics. She was a staff of Database and Computational Department in Gamma Sigma Beta, an organization of statistics student in Bogor Agricultural University. On the 8 th semester, she had a chance to take an internship program at PT Ganesha Cipta Informatika. There, she with her partner made a SAS program about risk management to calculate Value at Risk of market risk.

6 ACKNOWLEDGEMENTS Alhamdulillah, thanks to Allah SWT Who gives me love, opportunity, health, and capability in finalizing my research which is entitled Credit Risk Modeling Using Logistic Ridge Regression. I recognized that the completion of my research would not be done without help from other people. I want to say thanks to Mr. Aam Alamudi and also Mrs. Dian Kusumaningrum as my advisors, for their critics, ideas, and also their patience. Thanks to Miss Indah Permatasari, my internship advisor, who gave consideration then finally I got the topic of my research. For Defri Ramadhan Ismana and Yulia Triwijiwati, thank you for the discussion. Special thanks to my beloved family for the love and supports. Last, I hope this thesis would be beneficial. Bogor, February 211 Rakhmawati

7 TABLE OF CONTENTS LIST OF TABLES viii LIST OF FIGURES viii LIST OF APPENDICES viii INTRODUCTION Background 1 Objective 1 LITERATURE REVIEW Credit Risk 1 The Cramer Statistic 1 Logistic Regression 2 Logistic Ridge Regression 2 Optimal Cut Point 3 Model Evaluation 3 METHODOLOGY Data Source 4 Method 4 RESULT AND DISCUSSION Data Exploration 4 Logistic Regression with Variable Selection 6 Logistic Ridge Regression 7 Comparison of Backward and Logistic Ridge Regression 7 Comparison of Logistic Regression with Variable Selection and Ridge with Generated V2* 9 CONCLUSION 1 RECOMMENDATION 1 REFERENCE 1 APPENDIX 11 Page

8 LIST OF TABLES Page Table 1 Pearson correlation coefficient of numeric variables 5 Table 2 Spearman correlation coefficient of ordinal variables 5 Table 3 Cramer coefficient of nominal variables 5 Table 4 Comparison of backward, forward, and stepwise logistic regression 6 Table 5 Parameter estimate by using backward logistic regression 6 Table 6 Classification table of backward logistic regression by using a cut point of.68 7 Table 7 Classification table of logistic ridge regression by using a cut point of Table 8 Parameter estimate by using logistic ridge regression 7 Table 9 Comparison of c statistic between backward and logistic ridge regression 8 Table 1 Odds ratio estimate of V8 (credit history) 9 Table 11 Parameter existence on the logistic regression model with variable selection 1 LIST OF FIGURES Page Figure 1 Plot of sensitivity and specificity versus all possible cut points 3 Figure 2 Classification table 3 Figure 3 ROC curve 3 Figure 4 Plot of percentage of good debtors in each group of credit amount (V2) 4 Figure 5 Proportion of good debtors in credit history (V8) 5 Figure 6 Classification rate of backward and logistic ridge regression on each optimal cut point 8 Figure 7 Validation s classification rate of backward and logistic ridge regression on each optimal cut point 8 Figure 8 Observed collectability status on P(Y=1) by using backward logistic regression 8 Figure 9 Observed collectability status on P(Y=1) by using logistic ridge regression 8 Figure 1 Comparison of C statistic between logistic regression with variable selection and logistic ridge regression of data set with generated V2* 9 Figure 11 Comparison of total correctly predicted cases between logistic regression with variable selection and logistic ridge regression with generated V2* 9 LIST OF APPENDICES Page Appendix 1 Description of variables used in analysis 12 Appendix 2 Proportion of good debtor on each variable 14 Appendix 3 Odds ratio of backward logistic regression and logistic ridge regression 16 Appendix 4 Comparison of C statistic and the correctly predicted cases between logistic regression with variable selection and logistic ridge regression of data set with generated V2* 17

9 1 INTRODUCTION Background Credit risk is one of the eight risks that banks must consider. It is important to make a measurable, documented, and developable credit risk system. Logistic regression, discriminant analysis, and artificial neural network are some methods that are used in credit risk model. They are useful to predict whether a new applicant will become a good or bad debtor if he or she receives a loan. Multicollinearity is a common problem in credit risk modeling. Usually, the solution for this problem is using variable selection method (forward, backward, and stepwise). But this solution may cause missing information about the response variable if the deleted predictor variable is an important one. Ridge regression is another statistical procedure for dealing with the problem of multicollinearity (Ravinshanker & Dey 21). With logistic ridge regression, the multicollinearity is expected to be handled without deleting any variables and there will be no missing information from the data that has been collected. Bank of Indonesia noted that the growth of credit of national banks in January 21 was 1%. Until the end of August 21, the credit of banking industry grew and reached 2.3% (Purnomo 21a, 21b). This may conduce on a greater risk that has not been faced by banks before. Hence, it is important to build a more accurate credit scoring model to decide whether a new applicant is credible enough to get a loan. Objectives The objectives of this research are: 1. To build a credit risk model using logistic regression with variable selection and logistic ridge regression. 2. To determine the optimal probability cutpoint. 3. To compare the classification rate and the c statistic of logistic regression with variable selection and logistic ridge regression. LITERATURE REVIEW Credit Risk Model Banks loan to individuals, first by asking to fill out a loan application. The customer is asked to submit several documents that the bank needs in order to evaluate the loan request. There are six aspects of the loan application to determine whether a new applicant is creditworthy or not, The Six Basic Cs of Lending are namely character, capacity, cash, collateral, condition, and control. Character is the data about the personality. Capacity is the capacity to borrow money. Cash is related to the borrower income and balance in saving account. Collateral is the adequacy of the borrower to provide adequate support for the loan. Age and degree of specialization of the borrower's assets are the example of collateral. Condition is the prospect of business associated with economics conditions. Correctly prepared loan document is the example of control. The basic theory of credit scoring is that the bank can identify the financial, economic, and motivational factors that separate the good debtors from the bad ones by observing a large group of people who have borrowed in the past. Credit scoring systems are usually based on discriminant models or related techniques such as logit or probit models or neural networks. If the applicant s score exceeds a critical cutpoint level, he or she is more likely to be approved for credit. Among the most important variables used in evaluating consumer s loan are age, marital status, number of dependents, home ownership, telephone ownership, type of occupation, and length of employment in a current job. The Cramer Statistic The chi-square test of independence is used to conclude whether there is an association between two categorical variables. When the number of rows and columns of the contingency table are unequal, Cramer coefficient is the measure of the strength of this association. The value is between and 1. The Cramer coefficient is defined as: 1 Where X 2 is the chi-square statistic, n is the total sample size, and t is either the number of

10 2 rows or the number of columns in the contingency table, whichever is smaller. Logistic Regression Let the conditional probability that the outcome is present be denoted by P(Y=1 x)=π(x). The logit of the multiple logistic regression is given by the equation where log 1 in which case the logistic regression model is 1 When the type of independent variable is categorical, dummy variable is needed. In general, if a categorical variable (nominal or ordinal scale) has k values, then k-1 design variable will be needed. Thus, the logit for a model with p variables and the j th variable being categorical would be Maximum likelihood estimators to logit model are obtained by maximizing β of the likelihood function log 1 log 1 After getting the model, we begin the process of model assessment. The significance of the covariates could be assessed by G test statistic and Wald test. G test statistic is a likelihood ratio test and measures the significance of the parameters on the overall model. Hypothesis of G test statistic: H : β 1 = β 2 =... = β p = H 1 : at least one β i, i = 1, 2,..., p G-test Statistic could be formulated as 2 where L = Likelihood without covariates, and L p =Likelihood with p covariates. Under the null hypothesis, the distribution of G is chisquare χ 2 with p degrees of freedom. If the null hypothesis is rejected and conclude that at least one and perhaps all p coefficients are different from zero, the Wald test could be used to assess the significance of each covariate. H : β i = H 1 : β i where i = 1, 2,..., p ) β i W = ) ) SE( β i ) Under the null hypothesis, W statistic will follow a standard normal distribution (Hosmer & Lemeshow 2). Coefficient interpretation in logistic regression is by using the odds ratio that indicates how much more likely, with respect to odds, a certain event occurs in one group relative to its occurrence in another group. The odds ratio defined as exp. For numeric variable, the odds ratio indicates that for every increase of one measurement of the predictor, the risk of the outcome increases times. Multicollinearity can cause unstable estimates and inaccurate variances which affects hypothesis test (Hoerl & Kennard 198, in Shen & Gao 22). In regression, there are some approaches to handle multicollinearity, which are variable selection method (forward, backward, and stepwise) and using ridge regression. Forward selection adds terms sequentially until further additions do not improve the model. Backward elimination begins with a complex model and sequentially removes terms. Stepwise procedure starts off by choosing the equation containing the most important variable and then attempts to build up with subsequent additions of variable one at a time as long as these additions are worthwhile. Logistic Ridge Regression Unstable parameter estimates occur when the number of covariates is relatively large or when the covariates are correlated. An alternative procedure to obtain more stable estimates is to specify a restriction on the parameters. Consider the maximization of the log-likelihood function with a penalty on the norm of β: where /, the norm of the parameter vector β. The ridge parameter λ controls the amount of shrinkage of the norm of β. When λ= the solution will be the ordinary MLE. For a good choice of λ, the estimate is expected to be an average closer to the real value of β than the ordinary MLE, i.e. MSE( ) < MSE( ) (Cessie & Houwelingen 199). The estimate parameter of logistic ridge regression is calculated in the following ways:

11 3 1. Fit the logistic regression model using maximum likelihood, leading to the estimate of. Construct standardized coefficients by defining j=1,2,,p where is the standard deviation of β in the training data for the j th predictor. 2. Construct the Pearson statistic 1 where g = the number of covariate patterns m k = the number of subjects with x=x k y k = the number of positive responses (y=1) among the m k subjects = probability that the outcome is present in x=x k This is a measure of the difference between the observed and the fitted values. 3. Define the ridge parameter (λ) 1 4. Let N Z p be the matrix of centered and scaled predictors, with Let, where N V N is 1. Let equal with the intercept omitted. Then the ridge regression estimate equals,, where and Optimal Cutpoint Optimal cutpoint for the purpose of classification can be obtained from the plot of sensitivity and specificity versus all possible cutpoints (Hosmer & Lemeshow 2). The plot can be seen in Figure 1. The optimal cutpoint is not the only criteria for deciding whether a new applicant is acceptable or not to get a loan. Although the correct classification rate is high based on the optimal cutpoint, the number of false positive should be considered because the loss caused by this error is extremely large relative to the false negative. Each bank has its own criteria for making a decision. Explanation for these errors can be seen in the next session. In this research, the cutpoint score will just be attained from the plot of sensitivity and specificity versus all possible cutpoints. Figure 1 Plot of sensitivity and specificity versus all possible cutpoints Model Evaluation In model assessment, a classification table is most appropriate when classification is a stated goal of the analysis. Figure 1 is the classification table. It is a two way frequency table between actual data and the prediction. Correct classification rate (CCR) consists of percentage of true positive and true negative, while misclassification rate (MCR) consist of percentage of false positive and false negative. Predicted 1 True Negative False Positive Actual 1 (TN) False Negative (FN) Figure 2 Classification Table (FP) True Positive (TP) Sensitivity or true positive (TP) is the number of observation that have category 1 and was correctly predicted. Specificity or true negative (TN) or is the number of observation that have category and was correctly predicted. False positive is the number of observation that have category but predicted as category 1. False negative is the number of observation that have category 1 but predicted as category. Figure 3 ROC curve

12 4 Figure 3 shows ROC curve. It plots the probability of false positive (1-specificity) against true positive (sensitivity). The area under the ROC curve (AUR), which ranges from to 1, provides measure of the model ability to discriminate between those subjects who experience the outcome of interest versus those who don t. The measure of AUR is c- statistic..5 where n c : the number of concordant n d : the number of discordant t : the number of total pairs As general rule: C =.5 : no discrimination.7 C <.8 : acceptable discrimination.8 C <.9 : excellent discrimination C.9 : outstanding discrimination (Hosmer & Lemeshow 2). METHODOLOGY Data Source The data used in this research was the German Credit data set which was available at It contains observations on 1 past credit applicants. Each applicant was rated as good (7 cases) or bad (3 cases). There were 17 variables used in this research after considering The Six Basic C of Lending which consist of 3 numeric variables, 6 ordinal variables, 7 nominal variables, and 1 binary variable. Description of the variables can be seen in Appendix 1. Method Procedures used in this research were: 1. Divide the data into training data (74) for modeling and testing data (26) for validation. Each data set has the same pattern of good/bad debtors with the full data set, which comprise of 7% good debtors and 3% bad. 2. Data exploration. 3. Modeling the data by using stepwise, forward, and backward logistic regression. The probability modeled was Y=1 (the debtor had a good collectability status). Then choose one of those three models by considering the fit of the model and the model having the highest c statistic. 4. Modeling data using logistic ridge regression. 5. Determine optimal cutpoint from the intersection of sensitivity and specificity. 6. Model validation with testing data. 7. Comparing the classification rate and the c statistic between logistic ridge regression and the logistic regression with variable selection. 8. Generate V2* that had some specific correlation with V1. Then do step 3 until step 7 with new data (by replacing V2 with V2*) to see the performance of logistic regression with variable selection and logistic ridge regression as the correlation between V1 and V2* increases. RESULT AND DISCUSSION Data Exploration There were no outliers and missing values in the full data set, so all of the 1 observations were included in the analysis. Allocation of the data into modeling and validation was based on the proportion of bad and good cases of the overall data set. Each had 7% of good and 3% of bad which was appropriate with the full data set. The variables of V1 (duration of credit) and V2 (credit amount) had a decreasing trend to the response variable. Figure 4 showed that as the amount of the credit increased, the proportion of debtors with good collectability status decreased. The debtors with high installment rate (V4) tend to be bad debtors. The difference of good debtors for each occupation category was not significant. The group of debtors who were unemployed/unskilled-nonresident had the highest proportion of good debtor compared to the unskilled-resident, official, and officer. Proportion of good debtor Credit amount Figure 4 Plot of percentage of good debtors in each group of credit amount (V2) It can be seen in Appendix 2 that based on age (V3), the group of debtor aged 2 years old until 5 years old had a positive trend to the proportion of good debtor. As the age

13 5 increased the proportion of good debtors increased until the age of 5 years old. The group of debtors that were 66 until 75 years old had the lowest proportion of good debtors. Debtors with two dependents had higher proportion of good debtors than those who had one dependent (V6). As the status of checking account (V7) increased, debtors tend to be good debtors. Home ownership status (V12) also had a positive trend, the proportion of good debtors increased as the home ownership status changed from free, rent, and own. There was no pattern of the proportion of good debtors as the time of working experience in current job (V1) and the time living in their present residence (V11) increased. The figure can be seen in Appendix 2. Group of debtors that have been working four until seven years in their current job had the highest proportion of good debtors. Debtors that have been working more than seven years in their current job tend to be good debtors compared to those with less than four years of working experience in their current job. The unemployed debtors had higher proportion of good debtor than those with less than a year of working experience in their current job. For variable time of living in present residence, debtors with less than or equal to one year and also debtors with two until three years tend to be good debtors than the others. Proportion of good debtor no credit taken all credits paid back duly existing credit paid back duly Credit history delay in paying off in the past critical account Figure 5 Proportion of good debtors in credit history (V8) Figure 5 shows the credit history (V8) have a positive trend. Debtors that have not even credit taken before had the lowest proportion of good debtors. Those with high average balance in savings account (V9) tend to be good debtors. The difference between marital statuses (V14) was not significant, although the single males and married males had higher proportion of good debtor than females and divorced males. Debtors who had a guarantor (V15) tend to be good debtors than those who had a co-applicant. Those who had property (V16) tend to be good debtors than those with no property. Those having a credit purpose (V17) for used cars, furniture, or radio/television had higher proportion of good debtors than the other purposes. The lowest proportion of good debtors was those with education as the purpose of taking credit. Debtors who had a telephone number under his or her name (V19) tend to be good debtors than those who did not have telephone number under his or her name. The figure of percentage of good debtors in each variable can be seen in Appendix 2. Evaluation on the correlation between predictor variables can be seen in Table 1 and Table 2. It can be concluded that there were many significant correlations but there is only one high correlation coefficient. It exists between V1 and V2. Table 3 showed the Cramer statistic as the measure of association between the nominal variables. Table 1 Pearson correlation coefficient of numeric variables V1 V3 V V Table 2 Spearman correlation coefficient of ordinal variables V6 V1 V11 V12 V13 V V V V V12.82 Table 3 Cramer coefficient of nominal variables V8 V9 V14 V15 V16 V17 V19 V V V V V V V Between the numeric predictors, the only significant correlation occurs between V1 (duration of credit) and V2 (credit amount), with a correlation coefficient of.628 which is shown in Table 1. By using spearman coefficient of correlation shown in Table 2,.327 was the largest correlation which occurred between V11 (time in present resident) and V12 (housing). Variable V1 (time in current job) had significant correlation with all other ordinal variables except with variable V12 (home ownership).

14 6 The strength of association between nominal variables was measured by Cramer coefficient and can be seen in Table 3. Variable V17 (purpose of credit) had significant correlation with all other nominal variables. The highest correlation in nominal predictor exist between V16 (property owned) and V17 (purpose of credit), which was.218. Logistic Regression With Variable Selection Logistic regression model using forward, backward, and stepwise variable selection methods were built. Forward logistic regression gave the same result with stepwise logistic regression. Among the three selection methods, backward was the method which had the highest c statistic. By using Hosmer Lemeshow Goodness-of-Fit Test as proposed in Hosmer & Lemeshow (2), backward logistic regression model was considered fit with a p-value of.724. Table 4 Comparison of backward, forward, and stepwise logistic regression Method Hosmer and Lemeshow Goodnessof-Fit Test C statistic Chi-Square P-value Backward Forward Stepwise By using backward logistic regression, eight significant predictors were selected from 17 predictor variables, which were credit duration (V1), credit amount (V2), installment rate (V4), checking account status (V7), credit history (V8), balance in saving account (V9), marital status (V14), and the purpose of credit (V17). The parameter estimates of backward logistic regression were shown in Table 5. Variable V1 and V2 which had.628 of correlation coefficient both entered the model. It showed that the correlation of.628 was not high enough. All the parameter estimates were appropriate with the data exploration. For example V8, which had been explained above that debtors with no credits/history (the first dummy variable (V8 1 )), had the lowest proportion of good debtors. We can see from Table 5 that the parameter estimate of V8 1 was the lowest compared to the V8 2, V8 3 and V8 4. The value of the 3 rd dummy (representing debtor whose paid dully of the existing credit) was higher than 4 th dummy (representing debtor whose delay in paying off in the past) which was also appropriate with the data exploration. Table 5 Parameter estimate by using backward logistic regression Parameter Estimate SE Wald P-value Intercept <.1* V * V E * V * V * V V <.1* V <.1* V V * V * V * V * V * V V V V V * V V V V V V * V *significant at the.5 level By choosing the model that had eight significant predictors, the c statistic of backward logistic regression was.817 and it indicated that the model was an excellent classifier. From the intersection of sensitivity and specificity versus all possible cutpoints, the optimal cutpoint of backward logistic regression was equal to.68. The classification table can be seen in Table 6.

15 7 Table 6 Classification table of backward logistic regression by using a cutpoint of.68 Predicted CLASS % 25.66% Actual % 74.51% Total Correct % From 74 total cases, 551 cases (74.46%) were correctly predicted. There were 58 bad debtors (25.66%) that were predicted as the good debtors from the total of 226 bad debtors. From 514 good debtors, 131 cases (25.49%) were predicted as the bad cases by the chosen model. With the true positive rate and the true negative rate at about 74%, this model was good enough in classification and it was in line with the high c statistic that was mentioned above. Logistic Ridge Regression All the 17 variables were included in the logistic ridge regression. These variables were considered by the Six Basic Cs of Lending. The ridge parameter or λ achieved from the calculation that involved standardized parameter by ordinary logistic regression and standardized predictor variables was Table 7 Classification table of logistic ridge regression by using a cutpoint of.677 Predicted CLASS % % Actual % % Total Correct % The c statistic of this model was.832 so the model was categorized as an excellent classifier. By using optimal cutpoint of.677, the classification rate can be seen in Table 7. The total correctly predicted cases were 559 cases. There were 55 cases (24.33%) of false positive from the total of 226 bad cases. From 514 good debtors, 126 (24.51%) were predicted as bad debtors. Although the difference was not too big, there were some differences between parameter estimate achieved from the logistic ridge regression and backward logistic regression. The sign of the parameter estimate between those models were all same. But for variable V8, the parameter estimates of this model gave inappropriate result compared to the data exploration. The 1 st dummy (V8 1 ) should be lower than the 2 nd dummy (V8 2 ). Table 8 Parameter estimate by using logistic ridge regression Parameter Estimate Parameter Estimate Intercept V V V V2 -.1 V V3.116 V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V Comparison of Backward Logistic Regression and Logistic Ridge Regression From the optimal cutpoint of each model, the percentage of total correctly predicted of logistic ridge regression was better, although only six cases higher than the backward. There were 559 cases correctly predicted by using logistic ridge regression and 553 correctly predicted by using backward logistic regression. Figure 6 shows that the misclassification rates (false positive and false negative) of logistic ridge regression are lower than backward logistic regression. The correct classification rates (true positive and true negative) of logistic ridge regression were higher than of backward logistic regression. Although the logistic ridge regression was better (lower in MCR and higher in CCR), the values were not quite different.

8 Percentage (%) 8 7 6 5 4 3 2 1 74.46 75.5 54 25.66 24.33 25.49 24.51 75.49 75.66 74.51 74..34 validation were higher than the backward logistic regression.

With testing data, these two models were acceptable to discriminate the good/bad debtor.

predicted cases by using backward logistic regression differ a little bit from logistic ridge regression; it was 188 by using backward logistic regression and 185 by using logistic ridge regression

16 8 Percentage (%) validation were higher than the backward logistic regression. However, the category of discrimination ability from both models was similar. With training data, backward and ridge were both excellent. With testing data, these two models were acceptable to discriminate the good/bad debtor. Total FP FN TP TN Backward Ridge Figure 6 Classification rate of backward and logistic ridge regression on each optimal cutpointt Figure 7 shows that in validation data set the total correctly predicted cases by using backward logistic regression differ a little bit from logistic ridge regression; it was 188 by using backward logistic regression and 185 by using logistic ridge regression from the total of 26 cases. The backward logistic regression method resulted in lower MCR and higher CCR than the logistic ridge regression. Pecentage (%) Total FP FN TP TN Backward Ridge Figure 8 Observed collectability status on P(Y= =1) by using backward logistic regression Figure 8 shows the distribution of probability of Y=1 achieved by backward logistic regression in each collectability status. The upper panel of the figure was the histogram of probability of Y=1 for the bad debtors and the lower panel was for the good debtors, as denoted by the vertical axis. The good debtors (lower panel) tend to have high probability of Y=1, there were only a few that had low probability of Y= =1. The bad debtors (upper panel) were distributed in overalll range of probability that showed the bad capability of the model in differentiating the bad debtors from the good ones. Figure 7 Validation s classification rate of backward and logistic ridge regression on each optimal cutpoint Because of the classification rates depend on only one threshold (cutpoint), so the other measure of model assessment was evaluated. It was the c statistic that measured the area under ROC curve and explained how well the model s performance. Table 9 Comparison of c statistic between backward regression and logistic ridge Model Validation Backward Ridge The c statistics of logistic ridge logistic regression with training data and also in Figure 9 Observed collectability status on P(Y= =1) by using logistic ridge regression Figure 9 shows the distribution of probability of Y=1 by logistic ridge regression. It was similar to the backward logistic regression that showed the bad debtors distribute in overall range of probability of Y=1. It represents the bad capability of the model in identifying the bad debtors.

17 9 Odds ratio is the interpretation of model coefficient in logistic regression. The odds ratio of V8 (credit history of debtor) can be seen in Table 1. Table 1 Odds ratio of V8 (credit history) Variable Backward Ridge 1 vs V8 2 vs vs vs By using backward logistic regression, debtors who had no credit tend to be a good debtor.21 times than debtors with critical account. Debtor that paid back duly all the credits at the bank tends to be a good debtor.229 times than debtor with critical account. Debtor with the existing credits paid back duly and one whose delay in paying off in the past tends to be a good debtor about one half than debtor with critical account. The last column of Table 1 was the odds ratio of V8 from the logistic ridge regression. The odds ratio of the logistic ridge regression was similar to the odds ratio of backward logistic regression. But similar to the parameter estimates, the 1 st and the 2 nd dummy variable, of logistic ridge regression showed an inconsistent result compared to the data exploration. The 1 st dummy should be lower than the 2 nd dummy. Except for the V8, all parameter estimates of the logistic ridge regression were appropriate with the data exploration. The odds ratio for the other variables can be seen in Appendix 3. Comparison of Logistic Regression with Variable Selection and Ridge with Generated V2* From the comparison of c statistic, logistic regression with variable selection and logistic ridge regression gave similar results. To know the performance of these two methods in some values of correlation coefficient, variable V2* was generated to replace V2. Variable V2* that had a correlation coefficient of.6 until.95 in increments of.5 were generated. Then the c statistics and the total correctly predicted cases were compared. The total correctly predicted cases were obtained from each model s optimal cutpoint. C statistic Figure 1 Comparison of c statistic between logistic regression with variable selection and logistic ridge regression of data set with generated V2* Total correctly predicted (%) Correlation between V1 & V2* Ridge Var. Selection Correlation between V1 & V2* Ridge Var Selection Figure 11 Comparison of total correctly predicted cases between logistic regression with variable selection and logistic ridge regression of data set with generated V2* Figure 1 and Figure 11 show that the c statistic and the total correctly predicted of logistic ridge regression was always higher than the logistic regression with variable selection in all correlation coefficient between V1 and V2*. There was no clear pattern of the difference of c statistic and total correctly predicted on these models with the increase of correlation coefficient between V1 and V2*. It may be caused by the fact that there were only two predictor variables that had high correlation coefficient. The value of the c statistic and the total correctly predicted both in modeling and validation can be seen in Appendix 4. Table 11 shows the parameter existence of each logistic regression model with variable selection method. It can be seen in the table that V11 only exist in the model with correlation coefficient of.75. It may be the cause of why the c statistic and the total correctly predicted of logistic regression with variable selection on correlation coefficient of

18 1.75 was higher than the others as shown in Figure 1 and Figure 11. Table 11 Parameter existence of the logistic regression model with variable selection Parameter Correlation coefficient between V1 & V2* V V2* V V V V7 V8 V9 V V V V V14 V15 V V17 V CONCLUSION Comparing the total correctly predicted cases with training data set logistic ridge regression was better than logistic regression with variable selection (in this case backward elimination). But when using validation data set, backward logistic regression was better. The optimal cutpoint of backward was.68 while for ridge the optimal cut point was.677. The comparison of c statistic and the total correctly predicted cases on German Credit data set show that logistic ridge regression has a little higher capability to predict the new applicant s collectability status than logistic regression with variable selection. But both models had low capability in identifying the bad debtors. RECOMMENDATION High multicollinearity in predictor variables is needed to examine the performance of logistic ridge regression. Other way to define optimal cutpoint is good to try. Usually, banks consider that the false positive affects in more losses for the bank than the false negative. Before entering variable as the predictor variable in logistic ridge regression, it is important to know whether the variable affect the response or not based on some related theories. REFERENCES Agresti A. 27. An Introduction to Categorical Data Analysis. Second Edition. New Jersey: J Wiley. Cessie S le, Houwelingan JC van Ridge estimators in logistic regression. Appl Statist 41: Daniel WW Applied Nonparametric Statistics. Second Edition. Atlanta: PWS- KENT. Gonen M. 26. Receiver Operating Characteristics (ROC) Curves. In: Statistics and Data Analysis. Proceedings of the 31 th SAS Users Group International (SUGI 31) Conference; San Francisco, Mar 26. New York: SAS Institute. page 18. Hosmer DW, Lemeshow S. 2. Applied Logistic Regression. Second Edition. New York: J Wiley. Kemeny S, Vago E. 26. Logistic ridge regression for clinical data analysis (a case study). J Appl Ecol Environ Res 4: Perlich C, Provost F, Simonoff JS. 23. Tree induction vs. logistic regression: a learning-curve analysis. J Mach Learn Res 4: Purnomo H. 21a. BI : Kredit Perbankan Tumbuh 1% di Januari /11/1325/ /5/bi-kreditperbankan-tumbuh-1-di-januari-21 [3 Jun 21]. Purnomo H. 21b. Kredit Tumbuh 2.3% Hingga Agustus. 9/3/145545/ /5/kredit-tumbuh- 23-hingga-agustus [15 Sep 21]. Reed D. 28. Mortgages 11. New York: AMACOM. Rose PS Commercial Bank Management. Fourth Edition. USA: McGraw-Hill.

19 APPENDIX

20 12 Appendix 1 Description of variables used in analysis Dummy Variable Variable Annotation Description V1 Duration of credit V2 Amount of credit V3 Age V4 Installment rate 1 : 1% 1 2 : 2% 1 3 : 3% 1 4 : 4% V6 Number of dependents 1 : 1 dependent 2 : 2 dependents 1 V7 Status of checking 1 : < DM 1 account 2: <...< 2 DM 1 3 : => 2 DM 1 4: no checking account V8 Credit history 1: no credits taken 1 2: all credits paid duly 1 3: existing credits paid duly 1 4: delay in paying off 1 5: critical account V9 Saving account balance 1 : < 1 DM 1 2 : 1<=... < 5 DM 1 3 : 5<=... < 1 DM 1 4 : =>1 DM 1 5 : no savings account V1 Time of working 1 : unemployed experience in current job 2: < 1 year 1 3 : 1 <=... < 4 years 1 4 : 4 <=... < 7 years 1 5 : >= 7 years 1 V11 Time of living in present 1: <= 1 year residence 2: 1< <=2 years 1 3: 2< <=3 years 1 4: >4years 1 V12 Home ownership 1 : owns residence 1 2 : rents 1 3 : free V13 Occupation category 1 : unemployed/ unskilled 2 : unskilled - resident 1 3 : official 1 4 : officer 1

21 13 Dummy Variable Variable Annotation Description V14 Marital Status 1 : male & divorced 1 2 : male & single 1 3 : male & married 1 4 : female V15 Guarantor 1 : has a co-applicant 1 2 : has a guarantor 1 3 : has none V16 Property owned 1 : car or other 1 2 : real estate 1 3 : no property V17 Purpose of credit 1 : new Car 1 2 : used Car 1 3 : furniture 1 4 : radio/television 1 5 : education 1 6 : retraining 1 7 : other V19 Telephone ownership 1 : no 2: yes 1 Y Good credit rating : no, 1: yes

22 14 Appendix 2 Proportion of good debtor on each variable Percentage of good debtor Proportion of good debtor Duration of credit (month) Number of dependents Percentage of good debtor Proportion of good debtor Amount of credit (DM) < DM (,2) DM >= 2 DM no checking account Status of checking account.9 Percentage of good debtor Age Proportion of good debtor no credit taken all credits paid back duly existing credit paid back duly delay in paying off in the past critical account Credit history Proportion of good debtor % 2% 3% 4% Proportion of good debtor < 1 DM [1,5) DM [5,1) DM >= 1 DM no saving account Installment rate Saving account balance

23 15 Proportion of good debtor unemployed < 1 year [1,4) years [4,7) years >= 7 years Percentage of good debtor has a co applicant has a guarantor has none Time of working experience in current job Guarantor Proportion of good debtor Percentage of good debtor <= 1 (1,2] (2,3] > 4 car or other real estate no property Time of living in present residence (year) Property owned Proportion of good debtor Percentage of good debtor Owns Rents Free new car used car furniture radio/tv education retraining other Home owrnership Purpose of credit Proportion of good debtor unemployed unskilled official officer Proportion of Good Debtor Has Has none Occupation category Telephone ownership Percentage of good debtor male & divorced male & single male & married female Marital status

24 16 Appendix 3 Odds ratio of backward logistic regression and logistic ridge regression Variable Ridge Backward Variable Ridge Backward V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V

25 17 Appendix 4 Comparison of C statistic and the total correctly predicted cases of logistic regression with variable selection and logistic ridge regression of data set with generated V2* Correlation of V1 and V2* Variable Selection C Statistic Ridge Model Total correctly predicted (%) Variable Selection Ridge C Statistic Variable Selection Validation Ridge Total correctly predicted (%) Variable Selection Ridge

RESULT AND DISCUSSION

RESULT AND DISCUSSION 4 Figure 3 shows ROC curve. It plots the probability of false positive (1-specificity) against true positive (sensitivity). The area under the ROC curve (AUR), which ranges from to 1, provides measure