CREDIT RISK MODELING USING LOGISTIC RIDGE REGRESSION RAKHMAWATI

Size: px
Start display at page:

Download "CREDIT RISK MODELING USING LOGISTIC RIDGE REGRESSION RAKHMAWATI"

Transcription

1 CREDIT RISK MODELI ING USING LOGISTIC RIDGE REGRESSION RAKHMAWATI DEPARTMENT OF STATISTICS FACULTY OF MATHEMATICS AND NATURAL SCIENCES BOGOR AGRICULTURAL UNIVERSITY 2111

2 ABSTRACT RAKHMAWATI. Credit Risk Modeling Using Logistic Ridge Regression. Supervised by AAM ALAMUDI and DIAN KUSUMANINGRUM. The growth of credit of national banking may cause a greater risk faced by banks. One thing we must highlight is a way to determine whether the new applicant will be good in loan repayments. A well known and widely used method for classifying the new applicant of credit is Logistic Regression. Multicollinearity is a problem that is frequently encountered in model building. Usually, variable selection method is used for handling this problem. But sometimes it creates a new problem when the important variable does not enter to the model. Logistic Ridge Regression could be an alternative in logistic regression when multicollinearity exists. The advantage of this method is that it can handle multicollinearity without deleting any predictor variables. This research compared the performance of logistic ridge regression and logistic regression with variable selection to predict the collectability status of new applicants of credit. There were 1 observations of German Credit data set. The 74 observations were used for modeling and 26 observations were used for validation. Backward was the best among other selection variable methods which had the highest c statistic and the model was fit by Hosmer and Lemeshow Goodness-of-Fit Test. By using backward logistic regression, it showed that among 17 variables there were eight variables which were significant in the wald test. There were many significant correlations among the predictors but the highest correlation coefficient was.628 which exist between duration of credit (V1) and credit amount (V2).The ridge parameter or λ was.1. The optimal cut point of backward logistic regression was.68, while for logistic ridge regression was.677. By comparing the c statistic and the total correctly predicted cases, we can see that the logistic ridge regression was better than backward logistic regression in training data. However, with testing data (validation), backward logistic regression was better. To have a better understanding of the model with higher correlation values between V1 and V2, V2* was generated to replace V2 and logistic regression with variable selection and ridge were also built. The result pointed out that logistic ridge regression has a little higher capability to predict the new applicant s collectability status than logistic regression with variable selection. Key words: credit risk modeling, logistic ridge regression, multicollinearity

3 CREDIT RISK MODELING USING LOGISTIC RIDGE REGRESSION RAKHMAWATI G Thesis as a requirement for Bachelor Degree in Statistics DEPARTMENT OF STATISTICS FACULTY OF MATHEMATICS AND NATURAL SCIENCES BOGOR AGRICULTURAL UNIVERSITY 211

4 Title : Credit Risk Modeling Using Logistic Ridge Regression Author : Rakhmawati NIM : G Approved by : Advisor I Advisor II Ir. Aam Alamudi, M.Si NIP Dian Kusumaningrum, S.Si, M,Si Acknowledged by : Head of Department Statistics Dr. Ir. Hari Wijayanto, M.Si NIP Graduation date:

5 BIOGRAPHY Rakhmawati was born in Salatiga on June, 16 th 1988 as the daughter of Paijo Nurhadi Santoso and Siti Naimah. She has a big brother named Nur Rahman Istianto and a little brother named Aulia Kharis Kurniawan. After graduating from SMA Negeri 1 Salatiga in 26, she continued her studies in Bogor Agricultural University through USMI. She took Statistics as her major in Bogor Agricultural University. She chose Information System as her Minor Subject and also some supporting courses from Department of Mathematics. She was a staff of Database and Computational Department in Gamma Sigma Beta, an organization of statistics student in Bogor Agricultural University. On the 8 th semester, she had a chance to take an internship program at PT Ganesha Cipta Informatika. There, she with her partner made a SAS program about risk management to calculate Value at Risk of market risk.

6 ACKNOWLEDGEMENTS Alhamdulillah, thanks to Allah SWT Who gives me love, opportunity, health, and capability in finalizing my research which is entitled Credit Risk Modeling Using Logistic Ridge Regression. I recognized that the completion of my research would not be done without help from other people. I want to say thanks to Mr. Aam Alamudi and also Mrs. Dian Kusumaningrum as my advisors, for their critics, ideas, and also their patience. Thanks to Miss Indah Permatasari, my internship advisor, who gave consideration then finally I got the topic of my research. For Defri Ramadhan Ismana and Yulia Triwijiwati, thank you for the discussion. Special thanks to my beloved family for the love and supports. Last, I hope this thesis would be beneficial. Bogor, February 211 Rakhmawati

7 TABLE OF CONTENTS LIST OF TABLES viii LIST OF FIGURES viii LIST OF APPENDICES viii INTRODUCTION Background 1 Objective 1 LITERATURE REVIEW Credit Risk 1 The Cramer Statistic 1 Logistic Regression 2 Logistic Ridge Regression 2 Optimal Cut Point 3 Model Evaluation 3 METHODOLOGY Data Source 4 Method 4 RESULT AND DISCUSSION Data Exploration 4 Logistic Regression with Variable Selection 6 Logistic Ridge Regression 7 Comparison of Backward and Logistic Ridge Regression 7 Comparison of Logistic Regression with Variable Selection and Ridge with Generated V2* 9 CONCLUSION 1 RECOMMENDATION 1 REFERENCE 1 APPENDIX 11 Page

8 LIST OF TABLES Page Table 1 Pearson correlation coefficient of numeric variables 5 Table 2 Spearman correlation coefficient of ordinal variables 5 Table 3 Cramer coefficient of nominal variables 5 Table 4 Comparison of backward, forward, and stepwise logistic regression 6 Table 5 Parameter estimate by using backward logistic regression 6 Table 6 Classification table of backward logistic regression by using a cut point of.68 7 Table 7 Classification table of logistic ridge regression by using a cut point of Table 8 Parameter estimate by using logistic ridge regression 7 Table 9 Comparison of c statistic between backward and logistic ridge regression 8 Table 1 Odds ratio estimate of V8 (credit history) 9 Table 11 Parameter existence on the logistic regression model with variable selection 1 LIST OF FIGURES Page Figure 1 Plot of sensitivity and specificity versus all possible cut points 3 Figure 2 Classification table 3 Figure 3 ROC curve 3 Figure 4 Plot of percentage of good debtors in each group of credit amount (V2) 4 Figure 5 Proportion of good debtors in credit history (V8) 5 Figure 6 Classification rate of backward and logistic ridge regression on each optimal cut point 8 Figure 7 Validation s classification rate of backward and logistic ridge regression on each optimal cut point 8 Figure 8 Observed collectability status on P(Y=1) by using backward logistic regression 8 Figure 9 Observed collectability status on P(Y=1) by using logistic ridge regression 8 Figure 1 Comparison of C statistic between logistic regression with variable selection and logistic ridge regression of data set with generated V2* 9 Figure 11 Comparison of total correctly predicted cases between logistic regression with variable selection and logistic ridge regression with generated V2* 9 LIST OF APPENDICES Page Appendix 1 Description of variables used in analysis 12 Appendix 2 Proportion of good debtor on each variable 14 Appendix 3 Odds ratio of backward logistic regression and logistic ridge regression 16 Appendix 4 Comparison of C statistic and the correctly predicted cases between logistic regression with variable selection and logistic ridge regression of data set with generated V2* 17

9 1 INTRODUCTION Background Credit risk is one of the eight risks that banks must consider. It is important to make a measurable, documented, and developable credit risk system. Logistic regression, discriminant analysis, and artificial neural network are some methods that are used in credit risk model. They are useful to predict whether a new applicant will become a good or bad debtor if he or she receives a loan. Multicollinearity is a common problem in credit risk modeling. Usually, the solution for this problem is using variable selection method (forward, backward, and stepwise). But this solution may cause missing information about the response variable if the deleted predictor variable is an important one. Ridge regression is another statistical procedure for dealing with the problem of multicollinearity (Ravinshanker & Dey 21). With logistic ridge regression, the multicollinearity is expected to be handled without deleting any variables and there will be no missing information from the data that has been collected. Bank of Indonesia noted that the growth of credit of national banks in January 21 was 1%. Until the end of August 21, the credit of banking industry grew and reached 2.3% (Purnomo 21a, 21b). This may conduce on a greater risk that has not been faced by banks before. Hence, it is important to build a more accurate credit scoring model to decide whether a new applicant is credible enough to get a loan. Objectives The objectives of this research are: 1. To build a credit risk model using logistic regression with variable selection and logistic ridge regression. 2. To determine the optimal probability cutpoint. 3. To compare the classification rate and the c statistic of logistic regression with variable selection and logistic ridge regression. LITERATURE REVIEW Credit Risk Model Banks loan to individuals, first by asking to fill out a loan application. The customer is asked to submit several documents that the bank needs in order to evaluate the loan request. There are six aspects of the loan application to determine whether a new applicant is creditworthy or not, The Six Basic Cs of Lending are namely character, capacity, cash, collateral, condition, and control. Character is the data about the personality. Capacity is the capacity to borrow money. Cash is related to the borrower income and balance in saving account. Collateral is the adequacy of the borrower to provide adequate support for the loan. Age and degree of specialization of the borrower's assets are the example of collateral. Condition is the prospect of business associated with economics conditions. Correctly prepared loan document is the example of control. The basic theory of credit scoring is that the bank can identify the financial, economic, and motivational factors that separate the good debtors from the bad ones by observing a large group of people who have borrowed in the past. Credit scoring systems are usually based on discriminant models or related techniques such as logit or probit models or neural networks. If the applicant s score exceeds a critical cutpoint level, he or she is more likely to be approved for credit. Among the most important variables used in evaluating consumer s loan are age, marital status, number of dependents, home ownership, telephone ownership, type of occupation, and length of employment in a current job. The Cramer Statistic The chi-square test of independence is used to conclude whether there is an association between two categorical variables. When the number of rows and columns of the contingency table are unequal, Cramer coefficient is the measure of the strength of this association. The value is between and 1. The Cramer coefficient is defined as: 1 Where X 2 is the chi-square statistic, n is the total sample size, and t is either the number of

10 2 rows or the number of columns in the contingency table, whichever is smaller. Logistic Regression Let the conditional probability that the outcome is present be denoted by P(Y=1 x)=π(x). The logit of the multiple logistic regression is given by the equation where log 1 in which case the logistic regression model is 1 When the type of independent variable is categorical, dummy variable is needed. In general, if a categorical variable (nominal or ordinal scale) has k values, then k-1 design variable will be needed. Thus, the logit for a model with p variables and the j th variable being categorical would be Maximum likelihood estimators to logit model are obtained by maximizing β of the likelihood function log 1 log 1 After getting the model, we begin the process of model assessment. The significance of the covariates could be assessed by G test statistic and Wald test. G test statistic is a likelihood ratio test and measures the significance of the parameters on the overall model. Hypothesis of G test statistic: H : β 1 = β 2 =... = β p = H 1 : at least one β i, i = 1, 2,..., p G-test Statistic could be formulated as 2 where L = Likelihood without covariates, and L p =Likelihood with p covariates. Under the null hypothesis, the distribution of G is chisquare χ 2 with p degrees of freedom. If the null hypothesis is rejected and conclude that at least one and perhaps all p coefficients are different from zero, the Wald test could be used to assess the significance of each covariate. H : β i = H 1 : β i where i = 1, 2,..., p ) β i W = ) ) SE( β i ) Under the null hypothesis, W statistic will follow a standard normal distribution (Hosmer & Lemeshow 2). Coefficient interpretation in logistic regression is by using the odds ratio that indicates how much more likely, with respect to odds, a certain event occurs in one group relative to its occurrence in another group. The odds ratio defined as exp. For numeric variable, the odds ratio indicates that for every increase of one measurement of the predictor, the risk of the outcome increases times. Multicollinearity can cause unstable estimates and inaccurate variances which affects hypothesis test (Hoerl & Kennard 198, in Shen & Gao 22). In regression, there are some approaches to handle multicollinearity, which are variable selection method (forward, backward, and stepwise) and using ridge regression. Forward selection adds terms sequentially until further additions do not improve the model. Backward elimination begins with a complex model and sequentially removes terms. Stepwise procedure starts off by choosing the equation containing the most important variable and then attempts to build up with subsequent additions of variable one at a time as long as these additions are worthwhile. Logistic Ridge Regression Unstable parameter estimates occur when the number of covariates is relatively large or when the covariates are correlated. An alternative procedure to obtain more stable estimates is to specify a restriction on the parameters. Consider the maximization of the log-likelihood function with a penalty on the norm of β: where /, the norm of the parameter vector β. The ridge parameter λ controls the amount of shrinkage of the norm of β. When λ= the solution will be the ordinary MLE. For a good choice of λ, the estimate is expected to be an average closer to the real value of β than the ordinary MLE, i.e. MSE( ) < MSE( ) (Cessie & Houwelingen 199). The estimate parameter of logistic ridge regression is calculated in the following ways:

11 3 1. Fit the logistic regression model using maximum likelihood, leading to the estimate of. Construct standardized coefficients by defining j=1,2,,p where is the standard deviation of β in the training data for the j th predictor. 2. Construct the Pearson statistic 1 where g = the number of covariate patterns m k = the number of subjects with x=x k y k = the number of positive responses (y=1) among the m k subjects = probability that the outcome is present in x=x k This is a measure of the difference between the observed and the fitted values. 3. Define the ridge parameter (λ) 1 4. Let N Z p be the matrix of centered and scaled predictors, with Let, where N V N is 1. Let equal with the intercept omitted. Then the ridge regression estimate equals,, where and Optimal Cutpoint Optimal cutpoint for the purpose of classification can be obtained from the plot of sensitivity and specificity versus all possible cutpoints (Hosmer & Lemeshow 2). The plot can be seen in Figure 1. The optimal cutpoint is not the only criteria for deciding whether a new applicant is acceptable or not to get a loan. Although the correct classification rate is high based on the optimal cutpoint, the number of false positive should be considered because the loss caused by this error is extremely large relative to the false negative. Each bank has its own criteria for making a decision. Explanation for these errors can be seen in the next session. In this research, the cutpoint score will just be attained from the plot of sensitivity and specificity versus all possible cutpoints. Figure 1 Plot of sensitivity and specificity versus all possible cutpoints Model Evaluation In model assessment, a classification table is most appropriate when classification is a stated goal of the analysis. Figure 1 is the classification table. It is a two way frequency table between actual data and the prediction. Correct classification rate (CCR) consists of percentage of true positive and true negative, while misclassification rate (MCR) consist of percentage of false positive and false negative. Predicted 1 True Negative False Positive Actual 1 (TN) False Negative (FN) Figure 2 Classification Table (FP) True Positive (TP) Sensitivity or true positive (TP) is the number of observation that have category 1 and was correctly predicted. Specificity or true negative (TN) or is the number of observation that have category and was correctly predicted. False positive is the number of observation that have category but predicted as category 1. False negative is the number of observation that have category 1 but predicted as category. Figure 3 ROC curve

12 4 Figure 3 shows ROC curve. It plots the probability of false positive (1-specificity) against true positive (sensitivity). The area under the ROC curve (AUR), which ranges from to 1, provides measure of the model ability to discriminate between those subjects who experience the outcome of interest versus those who don t. The measure of AUR is c- statistic..5 where n c : the number of concordant n d : the number of discordant t : the number of total pairs As general rule: C =.5 : no discrimination.7 C <.8 : acceptable discrimination.8 C <.9 : excellent discrimination C.9 : outstanding discrimination (Hosmer & Lemeshow 2). METHODOLOGY Data Source The data used in this research was the German Credit data set which was available at It contains observations on 1 past credit applicants. Each applicant was rated as good (7 cases) or bad (3 cases). There were 17 variables used in this research after considering The Six Basic C of Lending which consist of 3 numeric variables, 6 ordinal variables, 7 nominal variables, and 1 binary variable. Description of the variables can be seen in Appendix 1. Method Procedures used in this research were: 1. Divide the data into training data (74) for modeling and testing data (26) for validation. Each data set has the same pattern of good/bad debtors with the full data set, which comprise of 7% good debtors and 3% bad. 2. Data exploration. 3. Modeling the data by using stepwise, forward, and backward logistic regression. The probability modeled was Y=1 (the debtor had a good collectability status). Then choose one of those three models by considering the fit of the model and the model having the highest c statistic. 4. Modeling data using logistic ridge regression. 5. Determine optimal cutpoint from the intersection of sensitivity and specificity. 6. Model validation with testing data. 7. Comparing the classification rate and the c statistic between logistic ridge regression and the logistic regression with variable selection. 8. Generate V2* that had some specific correlation with V1. Then do step 3 until step 7 with new data (by replacing V2 with V2*) to see the performance of logistic regression with variable selection and logistic ridge regression as the correlation between V1 and V2* increases. RESULT AND DISCUSSION Data Exploration There were no outliers and missing values in the full data set, so all of the 1 observations were included in the analysis. Allocation of the data into modeling and validation was based on the proportion of bad and good cases of the overall data set. Each had 7% of good and 3% of bad which was appropriate with the full data set. The variables of V1 (duration of credit) and V2 (credit amount) had a decreasing trend to the response variable. Figure 4 showed that as the amount of the credit increased, the proportion of debtors with good collectability status decreased. The debtors with high installment rate (V4) tend to be bad debtors. The difference of good debtors for each occupation category was not significant. The group of debtors who were unemployed/unskilled-nonresident had the highest proportion of good debtor compared to the unskilled-resident, official, and officer. Proportion of good debtor Credit amount Figure 4 Plot of percentage of good debtors in each group of credit amount (V2) It can be seen in Appendix 2 that based on age (V3), the group of debtor aged 2 years old until 5 years old had a positive trend to the proportion of good debtor. As the age

13 5 increased the proportion of good debtors increased until the age of 5 years old. The group of debtors that were 66 until 75 years old had the lowest proportion of good debtors. Debtors with two dependents had higher proportion of good debtors than those who had one dependent (V6). As the status of checking account (V7) increased, debtors tend to be good debtors. Home ownership status (V12) also had a positive trend, the proportion of good debtors increased as the home ownership status changed from free, rent, and own. There was no pattern of the proportion of good debtors as the time of working experience in current job (V1) and the time living in their present residence (V11) increased. The figure can be seen in Appendix 2. Group of debtors that have been working four until seven years in their current job had the highest proportion of good debtors. Debtors that have been working more than seven years in their current job tend to be good debtors compared to those with less than four years of working experience in their current job. The unemployed debtors had higher proportion of good debtor than those with less than a year of working experience in their current job. For variable time of living in present residence, debtors with less than or equal to one year and also debtors with two until three years tend to be good debtors than the others. Proportion of good debtor no credit taken all credits paid back duly existing credit paid back duly Credit history delay in paying off in the past critical account Figure 5 Proportion of good debtors in credit history (V8) Figure 5 shows the credit history (V8) have a positive trend. Debtors that have not even credit taken before had the lowest proportion of good debtors. Those with high average balance in savings account (V9) tend to be good debtors. The difference between marital statuses (V14) was not significant, although the single males and married males had higher proportion of good debtor than females and divorced males. Debtors who had a guarantor (V15) tend to be good debtors than those who had a co-applicant. Those who had property (V16) tend to be good debtors than those with no property. Those having a credit purpose (V17) for used cars, furniture, or radio/television had higher proportion of good debtors than the other purposes. The lowest proportion of good debtors was those with education as the purpose of taking credit. Debtors who had a telephone number under his or her name (V19) tend to be good debtors than those who did not have telephone number under his or her name. The figure of percentage of good debtors in each variable can be seen in Appendix 2. Evaluation on the correlation between predictor variables can be seen in Table 1 and Table 2. It can be concluded that there were many significant correlations but there is only one high correlation coefficient. It exists between V1 and V2. Table 3 showed the Cramer statistic as the measure of association between the nominal variables. Table 1 Pearson correlation coefficient of numeric variables V1 V3 V V Table 2 Spearman correlation coefficient of ordinal variables V6 V1 V11 V12 V13 V V V V V12.82 Table 3 Cramer coefficient of nominal variables V8 V9 V14 V15 V16 V17 V19 V V V V V V V Between the numeric predictors, the only significant correlation occurs between V1 (duration of credit) and V2 (credit amount), with a correlation coefficient of.628 which is shown in Table 1. By using spearman coefficient of correlation shown in Table 2,.327 was the largest correlation which occurred between V11 (time in present resident) and V12 (housing). Variable V1 (time in current job) had significant correlation with all other ordinal variables except with variable V12 (home ownership).

14 6 The strength of association between nominal variables was measured by Cramer coefficient and can be seen in Table 3. Variable V17 (purpose of credit) had significant correlation with all other nominal variables. The highest correlation in nominal predictor exist between V16 (property owned) and V17 (purpose of credit), which was.218. Logistic Regression With Variable Selection Logistic regression model using forward, backward, and stepwise variable selection methods were built. Forward logistic regression gave the same result with stepwise logistic regression. Among the three selection methods, backward was the method which had the highest c statistic. By using Hosmer Lemeshow Goodness-of-Fit Test as proposed in Hosmer & Lemeshow (2), backward logistic regression model was considered fit with a p-value of.724. Table 4 Comparison of backward, forward, and stepwise logistic regression Method Hosmer and Lemeshow Goodnessof-Fit Test C statistic Chi-Square P-value Backward Forward Stepwise By using backward logistic regression, eight significant predictors were selected from 17 predictor variables, which were credit duration (V1), credit amount (V2), installment rate (V4), checking account status (V7), credit history (V8), balance in saving account (V9), marital status (V14), and the purpose of credit (V17). The parameter estimates of backward logistic regression were shown in Table 5. Variable V1 and V2 which had.628 of correlation coefficient both entered the model. It showed that the correlation of.628 was not high enough. All the parameter estimates were appropriate with the data exploration. For example V8, which had been explained above that debtors with no credits/history (the first dummy variable (V8 1 )), had the lowest proportion of good debtors. We can see from Table 5 that the parameter estimate of V8 1 was the lowest compared to the V8 2, V8 3 and V8 4. The value of the 3 rd dummy (representing debtor whose paid dully of the existing credit) was higher than 4 th dummy (representing debtor whose delay in paying off in the past) which was also appropriate with the data exploration. Table 5 Parameter estimate by using backward logistic regression Parameter Estimate SE Wald P-value Intercept <.1* V * V E * V * V * V V <.1* V <.1* V V * V * V * V * V * V V V V V * V V V V V V * V *significant at the.5 level By choosing the model that had eight significant predictors, the c statistic of backward logistic regression was.817 and it indicated that the model was an excellent classifier. From the intersection of sensitivity and specificity versus all possible cutpoints, the optimal cutpoint of backward logistic regression was equal to.68. The classification table can be seen in Table 6.

15 7 Table 6 Classification table of backward logistic regression by using a cutpoint of.68 Predicted CLASS % 25.66% Actual % 74.51% Total Correct % From 74 total cases, 551 cases (74.46%) were correctly predicted. There were 58 bad debtors (25.66%) that were predicted as the good debtors from the total of 226 bad debtors. From 514 good debtors, 131 cases (25.49%) were predicted as the bad cases by the chosen model. With the true positive rate and the true negative rate at about 74%, this model was good enough in classification and it was in line with the high c statistic that was mentioned above. Logistic Ridge Regression All the 17 variables were included in the logistic ridge regression. These variables were considered by the Six Basic Cs of Lending. The ridge parameter or λ achieved from the calculation that involved standardized parameter by ordinary logistic regression and standardized predictor variables was Table 7 Classification table of logistic ridge regression by using a cutpoint of.677 Predicted CLASS % % Actual % % Total Correct % The c statistic of this model was.832 so the model was categorized as an excellent classifier. By using optimal cutpoint of.677, the classification rate can be seen in Table 7. The total correctly predicted cases were 559 cases. There were 55 cases (24.33%) of false positive from the total of 226 bad cases. From 514 good debtors, 126 (24.51%) were predicted as bad debtors. Although the difference was not too big, there were some differences between parameter estimate achieved from the logistic ridge regression and backward logistic regression. The sign of the parameter estimate between those models were all same. But for variable V8, the parameter estimates of this model gave inappropriate result compared to the data exploration. The 1 st dummy (V8 1 ) should be lower than the 2 nd dummy (V8 2 ). Table 8 Parameter estimate by using logistic ridge regression Parameter Estimate Parameter Estimate Intercept V V V V2 -.1 V V3.116 V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V Comparison of Backward Logistic Regression and Logistic Ridge Regression From the optimal cutpoint of each model, the percentage of total correctly predicted of logistic ridge regression was better, although only six cases higher than the backward. There were 559 cases correctly predicted by using logistic ridge regression and 553 correctly predicted by using backward logistic regression. Figure 6 shows that the misclassification rates (false positive and false negative) of logistic ridge regression are lower than backward logistic regression. The correct classification rates (true positive and true negative) of logistic ridge regression were higher than of backward logistic regression. Although the logistic ridge regression was better (lower in MCR and higher in CCR), the values were not quite different.

16 8 Percentage (%) validation were higher than the backward logistic regression. However, the category of discrimination ability from both models was similar. With training data, backward and ridge were both excellent. With testing data, these two models were acceptable to discriminate the good/bad debtor. Total FP FN TP TN Backward Ridge Figure 6 Classification rate of backward and logistic ridge regression on each optimal cutpointt Figure 7 shows that in validation data set the total correctly predicted cases by using backward logistic regression differ a little bit from logistic ridge regression; it was 188 by using backward logistic regression and 185 by using logistic ridge regression from the total of 26 cases. The backward logistic regression method resulted in lower MCR and higher CCR than the logistic ridge regression. Pecentage (%) Total FP FN TP TN Backward Ridge Figure 8 Observed collectability status on P(Y= =1) by using backward logistic regression Figure 8 shows the distribution of probability of Y=1 achieved by backward logistic regression in each collectability status. The upper panel of the figure was the histogram of probability of Y=1 for the bad debtors and the lower panel was for the good debtors, as denoted by the vertical axis. The good debtors (lower panel) tend to have high probability of Y=1, there were only a few that had low probability of Y= =1. The bad debtors (upper panel) were distributed in overalll range of probability that showed the bad capability of the model in differentiating the bad debtors from the good ones. Figure 7 Validation s classification rate of backward and logistic ridge regression on each optimal cutpoint Because of the classification rates depend on only one threshold (cutpoint), so the other measure of model assessment was evaluated. It was the c statistic that measured the area under ROC curve and explained how well the model s performance. Table 9 Comparison of c statistic between backward regression and logistic ridge Model Validation Backward Ridge The c statistics of logistic ridge logistic regression with training data and also in Figure 9 Observed collectability status on P(Y= =1) by using logistic ridge regression Figure 9 shows the distribution of probability of Y=1 by logistic ridge regression. It was similar to the backward logistic regression that showed the bad debtors distribute in overall range of probability of Y=1. It represents the bad capability of the model in identifying the bad debtors.

17 9 Odds ratio is the interpretation of model coefficient in logistic regression. The odds ratio of V8 (credit history of debtor) can be seen in Table 1. Table 1 Odds ratio of V8 (credit history) Variable Backward Ridge 1 vs V8 2 vs vs vs By using backward logistic regression, debtors who had no credit tend to be a good debtor.21 times than debtors with critical account. Debtor that paid back duly all the credits at the bank tends to be a good debtor.229 times than debtor with critical account. Debtor with the existing credits paid back duly and one whose delay in paying off in the past tends to be a good debtor about one half than debtor with critical account. The last column of Table 1 was the odds ratio of V8 from the logistic ridge regression. The odds ratio of the logistic ridge regression was similar to the odds ratio of backward logistic regression. But similar to the parameter estimates, the 1 st and the 2 nd dummy variable, of logistic ridge regression showed an inconsistent result compared to the data exploration. The 1 st dummy should be lower than the 2 nd dummy. Except for the V8, all parameter estimates of the logistic ridge regression were appropriate with the data exploration. The odds ratio for the other variables can be seen in Appendix 3. Comparison of Logistic Regression with Variable Selection and Ridge with Generated V2* From the comparison of c statistic, logistic regression with variable selection and logistic ridge regression gave similar results. To know the performance of these two methods in some values of correlation coefficient, variable V2* was generated to replace V2. Variable V2* that had a correlation coefficient of.6 until.95 in increments of.5 were generated. Then the c statistics and the total correctly predicted cases were compared. The total correctly predicted cases were obtained from each model s optimal cutpoint. C statistic Figure 1 Comparison of c statistic between logistic regression with variable selection and logistic ridge regression of data set with generated V2* Total correctly predicted (%) Correlation between V1 & V2* Ridge Var. Selection Correlation between V1 & V2* Ridge Var Selection Figure 11 Comparison of total correctly predicted cases between logistic regression with variable selection and logistic ridge regression of data set with generated V2* Figure 1 and Figure 11 show that the c statistic and the total correctly predicted of logistic ridge regression was always higher than the logistic regression with variable selection in all correlation coefficient between V1 and V2*. There was no clear pattern of the difference of c statistic and total correctly predicted on these models with the increase of correlation coefficient between V1 and V2*. It may be caused by the fact that there were only two predictor variables that had high correlation coefficient. The value of the c statistic and the total correctly predicted both in modeling and validation can be seen in Appendix 4. Table 11 shows the parameter existence of each logistic regression model with variable selection method. It can be seen in the table that V11 only exist in the model with correlation coefficient of.75. It may be the cause of why the c statistic and the total correctly predicted of logistic regression with variable selection on correlation coefficient of

18 1.75 was higher than the others as shown in Figure 1 and Figure 11. Table 11 Parameter existence of the logistic regression model with variable selection Parameter Correlation coefficient between V1 & V2* V V2* V V V V7 V8 V9 V V V V V14 V15 V V17 V CONCLUSION Comparing the total correctly predicted cases with training data set logistic ridge regression was better than logistic regression with variable selection (in this case backward elimination). But when using validation data set, backward logistic regression was better. The optimal cutpoint of backward was.68 while for ridge the optimal cut point was.677. The comparison of c statistic and the total correctly predicted cases on German Credit data set show that logistic ridge regression has a little higher capability to predict the new applicant s collectability status than logistic regression with variable selection. But both models had low capability in identifying the bad debtors. RECOMMENDATION High multicollinearity in predictor variables is needed to examine the performance of logistic ridge regression. Other way to define optimal cutpoint is good to try. Usually, banks consider that the false positive affects in more losses for the bank than the false negative. Before entering variable as the predictor variable in logistic ridge regression, it is important to know whether the variable affect the response or not based on some related theories. REFERENCES Agresti A. 27. An Introduction to Categorical Data Analysis. Second Edition. New Jersey: J Wiley. Cessie S le, Houwelingan JC van Ridge estimators in logistic regression. Appl Statist 41: Daniel WW Applied Nonparametric Statistics. Second Edition. Atlanta: PWS- KENT. Gonen M. 26. Receiver Operating Characteristics (ROC) Curves. In: Statistics and Data Analysis. Proceedings of the 31 th SAS Users Group International (SUGI 31) Conference; San Francisco, Mar 26. New York: SAS Institute. page 18. Hosmer DW, Lemeshow S. 2. Applied Logistic Regression. Second Edition. New York: J Wiley. Kemeny S, Vago E. 26. Logistic ridge regression for clinical data analysis (a case study). J Appl Ecol Environ Res 4: Perlich C, Provost F, Simonoff JS. 23. Tree induction vs. logistic regression: a learning-curve analysis. J Mach Learn Res 4: Purnomo H. 21a. BI : Kredit Perbankan Tumbuh 1% di Januari /11/1325/ /5/bi-kreditperbankan-tumbuh-1-di-januari-21 [3 Jun 21]. Purnomo H. 21b. Kredit Tumbuh 2.3% Hingga Agustus. 9/3/145545/ /5/kredit-tumbuh- 23-hingga-agustus [15 Sep 21]. Reed D. 28. Mortgages 11. New York: AMACOM. Rose PS Commercial Bank Management. Fourth Edition. USA: McGraw-Hill.

19 APPENDIX

20 12 Appendix 1 Description of variables used in analysis Dummy Variable Variable Annotation Description V1 Duration of credit V2 Amount of credit V3 Age V4 Installment rate 1 : 1% 1 2 : 2% 1 3 : 3% 1 4 : 4% V6 Number of dependents 1 : 1 dependent 2 : 2 dependents 1 V7 Status of checking 1 : < DM 1 account 2: <...< 2 DM 1 3 : => 2 DM 1 4: no checking account V8 Credit history 1: no credits taken 1 2: all credits paid duly 1 3: existing credits paid duly 1 4: delay in paying off 1 5: critical account V9 Saving account balance 1 : < 1 DM 1 2 : 1<=... < 5 DM 1 3 : 5<=... < 1 DM 1 4 : =>1 DM 1 5 : no savings account V1 Time of working 1 : unemployed experience in current job 2: < 1 year 1 3 : 1 <=... < 4 years 1 4 : 4 <=... < 7 years 1 5 : >= 7 years 1 V11 Time of living in present 1: <= 1 year residence 2: 1< <=2 years 1 3: 2< <=3 years 1 4: >4years 1 V12 Home ownership 1 : owns residence 1 2 : rents 1 3 : free V13 Occupation category 1 : unemployed/ unskilled 2 : unskilled - resident 1 3 : official 1 4 : officer 1

21 13 Dummy Variable Variable Annotation Description V14 Marital Status 1 : male & divorced 1 2 : male & single 1 3 : male & married 1 4 : female V15 Guarantor 1 : has a co-applicant 1 2 : has a guarantor 1 3 : has none V16 Property owned 1 : car or other 1 2 : real estate 1 3 : no property V17 Purpose of credit 1 : new Car 1 2 : used Car 1 3 : furniture 1 4 : radio/television 1 5 : education 1 6 : retraining 1 7 : other V19 Telephone ownership 1 : no 2: yes 1 Y Good credit rating : no, 1: yes

22 14 Appendix 2 Proportion of good debtor on each variable Percentage of good debtor Proportion of good debtor Duration of credit (month) Number of dependents Percentage of good debtor Proportion of good debtor Amount of credit (DM) < DM (,2) DM >= 2 DM no checking account Status of checking account.9 Percentage of good debtor Age Proportion of good debtor no credit taken all credits paid back duly existing credit paid back duly delay in paying off in the past critical account Credit history Proportion of good debtor % 2% 3% 4% Proportion of good debtor < 1 DM [1,5) DM [5,1) DM >= 1 DM no saving account Installment rate Saving account balance

23 15 Proportion of good debtor unemployed < 1 year [1,4) years [4,7) years >= 7 years Percentage of good debtor has a co applicant has a guarantor has none Time of working experience in current job Guarantor Proportion of good debtor Percentage of good debtor <= 1 (1,2] (2,3] > 4 car or other real estate no property Time of living in present residence (year) Property owned Proportion of good debtor Percentage of good debtor Owns Rents Free new car used car furniture radio/tv education retraining other Home owrnership Purpose of credit Proportion of good debtor unemployed unskilled official officer Proportion of Good Debtor Has Has none Occupation category Telephone ownership Percentage of good debtor male & divorced male & single male & married female Marital status

24 16 Appendix 3 Odds ratio of backward logistic regression and logistic ridge regression Variable Ridge Backward Variable Ridge Backward V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V

25 17 Appendix 4 Comparison of C statistic and the total correctly predicted cases of logistic regression with variable selection and logistic ridge regression of data set with generated V2* Correlation of V1 and V2* Variable Selection C Statistic Ridge Model Total correctly predicted (%) Variable Selection Ridge C Statistic Variable Selection Validation Ridge Total correctly predicted (%) Variable Selection Ridge

RESULT AND DISCUSSION

RESULT AND DISCUSSION 4 Figure 3 shows ROC curve. It plots the probability of false positive (1-specificity) against true positive (sensitivity). The area under the ROC curve (AUR), which ranges from to 1, provides measure

More information

GETTING STARTED WITH PROC LOGISTIC

GETTING STARTED WITH PROC LOGISTIC PAPER 255-25 GETTING STARTED WITH PROC LOGISTIC Andrew H. Karp Sierra Information Services, Inc. USA Introduction Logistic Regression is an increasingly popular analytic tool. Used to predict the probability

More information

CREDIT RISK MODELLING Using SAS

CREDIT RISK MODELLING Using SAS Basic Modelling Concepts Advance Credit Risk Model Development Scorecard Model Development Credit Risk Regulatory Guidelines 70 HOURS Practical Learning Live Online Classroom Weekends DexLab Certified

More information

GETTING STARTED WITH PROC LOGISTIC

GETTING STARTED WITH PROC LOGISTIC GETTING STARTED WITH PROC LOGISTIC Andrew H. Karp Sierra Information Services and University of California, Berkeley Extension Division Introduction Logistic Regression is an increasingly popular analytic

More information

The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa

The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pages 37-64. The description of the problem can be found

More information

Advanced Tutorials. SESUG '95 Proceedings GETTING STARTED WITH PROC LOGISTIC

Advanced Tutorials. SESUG '95 Proceedings GETTING STARTED WITH PROC LOGISTIC GETTING STARTED WITH PROC LOGISTIC Andrew H. Karp Sierra Information Services and University of California, Berkeley Extension Division Introduction Logistic Regression is an increasingly popular analytic

More information

Getting Started With PROC LOGISTIC

Getting Started With PROC LOGISTIC Getting Started With PROC LOGISTIC Andrew H. Karp Sierra Information Services, Inc. 19229 Sonoma Hwy. PMB 264 Sonoma, California 95476 707 996 7380 SierraInfo@aol.com www.sierrainformation.com Getting

More information

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 02 Data Mining Process Welcome to the lecture 2 of

More information

Telecommunications Churn Analysis Using Cox Regression

Telecommunications Churn Analysis Using Cox Regression Telecommunications Churn Analysis Using Cox Regression Introduction As part of its efforts to increase customer loyalty and reduce churn, a telecommunications company is interested in modeling the "time

More information

Model Validation of a Credit Scorecard Using Bootstrap Method

Model Validation of a Credit Scorecard Using Bootstrap Method IOSR Journal of Economics and Finance (IOSR-JEF) e-issn: 2321-5933, p-issn: 2321-5925.Volume 3, Issue 3. (Mar-Apr. 2014), PP 64-68 Model Validation of a Credit Scorecard Using Bootstrap Method Dilsha M

More information

Comparative analysis on the probability of being a good payer

Comparative analysis on the probability of being a good payer Comparative analysis on the probability of being a good payer V. Mihova, and V. Pavlov Citation: AIP Conference Proceedings 1895, 050006 (2017); View online: https://doi.org/10.1063/1.5007378 View Table

More information

COMPARISON OF LOGISTIC REGRESSION MODEL AND MARS CLASSIFICATION RESULTS ON BINARY RESPONSE FOR TEKNISI AHLI BBPLK SERANG TRAINING GRADUATES STATUS

COMPARISON OF LOGISTIC REGRESSION MODEL AND MARS CLASSIFICATION RESULTS ON BINARY RESPONSE FOR TEKNISI AHLI BBPLK SERANG TRAINING GRADUATES STATUS International Journal of Humanities, Religion and Social Science ISSN : 2548-5725 Volume 2, Issue 1 2017 www.doarj.org COMPARISON OF LOGISTIC REGRESSION MODEL AND MARS CLASSIFICATION RESULTS ON BINARY

More information

MODELING THE EXPERT. An Introduction to Logistic Regression The Analytics Edge

MODELING THE EXPERT. An Introduction to Logistic Regression The Analytics Edge MODELING THE EXPERT An Introduction to Logistic Regression 15.071 The Analytics Edge Ask the Experts! Critical decisions are often made by people with expert knowledge Healthcare Quality Assessment Good

More information

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN Predictive Modeling using SAS Enterprise Miner and SAS/STAT : Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN 1 Overview This presentation will: Provide a brief introduction of how to set

More information

Unit 5 Logistic Regression Homework #7 Practice Problems. SOLUTIONS Stata version

Unit 5 Logistic Regression Homework #7 Practice Problems. SOLUTIONS Stata version Unit 5 Logistic Regression Homework #7 Practice Problems SOLUTIONS Stata version Before You Begin Download STATA data set illeetvilaine.dta from the course website page, ASSIGNMENTS (Homeworks and Exams)

More information

Section A: This section deals with the profile of the respondents taken for the study.

Section A: This section deals with the profile of the respondents taken for the study. RESULTS In this chapter we have discussed the results of this study. The study was conducted with the intention of finding out the relationship between service quality and customer satisfaction in Direct

More information

= = Intro to Statistics for the Social Sciences. Name: Lab Session: Spring, 2015, Dr. Suzanne Delaney

= = Intro to Statistics for the Social Sciences. Name: Lab Session: Spring, 2015, Dr. Suzanne Delaney Name: Intro to Statistics for the Social Sciences Lab Session: Spring, 2015, Dr. Suzanne Delaney CID Number: _ Homework #22 You have been hired as a statistical consultant by Donald who is a used car dealer

More information

The Dummy s Guide to Data Analysis Using SPSS

The Dummy s Guide to Data Analysis Using SPSS The Dummy s Guide to Data Analysis Using SPSS Univariate Statistics Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved Table of Contents PAGE Creating a Data File...3 1. Creating

More information

AcaStat How To Guide. AcaStat. Software. Copyright 2016, AcaStat Software. All rights Reserved.

AcaStat How To Guide. AcaStat. Software. Copyright 2016, AcaStat Software. All rights Reserved. AcaStat How To Guide AcaStat Software Copyright 2016, AcaStat Software. All rights Reserved. http://www.acastat.com Table of Contents Frequencies... 3 List Variables... 4 Descriptives... 5 Explore Means...

More information

= = Name: Lab Session: CID Number: The database can be found on our class website: Donald s used car data

= = Name: Lab Session: CID Number: The database can be found on our class website: Donald s used car data Intro to Statistics for the Social Sciences Fall, 2017, Dr. Suzanne Delaney Extra Credit Assignment Instructions: You have been hired as a statistical consultant by Donald who is a used car dealer to help

More information

Lithium-Ion Battery Analysis for Reliability and Accelerated Testing Using Logistic Regression

Lithium-Ion Battery Analysis for Reliability and Accelerated Testing Using Logistic Regression for Reliability and Accelerated Testing Using Logistic Regression Travis A. Moebes, PhD Dyn-Corp International, LLC Houston, Texas tmoebes@nasa.gov Biography Dr. Travis Moebes has a B.S. in Mathematics

More information

CHAPTER FIVE CROSSTABS PROCEDURE

CHAPTER FIVE CROSSTABS PROCEDURE CHAPTER FIVE CROSSTABS PROCEDURE 5.0 Introduction This chapter focuses on how to compare groups when the outcome is categorical (nominal or ordinal) by using SPSS. The aim of the series of exercise is

More information

An Application of Categorical Analysis of Variance in Nested Arrangements

An Application of Categorical Analysis of Variance in Nested Arrangements International Journal of Probability and Statistics 2018, 7(3): 67-81 DOI: 10.5923/j.ijps.20180703.02 An Application of Categorical Analysis of Variance in Nested Arrangements Iwundu M. P. *, Anyanwu C.

More information

SPSS Guide Page 1 of 13

SPSS Guide Page 1 of 13 SPSS Guide Page 1 of 13 A Guide to SPSS for Public Affairs Students This is intended as a handy how-to guide for most of what you might want to do in SPSS. First, here is what a typical data set might

More information

SAS/STAT 14.1 User s Guide. Introduction to Categorical Data Analysis Procedures

SAS/STAT 14.1 User s Guide. Introduction to Categorical Data Analysis Procedures SAS/STAT 14.1 User s Guide Introduction to Categorical Data Analysis Procedures This document is an individual chapter from SAS/STAT 14.1 User s Guide. The correct bibliographic citation for this manual

More information

Hearst Challenge : A Framework to Learn Magazine Sales Volume

Hearst Challenge : A Framework to Learn Magazine Sales Volume Hearst Challenge : A Framework to Learn Magazine Sales Volume Reza Omrani 1 Introduction This project is based on a 25000$ competition by the Hearst Magazines posted on www.kdnuggets.com. The goal of this

More information

A SAS Macro to Analyze Data From a Matched or Finely Stratified Case-Control Design

A SAS Macro to Analyze Data From a Matched or Finely Stratified Case-Control Design A SAS Macro to Analyze Data From a Matched or Finely Stratified Case-Control Design Robert A. Vierkant, Terry M. Therneau, Jon L. Kosanke, James M. Naessens Mayo Clinic, Rochester, MN ABSTRACT A matched

More information

White Paper. AML Customer Risk Rating. Modernize customer risk rating models to meet risk governance regulatory expectations

White Paper. AML Customer Risk Rating. Modernize customer risk rating models to meet risk governance regulatory expectations White Paper AML Customer Risk Rating Modernize customer risk rating models to meet risk governance regulatory expectations Contents Executive Summary... 1 Comparing Heuristic Rule-Based Models to Statistical

More information

Categorical Data Analysis

Categorical Data Analysis Categorical Data Analysis Hsueh-Sheng Wu Center for Family and Demographic Research October 4, 200 Outline What are categorical variables? When do we need categorical data analysis? Some methods for categorical

More information

Logistic Regression, Part III: Hypothesis Testing, Comparisons to OLS

Logistic Regression, Part III: Hypothesis Testing, Comparisons to OLS Logistic Regression, Part III: Hypothesis Testing, Comparisons to OLS Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised February 22, 2015 This handout steals heavily

More information

Improving long run model performance using Deviance statistics. Matt Goward August 2011

Improving long run model performance using Deviance statistics. Matt Goward August 2011 Improving long run model performance using Deviance statistics Matt Goward August 011 Objective of Presentation Why model stability is important Financial institutions are interested in long run model

More information

Accurate Campaign Targeting Using Classification Algorithms

Accurate Campaign Targeting Using Classification Algorithms Accurate Campaign Targeting Using Classification Algorithms Jieming Wei Sharon Zhang Introduction Many organizations prospect for loyal supporters and donors by sending direct mail appeals. This is an

More information

Center for Demography and Ecology

Center for Demography and Ecology Center for Demography and Ecology University of Wisconsin-Madison A Comparative Evaluation of Selected Statistical Software for Computing Multinomial Models Nancy McDermott CDE Working Paper No. 95-01

More information

Analysis Occupational Change and Earning of Star Hotel s Workers in Palembang Indonesia

Analysis Occupational Change and Earning of Star Hotel s Workers in Palembang Indonesia International Journal of Scientific and Research Publications, Volume 7, Issue 8, August 2017 16 Analysis Occupational Change and Earning of Star Hotel s Workers in Palembang Indonesia Maidiana Astuti*,

More information

Session 7. Introduction to important statistical techniques for competitiveness analysis example and interpretations

Session 7. Introduction to important statistical techniques for competitiveness analysis example and interpretations ARTNeT Greater Mekong Sub-region (GMS) initiative Session 7 Introduction to important statistical techniques for competitiveness analysis example and interpretations ARTNeT Consultant Witada Anukoonwattaka,

More information

Suitability and Determinants of Agricultural Training Programs in Northern Ethiopia

Suitability and Determinants of Agricultural Training Programs in Northern Ethiopia Scholarly Journal of Agricultural Science Vol. 3(12), pp. 546-551 December, 2013 Available online at http:// www.scholarly-journals.com/sjas ISSN 2276-7118 2013 Scholarly-Journals Full Length Research

More information

EFFICACY OF ROBUST REGRESSION APPLIED TO FRACTIONAL FACTORIAL TREATMENT STRUCTURES MICHAEL MCCANTS

EFFICACY OF ROBUST REGRESSION APPLIED TO FRACTIONAL FACTORIAL TREATMENT STRUCTURES MICHAEL MCCANTS EFFICACY OF ROBUST REGRESSION APPLIED TO FRACTIONAL FACTORIAL TREATMENT STRUCTURES by MICHAEL MCCANTS B.A., WINONA STATE UNIVERSITY, 2007 B.S., WINONA STATE UNIVERSITY, 2008 A THESIS submitted in partial

More information

Regression Analysis I & II

Regression Analysis I & II Data for this session is available in Data Regression I & II Regression Analysis I & II Quantitative Methods for Business Skander Esseghaier 1 In this session, you will learn: How to read and interpret

More information

CHAPTER 6 ASDA ANALYSIS EXAMPLES REPLICATION SAS V9.2

CHAPTER 6 ASDA ANALYSIS EXAMPLES REPLICATION SAS V9.2 CHAPTER 6 ASDA ANALYSIS EXAMPLES REPLICATION SAS V9.2 GENERAL NOTES ABOUT ANALYSIS EXAMPLES REPLICATION These examples are intended to provide guidance on how to use the commands/procedures for analysis

More information

CHAPTER 5 RESULTS AND ANALYSIS

CHAPTER 5 RESULTS AND ANALYSIS CHAPTER 5 RESULTS AND ANALYSIS This chapter exhibits an extensive data analysis and the results of the statistical testing. Data analysis is done using factor analysis, regression analysis, reliability

More information

Biophysical and Econometric Analysis of Adoption of Soil and Water Conservation Techniques in the Semi-Arid Region of Sidi Bouzid (Central Tunisia)

Biophysical and Econometric Analysis of Adoption of Soil and Water Conservation Techniques in the Semi-Arid Region of Sidi Bouzid (Central Tunisia) Biophysical and Econometric Analysis of Adoption of Soil and Water Conservation Techniques in the Semi-Arid Region of Sidi Bouzid (Central Tunisia) 5 th EUROSOIL INTERNATIONAL CONGRESS 17-22 July 2016,

More information

3 Ways to Improve Your Targeted Marketing with Analytics

3 Ways to Improve Your Targeted Marketing with Analytics 3 Ways to Improve Your Targeted Marketing with Analytics Introduction Targeted marketing is a simple concept, but a key element in a marketing strategy. The goal is to identify the potential customers

More information

Module 7: Multilevel Models for Binary Responses. Practical. Introduction to the Bangladesh Demographic and Health Survey 2004 Dataset.

Module 7: Multilevel Models for Binary Responses. Practical. Introduction to the Bangladesh Demographic and Health Survey 2004 Dataset. Module 7: Multilevel Models for Binary Responses Most of the sections within this module have online quizzes for you to test your understanding. To find the quizzes: Pre-requisites Modules 1-6 Contents

More information

Harbingers of Failure: Online Appendix

Harbingers of Failure: Online Appendix Harbingers of Failure: Online Appendix Eric Anderson Northwestern University Kellogg School of Management Song Lin MIT Sloan School of Management Duncan Simester MIT Sloan School of Management Catherine

More information

Binary Classification Modeling Final Deliverable. Using Logistic Regression to Build Credit Scores. Dagny Taggart

Binary Classification Modeling Final Deliverable. Using Logistic Regression to Build Credit Scores. Dagny Taggart Binary Classification Modeling Final Deliverable Using Logistic Regression to Build Credit Scores Dagny Taggart Supervised by Jennifer Lewis Priestley, Ph.D. Kennesaw State University Submitted 4/24/2015

More information

An Application of Artificial Intelligent Neural Network and Discriminant Analyses On Credit Scoring

An Application of Artificial Intelligent Neural Network and Discriminant Analyses On Credit Scoring An Application of Artificial Intelligent Neural Network and Discriminant Analyses On Credit Scoring Alabi, M.A. 1, Issa, S 2., Afolayan, R.B 3. 1. Department of Mathematics and Statistics. Akanu Ibiam

More information

Logistic Regression for Early Warning of Economic Failure of Construction Equipment

Logistic Regression for Early Warning of Economic Failure of Construction Equipment Logistic Regression for Early Warning of Economic Failure of Construction Equipment John Hildreth, PhD and Savannah Dewitt University of North Carolina at Charlotte Charlotte, North Carolina Equipment

More information

BUS105 Statistics. Tutor Marked Assignment. Total Marks: 45; Weightage: 15%

BUS105 Statistics. Tutor Marked Assignment. Total Marks: 45; Weightage: 15% BUS105 Statistics Tutor Marked Assignment Total Marks: 45; Weightage: 15% Objectives a) Reinforcing your learning, at home and in class b) Identifying the topics that you have problems with so that your

More information

The impact of banner advertisement frequency on click through responses

The impact of banner advertisement frequency on click through responses The impact of banner advertisement frequency on click through responses Author Hussain, Rahim, Sweeney, Arthur, Sullivan Mort, Gillian Published 2007 Conference Title 2007 ANZMAC Conference Proceedings

More information

Weka Evaluation: Assessing the performance

Weka Evaluation: Assessing the performance Weka Evaluation: Assessing the performance Lab3 (in- class): 21 NOV 2016, 13:00-15:00, CHOMSKY ACKNOWLEDGEMENTS: INFORMATION, EXAMPLES AND TASKS IN THIS LAB COME FROM SEVERAL WEB SOURCES. Learning objectives

More information

Introduction to Categorical Data Analysis Procedures (Chapter)

Introduction to Categorical Data Analysis Procedures (Chapter) SAS/STAT 12.1 User s Guide Introduction to Categorical Data Analysis Procedures (Chapter) SAS Documentation This document is an individual chapter from SAS/STAT 12.1 User s Guide. The correct bibliographic

More information

Predicting Customer Purchase to Improve Bank Marketing Effectiveness

Predicting Customer Purchase to Improve Bank Marketing Effectiveness Business Analytics Using Data Mining (2017 Fall).Fianl Report Predicting Customer Purchase to Improve Bank Marketing Effectiveness Group 6 Sandy Wu Andy Hsu Wei-Zhu Chen Samantha Chien Instructor:Galit

More information

M. Zhao, C. Wohlin, N. Ohlsson and M. Xie, "A Comparison between Software Design and Code Metrics for the Prediction of Software Fault Content",

M. Zhao, C. Wohlin, N. Ohlsson and M. Xie, A Comparison between Software Design and Code Metrics for the Prediction of Software Fault Content, M. Zhao, C. Wohlin, N. Ohlsson and M. Xie, "A Comparison between Software Design and Code Metrics for the Prediction of Software Fault Content", Information and Software Technology, Vol. 40, No. 14, pp.

More information

A Note on Sex, Geographic Mobility, and Career Advancement. By: William T. Markham, Patrick O. Macken, Charles M. Bonjean, Judy Corder

A Note on Sex, Geographic Mobility, and Career Advancement. By: William T. Markham, Patrick O. Macken, Charles M. Bonjean, Judy Corder A Note on Sex, Geographic Mobility, and Career Advancement By: William T. Markham, Patrick O. Macken, Charles M. Bonjean, Judy Corder This is a pre-copyedited, author-produced PDF of an article accepted

More information

A Comparison of Segmentation Based on Relevant Attributes and Segmentation Based on Determinant Attributes

A Comparison of Segmentation Based on Relevant Attributes and Segmentation Based on Determinant Attributes 30-10-2015 A Comparison of Segmentation Based on Relevant Attributes and Segmentation Based on Determinant Attributes Kayleigh Meister WAGENINGEN UR A Comparison of Segmentation Based on Relevant Attributes

More information

APPLICATION OF SEASONAL ADJUSTMENT FACTORS TO SUBSEQUENT YEAR DATA. Corresponding Author

APPLICATION OF SEASONAL ADJUSTMENT FACTORS TO SUBSEQUENT YEAR DATA. Corresponding Author 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 APPLICATION OF SEASONAL ADJUSTMENT FACTORS TO SUBSEQUENT

More information

Statistics 201 Summary of Tools and Techniques

Statistics 201 Summary of Tools and Techniques Statistics 201 Summary of Tools and Techniques This document summarizes the many tools and techniques that you will be exposed to in STAT 201. The details of how to do these procedures is intentionally

More information

Modeling the Perceptions and Challenges of the National Service Personnel in Kumasi Metropolis, Ghana

Modeling the Perceptions and Challenges of the National Service Personnel in Kumasi Metropolis, Ghana International Journal of Applied Science and Technology Vol. 5, No. 3; June 2015 Modeling the Perceptions and Challenges of the National Service Personnel in Kumasi Metropolis, Ghana Frank Osei Frimpong

More information

An Analysis of Profit and Customer Satisfaction in Consumer Finance

An Analysis of Profit and Customer Satisfaction in Consumer Finance CS-BIGS 2(2): 147-156 2009 CS-BIGS http://www.bentley.edu/csbigs/vol2-2/wang.pdf An Analysis of Profit and Customer Satisfaction in Consumer Finance Chamont Wang College of New Jersey, USA Mikhail Zhuravlev

More information

Chapter 5 Evaluating Classification & Predictive Performance

Chapter 5 Evaluating Classification & Predictive Performance Chapter 5 Evaluating Classification & Predictive Performance Data Mining for Business Intelligence Shmueli, Patel & Bruce Galit Shmueli and Peter Bruce 2010 Why Evaluate? Multiple methods are available

More information

Factors Influencing the Choice of Management Strategy among Small-Scale Private Forest Owners in Sweden

Factors Influencing the Choice of Management Strategy among Small-Scale Private Forest Owners in Sweden Forests 2014, 5, 1695-1716; doi:10.3390/f5071695 Article OPEN ACCESS forests ISSN 1999-4907 www.mdpi.com/journal/forests Factors Influencing the Choice of Management Strategy among Small-Scale Private

More information

Chapter 3. Database and Research Methodology

Chapter 3. Database and Research Methodology Chapter 3 Database and Research Methodology In research, the research plan needs to be cautiously designed to yield results that are as objective as realistic. It is the main part of a grant application

More information

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS

More information

Balance Scorecard Application to Predict Business Success with Logistic Regression

Balance Scorecard Application to Predict Business Success with Logistic Regression 12 Journal of Advances in Economics and Finance, Vol. 3, No.1, February 2018 https://dx.doi.org/10.22606/jaef.2018.31002 Balance Scorecard Application to Predict Business Success with Logistic Regression

More information

Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data

Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data Paper 942-2017 Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data Josephine S Akosa, Oklahoma State University ABSTRACT The most commonly reported model evaluation metric

More information

Topics in Biostatistics Categorical Data Analysis and Logistic Regression, part 2. B. Rosner, 5/09/17

Topics in Biostatistics Categorical Data Analysis and Logistic Regression, part 2. B. Rosner, 5/09/17 Topics in Biostatistics Categorical Data Analysis and Logistic Regression, part 2 B. Rosner, 5/09/17 1 Outline 1. Testing for effect modification in logistic regression analyses 2. Conditional logistic

More information

Please respond to each of the following attitude statement using the scale below:

Please respond to each of the following attitude statement using the scale below: Resp. ID: QWL Questionnaire : Part A: Personal Profile 1. Age as of last birthday. years 2. Gender 0. Male 1. Female 3. Marital status 0. Bachelor 1. Married 4. Level of education 1. Certificate 2. Diploma

More information

ADVANCED DATA ANALYTICS

ADVANCED DATA ANALYTICS ADVANCED DATA ANALYTICS MBB essay by Marcel Suszka 17 AUGUSTUS 2018 PROJECTSONE De Corridor 12L 3621 ZB Breukelen MBB Essay Advanced Data Analytics Outline This essay is about a statistical research for

More information

ANALYSING QUANTITATIVE DATA

ANALYSING QUANTITATIVE DATA 9 ANALYSING QUANTITATIVE DATA Although, of course, there are other software packages that can be used for quantitative data analysis, including Microsoft Excel, SPSS is perhaps the one most commonly subscribed

More information

TRANSPORTATION PROBLEM AND VARIANTS

TRANSPORTATION PROBLEM AND VARIANTS TRANSPORTATION PROBLEM AND VARIANTS Introduction to Lecture T: Welcome to the next exercise. I hope you enjoyed the previous exercise. S: Sure I did. It is good to learn new concepts. I am beginning to

More information

Winsor Approach in Regression Analysis. with Outlier

Winsor Approach in Regression Analysis. with Outlier Applied Mathematical Sciences, Vol. 11, 2017, no. 41, 2031-2046 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/ams.2017.76214 Winsor Approach in Regression Analysis with Outlier Murih Pusparum Qasa

More information

Credit Card Marketing Classification Trees

Credit Card Marketing Classification Trees Credit Card Marketing Classification Trees From Building Better Models With JMP Pro, Chapter 6, SAS Press (2015). Grayson, Gardner and Stephens. Used with permission. For additional information, see community.jmp.com/docs/doc-7562.

More information

Semester 2, 2015/2016

Semester 2, 2015/2016 ECN 3202 APPLIED ECONOMETRICS 3. MULTIPLE REGRESSION B Mr. Sydney Armstrong Lecturer 1 The University of Guyana 1 Semester 2, 2015/2016 MODEL SPECIFICATION What happens if we omit a relevant variable?

More information

LECTURE 17: MULTIVARIABLE REGRESSIONS I

LECTURE 17: MULTIVARIABLE REGRESSIONS I David Youngberg BSAD 210 Montgomery College LECTURE 17: MULTIVARIABLE REGRESSIONS I I. What Determines a House s Price? a. Open Data Set 6 to help us answer this question. You ll see pricing data for homes

More information

QUESTION 2 What conclusion is most correct about the Experimental Design shown here with the response in the far right column?

QUESTION 2 What conclusion is most correct about the Experimental Design shown here with the response in the far right column? QUESTION 1 When a Belt Poka-Yoke's a defect out of the process entirely then she should track the activity with a robust SPC system on the characteristic of interest in the defect as an early warning system.

More information

The Business of Coupons-Do coupons lead to repeat purchases?

The Business of Coupons-Do coupons lead to repeat purchases? Pursuit - The Journal of Undergraduate Research at the University of Tennessee Volume 5 Issue 1 Article 14 June 2014 The Business of Coupons-Do coupons lead to repeat purchases? Margaret P. Ross University

More information

Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised March 28, 2015

Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised March 28, 2015 Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised March 28, 2015 NOTE: The routines spost13, lrdrop1, and extremes

More information

Choosing the best statistical tests for your data and hypotheses. Dr. Christine Pereira Academic Skills Team (ASK)

Choosing the best statistical tests for your data and hypotheses. Dr. Christine Pereira Academic Skills Team (ASK) Choosing the best statistical tests for your data and hypotheses Dr. Christine Pereira Academic Skills Team (ASK) ask@brunel.ac.uk Which test should I use? T-tests Correlations Regression Dr. Christine

More information

FACTORS INFLUENCING MICRO AND SMALL ENTERPRISES ACCESS TO FINANCE IN BOTSWANA

FACTORS INFLUENCING MICRO AND SMALL ENTERPRISES ACCESS TO FINANCE IN BOTSWANA Journal of Social and Economic Policy, Vol. 12, No. 2, December 2015, pp. 65-76 FACTORS INFLUENCING MICRO AND SMALL ENTERPRISES ACCESS TO FINANCE IN BOTSWANA MALEFHO K * AND MOFFAT B ** Abstract: This

More information

Getting Started with HLM 5. For Windows

Getting Started with HLM 5. For Windows For Windows Updated: August 2012 Table of Contents Section 1: Overview... 3 1.1 About this Document... 3 1.2 Introduction to HLM... 3 1.3 Accessing HLM... 3 1.4 Getting Help with HLM... 3 Section 2: Accessing

More information

Education and Labour Productivity in Papua New Guinea's Tuna Processing Industry. H.F. Campbell School of Economics University of Queensland.

Education and Labour Productivity in Papua New Guinea's Tuna Processing Industry. H.F. Campbell School of Economics University of Queensland. Education and Labour Productivity in Papua New Guinea's Tuna Processing Industry H.F. Campbell School of Economics University of Queensland Abstract Wage and personal characteristics data from a sample

More information

FACTORS AFFECTING ON YOUTH PARTICIPATION AND SATISFACTION IN OCCUPATION RELATED TO AGRICULTURE

FACTORS AFFECTING ON YOUTH PARTICIPATION AND SATISFACTION IN OCCUPATION RELATED TO AGRICULTURE FACTORS AFFECTING ON YOUTH PARTICIPATION AND SATISFACTION IN OCCUPATION RELATED TO AGRICULTURE S.D.P. Sudarshanie (108814M) Dissertation submitted in partial fulfillment of the requirements for the degree

More information

Linear model to forecast sales from past data of Rossmann drug Store

Linear model to forecast sales from past data of Rossmann drug Store Abstract Linear model to forecast sales from past data of Rossmann drug Store Group id: G3 Recent years, the explosive growth in data results in the need to develop new tools to process data into knowledge

More information

THE DYNAMICS OF SKILL MISMATCHES IN THE DUTCH LABOR MARKET

THE DYNAMICS OF SKILL MISMATCHES IN THE DUTCH LABOR MARKET THE DYNAMICS OF SKILL MISMATCHES IN THE DUTCH LABOR MARKET Wim Groot* Department of Health Sciences, Maastricht University and "Scholar" Research Centre for Education and Labor Market Department of Economics,

More information

Practical Aspects of Modelling Techp.iques in Logistic Regression Procedures of the SAS System

Practical Aspects of Modelling Techp.iques in Logistic Regression Procedures of the SAS System r""'=~~"''''''''''''''''''''''''''''\;'=="'~''''o''''"'"''~ ~c_,,..! Practical Aspects of Modelling Techp.iques in Logistic Regression Procedures of the SAS System Rainer Muche 1, Josef HogeP and Olaf

More information

A study of cartel stability: the Joint Executive Committee, Paper by: Robert H. Porter

A study of cartel stability: the Joint Executive Committee, Paper by: Robert H. Porter A study of cartel stability: the Joint Executive Committee, 1880-1886 Paper by: Robert H. Porter Joint Executive Committee Cartels can increase profits by restricting output from competitive levels. However,

More information

The impact of banner advertisement frequency on brand awareness

The impact of banner advertisement frequency on brand awareness The impact of banner advertisement frequency on brand awareness Author Hussain, Rahim, Sweeney, Arthur, Sullivan Mort, Gillian Published 2007 Conference Title 2007 ANZMAC Conference Proceedings Copyright

More information

FACTORS AFFECTING JOB STRESS AMONG IT PROFESSIONALS IN APPAREL INDUSTRY: A CASE STUDY IN SRI LANKA

FACTORS AFFECTING JOB STRESS AMONG IT PROFESSIONALS IN APPAREL INDUSTRY: A CASE STUDY IN SRI LANKA FACTORS AFFECTING JOB STRESS AMONG IT PROFESSIONALS IN APPAREL INDUSTRY: A CASE STUDY IN SRI LANKA W.N. Arsakularathna and S.S.N. Perera Research & Development Centre for Mathematical Modeling, Faculty

More information

Opening SPSS 6/18/2013. Lesson: Quantitative Data Analysis part -I. The Four Windows: Data Editor. The Four Windows: Output Viewer

Opening SPSS 6/18/2013. Lesson: Quantitative Data Analysis part -I. The Four Windows: Data Editor. The Four Windows: Output Viewer Lesson: Quantitative Data Analysis part -I Research Methodology - COMC/CMOE/ COMT 41543 The Four Windows: Data Editor Data Editor Spreadsheet-like system for defining, entering, editing, and displaying

More information

Chapter 3. Basic Statistical Concepts: II. Data Preparation and Screening. Overview. Data preparation. Data screening. Score reliability and validity

Chapter 3. Basic Statistical Concepts: II. Data Preparation and Screening. Overview. Data preparation. Data screening. Score reliability and validity Chapter 3 Basic Statistical Concepts: II. Data Preparation and Screening To repeat what others have said, requires education; to challenge it, requires brains. Overview Mary Pettibone Poole Data preparation

More information

Evaluation next steps Lift and Costs

Evaluation next steps Lift and Costs Evaluation next steps Lift and Costs Outline Lift and Gains charts *ROC Cost-sensitive learning Evaluation for numeric predictions 2 Application Example: Direct Marketing Paradigm Find most likely prospects

More information

Standard analysis model for monitoring compliance with wage equality between women and men in federal procurement (methodology)

Standard analysis model for monitoring compliance with wage equality between women and men in federal procurement (methodology) Federal Department of Home Affairs FDHA Federal Office for Gender Equality FOGE Standard analysis model for monitoring compliance with wage equality between women and men in federal procurement (methodology)

More information

Multiple Regression. Dr. Tom Pierce Department of Psychology Radford University

Multiple Regression. Dr. Tom Pierce Department of Psychology Radford University Multiple Regression Dr. Tom Pierce Department of Psychology Radford University In the previous chapter we talked about regression as a technique for using a person s score on one variable to make a best

More information

Lecture-21: Discrete Choice Modeling-II

Lecture-21: Discrete Choice Modeling-II Lecture-21: Discrete Choice Modeling-II 1 In Today s Class Review Examples of maximum likelihood estimation Various model specifications Software demonstration Other variants of discrete choice models

More information

Report for PAKDD 2007 Data Mining Competition

Report for PAKDD 2007 Data Mining Competition Report for PAKDD 2007 Data Mining Competition Li Guoliang School of Computing, National University of Singapore April, 2007 Abstract The task in PAKDD 2007 data mining competition is a cross-selling business

More information

Modelling Repeat Visitation

Modelling Repeat Visitation European Regional Science Association 40 th European Congress, Barcelona 2000 Modelling Repeat Visitation Jie Zhang AKF (Institute of Local Government Studies) Nyropsgade 37 DK-1602 Copenhagen V Denmark

More information

Correlation and Simple. Linear Regression. Scenario. Defining Correlation

Correlation and Simple. Linear Regression. Scenario. Defining Correlation Linear Regression Scenario Let s imagine that we work in a real estate business and we re attempting to understand whether there s any association between the square footage of a house and it s final selling

More information

arxiv: v1 [cs.lg] 13 Oct 2016

arxiv: v1 [cs.lg] 13 Oct 2016 Bank Card Usage Prediction Exploiting Geolocation Information Martin Wistuba, Nghia Duong-Trung, Nicolas Schilling, and Lars Schmidt-Thieme arxiv:1610.03996v1 [cs.lg] 13 Oct 2016 Information Systems and

More information

Commitment and discounts in a loyalty model

Commitment and discounts in a loyalty model Commitment and discounts in a loyalty model Martin Karvik Masteruppsats i försäkringsmatematik Master Thesis in Actuarial Mathematics Masteruppsats 2016:1 Försäkringsmatematik Juni 2016 www.math.su.se

More information

Logistic Regression with Expert Intervention

Logistic Regression with Expert Intervention Smart Cities Symposium Prague 2016 1 Logistic Regression with Expert Intervention Pavla Pecherková and Ivan Nagy Abstract This paper deals with problem of analysis of traffic data. A traffic network has

More information