Comparative analysis on the probability of being a good payer

Size: px
Start display at page:

Download "Comparative analysis on the probability of being a good payer"

Transcription

1 Comparative analysis on the probability of being a good payer V. Mihova, and V. Pavlov Citation: AIP Conference Proceedings 1895, (2017); View online: View Table of Contents: Published by the American Institute of Physics Articles you may be interested in Planning outstanding reserves in general insurance AIP Conference Proceedings 1895, (2017); / Mathematical modeling of the heat transfer during pyrolysis process used for end-of-life tires treatment AIP Conference Proceedings 1895, (2017); / Preface: Application of Mathematics in Technical and Natural Sciences 9th International Conference - AMiTaNS 17 AIP Conference Proceedings 1895, (2017); / The method of trend analysis of parameters time series of gas-turbine engine state AIP Conference Proceedings 1895, (2017); / Regression trees modeling and forecasting of PM10 air pollution in urban areas AIP Conference Proceedings 1895, (2017); / Analysis and modeling of daily air pollutants in the city of Ruse, Bulgaria AIP Conference Proceedings 1895, (2017); /

2 Comparative Analysis on the Probability of Being a Good Payer V. Mihova a) and V. Pavlov b) University of Ruse, 8 Studentska str., 7017 Ruse, Bulgaria a) Corresponding author: vmicheva@uni-ruse.bg b) vpavlov@uni-ruse.bg Abstract. Credit risk assessment is crucial for the bank industry. The current practice uses various approaches for the calculation of credit risk. The core of these approaches is the use of multiple regression models, applied in order to assess the risk associated with the approval of people applying for certain products (loans, credit cards, etc.). Based on data from the past, these models try to predict what will happen in the future. Different data requires different type of models. This work studies the causal link between the conduct of an applicant upon payment of the loan and the data that he completed at the time of application. A database of 100 borrowers from a commercial bank is used for the purposes of the study. The available data includes information from the time of application and credit history while paying off the loan. Customers are divided into two groups, based on the credit history: Good and Bad payers. Linear and logistic regression are applied in parallel to the data in order to estimate the probability of being good for new borrowers. A variable, which contains value of 1 for Good borrowers and value of 0 for Bad candidates, is modeled as a dependent variable. To decide which of the variables listed in the database should be used in the modelling process (as independent variables), a correlation analysis is made. Due to the results of it, several combinations of independent variables are tested as initial models both with linear and logistic regression. The best linear and logistic models are obtained after initial transformation of the data and following a set of standard and robust statistical criteria. A comparative analysis between the two final models is made and scorecards are obtained from both models to assess new customers at the time of application. A cut-off level of points, bellow which to reject the applications and above it to accept them, has been suggested for both the models, applying the strategy to keep the same Accept Rate as in the current data. INTRODUCTION Credit risk is the risk of default on a debt due to a borrower failing to make required payments (on credit line, loan, mortgage, etc.). Duties of the borrower include everything, connected with the loan (credit): principals, coupons, interest, etc. The existence of credit risk raises a number of consequences [1]. At first the banking and financial industry, which is affected by this type of risk, in most countries is allowed to collect personal data about the borrowers. This data is used to build mathematical models for assessment of the credit risk. Moreover, changes in the behavior of financial companies arise, based on assessments of credit risk. These changes include restructuring of clients accounts, portfolio diversification, activity specialization, changes in interest rates and individual interest levels to different customers, depending on their solvency. Given the number of corporate relationships that arise due to the presence of credit risk, that risk is essential for the success of the financial institutions. Statistical models are commonly used in the banking industry in order to assess the credit risk associated with the approval of people applying for certain products (loans, credit cards, etc.). Based on data from the past, these models try to predict what will happen in the future [2]. Different type of models (linear, logistic, polynomial, reciprocal, etc.) could be used according to the available data and creditor s purposes. Sometimes several different models have to be tested in order to choose the one that gives the best fit to the data. As a result of the built statistical model each candidate receives an estimate in the form of points. The points of all candidates draw up a point system or the so-called scorecard. Due to the type of the model, some results are Application of Mathematics in Technical and Natural Sciences AIP Conf. Proc. 1895, ; Published by AIP Publishing /$

3 more difficult to interpret. The interpretation of the results of the linear model is straightforward, as it models directly the dependent variable Y, while in the logistic regression, for example, the estimates are in log-odds units, which makes them difficult to interpret. A relation between the information that each client submits at the time of application for a product and the behavior of that client after he has been approved for the product is made [1]. A regression model is used to assess how the behavior of the approved customers (good or bad customers) depends on the data submitted in the application form for the product. The resulting relation can be used to assess new customers even in the time of application. The purpose of the scorecard is to model the resulting mechanism. When the model is based on the data submitted in the application form, not on the repayment behavior, then this mechanism could be modeled as the probability the candidate to be good or bad [3]. (In fact, the linear model gives the likelihood the candidate to be a good payer, while the logistic model gives the log-likelihood of being a good payer.) The above statement is described mathematically as follows: p P( y 1/ x) 1 P( y 0 / x), where x a vector that contains the information from the application form; p the probability a candidate with characteristic vector x to be good; y=1 the event the candidate is Good ; y=0 the event the candidate is Bad. This work has studied the causal link between the conduct of an applicant upon payment of the loan (the dependent variable, y) and the data that he completed at the time of application (independent variables x ( x 1,..., x k ) ). Linear and logistic regression have been used to estimate the probability of being good (or the log Good: Bad odds in the case of logistic model). PROBLEM FORMULATION A database of 100 borrowers from a commercial bank is used for the purposes of the study. They have applied for loans in the first quarter of The available data includes information from the time of application and credit history while paying off the loan. Customers are divided into two groups, based on the credit history: Good and Bad payers. If in the last 18 months a borrower has no more than one missed payment, he is Good, otherwise the candidate is Bad. The Good: Bad ratio (G:B Odds) for the monitored customers is 1.86 (65 Good borrowers and 35 Bad borrowers). Linear and logistic regression are applied in parallel to the data. A variable named Good Flag (Y), which contains the information mentioned above (value of 1 for Good borrowers and value of 0 for Bad candidates), is modeled as a dependent variable. A correlation analysis is made in order to decide which of the variables listed in the database should be used in the modelling process (as independent variables). Due to the results of this analysis, several combinations of independent variables are tested as initial models both with linear and logistic regression. To choose a starting point between the initial test models, the linear ones are compared by Adjusted R-squared, while for the comparison of the logistic ones is used Nagelkerke R-squared. Both the linear and the logistic initial test models are tested for regression significance. After choosing a starting point for the two types of regression, final linear and logistic models are obtained after new iterations, using initial transformation of the data and following a set of standard and robust statistical criteria. A comparative analysis between the two final models is made. As results from both of the models, two primary scorecards are made correspondingly, where all the coefficients are statistically significant and meet economic expectations. The scorecards could be used to assess new borrowers each applicant could be passed through the preferred scorecard and will receive certain points (formed on the basis of the available data for the person at the time of application). The bank could set a cut off level of points, bellow which to reject the applications and above it to accept them. The independent variables / indicators that are analyzed are presented in Table 1. The indicators X1, X2, X3, X4, X5 and X11 are quantitative, while X6, X7, X8, X9 and X10 are qualitative. The qualitative variables have the following categories: - Residential Status (X6): H (Homeowner), L (Living with parents), T (Tenant), O (Other); - Loan Payment Method (X7): B (Bank Payment), Q (Cheque), S (Standing Order), X (Not Given); - Occupation Code (X8): B (Self-employed), M (Employee), P (Pensioner), O (Other); - Account Type (X9): FL (Fixed Loan), VL (Variable Loan); - Marital Status (X10): D (Divorced), M (Married), S (Single), W (Widow), Z (Not Given)

4 TABLE 1. List of independent variables / indicators. Indicators / Variables Designation Units of Measurement Time with Bank Months Time in Employment Months Gross Annual Income Euro Loan Amount Euro Age of Applicant Years Residential Status Category Loan Payment Method Category Occupation Code Category Account Type Category Marital Status Category Loan as Percent of Income Percentages MODELLING The present data analysis has been done by SPSS [4]. Basic steps for statistical analysis in SPSS are presented in [5, 6, 7]. Dummy variables have been created for each of the qualitative variables in order to quantify the qualitative data. Dummies are variables that take the value of 0 (for the absence of an attribute) and 1 (for its presence) and are used in the same way in regression analysis as the quantitative variables. The dummies correspond to the number of classes / categories of the qualitative variable - 1. Thus, the category for which no dummy variable is created, is used as a reference category [8]. Let us take for an example the variable Residential Status (X6). The category Homeowner is selected as a reference, since there falls the largest percentage of the observations. No dummy was created for this category. For the remaining categories dummy variables were created as follows - set to 1 if the applicant lives with his parents and to 0 in all other cases; - set to 1 if the applicant rents a house and to 0 in all other cases; - takes value of 0 if the applicant is a homeowner, lives with parents or rents a house, and 1 in all other cases. Another example: the variable Loan Payment Method (X7). The category Bank Payment is selected as a reference, since there falls the largest percentage of the observations. No dummy was created for this category. For the remaining categories dummy variables were created as follows - Loan Payment Method Cheque 7.1) set to 1 if the applicant pays with cheque and to 0 in all other cases; - Loan Payment Method Standing Order 7.2) set to 1 if the applicant uses standing order to pay and to 0 in all other cases; - Loan Payment Method Not Given 7.3) takes value of 0 if the applicant pays via bank payment, cheque or standing order, and 1 in all other cases. In order to decide which of the independent variables should be used in the model, a correlation analysis has been made. Due to the presence of qualitative (nominal) variables, Cramer s V [9] have been used to measure the correlation. The existing correlations between the used indicators and the degree of their significance (** - significant correlation at P = 99%, * - significant correlation at P = 95%) have been established using the everyone against everyone method (Table 2). Based on the results of the correlation analysis, some of the indicators have been excluded from the model. X3, X4, X5 and X9 have been excluded from the modelling process due to statistically insignificant correlation with the dependent variable Y

5 TABLE 2. The correlation matrix. Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 Y 1.00 X1 0.56* 1.00 X2 0.54* X ** 1.00 X * ** 1.00 X ** 0.79* X6 0.36** 0.56** ** ** 1.00 X7 0.45** ** 0.89** 0.79* 0.71** X8 0.30* ** ** 1.00 X * X * ** 0.61** 0.35** 0.31** X * ** 0.85** 0.78** ** 0.89** * Due to statistically significant correlation between the independent variables that could lead to displacement of scores and hence wrong conclusions from the model, the following combinations of the indicators have been tested. M1: X1, X2 and X10 enter the model. (X6, X7, X8 and X11 are excluded from the modelling process due to significant correlations with some of the rest indicators, as can be seen in Table 2.) M2: X1, X8 and X11 enter the model. (X2, X6, X7 and X10 are excluded from the modelling process due to significant correlations with some of the rest indicators, as can be seen in Table 2.) M3: X1, X10 and X11 enter the model. (X2, X6, X7 and X8 are excluded from the modelling process due to significant correlations with some of the rest indicators, as can be seen in Table 2.) M4: X2, X6 and X8 enter the model. (X1, X7, X10 and X11 are excluded from the modelling process due to significant correlations with some of the rest indicators, as can be seen in Table 2.) M5: X6 and X7 enter the model. (X1, X2, X8, X10 and X11 are excluded from the modelling process due to significant correlations with some of the rest indicators, as can be seen in Table 2.) The results from the initial test models are briefly outlined in Table 3. TABLE 3. Statistics summary of the initial test models. Model M1 M2 M3 M4 M5 Lin. Log. Lin. Log. Lin. Log. Lin. Log. Lin. Log. R-squared * Regression Significance ** Correctly classified cases (%) *** *Adjusted R-squared is given for the linear models and Nagelkerke R-squared (which is pseudo R-squared) is used to assess the logistic models. ** F-test is used for the linear models and Chi-square - for the logistic models. The p-value is given in the table. *** These statistics are collected only for the logistic models. The statistics, stated in Table 3, are used to choose one of the five models as starting point. The adjusted R- squared is greatest for the Linear Model M5. Also, the Logistic Model M5 has greatest Nagelkerke R-squared. Both the linear and logistic regression, version M5, are statistically significant even at significance level of 1%

6 Linear Regression Version M5 of the linear model is used as starting model. It has the following general form Y b b X b X b X b X b X b X 7.3 The values of the coefficients in front of each of the indicators are listed in Table 4. The table shows also that some of the coefficients are statistically insignificant (X6.1, X7.1 and X7.2) at the significance level of 5%. TABLE 4. Coefficients in the Linear Regression Initial Model M5 and their significance. Indicator Coefficient (b) t-value Significance level Intercept X X X X X X By replacing in the general form of the model with the relevant coefficients from Table 4 the following equation is obtained: Y X X X 0.318X 0.088X X The coefficients in front of the indicators make sense except of the one in front of X6.1. The expectations for the applicants who live with their parents (X6.1) are that they cannot be better than the homeowners. To handle this situation, the applicants, who live with their parents, could be added to the null group (i.e., X6.1 to be excluded from the model). Another reason to exclude X6.1 from the modelling process is that the coefficient in front of it is not significant. As it comes to the loan payment method, it is worthy to combine the applicants who pay with cheque (X7.1) with the ones who use standing order (X7.2) and to check whether the new variable (X7.12) will have statistically significant coefficient. After applying the mentioned changes to the Initial Linear Model M5, the following model (Version M5.1) is obtained Y X 0.324X 0.078X X. 7.3 The adjusted R-squared for the Linear Model M5.1 is 0.269, which is better than for the initial version. The new version of the model adequately reflects the dependency of the variables, even at a level of significance of 1% (Fvalue is 10.1). All of the coefficients are statistically significant at the significance level of 5% except X7.12 (Table 5). TABLE 5. Coefficients in the Linear Regression Model M5.1 and their significance. Indicator Coefficient (b) t-value Significance level Intercept X X X X One way to handle the situation with the insignificant coefficient of X7.12 is to add it to the null group (Version M5.2), i.e., to leave in the null group everybody, except the ones with not given loan payment method that way they will be punished for the missing information (with negative coefficient). Another way is to add the applicants with not given payment method to the null, but let the ones with bank payment to get points (Version M5.3). Of course, if the null group is changed, the applicants who pay with cheque or with standing order can be tested again in the model together (Version M5.4) or separately (Version M5.5). The suggested changes have been analyzed. Statistics summary for the versions of the linear model could be seen in Table

7 In all of the new versions of the linear model (from M5.2 to M5.5) the logic behind residential status remains the same (i.e., negative points for X and more negative points for X). In Linear Model Version M5.2 the applicants with not given loan payment method receive negative points, which meets the economic expectations. In Linear Model Version M5.3 the applicants with bank payment receive positive points. In Linear Model Version M5.4 the applicants with bank payment again receive positive points, and those who pay via cheque or standing order are better than them. In Linear Model Version M5.5 the applicants with bank payment again receive positive points, those who pay via standing order are better than them and those who pay with cheque are the best. TABLE 6. Statistics summary of the linear models. Versions of the Linear Model / Statistics Adjusted R- squared Regression Significance * M5 M5.1 M5.2 M5.3 M5.4 M * F-test is used to test the significance of the model. The p-value is given in the table. How to choose the best model? The Adjusted R-squared of the versions M5.2, M5.4 and M5.5 are close to one another. All the models are statistically significant even at significance level of 1%. In principle, it is better to have more categories of the variable Loan Payment Method in the model, as this would give better differentiation between the applicants (i.e., to choose between versions M5.4 and M5.5 of the linear model). But there are some other considerations, on the basis of which Linear Model Version M5.2 is preferred in this work. The applicants with bank payment represent 80% of the used database. It is better to leave such a big group in the null, as it will make no further distinction at all. Let us proceed with Linear Model Version M5.2. It has the following form Y X X X % of the change in the behavior of the applicant (the dependent variable Y) is determined by a change in the modeled independent variables (Adjusted R-squared is 0.274) [7, 8]. As previously mentioned, Linear Model Version M5.2 adequately reflects the dependency of the variables, even at a level of significance of 1%, which can be seen also from Table 7. Variations TABLE 7. ANOVA table for Linear Model Version M5.2. Sum of squares Degrees of freedom Degrees of freedom 7.3 F-value Significance level ESS RSS TSS All of the coefficients are statistically significant at the significance level of 5% (Table 8). TABLE 8. Coefficients in the Linear Regression Model M5.2 and their significance. Indicator Coefficient (b) t-value Significance level Intercept X X X

8 The model conducted above gives the likelihood the candidate to be a good payer. The scorecard, corresponding to the considered model, can be seen in Table 9 (the coefficients are multiplied by 1000 for clarity). Accordingly, each borrower can be passed through this scoring system. The model could be interpreted as follows. Each candidate starts at 816 points. If the applicant rents a house, he will get 393 points; if he is a homeowner or lives with parents, he will get no points on the variable Residential Status; if he has some other status (not in the mentioned ones) he will get 330 points. If the applicant is going to pay via bank payment, cheque or standing order, he will get no points on the variable Loan Payment Method. If he did not give information about the way of payment, his score will be deducted with 692 points. Linear Model Version M5.2 is used as the final linear model in this work. TABLE 9. Scorecard from Linear Model Version M5.2. No. Variable Points INTERCEPT RESIDENTIAL STATUS (X6) Homeowner (X6.0) 0 Living with Parents (X6.1) 0 Tenant (X) Other status (X) LOAN PAYMENT METHOD (X7) Bank Payment (X7.0) 0 Cheque (X7.1) Standing Order (X7.2) 0 Not Given (X7.3) 692 Logistic Regression Version M5 of the logistic model is used as starting model. The variable Good Flag (Y) takes values of 1 (Good borrowers) and 0 (Bad borrowers) with probabilities respectively p and 1-p. The starting model has the following general form p log e b0 b6.1x 6.1 b X b X b7.1x 7.1 b7.2 X 7.2 b7.3 X p The values of the coefficients in front of each of the indicators are listed in Table 9. The table shows that most of the coefficients are statistically insignificant (X6.1, X7.1, X7.2, X7.3 and the intercept) at the significance level of 5%. TABLE 10. Coefficients in the Logistic Regression Initial Model M5 and their significance. Indicator Coefficient (b) Wald Chi-square value Significance level Exp(b) Intercept X *10 8 X X X X X By replacing in the general form of the model with the relevant coefficients from Table 10 the following equation is obtained p loge X X 1.598X 0.598X X X p Before interpreting the coefficients in front of the indicators, it is better to handle with the insignificant ones

9 Using the same logic as in the linear model, it is a good idea to add the applicants, who live with their parents (X6.1) to the null group (i.e., X6.1 to be excluded from the model). As it comes to the Loan Payment Method (X7), it has been checked that the overall variable X7 is statistically insignificant (p-value=0.946) even at the significance level of 20%, so it is better to exclude it from the model. After applying the suggested changes to the Initial Logistic Model M5, the following model (Version M5.1) is obtained p loge X X. 1 p The new version of the model adequately reflects the dependency of the variables, even at a level of significance of 1% (Chi-square-value is 12.57). 71% of the cases are correctly classified. All of the coefficients are statistically significant at the significance level of 5% except the intercept (Table 11). TABLE 11. Coefficients in the Logistic Regression Model M5.1 and their significance. Indicator Coefficient (b) Wald Chi-square value Significance level Exp(b) Intercept X X Because the coefficients are in log-odds units, they are often difficult to interpret. Therefore, it is better to use the odds ratios (the exponentiation of the coefficients), listed in the right-most column in Table 11 labeled Exp(b), or the inverted odds ratios (1/odds ratios). The inverted odds ratios for X and X indicate that the odds of being good for the applicants who own a home or live with their parents are approximately 9 times higher than for the ones who rent a house (1/0.107) and approximately 4 times higher than for the rest applicants (1/0.267). This corresponds to the logic from the linear model. The Nagelkerke R-squared for the Logistic Model M5.1 is For comparison, the Nagelkerke R-squared for the Logistic Model M5 was and the Nagelkerke coefficients of all of the initial logistic models were higher. In this case, it is worthy to check the Initial Logistic Model M4 and approve it (it contains X6 and has Nagelkerke R-squared of the highest after the one of Initial Version M5). The Initial Logistic Model Version M4 has the following form p loge X X X 0.880X 0.563X X X p The coding of the variable Occupation Code (X8) is as follows. The category Self-employed is selected as a reference, since there falls the largest percentage of the observations. No dummy was created for this category. For the remaining categories dummy variables were created as follows - set to 1 if the applicant is an employee and to 0 in all other cases; - set to 1 if the applicant is a pensioner and to 0 in all other cases; - takes value of 0 if the applicant is self-employed, employee or pensioner, and 1 in all other cases. The values of the coefficients in front of each of the indicators are listed in Table 12. TABLE 12. Coefficients in the Logistic Regression Initial Model M4 and their significance. Indicator Coefficient (b) Wald Chi-square value Significance level Exp(b) Intercept X *10 8 X X X X X

10 The table shows that most of the coefficients are statistically insignificant (X6.1, X, X8.1, X8.2, X8.3 and the intercept) at the significance level of 5%. It could be seen that at the significance level of 10% X8.2 and X8.3 are significant. It could be seen from Table 12 that for every one-month increase in Time in Employment (X2), a increase in the log-odds of Good Flag (Y) is expected, holding all other independent variables constant. This meets the economic expectations. As the coefficient in front of X6.1 is again insignificant at the significance level of 5%, it is better to add the applicants, who live with their parents, to the null group (i.e., X6.1 to be excluded from the model) and see what will happen with the rest categories of Residential Status (X6). The same logic could be applied to Occupation Code (X8) exclude X8.1 and check what will happen with the rest categories. After applying the suggested changes to the Initial Logistic Model 4, the following model (Version M4.1) is obtained p loge X X 0.939X 0.913X X p The new version of the model adequately reflects the dependency of the variables, even at a level of significance of 1% (Chi-square-value is 24.43). 79% of the cases are correctly classified. The Nagelkerke R- squared for the Logistic Model M4.1 is All of the coefficients are statistically significant at the significance level of 15%, although X, X8.2 and X8.3 are insignificant at the significance level of 5% (Table 13). TABLE 13. Coefficients in the Logistic Regression Model M4.1 and their significance. Indicator Coefficient (b) Wald Chi-square value Significance level Exp(b) Intercept X X X X Logistic Model Version M4.1 could be treated as the final logistic model (at significance level of 15%) or some changes could be applied to it in order to approve the significance level. A good starting point for a new version of the model is to bring together X8.2 and X8.3 (coded with the dummy variable X8.23), as they both are expected to be worse than the null group. After applying the suggested changes, the following model (Version 4.3) is obtained p loge X X 0.993X X p All of the coefficients are statistically significant at the significance level of 10%, although X and the intercept are insignificant at the significance level of 5% (Table 14). TABLE 14. Coefficients in the Logistic Regression Model M4.2 and their significance. Indicator Coefficient (b) Wald Chi-square value Significance level Exp(b) Intercept X X X The new version of the logistic model is statistically significant (Chi-square-value is 23.71), p < The model explains 29.1% of the variance in Good Flag (Nagelkerke R-squared is 0.291) and correctly classifies 79% of cases

11 TABLE 15. Scorecard from Logistic Model Version M4.2. No. Variable Points INTERCEPT TIME IN EMPLOYMENT (X2) 7 2 RESIDENTIAL STATUS (X6) Homeowner (X6.0) 0 Living with Parents (X6.1) 0 Tenant (X) 2108 Other status (X) OCCUPATION CODE (X8) Self-employed (X8.0) 0 Employee (X8.1) 0 Pensioner (X8.2) 1089 Other (X8.3) 1089 Increasing Time in Employment (X2) is associated with an increased likelihood of being a good payer. The inverted odds ratios for X and X indicate that the odds of being good for the applicants who own a home or live with their parents are ~ 8 times higher than for the ones who rent a house and ~ 3 times higher than for the rest applicants (the inverted odds ratio for X is 1/0.121 and for X is 1/0.371). Employees and self-employed are ~ 3 times more likely to be good payers than pensioners and other applicants (the inverted odds ratio for X8.23 is 1/0.336). The scorecard, corresponding to the considered model, could be seen in Table 15 (the coefficients are multiplied by 1000 for clarity). In order to improve the significance level of the coefficients, X and X could be grouped together. This option is tested, but then the intercept becomes insignificant even at the significance level of 20% (p-value=0.297). Logistic Model Version M4.2 is used as the final logistic model (at significance level of 10%) in this work. It has better level of significance of the coefficients than Logistic Model Version M4.1 and commensurable performance. COMPARATIVE ANALYSIS After the final linear and logistic models have been chosen, here comes the question how to compare them. Using R-squared for this purpose is not a good idea, because both models need to work with the same sample size and the same dependent variables. The second condition is violated here the logistic model gives the loglikelihood of being a good payer, while the linear model gives the likelihood the candidate to be a good payer. There are some transformations, on the basis of which the coefficients of determination of the two models could be compared [8]. Instead of doing this, some other criteria of goodness of fit have been checked below for both the models. The reason is that R-squared is not the best criteria to look at for the logistic models. Something more, the quest for determining the right model is not to get a model with the highest coefficient of determination, but to get estimates of regression coefficients and to be able to draw statistical conclusions about these coefficients. It often happens in the empirical analysis to obtain models with a very high coefficient of determination, but the coefficients of the model to be statistically insignificant or even with the opposite sign of the theory. What is important is to obtain a model with statistically significant coefficients that meet the theoretical (economic) expectations. Therefore, the coefficient of determination has been used as a key factor (along with the overall significance of the model) only when choosing the initial models and after that more attention has been paid on the significance of the model coefficients. It should also be noticed that a multicollinearity check has been performed for both the final models and the results show that there is no multicollinearity neither for the linear, nor for the logistic model. The Variance Inflation Factor (VIF) has been used for this check. The variables from the final linear model have VIF between 1.02 and 1.05, while those from the final logistic model are with VIF between 1.02 and

12 Score Distribution Reports In order to compare the final linear and logistic models score distribution reports have been created, including the number and the percentages of Goods and Bads, as well as the Good: Bad Ratio (G:B Odds) per score band. The scores from the two models have been created from the Predicted Probability of Being Good (multiplied by 1000 for clarity), obtained from the corresponding models. The scores have been classed on 5 bands with approximately 20% of the population in each band. A good model is a model that have a monotonically increasing trend, i.e., more Bad candidates in the lower score bands, and increasing Good ones in the upper bands. In other words, the G:B Odds should increase with increasing of the points. The score distribution for the linear model (Table 16) shows that there is a big cluster of observations in the highest band. In fact, 64% of the applicants have a score greater than 500, or to be more precise the score of all of them is 817 points, which means there is no good distinction between the applicants in the highest band. Despite of this, the overall report for the linear score looks good the G:B Odds are increasing per interval (score bands arranged in ascending order). In Figure 1a it could be seen that the number of Goods is increasing per interval, but the number of Bads varies this is due to the bigger percentages of observations in the last bands (due to the clusters). Class of points TABLE 16. Score Distribution Report for Linear Model Version 5.3. # Goods % Goods # Bads % Bads G:B Odds # Total % Total [Low : 100] 0 0.0% 3 8.6% % (100 : 200] 0 0.0% % % (200 : 450] 2 3.1% % % (450 : 500] % % % (500 : High] % % % Total % % % The score for the logistic model is better distributed (Table 17), there are not such big clusters as in the linear model. The G:B Odds are increasing per interval and Figure 1b shows that the logistic score has decreasing number of Bads and increasing number of Goods per interval (score bands arranged in ascending order). Class of points TABLE 17. Score Distribution Report for Logistic Model Version 4.3. # Goods % Goods # Bads % Bads G:B Odds # Total % Total [Low : 450] 5 7.7% % % (450 : 550] % % % (550 : 750] % % % (750 : 850] % 3 8.6% % (850 : High] % 3 8.6% % Total % % % Gini Index The power of the model could be measured using the Gini index. The Gini index (or Gini coefficient) is a measure of statistical dispersion introduced by the Italian statistician and sociologist Corrado Gini in [10]. In fact, the Gini index is an example of a Lorentz curve that is used to estimate inequalities [11]. Gini index of 0 shows perfect equality, and index of 1 (if calculated as number, or 100 if calculated as percentage) shows perfect inequality (excellent discrimination between the variables)

13 In this work the calculation of the Gini index is based on a comparison between the cumulative number of Goods and Bads for a given interval of points. It has been calculated (using the score distribution reports) that the Gini index of the Linear Score is (or 55.61%) and the Gini index of the Logistic Score is (or 59.65%). This means that the logistic model gives a little bit better discrimination. (a) (b) FIGURE 1. Bar Charts for the number of Goods and Bads per score interval: (a) for Linear Score; (b) for Logistic Score FIGURE 2. Gini index graphs for the final linear and logistic scores

14 An interesting fact was observed while drawing the graphs of the two Gini indexes. Figure 2 shows that the linear score performs better than the logistic score in the lower bands: the blue curve, corresponding to the linear model, is farther from the green line (the line of perfect equality). The situation in the upper bands is the opposite: the red line, corresponding to the logistic model, is farther from the line of perfect equality, which means that the logistic score performs better than the linear one here. In fact, the Gini index for the linear model represents the area between the line of perfect equality (the green line) and the curve of cumulative Goods (%) against cumulative Bads (%) for the linear score (the blue curve). Similarly, the Gini index for the logistic model represents the area between the green line and the red curve. The main disadvantage of the Gini coefficient is the implication of the relative value of the changes that may occur in different parts of the distribution [12]. Moving from the points of better to the points of worse applicant has a significantly greater impact on the Gini index if the two individuals are close to the average than if they were at either end of the score distribution. In some cases, this is a desired effect. In this work especially, the average bands of the score are important, as here is the place to set different credit limits. As the logistic score works better with the Good applicants (the upper bands) and has better Gini index, it could be used to set credit limits. At the same time, the linear score discriminates better between the Bad applicants (the lower bands) and therefore it could be used to set a cut-off level, which is to be done in the next section. Cut-Off Level How to separate Good from Bad candidates, that is, how to decide to whom to grant a loan? Once there is a working model, it already has a scoring system that can divide people to be rejected by those who will be approved. Let us fix a point level c and accept candidates who have received points that are higher or equal to that level, and reject those with points lower than c. This level is called Cut-off and is determined by the risk that the borrower is ready to take [3]. Different strategies could be taken, depending on the goal which the creditor pursues: retaining the same Accept Rate, but increasing the number of Goods; more accepted candidates, but with higher G:B Odds, etc. Two types of mistakes could be made when setting a cut off level: a first-order error and a second-order error. In the credit scoring a first-order error occurs when a Bad payer is marked as Good. When a Good payer is marked as Bad, then this is a second-order error. The cut-off level depends on how big first-order error is likely to accept the creditor. In the reviewed data the initial level of accepted candidates is ~ 90% (Accepted / (Accepted + Rejected)), while the initial Bad Rate (Bads / Accepted) is 35%. In order to compare the results of both methods, it will be checked how the Bad levels would change within each of the methods, keeping the level of accepted candidates at 90%. For the logistic model Accept Rate of ~ 90% corresponds to a cut-off level of approximately 283 points (i.e., to reject the 10%, which have points below (or equal) that level and accept the 90% with points above that level). For the linear model Accept Rate of ~ 90% corresponds to a cut-off level of approximately 125 points (as already mentioned, there are clusters of observations at certain points in the score from the linear model and for that reason cannot be rejected exactly 10% of the applicants, but the most closer is 9% Rejects). The cut-off points, the Accept Rates and the Bad Rates, corresponding to that cut-off levels of both the linear and the logistic model, could be seen in Table 18. TABLE 18. Cut-off analysis with strategy to keep the same Accept Rate. Type of the final model Cut-off level (points) Accept Rate Bad Rate Linear % 28.57% Logistic % 30.00% One could easily notice that both the models have lower Bad Rate than the current one. The linear model has a little bit lower Bad Rate than the logistic one, but they are comparable, having in mind the small number of Bads (in fact, there are no Goods in the rejected applicants from the linear model, and there are only 2 Goods in the rejected applicants from the logistic model). To prove this, a confidence interval on the number of Bads has been built. In fact, the Good Flag has binomial distribution (0 for the Bad applicants and 1 for the Good ones). For a

15 sample of 100 applicants with 35 Bads in them, the 95% confidence interval for the number of Bads shows that they can vary from 26 to 45, while as the 90% confidence interval shows from 28 to 43 Bads. Anyway, it is better to use the linear score for setting a cut-off level, as the Gini index graph (Figure 2) shows the linear model works better with the Bad applicants. Remark: A common practice to validate the results from the final model is to randomly separate small part from the population, which is excluded from the modeling process. The goal is to be able to demonstrate the performance of the model from three different points of view: the whole population; the population, on which the model is built (~ 80%); the population, which is excluded from modeling (~ 20%). By using a sample that is not taken in the modeling process, it will be possible to demonstrate how the model actually works on candidates that are new. In this case, they have data on whether they were Good or Bad, and it will be easy to check the discrimination, imposed by the model. In this work no validation sample is left because of the small amount of the data. But when working with big samples, it is very useful to use such sample to test the stability of the model. CONCLUSIONS The relationship between the information that each borrower submitted at the time of application for a product and his behavior, after he has been approved for the product, can be used to estimate the probability of being good for new customers. This could be done using different types of statistical models. Two of these models linear and logistic, have been compared in this work, using a database of 100 borrowers and following a set of standard and robust statistical criteria. As a result, two scorecards have been obtained from both models to assess new customers at the time of application. The comparative analysis between the two final models shows that the logistic model is more precise, but the linear model is easier to interpret. Furthermore, the linear model could be used to set a cut-off level, bellow which to reject the applicants, as it works better in identifying the Bad applicants. At the same time, the logistic model could be used to determine credit limits or annual percentage rates for the accepted candidates, to choose Good applicants for loyalty programs, etc. it works better with the Good applicants. The presented algorithm and criteria could be used for other data in a similar situation. Recommendation: Banks could set up strategies to work with the Bad applicants and take out the best of them, i.e., to think how exactly to approach them, so that they can still pay. Appropriate techniques for this purpose could be rescheduling, renegotiation, discounts, etc. ACKNOWLEDGMENTS This paper contains results of the work on project No 2017-FNSE-05, financed by Scientific Research Fund of Ruse University, Bulgaria. REFERENCES 1. V. Mihova and V. Pavlov (2017) An approach of estimating the probability of default for new borrowers, to appear in Economy & Business Journal D. Montrichard, Reject inference methodologies in credit risk modeling, in SESUG 2008, The Proceedings of the SouthEast SAS Users Group, 3. J. Mok, Reject inference in credit scoring, BMI paper, Amsterdam, SPSS IBM Statistics, 5. V. Goev, Statistical Processing and Data Analysis from Sociological, Marketing and Political Studies with SPSS (University Press Stopanstvo, Sofia, 1996). [in Bulgarian] 6. Manov, Statistics with SPSS (Trakia-M, Sofia, 2001). [in Bulgarian] 7. V. Pavlov and V. Mihova, Applied Statistics with SPSS (Avangard Print, Ruse, 2016). [in Bulgarian] 8. D. Gujarati, Basic Econometrics, 4th edn (McGraw Hill, New York, 2005). 9. H. Cramér, Mathematical Methods of Statistics (Princeton University Press, Princeton, 1946). 10. C. Gini, Variabilità e Mutabilità (C. Cuppini, Bologna, 1912)

16 11. A. Marshall, I. Olkin, and B. Arnold, Inequalities: Theory of Majorization and Its Applications (Academic press; New York, 1979). 12. F. Cowell, Measuring Inequality, LSE Perspectives in Economic Analysis (Oxford University Press, UK, 2008)

RESULT AND DISCUSSION

RESULT AND DISCUSSION 4 Figure 3 shows ROC curve. It plots the probability of false positive (1-specificity) against true positive (sensitivity). The area under the ROC curve (AUR), which ranges from to 1, provides measure

More information

CHAPTER 5 RESULTS AND ANALYSIS

CHAPTER 5 RESULTS AND ANALYSIS CHAPTER 5 RESULTS AND ANALYSIS This chapter exhibits an extensive data analysis and the results of the statistical testing. Data analysis is done using factor analysis, regression analysis, reliability

More information

CREDIT RISK MODELLING Using SAS

CREDIT RISK MODELLING Using SAS Basic Modelling Concepts Advance Credit Risk Model Development Scorecard Model Development Credit Risk Regulatory Guidelines 70 HOURS Practical Learning Live Online Classroom Weekends DexLab Certified

More information

UNIVERSITY OF STELLENBOSCH ECONOMICS DEPARTMENT ECONOMICS 388: 2018 INTRODUCTORY ECONOMETRICS

UNIVERSITY OF STELLENBOSCH ECONOMICS DEPARTMENT ECONOMICS 388: 2018 INTRODUCTORY ECONOMETRICS UNIVERSITY OF STELLENBOSCH ECONOMICS DEPARTMENT ECONOMICS 388: 2018 INTRODUCTORY ECONOMETRICS Lecturer: Le Roux Burrows Office: Schumann 512 Telephone: 021-808-2243 E-Mail: LRB@SUN.AC.ZA A AIM OF THE COURSE

More information

W H I T E P A P E R. How to Credit Score with Predictive Analytics

W H I T E P A P E R. How to Credit Score with Predictive Analytics How to Credit Score with Predictive Analytics Managing Credit Risk Credit scoring and automated rule-based decisioning are the most important tools used by financial services and credit lending organizations

More information

Correcting Sample Bias in Oversampled Logistic Modeling. Building Stable Models from Data with Very Low event Count

Correcting Sample Bias in Oversampled Logistic Modeling. Building Stable Models from Data with Very Low event Count Correcting Sample Bias in Oversampled Logistic Modeling Building Stable Models from Data with Very Low event Count ABSTRACT In binary outcome regression models with very few bads or minority events, it

More information

Credit Scoring, Response Modelling and Insurance Rating

Credit Scoring, Response Modelling and Insurance Rating Credit Scoring, Response Modelling and Insurance Rating Also by Steven Finlay THE MANAGEMENT OF CONSUMER CREDIT CONSUMER CREDIT FUNDAMENTALS Credit Scoring, Response Modelling and Insurance Rating A Practical

More information

Telecommunications Churn Analysis Using Cox Regression

Telecommunications Churn Analysis Using Cox Regression Telecommunications Churn Analysis Using Cox Regression Introduction As part of its efforts to increase customer loyalty and reduce churn, a telecommunications company is interested in modeling the "time

More information

Distinguish between different types of numerical data and different data collection processes.

Distinguish between different types of numerical data and different data collection processes. Level: Diploma in Business Learning Outcomes 1.1 1.3 Distinguish between different types of numerical data and different data collection processes. Introduce the course by defining statistics and explaining

More information

The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa

The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pages 37-64. The description of the problem can be found

More information

Solution for Customer and Portfolio Management

Solution for Customer and Portfolio Management Scorto Behavia Scorto Loan Decision Scorto Loan Manager SME Scorto Ample Collection Scorto Fraud Barrier Scorto Model Maestro Scorto Supervisor Solution for Customer and Portfolio Management All rights

More information

Correlation and Simple. Linear Regression. Scenario. Defining Correlation

Correlation and Simple. Linear Regression. Scenario. Defining Correlation Linear Regression Scenario Let s imagine that we work in a real estate business and we re attempting to understand whether there s any association between the square footage of a house and it s final selling

More information

Credit Risk Models Cross-Validation Is There Any Added Value?

Credit Risk Models Cross-Validation Is There Any Added Value? Credit Risk Models Cross-Validation Is There Any Added Value? Croatian Quants Day Zagreb, June 6, 2014 Vili Krainz vili.krainz@rba.hr The views expressed during this presentation are solely those of the

More information

Small Business advice seeking behaviour technical report. An analysis of the 2018 small business legal need survey July 2018

Small Business advice seeking behaviour technical report. An analysis of the 2018 small business legal need survey July 2018 Small Business advice seeking behaviour technical report An analysis of the 2018 small business legal need survey July 2018 Which characteristics of small businesses and the legal issues they face have

More information

= = Name: Lab Session: CID Number: The database can be found on our class website: Donald s used car data

= = Name: Lab Session: CID Number: The database can be found on our class website: Donald s used car data Intro to Statistics for the Social Sciences Fall, 2017, Dr. Suzanne Delaney Extra Credit Assignment Instructions: You have been hired as a statistical consultant by Donald who is a used car dealer to help

More information

Chapter 3. Database and Research Methodology

Chapter 3. Database and Research Methodology Chapter 3 Database and Research Methodology In research, the research plan needs to be cautiously designed to yield results that are as objective as realistic. It is the main part of a grant application

More information

Chapter Six- Selecting the Best Innovation Model by Using Multiple Regression

Chapter Six- Selecting the Best Innovation Model by Using Multiple Regression Chapter Six- Selecting the Best Innovation Model by Using Multiple Regression 6.1 Introduction In the previous chapter, the detailed results of FA were presented and discussed. As a result, fourteen factors

More information

Getting Started with OptQuest

Getting Started with OptQuest Getting Started with OptQuest What OptQuest does Futura Apartments model example Portfolio Allocation model example Defining decision variables in Crystal Ball Running OptQuest Specifying decision variable

More information

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN Predictive Modeling using SAS Enterprise Miner and SAS/STAT : Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN 1 Overview This presentation will: Provide a brief introduction of how to set

More information

An Analysis of Factors that Influence the Income of Poor Women: A Research Survey in Ho Chi Minh City of Vietnam

An Analysis of Factors that Influence the Income of Poor Women: A Research Survey in Ho Chi Minh City of Vietnam An Analysis of Factors that Influence the Income of Poor Women: A Research Survey in Ho Chi Minh City of Vietnam Do Phu Tran Tinh University of Economics and Law, Vietnam National Ho Chi Minh University,

More information

Who Are My Best Customers?

Who Are My Best Customers? Technical report Who Are My Best Customers? Using SPSS to get greater value from your customer database Table of contents Introduction..............................................................2 Exploring

More information

Maths Tables and Formulae wee provided within the question paper and are available elsewhere on the website.

Maths Tables and Formulae wee provided within the question paper and are available elsewhere on the website. Examination Question and Answer Book Foundation Level 3c Business Mathematics FBSM 18 November 00 Day 1 late afternoon INSTRUCTIONS TO CANDIDATES Read this page before you look at the questions THIS QUESTION

More information

After completion of this unit you will be able to: Define data analytic and explain why it is important Outline the data analytic tools and

After completion of this unit you will be able to: Define data analytic and explain why it is important Outline the data analytic tools and After completion of this unit you will be able to: Define data analytic and explain why it is important Outline the data analytic tools and techniques and explain them Now the difference between descriptive

More information

Application of Decision Trees in Mining High-Value Credit Card Customers

Application of Decision Trees in Mining High-Value Credit Card Customers Application of Decision Trees in Mining High-Value Credit Card Customers Jian Wang Bo Yuan Wenhuang Liu Graduate School at Shenzhen, Tsinghua University, Shenzhen 8, P.R. China E-mail: gregret24@gmail.com,

More information

Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised March 28, 2015

Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised March 28, 2015 Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised March 28, 2015 NOTE: The routines spost13, lrdrop1, and extremes

More information

Regression analysis of profit per 1 kg milk produced in selected dairy cattle farms

Regression analysis of profit per 1 kg milk produced in selected dairy cattle farms ISSN: 2319-7706 Volume 4 Number 2 (2015) pp. 713-719 http://www.ijcmas.com Original Research Article Regression analysis of profit per 1 kg milk produced in selected dairy cattle farms K. Stankov¹*, St.

More information

AP Statistics Scope & Sequence

AP Statistics Scope & Sequence AP Statistics Scope & Sequence Grading Period Unit Title Learning Targets Throughout the School Year First Grading Period *Apply mathematics to problems in everyday life *Use a problem-solving model that

More information

Masters in Business Statistics (MBS) /2015. Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa. Web:

Masters in Business Statistics (MBS) /2015. Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa. Web: Masters in Business Statistics (MBS) - 2014/2015 Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa Web: www.mrt.ac.lk Course Coordinator: Prof. T S G Peiris Prof. in Applied

More information

Customer Lifetime Value Modelling

Customer Lifetime Value Modelling Rhino Risk Customer Lifetime Value Modelling Alan Lucas Rhino Risk Ltd. 1 Purpose of Talk To discuss the Customer Value paradigm, explaining how it can be used to optimize customer-level decisions Key

More information

Spreadsheets in Education (ejsie)

Spreadsheets in Education (ejsie) Spreadsheets in Education (ejsie) Volume 2, Issue 2 2005 Article 5 Forecasting with Excel: Suggestions for Managers Scott Nadler John F. Kros East Carolina University, nadlers@mail.ecu.edu East Carolina

More information

Code Compulsory Module Credits Continuous Assignment

Code Compulsory Module Credits Continuous Assignment CURRICULUM AND SCHEME OF EVALUATION Compulsory Modules Evaluation (%) Code Compulsory Module Credits Continuous Assignment Final Exam MA 5210 Probability and Statistics 3 40±10 60 10 MA 5202 Statistical

More information

Advanced Analytics through the credit cycle

Advanced Analytics through the credit cycle Advanced Analytics through the credit cycle Alejandro Correa B. Andrés Gonzalez M. Introduction PRE ORIGINATION Credit Cycle POST ORIGINATION ORIGINATION Pre-Origination Propensity Models What is it?

More information

Model Validation of a Credit Scorecard Using Bootstrap Method

Model Validation of a Credit Scorecard Using Bootstrap Method IOSR Journal of Economics and Finance (IOSR-JEF) e-issn: 2321-5933, p-issn: 2321-5925.Volume 3, Issue 3. (Mar-Apr. 2014), PP 64-68 Model Validation of a Credit Scorecard Using Bootstrap Method Dilsha M

More information

Transportation Research Forum

Transportation Research Forum Transportation Research Forum Estimation of Satisfied Customers in Public Transport Systems: A New Methodological Approach Author(s): Maria Morfoulaki, Yannis Tyrinopoulos, and Georgia Aifadopoulou Source:

More information

Mining for Gold gets easier and a lot more fun! By Ken Deal

Mining for Gold gets easier and a lot more fun! By Ken Deal Mining for Gold gets easier and a lot more fun! By Ken Deal Marketing researchers develop and use scales routinely. It seems to be a fairly common procedure when analyzing survey data to assume that a

More information

Center for Demography and Ecology

Center for Demography and Ecology Center for Demography and Ecology University of Wisconsin-Madison A Comparative Evaluation of Selected Statistical Software for Computing Multinomial Models Nancy McDermott CDE Working Paper No. 95-01

More information

Data Science in a pricing process

Data Science in a pricing process Data Science in a pricing process Michaël Casalinuovo Consultant, ADDACTIS Software michael.casalinuovo@addactis.com Contents Nowadays, we live in a continuously changing market environment, Pricing has

More information

Binary Classification Modeling Final Deliverable. Using Logistic Regression to Build Credit Scores. Dagny Taggart

Binary Classification Modeling Final Deliverable. Using Logistic Regression to Build Credit Scores. Dagny Taggart Binary Classification Modeling Final Deliverable Using Logistic Regression to Build Credit Scores Dagny Taggart Supervised by Jennifer Lewis Priestley, Ph.D. Kennesaw State University Submitted 4/24/2015

More information

Outliers identification and handling: an advanced econometric approach for practical data applications

Outliers identification and handling: an advanced econometric approach for practical data applications Outliers identification and handling: an advanced econometric approach for practical data applications G. Palmegiani LUISS University of Rome Rome Italy DOI: 10.1481/icasVII.2016.d24c ABSTRACT PAPER Before

More information

intelligent world. GIV

intelligent world. GIV Huawei provides industries aiming to development, as well that will enable the ecosystem to truly intelligent world. GIV 2025 the direction for ramp up the pace of as the foundations diverse ICT industry

More information

Glossary of Standardized Testing Terms https://www.ets.org/understanding_testing/glossary/

Glossary of Standardized Testing Terms https://www.ets.org/understanding_testing/glossary/ Glossary of Standardized Testing Terms https://www.ets.org/understanding_testing/glossary/ a parameter In item response theory (IRT), the a parameter is a number that indicates the discrimination of a

More information

Gush vs. Bore: A Look at the Statistics of Sampling

Gush vs. Bore: A Look at the Statistics of Sampling Gush vs. Bore: A Look at the Statistics of Sampling Open the Fathom file Random_Samples.ftm. Imagine that in a nation somewhere nearby, a presidential election will soon be held with two candidates named

More information

SPSS Guide Page 1 of 13

SPSS Guide Page 1 of 13 SPSS Guide Page 1 of 13 A Guide to SPSS for Public Affairs Students This is intended as a handy how-to guide for most of what you might want to do in SPSS. First, here is what a typical data set might

More information

Business Quantitative Analysis [QU1] Examination Blueprint

Business Quantitative Analysis [QU1] Examination Blueprint Business Quantitative Analysis [QU1] Examination Blueprint 2014-2015 Purpose The Business Quantitative Analysis [QU1] examination has been constructed using an examination blueprint. The blueprint, also

More information

Introduction to Analytics Tools Data Models Problem solving with analytics

Introduction to Analytics Tools Data Models Problem solving with analytics Introduction to Analytics Tools Data Models Problem solving with analytics Analytics is the use of: data, information technology, statistical analysis, quantitative methods, and mathematical or computer-based

More information

Statistics and Business Decision Making TEKS/LINKS Student Objectives One Credit

Statistics and Business Decision Making TEKS/LINKS Student Objectives One Credit First Six Weeks Defining and Collecting of Data SBDM 8(A) The student will define the types of variables and the measurement scales of variables. SBDM 8(B) The student will understand the collecting of

More information

Amsterdam, The future of retail banking: the customers

Amsterdam, The future of retail banking: the customers Amsterdam, 2016 The future of retail banking: the customers The future of retail banking: the customers Retail banking was long referred to as a 3 6 3 business. The bank borrows money at a 3% interest

More information

Lithium-Ion Battery Analysis for Reliability and Accelerated Testing Using Logistic Regression

Lithium-Ion Battery Analysis for Reliability and Accelerated Testing Using Logistic Regression for Reliability and Accelerated Testing Using Logistic Regression Travis A. Moebes, PhD Dyn-Corp International, LLC Houston, Texas tmoebes@nasa.gov Biography Dr. Travis Moebes has a B.S. in Mathematics

More information

Semester 2, 2015/2016

Semester 2, 2015/2016 ECN 3202 APPLIED ECONOMETRICS 3. MULTIPLE REGRESSION B Mr. Sydney Armstrong Lecturer 1 The University of Guyana 1 Semester 2, 2015/2016 MODEL SPECIFICATION What happens if we omit a relevant variable?

More information

THE ANALYTICS ECOSYSTEM AND SOME USER CASES. Kristian Stegenborg, Bisnode 4/11 SAS Nordic Credit Scoring User Group

THE ANALYTICS ECOSYSTEM AND SOME USER CASES. Kristian Stegenborg, Bisnode 4/11 SAS Nordic Credit Scoring User Group THE ANALYTICS ECOSYSTEM AND SOME USER CASES Kristian Stegenborg, Bisnode 4/11 SAS Nordic Credit Scoring User Group ABOUT ME Master in Statistics Worked with credit scoring for 15 years Prior experience:

More information

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical and Industrial Engineering Indian Institute of Technology, Roorkee

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical and Industrial Engineering Indian Institute of Technology, Roorkee Industrial Engineering Prof. Inderdeep Singh Department of Mechanical and Industrial Engineering Indian Institute of Technology, Roorkee Module - 04 Lecture - 04 Sales Forecasting I A very warm welcome

More information

= = Intro to Statistics for the Social Sciences. Name: Lab Session: Spring, 2015, Dr. Suzanne Delaney

= = Intro to Statistics for the Social Sciences. Name: Lab Session: Spring, 2015, Dr. Suzanne Delaney Name: Intro to Statistics for the Social Sciences Lab Session: Spring, 2015, Dr. Suzanne Delaney CID Number: _ Homework #22 You have been hired as a statistical consultant by Donald who is a used car dealer

More information

Homework 1: Who s Your Daddy? Is He Rich Like Me?

Homework 1: Who s Your Daddy? Is He Rich Like Me? Homework 1: Who s Your Daddy? Is He Rich Like Me? 36-402, Spring 2015 Due at 11:59 pm on Tuesday, 20 January 2015 GENERAL INSTRUCTIONS: You may submit either (1) a single PDF, containing all your written

More information

Opening SPSS 6/18/2013. Lesson: Quantitative Data Analysis part -I. The Four Windows: Data Editor. The Four Windows: Output Viewer

Opening SPSS 6/18/2013. Lesson: Quantitative Data Analysis part -I. The Four Windows: Data Editor. The Four Windows: Output Viewer Lesson: Quantitative Data Analysis part -I Research Methodology - COMC/CMOE/ COMT 41543 The Four Windows: Data Editor Data Editor Spreadsheet-like system for defining, entering, editing, and displaying

More information

Model Building Process Part 2: Factor Assumptions

Model Building Process Part 2: Factor Assumptions Model Building Process Part 2: Factor Assumptions Authored by: Sarah Burke, PhD 17 July 2018 Revised 6 September 2018 The goal of the STAT COE is to assist in developing rigorous, defensible test strategies

More information

Statistics and Data Analysis

Statistics and Data Analysis Selecting the Appropriate Outlier Treatment for Common Industry Applications Kunal Tiwari Krishna Mehta Nitin Jain Ramandeep Tiwari Gaurav Kanda Inductis Inc. 571 Central Avenue #105 New Providence, NJ

More information

Real Estate Modelling and Forecasting

Real Estate Modelling and Forecasting Real Estate Modelling and Forecasting Chris Brooks ICMA Centre, University of Reading Sotiris Tsolacos Property and Portfolio Research CAMBRIDGE UNIVERSITY PRESS Contents list of figures page x List of

More information

Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006

Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006 Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006 Computer assignment #3 DUE Wednesday MARCH 29 th (in class) Regression

More information

LECTURE 17: MULTIVARIABLE REGRESSIONS I

LECTURE 17: MULTIVARIABLE REGRESSIONS I David Youngberg BSAD 210 Montgomery College LECTURE 17: MULTIVARIABLE REGRESSIONS I I. What Determines a House s Price? a. Open Data Set 6 to help us answer this question. You ll see pricing data for homes

More information

Model Selection, Evaluation, Diagnosis

Model Selection, Evaluation, Diagnosis Model Selection, Evaluation, Diagnosis INFO-4604, Applied Machine Learning University of Colorado Boulder October 31 November 2, 2017 Prof. Michael Paul Today How do you estimate how well your classifier

More information

THE RESPONSE OF FINANCIAL AND GOODS MARKETS TO VELOCITY INNOVATIONS: AN EMPIRICAL INVESTIGATION FOR THE US*

THE RESPONSE OF FINANCIAL AND GOODS MARKETS TO VELOCITY INNOVATIONS: AN EMPIRICAL INVESTIGATION FOR THE US* THE RESPONSE OF FINANCIAL AND GOODS MARKETS TO VELOCITY INNOVATIONS: AN EMPIRICAL INVESTIGATION FOR THE US* by Flavio Padrini Ministry of the Treasury, Italy ABSTRACT It is commonly thought that interest

More information

M.Sc. Agril. Economics

M.Sc. Agril. Economics M.Sc. Agril. Economics Sl. Course Name Course Credit Semester No. Code 1. Micro & Macro Economics Theory ECON-701 3 (3+0+0) I 2. Research Methodology ECON-705 4 (2+0+4) I 3. Farm Management ECON-703 4

More information

GETTING STARTED WITH PROC LOGISTIC

GETTING STARTED WITH PROC LOGISTIC PAPER 255-25 GETTING STARTED WITH PROC LOGISTIC Andrew H. Karp Sierra Information Services, Inc. USA Introduction Logistic Regression is an increasingly popular analytic tool. Used to predict the probability

More information

Statistics, Data Analysis, and Decision Modeling

Statistics, Data Analysis, and Decision Modeling - ' 'li* Statistics, Data Analysis, and Decision Modeling T H I R D E D I T I O N James R. Evans University of Cincinnati PEARSON Prentice Hall Upper Saddle River, New Jersey 07458 CONTENTS Preface xv

More information

Testing the Predictability of Consumption Growth: Evidence from China

Testing the Predictability of Consumption Growth: Evidence from China Auburn University Department of Economics Working Paper Series Testing the Predictability of Consumption Growth: Evidence from China Liping Gao and Hyeongwoo Kim Georgia Southern University; Auburn University

More information

Comparison of sales forecasting models for an innovative agro-industrial product: Bass model versus logistic function

Comparison of sales forecasting models for an innovative agro-industrial product: Bass model versus logistic function The Empirical Econometrics and Quantitative Economics Letters ISSN 2286 7147 EEQEL all rights reserved Volume 1, Number 4 (December 2012), pp. 89 106. Comparison of sales forecasting models for an innovative

More information

Session 7. Introduction to important statistical techniques for competitiveness analysis example and interpretations

Session 7. Introduction to important statistical techniques for competitiveness analysis example and interpretations ARTNeT Greater Mekong Sub-region (GMS) initiative Session 7 Introduction to important statistical techniques for competitiveness analysis example and interpretations ARTNeT Consultant Witada Anukoonwattaka,

More information

Introduction to Labour Economics. Professor H.J. Schuetze Economics 370. What is Labour Economics?

Introduction to Labour Economics. Professor H.J. Schuetze Economics 370. What is Labour Economics? Introduction to Labour Economics Professor H.J. Schuetze Economics 370 What is Labour Economics? Let s begin by looking at what economics is in general Study of interactions between decision makers, which

More information

Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur

Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur Lecture - 20 MANOVA Case Study (Refer Slide Time:

More information

CHAPTER VII SUMMARY OF FINDINGS, CONCLUSION AND SUGGESTION. So far, the aspects inspiring the micro entrepreneurship enterprise

CHAPTER VII SUMMARY OF FINDINGS, CONCLUSION AND SUGGESTION. So far, the aspects inspiring the micro entrepreneurship enterprise CHAPTER VII SUMMARY OF FINDINGS, CONCLUSION AND SUGGESTION 7.1 INTRODUCTION So far, the aspects inspiring the micro entrepreneurship enterprise involvement among the entrepreneurs, constraints faced by

More information

POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM)

POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM) OUTLINE FOR THE POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM) Module Subject Topics Learning outcomes Delivered by Exploratory & Visualization Framework Exploratory Data Collection and

More information

Industrial organisation and procurement in defence firms: An economic analysis of UK aerospace markets.

Industrial organisation and procurement in defence firms: An economic analysis of UK aerospace markets. Industrial organisation and procurement in defence firms: An economic analysis of UK aerospace markets. Abstract: The UK aerospace industry is currently in a process of structural change, which has already

More information

Hierarchical Linear Modeling: A Primer 1 (Measures Within People) R. C. Gardner Department of Psychology

Hierarchical Linear Modeling: A Primer 1 (Measures Within People) R. C. Gardner Department of Psychology Hierarchical Linear Modeling: A Primer 1 (Measures Within People) R. C. Gardner Department of Psychology As noted previously, Hierarchical Linear Modeling (HLM) can be considered a particular instance

More information

Forecasting Introduction Version 1.7

Forecasting Introduction Version 1.7 Forecasting Introduction Version 1.7 Dr. Ron Tibben-Lembke Sept. 3, 2006 This introduction will cover basic forecasting methods, how to set the parameters of those methods, and how to measure forecast

More information

Advanced Tutorials. SESUG '95 Proceedings GETTING STARTED WITH PROC LOGISTIC

Advanced Tutorials. SESUG '95 Proceedings GETTING STARTED WITH PROC LOGISTIC GETTING STARTED WITH PROC LOGISTIC Andrew H. Karp Sierra Information Services and University of California, Berkeley Extension Division Introduction Logistic Regression is an increasingly popular analytic

More information

Econometric Analysis Dr. Sobel

Econometric Analysis Dr. Sobel Econometric Analysis Dr. Sobel Econometrics Session 1: 1. Building a data set Which software - usually best to use Microsoft Excel (XLS format) but CSV is also okay Variable names (first row only, 15 character

More information

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization 1) One of SiriusXM's challenges was tracking potential customers

More information

Daily Optimization Project

Daily Optimization Project Daily Optimization Project Personalized Facebook Analysis 10/29/15 By Kevin DeLuca ThoughtBurner Kevin DeLuca and ThoughtBurner, 2015. Unauthorized use and/or duplication of this material without express

More information

GETTING STARTED WITH PROC LOGISTIC

GETTING STARTED WITH PROC LOGISTIC GETTING STARTED WITH PROC LOGISTIC Andrew H. Karp Sierra Information Services and University of California, Berkeley Extension Division Introduction Logistic Regression is an increasingly popular analytic

More information

Choosing the best statistical tests for your data and hypotheses. Dr. Christine Pereira Academic Skills Team (ASK)

Choosing the best statistical tests for your data and hypotheses. Dr. Christine Pereira Academic Skills Team (ASK) Choosing the best statistical tests for your data and hypotheses Dr. Christine Pereira Academic Skills Team (ASK) ask@brunel.ac.uk Which test should I use? T-tests Correlations Regression Dr. Christine

More information

The prediction of economic and financial performance of companies using supervised pattern recognition methods and techniques

The prediction of economic and financial performance of companies using supervised pattern recognition methods and techniques The prediction of economic and financial performance of companies using supervised pattern recognition methods and techniques Table of Contents: Author: Raluca Botorogeanu Chapter 1: Context, need, importance

More information

Algebra II Common Core

Algebra II Common Core Core Algebra II Common Core Algebra II introduces students to advanced functions, with a focus on developing a strong conceptual grasp of the expressions that define them. Students learn through discovery

More information

THE GUIDE TO SPSS. David Le

THE GUIDE TO SPSS. David Le THE GUIDE TO SPSS David Le June 2013 1 Table of Contents Introduction... 3 How to Use this Guide... 3 Frequency Reports... 4 Key Definitions... 4 Example 1: Frequency report using a categorical variable

More information

AN ECONOMIC RELIABILITY TEST PLAN FOR GENERALIZED LOG- LOGISTIC DISTRIBUTION

AN ECONOMIC RELIABILITY TEST PLAN FOR GENERALIZED LOG- LOGISTIC DISTRIBUTION AN ECONOMIC RELIABILITY TEST PLAN FOR GENERALIZED LOG- LOGISTIC DISTRIBUTION G. Srinivasa Rao 1, R.R.L. Kantam 2, K. Rosaiah 2 and S.V.S.V.S.V. Prasad 2 Department of Statistics, Dilla University, Dilla,

More information

AN ANALYSIS OF FUNCTIONAL / PSYCHOLOGICAL BARRIERS AND QUALITY OF E-BANKING / INTERNET BANKING

AN ANALYSIS OF FUNCTIONAL / PSYCHOLOGICAL BARRIERS AND QUALITY OF E-BANKING / INTERNET BANKING CHAPTER 5 AN ANALYSIS OF FUNCTIONAL / PSYCHOLOGICAL BARRIERS AND QUALITY OF E-BANKING / INTERNET BANKING 5.1 Introduction India is one of the leading countries in the world where e-banking / internet banking

More information

Improving long run model performance using Deviance statistics. Matt Goward August 2011

Improving long run model performance using Deviance statistics. Matt Goward August 2011 Improving long run model performance using Deviance statistics Matt Goward August 011 Objective of Presentation Why model stability is important Financial institutions are interested in long run model

More information

Forecasting with Seasonality Version 1.6

Forecasting with Seasonality Version 1.6 Forecasting with Seasonality Version 1.6 Dr. Ron Tibben-Lembke Sept 3, 2006 Forecasting with seasonality and a trend is obviously more difficult than forecasting for a trend or for seasonality by itself,

More information

THE IMPROVEMENTS TO PRESENT LOAD CURVE AND NETWORK CALCULATION

THE IMPROVEMENTS TO PRESENT LOAD CURVE AND NETWORK CALCULATION 1 THE IMPROVEMENTS TO PRESENT LOAD CURVE AND NETWORK CALCULATION Contents 1 Introduction... 2 2 Temperature effects on electricity consumption... 2 2.1 Data... 2 2.2 Preliminary estimation for delay of

More information

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization

Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 2 Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization Business Intelligence Analytics and Data Science A Managerial Perspective 4th Edition Sharda TEST BANK Full download at: https://testbankreal.com/download/business-intelligence-analytics-datascience-managerial-perspective-4th-edition-sharda-test-bank/

More information

Factors Influencing the Choice of Management Strategy among Small-Scale Private Forest Owners in Sweden

Factors Influencing the Choice of Management Strategy among Small-Scale Private Forest Owners in Sweden Forests 2014, 5, 1695-1716; doi:10.3390/f5071695 Article OPEN ACCESS forests ISSN 1999-4907 www.mdpi.com/journal/forests Factors Influencing the Choice of Management Strategy among Small-Scale Private

More information

Statistics 201 Summary of Tools and Techniques

Statistics 201 Summary of Tools and Techniques Statistics 201 Summary of Tools and Techniques This document summarizes the many tools and techniques that you will be exposed to in STAT 201. The details of how to do these procedures is intentionally

More information

Models in Engineering Glossary

Models in Engineering Glossary Models in Engineering Glossary Anchoring bias is the tendency to use an initial piece of information to make subsequent judgments. Once an anchor is set, there is a bias toward interpreting other information

More information

Research on the Approach and Strategy of Traditional Logistics Enterprise Transformation Under the Context of the Internet

Research on the Approach and Strategy of Traditional Logistics Enterprise Transformation Under the Context of the Internet Canadian Social Science Vol. 4, No. 6, 208, pp. 09-7 DOI:0.3968/0399 ISSN 72-8056 [Print] ISSN 923-6697 [Online] www.cscanada.net www.cscanada.org Research on the Approach and Strategy of Traditional Logistics

More information

Principles of Economics

Principles of Economics Principles of Economics Sylvain Barde Fall term 2009-2010: Microeconomics Course Outline Week 1: Introduction Part 1: Theories of the agent Week 2: Consumer preferences and Utility Week 3: The budget constraint

More information

The Economic and Social Review, Vol. 33, No. 1, Spring, 2002, pp Portfolio Effects and Firm Size Distribution: Carbonated Soft Drinks*

The Economic and Social Review, Vol. 33, No. 1, Spring, 2002, pp Portfolio Effects and Firm Size Distribution: Carbonated Soft Drinks* 03. Walsh Whelan Article 25/6/02 3:04 pm Page 43 The Economic and Social Review, Vol. 33, No. 1, Spring, 2002, pp. 43-54 Portfolio Effects and Firm Size Distribution: Carbonated Soft Drinks* PATRICK PAUL

More information

More-Advanced Statistical Sampling Concepts for Tests of Controls and Tests of Balances

More-Advanced Statistical Sampling Concepts for Tests of Controls and Tests of Balances APPENDIX 10B More-Advanced Statistical Sampling Concepts for Tests of Controls and Tests of Balances Appendix 10B contains more mathematical and statistical details related to the test of controls sampling

More information

Value Proposition for Financial Institutions

Value Proposition for Financial Institutions WWW.CUSTOMERS-DNA.COM VALUE PROPOSITION FINANCIAL INSTITUTIONS Value Proposition for Financial Institutions Customer Intelligence in Banking CRM & CUSTOMER INTELLIGENCE SERVICES INFO@CUSTOMERS-DNA.COM

More information

RISK CLASSIFICATION BASED ON DISCRIMINANT ANALYSIS FOR SMES

RISK CLASSIFICATION BASED ON DISCRIMINANT ANALYSIS FOR SMES RISK CLASSIFICATION BASED ON DISCRIMINANT ANALYSIS FOR SMES Srinvas Gumparthi and Dr. V.Manickavasagam Abstract Credit rating agencies specializes in analyzing and evaluating the creditworthiness of large

More information

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 02 Data Mining Process Welcome to the lecture 2 of

More information

Acknowledgments. Data Mining. Examples. Overview. Colleagues

Acknowledgments. Data Mining. Examples. Overview. Colleagues Data Mining A regression modeler s view on where it is and where it s likely to go. Bob Stine Department of Statistics The School of the Univ of Pennsylvania March 30, 2006 Acknowledgments Colleagues Dean

More information