Flowchart of K-Means Cluster Analysis and Regression Analysis

Size: px

Start display at page:

Download "Flowchart of K-Means Cluster Analysis and Regression Analysis"

Shawn Lewis
5 years ago
Views:

1 Flowchart of K-Means Cluster Analysis and Regression Analysis

2 Select Clusters and Variables The objective of this project is to identify factors that may cause differences in total profits between two clusters as well as to create models to predict profits in the future. In order to clearly explore the differences, we decided to use two very distinct clusters based on the outcome of Project (Cluster & ). Cluster has the largest scale of customers who are light users while customers Cluster are heavier users but in small scale. Figure contains the dependent and independent variables we used to conduct Regression Analysis. We hope to identify how factors in the number of products, returned revenue, customer lifetime, payment method and purchasing channel can affect total profits of H&M. Figure : Description of Selected Dependent and Independent Variables Regression Analysis of Cluster To guarantee the accuracy of regression models, we randomly divided Cluster into Calibration (60% of Cluster ) and Validation (remaining 40% of Cluster ) samples. Initially, we conducted Multiple Regression Analysis on the Calibration sample. Since the sig of ANOVA is (Figure ), at least one variable has relationship with total profits. Figure : ANOVA of Regression on Calibration of Cluster Based on Figure 3, the sigs of all independent variables are 0.000, which means that all

five independent variables can influence total profits. Because the absolute Beta Value of total product quantity is the highest, it has the strongest positive relationship with total profits.

3 five independent variables can influence total profits. Because the absolute Beta Value of total product quantity is the highest, it has the strongest positive relationship with total profits. Moreover, all Tolerance values are greater than 0.5 and all VIFs are less than 4, so there is no multicollinearity problem detected. The Regression Model of Cluster is as following: Total Profit = *Product Quantity + 0.*Returned Revenue + 0.5*Customer Lifetime.0*Visa (Payment Method) - 4.5*Web (Channel) If customers purchased extra one product, the profits will increase by $9. Also, for every dollar customers returned to H&M, profits will go up by $0.. We believe this is because customers purchase more when knowing they can return unfavorable products. Customers who stay with H&M for one more month will generate $0.5 profits. However, customers who pay with Visa create $ less than customers who use other payment methods. Moreover, orders placed through website actually make less profits than orders placed through other channels, such as mail or phone Apps. Figure 3: Coefficients of Regression on Calibration of Cluster Adjusted R Square in Figure 4 shows that 7% of total profits can be explained by this model. Moreover, the mean of residual equals to 0. So, our regression model is valid. Figure 4: Model Summary of Regression on Calibration of Cluster 3

4 Deal With Outliers of Cluster Figure 5: Leverage Values of First Four Cases in Calibration sample of Cluster Then we used Leverage to identify outliers. Figure 5 shows Leverage value of first four cases. Fortunately, all leverage values are less than 0.0, so there is no outlier for the regression analysis of Cluster. Test on Validation Sample of Cluster We tested this regression model on Validation sample and identify how the regression model fits the Validation. First, we conducted a similar regression analysis on Validation to get rid of the outliers in Validation. Due to all Leverage value below 0.0, there is no outlier in Validation sample. Then, we calculated R Square and Adjusted R Square of Validation sample based on following formulas: SST = SSR + SSE SST = Toral sum of squares = n i= SSE = Sum of squares due to error (y y) SSR = Sum of squares due to regression i = n i= = i n i= (y ˆ ) (ŷ y) y i i Adjusted R Square: R a = - (- R n : Number of p : Number of n )( ), n p observations in the sample independent variables R Square: SSR R = SST The R Square of Validation sample is and Adjusted R square is Compared to those values of Calibration sample, we found that Adjusted R Square of Validation is almost the same as Calibration. Thus, the regression model of Calibration sample fits Validation sample and can be utilized to predict profit for the whole Cluster. Regression Analysis of Cluster We conducted regression analysis on Cluster with same variables and procedures. The 4

sig of ANOVA is 0.000 (Figure 7), which indicates that at least one factor has relationship with total profits.

According to the absolute Beta value, product quantity still has the strongest influence on total profits.

5 sig of ANOVA is (Figure 7), which indicates that at least one factor has relationship with total profits. Figure 7: ANOVA of Regression on Calibration of Cluster In Figure 8, we found that the sig of Visa is greater than So Visa will not influence total profits and we can remove this variable. According to the absolute Beta value, product quantity still has the strongest influence on total profits. Besides, no multicollinearity problem is detected in this case for large Tolerance values and small VIF values. The Regression Model of Cluster is as following: Total Profit = *Product Quantity + 0.*Returned Revenue + 0.*Customer Lifetime - 9.4*Web (Channel) If customers purchase one more product, total profits will grow by $4.8 on average. Even though customers return extra one dollar, profits still will increase by $0.. Furthermore, when customer lifetime extends one more month, H&M can make $0. more profits. Nevertheless, customers who make their purchase on the website will generate less profits for H&M than those who purchase through mails or phone Apps. Figure 8: Coefficients of Regression on Calibration of Cluster 5

36.8% of total profits can be explained by the four independent variables (Figure 9). Since the mean of residual is 0, this regression model is valid and reliable.

6 36.8% of total profits can be explained by the four independent variables (Figure 9). Since the mean of residual is 0, this regression model is valid and reliable. Figure 9: Model Summary of Regression on Calibration of Cluster Deal With Outliers of Cluster Figure 0: Leverage Values of First Seven Cases in Calibration of Cluster We can clearly see that the first five cases can be regarded as outliers because their Leverage values are greater than 0.0. To explore the changes of results, we deleted the first one and run regression analysis again. When comparing new Adjusted R Square (36.8% in Figure ) with original one, we can easily conclude that they are the same and even coefficients do not change much (Figure ). Since the strongest outliers cannot affect total profits, we decide to keep those outliers. Figure : Model Summary of Regression on Calibration of Cluster (First Outlier Removed) Figure : Coefficients of Regression on Calibration of Cluster (First Outlier Removed) Test on Validation Sample of Cluster After handling outliers in Validation sample, we calculated its R Square ( ) and Adjusted R Square ( ). Apparently, Adjusted R Square of Calibration and Validation samples are very similar. Therefore, the regression model of Calibration sample is a good fit 6

7 for the whole Cluster. 7