Please purchase PDF Split-Merge on to remove this watermark. Contents

Size: px
Start display at page:

Download "Please purchase PDF Split-Merge on to remove this watermark. Contents"

Transcription

1 Contents

2 CONTENTS PAGE NO. ACKNOWLEDGEMENTS LIST OF TABLES LIST OF FIGURES LIST OF ABBREVIATIONS ABSTRACT i iii iv vi x CHAPTERS 1 INTRODUCTION INTRODUCTION OBJECTIVES OF THE THESIS MACHINE LEARNING Classification Clustering Association Rule Mining Sequential Pattern Mining Regression Neural Network in Regression Support Vector Regression (SVR) STAGES IN FORECASTING FORMULATING A FORECASTING STRATEGY FORECASTING METHODS Qualitative Forecasting Methods Quantitative Forecasting Methods Time Series Method 12

3 PAGE NO Moving Average Model Exponential Smoothing Causal Forecasting APPLICATION OF REGRESSION ANALYSIS FORECASTING ACCURACY MEASURES FORECASTING IN THE WOOD INDUSTRY PROBLEM STATEMENT ORGANIZATION OF THE THESIS 41 2 LITERATURE REVIEW INTRODUCTION FORECASTING USING LINEAR REGRESSION FORECASTING USING MULTIPLE LINEAR REGRESSION SOFTWARE EFFORT ESTIMATION FORECASTING USING ANN Software Effort Estimation using ANN Forecasting using ANN in other Domains FORECASTING USING SVR Software Effort Estimation using SVR Forecasting using SVR in other Domains 73 3 PERFORMANCE ANALYSIS OF EXISTING REGRESSION TECHNIQUES 3.1 FORECASTING PRODUCT DEMAND SOFTWARE DEVELOPMENT EFFORT ESTIMATION Software Estimation Techniques 79 76

4 PAGE NO Software Metrics PULPWOOD METHODOLOGY COnstructive COst MOdel (COCOMO) Data Set M5 Algorithm Linear Regression Error Statistics RESULTS AND DISCUSSION CONCLUSION PROPOSED GAUSSIAN KERNEL SUPPORT VECTOR MACHINES FOR REGRESSION 4.1 INTRODUCTION REGRESSION ANALYSIS SUPPORT VECTOR REGRESSION SVM Classifiers Generalizing Optimization Hyperplanes Generalization for High Dimensional Features Support Vector Regression Linear Regression Non-Linear Regression METHODOLOGY Proposed Gaussian Kernel Support Vector Regression RESULTS AND DISCUSSION CONCLUSION 151

5 5 NOVEL HYBRID PARTICLE SWARM OPTIMIZATION FOR KERNEL OPTIMIZATION IN SUPPORT VECTOR REGRESSION PAGE NO. 5.1 INTRODUCTION METHODOLOGY Particle Swarm Optimization (PSO) Proposed PSO GA optimization RESULTS AND DISCUSSION CONCLUSION CONCLUSION AND FUTURE ENHANCEMENTS CONCLUSION FUTURE ENHANCEMENTS 168 REFERENCES 169 LIST OF PUBLICATIONS 183 APPENDICES 185 A Glossary of Terms 185 B Magnitude relative error and absolute error achieved for COCOMO dataset 188 B1. M5 algorithm 188 B2. Linear Regression 191 B3. SVR-RBF kernel 194 B4. Proposed Gaussian Kernel -SVR Kernel 197 B5. SVR-Proposed optimization 200 C Source Code 203 C1. Prototype 203

6 List of Tables

7 Table No. LIST OF TABLES Title 1.1 Data for linear regression Predicted values and the errors of prediction for Data Statistics for computing the regression line Projected consumption of Paper (Million Tonnes) Projected production of Paper Wood, Recycled and Agro based Mills production status Status of availability of Recycled / waste paper Availability of Agro based raw materials (Million Tonnes) Variety wise production of paper from different raw materials ( ) (Million Tonnes) Page No. 3.7 COCOMO II Cost drivers COCOMO II Scale factors COCOMO Dataset Average MMRE and MdMRE achieved for M5 and Linear Regression Technique - COCOMO Dataset 3.11 Sample Demand data of Pulpwood in MT (Metric Tonne) Average MMRE and MdMRE for pulpwood Dataset MRE for M5, SVR-RBF, Proposed Gaussian kernel SVR - COCOMO Dataset 4.2 Average MMRE and MdMRE for M5, SVR-RBF, Proposed Gaussian kernel SVR - COCOMO Dataset 4.3 Average MMRE and MdMRE for M5, SVR-RBF, Proposed Gaussian kernel SVR - pulp wood dataset 5.1 Magnitude Relative Error achieved for COCOMO Dataset Average MMRE and MdMRE for SVR RBF, Proposed Gaussian SVR, Optimized SVR COCOMO dataset 5.3 Actual Vs Predicted values for pulp wood dataset MMRE and MdMRE for various Techniques pulpwood iii

8 List of Figures

9 Figure No. LIST OF FIGURES Title 1.1 Machine Learning Algorithms Classification 4 Page No. 1.3 Qualitative Forecasting Methods Quantitative Forecasting Methods A scatter plot of the data Regression line for the data Software Estimation Process Software Estimation Techniques Types of COCOMO Model Pulpwood stacked in the Processing Yard Indian paper industry growth Pulping process M5 model tree algorithm Pseudo code for M5 Algorithm Magnitude Relative Error for M5 and Linear Regression - COCOMO Dataset MMRE for M5 and Linear Regression - COCOMO Dataset MdMRE for M5 and Linear Regression - COCOMO Dataset MMRE for M5 and Linear Regression Technique Pulpwood dataset MdMRE for M5 and Linear Regression Technique Pulpwood dataset 4.1 SVM Algorithms Example for Support Vector Regression using kernel trick iv

10 Figure No. Title Page No. 4.3 Pseudo code for SVR algorithm Pseudo code of the proposed algorithm Magnitude Relative Error for M5, SVR RBF and Proposed Gaussian SVR - COCOMO Dataset MMRE for M5, SVR RBF and Proposed Gaussian SVR - COCOMO Dataset MdMRE for M5, SVR RBF and Proposed Gaussian SVR - COCOMO Dataset MMRE for M5, SVR RBF and Proposed Gaussian SVR - pulpwood dataset MdMRE for M5, SVR RBF and Proposed Gaussian SVR - pulpwood dataset 5.1 Machine Learning to Train a PredictiveModel Initial swarm of PSO Proposed optimization algorithm using PSO Flowchart for hybrid GA-PSO Magnitude Relative Error for SVR RBF, Proposed Gaussian SVR, Optimized SVR - COCOMO Dataset MMRE for SVR RBF, Proposed Gaussian SVR, Optimized SVR - COCOMO Dataset MdMRE for SVR RBF, Proposed Gaussian SVR, Optimized SVR - COCOMO Dataset 5.8 Actual Vs Predicted values for pulp wood dataset Mean Magnitude Relative Error pulpwood dataset Median Magnitude Relative Error pulpwood dataset v

11 List of Abbreviations

12 LIST OF ABBREVIATIONS ACLRM ANN ARIMA ATM CART CART CMMI COCOMO COQUALMO CPNN CRB CV DENFIS EA EA EC EM FLEX GA GDP GGGP GLS GMDH Autocorrelation-Corrected Linear Regression Model Artificial Neural Network Auto Regressive Integrated Moving Average Automatic Teller Machines Classification And Regression Tree Classification And Regression Tree Capability Maturity Model Integration COnstructive COst Model COnstructive QUALity Model Counter Propagation Neural Network Commodity Research Bureau Coefficient of variation Dynamic Evolving Neuro Fuzzy Inference System Estimation by Analogy Evolution Algorithms Evolutionary Computation Effort Multipliers Flexibility Genetic Algorithms Gross Domestic Product Grammar Guided Genetic Programming Generalized Least Squares Group Method of Data Handling (GMDH) vi

13 GP GRA ICA IID IPI ISBSG ISE KAs KDSI LOC MAD MAE MAPE MARS MdMRE MFE ML MLE MLFF MLP MLR MMRE MOPSO MOS Genetic Programming (GP) Grey Relational Analysis Independent Component Analysis Independent and Identically Distributed Industrial Production Index International Software Benchmarking Standards Group Istanbul Stock Exchange Knowledge Areas Kilo Deliverable Source Instructions Line Of Code Mean Absolute Deviation Mean Absolute Error Mean Absolute Percentage Error Multivariate Adaptive Regression Splines Median Magnitude of Relative Error Mean Forecast Error Machine Learning Maximum Likelihood Estimator Multilayer FeedForward Neural Network Multi-Layer Perceptron Multiple Linear Regression Mean Magnitude of Relative Error Multiple Objective Particle Swarm Optimization Model Output Statistics vii

14 MRE MSE NDFD NDSI NGT NLOC NWP NYSE OD OLS PCA PDFM PM s PMAT PREC PRED PSO QP RBF RMS RPROP RR RSEL RT Magnitude Relative Error Mean Square Error National Digital Forecast Database Number of Delivered Source Instructions Nominal Group Technique Number of Lines of Code Numerical Weather Predictions New York Stock Exchange Oven Dry Ordinary Least Squares Principle Component Analysis Propane Database and Forecasting Model Person-Months Process Maturity Precedentedness Prediction Particle Swarm Optimization Quadratic Programming Radial Basis Function Root Mean Square Resilient back Propagation algorithm Ridge Regression Risk Resolution Regression Technique viii

15 SDR SEBI SEE SEER-SEM SF SIC SKU SLIM SMO Standard Deviation Reduction Security and Exchange Board of India Software Effort Estimation System Evaluation and Estimation of Resource Software Evaluation Model Scale Factor Schwarz Information Criterion Stock Keeping Unit Software Life Cycle Management Sequential Minimal Optimization SPX Standard and Poor 500 SRGM SSE STLF SUR SVM SVR SWEBOK T-BILL TEAM TNPL USDX VC VIF VSTS Software Reliability Growth Model Sum of Squared Errors Short Term Load Forecasting Seemingly Unrelated Regression Support Vector Machines Support Vector Regressions Software Engineering Body Of Knowledge Treasury Bill Team Cohesion Tamil Nadu Newsprint and Papers Ltd US Dollar index Vector Classification Variance Inflation Factor Visual Studio Team System ix

16 Abstract

17 ABSTRACT Forecasting a product s demand and supply is crucial to any supplier, manufacturer, or retailer. Forecasts of future demand will determine the quantities that should be purchased, produced and shipped. Classification and Regression are two important data mining techniques used for classification and prediction. Classification maps an independent set of input values into an output value in a predetermined set of values. Prediction maps an independent set of values into a numeric value based on a predetermined relationship. The usage of paper and its related products are increasing every year in an exponential way. Due to the annual increase in the annual consumption of paper per person, a forecast on demand and supply is necessary to improve India s socio economic development. In this work the demand of pulpwood is forecasted using the data collected from Tamil Nadu Newsprint and Papers Limited (TNPL), India. Software Effort Estimation using COnstructive COst MOdel (COCOMO) dataset is also evaluated using the proposed techniques. In this work, the COCOMO Dataset is used for software effort estimation to evaluate the performance of the proposed regression techniques. This is used for benchmarking the proposed technique and the techniques are evaluated for real-time dataset of pulpwood. In the paper mills in India, 30% to 40% use wood as raw material, current requirement of raw material is slightly more than 5 million metric cubes as against a domestic supply of 2.6 million metric-cubes leading to 45% short fall in supply. The forecast of the demand and supply of pulpwood will be useful for the alarming increase in demand without a commensurate increase in supply. The M5 algorithm s introduced by Quinlan in 1992 to build trees whose leaves are associated to multivariate linear models and the nodes of the tree are chosen over the attribute that maximizes the expected error reduction as a function of the standard deviation of output parameter. The MMRE (Mean Magnitude Relative Error), is the de facto standard evaluation criterion to assess the accuracy x

18 of software prediction models. Magnitude Relative Error (MRE) computes absolute error percentage between actual and predicted efforts for reference samples. The attributes of the COCOMO and pulpwood dataset are evaluated and the percentage difference in the performance is computed. The percentage difference between M5 and Linear Regression for MMRE is 50.48% for the COCOMO dataset and 6.85% for pulpwood dataset respectively. Similarly the percentage difference for MdMRE between the two algorithms are 36.93% for the COCOMO dataset and 7.61% for pulpwood dataset. From the results it can be concluded that linear regression can perform reasonably well for predicting data with low number of attributes. Support Vector Regression (SVR) belong to the general category of kernel methods. A kernel method is an algorithm that depends on the data only through dot-products. Radial Basis Function (RBF) Kernel of thesvr is modified to improve the accuracy of the predictions of demand of pulpwood. In this work a method to find the ideal width is proposed. In dense areas, the width is narrowed and the weight assigned is less than 1 whereas in sparse areas, the width is increased and the weight assigned is more than 1. In dense area pattern x drops, the x s closest members are calculated using weighed nearest neighbor distance formula and the values are obtained. Experiments were conducted for ten different cost functions and the average MMRE and MdMRE results is computed for the proposed kernel and RBF kernel. From the results, it is observed that the percentage difference for MMRE between proposed Gaussian Kernel SVR and SVR-RBF is 57.75% for COCOMO dataset and by 43.30% for pulpwood dataset. Similarly the percentage difference in MdMRE for proposed Gaussian Kernel SVR compared with SVR-RBF is 47.4% for COCOMO dataset and 69.2%for pulpwood dataset respectively. In SVR method, a hybrid Particle Swarm Optimization (PSO) with Genetic Algorithm (GA) algorithm is proposed to optimize the parameters of kernel xi

19 functions. Regression models are created by using the data collected from TNPL. The strength of SVR relies on the design of the kernel functions. The advantages of SVR are generalizability and robustness in the presence of outlier data in training set. Proposed Gaussian kernel- SVR and modified PSO improves the performance of SVR by reducing regression errors. From the results, it is observed that proposed SVR optimization for COCOMO data set reduces the percentage difference of MMRE by 5.11 % and also reduces MdMRE by 1.37% when compared to proposed Gaussian Kernel SVR regression method. Similarly SVR optimization for pulpwood data set reduces MMRE by 7.15% and also reduces percentage difference of MdMRE by 10.67% compared to proposed Gaussian Kernel SVR regression method. By consolidating all the results, it is observed that proposed Gaussian Kernel SVR reduces MMRE for COCOMO dataset and pulpwood dataset compared to other methods including SVR-RBF and M5. The algorithms proposed in this research, have been implemented successfully to increase the accuracy of prediction. Algorithms such as M5 regression tree, Linearregression, SVM with RBF kernel and an proposed Gaussian Kernel SVR functions with PSO optimization are used. The parameters such as MMRE and MdMRE are used for evaluating the results. Software Effort Estimation using COCOMO dataset is evaluated and the demand of pulpwood is forecasted with the data collected from TNPL for the past 10 years. The results show that proposed algorithm with PSO-GA optimization approach gives highest accuracy with significant decrease in MMRE and MdMRE when comparing to all other methods. Future work aims to analyze the performance of the proposed method for various industrial wood species. xii