Please purchase PDF Split-Merge on to remove this watermark. Contents

Contents

CONTENTS PAGE NO. ACKNOWLEDGEMENTS LIST OF TABLES LIST OF FIGURES LIST OF ABBREVIATIONS ABSTRACT i iii iv vi x CHAPTERS 1 INTRODUCTION 1 1.1 INTRODUCTION 1 1.2 OBJECTIVES OF THE THESIS 2 1.3 MACHINE LEARNING 2 1.3.1 Classification 3 1.3.2 Clustering 4 1.3.3 Association Rule Mining 4 1.3.4 Sequential Pattern Mining 5 1.3.5 Regression 5 1.3.5.1 Neural Network in Regression 6 1.3.5.2 Support Vector Regression (SVR) 6 1.4 STAGES IN FORECASTING 7 1.5 FORMULATING A FORECASTING STRATEGY 7 1.6 FORECASTING METHODS 9 1.6.1 Qualitative Forecasting Methods 9 1.6.2 Quantitative Forecasting Methods 11 1.6.2.1 Time Series Method 12

PAGE NO. 1.6.2.2 Moving Average Model 14 1.6.2.3 Exponential Smoothing 15 1.6.2.4 Causal Forecasting 17 1.7 APPLICATION OF REGRESSION ANALYSIS 31 1.8 FORECASTING ACCURACY MEASURES 38 1.9 FORECASTING IN THE WOOD INDUSTRY 40 1.10 PROBLEM STATEMENT 41 1.11 ORGANIZATION OF THE THESIS 41 2 LITERATURE REVIEW 43 2.1 INTRODUCTION 43 2.2 FORECASTING USING LINEAR REGRESSION 43 2.3 FORECASTING USING MULTIPLE LINEAR REGRESSION 47 2.4 SOFTWARE EFFORT ESTIMATION 52 2.5 FORECASTING USING ANN 62 2.5.1 Software Effort Estimation using ANN 63 2.5.2 Forecasting using ANN in other Domains 66 2.6 FORECASTING USING SVR 70 2.6.1 Software Effort Estimation using SVR 70 2.6.2 Forecasting using SVR in other Domains 73 3 PERFORMANCE ANALYSIS OF EXISTING REGRESSION TECHNIQUES 3.1 FORECASTING PRODUCT DEMAND 76 3.2 SOFTWARE DEVELOPMENT EFFORT ESTIMATION 78 3.2.1 Software Estimation Techniques 79 76

PAGE NO. 3.2.2 Software Metrics 87 3.3 PULPWOOD 90 3.4 METHODOLOGY 105 3.4.1 COnstructive COst MOdel (COCOMO) Data Set 105 3.4.2 M5 Algorithm 114 3.4.3 Linear Regression 117 3.4.4 Error Statistics 118 3.5 RESULTS AND DISCUSSION 120 3.6 CONCLUSION 125 4 PROPOSED GAUSSIAN KERNEL SUPPORT VECTOR MACHINES FOR REGRESSION 4.1 INTRODUCTION 126 4.2 REGRESSION ANALYSIS 128 4.3 SUPPORT VECTOR REGRESSION 133 126 4.3.1 SVM Classifiers 133 4.3.1.1 Generalizing Optimization Hyperplanes 135 4.3.1.2 Generalization for High Dimensional Features 136 4.3.2 Support Vector Regression 136 4.3.2.1 Linear Regression 137 4.3.2.2 Non-Linear Regression 138 4.4 METHODOLOGY 139 4.4.1 Proposed Gaussian Kernel Support Vector Regression 139 4.5 RESULTS AND DISCUSSION 144 4.6 CONCLUSION 151

5 NOVEL HYBRID PARTICLE SWARM OPTIMIZATION FOR KERNEL OPTIMIZATION IN SUPPORT VECTOR REGRESSION PAGE NO. 5.1 INTRODUCTION 152 5.2 METHODOLOGY 153 152 5.2.1 Particle Swarm Optimization (PSO) 154 5.2.2 Proposed PSO GA optimization 156 5.3 RESULTS AND DISCUSSION 158 5.4 CONCLUSION 166 6 CONCLUSION AND FUTURE ENHANCEMENTS 167 6.1 CONCLUSION 167 6.2 FUTURE ENHANCEMENTS 168 REFERENCES 169 LIST OF PUBLICATIONS 183 APPENDICES 185 A Glossary of Terms 185 B Magnitude relative error and absolute error achieved for COCOMO dataset 188 B1. M5 algorithm 188 B2. Linear Regression 191 B3. SVR-RBF kernel 194 B4. Proposed Gaussian Kernel -SVR Kernel 197 B5. SVR-Proposed optimization 200 C Source Code 203 C1. Prototype 203

List of Tables

Table No. LIST OF TABLES Title 1.1 Data for linear regression 20 1.2 Predicted values and the errors of prediction for Data 21 1.3 Statistics for computing the regression line 22 3.1 Projected consumption of Paper (Million Tonnes) 94 3.2 Projected production of Paper 96 3.3 Wood, Recycled and Agro based Mills production status 97 3.4 Status of availability of Recycled / waste paper 98 3.5 Availability of Agro based raw materials (Million Tonnes) 99 3.6 Variety wise production of paper from different raw materials (2010-11) (Million Tonnes) Page No. 3.7 COCOMO II Cost drivers. 107 3.8 COCOMO II Scale factors 109 3.9 COCOMO Dataset 110 3.10 Average MMRE and MdMRE achieved for M5 and Linear Regression Technique - COCOMO Dataset 3.11 Sample Demand data of Pulpwood in MT (Metric Tonne) 123 3.12 Average MMRE and MdMRE for pulpwood Dataset 123 4.1 MRE for M5, SVR-RBF, Proposed Gaussian kernel SVR - COCOMO Dataset 4.2 Average MMRE and MdMRE for M5, SVR-RBF, Proposed Gaussian kernel SVR - COCOMO Dataset 4.3 Average MMRE and MdMRE for M5, SVR-RBF, Proposed Gaussian kernel SVR - pulp wood dataset 5.1 Magnitude Relative Error achieved for COCOMO Dataset 159 5.2 Average MMRE and MdMRE for SVR RBF, Proposed Gaussian SVR, Optimized SVR COCOMO dataset 5.3 Actual Vs Predicted values for pulp wood dataset 163 5.4 MMRE and MdMRE for various Techniques pulpwood 164 99 121 145 148 149 162 iii

List of Figures

Figure No. LIST OF FIGURES Title 1.1 Machine Learning Algorithms 3 1.2 Classification 4 Page No. 1.3 Qualitative Forecasting Methods 10 1.4 Quantitative Forecasting Methods 12 1.5 A scatter plot of the data 20 1.6 Regression line for the data 21 3.1 Software Estimation Process 80 3.2 Software Estimation Techniques 81 3.3 Types of COCOMO Model 86 3.4 Pulpwood stacked in the Processing Yard 91 3.5 Indian paper industry growth 93 3.6 Pulping process 101 3.7 M5 model tree algorithm 115 3.8 Pseudo code for M5 Algorithm 117 3.9 3.10 3.11 3.12 3.13 Magnitude Relative Error for M5 and Linear Regression - COCOMO Dataset MMRE for M5 and Linear Regression - COCOMO Dataset MdMRE for M5 and Linear Regression - COCOMO Dataset MMRE for M5 and Linear Regression Technique Pulpwood dataset MdMRE for M5 and Linear Regression Technique Pulpwood dataset 4.1 SVM Algorithms 127 4.2 Example for Support Vector Regression using kernel trick 137 121 122 122 124 124 iv

Figure No. Title Page No. 4.3 Pseudo code for SVR algorithm 142 4.4 Pseudo code of the proposed algorithm 144 4.5 4.6 4.7 4.8 4.9 Magnitude Relative Error for M5, SVR RBF and Proposed Gaussian SVR - COCOMO Dataset MMRE for M5, SVR RBF and Proposed Gaussian SVR - COCOMO Dataset MdMRE for M5, SVR RBF and Proposed Gaussian SVR - COCOMO Dataset MMRE for M5, SVR RBF and Proposed Gaussian SVR - pulpwood dataset MdMRE for M5, SVR RBF and Proposed Gaussian SVR - pulpwood dataset 5.1 Machine Learning to Train a PredictiveModel 153 5.2 Initial swarm of PSO 155 5.3 Proposed optimization algorithm using PSO 156 5.4 Flowchart for hybrid GA-PSO 157 5.5 5.6 5.7 Magnitude Relative Error for SVR RBF, Proposed Gaussian SVR, Optimized SVR - COCOMO Dataset MMRE for SVR RBF, Proposed Gaussian SVR, Optimized SVR - COCOMO Dataset MdMRE for SVR RBF, Proposed Gaussian SVR, Optimized SVR - COCOMO Dataset 5.8 Actual Vs Predicted values for pulp wood dataset 164 5.9 Mean Magnitude Relative Error pulpwood dataset 165 5.10 Median Magnitude Relative Error pulpwood dataset 165 147 148 149 150 150 161 162 163 v

List of Abbreviations

LIST OF ABBREVIATIONS ACLRM ANN ARIMA ATM CART CART CMMI COCOMO COQUALMO CPNN CRB CV DENFIS EA EA EC EM FLEX GA GDP GGGP GLS GMDH Autocorrelation-Corrected Linear Regression Model Artificial Neural Network Auto Regressive Integrated Moving Average Automatic Teller Machines Classification And Regression Tree Classification And Regression Tree Capability Maturity Model Integration COnstructive COst Model COnstructive QUALity Model Counter Propagation Neural Network Commodity Research Bureau Coefficient of variation Dynamic Evolving Neuro Fuzzy Inference System Estimation by Analogy Evolution Algorithms Evolutionary Computation Effort Multipliers Flexibility Genetic Algorithms Gross Domestic Product Grammar Guided Genetic Programming Generalized Least Squares Group Method of Data Handling (GMDH) vi

GP GRA ICA IID IPI ISBSG ISE KAs KDSI LOC MAD MAE MAPE MARS MdMRE MFE ML MLE MLFF MLP MLR MMRE MOPSO MOS Genetic Programming (GP) Grey Relational Analysis Independent Component Analysis Independent and Identically Distributed Industrial Production Index International Software Benchmarking Standards Group Istanbul Stock Exchange Knowledge Areas Kilo Deliverable Source Instructions Line Of Code Mean Absolute Deviation Mean Absolute Error Mean Absolute Percentage Error Multivariate Adaptive Regression Splines Median Magnitude of Relative Error Mean Forecast Error Machine Learning Maximum Likelihood Estimator Multilayer FeedForward Neural Network Multi-Layer Perceptron Multiple Linear Regression Mean Magnitude of Relative Error Multiple Objective Particle Swarm Optimization Model Output Statistics vii

MRE MSE NDFD NDSI NGT NLOC NWP NYSE OD OLS PCA PDFM PM s PMAT PREC PRED PSO QP RBF RMS RPROP RR RSEL RT Magnitude Relative Error Mean Square Error National Digital Forecast Database Number of Delivered Source Instructions Nominal Group Technique Number of Lines of Code Numerical Weather Predictions New York Stock Exchange Oven Dry Ordinary Least Squares Principle Component Analysis Propane Database and Forecasting Model Person-Months Process Maturity Precedentedness Prediction Particle Swarm Optimization Quadratic Programming Radial Basis Function Root Mean Square Resilient back Propagation algorithm Ridge Regression Risk Resolution Regression Technique viii

SDR SEBI SEE SEER-SEM SF SIC SKU SLIM SMO Standard Deviation Reduction Security and Exchange Board of India Software Effort Estimation System Evaluation and Estimation of Resource Software Evaluation Model Scale Factor Schwarz Information Criterion Stock Keeping Unit Software Life Cycle Management Sequential Minimal Optimization SPX Standard and Poor 500 SRGM SSE STLF SUR SVM SVR SWEBOK T-BILL TEAM TNPL USDX VC VIF VSTS Software Reliability Growth Model Sum of Squared Errors Short Term Load Forecasting Seemingly Unrelated Regression Support Vector Machines Support Vector Regressions Software Engineering Body Of Knowledge Treasury Bill Team Cohesion Tamil Nadu Newsprint and Papers Ltd US Dollar index Vector Classification Variance Inflation Factor Visual Studio Team System ix

Abstract

ABSTRACT Forecasting a product s demand and supply is crucial to any supplier, manufacturer, or retailer. Forecasts of future demand will determine the quantities that should be purchased, produced and shipped. Classification and Regression are two important data mining techniques used for classification and prediction. Classification maps an independent set of input values into an output value in a predetermined set of values. Prediction maps an independent set of values into a numeric value based on a predetermined relationship. The usage of paper and its related products are increasing every year in an exponential way. Due to the annual increase in the annual consumption of paper per person, a forecast on demand and supply is necessary to improve India s socio economic development. In this work the demand of pulpwood is forecasted using the data collected from Tamil Nadu Newsprint and Papers Limited (TNPL), India. Software Effort Estimation using COnstructive COst MOdel (COCOMO) dataset is also evaluated using the proposed techniques. In this work, the COCOMO Dataset is used for software effort estimation to evaluate the performance of the proposed regression techniques. This is used for benchmarking the proposed technique and the techniques are evaluated for real-time dataset of pulpwood. In the paper mills in India, 30% to 40% use wood as raw material, current requirement of raw material is slightly more than 5 million metric cubes as against a domestic supply of 2.6 million metric-cubes leading to 45% short fall in supply. The forecast of the demand and supply of pulpwood will be useful for the alarming increase in demand without a commensurate increase in supply. The M5 algorithm s introduced by Quinlan in 1992 to build trees whose leaves are associated to multivariate linear models and the nodes of the tree are chosen over the attribute that maximizes the expected error reduction as a function of the standard deviation of output parameter. The MMRE (Mean Magnitude Relative Error), is the de facto standard evaluation criterion to assess the accuracy x

of software prediction models. Magnitude Relative Error (MRE) computes absolute error percentage between actual and predicted efforts for reference samples. The attributes of the COCOMO and pulpwood dataset are evaluated and the percentage difference in the performance is computed. The percentage difference between M5 and Linear Regression for MMRE is 50.48% for the COCOMO dataset and 6.85% for pulpwood dataset respectively. Similarly the percentage difference for MdMRE between the two algorithms are 36.93% for the COCOMO dataset and 7.61% for pulpwood dataset. From the results it can be concluded that linear regression can perform reasonably well for predicting data with low number of attributes. Support Vector Regression (SVR) belong to the general category of kernel methods. A kernel method is an algorithm that depends on the data only through dot-products. Radial Basis Function (RBF) Kernel of thesvr is modified to improve the accuracy of the predictions of demand of pulpwood. In this work a method to find the ideal width is proposed. In dense areas, the width is narrowed and the weight assigned is less than 1 whereas in sparse areas, the width is increased and the weight assigned is more than 1. In dense area pattern x drops, the x s closest members are calculated using weighed nearest neighbor distance formula and the values are obtained. Experiments were conducted for ten different cost functions and the average MMRE and MdMRE results is computed for the proposed kernel and RBF kernel. From the results, it is observed that the percentage difference for MMRE between proposed Gaussian Kernel SVR and SVR-RBF is 57.75% for COCOMO dataset and by 43.30% for pulpwood dataset. Similarly the percentage difference in MdMRE for proposed Gaussian Kernel SVR compared with SVR-RBF is 47.4% for COCOMO dataset and 69.2%for pulpwood dataset respectively. In SVR method, a hybrid Particle Swarm Optimization (PSO) with Genetic Algorithm (GA) algorithm is proposed to optimize the parameters of kernel xi

functions. Regression models are created by using the data collected from TNPL. The strength of SVR relies on the design of the kernel functions. The advantages of SVR are generalizability and robustness in the presence of outlier data in training set. Proposed Gaussian kernel- SVR and modified PSO improves the performance of SVR by reducing regression errors. From the results, it is observed that proposed SVR optimization for COCOMO data set reduces the percentage difference of MMRE by 5.11 % and also reduces MdMRE by 1.37% when compared to proposed Gaussian Kernel SVR regression method. Similarly SVR optimization for pulpwood data set reduces MMRE by 7.15% and also reduces percentage difference of MdMRE by 10.67% compared to proposed Gaussian Kernel SVR regression method. By consolidating all the results, it is observed that proposed Gaussian Kernel SVR reduces MMRE for COCOMO dataset and pulpwood dataset compared to other methods including SVR-RBF and M5. The algorithms proposed in this research, have been implemented successfully to increase the accuracy of prediction. Algorithms such as M5 regression tree, Linearregression, SVM with RBF kernel and an proposed Gaussian Kernel SVR functions with PSO optimization are used. The parameters such as MMRE and MdMRE are used for evaluating the results. Software Effort Estimation using COCOMO dataset is evaluated and the demand of pulpwood is forecasted with the data collected from TNPL for the past 10 years. The results show that proposed algorithm with PSO-GA optimization approach gives highest accuracy with significant decrease in MMRE and MdMRE when comparing to all other methods. Future work aims to analyze the performance of the proposed method for various industrial wood species. xii