Planning for Quality using Defect Prediction Model

Size: px
Start display at page:

Download "Planning for Quality using Defect Prediction Model"

Transcription

1 1. Abstract Esha Banerjee Kalyani Sekhar NIIT Technologies New Delhi, India In order to have a comprehensive model for planning quality, an attempt has been made to develop an in-house defect prediction model using statistical regression analysis theory. The model is based on NIIT s past project data. As capability, productivity and defect detection rate in the delivery of software projects undergo dramatic changes even during a small period of time due to rapid changes in the technology, tools, processes, methodologies etc., we have considered the projects handled by NIIT during the last two years only. It has been experienced not only in NIIT, but in the industry in general that the number of defects detected in the project is closely linked to the size of the project. As the size of the application increases, the opportunities for defects increases proportionally. Hence more care needs to be taken to ensure proper planning to unearth these defects on-time in order to deliver a quality product. This paper begins with a brief introduction to the basic need for such a model, various alternatives considered, followed by the use of regression analysis and best fitment to create the defect prediction model. The paper also describes NIIT s approach for quality planning using this model. Note:- Although a brief introduction of Regression Analysis is given hereunder, yet to appreciate and to have the complete understanding of this paper, prior elementary knowledge of Statistics with the readers is assumed. 2. Introduction Need for the model There are core measures for which organization level goals are defined. Every project carries a target for achieving the relevant goals. These measures are composed of both process and product measures. The goals are in line with organization s business objectives. At the start of the project, Project manager needs to plan for the critical processes and the measures to control these critical processes. It was observed that to achieve the quality goals, defect planning was a critical process for all development projects. The measures associated with this process were identified as the number of defects in various lifecycle phases & work products. As the product quality is

2 dependent on the processes followed, establishing the right processes and measures at each activity/phase will ensure a quality product. Also, CMMI V1.1 has introduced a new process area for organizational process performance at Level 4. This process area has a specific goal - Baselines and models that characterize the expected process performance of the organization's set of standard processes are established and maintained. In order to meet these requirements, NIIT has focused on establishing process performance baselines and models to quantitatively manage organization s projects. Alternatives Considered Some of the alternatives considered for developing a suitable base for quality planning were: Quality planning using industry data Quality planning using past project data of similar projects Delphi method Use of standard models like Raleigh for reliability Statistical techniques like Regression Analysis After deliberating on the pros & cons of these alternatives, Regression Analysis was selected as: It is based on statistical techniques which are long proven & established The goodness of the model can be verified through statistical analysis The subsequent sections describe how regression analysis has been used to derive the model, followed by the use of the model for quality planning. 3. Use of regression analysis What is Regression? The statistical process with the help of which we are in position to predict (or estimate) the values of one variable-called Dependent Variable from known values of another variable(s) called Independent Variable(s) is known as Regression. For example, if we know that the effort and size in a software project are correlated, we may find out the expected amount of effort for a given size of project. When there is more than one independent variable, its called Multiple Regression. 2

3 What is Regression analysis? Regression Analysis is a method of finding the line of best fit for a set of data. It is a mathematical procedure that produces two results. 1. First it produces an equation to match the data gathered. There are different types of analysis (linear, quadratic, cubic, exponential, etc.). So one may want to check them to see which one matches the collected data most closely. 2. Second, regression analysis (or multiple regression analysis, if more than one independent variables are involved) may produce numbers to indicate how closely the new formula fits the data. For example, the dependent variable might be overall satisfaction and the independent variables be price, quality, value for money, delivery time and staff knowledge. The multiple regression analysis would then identify the relationship between the dependent variable and the independent variables this is presented as an equation or model (formula) that might look like this: Overall satisfaction = 1.37 * price rating * quality rating * delivery time rating (a constant) Approach to Regression Analysis The standard approach in regression analysis is: 1. Gather/take data - Past data for given independent variable(s) and corresponding dependent variable is collected. 2. Determine the form of equation to fit - We plot the dependent and independent data sets (in case of multiple independent variables, take one variable at a time) on a special graph called a scatter plot which shows the existence (or otherwise) of statistical relationships between variables. Examine the pattern being formed by these sets 3. Fit an equation - Depending on the number of independent variables, a simple (Y=a + bx) or multiple regression equation (Y = a + b1*x1 + b2*x bp*xp) is selected 4. Evaluate the fit using statistics - such as Coefficient of Determination (R), Standard Error of Estimate (SE), etc. The first number is the correlation coefficient, r. This is the linear correlation coefficient, for use in indicating how closely the data fits a straight line. The closer r is to 1 3

4 (for a positive correlation) or -1 (for a negative correlation) the better the fit. A value of 0 indicates no fit at all. Second is R (r 2 ), the coefficient of determination, which indicates how closely the curve fits the data. It s values range from 0 to 1, with 1 being a perfect fit. Standard error of estimate is a measure developed to measure the reliability and accuracy of the regression equation to predict the value of dependent variable for a given value(s) of independent variable(s). It measures the variability of the observed values of dependent variable (Y) around the regression line. 5. Use the equation to predict the value of dependent variable for given value(s) of independent variable(s). 4. Development of Defect Prediction Model for Software Projects Using Regression Analysis In order to meet the requirement of CMMI framework (Level 4 Process Area Organizational Process Performance), NIIT has developed a Statistical Regression Model as explained above to estimate the total number of defects in a project based on the premise that it s size is known. In NIIT (and in fact in industry as well), it is observed that the number of defects detected is independent of the technology and is a function of the size of the system. The prediction model for defects is however independent of technology and predicts full life cycle, in-process defects. This model will generally predict higher number of defects. Defect Prediction Model for Software Projects To develop the defect prediction model, we have followed the approach as explained above and same is elaborated hereunder. Gather Data We collected data on size and number of defects of all those development projects executed in the past two years which involved full development life cycle. Partial life cycle projects were also considered & their data was extrapolated to full lifecycle. These projects were of various technologies Internet Technologies - VB, Java, VB.Net, Others- Oracle D2K, ETL etc. In the first release of the model, data from ten projects were considered. Determining the Form of Equation We plotted the scatter diagrams to determine whether there was a linear relationship between Number of Defects & Size. As can be seen, at a top level, data pairs form approximately a straight line. 4

5 Relationship between Weighted Defects & Size Total Wt. Defects Size (Figure-1 Relationship Between # of Defects & Size) On a half yearly basis, the defect prediction model is re-calibrated, i.e. the new data is considered and plotted and the equation is re-calculated. We have had three releases of the defect prediction model over the last two years. Trends observed in Coefficient of Correlation over the past three calibrations of the model: Release Number Coefficient of Correlation Release % Release % Release % Note: The increasing trend of the value of Coefficient of Correlation shows the reliability of the model. 5

6 Regression Line of Best Fit Since, number of defects (Y) is dependent upon the independent variable Size(X). we deduced regression line Y = a + bx. where: Y denotes Number of defects X denotes Size (in FPs) a, b are the regression coefficient computed using LINEST function of MS Excel. Number of Defects = (5.87 * Size) ± 2319 Note : The number of defects given here are the weighted defects where weights are associated with the severity of the defect. In NIIT, the defects are categorised as A (Fatal), B (Major) & C (Minor) and are given weights of 10, 5 & 1 respectively. In the subsequent sections, the term number of defects is synonymous to number of weighted defects. Determine How Good is the Linear Relationship between Number of Defects & Size As discussed earlier, the coefficient of Determination is the measure to check the linearity relationship between dependent and independent variables. In our case, since value of coefficient of determination is 94.67%, which is very close to 100 % indicating a very high correlation between number of defects and size. Determine How Accurate will be the estimates given by the Regression Line Value of Standard Error (S.E.) is a measure of accuracy of the regression equation to predict the value of dependent variable for a given value(s) of independent variable(s). In our case this value is computed as This regression equation would give the best possible estimates of defects for any given size of a project. Before releasing this prediction model for use by with the actual data generated from various projects. projects, the model was corroborated 5. How is the Model used for Quality Planning NIIT s measurement objectives are derived from the business objectives that are initiated in management review meetings. The measurement needs of the organization are set on a periodic basis to coincide with the business requirements. These measurements form the basis of the "Information System" required by the higher management and hence constitute Organization's Capability Baseline. Some of the key measurements are: 6

7 Pro ductivity, Defects at Acceptance, Cycle time & Effort Overrun. For maintenance projects, the measurements are in terms of achieving SLAs. In order to provide a more focussed and accurate view, capability baseline is organized technology-wise. Currently capability baseline is available for Internet Microsoft, Internet Java, Client Server ( PB-RDBMS, VB-RDBMS & Oracle D2K) and ETL technologies. While the capability baseline uses the organization data and computes the capability in different technology, there exists the Process Performance Models which predict the effort, number of defects etc based on parameters like schedule, size etc. These models are used for estimation and planning at the contracting and project start-up stage. Like the capability baseline, these models also use the organization s own data for prediction. Both capability baseline and the process performance model are extensively used for planning & managing the project quantitatively. Process Performance models exist for predicting effort and defects. The capability baseline and process performance model is available on the QMS site and is accessible to all users. The diagram shown below illustrates the process flow for maintaining a process performance model: 7

8 Process Flow of the PROCESS PERFORMANCE MODEL 8

9 Projects plan the measures based on their project goals which are derived both from organization goals and customer specified goals as well as SLAs, if any. These measures include both process & product measures. Key inputs to planning are the project data in terms of size, schedule, the organization capability baseline, process performance models and the data of similar projects in the organization process database. One of the organizational goals for any project is in terms of controlling the number of defects at User Acceptance Testing (UAT) stage. In order to meet the quality goals at UAT, one of the core metrics is the in-process defects at each lifecycle phase. For this the project manager will perform the following steps: 1. Predict the total number of defects using the Defect Prediction Model Depending on the size of the projects, the total number of defects are predicted. The final number of defects are adjusted depending on project parameters like: Customer Quality Goals Criticality of the application Team experience in Domain & Technology Past data from similar projects Use of Tools and Reusable components Type of development methodology used And any other project specific constraints 2. Distribute the defects amongst the lifecycle phases Depending on the technology of the project, corresponding capability baseline data is considered. Capability baseline contains the phase wise % distribution of defects. It has been observed that the distribution of defects across the phases varies with technology. The number of defects arrived at in step 1 above is distributed amongst the various phases based on the technology wise distribution. The % distribution of defects in NIIT for two widely used technologies - Internet MS & Internet Java is as follows: RQA HLD LLD CNS TES ACC Internet Java Technology 4.25 % % 19.72% 50.83% 10.47% 1.69% Internet - Microsoft Technology 0.82 % 10.77% % % % 1.29% 3. This phase wise distribution arrived above is further adjusted to: a. Plan for zero defects at acceptance phase by distributing these defects into earlier phases b. Distribute the remaining defects in the other phases as per the scope of work for the project 9

10 c. Verification & validation strategy for the project Projects may have these special considerations: a. Use of code review & testing tools may impacts the number of defect found in construction & testing phase b. Any special type of testing planned in the project like Performance testing, Stress Testing may further distribute the testing defects 4. Based on the phase wise number of defects, the following measures are derived a. Defects per Function Point for each phase b. Defects per Person Month which is the measure of rate of injection of defects c. Review Effectiveness which is a measure of defect detection 5. Record these figures in the project metrics sheet and track for every phase. 6. Based on the actuals of the previous phase, re adjust the number of defects in the subsequent phases. The following sheet illustrates an example of the use of the defect prediction model : Size in Function Points 900 Total Weighted Defects from the equation in defect prediction model 3293 Phase wise Weighted Defects Distribution % from Capability Baseline RQA HLD LLD CNS TES ACC 4.25% 13.05% 19.72% 50.83% 10.47% 1.68% Redistribution Making Acceptance phase defects as zero & adding it to HLD phase 4.25% 14.73% 19.72% 50.83% 10.47% 0.00% Distributing Total Defects Sustenance of the model Projects upload the actual metrics into the organization s measurement repository at phase end and project end. SEPG compiles this data & the model is re-calibrated every six months based on the project data. New projects closed in the last six months are added to the model and projects prior to 24 months are removed. This has been done to reflect the current capability & trends. The estimation sheet in QMS is integrated with the latest release of defect prediction model. All project managers and quality managers and other roles responsible for planning are trained in the concept & use of model for defect prediction. The equation is published in NIIT s QMS and is made available to all users. 10

11 7. Limitations in using the model SEPG analyses data on the usage of the performance model in various projects periodically and identifies the limitations in using the model. Some of the limitations are: Model can be used only for development projects and cannot be used for maintenance projects This model predicts the total number of weighted defects and does not give any indication of severity (category) wise number of defects Projects which are of small size lesser than 500 FP have observed variation between the actual and predicted defects Projects which are greater than 2000 FP have observed variation between the actual and predicted defects Since the model uses data of the past 24 months, the equation may vary significantly depending on the individual characteristic of the projects being dropped or new projects being added The defect prediction equation derived here is statistical in nature. As such, users are expected to use their own experience to adjust the results given by this equation. The value of standard error is adjusted based on the typical nature of the project. 8. Conclusions While the engineering processes focus on the product and product requirements, management processes focus on planning the project to meet the organization & customer defined goals. This includes planning to ensure that right techniques/tools and processes have been applied to develop the product. It is this focus that has led to the development of the prediction model and use of capability baseline for quality planning. Sustained use and the advantage that projects have achieved by using the model has led to the development of regression analysis based models for other critical processes. References 1. A regression models for categorical and limited dependent variables by Fuller, Wayne 2. Business Statistics by S.P. Gupta & M.P. Gupta. Sultan Chand & Sons 3. Excel For Introductory Statistical Analysis by Prof. Hossein Arsham 4. Monograph on Decision Making Aids from Quality Institute of India 5. Multiple regression by Allison, Paul D (1999) 6. Operations Research by Paul & Smith 11

12 7. Quantitative Techniques by C. R. Kothari 12